CN108399213B - User-oriented personal file clustering method and system - Google Patents

User-oriented personal file clustering method and system Download PDF

Info

Publication number
CN108399213B
CN108399213B CN201810112624.0A CN201810112624A CN108399213B CN 108399213 B CN108399213 B CN 108399213B CN 201810112624 A CN201810112624 A CN 201810112624A CN 108399213 B CN108399213 B CN 108399213B
Authority
CN
China
Prior art keywords
file
clustering
files
directory
cluster
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810112624.0A
Other languages
Chinese (zh)
Other versions
CN108399213A (en
Inventor
李鹏
王斌
齐保元
周美林
郭莉
梅钰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Information Engineering of CAS
Original Assignee
Institute of Information Engineering of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Information Engineering of CAS filed Critical Institute of Information Engineering of CAS
Priority to CN201810112624.0A priority Critical patent/CN108399213B/en
Publication of CN108399213A publication Critical patent/CN108399213A/en
Application granted granted Critical
Publication of CN108399213B publication Critical patent/CN108399213B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/172Caching, prefetching or hoarding of files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • G06F16/137Hash-based
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a user-oriented personal file clustering method, which comprises the following steps: grouping the user files by using the storage habit of the user to the similar files to obtain a plurality of file groups; clustering the files in the file group to obtain one or more local clusters, wherein the file content in each local cluster is similar; and regarding each local cluster as a file, clustering all the local clusters to generate a global cluster. The invention also provides a user-individual-file-oriented clustering system which comprises a clustering calculation unit, a clustering result storage unit and a clustering result searching unit, wherein the clustering calculation unit comprises a batch file clustering calculation unit and an incremental file clustering calculation unit.

Description

User-oriented personal file clustering method and system
Technical Field
The invention relates to the field of user personal file management, which can realize the clustering of user files according to contents, in particular to the file clustering based on the use habits of users.
Background
With the popularization of computer office applications, a user edits and generates a large number of files (such as WORD/WPS/PDF) in work. According to the design of the existing file system, the files are stored in a tree structure in the computer, and the storage mode is obviously different from the storage (memorizing) mode of the files by people. Human memory of documents is often organized by task (or topic). One problem that results from such differences is that the user is required to devote a great deal of effort to the subsequent searching and management of the document, especially when multiple versions of the same theme are stored in different locations, and subsequent searching becomes more difficult.
At present, Personal Information Management (Personal Information Management) researches how to search Information generated by a user, and different companies develop search systems for user files, such as google desktop search, Baidu desktop search, and the like, but these tools are all keyword-based file search, and do not solve the problem of converting "file organization based on tree structure" into "file organization based on topic structure".
Organizing files by topic may use content-based clustering, where all files are directly input to a clustering algorithm, and each cluster (set of files) resulting from clustering is used as a file topic. Directly processing files ignores the user's saving habits of similar files (e.g., files under the same directory are more likely to belong to the same cluster), while significantly increasing the amount of unnecessary computation.
Disclosure of Invention
The invention provides a user-individual-file-oriented clustering method and a user-individual-file-oriented clustering system.
In order to achieve the purpose, the invention adopts the following technical scheme:
a user-oriented personal file clustering method comprises the following steps:
(1) grouping files: grouping the user files by using the storage habit of the user to the similar files to obtain a plurality of file groups;
(2) local cluster generation: clustering the files in the file group to obtain one or more local clusters, wherein the file content in each local cluster is similar;
(3) and (3) full clustering generation: and regarding each local cluster as a file, clustering all the local clusters to generate a global cluster.
Further, the user files are grouped by file tree directory structure, file type, file name, etc.
Further, the method for grouping the user files according to the file tree directory structure comprises the following steps: let NodeSet denote all directory nodes in the file directory tree, XiRepresenting the distance from the directory node i to the root node, and setting a distance threshold parameter deltaXWill meet the condition Xi>δXAs a single file group。
Further, the method for grouping the user files according to the file tree directory structure comprises the following steps: let NodeSet denote all directory nodes in the file directory tree, YiRepresenting the distance from directory node i to the farthest leaf node, and setting a distance threshold parameter deltaYThe method comprises the following steps:
(1) selecting a Y value greater than delta based on the Y values of the directory nodesYAnd is deltaYThe directory node with the closest value;
(2) taking the direct file contained in the selected directory node as a single file group;
(3) deleting the subtree corresponding to the selected directory node from the whole directory tree;
(4) updating Y values of all directory nodes in the directory tree;
(5) repeating steps (1) - (4) until no directory node has a Y value greater than δY
(6) All files included in the remaining directory nodes are treated as a single file group.
Further, the direct file refers to a file under a directory node, and the difference between the depth of the file in the directory tree and the depth of the directory node is 1.
Further, the method for grouping the user files according to the file types and the file names comprises the following steps: if the file clustering target requires that the file types in the clusters are the same, the clusters can be grouped according to the file types, then clustering is carried out in each group based on the file names, and each generated cluster is used as a file group; if the file clustering target does not require the same cluster file type, the files can be directly clustered based on the file names, and each generated cluster is used as a file group.
Further, the file name-based clustering method comprises the following steps: and calculating the editing distance of the file name character string, and dividing the files with the editing distance smaller than a set threshold value into a file group.
Further, the file name-based clustering method comprises the following steps: the method comprises the steps of segmenting the file names, expressing the file names as entry vectors, calculating text distances among the file names, and segmenting files with the text distances larger than a set threshold value into a file group, wherein the text distances are Jaccard distances or cosine distances.
Furthermore, the method for clustering the files comprises the steps of file representation, file similarity calculation and operation of a clustering algorithm.
Further, the file representation of the full cluster generation stage regards each local cluster as a file, and uses a unique file representation: either a single file in a local cluster can be selected for representation (e.g., a representation of a file with the longest length in the local cluster is selected as a local cluster file representation), or a representation can be computed using representations of all files in the local cluster (e.g., representations of all files in the local cluster are spliced to form a local cluster file representation).
Further, the file representation method comprises the following steps: vectorizing the file, wherein each dimension feature of the vector corresponds to a part of the file, the text file can be represented by using a vector space model, and the image file can be represented by using an image feature or an extraction method based on a neural network feature.
Further, the file representation method comprises the following steps: and calculating the file hash value by using a fuzzy hash method, wherein the file hash value is represented by a hash value.
Further, the file similarity calculation method comprises the following steps: jaccard distance or cosine distance between files is calculated.
Further, the file similarity calculation method comprises the following steps: and comparing the hash values of the two files, and converting the hash values into the similarity.
Further, the clustering algorithm comprises a kmeans clustering algorithm and a density-based clustering algorithm.
A user-oriented personal file clustering system, comprising:
the cluster calculating unit is used for clustering the user files;
the clustering result storage unit is used for storing clustering results and reserving the mapping relation between the files and the class clusters;
and the clustering result searching unit is used for realizing the functions of returning the file according to the class cluster and returning the class cluster according to the file.
Further, the cluster calculating unit includes:
the batch file clustering calculation unit is used for clustering the personal files for the first time;
and the incremental file clustering calculation unit is used for clustering subsequent incremental files and combining the incremental clustering result with the existing clustering result.
Further, merging the incremental clustering result with the existing clustering result means that the cluster obtained by incremental clustering and the existing cluster are clustered at a time by utilizing the clustering process of generating full clusters to obtain a merged result.
The input of the method is a file directory with a tree structure, the output is a file cluster list organized according to the subject, and the file contents in each file cluster are similar, as shown in fig. 1.
The invention uses the file tree directory structure, the file type and the file name to group the files, firstly clusters the files in the file group, and then carries out global clustering, and compared with the method of directly applying a clustering algorithm on all the files, the invention can obviously improve the clustering calculation efficiency. Assuming that the total number of files is N, the number of divided file groups is D, the average number of files in each file group is N/D, the number of local clusters generated by each file group is K, and the number of cluster clusters of the final cluster is M. Assuming that the time complexity of the clustering algorithm used is f (x), where x is the number of files input for clustering, the upper limit of the time complexity for file clustering using the method of the present invention is Df (N/D) + f (DK). Where Df (N/D) is the temporal complexity of local cluster generation; (dk) is the upper limit of the time complexity of global cluster generation, which is called the upper limit because the K local clusters obtained from each file group are not subjected to similarity calculation (do not necessarily belong to the same cluster) when performing full cluster generation. The temporal complexity of directly clustering all files using the conventional method is f (N). In practical application, K < < N/D and K < < M can be realized by reasonably setting a file grouping strategy, so that the clustering acceleration effect is achieved. For example, using a Kmeans clustering algorithm, the time complexity of the algorithm is f (x) ═ λ TxM, where x is the number of files input for clustering, T is the number of iterations of Kmeans, and λ is the file alignment time; the time complexity of directly clustering all files is lambda TNK + lambda TDKM, and the lambda TNM is greater than the lambda TNK + lambda TDKM under the condition that 1 is greater than K/M + DK/N, so that the file clustering acceleration effect is achieved, and the efficiency improvement rate is 1- (K/M + DK/N).
After the file clustering is completed, a user can browse files by using the file cluster (namely, the global cluster) generated by the method, and the method has two browsing modes: one is to provide a file organization based on a subject directory structure, wherein a user browses a file cluster first and then selects a file cluster to browse related files in the cluster; and the other is to provide an association function of the file, a user browses the file by using a traditional tree structure directory, clicks the file to display the related file of the cluster where the file is located, and then the file recommendation is realized. When displaying cluster files in the two browsing modes, a user can sort the files according to the file size, the file creation time and the file modification time, so that the prior browsing is realized.
Drawings
FIG. 1 is a schematic diagram of document organization based on a subject structure obtained by the method of the present invention.
FIG. 2 is a flowchart of an embodiment of a user-oriented personal file clustering method.
FIG. 3 is a sample diagram of a file tree directory structure.
FIG. 4 is a diagram of an embodiment of a clustering system for user-oriented personal files.
Detailed Description
In order to make the aforementioned and other features and advantages of the invention more comprehensible, embodiments accompanied with figures are described in detail below.
The embodiment provides a user-oriented personal file clustering method, as shown in fig. 2, which includes the following 3 steps:
1. grouping of files
The files are grouped by utilizing the storage habit of a user on similar files, and the aim is to ensure that the files in each file group are classified into a few class clusters as less as possible, and no repeated files exist among the file groups.
The following grouping strategy may be used:
1) grouping using a file tree directory structure:
with NodeSet representing all directory nodes in the tree, calculating the "distance to root node" and "distance to farthest leaf node" of each directory node, respectively, using the notation X, Y, then for each directory node i, the two distances can be represented as<Xi,Yi>。
The following grouping of files is done using X, Y:
the method comprises the following steps: grouping of files is done using X. User setting distance parameter deltaXWill meet the condition XiXAnd the files contained in the directory nodes of i belonging to the NodeSet are used as file groups to complete grouping.
Taking FIG. 3 as an example, assume that the distance parameter δXAssuming that 2, the obtained file set is { a0}, { a1.1}, { a2.1, a3.1, a3.2}, { B1.1}, and { C1.1, C1.2 }.
The method 2 comprises the following steps: grouping of files is performed using Y. The user sets the distance parameter to deltaY. The pseudo-code of the procedure is as follows:
Figure BDA0001569695530000051
the algorithm is mainly divided into 5 steps according to the pseudo code: (1) selecting a Y value greater than δ based on the Y values of the directory nodesYAnd is deltaYThe closest directory node comes out; (2) taking the direct file contained in the selected directory node as a file group; (3) deleting the subtree corresponding to the selected directory node from the whole directory tree; (4) updating Y values of all directory nodes in the directory tree; (5) repeating steps (1) - (4) until no directory node has a Y value greater than δY(ii) a (6) All files included in the remaining directory nodes are treated as a single file group.
Still taking FIG. 3 as an example, assume that the distance parameter δYSet to 2, the resulting set of files is { A0,A1.1}、{A2.1,A3.1,A3.2}、{B1.1}、{C1.1,C1.2}。
Compared to method 1, method 2 can avoid some file groups that contain too many files. The user can select either one of the above two methods or combine the two methods for grouping files.
2) Grouping using file name, file type:
if the clustering target is that the file types in the cluster are required to be the same, the files can be grouped according to the file types, then the files are clustered in each category based on the file names, and each generated cluster is used as a file group. If the clustering target does not require the cluster-like files to be the same, the files can be directly clustered based on the file names. Due to the short file names, the clustering speed is very fast.
There are 2 methods for file name based clustering:
the method comprises the following steps: and calculating the editing distance of the file name, and taking the files with the editing distance smaller than a specified threshold as a group, wherein the threshold is set by a user according to needs.
The method 2 comprises the following steps: the file names are participled, the file names are expressed as vocabulary entry vectors, the text distance between the file names is calculated, such as the Jaccard distance (see https:// baike.baidu.com/item/% E6% 9D% B0% E5% 8D% A1% E5% BE% B7% E8% B7% 9D% E7% A6% BB/15416212fr ═ adadin) or the cosine distance (see https:// zhui.wikipedia.org/zh-cn/% E4% BD 99% E5% BC% A6% E7% 9B 8% E4% BC 6% E5630% A7), the similarity is greater than a set file threshold, which is set by the user as needed.
3) Grouping is carried out by combining the 2 grouping strategies:
the strategy 1) can be used for grouping, and the strategy 2) can be used for grouping again for each obtained group; or grouping by using the strategy 2) and then grouping by using the strategy 1) for each obtained group.
2. Local cluster generation
And (4) clustering all the files in each file group in the step (1) to obtain local clusters, wherein the file contents in the local clusters are similar. One file group may generate a plurality of partial clusters.
3. Full cluster generation
And (3) regarding each local cluster obtained in the step (2) as a file, clustering all the local clusters, and generating a global cluster.
Specifically, each local cluster is treated as a file, using a unique representation. A local cluster file can be represented by using a single file in a cluster, such as selecting a file with the longest length as a cluster file representation; it is also possible to use all file representations in a cluster for calculation, such as concatenating the representations of all files in a cluster.
The clustering process of step 2 and step 3 involves document representation, document similarity calculation, and running a clustering algorithm. Wherein the clustering algorithm can select a kmeans clustering algorithm or a density-based clustering algorithm. The file representation and the file similarity calculation are determined according to requirements, and the common requirements are divided into the following two types:
1) the clustering target is to cluster files of the same topic together, and the content difference of the files in the cluster can be larger. For example, if it is desired to bring together articles in the "economy" class, the articles in the class cluster may be articles in the "international economy" class or articles in the "local economy development class, although these two articles are not discussed. For the requirement, the method for calculating the file representation and the file similarity used by clustering is as follows:
the file represents: the file is vectorized, each dimension feature corresponds to one piece of file content, wherein the text file can be represented by using a Vector space model (refer to https:// en. wikipedia. org/wiki/Vector _ space _ model), and the image file can use an extraction method of image features or a feature extraction method based on a neural network.
Calculating the similarity of the files: the distance between the files is calculated, such as the Jaccard distance or cosine distance.
2) The clustering target is to cluster the homologous files together, the homologous files refer to files which can be converted by adding, deleting and changing, and the homologous files are often drafts of a plurality of processes formed by a user in order to finish a final file. For the requirement, the method for calculating the file representation and the file similarity used by clustering is as follows:
the file represents: the file hash value is calculated by using a fuzzy hash (refer to http:// dx. doi. org/10.1016/j. di. 2006.06.015) method, wherein the fuzzy hash is based on a content-divided piecemeal hashing algorithm (CTPH), and a file similarity program is judged by fragmenting a file.
Calculating the similarity of the files: and comparing the hash values of the two files, and converting the hash values into similarity (refer to a fuzzy hash algorithm http:// dx. doi. org/10.1016/j. diin.2006.06.015).
The embodiment also provides a user-oriented personal file clustering system, as shown in fig. 4, including:
the cluster calculating unit is used for clustering the user files; the cluster calculation unit includes: the batch file clustering calculation unit is used for clustering the personal files for the first time; the incremental file clustering calculation unit is used for clustering subsequent incremental files, and performing primary clustering on the clusters obtained by the increment and the existing clusters by utilizing the clustering process of generating full clusters to obtain a merging result and realize the merging of the incremental clustering result and the existing clustering result;
the clustering result storage unit is used for storing clustering results and reserving the mapping relation between the files and the class clusters;
and the clustering result searching unit is used for realizing the functions of returning the file according to the class cluster and returning the class cluster according to the file.
The user personal files are clustered through the steps, after the files are clustered, the user can browse the files by using the global cluster generated by the method, and the method has two browsing modes: one is to provide a file organization based on a subject directory structure, wherein a user browses a file cluster first and then selects a file cluster to browse related files in the cluster; and the other is to provide an association function of the file, a user browses the file by using a traditional tree structure directory, clicks the file to display the related file of the cluster where the file is located, and then the file recommendation is realized. When displaying cluster files in the two browsing modes, a user can sort the files according to the file size, the file creation time and the file modification time, so that the prior browsing is realized.
The above embodiments are only intended to illustrate the technical solution of the present invention and not to limit the same, and a person skilled in the art can modify the technical solution of the present invention or substitute the same without departing from the spirit and scope of the present invention, and the scope of the present invention should be determined by the claims.

Claims (7)

1. A user-oriented personal file clustering method comprises the following steps:
grouping user files according to a file tree directory structure, grouping the user files according to file types and file names, and combining the two grouping strategies to obtain a plurality of file groups; the method for grouping the user files according to the file tree directory structure comprises the following steps: the method comprises the following steps: let NodeSet denote all directory nodes in the file directory tree, XiRepresenting the distance from the directory node i to the root node, and setting a distance threshold parameter deltaXWill meet the condition Xi>δXThe direct files contained in the directory node are used as a single file group; or the second method: let NodeSet denote all directory nodes in the file directory tree, YiRepresenting the distance from the directory node i to the farthest leaf node under the directory node, and setting a distance threshold parameter deltaYThe method comprises the following steps: selecting a Y value greater than delta based on the Y values of the directory nodesYAnd is deltaYThe directory node with the closest value; taking the direct file contained in the directory node as a single file group; deleting the subtree corresponding to the directory node from the whole directory tree; updating Y values of all directory nodes in the directory tree; repeating the steps until the Y value of no directory node is larger than deltaY(ii) a All files included by the remaining directory nodes are used as a single file group; the method for grouping the user files according to the file types and the file names comprises the following steps: if the file clustering target requires that the file types in the clusters are the same, the files can be firstly grouped according to the file types, and then clustering is carried out in each group based on the file namesEach formed class cluster is used as a file group; if the file clustering target does not require the same cluster file type, the files can be directly clustered based on the file names, and each generated cluster is used as a file group;
clustering the files in the file group to obtain one or more local clusters, wherein the file content in each local cluster is similar;
and regarding each local cluster as a file, clustering all the local clusters to generate a global cluster.
2. The method of claim 1, wherein the method for file name based clustering comprises:
the method comprises the following steps: calculating the editing distance of the file name character string, and dividing files with the editing distance smaller than a set threshold into a file group;
the second method comprises the following steps: the method comprises the steps of segmenting the file names, expressing the file names as entry vectors, calculating text distances among the file names, and segmenting files with the text distances larger than a set threshold value into a file group, wherein the text distances are Jaccard distances or cosine distances.
3. The method of claim 1, wherein the method step of clustering files or local clusters comprises: and (3) firstly representing the files, then calculating the similarity of the files, and finally operating a clustering algorithm.
4. The method of claim 3, wherein the method of file representation comprises:
the method comprises the following steps: vectorizing the file, wherein each dimension feature of the vector corresponds to a part of the file, the text file can be represented by a vector space model, and the image file can be represented by an image feature or an extraction method based on a neural network feature;
the second method comprises the following steps: and calculating the file hash value by using a fuzzy hash method, wherein the file hash value is represented by a hash value.
5. The method according to claim 3, wherein the document similarity calculation method includes:
the method comprises the following steps: calculating the Jaccard distance or cosine distance between the files;
the second method comprises the following steps: and comparing the hash values of the two files, and converting the hash values into the similarity.
6. The method of claim 3, wherein the clustering algorithm comprises a kmeans clustering algorithm, a density-based clustering algorithm.
7. A user-oriented personal file clustering system for implementing the method of any one of claims 1 to 6, the system comprising:
the clustering result storage unit is used for storing clustering results and reserving the mapping relation between the files and the class clusters;
the clustering result searching unit is used for realizing the functions of returning files according to the clusters and returning the clusters according to the files;
the cluster calculating unit is used for clustering the user files and comprises:
the batch file clustering calculation unit is used for clustering the personal files for the first time;
and the incremental file clustering calculation unit is used for clustering subsequent incremental files and combining the incremental clustering result with the existing clustering result.
CN201810112624.0A 2018-02-05 2018-02-05 User-oriented personal file clustering method and system Active CN108399213B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810112624.0A CN108399213B (en) 2018-02-05 2018-02-05 User-oriented personal file clustering method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810112624.0A CN108399213B (en) 2018-02-05 2018-02-05 User-oriented personal file clustering method and system

Publications (2)

Publication Number Publication Date
CN108399213A CN108399213A (en) 2018-08-14
CN108399213B true CN108399213B (en) 2022-04-01

Family

ID=63096263

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810112624.0A Active CN108399213B (en) 2018-02-05 2018-02-05 User-oriented personal file clustering method and system

Country Status (1)

Country Link
CN (1) CN108399213B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109933707B (en) * 2018-10-31 2022-10-14 中国科学院信息工程研究所 Topic corpus construction method and system based on search engine
CN110245209B (en) * 2019-06-20 2022-09-23 贵州电网有限责任公司 Method for extracting milestone events from massive texts
CN111209375B (en) * 2020-01-13 2023-01-17 中国科学院信息工程研究所 Universal clause and document matching method
CN112632953B (en) * 2020-12-22 2023-07-25 云汉芯城(上海)互联网科技股份有限公司 Method for rapidly and accurately detecting that multiple uploaded bill of materials belongs to same product
CN116363667B (en) * 2023-04-26 2023-10-13 公安部信息通信中心 Aggregation file theme identification and classification system
CN117573875A (en) * 2023-12-05 2024-02-20 安芯网盾(北京)科技有限公司 Method and device for optimizing homonymy file clustering algorithm

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1773492A (en) * 2004-11-09 2006-05-17 国际商业机器公司 Method for organizing multi-file and equipment for displaying multi-file
CN101004737A (en) * 2007-01-24 2007-07-25 贵阳易特软件有限公司 Individualized document processing system based on keywords
CN105095209A (en) * 2014-04-21 2015-11-25 北京金山网络科技有限公司 Document clustering method, document clustering device and network equipment
CN105574005A (en) * 2014-10-10 2016-05-11 富士通株式会社 Device and method for clustering source data containing a plurality of documents
CN106294568A (en) * 2016-07-27 2017-01-04 北京明朝万达科技股份有限公司 A kind of Chinese Text Categorization rule generating method based on BP network and system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TW561377B (en) * 2001-12-17 2003-11-11 Webstorage Corp Intelligent document management and usage method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1773492A (en) * 2004-11-09 2006-05-17 国际商业机器公司 Method for organizing multi-file and equipment for displaying multi-file
CN101004737A (en) * 2007-01-24 2007-07-25 贵阳易特软件有限公司 Individualized document processing system based on keywords
CN105095209A (en) * 2014-04-21 2015-11-25 北京金山网络科技有限公司 Document clustering method, document clustering device and network equipment
CN105574005A (en) * 2014-10-10 2016-05-11 富士通株式会社 Device and method for clustering source data containing a plurality of documents
CN106294568A (en) * 2016-07-27 2017-01-04 北京明朝万达科技股份有限公司 A kind of Chinese Text Categorization rule generating method based on BP network and system

Also Published As

Publication number Publication date
CN108399213A (en) 2018-08-14

Similar Documents

Publication Publication Date Title
CN108399213B (en) User-oriented personal file clustering method and system
US20220261427A1 (en) Methods and system for semantic search in large databases
CN109829104B (en) Semantic similarity based pseudo-correlation feedback model information retrieval method and system
KR102046096B1 (en) Resource efficient document search
CN108920556B (en) Expert recommending method based on discipline knowledge graph
CN108021658B (en) Intelligent big data searching method and system based on whale optimization algorithm
CN107729290B (en) Representation learning method of super-large scale graph by using locality sensitive hash optimization
CN110046713B (en) Robustness ordering learning method based on multi-target particle swarm optimization and application thereof
WO2023029356A1 (en) Sentence embedding generation method and apparatus based on sentence embedding model, and computer device
CN111325030A (en) Text label construction method and device, computer equipment and storage medium
CN114186084A (en) Online multi-mode Hash retrieval method, system, storage medium and equipment
CN107832319B (en) Heuristic query expansion method based on semantic association network
Sharaff et al. Analysing fuzzy based approach for extractive text summarization
CN109086386B (en) Data processing method, device, computer equipment and storage medium
JP5552981B2 (en) Index method, search method, and storage medium thereof
CN110209895B (en) Vector retrieval method, device and equipment
JP4219122B2 (en) Feature word extraction system
US20220179890A1 (en) Information processing apparatus, non-transitory computer-readable storage medium, and information processing method
CN113407702B (en) Employee cooperation relationship intensity quantization method, system, computer and storage medium
JP6495206B2 (en) Document concept base generation device, document concept search device, method, and program
CN113901278A (en) Data search method and device based on global multi-detection and adaptive termination
CN114995729A (en) Voice drawing method and device and computer equipment
CN111723179B (en) Feedback model information retrieval method, system and medium based on conceptual diagram
US20090319505A1 (en) Techniques for extracting authorship dates of documents
CN106682129B (en) Hierarchical concept vectorization increment processing method in personal big data management

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant