CN108399213A - A kind of clustering method and system of user oriented personal document - Google Patents

A kind of clustering method and system of user oriented personal document Download PDF

Info

Publication number
CN108399213A
CN108399213A CN201810112624.0A CN201810112624A CN108399213A CN 108399213 A CN108399213 A CN 108399213A CN 201810112624 A CN201810112624 A CN 201810112624A CN 108399213 A CN108399213 A CN 108399213A
Authority
CN
China
Prior art keywords
file
cluster
directory
clustered
group
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810112624.0A
Other languages
Chinese (zh)
Other versions
CN108399213B (en
Inventor
李鹏
王斌
齐保元
周美林
郭莉
梅钰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Information Engineering of CAS
Original Assignee
Institute of Information Engineering of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Information Engineering of CAS filed Critical Institute of Information Engineering of CAS
Priority to CN201810112624.0A priority Critical patent/CN108399213B/en
Publication of CN108399213A publication Critical patent/CN108399213A/en
Application granted granted Critical
Publication of CN108399213B publication Critical patent/CN108399213B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/172Caching, prefetching or hoarding of files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • G06F16/137Hash-based
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification

Abstract

The present invention provides a kind of clustering method of user oriented personal document, and step includes:The preservation custom of similar documents is grouped user file using user, obtains multiple file groups;File in file group is clustered, one or more local clusters are obtained, the file content in each part cluster is similar;Each local cluster is considered as a file, all local clusters are clustered, global cluster is generated.The present invention also provides a kind of clustering systems of user oriented personal document, including cluster calculation unit, cluster result storage unit and cluster result searching unit, and wherein cluster calculation unit includes batch documents cluster calculation unit and delta file cluster calculation unit.

Description

A kind of clustering method and system of user oriented personal document
Technical field
The present invention relates to individual subscriber file management field, may be implemented to cluster user file according to content, especially It is related to the document clustering based on user's use habit.
Background technology
With popularizing for computer office application, user edits at work generates a large amount of file (such as WORD/WPS/ PDF etc.).According to the design of existing file system, these files are preserved by tree structure in a computer, this guarantor Depositing preservation (memory) mode of mode with people to file, there are dramatically different.People to the memory of file often by task (or Theme) tissue.A problem caused by this species diversity is to need to put into greatly when user subsequently searches and manages to file Energy is measured, especially when multiple versions of user's same subject are stored in different location, subsequent lookup becomes more difficult.
Currently, how personal information management (Personal Information Management) research generates user Information retrieved, different company also develops the searching system for user file, such as Google Desktop Search, Baidu's table Faceted search etc., but these tools are all based on the document retrieval of keyword, do not solve " file organization based on tree construction " It is converted to " file organization based on thematic structure ".
The cluster based on content can be used by carrying out tissue to file by theme, and All Files, which are directly inputted to cluster, to be calculated Method, each cluster (file set) clustered are used as a document theme.Processing directly is carried out to file and has ignored user couple The preservation custom (file under such as same catalogue belongs to the possibility higher of same class cluster) of similar documents, while can dramatically increase Unnecessary calculation amount.
Invention content
The present invention proposes that a kind of clustering method and system of user oriented personal document, this method are preserved by using user With the custom of browsing file, individual subscriber file is clustered, cluster is efficient.
In order to achieve the above objectives, the present invention adopts the following technical scheme that:
A kind of clustering method of user oriented personal document, step include:
(1) file grouping:The preservation custom of similar documents is grouped user file using user, obtains multiple texts Part group;
(2) local fasciation at:File in file group is clustered, one or more local clusters, each part are obtained File content in cluster is similar;
(3) the full generation that clusters:Each local cluster is considered as a file, all local clusters are clustered, are generated global Cluster.
Further, user file is grouped according to file tree directory structure, file type, filename etc..
Further, it is to the method that user file is grouped according to file tree directory structure:Assuming that NodeSet tables Show all directory nodes in file directory tree, XiIndicate directory node i to the distance of root node, setting distance threshold parameters δX, By eligible Xi>δXDirectory node included direct file as an individual files group.
Further, it is to the method that user file is grouped according to file tree directory structure:Assuming that NodeSet tables Show all directory nodes in file directory tree, YiIndicate that directory node i to the distance of farthest leaf node, is arranged apart from threshold Value parameter δY, method and step includes:
(1) Y value based on directory node selects a Y value and is more than δYAnd and δYIt is worth immediate directory node;
(2) direct file for including using the directory node chosen is as an individual files group;
(3) the corresponding subtree of the directory node chosen is deleted from entire directory tree;
(4) Y value of all directory nodes in directory tree is updated;
(5) step (1)-(4) are repeated, until the Y value of not no directory node is more than δY
(6) using the All Files included by remaining directory node as an individual file group.
Further, the direct file refers to the file under directory node, and depth of this document in directory tree with Directory node depth difference is 1.
Further, it is to the method that user file is grouped according to file type, filename:If document clustering mesh Mark require the file type in class cluster identical, can be first grouped according to file type, then in each group based on filename into Row cluster, using each class cluster of generation as a file group;If document clustering target does not require class, clustered file type is identical, Directly file can be clustered based on filename, using each class cluster of generation as a file group.
Further, the clustering method based on filename is:The editing distance of calculation document name character string, by editing distance File less than given threshold is divided into a file group.
Further, the clustering method based on filename is:Filename is segmented, by table of file name be shown as entry to It measures, the text distance between calculation document name, the file that text distance is more than to given threshold is divided into a file group, the text Distance is Jaccard distances or cosine distances.
Further, the method and step clustered to file is calculated including first representation of file, then file similarity, finally Run clustering algorithm.
Further, each local cluster is considered as a file by the representation of file of the full generation phase that clusters, using only One representation of file:Both single file in local cluster may be selected and be indicated (the table as selected the longest file of length in local cluster Local clustered file is shown as to indicate), it is possible to use expression is calculated (as by institute in local cluster in All Files expression in local cluster Documentary expression indicates after being spliced as local clustered file).
Further, the method for the representation of file is:Vectorization is carried out to file, vectorial every one-dimensional characteristic corresponds to text Vector space model can be used to indicate for a part for part, wherein text file, image file can be used characteristics of image or based on god Abstracting method through network characterization.
Further, the method for the representation of file is:Using fuzzy hash method calculation document cryptographic Hash, cryptographic Hash is used It indicates.
Further, the file similarity computational methods are:Jaccard distances between calculation document or cosine Distance.
Further, the file similarity computational methods are:The cryptographic Hash for comparing two files, cryptographic Hash is converted to Similarity.
Further, the clustering algorithm includes kmeans clustering algorithms, density-based algorithms.
A kind of clustering system of user oriented personal document, including:
Cluster calculation unit, for being clustered to user file;
Cluster result storage unit, for storing cluster result, the mapping relations of document retaining and class cluster;
Cluster result searching unit, for realizing returning to file according to class cluster and returning to the function of class cluster according to file.
Further, the cluster calculation unit includes:
Batch documents cluster calculation unit, for being clustered for the first time to being used for personal document;
Delta file cluster calculation unit by increment cluster result and has for being clustered to subsequent delta file Cluster result merges.
Further, merging increment cluster result and existing cluster result, refer to using generate cluster entirely it is poly- Class cluster and existing class cluster that increment obtains once are clustered, obtain amalgamation result by class process.
The input of the method for the present invention is the file directory of tree construction, is exported as by the document cluster list of theme tissue, each File content in document cluster is similar, as shown in Figure 1.
The present invention is grouped file using file tree directory structure, file type, filename, first in file group It clusters, then carries out global clustering, compared with directly on All Files using clustering algorithm, cluster can be significantly improved Computational efficiency.Assuming that total number of files is N, the file group number being divided into is D, and the number of files in average each file group is N/D, often The number for the local cluster that a file group generates is K, and the class number of clusters finally clustered is M.Assuming that the time of the clustering algorithm used is multiple Miscellaneous degree is f (x), and wherein x is the number of files of cluster input, then carrying out the time complexity of document clustering using the method for the present invention The upper limit be Df (N/D)+f (DK).Wherein, Df (N/D) be local fasciation at time complexity;F (DK) be global fasciation at Time complexity the upper limit, why referred to as the upper limit be because cluster full generation when, K office that each file group obtains Portion's cluster is without similarity calculation (being centainly not belonging to same cluster).And directly All Files are clustered using conventional method Time complexity is then f (N).In practical application, by the way that file grouping strategy is rationally arranged, K may make<<N/D, K<<M, To achieve the effect that cluster accelerates.Such as using Kmeans clustering algorithms, the time complexity of the algorithm is f (x)=λ TxM, Wherein, x is the number of files of cluster input, and T is the iterations of Kmeans, and λ is file comparison time;So directly to all texts The time complexity that part is clustered is λ TNM, and the time complexity using the method for the present invention is λ TNK+ λ TDKM, is meeting 1> Under conditions of K/M+DK/N, λ TNM>λ TNK+ λ TDKM, to reach the acceleration effect to document clustering, improved efficiency rate is 1- (K/M+DK/N)。
After the completion of document clustering, user can be clear to carry out file using the document cluster (i.e. global cluster) that this method generates It lookes at, there are two types of browsing modes:A kind of to be to provide the file organization based on subject catalogue structure, user first browses document cluster, then A document cluster is selected to browse associated documents in cluster;Another kind is to provide the association function of file, and user uses traditional tree knot Structure catalogue carries out browsing file, and the associated documents of cluster where clicking document presentation this document realize file recommendation.With above-mentioned two When kind of browsing mode carries out clustered file displaying, user can come pair according to file size, file creation time, filemodetime File is ranked up, and realizes preferential browsing.
Description of the drawings
Fig. 1 is that the file organization based on tree construction is shown using the file organization based on thematic structure that the method for the present invention obtains It is intended to.
Fig. 2 is the clustering method flow chart of user oriented personal document of embodiment a kind of.
Fig. 3 is a file tree directory structure sample figure.
Fig. 4 is the clustering system structure chart of user oriented personal document of embodiment a kind of.
Specific implementation mode
Features described above and advantage to enable the present invention are clearer and more comprehensible, special embodiment below, and institute's attached drawing is coordinated to make Detailed description are as follows.
The present embodiment provides a kind of clustering methods of user oriented personal document, as shown in Fig. 2, including following 3 steps:
1. file grouping
The preservation custom of similar documents is grouped file using user, target is to ensure that the text in each file group Part is grouped into less several class clusters as far as possible, and duplicate file is not present between file group.
Following grouping strategy can be used:
1) it is grouped using file tree directory structure:
All directory nodes in tree are indicated with NodeSet, are calculated each directory node " to the distance of root node " and " are arrived The distance of farthest leaf node " indicates that, then for each directory node i, two distances can using symbol X, Y respectively To be expressed as<Xi,Yi>。
File grouping is carried out following with X, Y:
Method 1:File grouping is carried out using X.User setting distance parameter δX, by eligible XiX, i ∈ NodeSet The directory node file that is included as file group, complete grouping.
By taking Fig. 3 as an example, it is assumed that distance parameter δXBe set as 2, then the file group obtained be { A0 }, { A1.1 }, A2.1, A3.1, A3.2}、{B1.1}、{C1.1,C1.2}。
Method 2:File grouping is carried out using Y.User setting distance parameter is δY.The pseudocode of process is as follows:
It is broadly divided into 5 steps by the algorithm known to above-mentioned pseudocode:(1) Y value based on directory node selects a Y value More than δYAnd and δYImmediate directory node comes out;(2) direct file for including using the directory node chosen is as one File group;(3) the corresponding subtree of the directory node chosen is deleted from entire directory tree;(4) institute in directory tree is updated There is the Y value of directory node;(5) step (1)-(4) are repeated, until the Y value of not no directory node is more than δY;(6) by remaining catalogue All Files included by node are as an individual file group.
Still by taking Fig. 3 as an example, it is assumed that distance parameter δYBe set as 2, then the file group obtained be { A0, A1.1 }, A2.1, A3.1, A3.2}、{B1.1}、{C1.1,C1.2}。
Compared with method 1, method 2 can include the excessive file group of file to avoid some.User can select above-mentioned two Any one in kind method two kinds can also be used in combination to carry out file grouping.
2) it is grouped using filename, file type:
If cluster target is that the file type in requirement class cluster is identical, file grouping can be carried out first according to file type Grouping, then file is clustered based on filename in each classification, each class cluster of generation is as file group.If cluster Target does not require class clustered file identical, can directly be clustered based on filename to file.Since filename is shorter, speed is clustered It is very fast.
Cluster based on filename has 2 kinds of methods:
Method 1:The editing distance of calculation document name, the file that editing distance is less than to specified threshold are grouped as one, Threshold value is set as needed by user.
Method 2:Filename is segmented, table of file name is shown as entry vector, the text between calculation document name away from From if Jaccard distances are (with reference to https://baike.baidu.com/item/%E6%9D%B0%E5%8D%A1% E5%BE%B7%E8%B7%9D%E7%A6%BB/15416212Fr=aladdin) or cosine is apart from (reference https://zh.wikipedia.org/zh-cn/%E4%BD%99%E5%BC%A6%E7%9B%B8%E4%BC % BC%E6%80%A7), as a file group, threshold value is set the file using similarity more than given threshold as needed by user It is fixed.
3) it is grouped in conjunction with above-mentioned 2 kinds of grouping strategies:
First strategy can be used 1) to be grouped, tactful 2) be grouped again for obtaining each being grouped reusing;Or first It 2) is grouped using strategy, strategy 1) grouping is reused for obtaining each being grouped.
2. local fasciation at
All Files in step 1 in each file group are clustered, local cluster, the file content in local cluster are obtained It is similar.One file group can generate multiple local clusters.
3. the full generation that clusters
Step 2 each of is obtained into local cluster and is regarded as a file, all local clusters are clustered, global cluster is generated.
Specifically, each local cluster is regarded as a file, is indicated using unique.Local clustered file can use in cluster Single file is indicated, for example the longest file of length is selected to be indicated as clustered file;All Files in cluster can also be used Expression is calculated, for example the expression of All Files in cluster is spliced.
The cluster process of step 2 and step 3 is related to representation of file, file similarity calculates and operation clustering algorithm. Wherein clustering algorithm can select kmeans clustering algorithms or be based on density clustering algorithm.Representation of file and file similarity It calculates and determines according to demand, common demand is divided into following two:
1) cluster target is to gather the file of same topic together, and the file content difference in class cluster can be larger.Than Such as wish that the article by " economy " class is brought together, the article in class cluster can be the article of " international economy ", can also be The article of " local economic development ", although this two articles are not that something is discussed.For the demand, the file used is clustered It indicates and file similarity computational methods is as follows:
Representation of file:Vectorization is carried out to file, per one-dimensional characteristic one piece of content of respective file, wherein text file can To use vector space model to indicate (with reference to https://en.wikipedia.org/wiki/Vector_space_model), Image file can use the abstracting method of characteristics of image, can also use the Feature Extraction Method based on neural network.
File similarity calculates:The distance between calculation document, such as Jaccard distances or cosine distances.
2) cluster target is to gather source file together, and source file refers to file can be turned by increasing, deleting, change It changes, source file is often user to complete multiple process original texts of definitive document formation.For the demand, cluster uses Representation of file and file similarity computational methods it is as follows:
Representation of file:Using fuzzy Hash (with reference to http://dx.doi.org/10.1016/ J.diin.2006.06.015 method calculation document cryptographic Hash), the fuzzy Hash are that the fragment Hash based on content segmentation is calculated Method (context triggered piecewise hashing, CTPH) judges file by carrying out fragment to a file Similar programs.
File similarity calculates:Cryptographic Hash is converted to similarity (with reference to fuzzy Hash by the cryptographic Hash for comparing two files Algorithm http://dx.doi.org/10.1016/j.diin.2006.06.015).
The present embodiment also provides a kind of clustering system of user oriented personal document, as shown in figure 4, including:
Cluster calculation unit, for being clustered to user file;The cluster calculation unit includes:Batch documents cluster meter Unit is calculated, for being clustered for the first time to being used for personal document;And delta file cluster calculation unit, for subsequent delta File is clustered, and the cluster process to be clustered entirely using generation is once gathered class cluster and existing class cluster that increment obtains Class, obtains amalgamation result, and realization increment cluster result merges with existing cluster result;
Cluster result storage unit, for storing cluster result, the mapping relations of document retaining and class cluster;
Cluster result searching unit, for realizing returning to file according to class cluster and returning to the function of class cluster according to file.
Individual subscriber file is clustered by above step, after the completion of document clustering, user can utilize this method The global cluster of generation carries out browsing file, and there are two types of browsing modes:It is a kind of to be to provide the file group based on subject catalogue structure It knits, user first browses document cluster, then selectes a document cluster to browse associated documents in cluster;Another kind is to provide the association of file Function, user carry out browsing file using traditional tree construction catalogue, the associated documents of cluster where clicking document presentation this document, Realize file recommendation.When carrying out clustered file displaying with above two browsing mode, user can create according to file size, file Build the time, filemodetime is ranked up file, realize preferential browsing.
The above embodiments are merely illustrative of the technical solutions of the present invention rather than is limited, the ordinary skill of this field Personnel can be modified or replaced equivalently technical scheme of the present invention, without departing from the spirit and scope of the present invention, this The protection domain of invention should be subject to described in claims.

Claims (10)

1. a kind of clustering method of user oriented personal document, step include:
User file is grouped, multiple file groups are obtained;
File in file group is clustered, one or more local clusters are obtained, the file content in each part cluster is similar;
Each local cluster is considered as a file, all local clusters are clustered, global cluster is generated.
2. according to the method described in claim 1, it is characterized in that, according to file tree directory structure, file type, filename User file is grouped.
3. according to the method described in claim 2, it is characterized in that, dividing user file according to file tree directory structure Group method include:
Method one:Assuming that NodeSet indicates all directory nodes in file directory tree, XiIndicate directory node i to root node Distance, setting distance threshold parameters δX, by eligible Xi>δXThe directory node direct file that is included as one individually File group;
Method two:Assuming that NodeSet indicates all directory nodes in file directory tree, YiIndicate directory node i to farthest leaf The distance of child node, setting distance threshold parameters δY, method and step includes:Y value based on directory node selects a Y value More than δYAnd and δYIt is worth immediate directory node;The direct file for including using the directory node is as an individual files group;It will The corresponding subtree of the directory node is deleted from entire directory tree;Update the Y value of all directory nodes in directory tree;It repeats above-mentioned Step, until the Y value of not no directory node is more than δY;Individually using the All Files included by remaining directory node as one File group.
4. according to the method described in claim 2, it is characterized in that, dividing user file according to file type, filename Group method include:If the file type in document clustering target call class cluster is identical, can first be divided according to file type Group, then clustered based on filename in each group, using each class cluster of generation as a file group;If document clustering Target does not require class clustered file type identical, can directly be clustered based on filename to file, and each class cluster of generation is made For a file group.
5. according to the method described in claim 3, it is characterized in that, the method based on filename cluster includes:
Method one:The editing distance of calculation document name character string, the file that editing distance is less than to given threshold are divided into a text Part group;
Method two:Filename is segmented, table of file name is shown as entry vector, the text distance between calculation document name, The file that text distance is more than to given threshold is divided into a file group, text distance be Jaccard distances or cosine away from From.
6. according to the method described in claim 1, it is characterized in that, the method and step packet clustered to file or local cluster It includes:First representation of file, then file similarity calculate, and finally run clustering algorithm.
7. according to the method described in claim 6, it is characterized in that, the method for the representation of file includes:
Method one:Vectorization, a part for vectorial every one-dimensional characteristic respective file are carried out to file, wherein text file can make It is indicated with vector space model, characteristics of image or the abstracting method based on neural network characteristics can be used in image file;
Method two:Using fuzzy hash method calculation document cryptographic Hash, indicated with cryptographic Hash.
8. according to the method described in claim 6, it is characterized in that, the file similarity computational methods include:
Method one:Jaccard distances between calculation document or cosine distances;
Method two:The cryptographic Hash for comparing two files, similarity is converted to by cryptographic Hash.
9. according to the method described in claim 6, it is characterized in that, the clustering algorithm includes kmeans clustering algorithms, is based on The clustering algorithm of density.
10. a kind of clustering system of user oriented personal document, including:
Cluster result storage unit, for storing cluster result, the mapping relations of document retaining and class cluster;
Cluster result searching unit, for realizing returning to file according to class cluster and returning to the function of class cluster according to file;
Cluster calculation unit, for being clustered to user file comprising:
Batch documents cluster calculation unit, for being clustered for the first time to being used for personal document;
Delta file cluster calculation unit, for being clustered to subsequent delta file, by increment cluster result and existing cluster As a result it merges.
CN201810112624.0A 2018-02-05 2018-02-05 User-oriented personal file clustering method and system Active CN108399213B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810112624.0A CN108399213B (en) 2018-02-05 2018-02-05 User-oriented personal file clustering method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810112624.0A CN108399213B (en) 2018-02-05 2018-02-05 User-oriented personal file clustering method and system

Publications (2)

Publication Number Publication Date
CN108399213A true CN108399213A (en) 2018-08-14
CN108399213B CN108399213B (en) 2022-04-01

Family

ID=63096263

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810112624.0A Active CN108399213B (en) 2018-02-05 2018-02-05 User-oriented personal file clustering method and system

Country Status (1)

Country Link
CN (1) CN108399213B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109933707A (en) * 2018-10-31 2019-06-25 中国科学院信息工程研究所 A kind of theme corpus construction method and system based on search engine
CN110245209A (en) * 2019-06-20 2019-09-17 贵州电网有限责任公司 A method of extracting milestone event from mass text
CN111209375A (en) * 2020-01-13 2020-05-29 中国科学院信息工程研究所 Universal clause and document matching method
CN112632953A (en) * 2020-12-22 2021-04-09 云汉芯城(上海)互联网科技股份有限公司 Method for quickly and accurately detecting that multiple uploaded bill of materials belong to same product
CN116363667A (en) * 2023-04-26 2023-06-30 公安部信息通信中心 Aggregation file theme identification and classification system
CN117573875A (en) * 2023-12-05 2024-02-20 安芯网盾(北京)科技有限公司 Method and device for optimizing homonymy file clustering algorithm

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030115173A1 (en) * 2001-12-17 2003-06-19 Tsung-Yuan Liu Intelligent document management and usage method
CN1773492A (en) * 2004-11-09 2006-05-17 国际商业机器公司 Method for organizing multi-file and equipment for displaying multi-file
CN101004737A (en) * 2007-01-24 2007-07-25 贵阳易特软件有限公司 Individualized document processing system based on keywords
CN105095209A (en) * 2014-04-21 2015-11-25 北京金山网络科技有限公司 Document clustering method, document clustering device and network equipment
CN105574005A (en) * 2014-10-10 2016-05-11 富士通株式会社 Device and method for clustering source data containing a plurality of documents
CN106294568A (en) * 2016-07-27 2017-01-04 北京明朝万达科技股份有限公司 A kind of Chinese Text Categorization rule generating method based on BP network and system

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030115173A1 (en) * 2001-12-17 2003-06-19 Tsung-Yuan Liu Intelligent document management and usage method
CN1773492A (en) * 2004-11-09 2006-05-17 国际商业机器公司 Method for organizing multi-file and equipment for displaying multi-file
CN101004737A (en) * 2007-01-24 2007-07-25 贵阳易特软件有限公司 Individualized document processing system based on keywords
CN105095209A (en) * 2014-04-21 2015-11-25 北京金山网络科技有限公司 Document clustering method, document clustering device and network equipment
CN105574005A (en) * 2014-10-10 2016-05-11 富士通株式会社 Device and method for clustering source data containing a plurality of documents
CN106294568A (en) * 2016-07-27 2017-01-04 北京明朝万达科技股份有限公司 A kind of Chinese Text Categorization rule generating method based on BP network and system

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109933707A (en) * 2018-10-31 2019-06-25 中国科学院信息工程研究所 A kind of theme corpus construction method and system based on search engine
CN109933707B (en) * 2018-10-31 2022-10-14 中国科学院信息工程研究所 Topic corpus construction method and system based on search engine
CN110245209A (en) * 2019-06-20 2019-09-17 贵州电网有限责任公司 A method of extracting milestone event from mass text
CN110245209B (en) * 2019-06-20 2022-09-23 贵州电网有限责任公司 Method for extracting milestone events from massive texts
CN111209375A (en) * 2020-01-13 2020-05-29 中国科学院信息工程研究所 Universal clause and document matching method
CN111209375B (en) * 2020-01-13 2023-01-17 中国科学院信息工程研究所 Universal clause and document matching method
CN112632953A (en) * 2020-12-22 2021-04-09 云汉芯城(上海)互联网科技股份有限公司 Method for quickly and accurately detecting that multiple uploaded bill of materials belong to same product
CN112632953B (en) * 2020-12-22 2023-07-25 云汉芯城(上海)互联网科技股份有限公司 Method for rapidly and accurately detecting that multiple uploaded bill of materials belongs to same product
CN116363667A (en) * 2023-04-26 2023-06-30 公安部信息通信中心 Aggregation file theme identification and classification system
CN116363667B (en) * 2023-04-26 2023-10-13 公安部信息通信中心 Aggregation file theme identification and classification system
CN117573875A (en) * 2023-12-05 2024-02-20 安芯网盾(北京)科技有限公司 Method and device for optimizing homonymy file clustering algorithm

Also Published As

Publication number Publication date
CN108399213B (en) 2022-04-01

Similar Documents

Publication Publication Date Title
CN108399213A (en) A kind of clustering method and system of user oriented personal document
KR102046096B1 (en) Resource efficient document search
JP6216467B2 (en) Visual-semantic composite network and method for forming the network
US20050203943A1 (en) Personalized classification for browsing documents
JP2009093652A (en) Automatic generation of hierarchy of terms
JPWO2014109127A1 (en) Index generation apparatus and method, search apparatus, and search method
Eda et al. The effectiveness of latent semantic analysis for building up a bottom-up taxonomy from folksonomy tags
KR101055306B1 (en) Web service based content management system
JP5929532B2 (en) Event detection apparatus, event detection method, and event detection program
JP5470082B2 (en) Information storage search method and information storage search program
Li et al. Ranking weighted clustering coefficient in large dynamic graphs
Li et al. On mining webclick streams for path traversal patterns
Kumar et al. Personalized web search engine using dynamic user profile and clustering techniques
Shrivastava et al. Comparison between K-mean and C-mean clustering for CBIR
CN110162580A (en) Data mining and depth analysis method and application based on distributed early warning platform
Lu et al. An Algorithm of Top-k High Utility Itemsets Mining over Data Stream.
Mohotti et al. An efficient ranking-centered density-based document clustering method
JP2008210335A (en) Consciousness system construction system, consciousness system construction method, and consciousness system construction program
JP2011170666A (en) Retrieval device
KR20200094674A (en) Method and device for grape spasification using edge prunning
Dichev et al. A study on community formation in collaborative tagging systems
Krylov et al. Metrized Small World Properties Based Data Structure.
Ravakhah et al. Semantic similarity based focused crawling
JP2011170460A (en) Information accumulation retrieval method and information accumulation retrieval program
Aiello et al. Tagging with DHARMA, a DHT-based Approach for Resource Mapping through Approximation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant