CN108399213A - A kind of clustering method and system of user oriented personal document - Google Patents
A kind of clustering method and system of user oriented personal document Download PDFInfo
- Publication number
- CN108399213A CN108399213A CN201810112624.0A CN201810112624A CN108399213A CN 108399213 A CN108399213 A CN 108399213A CN 201810112624 A CN201810112624 A CN 201810112624A CN 108399213 A CN108399213 A CN 108399213A
- Authority
- CN
- China
- Prior art keywords
- file
- cluster
- directory
- clustered
- group
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/17—Details of further file system functions
- G06F16/172—Caching, prefetching or hoarding of files
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/13—File access structures, e.g. distributed indices
- G06F16/137—Hash-based
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/285—Clustering or classification
Abstract
The present invention provides a kind of clustering method of user oriented personal document, and step includes:The preservation custom of similar documents is grouped user file using user, obtains multiple file groups;File in file group is clustered, one or more local clusters are obtained, the file content in each part cluster is similar;Each local cluster is considered as a file, all local clusters are clustered, global cluster is generated.The present invention also provides a kind of clustering systems of user oriented personal document, including cluster calculation unit, cluster result storage unit and cluster result searching unit, and wherein cluster calculation unit includes batch documents cluster calculation unit and delta file cluster calculation unit.
Description
Technical field
The present invention relates to individual subscriber file management field, may be implemented to cluster user file according to content, especially
It is related to the document clustering based on user's use habit.
Background technology
With popularizing for computer office application, user edits at work generates a large amount of file (such as WORD/WPS/
PDF etc.).According to the design of existing file system, these files are preserved by tree structure in a computer, this guarantor
Depositing preservation (memory) mode of mode with people to file, there are dramatically different.People to the memory of file often by task (or
Theme) tissue.A problem caused by this species diversity is to need to put into greatly when user subsequently searches and manages to file
Energy is measured, especially when multiple versions of user's same subject are stored in different location, subsequent lookup becomes more difficult.
Currently, how personal information management (Personal Information Management) research generates user
Information retrieved, different company also develops the searching system for user file, such as Google Desktop Search, Baidu's table
Faceted search etc., but these tools are all based on the document retrieval of keyword, do not solve " file organization based on tree construction "
It is converted to " file organization based on thematic structure ".
The cluster based on content can be used by carrying out tissue to file by theme, and All Files, which are directly inputted to cluster, to be calculated
Method, each cluster (file set) clustered are used as a document theme.Processing directly is carried out to file and has ignored user couple
The preservation custom (file under such as same catalogue belongs to the possibility higher of same class cluster) of similar documents, while can dramatically increase
Unnecessary calculation amount.
Invention content
The present invention proposes that a kind of clustering method and system of user oriented personal document, this method are preserved by using user
With the custom of browsing file, individual subscriber file is clustered, cluster is efficient.
In order to achieve the above objectives, the present invention adopts the following technical scheme that:
A kind of clustering method of user oriented personal document, step include:
(1) file grouping:The preservation custom of similar documents is grouped user file using user, obtains multiple texts
Part group;
(2) local fasciation at:File in file group is clustered, one or more local clusters, each part are obtained
File content in cluster is similar;
(3) the full generation that clusters:Each local cluster is considered as a file, all local clusters are clustered, are generated global
Cluster.
Further, user file is grouped according to file tree directory structure, file type, filename etc..
Further, it is to the method that user file is grouped according to file tree directory structure:Assuming that NodeSet tables
Show all directory nodes in file directory tree, XiIndicate directory node i to the distance of root node, setting distance threshold parameters δX,
By eligible Xi>δXDirectory node included direct file as an individual files group.
Further, it is to the method that user file is grouped according to file tree directory structure:Assuming that NodeSet tables
Show all directory nodes in file directory tree, YiIndicate that directory node i to the distance of farthest leaf node, is arranged apart from threshold
Value parameter δY, method and step includes:
(1) Y value based on directory node selects a Y value and is more than δYAnd and δYIt is worth immediate directory node;
(2) direct file for including using the directory node chosen is as an individual files group;
(3) the corresponding subtree of the directory node chosen is deleted from entire directory tree;
(4) Y value of all directory nodes in directory tree is updated;
(5) step (1)-(4) are repeated, until the Y value of not no directory node is more than δY;
(6) using the All Files included by remaining directory node as an individual file group.
Further, the direct file refers to the file under directory node, and depth of this document in directory tree with
Directory node depth difference is 1.
Further, it is to the method that user file is grouped according to file type, filename:If document clustering mesh
Mark require the file type in class cluster identical, can be first grouped according to file type, then in each group based on filename into
Row cluster, using each class cluster of generation as a file group;If document clustering target does not require class, clustered file type is identical,
Directly file can be clustered based on filename, using each class cluster of generation as a file group.
Further, the clustering method based on filename is:The editing distance of calculation document name character string, by editing distance
File less than given threshold is divided into a file group.
Further, the clustering method based on filename is:Filename is segmented, by table of file name be shown as entry to
It measures, the text distance between calculation document name, the file that text distance is more than to given threshold is divided into a file group, the text
Distance is Jaccard distances or cosine distances.
Further, the method and step clustered to file is calculated including first representation of file, then file similarity, finally
Run clustering algorithm.
Further, each local cluster is considered as a file by the representation of file of the full generation phase that clusters, using only
One representation of file:Both single file in local cluster may be selected and be indicated (the table as selected the longest file of length in local cluster
Local clustered file is shown as to indicate), it is possible to use expression is calculated (as by institute in local cluster in All Files expression in local cluster
Documentary expression indicates after being spliced as local clustered file).
Further, the method for the representation of file is:Vectorization is carried out to file, vectorial every one-dimensional characteristic corresponds to text
Vector space model can be used to indicate for a part for part, wherein text file, image file can be used characteristics of image or based on god
Abstracting method through network characterization.
Further, the method for the representation of file is:Using fuzzy hash method calculation document cryptographic Hash, cryptographic Hash is used
It indicates.
Further, the file similarity computational methods are:Jaccard distances between calculation document or cosine
Distance.
Further, the file similarity computational methods are:The cryptographic Hash for comparing two files, cryptographic Hash is converted to
Similarity.
Further, the clustering algorithm includes kmeans clustering algorithms, density-based algorithms.
A kind of clustering system of user oriented personal document, including:
Cluster calculation unit, for being clustered to user file;
Cluster result storage unit, for storing cluster result, the mapping relations of document retaining and class cluster;
Cluster result searching unit, for realizing returning to file according to class cluster and returning to the function of class cluster according to file.
Further, the cluster calculation unit includes:
Batch documents cluster calculation unit, for being clustered for the first time to being used for personal document;
Delta file cluster calculation unit by increment cluster result and has for being clustered to subsequent delta file
Cluster result merges.
Further, merging increment cluster result and existing cluster result, refer to using generate cluster entirely it is poly-
Class cluster and existing class cluster that increment obtains once are clustered, obtain amalgamation result by class process.
The input of the method for the present invention is the file directory of tree construction, is exported as by the document cluster list of theme tissue, each
File content in document cluster is similar, as shown in Figure 1.
The present invention is grouped file using file tree directory structure, file type, filename, first in file group
It clusters, then carries out global clustering, compared with directly on All Files using clustering algorithm, cluster can be significantly improved
Computational efficiency.Assuming that total number of files is N, the file group number being divided into is D, and the number of files in average each file group is N/D, often
The number for the local cluster that a file group generates is K, and the class number of clusters finally clustered is M.Assuming that the time of the clustering algorithm used is multiple
Miscellaneous degree is f (x), and wherein x is the number of files of cluster input, then carrying out the time complexity of document clustering using the method for the present invention
The upper limit be Df (N/D)+f (DK).Wherein, Df (N/D) be local fasciation at time complexity;F (DK) be global fasciation at
Time complexity the upper limit, why referred to as the upper limit be because cluster full generation when, K office that each file group obtains
Portion's cluster is without similarity calculation (being centainly not belonging to same cluster).And directly All Files are clustered using conventional method
Time complexity is then f (N).In practical application, by the way that file grouping strategy is rationally arranged, K may make<<N/D, K<<M,
To achieve the effect that cluster accelerates.Such as using Kmeans clustering algorithms, the time complexity of the algorithm is f (x)=λ TxM,
Wherein, x is the number of files of cluster input, and T is the iterations of Kmeans, and λ is file comparison time;So directly to all texts
The time complexity that part is clustered is λ TNM, and the time complexity using the method for the present invention is λ TNK+ λ TDKM, is meeting 1>
Under conditions of K/M+DK/N, λ TNM>λ TNK+ λ TDKM, to reach the acceleration effect to document clustering, improved efficiency rate is 1-
(K/M+DK/N)。
After the completion of document clustering, user can be clear to carry out file using the document cluster (i.e. global cluster) that this method generates
It lookes at, there are two types of browsing modes:A kind of to be to provide the file organization based on subject catalogue structure, user first browses document cluster, then
A document cluster is selected to browse associated documents in cluster;Another kind is to provide the association function of file, and user uses traditional tree knot
Structure catalogue carries out browsing file, and the associated documents of cluster where clicking document presentation this document realize file recommendation.With above-mentioned two
When kind of browsing mode carries out clustered file displaying, user can come pair according to file size, file creation time, filemodetime
File is ranked up, and realizes preferential browsing.
Description of the drawings
Fig. 1 is that the file organization based on tree construction is shown using the file organization based on thematic structure that the method for the present invention obtains
It is intended to.
Fig. 2 is the clustering method flow chart of user oriented personal document of embodiment a kind of.
Fig. 3 is a file tree directory structure sample figure.
Fig. 4 is the clustering system structure chart of user oriented personal document of embodiment a kind of.
Specific implementation mode
Features described above and advantage to enable the present invention are clearer and more comprehensible, special embodiment below, and institute's attached drawing is coordinated to make
Detailed description are as follows.
The present embodiment provides a kind of clustering methods of user oriented personal document, as shown in Fig. 2, including following 3 steps:
1. file grouping
The preservation custom of similar documents is grouped file using user, target is to ensure that the text in each file group
Part is grouped into less several class clusters as far as possible, and duplicate file is not present between file group.
Following grouping strategy can be used:
1) it is grouped using file tree directory structure:
All directory nodes in tree are indicated with NodeSet, are calculated each directory node " to the distance of root node " and " are arrived
The distance of farthest leaf node " indicates that, then for each directory node i, two distances can using symbol X, Y respectively
To be expressed as<Xi,Yi>。
File grouping is carried out following with X, Y:
Method 1:File grouping is carried out using X.User setting distance parameter δX, by eligible Xi>δX, i ∈ NodeSet
The directory node file that is included as file group, complete grouping.
By taking Fig. 3 as an example, it is assumed that distance parameter δXBe set as 2, then the file group obtained be { A0 }, { A1.1 }, A2.1, A3.1,
A3.2}、{B1.1}、{C1.1,C1.2}。
Method 2:File grouping is carried out using Y.User setting distance parameter is δY.The pseudocode of process is as follows:
It is broadly divided into 5 steps by the algorithm known to above-mentioned pseudocode:(1) Y value based on directory node selects a Y value
More than δYAnd and δYImmediate directory node comes out;(2) direct file for including using the directory node chosen is as one
File group;(3) the corresponding subtree of the directory node chosen is deleted from entire directory tree;(4) institute in directory tree is updated
There is the Y value of directory node;(5) step (1)-(4) are repeated, until the Y value of not no directory node is more than δY;(6) by remaining catalogue
All Files included by node are as an individual file group.
Still by taking Fig. 3 as an example, it is assumed that distance parameter δYBe set as 2, then the file group obtained be { A0, A1.1 }, A2.1, A3.1,
A3.2}、{B1.1}、{C1.1,C1.2}。
Compared with method 1, method 2 can include the excessive file group of file to avoid some.User can select above-mentioned two
Any one in kind method two kinds can also be used in combination to carry out file grouping.
2) it is grouped using filename, file type:
If cluster target is that the file type in requirement class cluster is identical, file grouping can be carried out first according to file type
Grouping, then file is clustered based on filename in each classification, each class cluster of generation is as file group.If cluster
Target does not require class clustered file identical, can directly be clustered based on filename to file.Since filename is shorter, speed is clustered
It is very fast.
Cluster based on filename has 2 kinds of methods:
Method 1:The editing distance of calculation document name, the file that editing distance is less than to specified threshold are grouped as one,
Threshold value is set as needed by user.
Method 2:Filename is segmented, table of file name is shown as entry vector, the text between calculation document name away from
From if Jaccard distances are (with reference to https://baike.baidu.com/item/%E6%9D%B0%E5%8D%A1%
E5%BE%B7%E8%B7%9D%E7%A6%BB/15416212Fr=aladdin) or cosine is apart from (reference
https://zh.wikipedia.org/zh-cn/%E4%BD%99%E5%BC%A6%E7%9B%B8%E4%BC %
BC%E6%80%A7), as a file group, threshold value is set the file using similarity more than given threshold as needed by user
It is fixed.
3) it is grouped in conjunction with above-mentioned 2 kinds of grouping strategies:
First strategy can be used 1) to be grouped, tactful 2) be grouped again for obtaining each being grouped reusing;Or first
It 2) is grouped using strategy, strategy 1) grouping is reused for obtaining each being grouped.
2. local fasciation at
All Files in step 1 in each file group are clustered, local cluster, the file content in local cluster are obtained
It is similar.One file group can generate multiple local clusters.
3. the full generation that clusters
Step 2 each of is obtained into local cluster and is regarded as a file, all local clusters are clustered, global cluster is generated.
Specifically, each local cluster is regarded as a file, is indicated using unique.Local clustered file can use in cluster
Single file is indicated, for example the longest file of length is selected to be indicated as clustered file;All Files in cluster can also be used
Expression is calculated, for example the expression of All Files in cluster is spliced.
The cluster process of step 2 and step 3 is related to representation of file, file similarity calculates and operation clustering algorithm.
Wherein clustering algorithm can select kmeans clustering algorithms or be based on density clustering algorithm.Representation of file and file similarity
It calculates and determines according to demand, common demand is divided into following two:
1) cluster target is to gather the file of same topic together, and the file content difference in class cluster can be larger.Than
Such as wish that the article by " economy " class is brought together, the article in class cluster can be the article of " international economy ", can also be
The article of " local economic development ", although this two articles are not that something is discussed.For the demand, the file used is clustered
It indicates and file similarity computational methods is as follows:
Representation of file:Vectorization is carried out to file, per one-dimensional characteristic one piece of content of respective file, wherein text file can
To use vector space model to indicate (with reference to https://en.wikipedia.org/wiki/Vector_space_model),
Image file can use the abstracting method of characteristics of image, can also use the Feature Extraction Method based on neural network.
File similarity calculates:The distance between calculation document, such as Jaccard distances or cosine distances.
2) cluster target is to gather source file together, and source file refers to file can be turned by increasing, deleting, change
It changes, source file is often user to complete multiple process original texts of definitive document formation.For the demand, cluster uses
Representation of file and file similarity computational methods it is as follows:
Representation of file:Using fuzzy Hash (with reference to http://dx.doi.org/10.1016/
J.diin.2006.06.015 method calculation document cryptographic Hash), the fuzzy Hash are that the fragment Hash based on content segmentation is calculated
Method (context triggered piecewise hashing, CTPH) judges file by carrying out fragment to a file
Similar programs.
File similarity calculates:Cryptographic Hash is converted to similarity (with reference to fuzzy Hash by the cryptographic Hash for comparing two files
Algorithm http://dx.doi.org/10.1016/j.diin.2006.06.015).
The present embodiment also provides a kind of clustering system of user oriented personal document, as shown in figure 4, including:
Cluster calculation unit, for being clustered to user file;The cluster calculation unit includes:Batch documents cluster meter
Unit is calculated, for being clustered for the first time to being used for personal document;And delta file cluster calculation unit, for subsequent delta
File is clustered, and the cluster process to be clustered entirely using generation is once gathered class cluster and existing class cluster that increment obtains
Class, obtains amalgamation result, and realization increment cluster result merges with existing cluster result;
Cluster result storage unit, for storing cluster result, the mapping relations of document retaining and class cluster;
Cluster result searching unit, for realizing returning to file according to class cluster and returning to the function of class cluster according to file.
Individual subscriber file is clustered by above step, after the completion of document clustering, user can utilize this method
The global cluster of generation carries out browsing file, and there are two types of browsing modes:It is a kind of to be to provide the file group based on subject catalogue structure
It knits, user first browses document cluster, then selectes a document cluster to browse associated documents in cluster;Another kind is to provide the association of file
Function, user carry out browsing file using traditional tree construction catalogue, the associated documents of cluster where clicking document presentation this document,
Realize file recommendation.When carrying out clustered file displaying with above two browsing mode, user can create according to file size, file
Build the time, filemodetime is ranked up file, realize preferential browsing.
The above embodiments are merely illustrative of the technical solutions of the present invention rather than is limited, the ordinary skill of this field
Personnel can be modified or replaced equivalently technical scheme of the present invention, without departing from the spirit and scope of the present invention, this
The protection domain of invention should be subject to described in claims.
Claims (10)
1. a kind of clustering method of user oriented personal document, step include:
User file is grouped, multiple file groups are obtained;
File in file group is clustered, one or more local clusters are obtained, the file content in each part cluster is similar;
Each local cluster is considered as a file, all local clusters are clustered, global cluster is generated.
2. according to the method described in claim 1, it is characterized in that, according to file tree directory structure, file type, filename
User file is grouped.
3. according to the method described in claim 2, it is characterized in that, dividing user file according to file tree directory structure
Group method include:
Method one:Assuming that NodeSet indicates all directory nodes in file directory tree, XiIndicate directory node i to root node
Distance, setting distance threshold parameters δX, by eligible Xi>δXThe directory node direct file that is included as one individually
File group;
Method two:Assuming that NodeSet indicates all directory nodes in file directory tree, YiIndicate directory node i to farthest leaf
The distance of child node, setting distance threshold parameters δY, method and step includes:Y value based on directory node selects a Y value
More than δYAnd and δYIt is worth immediate directory node;The direct file for including using the directory node is as an individual files group;It will
The corresponding subtree of the directory node is deleted from entire directory tree;Update the Y value of all directory nodes in directory tree;It repeats above-mentioned
Step, until the Y value of not no directory node is more than δY;Individually using the All Files included by remaining directory node as one
File group.
4. according to the method described in claim 2, it is characterized in that, dividing user file according to file type, filename
Group method include:If the file type in document clustering target call class cluster is identical, can first be divided according to file type
Group, then clustered based on filename in each group, using each class cluster of generation as a file group;If document clustering
Target does not require class clustered file type identical, can directly be clustered based on filename to file, and each class cluster of generation is made
For a file group.
5. according to the method described in claim 3, it is characterized in that, the method based on filename cluster includes:
Method one:The editing distance of calculation document name character string, the file that editing distance is less than to given threshold are divided into a text
Part group;
Method two:Filename is segmented, table of file name is shown as entry vector, the text distance between calculation document name,
The file that text distance is more than to given threshold is divided into a file group, text distance be Jaccard distances or cosine away from
From.
6. according to the method described in claim 1, it is characterized in that, the method and step packet clustered to file or local cluster
It includes:First representation of file, then file similarity calculate, and finally run clustering algorithm.
7. according to the method described in claim 6, it is characterized in that, the method for the representation of file includes:
Method one:Vectorization, a part for vectorial every one-dimensional characteristic respective file are carried out to file, wherein text file can make
It is indicated with vector space model, characteristics of image or the abstracting method based on neural network characteristics can be used in image file;
Method two:Using fuzzy hash method calculation document cryptographic Hash, indicated with cryptographic Hash.
8. according to the method described in claim 6, it is characterized in that, the file similarity computational methods include:
Method one:Jaccard distances between calculation document or cosine distances;
Method two:The cryptographic Hash for comparing two files, similarity is converted to by cryptographic Hash.
9. according to the method described in claim 6, it is characterized in that, the clustering algorithm includes kmeans clustering algorithms, is based on
The clustering algorithm of density.
10. a kind of clustering system of user oriented personal document, including:
Cluster result storage unit, for storing cluster result, the mapping relations of document retaining and class cluster;
Cluster result searching unit, for realizing returning to file according to class cluster and returning to the function of class cluster according to file;
Cluster calculation unit, for being clustered to user file comprising:
Batch documents cluster calculation unit, for being clustered for the first time to being used for personal document;
Delta file cluster calculation unit, for being clustered to subsequent delta file, by increment cluster result and existing cluster
As a result it merges.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810112624.0A CN108399213B (en) | 2018-02-05 | 2018-02-05 | User-oriented personal file clustering method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810112624.0A CN108399213B (en) | 2018-02-05 | 2018-02-05 | User-oriented personal file clustering method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108399213A true CN108399213A (en) | 2018-08-14 |
CN108399213B CN108399213B (en) | 2022-04-01 |
Family
ID=63096263
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810112624.0A Active CN108399213B (en) | 2018-02-05 | 2018-02-05 | User-oriented personal file clustering method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108399213B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109933707A (en) * | 2018-10-31 | 2019-06-25 | 中国科学院信息工程研究所 | A kind of theme corpus construction method and system based on search engine |
CN110245209A (en) * | 2019-06-20 | 2019-09-17 | 贵州电网有限责任公司 | A method of extracting milestone event from mass text |
CN111209375A (en) * | 2020-01-13 | 2020-05-29 | 中国科学院信息工程研究所 | Universal clause and document matching method |
CN112632953A (en) * | 2020-12-22 | 2021-04-09 | 云汉芯城(上海)互联网科技股份有限公司 | Method for quickly and accurately detecting that multiple uploaded bill of materials belong to same product |
CN116363667A (en) * | 2023-04-26 | 2023-06-30 | 公安部信息通信中心 | Aggregation file theme identification and classification system |
CN117573875A (en) * | 2023-12-05 | 2024-02-20 | 安芯网盾(北京)科技有限公司 | Method and device for optimizing homonymy file clustering algorithm |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030115173A1 (en) * | 2001-12-17 | 2003-06-19 | Tsung-Yuan Liu | Intelligent document management and usage method |
CN1773492A (en) * | 2004-11-09 | 2006-05-17 | 国际商业机器公司 | Method for organizing multi-file and equipment for displaying multi-file |
CN101004737A (en) * | 2007-01-24 | 2007-07-25 | 贵阳易特软件有限公司 | Individualized document processing system based on keywords |
CN105095209A (en) * | 2014-04-21 | 2015-11-25 | 北京金山网络科技有限公司 | Document clustering method, document clustering device and network equipment |
CN105574005A (en) * | 2014-10-10 | 2016-05-11 | 富士通株式会社 | Device and method for clustering source data containing a plurality of documents |
CN106294568A (en) * | 2016-07-27 | 2017-01-04 | 北京明朝万达科技股份有限公司 | A kind of Chinese Text Categorization rule generating method based on BP network and system |
-
2018
- 2018-02-05 CN CN201810112624.0A patent/CN108399213B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030115173A1 (en) * | 2001-12-17 | 2003-06-19 | Tsung-Yuan Liu | Intelligent document management and usage method |
CN1773492A (en) * | 2004-11-09 | 2006-05-17 | 国际商业机器公司 | Method for organizing multi-file and equipment for displaying multi-file |
CN101004737A (en) * | 2007-01-24 | 2007-07-25 | 贵阳易特软件有限公司 | Individualized document processing system based on keywords |
CN105095209A (en) * | 2014-04-21 | 2015-11-25 | 北京金山网络科技有限公司 | Document clustering method, document clustering device and network equipment |
CN105574005A (en) * | 2014-10-10 | 2016-05-11 | 富士通株式会社 | Device and method for clustering source data containing a plurality of documents |
CN106294568A (en) * | 2016-07-27 | 2017-01-04 | 北京明朝万达科技股份有限公司 | A kind of Chinese Text Categorization rule generating method based on BP network and system |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109933707A (en) * | 2018-10-31 | 2019-06-25 | 中国科学院信息工程研究所 | A kind of theme corpus construction method and system based on search engine |
CN109933707B (en) * | 2018-10-31 | 2022-10-14 | 中国科学院信息工程研究所 | Topic corpus construction method and system based on search engine |
CN110245209A (en) * | 2019-06-20 | 2019-09-17 | 贵州电网有限责任公司 | A method of extracting milestone event from mass text |
CN110245209B (en) * | 2019-06-20 | 2022-09-23 | 贵州电网有限责任公司 | Method for extracting milestone events from massive texts |
CN111209375A (en) * | 2020-01-13 | 2020-05-29 | 中国科学院信息工程研究所 | Universal clause and document matching method |
CN111209375B (en) * | 2020-01-13 | 2023-01-17 | 中国科学院信息工程研究所 | Universal clause and document matching method |
CN112632953A (en) * | 2020-12-22 | 2021-04-09 | 云汉芯城(上海)互联网科技股份有限公司 | Method for quickly and accurately detecting that multiple uploaded bill of materials belong to same product |
CN112632953B (en) * | 2020-12-22 | 2023-07-25 | 云汉芯城(上海)互联网科技股份有限公司 | Method for rapidly and accurately detecting that multiple uploaded bill of materials belongs to same product |
CN116363667A (en) * | 2023-04-26 | 2023-06-30 | 公安部信息通信中心 | Aggregation file theme identification and classification system |
CN116363667B (en) * | 2023-04-26 | 2023-10-13 | 公安部信息通信中心 | Aggregation file theme identification and classification system |
CN117573875A (en) * | 2023-12-05 | 2024-02-20 | 安芯网盾(北京)科技有限公司 | Method and device for optimizing homonymy file clustering algorithm |
Also Published As
Publication number | Publication date |
---|---|
CN108399213B (en) | 2022-04-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108399213A (en) | A kind of clustering method and system of user oriented personal document | |
KR102046096B1 (en) | Resource efficient document search | |
JP6216467B2 (en) | Visual-semantic composite network and method for forming the network | |
US20050203943A1 (en) | Personalized classification for browsing documents | |
JP2009093652A (en) | Automatic generation of hierarchy of terms | |
JPWO2014109127A1 (en) | Index generation apparatus and method, search apparatus, and search method | |
Eda et al. | The effectiveness of latent semantic analysis for building up a bottom-up taxonomy from folksonomy tags | |
KR101055306B1 (en) | Web service based content management system | |
JP5929532B2 (en) | Event detection apparatus, event detection method, and event detection program | |
JP5470082B2 (en) | Information storage search method and information storage search program | |
Li et al. | Ranking weighted clustering coefficient in large dynamic graphs | |
Li et al. | On mining webclick streams for path traversal patterns | |
Kumar et al. | Personalized web search engine using dynamic user profile and clustering techniques | |
Shrivastava et al. | Comparison between K-mean and C-mean clustering for CBIR | |
CN110162580A (en) | Data mining and depth analysis method and application based on distributed early warning platform | |
Lu et al. | An Algorithm of Top-k High Utility Itemsets Mining over Data Stream. | |
Mohotti et al. | An efficient ranking-centered density-based document clustering method | |
JP2008210335A (en) | Consciousness system construction system, consciousness system construction method, and consciousness system construction program | |
JP2011170666A (en) | Retrieval device | |
KR20200094674A (en) | Method and device for grape spasification using edge prunning | |
Dichev et al. | A study on community formation in collaborative tagging systems | |
Krylov et al. | Metrized Small World Properties Based Data Structure. | |
Ravakhah et al. | Semantic similarity based focused crawling | |
JP2011170460A (en) | Information accumulation retrieval method and information accumulation retrieval program | |
Aiello et al. | Tagging with DHARMA, a DHT-based Approach for Resource Mapping through Approximation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |