CN108268620A - A kind of Document Classification Method based on hadoop data minings - Google Patents
A kind of Document Classification Method based on hadoop data minings Download PDFInfo
- Publication number
- CN108268620A CN108268620A CN201810015666.2A CN201810015666A CN108268620A CN 108268620 A CN108268620 A CN 108268620A CN 201810015666 A CN201810015666 A CN 201810015666A CN 108268620 A CN108268620 A CN 108268620A
- Authority
- CN
- China
- Prior art keywords
- data
- document
- vector
- keyword
- matching
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/182—Distributed file systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of Document Classification Method based on hadoop data minings, including:A, data file is pre-processed, determines the correspondence of keyword and each keyword and its affiliated document;B, the attributive character of data in document is described using the method that attributive character is converted;C, its crucial term vector is generated from keyword set using matching rule, the data attribute characteristic set product concept vector obtained according to crucial term vector and step B;D, the keyword vector sum Concept Vectors in step C calculate the similitude between any two text document in data file to be sorted;E, the sort operation based on clustering processing is performed for attribute vector, obtains the classification results of the attribute vector, classification results indicate the classification of the target object corresponding to each attribute vector;F, Hadoop collects above-mentioned classification results automatically, treats grouped data document and classifies.The present invention have be easily achieved, the remarkable advantage that accuracy of classifying is high.
Description
Technical field
The invention belongs to data classification technology fields, and in particular to a kind of document classification side based on hadoop data minings
Method.
Background technology
Hadoop realizes a distributed file system, abbreviation HDFS.HDFS has the characteristics of high fault tolerance, and designs
For being deployed on cheap hardware;And it provides the data that high-throughput carrys out access application, those is suitble to have super
The application program of large data sets.HDFS relaxes the requirement of POSIX, can access the data in file system in the form of streaming.
With the high speed development of Internet technology, the quantity of network documentation just experiencings to be increased explosively.The text of magnanimity
Shelves easily obtain document and provide the foundation for user, at the same also for obtain available, the desired document of user bring it is huge
Challenge.Document classification technology is a kind of technology for efficiently sorting out document, and this method submits to classification dress by user
The sample document put quickly and accurately classifies the document not being classified in document library.Document classification of the prior art
It needs to carry out very huge text similarity matching primitives, the time of consuming and space are all that system is difficult to bear.
Invention content
The purpose of the present invention is to provide a kind of Document Classification Method based on hadoop data minings, to solve the above-mentioned back of the body
The problem of being proposed in scape technology.
To achieve the above object, the present invention provides following technical solution:A kind of document based on hadoop data minings point
Class method, includes the following steps:
A, data file is pre-processed, and determines each keyword in data file library and each keyword
With the correspondence of its affiliated document;
B, the attributive character of data in document is described using the method that attributive character is converted;
C, using matching rule, the keyword set of data file generates its crucial term vector from step A, according to key
The data attribute characteristic set product concept vector that term vector and step B are obtained;
D, the keyword vector sum Concept Vectors in step C, calculate any two text in data file to be sorted
Similitude between document;And the value at least one attribute data that the document is stablized is identified as attribute vector;
E, the sort operation based on clustering processing is performed for attribute vector in step D, to obtain the attribute vector
Classification results, classification results indicate the classification of the target object corresponding to each attribute vector;
F, in the automatic collection step F of Hadoop attribute vector classification results, treat grouped data document and classify.
Preferably, the matching process in the step C in matching rule includes the following steps:
A, matching condition is obtained, matching condition includes one or more of match information:One or more querying attributes,
Logical operation between querying attributes value, the matching operation of querying attributes value or multiple querying attributes;
B, matching tree is generated using matching condition, matching tree record has the querying attributes value, the querying attributes in original
Position in beginning data, for matching the adaptation function of the querying attributes or the logical operation;
C, Hash processing is carried out to keyword in initial data, obtains the hash index value of keyword to be found;According to treating
The hash index value of search key finds matched content to be found in a lookup table;
D, it is found out and the matched data of the matching condition in content to be found using matching tree.
Preferably, the sort operation of clustering processing includes the following steps in the step E:
A, reading attributes vector data, and obtain multiple default cluster centres of processing data;
B, according to multiple default cluster centres, classify to processing data, obtain post-classification comparison data;
C, according to post-classification comparison data, multiple annexable calculating tasks are established;
D, the annexable calculating task is calculated, and result of calculation is merged using multiple computational threads
Operation;
E, default cluster centre is modified and preserved according to the result of calculation after merging;And according to described default
Cluster centre, revised default cluster centre and amendment number of operations, determine data clusters handling result.
Preferably, in the step D, during calculation processing, computer first pre-processes pending data object, complete
Into the grouping of data object, then in calculating group data object similarity matrix, and it is new according to similarity size to merge generation
Data object, record merge generating process and delete legacy data object simultaneously.
Compared with prior art, the beneficial effects of the invention are as follows:
1, the sorting technique that the present invention uses is easily achieved, and accuracy of classifying is high, wherein, the matching process of use can
Data filtering, inquiry or matching are carried out to data;
2, the matching tree of matched data can be automatically generated for according to matching condition, therefore it is various to solve query demand
The problem of property, it can realize flexible Data Matching or filtering;
3, the sort operation of the clustering processing of use can reduce overall computation complexity and improve the stabilization of calculating
Property, and data general condition analysis ability is strong, is handled suitable for the quick clustering of mass data, further improves data file classification
Accuracy.
Description of the drawings
Fig. 1 is the whole classification process figure of the present invention;
Fig. 2 is matching process flow chart of the present invention;
Fig. 3 is the sort operation flow chart of clustering processing of the present invention.
Specific embodiment
Below in conjunction with the attached drawing in the embodiment of the present invention, the technical solution in the embodiment of the present invention is carried out clear, complete
Site preparation describes, it is clear that described embodiment is only part of the embodiment of the present invention, instead of all the embodiments.It is based on
Embodiment in the present invention, those of ordinary skill in the art are obtained every other without making creative work
Embodiment shall fall within the protection scope of the present invention.
Referring to Fig. 1, the present invention provides a kind of technical solution:A kind of document classification side based on hadoop data minings
Method includes the following steps:
A, data file is pre-processed, and determines each keyword in data file library and each keyword
With the correspondence of its affiliated document;
B, the attributive character of data in document is described using the method that attributive character is converted;
C, using matching rule, the keyword set of data file generates its crucial term vector from step A, according to key
The data attribute characteristic set product concept vector that term vector and step B are obtained;
D, the keyword vector sum Concept Vectors in step C, calculate any two text in data file to be sorted
Similitude between document;And the value at least one attribute data that the document is stablized is identified as attribute vector;
E, the sort operation based on clustering processing is performed for attribute vector in step D, to obtain the attribute vector
Classification results, classification results indicate the classification of the target object corresponding to each attribute vector;
F, in the automatic collection step F of Hadoop attribute vector classification results, treat grouped data document and classify.
As shown in Fig. 2, in the present invention, the matching process in step C in matching rule includes the following steps:
A, matching condition is obtained, matching condition includes one or more of match information:One or more querying attributes,
Logical operation between querying attributes value, the matching operation of querying attributes value or multiple querying attributes;
B, matching tree is generated using matching condition, matching tree record has the querying attributes value, the querying attributes in original
Position in beginning data, for matching the adaptation function of the querying attributes or the logical operation;
C, Hash processing is carried out to keyword in initial data, obtains the hash index value of keyword to be found;According to treating
The hash index value of search key finds matched content to be found in a lookup table;
D, it is found out and the matched data of the matching condition in content to be found using matching tree.
Matching process can carry out data filtering, inquiry or matching to data.It, can be according to matching condition to initial data
Match information is obtained, and automatically generates matching tree, match information is carried in being set due to matching, matching tree can be utilized to exist
It is found out in initial data and the matched data of matching condition.
As shown in figure 3, in the present invention, the sort operation of clustering processing includes the following steps in step E:
A, reading attributes vector data, and obtain multiple default cluster centres of processing data;
B, according to multiple default cluster centres, classify to processing data, obtain post-classification comparison data;
C, according to post-classification comparison data, multiple annexable calculating tasks are established;
D, the annexable calculating task is calculated, and result of calculation is merged using multiple computational threads
Operation;
E, default cluster centre is modified and preserved according to the result of calculation after merging;And according to described default
Cluster centre, revised default cluster centre and amendment number of operations, determine data clusters handling result.
Wherein, in step D, during calculation processing, computer first pre-processes pending data object, completes data
The grouping of object, then in calculating group data object similarity matrix, and according to similarity size merge generation new data pair
As record merges generating process and deletes legacy data object simultaneously.
The sorting technique that the present invention uses is easily achieved, and accuracy of classifying is high;Wherein, the matching process of use can be right
Data carry out data filtering, inquiry or matching;The matching tree of matched data can be automatically generated for according to matching condition, therefore
It can solve the problems, such as that query demand is multifarious, can realize flexible Data Matching or filtering;The classification behaviour of the clustering processing of use
Work can reduce overall computation complexity and improve the stability of calculating, and data general condition analysis ability is strong, suitable for sea
The quick clustering processing of data is measured, further improves the accuracy of data file classification.
It although an embodiment of the present invention has been shown and described, for the ordinary skill in the art, can be with
Understanding without departing from the principles and spirit of the present invention can carry out these embodiments a variety of variations, modification, replace
And modification, the scope of the present invention is defined by the appended.
Claims (4)
1. a kind of Document Classification Method based on hadoop data minings, it is characterised in that:Include the following steps:
A, data file is pre-processed, and determines each keyword in data file library and each keyword and its
The correspondence of affiliated document;
B, the attributive character of data in document is described using the method that attributive character is converted;
C, using certain matching rule, the keyword set of data file generates its crucial term vector from step A, according to pass
The data attribute characteristic set product concept vector that keyword vector and step B are obtained;
D, the keyword vector sum Concept Vectors in step C, calculate any two text document in data file to be sorted
Between similitude, and the value of at least one attribute data that the document is stablized is identified as attribute vector;
E, the sort operation based on clustering processing is performed for attribute vector in step D, to obtain the classification of the attribute vector
As a result, classification results indicate the classification of the target object corresponding to each attribute vector;
F, it using the classification results of attribute vector in the automatic collection step F of Hadoop, treats grouped data document and classifies.
2. a kind of Document Classification Method based on hadoop data minings according to claim 1, it is characterised in that:It is described
Matching rule in step C includes the following steps:
A, matching condition is obtained, matching condition includes one or more of match information:One or more querying attributes, inquiry
Logical operation between property value, the matching operation of querying attributes value or multiple querying attributes;
B, matching tree is generated using matching condition, matching tree record has the querying attributes value, the querying attributes in original number
Position in, for matching the adaptation function of the querying attributes or the logical operation;
C, Hash processing is carried out to keyword in initial data, the hash index value of keyword to be found is obtained, according to be found
The hash index value of keyword finds matched content to be found in a lookup table;
D, it is found out and the matched data of the matching condition in content to be found using matching tree.
3. a kind of Document Classification Method based on hadoop data minings according to claim 1, it is characterised in that:It is described
The sort operation of clustering processing includes the following steps in step E:
A, reading attributes vector data, and obtain multiple default cluster centres of processing data;
B, according to multiple default cluster centres, classify to processing data, obtain post-classification comparison data;
C, according to post-classification comparison data, multiple annexable calculating tasks are established;
D, the annexable calculating task is calculated, and behaviour is merged to result of calculation using multiple computational threads
Make;
E, default cluster centre is modified and preserved and according to the default cluster according to the result of calculation after merging
Center, revised default cluster centre and amendment number of operations, determine data clusters handling result.
4. a kind of Document Classification Method based on hadoop data minings according to claim 3, it is characterised in that:It is described
In step D, during calculation processing, computer first pre-processes pending data object, completes the grouping of data object, so
Afterwards in calculating group data object similarity matrix, and according to similarity size merge generation new data-objects, record merge life
Delete legacy data object simultaneously into process.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810015666.2A CN108268620A (en) | 2018-01-08 | 2018-01-08 | A kind of Document Classification Method based on hadoop data minings |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810015666.2A CN108268620A (en) | 2018-01-08 | 2018-01-08 | A kind of Document Classification Method based on hadoop data minings |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108268620A true CN108268620A (en) | 2018-07-10 |
Family
ID=62773213
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810015666.2A Pending CN108268620A (en) | 2018-01-08 | 2018-01-08 | A kind of Document Classification Method based on hadoop data minings |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108268620A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109684272A (en) * | 2018-12-29 | 2019-04-26 | 国家电网有限公司 | Document storage method, system and terminal device |
CN111597232A (en) * | 2020-05-26 | 2020-08-28 | 华北科技学院 | Data mining method and system |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103744935A (en) * | 2013-12-31 | 2014-04-23 | 华北电力大学(保定) | Rapid mass data cluster processing method for computer |
CN104699702A (en) * | 2013-12-09 | 2015-06-10 | 中国银联股份有限公司 | Data mining and classifying method |
CN104866502A (en) * | 2014-02-25 | 2015-08-26 | 深圳市中兴微电子技术有限公司 | Data matching method and device |
CN106095809A (en) * | 2016-05-30 | 2016-11-09 | 广东凯通科技股份有限公司 | Data matching method and system |
CN106295670A (en) * | 2015-06-11 | 2017-01-04 | 腾讯科技(深圳)有限公司 | Data processing method and data processing equipment |
CN106372122A (en) * | 2016-08-23 | 2017-02-01 | 温州大学瓯江学院 | Wiki semantic matching-based document classification method and system |
-
2018
- 2018-01-08 CN CN201810015666.2A patent/CN108268620A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104699702A (en) * | 2013-12-09 | 2015-06-10 | 中国银联股份有限公司 | Data mining and classifying method |
CN103744935A (en) * | 2013-12-31 | 2014-04-23 | 华北电力大学(保定) | Rapid mass data cluster processing method for computer |
CN104866502A (en) * | 2014-02-25 | 2015-08-26 | 深圳市中兴微电子技术有限公司 | Data matching method and device |
CN106295670A (en) * | 2015-06-11 | 2017-01-04 | 腾讯科技(深圳)有限公司 | Data processing method and data processing equipment |
CN106095809A (en) * | 2016-05-30 | 2016-11-09 | 广东凯通科技股份有限公司 | Data matching method and system |
CN106372122A (en) * | 2016-08-23 | 2017-02-01 | 温州大学瓯江学院 | Wiki semantic matching-based document classification method and system |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109684272A (en) * | 2018-12-29 | 2019-04-26 | 国家电网有限公司 | Document storage method, system and terminal device |
CN111597232A (en) * | 2020-05-26 | 2020-08-28 | 华北科技学院 | Data mining method and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Li | An improved DBSCAN algorithm based on the neighbor similarity and fast nearest neighbor query | |
Sreedhar et al. | Clustering large datasets using K-means modified inter and intra clustering (KM-I2C) in Hadoop | |
Christen | Automatic record linkage using seeded nearest neighbour and support vector machine classification | |
CN110135494A (en) | Feature selection method based on maximum information coefficient and Gini index | |
CN105426426B (en) | A kind of KNN file classification methods based on improved K-Medoids | |
Sitompul et al. | Optimization model of K-means clustering using artificial neural networks to handle class imbalance problem | |
CN107832456B (en) | Parallel KNN text classification method based on critical value data division | |
Kumar et al. | Canopy clustering: a review on pre-clustering approach to K-Means clustering | |
CN107291895B (en) | Quick hierarchical document query method | |
CN106095951B (en) | Data space multi-dimensional indexing method based on load balancing and inquiry log | |
Ou et al. | Non-transitive hashing with latent similarity components | |
Jenni et al. | Pre-processing image database for efficient Content Based Image Retrieval | |
Eghbali et al. | Online nearest neighbor search using hamming weight trees | |
Zaw et al. | Web document clustering by using PSO-based cuckoo search clustering algorithm | |
CN108268620A (en) | A kind of Document Classification Method based on hadoop data minings | |
Gupta et al. | Feature selection: an overview | |
CN105760478A (en) | Large-scale distributed data clustering method based on machine learning | |
Davardoost et al. | An innovative model for extracting olap cubes from nosql database based on scalable naïve bayes classifier | |
Yan et al. | Fast approximate matching of binary codes with distinctive bits | |
Diao et al. | An improved DBSCAN algorithm using local parameters | |
CN108090182B (en) | A kind of distributed index method and system of extensive high dimensional data | |
Zhao et al. | MapReduce-based clustering for near-duplicate image identification | |
Liu et al. | A potential-based clustering method with hierarchical optimization | |
Chernyshova et al. | Technique of cluster validity for Text Mining | |
Papanikolaou | Distributed algorithms for skyline computation using apache spark |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180710 |
|
RJ01 | Rejection of invention patent application after publication |