CN108268620A - A kind of Document Classification Method based on hadoop data minings - Google Patents

A kind of Document Classification Method based on hadoop data minings Download PDF

Info

Publication number
CN108268620A
CN108268620A CN201810015666.2A CN201810015666A CN108268620A CN 108268620 A CN108268620 A CN 108268620A CN 201810015666 A CN201810015666 A CN 201810015666A CN 108268620 A CN108268620 A CN 108268620A
Authority
CN
China
Prior art keywords
data
document
vector
keyword
matching
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810015666.2A
Other languages
Chinese (zh)
Inventor
王海勇
窦敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Post and Telecommunication University
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing Post and Telecommunication University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Post and Telecommunication University filed Critical Nanjing Post and Telecommunication University
Priority to CN201810015666.2A priority Critical patent/CN108268620A/en
Publication of CN108268620A publication Critical patent/CN108268620A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Abstract

The invention discloses a kind of Document Classification Method based on hadoop data minings, including:A, data file is pre-processed, determines the correspondence of keyword and each keyword and its affiliated document;B, the attributive character of data in document is described using the method that attributive character is converted;C, its crucial term vector is generated from keyword set using matching rule, the data attribute characteristic set product concept vector obtained according to crucial term vector and step B;D, the keyword vector sum Concept Vectors in step C calculate the similitude between any two text document in data file to be sorted;E, the sort operation based on clustering processing is performed for attribute vector, obtains the classification results of the attribute vector, classification results indicate the classification of the target object corresponding to each attribute vector;F, Hadoop collects above-mentioned classification results automatically, treats grouped data document and classifies.The present invention have be easily achieved, the remarkable advantage that accuracy of classifying is high.

Description

A kind of Document Classification Method based on hadoop data minings
Technical field
The invention belongs to data classification technology fields, and in particular to a kind of document classification side based on hadoop data minings Method.
Background technology
Hadoop realizes a distributed file system, abbreviation HDFS.HDFS has the characteristics of high fault tolerance, and designs For being deployed on cheap hardware;And it provides the data that high-throughput carrys out access application, those is suitble to have super The application program of large data sets.HDFS relaxes the requirement of POSIX, can access the data in file system in the form of streaming.
With the high speed development of Internet technology, the quantity of network documentation just experiencings to be increased explosively.The text of magnanimity Shelves easily obtain document and provide the foundation for user, at the same also for obtain available, the desired document of user bring it is huge Challenge.Document classification technology is a kind of technology for efficiently sorting out document, and this method submits to classification dress by user The sample document put quickly and accurately classifies the document not being classified in document library.Document classification of the prior art It needs to carry out very huge text similarity matching primitives, the time of consuming and space are all that system is difficult to bear.
Invention content
The purpose of the present invention is to provide a kind of Document Classification Method based on hadoop data minings, to solve the above-mentioned back of the body The problem of being proposed in scape technology.
To achieve the above object, the present invention provides following technical solution:A kind of document based on hadoop data minings point Class method, includes the following steps:
A, data file is pre-processed, and determines each keyword in data file library and each keyword With the correspondence of its affiliated document;
B, the attributive character of data in document is described using the method that attributive character is converted;
C, using matching rule, the keyword set of data file generates its crucial term vector from step A, according to key The data attribute characteristic set product concept vector that term vector and step B are obtained;
D, the keyword vector sum Concept Vectors in step C, calculate any two text in data file to be sorted Similitude between document;And the value at least one attribute data that the document is stablized is identified as attribute vector;
E, the sort operation based on clustering processing is performed for attribute vector in step D, to obtain the attribute vector Classification results, classification results indicate the classification of the target object corresponding to each attribute vector;
F, in the automatic collection step F of Hadoop attribute vector classification results, treat grouped data document and classify.
Preferably, the matching process in the step C in matching rule includes the following steps:
A, matching condition is obtained, matching condition includes one or more of match information:One or more querying attributes, Logical operation between querying attributes value, the matching operation of querying attributes value or multiple querying attributes;
B, matching tree is generated using matching condition, matching tree record has the querying attributes value, the querying attributes in original Position in beginning data, for matching the adaptation function of the querying attributes or the logical operation;
C, Hash processing is carried out to keyword in initial data, obtains the hash index value of keyword to be found;According to treating The hash index value of search key finds matched content to be found in a lookup table;
D, it is found out and the matched data of the matching condition in content to be found using matching tree.
Preferably, the sort operation of clustering processing includes the following steps in the step E:
A, reading attributes vector data, and obtain multiple default cluster centres of processing data;
B, according to multiple default cluster centres, classify to processing data, obtain post-classification comparison data;
C, according to post-classification comparison data, multiple annexable calculating tasks are established;
D, the annexable calculating task is calculated, and result of calculation is merged using multiple computational threads Operation;
E, default cluster centre is modified and preserved according to the result of calculation after merging;And according to described default Cluster centre, revised default cluster centre and amendment number of operations, determine data clusters handling result.
Preferably, in the step D, during calculation processing, computer first pre-processes pending data object, complete Into the grouping of data object, then in calculating group data object similarity matrix, and it is new according to similarity size to merge generation Data object, record merge generating process and delete legacy data object simultaneously.
Compared with prior art, the beneficial effects of the invention are as follows:
1, the sorting technique that the present invention uses is easily achieved, and accuracy of classifying is high, wherein, the matching process of use can Data filtering, inquiry or matching are carried out to data;
2, the matching tree of matched data can be automatically generated for according to matching condition, therefore it is various to solve query demand The problem of property, it can realize flexible Data Matching or filtering;
3, the sort operation of the clustering processing of use can reduce overall computation complexity and improve the stabilization of calculating Property, and data general condition analysis ability is strong, is handled suitable for the quick clustering of mass data, further improves data file classification Accuracy.
Description of the drawings
Fig. 1 is the whole classification process figure of the present invention;
Fig. 2 is matching process flow chart of the present invention;
Fig. 3 is the sort operation flow chart of clustering processing of the present invention.
Specific embodiment
Below in conjunction with the attached drawing in the embodiment of the present invention, the technical solution in the embodiment of the present invention is carried out clear, complete Site preparation describes, it is clear that described embodiment is only part of the embodiment of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, those of ordinary skill in the art are obtained every other without making creative work Embodiment shall fall within the protection scope of the present invention.
Referring to Fig. 1, the present invention provides a kind of technical solution:A kind of document classification side based on hadoop data minings Method includes the following steps:
A, data file is pre-processed, and determines each keyword in data file library and each keyword With the correspondence of its affiliated document;
B, the attributive character of data in document is described using the method that attributive character is converted;
C, using matching rule, the keyword set of data file generates its crucial term vector from step A, according to key The data attribute characteristic set product concept vector that term vector and step B are obtained;
D, the keyword vector sum Concept Vectors in step C, calculate any two text in data file to be sorted Similitude between document;And the value at least one attribute data that the document is stablized is identified as attribute vector;
E, the sort operation based on clustering processing is performed for attribute vector in step D, to obtain the attribute vector Classification results, classification results indicate the classification of the target object corresponding to each attribute vector;
F, in the automatic collection step F of Hadoop attribute vector classification results, treat grouped data document and classify.
As shown in Fig. 2, in the present invention, the matching process in step C in matching rule includes the following steps:
A, matching condition is obtained, matching condition includes one or more of match information:One or more querying attributes, Logical operation between querying attributes value, the matching operation of querying attributes value or multiple querying attributes;
B, matching tree is generated using matching condition, matching tree record has the querying attributes value, the querying attributes in original Position in beginning data, for matching the adaptation function of the querying attributes or the logical operation;
C, Hash processing is carried out to keyword in initial data, obtains the hash index value of keyword to be found;According to treating The hash index value of search key finds matched content to be found in a lookup table;
D, it is found out and the matched data of the matching condition in content to be found using matching tree.
Matching process can carry out data filtering, inquiry or matching to data.It, can be according to matching condition to initial data Match information is obtained, and automatically generates matching tree, match information is carried in being set due to matching, matching tree can be utilized to exist It is found out in initial data and the matched data of matching condition.
As shown in figure 3, in the present invention, the sort operation of clustering processing includes the following steps in step E:
A, reading attributes vector data, and obtain multiple default cluster centres of processing data;
B, according to multiple default cluster centres, classify to processing data, obtain post-classification comparison data;
C, according to post-classification comparison data, multiple annexable calculating tasks are established;
D, the annexable calculating task is calculated, and result of calculation is merged using multiple computational threads Operation;
E, default cluster centre is modified and preserved according to the result of calculation after merging;And according to described default Cluster centre, revised default cluster centre and amendment number of operations, determine data clusters handling result.
Wherein, in step D, during calculation processing, computer first pre-processes pending data object, completes data The grouping of object, then in calculating group data object similarity matrix, and according to similarity size merge generation new data pair As record merges generating process and deletes legacy data object simultaneously.
The sorting technique that the present invention uses is easily achieved, and accuracy of classifying is high;Wherein, the matching process of use can be right Data carry out data filtering, inquiry or matching;The matching tree of matched data can be automatically generated for according to matching condition, therefore It can solve the problems, such as that query demand is multifarious, can realize flexible Data Matching or filtering;The classification behaviour of the clustering processing of use Work can reduce overall computation complexity and improve the stability of calculating, and data general condition analysis ability is strong, suitable for sea The quick clustering processing of data is measured, further improves the accuracy of data file classification.
It although an embodiment of the present invention has been shown and described, for the ordinary skill in the art, can be with Understanding without departing from the principles and spirit of the present invention can carry out these embodiments a variety of variations, modification, replace And modification, the scope of the present invention is defined by the appended.

Claims (4)

1. a kind of Document Classification Method based on hadoop data minings, it is characterised in that:Include the following steps:
A, data file is pre-processed, and determines each keyword in data file library and each keyword and its The correspondence of affiliated document;
B, the attributive character of data in document is described using the method that attributive character is converted;
C, using certain matching rule, the keyword set of data file generates its crucial term vector from step A, according to pass The data attribute characteristic set product concept vector that keyword vector and step B are obtained;
D, the keyword vector sum Concept Vectors in step C, calculate any two text document in data file to be sorted Between similitude, and the value of at least one attribute data that the document is stablized is identified as attribute vector;
E, the sort operation based on clustering processing is performed for attribute vector in step D, to obtain the classification of the attribute vector As a result, classification results indicate the classification of the target object corresponding to each attribute vector;
F, it using the classification results of attribute vector in the automatic collection step F of Hadoop, treats grouped data document and classifies.
2. a kind of Document Classification Method based on hadoop data minings according to claim 1, it is characterised in that:It is described Matching rule in step C includes the following steps:
A, matching condition is obtained, matching condition includes one or more of match information:One or more querying attributes, inquiry Logical operation between property value, the matching operation of querying attributes value or multiple querying attributes;
B, matching tree is generated using matching condition, matching tree record has the querying attributes value, the querying attributes in original number Position in, for matching the adaptation function of the querying attributes or the logical operation;
C, Hash processing is carried out to keyword in initial data, the hash index value of keyword to be found is obtained, according to be found The hash index value of keyword finds matched content to be found in a lookup table;
D, it is found out and the matched data of the matching condition in content to be found using matching tree.
3. a kind of Document Classification Method based on hadoop data minings according to claim 1, it is characterised in that:It is described The sort operation of clustering processing includes the following steps in step E:
A, reading attributes vector data, and obtain multiple default cluster centres of processing data;
B, according to multiple default cluster centres, classify to processing data, obtain post-classification comparison data;
C, according to post-classification comparison data, multiple annexable calculating tasks are established;
D, the annexable calculating task is calculated, and behaviour is merged to result of calculation using multiple computational threads Make;
E, default cluster centre is modified and preserved and according to the default cluster according to the result of calculation after merging Center, revised default cluster centre and amendment number of operations, determine data clusters handling result.
4. a kind of Document Classification Method based on hadoop data minings according to claim 3, it is characterised in that:It is described In step D, during calculation processing, computer first pre-processes pending data object, completes the grouping of data object, so Afterwards in calculating group data object similarity matrix, and according to similarity size merge generation new data-objects, record merge life Delete legacy data object simultaneously into process.
CN201810015666.2A 2018-01-08 2018-01-08 A kind of Document Classification Method based on hadoop data minings Pending CN108268620A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810015666.2A CN108268620A (en) 2018-01-08 2018-01-08 A kind of Document Classification Method based on hadoop data minings

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810015666.2A CN108268620A (en) 2018-01-08 2018-01-08 A kind of Document Classification Method based on hadoop data minings

Publications (1)

Publication Number Publication Date
CN108268620A true CN108268620A (en) 2018-07-10

Family

ID=62773213

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810015666.2A Pending CN108268620A (en) 2018-01-08 2018-01-08 A kind of Document Classification Method based on hadoop data minings

Country Status (1)

Country Link
CN (1) CN108268620A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109684272A (en) * 2018-12-29 2019-04-26 国家电网有限公司 Document storage method, system and terminal device
CN111597232A (en) * 2020-05-26 2020-08-28 华北科技学院 Data mining method and system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103744935A (en) * 2013-12-31 2014-04-23 华北电力大学(保定) Rapid mass data cluster processing method for computer
CN104699702A (en) * 2013-12-09 2015-06-10 中国银联股份有限公司 Data mining and classifying method
CN104866502A (en) * 2014-02-25 2015-08-26 深圳市中兴微电子技术有限公司 Data matching method and device
CN106095809A (en) * 2016-05-30 2016-11-09 广东凯通科技股份有限公司 Data matching method and system
CN106295670A (en) * 2015-06-11 2017-01-04 腾讯科技(深圳)有限公司 Data processing method and data processing equipment
CN106372122A (en) * 2016-08-23 2017-02-01 温州大学瓯江学院 Wiki semantic matching-based document classification method and system

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104699702A (en) * 2013-12-09 2015-06-10 中国银联股份有限公司 Data mining and classifying method
CN103744935A (en) * 2013-12-31 2014-04-23 华北电力大学(保定) Rapid mass data cluster processing method for computer
CN104866502A (en) * 2014-02-25 2015-08-26 深圳市中兴微电子技术有限公司 Data matching method and device
CN106295670A (en) * 2015-06-11 2017-01-04 腾讯科技(深圳)有限公司 Data processing method and data processing equipment
CN106095809A (en) * 2016-05-30 2016-11-09 广东凯通科技股份有限公司 Data matching method and system
CN106372122A (en) * 2016-08-23 2017-02-01 温州大学瓯江学院 Wiki semantic matching-based document classification method and system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109684272A (en) * 2018-12-29 2019-04-26 国家电网有限公司 Document storage method, system and terminal device
CN111597232A (en) * 2020-05-26 2020-08-28 华北科技学院 Data mining method and system

Similar Documents

Publication Publication Date Title
Li An improved DBSCAN algorithm based on the neighbor similarity and fast nearest neighbor query
Sreedhar et al. Clustering large datasets using K-means modified inter and intra clustering (KM-I2C) in Hadoop
CN110135494A (en) Feature selection approach based on maximum information coefficient and Geordie index
CN110471916A (en) Querying method, device, server and the medium of database
Eluri et al. A comparative study of various clustering techniques on big data sets using Apache Mahout
CN107832456B (en) Parallel KNN text classification method based on critical value data division
CN107291895B (en) Quick hierarchical document query method
Tang et al. Efficient Processing of Hamming-Distance-Based Similarity-Search Queries Over MapReduce.
Connor et al. High-dimensional simplexes for supermetric search
Etezadifar et al. Scalable video summarization via sparse dictionary learning and selection simultaneously
Eghbali et al. Online nearest neighbor search using hamming weight trees
Zaw et al. Web document clustering by using PSO-based cuckoo search clustering algorithm
CN108268620A (en) A kind of Document Classification Method based on hadoop data minings
Čech et al. Comparing MapReduce-based k-NN similarity joins on Hadoop for high-dimensional data
Meena et al. Handling data-skewness in character based string similarity join using Hadoop
CN105760478A (en) Large-scale distributed data clustering method based on machine learning
Gupta et al. Feature selection: an overview
Antaris et al. Similarity search over the cloud based on image descriptors' dimensions value cardinalities
Yan et al. Fast approximate matching of binary codes with distinctive bits
Lu et al. Dynamic Partition Forest: An Efficient and Distributed Indexing Scheme for Similarity Search based on Hashing
Kumar et al. Visual semantic based 3D video retrieval system using HDFS
Liu et al. A potential-based clustering method with hierarchical optimization
Papanikolaou Distributed algorithms for skyline computation using apache spark
Zhao et al. MapReduce-based clustering for near-duplicate image identification
Jiang et al. Gvos: a general system for near-duplicate video-related applications on storm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20180710

RJ01 Rejection of invention patent application after publication