CN108268620A

CN108268620A - A kind of Document Classification Method based on hadoop data minings

Info

Publication number: CN108268620A
Application number: CN201810015666.2A
Authority: CN
Inventors: 王海勇; 窦敏
Original assignee: Nanjing Post and Telecommunication University
Current assignee: Nanjing Post and Telecommunication University; Nanjing University of Posts and Telecommunications
Priority date: 2018-01-08
Filing date: 2018-01-08
Publication date: 2018-07-10

Abstract

The invention discloses a kind of Document Classification Method based on hadoop data minings, including：A, data file is pre-processed, determines the correspondence of keyword and each keyword and its affiliated document；B, the attributive character of data in document is described using the method that attributive character is converted；C, its crucial term vector is generated from keyword set using matching rule, the data attribute characteristic set product concept vector obtained according to crucial term vector and step B；D, the keyword vector sum Concept Vectors in step C calculate the similitude between any two text document in data file to be sorted；E, the sort operation based on clustering processing is performed for attribute vector, obtains the classification results of the attribute vector, classification results indicate the classification of the target object corresponding to each attribute vector；F, Hadoop collects above-mentioned classification results automatically, treats grouped data document and classifies.The present invention have be easily achieved, the remarkable advantage that accuracy of classifying is high.

Description

A kind of Document Classification Method based on hadoop data minings

Technical field

The invention belongs to data classification technology fields, and in particular to a kind of document classification side based on hadoop data minings Method.

Background technology

Hadoop realizes a distributed file system, abbreviation HDFS.HDFS has the characteristics of high fault tolerance, and designs For being deployed on cheap hardware；And it provides the data that high-throughput carrys out access application, those is suitble to have super The application program of large data sets.HDFS relaxes the requirement of POSIX, can access the data in file system in the form of streaming.

With the high speed development of Internet technology, the quantity of network documentation just experiencings to be increased explosively.The text of magnanimity Shelves easily obtain document and provide the foundation for user, at the same also for obtain available, the desired document of user bring it is huge Challenge.Document classification technology is a kind of technology for efficiently sorting out document, and this method submits to classification dress by user The sample document put quickly and accurately classifies the document not being classified in document library.Document classification of the prior art It needs to carry out very huge text similarity matching primitives, the time of consuming and space are all that system is difficult to bear.

Invention content

The purpose of the present invention is to provide a kind of Document Classification Method based on hadoop data minings, to solve the above-mentioned back of the body The problem of being proposed in scape technology.

To achieve the above object, the present invention provides following technical solution：A kind of document based on hadoop data minings point Class method, includes the following steps：

A, data file is pre-processed, and determines each keyword in data file library and each keyword With the correspondence of its affiliated document；

B, the attributive character of data in document is described using the method that attributive character is converted；

C, using matching rule, the keyword set of data file generates its crucial term vector from step A, according to key The data attribute characteristic set product concept vector that term vector and step B are obtained；

D, the keyword vector sum Concept Vectors in step C, calculate any two text in data file to be sorted Similitude between document；And the value at least one attribute data that the document is stablized is identified as attribute vector；

E, the sort operation based on clustering processing is performed for attribute vector in step D, to obtain the attribute vector Classification results, classification results indicate the classification of the target object corresponding to each attribute vector；

F, in the automatic collection step F of Hadoop attribute vector classification results, treat grouped data document and classify.

Preferably, the matching process in the step C in matching rule includes the following steps：

A, matching condition is obtained, matching condition includes one or more of match information：One or more querying attributes, Logical operation between querying attributes value, the matching operation of querying attributes value or multiple querying attributes；

B, matching tree is generated using matching condition, matching tree record has the querying attributes value, the querying attributes in original Position in beginning data, for matching the adaptation function of the querying attributes or the logical operation；

C, Hash processing is carried out to keyword in initial data, obtains the hash index value of keyword to be found；According to treating The hash index value of search key finds matched content to be found in a lookup table；

D, it is found out and the matched data of the matching condition in content to be found using matching tree.

Preferably, the sort operation of clustering processing includes the following steps in the step E：

A, reading attributes vector data, and obtain multiple default cluster centres of processing data；

B, according to multiple default cluster centres, classify to processing data, obtain post-classification comparison data；

C, according to post-classification comparison data, multiple annexable calculating tasks are established；

D, the annexable calculating task is calculated, and result of calculation is merged using multiple computational threads Operation；

E, default cluster centre is modified and preserved according to the result of calculation after merging；And according to described default Cluster centre, revised default cluster centre and amendment number of operations, determine data clusters handling result.

Preferably, in the step D, during calculation processing, computer first pre-processes pending data object, complete Into the grouping of data object, then in calculating group data object similarity matrix, and it is new according to similarity size to merge generation Data object, record merge generating process and delete legacy data object simultaneously.

Compared with prior art, the beneficial effects of the invention are as follows：

1, the sorting technique that the present invention uses is easily achieved, and accuracy of classifying is high, wherein, the matching process of use can Data filtering, inquiry or matching are carried out to data；

2, the matching tree of matched data can be automatically generated for according to matching condition, therefore it is various to solve query demand The problem of property, it can realize flexible Data Matching or filtering；

3, the sort operation of the clustering processing of use can reduce overall computation complexity and improve the stabilization of calculating Property, and data general condition analysis ability is strong, is handled suitable for the quick clustering of mass data, further improves data file classification Accuracy.

Description of the drawings

Fig. 1 is the whole classification process figure of the present invention；

Fig. 2 is matching process flow chart of the present invention；

Fig. 3 is the sort operation flow chart of clustering processing of the present invention.

Specific embodiment

Below in conjunction with the attached drawing in the embodiment of the present invention, the technical solution in the embodiment of the present invention is carried out clear, complete Site preparation describes, it is clear that described embodiment is only part of the embodiment of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, those of ordinary skill in the art are obtained every other without making creative work Embodiment shall fall within the protection scope of the present invention.

Referring to Fig. 1, the present invention provides a kind of technical solution：A kind of document classification side based on hadoop data minings Method includes the following steps：

As shown in Fig. 2, in the present invention, the matching process in step C in matching rule includes the following steps：

Matching process can carry out data filtering, inquiry or matching to data.It, can be according to matching condition to initial data Match information is obtained, and automatically generates matching tree, match information is carried in being set due to matching, matching tree can be utilized to exist It is found out in initial data and the matched data of matching condition.

As shown in figure 3, in the present invention, the sort operation of clustering processing includes the following steps in step E：

Wherein, in step D, during calculation processing, computer first pre-processes pending data object, completes data The grouping of object, then in calculating group data object similarity matrix, and according to similarity size merge generation new data pair As record merges generating process and deletes legacy data object simultaneously.

The sorting technique that the present invention uses is easily achieved, and accuracy of classifying is high；Wherein, the matching process of use can be right Data carry out data filtering, inquiry or matching；The matching tree of matched data can be automatically generated for according to matching condition, therefore It can solve the problems, such as that query demand is multifarious, can realize flexible Data Matching or filtering；The classification behaviour of the clustering processing of use Work can reduce overall computation complexity and improve the stability of calculating, and data general condition analysis ability is strong, suitable for sea The quick clustering processing of data is measured, further improves the accuracy of data file classification.

It although an embodiment of the present invention has been shown and described, for the ordinary skill in the art, can be with Understanding without departing from the principles and spirit of the present invention can carry out these embodiments a variety of variations, modification, replace And modification, the scope of the present invention is defined by the appended.

Claims

1. a kind of Document Classification Method based on hadoop data minings, it is characterised in that：Include the following steps：

A, data file is pre-processed, and determines each keyword in data file library and each keyword and its The correspondence of affiliated document；

C, using certain matching rule, the keyword set of data file generates its crucial term vector from step A, according to pass The data attribute characteristic set product concept vector that keyword vector and step B are obtained；

D, the keyword vector sum Concept Vectors in step C, calculate any two text document in data file to be sorted Between similitude, and the value of at least one attribute data that the document is stablized is identified as attribute vector；

E, the sort operation based on clustering processing is performed for attribute vector in step D, to obtain the classification of the attribute vector As a result, classification results indicate the classification of the target object corresponding to each attribute vector；

F, it using the classification results of attribute vector in the automatic collection step F of Hadoop, treats grouped data document and classifies.

2. a kind of Document Classification Method based on hadoop data minings according to claim 1, it is characterised in that：It is described Matching rule in step C includes the following steps：

A, matching condition is obtained, matching condition includes one or more of match information：One or more querying attributes, inquiry Logical operation between property value, the matching operation of querying attributes value or multiple querying attributes；

B, matching tree is generated using matching condition, matching tree record has the querying attributes value, the querying attributes in original number Position in, for matching the adaptation function of the querying attributes or the logical operation；

C, Hash processing is carried out to keyword in initial data, the hash index value of keyword to be found is obtained, according to be found The hash index value of keyword finds matched content to be found in a lookup table；

3. a kind of Document Classification Method based on hadoop data minings according to claim 1, it is characterised in that：It is described The sort operation of clustering processing includes the following steps in step E：

D, the annexable calculating task is calculated, and behaviour is merged to result of calculation using multiple computational threads Make；

E, default cluster centre is modified and preserved and according to the default cluster according to the result of calculation after merging Center, revised default cluster centre and amendment number of operations, determine data clusters handling result.

4. a kind of Document Classification Method based on hadoop data minings according to claim 3, it is characterised in that：It is described In step D, during calculation processing, computer first pre-processes pending data object, completes the grouping of data object, so Afterwards in calculating group data object similarity matrix, and according to similarity size merge generation new data-objects, record merge life Delete legacy data object simultaneously into process.