CN104239479A

CN104239479A - Document classification method and system

Info

Publication number: CN104239479A
Application number: CN201410449140.7A
Authority: CN
Inventors: 宗栋瑞; 郭美思; 吴楠
Original assignee: Inspur Beijing Electronic Information Industry Co Ltd
Current assignee: Inspur Beijing Electronic Information Industry Co Ltd
Priority date: 2014-09-04
Filing date: 2014-09-04
Publication date: 2014-12-24

Abstract

The invention discloses a document classification method and a document classification system, and is applied to a Hadoop cluster comprising a Map program and a Reduce program. The method comprises the following steps that the Map program parses a training document and a document to be classified, determines a characteristic attribute according to a parsing result, and divides the characteristic attribute; the Map program generates a classifier according to the characteristic attribute of the training document and a classification result of the training document; the Reduce program classifies the document to be classified to obtain a classification result of the document to be classified by virtue of the classifier. According to the method and the system, a distributed characteristic of the Hadoop cluster is fully utilized, and the limitation of a conventional system frame is avoided; the method and the system have the characteristics of concurrency and high speed; massive documents can be rapidly classified, so that classification time is saved, and the document classification efficiency and the system performance are improved.

Description

A kind of Document Classification Method and system

Technical field

The present invention relates to field of computer technology, be specifically related to a kind of Document Classification Method and system.

Background technology

Day by day universal along with network technology, the data volume in network sharply increases, and application type is also very abundant.Data mining technology makes full use of existing information resource, finds out hiding knowledge from mass data, is a strong developing direction.Data mining relates to the fields such as machine learning, pattern-recognition, statistics, intelligent database, data visualization and high-performance calculation, its object is to find implicit, novel, interesting relation and rule from mass data.Wherein, document classification is an important directions of data mining.

In prior art, usually use traditional system framework to carry out document classification, when processing mass data, the classification time can be caused long, and system performance is low.

Summary of the invention

The invention provides a kind of Document Classification Method and system, to solve the low defect of system performance in prior art.

The invention provides a kind of Document Classification Method, be applied to and comprise in the Hadoop cluster of Map program and Reduce program, said method comprising the steps of:

Described Map program is resolved Training document and document to be sorted, according to analysis result determination characteristic attribute, and divides described characteristic attribute;

Described Map program, according to the characteristic attribute of described Training document and the classification results to described Training document, generates sorter;

Described Reduce program uses described sorter to classify to described document to be sorted, obtains the classification results of document to be sorted.

Alternatively, described Map program, according to after analysis result determination characteristic attribute, also comprises:

Described Map program, according to described characteristic attribute, carries out format conversion to described Training document and described document to be sorted respectively, obtains meeting the Training document of preset format and document to be sorted;

Described Map program, according to the characteristic attribute of described Training document and the classification results to described Training document, generates sorter, is specially:

Described Map program, according to the characteristic attribute of the Training document after format conversion and the classification results to described Training document, generates sorter;

Described Reduce program uses described sorter to classify to described document to be sorted, obtains the classification results of document to be sorted, is specially:

Described Reduce program uses described sorter to classify to the document to be sorted after format conversion, obtains the classification results of document to be sorted.

Alternatively, described Map program, according to the characteristic attribute of the Training document after format conversion and the classification results to described Training document, generates sorter, is specially:

Described Map program is according to the span of each characteristic attribute corresponding to the Training document after described format conversion and the classification results to described Training document, calculating the frequency of occurrences of each classification in described Training document and the conditional probability of each span of all characteristic attributes is estimated under each classification, is sorter by the described frequency of occurrences and described conditional probability estimated record.

Alternatively, described Reduce program uses described sorter to classify to the document to be sorted after format conversion, obtains the classification results of document to be sorted, is specially:

Described Reduce program obtains the span of all characteristic attributes of the document to be sorted after described format conversion, according to the frequency of occurrences in Training document of the span got, each classification and under each classification the conditional probability of each span of all characteristic attributes estimate, calculate the conditional probability that described document to be sorted belongs to each classification, and using the classification results of classification corresponding for conditional probability maximum for numerical value as described document to be sorted.

Alternatively, described in described Map program, Training document and document to be sorted are resolved, according to analysis result determination characteristic attribute, and described characteristic attribute are divided, be specially:

Described Map program, by resolving Training document and document to be sorted, obtains the attribute that Training document and document package to be sorted contain, and selected characteristic attribute in the attribute analytically obtained, and divide multiple span for each characteristic attribute.

Present invention also offers a kind of document classification system, be applied in Hadoop cluster, described system comprises:

Parsing module, for resolving Training document and document to be sorted, according to analysis result determination characteristic attribute, and divides described characteristic attribute;

Generation module, for the characteristic attribute of described Training document determined according to described parsing module and the classification results to described Training document, generates sorter;

Sort module, the described sorter generated for using described generation module is classified to described document to be sorted, obtains the classification results of document to be sorted.

Alternatively, described system, also comprises:

Modular converter, for the described characteristic attribute determined according to described parsing module, carries out format conversion to described Training document and described document to be sorted respectively, obtains meeting the Training document of preset format and document to be sorted;

Described generation module, specifically for according to the characteristic attribute of the Training document after described modular converter format conversion and the classification results to described Training document, generates sorter;

Described sort module, the described sorter generated specifically for using described generation module is classified to the document to be sorted after described modular converter format conversion, obtains the classification results of document to be sorted.

Alternatively, described generation module, specifically for according to the span of each characteristic attribute corresponding to the Training document after described modular converter format conversion and the classification results to described Training document, calculating the frequency of occurrences of each classification in described Training document and the conditional probability of each span of all characteristic attributes is estimated under each classification, is sorter by the described frequency of occurrences and described conditional probability estimated record.

Alternatively, described sort module, specifically for obtaining the span of all characteristic attributes of the document to be sorted after described modular converter format conversion, according to the frequency of occurrences in Training document of the span got, each classification and under each classification the conditional probability of each span of all characteristic attributes estimate, calculate the conditional probability that described document to be sorted belongs to each classification, and using the classification results of classification corresponding for conditional probability maximum for numerical value as described document to be sorted.

Alternatively, described parsing module, specifically for by resolving Training document and document to be sorted, obtains the attribute that Training document and document package to be sorted contain, and selected characteristic attribute in the attribute analytically obtained, and divide multiple span for each characteristic attribute.

The present invention takes full advantage of the distributed nature of Hadoop cluster, avoids the limitation of legacy system framework, has parallel feature fast, the classification to magnanimity document can be realized fast, save the classification time, improve the efficiency of document classification, improve system performance.

Accompanying drawing explanation

Fig. 1 is the process flow diagram of a kind of Document Classification Method in the embodiment of the present invention;

Fig. 2 is the structural representation of a kind of document classification system in the embodiment of the present invention.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, be clearly and completely described the technical scheme in the embodiment of the present invention, obviously, described embodiment is only the present invention's part embodiment, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, belong to the scope of protection of the invention.

It should be noted that, if do not conflicted, each feature in the embodiment of the present invention and embodiment can be combined with each other, all within protection scope of the present invention.In addition, although show logical order in flow charts, in some cases, can be different from the step shown or described by order execution herein.

A kind of Document Classification Method is proposed in the embodiment of the present invention, be applied to and comprise in the Hadoop cluster of Map program and Reduce program, in use Hadoop order, Training document and document to be sorted are placed into HDFS (Hadoop Distributed File System, distributed file system) upper after, perform operation as shown in Figure 1:

Step 101, Map program is resolved Training document and document to be sorted, according to analysis result determination characteristic attribute, and divides characteristic attribute.

Particularly, Map program can by resolving Training document and document to be sorted, obtains the attribute that Training document and document package to be sorted contain, and selected characteristic attribute in the attribute analytically obtained, and divide multiple span for each characteristic attribute.

Wherein, under Training document and document to be sorted can be arranged in the different directories of HDFS, and managed by split catalog, the name of each file is class label, and the content under file is the document of the class corresponding with belonging to such label.

Such as, Training document be arranged in HDFS /train catalogue under, document to be sorted be arranged in HDFS /test catalogue under.Map program, according to the analysis result to Training document and document to be sorted, selects 3 characteristic attribute: a, daily record quantity/registration number of days; B, good friend's quantity/registration number of days; C, whether use true head portrait, and each characteristic attribute is divided into: { a<=0.05,0.05<a<0.2, a>=0.2}; { b<=0.1,0.1<b<0.8, b>=0.8}; { c=0 (not being), c=1 (YES) }.

Step 102, Map program, according to the characteristic attribute determined, carries out format conversion to Training document and document to be sorted respectively, obtains meeting the Training document of preset format and document to be sorted.

Particularly, Map program can PrepareTwentyNewsgroups class in utility command row Mahout, is meet the Training document of preset format and document to be sorted by Training document and document subject feature vector to be sorted.Wherein, preset format can be VectorWritable form, and in the document meeting VectorWritable form, first character is class label, and remaining character is characteristic attribute.

Step 103, the characteristic attribute of Map program according to the Training document after format conversion and the classification results to Training document, generate sorter.

Particularly, Map program can be corresponding according to the Training document after format conversion the span of each characteristic attribute and the classification results to Training document, calculating the frequency of occurrences of each classification in Training document and the conditional probability of each span of all characteristic attributes is estimated under each classification, is sorter by the above-mentioned frequency of occurrences and conditional probability estimated record.

Such as, the number of Training document is 10,000, and its classification results is: 8900 Training document belong to real account numbers (that is, C=0), and 1100 Training document belong to non-genuine account (that is, C=1).

The frequency of occurrences of each classification in Training document is:

P(C＝0)＝8900/10000＝0.89；

P(C＝1)＝1100/10000＝0.11；

Under each classification, the conditional probability of each span of all characteristic attributes is estimated as:

P(a<＝0.05︱C＝0)＝0.3

P(0.05<a<0.2︱C＝0)＝0.5

P(a>＝0.2︱C＝0)＝0.2

P(a<＝0.05︱C＝1)＝0.8

P(0.05<a<0.2︱C＝1)＝0.1

P(a>＝0.2︱C＝1)＝0.1

P(b<＝0.1︱C＝0)＝0.1

P(0.1<b<0.8︱C＝0)＝0.7

P(b>＝0.8︱C＝0)＝0.2

P(b<＝0.1︱C＝1)＝0.7

P(0.1<b<0.8︱C＝1)＝0.2

P(b>＝0.8︱C＝1)＝0.1

P(c＝0︱C＝0)＝0.2

P(c＝1︱C＝0)＝0.8

P(c＝0︱C＝1)＝0.9

P(c＝1︱C＝1)＝0.1

Step 104, Reduce program uses sorter to classify to the document to be sorted after format conversion, obtains the classification results of document to be sorted.

Particularly, Reduce program can obtain the span of all characteristic attributes of the document to be sorted after format conversion, according to the frequency of occurrences in Training document of the span got, each classification and under each classification the conditional probability of each span of all characteristic attributes estimate, calculate the conditional probability that document to be sorted belongs to each classification, and the classification results of classification corresponding for conditional probability maximum for numerical value as document to be sorted is recorded on HDFS.

Such as, the span of 3 characteristic attributes of document to be sorted is: 0.05<a<0.2,0.1<b<0.8, b>=0.8, c=0, then document to be sorted belongs to the conditional probability of real account numbers (that is, C=0) and is:

P(C＝0)P(x︱C＝0)

＝P(C＝0)P(0.05<a<0.2︱C＝0)P(0.1<b<0.8︱C＝0)P(c＝0︱C＝0)

＝0.89*0.5*0.7*0.2

＝0.0623；

The conditional probability that document to be sorted belongs to non-genuine account (that is, C=1) is:

P(C＝1)P(x︱C＝1)

＝P(C＝1)P(0.05<a<0.2︱C＝1)P(0.1<b<0.8︱C＝1)P(c＝0︱C＝1)

＝0.11*0.1*0.2*0.9

＝0.00198

The conditional probability belonging to real account numbers due to document to be sorted is maximum, then Reduce program determines that this document to be sorted belongs to real account numbers.

The embodiment of the present invention takes full advantage of the distributed nature of Hadoop cluster, avoids the limitation of legacy system framework, has parallel feature fast, the classification to magnanimity document can be realized fast, save the classification time, improve the efficiency of document classification, improve system performance.

Based on above-mentioned Webpage clustering method, the embodiment of the present invention proposes a kind of document classification system, is applied in Hadoop cluster, and as shown in Figure 2, this system comprises:

Parsing module 210, for resolving Training document and document to be sorted, according to analysis result determination characteristic attribute, and divides this characteristic attribute;

Particularly, above-mentioned parsing module 210, specifically for by resolving Training document and document to be sorted, obtains the attribute that Training document and document package to be sorted contain, and selected characteristic attribute in the attribute analytically obtained, and divide multiple span for each characteristic attribute.

Generation module 220, for the characteristic attribute of Training document determined according to parsing module 210 and the classification results to Training document, generates sorter;

Sort module 230, the sorter for using generation module 220 to generate is treated classifying documents and is classified, and obtains the classification results of document to be sorted.

Further, said system, also comprises:

Modular converter 240, for the described characteristic attribute determined according to parsing module 210, carries out format conversion to Training document and document to be sorted respectively, obtains meeting the Training document of preset format and document to be sorted;

Correspondingly, above-mentioned generation module 220, specifically for according to the characteristic attribute of the Training document after modular converter 240 format conversion and the classification results to Training document, generates sorter;

Above-mentioned sort module 230, classifies to the document to be sorted after modular converter 240 format conversion specifically for the sorter using generation module 220 to generate, obtains the classification results of document to be sorted.

Further, above-mentioned generation module 220, specifically for according to the span of each characteristic attribute corresponding to the Training document after modular converter 240 format conversion and the classification results to Training document, calculating the frequency of occurrences of each classification in Training document and the conditional probability of each span of all characteristic attributes is estimated under each classification, is sorter by the above-mentioned frequency of occurrences and above-mentioned conditional probability estimated record.

Correspondingly, above-mentioned sort module 230, specifically for obtaining the span of all characteristic attributes of the document to be sorted after modular converter 240 format conversion, according to the frequency of occurrences in Training document of the span got, each classification and under each classification the conditional probability of each span of all characteristic attributes estimate, calculate the conditional probability that described document to be sorted belongs to each classification, and using the classification results of classification corresponding for conditional probability maximum for numerical value as described document to be sorted.

In conjunction with the software module that the step in the method that embodiment disclosed herein describes can directly use hardware, processor to perform, or the combination of the two is implemented.Software module can be placed in the storage medium of other form any known in random access memory (RAM), internal memory, ROM (read-only memory) (ROM), electrically programmable ROM, electrically erasable ROM, register, hard disk, moveable magnetic disc, CD-ROM or technical field.

The above; be only the specific embodiment of the present invention, but protection scope of the present invention is not limited thereto, is anyly familiar with those skilled in the art in the technical scope that the present invention discloses; change can be expected easily or replace, all should be encompassed within protection scope of the present invention.Therefore, protection scope of the present invention should described be as the criterion with the protection domain of claim.

Claims

1. a Document Classification Method, is characterized in that, is applied to and comprises in the Hadoop cluster of Map program and Reduce program, said method comprising the steps of:

2. the method for claim 1, is characterized in that, described Map program, according to after analysis result determination characteristic attribute, also comprises:

3. method as claimed in claim 2, is characterized in that, described Map program, according to the characteristic attribute of the Training document after format conversion and the classification results to described Training document, generates sorter, is specially:

4. method as claimed in claim 3, it is characterized in that, described Reduce program uses described sorter to classify to the document to be sorted after format conversion, obtains the classification results of document to be sorted, is specially:

5. the method for claim 1, is characterized in that, described Map program is resolved Training document and document to be sorted, according to analysis result determination characteristic attribute, and divides described characteristic attribute, is specially:

6. a document classification system, is characterized in that, be applied in Hadoop cluster, described system comprises:

7. system as claimed in claim 6, is characterized in that, also comprise:

8. system as claimed in claim 7, is characterized in that,

Described generation module, specifically for according to the span of each characteristic attribute corresponding to the Training document after described modular converter format conversion and the classification results to described Training document, calculating the frequency of occurrences of each classification in described Training document and the conditional probability of each span of all characteristic attributes is estimated under each classification, is sorter by the described frequency of occurrences and described conditional probability estimated record.

9. system as claimed in claim 8, is characterized in that,

Described sort module, specifically for obtaining the span of all characteristic attributes of the document to be sorted after described modular converter format conversion, according to the frequency of occurrences in Training document of the span got, each classification and under each classification the conditional probability of each span of all characteristic attributes estimate, calculate the conditional probability that described document to be sorted belongs to each classification, and using the classification results of classification corresponding for conditional probability maximum for numerical value as described document to be sorted.

10. system as claimed in claim 6, is characterized in that,

Described parsing module, specifically for by resolving Training document and document to be sorted, obtain the attribute that Training document and document package to be sorted contain, and selected characteristic attribute in the attribute analytically obtained, and divide multiple span for each characteristic attribute.