CN105630931A

CN105630931A - Document classification method and device

Info

Publication number: CN105630931A
Application number: CN201510974508.6A
Authority: CN
Inventors: 唐旋; 毛立花; 王传超
Original assignee: Inspur Software Group Co Ltd
Current assignee: Inspur Software Group Co Ltd
Priority date: 2015-12-22
Filing date: 2015-12-22
Publication date: 2016-06-01

Abstract

The invention provides a method and a device for classifying documents, wherein the method comprises the following steps: acquiring a plurality of training documents, and determining a category corresponding to each training document; determining a feature vector of each category according to the training document corresponding to each category, wherein the feature vector comprises: the word strings appearing in the corresponding current category, and the appearance probability of each word string appearing in the current category; obtaining a current document to be classified, and extracting a matching feature vector of the current document to be classified from the current document to be classified, wherein the matching feature vector comprises the following steps: word strings to be matched appear in the current document to be classified; determining the similarity between the matched feature vector and the feature vector of each category according to the word string to be matched in the matched feature vector and the occurrence probability in the feature vector of each category; and taking the class corresponding to the feature vector with the highest similarity as the class of the current document to be classified. The invention provides a method and a device for classifying documents, which can more flexibly classify the documents.

Description

A kind of method of document classification and device

Technical field

The present invention relates to field of computer technology, particularly to method and the device of a kind of document classification.

Background technology

Along with the development that can continue technology, natural language processing technique obtains unprecedented attention and considerable progress, and have evolved into a relatively independent subject, receive much concern, now along with the Internet+, the getting most of the attention of big data etc. are popular theory and technology, making full use of of webpage text data on network is launched various trial by every profession and trade, and natural language processing technique is then main force's effect of serving as in the task of processing and analyzing at these web page texts, utilizing.

In prior art, the process of web page text data being based primarily upon and presets fixing sorting technique, the demand according to user that is difficult to of this sorting technique is adjusted. For example, the accuracy rate of classification results is difficult to meet the demand of user, but, user also is difficult to sorting technique is adjusted, and has arrived the accuracy rate requirement of user. Visible by foregoing description, sorting technique underaction of the prior art.

Summary of the invention

The invention provides a kind of method of document classification and device, it is possible to carry out document classification more neatly.

On the one hand, a kind of method that the invention provides document classification, including:

S1: obtain multiple Training document, it is determined that the classification that each Training document is corresponding;

S2: according to the Training document that each classification is corresponding, it is determined that the characteristic vector of each classification, described characteristic vector includes: the word string occurred in corresponding current class, and each word string occurs in the probability of occurrence of current class;

S3: obtain current document to be sorted, from current document to be sorted, extracts the matching characteristic vector of current document to be sorted, and described matching characteristic vector includes: the word string to be matched occurred in current document to be sorted;

S4: according to the probability of occurrence in the characteristic vector of the word string to be matched in described matching characteristic vector and each classification, it is determined that described matching characteristic is vectorial and the similarity of the characteristic vector of each classification;

S5: using the classification corresponding for characteristic vector the highest for the similarity classification as described current document to be sorted.

Further, described S2, including:

Training document corresponding for each classification is processed into plain text document, the plain text document that each Training document is corresponding is carried out participle, it is thus achieved that multiple words that each Training document is corresponding;

By preset value word composition word string adjacent in each Training document, it is determined that the probability of occurrence of the classification that each word string is corresponding;

According to the probability of occurrence in corresponding classification of each word string in the Training document that each classification is corresponding, it is determined that the characteristic vector of each classification.

Further, described S3, including:

Current document process to be sorted is become plain text document, the plain text document that current document to be sorted is corresponding is carried out participle, it is thus achieved that multiple words that current document to be sorted is corresponding;

By described preset value word composition word string adjacent in current document to be sorted;

Described matching characteristic vector is determined according to the word string in current document to be sorted.

Further, described S4, including:

Characteristic vector according to each classification, it is determined that each described word string to be matched is at the probability of occurrence of each classification;

For each classification, it is determined that the word string all to be matched of described current document to be sorted is in the probability of occurrence sum of current class, using probability of occurrence sum corresponding for current class as similarity corresponding to current class.

Further, after described S2, before described S3, also include:

A1: obtain multiple test document, it is determined that the concrete class of each test document;

A2: from each test document, obtains word string to be tested;

A3: the characteristic vector according to each classification, it is determined that each word string to be tested is at the probability of occurrence of each classification;

A4: for each classification, it is determined that the word string all to be tested of described current test document is in the probability of occurrence sum of current class;

A5: will appear from the maximum classification of probability sum as coupling classification corresponding to described current test document;

A6: the concrete class according to the coupling classification of each test document and each test document, it is determined that the classification accuracy that each classification is corresponding;

A7: judge that whether classification accuracy corresponding to each classification be more than or equal to default accuracy rate threshold value respectively, if it is, perform step S3, otherwise, perform step A8;

A8: using the plurality of test document as described Training document, performs step S1.

On the other hand, the invention provides the device of a kind of document classification, including:

First acquiring unit, is used for obtaining multiple Training document, it is determined that the classification that each Training document is corresponding;

Training unit, for the Training document corresponding according to each classification, it is determined that the characteristic vector of each classification, described characteristic vector includes: the word string occurred in corresponding current class, and each word string occurs in the probability of occurrence of current class;

Second acquisition unit, is used for obtaining current document to be sorted, from current document to be sorted, extracts the matching characteristic vector of current document to be sorted, and described matching characteristic vector includes: the word string to be matched occurred in current document to be sorted;

Determine unit, for according to the probability of occurrence in the characteristic vector of the word string to be matched in described matching characteristic vector and each classification, it is determined that described matching characteristic is vectorial and the similarity of the characteristic vector of each classification;

Taxon, for using the classification corresponding for characteristic vector the highest for the similarity classification as described current document to be sorted.

Further, described training unit, for Training document corresponding for each classification is processed into plain text document, the plain text document that each Training document is corresponding is carried out participle, obtain multiple words that each Training document is corresponding, by preset value word composition word string adjacent in each Training document, it is determined that the probability of occurrence of the classification that each word string is corresponding, according to the probability of occurrence in corresponding classification of each word string in the Training document that each classification is corresponding, it is determined that the characteristic vector of each classification.

Further, described second acquisition unit, for current document process to be sorted is become plain text document, the plain text document that current document to be sorted is corresponding is carried out participle, obtain multiple words that current document to be sorted is corresponding, by described preset value word composition word string adjacent in current document to be sorted, determine described matching characteristic vector according to the word string in current document to be sorted.

Further, described determine unit, for the characteristic vector according to each classification, determine each described word string to be matched probability of occurrence in each classification, for each classification, determine the word string all to be matched probability of occurrence sum at current class of described current document to be sorted, using probability of occurrence sum corresponding for current class as similarity corresponding to current class.

Further, this device also includes: measuring unit, is used for performing:

A2: from each test document, obtains word string to be tested;

A7: judge that whether classification accuracy corresponding to each classification be more than or equal to default accuracy rate threshold value respectively, if it is, trigger described second acquisition unit, otherwise, perform step A8;

A8: using the plurality of test document as described Training document, performs described first acquiring unit.

The method of a kind of document classification provided by the invention and device, by Training document, every kind is trained, obtain every kind characteristic of correspondence vector, determine that the matching characteristic of document to be sorted is vectorial and the similarity of the characteristic vector of every kind, determine classification corresponding to the characteristic vector the highest with the matching characteristic vector similarity classification as document to be sorted, when classification results can not reach user require time, characteristic vector can be updated by adjusting training document, make classification results can more conform to user's request, document classification can be carried out more neatly.

Accompanying drawing explanation

In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, the accompanying drawing used required in embodiment or description of the prior art will be briefly described below, apparently, accompanying drawing in the following describes is some embodiments of the present invention, for those of ordinary skill in the art, under the premise not paying creative work, it is also possible to obtain other accompanying drawing according to these accompanying drawings.

Fig. 1 is the flow chart of the method for a kind of document classification that one embodiment of the invention provides;

Fig. 2 is the flow chart of the method for the another kind of document classification that one embodiment of the invention provides;

Fig. 3 is the schematic diagram of the device of a kind of document classification that one embodiment of the invention provides;

Fig. 4 is the schematic diagram of the device of the another kind of document classification that one embodiment of the invention provides.

Detailed description of the invention

For making the purpose of the embodiment of the present invention, technical scheme and advantage clearly; below in conjunction with the accompanying drawing in the embodiment of the present invention; technical scheme in the embodiment of the present invention is clearly and completely described; obviously; described embodiment is a part of embodiment of the present invention, rather than whole embodiments, based on the embodiment in the present invention; the every other embodiment that those of ordinary skill in the art obtain under the premise not making creative work, broadly falls into the scope of protection of the invention.

As it is shown in figure 1, a kind of method embodiments providing document classification, the method may comprise steps of:

The method of a kind of document classification that the embodiment of the present invention provides, by Training document, every kind is trained, obtain every kind characteristic of correspondence vector, determine that the matching characteristic of document to be sorted is vectorial and the similarity of the characteristic vector of every kind, determine classification corresponding to the characteristic vector the highest with the matching characteristic vector similarity classification as document to be sorted, when classification results can not reach user require time, characteristic vector can be updated by adjusting training document, make classification results can more conform to user's request, document classification can be carried out more neatly.

In a kind of possible implementation, described S2, including:

Here Training document can be webpage, and for the ease of extracting word string, it is necessary to Training document is carried out pretreatment, is processed into plain text document, processing procedure may include that the main content of text of extraction, removes space, special symbol etc. Then, by plain text document is carried out participle, multiple word is obtained. For example, " a kind of method of document classification " is after participle, it is possible to obtain " one " " document " " classification " " " " method " this connect word. And word string is made up of word, when preset value is 2, word string is made up of 2 words, for instance: " a kind of document " " document classification " can serve as word string. When calculating the probability of occurrence that each word string occurs in current class, it is possible to be accomplished by: determine the occurrence number that all word strings occurred in current class occur, it is determined that the total degree of all word strings occurred in current class; By the occurrence number of current word string divided by total degree, it is determined that the probability of occurrence of current word string. For example, for, in classification C, having two sections of Training document A and B, having word string A, word string B, word string C in Training document A, the occurrence number in Training document A is 2,3,4 respectively; Having word string A, word string B in Training document B, the occurrence number in Training document B is 5,7 respectively; Word string A occurrence number in classification C is 2+5=7, word string B occurrence number in classification C is 3+7=10, word string C occurrence number in classification C is 4, and the total degree of all word strings occurred in current class is 7+10+4=21, and the probability of occurrence of word string A is 7/21.

In a kind of possible implementation, described S3, including:

In this implementation, when carrying out participle, it is possible to adopt the method the same with the segmenting method in step S2, so can so that classification results be more accurate. Here preset value is the same with the preset value in step S2, so could find the word string matched in the characteristic vector of each classification. Here, the mode forming word string can be identical with the mode in step S2.

In a kind of possible implementation, described S4, including:

For example, having two classifications is that the characteristic vector of classification A and classification B, classification A includes: word string A, word string B, word string C respectively, and it is 0.2 that word string A occurs in the probability of occurrence of classification A, it is 0.3 that word string B occurs in the probability of occurrence of classification A, and it is 0.5 that word string C occurs in the probability of occurrence of classification A. The characteristic vector of classification B includes: word string C, word string D, and it is 0.2 that word string C occurs in the probability of occurrence of classification B, and it is 0.8 that word string D occurs in the probability of occurrence of classification B. Word string to be matched corresponding for document A to be sorted has word string A, word string C, word string E. Word string A is 0.2 and 0 at the probability of occurrence of classification A and classification B respectively, and word string C is 0.5 and 0.2 at the probability of occurrence of classification A and classification B respectively, and word string E is all 0 at the probability of occurrence of classification A and classification B. For classification A, it is determined that the word string all to be matched of document A to be sorted in the probability of occurrence sum of classification A is: word string A at the probability of occurrence+word string C of classification A at the probability of occurrence+word string E of classification A in probability of occurrence=0.7 of classification A; The word string all to be matched determining document A to be sorted in the probability of occurrence sum of classification B is: word string A at the probability of occurrence+word string C of classification B at the probability of occurrence+word string E of classification B in probability of occurrence=0.2 of classification B. Visible, the similarity of the characteristic vector of matching characteristic vector and classification A is 0.7, and the similarity of the characteristic vector of matching characteristic vector and classification B is 0.2, so, document A to be sorted belongs to classification A.

In order to meet user's requirement to classification accuracy, it is possible to treating before classifying documents classifies, the characteristic vector of each classification is tested. In a kind of possible implementation, after described S2, before described S3, also include:

A2: from each test document, obtains word string to be tested;

In this implementation, it is possible to obtain the test document of certain classification, certain classification being tested, the default accuracy rate threshold value of each classification can be arranged as required to. By this implementation, it is possible to by test document, characteristic vector is tested, when certain classification can not reach requirement, it is possible to using test document as Training document, then the category is trained.

In this implementation, when the classification accuracy that each classification of calculating is corresponding, judge that whether the coupling classification of each test document is identical with its concrete class respectively, if it is identical, then determine the classification of current test document accurately, otherwise, it determines current test document classification error. Determining the total A of test document in each concrete class, it is determined that the quantity B of the test document accurately of classifying in each concrete class, classification accuracy corresponding to each classification is: B/A. For example, the concrete class having 10 sections of test document is classification A, after classifying, these 10 sections of test document have the coupling classification of 8 sections to be classification A, it is, for classification A, having 8 sections is classify accurately, it is determined that going out classification accuracy corresponding for classification A is: 8/10=0.8. Wherein, preset accuracy rate threshold value and could be arranged to 80%.

For making the object, technical solutions and advantages of the present invention clearly, below in conjunction with drawings and the specific embodiments, the present invention is described in further detail.

In this embodiment, it is necessary to network tax text webpage is classified, specifically, it is possible to be divided into: policies and regulations, notice bulletin, Policy Interpretation, tax news, stock market's tax are heard, taxes on enterprise hear six big classes.

As in figure 2 it is shown, a kind of method embodiments providing document classification, the method may comprise steps of:

Step 201: obtain multiple Training document, it is determined that the classification that each Training document is corresponding.

Specifically, can in State Tax Administration website, each province and city tax bureau website and China's authoritative website such as tax net, accounting net gather Training document respectively, and determine that each Training document belongs to policies and regulations, notice bulletin, Policy Interpretation, tax news, stock market's tax are heard, which classification of taxes on enterprise news six big apoplexy due to endogenous wind.

Step 202: Training document corresponding for each classification is processed into plain text document, carries out participle to the plain text document that each Training document is corresponding, it is thus achieved that multiple words that each Training document is corresponding.

Specifically, it is possible to use Training document is processed by jsoup, plain text document is obtained.

Step 203: by preset value word composition word string adjacent in each Training document, it is determined that the probability of occurrence of the classification that each word string is corresponding.

Specifically, determine that each word string is at policies and regulations, the probability of occurrence notifying bulletin, Policy Interpretation, tax news, stock market's tax news, taxes on enterprise news six big apoplexy due to endogenous wind respectively.

Step 204: according to the probability of occurrence in corresponding classification of each word string in the Training document that each classification is corresponding, determine the characteristic vector of each classification, described characteristic vector includes: the word string occurred in corresponding current class, each word string occurs in the probability of occurrence of current class.

Specifically, the characteristic vector of each classification can represent in the following manner:

T_j={ (w₁w₂...w_n,P(w₁w₂...w_n))₁,...(w₁w₂...w_n,P(w₁w₂...w_n))_m, T_jFor the characteristic vector of jth classification, w_nFor the n-th word, w in current word string₁w₂...w_nFor word string, P (w₁w₂...w_n) for word string w₁w₂...w_nCorresponding probability of occurrence. (w₁w₂...w_n,P(w₁w₂...w_n))_mRepresent the probability of occurrence that m-th word string is corresponding with m-th word string. For example, it can be { (" one " " enterprise " that taxes on enterprise hear the characteristic vector of classification, 0.2), (" enterprise " " pays taxes "), 0.8}, it can be seen that each word string is made up of 2 words, word string " one " " enterprise " taxes on enterprise hear classification probability of occurrence be 0.2, word string " enterprise " " pay taxes " taxes on enterprise hear classification probability of occurrence be 0.8.

Step 205: obtain current document to be sorted, becomes plain text document by current document process to be sorted, and the plain text document that current document to be sorted is corresponding is carried out participle, it is thus achieved that multiple words that current document to be sorted is corresponding.

Specifically, it is possible to use current document to be sorted is processed by jsoup, obtains plain text document.

Specifically, segmenting method can be identical with above-mentioned segmenting method.

Step 206: by described preset value word composition word string adjacent in current document to be sorted.

Step 207: determining described matching characteristic vector according to the word string in current document to be sorted, matching characteristic vector includes: the word string to be matched occurred in current document to be sorted.

Step 208: the characteristic vector according to each classification, it is determined that each described word string to be matched is at the probability of occurrence of each classification.

Step 209: for each classification, it is determined that the word string all to be matched of described current document to be sorted is in the probability of occurrence sum of current class, using probability of occurrence sum corresponding for current class as similarity corresponding to current class.

Step 210: using the classification corresponding for characteristic vector the highest for the similarity classification as described current document to be sorted.

Wherein, above-mentioned preset value can be adjusted as required, can regulate the accuracy of classification by adjusting this preset value.

Above-described embodiment, it is possible to use the integrated Java of the JAVA instrument such as natural language processing instrument OpenNLP, FudanNLP, LingPipe, IKAnalyzer, word2vec of increasing income realizes.

As shown in Figure 3, Figure 4, the device of a kind of document classification is embodiments provided. Device embodiment can be realized by software, it is also possible to is realized by the mode of hardware or software and hardware combining. Say from hardware view; as shown in Figure 3; a kind of hardware structure diagram of device place equipment for a kind of document classification that the embodiment of the present invention provides; except the processor shown in Fig. 3, internal memory, network interface and nonvolatile memory; in embodiment, the equipment at device place generally can also include other hardware, such as the forwarding chip etc. of responsible process message. Implemented in software for example, as shown in Figure 4, as the device on a logical meaning, it is that computer program instructions corresponding in nonvolatile memory is read to run in internal memory and formed by the CPU by its place equipment. The device of a kind of document classification that the present embodiment provides, including:

First acquiring unit 401, is used for obtaining multiple Training document, it is determined that the classification that each Training document is corresponding;

Training unit 402, for the Training document corresponding according to each classification, it is determined that the characteristic vector of each classification, described characteristic vector includes: the word string occurred in corresponding current class, and each word string occurs in the probability of occurrence of current class;

Second acquisition unit 403, is used for obtaining current document to be sorted, from current document to be sorted, extracts the matching characteristic vector of current document to be sorted, and described matching characteristic vector includes: the word string to be matched occurred in current document to be sorted;

Determine unit 404, for according to the probability of occurrence in the characteristic vector of the word string to be matched in described matching characteristic vector and each classification, it is determined that described matching characteristic is vectorial and the similarity of the characteristic vector of each classification;

Taxon 405, for using the classification corresponding for characteristic vector the highest for the similarity classification as described current document to be sorted.

In a kind of possible implementation, described training unit 402, for Training document corresponding for each classification is processed into plain text document, the plain text document that each Training document is corresponding is carried out participle, obtain multiple words that each Training document is corresponding, by preset value word composition word string adjacent in each Training document, determine the probability of occurrence of the classification that each word string is corresponding, according to the probability of occurrence in corresponding classification of each word string in the Training document that each classification is corresponding, it is determined that the characteristic vector of each classification.

In a kind of possible implementation, described second acquisition unit 403, for current document process to be sorted is become plain text document, the plain text document that current document to be sorted is corresponding is carried out participle, obtain multiple words that current document to be sorted is corresponding, by described preset value word composition word string adjacent in current document to be sorted, determine described matching characteristic vector according to the word string in current document to be sorted.

In a kind of possible implementation, described determine unit 404, for the characteristic vector according to each classification, determine each described word string to be matched probability of occurrence in each classification, for each classification, determine the word string all to be matched probability of occurrence sum at current class of described current document to be sorted, using probability of occurrence sum corresponding for current class as similarity corresponding to current class.

In a kind of possible implementation, also include: measuring unit, be used for performing:

A2: from each test document, obtains word string to be tested;

The contents such as the information between each unit in said apparatus is mutual, execution process, due to the inventive method embodiment based on same design, particular content referring to the narration in the inventive method embodiment, can repeat no more herein.

The method of a kind of document classification that the embodiment of the present invention provides and device, have the advantages that

1, the method of a kind of document classification that the embodiment of the present invention provides and device, by Training document, every kind is trained, obtain every kind characteristic of correspondence vector, determine that the matching characteristic of document to be sorted is vectorial and the similarity of the characteristic vector of every kind, determine classification corresponding to the characteristic vector the highest with the matching characteristic vector similarity classification as document to be sorted, when classification results can not reach user require time, characteristic vector can be updated by adjusting training document, make classification results can more conform to user's request, document classification can be carried out more neatly.

2, the embodiment of the present invention provides a kind of method of document classification and device, the test document of certain classification can be obtained, certain classification is tested, the default accuracy rate threshold value of each classification can be arranged as required to, by this implementation, it is possible to by test document, characteristic vector is tested, when certain classification can not reach requirement, using test document as Training document, then the category can be trained, and then improve the accuracy rate of classification.

It should be noted that, in this article, the relational terms of such as first and second etc is used merely to separate an entity or operation with another entity or operating space, and not necessarily requires or imply the relation that there is any this reality between these entities or operation or sequentially. And, term " includes ", " comprising " or its any other variant are intended to comprising of nonexcludability, so that include the process of a series of key element, method, article or equipment not only include those key elements, but also include other key elements being not expressly set out, or also include the key element intrinsic for this process, method, article or equipment. When there is no more restriction, statement " including " key element limited, it is not excluded that there is also other same factor in including the process of described key element, method, article or equipment.

One of ordinary skill in the art will appreciate that: all or part of step realizing said method embodiment can be completed by the hardware that programmed instruction is relevant, aforesaid program can be stored in the storage medium of embodied on computer readable, this program upon execution, performs to include the step of said method embodiment; And aforesaid storage medium includes: in the various media that can store program code such as ROM, RAM, magnetic disc or CD.

Last it should be understood that the foregoing is only presently preferred embodiments of the present invention, it is merely to illustrate technical scheme, is not intended to limit protection scope of the present invention. All make within the spirit and principles in the present invention any amendment, equivalent replacement, improvement etc., be all contained in protection scope of the present invention.

Claims

1. the method for a document classification, it is characterised in that including:

2. method according to claim 1, it is characterised in that described S2, including:

3. method according to claim 2, it is characterised in that described S3, including:

4. method according to claim 1, it is characterised in that described S4, including:

5. according to described method arbitrary in claim 1-4, it is characterised in that after described S2, before described S3, also include:

A2: from each test document, obtains word string to be tested;

6. the device of a document classification, it is characterised in that including:

7. device according to claim 6, it is characterized in that, described training unit, for Training document corresponding for each classification is processed into plain text document, the plain text document that each Training document is corresponding is carried out participle, obtain multiple words that each Training document is corresponding, by preset value word composition word string adjacent in each Training document, determine the probability of occurrence of the classification that each word string is corresponding, according to the probability of occurrence in corresponding classification of each word string in the Training document that each classification is corresponding, it is determined that the characteristic vector of each classification.

8. device according to claim 7, it is characterized in that, described second acquisition unit, for current document process to be sorted is become plain text document, the plain text document that current document to be sorted is corresponding is carried out participle, obtain multiple words that current document to be sorted is corresponding, by described preset value word composition word string adjacent in current document to be sorted, determine described matching characteristic vector according to the word string in current document to be sorted.

9. device according to claim 6, it is characterized in that, described determine unit, for the characteristic vector according to each classification, determine each described word string to be matched probability of occurrence in each classification, for each classification, it is determined that the word string all to be matched of described current document to be sorted is in the probability of occurrence sum of current class, using probability of occurrence sum corresponding for current class as similarity corresponding to current class.

10. according to described device arbitrary in claim 6-9, it is characterised in that also include: measuring unit, be used for performing:

A2: from each test document, obtains word string to be tested;