CN105630931A - Document classification method and device - Google Patents

Document classification method and device Download PDF

Info

Publication number
CN105630931A
CN105630931A CN201510974508.6A CN201510974508A CN105630931A CN 105630931 A CN105630931 A CN 105630931A CN 201510974508 A CN201510974508 A CN 201510974508A CN 105630931 A CN105630931 A CN 105630931A
Authority
CN
China
Prior art keywords
classification
document
current
word string
characteristic vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510974508.6A
Other languages
Chinese (zh)
Inventor
唐旋
毛立花
王传超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Software Group Co Ltd
Original Assignee
Inspur Software Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Software Group Co Ltd filed Critical Inspur Software Group Co Ltd
Priority to CN201510974508.6A priority Critical patent/CN105630931A/en
Publication of CN105630931A publication Critical patent/CN105630931A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate

Abstract

The invention provides a method and a device for classifying documents, wherein the method comprises the following steps: acquiring a plurality of training documents, and determining a category corresponding to each training document; determining a feature vector of each category according to the training document corresponding to each category, wherein the feature vector comprises: the word strings appearing in the corresponding current category, and the appearance probability of each word string appearing in the current category; obtaining a current document to be classified, and extracting a matching feature vector of the current document to be classified from the current document to be classified, wherein the matching feature vector comprises the following steps: word strings to be matched appear in the current document to be classified; determining the similarity between the matched feature vector and the feature vector of each category according to the word string to be matched in the matched feature vector and the occurrence probability in the feature vector of each category; and taking the class corresponding to the feature vector with the highest similarity as the class of the current document to be classified. The invention provides a method and a device for classifying documents, which can more flexibly classify the documents.

Description

A kind of method of document classification and device
Technical field
The present invention relates to field of computer technology, particularly to method and the device of a kind of document classification.
Background technology
Along with the development that can continue technology, natural language processing technique obtains unprecedented attention and considerable progress, and have evolved into a relatively independent subject, receive much concern, now along with the Internet+, the getting most of the attention of big data etc. are popular theory and technology, making full use of of webpage text data on network is launched various trial by every profession and trade, and natural language processing technique is then main force's effect of serving as in the task of processing and analyzing at these web page texts, utilizing.
In prior art, the process of web page text data being based primarily upon and presets fixing sorting technique, the demand according to user that is difficult to of this sorting technique is adjusted. For example, the accuracy rate of classification results is difficult to meet the demand of user, but, user also is difficult to sorting technique is adjusted, and has arrived the accuracy rate requirement of user. Visible by foregoing description, sorting technique underaction of the prior art.
Summary of the invention
The invention provides a kind of method of document classification and device, it is possible to carry out document classification more neatly.
On the one hand, a kind of method that the invention provides document classification, including:
S1: obtain multiple Training document, it is determined that the classification that each Training document is corresponding;
S2: according to the Training document that each classification is corresponding, it is determined that the characteristic vector of each classification, described characteristic vector includes: the word string occurred in corresponding current class, and each word string occurs in the probability of occurrence of current class;
S3: obtain current document to be sorted, from current document to be sorted, extracts the matching characteristic vector of current document to be sorted, and described matching characteristic vector includes: the word string to be matched occurred in current document to be sorted;
S4: according to the probability of occurrence in the characteristic vector of the word string to be matched in described matching characteristic vector and each classification, it is determined that described matching characteristic is vectorial and the similarity of the characteristic vector of each classification;
S5: using the classification corresponding for characteristic vector the highest for the similarity classification as described current document to be sorted.
Further, described S2, including:
Training document corresponding for each classification is processed into plain text document, the plain text document that each Training document is corresponding is carried out participle, it is thus achieved that multiple words that each Training document is corresponding;
By preset value word composition word string adjacent in each Training document, it is determined that the probability of occurrence of the classification that each word string is corresponding;
According to the probability of occurrence in corresponding classification of each word string in the Training document that each classification is corresponding, it is determined that the characteristic vector of each classification.
Further, described S3, including:
Current document process to be sorted is become plain text document, the plain text document that current document to be sorted is corresponding is carried out participle, it is thus achieved that multiple words that current document to be sorted is corresponding;
By described preset value word composition word string adjacent in current document to be sorted;
Described matching characteristic vector is determined according to the word string in current document to be sorted.
Further, described S4, including:
Characteristic vector according to each classification, it is determined that each described word string to be matched is at the probability of occurrence of each classification;
For each classification, it is determined that the word string all to be matched of described current document to be sorted is in the probability of occurrence sum of current class, using probability of occurrence sum corresponding for current class as similarity corresponding to current class.
Further, after described S2, before described S3, also include:
A1: obtain multiple test document, it is determined that the concrete class of each test document;
A2: from each test document, obtains word string to be tested;
A3: the characteristic vector according to each classification, it is determined that each word string to be tested is at the probability of occurrence of each classification;
A4: for each classification, it is determined that the word string all to be tested of described current test document is in the probability of occurrence sum of current class;
A5: will appear from the maximum classification of probability sum as coupling classification corresponding to described current test document;
A6: the concrete class according to the coupling classification of each test document and each test document, it is determined that the classification accuracy that each classification is corresponding;
A7: judge that whether classification accuracy corresponding to each classification be more than or equal to default accuracy rate threshold value respectively, if it is, perform step S3, otherwise, perform step A8;
A8: using the plurality of test document as described Training document, performs step S1.
On the other hand, the invention provides the device of a kind of document classification, including:
First acquiring unit, is used for obtaining multiple Training document, it is determined that the classification that each Training document is corresponding;
Training unit, for the Training document corresponding according to each classification, it is determined that the characteristic vector of each classification, described characteristic vector includes: the word string occurred in corresponding current class, and each word string occurs in the probability of occurrence of current class;
Second acquisition unit, is used for obtaining current document to be sorted, from current document to be sorted, extracts the matching characteristic vector of current document to be sorted, and described matching characteristic vector includes: the word string to be matched occurred in current document to be sorted;
Determine unit, for according to the probability of occurrence in the characteristic vector of the word string to be matched in described matching characteristic vector and each classification, it is determined that described matching characteristic is vectorial and the similarity of the characteristic vector of each classification;
Taxon, for using the classification corresponding for characteristic vector the highest for the similarity classification as described current document to be sorted.
Further, described training unit, for Training document corresponding for each classification is processed into plain text document, the plain text document that each Training document is corresponding is carried out participle, obtain multiple words that each Training document is corresponding, by preset value word composition word string adjacent in each Training document, it is determined that the probability of occurrence of the classification that each word string is corresponding, according to the probability of occurrence in corresponding classification of each word string in the Training document that each classification is corresponding, it is determined that the characteristic vector of each classification.
Further, described second acquisition unit, for current document process to be sorted is become plain text document, the plain text document that current document to be sorted is corresponding is carried out participle, obtain multiple words that current document to be sorted is corresponding, by described preset value word composition word string adjacent in current document to be sorted, determine described matching characteristic vector according to the word string in current document to be sorted.
Further, described determine unit, for the characteristic vector according to each classification, determine each described word string to be matched probability of occurrence in each classification, for each classification, determine the word string all to be matched probability of occurrence sum at current class of described current document to be sorted, using probability of occurrence sum corresponding for current class as similarity corresponding to current class.
Further, this device also includes: measuring unit, is used for performing:
A1: obtain multiple test document, it is determined that the concrete class of each test document;
A2: from each test document, obtains word string to be tested;
A3: the characteristic vector according to each classification, it is determined that each word string to be tested is at the probability of occurrence of each classification;
A4: for each classification, it is determined that the word string all to be tested of described current test document is in the probability of occurrence sum of current class;
A5: will appear from the maximum classification of probability sum as coupling classification corresponding to described current test document;
A6: the concrete class according to the coupling classification of each test document and each test document, it is determined that the classification accuracy that each classification is corresponding;
A7: judge that whether classification accuracy corresponding to each classification be more than or equal to default accuracy rate threshold value respectively, if it is, trigger described second acquisition unit, otherwise, perform step A8;
A8: using the plurality of test document as described Training document, performs described first acquiring unit.
The method of a kind of document classification provided by the invention and device, by Training document, every kind is trained, obtain every kind characteristic of correspondence vector, determine that the matching characteristic of document to be sorted is vectorial and the similarity of the characteristic vector of every kind, determine classification corresponding to the characteristic vector the highest with the matching characteristic vector similarity classification as document to be sorted, when classification results can not reach user require time, characteristic vector can be updated by adjusting training document, make classification results can more conform to user's request, document classification can be carried out more neatly.
Accompanying drawing explanation
In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, the accompanying drawing used required in embodiment or description of the prior art will be briefly described below, apparently, accompanying drawing in the following describes is some embodiments of the present invention, for those of ordinary skill in the art, under the premise not paying creative work, it is also possible to obtain other accompanying drawing according to these accompanying drawings.
Fig. 1 is the flow chart of the method for a kind of document classification that one embodiment of the invention provides;
Fig. 2 is the flow chart of the method for the another kind of document classification that one embodiment of the invention provides;
Fig. 3 is the schematic diagram of the device of a kind of document classification that one embodiment of the invention provides;
Fig. 4 is the schematic diagram of the device of the another kind of document classification that one embodiment of the invention provides.
Detailed description of the invention
For making the purpose of the embodiment of the present invention, technical scheme and advantage clearly; below in conjunction with the accompanying drawing in the embodiment of the present invention; technical scheme in the embodiment of the present invention is clearly and completely described; obviously; described embodiment is a part of embodiment of the present invention, rather than whole embodiments, based on the embodiment in the present invention; the every other embodiment that those of ordinary skill in the art obtain under the premise not making creative work, broadly falls into the scope of protection of the invention.
As it is shown in figure 1, a kind of method embodiments providing document classification, the method may comprise steps of:
S1: obtain multiple Training document, it is determined that the classification that each Training document is corresponding;
S2: according to the Training document that each classification is corresponding, it is determined that the characteristic vector of each classification, described characteristic vector includes: the word string occurred in corresponding current class, and each word string occurs in the probability of occurrence of current class;
S3: obtain current document to be sorted, from current document to be sorted, extracts the matching characteristic vector of current document to be sorted, and described matching characteristic vector includes: the word string to be matched occurred in current document to be sorted;
S4: according to the probability of occurrence in the characteristic vector of the word string to be matched in described matching characteristic vector and each classification, it is determined that described matching characteristic is vectorial and the similarity of the characteristic vector of each classification;
S5: using the classification corresponding for characteristic vector the highest for the similarity classification as described current document to be sorted.
The method of a kind of document classification that the embodiment of the present invention provides, by Training document, every kind is trained, obtain every kind characteristic of correspondence vector, determine that the matching characteristic of document to be sorted is vectorial and the similarity of the characteristic vector of every kind, determine classification corresponding to the characteristic vector the highest with the matching characteristic vector similarity classification as document to be sorted, when classification results can not reach user require time, characteristic vector can be updated by adjusting training document, make classification results can more conform to user's request, document classification can be carried out more neatly.
In a kind of possible implementation, described S2, including:
Training document corresponding for each classification is processed into plain text document, the plain text document that each Training document is corresponding is carried out participle, it is thus achieved that multiple words that each Training document is corresponding;
By preset value word composition word string adjacent in each Training document, it is determined that the probability of occurrence of the classification that each word string is corresponding;
According to the probability of occurrence in corresponding classification of each word string in the Training document that each classification is corresponding, it is determined that the characteristic vector of each classification.
Here Training document can be webpage, and for the ease of extracting word string, it is necessary to Training document is carried out pretreatment, is processed into plain text document, processing procedure may include that the main content of text of extraction, removes space, special symbol etc. Then, by plain text document is carried out participle, multiple word is obtained. For example, " a kind of method of document classification " is after participle, it is possible to obtain " one " " document " " classification " " " " method " this connect word. And word string is made up of word, when preset value is 2, word string is made up of 2 words, for instance: " a kind of document " " document classification " can serve as word string. When calculating the probability of occurrence that each word string occurs in current class, it is possible to be accomplished by: determine the occurrence number that all word strings occurred in current class occur, it is determined that the total degree of all word strings occurred in current class; By the occurrence number of current word string divided by total degree, it is determined that the probability of occurrence of current word string. For example, for, in classification C, having two sections of Training document A and B, having word string A, word string B, word string C in Training document A, the occurrence number in Training document A is 2,3,4 respectively; Having word string A, word string B in Training document B, the occurrence number in Training document B is 5,7 respectively; Word string A occurrence number in classification C is 2+5=7, word string B occurrence number in classification C is 3+7=10, word string C occurrence number in classification C is 4, and the total degree of all word strings occurred in current class is 7+10+4=21, and the probability of occurrence of word string A is 7/21.
In a kind of possible implementation, described S3, including:
Current document process to be sorted is become plain text document, the plain text document that current document to be sorted is corresponding is carried out participle, it is thus achieved that multiple words that current document to be sorted is corresponding;
By described preset value word composition word string adjacent in current document to be sorted;
Described matching characteristic vector is determined according to the word string in current document to be sorted.
In this implementation, when carrying out participle, it is possible to adopt the method the same with the segmenting method in step S2, so can so that classification results be more accurate. Here preset value is the same with the preset value in step S2, so could find the word string matched in the characteristic vector of each classification. Here, the mode forming word string can be identical with the mode in step S2.
In a kind of possible implementation, described S4, including:
Characteristic vector according to each classification, it is determined that each described word string to be matched is at the probability of occurrence of each classification;
For each classification, it is determined that the word string all to be matched of described current document to be sorted is in the probability of occurrence sum of current class, using probability of occurrence sum corresponding for current class as similarity corresponding to current class.
For example, having two classifications is that the characteristic vector of classification A and classification B, classification A includes: word string A, word string B, word string C respectively, and it is 0.2 that word string A occurs in the probability of occurrence of classification A, it is 0.3 that word string B occurs in the probability of occurrence of classification A, and it is 0.5 that word string C occurs in the probability of occurrence of classification A. The characteristic vector of classification B includes: word string C, word string D, and it is 0.2 that word string C occurs in the probability of occurrence of classification B, and it is 0.8 that word string D occurs in the probability of occurrence of classification B. Word string to be matched corresponding for document A to be sorted has word string A, word string C, word string E. Word string A is 0.2 and 0 at the probability of occurrence of classification A and classification B respectively, and word string C is 0.5 and 0.2 at the probability of occurrence of classification A and classification B respectively, and word string E is all 0 at the probability of occurrence of classification A and classification B. For classification A, it is determined that the word string all to be matched of document A to be sorted in the probability of occurrence sum of classification A is: word string A at the probability of occurrence+word string C of classification A at the probability of occurrence+word string E of classification A in probability of occurrence=0.7 of classification A; The word string all to be matched determining document A to be sorted in the probability of occurrence sum of classification B is: word string A at the probability of occurrence+word string C of classification B at the probability of occurrence+word string E of classification B in probability of occurrence=0.2 of classification B. Visible, the similarity of the characteristic vector of matching characteristic vector and classification A is 0.7, and the similarity of the characteristic vector of matching characteristic vector and classification B is 0.2, so, document A to be sorted belongs to classification A.
In order to meet user's requirement to classification accuracy, it is possible to treating before classifying documents classifies, the characteristic vector of each classification is tested. In a kind of possible implementation, after described S2, before described S3, also include:
A1: obtain multiple test document, it is determined that the concrete class of each test document;
A2: from each test document, obtains word string to be tested;
A3: the characteristic vector according to each classification, it is determined that each word string to be tested is at the probability of occurrence of each classification;
A4: for each classification, it is determined that the word string all to be tested of described current test document is in the probability of occurrence sum of current class;
A5: will appear from the maximum classification of probability sum as coupling classification corresponding to described current test document;
A6: the concrete class according to the coupling classification of each test document and each test document, it is determined that the classification accuracy that each classification is corresponding;
A7: judge that whether classification accuracy corresponding to each classification be more than or equal to default accuracy rate threshold value respectively, if it is, perform step S3, otherwise, perform step A8;
A8: using the plurality of test document as described Training document, performs step S1.
In this implementation, it is possible to obtain the test document of certain classification, certain classification being tested, the default accuracy rate threshold value of each classification can be arranged as required to. By this implementation, it is possible to by test document, characteristic vector is tested, when certain classification can not reach requirement, it is possible to using test document as Training document, then the category is trained.
In this implementation, when the classification accuracy that each classification of calculating is corresponding, judge that whether the coupling classification of each test document is identical with its concrete class respectively, if it is identical, then determine the classification of current test document accurately, otherwise, it determines current test document classification error. Determining the total A of test document in each concrete class, it is determined that the quantity B of the test document accurately of classifying in each concrete class, classification accuracy corresponding to each classification is: B/A. For example, the concrete class having 10 sections of test document is classification A, after classifying, these 10 sections of test document have the coupling classification of 8 sections to be classification A, it is, for classification A, having 8 sections is classify accurately, it is determined that going out classification accuracy corresponding for classification A is: 8/10=0.8. Wherein, preset accuracy rate threshold value and could be arranged to 80%.
For making the object, technical solutions and advantages of the present invention clearly, below in conjunction with drawings and the specific embodiments, the present invention is described in further detail.
In this embodiment, it is necessary to network tax text webpage is classified, specifically, it is possible to be divided into: policies and regulations, notice bulletin, Policy Interpretation, tax news, stock market's tax are heard, taxes on enterprise hear six big classes.
As in figure 2 it is shown, a kind of method embodiments providing document classification, the method may comprise steps of:
Step 201: obtain multiple Training document, it is determined that the classification that each Training document is corresponding.
Specifically, can in State Tax Administration website, each province and city tax bureau website and China's authoritative website such as tax net, accounting net gather Training document respectively, and determine that each Training document belongs to policies and regulations, notice bulletin, Policy Interpretation, tax news, stock market's tax are heard, which classification of taxes on enterprise news six big apoplexy due to endogenous wind.
Step 202: Training document corresponding for each classification is processed into plain text document, carries out participle to the plain text document that each Training document is corresponding, it is thus achieved that multiple words that each Training document is corresponding.
Specifically, it is possible to use Training document is processed by jsoup, plain text document is obtained.
Step 203: by preset value word composition word string adjacent in each Training document, it is determined that the probability of occurrence of the classification that each word string is corresponding.
Specifically, determine that each word string is at policies and regulations, the probability of occurrence notifying bulletin, Policy Interpretation, tax news, stock market's tax news, taxes on enterprise news six big apoplexy due to endogenous wind respectively.
Step 204: according to the probability of occurrence in corresponding classification of each word string in the Training document that each classification is corresponding, determine the characteristic vector of each classification, described characteristic vector includes: the word string occurred in corresponding current class, each word string occurs in the probability of occurrence of current class.
Specifically, the characteristic vector of each classification can represent in the following manner:
Tj={ (w1w2...wn,P(w1w2...wn))1,...(w1w2...wn,P(w1w2...wn))m, TjFor the characteristic vector of jth classification, wnFor the n-th word, w in current word string1w2...wnFor word string, P (w1w2...wn) for word string w1w2...wnCorresponding probability of occurrence. (w1w2...wn,P(w1w2...wn))mRepresent the probability of occurrence that m-th word string is corresponding with m-th word string. For example, it can be { (" one " " enterprise " that taxes on enterprise hear the characteristic vector of classification, 0.2), (" enterprise " " pays taxes "), 0.8}, it can be seen that each word string is made up of 2 words, word string " one " " enterprise " taxes on enterprise hear classification probability of occurrence be 0.2, word string " enterprise " " pay taxes " taxes on enterprise hear classification probability of occurrence be 0.8.
Step 205: obtain current document to be sorted, becomes plain text document by current document process to be sorted, and the plain text document that current document to be sorted is corresponding is carried out participle, it is thus achieved that multiple words that current document to be sorted is corresponding.
Specifically, it is possible to use current document to be sorted is processed by jsoup, obtains plain text document.
Specifically, segmenting method can be identical with above-mentioned segmenting method.
Step 206: by described preset value word composition word string adjacent in current document to be sorted.
Step 207: determining described matching characteristic vector according to the word string in current document to be sorted, matching characteristic vector includes: the word string to be matched occurred in current document to be sorted.
Step 208: the characteristic vector according to each classification, it is determined that each described word string to be matched is at the probability of occurrence of each classification.
Step 209: for each classification, it is determined that the word string all to be matched of described current document to be sorted is in the probability of occurrence sum of current class, using probability of occurrence sum corresponding for current class as similarity corresponding to current class.
Step 210: using the classification corresponding for characteristic vector the highest for the similarity classification as described current document to be sorted.
Wherein, above-mentioned preset value can be adjusted as required, can regulate the accuracy of classification by adjusting this preset value.
Above-described embodiment, it is possible to use the integrated Java of the JAVA instrument such as natural language processing instrument OpenNLP, FudanNLP, LingPipe, IKAnalyzer, word2vec of increasing income realizes.
As shown in Figure 3, Figure 4, the device of a kind of document classification is embodiments provided. Device embodiment can be realized by software, it is also possible to is realized by the mode of hardware or software and hardware combining. Say from hardware view; as shown in Figure 3; a kind of hardware structure diagram of device place equipment for a kind of document classification that the embodiment of the present invention provides; except the processor shown in Fig. 3, internal memory, network interface and nonvolatile memory; in embodiment, the equipment at device place generally can also include other hardware, such as the forwarding chip etc. of responsible process message. Implemented in software for example, as shown in Figure 4, as the device on a logical meaning, it is that computer program instructions corresponding in nonvolatile memory is read to run in internal memory and formed by the CPU by its place equipment. The device of a kind of document classification that the present embodiment provides, including:
First acquiring unit 401, is used for obtaining multiple Training document, it is determined that the classification that each Training document is corresponding;
Training unit 402, for the Training document corresponding according to each classification, it is determined that the characteristic vector of each classification, described characteristic vector includes: the word string occurred in corresponding current class, and each word string occurs in the probability of occurrence of current class;
Second acquisition unit 403, is used for obtaining current document to be sorted, from current document to be sorted, extracts the matching characteristic vector of current document to be sorted, and described matching characteristic vector includes: the word string to be matched occurred in current document to be sorted;
Determine unit 404, for according to the probability of occurrence in the characteristic vector of the word string to be matched in described matching characteristic vector and each classification, it is determined that described matching characteristic is vectorial and the similarity of the characteristic vector of each classification;
Taxon 405, for using the classification corresponding for characteristic vector the highest for the similarity classification as described current document to be sorted.
In a kind of possible implementation, described training unit 402, for Training document corresponding for each classification is processed into plain text document, the plain text document that each Training document is corresponding is carried out participle, obtain multiple words that each Training document is corresponding, by preset value word composition word string adjacent in each Training document, determine the probability of occurrence of the classification that each word string is corresponding, according to the probability of occurrence in corresponding classification of each word string in the Training document that each classification is corresponding, it is determined that the characteristic vector of each classification.
In a kind of possible implementation, described second acquisition unit 403, for current document process to be sorted is become plain text document, the plain text document that current document to be sorted is corresponding is carried out participle, obtain multiple words that current document to be sorted is corresponding, by described preset value word composition word string adjacent in current document to be sorted, determine described matching characteristic vector according to the word string in current document to be sorted.
In a kind of possible implementation, described determine unit 404, for the characteristic vector according to each classification, determine each described word string to be matched probability of occurrence in each classification, for each classification, determine the word string all to be matched probability of occurrence sum at current class of described current document to be sorted, using probability of occurrence sum corresponding for current class as similarity corresponding to current class.
In a kind of possible implementation, also include: measuring unit, be used for performing:
A1: obtain multiple test document, it is determined that the concrete class of each test document;
A2: from each test document, obtains word string to be tested;
A3: the characteristic vector according to each classification, it is determined that each word string to be tested is at the probability of occurrence of each classification;
A4: for each classification, it is determined that the word string all to be tested of described current test document is in the probability of occurrence sum of current class;
A5: will appear from the maximum classification of probability sum as coupling classification corresponding to described current test document;
A6: the concrete class according to the coupling classification of each test document and each test document, it is determined that the classification accuracy that each classification is corresponding;
A7: judge that whether classification accuracy corresponding to each classification be more than or equal to default accuracy rate threshold value respectively, if it is, trigger described second acquisition unit, otherwise, perform step A8;
A8: using the plurality of test document as described Training document, performs described first acquiring unit.
The contents such as the information between each unit in said apparatus is mutual, execution process, due to the inventive method embodiment based on same design, particular content referring to the narration in the inventive method embodiment, can repeat no more herein.
The method of a kind of document classification that the embodiment of the present invention provides and device, have the advantages that
1, the method of a kind of document classification that the embodiment of the present invention provides and device, by Training document, every kind is trained, obtain every kind characteristic of correspondence vector, determine that the matching characteristic of document to be sorted is vectorial and the similarity of the characteristic vector of every kind, determine classification corresponding to the characteristic vector the highest with the matching characteristic vector similarity classification as document to be sorted, when classification results can not reach user require time, characteristic vector can be updated by adjusting training document, make classification results can more conform to user's request, document classification can be carried out more neatly.
2, the embodiment of the present invention provides a kind of method of document classification and device, the test document of certain classification can be obtained, certain classification is tested, the default accuracy rate threshold value of each classification can be arranged as required to, by this implementation, it is possible to by test document, characteristic vector is tested, when certain classification can not reach requirement, using test document as Training document, then the category can be trained, and then improve the accuracy rate of classification.
It should be noted that, in this article, the relational terms of such as first and second etc is used merely to separate an entity or operation with another entity or operating space, and not necessarily requires or imply the relation that there is any this reality between these entities or operation or sequentially. And, term " includes ", " comprising " or its any other variant are intended to comprising of nonexcludability, so that include the process of a series of key element, method, article or equipment not only include those key elements, but also include other key elements being not expressly set out, or also include the key element intrinsic for this process, method, article or equipment. When there is no more restriction, statement " including " key element limited, it is not excluded that there is also other same factor in including the process of described key element, method, article or equipment.
One of ordinary skill in the art will appreciate that: all or part of step realizing said method embodiment can be completed by the hardware that programmed instruction is relevant, aforesaid program can be stored in the storage medium of embodied on computer readable, this program upon execution, performs to include the step of said method embodiment; And aforesaid storage medium includes: in the various media that can store program code such as ROM, RAM, magnetic disc or CD.
Last it should be understood that the foregoing is only presently preferred embodiments of the present invention, it is merely to illustrate technical scheme, is not intended to limit protection scope of the present invention. All make within the spirit and principles in the present invention any amendment, equivalent replacement, improvement etc., be all contained in protection scope of the present invention.

Claims (10)

1. the method for a document classification, it is characterised in that including:
S1: obtain multiple Training document, it is determined that the classification that each Training document is corresponding;
S2: according to the Training document that each classification is corresponding, it is determined that the characteristic vector of each classification, described characteristic vector includes: the word string occurred in corresponding current class, and each word string occurs in the probability of occurrence of current class;
S3: obtain current document to be sorted, from current document to be sorted, extracts the matching characteristic vector of current document to be sorted, and described matching characteristic vector includes: the word string to be matched occurred in current document to be sorted;
S4: according to the probability of occurrence in the characteristic vector of the word string to be matched in described matching characteristic vector and each classification, it is determined that described matching characteristic is vectorial and the similarity of the characteristic vector of each classification;
S5: using the classification corresponding for characteristic vector the highest for the similarity classification as described current document to be sorted.
2. method according to claim 1, it is characterised in that described S2, including:
Training document corresponding for each classification is processed into plain text document, the plain text document that each Training document is corresponding is carried out participle, it is thus achieved that multiple words that each Training document is corresponding;
By preset value word composition word string adjacent in each Training document, it is determined that the probability of occurrence of the classification that each word string is corresponding;
According to the probability of occurrence in corresponding classification of each word string in the Training document that each classification is corresponding, it is determined that the characteristic vector of each classification.
3. method according to claim 2, it is characterised in that described S3, including:
Current document process to be sorted is become plain text document, the plain text document that current document to be sorted is corresponding is carried out participle, it is thus achieved that multiple words that current document to be sorted is corresponding;
By described preset value word composition word string adjacent in current document to be sorted;
Described matching characteristic vector is determined according to the word string in current document to be sorted.
4. method according to claim 1, it is characterised in that described S4, including:
Characteristic vector according to each classification, it is determined that each described word string to be matched is at the probability of occurrence of each classification;
For each classification, it is determined that the word string all to be matched of described current document to be sorted is in the probability of occurrence sum of current class, using probability of occurrence sum corresponding for current class as similarity corresponding to current class.
5. according to described method arbitrary in claim 1-4, it is characterised in that after described S2, before described S3, also include:
A1: obtain multiple test document, it is determined that the concrete class of each test document;
A2: from each test document, obtains word string to be tested;
A3: the characteristic vector according to each classification, it is determined that each word string to be tested is at the probability of occurrence of each classification;
A4: for each classification, it is determined that the word string all to be tested of described current test document is in the probability of occurrence sum of current class;
A5: will appear from the maximum classification of probability sum as coupling classification corresponding to described current test document;
A6: the concrete class according to the coupling classification of each test document and each test document, it is determined that the classification accuracy that each classification is corresponding;
A7: judge that whether classification accuracy corresponding to each classification be more than or equal to default accuracy rate threshold value respectively, if it is, perform step S3, otherwise, perform step A8;
A8: using the plurality of test document as described Training document, performs step S1.
6. the device of a document classification, it is characterised in that including:
First acquiring unit, is used for obtaining multiple Training document, it is determined that the classification that each Training document is corresponding;
Training unit, for the Training document corresponding according to each classification, it is determined that the characteristic vector of each classification, described characteristic vector includes: the word string occurred in corresponding current class, and each word string occurs in the probability of occurrence of current class;
Second acquisition unit, is used for obtaining current document to be sorted, from current document to be sorted, extracts the matching characteristic vector of current document to be sorted, and described matching characteristic vector includes: the word string to be matched occurred in current document to be sorted;
Determine unit, for according to the probability of occurrence in the characteristic vector of the word string to be matched in described matching characteristic vector and each classification, it is determined that described matching characteristic is vectorial and the similarity of the characteristic vector of each classification;
Taxon, for using the classification corresponding for characteristic vector the highest for the similarity classification as described current document to be sorted.
7. device according to claim 6, it is characterized in that, described training unit, for Training document corresponding for each classification is processed into plain text document, the plain text document that each Training document is corresponding is carried out participle, obtain multiple words that each Training document is corresponding, by preset value word composition word string adjacent in each Training document, determine the probability of occurrence of the classification that each word string is corresponding, according to the probability of occurrence in corresponding classification of each word string in the Training document that each classification is corresponding, it is determined that the characteristic vector of each classification.
8. device according to claim 7, it is characterized in that, described second acquisition unit, for current document process to be sorted is become plain text document, the plain text document that current document to be sorted is corresponding is carried out participle, obtain multiple words that current document to be sorted is corresponding, by described preset value word composition word string adjacent in current document to be sorted, determine described matching characteristic vector according to the word string in current document to be sorted.
9. device according to claim 6, it is characterized in that, described determine unit, for the characteristic vector according to each classification, determine each described word string to be matched probability of occurrence in each classification, for each classification, it is determined that the word string all to be matched of described current document to be sorted is in the probability of occurrence sum of current class, using probability of occurrence sum corresponding for current class as similarity corresponding to current class.
10. according to described device arbitrary in claim 6-9, it is characterised in that also include: measuring unit, be used for performing:
A1: obtain multiple test document, it is determined that the concrete class of each test document;
A2: from each test document, obtains word string to be tested;
A3: the characteristic vector according to each classification, it is determined that each word string to be tested is at the probability of occurrence of each classification;
A4: for each classification, it is determined that the word string all to be tested of described current test document is in the probability of occurrence sum of current class;
A5: will appear from the maximum classification of probability sum as coupling classification corresponding to described current test document;
A6: the concrete class according to the coupling classification of each test document and each test document, it is determined that the classification accuracy that each classification is corresponding;
A7: judge that whether classification accuracy corresponding to each classification be more than or equal to default accuracy rate threshold value respectively, if it is, trigger described second acquisition unit, otherwise, perform step A8;
A8: using the plurality of test document as described Training document, performs described first acquiring unit.
CN201510974508.6A 2015-12-22 2015-12-22 Document classification method and device Pending CN105630931A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510974508.6A CN105630931A (en) 2015-12-22 2015-12-22 Document classification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510974508.6A CN105630931A (en) 2015-12-22 2015-12-22 Document classification method and device

Publications (1)

Publication Number Publication Date
CN105630931A true CN105630931A (en) 2016-06-01

Family

ID=56045864

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510974508.6A Pending CN105630931A (en) 2015-12-22 2015-12-22 Document classification method and device

Country Status (1)

Country Link
CN (1) CN105630931A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106095845A (en) * 2016-06-02 2016-11-09 腾讯科技(深圳)有限公司 File classification method and device
CN106126734A (en) * 2016-07-04 2016-11-16 北京奇艺世纪科技有限公司 The sorting technique of document and device
CN106649274A (en) * 2016-12-27 2017-05-10 东华互联宜家数据服务有限公司 Text content tag labeling method and device
CN107291896A (en) * 2017-06-21 2017-10-24 北京小度信息科技有限公司 Data-updating method and device
CN107766869A (en) * 2016-08-22 2018-03-06 富士通株式会社 Object classification method and object sorting device
CN107783989A (en) * 2016-08-25 2018-03-09 北京国双科技有限公司 Document belongs to the determination method and apparatus in field
CN108038101A (en) * 2017-12-07 2018-05-15 杭州迪普科技股份有限公司 A kind of recognition methods for distorting text and device
CN108763477A (en) * 2018-05-29 2018-11-06 厦门快商通信息技术有限公司 A kind of short text classification method and system
CN111259658A (en) * 2020-02-05 2020-06-09 中国科学院计算技术研究所 General text classification method and system based on category dense vector representation

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101727500A (en) * 2010-01-15 2010-06-09 清华大学 Text classification method of Chinese web page based on steam clustering
CN104142998A (en) * 2014-08-01 2014-11-12 中国传媒大学 Text classification method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101727500A (en) * 2010-01-15 2010-06-09 清华大学 Text classification method of Chinese web page based on steam clustering
CN104142998A (en) * 2014-08-01 2014-11-12 中国传媒大学 Text classification method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
万乐等: "类别特征词权重加权文本分类方法", 《军民两用技术与产品》 *
周新栋等: "基于N元语言模型的文本分类方法", 《计算机应用》 *
李雪蕾等: "一种基于向量空间模型的文本分类方法", 《计算机工程》 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106095845A (en) * 2016-06-02 2016-11-09 腾讯科技(深圳)有限公司 File classification method and device
CN106126734A (en) * 2016-07-04 2016-11-16 北京奇艺世纪科技有限公司 The sorting technique of document and device
CN106126734B (en) * 2016-07-04 2019-06-28 北京奇艺世纪科技有限公司 The classification method and device of document
CN107766869A (en) * 2016-08-22 2018-03-06 富士通株式会社 Object classification method and object sorting device
CN107783989A (en) * 2016-08-25 2018-03-09 北京国双科技有限公司 Document belongs to the determination method and apparatus in field
CN106649274A (en) * 2016-12-27 2017-05-10 东华互联宜家数据服务有限公司 Text content tag labeling method and device
CN107291896A (en) * 2017-06-21 2017-10-24 北京小度信息科技有限公司 Data-updating method and device
CN108038101A (en) * 2017-12-07 2018-05-15 杭州迪普科技股份有限公司 A kind of recognition methods for distorting text and device
CN108038101B (en) * 2017-12-07 2021-04-27 杭州迪普科技股份有限公司 Method and device for identifying tampered text
CN108763477A (en) * 2018-05-29 2018-11-06 厦门快商通信息技术有限公司 A kind of short text classification method and system
CN111259658A (en) * 2020-02-05 2020-06-09 中国科学院计算技术研究所 General text classification method and system based on category dense vector representation
CN111259658B (en) * 2020-02-05 2022-08-19 中国科学院计算技术研究所 General text classification method and system based on category dense vector representation

Similar Documents

Publication Publication Date Title
CN105630931A (en) Document classification method and device
CN108717406B (en) Text emotion analysis method and device and storage medium
US9477750B2 (en) System and method for real-time dynamic measurement of best-estimate quality levels while reviewing classified or enriched data
CN109299362B (en) Similar enterprise recommendation method and device, computer equipment and storage medium
CN110765770A (en) Automatic contract generation method and device
CN111241389B (en) Sensitive word filtering method and device based on matrix, electronic equipment and storage medium
CN107808011A (en) Classification abstracting method, device, computer equipment and the storage medium of information
CN111125343A (en) Text analysis method and device suitable for human-sentry matching recommendation system
WO2018171295A1 (en) Method and apparatus for tagging article, terminal, and computer readable storage medium
CN110688536A (en) Label prediction method, device, equipment and storage medium
CN112163072A (en) Data processing method and device based on multiple data sources
CN107491536A (en) A kind of examination question method of calibration, examination question calibration equipment and electronic equipment
CN106503266A (en) Document Classification Method and device
CN106445907A (en) Domain lexicon generation method and apparatus
CN114692628A (en) Sample generation method, model training method, text extraction method and text extraction device
US20170103059A1 (en) Method and system for preserving sensitive information in a confidential document
CN112016294B (en) Text-based news importance evaluation method and device and electronic equipment
CN105787004A (en) Text classification method and device
CN111898378B (en) Industry classification method and device for government enterprise clients, electronic equipment and storage medium
CN105095203B (en) Determination, searching method and the server of synonym
KR102299525B1 (en) Product Evolution Mining Method And Apparatus Thereof
CN104462552A (en) Question and answer page core word extracting method and device
CN109409091B (en) Method, device and equipment for detecting Web page and computer storage medium
CN111639250A (en) Enterprise description information acquisition method and device, electronic equipment and storage medium
CN107577667B (en) Entity word processing method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20160601