CN105630931A - Document classification method and device - Google Patents
Document classification method and device Download PDFInfo
- Publication number
- CN105630931A CN105630931A CN201510974508.6A CN201510974508A CN105630931A CN 105630931 A CN105630931 A CN 105630931A CN 201510974508 A CN201510974508 A CN 201510974508A CN 105630931 A CN105630931 A CN 105630931A
- Authority
- CN
- China
- Prior art keywords
- classification
- document
- current
- word string
- characteristic vector
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 57
- 238000012549 training Methods 0.000 claims abstract description 88
- 238000012360 testing method Methods 0.000 claims description 61
- 230000008878 coupling Effects 0.000 claims description 14
- 238000010168 coupling process Methods 0.000 claims description 14
- 238000005859 coupling reaction Methods 0.000 claims description 14
- 239000000203 mixture Substances 0.000 claims description 14
- 239000000284 extract Substances 0.000 claims description 6
- 230000029305 taxis Effects 0.000 description 8
- 238000005516 engineering process Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 238000003058 natural language processing Methods 0.000 description 3
- 206010008190 Cerebrovascular accident Diseases 0.000 description 2
- 208000006011 Stroke Diseases 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000004590 computer program Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
Abstract
The invention provides a method and a device for classifying documents, wherein the method comprises the following steps: acquiring a plurality of training documents, and determining a category corresponding to each training document; determining a feature vector of each category according to the training document corresponding to each category, wherein the feature vector comprises: the word strings appearing in the corresponding current category, and the appearance probability of each word string appearing in the current category; obtaining a current document to be classified, and extracting a matching feature vector of the current document to be classified from the current document to be classified, wherein the matching feature vector comprises the following steps: word strings to be matched appear in the current document to be classified; determining the similarity between the matched feature vector and the feature vector of each category according to the word string to be matched in the matched feature vector and the occurrence probability in the feature vector of each category; and taking the class corresponding to the feature vector with the highest similarity as the class of the current document to be classified. The invention provides a method and a device for classifying documents, which can more flexibly classify the documents.
Description
Technical field
The present invention relates to field of computer technology, particularly to method and the device of a kind of document classification.
Background technology
Along with the development that can continue technology, natural language processing technique obtains unprecedented attention and considerable progress, and have evolved into a relatively independent subject, receive much concern, now along with the Internet+, the getting most of the attention of big data etc. are popular theory and technology, making full use of of webpage text data on network is launched various trial by every profession and trade, and natural language processing technique is then main force's effect of serving as in the task of processing and analyzing at these web page texts, utilizing.
In prior art, the process of web page text data being based primarily upon and presets fixing sorting technique, the demand according to user that is difficult to of this sorting technique is adjusted. For example, the accuracy rate of classification results is difficult to meet the demand of user, but, user also is difficult to sorting technique is adjusted, and has arrived the accuracy rate requirement of user. Visible by foregoing description, sorting technique underaction of the prior art.
Summary of the invention
The invention provides a kind of method of document classification and device, it is possible to carry out document classification more neatly.
On the one hand, a kind of method that the invention provides document classification, including:
S1: obtain multiple Training document, it is determined that the classification that each Training document is corresponding;
S2: according to the Training document that each classification is corresponding, it is determined that the characteristic vector of each classification, described characteristic vector includes: the word string occurred in corresponding current class, and each word string occurs in the probability of occurrence of current class;
S3: obtain current document to be sorted, from current document to be sorted, extracts the matching characteristic vector of current document to be sorted, and described matching characteristic vector includes: the word string to be matched occurred in current document to be sorted;
S4: according to the probability of occurrence in the characteristic vector of the word string to be matched in described matching characteristic vector and each classification, it is determined that described matching characteristic is vectorial and the similarity of the characteristic vector of each classification;
S5: using the classification corresponding for characteristic vector the highest for the similarity classification as described current document to be sorted.
Further, described S2, including:
Training document corresponding for each classification is processed into plain text document, the plain text document that each Training document is corresponding is carried out participle, it is thus achieved that multiple words that each Training document is corresponding;
By preset value word composition word string adjacent in each Training document, it is determined that the probability of occurrence of the classification that each word string is corresponding;
According to the probability of occurrence in corresponding classification of each word string in the Training document that each classification is corresponding, it is determined that the characteristic vector of each classification.
Further, described S3, including:
Current document process to be sorted is become plain text document, the plain text document that current document to be sorted is corresponding is carried out participle, it is thus achieved that multiple words that current document to be sorted is corresponding;
By described preset value word composition word string adjacent in current document to be sorted;
Described matching characteristic vector is determined according to the word string in current document to be sorted.
Further, described S4, including:
Characteristic vector according to each classification, it is determined that each described word string to be matched is at the probability of occurrence of each classification;
For each classification, it is determined that the word string all to be matched of described current document to be sorted is in the probability of occurrence sum of current class, using probability of occurrence sum corresponding for current class as similarity corresponding to current class.
Further, after described S2, before described S3, also include:
A1: obtain multiple test document, it is determined that the concrete class of each test document;
A2: from each test document, obtains word string to be tested;
A3: the characteristic vector according to each classification, it is determined that each word string to be tested is at the probability of occurrence of each classification;
A4: for each classification, it is determined that the word string all to be tested of described current test document is in the probability of occurrence sum of current class;
A5: will appear from the maximum classification of probability sum as coupling classification corresponding to described current test document;
A6: the concrete class according to the coupling classification of each test document and each test document, it is determined that the classification accuracy that each classification is corresponding;
A7: judge that whether classification accuracy corresponding to each classification be more than or equal to default accuracy rate threshold value respectively, if it is, perform step S3, otherwise, perform step A8;
A8: using the plurality of test document as described Training document, performs step S1.
On the other hand, the invention provides the device of a kind of document classification, including:
First acquiring unit, is used for obtaining multiple Training document, it is determined that the classification that each Training document is corresponding;
Training unit, for the Training document corresponding according to each classification, it is determined that the characteristic vector of each classification, described characteristic vector includes: the word string occurred in corresponding current class, and each word string occurs in the probability of occurrence of current class;
Second acquisition unit, is used for obtaining current document to be sorted, from current document to be sorted, extracts the matching characteristic vector of current document to be sorted, and described matching characteristic vector includes: the word string to be matched occurred in current document to be sorted;
Determine unit, for according to the probability of occurrence in the characteristic vector of the word string to be matched in described matching characteristic vector and each classification, it is determined that described matching characteristic is vectorial and the similarity of the characteristic vector of each classification;
Taxon, for using the classification corresponding for characteristic vector the highest for the similarity classification as described current document to be sorted.
Further, described training unit, for Training document corresponding for each classification is processed into plain text document, the plain text document that each Training document is corresponding is carried out participle, obtain multiple words that each Training document is corresponding, by preset value word composition word string adjacent in each Training document, it is determined that the probability of occurrence of the classification that each word string is corresponding, according to the probability of occurrence in corresponding classification of each word string in the Training document that each classification is corresponding, it is determined that the characteristic vector of each classification.
Further, described second acquisition unit, for current document process to be sorted is become plain text document, the plain text document that current document to be sorted is corresponding is carried out participle, obtain multiple words that current document to be sorted is corresponding, by described preset value word composition word string adjacent in current document to be sorted, determine described matching characteristic vector according to the word string in current document to be sorted.
Further, described determine unit, for the characteristic vector according to each classification, determine each described word string to be matched probability of occurrence in each classification, for each classification, determine the word string all to be matched probability of occurrence sum at current class of described current document to be sorted, using probability of occurrence sum corresponding for current class as similarity corresponding to current class.
Further, this device also includes: measuring unit, is used for performing:
A1: obtain multiple test document, it is determined that the concrete class of each test document;
A2: from each test document, obtains word string to be tested;
A3: the characteristic vector according to each classification, it is determined that each word string to be tested is at the probability of occurrence of each classification;
A4: for each classification, it is determined that the word string all to be tested of described current test document is in the probability of occurrence sum of current class;
A5: will appear from the maximum classification of probability sum as coupling classification corresponding to described current test document;
A6: the concrete class according to the coupling classification of each test document and each test document, it is determined that the classification accuracy that each classification is corresponding;
A7: judge that whether classification accuracy corresponding to each classification be more than or equal to default accuracy rate threshold value respectively, if it is, trigger described second acquisition unit, otherwise, perform step A8;
A8: using the plurality of test document as described Training document, performs described first acquiring unit.
The method of a kind of document classification provided by the invention and device, by Training document, every kind is trained, obtain every kind characteristic of correspondence vector, determine that the matching characteristic of document to be sorted is vectorial and the similarity of the characteristic vector of every kind, determine classification corresponding to the characteristic vector the highest with the matching characteristic vector similarity classification as document to be sorted, when classification results can not reach user require time, characteristic vector can be updated by adjusting training document, make classification results can more conform to user's request, document classification can be carried out more neatly.
Accompanying drawing explanation
In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, the accompanying drawing used required in embodiment or description of the prior art will be briefly described below, apparently, accompanying drawing in the following describes is some embodiments of the present invention, for those of ordinary skill in the art, under the premise not paying creative work, it is also possible to obtain other accompanying drawing according to these accompanying drawings.
Fig. 1 is the flow chart of the method for a kind of document classification that one embodiment of the invention provides;
Fig. 2 is the flow chart of the method for the another kind of document classification that one embodiment of the invention provides;
Fig. 3 is the schematic diagram of the device of a kind of document classification that one embodiment of the invention provides;
Fig. 4 is the schematic diagram of the device of the another kind of document classification that one embodiment of the invention provides.
Detailed description of the invention
For making the purpose of the embodiment of the present invention, technical scheme and advantage clearly; below in conjunction with the accompanying drawing in the embodiment of the present invention; technical scheme in the embodiment of the present invention is clearly and completely described; obviously; described embodiment is a part of embodiment of the present invention, rather than whole embodiments, based on the embodiment in the present invention; the every other embodiment that those of ordinary skill in the art obtain under the premise not making creative work, broadly falls into the scope of protection of the invention.
As it is shown in figure 1, a kind of method embodiments providing document classification, the method may comprise steps of:
S1: obtain multiple Training document, it is determined that the classification that each Training document is corresponding;
S2: according to the Training document that each classification is corresponding, it is determined that the characteristic vector of each classification, described characteristic vector includes: the word string occurred in corresponding current class, and each word string occurs in the probability of occurrence of current class;
S3: obtain current document to be sorted, from current document to be sorted, extracts the matching characteristic vector of current document to be sorted, and described matching characteristic vector includes: the word string to be matched occurred in current document to be sorted;
S4: according to the probability of occurrence in the characteristic vector of the word string to be matched in described matching characteristic vector and each classification, it is determined that described matching characteristic is vectorial and the similarity of the characteristic vector of each classification;
S5: using the classification corresponding for characteristic vector the highest for the similarity classification as described current document to be sorted.
The method of a kind of document classification that the embodiment of the present invention provides, by Training document, every kind is trained, obtain every kind characteristic of correspondence vector, determine that the matching characteristic of document to be sorted is vectorial and the similarity of the characteristic vector of every kind, determine classification corresponding to the characteristic vector the highest with the matching characteristic vector similarity classification as document to be sorted, when classification results can not reach user require time, characteristic vector can be updated by adjusting training document, make classification results can more conform to user's request, document classification can be carried out more neatly.
In a kind of possible implementation, described S2, including:
Training document corresponding for each classification is processed into plain text document, the plain text document that each Training document is corresponding is carried out participle, it is thus achieved that multiple words that each Training document is corresponding;
By preset value word composition word string adjacent in each Training document, it is determined that the probability of occurrence of the classification that each word string is corresponding;
According to the probability of occurrence in corresponding classification of each word string in the Training document that each classification is corresponding, it is determined that the characteristic vector of each classification.
Here Training document can be webpage, and for the ease of extracting word string, it is necessary to Training document is carried out pretreatment, is processed into plain text document, processing procedure may include that the main content of text of extraction, removes space, special symbol etc. Then, by plain text document is carried out participle, multiple word is obtained. For example, " a kind of method of document classification " is after participle, it is possible to obtain " one " " document " " classification " " " " method " this connect word. And word string is made up of word, when preset value is 2, word string is made up of 2 words, for instance: " a kind of document " " document classification " can serve as word string. When calculating the probability of occurrence that each word string occurs in current class, it is possible to be accomplished by: determine the occurrence number that all word strings occurred in current class occur, it is determined that the total degree of all word strings occurred in current class; By the occurrence number of current word string divided by total degree, it is determined that the probability of occurrence of current word string. For example, for, in classification C, having two sections of Training document A and B, having word string A, word string B, word string C in Training document A, the occurrence number in Training document A is 2,3,4 respectively; Having word string A, word string B in Training document B, the occurrence number in Training document B is 5,7 respectively; Word string A occurrence number in classification C is 2+5=7, word string B occurrence number in classification C is 3+7=10, word string C occurrence number in classification C is 4, and the total degree of all word strings occurred in current class is 7+10+4=21, and the probability of occurrence of word string A is 7/21.
In a kind of possible implementation, described S3, including:
Current document process to be sorted is become plain text document, the plain text document that current document to be sorted is corresponding is carried out participle, it is thus achieved that multiple words that current document to be sorted is corresponding;
By described preset value word composition word string adjacent in current document to be sorted;
Described matching characteristic vector is determined according to the word string in current document to be sorted.
In this implementation, when carrying out participle, it is possible to adopt the method the same with the segmenting method in step S2, so can so that classification results be more accurate. Here preset value is the same with the preset value in step S2, so could find the word string matched in the characteristic vector of each classification. Here, the mode forming word string can be identical with the mode in step S2.
In a kind of possible implementation, described S4, including:
Characteristic vector according to each classification, it is determined that each described word string to be matched is at the probability of occurrence of each classification;
For each classification, it is determined that the word string all to be matched of described current document to be sorted is in the probability of occurrence sum of current class, using probability of occurrence sum corresponding for current class as similarity corresponding to current class.
For example, having two classifications is that the characteristic vector of classification A and classification B, classification A includes: word string A, word string B, word string C respectively, and it is 0.2 that word string A occurs in the probability of occurrence of classification A, it is 0.3 that word string B occurs in the probability of occurrence of classification A, and it is 0.5 that word string C occurs in the probability of occurrence of classification A. The characteristic vector of classification B includes: word string C, word string D, and it is 0.2 that word string C occurs in the probability of occurrence of classification B, and it is 0.8 that word string D occurs in the probability of occurrence of classification B. Word string to be matched corresponding for document A to be sorted has word string A, word string C, word string E. Word string A is 0.2 and 0 at the probability of occurrence of classification A and classification B respectively, and word string C is 0.5 and 0.2 at the probability of occurrence of classification A and classification B respectively, and word string E is all 0 at the probability of occurrence of classification A and classification B. For classification A, it is determined that the word string all to be matched of document A to be sorted in the probability of occurrence sum of classification A is: word string A at the probability of occurrence+word string C of classification A at the probability of occurrence+word string E of classification A in probability of occurrence=0.7 of classification A; The word string all to be matched determining document A to be sorted in the probability of occurrence sum of classification B is: word string A at the probability of occurrence+word string C of classification B at the probability of occurrence+word string E of classification B in probability of occurrence=0.2 of classification B. Visible, the similarity of the characteristic vector of matching characteristic vector and classification A is 0.7, and the similarity of the characteristic vector of matching characteristic vector and classification B is 0.2, so, document A to be sorted belongs to classification A.
In order to meet user's requirement to classification accuracy, it is possible to treating before classifying documents classifies, the characteristic vector of each classification is tested. In a kind of possible implementation, after described S2, before described S3, also include:
A1: obtain multiple test document, it is determined that the concrete class of each test document;
A2: from each test document, obtains word string to be tested;
A3: the characteristic vector according to each classification, it is determined that each word string to be tested is at the probability of occurrence of each classification;
A4: for each classification, it is determined that the word string all to be tested of described current test document is in the probability of occurrence sum of current class;
A5: will appear from the maximum classification of probability sum as coupling classification corresponding to described current test document;
A6: the concrete class according to the coupling classification of each test document and each test document, it is determined that the classification accuracy that each classification is corresponding;
A7: judge that whether classification accuracy corresponding to each classification be more than or equal to default accuracy rate threshold value respectively, if it is, perform step S3, otherwise, perform step A8;
A8: using the plurality of test document as described Training document, performs step S1.
In this implementation, it is possible to obtain the test document of certain classification, certain classification being tested, the default accuracy rate threshold value of each classification can be arranged as required to. By this implementation, it is possible to by test document, characteristic vector is tested, when certain classification can not reach requirement, it is possible to using test document as Training document, then the category is trained.
In this implementation, when the classification accuracy that each classification of calculating is corresponding, judge that whether the coupling classification of each test document is identical with its concrete class respectively, if it is identical, then determine the classification of current test document accurately, otherwise, it determines current test document classification error. Determining the total A of test document in each concrete class, it is determined that the quantity B of the test document accurately of classifying in each concrete class, classification accuracy corresponding to each classification is: B/A. For example, the concrete class having 10 sections of test document is classification A, after classifying, these 10 sections of test document have the coupling classification of 8 sections to be classification A, it is, for classification A, having 8 sections is classify accurately, it is determined that going out classification accuracy corresponding for classification A is: 8/10=0.8. Wherein, preset accuracy rate threshold value and could be arranged to 80%.
For making the object, technical solutions and advantages of the present invention clearly, below in conjunction with drawings and the specific embodiments, the present invention is described in further detail.
In this embodiment, it is necessary to network tax text webpage is classified, specifically, it is possible to be divided into: policies and regulations, notice bulletin, Policy Interpretation, tax news, stock market's tax are heard, taxes on enterprise hear six big classes.
As in figure 2 it is shown, a kind of method embodiments providing document classification, the method may comprise steps of:
Step 201: obtain multiple Training document, it is determined that the classification that each Training document is corresponding.
Specifically, can in State Tax Administration website, each province and city tax bureau website and China's authoritative website such as tax net, accounting net gather Training document respectively, and determine that each Training document belongs to policies and regulations, notice bulletin, Policy Interpretation, tax news, stock market's tax are heard, which classification of taxes on enterprise news six big apoplexy due to endogenous wind.
Step 202: Training document corresponding for each classification is processed into plain text document, carries out participle to the plain text document that each Training document is corresponding, it is thus achieved that multiple words that each Training document is corresponding.
Specifically, it is possible to use Training document is processed by jsoup, plain text document is obtained.
Step 203: by preset value word composition word string adjacent in each Training document, it is determined that the probability of occurrence of the classification that each word string is corresponding.
Specifically, determine that each word string is at policies and regulations, the probability of occurrence notifying bulletin, Policy Interpretation, tax news, stock market's tax news, taxes on enterprise news six big apoplexy due to endogenous wind respectively.
Step 204: according to the probability of occurrence in corresponding classification of each word string in the Training document that each classification is corresponding, determine the characteristic vector of each classification, described characteristic vector includes: the word string occurred in corresponding current class, each word string occurs in the probability of occurrence of current class.
Specifically, the characteristic vector of each classification can represent in the following manner:
Tj={ (w1w2...wn,P(w1w2...wn))1,...(w1w2...wn,P(w1w2...wn))m, TjFor the characteristic vector of jth classification, wnFor the n-th word, w in current word string1w2...wnFor word string, P (w1w2...wn) for word string w1w2...wnCorresponding probability of occurrence. (w1w2...wn,P(w1w2...wn))mRepresent the probability of occurrence that m-th word string is corresponding with m-th word string. For example, it can be { (" one " " enterprise " that taxes on enterprise hear the characteristic vector of classification, 0.2), (" enterprise " " pays taxes "), 0.8}, it can be seen that each word string is made up of 2 words, word string " one " " enterprise " taxes on enterprise hear classification probability of occurrence be 0.2, word string " enterprise " " pay taxes " taxes on enterprise hear classification probability of occurrence be 0.8.
Step 205: obtain current document to be sorted, becomes plain text document by current document process to be sorted, and the plain text document that current document to be sorted is corresponding is carried out participle, it is thus achieved that multiple words that current document to be sorted is corresponding.
Specifically, it is possible to use current document to be sorted is processed by jsoup, obtains plain text document.
Specifically, segmenting method can be identical with above-mentioned segmenting method.
Step 206: by described preset value word composition word string adjacent in current document to be sorted.
Step 207: determining described matching characteristic vector according to the word string in current document to be sorted, matching characteristic vector includes: the word string to be matched occurred in current document to be sorted.
Step 208: the characteristic vector according to each classification, it is determined that each described word string to be matched is at the probability of occurrence of each classification.
Step 209: for each classification, it is determined that the word string all to be matched of described current document to be sorted is in the probability of occurrence sum of current class, using probability of occurrence sum corresponding for current class as similarity corresponding to current class.
Step 210: using the classification corresponding for characteristic vector the highest for the similarity classification as described current document to be sorted.
Wherein, above-mentioned preset value can be adjusted as required, can regulate the accuracy of classification by adjusting this preset value.
Above-described embodiment, it is possible to use the integrated Java of the JAVA instrument such as natural language processing instrument OpenNLP, FudanNLP, LingPipe, IKAnalyzer, word2vec of increasing income realizes.
As shown in Figure 3, Figure 4, the device of a kind of document classification is embodiments provided. Device embodiment can be realized by software, it is also possible to is realized by the mode of hardware or software and hardware combining. Say from hardware view; as shown in Figure 3; a kind of hardware structure diagram of device place equipment for a kind of document classification that the embodiment of the present invention provides; except the processor shown in Fig. 3, internal memory, network interface and nonvolatile memory; in embodiment, the equipment at device place generally can also include other hardware, such as the forwarding chip etc. of responsible process message. Implemented in software for example, as shown in Figure 4, as the device on a logical meaning, it is that computer program instructions corresponding in nonvolatile memory is read to run in internal memory and formed by the CPU by its place equipment. The device of a kind of document classification that the present embodiment provides, including:
First acquiring unit 401, is used for obtaining multiple Training document, it is determined that the classification that each Training document is corresponding;
Training unit 402, for the Training document corresponding according to each classification, it is determined that the characteristic vector of each classification, described characteristic vector includes: the word string occurred in corresponding current class, and each word string occurs in the probability of occurrence of current class;
Second acquisition unit 403, is used for obtaining current document to be sorted, from current document to be sorted, extracts the matching characteristic vector of current document to be sorted, and described matching characteristic vector includes: the word string to be matched occurred in current document to be sorted;
Determine unit 404, for according to the probability of occurrence in the characteristic vector of the word string to be matched in described matching characteristic vector and each classification, it is determined that described matching characteristic is vectorial and the similarity of the characteristic vector of each classification;
Taxon 405, for using the classification corresponding for characteristic vector the highest for the similarity classification as described current document to be sorted.
In a kind of possible implementation, described training unit 402, for Training document corresponding for each classification is processed into plain text document, the plain text document that each Training document is corresponding is carried out participle, obtain multiple words that each Training document is corresponding, by preset value word composition word string adjacent in each Training document, determine the probability of occurrence of the classification that each word string is corresponding, according to the probability of occurrence in corresponding classification of each word string in the Training document that each classification is corresponding, it is determined that the characteristic vector of each classification.
In a kind of possible implementation, described second acquisition unit 403, for current document process to be sorted is become plain text document, the plain text document that current document to be sorted is corresponding is carried out participle, obtain multiple words that current document to be sorted is corresponding, by described preset value word composition word string adjacent in current document to be sorted, determine described matching characteristic vector according to the word string in current document to be sorted.
In a kind of possible implementation, described determine unit 404, for the characteristic vector according to each classification, determine each described word string to be matched probability of occurrence in each classification, for each classification, determine the word string all to be matched probability of occurrence sum at current class of described current document to be sorted, using probability of occurrence sum corresponding for current class as similarity corresponding to current class.
In a kind of possible implementation, also include: measuring unit, be used for performing:
A1: obtain multiple test document, it is determined that the concrete class of each test document;
A2: from each test document, obtains word string to be tested;
A3: the characteristic vector according to each classification, it is determined that each word string to be tested is at the probability of occurrence of each classification;
A4: for each classification, it is determined that the word string all to be tested of described current test document is in the probability of occurrence sum of current class;
A5: will appear from the maximum classification of probability sum as coupling classification corresponding to described current test document;
A6: the concrete class according to the coupling classification of each test document and each test document, it is determined that the classification accuracy that each classification is corresponding;
A7: judge that whether classification accuracy corresponding to each classification be more than or equal to default accuracy rate threshold value respectively, if it is, trigger described second acquisition unit, otherwise, perform step A8;
A8: using the plurality of test document as described Training document, performs described first acquiring unit.
The contents such as the information between each unit in said apparatus is mutual, execution process, due to the inventive method embodiment based on same design, particular content referring to the narration in the inventive method embodiment, can repeat no more herein.
The method of a kind of document classification that the embodiment of the present invention provides and device, have the advantages that
1, the method of a kind of document classification that the embodiment of the present invention provides and device, by Training document, every kind is trained, obtain every kind characteristic of correspondence vector, determine that the matching characteristic of document to be sorted is vectorial and the similarity of the characteristic vector of every kind, determine classification corresponding to the characteristic vector the highest with the matching characteristic vector similarity classification as document to be sorted, when classification results can not reach user require time, characteristic vector can be updated by adjusting training document, make classification results can more conform to user's request, document classification can be carried out more neatly.
2, the embodiment of the present invention provides a kind of method of document classification and device, the test document of certain classification can be obtained, certain classification is tested, the default accuracy rate threshold value of each classification can be arranged as required to, by this implementation, it is possible to by test document, characteristic vector is tested, when certain classification can not reach requirement, using test document as Training document, then the category can be trained, and then improve the accuracy rate of classification.
It should be noted that, in this article, the relational terms of such as first and second etc is used merely to separate an entity or operation with another entity or operating space, and not necessarily requires or imply the relation that there is any this reality between these entities or operation or sequentially. And, term " includes ", " comprising " or its any other variant are intended to comprising of nonexcludability, so that include the process of a series of key element, method, article or equipment not only include those key elements, but also include other key elements being not expressly set out, or also include the key element intrinsic for this process, method, article or equipment. When there is no more restriction, statement " including " key element limited, it is not excluded that there is also other same factor in including the process of described key element, method, article or equipment.
One of ordinary skill in the art will appreciate that: all or part of step realizing said method embodiment can be completed by the hardware that programmed instruction is relevant, aforesaid program can be stored in the storage medium of embodied on computer readable, this program upon execution, performs to include the step of said method embodiment; And aforesaid storage medium includes: in the various media that can store program code such as ROM, RAM, magnetic disc or CD.
Last it should be understood that the foregoing is only presently preferred embodiments of the present invention, it is merely to illustrate technical scheme, is not intended to limit protection scope of the present invention. All make within the spirit and principles in the present invention any amendment, equivalent replacement, improvement etc., be all contained in protection scope of the present invention.
Claims (10)
1. the method for a document classification, it is characterised in that including:
S1: obtain multiple Training document, it is determined that the classification that each Training document is corresponding;
S2: according to the Training document that each classification is corresponding, it is determined that the characteristic vector of each classification, described characteristic vector includes: the word string occurred in corresponding current class, and each word string occurs in the probability of occurrence of current class;
S3: obtain current document to be sorted, from current document to be sorted, extracts the matching characteristic vector of current document to be sorted, and described matching characteristic vector includes: the word string to be matched occurred in current document to be sorted;
S4: according to the probability of occurrence in the characteristic vector of the word string to be matched in described matching characteristic vector and each classification, it is determined that described matching characteristic is vectorial and the similarity of the characteristic vector of each classification;
S5: using the classification corresponding for characteristic vector the highest for the similarity classification as described current document to be sorted.
2. method according to claim 1, it is characterised in that described S2, including:
Training document corresponding for each classification is processed into plain text document, the plain text document that each Training document is corresponding is carried out participle, it is thus achieved that multiple words that each Training document is corresponding;
By preset value word composition word string adjacent in each Training document, it is determined that the probability of occurrence of the classification that each word string is corresponding;
According to the probability of occurrence in corresponding classification of each word string in the Training document that each classification is corresponding, it is determined that the characteristic vector of each classification.
3. method according to claim 2, it is characterised in that described S3, including:
Current document process to be sorted is become plain text document, the plain text document that current document to be sorted is corresponding is carried out participle, it is thus achieved that multiple words that current document to be sorted is corresponding;
By described preset value word composition word string adjacent in current document to be sorted;
Described matching characteristic vector is determined according to the word string in current document to be sorted.
4. method according to claim 1, it is characterised in that described S4, including:
Characteristic vector according to each classification, it is determined that each described word string to be matched is at the probability of occurrence of each classification;
For each classification, it is determined that the word string all to be matched of described current document to be sorted is in the probability of occurrence sum of current class, using probability of occurrence sum corresponding for current class as similarity corresponding to current class.
5. according to described method arbitrary in claim 1-4, it is characterised in that after described S2, before described S3, also include:
A1: obtain multiple test document, it is determined that the concrete class of each test document;
A2: from each test document, obtains word string to be tested;
A3: the characteristic vector according to each classification, it is determined that each word string to be tested is at the probability of occurrence of each classification;
A4: for each classification, it is determined that the word string all to be tested of described current test document is in the probability of occurrence sum of current class;
A5: will appear from the maximum classification of probability sum as coupling classification corresponding to described current test document;
A6: the concrete class according to the coupling classification of each test document and each test document, it is determined that the classification accuracy that each classification is corresponding;
A7: judge that whether classification accuracy corresponding to each classification be more than or equal to default accuracy rate threshold value respectively, if it is, perform step S3, otherwise, perform step A8;
A8: using the plurality of test document as described Training document, performs step S1.
6. the device of a document classification, it is characterised in that including:
First acquiring unit, is used for obtaining multiple Training document, it is determined that the classification that each Training document is corresponding;
Training unit, for the Training document corresponding according to each classification, it is determined that the characteristic vector of each classification, described characteristic vector includes: the word string occurred in corresponding current class, and each word string occurs in the probability of occurrence of current class;
Second acquisition unit, is used for obtaining current document to be sorted, from current document to be sorted, extracts the matching characteristic vector of current document to be sorted, and described matching characteristic vector includes: the word string to be matched occurred in current document to be sorted;
Determine unit, for according to the probability of occurrence in the characteristic vector of the word string to be matched in described matching characteristic vector and each classification, it is determined that described matching characteristic is vectorial and the similarity of the characteristic vector of each classification;
Taxon, for using the classification corresponding for characteristic vector the highest for the similarity classification as described current document to be sorted.
7. device according to claim 6, it is characterized in that, described training unit, for Training document corresponding for each classification is processed into plain text document, the plain text document that each Training document is corresponding is carried out participle, obtain multiple words that each Training document is corresponding, by preset value word composition word string adjacent in each Training document, determine the probability of occurrence of the classification that each word string is corresponding, according to the probability of occurrence in corresponding classification of each word string in the Training document that each classification is corresponding, it is determined that the characteristic vector of each classification.
8. device according to claim 7, it is characterized in that, described second acquisition unit, for current document process to be sorted is become plain text document, the plain text document that current document to be sorted is corresponding is carried out participle, obtain multiple words that current document to be sorted is corresponding, by described preset value word composition word string adjacent in current document to be sorted, determine described matching characteristic vector according to the word string in current document to be sorted.
9. device according to claim 6, it is characterized in that, described determine unit, for the characteristic vector according to each classification, determine each described word string to be matched probability of occurrence in each classification, for each classification, it is determined that the word string all to be matched of described current document to be sorted is in the probability of occurrence sum of current class, using probability of occurrence sum corresponding for current class as similarity corresponding to current class.
10. according to described device arbitrary in claim 6-9, it is characterised in that also include: measuring unit, be used for performing:
A1: obtain multiple test document, it is determined that the concrete class of each test document;
A2: from each test document, obtains word string to be tested;
A3: the characteristic vector according to each classification, it is determined that each word string to be tested is at the probability of occurrence of each classification;
A4: for each classification, it is determined that the word string all to be tested of described current test document is in the probability of occurrence sum of current class;
A5: will appear from the maximum classification of probability sum as coupling classification corresponding to described current test document;
A6: the concrete class according to the coupling classification of each test document and each test document, it is determined that the classification accuracy that each classification is corresponding;
A7: judge that whether classification accuracy corresponding to each classification be more than or equal to default accuracy rate threshold value respectively, if it is, trigger described second acquisition unit, otherwise, perform step A8;
A8: using the plurality of test document as described Training document, performs described first acquiring unit.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510974508.6A CN105630931A (en) | 2015-12-22 | 2015-12-22 | Document classification method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510974508.6A CN105630931A (en) | 2015-12-22 | 2015-12-22 | Document classification method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105630931A true CN105630931A (en) | 2016-06-01 |
Family
ID=56045864
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510974508.6A Pending CN105630931A (en) | 2015-12-22 | 2015-12-22 | Document classification method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105630931A (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106095845A (en) * | 2016-06-02 | 2016-11-09 | 腾讯科技(深圳)有限公司 | File classification method and device |
CN106126734A (en) * | 2016-07-04 | 2016-11-16 | 北京奇艺世纪科技有限公司 | The sorting technique of document and device |
CN106649274A (en) * | 2016-12-27 | 2017-05-10 | 东华互联宜家数据服务有限公司 | Text content tag labeling method and device |
CN107291896A (en) * | 2017-06-21 | 2017-10-24 | 北京小度信息科技有限公司 | Data-updating method and device |
CN107766869A (en) * | 2016-08-22 | 2018-03-06 | 富士通株式会社 | Object classification method and object sorting device |
CN107783989A (en) * | 2016-08-25 | 2018-03-09 | 北京国双科技有限公司 | Document belongs to the determination method and apparatus in field |
CN108038101A (en) * | 2017-12-07 | 2018-05-15 | 杭州迪普科技股份有限公司 | A kind of recognition methods for distorting text and device |
CN108763477A (en) * | 2018-05-29 | 2018-11-06 | 厦门快商通信息技术有限公司 | A kind of short text classification method and system |
CN111259658A (en) * | 2020-02-05 | 2020-06-09 | 中国科学院计算技术研究所 | General text classification method and system based on category dense vector representation |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101727500A (en) * | 2010-01-15 | 2010-06-09 | 清华大学 | Text classification method of Chinese web page based on steam clustering |
CN104142998A (en) * | 2014-08-01 | 2014-11-12 | 中国传媒大学 | Text classification method |
-
2015
- 2015-12-22 CN CN201510974508.6A patent/CN105630931A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101727500A (en) * | 2010-01-15 | 2010-06-09 | 清华大学 | Text classification method of Chinese web page based on steam clustering |
CN104142998A (en) * | 2014-08-01 | 2014-11-12 | 中国传媒大学 | Text classification method |
Non-Patent Citations (3)
Title |
---|
万乐等: "类别特征词权重加权文本分类方法", 《军民两用技术与产品》 * |
周新栋等: "基于N元语言模型的文本分类方法", 《计算机应用》 * |
李雪蕾等: "一种基于向量空间模型的文本分类方法", 《计算机工程》 * |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106095845A (en) * | 2016-06-02 | 2016-11-09 | 腾讯科技(深圳)有限公司 | File classification method and device |
CN106126734A (en) * | 2016-07-04 | 2016-11-16 | 北京奇艺世纪科技有限公司 | The sorting technique of document and device |
CN106126734B (en) * | 2016-07-04 | 2019-06-28 | 北京奇艺世纪科技有限公司 | The classification method and device of document |
CN107766869A (en) * | 2016-08-22 | 2018-03-06 | 富士通株式会社 | Object classification method and object sorting device |
CN107783989A (en) * | 2016-08-25 | 2018-03-09 | 北京国双科技有限公司 | Document belongs to the determination method and apparatus in field |
CN106649274A (en) * | 2016-12-27 | 2017-05-10 | 东华互联宜家数据服务有限公司 | Text content tag labeling method and device |
CN107291896A (en) * | 2017-06-21 | 2017-10-24 | 北京小度信息科技有限公司 | Data-updating method and device |
CN108038101A (en) * | 2017-12-07 | 2018-05-15 | 杭州迪普科技股份有限公司 | A kind of recognition methods for distorting text and device |
CN108038101B (en) * | 2017-12-07 | 2021-04-27 | 杭州迪普科技股份有限公司 | Method and device for identifying tampered text |
CN108763477A (en) * | 2018-05-29 | 2018-11-06 | 厦门快商通信息技术有限公司 | A kind of short text classification method and system |
CN111259658A (en) * | 2020-02-05 | 2020-06-09 | 中国科学院计算技术研究所 | General text classification method and system based on category dense vector representation |
CN111259658B (en) * | 2020-02-05 | 2022-08-19 | 中国科学院计算技术研究所 | General text classification method and system based on category dense vector representation |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105630931A (en) | Document classification method and device | |
CN108717406B (en) | Text emotion analysis method and device and storage medium | |
US9477750B2 (en) | System and method for real-time dynamic measurement of best-estimate quality levels while reviewing classified or enriched data | |
CN109299362B (en) | Similar enterprise recommendation method and device, computer equipment and storage medium | |
CN110765770A (en) | Automatic contract generation method and device | |
CN111241389B (en) | Sensitive word filtering method and device based on matrix, electronic equipment and storage medium | |
CN107808011A (en) | Classification abstracting method, device, computer equipment and the storage medium of information | |
CN111125343A (en) | Text analysis method and device suitable for human-sentry matching recommendation system | |
WO2018171295A1 (en) | Method and apparatus for tagging article, terminal, and computer readable storage medium | |
CN110688536A (en) | Label prediction method, device, equipment and storage medium | |
CN112163072A (en) | Data processing method and device based on multiple data sources | |
CN107491536A (en) | A kind of examination question method of calibration, examination question calibration equipment and electronic equipment | |
CN106503266A (en) | Document Classification Method and device | |
CN106445907A (en) | Domain lexicon generation method and apparatus | |
CN114692628A (en) | Sample generation method, model training method, text extraction method and text extraction device | |
US20170103059A1 (en) | Method and system for preserving sensitive information in a confidential document | |
CN112016294B (en) | Text-based news importance evaluation method and device and electronic equipment | |
CN105787004A (en) | Text classification method and device | |
CN111898378B (en) | Industry classification method and device for government enterprise clients, electronic equipment and storage medium | |
CN105095203B (en) | Determination, searching method and the server of synonym | |
KR102299525B1 (en) | Product Evolution Mining Method And Apparatus Thereof | |
CN104462552A (en) | Question and answer page core word extracting method and device | |
CN109409091B (en) | Method, device and equipment for detecting Web page and computer storage medium | |
CN111639250A (en) | Enterprise description information acquisition method and device, electronic equipment and storage medium | |
CN107577667B (en) | Entity word processing method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20160601 |