CN110046264A

CN110046264A - A kind of automatic classification method towards mobile phone document

Info

Publication number: CN110046264A
Application number: CN201910260996.2A
Authority: CN
Inventors: 余蓓蓓
Original assignee: Yunnan University YNU
Current assignee: Yunnan University YNU
Priority date: 2019-04-02
Filing date: 2019-04-02
Publication date: 2019-07-23

Abstract

The invention discloses a kind of automatic classification methods towards mobile phone document, this method constructs document class libraries first, document class libraries is divided into training set and test set, and content of text therein and picture material are extracted respectively from the training set of document class libraries, according to document class libraries and original document class libraries label correspondence establishment corpus class libraries and image class libraries, then text prediction label vector and image prediction label vector are obtained through deep learning to after the content of text and picture material progress data prediction in corpus class libraries and image class libraries respectively, finally use Weighted Fusion formula, by image tag vector sum text label vector combination, and document prediction label probability is obtained after being normalized, document prediction label probability is compared with preset threshold, complete the automatic classification of document.The present invention measures document classification as index simultaneously as a result, realizing that non-structured document is quickly and effectively classified using picture material and content of text.

Description

A kind of automatic classification method towards mobile phone document

Technical field

The present invention relates to document management fields, more particularly to a kind of automatic classification method towards mobile phone document.

Background technique

With the development of internet, digital office is also continued to develop with it, but also gradually sudden and violent in this process Expose some problems.And most obvious one is exactly a large amount of digital office bring heap files and the intrinsic inertia institute of people Bring document is overstock, so that the document classification of people is in utter disorder, reduces office efficiency and office experience.According to state The investigation of Archives Administration, family is shown: having nearly 80% central government and state organs, central enterprise uses office automation or electronics political affairs Business system generates all kinds of electronic documents nearly 200,000,000.Thus it is not difficult to visualize, in the near future, electronic document will become government, The main supporting body and the form of expression of enterprises and institutions' information resources., management heterogeneous for the document on mobile phone and classification confusion etc. are asked Topic is dedicated to document management automation, establishes the automatic classification managing system of document, make people to the file in oneself mobile phone It is very clear, it is convenient that document in mobile phone is classified and searched.Effective management function for file is not only carried, more The intelligent automatic classification of text also crucially is realized to document, returns huge heterogeneous local document automation, intelligence Class.

And so far, non-structured document (Word/PDF/PPT) classification is confined to carry out according to the text in document Classification, and the emphasis of most technique studies is natural language processing (NLP).The presence of image in document is often ignored, but Be image it is also one of main information source of the mankind, wherein may including the important information of this document, cannot be ignored. And in the non-structured document file based on image, picture material is also an important influence factor in classification. Existing office software focuses on the processing in the processes such as text, table, but really focuses on carrying out automatic sorting point to large volume document On the market or blank out, and existing Document Classification Method exists to require study and changes there is also insufficient the system of class Into place.

Summary of the invention

In order to solve the above technical problems, the present invention provides a kind of automatic classification method towards mobile phone document.

In order to solve the above technical problems, one technical scheme adopted by the invention is that: it provides a kind of towards mobile phone document Automatic classification method, including S1: collecting and arrange pass of the multiple labels for being most commonly used to document classification as building document class libraries Keyword constructs multiple document class libraries, the document class libraries according to the rule of the corresponding document class libraries of a label It is non-classified document class libraries comprising the document class libraries that multiple everyday words are label and a label, and by the document class libraries It is divided into training set and test set；

S2: extracting content of text therein and picture material respectively from the training set of the document class libraries, and according to Each document class libraries and its corresponding label, correspondence establishment corpus class libraries and image class libraries, and by the corpus class Library and image class libraries are divided into training set and test set；

S3: data prediction is carried out to the content of text in the test set of the corpus class libraries and image class libraries, constructs word Allusion quotation, and text prediction label vector is obtained by constructing textual classification model；To the figure in the training set in described image class libraries As content progress data prediction, and image prediction label vector is obtained by constructing image disaggregated model；

S4: by text prediction label vector and image prediction label vector by after Weighted Fusion document prediction label to Amount, the document prediction label vector obtain document prediction label probability after passing through normalized；

S5: the probability of document prediction label is compared with preset threshold value, when the document prediction label probability is big When the threshold value, the document is included into the document class libraries of common classification word corresponding to document prediction label, When the document prediction label probability is less than the threshold value, it is in non-classified document class libraries that the document, which is included into label,.

Preferably, further include situation that a document occurs in multiple document class libraries in the step S1, that is, assume to Classifying documents are Xi,Wherein Yi is document class libraries corresponding to document Xi to be sorted Set,J is all possible document class libraries number.

Preferably, the text in the picture material in each text class libraries is passed through into OCR technique in the step S2 It is added in corresponding corpus class libraries after identification as content of text.

Preferably, the step S3 specifically includes S31: carrying out text point to the content of text using Chinese words segmentation Word；

S32: to the text word segmentation result removal stop words and low-frequency word in the step S31, specifically, by described The stop words in common deactivated vocabulary is rejected in word segmentation result, and minimum word frequency is arranged according to document text size, is filtered out low In the low-frequency word of the minimum word frequency；

S33: the content of text after eliminating stop words and low-frequency word in step S32 is passed through using Wor2vec kit The method of mapping indicates the content of text in the form of term vector；

S34: carrying out further feature extraction using convolutional neural networks, and wherein convolutional layer is to the institute in the step S33 Predicate vector carries out preliminary feature extraction, and the preliminary feature of extraction input pond layer is generated feature vector, then full connection Layer connects all described eigenvectors, and adds an output layer, and uses sigmoid activation primitive, calculates every The probability of a label finally exports text prediction label vector.

Preferably, the step S3 also specifically includes S35: being rotated, is scaled to picture material, cut and normalizing Change；

S36: carrying out the preliminary feature extraction of convolutional layer to the step S35 treated picture material, and will extraction just It walks feature input pond layer and generates feature vector, then full articulamentum connects all described eigenvectors, and adds one A output layer, and sigmoid activation primitive is used, calculate the probability of each label, final output image prediction label vector.

Preferably, the textual classification model measures performance using cross entropy formula, and described image disaggregated model is using flat Mean square deviation assesses the loss in learning process.

It is in contrast to the prior art, the beneficial effects of the present invention are:

1. can be realized non-structured document quickly and effectively to classify

2. extracting text from full document using machine learning method building textual classification model and image classification model Two parts of content and picture material and correspondence establishment corpus class libraries and image class libraries, classify, lead in this process Mass data training study is crossed, so that document is realized Machine automated classification, has saved manpower and material resources, and then improves work effect Rate.

3. corpus class libraries and the classification results of image class libraries are measured document classification as a result, in this way as classification indicators Keep classification results more accurate, applicable document content and format are more extensive.

Detailed description of the invention

Fig. 1 is the flow diagram of the automatic classification method towards mobile phone document of the embodiment of the present invention；

Fig. 2 is the idiographic flow schematic diagram of the step S3 of the automatic classification method shown in FIG. 1 towards mobile phone document.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that the described embodiments are merely a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.

As shown in Figure 1, the present invention includes S1: collecting and arrange the multiple labels for being most commonly used to document classification as building text The keyword of shelves class libraries constructs multiple document class libraries, institute according to the rule of the corresponding document class libraries of a label It is the document class libraries of label and a label is non-classified document class libraries that state document class libraries, which include multiple everyday words, and by institute It states document class libraries and is divided into training set and test set.

Wherein, in the label of finishing collecting document class libraries, the label data of classification can be crawled using crawl device, The higher word of search engine degree of correlation in document classification can also be chosen to integrate the label for choosing each document class libraries.Structure The mode of class libraries is built to obtain or modes such as collection manually using the crawl device document that crawls or increase income.In the present embodiment, building altogether N+1 document class libraries is non-classified class libraries, mark including the N number of document class libraries using everyday words as label and a label Label are in non-classified document class libraries comprising being not belonging to the everyday words to be usually used in document classification as the document of label, initial Under state, any document is free of in the document class libraries, in subsequent steps, in addition to needing in the document classification result of step S5 It uses, remaining situation does not consider that the document class libraries participates in.

It in the present embodiment, further include situation that a document occurs in multiple document class libraries in the step S1, i.e., Assuming that document to be sorted is Xi,Wherein Yi is more corresponding to document Xi to be sorted The set of the document class libraries of a label,J is the corresponding document class libraries number of all possible label.

S2: extracting content of text therein and picture material respectively from the training set of the document class libraries, and according to Each document class libraries and its corresponding label, correspondence establishment corpus class libraries and image class libraries, and by the corpus class Library and image class libraries are divided into 80% training set and 20% test set.

Wherein, the picture material in image class libraries is drawn into save in a file and leads to the text in picture material It is added in corresponding corpus class libraries after crossing OCR technique identification as content of text.In the present embodiment, in each text class libraries Text in picture material is added in corresponding corpus class libraries after being identified by OCR technique as content of text.Such as it can adopt With Baidu's OCR api interface, function includes high-accuracy general text API, table Text region API and two dimensional code identification API can extract common language, rarely used word, table text, certificate text etc. in picture.

S3: data prediction is carried out to the content of text in the training set of the corpus class libraries and image class libraries, constructs word Allusion quotation, and text prediction label vector is obtained by constructing textual classification model；To the figure in the training set in described image class libraries As content progress data prediction, and image prediction label vector is obtained by constructing image disaggregated model.

As shown in Fig. 2, wherein step S3 specifically includes S31: carrying out text to the content of text using Chinese words segmentation This participle；

English using space as natural separator, and Chinese because the particularity of its language is in addition to punctuation mark, text is not There are intervals to influence subsequent processing structure so Chinese word segmentation is the basis of natural language processing.Chinese words segmentation is existing more It is mature, it is possible to directly adopt prior art algorithm or open source projects tool carries out at participle the text in corpus class libraries Reason, such as: jieba, SnowNLP, THULAC.

Specifically, the matrix of construction A*B, wherein A is word number, and B represents term vector dimension.In order to carry out at batch vector Reason, is fixed as length A for text.Then convolution operation is carried out to each text, using filter W ∈ R^hb, the size of filter For h*b, wherein h is the length of n-gram, then the target of convolution is c_i=f (Wx_i:i+h-1+d^(c)), wherein d is offset, and f is non-thread Property activation primitive.In convolution process, this filter may generate one group of feature { c on N-h+1 window₁,c₂,…, c_N-h+1}.Then this group of feature is input to pond layer and generates feature vectorIt is thus real Show the target for extracting single feature from one group of feature, then connects all feature vectors in full articulamentum, and An output layer is added, using sigmoid activation primitive, calculates the probability of each label, finally exports text prediction label Vector

In order to assess the performance of textual classification model, one layer of output layer is added in original output layer, according to cross entropy public affairs Formula is measured, specifically:Wherein p (x) presentation class x is the probability correctly classified, p Value can only be 0 or 1, it is the prediction probability correctly classified that q (x), which is then x type, and value range is (0,1).

S35: picture material is rotated, scaled, cut and is normalized；

S36: carrying out the preliminary feature extraction of convolutional layer to the step S35 treated picture material, and will extraction just It walks feature input pond layer and generates feature vector, then full articulamentum connects all described eigenvectors, and adds one A output layer, and sigmoid activation primitive is used, calculate the probability of each label, final output image prediction label vector. Concrete processing procedure in this step is identical with the concrete processing procedure of step S34, final output image prediction label vector

In order to measure the performance of image classification model, one layer of output layer is added on model, is assessed using average variance Loss in learning process, specific cost function are as follows:Wherein O_cnnRepresentative image disaggregated model is pre- The label of the data set of survey, O_realThe true label of data set is represented, when e is smaller, illustrates that model prediction performance is more preferable.

S4: by text prediction label vector and image prediction label vector by after Weighted Fusion document prediction label to Amount, the document prediction label vector obtain document prediction label probability after passing through normalized.

S5: document prediction label probability is compared with preset threshold value, when the document prediction label probability is greater than Or when being equal to the threshold value, the document is included into common classification word document class libraries corresponding to document prediction label, institute is worked as When stating document prediction label probability less than the threshold value, it is in non-classified document class libraries that the document, which is included into label,.

Specifically, by text prediction label vectorWith image prediction label vectorIt is weighted fusion, is calculated Document label vectorCalculation formula is as follows:Wherein a is text feature similarity weight, B is characteristics of image similarity weight, and carries out numerical value processing using sigmoid function, by the output data normalizing of multiple classification Change, is converted into final document and conjecture label probability P_j, it is equivalent on original two models in this way, utilizes weighted average Method, add one layer of LR classifier, the fusion of textual classification model and image classification model completed, finally when the pre- mark of document Sign probability P_j(1≤j≤N) is greater than threshold value, then the document is included into the text of common classification word corresponding to document prediction label In shelves class libraries.

Further, threshold value cannot be excessively high or too low, excessively high, document can not be ranged simultaneously several degrees of correlation compared with It is too low in high classification, it is unfavorable for correctly classifying, loses meaning.For threshold value, first by text test set and figure As test set is divided into more equal parts, using the method for cross validation, model and the best document classification model of retention are verified, Here using the accuracy of Hamming loss (Hamming loss) Lai Hengliang document classification model, Hamming loss can indicate institute There is the ratio of error sample in label, so the classification capacity of the smaller then network of the value is stronger.Calculation formula is as follows:

Wherein | D | indicate total sample number, | L | indicate total number of labels, xi and yi respectively indicate prediction result and true value, xor Indicate XOR operation, and stipulated that in the process, a, b weight are fixed value, and a, b meet a+b=1, are surveyed by document class libraries The repeatedly test of examination collection, can obtain threshold value.

When obtaining text feature similarity weight a and characteristics of image similarity weight b, precision and recall rate are introduced, it is quasi- True rate (Precision) refers to that for given test data set, the relevant documentation number being correctly retrieved accounts for real in document class libraries The ratio for the relevant documentation number that border is retrieved.Recall rate (Recall Rate) refers to given test data set, is correctly examined The relevant documentation number that rope goes out accounts for the ratio of relevant documentation number all in document class libraries.In multi-tag, calculation formula deformation It is as follows:

Wherein | D | indicate total sample number, xi and yi respectively indicate prediction result and true value, and similarly fixed threshold is optimal A, b are equally divided into 0.01 to be more equal parts that scale increases in [0,1] section by value, and a, b meet a+b=1.By text Shelves class libraries test set is repeatedly tested, and comprehensively considers precision and recall rate two indices, can obtain text feature similarity weight a For and characteristics of image similarity weight b.

Work as P_jWhen more than or equal to threshold value, Xi is successfully classified into the document class libraries that label is j, and updating ought be above Shelves class libraries and "current" model；But P_jWhen less than threshold value, Xi is classified into the document class libraries with unfiled label, and not It is classified into any document class libraries with everyday words label, and updates non-classified document class libraries.

According to the relationship of document prediction label probability and threshold value, document may be divided into the label pair of multiple common classification words In the multiple document class libraries answered, it is also possible to be divided into the document class libraries with unfiled label, i.e., document Xi to be sorted points After class,Z represents the document class libraries updated.Wherein Y'_iFor corresponding to document Xi The collection of document of label,I (1≤I≤N+1) is all possible label number, when I is N+1 When, indicating that document Xi to be sorted is classified to tag is non-classified document class libraries, is not classified to N number of common classification word In the corresponding document class libraries of label；Indicate that document Xi to be sorted is included into the corresponding document class libraries of l label simultaneously when for I being l In.

Quickly and effectively classify by the above-mentioned means, the present invention can be realized non-structured document, utilizes machine learning side Method constructs textual classification model and image classification model, extracts two parts of word content and picture material from full document And correspondence establishment corpus class libraries and image class libraries, classify, is learnt in this process by mass data training, make document Machine automated classification is realized, has saved manpower and material resources, and then improve work efficiency；And by corpus class libraries and image class The classification results in library measure document classification as a result, keeping classification results more accurate in this way, in applicable document as classification indicators Hold and format is more extensive.

The above description is only an embodiment of the present invention, is not intended to limit the scope of the invention, all to utilize this hair Equivalent structure or equivalent flow shift made by bright specification and accompanying drawing content is applied directly or indirectly in other relevant skills Art field, is included within the scope of the present invention.

Claims

1. a kind of automatic classification method towards mobile phone document characterized by comprising

S1: collecting and arranges keyword of the multiple labels for being most commonly used to document classification as building document class libraries, according to one The rule of the corresponding document class libraries of the label constructs multiple document class libraries, and the document class libraries includes multiple everyday words Be non-classified document class libraries for the document class libraries of label and a label, and by the document class libraries be divided into training set and Test set；

S2: content of text therein and picture material are extracted respectively from the training set of the document class libraries, and according to each The document class libraries and its corresponding label, correspondence establishment corpus class libraries and image class libraries, and by the corpus class libraries and Image class libraries is divided into training set and test set；

S3: carrying out data prediction to the content of text in the test set of the corpus class libraries and image class libraries, constructs dictionary, and Text prediction label vector is obtained by constructing textual classification model；To the picture material in the training set in described image class libraries Data prediction is carried out, and obtains image prediction label vector by constructing image disaggregated model；

S4: by text prediction label vector and image prediction label vector by obtaining document prediction label vector after Weighted Fusion, The document prediction label vector obtains document prediction label probability after passing through normalized.

S5: the probability of document prediction label is compared with preset threshold value, be greater than when the document prediction label probability or When equal to the threshold value, the document is included into the document class libraries of common classification word corresponding to document prediction label, works as institute When stating document prediction label probability less than the threshold value, it is in non-classified document class libraries that the document, which is included into label,.

2. the automatic classification method according to claim 1 towards mobile phone document, which is characterized in that in the step S1 also Including the situation that a document occurs in multiple document class libraries, that is, assume that document to be sorted is Xi,Wherein Yi is the set of document class libraries corresponding to document Xi to be sorted,J is all possible document class libraries number.

3. the automatic classification method according to claim 1 towards mobile phone document, which is characterized in that will in the step S2 The text in picture material in each text class libraries is used as content of text that corresponding language is added after identifying by OCR technique Expect in class libraries.

4. the automatic classification method according to claim 1 towards mobile phone document, which is characterized in that the step S3 is specific Include:

S31: text participle is carried out to the content of text using Chinese words segmentation；

S32: to the text word segmentation result removal stop words and low-frequency word in the step S31, specifically, by the participle As a result the stop words in common deactivated vocabulary is rejected in, and minimum word frequency is arranged according to document text size, is filtered out lower than institute State the low-frequency word of minimum word frequency；

S33: the content of text after eliminating stop words and low-frequency word in step S32 is passed through mapping using Wor2vec kit Method the content of text is indicated in the form of term vector；

S34: carrying out further feature extraction using convolutional neural networks, and wherein convolutional layer is to institute's predicate in the step S33 Vector carries out preliminary feature extraction, and the preliminary feature of extraction input pond layer is generated feature vector, and then full articulamentum will All described eigenvector connections, and an output layer is added, and use sigmoid activation primitive, calculate each mark The probability of label finally exports text prediction label vector.

5. the automatic classification method according to claim 1 towards mobile phone document, which is characterized in that the step S3 also has Body includes:

S35: picture material is rotated, scaled, cut and is normalized；

S36: carrying out the preliminary feature extraction of convolutional layer to the step S35 treated picture material, and by the preliminary spy of extraction Sign input pond layer generates feature vector, and then full articulamentum connects all described eigenvectors, and addition one is defeated Layer out, and sigmoid activation primitive is used, calculate the probability of each label, final output image prediction label vector.

6. the automatic classification method according to claim 1 towards mobile phone document, which is characterized in that the text classification mould Type measures performance using cross entropy formula, and described image disaggregated model is using the loss in average variance assessment learning process.