CN110929746A

CN110929746A - Electronic file title positioning, extracting and classifying method based on deep neural network

Info

Publication number: CN110929746A
Application number: CN201910454209.8A
Authority: CN
Inventors: 葛季栋; 李传艺; 刘宇翔; 姚林霞; 乔洪波; 周筱羽; 骆斌
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2019-05-24
Filing date: 2019-05-24
Publication date: 2020-03-27

Abstract

The invention discloses an electronic file title positioning, extracting and classifying method based on a deep neural network, which comprises the following steps of: inputting the file pictures into a neural network to extract a plurality of feature maps with multiple sizes, calculating category scores and border positions according to the output feature maps, and selecting title positions and title categories in the document by a plurality of title election algorithms. The invention aims to solve the problem that the electronic file images are often required to be classified manually in the actual electronic file processing process, the titles of the images are extracted from a simple image level rather than an OCR (optical character recognition) mode and the like, the positions and the types of the image titles can be accurately obtained through the characteristics of the images, the general robustness is improved, and the accuracy of image classification is improved.

Description

Electronic file title positioning, extracting and classifying method based on deep neural network

Technical Field

The invention relates to a classification method for electronic files, in particular to a method for positioning, extracting and classifying titles of electronic files based on a deep neural network, and belongs to the field of computer vision and deep learning.

Background

In order to promote the synchronous generation work of electronic files according to the case, promote the deep integration of modern information technology and court work, boost the 'intelligent court' to upgrade again, and generate the electronic files according to the case accepted by each court across the country, most cases synchronously generate the electronic files according to the case, and cover the whole process of case setting, handling, filing and case settlement. The personnel handling the case need to convert the data handling the case into an electronic document in real time and generate an electronic file, thereby ensuring that all traces are left in the system in the whole case handling process; related personnel such as department responsible persons, council courtyards, branch management couriers and the like can track case handling progress, council review cases and examination case quality on line through the electronic file system, and the judicial intelligent management level is improved; the courts at all levels can realize the on-line transfer of case files through the electronic file system, so that the cooperative work efficiency among the courts is improved; the parties and the litigation agents thereof can automatically scan and upload electronic litigation materials on equipment provided by the province and high schools, apply for looking up and printing the related electronic file information of cases, know and track the case transaction progress in real time, better promote judicial disclosure and realize execution supervision.

However, cases and electronic files need to be manually processed, workers need to browse related types of information, other data mining information extraction also depends on files of specific types, cataloguing personnel need to identify and split electronic data, file titles are extracted, file names are manually input, and time and labor are wasted.

The value of the electronic file picture classification is reflected in the fact that on one hand, each picture in one electronic file is clearly marked with a type, so that other related personnel or workers can check the pictures to be browsed more quickly, specific information is tracked, whether materials and the like are left or not is checked, and the efficiency of the workers in checking the electronic files is greatly improved. On the other hand, as the first step of building the intelligent court, because many related subsequent steps of artificial intelligence greatly depend on the classified pictures and then carry out additional steps of information extraction and the like, the classification of thousands of pictures consumes manpower, the intelligent targeting marking of the electronic file pictures provides great convenience for the subsequent steps, and a large amount of time and manpower are saved.

In computer vision, image classification is a fundamental problem, but when applied to image classification of electronic documents, because the overall characteristics of text type document images are approximately the same, and meanwhile, new types of litigation materials continuously appear and the identifiability of the document materials and other factors influence, the effect of directly classifying the overall document images is not particularly ideal. Whereby the localization and classification of the position of the header is performed for the text image using another classical object detection and recognition in computer vision. The object detection is roughly divided into two types, namely two-stage extracting the region of interest first and then extracting the image characteristics of the region of interest again for subsequent classification and inference. The other method is that the whole image features are extracted end to end in the prior art, and prediction frames with different aspect ratios and sizes of target frames are output at each layer while feature maps are reduced in a pyramid mode. The prediction burden of different target sizes is shared to different layers to complete, and the multitask forms of jointly predicting the title types and regressing the length and width of the title frames interact with each other to improve the accuracy of the titles. In the calculation process, as a plurality of long text boxes with extremely high aspect ratios need to be predicted, a plurality of convolution layers with extremely low aspect ratios are added, so that the characteristics of text titles and characters are easily lost, and a satisfactory effect is difficult to achieve by only depending on high-level characteristics and simultaneously calculating the types of the text titles, so that the bottom-level characteristics need to be spliced when the types of the text titles are predicted, so that the font characteristics do not disappear along with the increase of network hierarchies. Therefore, the method is based on the basic end-to-end target detection deep network, simplifies the flow line steps of scanning the electronic files to globally perform OCR (Optical character recognition) and then performing text analysis and title extraction in the traditional process, and emphatically researches the method for positioning, extracting and classifying the titles of the electronic files.

Disclosure of Invention

The invention discloses an electronic file title positioning, extracting and classifying method based on a deep neural network, and provides an electronic file image preprocessing method. The method can greatly reduce the manpower and time consumption of a court when the court manually consults and classifies and archives the electronic files, provides convenient retrieval when a judge needs to consult file data or files of a specific type, and provides clear image types for subsequent information extraction of documents of specific types such as litigation documents, judgment documents and the like in specific artificial intelligence related processing.

The invention relates to an electronic file title positioning, extracting and classifying method based on a deep neural network, which is characterized by comprising the following steps of:

step (1), inputting the file pictures into a neural network to extract a plurality of characteristic maps with multiple sizes.

And (2) calculating a category score and a frame position according to the output feature map.

And (3) deducing the title position and the title category in the document by using a plurality of title election algorithms.

2. The method for extracting and classifying electronic file titles based on the deep neural network as claimed in claim 1, wherein the file pictures are input into the neural network in step (1) to extract a plurality of feature maps with multiple sizes, and the specific sub-steps include:

and (1.1) carrying out size correction on the file image and carrying out image preprocessing.

Step (1.2) input the preprocessed portfolio image into the underlying neural network and transmit into the title proposal neural network when the signature size becomes 1/8 initial.

Step (1.3) multiple long transverse convolutions and dilation convolutions are performed on the feature map and combined in the title proposed neural network.

And (1.4) finishing the final reduction of the feature map to 1/32 of the original map, and calculating and classifying two 1/8, two 1/16 and one 1/32 feature maps in the middle process.

And (1.5) image enhancement is carried out by adopting image rotation, filling, blurring, interception and brightness contrast adjustment in a training stage. And selecting a corresponding frame as a positive category mark through the Jaccard, selecting a specified number of frames with the lowest predicted values as negative category marks, and dividing the data set into a training, verifying and testing set. And training the network by continuously changing the hierarchical structure of the network parameters, and finally jointly evaluating the network by the f-measure of the frame and the classified call.

3. The method for extracting and classifying electronic file titles based on the deep neural network as claimed in claim 1, wherein the step (2) of calculating the category score and the border position according to the output feature map comprises the following specific sub-steps:

and (2.1) predicting the position of the center of the title in a plurality of vertical rows in the original image and the length and width of the title in various aspect ratios by using the point mapping for each layer of feature map through frame regression convolution.

And (2.2) continuously calculating the characteristics of the title categories of the characteristic graphs of the layers through an additional multi-layer classification module, and outputting the title categories of the points mapped in the original graph.

4. The method for extracting and classifying electronic file titles based on the deep neural network as claimed in claim 1, wherein the title position and the title category in the document are selected by a plurality of title election algorithms in step (3), and the specific sub-steps include:

and (3.1) judging the possibility of each title of all points in the image, and if the titles exist, acquiring predicted values of the center and the height width of the title frame.

And (3.2) screening all the prediction frames through a threshold value.

Step (3.3) revises all the values of the header bounding box prediction that exceed the image boundary.

And (3.4) sorting all the processed title frames in a descending order according to the probability of each type of title, and extracting the top k title frames with the highest probability.

And (3.5) selecting several frame types and frame positions with the highest prediction probability by using an NMS algorithm.

Step (3.6) reprocesses several results obtained in step (3.5) by a quench NMS algorithm and finally elects a bounding box result.

And (3.7) the title of the step of proposing the effect of the classification network adopts f-measure of frame prediction with IOU more than 0.5 and the accuracy rate of classification to jointly evaluate.

Compared with the prior art, the invention has the following remarkable advantages: the method changes the traditional method aiming at the editing purpose of the electronic file, namely, the method of extracting the title in a text analysis mode after performing OCR on the file image into the method of directly predicting the position of the title frame through image characteristics, and simultaneously calculating the category of the title frame through shared convolution, thereby simplifying the steps, and reducing some error conditions in the OCR process and the condition that the title name is not obvious and can not be classified. Meanwhile, the image features have strong robustness against handwriting, rotation and blurring, and the position of a title can be detected and a possible title category can be estimated under the condition that OCR cannot detect correctly. Time is saved. Meanwhile, when an additional electronic file picture category needs to be added, only additional training is needed, and no additional step is needed. The extracted image features are obvious, so that the method is reliable in judgment accuracy. When manual verification is needed, the title frame marked on the image and the category on the frame can be used for visually checking whether the image is wrong or not, and subsequent verification is facilitated. By the method provided by the invention, a large amount of electronic file images of pure text types can be well subjected to title extraction and recognition without extra OCR (optical character recognition) and other steps, and meanwhile, the positions of titles can be well positioned and recognized through image characteristics aiming at handwritten or fuzzy text images which are difficult to recognize by OCR. When the images of the new category need to be added, only the image samples of the new category need to be added for training, and the network convergence is rapid.

Drawings

FIG. 1 is a flow chart of a method for extracting and classifying electronic file titles based on a deep neural network

FIG. 2 general architecture of a title-locating network

FIG. 3 Special modules employed for title Frames in a title extraction network

FIG. 4 electronic portfolio title example

FIG. 5 is a schematic diagram of a portion of a pre-allocated border (only shown with aspect ratios of 3 and 13, in reality more than two, and with the aspect ratios of 3 and 9 placed in two columns for visibility)

FIG. 6 general network architecture of a classifier module

FIG. 7 is a flow chart of the post-NMS suppression algorithm

FIG. 8 is a comparison of the conventional TextBoxes method with the electronic portfolio title extraction positioning classification network experiment presented herein

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.

The invention aims to solve the problem of electronic file cataloging and provides an electronic file title positioning, extracting and classifying method based on a deep neural network. By using the deep neural network, the title position and the title category in the electronic volume are extracted, and the step of finding the document title by extracting text information after OCR recognition of the whole text is avoided. The method changes the traditional method aiming at the editing purpose of the electronic file, namely, the method of extracting the title in a text analysis mode after performing OCR on the file image into the method of directly predicting the position of the title frame through image characteristics, and simultaneously calculating the category of the title frame through shared convolution, thereby simplifying the steps, and reducing some error conditions in the OCR process and the condition that the title name is not obvious and can not be classified. Meanwhile, the image features have strong robustness against handwriting, rotation and blurring, and the position of a title can be detected and a possible title category can be estimated under the condition that OCR cannot detect correctly. Time is saved. Meanwhile, when an additional electronic file picture category needs to be added, only additional training is needed, and no additional step is needed. The extracted image features are obvious, so that the method is reliable in judgment accuracy. When manual verification is needed, the title frame marked on the image and the category on the frame can be used for visually checking whether the image is wrong or not, and subsequent verification is facilitated. By the method provided by the invention, a large amount of electronic file images of pure text types can be well subjected to title extraction and recognition without extra OCR (optical character recognition) and other steps, and meanwhile, the positions of titles can be well positioned and recognized through image characteristics aiming at handwritten or fuzzy text images which are difficult to recognize by OCR. When the images of the new category need to be added, only the image samples of the new category need to be added for training, and the network convergence is rapid. The invention mainly comprises the following steps:

The detailed work flow of the electronic file title positioning, extracting and classifying method based on the deep neural network is shown in fig. 1. The above steps will be described in detail herein.

1. Because the change of the size proportion of the electronic file image is large, when the file image is input into the neural network to extract a plurality of feature maps with multiple sizes, a series of preprocessing operations are needed to be carried out, so that all the file images can be added into the deep neural network to be processed, and the method specifically comprises the following steps:

step (1.1) resizes the portfolio image to a fixed scale resolution (e.g., 320 x 320).

Inputting the preprocessed file image into a basic neural network and transmitting the file image into a title suggestion neural network when the size of a feature map is changed to be initial 1/8, wherein the basic neural network is selected to extract the features of the electronic file image, the basic neural networks such as initiation series, vgg or resnet can be selected and used, the pre-training models of the basic neural networks help the title suggestion network to better and faster learn the basic features of the image, the basic neural network is used for extracting the character features of the electronic file image, the image features of the levels such as the relation among characters in each line, and the like, and the network structure of the whole network is shown in FIG. 2.

Step (1.3) multiple long transverse convolutions and dilation convolutions are performed on the feature map and combined in the title proposed neural network. Since the header parts of all electronic volume images belong to long strips instead of square-like shapes as in the conventional object detection task, the features of the header frame need to be extracted more frequently by using a form of horizontal long convolution, and the features of other information on the image need to be extracted by using vertical convolution, and since the high-level features of the image need to be better obtained without reducing the feature map, the features of the header frame of the image need to be better obtained by adopting forms of expansion convolution, pooling and the like, and the structure of the network module is shown in fig. 3.

And (1.4) finishing the final reduction of the feature map to 1/32 of the original map, and calculating and classifying two 1/8, two 1/16 and one 1/32 feature maps in the middle process. Since the title size type in the electronic volume image spans very large, as shown in fig. 4, the title box needs to be predicted by using multiple layers of image pyramids with different sizes to make the network have a wide range of scales.

And (1.5) image enhancement is carried out by adopting image rotation, filling, blurring, interception and brightness contrast adjustment in a training stage. And selecting a corresponding frame as a positive category mark through the Jaccard, selecting a specified number of frames with the lowest predicted values as negative category marks, and dividing the data set into a training, verifying and testing set. And training the network by continuously changing the hierarchical structure of the network parameters, and finally jointly evaluating the network by the f-measure of the frame and the classified call. Meanwhile, in the training stage, softmax is adopted for conversion of predicted values aiming at judgment of frame types, cross entropy is adopted for loss values of the training network, top-k negative sample values with the worst prediction results are selected aiming at negative samples, wherein k is the number of positive samples, and the specific calculation formula is as follows:

the length and width of the frame and the loss value of the central point are calculated by adopting smooth L1, meanwhile, the regression of the central point is zoomed by 0.1 time, the regression of the length and width is zoomed by 0.15 time, so that the distribution of the length and width is similar, and the specific calculation formula is as follows:

the sum of the two is divided by the number N of the selected positive samples, and the specific calculation formula is as follows:

the positive sample is matched with pre-configured frames on the feature map through a marking frame during input, wherein the pre-configured frames are generally distributed as shown in fig. 5 and are frames with various aspect ratios, a node of each feature map is responsible for a plurality of frames with different aspect ratios generated by a plurality of central points of the area in the original map, and a specific calculation formula of the length and the width of each frame is as follows:

wherein v is_kIs a preset frame size, f_kIs the ratio of the frame step size to the image size. Calculating IOU through Jaccard, namely calculating the ratio of the intersection area of two rectangular frames to the area sum of the two rectangular frames, and if the IOU is more than 0.5, representing a positive sample. The training target is to select an optimal title-proposing model, pre-training can be performed on a general title data set firstly, and then the method is transferred to the title extraction of the electronic file in a transfer learning mode, so that the overfitting of the model is reduced.

2. In order to convert the image features extracted by the neural network into the final multi-scale title detection result, the class score and the frame position need to be calculated according to the output feature map in step (2), and the specific sub-steps include:

And (2.2) continuously calculating the characteristics of the title categories of the characteristic graphs of the layers through an additional multi-layer classification module, and outputting the title categories of the points mapped in the original graph. The general structure is shown in fig. 6. Since most titles have many words in common and the image features at the upper level may not have the values of the features at the lower level of the image, especially after the features are calculated for the long title, the category of the title is jointly predicted by combining the features at the lower level and the features at the upper level, so that the distinction of small fonts in a large title box is amplified.

3. In order to integrate all preset title frames output by the neural network and finally obtain a single title frame position and a title frame category for the electronic file image through screening, the title position and the title category in the document need to be selected through a plurality of title election algorithms in step (3), and the specific sub-steps include:

and (3.1) judging the possibility of each title of all points in the image, and if the titles exist, acquiring predicted values of the center and the height width of the title frame. I.e. the classification output and the title frame regression output in the step (2).

And (3.2) screening all the prediction frames through a threshold value. Since most of the header frames are background or do not need to be adopted, and the number of the header frames output in the feature map is very large, the header frames with small prediction results need to be directly excluded through threshold values for convenient subsequent processing to reduce subsequent calculation.

Step (3.3) revises all the values of the header bounding box prediction that exceed the image boundary. Since the regression value of the output header box may exceed the boundary of the image, the value exceeding the regression value needs to be corrected to the boundary of the image.

And (3.4) sorting all the processed title frames in a descending order according to the probability of each type of title, and extracting the top k title frames with the highest probability. Thereby obtaining all titles that are most likely to be the title of the electronic volume.

Step (3.6) reprocesses several results obtained in step (3.5) by a quench NMS algorithm and finally elects a bounding box result. The general flow of suppressing the NMS algorithm is shown in fig. 7, and the purpose is to filter the final judgment result error caused by outputting a predicted frame type with the highest probability by error, because the network only needs to adopt a bounding box at last, the conventional NMS algorithm degenerates to only select the bounding box with the highest probability for the title positioning of the electronic file, and generates some fluctuation. The post-suppression NMS algorithm can reduce some interference aiming at the detection header result.

And (3.7) the title of the step of proposing the effect of the classification network adopts f-measure of frame prediction with IOU more than 0.5 and the accuracy rate of classification to jointly evaluate. Due to the diversity of aspect ratio changes in the header size of electronic files, there is not a good feature, and in addition, the types of header fonts of different types of electronic files are very different. In the scene of intelligent cataloguing of electronic files, a good electronic file image classification network should take the variability of fonts into consideration. Therefore, the invention adopts the f-measure of frame prediction and the accuracy of classification to jointly evaluate. In experimental evaluation, the invention compares the conventional TextBoxes method with the electronic file title extraction, positioning and classification network provided herein for five different types of electronic file image calculation experiments, and the experimental result is shown in fig. 8. Therefore, the extracted title position and the category judgment of the invention are superior to other methods for the electronic file images of five different types. Other conventional methods cannot cover all titles of the electronic file and cannot well extract the title category information of the electronic file.

A method for extracting and classifying the title of an electronic file based on a deep neural network according to the present invention has been described in detail with reference to the accompanying drawings. The invention has the following advantages: the method changes the traditional method aiming at the editing purpose of the electronic file, namely, the method of extracting the title in a text analysis mode after performing OCR on the file image into the method of directly predicting the position of the title frame through image characteristics, and simultaneously calculating the category of the title frame through shared convolution, thereby simplifying the steps, and reducing some error conditions in the OCR process and the condition that the title name is not obvious and can not be classified. Meanwhile, the image features have strong robustness against handwriting, rotation and blurring, and the position of a title can be detected and a possible title category can be estimated under the condition that OCR cannot detect correctly. Time is saved. Meanwhile, when an additional electronic file picture category needs to be added, only additional training is needed, and no additional step is needed. The extracted image features are obvious, so that the method is reliable in judgment accuracy. When manual verification is needed, the title frame marked on the image and the category on the frame can be used for visually checking whether the image is wrong or not, and subsequent verification is facilitated. By the method provided by the invention, a large amount of electronic file images of pure text types can be well subjected to title extraction and recognition without extra OCR (optical character recognition) and other steps, and meanwhile, the positions of titles can be well positioned and recognized through image characteristics aiming at handwritten or fuzzy text images which are difficult to recognize by OCR. When the images of the new category need to be added, only the image samples of the new category need to be added for training, and the network convergence is rapid. Compared with the traditional target detection network, the network module specially designed for the electronic files enables the network to well detect the title positions of almost all the electronic files, and meanwhile, the characteristics of all the characters in the titles of the electronic files can be well extracted.

It is to be understood that the invention is not limited to the specific arrangements and instrumentality described above and shown in the drawings. Also, a detailed description of known process techniques is omitted herein for the sake of brevity. The present embodiments are to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Claims

1. A method for extracting and classifying electronic file titles based on a deep neural network is characterized by comprising the following steps:

And (3.2) screening all the prediction frames through a threshold value.