CN113065396A

CN113065396A - Automatic filing processing system and method for scanned archive image based on deep learning

Info

Publication number: CN113065396A
Application number: CN202110230772.4A
Authority: CN
Inventors: 陈文正; 栾杉; 李琳; 占娜; 魏馨霆; 王溪
Original assignee: Hubei Central China Technology Development Of Electric Power Co ltd; State Grid Hubei Electric Power Co Ltd
Current assignee: Hubei Central China Technology Development Of Electric Power Co ltd; State Grid Hubei Electric Power Co Ltd
Priority date: 2021-03-02
Filing date: 2021-03-02
Publication date: 2021-07-02

Abstract

The invention provides an automatic filing processing system and method for scanning archival images based on deep learning, wherein the method comprises the following steps: the method comprises the following steps: preprocessing data and training a model; step two: identifying a subject; step three: correcting the inclination; step four: automatic threshold processing and image reconstruction; step five: processing table picture data; step six: table division: inputting the data marked in the step five into a Unet network for training to obtain a table segmentation model, and performing table segmentation on an input picture according to the table segmentation model to obtain a picture of a cell; step seven: text line segmentation: and performing text line segmentation on each cell segmented in the step six by using a CTPN model. The invention has the advantages that the integral accuracy of the archive picture processing reaches more than 91.7 percent, the archive filing work efficiency and the standardization level can be improved, and the personnel and equipment investment and the environmental requirements of a processing site are reduced.

Description

Automatic filing processing system and method for scanned archive image based on deep learning

Technical Field

The invention relates to the field of automatic processing of documents, in particular to an automatic filing processing system and method for scanning archival images based on deep learning.

Background

With the continuous application of informatization, networking and digitization in various industries of society, people have generally accepted the digitization of an industry management mode. However, digital archive management work is still slow moving. The main reasons are: large amount of files and limited quality of file management personnel. From the electric power industry, the intelligent terminal is popularized and applied in a plurality of specialties such as electric power operation and maintenance, marketing, safety supervision and office, and the proportion of unstructured data generated by the intelligent terminal is greatly improved.

The unstructured data generated by the intelligent terminal is mainly a complex shot image, and the image shot by the intelligent terminal generally has various abnormal characteristics such as inclination, uneven illumination, noise interference, edge softening, geometric distortion and the like, so that a large amount of finishing and processing work of converting the unstructured data into structured data is needed before the unstructured data is effectively applied, and the workload is huge, and the unstructured data is mainly completed manually at present. In the face of increasingly huge shot image documents, it is urgent to rapidly, effectively and correctly acquire valuable unstructured information or knowledge, and therefore, intensive research on the automatic processing technology for finishing and extracting information of complex shot image documents is urgently needed.

The archives digital work has the characteristics of labor intensity, the realization of the automation of scanning archives images and the standardized filing processing technology is expected to utilize machines to replace manpower in the archives filing link, the archives filing work efficiency and the standardization level are improved, and the personnel and equipment investment and the environmental requirements of a processing field are reduced.

Disclosure of Invention

Aiming at the problems in the digital automatic filing work of archives, the invention provides an automatic filing processing system and method for scanning archives images based on deep learning.

An automatic filing processing method for scanned archive images based on deep learning comprises the following steps:

the method comprises the following steps: data preprocessing and model training

The pictures to be processed are divided into five categories: drawing, handwriting, table, photo and other types, and marking the document body and the text line for each type of picture; then, training the preprocessed pictures by using Object Detection and Faster CNN models to obtain picture classification and document main body positioning models;

step two: subject identification

According to the document main body positioning model obtained in the first step, positioning a document main body and positioning a text line on an input picture, and simultaneously segmenting the document main body to obtain a text line part of the document;

step three: tilt correction

Selecting pixel points of the text line part obtained in the second step, fitting a straight line in a straight line fitting mode to obtain the integral inclination angle of the document, and performing rotation deviation correction on the document main body cut out in the second step according to the inclination angle to obtain a corrected document picture;

step four: automated thresholding and image reconstruction

Carrying out automatic threshold processing and image reconstruction on the corrected document picture obtained in the step three to obtain a standardized output picture;

step five: processing of tabular picture data

Selecting a part of table archive images from the images output in the step four in a standardized manner, and marking the line edge of the table through labelme;

step six: table partitioning

Inputting the data marked in the step five into a Unet network for training to obtain a table segmentation model, and performing table segmentation on an input picture according to the table segmentation model to obtain a picture of a cell;

step seven: text line segmentation

And performing text line segmentation on each cell segmented in the step six by using a CTPN model.

Further, the automatic threshold processing in the third step is to establish a dynamic threshold of the picture according to local pixel distribution of the picture, and perform threshold segmentation processing on the picture according to the dynamic threshold so as to retain most details of the picture and avoid loss of picture content; the image reconstruction in the third step is to output the pictures in a standardized manner, and the pictures are output according to the category and the size of A4 or A3.

Further, in step three, the system implements dynamic thresholding based on local image characteristics in the thresholding portion: let sigma_xyAnd m_xyRepresenting a neighborhood S centered on coordinates (x, y) in an image_xyThe standard deviation and mean of the included set of pixels, the general form of the variable local threshold is:

T_xy＝aσ_xy+bm_xy

where a and b are non-negative constants, the segmented image is calculated as follows:

where f (x, y) is the input image, and this equation is applied to all pixel locations in the imageEvaluate the row and use the neighborhood S at each point (x, y)_xyThe pixels in (1) calculate different threshold values T_xy。

Further, in the seventh step, the CTPN model is trained by adopting an open source data set, the trained pictures are divided into a training set and a verification set according to the ratio of 99:1, the data randomly generate 5990 characters comprising Chinese characters, English letters, numbers and punctuations by utilizing a Chinese language database through font, size, gray scale, blur, perspective and stretching change, 10 characters are fixed on each sample, the characters are randomly intercepted from sentences in the language database, and the resolution of the pictures is unified as 280x 32.

Furthermore, 1000 table file images are selected in the fifth step.

Further, in the sixth step, the Unet is input for training, and the iteration number is 80000 times.

An automatic filing processing system for scanning archival images based on deep learning comprises an image finishing system and a document automatic processing system;

the image finishing system is used for carrying out data preprocessing and model training on a picture to be processed, obtaining a document main body positioning model based on deep learning through training, identifying and carrying out inclination correction on a main body of an image document by using the document main body positioning model, and then carrying out automatic threshold processing on a text to obtain a form archive image after image finishing;

the automatic document processing system is used for segmenting the table by using a deep learning segmentation network Unet and segmenting each cell in the table by using a CTPN model to the table archive image processed by the image finishing system, and provides a basis for OCR recognition and data extraction of subsequent table data.

Further, the automatic threshold processing refers to establishing a dynamic threshold of the picture according to local pixel distribution of the picture, and performing threshold segmentation processing on the picture according to the dynamic threshold so as to retain most details of the picture and avoid loss of picture contents.

Further, the image finishing system realizes the local map-based on the threshold segmentation partDynamic thresholding of image characteristics: let sigma_xyAnd m_xyRepresenting a neighborhood S centered on coordinates (x, y) in an image_xyThe standard deviation and mean of the included set of pixels, the general form of the variable local threshold is:

T_xy＝aσ_xy+bm_xy

where f (x, y) is the input image, this equation evaluates all pixel locations in the image and uses the neighborhood S at each point (x, y)_xyThe pixels in (1) calculate different threshold values T_xy。

Further, the CTPN model is trained by adopting an open source data set, the trained pictures are divided into a training set and a verification set according to the ratio of 99:1, 5990 characters comprising Chinese characters, English letters, numbers and punctuations are randomly generated by using a Chinese language database through font, size, gray scale, fuzzy, perspective and stretching changes, 10 characters are fixed on each sample, the characters are randomly intercepted from sentences in the language database, and the resolution of the pictures is unified to 280x 32.

The method comprises the steps of firstly carrying out inclination correction on a main body of a document through a document main body positioning model obtained through training, then carrying out automatic threshold processing on a text, and finally carrying out region segmentation text extraction. Experimental analysis proves that the overall accuracy of image processing of the system archives reaches more than 91.7%, and the usability of the system in the field of archives digitization is proved.

Drawings

FIG. 1 is a schematic flow chart of an automated archive processing method for scanned archive images based on deep learning according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a Unet network architecture for use with the present invention;

FIG. 3 is a comparison of the effect of a conventional single threshold versus a dynamic threshold;

FIG. 4 is a process effect diagram of the system of the present invention;

fig. 5 is a schematic diagram of CTPN implementation steps of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.

Referring to fig. 1, an embodiment of the invention provides an automatic filing system for scanning an archival image based on deep learning, including an image finishing system and a document automatic processing system.

The image finishing system is used for carrying out data preprocessing and model training on a picture to be processed, obtaining a document main body positioning model based on deep learning through training, identifying and carrying out inclination correction on a main body of an image document by using the document main body positioning model, and then carrying out automatic threshold processing on a text to obtain a form archive image after image finishing; the image finishing system divides image documents into 5 types of images, paper, handwriting, tables, photos and other types, carries out different processing according to different types, carries out image reconstruction after main body recognition, inclination calculation, inclination correction and threshold processing operation to obtain a preprocessed image result, and divides the image output result into the above 5 types;

The embodiment of the invention also provides an automatic filing processing method of the scanned archive image based on deep learning, which comprises the following steps:

the method comprises the following steps: data preprocessing and model training

Experiments show that a single processing mode cannot meet the processing requirements of all pictures, so that the pictures to be processed need to be preprocessed, and specifically, the pictures to be processed are divided into five types: drawing, handwriting, table, photo and other types, and marking the document body and the text line for each type of picture; and then training the preprocessed pictures by using Object Detection and Faster CNN models to obtain picture classification and document main body positioning models.

Step two: subject identification

And C, according to the document main body positioning model obtained in the step I, positioning the document main body and positioning the text line on the input picture, and simultaneously segmenting the document main body to obtain the text line part of the document.

Step three: tilt correction

And C, selecting pixel points of the text line part obtained in the step two, fitting a straight line in a straight line fitting mode to obtain the integral inclination angle of the document, and performing rotation deviation correction on the document main body cut out in the step two according to the inclination angle to obtain a corrected document picture.

Step four: automated thresholding and image reconstruction

And performing automatic threshold processing and image reconstruction on the corrected document picture obtained in the step three. The automatic threshold processing is to establish a dynamic threshold of the picture according to the local pixel distribution of the picture, and perform threshold segmentation processing on the picture according to the dynamic threshold so as to retain most details of the picture and avoid the loss of the content of the picture; the image reconstruction is to output the normalized picture in the size of a4 or A3 according to the type.

In the process of image processing, it is desirable to retain more valuable information of an image file, eliminate noise and the like existing in an image, and the system realizes dynamic threshold processing based on local image characteristics in a threshold segmentation part. Compared with the traditional single thresholdAnd the segmentation effect of the dynamic threshold processing on the image with uneven illumination is better. Let sigma_xyAnd m_xyRepresenting a neighborhood S centered on coordinates (x, y) in an image_xyThe standard deviation and mean of the included set of pixels, the general form of the variable local threshold is:

T_xy＝aσ_xy+bm_xy

where f (x, y) is the input image, this equation evaluates all pixel locations in the image and uses the neighborhood S at each point (x, y)_xyThe pixels in (1) calculate different threshold values T_xy. By dynamic threshold processing based on local image characteristics, the problem of image information loss can be avoided, and the quality of the processed archive image is ensured.

After the image is subjected to the threshold segmentation processing, the image needs to be reconstructed into 2480 × 3508 pixels according to the requirement that the resolution is 300 pixels/inch, so when the size of the image is smaller than 2480 × 3508, pixels are padded around the image to be in the size of a4, and the standardized output is ensured.

The processing result pair obtained by using the conventional single threshold and dynamic threshold is shown in fig. 3. According to the comparison between the left part and the right part in the picture, the blurring and the losing of the characters exist on the right side in the left half picture, and the characters are well kept in the right half picture. The reason is that: the right side of the original image has insufficient light, which results in large pixel values of the text part (the larger the pixel value is, the closer the image color is to white), and if a single threshold value is adopted to process the image, the details of the image are lost. And if the dynamic threshold is adopted, the threshold is selected according to the local characteristics of the picture, so that the details in the picture are not lost too much.

Step five: processing of tabular picture data

And the first step to the fourth step are image finishing steps, five types of pictures which are output in a standardized mode can be obtained through the image finishing steps, 1000 table file images are selected, and line edges of the tables are marked through labelme.

Step six: table partitioning

Inputting the data marked in the fifth step into a Unet network for training, wherein the iteration times are 80000 times to obtain a table segmentation model, and performing table segmentation on an input picture according to the table segmentation model to obtain a picture of a cell.

The embodiment of the invention uses the Unet network structure to further divide and process the table output picture obtained by image finishing. The structure of the Unet as shown in fig. 2, the Unet network is made up of two parts, a contracted path on the left (down) and an expanded path on the right (up). Wherein the puncturing path follows a typical convolutional network structure, which consists of two repeated 3 x3 convolutional kernels (unfilled convolution), and both use modified linear unit (ReLU) activation functions and a 2 x 2 max pooling operation with step size 2 for downsampling, and the number of feature channels is doubled in each downsampling step.

In the expanding path, each step up-samples the feature map; then carrying out convolution operation by using 2 x 2 convolution kernels so as to reduce the number of the characteristic channels by half; then, corresponding cut characteristic graphs in the cascade contraction path are obtained; convolution operations were then performed with two convolution kernels of 3 x3, with both relus being used for the activation functions. In each convolution operation, the feature map needs to be clipped by the edge missing pixels. In the last layer, convolution operation is carried out by using convolution kernels of 1 x 1, and each feature vector of 64 dimensions is mapped to an output layer of the network. As can be seen from FIG. 2, the network has 23 convolutional layers.

The network adopts a common Encoder-Decoder structure, and adds the operation of directly intercepting information from the Encoder and putting the information in the Decoder into the original structure, so that the operation can effectively retain the edge detail information in the original image and prevent the loss of excessive edge information. Here, it should be noted that: to ensure seamless splicing of the output segment map, careful selection of the input picture size is required to ensure that all Max Pooling operations are applied to layers with even x-size and even y-size.

Step seven: text line segmentation

And performing text line segmentation on each segmented cell by using a CTPN model. The CTPN model adopts an open source data set to train about 364 pictures in total, and is divided into a training set and a verification set according to the ratio of 99: 1. The data utilizes a Chinese language database (news + Chinese), 5990 characters comprising Chinese characters, English letters, numbers and punctuations are randomly generated through changes of fonts, sizes, gray levels, fuzziness, perspective, stretching and the like, 10 characters are fixed in each sample, the characters are randomly intercepted from sentences in the language database, and the resolution of pictures is unified to 280x 32.

The specific implementation process of CTPN comprises three steps: detecting a small-scale text box, circularly connecting the text box and thinning a text line edge. The specific implementation steps are as follows, as shown in fig. 5:

1. extracting features by using VGG16 as base net to obtain the features of conv5_3 as feature map, wherein the size is W multiplied by H multiplied by C;

2. sliding on the feature map by using sliding windows with the size of 3 × 3, wherein each window can obtain a feature vector with the length of 3 × 3 × C, and the center of each sliding window can predict k offsets relative to the anchor;

3. inputting the features obtained in the last step into a bidirectional LSTM to obtain an output with the length of W multiplied by 256, and then connecting a 512 full-connection layer to prepare for output;

4. the output layer section has three main outputs: one is 2k vertical coordinates, one is 2k outputs (note that the output here is an offset from the anchor) because one anchor is represented by two values, the height of the center position (y-coordinate) and the height of the rectangular box; second, 2k scores, each of text and non-text having a score of 2k because k text spots are predicted; thirdly, k side-redefinitions which are mainly used for fine-trimming two end points of a text line and represent the horizontal translation amount of each proposal;

5. filtering out redundant text disposals using a standard non-maximum suppression algorithm;

6. and finally, combining the obtained text segments into a text line by using a text line construction algorithm based on a graph.

Through the steps, the text lines in the cells are segmented and recognized, and a basis can be provided for OCR recognition of the text lines and extraction of data. The reason why table segmentation is required is that if table segmentation is not performed, global OCR recognition is directly performed on a document image, a structured document such as a table document cannot be digitally restored, and for example, when the table document is reconstructed into an Excel table, the correspondence relationship of cells of the structured document is lost. In addition, the text line segmentation is performed on the form cells mainly for improving the accuracy of text recognition.

Results and analysis of the experiments

Experimental setup:

the data used in the embodiment of the invention is real data extracted from paper archives of a company in Wuhan City in nearly 30 years. Because of the large number of files, the samples with high frequency and strong characteristics in the files are selected. And finally, 6779 samples are obtained by selection, and the ratio of the training sample to the verification sample is 1: 1.

The system relates to two parts, wherein the first part is an image finishing part, the second part is a document automatic processing part, model training processes of the two parts are independent of each other, and the output of the first part is used as the input of the second part. The following is a detailed experimental procedure in two parts.

The image finishing part marks 6779 file samples selected by experiments on text positioning characteristics by using a LabelImg tool to manufacture a VOC2007 format data set. Features are grouped into two categories in total: one category performs text body positioning and the other category performs text line positioning. The size of batch _ size is set to 1 in terms of training parameter setting, the initial value of the learning rate is set to 0.002 the learning rate is automatically set to 0.00002 when the number of iterations exceeds 90000, the learning rate is automatically set to 0.000002 when the number of iterations exceeds 120000, the pooling is set to 2 × 2, the number of iterations is set to 200000, and the remaining parameters are defaulted. And evaluating the accuracy of the model obtained by training. If the success rate of the first class identification is a1 and the success rate of the second class identification is b1, the overall identification success rate is a1 × b 1. Through the analysis of success rate and other information, the defects of the weight model are found and the data set and partial parameters are adjusted until the total accuracy rate reaches about 90%.

In the automatic document processing part, 1000 table file images are selected, and line edges of the table are marked through labelImg. And inputting Unet for training, wherein the iteration number is 80000 times. After the table is segmented by Unet, text line segmentation is carried out on each segmented table unit by using a CTPN model, the CTPN model adopts an open source data set to train for about 364 pictures, and the CTPN model is divided into a training set and a verification set according to the ratio of 99: 1. The data utilizes a Chinese language database (news + Chinese), 5990 characters comprising Chinese characters, English letters, numbers and punctuations are randomly generated through changes of fonts, sizes, gray levels, fuzziness, perspective, stretching and the like, 10 characters are fixed in each sample, sentences in the language database are randomly intercepted by the characters, and the resolution of pictures is unified as 280x 32.

Results and analysis:

table 1 below is the training effect of the image finishing,

TABLE 1 image grooming training results

The following is a test set of 100 untrained photographs used to verify the segmentation effect of the Unet table, which is shown in Table 2. Of which 50 sheets were subjected to image finishing pretreatment and 50 sheets were not subjected to pretreatment.

Table 2 Unet table segmentation evaluation

For CTPN, the present invention replaces original CNN network VGG-16[8] with DenseNet [9] network, the comparison effect is shown in Table 3,

TABLE 3 CTPN text line segmentation evaluation

The system processing effect is shown in fig. 4.

Analysis experiment results show that the overall accuracy of image finishing reaches 96.5%, and certain errors exist because the input images have the problems of unobvious features, such as image blurring, excessive font density and the like, which result in that text lines cannot be effectively extracted, and the processing effect is poor. In addition, the effect of Unet table segmentation is good on 100 test pictures, although a table segmentation error phenomenon exists, the occurrence probability is low, and meanwhile, the table segmentation error does not have great influence on the subsequent data extraction operation. Meanwhile, the processing speed and the accuracy of two CNN networks VGG-16 and DenseNet in the CTPN are compared in the experiment, it can be found that both have high enough accuracy, DenseNet can achieve higher accuracy on the basis of sacrificing time, and the selection of the two networks can be determined according to specific requirements.

Through the work, the pictures with irregularities such as inclination, uneven illumination, noise interference and the like can be finished into standardized pictures specified in the digitized paper archive specification, and meanwhile, on the basis, the automatic processing scheme provided by the invention can be used for carrying out table segmentation and labeling on the table archive pictures, so that a good picture data format is provided for extracting designated position data of the table archive.

The above description is only an embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. An automatic filing processing method for scanned archive images based on deep learning is characterized in that: the method comprises the following steps:

the method comprises the following steps: data preprocessing and model training

step two: subject identification

step three: tilt correction

step four: automated thresholding and image reconstruction

step five: processing of tabular picture data

step six: table partitioning

step seven: text line segmentation

2. The automated archival processing method of deep learning based scan archival images of claim 1, characterized by: the automatic threshold processing in the third step is to establish a dynamic threshold of the picture according to the local pixel distribution of the picture, and perform threshold segmentation processing on the picture according to the dynamic threshold so as to retain most details of the picture and avoid the loss of the content of the picture; the image reconstruction in the third step is to output the pictures in a standardized manner, and the pictures are output according to the category and the size of A4 or A3.

3. The automated archival processing method of deep learning based scan archival images of claim 1, characterized by: in step three, the system realizes dynamic threshold processing based on local image characteristics in a threshold segmentation part: let sigma_xyAnd m_xyRepresenting a neighborhood S centered on coordinates (x, y) in an image_xyThe standard deviation and mean of the included set of pixels, the general form of the variable local threshold is:

T_xy＝aσ_xy+bm_xy

4. The automated archival processing method of deep learning based scan archival images of claim 1, characterized by: in the seventh step, the CTPN model is trained by adopting an open source data set, the trained pictures are divided into a training set and a verification set according to the ratio of 99:1, 5990 characters comprising Chinese characters, English letters, numbers and punctuations are randomly generated by using a Chinese language database through font, size, gray scale, blur, perspective and stretching change, 10 characters are fixed on each sample, sentences in the language database are randomly intercepted by the characters, and the resolution of the pictures is unified as 280x 32.

5. The automated archival processing method of deep learning based scan archival images of claim 1, characterized by: and step five, selecting 1000 table file images.

6. The automated archival processing method of deep learning based scan archival images of claim 1, characterized by: and inputting Unet for training in the sixth step, wherein the iteration times are 80000 times.

7. An automated archival processing system for scanning archival images based on deep learning, characterized by: comprises an image finishing system and a document automatic processing system;

8. The automated archival processing system of deep learning based scan archival images of claim 7, wherein: the automatic threshold processing is to establish a dynamic threshold of the picture according to the local pixel distribution of the picture, and perform threshold segmentation processing on the picture according to the dynamic threshold so as to retain most details of the picture and avoid the loss of the content of the picture.

9. The automated archival processing system of deep learning based scanned archival images of claim 8, wherein: the image finishing system implements dynamic thresholding based on local image characteristics in a thresholding section: let sigma_xyAnd m_xyRepresenting a neighborhood S centered on coordinates (x, y) in an image_xyThe standard deviation and mean of the included set of pixels, the general form of the variable local threshold is:

T_xy＝aσ_xy+bm_xy

10. The automated archival processing system of deep learning based scan archival images of claim 7, wherein: the CTPN model is trained by adopting an open source data set, a trained picture is divided into a training set and a verification set according to the ratio of 99:1, 5990 characters comprising Chinese characters, English letters, numbers and punctuations are randomly generated by using a Chinese language database through font, size, gray scale, blur, perspective and stretching change, 10 characters are fixed in each sample, sentences in the language database are randomly intercepted by the characters, and the resolution of the picture is unified to 280x 32.