CN116433494A

CN116433494A - File scanning image automatic correction and trimming method based on deep learning

Info

Publication number: CN116433494A
Application number: CN202310420534.9A
Authority: CN
Inventors: 孙强; 吉红慧; 陈逸彬; 蒋行健; 曹张华; 邵蔚; 黄勋
Original assignee: Nantong University
Current assignee: Nantong University
Priority date: 2023-04-19
Filing date: 2023-04-19
Publication date: 2023-07-14
Anticipated expiration: 2043-04-19
Also published as: CN116433494B

Abstract

The invention relates to the technical field of automatic correction and trimming of file scanning images, in particular to an automatic correction and trimming method of file scanning images based on deep learning, which comprises the following steps: preprocessing the archive scanning image; placing the processed image data set into an angle correction and edge cutting network model for training; extracting features of the image; carrying out automatic correction and trimming treatment on the image; and processing the file scanning image by using the model obtained by training, and outputting the file scanning image subjected to automatic correction and trimming processing. The model comprises a feature extraction module, a deviation rectifying module and a trimming module. Meanwhile, the self-adaptive convolution module and the channel attention module are respectively added in the deviation rectifying module and the edge cutting module, so that smaller angle deviation can be accurately processed and the edge ambiguity of the image can be reduced. The method can improve the calculation speed of the model, enable the model to be lighter, improve the correction and edge cutting efficiency of the file scanning image and accurately process the small-angle offset picture.

Description

File scanning image automatic correction and trimming method based on deep learning

Technical Field

The invention relates to the technical field of automatic correction and trimming of file scanning images, in particular to an automatic correction and trimming method of file scanning images based on deep learning.

Background

In modern society, digitization has become a main mode of information processing, and digitization processing of various documents is becoming more and more common. When the file is scanned, the problems of distortion, inclination, unfilled corner and the like of the scanned image are caused by the reasons that the scanner is improperly placed or the file is not placed flatly and the like. These problems can have a serious impact on subsequent document identification, classification, retrieval, and browsing. Therefore, automatic correction and trimming processing for the scanned images of the files become an important link in the digital file processing.

Currently, there are many methods for solving the document edge correction and trimming problem. The traditional method based on corner detection utilizes geometric shape characteristics of images to rectify deviation, but larger errors can occur when noise, shadow and the like exist in some document images. In addition, the method based on edge detection utilizes image edge information to correct deviation. In recent years, the development of deep learning technology has significantly improved the document deviation correcting effect. The document correction method based on the Convolutional Neural Network (CNN) is widely applied. For example, the method carries out self-adaptive correction of the document by constructing a CNN model, and the method realizes self-adaptive correction by learning structural information in the document, but has the problem of inaccurate correction when the distortion angle of some documents is smaller. In addition, another method for correcting deviation of the document based on CNN realizes correction and cutting of the document by learning characteristic points and boundary information of the document, but has poor processing effect on the condition that the document has redundant frames.

Although the existing document correction method achieves a certain effect, it is still a challenge to achieve an accurate effect on some archival scan images that do not twist, do not unfilled corner, have a small or almost no offset angle, but need to remove the redundant frame. This is because the rotation angles of these document images are small, the conventional rotation angle detection method has difficulty in detecting these angles, and the frame information of the document images is not clear. Therefore, there is a need for an automatic correction and trimming method for file scan images based on deep learning, which can realize higher precision automatic correction and trimming when processing file scan images which are not distorted, are not unfilled, are very small in offset angle or are almost not offset, but need to remove redundant frames, and can process the difficult conditions of small angle, unclear frame information and the like, thereby improving the efficiency and precision of digital file processing. The invention aims to solve the defects and limitations of the prior art and improve the efficiency and the accuracy of automatic correction and trimming of the file scanning image.

Disclosure of Invention

The invention aims to solve the defects in the prior art, and provides an automatic correction and edge cutting method for file scanning images based on deep learning, which is used for realizing higher-precision automatic correction and edge cutting, and can be used for processing the difficult conditions of small angle, unclear frame information and the like, and improving the efficiency and precision of digital file processing.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

a file scanning image automatic correction and trimming method based on deep learning comprises the following steps:

s11, preprocessing an input archive scanning image, including: 1) image edge cutting, 2) image size adjustment, 3) synthesis of data sets, 4) data set grouping to meet the requirements of a model on an input data set;

s12, placing the processed archive scanning image data set into an angle correction and edge cutting network model for training, and extracting features of an input image by the model;

s13, adopting an ACMCN model to automatically detect edges in the image, automatically rectifying according to the edge positions, automatically detecting contents in the image, and automatically trimming according to the content positions to remove redundant edge parts;

s14, judging whether the processed image meets the set deviation correcting and edge cutting requirements at the moment, namely, the average angle deviation value of all straight lines in the processed file scanning image is between 0 degrees and 1 degree, the value of the edge ambiguity of all edges is between 0,0.1 degrees, and outputting the processed image if the requirements are met; if the requirement is not met, continuing to iterate the correction and trimming processing until the requirement is met and outputting a processed image;

s15, processing the file scanning image by using the model obtained through training and outputting the file scanning image after automatic correction and trimming processing.

Preferably, in step S11, said 1): the image edge cutting means that firstly, n original pictures which are basically cut and have 0-degree deflection are read, and if the original pictures are not 0-degree, the deviation is corrected to 0-degree manually; then, performing the next round of cutting operation, cutting out 60 pixels from the upper, lower, left and right of the image respectively so as to cut off the variegated color of the document edge, and generating a new image data set;

said 2): adjusting the image size refers to setting the height pixels of all images of the new image dataset to 480dpi, and simultaneously adjusting the width pixels of the images according to the corresponding proportion, so that the image size is unified, and a new image dataset is generated;

the 3): the synthesis data set refers to the operation of respectively taking 75% of a new image data set to perform vertical overturning and horizontal overturning, and simultaneously selecting 50% of images to perform rotation, wherein the rotation angle takes any value between intervals of minus 90 degrees and 90 degrees, then compression enhancement is performed on all the images, the lower limit of the image quality after compression is designated as 30JFIF, the upper limit is designated as 80JFIF, and 70% of the images are selected to perform random shadow enhancement, wherein the possible occurrence area of the designated shadow is the whole image, the lower limit of the shadow quantity is 0, and the upper limit is 1; then selecting 50% of pictures for random brightness and contrast enhancement, wherein the range of the appointed contrast adjustment is between 0.1 and 0.34, and the appointed image brightness is reduced by 50%; finally, randomly synthesizing a background picture common to z archives scanning images and n processed images, and generating a final data set;

said 4): the data set grouping refers to grouping the last data set into 6 pictures, wherein each group extracts 1 picture as a verification set, and the other 5 pictures are used as training sets, so that the last data set is divided into two parts of the training set and the verification set.

Preferably, in step S12, the angle correction and edge cutting network model includes: a) a feature extraction module, b) a correction module, c) an edge cutting module; the specific architecture is as follows:

a) The feature extraction module has a network depth of 16 layers, the network comprises 5 convolution layers, 3 full connection layers and 8 nonlinear activation layers, wherein each nonlinear activation layer uses a ReLU activation function, the network uses a dropout layer after each full connection layer to prevent overfitting, the discarding rates are respectively 0.4, 0.3 and 0.25, and the softmax activation function is used for final classification;

b) The network depth of the deviation rectifying module is 9 layers, the network comprises 4 convolution layers, 4 pooling layers and 1 fully connected layer, wherein the convolution kernel size of each convolution layer is 3 multiplied by 3, the number of convolution kernels is 32, 64, 128 and 256 in sequence, the pooling window size of each pooling layer is 2 multiplied by 2, all the convolution layers and the pooling layers use a ReLU activation function, and the number of neurons of the fully connected layer is 1 and a sigmoid activation function is used;

c) The network depth of the edge cutting module is 4 layers, the network comprises 3 convolution layers and 1 full connection layer, wherein the convolution kernel size of each convolution layer is 3 multiplied by 3, the number of convolution kernels is 32, 64 and 128 in sequence, in the full connection layer, the number of neurons of a first hidden layer is 256, and the number of neurons of a second hidden layer is 128; in addition, an adaptive convolution module and a channel attention module are added after the deviation correcting module and the trimming module, wherein the adaptive convolution module comprises a one-dimensional convolution layer, two adaptive convolution layers, the number of output channels of the one-dimensional convolution layer is 1, the number of input channels of the first adaptive convolution layer is 1, the number of output channels is 2, the number of input channels of the second adaptive convolution layer is 2, and the number of output channels is 1; the size and the shape of the convolution kernel of the self-adaptive convolution layer are dynamically generated, and the self-adaptive adjustment can be carried out according to the size and the shape of the input feature map; the channel attention module comprises a global average pooling layer, two full-connection layers and a sigmoid activation layer; the global average pooling layer carries out average pooling on the input feature images along the channel dimension to obtain a one-dimensional feature image, the two full-connection layers respectively compress and expand the input feature images along the channel dimension to obtain two one-dimensional feature images, the sigmoid activation layer adds the two one-dimensional feature images and normalizes the two one-dimensional feature images by using a sigmoid function to obtain a channel attention weight vector, and finally the channel attention weight vector and the input feature images are multiplied channel by channel to obtain the weighted feature images.

Preferably, in step S15, the model obtained by training refers to that the training set is input into the ACMCN model to perform training, and the value of num_epochs is set to 100, that is, the model will traverse the whole training set 100 times, the value of BATCH_size is set to 16, that is, 16 samples will be trained by the model each time, and the value of num_classes is set to 2, that is, the model is to perform two classification; the performance of the model obtained by training in step S15 needs to be evaluated by an IOU function and a Loss function, where the IOU is used to calculate the performance of object detection and semantic segmentation, the value range of the IOU is between 0 and 1, the larger the value is, the more accurate the prediction result is, that is, the better the performance is, and the formula is expressed as follows:

wherein, area A U B represents the intersection Area between A and B, area A U B represents the union Area between A and B, w _overlap And h _overlap Representing the width and height, w, of the overlap between A and B, respectively _A 、h _A 、w _B And h _B Separate tableThe width and the height of A and B are shown, wherein A represents the prediction effect of correction and trimming of the file scanning image, and B represents the actual effect of correction and trimming of the file scanning image;

the Loss function is a Loss function, when the model is trained, the ACMCN model predicts according to input data, and then calculates the difference between a predicted result and a real result, wherein the difference is the value of the Loss function; the smaller the Loss value, the closer the predicted result of the model is to the real result, namely the better the performance, and the formula is expressed as follows:

wherein y is _i Label representing prediction result of ith file scan image, file scan image with correction cut edge is marked as 1, file scan image without correction cut edge is marked as 0, p _i Representing the probability that the i-th file scanning image prediction result is the file scanning image needing correction and trimming, L _i A value representing a cross entropy loss function of the ith archive scan image, N being the total number of samples of the training set;

whether the model is trained is measured through the IOU and the Loss, and when the IOU reaches 0.5 and the Loss is lower than 0.1, the target is considered to be correctly detected.

Compared with the prior art, the invention has the following beneficial effects:

1. compared with the traditional document deviation correcting dataset which has the characteristics of large distortion degree, unfilled corner and large inclination angle, the dataset effectively reduces the complexity of model processing, improves the speed of model calculation, enables the model to be lighter, and improves the deviation correcting and edge cutting efficiency of the document scanning image.

2. Aiming at the problem that the traditional document correction trimming method has larger defects in the effect of removing the file scanning image of the redundant frame, which is not distorted, is not unfilled, has a small offset angle or is hardly offset, the invention provides an ACMCN model, and an adaptive convolution module and a channel attention module are added behind the correction module, wherein the size and the shape of the convolution kernel of the adaptive convolution layer are dynamically generated, and the adaptive adjustment can be carried out according to the size and the shape of the input feature diagram. The channel attention module can weight the channel attention of the input feature graph so as to further improve the expression capability and discrimination performance of the features. The small angle offset of the scanned image of the document can be accurately processed, and the deviation rectifying effect of the document is improved.

3. According to the invention, the ACMCN model is additionally provided with the self-adaptive convolution module and the channel attention module behind the trimming module, so that the trimming module can detect the edges of the file scanning image more clearly, the edge ambiguity of the file scanning image is reduced, and the discrimination capability and the robustness of the ACMCN model to the input image can be effectively improved.

4. The invention also adds a judging mechanism after the ACMCN model processing is finished, and if the processed image meets the set requirement, the processed image is output; if the set requirements are not met, the correction and edge cutting processing is continued to be iterated until the requirements are met, and then the processed images are output. This mechanism will greatly enhance the effect of the model to process archival scan images.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a flow chart of data preprocessing in the present invention;

FIG. 3 is a flow chart of ACMCN model training in the present invention;

FIG. 4 is a block diagram of a feature extraction module of the ACMCN model of the present invention;

FIG. 5 is a block diagram of a correction module of the ACMCN model according to the present invention;

fig. 6 is a block diagram of a trimming module of the ACMCN model according to the present invention.

Detailed Description

The following technical solutions in the embodiments of the present invention will be clearly and completely described with reference to the accompanying drawings, so that those skilled in the art can better understand the advantages and features of the present invention, and thus the protection scope of the present invention is more clearly defined. The described embodiments of the present invention are intended to be only a few, but not all embodiments of the present invention, and all other embodiments that may be made by one of ordinary skill in the art without inventive faculty are intended to be within the scope of the present invention.

As shown in FIG. 1, the method for automatically rectifying and trimming the file scan image based on deep learning comprises the following steps:

the average angle deviation value of all straight lines in the processed archive scanning image refers to the deviation between each straight line of the archive scanning image and the expected angle, the absolute values of all the deviations are added up, and finally the average deviation is obtained by dividing the absolute values by the number of the straight lines. The calculation formula is as follows:

where l is the number of straight lines, θ _k Is the angle of the kth straight line, θ ₀ Is the desired angle, here set to 0.

The edge Blurry values between [0,0.1] for all edges refer to the degree of edge blurring in image processing, which is used to describe the sharpness of the image edges, with smaller values indicating sharper edges and larger values indicating sharper edges. The invention uses a Sobel operator, and the specific calculation formula is as follows:

in the above formula, I (I, j) represents a pixel value of an ith row and a jth column in the image, where i=0, …, u, j=0, …, v, i_arg represents an average pixel value of the entire image, and m represents a total number of pixels in the image. The summation symbol in the formula represents the summation of the ambiguities of all pixels in the image. In order to avoid negative numbers, pixel values are normalized to be within the range of [0,1 ]. The value range of the summation symbol is the number m of pixels of the whole image.

Specifically, as shown in fig. 2, the specific steps of preprocessing an input archival scan image in the data preprocessing flow chart of the present invention are as follows:

s21, the image edge cutting means that 8000 original pictures which are basically cut and are deflected by 0 degree are read, and if the original pictures are not 0 degree, the manual deviation correction is 0 degree. Then, a next round of cropping operation is performed, 60 pixels are cropped out from the image up, down, left and right, so that the variegation of the document edge is cropped out, and a new image data set is generated.

S22, adjusting the image size refers to setting the height pixels of all images of the new image data set to 480dpi, and adjusting the width pixels of the images according to the corresponding proportion, so that the image size is unified, and the new image data set is generated.

S23, synthesizing a data set, namely respectively taking 75% of a new image data set to perform vertical overturning and horizontal overturning, simultaneously selecting 50% of images to perform rotating operation, taking any value between intervals of-90 degrees and 90 degrees, then performing compression enhancement on all the images, designating the lower limit of the quality of the compressed images as 30JFIF, the upper limit as 80JFIF, and selecting 70% of the images to perform random shadow enhancement, wherein the possible occurrence area of designated shadows is the whole image, the lower limit of the number of shadows is 0, and the upper limit is 1. Then, 50% of the pictures are selected for random brightness and contrast enhancement, wherein the range of the designated contrast adjustment is between 0.1 and 0.34, and the designated image brightness is reduced by 50%. And finally, randomly synthesizing the common background pictures of 100 archival scanning images and 8000 processed images, and generating a final data set.

Wherein, compression enhancement is performed on all images, the lower limit of the image quality after the compression is designated as 30JFIF, and the upper limit of the image quality is designated as JFIF in 80JFIF, which means that JFIF format is generally used for representing JPEG images in image processing and storage. The lower and upper limits of the compressed image quality may be specified by compression quality parameters in JFIF format, the compression quality parameter Q of a JPEG image is typically expressed using an integer between 0 and 100, where q=100 represents the highest quality, the lowest compression ratio, and q=0 represents the lowest quality, the highest compression ratio. The lower compression limit refers to the minimum compression ratio that the JPEG encoder can achieve when compressing an image, and the upper compression limit refers to the maximum compression ratio that the JPEG encoder can achieve when compressing an image. The formula is as follows:

Y＝X/Q

wherein Y represents the size of the file scanned image file after compression, X represents the size of the original uncompressed file scanned image file, Q represents the compression ratio, and all file scanned images in the invention are JPEG images.

S24, grouping the data sets refers to grouping the last data set according to 6 pictures, wherein each group extracts 1 picture as a verification set, and the other 5 pictures are used as training sets, so that the last data set is divided into two parts of the training set and the verification set.

Specifically, as shown in fig. 3, the architecture of the angle correction and edge cut network model is as follows:

s31, a feature extraction module: as shown in fig. 4, the network depth is 16 layers, the network contains 5 convolutional layers, 3 fully-connected layers, and 8 nonlinear active layers, each using a ReLU activation function, the network uses dropout layers after each fully-connected layer to prevent overfitting, discard rates of 0.4, 0.3, and 0.25, respectively, and final classification using a softmax activation function.

S32, a deviation rectifying module: as shown in fig. 5, the network depth is 9 layers, the network comprises 4 convolution layers, 4 pooling layers and 1 fully connected layer, wherein the convolution kernel size of each convolution layer is 3×3, the number of convolution kernels is 32, 64, 128 and 256 in turn, the pooling window size of each pooling layer is 2×2, all convolution layers and pooling layers use a ReLU activation function, and the number of neurons of the fully connected layer is 1 and uses a sigmoid activation function.

S33, a trimming module: as shown in fig. 6, the network depth is 4 layers, and the network comprises 3 convolution layers and 1 full connection layer, wherein the convolution kernel size of each convolution layer is 3×3, the number of convolution kernels is 32, 64 and 128 in sequence, and in the full connection layer, the number of neurons of the first hidden layer is 256, and the number of neurons of the second hidden layer is 128.

In addition, the invention adds an adaptive convolution module and a channel attention module after the deviation correcting module and the trimming module, wherein the adaptive convolution module comprises a one-dimensional convolution layer, two adaptive convolution layers, the number of output channels of the one-dimensional convolution layer is 1, the number of input channels of the first adaptive convolution layer is 1, the number of output channels is 2, the number of input channels of the second adaptive convolution layer is 2, and the number of output channels is 1. The size and the shape of the convolution kernel of the self-adaptive convolution layer are dynamically generated, and can be self-adaptively adjusted according to the size and the shape of the input feature map. The channel attention module comprises a global average pooling layer, two full connection layers and a sigmoid activation layer. The global average pooling layer carries out average pooling on the input feature images along the channel dimension to obtain a one-dimensional feature image, the two full-connection layers respectively compress and expand the input feature images along the channel dimension to obtain two one-dimensional feature images, the sigmoid activation layer adds the two one-dimensional feature images and normalizes the two one-dimensional feature images by using a sigmoid function to obtain a channel attention weight vector, and finally the channel attention weight vector and the input feature images are multiplied channel by channel to obtain the weighted feature images.

Specifically, the manufactured training set is input into the ACMCN model for training, the value of num_epochs is set to 100, that is, the model traverses the whole training set 100 times, the value of BATCH_size is set to 16, that is, each time the model trains 16 samples, the value of num_classes is set to 2, that is, the model is required to be classified into two categories. The performance of the model obtained through training needs to be evaluated through an IOU function and a Loss function, wherein the IOU is used for calculating the performance of target detection and semantic segmentation, the value range of the IOU is between 0 and 1, and the larger the numerical value is, the more accurate the prediction result is, namely the better the performance is. The formula is expressed as follows:

wherein, area A U B represents the intersection Area between A and B, area A U B represents the union Area between A and B, w _overlap And h _overlap Representing the width and height, w, of the overlap between A and B, respectively _A 、h _A 、w _B And h _B The width and the height of A and B are respectively represented, wherein A represents the prediction effect of the correction trimming of the file scanning image, and B represents the actual effect of the correction trimming of the file scanning image;

the Loss function is a Loss function, and when the model is trained, the ACMCN model predicts according to input data, and then calculates the difference between a predicted result and a real result, wherein the difference is the value of the Loss function. The smaller the Loss value, the closer the predicted outcome of the model is to the real outcome, i.e. the better the performance. The formula is expressed as follows:

wherein y is _i Label representing prediction result of ith file scan image, file scan image with correction cut edge is marked as 1, file scan image without correction cut edge is marked as 0, p _i Representing the probability that the i-th file scanning image prediction result is the file scanning image needing correction and trimming, L _i And (3) representing the value of the cross entropy loss function of the ith archive scan image, wherein N is the total number of samples of the training set.

Therefore, whether the model is trained is measured through the IOU and the Loss, and when the IOU reaches 0.5 and the Loss reaches 0.1, the target can be considered to be correctly detected.

Regarding p _i How to calculate, an explanation is given below:

the ACMCN model transmits the input image data to the neural network for processing, and a final output result is obtained. This output is a vector with each element representing the score of the corresponding class, i.e. the likelihood that the sample belongs to each class.

In order to convert the output result into probability values, a sigmoid function transformation needs to be performed for each score value. The formula of the sigmoid function is as follows:

and x is a score value output by the neural network, namely, the predicted result of the ith file scanning image is a score value of the file scanning image needing correction and trimming, and x is substituted into a sigmoid function to obtain pi, namely, the probability that the sample belongs to the file scanning image needing correction and trimming is obtained. The calculation formula of pi is as follows:

in addition, other metrics may be selected by those skilled in the art to evaluate the performance of the model. When the index value of the predicted result given by the model on the test set reaches a certain threshold value, the trained ACMCN model can be considered to give an accurate predicted result, and the invention does not limit the threshold value.

The description and practice of the invention disclosed herein will be readily apparent to those skilled in the art, and may be modified and adapted in several ways without departing from the principles of the invention. Accordingly, modifications or improvements may be made without departing from the spirit of the invention and are also to be considered within the scope of the invention.

Claims

1. The file scanning image automatic correction and trimming method based on deep learning is characterized by comprising the following steps of:

2. The method for automatically rectifying and trimming an archive scan image based on deep learning according to claim 1, wherein in step S11, the following steps 1: the image edge cutting means that firstly, n original pictures which are basically cut and have 0-degree deflection are read, and if the original pictures are not 0-degree, the deviation is corrected to 0-degree manually; then, performing the next round of cutting operation, cutting out 60 pixels from the upper, lower, left and right of the image respectively so as to cut off the variegated color of the document edge, and generating a new image data set;

3. The method for automatically rectifying and trimming an archive scan image based on deep learning as claimed in claim 1, wherein in step S12, the angle-correcting and edge-cutting network model comprises: a) a feature extraction module, b) a correction module, c) an edge cutting module; the specific architecture is as follows:

4. The method of automatic correction and edge trimming for file scan image based on deep learning as claimed in claim 2, wherein in step S15, the training obtained model refers to the training set as claimed in claim 2 is input into ACMCN model for training, the value of num_epochs is set to 100, i.e. the model will traverse the whole training set 100 times, the value of BATCH_size is set to 16, i.e. each time the model will train 16 samples, the value of num_class is set to 2, i.e. the model is indicated to be classified into two categories; the performance of the model obtained by training in step S15 needs to be evaluated by an IOU function and a Loss function, where the IOU is used to calculate the performance of object detection and semantic segmentation, the value range of the IOU is between 0 and 1, the larger the value is, the more accurate the prediction result is, that is, the better the performance is, and the formula is expressed as follows: