CN116863476A

CN116863476A - Image generation method and system for removing seal noise based on mask guidance

Info

Publication number: CN116863476A
Application number: CN202310733846.5A
Authority: CN
Inventors: 周宇; 杨欣烨; 杨东宝; 王伟平
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2023-06-20
Filing date: 2023-06-20
Publication date: 2023-10-10

Abstract

The invention discloses an image generation method and system for removing seal noise based on mask guidance, which relate to the field of image character recognition, wherein a mask capable of representing seal positions is generated through a seal positioning module, so that excessive erasure of characters in a non-seal area is avoided, and attention can be focused on the seal area; the texture information of the background text is extracted, and the texture information and the mask are transmitted to the seal erasing module together through skip connection, so that the seal erasing module can remove the seal and retain the background text, and the seal can be accurately erased. The invention can automatically erase the seal which shields the text information in the document image and prevents the recognition of the image text, and reserve the background text shielded by the seal, thereby preventing or reducing the negative influence of the seal shielding on the text recognizer, and further being beneficial to more accurately recognizing the text in the document image.

Description

Image generation method and system for removing seal noise based on mask guidance

Technical Field

The invention relates to the field of image character recognition, in particular to an image generation method and system for removing seal noise based on mask guidance.

Background

Word recognition is an important task in the field of computer vision and is widely used in analysis of document images such as invoices, contracts and the like. Although the existing character recognition method has satisfactory performance, the recognition accuracy is seriously reduced due to the fact that the seal covers characters. This problem is quite common in reality, but there are few solutions to the problem. Some conventional schemes use color filtering or thresholding to separate the stamp from the background. The deep learning method adopts a single deeper U-Net or a cycleGAN-based method to realize automatic removal of the seal. The problem is that the traditional method cannot be well adapted to different types of document images and seals by using modes such as color filtering or threshold segmentation, and the practicability is poor. The current deep learning method cannot well reserve background characters and is easy to erase the background characters by mistake. And there is no data set disclosed for training and evaluating the model.

Disclosure of Invention

The invention aims to provide an image generation method and system for removing seal noise based on mask guidance, which can automatically erase a seal which shields text information in a document image and prevents image text recognition, and reserve background text shielded by the seal, so that negative influence of seal shielding on a text recognizer is prevented or reduced, and further, text in the document image can be recognized more accurately.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

an image generation method for removing seal noise based on mask guidance comprises the following steps:

modifying the network structure of the U-Net, and adding a residual block between the encoder and the decoder as a bridge to connect the encoder and the decoder to obtain a seal positioning module;

modifying the network structure of the U-Net, and adding a global context attention module between the encoder and the decoder to obtain a seal erasing module;

inputting a seal picture into a seal positioning module, extracting image abstract features through an encoder, obtaining a seal mask through a decoder according to the image abstract features, and extracting background texture information features of the seal picture through a bridge;

inputting the seal picture and the seal mask into a seal erasing module, and extracting image features through an encoder of the seal picture and the seal mask; splicing the image features and the background texture information features, inputting the image features and the background texture information features into a global context attention module, and processing the image features by a decoder to obtain a seal-removed picture;

replacing the seal area in the seal picture with the seal area in the seal removing picture according to the seal mask to obtain a seal removing combined image;

constructing a training data set according to the real seal-free image and the corresponding seal image, and training the seal positioning module and the seal erasing module simultaneously according to the training data set to optimize loss;

and processing the seal picture to be processed by using the trained seal positioning module and the seal erasing module to generate an image for removing seal noise.

Preferably, the encoder of the seal positioning module comprises four downsampling layers, each downsampling layer is composed of a residual block; the decoder comprises four upsampling layers, each upsampling layer being composed of a deconvolution product and a residual block; the residual blocks as bridges are three residual blocks.

Preferably, the backbone structure of all residual blocks in the stamp locating module consists of two convolutions and the skip connection consists of one convolution.

Preferably, the seal erasure module comprises four downsampling layers, each downsampling layer being formed by one convolution; the decoder contains four upsampling layers, each of which is made up of a deconvolution product and a convolution.

Preferably, the seal area in the seal picture is replaced by the seal area in the seal removing picture according to the seal mask, so as to obtain a seal removing combined image, and the method comprises the following steps:

dot product is carried out on the seal mask and the seal-removed picture, so that image information of a seal area in the seal-removed picture is obtained;

dot product is carried out on the seal picture and the reversal mask of the seal mask, so that image information of a non-seal area of the seal picture is obtained;

and adding the image information of the seal area in the seal-removed picture and the image information of the non-seal area of the seal picture to obtain a combined image.

Preferably, the seal mask which performs dot product with the seal-removed picture is replaced by an expanded seal mask, and subsequent processing is performed to obtain a combined image.

Preferably, the training data set is constructed by:

stamping a batch of real seal-free images to obtain seal images;

the training data set is composed of the real seal-free image and the corresponding seal image.

Preferably, the stamp positioning module optimizes training by calculating L1 loss and dice loss; the stamp erasure module optimizes training by calculating L1 loss, perceived loss, style loss, and GAN challenge loss.

Preferably, the seal positioning module and the seal erasing module together form an end-to-end model, and training is performed simultaneously.

An image generation system based on mask-guided de-stamping noise, comprising:

the seal positioning module is a convolutional neural network based on the structural modification of an encoder-decoder of U-Net; the encoder comprises four downsampling layers, each downsampling layer is composed of a residual block and is used for extracting image abstract features for seal pictures; the decoder comprises four upsampling layers, each upsampling layer is composed of a deconvolution product and a residual block, and is used for obtaining a seal mask according to the abstract characteristics of the image; three residual blocks which are connected with the encoder and the decoder are arranged between the encoder and the decoder and used as bridges for extracting background texture information characteristics of the seal picture;

the seal erasing module is a convolutional neural network based on the structural modification of an encoder-decoder of U-Net; the encoder comprises four downsampling layers, each downsampling layer is formed by one convolution and is used for extracting image features; the decoder comprises four up-sampling layers, each up-sampling layer is composed of a deconvolution product and a common convolution, and is used for outputting seal pictures; and a global context attention module is arranged between the encoder and the decoder and is used for receiving and processing the splicing information of the image features and the background texture information features and outputting the splicing information to the decoder.

The technical scheme of the invention has the advantages that:

according to the technical scheme provided by the invention, the mask capable of representing the position of the seal is generated through the seal positioning module, so that excessive erasure of characters in a non-seal area is avoided, and the attention can be focused on the seal area. In order to better retain the background text, the seal positioning module extracts texture information of the background text, and the texture information and the mask are transmitted to the seal erasing module together through skip connection, so that the seal erasing module retains the background text while removing the seal, and accurate seal erasing is ensured. In order to make the final generated image more natural, the present invention uses the inflated mask to calculate the loss. Experiments show that the invention can obtain excellent performance on a real data set, and the system designed by the invention is light and can be flexibly used as a preprocessing module of the existing identification method.

Drawings

Fig. 1 is a block diagram of an image generation system for removing seal noise based on mask guidance in the embodiment.

Fig. 2 is a block diagram of a residual block in an embodiment.

FIG. 3 is a graph comparing the results of the visualization of the method of the present invention with the baseline method in the examples.

Detailed Description

In order to make the technical features and advantages or technical effects of the technical scheme of the invention more obvious and understandable, the following detailed description is given with reference to the accompanying drawings.

The embodiment provides an image generation method and system for removing seal noise based on mask guidance, the method can be realized through the system, and fig. 1 shows the structure of the system and a processed data flow diagram (seal information is desensitized unreal information in the figure). The system mainly comprises two parts, namely a seal positioning module and a seal erasing module. The seal positioning module generates an accurate seal mask, and the seal erasing module erases the seal and restores the background text shielded by the seal, and the two modules are described in detail below.

1. Seal positioning module

The input of the seal positioning module is a seal picture (see input image in fig. 1), and the output is an accurate seal mask (see prediction mask in fig. 1).

The seal positioning module is a convolutional neural network which is properly modified based on a U-Net structure, namely, the structure of an encoder-decoder of the U-Net is adopted. The encoder comprises four downsampling layers, wherein each downsampling layer is composed of a residual block and is used for gradually extracting abstract features of an image. The decoder comprises four upsampling layers, each comprising a deconvolution and a residual block, for restoring the features extracted by the encoder to the original image size and generating a stamp mask of the same size as the original image. One of the modifications is to add three residual blocks between the encoder and decoder as a bridge connection, i.e. the three residual blocks are located between the last downsampling layer of the decoder and the first upsampling layer of the encoder as a bridge, and the output characteristics are used as background texture information for later erasure tasks.

The structural description of the seal positioning module shows that the residual block is a main component of the seal positioning module, and the residual block is beneficial to the stability of model training. The second modification is to specially design the residual block structure, the specific structure of the residual block is shown in fig. 2, and the main structure of the residual block consists of two convolutions: the first convolution is 3 x 3 in size, with a step size of 2 or 1 (step size is set to 2 when downsampling features, and step size is set to 1 when upsampling features); the second convolution is of 3 x 3 size with a step size of 1. The backbone structure uses 2 convolutions to increase the depth of the network relative to using only 1 convolution, which is more advantageous for feature extraction. A convolution with the size of 3 multiplied by 3 and the step length of 2 is used in the skip connection of the residual block and is used for adjusting the feature size and matching the feature size output by the trunk structure; if feature size adjustment is not required, a convolution of 1 x 1 size with a step size of 1 is used. This way feature extraction and integration between the different layers can be enhanced. The structural design of the residual block can strengthen feature extraction and integration among different downsampling layers, and aims to better extract texture information of background characters and improve accuracy and robustness of a module.

2. Seal erasing module

The seal erasing module inputs the seal picture and the seal mask output by the seal positioning module, and outputs the seal picture after the seal is erased (see the output image in fig. 1).

The seal erasing module is a convolutional neural network which is properly modified based on a U-Net structure, namely, the structure of an encoder and a decoder of the U-Net is adopted. The encoder comprises four downsampling layers, each downsampling layer being formed by a common convolution, for extracting image features. The decoder comprises four upsampling layers, each of which is formed by a deconvolution and a normal convolution, for outputting the seal-removed picture. Wherein the up-sampling layer and the down-sampling layer each employ only one common convolution in order to reduce the number of parameters, which is one of the modifications to the U-Net structure. Another modification is to add a global context attention module (GC Block) between the encoder and decoder to receive the background texture features from the stamp locating module and the image features extracted by the downsampling layer, and then splice the image features on the channel. The GC Block is an attention module whose output attention attempts to indicate which regions in the image are important, so that the model's attention can be focused on the stamp region. The input of the background texture features can provide texture information of background characters, so that the seal erasing module can keep the background characters as much as possible while erasing the seal.

After outputting the seal-removed picture, namely the output image, the seal erasing module replaces the seal area in the initially input seal picture with the seal area in the seal-removed picture according to the seal mask to obtain a combined image, and the combined image is used as a final result. Specifically, the seal erasing module performs dot product on the seal mask and the seal-removed picture to obtain an image of a seal area in the seal-removed picture (other areas are black); dot product is carried out on the seal picture and the reverse mask of the seal mask, so that an image of a non-seal area (other areas are black) of the seal picture is obtained; and then adding the two images, and obtaining a final combined image after complementation.

As a preferred embodiment, the seal erasing module replaces the original seal mask with the expanded mask (i.e. the expanded mask shown in fig. 1 replaces the predicted mask output by the seal positioning module, i.e. the mask formed by expanding a plurality of pixels outwards from each edge of the seal mask) to obtain an optimized image of the seal area in the output image, and then the optimized combined image is obtained through the addition and complementation of the two. The aim of this preferred embodiment is to enable the system to fill the stamp area with pixels around the stamp, thereby making the generated image more natural; and the expansion mask is used, so that the situation that the seal frame remains due to unclear seal boundaries can be avoided.

The method provided by the embodiment is based on the structure of the system, and the whole process of performing seal removing treatment on the picture containing seal noise comprises the following steps:

1: inputting a seal picture into a seal positioning module to obtain an accurate mask capable of representing the seal position, extracting background texture information features through a bridge in the module and inputting the background texture information features into a seal erasing module.

2: the seal picture and the mask are spliced on the channel and then input to the seal erasing module, after multi-layer downsampling, the module splices the image characteristics obtained by downsampling and the background texture information characteristics obtained from the seal positioning module on the channel and then inputs the spliced image characteristics and the background texture information characteristics to the global context attention module, and then the picture of the seal is obtained by upsampling.

3: and replacing the seal area in the seal picture with the seal area in the predicted seal-removing picture according to the mask, particularly the inflated mask, so as to obtain a combined image and finally obtain the final result.

4: the seal positioning module and the seal erasing module need to be trained, and are used for formal seal erasing tasks after the training is completed, and the loss is calculated according to a training data set, namely a real seal-free image group trunk and a corresponding seal image. The stamp positioning module optimizes training by calculating L1 loss and Dice loss (Dice loss). The stamp erasure module optimizes training by calculating L1 penalty, perceived penalty, style penalty, and GAN challenge penalty. Although two different modules adopt different loss function optimization, a system model formed by the two modules adopts an end-to-end mode, and the two modules do not need to train respectively.

The loss function of the seal positioning module is specifically as follows:

L _loc ＝Dice(m,m′)+||m-m′|| ₁

m and m' represent the model predictive mask and the ground trunk, respectively, and x and y represent coordinates of the image.

The loss function of the seal erasing module is specifically as follows:

L _pix ＝10*||m*(I _out -I _gt )|| ₁ +2*||(1-m)*(I _out -I _gt )|| ₁

I _out and I _gt Respectively representing the seal-removed image and the ground trunk output by the model.

I _com ＝m*I _out +(1-m)*I _in

φ _n Representing characteristics of the nth pooling layer output of VGG.

Gram _n And (3) representing a Gram matrix of feature calculation of the nth pooling layer output of the VGG.

L _adv ＝-Ε[D(G(I _in ))]

G represents the generated model and D represents the authentication model.

Experimental test:

the invention performs extensive experiments to evaluate the effect of the technical solution of the invention (taking the system proposed by the invention as an example). Before the beginning of the experiment, the inventors made a synthetic dataset as a training set for the training of the system of the present invention, due to the lack of a published stamp-related dataset. The composite data set takes different types of document pictures collected from the internet as the background, and then different types of seals are added on the composite data set, so that the system can adapt to different conditions. Wherein the seal shape comprises a circle, an ellipse and a square, and the color comprises red and blue. The composite dataset had a total of 19055 pictures, containing 22551 stamps. In order to be able to evaluate the actual performance of the inventive system accurately, the experiment uses the actual data set as the test set of the inventive system. The data in the real data set is a real picture with a seal, which is collected by a method of online collection and photographing, and then coordinates of text lines shielded by the seal and labels thereof are marked. The real dataset had a total of 224 pictures, containing 400 stamps and 616 text line notes. In terms of evaluation index, the experiment selects the field accuracy and edit distance of peak signal-to-noise ratio (PSNR), structural Similarity (SSIM), and recognizer (pad OCR).

The experimental test results are shown in tables 1-3. Table 1 shows the effect contrast of the system of the present invention at different settings, whether to expand the mask, whether to use the mask as input to the stamp erasure module, and whether to use a skip connection to transfer background texture information, respectively. Experimental results demonstrate that the design of the various parts of the system of the present invention (see method column in table 1) can significantly improve performance, with the best results achieved by the "expansion + mask + skip connect" approach.

Table 1 comparison of ablation experiment results on real dataset

Table 2 shows a comparison of the results after replacing the residual block with 1\2 convolutions, indicating that the residual block is able to more efficiently extract the background texture information.

Table 2 results of convolutionally replacing residual blocks with 1\2 on a real dataset

Method	Field accuracy	Edit distance
			1 convolution	57.79	0.9156
2 convolution	60.55	0.8782
			Residual block	62.18	0.8198

The experiment compares the advanced scene text erasure methods EraseNet and Stroke-base in the prior art as base lines with the invention, and the results are shown in Table 3, and the results show that the output result of the invention is better than the two base line methods.

Table 3 results of inventive vs. baseline comparisons on real datasets

Method	Field accuracy	Edit distance
			EraseNet	57.79	0.9156
Stroke-base	60.55	0.8782
			The invention is that	62.18	0.8198

Fig. 2 is a graph comparing the visual results of the present invention and the baseline, which shows that the present invention has the capability of erasing the stamp, can more effectively retain the background text, and generates the image more naturally compared with the two baseline methods.

Although the present invention has been described with reference to the above embodiments, it should be understood that the invention is not limited thereto, and that modifications and equivalents may be made thereto by those skilled in the art, which modifications and equivalents are intended to be included within the scope of the present invention as defined by the appended claims.

Claims

1. The image generation method for removing seal noise based on mask guidance is characterized by comprising the following steps:

2. The method of claim 1, wherein the encoder of the stamp locating module comprises four downsampling layers, each downsampling layer being comprised of a residual block; the decoder comprises four upsampling layers, each upsampling layer being composed of a deconvolution product and a residual block; the residual blocks as bridges are three residual blocks.

3. The method of claim 2, wherein the backbone structure of all residual blocks in the stamp locating module consists of two convolutions and the skip connection consists of one convolution.

4. The method of claim 1, wherein the stamp erasure module comprises four downsampling layers, each downsampling layer being formed by a convolution; the decoder contains four upsampling layers, each of which is made up of a deconvolution product and a convolution.

5. The method of claim 1, wherein the replacing of the stamp area in the stamp picture with the stamp area in the stamp-removed picture based on the stamp mask results in a combined image with the stamp removed, comprising the steps of:

6. The method of claim 5, wherein the stamp mask dot-product with the de-stamped image is replaced with an expanded stamp mask for subsequent processing to obtain the combined image.

7. The method of claim 1, wherein the training data set is constructed by:

stamping a batch of real seal-free images to obtain seal images;

8. The method of claim 1, wherein the stamp locating module optimizes training by calculating L1 losses and die losses; the stamp erasure module optimizes training by calculating L1 loss, perceived loss, style loss, and GAN challenge loss.

9. The method of claim 1 or 8, wherein the stamp positioning module and the stamp erasing module together form an end-to-end model, while training.

10. An image generation system for de-stamping noise based on mask guidance, comprising: