CN112801911A

CN112801911A - Method and device for removing Chinese character noise in natural image and storage medium

Info

Publication number: CN112801911A
Application number: CN202110172477.8A
Authority: CN
Inventors: 王波; 张百灵; 崔嵬
Original assignee: Suzhou Changzuichu Software Co ltd
Current assignee: Suzhou Changzuichu Software Co ltd
Priority date: 2021-02-08
Filing date: 2021-02-08
Publication date: 2021-05-14
Anticipated expiration: 2041-02-08
Also published as: CN112801911B

Abstract

The application discloses a method and a device for removing Chinese character noise in a natural image and a storage medium, wherein the method comprises the following steps: the image semantic segmentation network detects a region containing text elements in the image to be repaired, and takes a segmentation recognition region as a mask of the region to be repaired; repairing the region containing the character elements in the image to be repaired by using the mask of the region to be repaired according to the image repairing model; the image restoration model is a generator that generates a countermeasure network. The method and the device for detecting the character type noise elements in the image can quickly and automatically detect the common character type element areas in the image to be repaired, can select to automatically remove the character type noise elements in the natural image, and can also correct the areas needing to be repaired in a manual interaction mode. By using the image restoration method based on the generation countermeasure network, the restored image is more natural and vivid.

Description

Method and device for removing Chinese character noise in natural image and storage medium

Technical Field

The embodiment of the application relates to the technical field of image classification, in particular to a method and a device for removing character noise in a natural image and a storage medium.

Background

In recent years, with the advent of the big data age and the development of computer hardware, artificial intelligence has become more and more popular in our lives. The deep learning technology is widely applied to computer vision, and image recognition is one of the most widely applied technologies, such as photographing for identifying objects, face recognition, traffic sign recognition, gesture recognition, garbage classification and the like. The technologies are correspondingly applied to the electronic commerce industry, the automobile industry, the game industry and the manufacturing industry.

Images often have elements such as text due to human factors. These character elements spoil the beauty of the image, hinder the reuse of the image, and reduce the storage value and quality of the image. Therefore, a large number of application scenes need to remove text-like elements in the natural scene image to obtain a clean image. However, the character elements in the natural image are often different in style and distribution, such as handwriting, subtitles, watermarks, scratches, etc., which all increase the difficulty of removing the character elements. The existing mainstream character element removing method generally needs to manually label a character mask area and then carry out image restoration, and the method has the problems of poor quality of restored images and non-conformity with natural image characteristics, and is long in time consumption and heavy in labor cost burden.

On the other hand, the conventional image restoration method based on diffusion determines the diffusion direction by using the edge information of the area to be restored, and diffuses known information into the edge. The image restored by the method is unnatural, fuzzy and lack of texture details, and a large-range image defect area cannot be restored. Other traditional methods also have similar problems of complex processing flow, large calculation amount, low generalization and the like.

Disclosure of Invention

In view of the above, embodiments of the present application provide a method and an apparatus for removing text-like noise in a natural image, and a storage medium.

According to a first aspect of the present application, there is provided a method for removing word noise in a natural image, including:

detecting a region containing the character elements in the image to be repaired according to the image semantic segmentation network, and taking a segmentation recognition result as a mask of the region to be repaired;

repairing the region containing the character elements in the image to be repaired by using the mask of the region to be repaired according to the image repairing model; the image restoration model is a generator that generates a countermeasure network.

As an implementation manner, the detecting, according to an image semantic segmentation network, a region including a text element in an image to be repaired, and taking a segmentation recognition result as a mask of the region to be repaired, further includes:

after detecting a region containing the character elements in the image to be restored according to an image semantic segmentation network, determining whether a user selects a manual interaction mode to restore the image to be restored; if so, receiving the correction of the area to be repaired by the user through deletion, modification and addition operations; otherwise, the segmentation identification result is used as the mask of the area to be repaired.

As an implementation mode, the image semantic segmentation network is a U-shaped layer-skipping connection network structure of a U-Net segmentation network; and adding a cavity space convolution pooling pyramid ASPP network on the basis of the U-Net to extract and fuse the multi-scale context features.

As an implementation, the method further comprises:

adding an attention mechanism to enhance the feature characterization capability of the image semantic segmentation network;

the attention mechanism uses a channel attention module to assign weights to each channel, and uses a spatial attention module to assign spatial feature weights.

As an implementation, the method further comprises:

the channel attention module performs global pooling on the feature map of each channel to acquire global information; learning by adopting two fully-connected layers to obtain the weight of each channel, and multiplying the weight by the initial characteristics;

the spatial attention module compresses the channel number of the feature map by using a 1-by-1 convolution operation; the spatial characteristics are regulated to 4 different scales by adopting self-adaptive pooling; after splicing and arranging the pooling features of 4 scales, inputting the pooling features into the two fully-connected layers to learn different local weights of the spatial features; regulating the learned weight parameters to the scale size of the compression features; restoring the spatial parameter scale to the spatial size of the channel attention feature by using 1-by-1 convolution and performing multiplication operation on the spatial parameter scale and the spatial size of the channel attention feature; and adding the obtained spatial features and the original features to obtain final attention features.

As an implementation, the method further comprises:

the image restoration model is a generator G for generating an antagonistic network model for the trained Pixel2 pixels; the Pixel2Pixel generation countermeasure network model adopts a U-Net segmentation network model as the generator G.

According to a second aspect of the present application, there is provided an apparatus for removing word noise in a natural image, comprising:

the detection and mask generation unit is used for detecting a region containing the character elements in the image to be repaired according to the image semantic segmentation network and taking the segmentation identification region as a mask of the region to be repaired;

the image restoration unit is used for restoring the region containing the character elements in the image to be restored by using the mask of the region to be restored according to the image restoration model; the image restoration model is a generator that generates a countermeasure network.

As an implementation, the apparatus further comprises:

the manual interaction unit is used for determining whether a user selects a manual interaction mode to repair the image to be repaired after the detection and mask generation unit detects the region containing the character elements in the image to be repaired as the region to be repaired according to the image semantic segmentation network; if so, receiving the correction of the area to be repaired by the user through deletion, modification and addition operations; otherwise, the detection and mask generation unit is informed to take the segmentation identification result as the mask of the area to be repaired.

As one implementation manner, the image semantic segmentation network in the detection and mask generation unit is a "U" type layer-skipping connection network structure of a U-Net segmentation network; and adding a cavity space convolution pooling pyramid ASPP network on the basis of the U-Net to extract and fuse the multi-scale context features.

As an implementation, the detection and mask generating unit is further configured to:

As an implementation manner, the image restoration unit is further configured to:

According to a third aspect of the present application, there is provided a storage medium having stored thereon an executable program which, when executed by a processor, performs the steps of the method for removing text-like noise in natural images.

According to the method, the device and the storage medium for removing the Chinese character noise in the natural image, the area containing the character elements in the image to be repaired is detected according to the image semantic segmentation network, and the segmentation recognition result is used as the mask of the area to be repaired; repairing the region containing the character elements in the image to be repaired by using the mask of the region to be repaired according to the image repairing model; the image restoration model is a generator that generates a countermeasure network. The method and the device for detecting the character type noise elements in the image can quickly and automatically detect the common character type element areas in the image to be repaired, can select to automatically remove the character type noise elements in the natural image, and can also correct the areas needing to be repaired in a manual interaction mode. By using the image restoration method based on the generation countermeasure network, the restored image is more natural and vivid.

Drawings

FIG. 1 is a schematic flow chart of a method for removing Chinese character noise in a natural image according to an embodiment of the present application;

FIG. 2 is a schematic structural diagram of an improved semantic segmentation model according to an embodiment of the present application;

FIG. 3 is a flowchart illustrating an exemplary method for removing text noise in a natural image according to an embodiment of the present disclosure;

FIG. 4 is a schematic structural diagram of an attention module according to an embodiment of the present invention;

fig. 5 is a schematic diagram of a frame of a Pixel2Pixel model training structure provided in an embodiment of the present application;

fig. 6 is a schematic structural diagram illustrating a composition of a device for removing text noise in a natural image according to an embodiment of the present application.

Detailed Description

The following explains the essence of the technical solution of the embodiments of the present application in detail with reference to examples.

With the rise of deep learning, the deep neural convolutional network can easily detect characters in text or natural scene images and locate character areas. The mainstream deep learning character detection method comprises two methods based on target detection and semantic segmentation. Compared with a target detection algorithm with regression rectangular frame level identification precision, the semantic segmentation method can be used for identifying the pixel level, has more accurate positioning, has no strict requirement on the character direction, and is more fit with the character region outline. The mainstream semantic segmentation network structure is an Encoder-Decoder (Encoder-Decoder), such as FCN, U-Net and deep Lab series segmentation models.

The image restoration method based on the countermeasure generation network (GAN) in the deep learning can learn rich semantic information from a large-scale data set, and then fill missing contents in the image in an end-to-end mode, so that the restored image is more natural and vivid, and a better restoration effect is achieved.

The method and the device for restoring the natural image are combined with latest semantic segmentation and image restoration technologies, character areas in the natural image are obtained through semantic segmentation, and a generation countermeasure network is finally used for restoring the natural image in combination with an artificial interaction mechanism. Aiming at different application scenes, the image restoration is carried out by combining two decision mechanisms of automatic selection of character areas and manual interaction, the use is convenient, the labor burden is light, and the restored image is natural and vivid.

Fig. 1 is a schematic flow chart of a method for removing Chinese word noise in a natural image according to an embodiment of the present application, and as shown in fig. 1, the method for removing Chinese word noise in a natural image according to the embodiment of the present application includes the following processing steps:

step 101, detecting an area containing a character element in an image to be repaired according to an image semantic segmentation network, and taking a segmentation recognition result as a mask of the area to be repaired.

In the embodiment of the application, after detecting the region containing the character elements in the image to be restored according to the image semantic segmentation network, determining whether a user selects a manual interaction mode to restore the image to be restored; if so, receiving the correction of the area to be repaired by the user through deletion, modification and addition operations; otherwise, the segmentation identification result is used as the mask of the area to be repaired.

In the embodiment of the application, the image semantic segmentation network is a U-shaped layer-skipping connection network structure of a U-Net segmentation network; and adding a cavity space convolution pooling pyramid ASPP network on the basis of the U-Net to extract and fuse the multi-scale context features.

The improved semantic segmentation model of the embodiment of the application is shown in FIG. 2, and the whole U-Net network structure is similar to a large U letter. Firstly, down-sampling is carried out; then deconvolution is carried out to carry out up-sampling, and the former lower layer is fused; and then up-sampled again. This process is repeated to obtain an output attention image.

In the embodiment of the present application, the hole spatial convolution pooling pyramid (ASPP) samples given input in parallel with hole convolutions of different sampling rates, which is equivalent to capturing the context of an image at multiple scales.

In the embodiment of the application, an attention mechanism is added to enhance the feature characterization capability of the image semantic segmentation network; the attention mechanism uses a channel attention module to assign weights to each channel, and uses a spatial attention module to assign spatial feature weights.

In the embodiment of the application, the channel attention module performs global pooling on the feature map of each channel to acquire global information; learning by adopting two fully-connected layers to obtain the weight of each channel, and multiplying the weight by the initial characteristics;

102, repairing a region containing the character elements in the image to be repaired by using the mask of the region to be repaired according to an image repairing model; the image restoration model is a generator that generates a countermeasure network.

In the embodiment of the application, the image restoration model (image restoration module) generates a generator G of an antagonistic network model for the trained Pixel2 Pixel; the Pixel2Pixel generation countermeasure network model adopts a U-Net segmentation network model as the generator G.

In the embodiment of the application, after the repair area mask is generated, the selected area is repaired by using the image repair module. The image inpainting module uses the generator G of the trained Pixel2Pixel model to inpaint the synthetic realistic natural image. Pixel2Pixel is a generation countermeasure network, and the input is paired images during training, and mainly comprises a generator G and a discriminator D. In order to improve the details of the image and retain information of different scales, a U-Net model is adopted as a generator G.

The embodiments of the present application will be described in further detail below with reference to specific examples.

In the embodiment of the present application, a natural image is taken as an example for description, and it should be noted that other pictures or images and the like can use the technical means of the embodiment of the present application like texts such as screen shots and pictures.

Fig. 3 is a flowchart illustrating a specific example of a method for removing text noise in a natural image according to an embodiment of the present application, including the following specific steps:

first, a user loads an image to be restored. And automatically detecting the region containing the character elements in the natural image through a character element detection module. The character detection module detects a character area by adopting a trained image semantic segmentation network, and takes a segmentation recognition result as a mask of the area to be repaired. The semantic segmentation network model refers to a U-shaped jump layer connection network structure of a classical segmentation network U-Net. Aiming at the character characteristics, an ASPP module is added on the basis of an original U-Net to extract and fuse multi-scale contextual characteristics, and further, the characteristic characterization capability of a new attention mechanism enhanced network is provided, and the overall structure of the model is shown in FIG. 2.

In particular, the attention mechanism considers both the enhanced channel and the spatial features. The mechanism first uses a channel attention module whose main function is to assign weights to individual channels, and then uses a spatial attention module to assign spatial feature weights. The channel attention module firstly performs global pooling on the feature map of each channel to obtain global information, then learns the weight of each channel by adopting two fully-connected layers (fc layers), and performs multiplication operation with the initial feature. On the basis, the spatial attention module firstly compresses the number of channels of a newly obtained feature graph by using 1 × 1 convolution operation to reduce the calculated amount, then normalizes the spatial features to 4 different scales such as [1 × 1,8 × 8,16 × 16,32 × 32] and the like by adopting self-adaptive pooling to count global or local features of different feature graphs, after splicing and regularizing the pooled features of the 4 scales, inputs the same into two fully-connected layers (fc layers) to learn different local weights of the spatial features, normalizes the learned weight parameters to the scale size of the previous compressed feature, then restores the spatial parameter scale to the spatial size of the channel attention feature by using 1 × 1 convolution and performs multiplication operation on the spatial parameter scale, and finally performs addition operation on the latest obtained spatial feature and the original feature to obtain the final attention feature. The attention module structure is shown in fig. 4. Fig. 4 is a schematic structural diagram of an attention module according to an embodiment of the present invention.

Specifically, the system judges whether the user selects to correct and modify the area to be repaired predicted by the U-Net in a manual interaction mode. If manual interaction is needed, the user can correct the area to be repaired through operations of deleting, modifying, adding and the like before the final mask of the area to be repaired is generated. And if the manual interaction operation is not adopted, directly generating the mask of the area to be repaired by using the character prediction area.

And after the repair area mask is generated, repairing the selected area by using an image repair module. The image inpainting module uses the generator G of the trained Pixel2Pixel model to inpaint the synthetic realistic natural image. Pixel2Pixel is a generation countermeasure network, and the input is paired images during training, and mainly comprises a generator G and a discriminator D. In order to improve the details of the image and retain information of different scales, a U-Net model is adopted as a generator G. The training framework for Pixel2Pixel is shown in FIG. 5. Fig. 5 is a schematic diagram of a frame of a Pixel2Pixel model training structure provided in an embodiment of the present application.

And storing the repaired natural image until all image processing is finished, and quitting the system.

Fig. 6 is a schematic structural diagram of a device for removing chinese character noise in a natural image according to an embodiment of the present application, and as shown in fig. 6, the device for removing chinese character noise in a natural image according to an embodiment of the present application includes:

the detection and mask generation unit 61 is used for detecting a region containing the character elements in the image to be repaired according to the image semantic segmentation network and taking the segmentation identification region as a mask of the region to be repaired;

the image restoration unit 62 is configured to restore, according to an image restoration model, an area including a text element in the image to be restored by using the area mask to be restored; the image restoration model is a generator that generates a countermeasure network.

The device further comprises:

the manual interaction unit 63 is configured to determine whether a user selects a manual interaction mode to repair the image to be repaired after the detection and mask generation unit 61 detects, according to the image semantic segmentation network, that the region of the image to be repaired contains the text elements as the region to be repaired; if so, receiving the correction of the area to be repaired by the user through deletion, modification and addition operations; otherwise, the detecting and mask generating unit 61 is notified to use the segmentation identification result as the mask of the area to be repaired.

The image semantic segmentation network in the detection and mask generation unit 61 is a U-shaped layer-skipping connection network structure of a U-Net segmentation network; and adding a cavity space convolution pooling pyramid ASPP network on the basis of the U-Net to extract and fuse the multi-scale context features.

The detection and mask generating unit 61 is further configured to:

The image restoration unit 62 is further configured to:

In an exemplary embodiment, the Processing units of the apparatus for removing text-based noise in a natural image according to an embodiment of the present Application may be implemented by one or more Central Processing Units (CPUs), Graphics Processing Units (GPUs), Baseband Processors (BPs), Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), Field Programmable Gate Arrays (FPGAs), general purpose processors (GPUs), controllers, Micro Controllers (MCUs), microprocessors (microprocessunits), or other electronic components.

In the embodiment of the present disclosure, the specific manner in which each processing unit in the apparatus for removing text noise in natural images shown in fig. 6 performs operations has been described in detail in the embodiment related to the method, and will not be described in detail here.

The embodiment of the application also describes a storage medium, on which an executable program is stored, and the executable program realizes the steps of the method for removing the character noise in the natural image.

It should be appreciated that reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, the appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. It should be understood that, in various embodiments of the present invention, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation on the implementation process of the embodiments of the present invention. The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described device embodiments are only illustrative, for example, the division of the unit is only one logical function division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units; can be located in one place or distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, all the functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

The above description is only an embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the present invention, and all such changes or substitutions are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims

1. A method for removing word noise in a natural image is characterized by comprising the following steps:

detecting a region containing the character elements in the image to be repaired according to the image semantic segmentation network, and taking the segmentation recognition region as a mask of the region to be repaired;

2. The method according to claim 1, wherein the detecting a region containing a text element in the image to be repaired according to the image semantic segmentation network and using the segmentation recognition result as a mask of the region to be repaired, further comprises:

after detecting a region containing the character elements in the image to be restored according to an image semantic segmentation network, determining whether a user selects a manual interaction mode to restore the image to be restored; if so, receiving the correction of the area to be repaired by the user through deletion, modification and addition operations; otherwise, the segmentation recognition result is automatically used as the mask of the area to be repaired.

3. The method according to claim 1, wherein the image semantic segmentation network is a "U" type layer-hopping connection network structure of a U-Net segmentation network; and adding a cavity space convolution pooling pyramid ASPP network on the basis of the U-Net to extract and fuse the multi-scale context features.

4. The method of claim 3, further comprising:

5. The method of claim 4, further comprising:

6. The method of claim 1, further comprising:

the image restoration model is a generator G for generating a confrontation network model for a trained Pixel2Pixel, and the image restoration model adopts a U-Net segmentation network model as the generator G.

7. An apparatus for removing word noise in a natural image, the apparatus comprising:

8. The apparatus of claim 7, further comprising:

the manual interaction unit is used for determining whether a user selects a manual interaction mode to repair the image to be repaired after the detection and mask generation unit detects the region containing the character elements in the image to be repaired as the region to be repaired according to the image semantic segmentation network; if so, receiving the correction of the area to be repaired by the user through deletion, modification and addition operations; otherwise, the detection and mask generation unit is informed to automatically take the segmentation identification result as the mask of the area to be repaired.

9. The apparatus according to claim 7, wherein the image semantic segmentation network in the detection and mask generation unit is a "U" type layer-skipping connection network structure of a U-Net segmentation network; and adding a cavity space convolution pooling pyramid ASPP network on the basis of the U-Net to extract and fuse the multi-scale context features.

10. The apparatus of claim 9, wherein the detection and mask generation unit is further configured to:

11. The apparatus of claim 10, wherein the detection and mask generation unit is further configured to:

the spatial attention module compresses the channel number of the feature map by using a 1-by-1 convolution operation; adjusting the spatial characteristics to 4 different scales by adopting self-adaptive pooling; after splicing and arranging the pooling features of 4 scales, inputting the pooling features into the two fully-connected layers to learn different local weights of the spatial features; restoring the learned weight parameters to the scale of the compression features; restoring the spatial parameter scale to the spatial size of the channel attention feature by using 1-by-1 convolution and performing multiplication operation on the spatial parameter scale and the spatial size of the channel attention feature; and adding the obtained spatial features and the original features to obtain final attention features.

12. The apparatus of claim 7, wherein the image inpainting unit is further configured to:

13. A storage medium having stored thereon an executable program which, when executed by a processor, performs the steps of the method of removing text noise in natural images according to any one of claims 1 to 6.