CN112801911B

CN112801911B - Method and device for removing text noise in natural image and storage medium

Info

Publication number: CN112801911B
Application number: CN202110172477.8A
Authority: CN
Inventors: 王波; 张百灵; 崔嵬
Original assignee: Suzhou Changzuichu Software Co ltd
Current assignee: Suzhou Changzuichu Software Co ltd
Priority date: 2021-02-08
Filing date: 2021-02-08
Publication date: 2024-03-26
Anticipated expiration: 2041-02-08
Also published as: CN112801911A

Abstract

The application discloses a method and a device for removing text noise in natural images and a storage medium, wherein the method comprises the following steps: the image semantic segmentation network detects an area containing literal elements in an image to be repaired, and takes a segmentation recognition area as an area mask to be repaired; repairing the region containing the literal elements in the image to be repaired by using the mask of the region to be repaired according to the image repairing model; the image restoration model is a generator that generates an countermeasure network. According to the method and the device for detecting the text element areas, the text element areas which are common in the image to be repaired can be detected rapidly and automatically, text noise elements in the natural image can be removed selectively and automatically, and the areas which need to be repaired can be corrected in a manual interaction mode. The image restoration method based on the generation countermeasure network is used, and the restored image is more natural and lifelike.

Description

Method and device for removing text noise in natural image and storage medium

Technical Field

The embodiment of the application relates to the technical field of image classification, in particular to a method and a device for removing text noise in natural images and a storage medium.

Background

In recent years, with the advent of the big data age and the development of computer hardware, artificial intelligence is becoming more and more popular in our lives. The deep learning technology is widely applied to computer vision, and the image recognition is one of the most widely applied technologies, such as photographing recognition, face recognition, traffic sign recognition, gesture recognition, garbage classification and the like. These techniques find corresponding application in the e-commerce industry, the automotive industry, the gaming industry, and the manufacturing industry.

The image often has elements such as text due to human factors. These character elements destroy the aesthetic degree of the image, prevent the reuse of the image, and reduce the preservation value and quality of the image. Therefore, a large number of application scenes need to remove the text elements in the natural scene image to obtain a clean image. However, the text elements in natural images are often in different patterns and are unevenly distributed, such as handwriting, subtitles, watermarks, scratches and the like, which all increase the difficulty of removing the text elements. The existing mainstream text element removal method generally needs to manually mark text mask areas and then carry out image restoration, and the method has the problems of poor restored image quality and incongruity with natural image characteristics, and has long time consumption and heavy labor cost burden.

On the other hand, the conventional image restoration method based on diffusion uses edge information of an area to be restored to determine a direction of diffusion, and diffuses known information into the edge. The image restored by the method is unnatural, blurred and lacks texture details, and a large-scale image defect area cannot be restored. Other traditional methods also have the similar problems of complex processing flow, large calculation amount, low generalization and the like.

Disclosure of Invention

In view of this, the embodiments of the present application provide a method and apparatus for removing text noise in natural images, and a storage medium.

According to a first aspect of the present application, there is provided a method for removing text noise in a natural image, including:

detecting an area containing literal elements in an image to be repaired according to an image semantic segmentation network, and taking a segmentation recognition result as an area mask to be repaired;

repairing the region containing the literal elements in the image to be repaired by using the mask of the region to be repaired according to the image repairing model; the image restoration model is a generator that generates an countermeasure network.

As an implementation manner, the detecting, according to the image semantic segmentation network, a region containing a text element in an image to be repaired, and taking a segmentation recognition result as a mask of the region to be repaired, further includes:

after detecting the region containing the literal elements in the image to be repaired according to the image semantic segmentation network, determining whether a user selects a manual interaction mode to repair the image to be repaired; if yes, receiving the correction to-be-repaired area of the user through deleting, modifying and adding operations; otherwise, the segmentation recognition result is used as a mask of the area to be repaired.

As an implementation manner, the image semantic segmentation network is a U-shaped jump layer connection network structure of a U-Net segmentation network; and adding a hole space convolution pooling pyramid ASPP network on the basis of the U-Net to extract and fuse multi-scale context characteristics.

As an implementation, the method further includes:

increasing an attention mechanism to enhance a feature characterization capability of the image semantic segmentation network;

the attention mechanism uses a channel attention module to distribute weights to all channels, and uses a space attention module to distribute space feature weights.

As an implementation, the method further includes:

the channel attention module carries out global pooling on the feature graphs of each channel to acquire global information; obtaining the weight of each channel by adopting two full-connection layer learning, and carrying out multiplication operation with the initial characteristics;

the space attention module compresses the channel number of the obtained feature map by using 1*1 convolution operation; adopting self-adaptive pooling to normalize the spatial characteristics to 4 different scales; the pooling features with 4 scales are spliced and regulated and then input into different local weights of the learning space features in the two layers of full-connection layers; the learned weight parameters are regulated to the scale of the compression characteristic; recovering the spatial parameter scale to the spatial size of the channel attention feature and multiplying it with 1*1 convolution; and adding the obtained spatial features and the original features to obtain final attention features.

As an implementation, the method further includes:

the image restoration model generates a generator G of an countermeasure network model for the trained Pixel2 Pixel; the Pixel2Pixel generation countermeasure network model adopts a U-Net segmentation network model as the generator G.

According to a second aspect of the present application, there is provided a device for removing text noise in natural images, including:

the detection and mask generation unit is used for detecting an area containing literal elements in the image to be repaired according to the image semantic segmentation network, and taking the segmentation recognition area as a mask of the area to be repaired;

the image restoration unit is used for restoring the area containing the literal elements in the image to be restored by using the mask of the area to be restored according to the image restoration model; the image restoration model is a generator that generates an countermeasure network.

As an implementation, the apparatus further includes:

the manual interaction unit is used for determining whether a user selects a manual interaction mode to repair the image to be repaired after the detection and mask generation unit detects the area containing the text elements in the image to be repaired as the area to be repaired according to the image semantic segmentation network; if yes, receiving the correction to-be-repaired area of the user through deleting, modifying and adding operations; otherwise, the detection and mask generation unit is informed to take the segmentation recognition result as a mask of the area to be repaired.

As an implementation manner, the image semantic segmentation network in the detection and mask generation unit is a 'U' -shaped layer jump connection network structure of a U-Net segmentation network; and adding a hole space convolution pooling pyramid ASPP network on the basis of the U-Net to extract and fuse multi-scale context characteristics.

As an implementation, the detecting and mask generating unit is further configured to:

As an implementation, the image restoration unit is further configured to:

According to a third aspect of the present application, there is provided a storage medium having stored thereon an executable program which when executed by a processor implements the steps of the method of removing text noise in natural images.

According to the method, the device and the storage medium for removing the text noise in the natural image, the region containing the text elements in the image to be repaired is detected according to the image semantic segmentation network, and the segmentation recognition result is used as a mask of the region to be repaired; repairing the region containing the literal elements in the image to be repaired by using the mask of the region to be repaired according to the image repairing model; the image restoration model is a generator that generates an countermeasure network. According to the method and the device for detecting the text element areas, the text element areas which are common in the image to be repaired can be detected rapidly and automatically, text noise elements in the natural image can be removed selectively and automatically, and the areas which need to be repaired can be corrected in a manual interaction mode. The image restoration method based on the generation countermeasure network is used, and the restored image is more natural and lifelike.

Drawings

Fig. 1 is a schematic flow chart of a method for removing text noise in a natural image according to an embodiment of the present application;

FIG. 2 is a schematic structural diagram of a semantic segmentation model according to an embodiment of the present application;

FIG. 3 is a flowchart of a specific example of a method for removing text noise in natural images according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of an attention module structure according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a Pixel2Pixel model training architecture provided in an embodiment of the present application;

fig. 6 is a schematic diagram of a composition structure of a device for removing text noise in a natural image according to an embodiment of the present application.

Detailed Description

The following describes in detail the essence of the technical solution of the embodiments of the present application with reference to examples.

With the advent of deep learning, deep neural convolutional networks have been able to easily detect text in text or natural scene images and locate text regions. The main stream of deep learning text detection methods are based on two kinds of object detection and semantic segmentation. Compared with a target detection algorithm with regression rectangular frame level recognition precision, the semantic segmentation method can recognize pixel levels, has more accurate positioning, has no strict requirement on the character direction, and is more attached to the outline of a character area. The mainstream semantic segmentation network structures are all Encoder-decoders (Encoder-decoders), such as FCN, U-Net and DeepLab series segmentation models.

The image restoration method based on the countermeasure generation network (GAN) in the deep learning can learn rich semantic information from a large-scale data set, then fill missing contents in the image in an end-to-end mode, and the restored image is more natural and lifelike, so that a better restoration effect is achieved.

According to the embodiment of the application, the latest semantic segmentation and image restoration technology are combined, text areas in the natural image are obtained through semantic segmentation, a manual interaction mechanism is combined, and finally the natural image is restored by utilizing the generated countermeasure network. Aiming at different application scenes, the image restoration is carried out by combining two decision mechanisms of automatic selection and manual interaction of the text region, and the method is convenient to use, light in manpower burden and natural and vivid in restored images.

Fig. 1 is a schematic flow chart of a method for removing text noise in a natural image according to an embodiment of the present application, as shown in fig. 1, where the method for removing text noise in a natural image according to an embodiment of the present application includes the following processing steps:

and step 101, detecting the region containing the literal elements in the image to be repaired according to the image semantic segmentation network, and taking the segmentation recognition result as a mask of the region to be repaired.

In the embodiment of the application, after detecting the region containing the text elements in the image to be repaired according to the image semantic segmentation network, determining whether a user selects a manual interaction mode to repair the image to be repaired; if yes, receiving the correction to-be-repaired area of the user through deleting, modifying and adding operations; otherwise, the segmentation recognition result is used as a mask of the area to be repaired.

In the embodiment of the application, the image semantic segmentation network is a U-shaped jump layer connection network structure of a U-Net segmentation network; and adding a hole space convolution pooling pyramid ASPP network on the basis of the U-Net to extract and fuse multi-scale context characteristics.

The improved semantic segmentation model of the embodiment of the application is shown in fig. 2, and the whole U-Net network structure is similar to a large U letter. Firstly, downsampling; then deconvolution is carried out to carry out up-sampling, and the previous lower layers are fused; and then up-sampled again. This process is repeated to obtain an output attention image.

In the embodiment of the application, the hole space convolution pooling pyramid (atrous spatial pyramid pooling, ASPP) samples given hole convolutions of input at different sampling rates in parallel, which is equivalent to capturing the context of the image at a plurality of proportions.

In the embodiment of the application, the method further comprises the step of increasing an attention mechanism to enhance the characteristic characterization capability of the image semantic segmentation network; the attention mechanism uses a channel attention module to distribute weights to all channels, and uses a space attention module to distribute space feature weights.

In the embodiment of the application, a channel attention module carries out global pooling on the feature map of each channel to acquire global information; obtaining the weight of each channel by adopting two full-connection layer learning, and carrying out multiplication operation with the initial characteristics;

102, repairing an area containing literal elements in the image to be repaired by using the mask of the area to be repaired according to an image repair model; the image restoration model is a generator that generates an countermeasure network.

In this embodiment of the present application, the image restoration model (image restoration module) generates a generator G of an countermeasure network model for the trained Pixel2 Pixel; the Pixel2Pixel generation countermeasure network model adopts a U-Net segmentation network model as the generator G.

In the embodiment of the application, after the repair area mask is generated, the selected area is repaired by using the image repair module. The image restoration module uses the generator G of the trained Pixel2Pixel model to restore the synthetic realistic natural image. Pixel2Pixel is a generating countermeasure network whose training inputs are pairs of images, consisting essentially of a generator G and a discriminator D. In order to promote the details of the image and keep the information of different scales, a U-Net model is adopted as a generator G.

Embodiments of the present application are described in further detail below in conjunction with specific examples.

In the embodiment of the present application, a natural image is taken as an example for explanation, and it should be noted that other pictures or images, such as a screen capture, a text of a picture, etc., may use the technical means of the embodiment of the present application.

Fig. 3 is a flowchart of a specific example of a method for removing text noise in a natural image according to an embodiment of the present application, where specific steps are as follows:

first, the user loads the image to be repaired. And automatically detecting the region containing the literal element in the natural image by a literal element detection module. The character detection module adopts a trained image semantic segmentation network to detect character areas, and takes segmentation recognition results as masks of the areas to be repaired. The semantic segmentation network model refers to a U-shaped jump layer connection network structure of a classical segmentation network U-Net. Aiming at the character characteristics, an ASPP module is added on the basis of the original U-Net to extract and fuse multi-scale context characteristics, and further a new attention mechanism is provided to enhance the characteristic characterization capability of the network, and the overall structure of the model is shown in figure 2.

In particular, the attention mechanism considers both enhanced channels and spatial features. The mechanism first uses a channel attention module whose main function is to assign weights to the individual channels, and then uses a spatial attention module to assign spatial feature weights. The channel attention module carries out global pooling on the feature map of each channel to obtain global information, then learns by adopting two full-connection layers (fc layers) to obtain the weight of each channel, and carries out multiplication operation with the initial feature. On the basis, the space attention module firstly compresses the channel number of the newly obtained feature map by using 1*1 convolution operation to reduce the calculated amount, then adopts self-adaptive pooling to normalize the space feature to 4 different scales, such as [1 x 1,8 x 8,16 x 16,32 x 32] and the like, so as to count global or local features of different feature maps, next splices and normalizes pooled features of the 4 scales, then inputs the pooled features into two layers of fully connected layers (fc layers) to learn different local weights of the space feature, normalizes the learned weight parameters to the size of the compressed feature of the previous step, then uses 1*1 convolution to restore the space parameter size to the space size of the channel attention feature, and performs multiplication operation on the space feature and the original feature, and finally performs addition operation on the latest obtained space feature to obtain the final attention feature. The attention module structure is shown in fig. 4. Fig. 4 is a schematic diagram of an attention module structure according to an embodiment of the present invention.

Specifically, the system judges whether the user selects to correct and modify the predicted area to be repaired of the U-Net in a manual interaction mode. If manual interaction is required, the user can correct the region to be repaired by deleting, modifying, adding and the like before generating the final region mask to be repaired. And if no manual interaction operation is adopted, directly generating a mask of the area to be repaired by using the text prediction area.

After the repair area mask is generated, the selected area is repaired by using the image repair module. The image restoration module uses the generator G of the trained Pixel2Pixel model to restore the synthetic realistic natural image. Pixel2Pixel is a generating countermeasure network whose training inputs are pairs of images, consisting essentially of a generator G and a discriminator D. In order to promote the details of the image and keep the information of different scales, a U-Net model is adopted as a generator G. The training architecture for Pixel2Pixel is shown in fig. 5. Fig. 5 is a schematic diagram of a Pixel2Pixel model training structure provided in an embodiment of the present application.

And storing the restored natural images until all image processing is completed, and exiting the system.

Fig. 6 is a schematic diagram of a composition structure of a device for removing text noise in a natural image according to an embodiment of the present application, as shown in fig. 6, where the device for removing text noise in a natural image according to an embodiment of the present application includes:

the detection and mask generation unit 61 detects an area containing a literal element in the image to be repaired according to the image semantic segmentation network, and takes the segmentation recognition area as a mask of the area to be repaired;

an image restoration unit 62, configured to restore, according to an image restoration model, an area including a text element in the image to be restored with the mask of the area to be restored; the image restoration model is a generator that generates an countermeasure network.

The apparatus further comprises:

a manual interaction unit 63, configured to determine whether a user selects a manual interaction mode to repair the image to be repaired after the detection and mask generation unit 61 detects, according to an image semantic segmentation network, a region including a text element in the image to be repaired as a region to be repaired; if yes, receiving the correction to-be-repaired area of the user through deleting, modifying and adding operations; otherwise, the detection and mask generation unit 61 is notified of the division recognition result as a mask of the area to be repaired.

The image semantic segmentation network in the detection and mask generation unit 61 is a U-shaped jump layer connection network structure of a U-Net segmentation network; and adding a hole space convolution pooling pyramid ASPP network on the basis of the U-Net to extract and fuse multi-scale context characteristics.

The detection and mask generation unit 61 is further configured to:

The image restoration unit 62 is further configured to:

In an exemplary embodiment, the above-described processing units of the apparatus for removing text noise in natural images of the embodiments of the present application may be implemented by one or more central processing units (CPU, central Processing Unit), graphic processors (GPU, graphics Processing Unit), baseband processors (BP, base Processor), application specific integrated circuits (ASIC, application Specific Integrated Circuit), DSPs, programmable logic devices (PLD, programmable Logic Device), complex programmable logic devices (CPLD, complex Programmable Logic Device), field programmable gate arrays (FPGA, field-Programmable Gate Array), general purpose processors, controllers, microcontrollers (MCU, micro Controller Unit), microprocessors (Microprocessor), or other electronic components.

In the embodiment of the present disclosure, the specific manner in which each processing unit in the apparatus for removing text noise in natural images performs the operation in the embodiment of the method shown in fig. 6 is described in detail in the embodiment of the method, which will not be described in detail herein.

The embodiment of the application also describes a storage medium, and the storage medium stores an executable program, and the executable program realizes the steps of the method for removing the text noise in the natural image when being executed by a processor.

It should be appreciated that reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, the appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. It should be understood that, in various embodiments of the present invention, the sequence numbers of the foregoing processes do not mean the order of execution, and the order of execution of the processes should be determined by the functions and internal logic thereof, and should not constitute any limitation on the implementation process of the embodiments of the present invention. The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above described device embodiments are only illustrative, e.g. the division of the units is only one logical function division, and there may be other divisions in practice, such as: multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. In addition, the various components shown or discussed may be coupled or directly coupled or communicatively coupled to each other via some interface, whether indirectly coupled or communicatively coupled to devices or units, whether electrically, mechanically, or otherwise.

The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units; can be located in one place or distributed to a plurality of network units; some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present invention may be integrated in one processing unit, or each unit may be separately used as one unit, or two or more units may be integrated in one unit; the integrated units may be implemented in hardware or in hardware plus software functional units.

The foregoing is merely an embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present invention, and the changes and substitutions are intended to be covered by the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for removing text noise in a natural image, the method comprising:

detecting an area containing literal elements in an image to be repaired according to an image semantic segmentation network, and taking a segmentation recognition area as an area mask to be repaired;

2. The method according to claim 1, wherein the detecting the region containing the text element in the image to be repaired according to the image semantic segmentation network and using the segmentation recognition result as the mask of the region to be repaired further comprises:

after detecting the region containing the literal elements in the image to be repaired according to the image semantic segmentation network, determining whether a user selects a manual interaction mode to repair the image to be repaired; if yes, receiving the correction to-be-repaired area of the user through deleting, modifying and adding operations; otherwise, the segmentation recognition result is automatically used as the mask of the area to be repaired.

3. The method according to claim 1, wherein the image semantic segmentation network is a "U" type layer-jump connection network structure of a U-Net segmentation network; and adding a hole space convolution pooling pyramid ASPP network on the basis of the U-Net to extract and fuse multi-scale context characteristics.

4. A method according to claim 3, characterized in that the method further comprises:

5. The method according to claim 4, wherein the method further comprises:

6. The method according to claim 1, wherein the method further comprises:

the image restoration model generates a generator G of an countermeasure network model for the trained Pixel2Pixel, and the Pixel2Pixel generates the countermeasure network model by adopting a U-Net segmentation network model as the generator G.

7. A device for removing text noise in natural images, the device comprising:

8. The apparatus of claim 7, wherein the apparatus further comprises:

the manual interaction unit is used for determining whether a user selects a manual interaction mode to repair the image to be repaired after the detection and mask generation unit detects the area containing the text elements in the image to be repaired as the area to be repaired according to the image semantic segmentation network; if yes, receiving the correction to-be-repaired area of the user through deleting, modifying and adding operations; otherwise, the detection and mask generation unit is notified to automatically take the segmentation recognition result as the mask of the area to be repaired.

9. The apparatus according to claim 7, wherein the image semantic segmentation network in the detection and mask generation unit is a "U" type layer-jump connection network structure of a U-Net segmentation network; and adding a hole space convolution pooling pyramid ASPP network on the basis of the U-Net to extract and fuse multi-scale context characteristics.

10. The apparatus of claim 9, wherein the detection and mask generation unit is further configured to:

11. The apparatus of claim 10, wherein the detection and mask generation unit is further configured to:

the space attention module compresses the channel number of the obtained feature map by using 1*1 convolution operation; adopting self-adaptive pooling to adjust the spatial characteristics to 4 different scales; the pooling features with 4 scales are spliced and regulated and then input into different local weights of the learning space features in the two layers of full-connection layers; restoring the learned weight parameters to the scale of the compression characteristic; recovering the spatial parameter scale to the spatial size of the channel attention feature and multiplying it with 1*1 convolution; and adding the obtained spatial features and the original features to obtain final attention features.

12. The apparatus of claim 7, wherein the image restoration unit is further configured to:

13. A storage medium having stored thereon an executable program which when executed by a processor performs the steps of the method of removing text noise in natural images as claimed in any one of claims 1 to 6.