CN115908639A

CN115908639A - Transformer-based scene image character modification method and device, electronic equipment and storage medium

Info

Publication number: CN115908639A
Application number: CN202211662804.9A
Authority: CN
Inventors: 艾孜麦提·艾尼瓦尔; 杨雅婷; 马博; 董瑞; 王磊; 周喜
Original assignee: Xinjiang Technical Institute of Physics and Chemistry of CAS
Current assignee: Xinjiang Technical Institute of Physics and Chemistry of CAS
Priority date: 2022-12-23
Filing date: 2022-12-23
Publication date: 2023-04-04

Abstract

The invention discloses a method, a device, equipment and a storage medium for modifying scene image characters based on a transform, wherein an encoder and a decoder based on a deep convolutional neural network are utilized to extract foreground style characteristics of an original style image and transfer foreground styles such as character fonts, character colors, character shapes and the like to target characters; the scene character erasing task of the word level is completed on the original style character image by using the same encoder and decoder structure; and performing depth fusion on the style character image and the original background image of the erased characters by using a character style fusion module so as to obtain a finally modified scene character image. The invention solves the problems of difficult acquisition of language recognition corpus of resource scarce language or scene characters in a specific field and lack of real scene training samples by utilizing an image processing technology based on a deep neural network, and improves the effect of recognizing the resource scarce language or the scene characters in the specific field.

Description

Transformer-based scene image text modification method and device, electronic device and storage medium

Technical Field

The invention relates to the field of information processing, in particular to the field of scene character recognition and instant translation. Specifically, a method and an apparatus for modifying scene image text based on a transform, an electronic device and a storage medium are provided.

Background

Scene character modification refers to a technology for replacing characters in a scene character image with target characters on the premise of keeping the character style and the background of an original scene image. The scene image character modification has important application value in the fields of scene character recognition, instant translation, office automation and the like. The complex interaction between the checkered characters and the complex background in the scene image is a very challenging task.

The existing data sets for scene character recognition are mainly Chinese and English, scene character image data sets of other languages are very deficient or even absent, manual labeling is time-consuming and labor-consuming, and the labeling of the character recognition data sets needs to be completed by people who are familiar with the languages correspondingly, so that the difficulty of the task of labeling the data sets is further increased. And the data set used for scene character recognition mainly comes from a scene character image synthesis tool, namely, randomly selecting fonts, font colors and an image background, simulating a natural scene, rendering characters on the image background, and further obtaining a scene character image. However, the images synthesized by the synthesis tool have distribution difference with the real scene character images, so that the deep neural network model trained by the synthesis data has influence on the performance of the real scene. In recent years, many image generation models have been proposed in succession, such as generation of a countermeasure network, a variation self-encoder, and an autoregressive model. These models show great power in the task of generating realistic images. Compared with the prior image generation method, the existing model can generate data close to real distribution by modeling the distribution of image data, so that the generated image is more vivid.

Disclosure of Invention

The invention aims to provide a method and a device for modifying scene image characters based on a transform, an electronic device and a storage medium. The method comprises the following steps: learning global and local characteristics of each stage of a scene image based on a coder and decoder network of a convolutional neural network; the multi-head depth separable convolution attention mechanism network extracts implicit coding attention characteristics of global context information; the gate control depth separable feedforward network is used for carrying out element-by-element dot product through linear conversion layers of two parallel paths to learn local characteristics of the image; the device comprises a coder, a decoder and a data processing system, wherein the coder and the decoder are based on a deep convolutional neural network, and are used for carrying out foreground style characteristic extraction on an original style image and transferring foreground styles such as character fonts, character colors, character shapes and the like to target characters; the scene character erasing task of the word level is completed on the original style character image by using the same encoder and decoder structure; and performing depth fusion on the style character image and the original background image of the erased character by using a character style fusion module so as to obtain a finally modified scene character image. The invention solves the problems of difficult acquisition of language recognition corpus of resource scarce language or scene characters in a specific field and lack of real scene training samples by utilizing an image processing technology based on a deep neural network, and improves the effect of recognizing the resource scarce language or the scene characters in the specific field.

The invention relates to a method for modifying scene image characters based on a transformer, which comprises the following steps: the method comprises the following steps of scene image character style migration, scene image background erasing and scene image character fusion, and the specific operation is carried out according to the following steps:

a. respectively inputting the original style image and the target character image into an encoder to obtain high-level semantic features of the images;

b. fusing the semantic features of the image obtained in the step a through a 1x1 convolution network;

c. b, passing the features fused in the step b through a decoder based on a deep convolution nerve to obtain a target character image with an original image style;

d. after passing through an encoder and a decoder, the original style image passes through a 3x3 convolutional neural network to obtain an original image background image of the word level of the original style character image;

e. c, extracting the characteristics of the target character image with the original image style obtained in the step c and the original image background image obtained in the step d by using an encoder and a decoder;

f. carrying out global feature extraction on the image features obtained in the step e by using a transformer block;

g. and performing feature fusion by using a multi-head depth separable convolution attention mechanism network and a gated depth separable feedforward network to obtain a finally modified scene character image.

B, the scene image character style is transferred in the step a, the scene image character style is input into a style image and a target character image, and the target character image is output into an original image foreground style; the scene image character style specifically includes: scene image character font, font color, character shape; the encoder may include a 3x3 scale invariant convolution, cubic down-sampling, and 8 transform modules.

The decoder in step c may include 3 upsamples and a different number of transform modules after each upsampling.

And d, inputting the character image only in the original style, outputting the background image of the character image in the original style, and filling the original character area with proper background texture.

The coder/decoder in the step d and the coder/decoder in the step a are respectively characterized in that the decoder in the step a splices the feature map before each down-sampling and the feature map after the up-sampling of the decoder according to the feature map channels for 3 times; in the step d, the decoder splices the feature map before each down-sampling and the feature map after the up-sampling of the decoder for 3 times, and finally reduces the number of the feature map channels by half through 1x1 convolution.

And g, inputting the scene image text fusion into a foreground text image generated by the text style migration module in the step a and a background image of an original image style generated by background erasing of the scene image in the step d.

A scene image character modification device based on a transformer is composed of a scene image character style migration module, a scene image background erasing module and a scene image character fusion module, wherein:

scene image text style migration module: the foreground style characteristic extraction module is used for extracting foreground style characteristics of the original style image and transferring the foreground style characteristics to target characters;

scene image background erasing module: completing a scene character erasing task at a word level for the original style character image;

scene image text fusion: the outputs of the scene image character style migration module and the scene image background erasing module are fused to generate a finally modified scene character image;

an electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.

A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-6.

The invention provides a method and a device for modifying scene image characters based on a transformer, electronic equipment and a storage medium, wherein the method comprises the following steps: learning global and local characteristics of each stage of a scene image based on an encoder and a decoder network of a convolutional neural network; the multi-head depth separable convolution attention mechanism network extracts implicit coding attention characteristics of global context information; the gate control depth separable feedforward network is used for carrying out element-by-element dot product through linear conversion layers of two parallel paths to learn local characteristics of the image; and provides a character modification device facing to scene images, which comprises:

the scene image character style migration module: the foreground style characteristic extraction module is used for carrying out foreground style characteristic extraction on the original style image and transferring the foreground style image to the target characters;

according to still another aspect of the present invention, there is provided an electronic apparatus including:

at least one processor;

at least one GPU computing card; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor or the at least one GPU computing card to enable the at least one processor or the at least one GPU computing card to perform the method of any of the examples of this application.

According to the method, the device, the electronic equipment and the storage medium for scene image character modification based on the transformer, training data are constructed for a scene character recognition model of a low-resource language, and accuracy of scene character recognition is improved.

It should be understood that the statements in this section are not intended to identify key or critical features of the embodiments of the present disclosure, nor are they intended to limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

FIG. 1 is a flowchart of a transform-based scene image text modification method provided by the present invention;

FIG. 2 is a flow diagram of a text style migration module provided by the present invention;

FIG. 3 is a flow diagram of another text style migration module provided by the present invention;

FIG. 4 is a flow chart of an image background repair module provided by the present invention;

FIG. 5 is a flow chart of another image background restoration module provided by the present invention;

FIG. 6 is a flow chart of a text fusion module provided by the present invention;

FIG. 7 is a flow diagram of another text fusion module provided by the present invention;

FIG. 8 is a block flow diagram of a multi-headed depth separable convolution attention mechanism provided in accordance with the present invention;

FIG. 9 is a block diagram of a gated deep separable feed forward network according to the present invention.

Fig. 10 is a block diagram of an electronic device of the present invention.

Detailed Description

In order to more clearly describe the technical solutions of the embodiments of the present invention, the present invention is further described in detail below with reference to the accompanying drawings. Various details of the embodiments of the application are included to assist understanding and should be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Examples

The invention relates to a method for modifying scene image characters based on a transformer, which comprises the following steps: the method comprises the following steps of scene image character style migration, scene image background erasing and scene image character fusion, and specifically comprises the following steps:

c. c, passing the features fused in the step b through a decoder based on deep convolution nerve to obtain a target character image with an original image style;

The coder/decoder in the step d and the coder/decoder in the step a are respectively that the decoder in the step a splices the feature map before each down-sampling and the feature map after the up-sampling of the decoder according to the feature map channels for 3 times; in the step d, the decoder splices the feature map before each down-sampling and the feature map after the up-sampling of the decoder for 3 times, and finally reduces the number of the feature map channels by half through 1x1 convolution.

A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-6;

fig. 1 is a flowchart of a method for modifying text of a scene image based on a transform according to an embodiment of the present application. The method can be executed by a transformer-based scene image text modification device, which can be implemented by software and/or hardware; referring to fig. 1, a scene image text modification method provided in the embodiment of the present application includes:

A. a character style migration module:

in an embodiment, the specific method of the text style migration module, referring to fig. 2 and fig. 3, includes the specific steps of:

inputting the style image and the target character image as modules, and aiming at extracting the foreground style characteristics of the original style image and transferring the foreground style characteristics to the target character;

an encoder: expanding a channel of an original image into 48 through a 3x3 scale invariant convolution, performing three times of pixel-unshuffle down-sampling, and 8 transform modules, wherein the transform modules with different numbers are used after each down-sampling;

a decoder: for Decoder, pixel-shuffle upsampling is carried out for 3 times, and each upsampling is carried out by a different number of transform modules;

the working process is as follows: the Encode and Decode modules are used to perform the corresponding calculations. For the Encoder, the original image is convolved by a 3x3 scale invariant convolution to expand the channel to 48, then downsampled by three pixel-unshuffle times, and 8 transform modules, where each downsampling is followed by a different number of transform modules. The same Encoder operation is adopted for the target character image. And connecting the output characteristics obtained by the two Encoders according to the depth. Before the Decoder, performing 1x1 convolution on the Encode output, reducing the number of channels by half, performing pixel-buffer up-sampling for the Decoder for 3 times, and performing transform modules with different numbers after each up-sampling;

O _t ＝T_convert(I _t ,I _s ) (formula 1)

T _ convert denotes the text style migration model, I _t For input of target character image, I _s For inputting stylized text images, O _t And outputting the character style migration model. Here the invention uses L ₁ loss supervises the output of the character style migration module, and the character style migration loss is as follows:

L_convert＝||T _t -O _t || ₁ (formula 2)

T _t For the style migrationAnd moving the label of the module.

B. An image background restoration module:

the specific method of the image background repairing module is shown in fig. 4 and 5. The method comprises the following specific steps:

inputting character images only in original style, outputting background images of the character images in the original style, and filling the original character areas with proper background textures;

the working process comprises the following steps: the target is to complete the scene character erasing task at the word level, the input of the module is only the original style character image, the output is the background image of the original style image, and the original character area is filled by proper background texture. The Encoder and Decoder modules are the same as the character style migration module, and the difference is that the characteristic diagram before each down-sampling of the Encoder is connected with the characteristic diagram after the up-sampling of the Decoder according to the depth, and the channel is halved by 1x1 convolution after the last two connections. After the Decoder, the output channel is adjusted to 3 through a 3x3 scale invariant convolution, a background image is output, and the loss of background restoration is calculated.

The invention employs L ₁ loss and wgan countermeasures the loss to supervise the output of the background erase module, and the image background repair loss is:

and f is an infimum constraint function, and the output of the discriminator is constrained to be in accordance with 1-Lipschitz continuity. T is _b Erasing the label of the module for background, D _B Discriminator for background erase block, O _b For background erase module generatorsOutput, α is L ₁ The weight coefficient of loss, the present invention is set to 10.

A character fusion module:

the specific method of the text fusion module, referring to fig. 6 and 7, includes the specific steps of:

inputting only original style character images, and outputting background images of the original style images, wherein the original character areas are filled with proper background textures;

the working process comprises the following steps: the method comprises the steps that an Encoder frame and a Decoder frame before a module continues, a foreground character image generated by a character style migration module is input, a feature graph and a Decoder feature graph of a background erasing module with the same size are connected according to depth in a Decoder stage, after the first two connections, a 1x1 convolution is adopted to reduce a channel by half, before the third connection, the feature graph of the background erasing module is subjected to 1x1 convolution to reduce the number of the channels by half and then connected according to the depth, after the Decoder module, the feature graph is input into a fine tuning module, the fine tuning module is formed by stacking four transform modules, 3x3 scale invariant convolution is used to change an output channel into 3, and an image after characters are modified is obtained.

The invention employs L ₁ loss and wgan resist loss to supervise the modified text image, and the loss of the text fusion module is as follows:

D _F arbiter for character fusion module, T _f Is a label of a character fusion module, beta is L ₁ Weight coefficient of loss, set to 10,O in the present invention _f For character integrationThe output of the joint module generator.

In order to make the generated image more realistic, the invention introduces VGG-loss to the text fusion module, similar to the idea of the style migration model, including content perception loss and style loss. VGG-loss can be expressed as:

L _vgg ＝λ ₁ L _per +λ ₂ L _style (formula 5)

/>

λ ₁ As a weight of the content-aware loss, λ ₂ The weights for the style loss are set to 1 and 500, respectively.

Are activation characteristic graphs of relu1_1 to relu5_1 of the VGG-19 model, and G is a Gram matrix.

The overall frame loss was:

L＝L_convert+L _B +L _F (formula 9)

The discriminator uses a structure similar to a PatchGAN network structure, but in order to make the model more easily converge, the discriminator takes wgan loss as an antagonistic loss, removes a sigmoid layer in the original PatchGAN, replaces batch normalization in the original structure with spectrum normalization, does not add spectrum normalization in the last layer, and restricts a discriminator function to be 1-Lipschitz continuity.

The multi-head depth separable convolution attention mechanism module comprises:

referring to fig. 8, the specific method of the multi-head depth separable convolution attention mechanism module includes the following specific steps:

the main computational overhead in the Tranformer comes from the self-attention layer, and in the conventional SA (self-attention), the temporal complexity of the key-query dot product increases by a square multiple of the input spatial pixels, and for the input image W × H, the temporal complexity is O (W × H) ² H ² ) This greatly increases the difficulty of model computation. To alleviate the computational complexity, the MDTA used in the present invention is shown in fig. 8, and the complexity of the MDTA is linear complexity. Here, the SA generates an implicitly encoded attention feature map for the global context information by calculating the correlation between channels. Besides, another important component of MDTA is to introduce deep separable convolution, which also incorporates the powerful ability of CNN to extract local information into the Transformer before generating the global attention feature map, which further reduces the training parameters.

Suppose a tensor Y ∈ R normalized by a layer ^H×W×C As input, MDTA adopts 1x1 convolution to integrate context information among channels for generating query (Q), key (K) and value (V) feature vectors with abundant local information, and then carries out 3x3 scale invariant depth separable convolution to encode spatial context information of channel level to generate

Is 1x1 point-by-point convolution and is greater than or equal to>

Is a 3x3 depth separable convolution. The present invention uses unbiased convolution in the network. Then changing the shapes of the query and the key, and generating a transposed attention feature map A with the size of R after dot product ^C×C Instead of the huge attention profile generated by the original Transformer, the size R ^HW×HW . In summary, the MDTA process can be defined as:

x and

for input and output of the profile, the initial Q, K, V ∈ R ^H×W×C Changed shape into>

Gamma is a learnable parameter that is paired ≦ before passing through the Softmax activation function>

And &>

The size of the dot product of (a) is controlled. Similar to the traditional multi-head self-attention mechanism, the number of channels is respectively put into different heads for parallel calculation.

Gated depth separable feed forward network module:

referring to fig. 9, the specific method of the multi-head depth separable convolution attention mechanism module includes the following specific steps:

this gating mechanism is done by element-by-element dot products through two parallel-path linear translation layers. A Gelu nonlinear activation function is used. Similar to MDTA, GDFN encodes spatially adjacent element information using depth separable convolution, and the local information of an effective learning image plays an important role in image generation. It is assumed here that an input tensor X belongs to R ^(H×W×C) The GDFN equation is as follows:

representing pixel-by-pixel multiplication, theta represents the Gelu nonlinear activation function, and LN is slice normalization. The GDFN controls the transfer of information at various levels in the network structure, and can focus on finer-grained features at various levels and transfer the information to other levels. By reducing the expansion coefficient gamma, the model parameters and the calculation load are reduced while the same effect is achieved;

the invention provides an electronic device and a readable storage medium;

as shown in fig. 10, the block diagram of the electronic device of the present invention refers to various modern electronic digital computers, including, for example: personal computers, portable computers, various server devices. The components shown in the present disclosure and their interconnections and functions are merely examples;

as shown in fig. 10, the electronic apparatus includes: one or more multi-core processors, one or more GPU computing cards and a memory, and in order to enable the electronic equipment to generate interaction, the method further comprises the following steps: input device, output device. Various devices are interconnected and communicated through a bus;

a memory is a non-transitory computer-readable storage medium provided herein, wherein the memory stores instructions executable by the at least one processor or the at least one GPU computing card to enable the at least one processor or the at least one GPU computing card to perform the method of any one of the embodiments of the present application;

the input device provides and accepts control signals input into the electronic device by a user, and comprises a keyboard for generating digital or character information and a mouse for controlling the device to generate other key signals. The output device provides feedback information to the consumer electronic device, including a display of the results or processes of the printing execution.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof.

Claims

1. A method for modifying scene image characters based on a transform is characterized by comprising the following steps: the method comprises the following steps of scene image character style migration, scene image background erasing and scene image character fusion, and the specific operation is carried out according to the following steps:

2. The method of claim 1, wherein the scene image text style migration in step a is performed by inputting a style image and a target text image, and outputting the target text image in a foreground style of an original image; the scene image character style specifically includes: scene image character font, font color, character shape; the encoder may include a 3x3 scale invariant convolution, cubic down-sampling, and 8 transform modules.

3. The method of claim 1, wherein the decoder in step c comprises 3 upsampling modules and a different number of transform modules after each upsampling.

4. The method of claim 1, wherein step d inputs only primitive-style text images and outputs background images of the primitive-style images, and the primitive text regions are filled with appropriate background textures.

5. The method of claim 1, wherein the encoder/decoder in step d and the encoder/decoder in step a respectively are that the decoder in step a concatenates the feature map before each down-sampling and the feature map after the up-sampling in the decoder 3 times according to the feature map channel; in the step d, the decoder splices the feature map before each down-sampling and the feature map after the up-sampling of the decoder for 3 times, and finally reduces the number of the feature map channels by half through 1x1 convolution.

6. The method of claim 1, wherein the input of the scene image text fusion in the step g is a foreground text image generated by the text style migration module in the step a and a background image of an original image style generated by background erasing of the scene image in the step d.

7. A scene image character modifying device based on a transformer is characterized by comprising a scene image character style transferring module, a scene image background erasing module and a scene image character fusing module, wherein:

the scene image background erasing module: completing a scene character erasing task at a word level for the original style character image;

scene image text fusion: and fusing the outputs of the scene image character style migration module and the scene image background erasing module to generate the finally modified scene character image.

8. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.

9. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-6.