CN114359033A

CN114359033A - Scene image-text generation method and system

Info

Publication number: CN114359033A
Application number: CN202111519968.1A
Authority: CN
Inventors: 吕岳; 陈昕苑; 张灵珺
Original assignee: East China Normal University
Current assignee: East China Normal University
Priority date: 2021-12-13
Filing date: 2021-12-13
Publication date: 2022-04-15

Abstract

The invention discloses a scene graph-text generation method and a system, wherein the method comprises the following steps: s1, text line extraction, namely extracting the region where the text is located from the given scene image to be used as a text line style image; s2, text line style migration, namely, fusing a given text image and a style image to obtain a fused image; and S3, scene text fusion, namely embedding the fused image into the input image and rendering to obtain an output image. According to the technical scheme, the text content in the input scene picture is replaced by the text content of other languages according to the text position information, the style and the background of the characters of the input picture are reserved, and the replacement effect is improved.

Description

Scene image-text generation method and system

Technical Field

The invention relates to the technical field of data processing, in particular to a scene image-text generation method and a scene image-text generation system.

Background

The technology of generating pictures or generating or replacing characters has become a popular technology in the field of artificial intelligence, but in most scenes, pictures and characters are together, and in some new scenes, the background of the pictures needs to be preserved, the characters need to be replaced, and the style of the original characters is preserved, which brings new challenges to the current technology.

The new scenes are more, for example, the new scenes are applied to mobile phone photographing and translation, texts in photographed pictures are converted into languages required by users, and good visual effects are provided, particularly the languages used in regions with few users, especially economic undeveloped regions, relevant image-text researches are hardly considered, data resources of the new scenes are relatively scarce, and communication of the languages are influenced.

And the traditional method needs a subtitle group to translate the text and manually embed characters for the scene text appearing in the movie, which is time-consuming and labor-consuming, and errors caused by manual work are difficult to avoid due to manual operation.

To solve these technical problems, in recent years, a technique of synthesizing a scene text data set has been proposed with the development of generation of an adversarial network. The mainstream methods for synthesizing the scene text data set include three methods, namely, generating a text picture in a real scene style by a given text sketch, placing characters with qualification at a proper position of the scene picture, and replacing an original scene character area through style migration.

The first method comprises the following steps: gong et al, in "Generating text sequences images for recognition", consider the image generation task as an image-to-image translation and generate realistic scene text pictures from semantic images. Isola et al, Image-to-Image transformation with a conditional adaptive network, propose a cGAN-based method for generating images of one domain from another, also known as Pix2Pix, that can be used for text conversion. However, this method is mainly used for generating simple text line images without background, and the style of the generated images is not controllable.

The second method comprises the following steps: gupta et al, in the document Synthetic data for text localization in natural images, use an input picture as a background and put a text with given font and color into the background picture to generate a scene text synthesis image. Zhan et al propose an image synthesis technique in Verisimilar image synthesis for acquisition detection and recognition of texts in scenes, and based on the idea of semantic segmentation, Zhan et al propose a Spatial Fusion GAN (SFGAN) to make the text in the synthesized image have vivid effects in both geometric form and spatial position. But often the text position in the composite image of this type does not correspond to the text position in the real scene.

The third method comprises the following steps: yang et al, in Context-aware textual stylization, propose a method for artistic text stylization without supervision, and have performed experiments in different fonts and languages. Liu et al, in Synthesis scene text images for retrieval with style transfer, propose a content image initialization module and an encoding-decoding network to generate a natural scene text image using binarized text images and texture images. On the basis, Li et al render the binarized text image in Synthesis data for text recognition with style transfer as input to the style migration network. They propose a synthttext-Transfer framework for generating synthetic text images with the same texture but different text contents, but the diversity of the generated images is limited due to the need for many manual operations. Wu et al, in Editing text in the world, propose an end-to-end trainable style preserving network (SRNet) for Editing text in natural scene pictures and have made some attempts at cross-lingual English-to-Chinese. Their method is able to handle most input images but fails in situations where the text structure is complex or the background is complex.

Based on the work of Wu et al, Yang et al propose a unified frame SwapText for scene text conversion in SwapText, which improves the effect of editing curved text images. Roy et al [ Scene text editor using font adaptive neural network ] designs a generating network that generates other characters having the same font characteristics according to the font characteristics of a single character. The above methods are all under supervision, but for the small languages with relatively few resources, especially for the training of cross-language models, only synthetic data can be used for training, so that the real scene generation effect is poor.

Disclosure of Invention

The invention provides a scene graph-text generation method for solving the technical problems in the prior art, which comprises the following steps:

s1, text line extraction, namely extracting the region where the text is located from the given scene image to be used as a text line style image; s2, text line style migration, namely, fusing a given text image and a style image to obtain a fused image; and S3, scene text fusion, namely embedding the fused image into the input image and rendering to obtain an output image. According to the technical scheme, the text content in the input scene picture is replaced by the text content of other languages according to the text position information, the style and the background of the characters of the input picture are reserved, and the replacement effect is improved.

Further, step S2 is to fuse the given text image and the style image by using a neural network to obtain a fused image, which specifically includes:

extracting background features of the text line style image to obtain a background image;

extracting content characteristics of the text image and style characteristics of the style image and fusing to obtain a foreground image;

and fusing the background image and the foreground image to obtain a fused image.

Further, the neural network is a text-to-grid migration network, and comprises three subnetworks: the method comprises the following steps of character migration subnet, background generation subnet and fusion subnet, wherein:

the method comprises the steps that style characteristics of texts in text line style images and content characteristics of texts in content images are separated in a text migration subnet and are cascaded together to generate foreground images;

the background generation subnet is used for extracting background characteristics of the style image to obtain a background image;

the fusion subnet is used for fusing the background image and the foreground image to obtain a fusion image.

4. A scene graph and text generation method according to claim 3, characterized in that the text line format migration network is a multilayer convolutional neural network.

Further, the training method of the neural network comprises the following steps:

preparing a composite data set, wherein the composite data set comprises a text line format image, a content image and a corresponding result image, and the result image simultaneously has the content of the content image and the character style and the background style of the format image;

and inputting the text line style image and the content image into a text line style migration network to obtain a fused image after style migration.

Calculating the loss of the pixel level between the fused image and the result image of the style migration, and constructing a target loss function by combining the loss of the pixel level and an optimized target function of a countermeasure generation network;

optimization is performed using a gradient descent algorithm.

preparing a real scene data set, wherein the real scene data set comprises a text image, a text position and annotation information of text content;

and taking the text line area in the real scene as the input of the character migration subnet, namely the text line stylized image. Generating a text image which is the same as the text content and serves as the content input of the character migration subnet, namely a content image according to the existing label of the text content;

calculating the loss of a pixel level of a foreground image generated by the character migration subnet and an input style image, and constructing a target loss function by combining the loss of the pixel level and an optimized target function of the confrontation generation network;

inputting other text lines in the corpus in the neural network as content images, inputting images of a real scene as style images, and replacing the fusion images obtained through the text line style migration network with given content styles;

calculating consistency loss of the fused image and the content image in a content characteristic space, consistency loss of the fused image and the style image in a style characteristic space and reality loss of the fused image;

optimization is performed using a gradient descent algorithm.

Further, the training method of the neural network is alternatively trained by using the method of claim 5 and the method of claim 6.

Further, the method of synthetic data set generation comprises:

randomly selecting an image from a background image library as a background image;

and embedding the text subjected to deformation such as perspective and bending into the background image to obtain the style image.

Randomly selecting words from the corpus, and rendering the words into content images by selecting fonts and colors;

based on the depth information and segmentation information of the background image, a text of the content image is selected, and the text is embedded in the background image as a corresponding result image by deformation such as perspective and bending.

The invention also discloses a scene image-text generation system, which comprises a feature extraction module, a feature fusion module and an output module, wherein:

the characteristic extraction module is used for extracting an area where a text is located from an input image to obtain a style image;

the feature fusion module is used for fusing a given text image and a style image to obtain a fusion image;

the output module is used for embedding the fused image into the input image and rendering to obtain an output image.

The present invention also provides an electronic device comprising: the scene graph-text generating method comprises a processor, a storage medium and a bus, wherein the storage medium stores machine-readable instructions executable by the processor, when an electronic device runs, the processor is communicated with the storage medium through the bus, and the processor executes the machine-readable instructions to execute the scene graph-text generating method.

In practical applications, the modules in the method and system disclosed by the invention can be deployed on one target server, or each module can be deployed on different target servers independently, and particularly, the modules can be deployed on cluster target servers according to needs in order to provide stronger computing processing capacity.

Therefore, according to the technical scheme, the text content in the input picture can be replaced by the text content in other languages according to the text position information, the style and the background of the characters in the input picture are kept, and the replacement effect is improved.

In order that the invention may be more clearly and fully understood, specific embodiments thereof are described in detail below with reference to the accompanying drawings.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments will be briefly introduced below. It is obvious that the drawings in the following description are only some embodiments of the application, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.

Fig. 1 is a schematic flowchart of a scene graph-text generation method according to an embodiment of the present application.

Fig. 2 is a schematic diagram of a training process of a neural network according to an embodiment of the present application.

Fig. 3 is a schematic diagram of a visualization result of each step in the embodiment of the present application.

FIG. 4 is a diagram illustrating a comparison effect on a synthetic data set according to an embodiment of the present application.

FIG. 5 is a diagram illustrating a comparison effect of the embodiment of the present application on a real-world dataset.

Detailed Description

According to the text position information, the text content in the input picture is replaced by the text content of other languages, and meanwhile the style and the background of the characters of the input picture are reserved. The method comprises the following steps:

s1, text line extraction, namely extracting the region where the text is located from the input image to obtain a style image;

s2, text line style migration, namely fusing a given text image and a style image to obtain a fused image, namely the text line image after the style migration is finished;

and S3, scene text fusion, namely embedding the fused image into the input image and rendering to obtain an output image.

Example one

Referring to fig. 1 in combination with the example of the visualization result of fig. 3, fig. 1 is a schematic implementation flow diagram of a scene graph and text generation method according to an embodiment of the present application, where "step one" is S1, "text line" refers to an area where text is located, and a rectangle is usually selected, where the rectangle includes text lines, and "text line style image" is a style image, which is as follows:

and S1, extracting the region where the text is located from the input image to obtain a style image.

In this embodiment, the original scene image includes the text "skylinght", and the font has a certain design style, and the rectangular image where the text line of "skylinght" is located is extracted as the style image.

S2, fusing the given text image and the style image to obtain a fused image;

learning the text style and the background style of the style images by using the style migration network, and migrating the style images and the background style to the text images (namely, the content images of the image) of the given text content to obtain a fused image which has both the style image style and the text content in the text images, wherein the steps of S21-S23 are specifically included:

and S21, extracting the background features of the style image by the background generation subnet to obtain a background image, and realizing feature extraction by a multilayer convolutional neural network.

And S22, extracting the content characteristics of the text image and the style characteristics of the style image by the character migration subnet, adding the style characteristics and the characteristic layers with the same pixels as the content characteristics to achieve the purpose of fusion, obtaining a foreground image by decoding, and extracting the content characteristics of the text image and the style characteristics of the style image to realize characteristic extraction through a multilayer convolutional neural network.

And S23, fusing the background image and the foreground image by the fusion subnet to obtain a fusion image.

The method and the device can realize the migration of the cross-language scene character style, replace the character content in the input picture with the small-language character content according to the labeled text position information by utilizing the existing public data set labeled with large languages (such as Chinese and English), simultaneously keep the style and background of the characters of the input picture, and improve the replacement effect.

In order to better implement the above cross-language style migration, as a preferred embodiment, the neural network applied based on the above implementation method includes: the method comprises the following steps of character migration subnet, background generation subnet and fusion subnet, wherein:

the character migration subnet is used for separating the style characteristics of the text in the text line style image and the content characteristics of the text in the content image and cascading the style characteristics and the content characteristics together to generate a foreground image;

Referring to fig. 2, fig. 2 is a schematic diagram illustrating a training process of a neural network according to an embodiment of the present application. In order to better develop the performance of the neural network, based on the purpose and the characteristics of the invention, the application provides two training methods for the neural network of the embodiment of the application, and the loss function calculation step comprises the following steps:

preparing a composite data set, wherein the composite data set comprises a style image, a content image and a corresponding result image, and the result image simultaneously has the content of the content image and the character style and the background style of the style image;

optimization is performed using a gradient descent algorithm.

Another training method includes:

preparing a real scene data set, wherein the real scene data set comprises text images and annotation information of text positions and contents;

taking a text line region in a real scene as the input of a character migration subnet, and generating a text image which is the same as the text content as the content input of the character migration subnet according to the existing label of the text content;

inputting other text lines in the corpus as content images in the neural network, inputting images of the real scene as style images, and requiring to be replaced by a given content style while keeping the background and style of the original real scene through the output result (namely, the fused images after style migration) of the text line style migration network. Then, the consistency loss of the output style-migrated fusion image and the content image in the content feature space, the consistency loss of the style image in the style feature space and the fidelity loss of the fusion image are calculated.

Optimization is performed using a gradient descent algorithm.

In addition, in order to better obtain a training result, the neural network of the application can also adopt the two training methods to carry out alternate training, so that a better training effect is obtained.

During the training process, the data set is also an important factor influencing the training. The cross-language to be solved by the application is characterized in that data such as Chinese and English are provided with abundant labeled data sets for use, and small languages use few people and lack data resources.

Due to scarcity of cross-lingual data set resources, it is difficult to obtain a large number of paired images for model training, and therefore, in training of models, a synthetic data set needs to be created. Aiming at the current situation that data resources are scarce in the implementation process of the application, the application provides a preferred implementation mode for the preparation of the data set, namely the data set of the application comprises a real data set and a composite data set. The optimal mode comprises a synthetic data set which is provided with an input image and a corresponding result image at the same time, and a real scene data set which only needs to input the image, the text position and the content information of the image, so that the problem that the real scene data set has no corresponding style migration result is solved.

The synthetic data set is a style image obtained by randomly selecting the font, color and geometric deformation of the text, selecting the image as the background and combining the two. Meanwhile, a text in a Chinese language is randomly selected for generating a binary content image. The corresponding font, color, geometric deformation and background are used for generating a true value image. In order to make the output image closer to the real scene image, it is also necessary to use a real world data set.

The specific steps of the synthetic dataset generation include:

randomly selecting a background picture from a background picture library;

Based on the technical scheme provided by the application, experiments are carried out on the languages of Arabic, Thai and Vietnamese through tests. More than 100 fonts, each containing both small language characters (e.g., arabic, thai, and vietnamese) and english characters, and 8000 background images were selected for training the model.

In order to show the progress of the invention, the invention carries out quantitative and qualitative comparison with the SRNet and the Pix2Pix methods. On the aspect of quantitative comparison, the invention adopts MSE, PSNR and SSIM to measure the generation effect among different methods. The smaller the MSE, the larger the PSNR and SSIM prove the better the picture generation.

Tests were performed on the test set by three methods and the results of the evaluations were compared as follows:

at the same time, as shown in fig. 4 and 5, the results generated by the present method are visually superior to those generated by SRNet.

Therefore, the technical effects obtained by the application are as follows:

example two

Based on the above embodiment, the present application provides a scene graph generating system, which includes: the device comprises a feature extraction module, a feature fusion module and an output module, wherein:

An embodiment of the present application further provides an electronic device, including: the scene graph-text generating method comprises a processor, a storage medium and a bus, wherein the storage medium stores machine-readable instructions executable by the processor, when an electronic device runs, the processor is communicated with the storage medium through the bus, and the processor executes the machine-readable instructions to execute the scene graph-text generating method.

It should be noted that, all or part of the steps in the methods of the above embodiments may be implemented by hardware related to instructions of a computer program, which may be stored in a computer-readable storage medium, which may include, but is not limited to: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A scene graph-text generation method is characterized by being applied to a scene graph-text generation system and comprising the following steps:

s1, text line extraction, namely extracting the region where the text is located from the given scene image to be used as a text line style image;

s2, text line style migration, namely, fusing a given text image and a style image to obtain a fused image;

2. The scene graph and text generating method according to claim 1, wherein the step S2 is to fuse the given text image and the style image by using a neural network to obtain a fused image, and specifically comprises:

3. The scene graph and text generation method of claim 2, wherein the neural network is a text-to-graph migration network comprising three subnetworks: the method comprises the following steps of character migration subnet, background generation subnet and fusion subnet, wherein:

4. The scene graph and text generation method of claim 2, wherein the text line format migration network is a multilayer convolutional neural network.

5. The scene graph and text generation method of claim 2, wherein the neural network training method comprises:

optimization is performed using a gradient descent algorithm.

6. The scene graph and text generation method of claim 2, wherein the neural network training method comprises:

optimization is performed using a gradient descent algorithm.

7. The scene graph and text generation method according to claim 6, wherein the training method of the neural network is alternatively trained by the method of claim 5 and the method of claim 6.

8. The scene graph and text generation method of claim 5, wherein the method of generating the composite data set includes:

9. A scene image-text generation system is characterized by comprising a feature extraction module, a feature fusion module and an output module, wherein:

10. An electronic device, comprising: a processor, a storage medium and a bus, wherein the storage medium stores machine-readable instructions executable by the processor, when the electronic device runs, the processor and the storage medium communicate through the bus, and the processor executes the machine-readable instructions to execute the scene graph and text generation method according to any one of claims 1 to 8.