CN113191355A

CN113191355A - Text image synthesis method, device, equipment and storage medium

Info

Publication number: CN113191355A
Application number: CN202110541630.XA
Authority: CN
Inventors: 范湉湉; 黄灿; 王长虎
Original assignee: Beijing Youzhuju Network Technology Co Ltd
Current assignee: Beijing Youzhuju Network Technology Co Ltd
Priority date: 2021-05-18
Filing date: 2021-05-18
Publication date: 2021-07-30

Abstract

The embodiment of the application provides a text image synthesis method, a text image synthesis device, text image synthesis equipment and a storage medium, wherein the method comprises the following steps: acquiring a target text image and a target text style image; inputting the target text image and the target text style image into a text synthesis network to obtain a synthesized text image output by the text synthesis network; the text in the synthesized text image is the target text in the target text image, the text style in the synthesized text image is the text style in the target text style image, the text synthesis network is trained by the aid of a text recognition module, and the text recognition module is used for recognizing text information in the image. The text recognition module is used as the supervision module to assist the training of the text synthesis network, so that the training accuracy of the text synthesis network is improved, and the synthesis effect of the trained text synthesis network during text synthesis is further improved.

Description

Text image synthesis method, device, equipment and storage medium

Technical Field

The embodiment of the application relates to the technical field of image processing, in particular to a text image synthesis method, a text image synthesis device, text image synthesis equipment and a storage medium.

Background

Natural scene text synthesis refers to the artificial synthesis of scene text data in a complex background, for example, inputting a text to be synthesized and a target style into a text synthesis network, so that the text synthesis network outputs a text image of the target style.

Before the text synthesis network is used for text synthesis, the text synthesis network needs to be trained, but the training process of the current text synthesis network is incomplete, so that the synthesis effect of the trained text synthesis network is poor.

Disclosure of Invention

The embodiment of the application provides a text image synthesis method, a text image synthesis device and a storage medium, which are used for improving the synthesis effect of a text synthesis network.

In a first aspect, an embodiment of the present application provides a text image synthesis method, including:

acquiring a target text image and a target text style image;

inputting the target text image and the target text style image into a text synthesis network to obtain a synthesized text image output by the text synthesis network;

the text in the synthesized text image is the target text in the target text image, the text style in the synthesized text image is the text style in the target text style image, the text synthesis network is trained by the aid of a text recognition module, and the text recognition module is used for recognizing text information in the image.

In a second aspect, an embodiment of the present application provides a text image synthesizing apparatus, including:

an acquisition unit configured to acquire a target text image and a target text style image;

the synthesis unit is used for inputting the target text image and the target text style image into a text synthesis network to obtain a synthesized text image output by the text synthesis network;

In a third aspect, embodiments of the present application provide a computing device, comprising a processor and a memory;

the memory for storing a computer program;

the processor is configured to execute the computer program to implement the method according to the first aspect.

In a fourth aspect, the present application provides a computer-readable storage medium, which includes computer instructions, which when executed by a computer, cause the computer to implement the method according to the first aspect.

In a fifth aspect, embodiments of the present application provide a computer program product, which includes a computer program, the computer program being stored in a readable storage medium, from which the computer program can be read by at least one processor of a computer, and the execution of the computer program by the at least one processor causes the computer to implement the method of the first aspect.

According to the text image synthesis method, the text image synthesis device, the text image synthesis equipment and the storage medium, the text recognition module is used as the supervision module to assist the training of the text synthesis network, so that the training accuracy of the text synthesis network is improved, and the synthesis effect of the trained text synthesis network during text image synthesis is further improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

Fig. 1 is a schematic view of an application scenario according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a text-to-text network training scenario according to the present application;

fig. 3 is a schematic diagram illustrating a training process of a text synthesis network according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a text synthesis network to which the present application relates;

FIG. 5 is a schematic diagram illustrating a training process of a text synthesis network according to another embodiment of the present application;

FIG. 6 is a schematic diagram of another text synthesis network to which the present application relates;

FIG. 7 is a schematic diagram illustrating a training process of a text synthesis network according to another embodiment of the present application;

FIG. 8 is a schematic diagram of another text synthesis network to which the present application relates;

FIG. 9 is a schematic diagram of another text synthesis network to which the present application relates;

FIG. 10 is a schematic diagram illustrating a training process of a text synthesis network according to an embodiment of the present application;

FIG. 11 is a schematic diagram of a network structure of another text synthesis network according to the present application;

fig. 12 is a schematic flowchart of a text image synthesis method according to an embodiment of the present application;

FIG. 13 is a schematic diagram of a network structure of another text synthesis network to which the present application relates;

fig. 14 is a schematic structural diagram of a text image synthesizing apparatus according to an embodiment of the present application;

fig. 15 is a schematic structural diagram of a text image synthesizing apparatus according to an embodiment of the present application;

fig. 16 is a block diagram of a computing device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

The embodiment of the application relates to the technical field of artificial intelligence, in particular to an image identification method, an image identification device and computing equipment.

In order to facilitate understanding of the embodiments of the present application, the related concepts related to the embodiments of the present application are first briefly described as follows:

artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Computer Vision technology (CV) Computer Vision is a science for researching how to make a machine "see", and further refers to that a camera and a Computer are used to replace human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the Computer processing becomes an image more suitable for human eyes to observe or transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. The computer vision technology generally includes image processing, image Recognition, image semantic understanding, image retrieval, Optical Character Recognition (OCR), video processing, video semantic understanding, video content/behavior Recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also includes common biometric technologies such as face Recognition and fingerprint Recognition.

OCR refers to a process in which an electronic device (e.g., a scanner or digital camera) examines a character printed on paper, determines its shape by detecting dark and light patterns, and then translates the shape into computer text using character recognition methods; the method is characterized in that characters in a paper document are converted into an image file with a black-white dot matrix in an optical mode aiming at print characters, and the characters in the image are converted into a text format through recognition software for further editing and processing by word processing software.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

It should be understood that, in the present embodiment, "B corresponding to a" means that B is associated with a. In one implementation, B may be determined from a. It should also be understood that determining B from a does not mean determining B from a alone, but may be determined from a and/or other information.

In the description of the present application, "plurality" means two or more than two unless otherwise specified.

In addition, in order to facilitate clear description of technical solutions of the embodiments of the present application, in the embodiments of the present application, terms such as "first" and "second" are used to distinguish the same items or similar items having substantially the same functions and actions. Those skilled in the art will appreciate that the terms "first," "second," etc. do not denote any order or quantity, nor do the terms "first," "second," etc. denote any order or importance.

The technical solutions of the embodiments of the present application are described in detail below with reference to some embodiments. The following several embodiments may be combined with each other and may not be described in detail in some embodiments for the same or similar concepts or processes.

Fig. 1 is a schematic view of an application scenario related to an embodiment of the present application, including: user equipment 101, data acquisition equipment 102, training equipment 103, execution equipment 104, database 105, and content repository 106.

The data acquisition device 102 is configured to read training data from the content library 106 and store the read training data in the database 105.

It should be noted that the training data in the present application includes text images and text pattern images, and in order to facilitate distinguishing from the text images and the text pattern images in the actual application process, the text images used in the training process are referred to as training text images, and the text pattern images used in the training process are referred to as training text pattern images.

The training device 103 trains the text synthesis network based on the training data maintained in the database 105 so that the trained text synthesis network can accurately synthesize text.

In some embodiments, the text synthesis network derived by the training device 103 may be applied to different systems or devices.

In one possible application scenario, the text synthesis network may be applied to a scene text recognition device, which may be understood as a computing device installed with a scene text recognition model, where the scene text recognition model is used to recognize position information of a text in an image and complete text translation, and specifically, the scene text recognition model recognizes a position of a word in a picture and converts text data contained in the picture at the position into information understandable by a human, for example, translating a language a in the picture into a language B. With the development of deep learning, the effect of the scene text recognition model is greatly improved. The scene text recognition model, which can be successfully applied to scene text recognition, requires a large number of labeled data sets to be trained. However, text labeling is time consuming and labor intensive, and has potential risks of data security and the like. The text image synthesis method greatly relieves the difficulty of text labeling, for example, the text image can be synthesized through the trained text synthesis network, the synthesized text image is used as a training data set of the scene text recognition model to train the scene text recognition model, the text does not need to be labeled, and the training efficiency of the scene text recognition model is further improved.

It should be noted that the text synthesis network in the embodiment of the present application may be applied to other scenes requiring text synthesis besides the training data set used for synthesizing the scene text recognition model, and the present application is not limited thereto.

In some embodiments, as shown in FIG. 1, the execution device 104 is configured with an I/O interface 107 to interact with external devices. Such as receiving target text and target text styles sent by user device 101 via an I/O interface. The computing module 109 in the execution device 104 processes the input target text image and the target text pattern image using the trained text synthesis network, and outputs a synthesized text image in which the pattern of the target text is the target text pattern. The execution device 104 transmits the synthesized text image to the user device 101 through the I/O interface to cause the user device 101 to present the synthesized text image after synthesis.

It should be noted that fig. 1 is only a schematic diagram of an application scenario provided in the embodiment of the present application, and a positional relationship between devices, modules, and the like shown in the diagram does not constitute any limitation. In some embodiments, the data collection device 102 may be the same device as the user device 101, the training device 103, and the performance device 104. The database 105 may be distributed on one server or a plurality of servers, and the content library 106 may be distributed on one server or a plurality of servers.

First, a training process of the text synthesis network will be described.

Fig. 2 is a schematic view of a training scenario of a text synthesis network according to the present application, and fig. 3 is a schematic view of a training flow of the text synthesis network according to an embodiment of the present application, as shown in fig. 3, the method includes:

s301, acquiring a training text image and a training text style image;

the execution subject of the embodiment of the present application is a device having a model training function, for example, a text image synthesis device. In some embodiments, the text image synthesizing apparatus is a computing device. In some embodiments, the text image synthesizing apparatus is a unit having a data processing function in a computing device, for example, a processor in the computing device. The embodiment of the present application takes an execution subject as an example of a computing device.

In some embodiments, the computing device may be a terminal device, such as a terminal server, a smart phone, a laptop, a tablet, a personal desktop, a smart camera, and the like.

The training text image and the training text pattern image may be understood as a training text image and a training text pattern image in a training set, where a training process of using each training text image and each training text pattern image in the training set to the text synthesis network is the same.

In some embodiments, one training text image and one training text pattern image are input to the text synthesis network during each training process. After the training text image and the training text pattern image are trained, inputting a next training text image and a next training text pattern image to start training.

In some embodiments, within each training process, multiple training text images and training text pattern images may be input, and the text synthesis network may be trained simultaneously using the multiple training text images and training text pattern images.

In some embodiments, the training set includes a plurality of training text images, and the application does not limit the specific number of training text images.

Optionally, the training text images are different from one another.

Alternatively, the training text image may include fonts, symbols, etc. in any format.

In some embodiments, the training set includes a plurality of training text pattern images, and the present application does not limit the specific number of training text pattern images and does not limit the text patterns in the training text pattern images.

Optionally, the training text pattern images are different from one another.

In some embodiments, the text pattern in the training text pattern image includes at least one of: font color, text style, font size, font position, font slope, font background, etc.

S302, inputting the training text images and the training text style images into a text synthesis network, and performing end-to-end training on the text synthesis network by taking a text recognition module as a supervision module.

The text recognition module may recognize text information in the text-synthesized network synthesized image. Thus, the recognized text information can be compared with the original text information to judge whether the image synthesized by the text synthesis network is accurate or not. For example, the text recognition module recognizes text information in an image synthesized by a text synthesis network, outputs the recognized text information, compares the text information output by the text recognition module with original text information, and if an error between the text information output by the text recognition module and the original text information is small, the image synthesized by the text synthesis network is accurate, and the text recognition module can clearly recognize the text information in the synthesized image. If the error between the text information output by the text recognition module and the original text information is large, the image synthesized by the text synthesis network is inaccurate, and the text recognition module cannot clearly recognize the text information in the synthesized image.

Therefore, the text recognition module can measure the accuracy of the image synthesized by the text synthesis network, and based on the accuracy, as shown in fig. 2, in the training process of the text synthesis network, the text recognition module is used as a supervision module of the text synthesis network to assist the training of the text synthesis network, and after the training of the text synthesis network is completed, the text recognition module stops working and does not work in the actual use process of the text synthesis network.

In some embodiments, FIG. 2 shows that the input of the text recognition module is connected to the middle tier output of the text synthesis network, i.e., the synthesized image of the middle tier output of the text synthesis network serves as the input of the text recognition module. It should be noted that, the connection manner of the text synthesis network and the text recognition module related in the present application includes, but is not limited to, that shown in fig. 2, in some embodiments, an input end of the text recognition module is connected with an output end of the text synthesis network, that is, a final output of the text synthesis network is an input of the text recognition module, and the present application does not limit a specific connection manner of the text recognition module and the text synthesis network.

It should be noted that, in the embodiment of the present application, a specific network structure of the text recognition module is not limited, as long as a model of the text information in the picture can be recognized.

In some embodiments, the text recognition module of the present embodiment is trained before the text synthesis network training, for example, the text recognition module is pre-trained using labeled text data, and the parameters of the text recognition module do not participate in updating during the text synthesis network training. In the training process of the text synthesis network, the pre-trained text recognition module is used as a supervision module to assist the training of the text synthesis network, so that the training accuracy of the text synthesis network is improved, and the synthesis effect of the trained text synthesis network during text synthesis is further improved.

In some embodiments, as shown in fig. 4, the text synthesis network includes a text conversion module to convert text patterns of training text in the training text image into text patterns in the training text pattern image. The input end of the text recognition module is connected with the output end of the text conversion module, namely, the first image output by the text conversion module is used as the input of the text recognition module, and the text recognition module is used for recognizing the text information in the first image.

On the basis of the network model shown in fig. 4, as shown in fig. 5, the above S302 includes the following steps S302-1 to S302-3:

s302-1, inputting a training text image and a training text style image into a text conversion module to obtain a first image output by the text conversion module, wherein a text in the first image is a text in the training text image, and a text style in the first image is a text style in the training text style image;

s302-2, inputting the first image into a text recognition module to obtain text information output by the text recognition module;

s302-3, performing end-to-end training on the text synthesis network according to the difference between the text information output by the text recognition module and the text information of the training text image.

Specifically, as shown in fig. 4, the text synthesis network of the present application includes a text conversion module, and the text conversion module is connected to the text recognition module. The text style in the training text style image of the present embodiment includes a font type and a font gradient, a font color, and the like. And inputting the training text image and the training text pattern image into a text conversion module, converting the training text image into a text image with a text pattern in the training text pattern image by the text conversion module, and recording the text image as a first image Ot. And inputting the first image Ot output by the text conversion module into a text recognition module, and recognizing the text information in the first image Ot by the text recognition module to obtain recognized text information. And comparing the text information identified by the text identification module with the text information in the training text image, and judging whether the first image converted by the text conversion module is accurate or not. For example, when the difference (or loss) between the text information recognized by the text recognition module and the text information of the training text image is greater than a preset value, which indicates that the text synthesis network has not been trained, the text recognition network is reversely trained, for example, parameters in the text recognition network are adjusted, according to the difference (or loss) between the text information recognized by the text recognition module and the text information of the training text image.

For example, as shown in fig. 4, it is assumed that the training text image is "barbarous", and the text style in the training text style image is: font type 1, font tilt, font white. Inputting the training text image and the training text pattern image into a text conversion module, and inputting a first image Ot as shown in fig. 4 into the text conversion module, wherein the text pattern of the text "barbarosources" in the first image Ot is the text pattern in the training text pattern image, that is, the font type 1, the font inclination, and the font is white. And inputting the first image into a text recognition module, and recognizing and outputting text information 'barbarous' in the first image Ot by the text recognition module. And performing end-to-end training on the text synthesis network according to the difference between the text information output by the text recognition module and the text information of the training text image. For example, if the text recognition module recognizes that the text information in the first image Ot is "aarbasources" and the text information in the training text image "barsources" are inconsistent, which indicates that the text recognition module is not trained, the text recognition module may be trained continuously by adjusting the parameters in the text recognition module until the training end condition is satisfied.

Optionally, the training end condition may be that the training frequency reaches a preset value, or the loss reaches a preset value.

In some embodiments, as shown in fig. 6, the middle layer of the text conversion module outputs the text skeleton image Osk, and in this case, as shown in fig. 7, the above S302-3 includes:

s302-31, acquiring a text skeleton image output by a middle layer of a text conversion module;

s302-32, obtaining a first loss according to the difference between the first image and the text skeleton image;

s302-33, obtaining a second loss according to the difference between the text information output by the text recognition module and the text information of the training text image;

and S302-34, performing end-to-end training on the text synthesis network according to the first loss and the second loss.

Specifically, as shown in FIG. 6, the text conversion module includes a middle layer that can output a skeleton image of the converted text, i.e., the text skeleton image Osk in FIG. 6, the textThe skeleton image Osk agrees with both the font tilt direction and the font type in the first image Ot, and the color of the characters in the text skeleton image Osk is white and the background is black. In some embodiments, the text skeleton image Osk may be understood as a binary mask for Ot. Determining a first loss L between the text skeleton image Osk and the first image Ot from a difference between the text skeleton image Osk and the first image Ot_TAnd determining a second loss L between the text information recognized by the text recognition module and the input training text image according to a difference between the text information recognized by the text recognition module and the text information of the input training text image_RAccording to the first loss L_TAnd a second loss L_RTraining the text synthesis network end-to-end, i.e. according to a first loss L_TAnd a second loss L_RTraining the text synthesis network in reverse, e.g. adjusting parameters in the text conversion module, such that the first loss L_TAnd a second loss L_RUntil a certain preset value is reached.

Optionally, the first loss L is set_TAnd a second loss L_RThe sum is used as the loss of the text synthesis network to train the text synthesis network.

In some embodiments, as shown in FIG. 8, the text synthesis network includes a background repair module and a fusion module in addition to the text conversion module, wherein an output of the text conversion module and an output of the background repair module are both connected to an input of the fusion module.

The background repairing module is used for processing the background in the input training text style image to obtain a background characteristic image.

In some embodiments, the background restoration module may output a plurality of background feature maps with different sizes, and input the plurality of background feature maps with different sizes into the fusion module.

The fusion module is used for fusing the first image input by the text conversion module and the background characteristic image input by the background restoration module to obtain a second image, wherein the text style of the training text in the second image is the text style in the training text style image.

In this case, the above S302-34 includes:

step 1, inputting a training text style image into a background repairing module to obtain a background characteristic diagram output by the background repairing module;

step 2, inputting the first image and the background feature map into a fusion module to obtain a second image output by the fusion module, wherein the text style of the training text in the second image is consistent with the text style in the training text style;

step 3, obtaining a third loss according to the difference between the second image output by the fusion module and the training text image and the training text style image;

and 4, performing end-to-end training on the text synthesis network according to the first loss, the second loss and the third loss.

In the training process, training text images and training text style images are input to the text synthesis network. As shown in fig. 8, a training text image and a training text pattern image are input to a text conversion module, the training image and the training text pattern image are processed by the text conversion module, a text skeleton image Osk is output by an intermediate layer of the text conversion module, and a first image Ot is finally output by the text conversion module, and a first loss L can be obtained according to a difference between the text skeleton image Osk and the first image Ot_TSpecifically, the difference between each pixel point in the text skeleton image Osk and the corresponding pixel point in the first image Ot is used. Meanwhile, the first image Ot is input into a text recognition module, the text recognition module recognizes the text information in the first image Ot, and a second loss L is obtained according to the difference between the text information in the first image recognized by the text recognition module and the text information of the training text image_R。

In addition, the training text style image is input into the background repairing module, the background repairing module outputs a background characteristic image and inputs the background characteristic image into the fusion module, and meanwhile, the first image output by the text conversion module is also input into the fusion module. The fusion module fuses the first image Ot and the background feature map and outputs a second image Of, wherein the style Of the training text in the second image Of is connected with the training textThe text style in the image of the style is consistent, for example, the text style of the text in the second image is consistent with the text style in the image of the training text style, and the background of the second image is consistent with the background of the image of the training text style. Obtaining a third loss L according to the difference between the second image Of output by the fusion module and the training text image and the training text style image_FSpecifically, the third loss L is obtained according to the difference between each pixel point in the second image Of and the pixel point corresponding to the training text image and the training text style image_F。

According to the first loss L_TSecond loss L_RAnd a third loss L_FTraining the text synthesis network end-to-end, i.e. according to a first loss L_TSecond loss L_RAnd a third loss L_FTraining the text synthesis network in reverse, e.g. adjusting parameters in the text conversion module and the background restoration module, such that the first loss L_TSecond loss L_RAnd a third loss L_FUntil a certain preset value is reached.

Optionally, the first loss L is set_TSecond loss L_RAnd a third loss L_FThe sum is used as the loss of the text synthesis network to train the text synthesis network.

In some embodiments, as shown in fig. 9, during the training process, the background restoration module outputs a final processed background map Ob. In this case, the step 4 includes: obtaining a background image output by a background restoration module; obtaining a fourth loss according to a difference between the background image output by the background restoration module and the training text style image (specifically, the background in the training text style image); and performing end-to-end training on the text synthesis network according to the first loss, the second loss, the third loss and the fourth loss.

In a possible implementation manner, the fourth loss L is obtained according to the difference between the pixel point in the background map Ob and the background pixel point of the training text style image_B。

It should be noted that, in the embodiment of the present application, a specific loss function type used for calculating the first loss, the second loss, the third loss, and the fourth loss is not limited, and may be any one of the following: logarithmic loss function, quadratic damage function, exponential loss function, cross entropy loss function, mean square error loss function.

In some embodiments, the training the text synthesis network end to end according to the first loss, the second loss, the third loss, and the fourth loss includes: taking the sum of the first loss, the second loss, the third loss and the fourth loss as the loss of the text synthesis network; and training the text synthesis network end to end according to the loss of the text synthesis network.

Fig. 10 is a schematic diagram of a training process of a text synthesis network according to an embodiment of the present application, and as shown in fig. 10, a training process of the text synthesis network according to the embodiment of the present application includes:

s401, obtaining a training text image and a training text style image.

S402, inputting the training text image and the training text style image into the text conversion module to obtain a text skeleton image and a first image output by the text conversion module.

And S403, inputting the first image into the text recognition module to obtain text information output by the text recognition module.

S404, inputting the training text style image into a background repairing module to obtain a background feature image and a background image output by the background repairing module.

It should be noted that the above S404 has no precedence relationship with the above S402 and S403, that is, S404 may be executed before the above S402 and S403, may be executed after S402 and S403, or may be executed simultaneously with S402 and S403.

S405, inputting the background image and the first image into a fusion module to obtain a second image output by the fusion module;

s406, obtaining a first loss according to the difference between the text skeleton image and the first image; obtaining a second loss according to the difference between the text information output by the text recognition module and the text information of the training text image; obtaining a third loss according to the difference between the second image and the training text style image; a fourth loss is derived from a difference between the background image and the training text style image.

And S407, performing end-to-end training on the text synthesis network according to the first loss, the second loss, the third loss and the fourth loss.

It should be noted that, the present embodiment does not limit the order of determining the losses.

The specific implementation manner of each step may refer to the description of the above embodiment, and is not described herein again.

In a specific embodiment, a network structure of a text synthesis network according to an embodiment of the present application is shown in fig. 11, where the text synthesis network includes a text conversion module, a background repair module, and a fusion module.

1. A Text Conversion Module (TCM) for converting a Text style of a source image into a target style, comprising: font, color, position and scale; for example, a text pattern in the training image is converted to a text pattern of the training text pattern image.

For example, the training text image is rendered as a first image Ot with a fixed font, and the background pixel value of the first image is set to 127, the text style of the text in the first image is the text style of the training text style image. The text conversion module includes an encoder-decoder.

In some embodiments, for encoding, the text direction in the encoder includes 3 downsampled convolutional layers and 4 residual coding blocks, and the training text image outputs a text feature map via the 3 downsampled convolutional layers and the 4 residual coding blocks. The text style direction in the coder also comprises 3 downsampling convolutional layers and 4 residual coding blocks, and the training text style image outputs a style characteristic diagram after passing through the 3 downsampling convolutional layers and the 4 residual coding blocks. The text feature map and the style feature map are connected along the depth axis direction thereof. In decoding, the decoder includes 3 upsampled convolutional layers and 1 activation function layer (e.g., Convolition-BatchNorm-LeakyReLU), and outputs a first picture Ot.

In addition, a framework-oriented learning mechanism is introduced, specifically, a framework response module is added at a decoding end, the framework response module is composed of 3 upsampling convolutional layers, then a single-channel text framework image is predicted through an S-shaped activation function, and then the text framework image Osk is output along the depth axis direction of a decoder.

The first loss is obtained according to the difference between the text skeleton map Osk and each pixel point in the first image Ot.

2. The Background repairing module (BIM) is used for deleting original character stroke pixels, filling pixel points of deleted positions with proper textures, and following the structure of a general architecture "U-Net" in a bottom-up feature fusion mode.

As shown in fig. 11, the module simply takes as its input the image with the training text style, deletes all text stroke pixels in the training text style image, and fills in the appropriate texture, inputting the background image Ob. The encoder of the background repair module comprises 3 encoding pairs of downsampled convolutional layers with step size of 2 and 4 residual blocks, and the decoder comprises 3 upsampled convolutional layers for obtaining the feature map of the original size. Each layer is then subjected to the ReLU activation process. In order to make the visual effect more realistic, it is necessary to restore as much background texture as possible. U-Net suggests that adding skip connections between mirror layers is effective in resolving object segmentation in image-to-image translation tasks.

And obtaining a third loss according to the difference between the background image Ob and each pixel point in the training text style image.

3. The fusion module is automatically used for fusing the foreground information and the background effective texture information so as to synthesize the edited text image.

As shown in fig. 11, the fusion module also follows the codec FCN framework, the encoder comprising three downsampled convolutional layers and a residual block. The decoder comprises 3 upsampled transposed convolutional layers and activation function layers (e.g., volume-Batch-Norm-leak relu), which may eventually generate an edited text image, i.e., a second image Of. It is noted that the background feature map output by the background repair module is connected to each upsampled convolutional layer of the decoder with the same resolution. In this way, the background details of the fusion module output image are substantially restored.

And obtaining a fourth loss according to the difference between the second image Of and each pixel point in the training text style image.

4. The text recognition module is used for recognizing text information in the text image.

As shown in fig. 11, the text recognition module includes a convolutional neural network feature extraction unit (CNN feature extraction), an LSTM (Long Short Term Memory) network, and an arithmetic decoding unit, and specifically, the convolutional neural network feature extraction unit is configured to process a first image to obtain an image feature of the first image, input the image feature of the first image into the LSTM network for processing, output a feature vector, input the feature vector into the arithmetic decoding unit, and output recognized text information.

And comparing the text information output by the text recognition module with the text information of the training text image to obtain a second loss.

Next, as shown in fig. 11, the text synthesis network is trained end-to-end according to the first loss, the second loss, the third loss, and the fourth loss until reaching a training end condition.

On the basis of original pixel level supervision, the text recognition module is added to provide semantic level supervision for a text synthesis network, and semantic characteristics of text images are effectively utilized. The text recognition module predicts text information in the first image for the first image generated by the text conversion module, and calculates a second loss with the training text image label. Because the text recognition module is pre-trained, text information in the image can be effectively predicted. If the image generated by the text conversion module is close enough to the training text image, the probability of predicting the character information by the text recognition module is high, and the corresponding second loss is small; whereas if the text in the first image generated by the text conversion module is difficult to identify, the corresponding second loss is large. With the help of the second loss, the text conversion module can synthesize a text image which is easier to recognize, not only the text image which is more similar to the training text image on the pixel level, thereby effectively improving the synthesis effect of the text synthesis network.

The training process of the text synthesis network is described in detail above, and the prediction process of the text synthesis network is described below.

Fig. 12 is a schematic flowchart of a text image synthesis method provided in the embodiment of the present application, that is, the embodiment of the present application mainly introduces a process of synthesizing a target text image and a target text style image by using the trained text synthesis network. As shown in fig. 12, includes:

s501, acquiring a target text image and a target text style image;

s502, inputting the target text image and the target text style image into a text synthesis network to obtain a synthesized text image output by the text synthesis network.

The text in the synthesized text image is a target text in the target text image, the text pattern in the synthesized text image is a text pattern in the target text pattern image, the text synthesis network is trained by the aid of a text recognition module, and the text recognition module is used for recognizing text information in the image.

In some embodiments, the trained text synthesis network is as shown in fig. 13, where a target text image and a target text pattern image are input into the text synthesis network, the target text image and the target text pattern image are input into the text conversion module, the text conversion module converts the text pattern of the target text image into a target text pattern, obtains a converted text image, and inputs the converted text image into the fusion module. And simultaneously, inputting the target text style image into a background repairing module, wherein the background repairing module acquires a background characteristic diagram corresponding to the target text style image and inputs the background characteristic diagram into a fusion module, the fusion module fuses the converted text image and the background characteristic diagram and outputs a synthesized text image, and the style of the target text in the synthesized text image is the target text style.

According to the text recognition network, the text recognition module assists in training the text image synthesis network, so that the text synthesis network can notice text information in the image more easily, and further the text synthesis efficiency of the text synthesis network is improved.

Fig. 14 is a schematic structural diagram of a text image synthesizing apparatus according to an embodiment of the present application. The text image synthesizing apparatus may be a computing device, or may be a component (e.g., an integrated circuit, a chip, etc.) of a computing device. As shown in fig. 14, the text image synthesizing apparatus 10 may include:

a first acquiring unit 11 configured to acquire a target text image and a target text style image;

a synthesizing unit 12, configured to input the target text image and the target text style image into a text synthesis network, so as to obtain a synthesized text image output by the text synthesis network;

Fig. 15 is a schematic structural diagram of another text image synthesizing apparatus according to an embodiment of the present application. As shown in fig. 15, the text image synthesizing apparatus 10 may further include:

a second acquiring unit 13 configured to acquire a training text image and a training text style image;

and the training unit 14 is configured to input the training text image and the training text pattern image into the text synthesis network, and perform end-to-end training on the text synthesis network by using the text recognition module as a supervision module.

In some embodiments, the text recognition module has been trained prior to the text synthesis network training.

In some embodiments, the text synthesis network includes a text conversion module configured to convert a text pattern of a training text in the training text image into a text pattern in the training text pattern image, and the training unit 14 is configured to input the training text image and the training text pattern image into the text conversion module to obtain a first image output by the text conversion module, where the text in the first image is a text in the training text image, and the text pattern in the first image is a text pattern in the training text pattern image; inputting the first image into the text recognition module to obtain text information output by the text recognition module; and performing end-to-end training on the text synthesis network according to the difference between the text information output by the text recognition module and the text information of the training text image.

In some embodiments, the training unit 14 is specifically configured to obtain a text skeleton image output by the middle layer of the text conversion module; obtaining a first loss according to the difference between the first image and the text skeleton image; obtaining a second loss according to the difference between the text information output by the text recognition module and the text information of the training text image; and performing end-to-end training on the text synthesis network according to the first loss and the second loss.

In some embodiments, the text synthesis network further includes a background restoration module and a fusion module, and the training unit 14 is specifically configured to input the training text style image into the background restoration module to obtain a background feature map output by the background restoration module; inputting the first image and the background feature map into the fusion module to obtain a second image output by the fusion module, wherein the text style of the training text in the second image is consistent with the text style of the training text style image; obtaining a third loss according to the difference between the second image output by the fusion module and the training text image and the training text style image; and performing end-to-end training on the text synthesis network according to the first loss, the second loss and the third loss.

In some embodiments, the training unit 14 is specifically configured to obtain a background image output by the background restoration module; obtaining a fourth loss according to the difference between the background image output by the background restoration module and the training text style image; and performing end-to-end training on the text synthesis network according to the first loss, the second loss, the third loss and the fourth loss.

In some embodiments, the training unit 14 is specifically configured to use a sum of the first loss, the second loss, the third loss, and the fourth loss as the loss of the text synthesis network; and performing end-to-end training on the text synthesis network according to the loss of the text synthesis network.

It is to be understood that apparatus embodiments and method embodiments may correspond to one another and that similar descriptions may refer to method embodiments. To avoid repetition, further description is omitted here. Specifically, the apparatus 10 shown in fig. 14 and fig. 15 may correspond to a corresponding main body in executing the method of the embodiment of the present application, and the foregoing and other operations and/or functions of each unit in the apparatus 10 are respectively for implementing corresponding flows in each method such as the method, and are not described herein again for brevity.

The apparatus and system of embodiments of the present application are described above in terms of functional units in conjunction with the following figures. It is to be understood that the functional units may be implemented in hardware, by instructions in software, or by a combination of hardware and software units. Specifically, the steps of the method embodiments in the present application may be implemented by integrated logic circuits of hardware in a processor and/or instructions in the form of software, and the steps of the method disclosed in conjunction with the embodiments in the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software units in the decoding processor. Alternatively, the software elements may reside in random access memory, flash memory, read only memory, programmable read only memory, electrically erasable programmable memory, registers, or other storage medium known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps in the above method embodiments in combination with hardware thereof.

Fig. 16 is a block diagram of a computing device according to an embodiment of the present application, where the computing device is configured to execute the text image synthesis method according to the above embodiment, and refer to the description in the above method embodiment specifically.

The computing device 200 shown in fig. 16 includes a memory 201, a processor 202, and a communication interface 203. The memory 201, the processor 202 and the communication interface 203 are connected with each other in communication. For example, the memory 201, the processor 202, and the communication interface 203 may be connected by a network connection. Alternatively, the computing device 200 may also include a bus 204. The memory 201, the processor 202 and the communication interface 203 are connected to each other by a bus 204. Fig. 16 is a computing device 200 in which a memory 201, a processor 202, and a communication interface 203 are communicatively connected to each other via a bus 204.

The Memory 201 may be a Read Only Memory (ROM), a static Memory device, a dynamic Memory device, or a Random Access Memory (RAM). The memory 201 may store programs, and the processor 202 and the communication interface 203 are used to perform the above-described methods when the programs stored in the memory 201 are executed by the processor 202.

The processor 202 may be implemented as a general purpose Central Processing Unit (CPU), a microprocessor, an Application Specific Integrated Circuit (ASIC), a Graphics Processing Unit (GPU), or one or more Integrated circuits.

The processor 202 may also be an integrated circuit chip having signal processing capabilities. In implementation, the method of the present application may be performed by instructions in the form of hardware, integrated logic circuits, or software in the processor 202. The processor 202 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, or discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 201, and the processor 202 reads the information in the memory 201 and completes the method of the embodiment of the application in combination with the hardware thereof.

The communication interface 203 enables communication between the computing device 200 and other devices or communication networks using transceiver modules such as, but not limited to, transceivers. For example, the data set may be acquired through the communication interface 203.

When computing device 200 includes bus 204, as described above, bus 204 may include a pathway to transfer information between various components of computing device 200 (e.g., memory 201, processor 202, communication interface 203).

According to an aspect of the present application, there is provided a computer storage medium having a computer program stored thereon, which, when executed by a computer, enables the computer to perform the method of the above-described method embodiments. In other words, the present application also provides a computer program product containing instructions, which when executed by a computer, cause the computer to execute the method of the above method embodiments.

According to another aspect of the application, a computer program product or computer program is provided, comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method of the above-described method embodiment.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

Those of ordinary skill in the art will appreciate that the various illustrative modules and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the module is merely a logical division, and other divisions may be realized in practice, for example, a plurality of modules or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.

Modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. For example, functional modules in the embodiments of the present application may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated into one module.

In summary, the present disclosure is only an embodiment of the present disclosure, but the scope of the present disclosure is not limited thereto, and any person skilled in the art can easily think of the changes or substitutions within the technical scope of the present disclosure, and all the changes or substitutions should be covered by the scope of the present disclosure. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A text image synthesizing method, comprising:

acquiring a target text image and a target text style image;

2. The method of claim 1, further comprising:

acquiring a training text image and a training text style image;

and inputting the training text images and the training text style images into the text synthesis network, and performing end-to-end training on the text synthesis network by taking the text recognition module as a supervision module.

3. The method of claim 2, wherein the text recognition module has been trained prior to the text-synthesis-network training.

4. The method of claim 3, wherein the text synthesis network comprises a text conversion module configured to convert text patterns of training text in the training text images into text patterns in the training text pattern images, wherein the inputting of the training text images into the text synthesis network, and wherein the end-to-end training of the text synthesis network with the text recognition module as a supervision module comprises:

inputting the training text image and the training text style image into the text conversion module to obtain a first image output by the text conversion module, wherein a text in the first image is a text in the training text image, and a text style in the first image is a text style in the training text style image;

inputting the first image into the text recognition module to obtain text information output by the text recognition module;

and performing end-to-end training on the text synthesis network according to the difference between the text information output by the text recognition module and the text information of the training text image.

5. The method of claim 4, wherein the training the text synthesis network end-to-end according to the difference between the text information output by the text recognition module and the text information of the training text image comprises:

acquiring a text skeleton image output by the middle layer of the text conversion module;

obtaining a first loss according to the difference between the first image and the text skeleton image;

obtaining a second loss according to the difference between the text information output by the text recognition module and the text information of the training text image;

and performing end-to-end training on the text synthesis network according to the first loss and the second loss.

6. The method of claim 5, wherein the text synthesis network further comprises a background repair module and a fusion module, and wherein the training the text synthesis network end-to-end according to the first loss and the second loss comprises:

inputting the training text style image into the background repairing module to obtain a background feature map output by the background repairing module;

inputting the first image and the background feature map into the fusion module to obtain a second image output by the fusion module, wherein the text style of the training text in the second image is consistent with the text style of the training text style image;

obtaining a third loss according to the difference between the second image output by the fusion module and the training text image and the training text style image;

and performing end-to-end training on the text synthesis network according to the first loss, the second loss and the third loss.

7. The method of claim 6, wherein the training the text synthesis network end-to-end based on the first loss, the second loss, and the third loss comprises:

obtaining a background image output by the background restoration module;

obtaining a fourth loss according to the difference between the background image output by the background restoration module and the training text style image;

and performing end-to-end training on the text synthesis network according to the first loss, the second loss, the third loss and the fourth loss.

8. The method of claim 7, wherein the training the text synthesis network end-to-end based on the first loss, the second loss, the third loss, and the fourth loss comprises:

taking the sum of the first loss, the second loss, the third loss and the fourth loss as the loss of the text synthesis network;

and performing end-to-end training on the text synthesis network according to the loss of the text synthesis network.

9. A text image synthesizing apparatus characterized by comprising:

a first acquiring unit configured to acquire a target text image and a target text style image;

10. The apparatus of claim 9, further comprising:

a second acquiring unit for acquiring a training text image and a training text style image;

and the training unit is used for inputting the training text images and the training text pattern images into the text synthesis network, and performing end-to-end training on the text synthesis network by taking the text recognition module as a supervision module.

11. The apparatus of claim 10, wherein the text recognition module has been trained prior to the text synthesis network training.

12. The apparatus according to claim 11, wherein the text synthesis network includes a text conversion module, the text conversion module is configured to convert a text style of a training text in the training text image into a text style in the training text style image, and the training unit is specifically configured to input the training text image and the training text style image into the text conversion module, to obtain a first image output by the text conversion module, where the text in the first image is a text in the training text image, and the text style in the first image is a text style in the training text style image; inputting the first image into the text recognition module to obtain text information output by the text recognition module; and performing end-to-end training on the text synthesis network according to the difference between the text information output by the text recognition module and the text information of the training text image.

13. The apparatus according to claim 12, wherein the training unit is specifically configured to obtain a text skeleton image output by a middle layer of the text conversion module; obtaining a first loss according to the difference between the first image and the text skeleton image; obtaining a second loss according to the difference between the text information output by the text recognition module and the text information of the training text image; and performing end-to-end training on the text synthesis network according to the first loss and the second loss.

14. The apparatus according to claim 13, wherein the text synthesis network further includes a background restoration module and a fusion module, and the training unit is specifically configured to input the training text style image into the background restoration module to obtain a background feature map output by the background restoration module; inputting the first image and the background feature map into the fusion module to obtain a second image output by the fusion module, wherein the text style of the training text in the second image is consistent with the text style of the training text style image; obtaining a third loss according to the difference between the second image output by the fusion module and the training text image and the training text style image; and performing end-to-end training on the text synthesis network according to the first loss, the second loss and the third loss.

15. The apparatus according to claim 14, wherein the training unit is specifically configured to obtain a background image output by the background restoration module; obtaining a fourth loss according to the difference between the background image output by the background restoration module and the training text style image; and performing end-to-end training on the text synthesis network according to the first loss, the second loss, the third loss and the fourth loss.

16. The apparatus according to claim 15, wherein the training unit is specifically configured to use a sum of the first loss, the second loss, the third loss, and the fourth loss as the loss of the text synthesis network; and performing end-to-end training on the text synthesis network according to the loss of the text synthesis network.

17. A computing device, comprising: a memory, a processor;

the memory for storing a computer program;

the processor for executing the computer program to implement the text image synthesizing method according to any one of claims 1 to 8.

18. A computer-readable storage medium having computer-executable instructions stored therein, which when executed by a processor, are configured to implement the text image synthesis method according to any one of claims 1 to 8.