CN115908640A

CN115908640A - Method and device for generating image, readable medium and electronic equipment

Info

Publication number: CN115908640A
Application number: CN202211668274.9A
Authority: CN
Inventors: 郭明宇; 刘博元; 冉蛟
Original assignee: Beijing Zitiao Network Technology Co Ltd
Current assignee: Beijing Zitiao Network Technology Co Ltd
Priority date: 2022-12-23
Filing date: 2022-12-23
Publication date: 2023-04-04

Abstract

The embodiment of the disclosure relates to a method and a device for generating an image, a readable medium and an electronic device. The method comprises the following steps: acquiring a first text for describing a target object, and acquiring a second text; inputting the first text and the second text into a pre-generated target image generation model to obtain a target image output by the target image generation model; the target image comprises a target object and character information corresponding to a second text, the target image generation model comprises a first text encoder and a second text encoder, the first text encoder is used for encoding the first text to obtain first text characteristics corresponding to the first text, the second text encoder is used for encoding the second text to obtain second text characteristics corresponding to the second text, and the target image generation model is further used for generating a target image according to the first text characteristics and the second text characteristics. In this way, clear text information can be included in the generated target image.

Description

Method and device for generating image, readable medium and electronic equipment

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method and an apparatus for generating an image, a readable medium, and an electronic device.

Background

With the advance of computer technology, image generation technology has made great progress, for example, a piece of text describing the content of an image can be input, and an image meeting the description requirement of the text can be generated according to the text.

However, in the related art, clear character information cannot be included in the generated image.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

According to a first aspect of embodiments of the present disclosure, there is provided a method of generating an image, the method comprising:

acquiring a first text for describing a target object;

acquiring a second text;

inputting the first text and the second text into a pre-generated target image generation model to obtain a target image output by the target image generation model;

the target image comprises the target object and word information corresponding to the second text, the target image generation model comprises a first text encoder and a second text encoder, the first text encoder is used for encoding the first text to obtain a first text feature corresponding to the first text, the second text encoder is used for encoding the second text to obtain a second text feature corresponding to the second text, and the target image generation model is used for generating the target image according to the first text feature and the second text feature.

According to a second aspect of the embodiments of the present disclosure, there is provided an apparatus for generating an image, the apparatus comprising:

the first acquisition module is used for acquiring a first text for describing a target object;

the second acquisition module is used for acquiring a second text;

the image generation module is used for inputting the first text and the second text into a pre-generated target image generation model to obtain a target image output by the target image generation model; the target image comprises the target object and word information corresponding to the second text, the target image generation model comprises a first text encoder and a second text encoder, the first text encoder is used for encoding the first text to obtain a first text feature corresponding to the first text, the second text encoder is used for encoding the second text to obtain a second text feature corresponding to the second text, and the target image generation model is used for generating the target image according to the first text feature and the second text feature.

According to a third aspect of embodiments of the present disclosure, there is provided a computer readable medium having stored thereon a computer program which, when executed by a processing apparatus, performs the steps of the method of the first aspect of the present disclosure.

According to a fourth aspect of embodiments of the present disclosure, there is provided an electronic apparatus including:

a storage device having a computer program stored thereon;

processing means for executing the computer program in the storage means to implement the steps of the method of the first aspect of the present disclosure.

By adopting the technical scheme, a first text for describing a target object is obtained, and a second text is obtained; inputting the first text and the second text into a pre-generated target image generation model to obtain a target image output by the target image generation model; the target image comprises a target object and word information corresponding to a second text, the target image generation model comprises a first text encoder and a second text encoder, the first text encoder is used for encoding the first text to obtain first text characteristics corresponding to the first text, the second text encoder is used for encoding the second text to obtain second text characteristics corresponding to the second text, and the target image generation model is further used for generating a target image according to the first text characteristics and the second text characteristics. Therefore, clear text information can be contained in the generated target image, and the text information in the generated target image can have a good visual effect.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale. In the drawings:

FIG. 1 is a diagram illustrating an image generated from text in accordance with an exemplary embodiment.

FIG. 2 is a flow chart illustrating a method of generating an image according to an exemplary embodiment.

FIG. 3 is a schematic diagram illustrating a target image generation model according to an exemplary embodiment.

FIG. 4 is a flow chart illustrating another method of generating an image according to an exemplary embodiment.

FIG. 5 is a flow diagram illustrating a method for generating a model of a target image generation, according to an exemplary embodiment.

FIG. 6 is a schematic diagram illustrating a pending image generation model in accordance with an exemplary embodiment.

FIG. 7 is a diagram illustrating an image generated from text in a related art, according to an example embodiment.

FIG. 8 is a diagram illustrating a target image generated from text by a target image generation model, according to an example embodiment.

Fig. 9 is a block diagram illustrating an apparatus for generating an image according to an exemplary embodiment.

Fig. 10 is a block diagram illustrating another apparatus for generating an image according to an exemplary embodiment.

FIG. 11 is a block diagram illustrating an electronic device in accordance with an example embodiment.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and the embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "include" and its variants as used in this disclosure are intended to be inclusive, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.

It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise. In the description of the present disclosure, unless otherwise indicated, "plurality" means two or more, and other terms are similar; "at least one item", "item(s)", "or" one or more items "or the like, refers to any combination of these item(s), including any combination of the single item(s) or the plural items. For example, at least one item(s) a, may represent any number a; as another example, one or more of a, b, and c may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c can be single or multiple; "and/or" is an association describing an associated object, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists singly, A and B exist simultaneously, and B exists singly, wherein A and B can be singular or plural. The character "/" indicates that the former and latter associated objects are in an "or" relationship. The singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

Although operations or steps may be described in a particular order in the drawings in the embodiments of the disclosure, it should not be understood that they require the performance of the operations or steps in the particular order shown or in sequential order, or that all of the illustrated operations or steps be performed, to achieve desirable results. In embodiments of the present disclosure, these operations or steps may be performed in series; these operations or steps may also be performed in parallel; some of these operations or steps may also be performed.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

It is understood that before the technical solutions disclosed in the embodiments of the present disclosure are used, the type, the use range, the use scene, etc. of the personal information related to the present disclosure should be informed to the user and obtain the authorization of the user through a proper manner according to the relevant laws and regulations.

For example, in response to receiving a user's active request, prompt information is sent to the user to explicitly prompt the user that the requested operation to be performed would require acquisition and use of personal information to the user. Thus, the user can autonomously select whether to provide personal information to software or hardware such as an electronic device, an application program, a server, or a storage medium that performs the operations of the disclosed technical solution, according to the prompt information.

As an optional but non-limiting implementation manner, in response to receiving an active request from the user, the manner of sending the prompt information to the user may be, for example, a pop-up window, and the prompt information may be presented in a text manner in the pop-up window. In addition, a selection control for providing personal information to the electronic device by the user's selection of "agreeing" or "disagreeing" can be carried in the pop-up window.

It is understood that the above notification and user authorization process is only illustrative and not limiting, and other ways of satisfying relevant laws and regulations may be applied to the implementation of the present disclosure.

Meanwhile, it is understood that the data involved in the present technical solution (including but not limited to the data itself, the acquisition or use of the data) should comply with the requirements of the corresponding laws and regulations and the related regulations.

The disclosure is described below with reference to specific examples.

First, an application scenario of the present disclosure will be explained. The present disclosure may be applied to image generation scenarios, such as scenarios in which images are generated from textual descriptions. In some embodiments of the present disclosure, a piece of text description information is input into a pre-generated model, and at least one image corresponding to the text description information output by the model can be obtained. FIG. 1 is a diagram illustrating an image generated from text in accordance with an exemplary embodiment. As shown in fig. 1, the following text information is input to the above model: "Teddy bearings shopping for groceries in public Egyptian" or "Teddy bear in Tokyo shopping at a grocery store", the model may output an image corresponding to the text information, such as the image shown in FIG. 1.

After the model is trained by data, the generated image can achieve a very sufficient and vivid visual effect. However, although the image generated by the pure contrast learning method can sufficiently grasp the correspondence between the text and the picture, the generated image cannot contain the character information, or even if the generated image contains the character information, the character information is not clear enough, which affects the reading effect of the user.

FIG. 2 is a flow chart illustrating a method of generating an image according to an exemplary embodiment. The method can be applied to electronic devices, which may include terminal devices, such as smart phones, smart wearable devices, smart speakers, smart tablets, PDAs (Personal Digital assistants), CPEs (Customer Premise Equipment), personal computers, vehicle terminals, and the like; the electronic device may also include a server, such as a local server or a cloud server. As shown in fig. 2, the method may include:

s201, obtaining a first text for describing the target object.

The target object may be an object that the user desires to see from a target image, such as a person, a building, an article, an animal, a plant, a natural scene, a behavior, an event, or any other object that can be expressed by an image, and the target object may be one or more objects. The first text may be text describing the target object, and may include, for example, "Teddy bearings shopping for groceries in facility Egypt" or "Teddy bear shopping in grocery store".

S202, acquiring a second text.

In some embodiments, the second text may be text related to the target object, e.g., the target object contains a store, and the second text may be the name of the store, e.g., "HappyCandyShop.

In other embodiments, the second text may be any text entered by the user.

In some embodiments, the first text and the second text may both be text entered by a user.

S203, inputting the first text and the second text into a pre-generated target image generation model to obtain a target image output by the target image generation model.

And the target image comprises a target object and character information corresponding to the second text.

The target image generation model can comprise a first text encoder and a second text encoder, wherein the first text encoder is used for encoding a first text to obtain a first text characteristic corresponding to the first text, the second text encoder is used for encoding a second text to obtain a second text characteristic corresponding to the second text, and the target image generation model is further used for generating a target image according to the first text characteristic and the second text characteristic.

The target image may be a picture or a video, and the type of the target image is not limited in the present disclosure.

It should be noted that the languages of the first text and the second text may be the same or different. For example, the first text is English and the second text may be English, german, chinese, or any other language. As another example, the first text may be Chinese and the second text may be Chinese, english, german, or any other language, or the second text may be a word comprising a mixture of languages.

By adopting the method, a first text for describing the target object is obtained, and a second text is obtained; inputting the first text and the second text into a pre-generated target image generation model to obtain a target image output by the target image generation model; the target image comprises a target object and character information corresponding to a second text, the target image generation model comprises a first text encoder and a second text encoder, the first text encoder is used for encoding the first text to obtain first text characteristics corresponding to the first text, the second text encoder is used for encoding the second text to obtain second text characteristics corresponding to the second text, and the target image generation model is further used for generating a target image according to the first text characteristics and the second text characteristics. Therefore, clear text information can be contained in the generated target image, and the text information in the generated target image can have a better visual effect.

FIG. 3 is a schematic diagram illustrating a target image generation model according to an exemplary embodiment. As shown in fig. 3, the target image generation model may include a first text encoder 301, a second text encoder 302, a feature converter 303, and an image generator 304; wherein:

the input of the first text encoder 301 may be the first text, and the output may be a first text feature obtained by encoding the first text, where the first text feature may also be referred to as a first text feature vector.

The input of the second text encoder 302 may be the second text, and the output may be a second text feature obtained by encoding the second text, where the second text feature may also be referred to as a second text feature vector.

The input of the feature converter 303 may be the first text feature, and the first image feature obtained by converting the first text feature is output, where the first image feature may also be referred to as a first image feature vector.

In some embodiments, the feature converter may also be referred to as an image feature regression model.

The feature converter may comprise a pre-generated prior model, such as a diffusion model or other prior model, by which the first text feature may be mapped to the first image feature.

The input to the image generator 304 may include the first image feature and the second text feature described above, and the output may be the target image.

In some embodiments, the image generator comprises any one of a diffusion model, a countermeasure generation network, a variable autoencoder, or a stream model. The image generator based on the diffusion model can add noise and perform inverse diffusion on the input image feature vector and/or text feature vector, and finally generate at least one style of target image.

FIG. 4 is a flow chart illustrating another method of generating an image according to an exemplary embodiment. As shown in fig. 4, the method of generating an image may include, based on the target image generation model shown in fig. 3:

s201, obtaining a first text for describing the target object.

S202, acquiring a second text.

S2031, inputting the first text into a first text encoder to obtain a first text characteristic output by the first text encoder.

S2032, inputting the first text characteristic into the characteristic converter to obtain a first image characteristic.

S2033, inputting the second text into a second text encoder to obtain a second text characteristic output by the second text encoder.

S2034, inputting the first image characteristic and the second text characteristic into an image generator to obtain a target image generated by the image generator.

In this step, the target image may be generated in various ways, for example:

in some embodiments, the first image feature and the second text feature may be input to an image generator, which processes and weight-fuses the first image feature and the second text feature, respectively, to generate the target image.

In other embodiments, the first image feature and the second text feature may be subjected to superposition processing to obtain a superposition feature; and inputting the superposition characteristics into an image generator to obtain a target image.

Wherein the superimposition processing may include superimposition processing based on a cross-attention mechanism or affine transformation. For example, the first image feature and the second text feature may be subjected to an overlay process by a cross attention mechanism or affine transformation, resulting in an overlay feature.

By adopting the mode, the target image with clear character information can be generated according to the first text and the second text through the target image generation model.

FIG. 5 is a flow diagram illustrating a method for generating a model of a target image generation, according to an exemplary embodiment. As shown in fig. 5, the object image generation model may be generated by:

s501, obtaining a first sample set.

S502, training the image generation model to be determined according to the first sample set to obtain a target image generation model.

The first sample set may include a plurality of first sample images, and a first text sample and a second text sample corresponding to each first sample image, where the second text sample is used to describe sample text information included in the first sample image, and the first text sample is used to describe other image information in the first sample image except the sample text information.

In this embodiment, there may be multiple ways to obtain the first sample set, for example, words may be added according to a specified area of an image without words to obtain a first sample image, and the added words are used as a second text sample; or erasing the original image in the designated area in the image without characters, adding the characters to obtain a first sample image, and taking the added characters as a second text sample; or performing text recognition from a first sample image with characters to obtain a second text sample corresponding to the first sample image.

In one implementation, the first set of samples may be obtained by:

first, a second set of samples is obtained.

The second sample set comprises a plurality of second sample images and a third text sample corresponding to each second sample image, the third text sample is used for describing image information of the second sample images, and the second sample images are images without text information;

second, a fourth text sample is obtained.

It should be noted that the fourth text sample may be any text. For example, the text may be text input by the user or randomly generated text.

And thirdly, adding pending character information corresponding to the fourth text sample in the specified area of the second sample image.

And finally, taking the second sample image added with the undetermined character information as a first sample image, taking the third text sample as a first text sample corresponding to the first sample image, and taking the fourth text sample as a second text sample corresponding to the first sample image to obtain a first sample set.

In this way, the first sample set can be generated from the second sample image that does not contain text information.

In another implementation, the first sample set may also be obtained by:

first, a third set of samples is obtained.

The third sample set may include a plurality of third sample images and a fifth text sample corresponding to each third sample image, where the fifth text sample is used to describe image information of the third sample image, and the third sample image is an image containing text information;

then, a sixth text sample is determined according to the character information in the third sample image.

And finally, taking the third sample image as a first sample image, taking the fifth text sample as a first text sample corresponding to the first sample image, and taking the sixth text sample as a second text sample corresponding to the first sample image to obtain a first sample set.

In this way, the first sample set can be generated from the third sample image containing the text information.

FIG. 6 is a schematic diagram illustrating a pending image generation model in accordance with an exemplary embodiment. As shown in fig. 6, the pending image generation model may comprise a first text encoder 301, a second text encoder 302, a feature converter 303, an image generator 304, an image encoder 311 and a text recognizer 312.

In the embodiment of the disclosure, the undetermined image generation model may be trained to obtain a target image generation model. The structures of the first text encoder, the second text encoder, the feature converter and the image generator are the same as that of a target image generation model generated after training; while the image coder and text recognizer may be specialized for training, the target image generation model generated after training may not include the image coder and text recognizer.

It should be noted that the image encoder and the text recognizer may be pre-trained models, so as to improve the effect of training the to-be-determined image generation model; the image encoder and the text recognizer can also perform parameter optimization in the process of training the image generation model to be determined so as to further improve the accuracy of the target image generation model obtained after training.

The method for obtaining the target image generation model after training the to-be-determined image generation model according to the first sample set in the step S502 may include:

firstly, according to an image encoder, a first sample image and a first text sample, a first text encoder in an image generation model to be determined is pre-trained, and model parameters of the first text encoder are updated.

For example, the pre-trained first text encoder may be used as a new first text encoder in the pending image generation model.

In some embodiments, a first sample image in the first sample set may be input into an image encoder to obtain a sample image feature vector, a first text sample corresponding to the first sample image may be input into the first text encoder to obtain a sample text feature vector, and similarity between the sample image feature vector corresponding to the sample image and the sample text feature vector is constrained by a loss function in training, so that feature similarity between a text and an image may be achieved, and parameters of the first text encoder may be optimized or updated, thereby improving description capability of the first text encoder on an image scene.

And then, training the updated undetermined image generation model according to the text recognizer and the first sample set, and determining a target image generation model according to the trained undetermined image generation model.

In some embodiments, the second text encoder, the feature converter, and the image generator in the image generation model to be determined may be trained based on the pre-trained first text encoder, and for example, only the parameters of the second text encoder, the feature converter, and the image generator may be updated during the training process.

In some other embodiments, the second text encoder may also be obtained by pre-training, and the feature converter and the image generator in the image generation model to be determined may be trained based on the first text encoder and the second text encoder obtained by pre-training, for example, only the parameters of the feature converter and the image generator may be updated during the training process.

In some embodiments of the present disclosure, the following model training steps may be performed in a loop until it is determined that the trained pending image generation model satisfies the preset iteration stop condition, and a target image generation model is determined according to the trained pending image generation model.

Wherein, the model training step may include:

s11, inputting the first text sample and the second text sample into an undetermined image generation model to obtain a prediction image output by the undetermined image generation model.

For example, a first text sample may be input to a first text encoder to obtain a first text sample characteristic output by the first text encoder, and the first text sample characteristic may be input to a characteristic converter to obtain a first image sample characteristic. Inputting the second text sample into a second text encoder to obtain a second text sample characteristic output by the second text encoder; and inputting the first image sample characteristic and the second text sample characteristic into an image generator to obtain a predicted image generated by the image generator.

And S12, determining an image loss value according to the prediction image and the sample image.

Wherein the image loss value is used to characterize a difference between the predicted image and the sample image.

And S13, inputting the predicted image into the text recognizer to obtain the recognized text output by the text recognizer.

And S14, determining a text loss value according to the recognized text and the second text sample.

Wherein the text loss value is used to characterize a difference between the recognized text and the second text sample.

S15, under the condition that the undetermined image generation model does not meet the preset iteration stopping condition according to the image loss value and the text loss value, updating parameters of the undetermined image generation model according to the image loss value and the text loss value to obtain a trained undetermined image generation model, and taking the trained undetermined image generation model as a new undetermined image generation model.

Illustratively, parameters of a goal module in the image generation model, which may include at least one of the second text encoder, the feature converter, and the image generator, may be updated according to the image loss value and the text loss value.

In some embodiments, parameters of the image generator in the image generation model may be updated according to the image loss value and the text loss value, so that the image generator can express both image information corresponding to the first text and text information corresponding to the second text.

In other embodiments, parameters of a feature converter in the pending image generation model may be updated according to the image loss value, parameters of a second text encoder in the pending image generation model may be updated according to the text loss value, and parameters of an image generator in the image generation model may be updated according to the image loss value and the text loss value.

It should be noted that the preset iteration stop condition may include that the target loss value is less than or equal to a preset loss threshold, or a change value of the target loss value within a certain number of iterations is less than a preset change threshold, or may be a condition for stopping iteration commonly used in the related art, which is not limited in this disclosure. The target loss value may include the image loss value and the text loss value described above.

In addition, if it is determined that the pending image generation model satisfies the preset iteration stop condition according to the target loss value, the model training step may be stopped.

By adopting the mode, the image generation model to be determined can be trained to obtain the target image generation model.

To further illustrate the effects of the present disclosure, fig. 7 and 8 provide schematic diagrams of images generated from text, respectively. Wherein:

FIG. 7 is a diagram illustrating an image generated from text in a related art, according to an example embodiment. As shown in fig. 7, the text information in the first target image is unclear and cannot express an accurate semantic meaning.

FIG. 8 is a diagram illustrating a target image generated from text by a target image generation model, according to an example embodiment. As shown in fig. 8, by inputting the first text and the second text into the target image generation model by the method provided by the embodiment of the present disclosure, the text information "Happy Candy Shop" corresponding to the second text can be clearly shown in the generated target image.

Fig. 9 is a block diagram illustrating an apparatus 1100 for generating an image according to an exemplary embodiment.

As shown in fig. 9, the apparatus 1100 may include:

a first obtaining module 1101, configured to obtain a first text for describing a target object;

a second obtaining module 1102, configured to obtain a second text;

an image generation module 1103, configured to input the first text and the second text into a pre-generated target image generation model, so as to obtain a target image output by the target image generation model; the target image comprises the target object and word information corresponding to the second text, the target image generation model comprises a first text encoder and a second text encoder, the first text encoder is used for encoding the first text to obtain a first text feature corresponding to the first text, the second text encoder is used for encoding the second text to obtain a second text feature corresponding to the second text, and the target image generation model is used for generating the target image according to the first text feature and the second text feature.

According to one or more embodiments of the present disclosure, the target image generation model further includes a feature converter and an image generator, and the image generation module 1103 is configured to input the first text into the first text encoder, so as to obtain a first text feature output by the first text encoder; inputting the first text characteristic into the characteristic converter to obtain a first image characteristic; inputting the second text into the second text encoder to obtain a second text characteristic output by the second text encoder; and inputting the first image characteristic and the second text characteristic into the image generator to obtain the target image generated by the image generator.

According to one or more embodiments of the present disclosure, the image generating module 1103 is configured to perform overlay processing on the first image feature and the second text feature to obtain an overlay feature; and inputting the superposition characteristics into the image generator to obtain the target image.

Fig. 10 is a block diagram illustrating another apparatus 1100 for generating an image according to an example embodiment. As shown in fig. 10, the apparatus 1100 may further include:

a model generation module 1104 for obtaining a first set of samples; the first sample set comprises a plurality of first sample images, and a first text sample and a second text sample corresponding to each first sample image, wherein the second text sample is used for describing sample text information contained in the first sample images, and the first text sample is used for describing other image information except the sample text information in the first sample images; and training the image generation model to be determined according to the first sample set to obtain the target image generation model.

According to one or more embodiments of the present disclosure, the model generation module 1104 is configured to obtain a second set of samples; the second sample set comprises a plurality of second sample images and a third text sample corresponding to each second sample image, the third text sample is used for describing image information of the second sample images, and the second sample images are images without text information; acquiring a fourth text sample; adding pending character information corresponding to the fourth text sample in a designated area of the second sample image; and taking a second sample image added with the undetermined character information as the first sample image, taking the third text sample as the first text sample corresponding to the first sample image, and taking the fourth text sample as the second text sample corresponding to the first sample image to obtain the first sample set.

According to one or more embodiments of the present disclosure, the model generation module 1104 is configured to obtain a third set of samples; the third sample set comprises a plurality of third sample images and a fifth text sample corresponding to each third sample image, the fifth text sample is used for describing image information of the third sample images, and the third sample images are images containing text information; determining a sixth text sample according to the character information in the third sample image; and taking the third sample image as the first sample image, taking the fifth text sample as the first text sample corresponding to the first sample image, and taking the sixth text sample as the second text sample corresponding to the first sample image, so as to obtain the first sample set.

According to one or more embodiments of the present disclosure, the pending image generation model comprises the first text encoder, the second text encoder, the feature converter, the image generator, an image encoder, and a text recognizer; the model generating module 1104 is configured to pre-train a first text encoder in the to-be-determined image generation model according to the image encoder, the first sample image, and the first text sample, and update model parameters of the first text encoder; and training the updated undetermined image generation model according to the text recognizer and the first sample set to obtain the target image generation model.

According to one or more embodiments of the present disclosure, the model generating module 1104 is configured to perform the step of model training in a loop until it is determined that the trained pending image generating model meets a preset iteration stopping condition, and determine the target image generating model according to the trained pending image generating model;

wherein the model training step comprises:

inputting the first text sample and the second text sample into an undetermined image generation model to obtain a predicted image output by the undetermined image generation model;

determining an image loss value from the predicted image and the sample image; the image loss value is used for characterizing the difference between the predicted image and the sample image;

inputting the predicted image into the text recognizer to obtain a recognition text output by the text recognizer;

determining a text loss value according to the recognition text and the second text sample; the text loss value is used to characterize a difference between the recognized text and the second text sample;

and under the condition that the undetermined image generation model does not meet the preset iteration stopping condition according to the image loss value and the text loss value, updating parameters of the undetermined image generation model according to the image loss value and the text loss value to obtain a trained undetermined image generation model, and taking the trained undetermined image generation model as a new undetermined image generation model.

According to one or more embodiments of the present disclosure, the model generation module 1104 is configured to update parameters of a target module in the image generation model according to the image loss value and the text loss value, the target module including at least one of the second text encoder, the feature converter, and the image generator.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Referring now to fig. 11, shown is a schematic diagram of an electronic device 2000 (e.g., a terminal device or a server) suitable for use in implementing embodiments of the present disclosure. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a stationary terminal such as a digital TV, a desktop computer, and the like. The server in the embodiments of the present disclosure may include, but is not limited to, devices such as a local server, a cloud server, a single server, a distributed server, and the like. The electronic device shown in fig. 11 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 11, the electronic device 2000 may include a processing means (e.g., a central processing unit, a graphic processor, etc.) 2001, which may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 2002 or a program loaded from a storage means 2008 into a Random Access Memory (RAM) 2003. In the RAM2003, various programs and data necessary for the operation of the electronic apparatus 2000 are also stored. The processing device 2001, the ROM2002, and the RAM2003 are connected to each other by a bus 2004. An input/output (I/O) interface 2005 is also connected to bus 2004.

Generally, the following devices may be connected to the input/output interface 2005: input devices 2006 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 2007 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 2008 including, for example, magnetic tape, hard disk, and the like; and a communication device 2009. Communication 2009 may allow electronic device 2000 to communicate wirelessly or by wire with other devices to exchange data. While fig. 11 illustrates an electronic device 2000 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication device 2009, or installed from the storage device 2008, or installed from the ROM 2002. The computer program, when executed by the processing device 2001, performs the above-described functions defined in the methods of the embodiments of the present disclosure.

It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

In some embodiments, the clients, servers may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may interconnect with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring a first text for describing a target object; acquiring a second text; inputting the first text and the second text into a pre-generated target image generation model to obtain a target image output by the target image generation model; the target image comprises the target object and word information corresponding to the second text, the target image generation model comprises a first text encoder and a second text encoder, the first text encoder is used for encoding the first text to obtain a first text feature corresponding to the first text, the second text encoder is used for encoding the second text to obtain a second text feature corresponding to the second text, and the target image generation model is used for generating the target image according to the first text feature and the second text feature.

Computer program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present disclosure may be implemented by software or hardware. The name of the module does not constitute a limitation to the module itself in some cases, for example, the first obtaining module may also be described as a "module that obtains a first text for describing a target object".

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems on a chip (SOCs), complex Programmable Logic Devices (CPLDs), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

According to one or more embodiments of the present disclosure, there is provided a method of generating an image, the method including:

acquiring a first text for describing a target object;

acquiring a second text;

According to one or more embodiments of the present disclosure, the target image generation model further includes a feature converter and an image generator, and the inputting the first text and the second text into a pre-generated target image generation model to obtain a target image output by the target image generation model includes:

inputting the first text into the first text encoder to obtain a first text characteristic output by the first text encoder;

inputting the first text characteristic into the characteristic converter to obtain a first image characteristic;

inputting the second text into the second text encoder to obtain a second text characteristic output by the second text encoder;

and inputting the first image characteristic and the second text characteristic into the image generator to obtain the target image generated by the image generator.

According to one or more embodiments of the present disclosure, the inputting the first image feature and the second text feature into the image generator to obtain the target image generated by the image generator includes:

overlapping the first image characteristic and the second text characteristic to obtain an overlapping characteristic;

and inputting the superposition characteristics into the image generator to obtain the target image.

According to one or more embodiments of the present disclosure, the target image generation model is generated by:

acquiring a first sample set; the first sample set comprises a plurality of first sample images, and a first text sample and a second text sample corresponding to each first sample image, wherein the second text sample is used for describing sample text information contained in the first sample images, and the first text sample is used for describing other image information except the sample text information in the first sample images;

and training the image generation model to be determined according to the first sample set to obtain the target image generation model.

According to one or more embodiments of the present disclosure, the obtaining the first set of samples includes:

obtaining a second sample set; the second sample set comprises a plurality of second sample images and a third text sample corresponding to each second sample image, the third text sample is used for describing image information of the second sample images, and the second sample images are images without text information;

acquiring a fourth text sample;

adding pending character information corresponding to the fourth text sample in a designated area of the second sample image;

and taking a second sample image added with the undetermined character information as the first sample image, taking the third text sample as the first text sample corresponding to the first sample image, and taking the fourth text sample as the second text sample corresponding to the first sample image to obtain the first sample set.

According to one or more embodiments of the present disclosure, the obtaining the first sample set includes:

obtaining a third sample set; the third sample set comprises a plurality of third sample images and a fifth text sample corresponding to each third sample image, the fifth text sample is used for describing image information of the third sample images, and the third sample images are images containing text information;

determining a sixth text sample according to the character information in the third sample image;

and taking the third sample image as the first sample image, taking the fifth text sample as the first text sample corresponding to the first sample image, and taking the sixth text sample as the second text sample corresponding to the first sample image, so as to obtain the first sample set.

According to one or more embodiments of the present disclosure, the pending image generation model comprises the first text encoder, the second text encoder, the feature converter, the image generator, an image encoder, and a text recognizer; the obtaining of the target image generation model after training the to-be-determined image generation model according to the first sample set comprises:

according to the image encoder, the first sample image and the first text sample, pre-training a first text encoder in the to-be-determined image generation model, and updating model parameters of the first text encoder;

and training the updated undetermined image generation model according to the text recognizer and the first sample set to obtain the target image generation model.

According to one or more embodiments of the present disclosure, the training the updated pending image generation model according to the text recognizer and the first sample set to obtain the target image generation model includes:

circularly executing the model training step until the trained pending image generation model meets the preset iteration stopping condition, and determining the target image generation model according to the trained pending image generation model;

wherein the model training step comprises:

determining an image loss value from the predicted image and the sample image; the image loss value is used for representing the difference between the predicted image and the sample image;

According to one or more embodiments of the present disclosure, the updating parameters of the pending image generation model according to the image loss value and the text loss value includes:

updating parameters of a target module in the image generation model according to the image loss value and the text loss value, the target module including at least one of the second text encoder, the feature converter, and the image generator.

According to one or more embodiments of the present disclosure, there is provided an apparatus for generating an image, the apparatus including:

the second acquisition module is used for acquiring a second text;

According to one or more embodiments of the present disclosure, the target image generation model further includes a feature converter and an image generator, and the image generation module is configured to input the first text into the first text encoder, so as to obtain a first text feature output by the first text encoder; inputting the first text characteristic into the characteristic converter to obtain a first image characteristic; inputting the second text into the second text encoder to obtain a second text characteristic output by the second text encoder; and inputting the first image characteristic and the second text characteristic into the image generator to obtain the target image generated by the image generator.

According to one or more embodiments of the present disclosure, the image generation module is configured to perform an overlay process on the first image feature and the second text feature to obtain an overlay feature; and inputting the superposition characteristics into the image generator to obtain the target image.

According to one or more embodiments of the present disclosure, the apparatus further comprises:

the model generation module is used for acquiring a first sample set; the first sample set comprises a plurality of first sample images, and a first text sample and a second text sample corresponding to each first sample image, wherein the second text sample is used for describing sample text information contained in the first sample images, and the first text sample is used for describing other image information except the sample text information in the first sample images; and training the image generation model to be determined according to the first sample set to obtain the target image generation model.

According to one or more embodiments of the present disclosure, the model generation module is configured to obtain a second set of samples; the second sample set comprises a plurality of second sample images and a third text sample corresponding to each second sample image, the third text sample is used for describing image information of the second sample images, and the second sample images are images without text information; acquiring a fourth text sample; adding pending character information corresponding to the fourth text sample in a designated area of the second sample image; and taking a second sample image added with the undetermined character information as the first sample image, taking the third text sample as the first text sample corresponding to the first sample image, and taking the fourth text sample as the second text sample corresponding to the first sample image to obtain the first sample set.

According to one or more embodiments of the present disclosure, the model generation module is configured to obtain a third set of samples; the third sample set comprises a plurality of third sample images and a fifth text sample corresponding to each third sample image, the fifth text sample is used for describing image information of the third sample images, and the third sample images are images containing text information; determining a sixth text sample according to the character information in the third sample image; and taking the third sample image as the first sample image, taking the fifth text sample as the first text sample corresponding to the first sample image, and taking the sixth text sample as the second text sample corresponding to the first sample image, so as to obtain the first sample set.

According to one or more embodiments of the present disclosure, the pending image generation model comprises the first text encoder, the second text encoder, the feature converter, the image generator, an image encoder, and a text recognizer; the model generation module is used for pre-training a first text encoder in the to-be-determined image generation model according to the image encoder, the first sample image and the first text sample, and updating model parameters of the first text encoder; and training the updated undetermined image generation model according to the text recognizer and the first sample set to obtain the target image generation model.

According to one or more embodiments of the disclosure, the model generation module is configured to perform a model training step in a loop until it is determined that the trained undetermined image generation model meets a preset iteration stop condition, and determine the target image generation model according to the trained undetermined image generation model;

wherein the model training step comprises:

According to one or more embodiments of the present disclosure, the model generation module is configured to update parameters of a target module in the image generation model according to the image loss value and the text loss value, the target module including at least one of the second text encoder, the feature converter, and the image generator.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims. With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Claims

1. A method of generating an image, the method comprising:

acquiring a first text for describing a target object;

acquiring a second text;

2. The method of claim 1, wherein the target image generation model further comprises a feature converter and an image generator, and the inputting the first text and the second text into a pre-generated target image generation model to obtain a target image output by the target image generation model comprises:

3. The method of claim 2, wherein inputting the first image feature and the second text feature into the image generator results in the target image generated by the image generator comprising:

4. The method of claim 2, wherein the target image generation model is generated by:

5. The method of claim 4, wherein obtaining the first set of samples comprises:

acquiring a fourth text sample;

6. The method of claim 4, wherein obtaining the first set of samples comprises:

7. The method of claim 6, characterized in that the pending image generation model comprises the first text encoder, the second text encoder, the feature converter, the image generator, an image encoder and a text recognizer; the obtaining of the target image generation model after training the to-be-determined image generation model according to the first sample set comprises:

8. The method of claim 7, wherein training the updated pending image generative model according to the text recognizer and the first sample set to obtain the target image generative model comprises:

wherein the model training step comprises:

9. The method of claim 8, wherein said updating parameters of said pending image generation model in accordance with said image loss value and said text loss value comprises:

10. An apparatus for generating an image, the apparatus comprising:

the second acquisition module is used for acquiring a second text;

11. A computer-readable medium, on which a computer program is stored, which, when being executed by a processing means, carries out the steps of the method according to any one of claims 1 to 9.

12. An electronic device, comprising:

a storage device having a computer program stored thereon;

processing means for executing the computer program in the storage means to carry out the steps of the method according to any one of claims 1 to 9.