CN117197271A

CN117197271A - Image generation method, device, electronic equipment and storage medium

Info

Publication number: CN117197271A
Application number: CN202311150182.6A
Authority: CN
Inventors: 叶虎; 韩骁; 张军
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2023-09-06
Filing date: 2023-09-06
Publication date: 2023-12-08

Abstract

The application relates to an image generation method, an image generation device, electronic equipment and a storage medium, which can be used in the fields of traffic, cloud computing, various image processing and the like. The method comprises the following steps: inputting the acquired target image features, target text features and noise features to be processed into a noise processing network, performing feature cross processing on the target image features and the noise features to be processed by using an image attention network in the noise processing network, and performing denoising processing on the noise features to be processed by using a denoising network in the noise processing network based on the result of the feature cross processing and the target text features to obtain target noise features; and performing image decoding processing on the target noise characteristics by using an image decoder in the text-generated graph model to obtain a generated image. According to the technical scheme provided by the application, the image generation can be guided by using a simple model structure to support the multi-mode conditions of the text and the image, and the accuracy of the generated image is improved.

Description

Image generation method, device, electronic equipment and storage medium

Technical Field

The present application relates to the field of computer vision, and in particular, to an image generating method, an image generating device, an electronic device, and a storage medium.

Background

The current image generation method is generally a text-based image generation method, for example, text-based image generation is performed based on a diffusion model. In the related technology, multi-modal fusion generation is performed on the basis of a meridional graph model, but a coding model supporting multi-modes needs to be added to extract multi-modal features to adapt to a diffusion model, and complex multi-modal data needs to be built, so that the graph generation model is complex in structure, training data is difficult to build, and a large amount of processing resources are consumed.

Disclosure of Invention

The application provides an image generation method, an image generation device, electronic equipment and a storage medium, which at least solve the problem of how to improve the image generation precision in the related technology. The technical scheme of the application is as follows:

according to a first aspect of an embodiment of the present application, there is provided an image generating method including:

acquiring target image characteristics, target text characteristics and noise characteristics to be processed;

inputting the target image feature, the target text feature and the noise feature to be processed into a noise processing network, performing feature cross processing on the target image feature and the noise feature to be processed by using an image attention network in the noise processing network, and performing denoising processing on the noise feature to be processed by using a denoising network in the noise processing network based on the result of the feature cross processing and the target text feature to obtain a target noise feature; the denoising network is a denoising network in a pre-trained text graph model, and the image attention network is connected with the text attention network in the denoising network in parallel in the denoising network so as to be embedded in the denoising network;

And performing image decoding processing on the target noise characteristics by using an image decoder in the venturi graph model to obtain a generated image.

According to a second aspect of an embodiment of the present application, there is provided an image generating apparatus including:

the acquisition module is used for acquiring target image characteristics, target text characteristics and noise characteristics to be processed;

the denoising module is used for inputting the target image feature, the target text feature and the noise feature to be processed into a noise processing network, performing feature cross processing on the target image feature and the noise feature to be processed by using an image attention network in the noise processing network, and denoising the noise feature to be processed by using a denoising network in the noise processing network based on a result of the feature cross processing and the target text feature to obtain a target noise feature; the denoising network is a denoising network in a pre-trained text graph model, and the image attention network is connected with the text attention network in the denoising network in parallel in the denoising network so as to be embedded in the denoising network;

and the image generation module is used for carrying out image decoding processing on the target noise characteristics by using an image decoder in the text graph model to obtain a generated image.

According to a third aspect of an embodiment of the present application, there is provided an electronic apparatus including: a processor; a memory for storing the processor-executable instructions; wherein the processor is configured to execute the instructions to implement the method of any of the first aspects above.

According to a fourth aspect of embodiments of the present application, there is provided a computer readable storage medium, which when executed by a processor of an electronic device, causes the electronic device to perform the method of any of the first aspects of embodiments of the present application.

According to a fifth aspect of embodiments of the present application, there is provided a computer program product comprising computer instructions which, when executed by a processor, cause the computer to perform the method of any of the first aspects of embodiments of the present application.

The technical scheme provided by the embodiment of the application at least has the following beneficial effects:

the embedded image attention network is newly added into the denoising network in the pre-trained text-to-image model, so that the text-to-image model embedded with the image attention network can support the input of target text characteristics and target image characteristics, and the text-to-image model embedded with the image attention network can support the multi-modal condition input of images and texts; the method comprises the steps of inputting target image features, target text features and noise features to be processed into a noise processing network, performing feature cross processing on the target image features and the noise features to be processed by using an image attention network in the noise processing network, and performing denoising processing on the noise features to be processed by using a denoising network in the noise processing network based on a result of the feature cross processing and the target text features to obtain target noise features. The aim of denoising the noise feature to be processed by using the target image feature and the target text feature as conditions for guiding the image generation is fulfilled; further, an image decoder in the text-generated graph model is used for carrying out image decoding processing on the target noise characteristics to obtain a generated image, and under the condition that the target image characteristics and the target text characteristics serve as guide image generation, the target image characteristics can more and more accurately express the condition of image generation than the target text characteristics, so that the accuracy of the generated image can be improved; in addition, the mode of newly embedding the image attention network in the pre-trained text-generated graph model can enable the model structure to be simpler.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application and do not constitute a undue limitation on the application.

FIG. 1 is a schematic diagram of an application environment, shown in accordance with an exemplary embodiment.

Fig. 2 is a flowchart illustrating an image generation method according to an exemplary embodiment.

FIG. 3 is a schematic diagram of an architecture of a multimodal graph generation model, according to an example embodiment.

FIG. 4a is a schematic diagram of an architecture of an embedded graph attention network in a steady diffusion model, according to an example embodiment.

FIG. 4b is a schematic diagram of an architecture of an embedded graph attention network in a steady diffusion model, according to an example embodiment.

FIG. 5 is a training framework diagram illustrating a linear transformation layer and a normalization layer, according to an example embodiment.

Fig. 6 is a block diagram of an image generating apparatus according to an exemplary embodiment.

Fig. 7 is a block diagram of an electronic device for image generation, according to an example embodiment.

Detailed Description

Various exemplary embodiments, features and aspects of the application will be described in detail below with reference to the drawings. In the drawings, like reference numbers indicate identical or functionally similar elements. Although various aspects of the embodiments are illustrated in the accompanying drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

In addition, numerous specific details are set forth in the following description in order to provide a better illustration of the application. It will be understood by those skilled in the art that the present application may be practiced without some of these specific details. In some instances, well known methods, procedures, components, and circuits have not been described in detail so as not to obscure the present application.

Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. The artificial intelligence software technology mainly comprises a computer image technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

In recent years, with research and progress of artificial intelligence technology, the artificial intelligence technology is widely applied in a plurality of fields, and the scheme provided by the embodiment of the application relates to the technologies of computer image technology, machine learning/deep learning and the like, and is specifically described by the following embodiments.

Referring to fig. 1, fig. 1 is a schematic diagram of an application system according to an embodiment of the application. The application system may be used in the image generation method of the present application. As shown in fig. 1, the application system may include at least a server 01 and a terminal 02.

In the embodiment of the present application, the server 01 may be used for image generation processing, where the server 01 may include an independent physical server, or may be a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs (Content Delivery Network, content distribution networks), and basic cloud computing services such as big data and artificial intelligence platforms.

In the embodiment of the present application, the terminal 02 may be used to trigger the image generation process and so on. The terminal 02 may include a smart phone, a desktop computer, a tablet computer, a notebook computer, a smart speaker, a digital assistant, an augmented reality (augmented reality, AR)/Virtual Reality (VR) device, a smart wearable device, or other type of physical device. The physical device may also include software, such as an application, running in the physical device. The operating system running on the terminal 02 in the embodiment of the present application may include, but is not limited to, an android system, an IOS system, linux, windows, and the like.

In addition, it should be noted that, fig. 1 is only a single application environment of the image generating method provided by the present application.

In the embodiment of the present disclosure, the terminal 02 and the server 01 may be directly or indirectly connected through a wired or wireless communication method, which is not limited to the present disclosure.

In a specific embodiment, when the server 02 is a distributed system, the distributed system may be a blockchain system, and when the distributed system is a blockchain system, the distributed system may be formed by a plurality of nodes (any form of computing device in an access network, such as a server and a user terminal), and a peer-to-peer network formed by the nodes is formed, where a protocol used by the peer-to-peer network is an application layer protocol running on top of a transmission control protocol (TCP, transmission Control Protocol) protocol. In a distributed system, any machine, such as a server, a terminal, may join to become a node, including a hardware layer, an intermediate layer, an operating system layer, and an application layer. Specifically, the functions of each node in the blockchain system may include:

1) The routing, the node has basic functions for supporting communication between nodes.

Besides the routing function, the node can also have the following functions:

2) The application is used for being deployed in a block chain to realize specific service according to actual service requirements, recording data related to the realization function to form recorded data, carrying a digital signature in the recorded data to represent the source of task data, sending the recorded data to other nodes in the block chain system, and adding the recorded data into a temporary block when the source and the integrity of the recorded data are verified by the other nodes.

It should be noted that, in the specific embodiment of the present application, related data of a user is referred to, and when the following embodiments of the present application are applied to specific products or technologies, user permission or consent is required to be obtained, and the collection, use and processing of related data are required to comply with related laws and regulations and standards of related countries and regions.

Fig. 2 is a flowchart illustrating an image generation method according to an exemplary embodiment. As shown in fig. 2, the following steps may be included.

S201, acquiring target image features, target text features and noise features to be processed.

In practical application, under the condition that image generation is required, the target image characteristics, the target text characteristics and the noise characteristics to be processed can be obtained. Wherein the noise characteristic to be processed may be a random noise characteristic.

The target text feature may refer to a text feature of a target text that is used to guide the graph generation process, and may be a text condition of the graph generation process. For example, the target text may be encoded to obtain target text features. For example, referring to fig. 3, a text encoding network may be used to encode the target text to obtain the target text feature. In one example, the text encoding network may be a text encoder in a CLIP (Contrastive Language-Image Pre-training based on Pre-training of contrast text-Image pairs) model, or may be a BLIP (Bootstrapping Language-Image Pretraining, guided language Image Pre-training) model, as the application is not limited in this regard.

The target image feature may refer to an image feature of a target image, which is used to guide the image generation process, and may be an image condition of the image generation process. For example, the target image may be encoded to obtain the target image feature. For example, the target image may be encoded using a pre-trained image encoding model to obtain the target image features.

In one possible implementation, the image coding model may include an image coding network, a linear transformation layer, and a normalization layer, as shown in fig. 3, based on which the target image features may be obtained by:

Acquiring a target image; for example, an image for guiding the map generation process is acquired as a target image;

inputting the target image into an image coding network, and performing image coding processing to obtain a first image characteristic; the first image feature may be a global feature or may be a fine-grained grid feature, and in the case that the first image feature is a fine-grained image grid feature, the target image may be input into the image encoding network, and image grid feature extraction processing is performed, so as to obtain the first image feature. Illustratively, the first image feature may be a grid feature of 16×16, such that fine-grained feature extraction makes the first image feature more accurate.

Further, the image coding feature may be input to a linear transformation layer, and linear transformation processing may be performed to obtain the second image feature. Alternatively, the linear transformation layer may be a transducer model (transducer model), which may make the feature fitting better.

And, the second image feature may be input into a normalization layer (LayerNorm layer), and normalized to obtain the target image feature.

The linear transformation layer and the normalization layer are obtained by training the initial linear transformation layer and the initial normalization layer based on sample images under the condition of fixing parameters of an image coding network and parameters of a meridional graph model.

In one example, the image encoding network may be an image encoding module in a pre-trained multimodal model. For example, the image encoding network may be an image encoder in a CLIP model, as the application is not limited in this regard. The image coding network is an image coding module in a pre-trained multi-mode model, so that the image feature extraction is more accurate, and the semantic expression is stronger. The size of the first image feature may be 1x768, the linear transformation layer may map the feature to 8x768, and then perform normalization processing through the normalization layer, so as to obtain the target image feature with the size of 8x 768. The normalization layer may normalize the output of each neuron in the network such that the output of each layer in the network has a similar distribution. The linear transformation layer may be a fully connected neural network for mapping the input features to larger features, as the application is not limited in this regard.

S203, inputting the target image feature, the target text feature and the noise feature to be processed into a noise processing network, performing feature cross processing on the target image feature and the noise feature to be processed by using an image attention network in the noise processing network, and performing denoising processing on the noise feature to be processed by using a denoising network in the noise processing network based on a result of the feature cross processing and the target text feature to obtain the target noise feature.

In the embodiment of the present disclosure, the denoising network may be a denoising network in a pre-trained text-to-graph model, and based on this, the noise processing network may include a denoising network in the pre-trained text-to-graph model and an image attention network newly embedded in the denoising network. The image attention network is connected in parallel in the denoising network with the text attention network in the denoising network to be embedded in the denoising network, that is, the image attention network is embedded in the pre-trained denoising network for fine tuning training. Illustratively, the dashed box in FIG. 3 may refer to a denoising network in the pre-trained text-to-graph model (where it is noted that the denoising network does not include an embedded image attention network)。

The text attention network is a cross attention network used for carrying out feature cross processing on the input text features and the input noise features in the denoising network; the image attention network is used for performing characteristic cross processing on the input image characteristics and the input noise characteristics. That is, the pre-trained textbook model supports the input of text conditions through the text attention network, where the image attention network is connected in parallel with the text attention network for the purpose of embedding the image attention network into the pre-trained textbook model, such that the textbook model embedded into the image attention network can support the input of image conditions, such that the textbook model embedded into the image attention network can be regarded as a multimodal graph generation model supporting text and image input. As such, the multi-modal map generation model may include a pre-trained text-to-text map model, an image attention network, and an image coding model; the pre-trained text-to-text graph model may include, among other things, a text encoding network, a denoising network, and an image decoder. In this case, as shown in fig. 3, the image attention network may support the input of image conditions and be connected in parallel with the text attention network, and if the text attention network and the image attention network are regarded as an overall attention network, the process flow architecture of the text graph model embedded in the image attention network is identical to that of the pre-trained text graph model, that is, the target image features are associated in the text attention network part, and not only the model structure is not greatly changed, but also the process flow architecture is identical.

Alternatively, the output of the text attention network and the output of the image attention may be input to the next processing module in parallel. Alternatively, the output of the text attention network and the output of the image attention may be superimposed and input to a next processing module, such as to the decoding submodule in fig. 3, or to the image decoder shown in fig. 3.

In one possible implementation, the meridional graph model may be a graph generation model that is image generation guided based on text conditions, as the application is not limited in this regard.

In one example, the noise processing network may be trained on a pre-set noise network based on multi-modal sample data, which may include sample image features, sample text features, and sample noise features. The preset noise network may include a denoising network in the text-generated graph model and an initial attention network newly embedded in the denoising network, the initial attention network being connected in parallel (or referred to as parallel) in the denoising network with a text attention network in the denoising network to be embedded in the denoising network; the initial attention network is used for carrying out characteristic cross processing on the input image characteristics and the input noise characteristics; and in the training process of the noise processing network, fixing parameters of the text-generated graph model, and carrying out iterative training on the initial attention network. The specific training process of the initial attention network may be referred to in the following description, and will not be described in detail herein.

In the embodiment of the present disclosure, the target image feature, the target text feature, and the noise feature to be processed may be input into a noise processing network, the target image feature and the noise feature to be processed are subjected to feature cross processing by using an image attention network in the noise processing network, and a denoising network in the noise processing network is used to denoise the noise feature to be processed based on a result of the feature cross processing and the target text feature, so as to obtain the target noise feature.

In one possible implementation manner, the target image feature, the target text feature and the noise feature to be processed can be input into a noise processing network, and the noise feature to be processed is subjected to coding processing to obtain a noise coding feature, so that the text attention network can be used for carrying out correlation processing on the noise coding feature and the target text feature to obtain a text noise correlation feature; and performing correlation processing, such as feature cross processing, on the noise coding features and the target image features by using an image attention network to obtain image noise correlation features. Therefore, decoding processing can be performed based on the text noise correlation characteristic and the image noise correlation characteristic, and the target noise characteristic is obtained.

In another possible implementation, the denoising network may further include a coding submodule and a decoding submodule, and the coding submodule may include a convolution layer and a pooling layer, for example; the decoding submodule may include a decoding layer and a convolutional layer. The text attention network comprises a text attention network corresponding to the coding submodule and a text attention network corresponding to the decoding submodule, and the image attention network comprises an image attention network corresponding to the coding submodule and an image attention network corresponding to the decoding submodule. As shown in fig. 3, the text attention network corresponding to the encoding submodule may refer to a text attention network connected after the encoding submodule, and the image attention network corresponding to the encoding submodule may refer to an image attention network connected after the encoding submodule; the text attention network corresponding to the decoding submodule may refer to a text attention network connected after the decoding submodule, and the image attention network corresponding to the decoding submodule may refer to an image attention network connected after the decoding submodule. The coding submodule, the text attention network corresponding to the coding submodule and the image attention network corresponding to the coding submodule can form a coding module in the noise processing network; the decoding submodule, the text attention network corresponding to the decoding submodule and the image attention network corresponding to the decoding submodule can form a decoding module in the noise processing network. It will be appreciated that the parameters of each text attention network may be different and may be pre-trained; the parameters of the image attention network may be different and may be trained. The names of the text attention network and the image attention network in the different encoding modules and the different decoding modules are the same here to mean that they are functionally similar, and the parameters in the networks are not limited.

As shown in fig. 3, the inputting the target image feature, the target text feature and the noise feature to be processed into the noise processing network, performing feature cross processing on the target image feature and the noise feature to be processed by using the image attention network in the noise processing network, and performing denoising processing on the noise feature to be processed by using the denoising network in the noise processing network based on the result of the feature cross processing and the target text feature, to obtain the target noise feature may include:

performing coding processing on the noise characteristics to be processed based on the coding submodule to obtain first noise characteristics;

calling a text attention network corresponding to the coding submodule and an image attention network corresponding to the coding submodule, respectively carrying out feature cross processing on the target text feature and the first noise feature to obtain a first cross feature, and carrying out feature cross processing on the target image feature and the first noise feature to obtain a second cross feature;

and decoding and feature crossing processing are carried out on the first crossing feature and the second crossing feature based on the decoding submodule, the text attention network corresponding to the decoding submodule and the corresponding image attention network, so as to obtain the target noise feature. For example, feature stacking processing may be performed on the first cross feature and the second cross feature to obtain a target cross feature; the target cross characteristic can be input into a decoding submodule to be decoded, so that a second noise characteristic is obtained; further, a text attention network corresponding to the decoding submodule and an image attention network corresponding to the decoding submodule can be called, feature cross processing is conducted on the target text feature and the second noise feature to obtain a third cross feature, and feature cross processing is conducted on the target image feature and the second noise feature to obtain a fourth cross feature; therefore, the third cross feature and the fourth cross feature can be subjected to feature superposition processing to obtain the target noise feature.

In one example, the feature overlay process may be represented as follows:

Output＝CrossAttention(query,Wk*Ft,Wv*Ft)+CrossAttention(query,Wk’*Fi,Wv’*Fi)

wherein Output may represent the Output of the feature overlay process; the query may represent the output of the last module, and Wk and Wv may represent parameters of the text attention network; ft may represent a target text feature; wk 'and Wv' may represent parameters of the image attention network; fi may represent a target image feature. Taking Output as a target noise characteristic as an example; the query may represent a second noise characteristic, where the last module may refer to a decoding submodule; wk and Wv may represent parameters of the text attention network corresponding to the decoding submodule; ft may represent a target text feature; cross sAtention (query, wk. Fw, wv. Ft) may represent a third intersection feature; wk 'and Wv' may represent parameters of the image attention network to which the decoding submodule corresponds; fi may represent a target image feature; cross-section (query, wk 'Fi, wv' Fi) may represent a fourth cross-feature.

As one example, the meridional graph model may be a steady diffusion model (stablidenfusion model). Based on this, the method may further comprise: acquiring a current time step;

accordingly, the step of inputting the target image feature, the target text feature, the noise feature to be processed and the current time step into the noise processing network, performing feature cross processing on the target image feature and the noise feature to be processed by using the image attention network in the noise processing network, and performing denoising processing on the noise feature to be processed by using the denoising network in the noise processing network based on the result of the feature cross processing and the target text feature to obtain the target noise feature may be replaced by the following steps:

Inputting the target image characteristics, the target text characteristics, the noise characteristics to be processed and the current time step into a noise processing network, performing characteristic cross processing on the target image characteristics and the noise characteristics to be processed by utilizing an image attention network in the noise processing network, and performing denoising processing on the noise characteristics to be processed by utilizing a denoising network in the noise processing network based on the characteristic cross processing result, the target text characteristics and the current time step to obtain the current noise characteristics;

and under the condition that the current time step does not meet the time step threshold, taking the current noise characteristic as the noise characteristic to be processed, repeating the denoising processing step until the time step threshold is met, and determining the current noise characteristic when the time step threshold is met as the target noise characteristic.

In an alternative embodiment, where the venturi graph model is a stable diffusion model, the denoising network may be UNet accordingly. The number of the coding submodules and the decoding submodules in UNet can be multiple, and the coding submodules and the decoding submodules are provided with corresponding text attention networks. Based on this, the architecture of the multimodal graph generating model may be as shown in fig. 4a, i.e. the noise processing network may comprise a plurality of encoding modules and a plurality of decoding modules, each encoding module being as structured in the dashed box in fig. 4a, and each decoding module being as shown in fig. 3 as a decoding submodule, a text attention network and an image attention network corresponding to the decoding submodule. Wherein the processing of each coding sub-module may be as in the processing of the coding sub-module of fig. 3. The difference is that the target cross feature of the output of the text attention network superimposed with the output of the image attention network is not only sent to the next encoding submodule but also cross-layer to the subsequent corresponding decoding submodule as indicated by the dashed line in fig. 4 a. Accordingly, the input of any of the decoding submodules may include the output of the previous module of any of the decoding submodules, and the corresponding cross-layer delivered target cross-feature. The correspondence here may refer to symmetry in the structure of UNet, such as cross-layer transmission as illustrated by the dashed lines with arrows in fig. 4 b. As described above, if the UNet embedded in the image attention network is regarded as a whole, the denoising process frame is identical to the original denoising process frame of UNet, and the entire structure is not shown in fig. 4 a.

In one example, referring to fig. 4b, the UNet after embedding the image attention network may be as in the dashed box portion of fig. 4b, where 4 encoding sub-modules in fig. 4a may be included and 4 decoding sub-modules may be included.

S205, performing image decoding processing on the target noise characteristics by using an image decoder in the text graph model to obtain a generated image.

In the embodiment of the present specification, the image decoder in the text-generated graph model may be used to perform image decoding processing on the target noise feature, to obtain the generated image. As shown in fig. 3, the target noise feature may be input to an image decoder in the text graph model, and subjected to image decoding processing, and output as a generated image. The attribute information of the generated image may be matched with the semantics of the target text and the attribute information of the target image, which may include, but is not limited to, content and style.

Referring to fig. 5, in one possible implementation, the linear transformation layer and the normalization layer may be trained by the following steps, which may include:

coding, transforming and normalizing the sample image by using an image coding network, an initial linear transformation layer and an initial normalization layer to obtain predicted image characteristics;

inputting the predicted image characteristics into a pre-trained text-to-text graph model, and performing image generation processing to obtain a first predicted image; the text condition that the text-to-graph model originally entered is replaced with the image condition (predicted image feature) here so that the image condition can be against the feature space of the Ji Wensheng graph.

Determining first loss information based on the label image corresponding to the first predicted image and the sample image; the label image corresponding to the sample image may be a label labeled for the sample image in advance, that is, the target of training and learning, that is, the first predicted image obtained by image generation after learning is expected to be better as the first predicted image is closer to the label image.

Training the initial linear transformation layer and the initial normalization layer according to the first loss information until the training iteration condition is met, and determining the initial linear transformation layer and the initial normalization layer corresponding to the training iteration condition as the linear transformation layer and the normalization layer respectively. That is, during the training process, the parameters of the image encoding network and the parameters of the meridional chart model are fixed. In one example, the image encoding network may be an image encoder in a CLIP model, and the meridional graph model may be a pre-trained stable diffusion model, as the application is not limited in this regard. Illustratively, the training iteration condition may be a loss threshold, a training iteration number threshold, etc., which the present application is not limited to. The training of the initial linear transformation layer and the initial normalization layer according to the first loss information may include calculating first gradient information according to the first loss information, and adjusting parameters of the initial linear transformation layer and parameters of the initial normalization layer by using the first gradient information to perform inverse transmission.

Since the training parameters of the initial linear transformation layer and the initial normalization layer are relatively few, no excessive problem exists, the process can choose to adopt a relatively large learning rate, and the optimizer can adopt Adam (Adaptive Moment Estimation, an optimization algorithm based on gradient descent).

In one possible implementation, the training process of the image attention network may include the steps of:

acquiring multi-modal sample data, which may include sample image features, sample text features, and sample noise features; the sample image features may be obtained by encoding training sample images, the sample text features may be obtained by encoding training sample text, and the sample noise features may be random noise features. The specific implementation process of this step may refer to step S201, which is not described herein.

Inputting the sample image features, the sample text features and the sample noise features into a preset noise network, performing feature cross processing on the sample image features and the sample noise features by using an initial attention network in a noise processing network, and performing denoising processing on the sample noise features by using a denoising network in the noise processing network based on feature cross results of the sample image features and the sample noise features and the sample text features to obtain predicted noise features; the denoising network can be a denoising network in a pre-trained text-generated graph model, and the initial attention network is connected with a text attention network in the denoising network in parallel in the denoising network so as to be embedded in the denoising network; the specific implementation process of this step may refer to step S203, which is not described herein. Illustratively, the initial attention network may be a copied text attention network, that is, the parameters of the initial state of the initial attention network are parameters of the text attention network, so that the output of the initial attention network may be directed to the feature space of the Ji Wensheng graph; alternatively, the initial attention network may be other cross-attention networks. The application is not limited in this regard.

Performing image decoding processing on the prediction noise characteristics by using an image decoder in the text-generated graph model to obtain a second prediction image; the specific implementation process of this step may refer to step S205, which is not described herein.

Further, second loss information may be determined based on the second predicted image and a label image corresponding to the multi-modal sample data; the second loss information may be determined, for example, based on a preset loss function, which is not limited by the present application. And the initial attention network can be trained according to the second loss information until the training iteration condition is met, and the initial attention network corresponding to the training iteration condition is determined to be the image attention network. The label image corresponding to the multi-mode sample data herein may refer to a label labeled for the multi-mode sample data in advance, that is, the target of training and learning, that is, the closer the second predicted image obtained by image generation performed after learning is to the label image, the better. Adam optimizers may also be employed here, but relatively small learning rates may be employed.

Illustratively, the training iteration condition may be a loss threshold, a training iteration number threshold, etc., which the present application is not limited to. Training the initial attentiveness network according to the second loss information may include calculating second gradient information according to the second loss information, and adjusting parameters of the initial attentiveness network by using the second gradient information to perform the reflexed transmission.

It should be noted that, in the training, the content and style in the sample image or the training sample image may be learned, and the present application is not limited by the comparison.

Fig. 6 is a block diagram of an image generating apparatus according to an exemplary embodiment. Referring to fig. 6, the apparatus may include:

the acquiring module 601 is configured to acquire a target image feature, a target text feature, and a noise feature to be processed;

the denoising module 603 is configured to input a target image feature, a target text feature, and a noise feature to be processed into a noise processing network, perform feature cross processing on the target image feature and the noise feature to be processed by using an image attention network in the noise processing network, and perform denoising processing on the noise feature to be processed by using a denoising network in the noise processing network based on a result of the feature cross processing and the target text feature, so as to obtain a target noise feature; the denoising network is a denoising network in a pre-trained text graph model, and the image attention network is connected with the text attention network in the denoising network in parallel in the denoising network so as to be embedded in the denoising network;

and the image generating module 605 is configured to perform image decoding processing on the target noise feature by using an image decoder in the text graph model, so as to obtain a generated image.

In a possible implementation manner, the denoising network further comprises an encoding submodule and a decoding submodule, the text attention network comprises a text attention network corresponding to the encoding submodule and a text attention network corresponding to the decoding submodule, and the image attention network comprises an image attention network corresponding to the encoding submodule and an image attention network corresponding to the decoding submodule; the denoising module 603 may include:

the coding unit is used for coding the noise characteristic to be processed based on the coding submodule to obtain a first noise characteristic;

the feature intersection unit is used for calling a text attention network corresponding to the coding submodule and an image attention network corresponding to the coding submodule, performing feature intersection processing on the target text feature and the first noise feature to obtain a first intersection feature, and performing feature intersection processing on the target image feature and the first noise feature to obtain a second intersection feature;

and the decoding unit is used for decoding and feature crossing processing the first crossing feature and the second crossing feature based on the decoding submodule, a text attention network corresponding to the decoding submodule and a corresponding image attention network to obtain the target noise feature.

In one possible implementation manner, the decoding unit may include:

the characteristic superposition subunit is used for carrying out characteristic superposition processing on the first cross characteristic and the second cross characteristic to obtain a target cross characteristic;

the second noise characteristic acquisition subunit is used for inputting the target cross characteristic into the decoding submodule to carry out decoding processing to obtain a second noise characteristic;

the feature intersection sub-unit is used for calling a text attention network corresponding to the decoding sub-module and an image attention network corresponding to the decoding sub-module, respectively performing feature intersection processing on the target text feature and the second noise feature to obtain a third intersection feature, and performing feature intersection processing on the target image feature and the second noise feature to obtain a fourth intersection feature;

and the decoding subunit is used for carrying out feature superposition processing on the third cross feature and the fourth cross feature to obtain the target noise feature.

In one possible implementation manner, the acquiring module 601 may include:

a target image acquisition unit configured to acquire a target image;

the image coding unit is used for inputting the target image into an image coding network and carrying out image coding processing to obtain a first image characteristic;

The linear transformation unit is used for inputting the image coding features into the linear transformation layer and performing linear transformation processing to obtain second image features;

the normalization unit is used for inputting the second image features into a normalization layer and carrying out normalization processing to obtain the target image features;

the linear transformation layer and the normalization layer are obtained by training an initial linear transformation layer and an initial normalization layer based on a sample image under the condition that parameters of the image coding network and parameters of the meridional graph model are fixed.

In one possible implementation manner, the image encoding unit is further configured to input the target image into an image encoding network, and perform image grid feature extraction processing to obtain a first image feature.

In one possible implementation, the image coding network is an image coding module in a pre-trained multimodal model; the linear transformation layer is a converter model.

In one possible implementation, the meridional graph model is a stable diffusion model; the apparatus may further include:

the time step acquisition module is used for acquiring the current time step;

accordingly, the denoising module 603 includes:

The denoising unit is used for inputting target image characteristics, target text characteristics, noise characteristics to be processed and current time steps into a noise processing network, performing characteristic cross processing on the target image characteristics and the noise characteristics to be processed by using an image attention network in the noise processing network, and denoising the noise characteristics to be processed by using a denoising network in the noise processing network based on the characteristic cross processing result, the target text characteristics and the current time steps to obtain current noise characteristics;

and the iteration unit is used for taking the current noise characteristic as the noise characteristic to be processed under the condition that the current time step does not meet the time step threshold, repeating the denoising processing step until the time step threshold is met, and determining the current noise characteristic meeting the time step threshold as the target noise characteristic.

In one possible implementation manner, the apparatus may further include:

the predicted image characteristic acquisition module is used for carrying out coding, transformation and normalization processing on the sample image by utilizing the image coding network, the initial linear transformation layer and the initial normalization layer to obtain predicted image characteristics;

The first image prediction module is used for inputting the predicted image characteristics into the pre-trained text-to-text graph model, and performing image generation processing to obtain a first predicted image;

a first loss determination module, configured to determine first loss information based on the first prediction image and a label image corresponding to the sample image;

the first training module is configured to train the initial linear transformation layer and the initial normalization layer according to the first loss information until a training iteration condition is satisfied, and determine the initial linear transformation layer and the initial normalization layer corresponding to the condition satisfying the training iteration condition as the linear transformation layer and the normalization layer respectively.

In one possible implementation manner, the apparatus may further include:

the training data acquisition module is used for acquiring multi-mode sample data, wherein the multi-mode sample data comprises the sample image characteristics, the sample text characteristics and the sample noise characteristics;

the prediction noise feature acquisition module is used for inputting the sample image feature, the sample text feature and the sample noise feature into a preset noise network, performing feature intersection processing on the sample image feature and the sample noise feature by using an initial attention network in the noise processing network, and performing denoising processing on the sample noise feature by using a denoising network in the noise processing network based on feature intersection results of the sample image feature and the sample noise feature and the sample text feature to obtain a prediction noise feature; the denoising network is a denoising network in a pre-trained text graph model, and the initial attention network is connected with the text attention network in the denoising network in parallel in the denoising network so as to be embedded in the denoising network;

The second image prediction module is used for performing image decoding processing on the prediction noise characteristics by using the image decoder in the venturi figure model to obtain a second prediction image;

a second loss determination module, configured to determine second loss information based on the second predicted image and a label image corresponding to the multi-mode sample data;

and the second training module is used for training the initial attention network according to the second loss information until the training iteration condition is met, and determining the initial attention network corresponding to the condition meeting the training iteration condition as the image attention network.

The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

Fig. 7 is a block diagram illustrating an electronic device for image generation, which may be a server, and an internal structure diagram thereof may be as shown in fig. 7, according to an exemplary embodiment. The electronic device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the electronic device is configured to provide computing and control capabilities. The memory of the electronic device includes a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the electronic device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of image generation.

It will be appreciated by those skilled in the art that the structure shown in fig. 7 is merely a block diagram of a portion of the structure associated with the present inventive arrangements and is not limiting of the electronic device to which the present inventive arrangements are applied, and that a particular electronic device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

In an exemplary embodiment, there is also provided an electronic device including: a processor; a memory for storing the processor-executable instructions; wherein the processor is configured to execute the instructions to implement the image generation method as in the embodiments of the present application.

In an exemplary embodiment, a computer readable storage medium is also provided, which when executed by a processor of an electronic device, causes the electronic device to perform the image generation method in the embodiment of the application. The computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

In an exemplary embodiment, a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of image generation in an embodiment of the application is also provided.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

Other embodiments of the application will be apparent to those skilled in the art from consideration of the specification and practice of the application disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It is to be understood that the application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. An image generation method, comprising:

2. The image generation method according to claim 1, wherein the denoising network further comprises an encoding submodule and a decoding submodule, the text attention network comprises a text attention network corresponding to the encoding submodule and a text attention network corresponding to the decoding submodule, and the image attention network comprises an image attention network corresponding to the encoding submodule and an image attention network corresponding to the decoding submodule;

inputting the target image feature, the target text feature and the noise feature to be processed into a noise processing network, performing feature cross processing on the target image feature and the noise feature to be processed by using an image attention network in the noise processing network, and performing denoising processing on the noise feature to be processed by using a denoising network in the noise processing network based on a result of the feature cross processing and the target text feature to obtain a target noise feature, wherein the method comprises the following steps:

Invoking a text attention network corresponding to the coding submodule and an image attention network corresponding to the coding submodule, respectively carrying out feature cross processing on the target text feature and the first noise feature to obtain a first cross feature, and carrying out feature cross processing on the target image feature and the first noise feature to obtain a second cross feature;

and decoding and feature crossing processing are carried out on the first crossing feature and the second crossing feature based on the decoding submodule, a text attention network corresponding to the decoding submodule and a corresponding image attention network, so that the target noise feature is obtained.

3. The image generating method according to claim 2, wherein the decoding and feature-crossing processing the first and second crossing features based on the text attention network and the corresponding image attention network corresponding to the decoding submodule and the decoding submodule to obtain the target noise feature includes:

performing feature superposition processing on the first cross feature and the second cross feature to obtain a target cross feature;

inputting the target cross characteristic into the decoding submodule to carry out decoding processing to obtain a second noise characteristic;

Invoking a text attention network corresponding to the decoding submodule and an image attention network corresponding to the decoding submodule, respectively carrying out feature cross processing on the target text feature and the second noise feature to obtain a third cross feature, and carrying out feature cross processing on the target image feature and the second noise feature to obtain a fourth cross feature;

and performing feature superposition processing on the third cross feature and the fourth cross feature to obtain the target noise feature.

4. The image generation method according to claim 1, wherein the target image feature is acquired by:

acquiring a target image;

inputting the target image into an image coding network, and performing image coding processing to obtain a first image characteristic;

inputting the image coding features into a linear transformation layer, and performing linear transformation processing to obtain second image features;

inputting the second image features into a normalization layer, and performing normalization processing to obtain the target image features;

5. The image generating method according to claim 4, wherein the inputting the target image into an image encoding network, performing image encoding processing, and obtaining a first image feature, includes:

inputting the target image into the image coding network, and extracting and processing image grid characteristics to obtain the first image characteristics.

6. The image generation method of claim 4, wherein the image coding network is an image coding module in a pre-trained multimodal model; the linear transformation layer is a converter model.

7. The image generation method according to claim 1, wherein the meridional graph model is a stable diffusion model; the image generation method further includes:

acquiring a current time step;

Inputting the target image feature, the target text feature, the noise feature to be processed and the current time step into a noise processing network, performing feature cross processing on the target image feature and the noise feature to be processed by using an image attention network in the noise processing network, and performing denoising processing on the noise feature to be processed by using a denoising network in the noise processing network based on the result of the feature cross processing, the target text feature and the current time step to obtain the current noise feature;

8. The image generation method according to any one of claims 4 to 6, characterized in that the image generation method further comprises:

the sample image is coded, transformed and normalized by using the image coding network, the initial linear transformation layer and the initial normalization layer, so as to obtain predicted image characteristics;

Inputting the predicted image characteristics into the pre-trained text-to-text graph model, and performing image generation processing to obtain a first predicted image;

determining first loss information based on the label image corresponding to the first predicted image and the sample image;

training the initial linear transformation layer and the initial normalization layer according to the first loss information until training iteration conditions are met, and determining the initial linear transformation layer and the initial normalization layer corresponding to the condition meeting the training iteration conditions as the linear transformation layer and the normalization layer respectively.

9. The image generation method according to claim 1, characterized in that the image generation method further comprises:

acquiring multi-modal sample data, the multi-modal sample data comprising the sample image features, the sample text features, and the sample noise features;

inputting the sample image feature, the sample text feature and the sample noise feature into a preset noise network, performing feature cross processing on the sample image feature and the sample noise feature by using an initial attention network in the noise processing network, and performing denoising processing on the sample noise feature by using a denoising network in the noise processing network based on feature cross results of the sample image feature and the sample noise feature and the sample text feature to obtain a prediction noise feature; the denoising network is a denoising network in a pre-trained text graph model, and the initial attention network is connected with the text attention network in the denoising network in parallel in the denoising network so as to be embedded in the denoising network;

Performing image decoding processing on the prediction noise characteristics by using the image decoder in the venturi image model to obtain a second prediction image;

determining second loss information based on the second predicted image and a label image corresponding to the multi-modal sample data;

training the initial attention network according to the second loss information until a training iteration condition is met, and determining the initial attention network corresponding to the condition meeting the training iteration condition as the image attention network.

10. An image generating apparatus, comprising:

11. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the image generation method of any of claims 1 to 9.

12. A computer readable storage medium, characterized in that instructions in the computer readable storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the image generation method of any one of claims 1 to 9.