CN117351299A

CN117351299A - Image generation and model training method, device, equipment and storage medium

Info

Publication number: CN117351299A
Application number: CN202311184380.4A
Authority: CN
Inventors: 李弼; 彭楠; 希滕; 张刚
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2023-09-13
Filing date: 2023-09-13
Publication date: 2024-01-05

Abstract

The disclosure provides an image generation and model training method, device, equipment and storage medium, relates to the technical field of artificial intelligence, in particular to the technical fields of computer vision, deep learning, large models and the like, and can be applied to scenes such as image processing and the like. The training method of the image generation model comprises the following steps: acquiring a first generation result of a teacher model; obtaining a second generation result of the student model; the student model generates a model for an image to be trained; constructing a loss function based on the first and second generation results; updating the first parameter matrix and the second parameter matrix based on the loss function, and obtaining an updated target parameter matrix according to the updated first parameter matrix and the updated second parameter matrix; the rank of the first parameter matrix and the rank of the second parameter matrix are both smaller than the rank of the target parameter matrix. The present disclosure may reduce computing resource overhead.

Description

Image generation and model training method, device, equipment and storage medium

Technical Field

The disclosure relates to the technical field of artificial intelligence, in particular to the technical fields of computer vision, deep learning, large models and the like, and can be applied to scenes such as image processing and the like, in particular to an image generation and model training method, an image generation and model training device, equipment and a storage medium.

Background

Diffusion model (diffusion model) is a type of generation model that can be used to generate high resolution images. The diffusion model decomposes the image generation process into a number of denoising steps, i.e., sampling processes that generate the model. Because of the high number of samples and the need for two forward denoising (denoise) processes for a single sample, the sampling rate is slow.

Disclosure of Invention

The present disclosure provides an image generation and model training method, apparatus, device and storage medium.

According to an aspect of the present disclosure, there is provided a training method of an image generation model, including: acquiring a first generation result of a teacher model; obtaining a second generation result of the student model; the student model generates a model for an image to be trained, a target parameter matrix of the student model is determined based on a first parameter matrix and a second parameter matrix, and the rank of the first parameter matrix and the rank of the second parameter matrix are smaller than the rank of the target parameter matrix; constructing a loss function based on the first and second generation results; updating the first parameter matrix and the second parameter matrix based on the loss function, and obtaining an updated target parameter matrix according to the updated first parameter matrix and the updated second parameter matrix.

According to another aspect of the present disclosure, there is provided an image generating method including: acquiring input features, wherein the input features are used for generating images; generating the input features by adopting an image generation model so as to generate output images corresponding to the input features; wherein the image generation model is trained using the training method of any of the above aspects.

According to another aspect of the present disclosure, there is provided a training apparatus of an image generation model, including: the first acquisition module is used for acquiring a first generation result of the teacher model; the second acquisition module is used for acquiring a second generation result of the student model; the student model generates a model for an image to be trained, a target parameter matrix of the student model is determined based on a first parameter matrix and a second parameter matrix, and the rank of the first parameter matrix and the rank of the second parameter matrix are smaller than the rank of the target parameter matrix; the construction module is used for constructing a loss function based on the first generation result and the second generation result; and the updating module is used for updating the first parameter matrix and the second parameter matrix based on the loss function and obtaining an updated target parameter matrix according to the updated first parameter matrix and the updated second parameter matrix.

According to another aspect of the present disclosure, there is provided an image generating apparatus including: the acquisition module is used for acquiring input features, wherein the input features are used for generating images; the generation module is used for generating and processing the input features by adopting an image generation model so as to generate output images corresponding to the input features; wherein the image generation model is trained using the training method of any of the above aspects.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of the above aspects.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method according to any one of the above aspects.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method according to any of the above aspects.

According to the technical scheme disclosed by the invention, the expenditure of computing resources can be reduced.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram according to a first embodiment of the present disclosure;

fig. 2 is a schematic diagram of an application scenario provided according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of an overall architecture provided in accordance with an embodiment of the present disclosure;

FIG. 4 is a schematic diagram according to a second embodiment of the present disclosure;

FIG. 5 is a schematic diagram according to a third embodiment of the present disclosure;

FIG. 6 is a schematic diagram according to a fourth embodiment of the present disclosure;

FIG. 7 is a schematic diagram according to a fifth embodiment of the present disclosure;

fig. 8 is a schematic diagram of an electronic device for implementing a training method or image generation method of an image generation model of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In order to increase the sampling speed, knowledge distillation can be adopted in the model training stage.

Based on knowledge distillation mode, the whole architecture comprises a teacher model and a student model, and the teacherThe size of the network and the student network are the same. Adopting a teacher model to perform two denoising processes, wherein one denoising process is a conditional denoising (condition denoise) process, and obtaining a conditional generation result y _c The method comprises the steps of carrying out a first treatment on the surface of the The other is the unconditional denoising (uncondition denoise) process to obtain unconditional generation result y _uc . For y _c And y _uc Weighting operation is carried out to obtain the final generation result y of the teacher model _t . Adopting the student model to perform one-time condition denoising process to obtain the final generation result y of the student model _s . Final generation result y based on teacher model _t And final generation result y of student model _s And constructing a loss function L, and then adjusting model parameters of the student model by using the L.

In the related art, all model parameters of the student model need to be updated (learnable), which results in high computing resource overhead.

In order to reduce computing resource overhead, the present disclosure provides the following embodiments.

Fig. 1 is a schematic diagram according to a first embodiment of the present disclosure. The embodiment provides a training method of an image generation model, which comprises the following steps:

101. and obtaining a first generation result of the teacher model.

102. Obtaining a second generation result of the student model; the student model generates a model for an image to be trained, a target parameter matrix of the student model is determined based on a first parameter matrix and a second parameter matrix, and the rank of the first parameter matrix and the rank of the second parameter matrix are smaller than the rank of the target parameter matrix.

103. Constructing a loss function based on the first and second generation results.

104. Updating the first parameter matrix and the second parameter matrix based on the loss function, and obtaining an updated target parameter matrix according to the updated first parameter matrix and the updated second parameter matrix.

In order to accelerate sampling, a knowledge distillation architecture can be adopted for the image generation model, wherein the architecture comprises a teacher model and a student model.

The teacher model and the student model are both image generation models. In addition, the teacher model and the student model may select the same model structure, and may be the same size.

The student model is an image generation model to be trained, that is, an image generation model to be finally adopted. After training, the student model is adopted to generate images.

The output of the teacher model may be referred to as a first generation result and the output of the student model may be referred to as a second generation result, and in order to distill knowledge of the teacher model into the student model, a loss function is constructed based on the first generation result and the second generation result, and model parameters of the student model are updated with the loss function.

The student model is a deep neural network model, which is typically a multi-layer structure, including, for example, a convolution layer, a pooling layer, an attention layer, and the like. The different network layers correspond to respective model parameters, for example, the student model includes model parameters including: w (W) _s ¹ ，W _s ² ，...。

In order to reduce the operation amount and save the computing resource expenditure, a part of parameter matrixes can be selected from all model parameters (parameter matrixes) to update (learn), and the parameter matrixes to be updated are called target parameter matrixes.

The target parameter matrix may be updated by a learning process. Let W be used for target parameter matrix _s The dimension is denoted d1 x d 2. If in W _s For the whole learning, the parameter to be learned is d1×d2, and in general, d1 and d2 are both large values, so that the calculation amount is large and the calculation resource cost is large.

In order to reduce the cost of computing resources, in this embodiment, the target parameter matrix is not directly learned as a whole, but can be disassembled into two parameter matrices, which are called a first parameter matrix and a second parameter matrix, and the target parameter matrix is calculated based on the first parameter matrix and the second parameter matrix. The first parameter matrix and the second parameter matrix are low-rank matrices, i.e. the ranks of the two parameter matrices are smaller than the rank of the target parameter matrix. For example, the first parameter matrix and the second parameter matrix are denoted by a and B, where a is d1 x r, B is r x d2, and r is the rank of the two parameter matrices, and may be manually set to a smaller value, such as 32. Based on this, the parameter quantity to be learned is d1×r+r×d2=r (d1+d2), and compared with the parameter quantity of d1×d2, the parameter quantity to be learned can be significantly reduced, and the computing resource overhead can be reduced.

In this embodiment, the first parameter matrix and the second parameter matrix are updated based on the loss function, and the updated target parameter matrix is obtained according to the updated first parameter matrix and the updated second parameter matrix, so that the target parameter matrix can be updated by using the first parameter matrix and the second parameter matrix, and the rank of the first parameter matrix and the rank of the second parameter matrix are smaller, so that the number of parameters to be learned can be reduced, the computing resource cost can be reduced, and the model training efficiency can be improved, compared with the direct learning of the target parameter matrix.

In order to better understand the embodiments of the present disclosure, application scenarios to which the embodiments of the present disclosure are applicable are described.

Fig. 2 is a schematic diagram of an application scenario provided in an embodiment of the present disclosure. The scene comprises: user terminal 201 and server 202, user terminal 201 may include: personal computers (PersonalComputer, PC), cell phones, tablet computers, notebook computers, smart wearable devices, and the like. The server 202 may be a cloud server or a local server, and the user terminal 201 and the server 202 may communicate using a communication network, for example, a wired network and/or a wireless network.

The image generation model may be trained by a server. The user terminal can send the training data to the server, and the server trains based on the training data to obtain an image generation model. Then, in the reasoning stage, an image can be generated by the server by adopting an image generation model; alternatively, when the user terminal has offline image generation capability, the server may send the image generation model to the user terminal, and the image generation model may be used locally at the user terminal to generate the image.

The image generation model may be a large model (Large Language Model, LLM).

LLM is a hot problem in the field of artificial intelligence in recent years, LLM is a pre-training language model, and rich language knowledge and world knowledge are learned by pre-training on massive text data, so that a remarkable effect can be achieved on various natural language processing (Natural Language Processing, NLP) tasks. The relics, chatGPT and the like are all applications based on LLM development, and can generate smooth, logical and creative text contents and even perform natural dialogue with human beings. In a natural language processing scenario, the large model may be a transform-based general Pre-training (GPT) model, an enhanced representation (Enhanced Representation through Knowledge Integration, ERNIE) model implemented based on knowledge integration, or the like.

In this embodiment, taking the context as an example, the corresponding image generation model may be selected as a diffusion model (diffusion model).

The diffusion model is sampled on the basis of a time step t, the initial image of which is a noise image (Z _T ) Obtaining a denoised final image (Z) by sampling a preset number of times (T) ₀ ). Each sampling process (e.g., from time t to time (t-1)) may employ a denoising network to convert the current image (Z _t ) Processing is performed to obtain the next image (Z _t-1 ) T=t, T-1,..constantly sampled, resulting in the final image Z ₀ 。

Taking the steady diffusion (stabiedifusion) model as an example, the number of samples T needs 50 times, i.e. 50 iterations. To accelerate sampling, the architecture of knowledge distillation may be employed for model training.

Thus, as shown in FIG. 2, knowledge-based distillation training may be performed on the training data to obtain a trained image generation model.

As shown in fig. 3, a teacher model and a student model are included under the knowledge distillation architecture.

Taking the text-generated graph as an example, the training data includes: the image sample and the text sample are processed through an image encoder to obtain image characteristics, and the image characteristics in fig. 3 are represented by hidden characteristics (latency) with noise; and processing the text sample through a text encoder to obtain text characteristics.

Since the knowledge distillation architecture of the present embodiment is to accelerate sampling, rather than reduce the model size, the teacher model and the student model of the present embodiment may be selected to have the same structure and the same size. For example, the teacher model and the student model are net models in UNet form of the same size.

Aiming at a teacher model, the method comprises two denoising processes, wherein one denoising process is a conditional denoising process, and corresponding results are called condition generation results; the other is unconditional denoising process, and the corresponding result is called unconditional generation result. The input to the unconditional denoising process is simply a noisy image, and the input to the conditional denoising process includes text in addition to the noisy image.

Referring to fig. 3, an input hidden characteristic with noise is processed by adopting a teacher model, so as to obtain an unconditional generation result; and processing the input hidden characteristic with noise and the text characteristic by adopting a teacher model to obtain a condition generating result.

Then, a weighting operation (weight sum) may be performed on the conditional generation result and the unconditional generation to obtain a first generation result of the teacher model.

For the student model, to speed up sampling, it includes only one denoising process instead of two denoising processes. The result of one denoising process is a condition generating result.

Referring to fig. 3, the student model is adopted to process the input noisy hidden features and text features to obtain a condition generation result as a second generation result of the student model.

A loss function, such as a mean square error (Mean Square Error, MSE) loss function, may then be constructed based on the first and second generation results.

After the loss function is obtained, the model parameters of the learning model may be updated with the loss function. In order to reduce the cost of computing resources, a parameter efficient updating mode is adopted, namely the number of parameters needing to be learned is reduced.

Assume that a certain model parameter of student model is W _s In the related art, the model parameters are learned as a whole, so that the calculation resource cost is high, for example, the parameter quantity to be learned is d1 d2.

In addition, the student model comprises a plurality of network layers, and in the related art, model parameters of each network layer need to be learned and updated, for example, parameter matrixes of the plurality of network layers are respectively used with W _s ¹ ，W _s ² ,., these parameter matrices all require updating, further exacerbating the computational resource overhead.

In the diffusion model scene, the teacher model and the student model are both denoising networks and have the same size.

The denoising network may be selected as a network in the form of a UNet, which includes an encoder (decoder) and a decoder (decoder), in which a cross-attention (cross-attention) network exists.

In order to reduce the computing resource overhead, in this embodiment, only the parameter matrix (attention weight matrix) in the cross-attention network may be learned, and the parameter matrix of other layers may be kept unchanged, instead of updating the parameter matrix of all layers. The model parameters that need to be updated may be referred to as a target parameter matrix. In addition, for target parameter matrices, e.g. W _s The method does not learn the matrix as a whole, and can be disassembled into two low-rank matrixes for learning, so that the resource cost is further reduced.

In combination with the above application scenario, the present disclosure further provides the following embodiments.

Fig. 4 is a schematic diagram of a second embodiment of the present disclosure, where a training method of an image generation model is provided, as shown in fig. 4, and the method includes:

401. and generating and processing the image characteristics of the image sample by adopting a teacher model to obtain unconditional generation results.

402. And generating the image characteristics and the text characteristics of the text sample by adopting the teacher model so as to obtain a condition generating result.

403. And weighting the unconditional generation result and the conditional generation result to obtain a first generation result of the teacher model.

Wherein, referring to fig. 3, the image features are represented by noisy hidden features. The image features may be obtained by extracting features from an image sample using an image encoder.

The text feature may be obtained by extracting features from a text sample using a text encoder.

Assume that y is used for the condition generation result _c The unconditional generation result is expressed as y _uc The first result obtained after the weighting operation is expressed by y _t The calculation formula of the weighting operation may be:

y _t ＝y _uc +w*(y _c -y _uc )；

wherein w is a preset weight value.

In this embodiment, the first generating result of the teacher model is obtained based on the unconditional generating result and the conditional generating result of the teacher model, and the generating result of the student model can be supervised by using the first generating result, so that the unconditional generating process and the conditional generating process of the teacher model can be distilled into a single generating process of the student model, and the sampling speed is further increased.

404. And generating and processing the image characteristics of the image sample and the text characteristics of the text sample by adopting the student model to obtain a second generation result of the student model.

With reference to fig. 3, the student model has only one denoising process, and the denoising process is a conditional denoising process, that is, the image feature and the text feature are processed to obtain a conditional generation result of the student model, which is used as a second generation result.

In this embodiment, the student model is used to process the image feature and the text feature to obtain the second generation result, and the effect of twice sampling of the teacher model can be achieved by using the student model for once sampling, so as to increase the sampling speed.

405. Constructing a loss function based on the first and second generation results.

Wherein, referring to FIG. 3, the loss function may be an MSE function assuming that the second generation result uses y _s The representation may then be based on the first generation result y _t And a second generation result y _s And constructing an MSE loss function L, and updating model parameters of the student model based on the loss function L.

406. Updating the first parameter matrix and the second parameter matrix based on the loss function, and obtaining an updated target parameter matrix of the student model according to the updated first parameter matrix and the updated second parameter matrix.

In order to reduce the computing resource overhead, the model parameters of the learning model are updated in a manner of efficient parameter updating, referring to fig. 3.

On the one hand, not all the parameter matrices of the network layers of the student model need to be updated, the parameter matrices of part of the network layers can be selected as target parameter matrices to be updated, and the parameter matrices of the rest of the network layers can be kept unchanged.

In this embodiment, the parameter matrix of a part of the network layer is used as the target parameter matrix to update, and only part of the parameter matrix, but not all of the parameter matrix, can be updated, so that the number of parameters to be updated can be reduced, and the resource overhead can be reduced.

Further, the plurality of network layers includes an attention network layer, and the target parameter matrix is an attention weight matrix of the attention network layer.

For example, when the student model selects a UNet model that includes a cross-attention network, the attention weight matrix of the cross-attention network may be used as the target parameter matrix. Further, if the number of the cross-attention networks is plural, the attention weight matrix of one or more of the cross-attention networks may be used as the target parameter matrix.

Because the attention network plays an important role in improving the performance of the model, in the embodiment, the attention weight matrix is used as a target parameter matrix to update, so that the performance of the model can be improved, and the image generation effect can be improved.

On the other hand, for a single target parameter matrix, the target parameter matrix is not studied as a whole, but is disassembled into two low-rank parameter matrices, parameters in the two low-rank parameter matrices are studied and updated, and the updated target parameter matrix is obtained by using the updated low-rank parameter matrix.

In the process of disassembly, the target parameter matrix can be directly disassembled into the product of two low-rank parameter matrices; alternatively, the incremental portion of the target parameter matrix may also be disassembled into the product of two low rank parameter matrices.

Accordingly, the product of the updated first parameter matrix and the updated second parameter matrix may be used as the updated target parameter matrix; or,

obtaining an increment part of the target parameter matrix according to the updated first parameter matrix and the updated second parameter matrix; and obtaining an updated target parameter matrix according to the target parameter matrix before updating and the increment part.

For example, for direct disassembly, the updated target parameter matrix is W _new The updated two parameter matrixes are respectively represented by A and B, and the calculation formula can be as follows:

W _new ＝A*B；

Let W be _new Is d1 x d2, a is d1 x r, B is r x d2, r is a manually selected smaller value, such as 32.

The first and second parameter matrices may be updated using a general learning procedure, for example, using a random gradient descent (Stochastic Gradient Descent, SGD) algorithm in which gradients may be calculated using the loss function L and then pre-update parameters and the gradients may be used to calculate post-update parameters. The initial values of a and B may be randomly generated or fixedly set.

In this embodiment, the product of the updated first parameter matrix and the updated second parameter matrix is used as the updated target parameter matrix, so that the target parameter matrix can be simply, conveniently and efficiently updated, and the processing efficiency is improved.

For another example, for partial resolution of the increment into two parameter matrices, assume that the updated target parameter matrix is W _new Representing the target parameter matrix W before updating _old The increment part is represented by delta W, the updated two parameter matrixes are respectively represented by A and B, and the calculation formula can be as follows:

W _new ＝W _old +ΔW＝W _old +A*B；

let W be _new And W is _old Is d1 x d2, a is d1 x r, B is r x d2, r is a manually selected smaller value, such as 32.

In this embodiment, according to the calculated increment portion of the updated first parameter matrix and the updated second parameter matrix, and then according to the increment portion, updating of the target parameter matrix is performed, low-rank parameter matrix disassembly can be performed on the increment portion, so that operation accuracy is improved, and model accuracy is further improved.

In addition, the model training process may be a multiple iteration process, in which the target parameter matrix is updated in the above manner in each iteration process, and a final target parameter matrix may be obtained through multiple iterations, that is, a final image generation model (student model) is obtained. The final student model may then be used for image generation.

In this embodiment, in the knowledge distillation process, the update of the target parameter matrix is performed according to the two low-rank parameter matrices, so that the updated parameter amount can be reduced, and the consumption of computing resources can be reduced. For the downstream task of the diffusion generation model with small data volume, because the updated parameter is small in proportion, on one hand, the capability of the pre-training diffusion generation model can be better reserved, and on the other hand, a better distillation effect can be obtained at the downstream task.

The above embodiments describe a model training process, and after training to obtain an image generation model, the image generation model may be used to generate an image.

Fig. 5 is a schematic diagram of a third embodiment of the present disclosure, which provides an image generating method, including:

501. input features are acquired, which are used to generate an image.

502. And adopting an image generation model to generate and process the input features so as to generate an output image corresponding to the input features.

Wherein the image generation model is trained using the training method of any one of the above.

In this embodiment, since the image generation model with low calculation performance consumption is adopted, resource overhead can be reduced during image generation, and image generation efficiency can be improved.

In some embodiments, the input features include: image features of the noise image and text features of the prompt text; correspondingly, the output image is an image obtained after denoising the noise image.

Generally, the single sampling process includes two denoising processes, but in this embodiment, only one denoising process is included, so that the effect of the original two denoising processes can be achieved, and therefore, the sampling speed can be increased and the image generation efficiency can be improved on the basis of ensuring the image quality.

Fig. 6 is a schematic diagram according to a fourth embodiment of the present disclosure. The present embodiment provides a training apparatus for generating a model from images, as shown in fig. 6, the apparatus 600 includes: a first acquisition module 601, a second acquisition module 602, a construction module 603 and an update module 604.

The first obtaining module 601 is configured to obtain a first generation result of the teacher model; the second obtaining module 602 is configured to obtain a second generation result of the student model; the student model generates a model for an image to be trained, a target parameter matrix of the student model is determined based on a first parameter matrix and a second parameter matrix, and the rank of the first parameter matrix and the rank of the second parameter matrix are smaller than the rank of the target parameter matrix; the construction module 603 is configured to construct a loss function based on the first generation result and the second generation result; the updating module 604 is configured to update the first parameter matrix and the second parameter matrix based on the loss function, and obtain an updated target parameter matrix according to the updated first parameter matrix and the updated second parameter matrix.

In some embodiments, the update module 604 is further configured to: and taking the product of the updated first parameter matrix and the updated second parameter matrix as the updated target parameter matrix.

In some embodiments, the update module 604 is further configured to: obtaining an increment part of the target parameter matrix according to the updated first parameter matrix and the updated second parameter matrix; and obtaining an updated target parameter matrix according to the target parameter matrix before updating and the increment part.

In some embodiments, the student model includes a plurality of network layers, and the target parameter matrix is a parameter matrix of a portion of the plurality of network layers.

In some embodiments, the plurality of network layers includes an attention network layer, and the target parameter matrix is an attention weight matrix of the attention network layer.

In some embodiments, the first obtaining module 601 is further configured to: generating and processing the image characteristics of the image sample by adopting the teacher model to obtain unconditional generation results; generating the image characteristics and the text characteristics of the text sample by adopting the teacher model to obtain a condition generating result; and weighting the unconditional generation result and the conditional generation result to obtain the first generation result.

In some embodiments, the second obtaining module 602 is further configured to: and generating the image characteristics of the image sample and the text characteristics of the text sample by adopting the student model so as to obtain the second generation result.

Fig. 7 is a schematic diagram according to a fifth embodiment of the present disclosure. The present embodiment provides an image generating apparatus, as shown in fig. 7, the apparatus 700 includes: an acquisition module 701 and a generation module 702.

The acquisition module 701 is configured to acquire input features, where the input features are used to generate an image; the generating module 702 is configured to perform generating processing on the input features by using an image generating model, so as to generate an output image corresponding to the input features; the image generation model is trained by adopting any training method.

It is to be understood that in the embodiments of the disclosure, the same or similar content in different embodiments may be referred to each other.

It can be understood that "first", "second", etc. in the embodiments of the present disclosure are only used for distinguishing, and do not indicate the importance level, the time sequence, etc.

In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the user accord with the regulations of related laws and regulations, and the public order colloquial is not violated.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 8 illustrates a schematic block diagram of an example electronic device 800 that may be used to implement embodiments of the present disclosure. Electronic device 800 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile apparatuses, such as personal digital assistants, cellular telephones, smartphones, wearable devices, and other similar computing apparatuses. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 8, the electronic device 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the electronic device 800 can also be stored. The computing unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.

Various components in electronic device 800 are connected to I/O interface 805, including: an input unit 806 such as a keyboard, mouse, etc.; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, etc.; and a communication unit 809, such as a network card, modem, wireless communication transceiver, or the like. The communication unit 809 allows the electronic device 800 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The computing unit 801 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The calculation unit 801 performs the respective methods and processes described above, for example, a training method of an image generation model or an image generation method. For example, in some embodiments, the training method of the image generation model or the image generation method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 808. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 800 via the ROM 802 and/or the communication unit 809. When the computer program is loaded into the RAM 803 and executed by the computing unit 801, one or more steps of the training method of the image generation model or the image generation method described above may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to perform a training method or an image generation method of the image generation model in any other suitable way (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems-on-chips (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virtual PrivateServer" or simply "VPS") are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A training method of an image generation model, comprising:

acquiring a first generation result of a teacher model;

obtaining a second generation result of the student model; the student model generates a model for an image to be trained, a target parameter matrix of the student model is determined based on a first parameter matrix and a second parameter matrix, and the rank of the first parameter matrix and the rank of the second parameter matrix are smaller than the rank of the target parameter matrix;

Constructing a loss function based on the first and second generation results;

updating the first parameter matrix and the second parameter matrix based on the loss function, and obtaining an updated target parameter matrix according to the updated first parameter matrix and the updated second parameter matrix.

2. The method of claim 1, wherein the obtaining the updated target parameter matrix from the updated first parameter matrix and the updated second parameter matrix comprises:

and taking the product of the updated first parameter matrix and the updated second parameter matrix as the updated target parameter matrix.

3. The method of claim 1, wherein the obtaining the updated target parameter matrix from the updated first parameter matrix and the updated second parameter matrix comprises:

obtaining an increment part of the target parameter matrix according to the updated first parameter matrix and the updated second parameter matrix;

and obtaining an updated target parameter matrix according to the target parameter matrix before updating and the increment part.

4. The method of claim 1, wherein the student model comprises a plurality of network layers, the target parameter matrix being a parameter matrix of a portion of the plurality of network layers.

5. The method of claim 4, wherein the plurality of network layers includes an attention network layer, the target parameter matrix being an attention weight matrix of the attention network layer.

6. The method of any of claims 1-5, wherein the obtaining a first generation of a teacher model comprises:

generating and processing the image characteristics of the image sample by adopting the teacher model to obtain unconditional generation results;

generating the image characteristics and the text characteristics of the text sample by adopting the teacher model to obtain a condition generating result;

and weighting the unconditional generation result and the conditional generation result to obtain the first generation result.

7. The method of any of claims 1-5, wherein the obtaining a second generation of the student model comprises:

and generating the image characteristics of the image sample and the text characteristics of the text sample by adopting the student model so as to obtain the second generation result.

8. An image generation method, comprising:

acquiring input features, wherein the input features are used for generating images;

generating the input features by adopting an image generation model so as to generate output images corresponding to the input features;

Wherein the image generation model is trained using the method of any of claims 1-6.

9. The method of claim 8, wherein the input features comprise: image features of the noise image and text features of the prompt text; correspondingly, the output image is an image obtained after denoising the noise image.

10. A training apparatus for an image generation model, comprising:

the first acquisition module is used for acquiring a first generation result of the teacher model;

the second acquisition module is used for acquiring a second generation result of the student model; the student model generates a model for an image to be trained, a target parameter matrix of the student model is determined based on a first parameter matrix and a second parameter matrix, and the rank of the first parameter matrix and the rank of the second parameter matrix are smaller than the rank of the target parameter matrix;

the construction module is used for constructing a loss function based on the first generation result and the second generation result;

and the updating module is used for updating the first parameter matrix and the second parameter matrix based on the loss function and obtaining an updated target parameter matrix according to the updated first parameter matrix and the updated second parameter matrix.

11. The apparatus of claim 10, wherein the update module is further to:

12. The apparatus of claim 10, wherein the update module is further to:

13. The apparatus of claim 10, wherein the student model comprises a plurality of network layers, the target parameter matrix being a parameter matrix of a portion of the plurality of network layers.

14. The apparatus of claim 13, wherein the plurality of network layers comprises an attention network layer, the target parameter matrix being an attention weight matrix of the attention network layer.

15. The apparatus of any of claims 10-14, wherein the first acquisition module is further to:

16. The apparatus of any of claims 10-14, wherein the second acquisition module is further to:

17. An image generating apparatus comprising:

the acquisition module is used for acquiring input features, wherein the input features are used for generating images;

the generation module is used for generating and processing the input features by adopting an image generation model so as to generate output images corresponding to the input features;

18. The apparatus of claim 17, wherein the input features comprise: image features of the noise image and text features of the prompt text; correspondingly, the output image is an image obtained after denoising the noise image.

19. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-9.

20. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-9.

21. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any of claims 1-9.