CN115861462B

CN115861462B - Training method and device for image generation model, electronic equipment and storage medium

Info

Publication number: CN115861462B
Application number: CN202211268653.9A
Authority: CN
Inventors: 冯智达; 张振宇; 余欣彤; 李岚欣; 方晔玮; 陈徐屹; 刘佳祥; 尹维冲; 冯仕堃; 孙宇
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-10-17
Filing date: 2022-10-17
Publication date: 2023-11-03
Anticipated expiration: 2042-10-17
Also published as: CN115861462A

Abstract

The disclosure provides a training method, a training device, electronic equipment and a storage medium of an image generation model, relates to the technical field of artificial intelligence, in particular to the technical fields of natural language processing, computer vision, deep learning and the like, and can be applied to scenes such as image denoising, image generation and the like. The specific implementation scheme is as follows: the weight of each pixel unit is determined based on the region of interest for the sample image, the loss function is determined based on the weight, and the model parameter of the generated model is adjusted by adopting the loss function, so that the training effect of the generated model can be improved, and the image quality of the generated image after the training of the image generated model is improved.

Description

Training method and device for image generation model, electronic equipment and storage medium

Technical Field

The disclosure relates to the technical field of artificial intelligence, in particular to the technical fields of natural language processing, computer vision, deep learning and the like, and can be applied to scenes such as image denoising, image generation and the like, and particularly relates to a training method, device, electronic equipment and storage medium of an image generation model.

Background

The task of generating an image based on text refers to inputting a text description in a form of a natural language so that an image generation model outputs an image conforming to the text description. In this way, the image is generated based on the image generation model, and the quality of the image is greatly dependent on the training effect of the generated model. By improving the training effect of the model, the image quality of the image generated by the image generation model can be improved.

Disclosure of Invention

The disclosure provides a training method and device for an image generation model, electronic equipment and a storage medium.

According to an aspect of the present disclosure, there is provided a training method of an image generation model, including:

acquiring a sample image and acquiring a description text of the sample image; identifying the region of interest of the sample image so as to determine the weight of each pixel unit in the sample image according to whether each pixel unit in the sample image belongs to the region of interest; adopting an image generation model to perform noise reduction processing on the set noise map based on the description text so as to obtain a noise reduction image; determining a loss function according to the content difference between each pixel unit in the sample image and the corresponding pixel unit in the noise reduction image and the weight of each pixel unit in the sample image; and according to the loss function, carrying out model parameter adjustment on the image generation model to obtain a trained image generation model.

According to another aspect of the present disclosure, there is provided a training apparatus of an image generation model, including:

the acquisition module is used for acquiring a sample image and acquiring a description text of the sample image;

the first determining module is used for identifying the region of interest of the sample image so as to determine the weight of each pixel unit in the sample image according to whether each pixel unit in the sample image belongs to the region of interest;

the processing module is used for carrying out noise reduction processing on the set noise image by adopting an image generation model based on the description text so as to obtain a noise reduction image;

the second determining module is used for determining a loss function according to the content difference between each pixel unit in the sample image and the corresponding pixel unit in the noise reduction image and the weight of each pixel unit in the sample image;

and the training module is used for carrying out model parameter adjustment on the image generation model according to the loss function so as to obtain a trained image generation model.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first aspect embodiment of the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of the first aspect embodiment of the present disclosure.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method as described in embodiments of the first aspect of the present disclosure.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 is a flowchart of a training method of an image generation model according to an embodiment of the disclosure;

FIG. 2 is a flow chart of another method for training an image generation model according to an embodiment of the present disclosure;

FIG. 3 is a flowchart of another training method for an image generation model according to an embodiment of the present disclosure;

FIG. 4 is a schematic illustration of a diffusion model denoising process for a multi-frame sample image according to an embodiment of the disclosure;

FIG. 5 is a flowchart of another training method for an image generation model according to an embodiment of the present disclosure;

FIG. 6 is one of the schematic diagrams describing the text generation process provided by the present disclosure;

FIG. 7 is a flowchart of another training method for an image generation model according to an embodiment of the present disclosure;

FIG. 8 is a second schematic diagram of a generation process of descriptive text provided in the present disclosure;

fig. 9 is a schematic structural diagram of a training device for an image generation model according to an embodiment of the present disclosure;

fig. 10 is a block diagram of an example electronic device provided by an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

With the continued development of deep learning, on a task of generating images based on text, it is expected that the effect of the images generated by the task can reach the quality of real photos and human works of art. In order to improve the image quality of the image generated by the generated model, the inventor finds that the existing generated model has a defect on the understanding degree of the relation between the text description and the generated image, and if the understanding degree of the generated model on the relation between the text description and the generated image can be enhanced in the training stage, the image quality of the image generated by the generated model after training can be improved.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the related image data and text data all conform to the regulations of related laws and regulations, and the related image data and text data are not in violation of the popular regulations.

Fig. 1 is a flow chart of a training method of an image generation model according to an embodiment of the disclosure, as shown in fig. 1, the method includes:

step 101, acquiring a sample image and acquiring descriptive text of the sample image.

Wherein the descriptive text is a text in the form of a segment of natural language. The text is used to indicate the image content output by the image generation model to be trained. The image generation model may generate an image based on text, that is, the descriptive text is used for preprocessing and then input into the image generation model, or may be input directly into the image generation model without modification, so that the image generation model generates an image based on the descriptive text.

The sample image is at least one frame and is an expected value output by a model generated by the image to be trained. The method is used for training an image generation model to output corresponding sample images based on descriptive text.

Step 102, identifying the region of interest of the sample image, so as to determine the weight of each pixel unit in the sample image according to whether each pixel unit in the sample image belongs to the region of interest.

And identifying the region of interest in the sample image, so that the region of interest is selected in the sample image. The sample image may be divided into a plurality of cells, each cell including at least one pixel, according to a set rule, so that the cell may be referred to as a pixel cell. Those skilled in the art can know that each pixel unit contains at least one pixel, and the number of pixels contained in each pixel unit can be the same or different, without affecting the implementation of the technical scheme.

In order to determine whether the pixel unit belongs to the region of interest, the following manner may be adopted, which is not limited in this embodiment:

for any pixel unit, if each pixel contained in the pixel unit belongs to the region of interest, determining that the pixel unit belongs to the region of interest. Otherwise, the pixel unit does not belong to the region of interest.

Or for any pixel unit, if at least one pixel in the pixels contained in the pixel unit belongs to the region of interest, determining that the pixel unit belongs to the region of interest. Otherwise, the pixel unit does not belong to the region of interest.

Or, for any pixel unit, if more than a certain proportion of pixels in the pixels contained in the pixel unit belong to the region of interest, determining that the pixel unit belongs to the region of interest, otherwise, determining that the pixel unit does not belong to the region of interest.

After determining whether each pixel unit belongs to the region of interest in the above manner, corresponding weights are respectively configured for the regions of interest and the regions of non-interest. That is, the weight of the pixel unit belonging to any one region of interest in the sample image may be determined as the first value; and determining the weight of the pixel unit which is not in any region of interest in the sample image as a second value, wherein the first value can be larger than the second value.

And 103, carrying out noise reduction processing on the set noise map by adopting an image generation model based on the descriptive text so as to obtain a noise reduction image.

The image generation model adopted in the embodiment of the disclosure may be a generated image obtained by denoising a set noise figure with reference to a description text, thereby taking a denoising image obtained after denoising as a final output image.

For example: the image generation model may specifically be a diffusion model. Diffusion models (diffusion models) are depth generation models, inspired by non-equilibrium thermodynamics. The diffusion model defines a Markov chain of diffusion steps, random noise is gradually added to the sample image through the diffusion process, and then the back diffusion process, i.e., the denoising process, is learned.

In the training process of the image generation model, the set noise image is subjected to noise reduction through the image generation model, so that the image generation model learns the process of denoising, and the difference between the noise reduction image and the sample image which is expected is minimized.

Step 104, determining a loss function according to the content difference between each pixel unit in the sample image and the corresponding pixel unit in the noise reduction image and the weight of each pixel unit in the sample image.

In the related art, in order to enable a loss function to characterize the difference between a sample image and a noise reduction image, the sample image is generally compared with a corresponding portion in the noise reduction image, and the loss function is determined based on the difference. In this embodiment, in order to enable the image generation model to better pay attention to the region of interest having higher relevance to the descriptive text, different weights are configured for the pixel units belonging to the region of interest and the pixel units not belonging to the region of interest.

As a possible implementation manner, for any one pixel unit, a difference may be made between a pixel value of the pixel unit in the sample image and a pixel value of a corresponding pixel unit in the noise reduction image, and the pixel value is multiplied by a weight of the pixel unit, and then the differences obtained by weighting the plurality of pixel units are summed to be used as a loss function.

And 105, performing model parameter adjustment on the image generation model according to the loss function to obtain a trained image generation model.

And adjusting model parameters along the direction of the gradient of the loss function according to the gradient of the loss function so as to obtain a trained image generation model.

It should be noted that, the foregoing steps 101 to 105 need to be repeated multiple times, and each time may use a different sample image, so that training is stopped when the loss function is smaller than the threshold value, or training is stopped when the number of repeated executions is greater than the threshold value. And taking the image generation model obtained after the last model parameter adjustment as the trained image generation model.

According to the training method for the image generation model, because the relevance between the region of interest of the sample image and the descriptive text of the sample image is large, the weight of each pixel unit is determined on the basis of the region of interest for the sample image, and then the loss function is determined on the basis of the weight, so that the loss function can carry the relevant information of the relevance between the region of interest and the descriptive text, model parameters of the generation model are adjusted by adopting the loss function, the understanding degree of the generation model on the relation between the descriptive text and the generation image can be enhanced in the training stage, and the image quality of the image generated by the image generation model after training can be improved.

Fig. 2 is a flow chart of another training method of an image generation model according to an embodiment of the disclosure, as shown in fig. 2, the method includes:

step 201, a sample image is acquired, and descriptive text of the sample image is acquired.

Reference is made to the related descriptions in the foregoing embodiments, and details are not repeated in this embodiment.

At step 202, at least one region of interest is identified for the sample image.

In the embodiments of the present disclosure, the region of interest is generally a region containing a target object to be identified, which may be a person, an animal, a plant, an artwork, or the like, which is not listed here. There may be multiple regions of interest in the sample image, and in this step, there is no need to limit the number of regions of interest, and the identified regions of interest may be one or multiple.

In step 203, the weight of the pixel unit belonging to any region of interest in the sample image is determined as the first value.

If the pixel unit belongs to any region of interest, the weight of the pixel unit can be set to a first value.

And 204, determining the weight of the pixel unit which is not in any region of interest in the sample image as a second value.

Wherein the first value is greater than the second value.

If the pixel unit does not belong to any region of interest, the correlation between the image content of the pixel unit and the descriptive text is weak, so that the weight value of the pixel unit is reduced to a second value, and the influence of the difference value of the partial pixel unit on the whole value of the loss function is reduced in the subsequent calculation of the loss function. The value of the loss function is greatly influenced by the pixel units in the region of interest, so that the trained image generation model is more focused on the region of interest, and the correlation between the region of interest and the description text is stronger, thereby increasing the correlation between the image generated by the image generation model and the description text.

Alternatively, the second value may be inversely related to the minimum distance of the corresponding pixel unit from the region of interest. That is, for pixel units that do not belong to any one region of interest, the further the minimum distance from the sensitive region is, the smaller the weight is. The degree of association between the pixel region and the descriptive text can be finely distinguished by the size of the weights.

Step 205, performing noise reduction processing on the set noise map based on the descriptive text by adopting the image generation model to obtain a noise reduction image.

Step 206, determining a loss function according to the content difference between each pixel unit in the sample image and the corresponding pixel unit in the noise reduction image and the weight of each pixel unit in the sample image.

Step 207, performing model parameter adjustment on the image generation model according to the loss function.

Step 205 to step 207 refer to the related descriptions in the foregoing embodiments, and are not repeated in this embodiment.

In the training method of the image generation model, because the relevance between the region of interest of the sample image and the description text of the sample image is large, the weight of each pixel unit is determined on the basis of the region of interest for the sample image, and then the loss function is determined on the basis of the weight, so that the loss function can carry the relevant information of the relevance between the region of interest and the description text, the model parameters of the generation model can be adjusted by adopting the loss function, the understanding degree of the generation model on the relation between the description text and the generation image can be enhanced in the training stage, and the image quality of the image generated by the image generation model after training can be improved.

Fig. 3 is a flow chart of another training method of an image generation model according to an embodiment of the disclosure, as shown in fig. 3, the method includes:

Step 301, obtaining a sample image and obtaining a description text of the sample image, wherein the sample image is a plurality of frames which are sequentially arranged, and a later frame of sample image is obtained by performing noise superposition on a previous frame of sample image.

At step 302, at least one region of interest is identified for at least one frame of sample images.

In step 303, the weight of the pixel unit belonging to any region of interest in at least one frame of sample image is determined as the first value.

And 304, determining the weight of the pixel unit which is not in any region of interest in at least one frame of sample image as a second value.

Wherein the first value is greater than the second value.

And 305, performing at least one noise reduction treatment on the set noise map by adopting an image generation model based on the descriptive text so as to obtain at least one frame of noise reduction image which is arranged in sequence.

The noise reduction image of the next frame is obtained by performing noise reduction processing on the noise reduction image of the previous frame.

Optionally, encoding the descriptive text by an encoder to obtain semantic vectors of the descriptive text; inputting the semantic vector into an image generation model, so that the image generation model performs at least one noise reduction treatment on the set noise image by adopting an attention mechanism based on the semantic vector to obtain at least one frame of noise reduction image which is arranged in sequence; the noise reduction image of the next frame is obtained by performing noise reduction processing on the noise reduction image of the previous frame. Through the multi-frame noise reduction image, the training difficulty of the model can be reduced, so that the model learning gradually learns the process of noise inverse diffusion.

Step 306, determining a loss function according to the content difference between each pixel unit in at least one frame of sample image and the corresponding pixel unit in the corresponding noise reduction image, and the weight of each pixel unit in the sample image.

As a possible implementation manner, for any frame of sample image, a comparison is performed with a frame of noise reduction image corresponding to the sequence, so as to determine a loss component of each frame of sample image according to a content difference between each pixel unit in the sample image and a corresponding pixel unit in the noise reduction image and a weight of each pixel unit in the sample image. A loss function is determined based on the sum of the loss components of each frame of sample images. Because the difference between each frame of sample image and the corresponding noise reduction image is counted into a loss function, the training accuracy is improved, and the training effect is better.

As another possible implementation manner, the content difference between each pixel unit in the first frame of sample image and the corresponding pixel unit in the last frame of noise reduction image, and the weight of each pixel unit in the sample image are determined to determine the loss function. Because only the loss function needs to be determined according to the difference between part of sample images and the corresponding noise images, the calculation amount is simplified, the calculation speed is improved, and the model training speed is improved.

Step 307, performing model parameter adjustment on the image generation model according to the loss function.

In order to clearly explain the manner of determining the loss function in the present embodiment, as shown in fig. 4, the noise reduction process of the diffusion model on the multi-frame sample image is schematically illustrated.

The text generating image technology based on the diffusion model models the generating task into a continuous denoising process, namely, starting from an image which is completely noisy, and taking the image obtained by denoising a plurality of steps repeatedly as a final generating result.

As shown in FIG. 4, the image sequence is marked with Z's subscript, also referred to as time step T, e.g., T in total in FIG. 4, respectively identified as Z ₁ To Z _T 。

The loss function of the diffusion model is directly calculated in image space, and may be, for example:

where T represents the time step (t=1..t),. Epsilon. Represents the added noise,representing that the noise E is sampled from a normal distribution, z0 represents the original image without added noise, z _t Representing the image when noise is repeatedly added for t time steps, E _θ Representing a neural network model, inputting a noise adding result z of the t-th time step _t And a time step t. The loss function uses a Mean Square Error (MSE) to make the prediction result epsilon _θ (z _t T) approximates the truly added noise e and it is desirable that the mean square error of the model at all time steps t is as small as possible.

wherein ,represents a weight matrix, M represents image length and width, w _l Is the set super parameter, x _key Representing the image area corresponding to the key object. The parameters i and j represent the lateral and longitudinal coordinates of the pixel element in the image.Representing pixel units loss at i and j positions in weight matrix _ij Is a weight of (2).

Fig. 5 is a flowchart of another training method for an image generation model according to an embodiment of the disclosure, as shown in fig. 5, where the method includes:

Step 501, a sample image is acquired.

Step 501 may refer to the explanation in the foregoing embodiments, and is not repeated here.

In step 502, object detection is performed on the sample image to identify the name of the key object and/or the attribute of the key object in the sample image.

Optionally, an object segmentation model is adopted on the sample image, or object detection is carried out on the sample image based on an object detection model of the object frame, so that names and attribute information of key objects presented in the sample image are identified. Or, identifying the names of the key objects presented in the sample image, and inquiring the corresponding attribute information based on the names. As a fine-grained knowledge, the key objects and attributes identified by the target detection are likely to not appear in the text, and thus interfere with the image generation model to learn the correspondence between the text and the image. Therefore, the object class labels and the attribute class labels corresponding to the recognition results are supplemented to the original text, so that the alignment relationship between the text and the image is enhanced, and additional visual knowledge is better integrated into the training process.

The key object may be an object in the center position in the sample image, or an object with a larger occupied area, or an object that is emphasized by the sample image.

In step 503, descriptive text is generated according to the name of the key object and/or the attribute of the key object.

Optionally, acquiring an original text corresponding to the sample image; and performing text splicing on the original text and the names of the key objects and/or the attributes of the key objects to obtain descriptive text. Due to the splicing mode, the method is simple to realize, has small calculated amount and is beneficial to reducing the resources required by training.

Further, taking an image title displayed in a page to which the sample image belongs as an original text; or reading an original text corresponding to the sample image from an image library to which the sample image belongs; alternatively, the label of the sample image is taken as the original text. By adopting a plurality of different ways to acquire the original text, the description angles of the original text to the image can be different, which is beneficial to the image generation model to align the description text with different dimensions with the sample image, thereby learning more mapping relations between the image and the text.

Step 504, performing region of interest identification on the sample image, so as to determine the weight of each pixel unit in the sample image according to whether each pixel unit in the sample image belongs to the region of interest.

And 505, carrying out noise reduction processing on the set noise map by adopting an image generation model based on the descriptive text so as to obtain a noise reduction image.

Step 506, determining a loss function according to the content difference between each pixel unit in the sample image and the corresponding pixel unit in the noise reduction image, and the weight of each pixel unit in the sample image.

And 507, performing model parameter adjustment on the image generation model according to the loss function.

The principles of steps 504-507 may be the same as those described in the previous embodiments, and are not repeated here.

In order to clearly illustrate the generation process of the description text in this embodiment, fig. 6 is one of schematic diagrams of the generation process of the description text provided in this disclosure, and as shown in fig. 6, the original text corresponding to the sample image is a bundle of flowers. In order to enhance the content of the original text and obtain a descriptive text, in the processing process shown in fig. 6, a sample image is input into a target detection model, so that two detection frames are obtained, and the two detection frames are respectively marked with labels and flowers; red "," vase; blue). The original text is "a bundle of flowers" and the label is "a flower; red "," vase; blue "stitching, thus obtaining" a bundle of flowers, red flowers, blue flower vase ", which is used as descriptive text.

Meanwhile, as shown in FIG. 6, the areas corresponding to the two detection frames are used as senses A region of interest, the weight of pixel units belonging to the region of interest being determined to be 1+w _l The weight of the pixel cells within the non-region of interest is determined to be 1.

Fig. 7 is a flowchart of another training method for an image generation model according to an embodiment of the disclosure, as shown in fig. 7, where the method includes:

step 701, acquiring a sample image and corresponding original text.

The original text corresponding to the sample image generally contains less semantic information.

In step 702, object detection is performed on the sample image to identify the names of key objects and/or the attributes of the key objects in the sample image.

Step 702 may refer to the explanation in the foregoing embodiments, and is not repeated here.

In step 703, the original text is text spliced with the name of the key object and/or the attribute of the key object, so as to obtain a first candidate text.

In the embodiment of the disclosure, names of key objects and/or attributes of the key objects in a sample image obtained by performing target monitoring on the sample image are spliced with an original text, optionally, semantic deduplication processing can be performed before splicing, and under the condition that the text information amount is not reduced, the splicing performance is improved, so that the semantic information amount of the obtained first candidate text is increased.

And step 704, carrying out semantic recognition on the sample image by adopting a text generation model to take the text output by the text generation model as a second candidate text.

Image description generation is the inverse of text generation images, i.e., a text generation model (otherwise known as an image description generation model) generates descriptive text given an image. The text generated by the text generation model is generally concise and clear, and can capture more complex and precise semantics. Thus, we use the text generation model to additionally generate one descriptive text, i.e., the second candidate text, for the sample images in all training data, thereby helping to further enrich the meaning of the descriptive text.

Step 705, selecting descriptive text from the first candidate text and the second candidate text.

Since the first candidate text and the second candidate text are generated based on different modes, but overlapping contents exist between the first candidate text and the second candidate text, the description text is selected in a mode of randomly selecting the description text, and the image generation model is facilitated to learn the mapping relation between texts generated in different dimensions and images under the condition that repeated information is not added.

Step 706, performing region of interest identification on the sample image, so as to determine the weight of each pixel unit in the sample image according to whether each pixel unit in the sample image belongs to the region of interest.

Step 707, performing noise reduction processing on the set noise map based on the description text by using the image generation model to obtain a noise reduction image.

Step 708, determining a loss function according to the content difference between each pixel unit in the sample image and the corresponding pixel unit in the noise reduction image, and the weight of each pixel unit in the sample image.

Step 709, performing model parameter adjustment on the image generation model according to the loss function.

The principles of steps 706 to 708 may be the same as those of the previous embodiments, and are not repeated here.

For clarity of explanation of the text generation process described in this embodiment, fig. 8 is a second schematic diagram describing the text generation process, and as shown in fig. 8, the original text corresponding to the sample image is a bundle of flowers. In order to enhance the content of the original text to obtain a descriptive text, in the processing process shown in fig. 8, a text 2 is obtained by inputting a sample image into a target detection model in fig. 6, so as to obtain two detection boxes, and the two detection boxes are respectively marked with labels, namely a flower; red "," vase; blue). The original text is "a bundle of flowers" and the label is "a flower; red "," vase; blue 'splicing' so as to obtain 'a bundle of flowers, red flowers and blue flower vase'. As shown in fig. 8, text 1 is input as a sample image into an image description generation model which has learned the image-to-text mapping relationship, so that a segment of expression "a bundle of red flowers in a blue vase" can be output as text 1 based on the sample image. As shown in fig. 8, text 1 or text 2 is randomly selected as descriptive text, paired with the sample image, as a training sample for training the image generation model.

Fig. 9 is a schematic structural diagram of a training device for an image generation model according to an embodiment of the present disclosure, as shown in fig. 9, where the device includes:

an obtaining module 91, configured to obtain a sample image, and obtain a description text of the sample image.

The first determining module 92 is configured to identify a region of interest of the sample image, so as to determine a weight of each pixel unit in the sample image according to whether each pixel unit in the sample image belongs to the region of interest.

And the processing module 93 is used for carrying out noise reduction processing on the set noise graph by adopting an image generation model based on the description text so as to obtain a noise reduction image.

A second determining module 94 is configured to determine a loss function according to a content difference between each pixel unit in the sample image and a corresponding pixel unit in the noise reduction image, and a weight of each pixel unit in the sample image.

And the training module 95 is configured to perform model parameter adjustment on the image generation model according to the loss function, so as to obtain a trained image generation model.

Further, in an implementation manner of the embodiment of the present disclosure, the first determining module is configured to:

identifying the sample image to obtain at least one region of interest;

determining the weight of a pixel unit belonging to any region of interest in the sample image as a first value;

determining the weight of the pixel unit which is not in any region of interest in the sample image as a second value; wherein the first value is greater than the second value.

In one implementation manner of the embodiment of the disclosure, the second value is in an inverse relationship with a minimum distance between the corresponding pixel unit and the region of interest.

In one implementation of the embodiment of the disclosure, the obtaining module 91 includes:

the detection unit is used for carrying out target detection on the sample image so as to identify and obtain the names of key objects and/or the attributes of the key objects in the sample image;

and the generation unit is used for generating the description text according to the name of the key object and/or the attribute of the key object.

In one implementation manner of the embodiment of the disclosure, the generating unit is configured to:

acquiring an original text corresponding to the sample image;

and performing text splicing on the original text and the names of the key objects and/or the attributes of the key objects to obtain the description text.

taking an image title displayed in a page to which the sample image belongs as the original text; or,

reading an original text corresponding to the sample image from an image library to which the sample image belongs; or,

and taking the label of the sample image as the original text.

In one implementation of the embodiment of the disclosure, the obtaining module 91 is configured to:

and carrying out semantic recognition on the sample image by adopting a text generation model so as to determine the descriptive text according to the text output by the text generation model.

performing target detection on the sample image to identify and obtain the names of key objects and/or the attributes of the key objects in the sample image;

generating a first candidate text according to the name of the key object and/or the attribute of the key object;

carrying out semantic recognition on the sample image by adopting a text generation model, and taking a text output by the text generation model as a second candidate text;

the descriptive text is selected from the first candidate text and the second candidate text.

In one implementation of the disclosed embodiment, the processing module 93 includes:

the coding unit is used for coding the description text by adopting an encoder so as to obtain a semantic vector of the description text;

the processing unit is used for inputting the semantic vector into the image generation model so that the image generation model carries out at least one noise reduction treatment on the set noise image by adopting an attention mechanism based on the semantic vector to obtain at least one frame of noise reduction image which is arranged in sequence; the noise reduction image of the next frame is obtained by performing noise reduction processing on the noise reduction image of the previous frame.

In an implementation manner of the embodiment of the disclosure, the sample images are sequentially arranged multi-frames, and a later frame of sample image is obtained by performing noise superposition on a previous frame of sample image;

a second determining module 94 for:

comparing any frame of sample image with a frame of noise reduction image corresponding to the sequence, so as to determine loss components of the sample image of each frame according to content differences between pixel units in the sample image and corresponding pixel units in the noise reduction image and weights of the pixel units in the sample image;

the loss function is determined based on a sum of loss components of the sample images for each frame.

In one implementation of the disclosed embodiment, the second determining module 94 is configured to:

and determining the loss function by using the content difference between each pixel unit in the sample image of the first frame and the corresponding pixel unit in the noise reduction image of the last frame and the weight of each pixel unit in the sample image.

It should be noted that the explanation of the foregoing method embodiment is also applicable to the method of this embodiment, and the principle is the same, and will not be repeated here.

According to the training device for the image generation model, the weights of all pixel units are determined on the basis of the region of interest on the sample image, the loss function is determined on the basis of the weights, and the model parameters of the generation model are adjusted by adopting the loss function, so that the training effect of the generation model can be improved, and the image quality of the image generated after the image generation model is trained is improved.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 10 is a block diagram of an example electronic device provided by an embodiment of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 10, the electronic apparatus 1000 includes a computing unit 1001 that can perform various appropriate actions and processes according to a computer program stored in a ROM (Read-Only Memory) 1002 or a computer program loaded from a storage unit 1008 into a RAM (Random Access Memory ) 1003. In the RAM 1003, various programs and data required for the operation of the device 1000 can also be stored. The computing unit 1001, the ROM 1002, and the RAM 1003 are connected to each other by a bus 1004. An I/O (Input/Output) interface 1005 is also connected to bus 1004.

Various components in device 1000 are connected to I/O interface 1005, including: an input unit 1006 such as a keyboard, a mouse, and the like; an output unit 1007 such as various types of displays, speakers, and the like; a storage unit 1008 such as a magnetic disk, an optical disk, or the like; and communication unit 1009 such as a network card, modem, wireless communication transceiver, etc. Communication unit 1009 allows device 1000 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks.

The computing unit 1001 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 1001 include, but are not limited to, a CPU (Central Processing Unit ), GPU (Graphic Processing Units, graphics processing unit), various dedicated AI (Artificial Intelligence ) computing chips, various computing units running machine learning model algorithms, DSP (Digital Signal Processor ), and any suitable processor, controller, microcontroller, etc. The computing unit 1001 performs the respective methods and processes described above, for example, a training method of an image generation model. For example, in some embodiments, the training method of the image generation model described above may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 1008. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 1000 via ROM 1002 and/or communication unit 1009. When a computer program is loaded into RAM 1003 and executed by computing unit 1001, one or more steps of the method embodiments described above may be performed. Alternatively, in other embodiments, the computing unit 1001 may be configured to perform the training method of the image generation model described above in any other suitable way (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit System, FPGA (Field Programmable Gate Array ), ASIC (Application-Specific Integrated Circuit, application-specific integrated circuit), ASSP (Application Specific Standard Product, special-purpose standard product), SOC (System On Chip ), CPLD (Complex Programmable Logic Device, complex programmable logic device), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, RAM, ROM, EPROM (Electrically Programmable Read-Only-Memory, erasable programmable read-Only Memory) or flash Memory, an optical fiber, a CD-ROM (Compact Disc Read-Only Memory), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., CRT (Cathode-Ray Tube) or LCD (Liquid Crystal Display ) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: LAN (Local Area Network ), WAN (Wide Area Network, wide area network), internet and blockchain networks.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virtual Private Server" or simply "VPS") are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.

It should be noted that, artificial intelligence is a subject of studying a certain thought process and intelligent behavior (such as learning, reasoning, thinking, planning, etc.) of a computer to simulate a person, and has a technology at both hardware and software level. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, a machine learning/deep learning technology, a big data processing technology, a knowledge graph technology and the like.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A training method of an image generation model, comprising:

acquiring a sample image and a description text of the sample image, wherein the sample image is at least one frame, is an expected value output by an image generation model to be trained, and is used for training the image generation model to output a corresponding sample image based on the description text, and the description text of the sample image is used for indicating image content output by the image generation model to be trained;

identifying the sample image to obtain at least one region of interest;

determining the weight of the pixel unit which is not in any region of interest in the sample image as a second value; wherein the first value is greater than the second value;

adopting an image generation model to perform noise reduction processing on the set noise map based on the description text so as to obtain a noise reduction image;

determining a loss function according to the content difference between each pixel unit in the sample image and the corresponding pixel unit in the noise reduction image and the weight of each pixel unit in the sample image;

And according to the loss function, carrying out model parameter adjustment on the image generation model to obtain a trained image generation model.

2. The method of claim 1, wherein the second value is inversely related to a minimum distance of the corresponding pixel element from the region of interest.

3. The method of claim 1, wherein the obtaining descriptive text of the sample image comprises:

and generating the description text according to the name of the key object and/or the attribute of the key object.

4. A method according to claim 3, wherein said generating said descriptive text from the name of said key object and/or the attribute of said key object comprises:

acquiring an original text corresponding to the sample image;

5. The method of claim 4, wherein the obtaining the original text corresponding to the sample image comprises:

and taking the label of the sample image as the original text.

6. The method of claim 1, wherein the obtaining descriptive text of the sample image comprises:

7. The method of claim 1, wherein the obtaining descriptive text of the sample image comprises:

8. The method according to any one of claims 1-7, wherein the performing noise reduction processing on the set noise map based on the descriptive text using the image generation model to obtain a noise reduced image includes:

encoding the description text by adopting an encoder to obtain a semantic vector of the description text;

inputting the semantic vector into the image generation model so that the image generation model carries out at least one noise reduction treatment on a set noise image by adopting an attention mechanism based on the semantic vector to obtain at least one frame of noise reduction image which is arranged in sequence; the noise reduction image of the next frame is obtained by performing noise reduction processing on the noise reduction image of the previous frame.

9. The method according to claim 8, wherein the sample images are sequentially arranged multi-frames, and the sample image of the next frame is obtained by performing noise superposition on the sample image of the previous frame;

the determining a loss function according to the content difference between each pixel unit in the sample image and the corresponding pixel unit in the noise reduction image and the weight of each pixel unit in the sample image comprises:

10. The method of claim 8, wherein determining the loss function based on the content difference between each pixel element in the sample image and the corresponding pixel element in the noise reduction image, and the weight of each pixel element in the sample image, comprises:

11. A training apparatus for an image generation model, comprising:

the system comprises an acquisition module, a training module and a display module, wherein the acquisition module is used for acquiring a sample image and a description text of the sample image, wherein the sample image is at least one frame and is an expected value output by an image generation model to be trained, the training module is used for training the image generation model to output a corresponding sample image based on the description text, and the description text of the sample image is used for indicating the image content output by the image generation model to be trained;

the first determining module is used for identifying the sample image to obtain at least one region of interest;

12. The apparatus of claim 11, wherein the second value is inversely related to a minimum distance of a corresponding pixel element from the region of interest.

13. The apparatus of claim 11, wherein the acquisition module comprises:

14. The apparatus of claim 13, wherein the generating unit is configured to:

acquiring an original text corresponding to the sample image;

15. The apparatus of claim 14, wherein the generating unit is configured to:

and taking the label of the sample image as the original text.

16. The apparatus of claim 11, wherein the means for obtaining is configured to:

17. The apparatus of claim 11, wherein the means for obtaining is configured to:

18. The apparatus according to any one of claims 11-17, wherein the processing module comprises:

19. The apparatus of claim 18, wherein the sample images are sequentially arranged multi-frames, and the later frame of sample images is obtained by noise-adding the previous frame of sample images;

the second determining module is configured to:

20. The apparatus of claim 18, wherein the second determining module is configured to:

21. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-10.

22. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-10.

23. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any of claims 1-10.