CN117094365A

CN117094365A - Training method and device for image-text generation model, electronic equipment and medium

Info

Publication number: CN117094365A
Application number: CN202311101515.6A
Authority: CN
Inventors: 罗龙强
Original assignee: Vivo Mobile Communication Co Ltd
Current assignee: Vivo Mobile Communication Co Ltd
Priority date: 2023-08-29
Filing date: 2023-08-29
Publication date: 2023-11-21

Abstract

This application discloses a training method, device, electronic equipment and medium for a graphic and text generation model, which belongs to the field of artificial intelligence. The method includes: inputting a first training sample pair in a first training sample pair set to a first image and text generation model, and outputting a second training sample pair, where the first training sample pair includes a first image and a first training sample pair used to describe the first image. The first text of the image content, the second training sample pair includes a second image and a second text, the second image is an image obtained by image-text conversion of the first text, and the second text is an image obtained by image-text conversion of the first image Text; Based on the first training sample pair and the second training sample pair, generate M training sample pairs; replace the first training sample pair in the first training sample pair set with the third training sample pair to obtain the second training sample pair The third training sample pair is the training sample pair with the highest image and text similarity among the M training sample pairs; the first image and text generation model is trained based on the second training sample pair set to obtain the target image and text generation model.

Description

Training method, device, electronic device and medium for image-text generation model

技术领域Technical Field

本申请属于人工智能领域，具体涉及一种图文生成模型的训练方法、装置、电子设备及介质。The present application belongs to the field of artificial intelligence, and specifically relates to a training method, device, electronic equipment and medium for a graphic and text generation model.

背景技术Background Art

目前，随着生成式人工智能(Artificial Intelligence generated content，AIGC)的兴起和不断发展，图文生成模型，如，AI绘画领域的文生图扩散模型，在壁纸、头像、游戏、动漫设计等领域得到了广泛应用，具有效率高、自动化程度高等优点。在相关技术中，可以通过将文本输入图文生成模型以输出该文本对应图像。At present, with the rise and continuous development of artificial intelligence generated content (AIGC), image-text generation models, such as the image-text diffusion model in the field of AI painting, have been widely used in the fields of wallpaper, avatar, game, animation design, etc., with the advantages of high efficiency and high degree of automation. In related technologies, text can be input into the image-text generation model to output the corresponding image of the text.

然而，上述文图文生成模型的模型训练过程中仍然存在训练精度低的问题。However, there is still a problem of low training accuracy in the model training process of the above-mentioned text-image-text generation model.

发明内容Summary of the invention

本申请实施例的目的是提供一种图文生成模型的训练方法、装置、电子设备及介质，能够提高图文生成模型的模型训练精度。The purpose of the embodiments of the present application is to provide a training method, device, electronic device and medium for a graphic-text generation model, which can improve the model training accuracy of the graphic-text generation model.

第一方面，本申请实施例提供了一种图文生成模型的训练方法，该图文生成模型的训练方法包括：将第一训练样本对集合中的第一训练样本对输入至第一图文生成模型，输出第二训练样本对，第一图文生成模型是基于第一训练样本对集合训练得到的，第一训练样本对包括第一图像和用于描述第一图像的图像内容的第一文本，第二训练样本对包括第二图像和第二文本，第二图像为第一文本经图文转换得到的图像，第二文本为第一图像经图文转换得到的文本；基于第一训练样本对和第二训练样本对，生成M个训练样本对，M个训练样本对中至少包括第一训练样本对和第二训练样本对，M为大于1的整数；将第一训练样本对集合中的第一训练样本对替换为第三训练样本对，得到第二训练样本对集合，第三训练样本对为M个训练样本对中图文相似度最高的训练样本对；基于第二训练样本对集合训练第一图文生成模型，得到目标图文生成模型。In a first aspect, an embodiment of the present application provides a training method for a graphic-text generation model, the training method for the graphic-text generation model comprising: inputting a first training sample pair in a first training sample pair set into a first graphic-text generation model, and outputting a second training sample pair, wherein the first graphic-text generation model is obtained by training based on the first training sample pair set, the first training sample pair comprises a first image and a first text for describing the image content of the first image, the second training sample pair comprises a second image and a second text, the second image is an image obtained by graphic-text conversion of the first text, and the second text is a text obtained by graphic-text conversion of the first image; based on the first training sample pair and the second training sample pair, M training sample pairs are generated, the M training sample pairs include at least the first training sample pair and the second training sample pair, and M is an integer greater than 1; the first training sample pair in the first training sample pair set is replaced by a third training sample pair to obtain a second training sample pair set, the third training sample pair is the training sample pair with the highest graphic-text similarity among the M training sample pairs; the first graphic-text generation model is trained based on the second training sample pair set to obtain a target graphic-text generation model.

第二方面，本申请实施例提供了一种图文生成模型的训练装置，该图文生成模型的训练装置包括：处理模块、生成模块、替换模块以及训练模块；In a second aspect, an embodiment of the present application provides a training device for a graph-text generation model, the training device for the graph-text generation model comprising: a processing module, a generation module, a replacement module, and a training module;

该处理模块，用于将第一训练样本对集合中的第一训练样本对输入至第一图文生成模型，输出第二训练样本对，该第一图文生成模型是基于上述第一训练样本对集合训练得到的，第一训练样本对包括第一图像和用于描述第一图像的图像内容的第一文本，第二训练样本对包括第二图像和第二文本，第二图像为上述第一文本经图文转换得到的图像，第二文本为上述第一图像经图文转换得到的文本；上述生成模块，用于基于上述第一训练样本对和上述第二训练样本对，生成M个训练样本对，M个训练样本对中至少包括第一训练样本对和第二训练样本对，M为大于1的整数；上述替换模块，用于将上述第一训练样本对集合中的第一训练样本对替换为第三训练样本对，得到第二训练样本对集合，该第三训练样本对为上述生成模块生成的多个训练样本对中图文相似度最高的训练样本对；上述训练模块，用于基于上述替换模块替换后的第二训练样本对集合训练第一图文生成模型，得到目标图文生成模型。The processing module is used to input the first training sample pair in the first training sample pair set into the first image-text generation model and output the second training sample pair. The first image-text generation model is obtained by training based on the above-mentioned first training sample pair set. The first training sample pair includes a first image and a first text for describing the image content of the first image. The second training sample pair includes a second image and a second text. The second image is an image obtained by image-text conversion of the above-mentioned first text, and the second text is a text obtained by image-text conversion of the above-mentioned first image; the above-mentioned generation module is used to generate M training sample pairs based on the above-mentioned first training sample pair and the above-mentioned second training sample pair. The M training sample pairs at least include the first training sample pair and the second training sample pair, and M is an integer greater than 1; the above-mentioned replacement module is used to replace the first training sample pair in the above-mentioned first training sample pair set with a third training sample pair to obtain the second training sample pair set. The third training sample pair is the training sample pair with the highest image-text similarity among the multiple training sample pairs generated by the above-mentioned generation module; the above-mentioned training module is used to train the first image-text generation model based on the second training sample pair set replaced by the above-mentioned replacement module to obtain the target image-text generation model.

第三方面，本申请实施例提供了一种电子设备，该电子设备包括处理器和存储器，所述存储器存储可在所述处理器上运行的程序或指令，所述程序或指令被所述处理器执行时实现如第一方面所述的方法的步骤。In a third aspect, an embodiment of the present application provides an electronic device, which includes a processor and a memory, wherein the memory stores programs or instructions that can be run on the processor, and when the program or instructions are executed by the processor, the steps of the method described in the first aspect are implemented.

第四方面，本申请实施例提供了一种可读存储介质，所述可读存储介质上存储程序或指令，所述程序或指令被处理器执行时实现如第一方面所述的方法的步骤。In a fourth aspect, an embodiment of the present application provides a readable storage medium, on which a program or instruction is stored, and when the program or instruction is executed by a processor, the steps of the method described in the first aspect are implemented.

第五方面，本申请实施例提供了一种芯片，所述芯片包括处理器和通信接口，所述通信接口和所述处理器耦合，所述处理器用于运行程序或指令，实现如第一方面所述的方法。In a fifth aspect, an embodiment of the present application provides a chip, comprising a processor and a communication interface, wherein the communication interface is coupled to the processor, and the processor is used to run a program or instruction to implement the method described in the first aspect.

第六方面，本申请实施例提供一种计算机程序产品，该程序产品被存储在存储介质中，该程序产品被至少一个处理器执行以实现如第一方面所述的方法。In a sixth aspect, an embodiment of the present application provides a computer program product, which is stored in a storage medium and is executed by at least one processor to implement the method described in the first aspect.

在本申请实施例中，将第一训练样本对集合中的第一训练样本对输入至第一图文生成模型，输出第二训练样本对，第一图文生成模型是基于第一训练样本对集合训练得到的，第一训练样本对包括第一图像和用于描述第一图像的图像内容的第一文本，第二训练样本对包括第二图像和第二文本，第二图像为第一文本经图文转换得到的图像，第二文本为第一图像经图文转换得到的文本；基于第一训练样本对和第二训练样本对，生成M个训练样本对，M个训练样本对中至少包括第一训练样本对和第二训练样本对，M为大于1的整数；将第一训练样本对集合中的第一训练样本对替换为第三训练样本对，得到第二训练样本对集合，第三训练样本对为M个训练样本对中图文相似度最高的训练样本对；基于第二训练样本对集合训练第一图文生成模型，得到目标图文生成模型。如此，通过将训练样本集中的一个或多个训练样本对输入图文生成模型进行图文转换，得到新的训练样本对，然后，基于每个训练样本中的文本与图像间的图文相似度，不断使用图文相似度更高的训练样本对来更新训练样本对集合，以使得基于更新后的训练样本对集合对图文生成模型进行训练，提高了图文生成模型的模型训练精度，进而提高了图文生成模型生成的图像和文本内容的一致性。In an embodiment of the present application, a first training sample pair in a first training sample pair set is input into a first image-text generation model, and a second training sample pair is output. The first image-text generation model is obtained by training based on the first training sample pair set, the first training sample pair includes a first image and a first text for describing the image content of the first image, and the second training sample pair includes a second image and a second text, the second image is an image obtained by image-text conversion of the first text, and the second text is a text obtained by image-text conversion of the first image; based on the first training sample pair and the second training sample pair, M training sample pairs are generated, and the M training sample pairs include at least the first training sample pair and the second training sample pair, and M is an integer greater than 1; the first training sample pair in the first training sample pair set is replaced with the third training sample pair to obtain a second training sample pair set, and the third training sample pair is the training sample pair with the highest image-text similarity among the M training sample pairs; the first image-text generation model is trained based on the second training sample pair set to obtain a target image-text generation model. In this way, one or more training sample pairs in the training sample set are input into the image-text generation model for image-text conversion to obtain new training sample pairs. Then, based on the image-text similarity between the text and the image in each training sample, the training sample pair set is continuously updated using training sample pairs with higher image-text similarity, so that the image-text generation model is trained based on the updated training sample pair set, thereby improving the model training accuracy of the image-text generation model and further improving the consistency of the image and text content generated by the image-text generation model.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1是本申请实施例提供的一种图文生成模型的训练方法的流程示意图之一；FIG1 is a flow chart of a training method for a graph-text generation model provided in an embodiment of the present application;

图2是本申请实施例提供的一种图文生成模型的训练方法的中第一图像的实例示意图；FIG2 is a schematic diagram of an example of a first image in a training method for a picture-text generation model provided in an embodiment of the present application;

图3是本申请实施例提供的一种图文生成模型的训练方法的流程示意图之二；FIG3 is a second flow chart of a training method for a graph-text generation model provided in an embodiment of the present application;

图4是本申请实施例提供的一种图文生成模型的训练方法的流程示意图之三；FIG4 is a third flow chart of a training method for a graph-text generation model provided in an embodiment of the present application;

图5是本申请实施例提供的一种图文生成模型的训练方法的流程示意图之四；FIG5 is a fourth flow chart of a training method for a graph-text generation model provided in an embodiment of the present application;

图6是本申请实施例提供的一种图文生成模型的训练装置的结构示意图；FIG6 is a schematic diagram of the structure of a training device for a picture-text generation model provided in an embodiment of the present application;

图7是本申请实施例提供的一种电子设备的硬件结构示意图之一；FIG7 is a schematic diagram of a hardware structure of an electronic device provided in an embodiment of the present application;

图8是本申请实施例提供的一种电子设备的硬件结构示意图之二。FIG. 8 is a second schematic diagram of the hardware structure of an electronic device provided in an embodiment of the present application.

具体实施方式DETAILED DESCRIPTION

下面将结合本申请实施例中的附图，对本申请实施例中的技术方案进行清楚地描述，显然，所描述的实施例是本申请一部分实施例，而不是全部的实施例。基于本申请中的实施例，本领域普通技术人员获得的所有其他实施例，都属于本申请保护的范围。The following will be combined with the drawings in the embodiments of the present application to clearly describe the technical solutions in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, rather than all the embodiments. All other embodiments obtained by ordinary technicians in this field based on the embodiments in the present application belong to the scope of protection of this application.

本申请的说明书和权利要求书中的术语“第一”、“第二”等是用于区别类似的对象，而不用于描述特定的顺序或先后次序。应该理解这样使用的术语在适当情况下可以互换，以便本申请的实施例能够以除了在这里图示或描述的那些以外的顺序实施，且“第一”、“第二”等所区分的对象通常为一类，并不限定对象的个数，例如第一对象可以是一个，也可以是多个。此外，说明书以及权利要求中“和/或”表示所连接对象的至少其中之一，字符“/”，一般表示前后关联对象是一种“或”的关系。The terms "first", "second", etc. in the specification and claims of the present application are used to distinguish similar objects, and are not used to describe a specific order or sequence. It should be understood that the terms used in this way are interchangeable where appropriate, so that the embodiments of the present application can be implemented in an order other than those illustrated or described herein, and the objects distinguished by "first", "second", etc. are generally of one type, and the number of objects is not limited. For example, the first object can be one or more. In addition, "and/or" in the specification and claims represents at least one of the connected objects, and the character "/" generally indicates that the objects associated with each other are in an "or" relationship.

下面对本申请实施例提供的图文生成模型的训练方法、装置、电子设备及介质中涉及的一些概念和/或术语做一下解释说明。The following is an explanation of some concepts and/or terms involved in the training method, device, electronic device and medium of the image-text generation model provided in the embodiments of the present application.

下面结合附图，通过具体的实施例及其应用场景对本申请实施例提供的图文生成模型的训练方法、装置、电子设备及介质进行详细地说明。In the following, in conjunction with the accompanying drawings, the training method, device, electronic device and medium of the image-text generation model provided in the embodiments of the present application are described in detail through specific embodiments and their application scenarios.

目前，随着生成式人工智能(Artificial Intelligence generated content，AIGC)的兴起和不断发展，图文生成模型，如，AI绘画领域的文生图扩散模型，在壁纸、头像、游戏、动漫设计等领域得到了广泛应用，具有效率高、自动化程度高等优点。其中，图文生成模型用于进行图文转换，即将输入的文本转换为该文本对应的图像，或者，将输入的图像转换为该图像对应的文本。At present, with the rise and continuous development of artificial intelligence generated content (AIGC), image-text generation models, such as the image-generated diffusion model in the field of AI painting, have been widely used in the fields of wallpaper, avatar, game, animation design, etc., with the advantages of high efficiency and high degree of automation. Among them, the image-text generation model is used for image-text conversion, that is, converting the input text into the image corresponding to the text, or converting the input image into the text corresponding to the image.

在相关技术中，在对图文生成模型进行训练的过程中，可以将训练样本输入图文生成模型对加噪样本图像进行降噪处理，以生成降噪样本图像，其中，不同文本内容的文本对应不同的加噪样本图像，然后，根据降噪样本图像的第一表示向量和样本文本的第二表示向量，得到第一文图对齐分数，并基于第一文图对齐分数从当前批次的多个训练样本中选取第一训练样本，根据第一训练样本对应的原始样本图像和上述降噪样本图像，确定图文生成模型的第一损失函数，并基于第一损失函数，对图文生成模型进行调整。接着，使用下一批次的训练样本继续对图文生成模型进行训练，直至训练结束。In the related art, in the process of training the image-text generation model, the training samples can be input into the image-text generation model to perform denoising on the noisy sample images to generate denoised sample images, wherein texts with different text contents correspond to different noisy sample images, and then, according to the first representation vector of the denoised sample image and the second representation vector of the sample text, a first text-image alignment score is obtained, and based on the first text-image alignment score, a first training sample is selected from multiple training samples of the current batch, and according to the original sample image corresponding to the first training sample and the denoised sample image, a first loss function of the image-text generation model is determined, and based on the first loss function, the image-text generation model is adjusted. Then, the image-text generation model is trained using the next batch of training samples until the training is completed.

然而，由于目前常规的图文生成模型仅能将文本内容生成图像，因此，上述图文生成模型的训练过程，仅仅只能针对文本内容进行转换，从而导致整体训练过程精度较低。However, since the current conventional image-text generation model can only generate images from text content, the training process of the above-mentioned image-text generation model can only convert the text content, resulting in low accuracy of the overall training process.

而在本申请实施例中，通过将第一训练样本对集合中的第一训练样本对输入至第一图文生成模型，输出第二训练样本对，第一图文生成模型是基于第一训练样本对集合训练得到的，第一训练样本对包括第一图像和用于描述第一图像的图像内容的第一文本，第二训练样本对包括第二图像和第二文本，第二图像为第一文本经图文转换得到的图像，第二文本为第一图像经图文转换得到的文本；基于第一训练样本对和第二训练样本对，生成M个训练样本对，M个训练样本对中至少包括第一训练样本对和第二训练样本对，M为大于1的整数；将第一训练样本对集合中的第一训练样本对替换为第三训练样本对，得到第二训练样本对集合，第三训练样本对为M个训练样本对中图文相似度最高的训练样本对；基于第二训练样本对集合训练第一图文生成模型，得到目标图文生成模型。如此，通过将训练样本集中的一个或多个训练样本对输入图文生成模型进行图文转换，得到新的训练样本对，然后，基于每个训练样本中的文本与图像间的图文相似度，不断使用图文相似度更高的训练样本对来更新训练样本对集合，以使得基于更新后的训练样本对集合对图文生成模型进行训练，提高了图文生成模型的模型训练精度，进而提高了图文生成模型生成的图像和文本内容的一致性。In an embodiment of the present application, by inputting the first training sample pair in the first training sample pair set into the first image-text generation model, the second training sample pair is output, the first image-text generation model is trained based on the first training sample pair set, the first training sample pair includes a first image and a first text for describing the image content of the first image, the second training sample pair includes a second image and a second text, the second image is an image obtained by image-text conversion of the first text, and the second text is a text obtained by image-text conversion of the first image; based on the first training sample pair and the second training sample pair, M training sample pairs are generated, the M training sample pairs include at least the first training sample pair and the second training sample pair, and M is an integer greater than 1; the first training sample pair in the first training sample pair set is replaced by the third training sample pair to obtain the second training sample pair set, the third training sample pair is the training sample pair with the highest image-text similarity among the M training sample pairs; the first image-text generation model is trained based on the second training sample pair set to obtain the target image-text generation model. In this way, one or more training sample pairs in the training sample set are input into the image-text generation model for image-text conversion to obtain new training sample pairs. Then, based on the image-text similarity between the text and the image in each training sample, the training sample pair set is continuously updated using training sample pairs with higher image-text similarity, so that the image-text generation model is trained based on the updated training sample pair set, thereby improving the model training accuracy of the image-text generation model and further improving the consistency of the image and text content generated by the image-text generation model.

本申请实施例提供一种图文生成模型的训练方法，图1示出了本申请实施例提供的一种图文生成模型的训练方法的流程图，该方法可以应用于图文生成模型的训练装置。在本申请实施例中图文生成模型的训练装置可以以电子设备为例。The present application embodiment provides a training method for a graph-text generation model. FIG1 shows a flow chart of a training method for a graph-text generation model provided by the present application embodiment. The method can be applied to a training device for a graph-text generation model. In the present application embodiment, the training device for a graph-text generation model can be an electronic device as an example.

如图1所示，本申请实施例提供的图文生成模型的训练方法可以包括下述的步骤201至步骤204。As shown in FIG. 1 , the training method of the image-text generation model provided in the embodiment of the present application may include the following steps 201 to 204 .

步骤201、电子设备将第一训练样本对集合中的第一训练样本对输入第一图文生成模型，输出第二训练样本对。Step 201: The electronic device inputs a first training sample pair in a first training sample pair set into a first image-text generation model, and outputs a second training sample pair.

在本申请实施例中，上述第一图文生成模型是基于第一训练样本对集合训练得到的。In an embodiment of the present application, the first image-text generation model is obtained by training based on a first training sample pair set.

在本申请实施例中，上述第一图文生成模型可以是卷积神经网络模型。In an embodiment of the present application, the above-mentioned first image and text generation model can be a convolutional neural network model.

在本申请实施例中，上述第一训练样本对集合可以是电子设备自动获取的，也可以是用户挑选的。In the embodiment of the present application, the first training sample pair set may be automatically acquired by the electronic device or selected by the user.

在本申请实施例中，上述第一训练样本对集合中包括N个训练样本对，每个训练样本对包括：一个图像和一个用于描述该图像的文本。其中，N为正整数。In the embodiment of the present application, the first training sample pair set includes N training sample pairs, each training sample pair includes: an image and a text for describing the image, wherein N is a positive integer.

在本申请实施例中，上述第一训练样本对包括第一图像和用于描述第一图像的图像内容的第一文本。In the embodiment of the present application, the first training sample pair includes a first image and a first text for describing the image content of the first image.

举例说明，如图2所示，若第一图像的图像内容为图2，则第一文本可以是小猫，也可以是小猫在钓鱼，也可以是钓鱼。For example, as shown in FIG2 , if the image content of the first image is FIG2 , the first text may be a kitten, a kitten fishing, or fishing.

需要说明的是，上述第一训练样本对为第一训练样本对集合中的任一训练样本对。It should be noted that the above-mentioned first training sample pair is any training sample pair in the first training sample pair set.

在本申请实施例中，上述第二训练样本对包括第二图像和第二文本。In the embodiment of the present application, the second training sample pair includes a second image and a second text.

在本申请实施例中，上述第二图像为第一文本经图文转换得到的图像。例如，以第一文本为小猫为例，第二图像可以是关于小猫的任意图像。In the embodiment of the present application, the second image is an image obtained by converting the first text into text. For example, taking the first text as a kitten, the second image can be any image related to a kitten.

在本申请实施例中，上述第二文本为第一图像经图文转换得到的文本。例如，以第一图像的图像内容为图2为例，第二文本可以为小猫，也可以为小猫钓鱼，也可以为钓鱼。In the embodiment of the present application, the second text is the text obtained by converting the first image into text. For example, taking the image content of the first image as Figure 2 as an example, the second text can be a kitten, a kitten fishing, or fishing.

步骤202、电子设备基于第一训练样本对和第二训练样本对，生成M个训练样本对。Step 202: The electronic device generates M training sample pairs based on the first training sample pair and the second training sample pair.

在本申请实施例中，M个训练样本对中至少包括第一训练样本对和第二训练样本对，M为大于1的整数。In the embodiment of the present application, the M training sample pairs include at least a first training sample pair and a second training sample pair, and M is an integer greater than 1.

在本申请实施例中，上述M个训练样本对中的每个训练样本对均包括一个图像和用于描述该图像的图像内容的文本。In the embodiment of the present application, each of the M training sample pairs includes an image and text for describing the image content of the image.

可选地，在本申请实施例中，上述M个训练样本对包括：上述第一训练样本对、上述第二训练样本对、第四训练样本对和第五训练样本对。Optionally, in an embodiment of the present application, the M training sample pairs include: the first training sample pair, the second training sample pair, the fourth training sample pair and the fifth training sample pair.

在本申请实施例中，第一训练样本对包括第一图像和第一文本，第二训练样本对包括第二图像和第二文本，第四训练样本对包括第一图像和第二文本，第五训练样本对包括第二图像和第一文本。In an embodiment of the present application, the first training sample pair includes a first image and a first text, the second training sample pair includes a second image and a second text, the fourth training sample pair includes a first image and a second text, and the fifth training sample pair includes a second image and a first text.

步骤203、电子设备将第一训练样本对集合中的第一训练样本对替换为第三训练样本对，得到第二训练样本对集合。Step 203: The electronic device replaces the first training sample pair in the first training sample pair set with the third training sample pair to obtain a second training sample pair set.

在本申请实施例中，上述第三训练样本对为上述M个训练样本对中图文相似度最高的训练样本对。In the embodiment of the present application, the third training sample pair is the training sample pair with the highest image-text similarity among the M training sample pairs.

在本申请实施例中，电子设备会计算上述M个训练样本中每个训练样本对中的文本与图像间的图文相似度，然后，基于上述M个训练样本中每个训练样本对中的文本与图像间的图文相似度，从上述M个训练样本中选择图文相似度最高的训练样本对，作为第三训练样本对。In an embodiment of the present application, the electronic device calculates the image-text similarity between the text and the image in each training sample pair in the above-mentioned M training samples, and then, based on the image-text similarity between the text and the image in each training sample pair in the above-mentioned M training samples, selects the training sample pair with the highest image-text similarity from the above-mentioned M training samples as the third training sample pair.

在本申请实施例中，上述训练样本对中的文本与图像间的图文相似度是通过图文生成模型中的整体损失函数进行计算得到的。In the embodiment of the present application, the image-text similarity between the text and the image in the above training sample pair is calculated by using the overall loss function in the image-text generation model.

可选地，在本申请实施例中，上述第一图文生成模型中的整体损失函数可以基于以下至少之一来构建：上述第一图文生成模型中的文本编码器模型、图像编码器模型、扩散模型、文本解码器模型等。Optionally, in an embodiment of the present application, the overall loss function in the above-mentioned first image-text generation model can be constructed based on at least one of the following: a text encoder model, an image encoder model, a diffusion model, a text decoder model, etc. in the above-mentioned first image-text generation model.

可选地，在本申请实施例中，每完成一次图文生成模型的训练，便基于训练后的图文生成模型来构建新的整体损失函数，并在下次对该图文生成模型进行训练时，可以使用该新的整体损失函数，来计算训练样本对的中的文本与图像间的图文相似度。Optionally, in an embodiment of the present application, each time the training of the image-text generation model is completed, a new overall loss function is constructed based on the trained image-text generation model, and the next time the image-text generation model is trained, the new overall loss function can be used to calculate the image-text similarity between the text and the image in the training sample pair.

如此，通过图文生成模型中的各不同模块的特性来构建整体损失函数，使得构建出的整体损失函数能够贴合图文生成模型，使得电子设备使用该整体损失函数所计算出的训练样本对的图文相似度的准确度更高。In this way, the overall loss function is constructed through the characteristics of different modules in the image-text generation model, so that the constructed overall loss function can fit the image-text generation model, and the image-text similarity of the training sample pairs calculated by the electronic device using the overall loss function is more accurate.

步骤204、电子设备基于第二训练样本对集合训练第一图文生成模型，得到目标图文生成模型。Step 204: The electronic device trains the first image-text generation model based on the second training sample pair set to obtain a target image-text generation model.

可选地，在本申请实施例中，电子设备基于第二训练样本对集合，不断构建整体损失函数，以对每个训练样本对进行梯度反向传播，从而更新图文生成模型的参数，当第二训练样本对集合全部输入过第一图文生成模型后，表示当前完成一次完整的训练周期。若当前训练周期的次数达到预设次数，则直接输出目标图文生成模型；若当前训练周期的次数未达到预设次数，则说明目标图文生成模型还未成熟。因此，电子设备则继续替换当前训练样本对集合进行训练，直至训练周期达到预设次数，最终将该图文生成模型作为目标图文生成模型。Optionally, in an embodiment of the present application, the electronic device continuously constructs an overall loss function based on the second training sample pair set to perform gradient backpropagation on each training sample pair, thereby updating the parameters of the image and text generation model. When the second training sample pair set has been fully input into the first image and text generation model, it indicates that a complete training cycle is currently completed. If the number of the current training cycle reaches the preset number of times, the target image and text generation model is directly output; if the number of the current training cycle does not reach the preset number of times, it means that the target image and text generation model is not yet mature. Therefore, the electronic device continues to replace the current training sample pair set for training until the training cycle reaches the preset number of times, and finally uses the image and text generation model as the target image and text generation model.

可选地，在本申请实施例中，上述步骤204具体包括步骤204a至步骤204e：Optionally, in the embodiment of the present application, the above step 204 specifically includes steps 204a to 204e:

步骤204a、电子设备基于第二训练样本对集合训练第一图文生成模型，得到第二图文生成模型。Step 204a: The electronic device trains the first image-text generation model based on the second training sample pair set to obtain a second image-text generation model.

在本申请实施例中，上述第二图文生成模型是基于第二训练样本对集合训练得到的。In an embodiment of the present application, the second image-text generation model is obtained by training based on a second training sample pair set.

在本申请实施例中，上述第二图文生成模型可以是卷积神经网络模型。In an embodiment of the present application, the second image and text generation model may be a convolutional neural network model.

示例性地，电子设备将第二训练样本对集合中的训练样本对输入至第一图文生成模型，并构建整体损失函数，对第二训练样本对集合中的每个训练样本对进行梯度反向传播，更新第一图文生成模型的参数，以得到第二图文生成模型。Exemplarily, the electronic device inputs the training sample pairs in the second training sample pair set into the first image-text generation model, constructs an overall loss function, performs gradient backpropagation on each training sample pair in the second training sample pair set, updates the parameters of the first image-text generation model, and obtains the second image-text generation model.

步骤204b、电子设备在模型训练次数未达到预设阈值的情况下，将第二训练样本对集合中的第六训练样本对输入至第二图文生成模型，输出第七训练样本对。Step 204b: When the number of model training times does not reach a preset threshold, the electronic device inputs the sixth training sample pair in the second training sample pair set into the second image-text generation model, and outputs the seventh training sample pair.

在本申请实施例中，上述第六训练样本对为第二训练样本对集合中的任一训练样本对。In the embodiment of the present application, the sixth training sample pair is any training sample pair in the second training sample pair set.

在本申请实施例中，上述预设阈值为电子设备设置的或者用户自行设置的。In the embodiment of the present application, the above-mentioned preset threshold is set by the electronic device or set by the user.

步骤204c、电子设备基于第六训练样本对和第七训练样本对，生成N个训练样本对。Step 204c: The electronic device generates N training sample pairs based on the sixth training sample pair and the seventh training sample pair.

在本申请实施例中，上述N个训练样本对中至少包括第六训练样本对和第七训练样本对，N为大于1的整数。In the embodiment of the present application, the N training sample pairs include at least the sixth training sample pair and the seventh training sample pair, and N is an integer greater than 1.

步骤204d、电子设备将第二训练样本对集合中的第六训练样本对替换为第八训练样本对，得到第三训练样本对集合。Step 204d: The electronic device replaces the sixth training sample pair in the second training sample pair set with the eighth training sample pair to obtain a third training sample pair set.

在本申请实施例中，上述第八训练样本对为N个训练样本对中图文相似度最高的训练样本对。In the embodiment of the present application, the eighth training sample pair is the training sample pair with the highest image-text similarity among the N training sample pairs.

步骤204c、电子设备基于第三训练样本对集合训练第二图文生成模型，得到第三图文生成模型，迭代上述过程，直至模型训练次数达到预设阈值，则将最后一次训练得到的图文生成模型作为目标图文生成模型。Step 204c: the electronic device trains the second image-text generation model based on the third training sample pair set to obtain a third image-text generation model, iterates the above process until the number of model training times reaches a preset threshold, and then uses the image-text generation model obtained by the last training as the target image-text generation model.

以下以第一图像为M1，第一文本为T1，第二图像为M2，第二文本为T2为例，对本申请实施例提供的模型训练方法进行示例性说明。具体地，上述模型训练方法可以包括如下步骤B1至步骤B6：The following takes the first image as M1, the first text as T1, the second image as M2, and the second text as T2 as an example to exemplify the model training method provided in the embodiment of the present application. Specifically, the above-mentioned model training method may include the following steps B1 to B6:

步骤B1、电子设备从训练样本对集合里面挑选训练样本对<T1，M1>。Step B1: The electronic device selects a training sample pair <T1, M1> from a training sample pair set.

步骤B2、电子设备采用已经完成一个周期训练的图文生成模型对训练样本对<T1,M1>进行图文转换处理，输出得到训练样本对<T2，M2>。Step B2: The electronic device uses the image-text generation model that has completed one cycle of training to perform image-text conversion processing on the training sample pair <T1, M1>, and outputs the training sample pair <T2, M2>.

步骤B3、电子设备采用整体损失函数计算<T1，M1>，<T1，M2>，<T2，M1>，<T2，M2>这四组训练样本对中的图像与文本的图文相似度，同时，通过构建新的整体损失函数，对四组训练样本对进行梯度反向传播。Step B3, the electronic device uses the overall loss function to calculate the image-text similarity between the images and texts in the four groups of training sample pairs <T1, M1>, <T1, M2>, <T2, M1>, and <T2, M2>. At the same time, by constructing a new overall loss function, the gradient backpropagation is performed on the four groups of training sample pairs.

步骤B4、电子设备选择图文相似度最高的训练样本对，假设为<T2，M1>，则将该训练样本对用来更新替代<T1，M1>。Step B4: The electronic device selects a training sample pair with the highest image-text similarity, assuming it is <T2, M1>, and uses the training sample pair to update and replace <T1, M1>.

步骤B5、电子设备遍历完整个训练样本对集合，则完成一次训练周期。Step B5: When the electronic device traverses the entire set of training sample pairs, a training cycle is completed.

步骤B6、电子设备判断当前训练周期次数是否达到最大训练周期的次数，如果达到了，则直接输出最终图文生成模型；如果没有达到，则重复步骤B1至B5。Step B6: The electronic device determines whether the current number of training cycles reaches the maximum number of training cycles. If so, the final image and text generation model is directly output; if not, steps B1 to B5 are repeated.

如此，通过不断替换训练样本对集合，来训练图文生成模型，从而提高图文生成模型所输出的图像和文本的一致性。In this way, the image-text generation model is trained by continuously replacing the set of training sample pairs, thereby improving the consistency of the image and text output by the image-text generation model.

在本申请提供的图文生成模型的训练方法中，通过将训练样本集中的一个或多个训练样本对输入图文生成模型进行图文转换，得到新的训练样本对，然后，基于每个训练样本中的文本与图像间的图文相似度，不断使用图文相似度更高的训练样本对来更新训练样本对集合，以使得基于更新后的训练样本对集合对图文生成模型进行训练，提高了图文生成模型的模型训练精度，进而提高了图文生成模型生成的图像和文本内容的一致性。In the training method of the image-text generation model provided in the present application, one or more training sample pairs in the training sample set are input into the image-text generation model for image-text conversion to obtain new training sample pairs, and then, based on the image-text similarity between the text and the image in each training sample, the training sample pair set is continuously updated using training sample pairs with higher image-text similarity, so that the image-text generation model is trained based on the updated training sample pair set, thereby improving the model training accuracy of the image-text generation model, and thereby improving the consistency of the image and text content generated by the image-text generation model.

可选地，在本申请实施例中，上述图文生成模型至少包括：文本编码器模型、图像编码器模型、扩散模型、图像解码器模型、以及文本解码器模型。Optionally, in an embodiment of the present application, the above-mentioned image-text generation model includes at least: a text encoder model, an image encoder model, a diffusion model, an image decoder model, and a text decoder model.

示例性地，上述文本编码器模型用于对文本进行编码，即用于提取文本的文本特征信息。Exemplarily, the above-mentioned text encoder model is used to encode text, that is, to extract text feature information of the text.

示例性地，上述文本编码器模型包括：标记化(Tokenizer)模块、以及自注意力模块。Exemplarily, the above-mentioned text encoder model includes: a tokenizer module and a self-attention module.

一种示例中，上述Tokenizer模块用于将文本转换为文本对应的数值向量。In one example, the Tokenizer module is used to convert text into a numerical vector corresponding to the text.

一种示例中，上述自注意力模块用于提取文本中的特征信息。In one example, the self-attention module is used to extract feature information from text.

示例性地，上述图像编码器模型用于对图像进行编码，即用于提取图像的图像特征信息。Exemplarily, the above-mentioned image encoder model is used to encode an image, that is, to extract image feature information of the image.

示例性地，上述图像编码器模型包括：第一卷积模块、第二卷积模块、以及第三卷积模块。Exemplarily, the above-mentioned image encoder model includes: a first convolution module, a second convolution module, and a third convolution module.

示例性地，上述扩散模型用于将文本特征信息转换为图像特征信息。Exemplarily, the above diffusion model is used to convert text feature information into image feature information.

一种示例中，上述第一卷积模块包括：第一卷积层、第二卷积层、以及第三卷积层；第二卷积模块包括：第一卷积层、第二卷积层、以及第三卷积层；第三卷积模块包括：第一卷积层、第二卷积层、以及第三卷积层。In one example, the first convolution module includes: a first convolution layer, a second convolution layer, and a third convolution layer; the second convolution module includes: a first convolution layer, a second convolution layer, and a third convolution layer; the third convolution module includes: a first convolution layer, a second convolution layer, and a third convolution layer.

一种示例中，上述第一卷积层的卷积核为3×3，步长为2，输出通道数为8；上述第二卷积层的卷积核为3×3，步长为1，输出通道数为8；上述第三卷积层的卷积核为3×3，步长为1，输出通道数为8。In one example, the convolution kernel of the first convolution layer is 3×3, the stride is 2, and the number of output channels is 8; the convolution kernel of the second convolution layer is 3×3, the stride is 1, and the number of output channels is 8; the convolution kernel of the third convolution layer is 3×3, the stride is 1, and the number of output channels is 8.

示例性地，上述扩散模型包括：扩散模型编码器、扩散模型中间层、以及扩散模型解码器。Exemplarily, the above diffusion model includes: a diffusion model encoder, a diffusion model intermediate layer, and a diffusion model decoder.

一种示例中，上述扩散模型编码器包括：第一编码模块、第二编码模块以及第三编码模块。In one example, the diffusion model encoder includes: a first encoding module, a second encoding module and a third encoding module.

一种示例中，上述扩散模型解码器包括：第一解码模块、第二解码模块以及第三解码模块。In an example, the diffusion model decoder includes: a first decoding module, a second decoding module and a third decoding module.

应注意的是，上述第一编码模块、第二编码模块、第三编码模块、第一解码模块、第二解码模块以及第三解码模块均包括：交叉注意力算子(CrossAttention)、Add&Norm、以及FeedForward。It should be noted that the above-mentioned first encoding module, second encoding module, third encoding module, first decoding module, second decoding module and third decoding module all include: CrossAttention operator (CrossAttention), Add&Norm, and FeedForward.

示例性地，上述图像解码器模型用于将图像特征信息转换为最终图像。Exemplarily, the above-mentioned image decoder model is used to convert image feature information into a final image.

一种示例中，上述图像解码器模型包括：第一反卷积模块、第二反卷积模块、以及第三反卷积模块。In one example, the image decoder model includes: a first deconvolution module, a second deconvolution module, and a third deconvolution module.

一种示例中，上述第一反卷积模块包括：第一反卷积层、第二反卷积层、以及第三反卷积层；第二反卷积模块包括：第一反卷积层、第二反卷积层、以及第三反卷积层；第三反卷积模块包括：第一反卷积层、第二反卷积层、以及第三反卷积层。In one example, the first deconvolution module includes: a first deconvolution layer, a second deconvolution layer, and a third deconvolution layer; the second deconvolution module includes: a first deconvolution layer, a second deconvolution layer, and a third deconvolution layer; the third deconvolution module includes: a first deconvolution layer, a second deconvolution layer, and a third deconvolution layer.

一种示例中，上述第一反卷积的反卷积核为3×3，步长为2，输出通道数为16；上述第二反卷积层的反卷积核为3×3，步长为1，输出通道数为8；上述第三反卷积层的反卷积核为3×3，步长为1，输出通道数为8。In one example, the deconvolution kernel of the first deconvolution is 3×3, the step size is 2, and the number of output channels is 16; the deconvolution kernel of the second deconvolution layer is 3×3, the step size is 1, and the number of output channels is 8; the deconvolution kernel of the third deconvolution layer is 3×3, the step size is 1, and the number of output channels is 8.

示例性地，上述文本解码器用于将图像特征信息转换为文本。Exemplarily, the text decoder is used to convert image feature information into text.

可选地，在本申请实施例中，上述步骤201具体包括以下步骤201a至步骤201f：Optionally, in the embodiment of the present application, the above step 201 specifically includes the following steps 201a to 201f:

步骤201a、电子设备将第一训练样本对集合中的第一训练样本对输入至第一图文生成模型。Step 201a: The electronic device inputs a first training sample pair in a first training sample pair set into a first image-text generation model.

步骤201b、电子设备将第一文本输入文本编码器模型进行特征提取，得到第一文本的第一文本特征信息。Step 201b: The electronic device inputs the first text into a text encoder model for feature extraction to obtain first text feature information of the first text.

示例性地，上述第一文本特征信息可以以数值向量的形式表示。Exemplarily, the first text feature information may be represented in the form of a numerical vector.

举例说明，如图3所示，以第一文本为例，首先，电子设备将第一文本输入Tokenizer模块，得到向量维度为[128,768]的第一数值向量。其中，128表示最大字符标记(token)数，768是每个字符的表示向量。接着，电子设备再将[128,768]维的第一数值向量作为QKV值输入自注意力模块，经过自注意力算子(SelfAttention)提取第一数值向量中的特征向量，以得到第一文本的关键文本特征信息，经过向量加法&归一化算子(Add&Norm)计算出关键文本特征信息对应的均值和标准差值，并将关键文本特征信息对应的均值和标准差值经过前向反馈算子(FeedForward)进行特征信息融合处理，重复计算过程12次，得到[128,768]维的文本编码向量(Text Encoder Vector，TCV)，即得到更充分描述的第一文本特征信息。For example, as shown in FIG3, taking the first text as an example, first, the electronic device inputs the first text into the Tokenizer module to obtain a first numerical vector with a vector dimension of [128,768]. Among them, 128 represents the maximum number of character tokens, and 768 is the representation vector of each character. Then, the electronic device inputs the first numerical vector of [128,768] dimensions as the QKV value into the self-attention module, extracts the feature vector in the first numerical vector through the self-attention operator (SelfAttention), and obtains the key text feature information of the first text, calculates the mean and standard deviation corresponding to the key text feature information through the vector addition & normalization operator (Add&Norm), and performs feature information fusion processing on the mean and standard deviation corresponding to the key text feature information through the forward feedback operator (FeedForward), repeats the calculation process 12 times, and obtains a [128,768]-dimensional text encoding vector (Text Encoder Vector, TCV), that is, obtains a more fully described first text feature information.

示例性地，上述QKV值包含了注意力机制的三个输入向量，分别表述Query、Key和Value，在自注意力模块SelfAttention中，这三个向量来自同一个输入的映射；在CrossAttention中，Query作为独立输入映射而来，Key和Value则是来自同一个输入映射。Exemplarily, the above QKV value contains three input vectors of the attention mechanism, representing Query, Key and Value respectively. In the self-attention module SelfAttention, these three vectors come from the same input mapping; in CrossAttention, Query is mapped as an independent input, and Key and Value come from the same input mapping.

步骤201c、电子设备将第一图像输入图像编码器模型进行特征提取，得到第一图像的第一图像特征信息。Step 201c: The electronic device inputs the first image into an image encoder model to perform feature extraction to obtain first image feature information of the first image.

示例性地，上述第一图像的图像可以表示为[X,Y,Z]，其中，X表示第一图像颜色通道数，Y和Z表示图像的各像素点的数值大小。Exemplarily, the image of the first image can be expressed as [X, Y, Z], where X represents the number of color channels of the first image, and Y and Z represent the numerical values of each pixel of the image.

示例性地，上述第一图像特征信息可以以向量矩阵的形式表示。Exemplarily, the first image feature information may be represented in the form of a vector matrix.

举例说明，如图4所示，针对图像编码器模型中的第一卷积模块，电子设备将向量维度为[3,512,512]的第一图像输入第一卷积模块，经过第一卷积层计算得到[8,256,256]的特征矩阵，再将向量维度为[8,256,256]的特征矩阵经过一个激活函数Relu，仍得到[8,256,256]维的特征矩阵。接着，电子设备将[8,256,256]维的特征矩阵依次输入到第二卷积层和第三卷积层中，计算过程与第一卷积层相同，但是卷积的步长改为1，以使得最终得到[8,256,256]维的特征矩阵。For example, as shown in Figure 4, for the first convolution module in the image encoder model, the electronic device inputs the first image with a vector dimension of [3,512,512] into the first convolution module, and calculates the feature matrix of [8,256,256] through the first convolution layer. Then, the feature matrix with a vector dimension of [8,256,256] passes through an activation function Relu, and still obtains a feature matrix of [8,256,256] dimensions. Next, the electronic device inputs the feature matrix of [8,256,256] dimensions into the second and third convolution layers in sequence. The calculation process is the same as that of the first convolution layer, but the convolution step size is changed to 1, so that the final feature matrix of [8,256,256] dimensions is obtained.

需要说明的是，电子设备将第一卷积模块输出得到的[8,256,256]维的特征矩阵依次输入第二卷积模块和第三卷积模块中，计算过程同第一卷积模块相同，最终得到[32,64,64]维的图像特征编码矩阵(Image Encoder Matrix，IEM)，即上述第一图像特征信息。It should be noted that the electronic device inputs the [8, 256, 256]-dimensional feature matrix obtained by the output of the first convolution module into the second convolution module and the third convolution module in sequence. The calculation process is the same as that of the first convolution module, and finally obtains a [32, 64, 64]-dimensional image feature encoding matrix (Image Encoder Matrix, IEM), that is, the above-mentioned first image feature information.

可选地，在得到第一文本特征信息之后，即步骤201b之后，电子设备可以将第一文本特征信息经过时间映射处理，得到文本条件控制向量(Text Condition Vector，TCV)。Optionally, after obtaining the first text feature information, that is, after step 201b, the electronic device may perform time mapping processing on the first text feature information to obtain a text condition control vector (TCV).

举例说明，结合图3，电子设备先将第一文本特征信息[128,768]维经过线性映射层Project，得到维度为[320,768]的文本编码向量(Text Project Vector，TPV)。再将初始化时间Embedding通过线性映射层Project，得到维度为[320,768]的时间映射Embedding。最终将TPV和时间映射Embedding进行相加，得到维度为[320,768]的TCV)。For example, in conjunction with Figure 3, the electronic device first passes the first text feature information [128,768] dimension through the linear mapping layer Project to obtain a text coding vector (Text Project Vector, TPV) with a dimension of [320,768]. Then the initialization time Embedding passes through the linear mapping layer Project to obtain a time mapping Embedding with a dimension of [320,768]. Finally, the TPV and the time mapping Embedding are added to obtain a TCV with a dimension of [320,768].

可选地，在得到第一图像特征信息之后，即步骤201c之后，电子设备可以对第一图像特征信息进行添加噪声处理，并将处理后的第一图像特征信息进行卷积计算，得到图像编码映射向量(Image Project Vector，IPV)。Optionally, after obtaining the first image feature information, that is, after step 201c, the electronic device may add noise processing to the first image feature information, and perform convolution calculation on the processed first image feature information to obtain an image coding mapping vector (Image Project Vector, IPV).

示例性地，上述“添加噪声处理”(ADDNoise处理)可以为向上述第一图像特征信息中添加一个随机高斯噪声矩阵(Gaussian Noise Matrix，GNM)，得到加噪之后图像潜层矩阵Latent。Exemplarily, the above-mentioned “add noise processing” (ADDNoise processing) can be to add a random Gaussian noise matrix (Gaussian Noise Matrix, GNM) to the above-mentioned first image feature information to obtain the image latent matrix Latent after noise addition.

示例性地，电子设备对上述Latent进行Conv的卷积运算，可以得到维度为[320,64,64]的矩阵，此处卷积运算的卷积核大小为3x3，步长为1，输出通道数为320。再将该矩阵经过重塑函数reshape进行重塑，得到维度为[320,64,64]的IPV。Exemplarily, the electronic device performs a convolution operation of Conv on the above Latent to obtain a matrix with a dimension of [320, 64, 64]. Here, the convolution kernel size of the convolution operation is 3x3, the step size is 1, and the number of output channels is 320. The matrix is then reshaped by a reshape function to obtain an IPV with a dimension of [320, 64, 64].

需要说明的是，上述两种示例顺序不分先后，可以先对第一文本特征信息进行处理，再对第一图像特征信息进行处理。也可以先对第一图像特征信息进行处理，再对第一文本特信息进行处理。也可以同时处理。本申请不作限制。It should be noted that the above two examples are not sequentially processed. The first text feature information may be processed first, and then the first image feature information. The first image feature information may be processed first, and then the first text feature information. The two examples may also be processed simultaneously. This application does not limit this.

步骤201d、将第一文本特征信息和第一图像特征信息输入扩散模型进行交叉注意力机制及卷积计算，得到第二图像特征信息。Step 201d: input the first text feature information and the first image feature information into the diffusion model to perform a cross attention mechanism and convolution calculation to obtain the second image feature information.

示例性地，上述第二图像特征信息为第一文本对应的图像特征信息。Exemplarily, the second image feature information is the image feature information corresponding to the first text.

举例说明，针对扩散模型中扩散模型编码器中的第一编码模块，第一编码模块的处理过程可以包括如下S1至S7：For example, for the first encoding module in the diffusion model encoder in the diffusion model, the processing process of the first encoding module may include the following S1 to S7:

S1)电子设备将第一图像特征信息作为Query，将第一文本特征信息作为Key和Value输入第一编码模块，进行交叉注意力机制计算，经过CrossAttention提取第一图像特征信息和第一文本特征信息中的关联图像特征信息，再经过向量加法&归一化算子(Add&Norm)计算出关联图像特征信息对应的均值和标准差值。然后，将关联图像特征信息对应的均值和标准差值经过前向反馈算子(FeedForward)进行特征信息融合处理，重复两次计算后，输出向量维度为[320,64,64]的向量。接着，将该向量进行卷积核为3×3，步长为2，输出通道为640的卷积运算，得到[640,32,32]维的向量，然后将该向量作为第二编码模块的Query进行再次运算，运算过程与第一编码模块的运算过程相同，重复两次计算后，得到[320,64,64]维的向量。S1) The electronic device uses the first image feature information as a query and the first text feature information as a key and value to input into the first encoding module, performs a cross-attention mechanism calculation, extracts the associated image feature information from the first image feature information and the first text feature information through CrossAttention, and then calculates the mean and standard deviation values corresponding to the associated image feature information through the vector addition & normalization operator (Add&Norm). Then, the mean and standard deviation values corresponding to the associated image feature information are processed by the forward feedback operator (FeedForward) for feature information fusion. After repeating the calculation twice, a vector with a vector dimension of [320,64,64] is output. Next, the vector is subjected to a convolution operation with a convolution kernel of 3×3, a step size of 2, and an output channel of 640 to obtain a vector with a dimension of [640,32,32]. Then, the vector is used as the query of the second encoding module for another calculation. The calculation process is the same as that of the first encoding module. After repeating the calculation twice, a vector with a dimension of [320,64,64] is obtained.

S2)再对[320,64,64]维的向量进行卷积运算，输出向量维度为[640,32,32]的向量，作为第三编码模块的Query进行再次运算，运算过程与第一编码模块的运算过程相同，重复两次计算后，得到[640,32,32]维的向量。再进行卷积运算输出向量维度为[1280,16,16]的向量。S2) Perform convolution operation on the vector of dimension [320,64,64] again, and output a vector of dimension [640,32,32], which is used as the query of the third encoding module for further operation. The operation process is the same as that of the first encoding module. After repeating the calculation twice, a vector of dimension [640,32,32] is obtained. Perform convolution operation again to output a vector of dimension [1280,16,16].

S3)将向量维度为[1280,16,16]的向量输入扩散模型中间层，扩散模型中间层的计算过程和上述第一编码模块相同，先后经过CrossAttention、Add&Norm、以及FeedForward的计算，输出向量维度为[1280,16,16]的向量。S3) Input the vector with the dimension [1280, 16, 16] into the middle layer of the diffusion model. The calculation process of the middle layer of the diffusion model is the same as that of the first encoding module. It is calculated by CrossAttention, Add&Norm, and FeedForward in turn, and outputs a vector with the dimension [1280, 16, 16].

S4)将第三编码模块输出的向量和扩散模型中间层输出的向量求平均值后作为Query，和作为Key和Value的第一文本特征信息输入第一解码模块进行计算，计算过程同第一编码模块相同，重复两次计算后，得到[1280,16,16]维的向量。然后，再对[1280,16,16]维的向量进行反卷积核3×3，步长为2，通道数为640的反卷积运算，输出向量维度为[640,32,32]的向量。S4) The vector output by the third encoding module and the vector output by the intermediate layer of the diffusion model are averaged as the query, and the first text feature information as the key and value are input into the first decoding module for calculation. The calculation process is the same as that of the first encoding module. After repeating the calculation twice, a vector of [1280,16,16] dimensions is obtained. Then, the vector of [1280,16,16] dimensions is subjected to a deconvolution operation with a deconvolution kernel of 3×3, a step size of 2, and a channel number of 640, and the output vector dimension is a vector of [640,32,32].

S5)再将第一解码模块输出的向量和第二编码模块输出的向量求平均值后作为Query，和作为Key和Value的第一文本特征信息输入第二解码模块，进行计算，计算过程同第一编码模块相同，重复两次计算后，得到[640,32,32]维的向量。再进行反卷积运算，输出向量维度为[320,32,32]的向量。S5) The vector output by the first decoding module and the vector output by the second encoding module are averaged and used as the query, and the first text feature information as the key and value are input into the second decoding module for calculation. The calculation process is the same as that of the first encoding module. After repeating the calculation twice, a vector of [640, 32, 32] dimensions is obtained. A deconvolution operation is then performed to output a vector of [320, 32, 32] dimensions.

S6)再将第二解码模块输出的向量和第一编码模块输出的向量求平均值后作为Query，和作为Key和Value的第一文本特征信息输入第三解码模块，进行计算，计算过程同第一编码模块相同，重复两次计算后，输出向量维度为[320,64,64]的向量。S6) The vector output by the second decoding module and the vector output by the first encoding module are averaged and used as the query, and the first text feature information as the key and value are input into the third decoding module for calculation. The calculation process is the same as that of the first encoding module. After repeating the calculation twice, a vector with a vector dimension of [320, 64, 64] is output.

S7)将该向量经过一层卷积层(Conv_out)计算，得到[32,64,64]维的向量，记为图像预测噪声矩阵，即根据第一文本预测出的图像噪声矩阵，最后将Latent减去该图像噪声矩阵，得到最终的图像重建矩阵(Image Reconstruction Matrix，IRM)，即上述第三图像特征信息。S7) The vector is calculated through a convolution layer (Conv_out) to obtain a [32, 64, 64]-dimensional vector, which is recorded as the image prediction noise matrix, that is, the image noise matrix predicted according to the first text. Finally, the image noise matrix is subtracted from the Latent to obtain the final image reconstruction matrix (Image Reconstruction Matrix, IRM), that is, the third image feature information mentioned above.

步骤201e、电子设备将第二图像特征信息输入图像解码器模型进行图像解码，得到第三图像特征信息，并基于第三图像特征信息，输出第二图像。Step 201e: The electronic device inputs the second image feature information into the image decoder model to perform image decoding to obtain third image feature information, and outputs the second image based on the third image feature information.

举例说明，针对第一反卷积模块，电子设备可以将向量维度为[32,64,64]的第二图像特征信息输入第一反卷积模块，经过第一反卷积层计算得到[16,128,128]维的特征矩阵，再将向量维度为[16,128,128]的特征矩阵经过一个激活函数Relu，仍得到[8,256,256]维的特征矩阵。接着，将[16,128,128]维的特征矩阵依次输入到第二反卷积层和第三反卷积层中，计算过程同第一反卷积层相同，但是反卷积的步长改为1，输出通道改为8，以使得最终得到[16,128,128]维的特征矩阵。For example, for the first deconvolution module, the electronic device can input the second image feature information with a vector dimension of [32, 64, 64] into the first deconvolution module, and calculate the feature matrix of [16, 128, 128] dimension through the first deconvolution layer, and then pass the feature matrix of [16, 128, 128] vector dimension through an activation function Relu, and still obtain the feature matrix of [8, 256, 256] dimension. Then, the feature matrix of [16, 128, 128] dimension is input into the second deconvolution layer and the third deconvolution layer in turn, and the calculation process is the same as the first deconvolution layer, but the deconvolution step size is changed to 1, and the output channel is changed to 8, so that the feature matrix of [16, 128, 128] dimension is finally obtained.

应注意的是，将第一反卷积模块输出得到的[16,128,128]维的特征矩阵依次输入第二反卷积模块和第三反卷积模块中，计算过程同第一反卷积模块相同，最终得到[3,512,512]维的第三图像特征信息。It should be noted that the [16, 128, 128]-dimensional feature matrix obtained by the output of the first deconvolution module is input into the second deconvolution module and the third deconvolution module in sequence. The calculation process is the same as that of the first deconvolution module, and finally the [3, 512, 512]-dimensional third image feature information is obtained.

示例性地，电子设备基于第三图像特征信息对应的特征向量，将特征向量转换为第二图像进行输出。Exemplarily, the electronic device converts the feature vector into the second image for output based on the feature vector corresponding to the feature information of the third image.

步骤201f、电子设备将第一图像特征信息输入文本解码器模型进行文本预测，得到文本预测参数，并基于文本预测参数，输出第二文本。Step 201f: The electronic device inputs the first image feature information into a text decoder model to perform text prediction, obtains text prediction parameters, and outputs a second text based on the text prediction parameters.

示例性地，将第一图像特征信息输入文本解码器模型中，并获取与第一图像特征信息存在映射关系的文本码本信息，基于文本码本信息，得到预测参数。Exemplarily, the first image feature information is input into a text decoder model, and text codebook information having a mapping relationship with the first image feature information is obtained, and prediction parameters are obtained based on the text codebook information.

示例性地，上述预测参数为第一图像特征信息所体现的文本内容的文本属于不同预设文本的概率。Exemplarily, the prediction parameter is the probability that the text of the text content reflected by the first image feature information belongs to different preset texts.

示例性地，上述文本码本信息为系统自带的文本码本。Exemplarily, the above text codebook information is a text codebook provided by the system.

示例性地，上述预设文本为上述文本码本信息中的文本。Exemplarily, the preset text is the text in the text codebook information.

示例性地，上述文本解码器模型中12个文本解码模块，其中每个解码模块均包括：CrossAttention、Add&Norm、以及FeedForward。Exemplarily, there are 12 text decoding modules in the above text decoder model, each of which includes: CrossAttention, Add&Norm, and FeedForward.

举例说明，电子设备将维度为[1,768]的文本码本信息中的第一文本码本信息作为Query值，第一图像特征信息作为Key和Value值输入第一个解码模块，进行交叉注意力计算，以得到文本码本信息和第一图像特征信息间的关联特征信息，再经过Add&Norm计算出文本码本信息和第一图像特征信息间的关联特征信息对应的均值和标准差值，并将文本码本信息和第一图像特征信息间的关联特征信息对应的均值和标准差值经过FeedForward进行特征信息融合处理。再将融合处理后的向量输入至剩下11个解码模块依次进行计算，其计算过程相同，最终得到[1,768]维的第一文本向量。For example, the electronic device uses the first text codebook information in the text codebook information with a dimension of [1,768] as the Query value, and the first image feature information as the Key and Value values to input into the first decoding module, and performs cross-attention calculation to obtain the associated feature information between the text codebook information and the first image feature information, and then calculates the mean and standard deviation values corresponding to the associated feature information between the text codebook information and the first image feature information through Add&Norm, and performs feature information fusion processing on the mean and standard deviation values corresponding to the associated feature information between the text codebook information and the first image feature information through FeedForward. The fused vector is then input into the remaining 11 decoding modules for calculation in sequence, and the calculation process is the same, and finally the first text vector of [1,768] dimensions is obtained.

然后，再通过词汇表(TokenEmbedding)得到[1,768]维的第一文本向量对应的词表向量矩阵，接着通过一层Softmax计算，选择概率最高的文本字符作为第二文本的第一个文本字符。将该文本字符拼接到第一图像特征信息后，得到维度为[33,768]的向量，作为Key和Value值，再采用[1,768]的文本码本信息中的第二文本码本信息作为Query值再次输入到第一个解码模块，进行交叉注意计算，再经过Add、Norm、FeedForward操作，进行12个解码模块的计算，最终得到[1,768]维的第二文本向量。然后再通过词汇表和Softmax的计算，得到第二文本的第二个文本字符。如此通过以上迭代逐字的计算直到达到设置的最长文本长度或者终止符则停止迭代，将最终得到的字符串作为第二文本输出。Then, the vocabulary vector matrix corresponding to the first text vector of dimension [1,768] is obtained through the vocabulary (TokenEmbedding), and then a layer of Softmax calculation is performed to select the text character with the highest probability as the first text character of the second text. After the text character is concatenated to the first image feature information, a vector of dimension [33,768] is obtained as the Key and Value value, and then the second text codebook information in the text codebook information of [1,768] is used as the Query value to be input into the first decoding module again for cross attention calculation, and then after Add, Norm, FeedForward operations, 12 decoding modules are calculated to finally obtain the second text vector of dimension [1,768]. Then, the second text character of the second text is obtained through the vocabulary and Softmax calculation. In this way, the above iterative word-by-word calculation is stopped until the set maximum text length or terminator is reached, and the final string is output as the second text.

可选地，在本申请实施例中，电子设备在获取到第一图像特征信息后，可以将第一图像特征信息映射至与第一文本特征信息相同的维度，然后将映射后的输入文本解码器模型进行文本预测，得到文本预测参数。Optionally, in an embodiment of the present application, after acquiring the first image feature information, the electronic device may map the first image feature information to the same dimension as the first text feature information, and then perform text prediction on the mapped input text decoder model to obtain text prediction parameters.

举例说明，上述第一图像特征信息为[32,64,64]维度的向量，电子设备将该第一图像特征信息的向量经过一个重塑函数转换为向量维度为[32,4096]的向量，然后经过线性映射层Project对该向量进行映射，得到与第一文本特征信息相同向量维度的[32,768]维的向量，即上述第四图像特征信息。For example, the first image feature information is a vector of dimension [32, 64, 64]. The electronic device converts the vector of the first image feature information into a vector of dimension [32, 4096] through a reshaping function, and then maps the vector through a linear mapping layer Project to obtain a vector of dimension [32, 768] with the same vector dimension as the first text feature information, that is, the fourth image feature information.

需要说明的是，上述第四图像特征信息可以作为第一图像特征信息输入上述文本解码器模型中。It should be noted that the fourth image feature information can be input into the text decoder model as the first image feature information.

如此，通过图文生成模型中的各种网络模型，将文本和图像进行融合关联训练，可以使得生成的文本和图像间的一致性提高。In this way, by using various network models in the image-text generation model, the text and image are fused and associated for training, which can improve the consistency between the generated text and image.

可选地，在本申请实施例中，在上述步骤202之前，本申请实施例提供的图文生成模型的训练方法还包括步骤301至步骤304：Optionally, in the embodiment of the present application, before the above step 202, the training method of the image-text generation model provided in the embodiment of the present application further includes steps 301 to 304:

步骤301、电子设备基于第一文本特征信息和第一图像特征信息，构建第一损失函数。Step 301: The electronic device constructs a first loss function based on first text feature information and first image feature information.

示例性地，上述第一损失函数的构建过程包括以下步骤A1至步骤A3：Exemplarily, the process of constructing the first loss function includes the following steps A1 to A3:

步骤A1、电子设备将上述第一文本特征信息经过线性映射层Project得到[256,768]维的文本对齐向量TAV。Step A1: The electronic device uses a linear mapping layer Project to obtain a [256,768]-dimensional text alignment vector TAV.

步骤A2、电子设备将上述第一图像特征信息经过卷积层Conv计算得到[768,16,16]维的图像对齐矩阵IAM，然后将其经过重塑函数Reshape重塑为[256,768]维的图像对齐向量IAV。Step A2: The electronic device calculates the first image feature information through the convolution layer Conv to obtain an image alignment matrix IAM of [768, 16, 16] dimensions, and then reshapes it into an image alignment vector IAV of [256, 768] dimensions through a reshaping function Reshape.

步骤A3、电子设备将文本对齐向量TAV和图像对齐向量IAV进行Cosin相似度计算构建图文对比损失函数，记为C1＝1-Cos(TAV,IAV)。Step A3: The electronic device performs Cosin similarity calculation on the text alignment vector TAV and the image alignment vector IAV to construct a text-image comparison loss function, which is recorded as C1=1-Cos(TAV,IAV).

步骤302、电子设备基于第二图像特征信息以及高斯噪声矩阵，构建第二损失函数。Step 302: The electronic device constructs a second loss function based on the second image feature information and the Gaussian noise matrix.

示例性地，上述第二损失函数构建过程为：电子设备利用随机采样的高斯噪声矩阵GNM和基于第一文本和第一图像的特征信息得到的图像预测噪声矩阵INM构建MSE损失函数，记为C2。如此，使得图像预测噪声矩阵INM去逼近随机采样的高斯噪声矩阵GNM，从而可以使得图像尽可能的重建回复为原始图像。Exemplarily, the second loss function construction process is as follows: the electronic device constructs an MSE loss function using a randomly sampled Gaussian noise matrix GNM and an image prediction noise matrix INM obtained based on feature information of the first text and the first image, which is recorded as C2. In this way, the image prediction noise matrix INM approximates the randomly sampled Gaussian noise matrix GNM, so that the image can be reconstructed as close to the original image as possible.

步骤303、电子设备基于文本预测参数，构建第三损失函数。Step 303: The electronic device constructs a third loss function based on the text prediction parameters.

示例性地，上述文本预测参数可以为预测文本字符为某个文本字符的概率值。Exemplarily, the above text prediction parameter may be a probability value of the predicted text character being a certain text character.

示例性地，以第i个训练样本对为例，上述第三损失函数的构建过程为：表示第i个样本的第一图像特征信息的第k个文本字符对应的embedding(总共32个)，表示第i个样本的第一图像特征信息的第j个embedding(文本最大长度128)。假设预测第2个文本字符经过softmax计算出来之后，该文本字符对应的embedding(记为)的softmax数值最大，即pθ(c2i|p1i,…,p32i,c1i)概率值高于词表中其他字符的概率值，于是第三损失函数构建如下：Exemplarily, taking the i-th training sample pair as an example, the construction process of the third loss function is as follows: The embedding corresponding to the kth text character representing the first image feature information of the i-th sample (32 in total), The jth embedding representing the first image feature information of the i-th sample (the maximum length of the text is 128). Assume that after the second text character is predicted and calculated by softmax, the embedding corresponding to the text character (denoted as ) has the largest softmax value, that is, the probability value of pθ(c2i|p1i,…,p32i,c1i) is higher than the probability values of other characters in the vocabulary, so the third loss function is constructed as follows:

步骤303、电子设备基于第一损失函数、第二损失函数以及第三损失函数，构建整体损失函数。Step 303: The electronic device constructs an overall loss function based on the first loss function, the second loss function and the third loss function.

在本申请实施例中，上述整体损失函数可以为：In the embodiment of the present application, the above overall loss function can be:

C＝a×C1+b×C2+c×C3(a+b+c＝1) 公式1C = a × C1 + b × C2 + c × C3 (a + b + c = 1) Formula 1

其中，C1表示基于上述图文生成模型中的文本编码器模型和图像编码器模型构建的第一损失函数，C2表示基于上述图文生成模型中的扩散模型构建的第二损失函数。C3表示基于上述图文生成模型中的文本解码器模型构建的第三损失函数。a、b、c分别表示三个损失函数的权重，三者权重相加为1。Wherein, C1 represents the first loss function constructed based on the text encoder model and the image encoder model in the above-mentioned image-text generation model, C2 represents the second loss function constructed based on the diffusion model in the above-mentioned image-text generation model. C3 represents the third loss function constructed based on the text decoder model in the above-mentioned image-text generation model. a, b, and c represent the weights of the three loss functions respectively, and the sum of the three weights is 1.

需要说明的是，上述三个损失函数的权重可以为预设权重。It should be noted that the weights of the above three loss functions can be preset weights.

可选地，在本申请实施例中，结合上述步骤301至步骤304，在上述步骤202之后，在本申请实施例中提供的图文生成模型的训练方法还包括步骤305：Optionally, in the embodiment of the present application, in combination with the above steps 301 to 304, after the above step 202, the training method of the image-text generation model provided in the embodiment of the present application further includes step 305:

步骤305、电子设备采用整体损失函数分别计算M个训练样本对中每个训练样本对的图文相似度。Step 305: The electronic device calculates the image-text similarity of each training sample pair in the M training sample pairs using the overall loss function.

需要说明的是，如图5所示，上述图文生成模型可以包括4个模块：输入模块、推理模块、输出模块、训练损失函数模块。在训练过程中，电子设备会先将训练样本对输入至图文生成模型的输入模块，然后经过图文生成模型的推理模块的推理过程，再通过图文生成模型的输出模块输出最终图像和文本。同时，在推理模块的中间过程以及最后的输出结果中会创建整体损失函数进行训练学习。It should be noted that, as shown in FIG5 , the above-mentioned image-text generation model may include four modules: an input module, an inference module, an output module, and a training loss function module. During the training process, the electronic device will first input the training sample pair into the input module of the image-text generation model, and then go through the inference process of the inference module of the image-text generation model, and then output the final image and text through the output module of the image-text generation model. At the same time, an overall loss function will be created in the intermediate process of the inference module and the final output result for training and learning.

可选地，在本申请实施例中，在上述步骤204之后，本申请实施例提供的图文生成模型的训练方法还包括以下步骤401或者步骤402：Optionally, in the embodiment of the present application, after the above step 204, the training method of the image-text generation model provided in the embodiment of the present application further includes the following step 401 or step 402:

步骤401、电子设备将第四图像输入目标图文生成模型进行图文转换，输出第四文本。Step 401: The electronic device inputs a fourth image into a target image-text generation model for image-text conversion, and outputs a fourth text.

示例性地，上述第四文本用于描述第四图像的图像内容。Exemplarily, the fourth text is used to describe the image content of the fourth image.

步骤402、电子设备将第五文本输入目标图文生成模型进行图文转换，输出第五图像。Step 402: The electronic device inputs the fifth text into the target image-text generation model for image-text conversion, and outputs a fifth image.

示例性地，第五图像是基于第五文本生成的。Exemplarily, the fifth image is generated based on the fifth text.

需要说明的是，本申请实施例提供的图文生成模型的训练方法，执行主体可以为图文生成模型的训练装置，或者电子设备，还可以为电子设备中的功能模块或实体。本申请实施例中以图文生成模型的训练装置执行图文生成模型的训练方法为例，说明本申请实施例提供的图文生成模型的训练装置。It should be noted that the training method of the image-text generation model provided in the embodiment of the present application can be executed by a training device for the image-text generation model, or an electronic device, or a functional module or entity in the electronic device. In the embodiment of the present application, the training method of the image-text generation model executed by the training device for the image-text generation model is taken as an example to illustrate the training device for the image-text generation model provided in the embodiment of the present application.

图6示出了本申请实施例中涉及的图文生成模型的训练装置的一种可能的结构示意图。如图6所示，该图文生成模型的训练装置700可以包括：处理模块701、生成模块702、替换模块703以及训练模块704；Fig. 6 shows a possible structural diagram of a training device for a graph-text generation model involved in an embodiment of the present application. As shown in Fig. 6, the training device 700 for a graph-text generation model may include: a processing module 701, a generation module 702, a replacement module 703, and a training module 704;

该处理模块701，用于将第一训练样本对集合中的第一训练样本对输入至第一图文生成模型，输出第二训练样本对，该第一图文生成模型是基于上述第一训练样本对集合训练得到的，第一训练样本对包括第一图像和用于描述第一图像的图像内容的第一文本，第二训练样本对包括第二图像和第二文本，第二图像为上述第一文本经图文转换得到的图像，第二文本为上述第一图像经图文转换得到的文本；上述生成模块702，用于基于上述第一训练样本对和上述第二训练样本对，生成M个训练样本对，M个训练样本对中至少包括第一训练样本对和第二训练样本对，M为大于1的整数；上述替换模块703，用于将上述第一训练样本对集合中的第一训练样本对替换为第三训练样本对，得到第二训练样本对集合，该第三训练样本对为上述生成模块702生成的M个训练样本对中图文相似度最高的训练样本对；上述训练模块704，用于基于上述替换模块703替换后的第二训练样本对集合训练第一图文生成模型，得到目标图文生成模型。The processing module 701 is used to input the first training sample pair in the first training sample pair set into the first image-text generation model, and output the second training sample pair, wherein the first image-text generation model is obtained by training based on the first training sample pair set, the first training sample pair includes a first image and a first text for describing the image content of the first image, the second training sample pair includes a second image and a second text, the second image is an image obtained by image-text conversion of the first text, and the second text is a text obtained by image-text conversion of the first image; the generation module 702 is used to generate a second training sample pair based on the first training sample pair and the second training sample pair. Sample pairs, generate M training sample pairs, the M training sample pairs include at least a first training sample pair and a second training sample pair, and M is an integer greater than 1; the above-mentioned replacement module 703 is used to replace the first training sample pair in the above-mentioned first training sample pair set with a third training sample pair to obtain a second training sample pair set, and the third training sample pair is the training sample pair with the highest image-text similarity among the M training sample pairs generated by the above-mentioned generation module 702; the above-mentioned training module 704 is used to train the first image-text generation model based on the second training sample pair set replaced by the above-mentioned replacement module 703 to obtain a target image-text generation model.

可选地，在本申请实施例中，上述图文生成模型包括文本编码器模型、图像编码器模型、扩散模型、图像解码器模型及文本解码器模型；Optionally, in an embodiment of the present application, the above-mentioned image-text generation model includes a text encoder model, an image encoder model, a diffusion model, an image decoder model and a text decoder model;

上述处理模块701，具体用于：The processing module 701 is specifically used for:

将第一训练样本对集合中的第一训练样本对输入至第一图文生成模型；将第一文本输入文本编码器模型进行特征提取，得到第一文本的第一文本特征信息；将第一图像输入图像编码器模型进行特征提取，得到第一图像的第一图像特征信息；将第一文本特征信息和第一图像特征信息输入扩散模型进行交叉注意力机制及卷积计算，得到第二图像特征信息；将第二图像特征信息输入图像解码器模型进行图像解码，得到第三图像特征信息，并基于第三图像特征信息，输出第二图像；将第一图像特征信息输入文本解码器模型进行文本预测，得到文本预测参数，并基于文本预测参数，输出第二文本。Input the first training sample pair in the first training sample pair set into the first image-text generation model; input the first text into the text encoder model for feature extraction to obtain first text feature information of the first text; input the first image into the image encoder model for feature extraction to obtain first image feature information of the first image; input the first text feature information and the first image feature information into the diffusion model for cross attention mechanism and convolution calculation to obtain second image feature information; input the second image feature information into the image decoder model for image decoding to obtain third image feature information, and output the second image based on the third image feature information; input the first image feature information into the text decoder model for text prediction to obtain text prediction parameters, and output the second text based on the text prediction parameters.

可选地，在本申请实施例中，上述处理模块701，还用于在基于第一训练样本对和第二训练样本对，生成M个训练样本对之前，基于第一文本特征信息和第一图像特征信息，构建第一损失函数；上述处理模块701，还用于基于第二图像特征信息以及高斯噪声矩阵，构建第二损失函数；上述处理模块701，还用于基于文本预测参数，构建第三损失函数；上述处理模块701，还用于基于第一损失函数、第二损失函数以及第三损失函数，构建整体损失函数；上述处理模块701，还用于基于第一训练样本对和第二训练样本对，生成M个训练样本对之后，采用整体损失函数分别计算M个训练样本对中每个训练样本对的图文相似度。Optionally, in an embodiment of the present application, the above-mentioned processing module 701 is also used to construct a first loss function based on the first text feature information and the first image feature information before generating M training sample pairs based on the first training sample pair and the second training sample pair; the above-mentioned processing module 701 is also used to construct a second loss function based on the second image feature information and the Gaussian noise matrix; the above-mentioned processing module 701 is also used to construct a third loss function based on text prediction parameters; the above-mentioned processing module 701 is also used to construct an overall loss function based on the first loss function, the second loss function and the third loss function; the above-mentioned processing module 701 is also used to generate M training sample pairs based on the first training sample pair and the second training sample pair, and then use the overall loss function to calculate the image-text similarity of each training sample pair in the M training sample pairs.

可选地，在本申请实施例中，上述M个训练样本对包括：第一训练样本对、第二训练样本对、第四训练样本对和第五训练样本对；其中，第四训练样本对包括第一图像和第二文本，第五训练样本对包括第二图像和第一文本。Optionally, in an embodiment of the present application, the above-mentioned M training sample pairs include: a first training sample pair, a second training sample pair, a fourth training sample pair and a fifth training sample pair; wherein the fourth training sample pair includes a first image and a second text, and the fifth training sample pair includes a second image and a first text.

可选地，在本申请实施例中，上述训练模块704，具体用于：Optionally, in the embodiment of the present application, the training module 704 is specifically used for:

基于第二训练样本对集合训练第一图文生成模型，得到第二图文生成模型；在模型训练次数未达到预设阈值的情况下，将第二训练样本对集合中的第六训练样本对输入至第二图文生成模型，输出第七训练样本对；基于第六训练样本对和第七训练样本对，生成N个训练样本对，N个训练样本对中至少包括第六训练样本对和第七训练样本对，N为大于1的整数；将第二训练样本对集合中的第六训练样本对替换为第八训练样本对，得到第三训练样本对集合，第八训练样本对为N个训练样本对中图文相似度最高的训练样本对；基于第三训练样本对集合训练第二图文生成模型，得到第三图文生成模型，迭代上述过程，直至模型训练次数达到预设阈值，则将最后一次训练得到的图文生成模型作为目标图文生成模型。Based on the second training sample pair set, the first image-text generation model is trained to obtain the second image-text generation model; when the number of model training times does not reach the preset threshold, the sixth training sample pair in the second training sample pair set is input into the second image-text generation model, and the seventh training sample pair is output; based on the sixth training sample pair and the seventh training sample pair, N training sample pairs are generated, and the N training sample pairs include at least the sixth training sample pair and the seventh training sample pair, and N is an integer greater than 1; the sixth training sample pair in the second training sample pair set is replaced by the eighth training sample pair to obtain a third training sample pair set, and the eighth training sample pair is the training sample pair with the highest image-text similarity among the N training sample pairs; based on the third training sample pair set, the second image-text generation model is trained to obtain the third image-text generation model, and the above process is iterated until the number of model training times reaches the preset threshold, and the image-text generation model obtained by the last training is used as the target image-text generation model.

可选地，在本申请实施例中，上述处理模块701，还用于基于第二训练样本对集合训练第一图文生成模型，得到目标图文生成模型之后，将第四图像输入目标图文生成模型进行图文转换，输出第四文本，第四文本用于描述第四图像的图像内容；或者，将第五文本输入目标图文生成模型进行图文转换，输出第五图像，第五图像是基于第五文本生成的。Optionally, in an embodiment of the present application, the above-mentioned processing module 701 is also used to train the first image-text generation model based on the second training sample pair set, and after obtaining the target image-text generation model, input the fourth image into the target image-text generation model for image-text conversion, and output a fourth text, where the fourth text is used to describe the image content of the fourth image; or, input the fifth text into the target image-text generation model for image-text conversion, and output a fifth image, where the fifth image is generated based on the fifth text.

在本申请实施例提供的图文生成模型的训练装置中，该装置将第一训练样本对集合中的第一训练样本对输入至第一图文生成模型，输出第二训练样本对，第一图文生成模型是基于第一训练样本对集合训练得到的，第一训练样本对包括第一图像和用于描述第一图像的图像内容的第一文本，第二训练样本对包括第二图像和第二文本，第二图像为第一文本经图文转换得到的图像，第二文本为第一图像经图文转换得到的文本；基于第一训练样本对和第二训练样本对，生成M个训练样本对，M个训练样本对中至少包括第一训练样本对和第二训练样本对，M为大于1的整数；将第一训练样本对集合中的第一训练样本对替换为第三训练样本对，得到第二训练样本对集合，第三训练样本对为M个训练样本对中图文相似度最高的训练样本对；基于第二训练样本对集合训练第一图文生成模型，得到目标图文生成模型。如此，通过将训练样本集中的一个或多个训练样本对输入图文生成模型进行图文转换，得到新的训练样本对，然后，基于每个训练样本中的文本与图像间的图文相似度，不断使用图文相似度更高的训练样本对来更新训练样本对集合，以使得基于更新后的训练样本对集合对图文生成模型进行训练，提高了图文生成模型的模型训练精度，进而提高了图文生成模型生成的图像和文本内容的一致性。In a training device for a picture-text generation model provided in an embodiment of the present application, the device inputs a first training sample pair in a first training sample pair set into a first picture-text generation model, and outputs a second training sample pair, wherein the first picture-text generation model is obtained by training based on the first training sample pair set, the first training sample pair includes a first image and a first text for describing the image content of the first image, the second training sample pair includes a second image and a second text, the second image is an image obtained by picture-text conversion of the first text, and the second text is a text obtained by picture-text conversion of the first image; based on the first training sample pair and the second training sample pair, M training sample pairs are generated, the M training sample pairs include at least the first training sample pair and the second training sample pair, and M is an integer greater than 1; the first training sample pair in the first training sample pair set is replaced by a third training sample pair to obtain a second training sample pair set, the third training sample pair is the training sample pair with the highest picture-text similarity among the M training sample pairs; the first picture-text generation model is trained based on the second training sample pair set to obtain a target picture-text generation model. In this way, one or more training sample pairs in the training sample set are input into the image-text generation model for image-text conversion to obtain new training sample pairs. Then, based on the image-text similarity between the text and the image in each training sample, the training sample pair set is continuously updated using training sample pairs with higher image-text similarity, so that the image-text generation model is trained based on the updated training sample pair set, thereby improving the model training accuracy of the image-text generation model and further improving the consistency of the image and text content generated by the image-text generation model.

本申请实施例中的图文生成模型的训练装置可以是电子设备，也可以是电子设备中的部件，例如集成电路或芯片。该电子设备可以是终端，也可以为除终端之外的其他设备。示例性的，电子设备可以为手机、平板电脑、笔记本电脑、掌上电脑、车载电子设备、移动上网装置(Mobile Internet Device，MID)、增强现实(augmented reality，AR)/虚拟现实(virtual reality，VR)设备、机器人、可穿戴设备、超级移动个人计算机(ultra-mobilepersonal computer，UMPC)、上网本或者个人数字助理(personal digital assistant，PDA)等，还可以为服务器、网络附属存储器(Network Attached Storage，NAS)、个人计算机(personal computer，PC)、电视机(television，TV)、柜员机或者自助机等，本申请实施例不作具体限定。The training device of the graphic generation model in the embodiment of the present application can be an electronic device, or a component in the electronic device, such as an integrated circuit or a chip. The electronic device can be a terminal, or other devices other than a terminal. Exemplarily, the electronic device can be a mobile phone, a tablet computer, a laptop computer, a PDA, a vehicle-mounted electronic device, a mobile Internet device (Mobile Internet Device, MID), an augmented reality (augmented reality, AR)/virtual reality (virtual reality, VR) device, a robot, a wearable device, an ultra-mobile personal computer (ultra-mobile personal computer, UMPC), a netbook or a personal digital assistant (personal digital assistant, PDA), etc., and can also be a server, a network attached storage (Network Attached Storage, NAS), a personal computer (personal computer, PC), a television (television, TV), a teller machine or a self-service machine, etc., which is not specifically limited in the embodiment of the present application.

本申请实施例中的图文生成模型的训练装置可以为具有操作系统的装置。该操作系统可以为安卓(Android)操作系统，可以为ios操作系统，还可以为其他可能的操作系统，本申请实施例不作具体限定。The training device of the image-text generation model in the embodiment of the present application can be a device with an operating system. The operating system can be an Android operating system, an iOS operating system, or other possible operating systems, which are not specifically limited in the embodiment of the present application.

本申请实施例提供的图文生成模型的训练装置能够实现图1至图5的方法实施例实现的各个过程，为避免重复，这里不再赘述。The training device for the image-text generation model provided in the embodiment of the present application can implement each process implemented in the method embodiments of Figures 1 to 5. To avoid repetition, they will not be described here.

可选地，如图7所示，本申请实施例还提供一种电子设备800，包括处理器801和存储器802，存储器802上存储有可在所述处理器801上运行的程序或指令，该程序或指令被处理器801执行时实现上述图文生成模型的训练方法实施例的各个步骤，且能达到相同的技术效果，为避免重复，这里不再赘述。Optionally, as shown in Figure 7, an embodiment of the present application also provides an electronic device 800, including a processor 801 and a memory 802, and the memory 802 stores a program or instruction that can be executed on the processor 801. When the program or instruction is executed by the processor 801, the various steps of the training method embodiment of the above-mentioned image and text generation model are implemented, and the same technical effect can be achieved. To avoid repetition, it will not be repeated here.

需要说明的是，本申请实施例中的电子设备包括上述所述的移动电子设备和非移动电子设备。It should be noted that the electronic devices in the embodiments of the present application include the mobile electronic devices and non-mobile electronic devices mentioned above.

图8为实现本申请实施例的一种电子设备的硬件结构示意图。FIG8 is a schematic diagram of the hardware structure of an electronic device implementing an embodiment of the present application.

该电子设备100包括但不限于：射频单元101、网络模块102、音频输出单元103、输入单元104、传感器105、显示单元106、用户输入单元107、接口单元108、存储器109、以及处理器110等部件。The electronic device 100 includes but is not limited to components such as a radio frequency unit 101, a network module 102, an audio output unit 103, an input unit 104, a sensor 105, a display unit 106, a user input unit 107, an interface unit 108, a memory 109, and a processor 110.

本领域技术人员可以理解，电子设备100还可以包括给各个部件供电的电源(比如电池)，电源可以通过电源管理系统与处理器110逻辑相连，从而通过电源管理系统实现管理充电、放电、以及功耗管理等功能。图8中示出的电子设备结构并不构成对电子设备的限定，电子设备可以包括比图示更多或更少的部件，或者组合某些部件，或者不同的部件布置，在此不再赘述。Those skilled in the art will appreciate that the electronic device 100 may also include a power source (such as a battery) for supplying power to each component, and the power source may be logically connected to the processor 110 through a power management system, so that the power management system can manage charging, discharging, and power consumption. The electronic device structure shown in FIG8 does not constitute a limitation on the electronic device, and the electronic device may include more or fewer components than shown, or combine certain components, or arrange components differently, which will not be described in detail here.

该处理器110，用于将第一训练样本对集合中的第一训练样本对输入至第一图文生成模型，输出第二训练样本对，该第一图文生成模型是基于上述第一训练样本对集合训练得到的，第一训练样本对包括第一图像和用于描述第一图像的图像内容的第一文本，第二训练样本对包括第二图像和第二文本，第二图像为上述第一文本经图文转换得到的图像，第二文本为上述第一图像经图文转换得到的文本；上述处理器110，还用于基于上述第一训练样本对和上述第二训练样本对，生成M个训练样本对，M个训练样本对中至少包括第一训练样本对和第二训练样本对，M为大于1的整数；上述处理器110，还用于将上述第一训练样本对集合中的第一训练样本对替换为第三训练样本对，得到第二训练样本对集合，该第三训练样本对为M个训练样本对中图文相似度最高的训练样本对；上述处理器110，还用于基于第二训练样本对集合训练第一图文生成模型，得到目标图文生成模型。The processor 110 is used to input the first training sample pair in the first training sample pair set into the first image-text generation model, and output the second training sample pair, wherein the first image-text generation model is obtained by training based on the above-mentioned first training sample pair set, the first training sample pair includes a first image and a first text for describing the image content of the first image, and the second training sample pair includes a second image and a second text, the second image is an image obtained by image-text conversion of the above-mentioned first text, and the second text is a text obtained by image-text conversion of the above-mentioned first image; the above-mentioned processor 110 is also used to generate M training sample pairs based on the above-mentioned first training sample pair and the above-mentioned second training sample pair, and the M training sample pairs at least include the first training sample pair and the second training sample pair, and M is an integer greater than 1; the above-mentioned processor 110 is also used to replace the first training sample pair in the above-mentioned first training sample pair set with a third training sample pair to obtain a second training sample pair set, and the third training sample pair is the training sample pair with the highest image-text similarity among the M training sample pairs; the above-mentioned processor 110 is also used to train the first image-text generation model based on the second training sample pair set to obtain a target image-text generation model.

上述处理器110，具体用于：The processor 110 is specifically configured to:

可选地，在本申请实施例中，上述处理器110，还用于在基于第一训练样本对和第二训练样本对，生成M个训练样本对之前，基于第一文本特征信息和第一图像特征信息，构建第一损失函数；上述处理器110，还用于基于第二图像特征信息以及高斯噪声矩阵，构建第二损失函数；上述处理器110，还用于基于文本预测参数，构建第三损失函数；上述处理器110，还用于基于第一损失函数、第二损失函数以及第三损失函数，构建整体损失函数；上述处理器110，还用于基于第一训练样本对和第二训练样本对，生成M个训练样本对之后，采用整体损失函数分别计算M个训练样本对中每个训练样本对的图文相似度。Optionally, in an embodiment of the present application, the processor 110 is further used to construct a first loss function based on the first text feature information and the first image feature information before generating M training sample pairs based on the first training sample pair and the second training sample pair; the processor 110 is further used to construct a second loss function based on the second image feature information and a Gaussian noise matrix; the processor 110 is further used to construct a third loss function based on text prediction parameters; the processor 110 is further used to construct an overall loss function based on the first loss function, the second loss function and the third loss function; the processor 110 is further used to calculate the image-text similarity of each training sample pair in the M training sample pairs after generating M training sample pairs based on the first training sample pair and the second training sample pair.

可选地，在本申请实施例中，上述处理器110，具体用于：Optionally, in the embodiment of the present application, the processor 110 is specifically configured to:

可选地，在本申请实施例中，上述处理器110，还用于在基于替换后的训练样本对集合训练图文生成模型，得到目标图文生成模型之后，将第四图像输入目标图文生成模型进行图文转换，输出第四文本，第四文本用于描述第四图像的图像内容；或者，将第五文本输入目标图文生成模型进行图文转换，输出第五图像，第五图像是基于第五文本生成的。Optionally, in an embodiment of the present application, the processor 110 is further used to, after training a graph-text generation model based on a set of replaced training samples to obtain a target graph-text generation model, input the fourth image into the target graph-text generation model for graph-text conversion, and output a fourth text, where the fourth text is used to describe the image content of the fourth image; or, input the fifth text into the target graph-text generation model for graph-text conversion, and output a fifth image, where the fifth image is generated based on the fifth text.

在本申请实施例提供的电子设备中，该电子设备将第一训练样本对集合中的第一训练样本对输入至第一图文生成模型，输出第二训练样本对，第一图文生成模型是基于第一训练样本对集合训练得到的，第一训练样本对包括第一图像和用于描述第一图像的图像内容的第一文本，第二训练样本对包括第二图像和第二文本，第二图像为第一文本经图文转换得到的图像，第二文本为第一图像经图文转换得到的文本；基于第一训练样本对和第二训练样本对，生成M个训练样本对，M个训练样本对中至少包括第一训练样本对和第二训练样本对，M为大于1的整数；将第一训练样本对集合中的第一训练样本对替换为第三训练样本对，得到第二训练样本对集合，第三训练样本对为M个训练样本对中图文相似度最高的训练样本对；基于第二训练样本对集合训练第一图文生成模型，得到目标图文生成模型。如此，通过将训练样本集中的一个或多个训练样本对输入图文生成模型进行图文转换，得到新的训练样本对，然后，基于每个训练样本中的文本与图像间的图文相似度，不断使用图文相似度更高的训练样本对来更新训练样本对集合，以使得基于更新后的训练样本对集合对图文生成模型进行训练，提高了图文生成模型的模型训练精度，进而提高了图文生成模型生成的图像和文本内容的一致性。In the electronic device provided in the embodiment of the present application, the electronic device inputs the first training sample pair in the first training sample pair set into the first image-text generation model, and outputs the second training sample pair, the first image-text generation model is obtained by training based on the first training sample pair set, the first training sample pair includes a first image and a first text for describing the image content of the first image, the second training sample pair includes a second image and a second text, the second image is an image obtained by image-text conversion of the first text, and the second text is a text obtained by image-text conversion of the first image; based on the first training sample pair and the second training sample pair, M training sample pairs are generated, the M training sample pairs include at least the first training sample pair and the second training sample pair, and M is an integer greater than 1; the first training sample pair in the first training sample pair set is replaced by the third training sample pair to obtain the second training sample pair set, the third training sample pair is the training sample pair with the highest image-text similarity among the M training sample pairs; the first image-text generation model is trained based on the second training sample pair set to obtain a target image-text generation model. In this way, one or more training sample pairs in the training sample set are input into the image-text generation model for image-text conversion to obtain new training sample pairs. Then, based on the image-text similarity between the text and the image in each training sample, the training sample pair set is continuously updated using training sample pairs with higher image-text similarity, so that the image-text generation model is trained based on the updated training sample pair set, thereby improving the model training accuracy of the image-text generation model and further improving the consistency of the image and text content generated by the image-text generation model.

应理解的是，本申请实施例中，输入单元104可以包括图形处理器(GraphicsProcessing Unit，GPU)1041和麦克风1042，图形处理器1041对在视频捕获模式或图像捕获模式中由图像捕获装置(如摄像头)获得的静态图片或视频的图像数据进行处理。显示单元106可包括显示面板1061，可以采用液晶显示器、有机发光二极管等形式来配置显示面板1061。用户输入单元107包括触控面板1071以及其他输入设备1072中的至少一种。触控面板1071，也称为触摸屏。触控面板1071可包括触摸检测装置和触摸控制器两个部分。其他输入设备1072可以包括但不限于物理键盘、功能键(比如音量控制按键、开关按键等)、轨迹球、鼠标、操作杆，在此不再赘述。It should be understood that in the embodiment of the present application, the input unit 104 may include a graphics processor (Graphics Processing Unit, GPU) 1041 and a microphone 1042, and the graphics processor 1041 processes the image data of a static picture or video obtained by an image capture device (such as a camera) in a video capture mode or an image capture mode. The display unit 106 may include a display panel 1061, and the display panel 1061 may be configured in the form of a liquid crystal display, an organic light emitting diode, etc. The user input unit 107 includes a touch panel 1071 and at least one of other input devices 1072. The touch panel 1071 is also called a touch screen. The touch panel 1071 may include two parts: a touch detection device and a touch controller. Other input devices 1072 may include, but are not limited to, a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, and a joystick, which will not be repeated here.

存储器109可用于存储软件程序以及各种数据。存储器109可主要包括存储程序或指令的第一存储区和存储数据的第二存储区，其中，第一存储区可存储操作系统、至少一个功能所需的应用程序或指令(比如声音播放功能、图像播放功能等)等。此外，存储器109可以包括易失性存储器或非易失性存储器，或者，存储器109可以包括易失性和非易失性存储器两者。其中，非易失性存储器可以是只读存储器(Read-Only Memory，ROM)、可编程只读存储器(Programmable ROM，PROM)、可擦除可编程只读存储器(Erasable PROM，EPROM)、电可擦除可编程只读存储器(Electrically EPROM，EEPROM)或闪存。易失性存储器可以是随机存取存储器(Random Access Memory，RAM)，静态随机存取存储器(Static RAM，SRAM)、动态随机存取存储器(Dynamic RAM，DRAM)、同步动态随机存取存储器(Synchronous DRAM，SDRAM)、双倍数据速率同步动态随机存取存储器(Double Data Rate SDRAM，DDRSDRAM)、增强型同步动态随机存取存储器(Enhanced SDRAM，ESDRAM)、同步连接动态随机存取存储器(Synch link DRAM，SLDRAM)和直接内存总线随机存取存储器(Direct Rambus RAM，DRRAM)。本申请实施例中的存储器109包括但不限于这些和任意其它适合类型的存储器。The memory 109 can be used to store software programs and various data. The memory 109 may mainly include a first storage area for storing programs or instructions and a second storage area for storing data, wherein the first storage area may store an operating system, an application program or instructions required for at least one function (such as a sound playback function, an image playback function, etc.), etc. In addition, the memory 109 may include a volatile memory or a non-volatile memory, or the memory 109 may include both volatile and non-volatile memories. Among them, the non-volatile memory may be a read-only memory (ROM), a programmable read-only memory (PROM), an erasable programmable read-only memory (EPROM), an electrically erasable programmable read-only memory (EEPROM), or a flash memory. The volatile memory may be a random access memory (RAM), a static random access memory (SRAM), a dynamic random access memory (DRAM), a synchronous dynamic random access memory (SDRAM), a double data rate synchronous dynamic random access memory (DDRSDRAM), an enhanced synchronous dynamic random access memory (ESDRAM), a synchronous link dynamic random access memory (SLDRAM) and a direct memory bus random access memory (DRRAM). The memory 109 in the embodiment of the present application includes but is not limited to these and any other suitable types of memory.

处理器110可包括一个或多个处理单元；可选的，处理器110集成应用处理器和调制解调处理器，其中，应用处理器主要处理涉及操作系统、用户界面和应用程序等的操作，调制解调处理器主要处理无线通信信号，如基带处理器。可以理解的是，上述调制解调处理器也可以不集成到处理器110中。The processor 110 may include one or more processing units; optionally, the processor 110 integrates an application processor and a modem processor, wherein the application processor mainly processes operations related to an operating system, a user interface, and application programs, and the modem processor mainly processes wireless communication signals, such as a baseband processor. It is understandable that the modem processor may not be integrated into the processor 110.

本申请实施例还提供一种可读存储介质，所述可读存储介质上存储有程序或指令，该程序或指令被处理器执行时实现上述图文生成模型的训练方法实施例的各个过程，且能达到相同的技术效果，为避免重复，这里不再赘述。An embodiment of the present application also provides a readable storage medium, on which a program or instruction is stored. When the program or instruction is executed by a processor, each process of the training method embodiment of the above-mentioned graphic generation model is implemented, and the same technical effect can be achieved. To avoid repetition, it will not be repeated here.

其中，所述处理器为上述实施例中所述的电子设备中的处理器。所述可读存储介质，包括计算机可读存储介质，如计算机只读存储器ROM、随机存取存储器RAM、磁碟或者光盘等。The processor is the processor in the electronic device described in the above embodiment. The readable storage medium includes a computer readable storage medium, such as a computer read-only memory ROM, a random access memory RAM, a magnetic disk or an optical disk.

本申请实施例另提供了一种芯片，所述芯片包括处理器和通信接口，所述通信接口和所述处理器耦合，所述处理器用于运行程序或指令，实现上述图文生成模型的训练方法实施例的各个过程，且能达到相同的技术效果，为避免重复，这里不再赘述。An embodiment of the present application further provides a chip, which includes a processor and a communication interface, wherein the communication interface is coupled to the processor, and the processor is used to run programs or instructions to implement the various processes of the above-mentioned training method embodiment of the graphic generation model, and can achieve the same technical effect. To avoid repetition, it will not be repeated here.

应理解，本申请实施例提到的芯片还可以称为系统级芯片、系统芯片、芯片系统或片上系统芯片等。It should be understood that the chip mentioned in the embodiments of the present application can also be called a system-level chip, a system chip, a chip system or a system-on-chip chip, etc.

本申请实施例提供一种计算机程序产品，该程序产品被存储在存储介质中，该程序产品被至少一个处理器执行以实现如上述图文生成模型的训练方法实施例的各个过程，且能达到相同的技术效果，为避免重复，这里不再赘述。An embodiment of the present application provides a computer program product, which is stored in a storage medium. The program product is executed by at least one processor to implement the various processes of the training method embodiment of the above-mentioned graphic generation model, and can achieve the same technical effect. To avoid repetition, it will not be repeated here.

需要说明的是，在本文中，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者装置不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者装置所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括该要素的过程、方法、物品或者装置中还存在另外的相同要素。此外，需要指出的是，本申请实施方式中的方法和装置的范围不限按示出或讨论的顺序来执行功能，还可包括根据所涉及的功能按基本同时的方式或按相反的顺序来执行功能，例如，可以按不同于所描述的次序来执行所描述的方法，并且还可以添加、省去、或组合各种步骤。另外，参照某些示例所描述的特征可在其他示例中被组合。It should be noted that, in this article, the term "comprises", "includes" or any other variant thereof is intended to cover non-exclusive inclusion, so that the process, method, article or device including a series of elements includes not only those elements, but also includes other elements not explicitly listed, or also includes elements inherent to such process, method, article or device. In the absence of further restrictions, the elements defined by the sentence "including one..." do not exclude the presence of other identical elements in the process, method, article or device including the element. In addition, it should be pointed out that the scope of the method and device in the embodiment of the present application is not limited to performing functions in the order shown or discussed, and may also include performing functions in a substantially simultaneous manner or in reverse order according to the functions involved, for example, the described method may be performed in an order different from that described, and various steps may also be added, omitted, or combined. In addition, the features described with reference to certain examples may be combined in other examples.

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现，当然也可以通过硬件，但很多情况下前者是更佳的实施方式。基于这样的理解，本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以计算机软件产品的形式体现出来，该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中，包括若干指令用以使得一台终端(可以是手机，计算机，服务器，或者网络设备等)执行本申请各个实施例所述的方法。Through the description of the above implementation methods, those skilled in the art can clearly understand that the above-mentioned embodiment methods can be implemented by means of software plus a necessary general hardware platform, and of course by hardware, but in many cases the former is a better implementation method. Based on such an understanding, the technical solution of the present application, or the part that contributes to the prior art, can be embodied in the form of a computer software product, which is stored in a storage medium (such as ROM/RAM, a disk, or an optical disk), and includes a number of instructions for a terminal (which can be a mobile phone, a computer, a server, or a network device, etc.) to execute the methods described in each embodiment of the present application.

上面结合附图对本申请的实施例进行了描述，但是本申请并不局限于上述的具体实施方式，上述的具体实施方式仅仅是示意性的，而不是限制性的，本领域的普通技术人员在本申请的启示下，在不脱离本申请宗旨和权利要求所保护的范围情况下，还可做出很多形式，均属于本申请的保护之内。The embodiments of the present application are described above in conjunction with the accompanying drawings, but the present application is not limited to the above-mentioned specific implementation methods. The above-mentioned specific implementation methods are merely illustrative and not restrictive. Under the guidance of the present application, ordinary technicians in this field can also make many forms without departing from the purpose of the present application and the scope of protection of the claims, all of which are within the protection of the present application.

Claims

1. The training method of the image-text generation model is characterized by comprising the following steps of:

inputting a first training sample pair in a first training sample pair set into a first image-text generation model, outputting a second training sample pair, wherein the first image-text generation model is obtained by training based on the first training sample pair set, the first training sample pair comprises a first image and a first text for describing the image content of the first image, the second training sample pair comprises a second image and a second text, the second image is an image obtained by converting the first text into the image, and the second text is a text obtained by converting the first image into the image;

generating M training sample pairs based on the first training sample pair and the second training sample pair, wherein the M training sample pairs at least comprise the first training sample pair and the second training sample pair, and M is an integer greater than 1;

replacing the first training sample pair in the first training sample pair set with a third training sample pair to obtain a second training sample pair set, wherein the third training sample pair is the training sample pair with highest image-text similarity in the M training sample pairs;

And training the first image-text generating model based on the second training sample pair set to obtain a target image-text generating model.

2. The method of claim 1, wherein the first teletext generation model comprises a text encoder model, an image encoder model, a diffusion model, an image decoder model, and a text decoder model;

inputting a first training sample pair in a first training sample pair set into the first image-text generating model, and outputting a second training sample pair, wherein the method comprises the following steps:

inputting the first training sample pair into the first image-text generating model;

inputting the first text into the text encoder model for feature extraction to obtain first text feature information of the first text;

inputting the first image into the image encoder model for feature extraction to obtain first image feature information of the first image;

inputting the first text feature information and the first image feature information into the diffusion model to perform a cross attention mechanism and convolution calculation to obtain second image feature information;

inputting the second image characteristic information into the image decoder model for image decoding to obtain third image characteristic information, and outputting the second image based on the third image characteristic information;

And inputting the first image characteristic information into the text decoder model for text prediction to obtain text prediction parameters, and outputting the second text based on the text prediction parameters.

3. The method of claim 2, wherein prior to generating M training sample pairs based on the first training sample pair and the second training sample pair, the method further comprises:

constructing a first loss function based on the first text feature information and the first image feature information;

constructing a second loss function based on the second image characteristic information and a Gaussian noise matrix;

constructing a third loss function based on the text prediction parameters;

constructing an overall loss function based on the first, second, and third loss functions;

after generating M training sample pairs based on the first training sample pair and the second training sample pair, the method further includes:

and respectively calculating the image-text similarity of each training sample pair of the M training sample pairs by adopting the integral loss function.

4. The method of claim 1, wherein the M training sample pairs comprise: the first training sample pair, the second training sample pair, the fourth training sample pair, and the fifth training sample pair; wherein the fourth training sample pair comprises the first image and the second text, and the fifth training sample pair comprises the second image and the first text.

5. The method according to claim 1, wherein training the first teletext generation model based on the second training sample pair set to obtain a target teletext generation model comprises:

training the first image-text generating model based on the second training sample pair set to obtain a second image-text generating model;

under the condition that the model training times do not reach a preset threshold value, inputting a sixth training sample pair in the second training sample pair set into the second image-text generating model, and outputting a seventh training sample pair;

generating N training sample pairs based on the sixth training sample pair and the seventh training sample pair, wherein the N training sample pairs at least comprise the sixth training sample pair and the seventh training sample pair, and N is an integer greater than 1;

replacing the sixth training sample pair in the second training sample pair set with an eighth training sample pair to obtain a third training sample pair set, wherein the eighth training sample pair is the training sample pair with highest image-text similarity in the N training sample pairs;

and training the second image-text generating model based on the third training sample pair set to obtain a third image-text generating model, iterating the process until the model training times reach the preset threshold value, and taking the image-text generating model obtained by the last training as the target image-text generating model.

6. The method according to claim 1, wherein after training the first teletext generation model based on the second training sample pair set to obtain a target teletext generation model, the method further comprises:

inputting a fourth image into the target image-text generation model for image-text conversion, and outputting a fourth text, wherein the fourth text is used for describing the image content of the fourth image;

or,

and inputting a fifth text into the target image-text generation model for image-text conversion, and outputting a fifth image, wherein the fifth image is generated based on the fifth text.

7. A training device for a pattern-text generation model, the device comprising: the device comprises a processing module, a generating module, a replacing module and a training module;

the processing module is used for inputting a first training sample pair in a first training sample pair set into a first image-text generating model and outputting a second training sample pair, the first image-text generating model is obtained by training based on the first training sample pair set, the first training sample pair comprises a first image and a first text for describing the image content of the first image, the second training sample pair comprises a second image and a second text, the second image is an image obtained by converting the first text into the image, and the second text is a text obtained by converting the first image into the image;

The generating module is configured to generate M training sample pairs based on the first training sample pair and the second training sample pair, where the M training sample pairs at least include the first training sample pair and the second training sample pair, and M is an integer greater than 1;

the replacing module is configured to replace the first training sample pair in the first training sample pair set with a third training sample pair to obtain a second training sample pair set, where the third training sample pair is a training sample pair with highest image-text similarity in the M training sample pairs generated by the generating module;

the training module is used for training the first image-text generating model based on the second training sample pair set replaced by the replacing module to obtain a target image-text generating model.

8. The apparatus of claim 7, wherein the teletext generation model comprises a text encoder model, an image encoder model, a diffusion model, an image decoder model, and a text decoder model;

the processing module is specifically configured to:

inputting a first training sample pair in the first training sample pair set to the first image-text generating model;

9. The apparatus of claim 8, wherein the processing module is further configured to, prior to generating M training sample pairs based on the first training sample pair and the second training sample pair,

The processing module is further used for constructing a second loss function based on the second image characteristic information and a Gaussian noise matrix;

the processing module is further used for constructing a third loss function based on the text prediction parameters;

the processing module is further configured to construct an overall loss function based on the first loss function, the second loss function, and the third loss function;

the processing module is further configured to calculate, according to the overall loss function, the image-text similarity of each of the M training sample pairs after generating the M training sample pairs based on the first training sample pair and the second training sample pair.

10. The apparatus of claim 7, wherein the M training sample pairs comprise: the first training sample pair, the second training sample pair, the fourth training sample pair, and the fifth training sample pair; wherein the fourth training sample pair comprises the first image and the second text, and the fifth training sample pair comprises the second image and the first text.

11. The device according to claim 7, wherein the training module is specifically configured to:

12. The apparatus of claim 7, wherein the processing module is further configured to:

after the first image-text generating model is trained based on the second training sample pair set to obtain a target image-text generating model, inputting a fourth image into the target image-text generating model for image-text conversion, and outputting a fourth text, wherein the fourth text is used for describing the image content of the fourth image; or, inputting a fifth text into the target image-text generation model for image-text conversion, and outputting a fifth image, wherein the fifth image is generated based on the fifth text.

13. An electronic device comprising a processor, a memory and a program or instruction stored on the memory and executable on the processor, which when executed by the processor, implements the steps of the training method of the teletext generation model according to any one of claims 1 to 6.

14. A readable storage medium, wherein a program or instructions is stored on the readable storage medium, which when executed by a processor, implements the steps of the training method of the teletext generation model according to any one of claims 1 to 6.