CN113590800A

CN113590800A - Training method and device of image generation model and image generation method and device

Info

Publication number: CN113590800A
Application number: CN202110966233.7A
Authority: CN
Inventors: 牛天睿; 冯方向; 王小捷; 袁彩霞
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2021-08-23
Filing date: 2021-08-23
Publication date: 2021-11-02
Anticipated expiration: 2041-08-23
Also published as: CN113590800B

Abstract

The present application discloses a training method and device for an image generation model, and an image generation method and device, wherein the method includes: acquiring dialogue sample data, the dialogue sample data including dialogue text data, standard images, image description text and dialogue total rounds Based on the dialogue sample data, the interactive incremental image generation model is trained by using the dialogue sample data and the pre-trained heterogeneous recurrent neural network encoder by means of random replay training, so that the interactive The incremental incremental image model can generate interactive incremental images based on human-machine dialogue text and image description text; in which, using all dialogue text data obtained at the final moment of the dialogue and all dialogue text data obtained at the middle moment of the dialogue, the the training. Adopting the present application is beneficial to reasonably realize the task of dialogue-to-image generation.

Description

Training method and device for image generation model and image generation method and device

技术领域technical field

本发明涉及人工智能技术，特别是涉及一种图像生成模型的训练方法和设备以及图像生成方法和设备。The present invention relates to artificial intelligence technology, in particular to an image generation model training method and device, and an image generation method and device.

背景技术Background technique

对话是人与人交流的最自然的方式，在文本到图像生成任务中以对话的形式控制机器产生图像，也是一种较为理想的交互方式。对话是一种自由的交互过程：对于一句话内难以述说清楚的事情，可以通过增补对话轮次来补全，且主题不限。Dialogue is the most natural way for people to communicate, and it is also an ideal way to control the machine to generate images in the form of dialogue in the text-to-image generation task. Dialogue is a free interactive process: things that are difficult to say clearly in a sentence can be completed by adding dialogue rounds, and the subject is not limited.

对于对话到图像生成任务而言，为了提高人机对话的智能性，需要机器在每一轮对话结束后都生成图像并显示，以作为对人的及时反馈，而不仅是在整个对话结束之后产生图像。For the task of dialogue-to-image generation, in order to improve the intelligence of human-machine dialogue, it is necessary for the machine to generate and display images after each round of dialogue as a timely feedback to people, not only after the entire dialogue is over. image.

发明人在实现本申请的过程中，通过研究分析发现：在每轮对话结束后，直接利用现有的文本到图像生成方法生成图像，无法合理地实现对话到图像生成任务，具体原因如下：In the process of realizing this application, the inventor found through research and analysis that: after each round of dialogue is over, directly using the existing text-to-image generation method to generate images cannot reasonably realize the dialogue-to-image generation task. The specific reasons are as follows:

由于对话过程中，人输入的信息会随着对话的进行而递增，相应地，为了确保对话到图像生成的合理性和智能化，对于每一轮对话后机器产生的每帧图像，其包含的信息也应是递增的，该特性称为图像的“递增性”。这样，在理想情况下，对话到图像生成过程应以一张空白画布开始，并在每一轮对话结束后，由机器递增式地增补信息。机器不应在对话刚开始时就画好了一张包含了大量物体的复杂的图像，以避免向人传达错误的反馈。每一轮对话后生成的图像，其内容应该“刚好涵盖当前对话过程的所有信息”，不多不少。同时，后面生成的图像在结构上不应有巨大的改变，以维持对话的连贯性。抽象地，“递增性”要求包含如下几个方面：During the dialogue process, the information input by the person will increase with the progress of the dialogue. Accordingly, in order to ensure the rationality and intelligence of dialogue-to-image generation, for each frame of image generated by the machine after each round of dialogue, it contains Information should also be incremental, a property known as the "incrementality" of the image. Thus, ideally, the dialogue-to-image generation process should start with a blank canvas, with information incrementally supplemented by the machine after each round of dialogue. Machines should not draw a complex image of a large number of objects at the beginning of a conversation to avoid giving false feedback to humans. The content of the image generated after each round of dialogue should "just cover all the information of the current dialogue process", no more or less. At the same time, the images generated later should not have huge changes in structure to maintain the coherence of the dialogue. Abstractly, the "incremental" requirement includes the following aspects:

物体数量递增性：物体数量随对话过程单调递增，并与对话实际涉及的物体数量相等，不可多于或少于对话涉及的物体数量。Increase in the number of objects: The number of objects increases monotonically with the dialogue process, and is equal to the number of objects actually involved in the dialogue, and cannot be more or less than the number of objects involved in the dialogue.

属性与关系递增性：物体的属性与关系随着对话过程逐步确定，且可被记忆：在对话后期产生的图像中，不可丢失对话前期确立的属性与关系。Incremental properties and relationships: The properties and relationships of objects are gradually determined with the dialogue process, and can be memorized: in the images generated later in the dialogue, the properties and relationships established in the early stage of the dialogue cannot be lost.

前后连贯性：对话中相邻轮次下生成的图像，在结构上应该是相似的。图像在对话过程中发生的巨大变动，是一种影响对话者体验的错误反馈。Front-to-back coherence: Images generated in adjacent turns in a dialogue should be similar in structure. The dramatic change of the image during the dialogue is a kind of false feedback that affects the experience of the interlocutor.

虽然对话过程信息量天然是递增的，但由此产生的图像内容未必如此。因为，在文本到图像生成任务的实现方案中，图像模态与文本模态是不对等的，图像中包含的信息远大于文本模态，文本只能控制图像中的小部分信息，而另一部分图像信息(即图像特定信息)是由机器随机生成的，这就为图像信息带来了不确定性。由于图像特定信息的不确定性，使得信息量较多的文本产生的图像所包含的信息量可能比信息量较少的文本产生图像的信息量更少。这样，在每轮对话结束后，基于当前已获取的对话文本生成图像，就无法保证图像中的信息随着对话轮次的增加而增加，从而无法上述对话到图像生成任务的“递增性”要求，进而无法合理地实现对话到图像生成任务。Although the amount of information in the dialogue process is naturally increasing, the resulting image content is not necessarily so. Because, in the implementation of the text-to-image generation task, the image modality is not equal to the text modality, the information contained in the image is much larger than the text modality, and the text can only control a small part of the information in the image, while the other part Image information (ie, image-specific information) is randomly generated by machines, which brings uncertainty to image information. Due to the uncertainty of the specific information of the image, the image generated by the text with more information may contain less information than the image generated by the text with less information. In this way, after each round of dialogue is over, an image is generated based on the currently acquired dialogue text, and there is no guarantee that the information in the image will increase with the increase of dialogue rounds, so that the above-mentioned "incremental" requirement of the dialogue-to-image generation task cannot be achieved. , which in turn cannot reasonably achieve the task of dialogue-to-image generation.

发明内容SUMMARY OF THE INVENTION

有鉴于此，本发明的主要目的在于提供一种编码器和图像生成模型的训练方法及图像生成方法和装置，有利于合理地实现对话到图像生成任务。In view of this, the main purpose of the present invention is to provide a training method for an encoder and an image generation model, and an image generation method and apparatus, which are beneficial to reasonably realize the task of generating dialogue to image.

为了达到上述目的，本发明实施例提出的技术方案为：In order to achieve the above purpose, the technical solutions proposed in the embodiments of the present invention are:

一种交互式递增图像生成模型的训练方法，包括：A training method for an interactive incremental image generation model, comprising:

获取对话样本数据，所述对话样本数据包括对话文本数据、标准图像、图像描述文本和对话总轮数；Obtaining dialogue sample data, the dialogue sample data includes dialogue text data, standard images, image description texts and the total number of dialogue rounds;

采用随机重放训练的方式，利用所述对话样本数据和预先训练的异构循环神经网络编码器，对交互式递增图像生成模型进行训练，以使所述交互式递增图像模型能够基于人机对话文本和图像描述文本生成具有交互递增性的图像；其中，利用在对话的最终时刻获得的所有对话文本数据以及在对话的中间时刻获得的所有对话文本数据，进行所述训练。The interactive incremental image generation model is trained by means of random replay training, using the dialogue sample data and the pre-trained heterogeneous recurrent neural network encoder, so that the interactive incremental image model can be based on human-machine dialogue Text and image description text generate images with interactive incrementality; wherein the training is performed using all dialogue text data obtained at the end of the dialogue and all dialogue text data obtained at the middle of the dialogue.

较佳地，所述训练包括：Preferably, the training includes:

采用随机采样的方式，确定当前采用的对话轮数t，其中，2≤t≤T，T为所述对话总论数；Using random sampling, determine the current number of dialogue rounds t, where 2≤t≤T, and T is the total number of dialogues;

将所述图像描述文本和所述对话文本数据输入至所述异构循环神经网络编码器进行编码，并将编码最后输出的特征表示作为第一文本表示X′_T；将所述第一文本表示输入至交互式递增图像生成模型的图像生成器进行图像生成，得到第一图像Y′_T；Inputting the image description text and the dialogue text data into the heterogeneous recurrent neural network encoder for encoding, and using the feature representation finally output from the encoding as a first text representation X′ _T ; using the first text representation Input to the image generator of the interactive incremental image generation model for image generation to obtain the first image Y′ _T ;

基于所述第一文本表示X′_T和所述第一图像Y′_T，利用所述交互式递增图像生成模型的判别器，计算主对抗损失；利用所述主对抗损失，更新所述交互式递增图像生成模型的图像生成器和判别器的累计梯度；所述主对抗损失包括图像生成器和判别器的损失函数值；Based on the first text representation X' _T and the first image Y' _T , using the discriminator of the interactive incremental image generation model, a main adversarial loss is calculated; using the main adversarial loss, the interactive The cumulative gradient of the image generator and the discriminator of the incremental image generation model; the main adversarial loss includes the loss function value of the image generator and the discriminator;

将所述图像描述文本和所述对话文本数据中的前t轮对话数据输入所述异构循环神经网络编码器进行编码，并将编码最后输出的特征表示作为第二文本表示X′_t；将所述第二文本表示输入至所述图像生成器进行图像生成，得到第二图像Y′_t；Inputting the first t rounds of dialogue data in the image description text and the dialogue text data into the heterogeneous recurrent neural network encoder for encoding, and using the feature representation finally output from the encoding as the second text representation X′ _t ; The second text representation is input to the image generator for image generation to obtain a second image Y′ _t ;

将所述图像描述文本和所述对话文本数据中的前t-1轮对话数据输入所述异构循环神经网络编码器进行编码，并将编码最后输出的特征表示作为第三文本表示X′_t-1；将所述第三文本表示输入至所述图像生成器进行图像生成，得到第三图像Y′_t-1；Input the image description text and the first t-1 rounds of dialogue data in the dialogue text data into the heterogeneous recurrent neural network encoder for encoding, and take the feature representation finally output from the encoding as the third text representation _X't -1; input the third text representation to the image generator for image generation to obtain a third image Y′ _t -1;

基于所述第二文本表示X′_t和所述第二图像Y′_t，构造第一正例；constructing a first positive example based on the second text representation _X't and the second image _Y't ;

基于所述第三文本表示X′_t-1和所述第三图像Y′_t-1，构造第二正例；constructing a second positive example based on the third text representation _X't -1 and the third image _Y't -1;

基于所述第一正例，利用所述判别器，计算第一辅对抗损失；基于所述第一辅对抗损失，更新所述图像生成器和所述判别器的累计梯度；所述第一辅对抗损失包括图像生成器和判别器的损失函数值；Based on the first positive example, using the discriminator, calculate a first auxiliary adversarial loss; based on the first auxiliary adversarial loss, update the cumulative gradient of the image generator and the discriminator; the first auxiliary adversarial loss The adversarial loss includes the loss function values of the image generator and the discriminator;

基于所述第二正例，利用所述判别器，计算第二辅对抗损失；基于所述第二辅对抗损失，更新所述图像生成器和所述判别器的累计梯度；所述第二辅对抗损失包括图像生成器和判别器的损失函数值；Based on the second positive example, using the discriminator, calculate a second auxiliary adversarial loss; based on the second auxiliary adversarial loss, update the cumulative gradient of the image generator and the discriminator; the second auxiliary adversarial loss The adversarial loss includes the loss function values of the image generator and the discriminator;

基于当前所述图像生成器的累计梯度，更新所述图像生成器的参数；基于当前所述判别器的累计梯度，更新所述判别器的参数。Based on the current cumulative gradient of the image generator, the parameters of the image generator are updated; based on the current cumulative gradient of the discriminator, the parameters of the discriminator are updated.

较佳地，所述训练方法进一步包括：Preferably, the training method further includes:

基于所述第三文本表示X′_t-1和所述第二图像Y′_t，构造第一负例；constructing a first negative example based on the third text representation _X't -1 and the second image _Y't ;

基于所述第二文本表示X′_t和所述第三图像Y′_t-1，构造第二负例；constructing a second negative example based on the second text representation _X't and the third image Y't _-1 ;

所述计算第一辅对抗损失包括：The calculating the first auxiliary adversarial loss includes:

基于所述第一正例和所述第一负例，利用所述判别器，计算所述第一辅对抗损失；Based on the first positive example and the first negative example, using the discriminator, calculate the first auxiliary adversarial loss;

所述计算第二辅对抗损失包括：The calculating the second auxiliary adversarial loss includes:

基于所述第二正例和所述第二负例，利用所述判别器，计算所述第二辅对抗损失。Using the discriminator, the second auxiliary adversarial loss is calculated based on the second positive example and the second negative example.

较佳地，所述异构循环神经网络编码器的训练包括：Preferably, the training of the heterogeneous RNN encoder includes:

获取编码训练样本数据，所述编码训练样本数据包括标准样本图像的图像描述文本和视觉对话文本；Obtaining coding training sample data, the coding training sample data includes image description text and visual dialogue text of standard sample images;

利用所述编码训练样本数据对异构循环神经网络编码器进行训练，使所述异构循环神经网络编码器能够将输入数据中视觉对话文本中的指代关系与图像描述文本中的对应内容相关联。The heterogeneous RNN encoder is trained by using the encoded training sample data, so that the heterogeneous RNN encoder can correlate the referential relationship in the visual dialogue text in the input data with the corresponding content in the image description text link.

较佳地，所述异构循环神经网络编码器由第一循环神经网络编码器、第二循环神经网络编码器和第三循环神经网络编码器组成；Preferably, the heterogeneous RNN encoder is composed of a first RNN encoder, a second RNN encoder and a third RNN encoder;

所述利用所述编码训练样本数据对异构循环神经网络编码器进行训练包括：The training of the heterogeneous recurrent neural network encoder using the encoded training sample data includes:

利用所述第一循环神经网络编码器，以单词为基本的编码单元，对所述图像描述文本进行编码，将编码得到的每个初级单词特征表示，输出至所述第三循环神经网络编码器；利用所述第二循环神经网络编码器，以句子为基本的编码单元，对所述视觉对话文本进行编码，将编码得到的每个初级句子特征表示，输出至所述第三循环神经网络编码器；Using the first cyclic neural network encoder, the image description text is encoded with words as the basic coding unit, and the feature representation of each primary word obtained by encoding is output to the third cyclic neural network encoder. Utilize the second cyclic neural network encoder, take the sentence as the basic coding unit, encode the visual dialogue text, and express each primary sentence feature obtained by encoding, and output it to the third cyclic neural network encoding. device;

所述第三循环神经网络编码器基于所述初级单词特征表示和所述初级句子特征表示进行编码，并将最后输出的特征表示，作为所述视觉对话文本和所述图像描述文本相关联的全局编码表示；The third recurrent neural network encoder encodes based on the primary word feature representation and the primary sentence feature representation, and uses the final output feature representation as a global image associated with the visual dialogue text and the image description text. encoded representation;

基于所述第三循环神经网络编码器在所述编码过程中输出的所有特征表示，采用深度注意力多模态相似度模型DAMSM损失函数，对所述第一循环神经网络编码器、第二循环神经网络编码器和所述第三循环神经网络编码器的权重参数进行调整。Based on all the feature representations output by the third recurrent neural network encoder in the encoding process, using the DAMSM loss function of the deep attention multimodal similarity model, the first recurrent neural network encoder, the second recurrent neural network encoder and the second recurrent neural network encoder are used. The weight parameters of the neural network encoder and the third recurrent neural network encoder are adjusted.

基于上述模型训练方法，本发明实施例还公开了一种交互式递增图像生成方法，包括：Based on the above model training method, an embodiment of the present invention further discloses an interactive incremental image generation method, including:

在视觉对话过程中，在每轮人机对话结束时，将已产生的所有轮人机对话文本和预设的图像描述文本，输入至预先训练的交互式递增图像生成模型进行图像生成，并显示所生成的图像；In the process of visual dialogue, at the end of each round of human-machine dialogue, input all the generated human-machine dialogue texts and preset image description texts into the pre-trained interactive incremental image generation model for image generation and display. the generated image;

其中，所述交互式递增图像生成模型基于如上所述的训练方法得到。Wherein, the interactive incremental image generation model is obtained based on the above-mentioned training method.

本发明实施例还公开了一种交互式递增图像生成模型的训练设备，包括：处理器，所述处理器用于：The embodiment of the present invention also discloses a training device for an interactive incremental image generation model, comprising: a processor, where the processor is used for:

本发明实施例还公开了一种非易失性计算机可读存储介质，所述非易失性计算机可读存储介质存储指令，所述指令在由处理器执行时使得所述处理器执行如上所述的任一训练方法的步骤。The embodiment of the present invention also discloses a non-volatile computer-readable storage medium, where the non-volatile computer-readable storage medium stores instructions, and when executed by a processor, the instructions cause the processor to execute the above-mentioned steps for any of the training methods described.

本发明实施例还公开了一种交互式递增图像生成设备，包括：处理器，所述处理器用于：The embodiment of the present invention also discloses an interactive incremental image generation device, comprising: a processor, and the processor is used for:

用于在视觉对话过程中，在每轮人机对话结束时，将已产生的所有轮人机对话文本和预设的图像描述文本，输入至预先训练的交互式递增图像生成模型进行图像生成，并显示所生成的图像；In the process of visual dialogue, at the end of each round of human-machine dialogue, input all the generated rounds of human-machine dialogue text and preset image description text into the pre-trained interactive incremental image generation model for image generation, and display the resulting image;

本发明实施例还公开了一种非易失性计算机可读存储介质，所述非易失性计算机可读存储介质存储指令，所述指令在由处理器执行时使得所述处理器执行如上所述的交互式递增图像生成方法的步骤。The embodiment of the present invention also discloses a non-volatile computer-readable storage medium, where the non-volatile computer-readable storage medium stores instructions, and when executed by a processor, the instructions cause the processor to execute the above-mentioned Steps of the described interactive incremental image generation method.

综上所述，本发明实施例提出的上述技术方案，通过采用随机重放训练的方式，对交互式递增图像生成模型进行训练，并且在进行训练时，不仅利用在对话的最终时刻获得的所有对话文本数据，还利用在对话的中间时刻获得的所有对话文本数据，进行所述训练。如此，通过在模型训练时加入中间时刻的文本，使得模型在训练过程中也能对中间时刻的对话文本有所感知，从而可以强化模型的交互递增性。另外，通过在模型训练时引入异构循环神经网络编码器，可以将对话文本数据与图像描述文本进行关联，使得编码后的文本特征向量能够准确表征用户语言描述的图像特征，从而有利于随机重放训练方法能够准确捕捉到交互过程中的递增性，进而可以使得训练后的模型能够生成具有交互递增性的图像。因此，采用本发明实施例的技术方案，可以合理地实现对话到图像生成任务。To sum up, the above technical solutions proposed by the embodiments of the present invention train the interactive incremental image generation model by adopting random replay training, and during training, not only use all the data obtained at the final moment of the dialogue For dialogue text data, the training is also performed using all dialogue text data obtained in the middle of the dialogue. In this way, by adding the text at the intermediate moment during model training, the model can also perceive the dialogue text at the intermediate moment during the training process, thereby enhancing the interactive incrementality of the model. In addition, by introducing a heterogeneous recurrent neural network encoder during model training, the dialogue text data can be associated with the image description text, so that the encoded text feature vector can accurately represent the image features described by the user's language, which is conducive to random reconstruction. The release training method can accurately capture the incrementality in the interaction process, which in turn enables the trained model to generate images with interactive incrementality. Therefore, by adopting the technical solutions of the embodiments of the present invention, the task of generating dialogues to images can be reasonably realized.

附图说明Description of drawings

图1为本发明实施例的方法流程示意图；1 is a schematic flowchart of a method according to an embodiment of the present invention;

图2为本发明实施例的异构循环神经网络编码器结构示意图。FIG. 2 is a schematic structural diagram of a heterogeneous cyclic neural network encoder according to an embodiment of the present invention.

具体实施方式Detailed ways

为使本发明的目的、技术方案和优点更加清楚，下面将结合附图及具体实施例对本发明作进一步地详细描述。In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments.

图1为本发明实施例的交互式递增图像生成模型的训练方法流程示意图，如图1所示，该实施例主要包括：FIG. 1 is a schematic flowchart of a training method for an interactive incremental image generation model according to an embodiment of the present invention. As shown in FIG. 1 , the embodiment mainly includes:

步骤101、获取对话样本数据，所述对话样本数据包括对话文本数据、标准图像、图像描述文本和对话总轮数。Step 101: Acquire dialogue sample data, where the dialogue sample data includes dialogue text data, standard images, image description texts, and the total number of dialogue rounds.

步骤102、采用随机重放训练的方式，利用所述对话样本数据和预先训练的异构循环神经网络编码器，对交互式递增图像生成模型进行训练，以使所述交互式递增图像模型能够基于人机对话文本和图像描述文本生成具有交互递增性的图像；其中，利用在对话的最终时刻获得的所有对话文本数据以及在对话的中间时刻获得的所有对话文本数据，进行所述训练。Step 102, using the random replay training mode, using the dialogue sample data and the pre-trained heterogeneous recurrent neural network encoder to train the interactive incremental image generation model, so that the interactive incremental image model can be based on Human-machine dialogue text and image description text generate interactive incremental images; wherein the training is performed using all dialogue text data obtained at the end of the dialogue and all dialogue text data obtained in the middle of the dialogue.

本步骤中，为了训练模型能够生成具有交互式递增性的图像，采用了随机重放训练方式，进行模型训练，并且在不仅利用在对话的最终时刻获得的所有对话文本数据进行训练模型，还需要利用对话中间时刻获得的所有对话文本数据进行训练模型，以使模型在训练过程中也对中间时刻的文本有所感知，从而可以提高模型对图像递增性的实现能力。In this step, in order to train the model to generate interactive incremental images, the random replay training method is used to train the model, and not only use all the dialogue text data obtained at the final moment of the dialogue to train the model, but also need to Use all the dialogue text data obtained in the middle of the dialogue to train the model, so that the model can also perceive the text at the middle moment during the training process, so that the ability of the model to realize the incrementality of images can be improved.

这里需要说明的是，在现有基于对话生成图像的模型训练算法中，采用的完整对话过程的全部对话文本数据和对话最后时刻生成的图像作为一对训练样例，对模型进行训练。这种训练的目标是确保在对话最后时刻生成的图像的准确性。由于模型在训练过程中观测到的是全部对话文本与最终图像，并没有见过中间时刻的对话与图像样本，因此，模型基于中间时刻文本生成的图像可能是不正确的。这样，如果将这种以对话最终图像为训练目标的模型，应用于需要为每轮对话生成图像的任务，则不仅需要生成对话最终时刻的图像，还需要生成对话中间时刻的图像。而由于这种模型的训练目标是在对话最后时刻生成的最终图像的准确性，而非在对话中间过程生成图像的准确性，即模型的训练和应用目标不一致，使得采用现有方法所训练的图像生成模型无法满足交互递增性要求(即，在对话过程中，每次生成的图像信息能够随着交互轮次递增)。针对该问题，在本步骤中，通过采用随机重放训练方法进行模型训练，在训练时，不仅使用对话过程中最后时刻已获取的全部对话文本和最终图像去训练模型，还使用在训练中间时刻已获取的全部对话文本，使得模型在训练过程中也需要对中间时刻的文本有所感知，从而可以减小模型训练和应用的差异，强化了图像生成的递增性。It should be noted here that in the existing model training algorithm based on dialogue generation images, all dialogue text data of the complete dialogue process and the images generated at the last moment of the dialogue are used as a pair of training samples to train the model. The goal of this training is to ensure the accuracy of images generated at the end of the conversation. Since the model observes all the dialogue text and the final image during training, and has not seen the dialogue and image samples at the intermediate moments, the images generated by the model based on the text at the intermediate moments may be incorrect. In this way, if such a model with the final dialogue image as the training target is applied to a task that requires generating images for each round of dialogue, it is necessary to generate not only images at the end of the dialogue, but also images at the middle of the dialogue. Since the training goal of this model is the accuracy of the final image generated at the last moment of the dialogue, rather than the accuracy of the image generated in the middle of the dialogue, that is, the training and application goals of the model are inconsistent, so the existing methods trained The image generation model cannot satisfy the requirement of incrementality of interaction (that is, in the process of dialogue, the image information generated each time can increase with the interaction round). In view of this problem, in this step, the random replay training method is used for model training. During training, not only all the dialogue texts and final images obtained at the last moment of the dialogue process are used to train the model, but also the middle moment of training is used to train the model. All the dialogue texts that have been acquired make the model also need to perceive the text in the middle during the training process, which can reduce the difference between model training and application, and strengthen the incrementality of image generation.

另外，在步骤102中，在模型训练时引入了异构循环神经网络编码器，以利用该异构循环神经网络编码器，对对话文本数据与图像描述文本进行编码，使得编码后的文本特征向量能够准确表征用户语言描述的图像特征，从而有利于随机重放训练方法能够基于输入的图像文本特征，准确捕捉到交互过程中的递增性，进而可以使得训练后的模型能够生成具有交互递增性的图像。In addition, in step 102, a heterogeneous cyclic neural network encoder is introduced during model training, so as to use the heterogeneous cyclic neural network encoder to encode the dialogue text data and the image description text, so that the encoded text feature vector It can accurately represent the image features described by the user language, which is beneficial to the random replay training method. Based on the input image and text features, the incrementality in the interaction process can be accurately captured, so that the trained model can generate interactive incrementality. image.

在一种实施方式中，步骤102中具体可以采用下述方法，对交互式递增图像生成模型进行训练：In one embodiment, in step 102, the following method can be used to train the interactive incremental image generation model:

步骤1021、采用随机采样的方式，确定当前采用的对话轮数t。Step 1021: Determine the number of dialogue rounds t currently used by random sampling.

其中，2≤t≤T，T为对话样本数据中的对话总论数。Among them, 2≤t≤T, T is the total number of dialogues in the dialogue sample data.

本步骤用于确定随机采样的对话中间时刻，以便后续可以利用随机截取的中间时刻的对话数据进行训练。This step is used to determine the randomly sampled dialogue middle moment, so that the dialogue data at the randomly intercepted middle moment can be used for training subsequently.

步骤1022、将所述图像描述文本和所述对话文本数据输入至所述异构循环神经网络编码器进行编码，并将编码最后输出的特征表示作为第一文本表示X′_T；将所述第一文本表示输入至交互式递增图像生成模型的图像生成器进行图像生成，得到第一图像Y′_T。Step 1022: Input the image description text and the dialogue text data into the heterogeneous recurrent neural network encoder for encoding, and use the feature representation finally output from the encoding as the first text representation X'_T; A text representation is input to the image generator of the interactive incremental image generation model for image generation to obtain the first image Y' _T .

步骤1023、基于所述第一文本表示X′_T和所述第一图像Y′_T，利用所述交互式递增图像生成模型的判别器，计算主对抗损失；利用所述主对抗损失，更新所述交互式递增图像生成模型的图像生成器和判别器的累计梯度；所述主对抗损失包括图像生成器和判别器的损失函数值。Step 1023: Based on the first text representation X' _T and the first image Y' _T , use the discriminator of the interactive incremental image generation model to calculate the main adversarial loss; use the main adversarial loss to update the The cumulative gradient of the image generator and the discriminator of the interactive incremental image generation model; the main adversarial loss includes the loss function value of the image generator and the discriminator.

较佳地，本步骤中可以按照下述公式1，计算图像生成器的主对抗损失

Preferably, in this step, the main adversarial loss of the image generator can be calculated according to the following formula 1:

其中，

表示从图像生成器G中得到的图像Y _T′，D( )表示判别器输出的概率值。

表示利用图像生成器G生成的图像Y′_T计算非条件对抗损失函数；

表示利用图像生成器G生成的图像Y′_T和相应文本表示X′_T计算条件对抗损失函数；in,

represents the image Y _T ′ obtained from the image generator G, and D( ) represents the probability value output by the discriminator.

Indicates that the unconditional adversarial loss function is calculated using the image Y′ _T generated by the image generator G;

Represents the image Y′ _T generated by the image generator G and the corresponding text representation X′ _T to calculate the conditional adversarial loss function;

本步骤中，关于判别器的损失函数值的计算，可以采用与现有AttnGAN模型采用的方法实现，在此不再赘述。In this step, the calculation of the loss function value of the discriminator can be implemented by the method adopted by the existing AttnGAN model, which will not be repeated here.

步骤1024、将所述图像描述文本和所述对话文本数据中的前t轮对话数据输入所述异构循环神经网络编码器进行编码，并将编码最后输出的特征表示作为第二文本表示X′_t；将所述第二文本表示输入至所述图像生成器进行图像生成，得到第二图像Y′_t；将所述图像描述文本和所述对话文本数据中的前t-1轮对话数据输入所述异构循环神经网络编码器进行编码，并将编码最后输出的特征表示作为第三文本表示X′_t-1；将所述第三文本表示输入至所述图像生成器进行图像生成，得到第三图像Y′_t-1。Step 1024: Input the image description text and the first t rounds of dialogue data in the dialogue text data into the heterogeneous recurrent neural network encoder for encoding, and use the feature representation finally output from the encoding as the second text representation X'_t; inputting the second text representation to the image generator for image generation to obtain a second image Y′ _t ; inputting the first t-1 rounds of dialogue data in the image description text and the dialogue text data The heterogeneous recurrent neural network encoder performs encoding, and uses the feature representation finally output by the encoding as a third text representation X′ _t-1 ; inputting the third text representation into the image generator for image generation, obtaining The third image Y' _t-1 .

这里，在步骤1024，需要利用步骤1021随机采样的对话轮数t，随机截断对话过程，构造伪样本(即前t轮对话数据、前t-1轮对话数据)，以模拟对话过程中间时刻的输入信息，从而可以让模型捕获到数据中的“递增性”。Here, in step 1024, it is necessary to use the number of dialogue rounds t randomly sampled in step 1021 to randomly truncate the dialogue process to construct pseudo samples (that is, the dialogue data for the first t rounds and the dialogue data for the first t-1 rounds) to simulate the middle moment of the dialogue process. Enter information so that the model can capture the "incrementality" of the data.

步骤1025、基于所述第二文本表示X′_t和所述第二图像Y′_t，构造第一正例；基于所述第三文本表示X′_t-1和所述第三图像Y′_t-1，构造第二正例。Step 1025: Construct a first positive example based on the second text representation X' _t and the second image Y'_t; based on the third text representation X' _t-1 and the third image Y' _{t -1} , construct the second positive example.

进一步地，为了提高训练的准确性，本步骤中，还可以基于所述第二文本表示X′_t、所述第二图像Y′_t、所述第三文本表示X′_t-1和所述第三图像Y′_t-1，构造负例，具体如下：Further, in order to improve the accuracy of training, in this step, the second text representation X′ _t , the second image Y′ _t , the third text representation X′ _t-1 and the For the third image Y′ _t-1 , a negative example is constructed, as follows:

基于所述第三文本表示X′_t-1和所述第二图像Y′_t，构造第一负例；constructing a first negative example based on the third text representation _X't-1 and the second image _Y't ;

基于所述第二文本表示X′_t和所述第三图像Y′_t-1，构造第二负例。A second negative example is constructed based on the second textual representation _X't and the third image Y't _-1 .

步骤1026、基于所述第一正例，利用所述判别器，计算第一辅对抗损失；基于所述第一辅对抗损失，更新所述图像生成器和所述判别器的累计梯度；所述第一辅对抗损失包括图像生成器和判别器的损失函数值；基于所述第二正例，利用所述判别器，计算第二辅对抗损失；基于所述第二辅对抗损失，更新所述图像生成器和所述判别器的累计梯度；所述第二辅对抗损失包括图像生成器和判别器的损失函数值。Step 1026: Based on the first positive example, use the discriminator to calculate a first auxiliary adversarial loss; based on the first auxiliary adversarial loss, update the cumulative gradient of the image generator and the discriminator; the The first auxiliary adversarial loss includes loss function values of the image generator and the discriminator; based on the second positive example, the discriminator is used to calculate the second auxiliary adversarial loss; based on the second auxiliary adversarial loss, the second auxiliary adversarial loss is updated. The accumulated gradients of the image generator and the discriminator; the second auxiliary adversarial loss includes the loss function values of the image generator and the discriminator.

本步骤，具体可以按照公式

计算第一辅对抗损失中图像生成器损失函数值

按照

计算第二辅对抗损失中图像生成器损失函数值

In this step, the specific can be according to the formula

Calculate the value of the image generator loss function in the first auxiliary adversarial loss

according to

Calculate the image generator loss function value in the second auxiliary adversarial loss

进一步地，如果步骤1025中构造了所述第一负例和第二负例，则步骤1026中可以分别采用下述方法，计算所述第一辅对抗损失和所述第二辅对抗损失：Further, if the first negative example and the second negative example are constructed in step 1025, the following methods may be used in step 1026 to calculate the first auxiliary adversarial loss and the second auxiliary adversarial loss:

基于所述第一正例和所述第一负例，利用所述判别器，计算所述第一辅对抗损失。Using the discriminator, the first auxiliary adversarial loss is calculated based on the first positive example and the first negative example.

其中，第一辅对抗损失中的图像生成器的损失函数值

具体可以按照下述公式计算得到：Among them, the loss function value of the image generator in the first auxiliary adversarial loss

Specifically, it can be calculated according to the following formula:

其中，第一对抗损失中的图像生成器的损失函数值

具体可以按照下述公式计算得到：Among them, the loss function value of the image generator in the first adversarial loss

Specifically, it can be calculated according to the following formula:

在步骤1026中，得到第一辅对抗损失和第二辅对抗损失之后，可以按照公式

得到图像生成器的总损失函数；其中，w_RR为辅对抗损失函数的权重。In step 1026, after obtaining the first auxiliary adversarial loss and the second auxiliary adversarial loss, you can follow the formula

Obtain the total loss function of the image generator; where w _RR is the weight of the auxiliary adversarial loss function.

步骤1027、基于当前所述图像生成器的累计梯度，更新所述图像生成器的参数；基于当前所述判别器的累计梯度，更新所述判别器的参数。Step 1027: Update the parameters of the image generator based on the current cumulative gradient of the image generator; update the parameters of the discriminator based on the current cumulative gradient of the discriminator.

本步骤的具体实现为本领域技术人员所掌握，在此不再赘述。The specific implementation of this step is mastered by those skilled in the art, and details are not repeated here.

在一种实施方式中，为了使得编码器能够捕捉到更全面、更完整的图像文本特征信息，预先可以采用下述方法训练所述异构循环神经网络编码器：In one embodiment, in order to enable the encoder to capture more comprehensive and complete image and text feature information, the following method can be used to train the heterogeneous recurrent neural network encoder in advance:

步骤x1、获取编码训练样本数据，所述编码训练样本数据包括标准样本图像的图像描述文本和视觉对话文本。Step x1: Obtain coding training sample data, where the coding training sample data includes image description text and visual dialogue text of standard sample images.

步骤x2、利用所述编码训练样本数据对异构循环神经网络编码器进行训练，使所述异构循环神经网络编码器能够将输入数据中视觉对话文本中的指代关系与图像描述文本中的对应内容相关联。Step x2, using the coding training sample data to train the heterogeneous RNN encoder, so that the heterogeneous RNN encoder can compare the reference relationship in the visual dialogue text in the input data with the reference relationship in the image description text. Corresponding content is associated.

考虑到在视觉对话数据中，文本的语义也是不对称的。图像描述句说明了图像的主体内容，而对话文本是作为图像描述句的补充，提供了主体内容以外的附加信息。为此，在获取文本描述与对话文本的特征表示时，可以将二者区分对待，以提高文本特征表示的完整性、准确性。具体而言，图像描述句中的每个词都至关重要，因此，对于图像描述句应该在“词”的层面上进行建模；而对话文本中的信息相对稀少、冗余，故可将一轮对话中一轮对问答对应的句子和图像描述句中的一个词视作对等，在“句子”的层面上进行建模。同时，将图像描述视为视觉对话过程的开始，用同一个编码器同时对两种数据进行建模。基于此，可以设计如图2所示的异构循环神经网络编码器的结构。Considering that in visual dialogue data, the semantics of text is also asymmetric. The image description sentence explains the main content of the image, and the dialogue text is the supplement of the image description sentence, which provides additional information beyond the main content. For this reason, when obtaining the feature representation of text description and dialogue text, they can be treated differently to improve the completeness and accuracy of text feature representation. Specifically, each word in the image description sentence is crucial, so the image description sentence should be modeled at the level of "words"; and the information in the dialogue text is relatively sparse and redundant, so it can be In a round of dialogue, the sentence corresponding to the question and answer and a word in the image description sentence are regarded as equivalent, and the modeling is carried out at the "sentence" level. At the same time, the image description is regarded as the beginning of the visual dialogue process, and the same encoder is used to model both kinds of data simultaneously. Based on this, the structure of the heterogeneous RNN encoder as shown in Figure 2 can be designed.

如图2所示，在一种实施方式中，所述异构循环神经网络编码器由三个RNN层组成，具体为:第一循环神经网络编码器(即图中的文本RNN编码器)、第二循环神经网络编码器(即图中的对话RNN编码器)和第三循环神经网络编码器(即图中的融合RNN编码器)。较佳地，这三个RNN层在模型结构上均为双向门控循环单元(Bi-GRU)。为简洁起见，图中只画出了双向GRU的正向计算路径。As shown in Figure 2, in one embodiment, the heterogeneous cyclic neural network encoder consists of three RNN layers, specifically: the first cyclic neural network encoder (that is, the text RNN encoder in the figure), The second RNN encoder (i.e. the dialog RNN encoder in the figure) and the third RNN encoder (i.e. the fused RNN encoder in the figure). Preferably, the three RNN layers are all bidirectional gated recurrent units (Bi-GRU) in model structure. For the sake of brevity, only the forward computation path of the bidirectional GRU is drawn in the figure.

其中，文本RNN编码器接受图像描述的每个词的词向量为输入，视每个词输入为一个时刻，并在每个时刻输出一个特征表示，该特征表示可视作整个图像描述句中所有单词信息的简单融合，称为初级单词特征表示。设文本描述中单词数量为M₁，则文本RNN编码器可产生M₁个初级单词特征表示。Among them, the text RNN encoder accepts the word vector of each word described by the image as input, regards each word input as a moment, and outputs a feature representation at each moment, which can be regarded as the entire image description sentence. A simple fusion of word information, called primary word feature representation. Assuming that the number of words in the text description is M ₁ , the text RNN encoder can generate M ₁ primary word feature representations.

对话RNN编码器接受对话中的每个词的词向量为输入(对话的问句与答句拼接起来视作一句话)，但输出时只保留最后一个时刻的输出。最后时刻输出的特征表示编码了整轮对话文本的信息，定义为这一轮对话的初级句子特征表示。对话过程的每一轮文本都按照此处理方式输入对话RNN编码器，可视为整个对话过程共享同一个编码器模型权重。The dialogue RNN encoder accepts the word vector of each word in the dialogue as input (the question and answer sentences of the dialogue are spliced together as one sentence), but only the output of the last moment is retained when outputting. The feature representation output at the last moment encodes the information of the whole round of dialogue text, which is defined as the primary sentence feature representation of this round of dialogue. Each round of text in the dialogue process is input into the dialogue RNN encoder according to this processing method, which can be regarded as sharing the same encoder model weights throughout the dialogue process.

如图2所示，以双向的GRU的正向计算路径为例，异构循环神经网络编码器的编码过程为：As shown in Figure 2, taking the forward calculation path of the bidirectional GRU as an example, the encoding process of the heterogeneous recurrent neural network encoder is as follows:

假设对话一共有M₂轮，则对话RNN编码器可产生M₂个初级句子特征表示。在对话RNN编码器中，每一轮对话文本之间是独立的，即在编码每一轮对话文本时均使用随机初始化的隐状态。这样，可以使模型对所有对话轮次并行处理，提高训练与测试速度。文本RNN编码器和对话RNN编码器均视为整个异构循环神经网络编码器的第一层，可为图像描述以及对话文本产生低级特征表示。上层的融合RNN编码器对下层编码器的输出做进一步的处理：在前M₁时刻，以文本RNN编码器的输出为输入，产生高级单词特征表示；在后面的M₂时刻，以对话RNN编码器的输出为输入，产生高级句子特征表示。融合RNN编码器的Bi-GRU结构，使其在每个时刻均可观测到当前对话过程的所有信息。它一方面建立了对话中每一轮对话文本之间的关联，有利于处理对话过程的指代消解等跨轮次问题，抵消了下层对话RNN编码器中独立处理每一轮文本导致的过度简化；另一方面建立了对话与文本描述之间的关联，可以更好地按需融合二者的信息，产生更有表达能力的文本特征。Assuming a total of _M2 rounds of dialogue, the dialogue RNN encoder can generate _M2 primary sentence feature representations. In the dialogue RNN encoder, each round of dialogue text is independent, that is, a randomly initialized hidden state is used when encoding each round of dialogue text. In this way, the model can be processed in parallel for all dialogue turns, speeding up training and testing. Both the text RNN encoder and the dialogue RNN encoder are regarded as the first layer of the whole heterogeneous RNN encoder, which can produce low-level feature representations for image descriptions as well as dialogue texts. The upper-layer fused RNN encoder further processes the output of the lower-layer encoder: at the first M ₁ moment, the output of the text RNN encoder is used as input to generate high-level word feature representation; at the later M ₂ moment, the dialogue RNN encodes The output of the generator is the input, which produces high-level sentence feature representations. The Bi-GRU structure of the RNN encoder is fused so that it can observe all the information of the current dialogue process at every moment. On the one hand, it establishes the association between the texts of each round of dialogue in the dialogue, which is conducive to dealing with cross-round problems such as referential resolution in the dialogue process, and offsets the oversimplification caused by the independent processing of each round of text in the lower dialogue RNN encoder. On the other hand, the association between dialogue and text description is established, which can better integrate the information of the two as needed, and generate more expressive text features.

基于上述，在一种实施方式中，具体可以采用下述方法，利用所述编码训练样本数据对异构循环神经网络编码器进行训练：Based on the above, in one embodiment, the following method can be specifically adopted to train the heterogeneous RNN encoder by using the encoded training sample data:

步骤x21、利用所述第一循环神经网络编码器，以单词为基本的编码单元，对所述图像描述文本进行编码，将编码得到的每个初级单词特征表示，输出至所述第三循环神经网络编码器；利用所述第二循环神经网络编码器，以句子为基本的编码单元，对所述视觉对话文本进行编码，将编码得到的每个初级句子特征表示，输出至所述第三循环神经网络编码器。Step x21, use the first cyclic neural network encoder to encode the image description text with words as the basic coding unit, and output the feature representation of each primary word obtained by encoding to the third cyclic neural network. network encoder; using the second recurrent neural network encoder to encode the visual dialogue text with sentences as the basic coding unit, and output the feature representation of each primary sentence obtained by encoding to the third loop Neural Network Encoder.

步骤x22、所述第三循环神经网络编码器基于所述初级单词特征表示和所述初级句子特征表示进行编码，并将最后输出的特征表示，作为所述视觉对话文本和所述图像描述文本相关联的全局编码表示。Step x22, the third recurrent neural network encoder encodes based on the primary word feature representation and the primary sentence feature representation, and uses the final output feature representation as a correlation between the visual dialogue text and the image description text. Globally encoded representation of the link.

本步骤获得的所述全局编码表示，即为异构循环神经网络编码器的最终编码结果。The global encoding representation obtained in this step is the final encoding result of the heterogeneous recurrent neural network encoder.

步骤x23、基于所述第三循环神经网络编码器在所述编码过程中输出的所有特征表示，采用深度注意力多模态相似度模型(DAMSM)损失函数，对所述第一循环神经网络编码器、第二循环神经网络编码器和所述第三循环神经网络编码器的权重参数进行调整。Step x23, based on all the feature representations output by the third recurrent neural network encoder in the encoding process, using a deep attention multimodal similarity model (DAMSM) loss function to encode the first recurrent neural network. The weight parameters of the encoder, the second RNN encoder and the third RNN encoder are adjusted.

从上述编码器训练方法可以看出，利用基于上述方法训练得到的异构循环神经网络编码器，来生成每轮对话结束后用于生成图像的文本，一方面可以融合当前对话时刻已获得的所有对话信息，建立了已完成对话轮次中各轮对话文本之间的关联，有利于处理对话过程中的指代消解等跨轮次问题，从而有利于满足图像的递增性要求，另一方面建立了对话文本与图像描述文本之间的关联，更好地融合二者的信息，产生了表达力更强、更准确的文本特征。因此，采用利用上述方法获得的异构循环神经网络编码器，有利于合理地实现对话到图像生成任务。It can be seen from the above encoder training method that the heterogeneous recurrent neural network encoder trained based on the above method is used to generate the text used to generate images after each round of dialogue. The dialogue information establishes the association between the dialogue texts in the completed dialogue rounds, which is conducive to dealing with cross-round problems such as referential resolution in the dialogue process, thereby helping to meet the incremental requirements of images. The relationship between the dialogue text and the image description text is better integrated, resulting in more expressive and accurate text features. Therefore, adopting the heterogeneous RNN encoder obtained by the above method is beneficial to reasonably implement the dialogue-to-image generation task.

通过上述交互式递增图像生成模型的训练方法实施例可以看出，上述实施例中，在进行模型训练时，利用中间时刻的对话文本，并结合异构循环神经网络编码器，可以使得训练后的模型能够生成具有交互递增性的图像，从而有利于合理地实现对话到图像生成任务。It can be seen from the above-mentioned embodiment of the training method for the interactive incremental image generation model that in the above-mentioned embodiment, during model training, the dialogue text at the middle moment is used, combined with the heterogeneous recurrent neural network encoder, the trained model can be The model is able to generate images with interactive incrementality, which facilitates the rational implementation of dialogue-to-image generation tasks.

基于上述模型训练方法，本发明实施例还提供了一种交互式递增图像生成方法，包括：Based on the above model training method, an embodiment of the present invention also provides an interactive incremental image generation method, including:

这里，由于所采用的交互式递增图像生成模型是基于上述模型训练方法得到的，而在进行模型训练时考虑了模型对中间时刻的对话文本的感知能力，强化了模型的交互递增性，因此，可以提高对话过程中所生成图像的合理性。Here, since the interactive incremental image generation model used is obtained based on the above model training method, and the model's ability to perceive the dialogue text at the middle moment is considered during model training, the interactive incrementality of the model is strengthened. Therefore, The plausibility of the images generated during the dialogue can be improved.

基于上述模型训练方法实施例，本发明实施例还提供了一种交互式递增图像生成模型的训练设备，包括：处理器，所述处理器用于：Based on the above model training method embodiment, the embodiment of the present invention further provides a training device for an interactive incremental image generation model, including: a processor, where the processor is configured to:

基于上述交互式递增图像生成方法实施例，本发明实施例还公开了一种交互式递增图像生成设备，包括：处理器，所述处理器用于：Based on the above-mentioned embodiment of the interactive incremental image generation method, the embodiment of the present invention further discloses an interactive incremental image generation device, comprising: a processor, where the processor is configured to:

基于交互式递增图像生成模型的训练方法实施例，本申请实施例还实现了一种交互式递增图像生成模型的训练电子设备，包括处理器和存储器；所述存储器中存储有可被所述处理器执行的应用程序，用于使得所述处理器执行如上所述的交互式递增图像生成模型的训练方法。具体地，可以提供配有存储介质的系统或者装置，在该存储介质上存储着实现上述实施例中任一实施方式的功能的软件程序代码，且使该系统或者装置的计算机(或CPU或MPU)读出并执行存储在存储介质中的程序代码。此外，还可以通过基于程序代码的指令使计算机上操作的操作系统等来完成部分或者全部的实际操作。还可以将从存储介质读出的程序代码写到插入计算机内的扩展板中所设置的存储器中或者写到与计算机相连接的扩展单元中设置的存储器中，随后基于程序代码的指令使安装在扩展板或者扩展单元上的CPU等来执行部分和全部实际操作，从而实现上述交互式递增图像生成模型的训练方法实施方式中任一实施方式的功能。Based on the embodiment of the training method for the interactive incremental image generation model, the embodiment of the present application also implements a training electronic device for the interactive incremental image generation model, including a processor and a memory; The application program executed by the processor is used to cause the processor to execute the training method of the interactive incremental image generation model as described above. Specifically, it is possible to provide a system or device equipped with a storage medium on which software program codes for realizing the functions of any one of the above-described embodiments are stored, and make the computer (or CPU or MPU of the system or device) ) to read and execute the program code stored in the storage medium. In addition, a part or all of the actual operation can also be completed by an operating system or the like operating on the computer based on the instructions of the program code. The program code read from the storage medium can also be written into the memory provided in the expansion board inserted into the computer or into the memory provided in the expansion unit connected to the computer, and then the instructions based on the program code make the device installed in the computer. The CPU or the like on the expansion board or the expansion unit is used to perform part and all of the actual operations, so as to realize the function of any one of the above-mentioned embodiments of the training method for the interactive incremental image generation model.

基于交互式递增图像生成方法实施例，本申请实施例还实现了一种交互式递增图像生成电子设备，包括处理器和存储器；所述存储器中存储有可被所述处理器执行的应用程序，用于使得所述处理器执行如上所述的交互式递增图像生成方法。具体地，可以提供配有存储介质的系统或者装置，在该存储介质上存储着实现上述实施例中任一实施方式的功能的软件程序代码，且使该系统或者装置的计算机(或CPU或MPU)读出并执行存储在存储介质中的程序代码。此外，还可以通过基于程序代码的指令使计算机上操作的操作系统等来完成部分或者全部的实际操作。还可以将从存储介质读出的程序代码写到插入计算机内的扩展板中所设置的存储器中或者写到与计算机相连接的扩展单元中设置的存储器中，随后基于程序代码的指令使安装在扩展板或者扩展单元上的CPU等来执行部分和全部实际操作，从而实现上述交互式递增图像生成方法实施方式中任一实施方式的功能。Based on the embodiment of the interactive incremental image generation method, the embodiment of the present application further implements an interactive incremental image generation electronic device, including a processor and a memory; the memory stores an application program executable by the processor, for causing the processor to perform the interactive incremental image generation method as described above. Specifically, it is possible to provide a system or device equipped with a storage medium on which software program codes for realizing the functions of any one of the above-described embodiments are stored, and make the computer (or CPU or MPU of the system or device) ) to read and execute the program code stored in the storage medium. In addition, a part or all of the actual operation can also be completed by an operating system or the like operating on the computer based on the instructions of the program code. The program code read from the storage medium can also be written into the memory provided in the expansion board inserted into the computer or into the memory provided in the expansion unit connected to the computer, and then the instructions based on the program code make the device installed in the computer. The CPU or the like on the expansion board or the expansion unit is used to perform part and all of the actual operations, so as to realize the functions of any one of the above-mentioned implementations of the interactive incremental image generation method.

上述存储器具体可以实施为电可擦可编程只读存储器(EEPROM)、快闪存储器(Flash memory)、可编程程序只读存储器(PROM)等多种存储介质。处理器可以实施为包括一或多个中央处理器或一或多个现场可编程门阵列，其中现场可编程门阵列集成一或多个中央处理器核。具体地，中央处理器或中央处理器核可以实施为CPU或MCU。The above-mentioned memory may be specifically implemented as various storage media such as Electrically Erasable Programmable Read-Only Memory (EEPROM), Flash Memory (Flash memory), Programmable Program Read-Only Memory (PROM). The processor may be implemented to include one or more central processing units or one or more field programmable gate arrays, wherein the field programmable gate arrays integrate one or more central processing unit cores. Specifically, a central processing unit or central processing unit core may be implemented as a CPU or an MCU.

需要说明的是，上述各流程和各结构图中不是所有的步骤和模块都是必须的，可以根据实际的需要忽略某些步骤或模块。各步骤的执行顺序不是固定的，可以根据需要进行调整。各模块的划分仅仅是为了便于描述采用的功能上的划分，实际实现时，一个模块可以分由多个模块实现，多个模块的功能也可以由同一个模块实现，这些模块可以位于同一个设备中，也可以位于不同的设备中。It should be noted that not all steps and modules in the above-mentioned processes and structural diagrams are necessary, and some steps or modules may be omitted according to actual needs. The execution order of each step is not fixed and can be adjusted as required. The division of each module is only to facilitate the description of the functional division used. In actual implementation, a module can be implemented by multiple modules, and the functions of multiple modules can also be implemented by the same module. These modules can be located in the same device. , or in a different device.

各实施方式中的硬件模块可以以机械方式或电子方式实现。例如，一个硬件模块可以包括专门设计的永久性电路或逻辑器件(如专用处理器，如FPGA或ASIC)用于完成特定的操作。硬件模块也可以包括由软件临时配置的可编程逻辑器件或电路(如包括通用处理器或其它可编程处理器)用于执行特定操作。至于具体采用机械方式，或是采用专用的永久性电路，或是采用临时配置的电路(如由软件进行配置)来实现硬件模块，可以根据成本和时间上的考虑来决定。The hardware modules in various embodiments may be implemented mechanically or electronically. For example, a hardware module may include specially designed permanent circuits or logic devices (eg, special purpose processors, such as FPGAs or ASICs) for performing specific operations. Hardware modules may also include programmable logic devices or circuits (eg, including general-purpose processors or other programmable processors) temporarily configured by software for performing particular operations. As for the specific use of a mechanical method, or a dedicated permanent circuit, or a temporarily configured circuit (for example, configured by software) to realize the hardware module, it can be decided according to cost and time considerations.

在本文中，“示意性”表示“充当实例、例子或说明”，不应将在本文中被描述为“示意性”的任何图示、实施方式解释为一种更优选的或更具优点的技术方案。为使图面简洁，各图中的只示意性地表示出了与本发明相关部分，而并不代表其作为产品的实际结构。另外，以使图面简洁便于理解，在有些图中具有相同结构或功能的部件，仅示意性地绘示了其中的一个，或仅标出了其中的一个。在本文中，“一个”并不表示将本发明相关部分的数量限制为“仅此一个”，并且“一个”不表示排除本发明相关部分的数量“多于一个”的情形。在本文中，“上”、“下”、“前”、“后”、“左”、“右”、“内”、“外”等仅用于表示相关部分之间的相对位置关系，而非限定这些相关部分的绝对位置。As used herein, "schematic" means "serving as an example, instance, or illustration" and any illustration, embodiment described herein as "schematic" should not be construed as a preferred or advantageous one Technical solutions. In order to make the drawings concise, only the relevant parts of the present invention are schematically shown in each drawing, and do not represent the actual structure of the product. In addition, in order to make the drawings concise and easy to understand, in some drawings, only one of the components having the same structure or function is schematically shown, or only one of them is marked. As used herein, "one" does not mean to limit the number of relevant parts of the invention to "only one", and "one" does not mean to exclude "more than one" number of relevant parts of the invention. In this article, "upper", "lower", "front", "rear", "left", "right", "inner", "outer", etc. are only used to indicate the relative positional relationship between related parts, and The absolute positions of these relative parts are not limited.

以上所述，仅为本发明的较佳实施例而已，并非用于限定本发明的保护范围。凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the protection scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention shall be included within the protection scope of the present invention.

Claims

1. A method for training an interactive incremental image generation model, comprising:

obtaining conversation sample data, wherein the conversation sample data comprises conversation text data, a standard image, an image description text and a total number of conversation rounds;

training an interactive incremental image generation model by using the dialogue sample data and a pre-trained heterogeneous cyclic neural network encoder in a random replay training mode so that the interactive incremental image generation model can generate an image with interactive incremental based on a man-machine dialogue text and an image description text; wherein the training is performed based on all dialog text data obtained at a final time of the dialog and all dialog text data obtained at an intermediate time of the dialog.

2. The method of claim 1, wherein the training comprises training

Determining the number T of currently adopted conversation rounds by adopting a random sampling mode, wherein T is more than or equal to 2 and less than or equal to T, and T is the total number of the conversation;

inputting the image description text and the dialogue text data into the heterogeneous cyclic neural network encoder for encoding, and taking the feature representation output at last of encoding as a first text representation X'_T(ii) a Inputting the first text representation to an image generator of an interactive incremental image generation model for image generation to obtain a first image Y'_T；

Representing X 'based on the first text'_TAnd the first image Y'_TComputing using said interactive incremental image generation model's discriminatorLoss of primary confrontation; updating the accumulated gradients of an image generator and a discriminator of the interactive incremental image generation model with the primary confrontation losses; the primary countermeasure loss includes a loss function value of an image generator and a discriminator;

inputting the image description text and the front t wheel in the dialogue text data into the heterogeneous cyclic neural network encoder for encoding, and using the feature representation output at last of encoding as a second text representation X'_t(ii) a Inputting the second text representation to the image generator for image generation to obtain a second image Y'_t；

Inputting the image description text and the first t-1 wheel in the dialogue text data into the heterogeneous recurrent neural network encoder for encoding, and using the feature representation output at last of encoding as a third text representation X'_t-1(ii) a Inputting the third text representation to the image generator for image generation to obtain a third image Y'_t-1；

Representing X 'based on the second text'_tAnd the second image Y'_tConstructing a first positive example;

representing X 'based on the third text'_t-1And the third image Y'_t-1, constructing a second positive example;

calculating a first secondary confrontation loss by using the discriminator based on the first positive example; updating a cumulative gradient of the image generator and the discriminator based on the first secondary confrontation loss; the first secondary countermeasure loss includes a loss function value of an image generator and a discriminator;

calculating a second secondary confrontation loss by using the discriminator based on the second positive example; updating the cumulative gradients of the image generator and the discriminator based on the second secondary confrontation loss; the second secondary countermeasure loss includes a loss function value of an image generator and a discriminator;

updating parameters of the image generator based on the current accumulated gradient of the image generator; and updating the parameters of the discriminator based on the current accumulated gradient of the discriminator.

3. The method of claim 2, wherein the training method further comprises:

representing X 'based on the third text'_t-1 and the second image Y'_tConstructing a first negative example;

representing X 'based on the second text'_tAnd the third image Y'_t-1, constructing a second negative example;

the calculating the first secondary confrontation loss comprises:

calculating the first secondary countermeasure loss using the discriminator based on the first positive example and the first negative example;

the calculating the second secondary confrontation loss comprises:

calculating, with the discriminator, the second secondary countermeasure loss based on the second positive example and the second negative example.

4. The method of claim 2, wherein the training of the heterogeneous cyclic neural network encoder comprises:

acquiring coding training sample data, wherein the coding training sample data comprises an image description text and a visual dialog text of a standard sample image;

and training a heterogeneous cyclic neural network encoder by using the encoding training sample data, so that the heterogeneous cyclic neural network encoder can associate the reference relationship in the visual dialog text in the input data with the corresponding content in the image description text.

5. The method of claim 4, wherein the heterogeneous recurrent neural network encoder consists of a first recurrent neural network encoder, a second recurrent neural network encoder, and a third recurrent neural network encoder;

the training a heterogeneous cyclic neural network encoder by using the encoded training sample data comprises:

encoding the image description text by using the first recurrent neural network encoder and an encoding unit taking words as basic, and outputting each primary word feature representation obtained by encoding to the third recurrent neural network encoder; coding the visual dialog text by using the second cyclic neural network coder and a sentence-based coding unit, and outputting each primary sentence feature representation obtained by coding to the third cyclic neural network coder;

the third recurrent neural network encoder encodes based on the primary word feature representation and the primary sentence feature representation and the last output feature representation as a global encoded representation in which the visual dialog text and the image description text are associated;

and adjusting the weight parameters of the first cyclic neural network encoder, the second cyclic neural network encoder and the third cyclic neural network encoder by adopting a deep attention multi-modal similarity model DAMSM loss function based on all feature representations output by the third cyclic neural network encoder in the encoding process.

6. An interactive incremental image generation method, comprising:

in the process of visual conversation, when each round of man-machine conversation is finished, inputting all the generated round of man-machine conversation texts and a preset image description text into a pre-trained interactive incremental image generation model for image generation, and displaying the generated image;

wherein the interactive incremental image generation model is obtained based on any one of the training methods of claims 1 to 5.

7. An interactive incremental image generation model training apparatus, comprising: a processor to:

training an interactive incremental image generation model by using the dialogue sample data and a pre-trained heterogeneous cyclic neural network encoder in a random replay training mode so that the interactive incremental image generation model can generate an image with interactive incremental based on a man-machine dialogue text and an image description text; wherein the training is performed using all dialog text data obtained at a final time of the dialog and all dialog text data obtained at an intermediate time of the dialog.

8. A non-transitory computer readable storage medium storing instructions which, when executed by a processor, cause the processor to perform the steps of any of the training methods of claims 1 to 5.

9. An interactive incremental image generation apparatus, comprising: a processor to:

the interactive incremental image generation method comprises the steps of inputting all generated round man-machine conversation texts and preset image description texts into a pre-trained interactive incremental image generation model for image generation and displaying generated images when each round of man-machine conversation is finished in the visual conversation process;

10. A non-transitory computer readable storage medium storing instructions which, when executed by a processor, cause the processor to perform the steps of the interactive incremental image generation method of claim 6.