CN113590800A - Training method and device of image generation model and image generation method and device - Google Patents

Training method and device of image generation model and image generation method and device Download PDF

Info

Publication number
CN113590800A
CN113590800A CN202110966233.7A CN202110966233A CN113590800A CN 113590800 A CN113590800 A CN 113590800A CN 202110966233 A CN202110966233 A CN 202110966233A CN 113590800 A CN113590800 A CN 113590800A
Authority
CN
China
Prior art keywords
image
text
training
neural network
image generation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110966233.7A
Other languages
Chinese (zh)
Other versions
CN113590800B (en
Inventor
牛天睿
冯方向
王小捷
袁彩霞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN202110966233.7A priority Critical patent/CN113590800B/en
Publication of CN113590800A publication Critical patent/CN113590800A/en
Application granted granted Critical
Publication of CN113590800B publication Critical patent/CN113590800B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/5846Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using extracted text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Databases & Information Systems (AREA)
  • Library & Information Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Human Computer Interaction (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Machine Translation (AREA)
  • Image Analysis (AREA)

Abstract

The application discloses a training method and equipment of an image generation model and an image generation method and equipment, wherein the method comprises the following steps: obtaining conversation sample data, wherein the conversation sample data comprises conversation text data, a standard image, an image description text and a total number of conversation rounds; training an interactive incremental image generation model by using the dialog sample data and a pre-trained heterogeneous cyclic neural network encoder in a random replay training mode based on the dialog sample data so that the interactive incremental image model can generate an image with interactive incremental property based on a man-machine dialog text and an image description text; wherein the training is performed using all dialog text data obtained at a final time of the dialog and all dialog text data obtained at an intermediate time of the dialog. By the adoption of the method and the device, conversation to image generation tasks are reasonably achieved.

Description

Training method and device of image generation model and image generation method and device
Technical Field
The invention relates to an artificial intelligence technology, in particular to a training method and equipment of an image generation model and an image generation method and equipment.
Background
The dialogue is the most natural way for people to communicate, and is also an ideal interactive way for controlling a machine to generate images in a dialogue mode in a text-to-image generation task. A conversation is a free interactive process: for things that are difficult to say clearly within a sentence, completion can be completed by supplementing the turn of conversation, and the subject is not limited.
For the dialog-to-image generation task, in order to improve the intelligence of the human-machine dialog, the machine is required to generate and display images after each pair of dialogs is finished as timely feedback to people, rather than generating images only after the whole dialog is finished.
In the process of implementing the application, the inventor discovers through research and analysis that: after each pair of conversations is finished, the images are generated by directly using the existing text-to-image generation method, and the conversation-to-image generation task cannot be reasonably realized, and the specific reasons are as follows:
since the information input by the person is increased gradually as the dialog progresses, and accordingly, in order to ensure the rationality and intelligence of the dialog to image generation, the information contained in each frame of image generated by the machine after each dialog is increased, and the characteristic is called the "incrementability" of the image. Thus, ideally, the dialog-to-image generation process should begin with a blank canvas and be incrementally supplemented by the machine after each pair of dialogs is completed. The machine should not draw a complex image containing a large number of objects right at the beginning of the conversation to avoid conveying false feedback to the person. The content of the image generated after each dialog should "just cover all the information of the current dialog process", not to a large extent. At the same time, the later generated images should not have great changes in structure to maintain continuity of the conversation. In abstraction, "incrementability" requires that several aspects be included:
incremental number of objects: the number of objects is monotonously increased along with the conversation process, and is equal to the number of objects actually involved in the conversation, and can not be more or less than the number of objects involved in the conversation.
Attribute and relationship incrementality: the attributes and relationships of the objects are determined step by step with the dialog process and can be memorized: in images generated later in a conversation, attributes and relationships established earlier in the conversation may not be lost.
Front-back continuity: the images generated in adjacent turns in the dialog should be similar in structure. The large variation of the image during the conversation is a false feedback that affects the experience of the interlocutor.
Although the amount of information of the dialog process is naturally increasing, the image content resulting therefrom is not necessarily so. Because, in the implementation of the text-to-image generation task, the image modality is not equivalent to the text modality, the information contained in the image is much larger than the text modality, the text can control only a small part of the information in the image, and another part of the image information (i.e., image specific information) is randomly generated by the machine, which brings uncertainty to the image information. Due to the uncertainty of image specific information, text with a larger amount of information may produce an image with a smaller amount of information than text with a smaller amount of information. Thus, after each pair of dialogs is finished, the images are generated based on the currently acquired dialog text, and it cannot be guaranteed that information in the images increases with the increase of the number of dialog turns, so that the requirement of 'incrementability' of the dialog-to-image generation task cannot be met, and the dialog-to-image generation task cannot be reasonably realized.
Disclosure of Invention
In view of the above, the present invention is directed to an encoder, a training method of an image generation model, an image generation method and an image generation device, which are beneficial to reasonably realizing a dialog-to-image generation task.
In order to achieve the above purpose, the embodiment of the present invention provides a technical solution:
a method of training an interactive incremental image generation model, comprising:
obtaining conversation sample data, wherein the conversation sample data comprises conversation text data, a standard image, an image description text and a total number of conversation rounds;
training an interactive incremental image generation model by using the dialogue sample data and a pre-trained heterogeneous cyclic neural network encoder in a random replay training mode so that the interactive incremental image generation model can generate an image with interactive incremental based on a man-machine dialogue text and an image description text; wherein the training is performed using all dialog text data obtained at a final time of the dialog and all dialog text data obtained at an intermediate time of the dialog.
Preferably, the training comprises:
determining the number T of currently adopted conversation rounds by adopting a random sampling mode, wherein T is more than or equal to 2 and less than or equal to T, and T is the total number of the conversation;
inputting the image description text and the dialogue text data into the heterogeneous cyclic neural network encoder for encoding, and taking the feature representation output at last of encoding as a first text representation X'T(ii) a Inputting the first text representation to an image generator of an interactive incremental image generation model for image generation to obtain a first image Y'T
Representing X 'based on the first text'TAnd the first image Y'TCalculating a primary countermeasure loss by using a discriminator of the interactive incremental image generation model; updating the accumulated gradients of an image generator and a discriminator of the interactive incremental image generation model with the primary confrontation losses; the primary countermeasure loss includes a loss function value of an image generator and a discriminator;
inputting the image description text and the front t wheel in the dialogue text data into the heterogeneous cyclic neural network encoder for encoding, and encodingFeature representation output last of code as a second text representation X't(ii) a Inputting the second text representation to the image generator for image generation to obtain a second image Y't
Inputting the image description text and the first t-1 wheel in the dialogue text data into the heterogeneous recurrent neural network encoder for encoding, and using the feature representation output at last of encoding as a third text representation X't-1; inputting the third text representation to the image generator for image generation to obtain a third image Y't-1;
Representing X 'based on the second text'tAnd the second image Y'tConstructing a first positive example;
representing X 'based on the third text't-1 and the third image Y't-1, constructing a second positive example;
calculating a first secondary confrontation loss by using the discriminator based on the first positive example; updating a cumulative gradient of the image generator and the discriminator based on the first secondary confrontation loss; the first secondary countermeasure loss includes a loss function value of an image generator and a discriminator;
calculating a second secondary confrontation loss by using the discriminator based on the second positive example; updating the cumulative gradients of the image generator and the discriminator based on the second secondary confrontation loss; the second secondary countermeasure loss includes a loss function value of an image generator and a discriminator;
updating parameters of the image generator based on the current accumulated gradient of the image generator; and updating the parameters of the discriminator based on the current accumulated gradient of the discriminator.
Preferably, the training method further comprises:
representing X 'based on the third text't-1 and the second image Y'tConstructing a first negative example;
representing X 'based on the second text'tAnd the third image Y't-1Constructing a second negative example;
the calculating the first secondary confrontation loss comprises:
calculating the first secondary countermeasure loss using the discriminator based on the first positive example and the first negative example;
the calculating the second secondary confrontation loss comprises:
calculating, with the discriminator, the second secondary countermeasure loss based on the second positive example and the second negative example.
Preferably, the training of the heterogeneous cyclic neural network encoder comprises:
acquiring coding training sample data, wherein the coding training sample data comprises an image description text and a visual dialog text of a standard sample image;
and training a heterogeneous cyclic neural network encoder by using the encoding training sample data, so that the heterogeneous cyclic neural network encoder can associate the reference relationship in the visual dialog text in the input data with the corresponding content in the image description text.
Preferably, the heterogeneous cyclic neural network encoder is composed of a first cyclic neural network encoder, a second cyclic neural network encoder and a third cyclic neural network encoder;
the training a heterogeneous cyclic neural network encoder by using the encoded training sample data comprises:
encoding the image description text by using the first recurrent neural network encoder and an encoding unit taking words as basic, and outputting each primary word feature representation obtained by encoding to the third recurrent neural network encoder; coding the visual dialog text by using the second cyclic neural network coder and a sentence-based coding unit, and outputting each primary sentence feature representation obtained by coding to the third cyclic neural network coder;
the third recurrent neural network encoder encodes based on the primary word feature representation and the primary sentence feature representation and the last output feature representation as a global encoded representation in which the visual dialog text and the image description text are associated;
and adjusting the weight parameters of the first cyclic neural network encoder, the second cyclic neural network encoder and the third cyclic neural network encoder by adopting a deep attention multi-modal similarity model DAMSM loss function based on all feature representations output by the third cyclic neural network encoder in the encoding process.
Based on the model training method, the embodiment of the invention also discloses an interactive incremental image generation method, which comprises the following steps:
in the process of visual conversation, when each round of man-machine conversation is finished, inputting all the generated round of man-machine conversation texts and a preset image description text into a pre-trained interactive incremental image generation model for image generation, and displaying the generated image;
wherein the interactive incremental image generation model is obtained based on the training method.
The embodiment of the invention also discloses training equipment for the interactive incremental image generation model, which comprises: a processor to:
obtaining conversation sample data, wherein the conversation sample data comprises conversation text data, a standard image, an image description text and a total number of conversation rounds;
training an interactive incremental image generation model by using the dialogue sample data and a pre-trained heterogeneous cyclic neural network encoder in a random replay training mode so that the interactive incremental image generation model can generate an image with interactive incremental based on a man-machine dialogue text and an image description text; wherein the training is performed using all dialog text data obtained at a final time of the dialog and all dialog text data obtained at an intermediate time of the dialog.
Also disclosed is a non-transitory computer-readable storage medium storing instructions that, when executed by a processor, cause the processor to perform any of the training method steps described above.
The embodiment of the invention also discloses an interactive incremental image generation device, which comprises: a processor to:
the interactive incremental image generation method comprises the steps of inputting all generated round man-machine conversation texts and preset image description texts into a pre-trained interactive incremental image generation model for image generation and displaying generated images when each round of man-machine conversation is finished in the visual conversation process;
wherein the interactive incremental image generation model is obtained based on the training method.
Also disclosed is a non-transitory computer-readable storage medium storing instructions that, when executed by a processor, cause the processor to perform the steps of the interactive incremental image generation method described above.
In summary, the above technical solution proposed by the embodiment of the present invention trains the interactive incremental image generation model by using a random replay training method, and when performing the training, the training is performed by using not only all the dialog text data obtained at the final time of the dialog, but also all the dialog text data obtained at the intermediate time of the dialog. Therefore, the text at the middle moment is added during model training, so that the model can sense the dialogue text at the middle moment in the training process, and the interaction incremental performance of the model can be enhanced. In addition, by introducing the heterogeneous cyclic neural network encoder during model training, the dialogue text data can be associated with the image description text, so that the encoded text feature vector can accurately represent the image features described by the user language, the random replay training method is facilitated to accurately capture the incremental property in the interactive process, and the trained model can generate an image with interactive incremental property. Therefore, by adopting the technical scheme of the embodiment of the invention, the task from conversation to image generation can be reasonably realized.
Drawings
FIG. 1 is a schematic flow chart of a method according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of an encoder of a heterogeneous cyclic neural network according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.
Fig. 1 is a schematic flowchart of a training method of an interactive incremental image generation model according to an embodiment of the present invention, as shown in fig. 1, the embodiment mainly includes:
step 101, obtaining conversation sample data, wherein the conversation sample data comprises conversation text data, a standard image, an image description text and a total number of conversation rounds.
102, training an interactive incremental image generation model by using the dialogue sample data and a pre-trained heterogeneous cyclic neural network encoder in a random replay training mode so that the interactive incremental image generation model can generate an image with interactive incremental based on a man-machine dialogue text and an image description text; wherein the training is performed using all dialog text data obtained at a final time of the dialog and all dialog text data obtained at an intermediate time of the dialog.
In this step, in order to train the model to generate an image with interactive incremental property, a random replay training mode is adopted to train the model, and the model is trained not only by using all the dialogue text data obtained at the final moment of the dialogue, but also by using all the dialogue text data obtained at the intermediate moment of the dialogue, so that the model can sense the text at the intermediate moment in the training process, thereby improving the realization capability of the model to the image incremental property.
In the conventional model training algorithm for generating images based on a dialog, all dialog text data of a complete dialog process and an image generated at the last moment of the dialog are used as a pair of training examples to train the model. The goal of this training is to ensure the accuracy of the images generated at the end of the session. Since the model observes the whole dialog text and the final image in the training process and does not see the dialog and image samples at the intermediate time, the image generated by the model based on the intermediate time text may be incorrect. As described above, if such a model with the final dialogue image as the training target is applied to a task that requires generation of images for each pair of dialogs, it is necessary to generate not only an image at the final dialogue time but also an image at an intermediate dialogue time. However, the training target of the model is the accuracy of the final image generated at the last moment of the conversation, but not the accuracy of the image generated in the middle process of the conversation, i.e. the training and application targets of the model are inconsistent, so that the image generation model trained by the existing method cannot meet the requirement of interaction increment (i.e. the image information generated each time can be increased along with the interaction turns in the conversation process). Aiming at the problem, in the step, the random replay training method is adopted for model training, and during training, not only all dialog texts and final images which are obtained at the last moment in the dialog process are used for training the model, but also all dialog texts which are obtained at the middle moment in the training process are used, so that the model also needs to sense the texts at the middle moment in the training process, the difference between model training and application can be reduced, and the increment of image generation is enhanced.
In addition, in step 102, a heterogeneous cyclic neural network encoder is introduced during model training, so that the heterogeneous cyclic neural network encoder is utilized to encode the dialogue text data and the image description text, and the encoded text feature vector can accurately represent the image features described by the user language, thereby being beneficial to a random replay training method to accurately capture the incremental property in the interactive process based on the input image text features, and further enabling the trained model to generate an image with interactive incremental property.
In one embodiment, the interactive incremental image generation model may be trained in step 102 by specifically using the following method:
step 1021, determining the currently adopted number t of the conversation rounds by adopting a random sampling mode.
Wherein T is more than or equal to 2 and less than or equal to T, and T is the total number of the dialogues in the dialog sample data.
The step is used for determining the randomly sampled dialogue intermediate time so that the dialogue data of the randomly intercepted intermediate time can be used for training in the following process.
Step 1022, inputting the image description text and the dialog text data into the heterogeneous recurrent neural network encoder for encoding, and taking the feature representation output at last of encoding as a first text representation X'T(ii) a Inputting the first text representation to an image generator of an interactive incremental image generation model for image generation to obtain a first image Y'T
Step 1023, representing X 'based on the first text'TAnd the first image Y'TCalculating a primary countermeasure loss by using a discriminator of the interactive incremental image generation model; updating the accumulated gradients of an image generator and a discriminator of the interactive incremental image generation model with the primary confrontation losses; the primary confrontation loss includes loss function values of an image generator and a discriminator.
Preferably, the main countermeasure loss of the image generator can be calculated according to the following formula 1
Figure BDA0003224072880000091
Figure BDA0003224072880000092
Wherein,
Figure BDA0003224072880000093
representing an image Y obtained from an image generator GT', D () represents a probability value output by the discriminator.
Figure BDA0003224072880000094
Represents an image Y 'generated by an image generator G'TCalculating unconditional countermeasuresA loss function;
Figure BDA0003224072880000095
represents an image Y 'generated by an image generator G'TAnd corresponding text represents X'TCalculating a conditional opposition loss function;
in this step, the calculation of the loss function value of the discriminator may be implemented by a method that is used in the conventional AttnGAN model, and is not described herein again.
Step 1024, inputting the image description text and the front t wheel in the dialogue text data into the heterogeneous recurrent neural network encoder for encoding, and using the feature representation output at last of encoding as a second text representation X't(ii) a Inputting the second text representation to the image generator for image generation to obtain a second image Y't(ii) a Inputting the image description text and the first t-1 wheel in the dialogue text data into the heterogeneous recurrent neural network encoder for encoding, and using the feature representation output at last of encoding as a third text representation X't-1(ii) a Inputting the third text representation to the image generator for image generation to obtain a third image Y't-1
Here, in step 1024, it is necessary to use the number of conversation rounds t randomly sampled in step 1021, randomly truncate the conversation process, and construct pseudo samples (i.e. the first t wheel conversation data and the first t-1 wheel conversation data) to simulate the input information at the middle time of the conversation process, so that the model can capture the "incrementability" in the data.
Step 1025, representing X 'based on the second text'tAnd the second image Y'tConstructing a first positive example; representing X 'based on the third text't-1And the third image Y't-1The second positive example was constructed.
Furthermore, in order to improve the accuracy of training, in this step, X 'may be represented based on the second text'tAnd the second image Y'tAnd the third text represents X't-1And the third image Y't-1Negative example of constructionThe method comprises the following steps:
representing X 'based on the third text't-1And the second image Y'tConstructing a first negative example;
representing X 'based on the second text'tAnd the third image Y't-1A second negative example is constructed.
Step 1026, based on the first positive example, calculating a first auxiliary confrontation loss by using the discriminator; updating a cumulative gradient of the image generator and the discriminator based on the first secondary confrontation loss; the first secondary countermeasure loss includes a loss function value of an image generator and a discriminator; calculating a second secondary confrontation loss by using the discriminator based on the second positive example; updating the cumulative gradients of the image generator and the discriminator based on the second secondary confrontation loss; the second secondary countermeasure loss includes a loss function value of an image generator and a discriminator.
The step can be specifically according to a formula
Figure BDA0003224072880000101
Calculating an image generator loss function value in a first secondary countermeasure loss
Figure BDA0003224072880000102
According to
Figure BDA0003224072880000103
Calculating image generator loss function values in second auxiliary countermeasure loss
Figure BDA0003224072880000104
Further, if the first and second negative examples are constructed in step 1025, the first and second secondary play losses may be calculated in step 1026 using the following methods, respectively:
calculating, with the discriminator, the first secondary countermeasure loss based on the first positive example and the first negative example.
Wherein the first secondary countermeasure is lostLoss function value of the image generator
Figure BDA0003224072880000111
Specifically, the calculation can be performed according to the following formula:
Figure BDA0003224072880000112
calculating, with the discriminator, the second secondary countermeasure loss based on the second positive example and the second negative example.
Wherein the first pair resists loss function values of the image generator in loss
Figure BDA0003224072880000113
Specifically, the calculation can be performed according to the following formula:
Figure BDA0003224072880000114
after the first secondary play loss and the second secondary play loss are obtained in step 1026, the formula can be shown
Figure BDA0003224072880000115
Obtaining a total loss function of the image generator; wherein, wRRIs the weight of the secondary countermeasure loss function.
In this step, the calculation of the loss function value of the discriminator may be implemented by a method that is used in the conventional AttnGAN model, and is not described herein again.
Step 1027, updating parameters of the image generator based on the current accumulated gradient of the image generator; and updating the parameters of the discriminator based on the current accumulated gradient of the discriminator.
The specific implementation of this step is known to those skilled in the art and will not be described herein.
In one embodiment, in order to enable the encoder to capture more comprehensive and complete image text feature information, the heterogeneous cyclic neural network encoder may be trained in advance by the following method:
and step x1, acquiring coding training sample data, wherein the coding training sample data comprises an image description text and a visual dialog text of a standard sample image.
And step x2, training a heterogeneous cyclic neural network encoder by using the encoded training sample data, so that the heterogeneous cyclic neural network encoder can associate the reference relation in the visual dialog text in the input data with the corresponding content in the image description text.
It is considered that in visual dialogue data, the semantics of text are also asymmetric. The image description sentence illustrates the main content of the image, and the dialog text is supplemental to the image description sentence and provides additional information beyond the main content. Therefore, when the text description and the feature representation of the dialog text are acquired, the text description and the feature representation of the dialog text can be distinguished, so that the completeness and the accuracy of the text feature representation are improved. In particular, every word in the image description sentence is crucial, and therefore, modeling should be performed on the level of "words" for the image description sentence; and the information in the dialogue text is relatively rare and redundant, so that one word in a sentence and an image description sentence corresponding to a question and answer in one round of dialogue can be regarded as equivalent, and modeling is carried out on the 'sentence' level. At the same time, the image description is considered as the start of the visual dialogue process, and both data are modeled simultaneously with the same encoder. Based on this, the structure of the heterogeneous cyclic neural network encoder as shown in fig. 2 can be designed.
As shown in FIG. 2, in one embodiment, the heterogeneous recurrent neural network encoder consists of three RNN layers, specifically, a first recurrent neural network encoder (i.e., a text RNN encoder in the figure), a second recurrent neural network encoder (i.e., an dialogue RNN encoder in the figure), and a third recurrent neural network encoder (i.e., a fusion RNN encoder in the figure). Preferably, the three RNN layers are all Bi-directionally gated cyclic units (Bi-GRUs) in the model structure. For simplicity, only the forward computation path of the bidirectional GRU is shown.
Wherein the text RNN encoder is connected withThe word vector of each word described by the image is input, each word input is considered as a moment, and a feature representation is output at each moment, and the feature representation can be considered as a simple fusion of all word information in the whole image description sentence, and is called as a primary word feature representation. Let the number of words in the text description be M1Then the text RNN encoder may generate M1Individual primary word feature representations.
The dialogue RNN encoder accepts as input a word vector for each word in the dialogue (the question and answer of the dialogue are concatenated together and treated as a sentence), but only the output at the last moment is retained when outputting. The feature representation output at the last moment encodes the information of the whole set of dialog text, defined as the primary sentence feature representation of this dialog turn. Each turn of text of the dialog process is input into the dialog RNN encoder in this manner, which can be considered as sharing the same encoder model weight for the entire dialog process.
As shown in fig. 2, taking the forward computation path of the bidirectional GRU as an example, the encoding process of the heterogeneous cyclic neural network encoder is as follows:
assume that there is a common M for conversations2In turn, the conversational RNN encoder may generate M2Individual primary sentence feature representations. In the dialog RNN encoder, each pair of dialog texts is independent, i.e., a randomly initialized hidden state is used in encoding each dialog text. Therefore, the model can process all the dialogue turns in parallel, and the training and testing speed is improved. Both the text RNN encoder and the dialogue RNN encoder are considered as the first layer of the overall heterogeneous recurrent neural network encoder, which can generate low-level feature representations for image descriptions as well as dialogue text. The upper layer fusion RNN encoder further processes the output of the lower layer encoder: preceding M1At the moment, the output of a text RNN encoder is taken as input, and high-level word feature representation is generated; m at the back2At this point, the output of the dialogue RNN encoder is used as input to generate a high-level sentence feature representation. The Bi-GRU structure of the RNN encoder is fused so that it can observe all the information of the current session at every moment. On the one hand, it establishes the association between each pair of dialog texts in the dialog, and is beneficial to processing the dialog processThe problem of cross-round times such as resolution is indicated, and excessive simplification caused by independently processing each round of text in a lower-layer dialogue RNN encoder is counteracted; on the other hand, the association between the conversation and the text description is established, so that the information of the conversation and the text description can be better fused as required, and text features with more expressive ability are generated.
Based on the above, in an embodiment, the following method may be specifically adopted, and the heterogeneous cyclic neural network encoder is trained by using the encoded training sample data:
step x21, encoding the image description text by using the first recurrent neural network encoder and a word-based encoding unit, and outputting each primary word feature representation obtained by encoding to the third recurrent neural network encoder; and coding the visual dialog text by using the second cyclic neural network coder and a sentence-based coding unit, and outputting each primary sentence feature representation obtained by coding to the third cyclic neural network coder.
Step x22, the third recurrent neural network encoder encodes based on the primary word feature representation and the primary sentence feature representation and the final output feature representation as a global encoded representation of the visual dialog text in association with the image description text.
The global coding representation obtained in this step is the final coding result of the heterogeneous cyclic neural network encoder.
Step x23, based on all feature representations output by the third recurrent neural network encoder in the encoding process, adopting a deep attention multi-modal similarity model (DAMSM) loss function to adjust the weight parameters of the first recurrent neural network encoder, the second recurrent neural network encoder and the third recurrent neural network encoder.
According to the encoder training method, the heterogeneous cyclic neural network encoder trained based on the method is used for generating the text for generating the image after each pair of dialogs is finished, on one hand, all dialog information obtained at the current dialog time can be fused, the association between the dialog texts of each pair in the completed dialog turn is established, the cross-turn problems such as reference resolution in the dialog process can be processed, the requirement on the increment of the image can be met, on the other hand, the association between the dialog texts and the image description text is established, the information of the dialog texts and the image description text can be better fused, and the text features with stronger expression and more accuracy are generated. Therefore, the heterogeneous cyclic neural network encoder obtained by the method is beneficial to reasonably realizing the task of generating the conversation to the image.
It can be seen from the above embodiment of the training method for an interactive incremental image generation model that, in the above embodiment, when model training is performed, by using a dialog text at an intermediate time and combining a heterogeneous cyclic neural network encoder, an image with interactive incremental property can be generated by the trained model, thereby being beneficial to reasonably realizing a dialog-to-image generation task.
Based on the model training method, the embodiment of the invention also provides an interactive incremental image generation method, which comprises the following steps:
in the process of visual conversation, when each round of man-machine conversation is finished, inputting all the generated round of man-machine conversation texts and a preset image description text into a pre-trained interactive incremental image generation model for image generation, and displaying the generated image;
wherein the interactive incremental image generation model is obtained based on the training method.
Here, the interactive incremental image generation model used is obtained based on the model training method, and the perception capability of the model to the dialogue text at the intermediate time is considered during the model training, so that the interactive incremental performance of the model is enhanced, and the reasonability of the generated image in the dialogue process can be improved.
Based on the above embodiment of the model training method, an embodiment of the present invention further provides a training device for an interactive incremental image generation model, including: a processor to:
obtaining conversation sample data, wherein the conversation sample data comprises conversation text data, a standard image, an image description text and a total number of conversation rounds;
training an interactive incremental image generation model by using the dialogue sample data and a pre-trained heterogeneous cyclic neural network encoder in a random replay training mode so that the interactive incremental image generation model can generate an image with interactive incremental based on a man-machine dialogue text and an image description text; wherein the training is performed using all dialog text data obtained at a final time of the dialog and all dialog text data obtained at an intermediate time of the dialog.
Based on the above embodiment of the interactive incremental image generation method, the embodiment of the present invention further discloses an interactive incremental image generation device, which includes: a processor to:
the interactive incremental image generation method comprises the steps of inputting all generated round man-machine conversation texts and preset image description texts into a pre-trained interactive incremental image generation model for image generation and displaying generated images when each round of man-machine conversation is finished in the visual conversation process;
wherein the interactive incremental image generation model is obtained based on the training method.
The embodiment of the training method based on the interactive incremental image generation model also realizes the training electronic equipment of the interactive incremental image generation model, which comprises a processor and a memory; the memory has stored therein an application executable by the processor for causing the processor to perform the method of training an interactive incremental image generation model as described above. Specifically, a system or an apparatus equipped with a storage medium on which a software program code that realizes the functions of any of the embodiments described above is stored may be provided, and a computer (or a CPU or MPU) of the system or the apparatus is caused to read out and execute the program code stored in the storage medium. Further, part or all of the actual operations may be performed by an operating system or the like operating on the computer by instructions based on the program code. The functions of any one of the above-described embodiments of the training method for an interactive incremental image generation model may be implemented by writing the program code read out from the storage medium to a memory provided in an expansion board inserted into the computer or to a memory provided in an expansion unit connected to the computer, and then causing a CPU or the like mounted on the expansion board or the expansion unit to perform part or all of the actual operations based on the instructions of the program code.
Based on the embodiment of the interactive incremental image generation method, the embodiment of the application also realizes the electronic equipment for generating the interactive incremental image, which comprises a processor and a memory; an application program executable by the processor is stored in the memory for causing the processor to perform the interactive incremental image generation method as described above. Specifically, a system or an apparatus equipped with a storage medium on which a software program code that realizes the functions of any of the embodiments described above is stored may be provided, and a computer (or a CPU or MPU) of the system or the apparatus is caused to read out and execute the program code stored in the storage medium. Further, part or all of the actual operations may be performed by an operating system or the like operating on the computer by instructions based on the program code. The functions of any of the above-described embodiments of the interactive incremental image generation method may also be implemented by writing the program code read out from the storage medium to a memory provided in an expansion board inserted into the computer or to a memory provided in an expansion unit connected to the computer, and then causing a CPU or the like mounted on the expansion board or the expansion unit to perform part or all of the actual operations based on the instructions of the program code.
The memory may be embodied as various storage media such as an Electrically Erasable Programmable Read Only Memory (EEPROM), a Flash memory (Flash memory), and a Programmable Read Only Memory (PROM). The processor may be implemented to include one or more central processors or one or more field programmable gate arrays, wherein the field programmable gate arrays integrate one or more central processor cores. In particular, the central processor or central processor core may be implemented as a CPU or MCU.
It should be noted that not all steps and modules in the above flows and structures are necessary, and some steps or modules may be omitted according to actual needs. The execution order of the steps is not fixed and can be adjusted as required. The division of each module is only for convenience of describing adopted functional division, and in actual implementation, one module may be divided into multiple modules, and the functions of multiple modules may also be implemented by the same module, and these modules may be located in the same device or in different devices.
The hardware modules in the various embodiments may be implemented mechanically or electronically. For example, a hardware module may include a specially designed permanent circuit or logic device (e.g., a special purpose processor such as an FPGA or ASIC) for performing specific operations. A hardware module may also include programmable logic devices or circuits (e.g., including a general-purpose processor or other programmable processor) that are temporarily configured by software to perform certain operations. The implementation of the hardware module in a mechanical manner, or in a dedicated permanent circuit, or in a temporarily configured circuit (e.g., configured by software), may be determined based on cost and time considerations.
"exemplary" means "serving as an example, instance, or illustration" herein, and any illustration, embodiment, or steps described as "exemplary" herein should not be construed as a preferred or advantageous alternative. For the sake of simplicity, the drawings are only schematic representations of the parts relevant to the invention, and do not represent the actual structure of the product. In addition, in order to make the drawings concise and understandable, components having the same structure or function in some of the drawings are only schematically illustrated or only labeled. In this document, "a" does not mean that the number of the relevant portions of the present invention is limited to "only one", and "a" does not mean that the number of the relevant portions of the present invention "more than one" is excluded. In this document, "upper", "lower", "front", "rear", "left", "right", "inner", "outer", and the like are used only to indicate relative positional relationships between relevant portions, and do not limit absolute positions of the relevant portions.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A method for training an interactive incremental image generation model, comprising:
obtaining conversation sample data, wherein the conversation sample data comprises conversation text data, a standard image, an image description text and a total number of conversation rounds;
training an interactive incremental image generation model by using the dialogue sample data and a pre-trained heterogeneous cyclic neural network encoder in a random replay training mode so that the interactive incremental image generation model can generate an image with interactive incremental based on a man-machine dialogue text and an image description text; wherein the training is performed based on all dialog text data obtained at a final time of the dialog and all dialog text data obtained at an intermediate time of the dialog.
2. The method of claim 1, wherein the training comprises training
Determining the number T of currently adopted conversation rounds by adopting a random sampling mode, wherein T is more than or equal to 2 and less than or equal to T, and T is the total number of the conversation;
inputting the image description text and the dialogue text data into the heterogeneous cyclic neural network encoder for encoding, and taking the feature representation output at last of encoding as a first text representation X'T(ii) a Inputting the first text representation to an image generator of an interactive incremental image generation model for image generation to obtain a first image Y'T
Representing X 'based on the first text'TAnd the first image Y'TComputing using said interactive incremental image generation model's discriminatorLoss of primary confrontation; updating the accumulated gradients of an image generator and a discriminator of the interactive incremental image generation model with the primary confrontation losses; the primary countermeasure loss includes a loss function value of an image generator and a discriminator;
inputting the image description text and the front t wheel in the dialogue text data into the heterogeneous cyclic neural network encoder for encoding, and using the feature representation output at last of encoding as a second text representation X't(ii) a Inputting the second text representation to the image generator for image generation to obtain a second image Y't
Inputting the image description text and the first t-1 wheel in the dialogue text data into the heterogeneous recurrent neural network encoder for encoding, and using the feature representation output at last of encoding as a third text representation X't-1(ii) a Inputting the third text representation to the image generator for image generation to obtain a third image Y't-1
Representing X 'based on the second text'tAnd the second image Y'tConstructing a first positive example;
representing X 'based on the third text't-1And the third image Y't-1, constructing a second positive example;
calculating a first secondary confrontation loss by using the discriminator based on the first positive example; updating a cumulative gradient of the image generator and the discriminator based on the first secondary confrontation loss; the first secondary countermeasure loss includes a loss function value of an image generator and a discriminator;
calculating a second secondary confrontation loss by using the discriminator based on the second positive example; updating the cumulative gradients of the image generator and the discriminator based on the second secondary confrontation loss; the second secondary countermeasure loss includes a loss function value of an image generator and a discriminator;
updating parameters of the image generator based on the current accumulated gradient of the image generator; and updating the parameters of the discriminator based on the current accumulated gradient of the discriminator.
3. The method of claim 2, wherein the training method further comprises:
representing X 'based on the third text't-1 and the second image Y'tConstructing a first negative example;
representing X 'based on the second text'tAnd the third image Y't-1, constructing a second negative example;
the calculating the first secondary confrontation loss comprises:
calculating the first secondary countermeasure loss using the discriminator based on the first positive example and the first negative example;
the calculating the second secondary confrontation loss comprises:
calculating, with the discriminator, the second secondary countermeasure loss based on the second positive example and the second negative example.
4. The method of claim 2, wherein the training of the heterogeneous cyclic neural network encoder comprises:
acquiring coding training sample data, wherein the coding training sample data comprises an image description text and a visual dialog text of a standard sample image;
and training a heterogeneous cyclic neural network encoder by using the encoding training sample data, so that the heterogeneous cyclic neural network encoder can associate the reference relationship in the visual dialog text in the input data with the corresponding content in the image description text.
5. The method of claim 4, wherein the heterogeneous recurrent neural network encoder consists of a first recurrent neural network encoder, a second recurrent neural network encoder, and a third recurrent neural network encoder;
the training a heterogeneous cyclic neural network encoder by using the encoded training sample data comprises:
encoding the image description text by using the first recurrent neural network encoder and an encoding unit taking words as basic, and outputting each primary word feature representation obtained by encoding to the third recurrent neural network encoder; coding the visual dialog text by using the second cyclic neural network coder and a sentence-based coding unit, and outputting each primary sentence feature representation obtained by coding to the third cyclic neural network coder;
the third recurrent neural network encoder encodes based on the primary word feature representation and the primary sentence feature representation and the last output feature representation as a global encoded representation in which the visual dialog text and the image description text are associated;
and adjusting the weight parameters of the first cyclic neural network encoder, the second cyclic neural network encoder and the third cyclic neural network encoder by adopting a deep attention multi-modal similarity model DAMSM loss function based on all feature representations output by the third cyclic neural network encoder in the encoding process.
6. An interactive incremental image generation method, comprising:
in the process of visual conversation, when each round of man-machine conversation is finished, inputting all the generated round of man-machine conversation texts and a preset image description text into a pre-trained interactive incremental image generation model for image generation, and displaying the generated image;
wherein the interactive incremental image generation model is obtained based on any one of the training methods of claims 1 to 5.
7. An interactive incremental image generation model training apparatus, comprising: a processor to:
obtaining conversation sample data, wherein the conversation sample data comprises conversation text data, a standard image, an image description text and a total number of conversation rounds;
training an interactive incremental image generation model by using the dialogue sample data and a pre-trained heterogeneous cyclic neural network encoder in a random replay training mode so that the interactive incremental image generation model can generate an image with interactive incremental based on a man-machine dialogue text and an image description text; wherein the training is performed using all dialog text data obtained at a final time of the dialog and all dialog text data obtained at an intermediate time of the dialog.
8. A non-transitory computer readable storage medium storing instructions which, when executed by a processor, cause the processor to perform the steps of any of the training methods of claims 1 to 5.
9. An interactive incremental image generation apparatus, comprising: a processor to:
the interactive incremental image generation method comprises the steps of inputting all generated round man-machine conversation texts and preset image description texts into a pre-trained interactive incremental image generation model for image generation and displaying generated images when each round of man-machine conversation is finished in the visual conversation process;
wherein the interactive incremental image generation model is obtained based on any one of the training methods of claims 1 to 5.
10. A non-transitory computer readable storage medium storing instructions which, when executed by a processor, cause the processor to perform the steps of the interactive incremental image generation method of claim 6.
CN202110966233.7A 2021-08-23 2021-08-23 Training method and device for image generation model and image generation method and device Active CN113590800B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110966233.7A CN113590800B (en) 2021-08-23 2021-08-23 Training method and device for image generation model and image generation method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110966233.7A CN113590800B (en) 2021-08-23 2021-08-23 Training method and device for image generation model and image generation method and device

Publications (2)

Publication Number Publication Date
CN113590800A true CN113590800A (en) 2021-11-02
CN113590800B CN113590800B (en) 2024-03-08

Family

ID=78239302

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110966233.7A Active CN113590800B (en) 2021-08-23 2021-08-23 Training method and device for image generation model and image generation method and device

Country Status (1)

Country Link
CN (1) CN113590800B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116824020A (en) * 2023-08-25 2023-09-29 北京生数科技有限公司 Image generation method and device, apparatus, medium, and program

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1694521A (en) * 2004-04-30 2005-11-09 株式会社东芝 Meta data for moving picture
US20110183302A1 (en) * 2010-01-28 2011-07-28 Harlow Robert W Situational Awareness Training System and Method
CN110008365A (en) * 2019-04-09 2019-07-12 广东工业大学 A kind of image processing method, device, equipment and readable storage medium storing program for executing
WO2020143130A1 (en) * 2019-01-08 2020-07-16 中国科学院自动化研究所 Autonomous evolution intelligent dialogue method, system and device based on physical environment game
CN112579759A (en) * 2020-12-28 2021-03-30 北京邮电大学 Model training method and task type visual dialogue problem generation method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1694521A (en) * 2004-04-30 2005-11-09 株式会社东芝 Meta data for moving picture
US20110183302A1 (en) * 2010-01-28 2011-07-28 Harlow Robert W Situational Awareness Training System and Method
WO2020143130A1 (en) * 2019-01-08 2020-07-16 中国科学院自动化研究所 Autonomous evolution intelligent dialogue method, system and device based on physical environment game
CN110008365A (en) * 2019-04-09 2019-07-12 广东工业大学 A kind of image processing method, device, equipment and readable storage medium storing program for executing
CN112579759A (en) * 2020-12-28 2021-03-30 北京邮电大学 Model training method and task type visual dialogue problem generation method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
邓珍荣;张宝军;蒋周琴;黄文明;: "融合word2vec和注意力机制的图像描述模型", 计算机科学, no. 04 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116824020A (en) * 2023-08-25 2023-09-29 北京生数科技有限公司 Image generation method and device, apparatus, medium, and program

Also Published As

Publication number Publication date
CN113590800B (en) 2024-03-08

Similar Documents

Publication Publication Date Title
EP4073787B1 (en) System and method for streaming end-to-end speech recognition with asynchronous decoders
US10846522B2 (en) Speaking classification using audio-visual data
CN111966800B (en) Emotion dialogue generation method and device and emotion dialogue model training method and device
CN110326002B (en) Sequence processing using online attention
EP4390881A1 (en) Image generation method and related device
CN111898635A (en) Neural network training method, data acquisition method and device
CN112528637B (en) Text processing model training method, device, computer equipment and storage medium
CN108665055B (en) Method and device for generating graphic description
CN111401259B (en) Model training method, system, computer readable medium and electronic device
CN111699497A (en) Fast decoding of sequence models using discrete latent variables
CN112579759B (en) Model training method and task type visual dialogue problem generation method and device
CN115311598A (en) Video description generation system based on relation perception
CN117152363A (en) Three-dimensional content generation method, device and equipment based on pre-training language model
CN115905485A (en) Common-situation conversation method and system based on common-sense self-adaptive selection
CN112349294A (en) Voice processing method and device, computer readable medium and electronic equipment
CN113590800A (en) Training method and device of image generation model and image generation method and device
CN109979461A (en) A kind of voice translation method and device
CN111414959B (en) Image recognition method, device, computer readable medium and electronic equipment
CN112330780A (en) Method and system for generating animation expression of target character
CN115712739B (en) Dance motion generation method, computer device and storage medium
CN117216223A (en) Dialogue text generation method and device, storage medium and electronic equipment
CN116601682A (en) Improved processing of sequential data via machine learning model featuring temporal residual connection
CN118397155B (en) Digital human animation generation and driving model training method and device and electronic equipment
CN111126479A (en) Image description generation method and system based on unsupervised uniqueness optimization
Mao et al. Vision and language navigation using multi-head attention mechanism

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant