CN117094365A - Training method and device for image-text generation model, electronic equipment and medium - Google Patents

Training method and device for image-text generation model, electronic equipment and medium Download PDF

Info

Publication number
CN117094365A
CN117094365A CN202311101515.6A CN202311101515A CN117094365A CN 117094365 A CN117094365 A CN 117094365A CN 202311101515 A CN202311101515 A CN 202311101515A CN 117094365 A CN117094365 A CN 117094365A
Authority
CN
China
Prior art keywords
image
training sample
text
sample pair
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311101515.6A
Other languages
Chinese (zh)
Inventor
罗龙强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Vivo Mobile Communication Co Ltd
Original Assignee
Vivo Mobile Communication Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Vivo Mobile Communication Co Ltd filed Critical Vivo Mobile Communication Co Ltd
Priority to CN202311101515.6A priority Critical patent/CN117094365A/en
Publication of CN117094365A publication Critical patent/CN117094365A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T11/002D [Two Dimensional] image generation
    • G06T11/60Editing figures and text; Combining figures or text

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Processing (AREA)

Abstract

The application discloses a training method and device for an image-text generation model, electronic equipment and a medium, and belongs to the field of artificial intelligence. The method comprises the following steps: inputting a first training sample pair in a first training sample pair set into a first image-text generation model, outputting a second training sample pair, wherein the first training sample pair comprises a first image and a first text for describing the image content of the first image, the second training sample pair comprises a second image and a second text, the second image is an image obtained by converting the first text into the image, and the second text is a text obtained by converting the first image into the image; generating M training sample pairs based on the first training sample pair and the second training sample pair; replacing a first training sample pair in the first training sample pair set with a third training sample pair to obtain a second training sample pair set, wherein the third training sample pair is the training sample pair with the highest image-text similarity in M training sample pairs; and training the first image-text generating model based on the second training sample pair set to obtain a target image-text generating model.

Description

Training method and device for image-text generation model, electronic equipment and medium
Technical Field
The application belongs to the field of artificial intelligence, and particularly relates to a training method and device for an image-text generation model, electronic equipment and a medium.
Background
At present, with the rising and continuous development of the generated artificial intelligence (Artificial Intelligence generated content, AIGC), the image-text generation model, such as a text-to-picture diffusion model in the field of AI drawing, is widely applied in the fields of wallpaper, head portrait, games, cartoon design and the like, and has the advantages of high efficiency, high automation degree and the like. In the related art, a model may be generated by inputting text into a graphic to output a corresponding image of the text.
However, the problem of low training accuracy still exists in the model training process of the above graph text generation model.
Disclosure of Invention
The embodiment of the application aims to provide a training method, a training device, electronic equipment and a training medium for an image-text generation model, which can improve the model training precision of the image-text generation model.
In a first aspect, an embodiment of the present application provides a training method for an image-text generating model, where the training method for an image-text generating model includes: inputting a first training sample pair in a first training sample pair set into a first image-text generation model, outputting a second training sample pair, wherein the first image-text generation model is obtained by training based on the first training sample pair set, the first training sample pair comprises a first image and a first text for describing the image content of the first image, the second training sample pair comprises a second image and a second text, the second image is an image obtained by converting the first text into the image, and the second text is a text obtained by converting the first image into the image; generating M training sample pairs based on the first training sample pair and the second training sample pair, wherein the M training sample pairs at least comprise the first training sample pair and the second training sample pair, and M is an integer greater than 1; replacing a first training sample pair in the first training sample pair set with a third training sample pair to obtain a second training sample pair set, wherein the third training sample pair is the training sample pair with the highest image-text similarity in M training sample pairs; and training the first image-text generating model based on the second training sample pair set to obtain a target image-text generating model.
In a second aspect, an embodiment of the present application provides a training device for an image-text generating model, where the training device for an image-text generating model includes: the device comprises a processing module, a generating module, a replacing module and a training module;
the processing module is used for inputting a first training sample pair in a first training sample pair set into a first image-text generating model and outputting a second training sample pair, wherein the first image-text generating model is obtained by training based on the first training sample pair set, the first training sample pair comprises a first image and a first text for describing the image content of the first image, the second training sample pair comprises a second image and a second text, the second image is an image obtained by converting the first text into the image, and the second text is a text obtained by converting the first image into the image; the generating module is configured to generate M training sample pairs based on the first training sample pair and the second training sample pair, where the M training sample pairs at least include the first training sample pair and the second training sample pair, and M is an integer greater than 1; the replacing module is configured to replace a first training sample pair in the first training sample pair set with a third training sample pair to obtain a second training sample pair set, where the third training sample pair is a training sample pair with highest image-text similarity in the plurality of training sample pairs generated by the generating module; the training module is used for training the first image-text generating model based on the second training sample pair set replaced by the replacing module to obtain the target image-text generating model.
In a third aspect, an embodiment of the present application provides an electronic device comprising a processor and a memory storing a program or instructions executable on the processor, which when executed by the processor, implement the steps of the method as described in the first aspect.
In a fourth aspect, embodiments of the present application provide a readable storage medium having stored thereon a program or instructions which when executed by a processor perform the steps of the method according to the first aspect.
In a fifth aspect, an embodiment of the present application provides a chip, where the chip includes a processor and a communication interface, where the communication interface is coupled to the processor, and where the processor is configured to execute a program or instructions to implement a method according to the first aspect.
In a sixth aspect, embodiments of the present application provide a computer program product stored in a storage medium, the program product being executable by at least one processor to implement the method according to the first aspect.
In the embodiment of the application, a first training sample pair in a first training sample pair set is input into a first image-text generating model, a second training sample pair is output, the first image-text generating model is obtained by training based on the first training sample pair set, the first training sample pair comprises a first image and a first text for describing the image content of the first image, the second training sample pair comprises a second image and a second text, the second image is an image obtained by converting the first text into the image, and the second text is a text obtained by converting the first image into the image; generating M training sample pairs based on the first training sample pair and the second training sample pair, wherein the M training sample pairs at least comprise the first training sample pair and the second training sample pair, and M is an integer greater than 1; replacing a first training sample pair in the first training sample pair set with a third training sample pair to obtain a second training sample pair set, wherein the third training sample pair is the training sample pair with the highest image-text similarity in M training sample pairs; and training the first image-text generating model based on the second training sample pair set to obtain a target image-text generating model. In this way, one or more training samples in the training sample set are subjected to image-text conversion on the input image-text generation model to obtain new training sample pairs, then, based on the image-text similarity between texts and images in each training sample, the training sample pair set is continuously updated by using the training sample pairs with higher image-text similarity, so that the image-text generation model is trained based on the updated training sample pair set, the model training precision of the image-text generation model is improved, and the consistency of images and text contents generated by the image-text generation model is further improved.
Drawings
FIG. 1 is a schematic flow chart of a training method of an image-text generation model according to an embodiment of the present application;
fig. 2 is an example schematic diagram of a first image in a training method of an image-text generating model according to an embodiment of the present application;
FIG. 3 is a second flow chart of a training method of an image-text generating model according to an embodiment of the present application;
FIG. 4 is a third flow chart of a training method of an image-text generating model according to an embodiment of the present application;
FIG. 5 is a flowchart of a training method of an image-text generation model according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of a training device for an image-text generating model according to an embodiment of the present application;
fig. 7 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present application;
fig. 8 is a second schematic diagram of a hardware structure of an electronic device according to an embodiment of the present application.
Detailed Description
The technical solutions of the embodiments of the present application will be clearly described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which are obtained by a person skilled in the art based on the embodiments of the present application, fall within the scope of protection of the present application.
The terms first, second and the like in the description and in the claims, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances such that the embodiments of the application are capable of operation in sequences other than those illustrated or otherwise described herein, and that the objects identified by "first," "second," etc. are generally of a type not limited to the number of objects, for example, the first object may be one or more. Furthermore, in the description and claims, "and/or" means at least one of the connected objects, and the character "/", generally means that the associated object is an "or" relationship.
The following explains some concepts and/or terms related in the training method, device, electronic device and medium of the image-text generating model provided by the embodiment of the application.
The training method, the training device, the electronic equipment and the training medium of the image-text generation model provided by the embodiment of the application are described in detail through specific embodiments and application scenes thereof by combining the attached drawings.
At present, with the rising and continuous development of the generated artificial intelligence (Artificial Intelligence generated content, AIGC), the image-text generation model, such as a text-to-picture diffusion model in the field of AI drawing, is widely applied in the fields of wallpaper, head portrait, games, cartoon design and the like, and has the advantages of high efficiency, high automation degree and the like. The image-text generating model is used for performing image-text conversion, namely converting an input text into an image corresponding to the text, or converting the input image into the text corresponding to the image.
In the related art, in the process of training the image-text generating model, a training sample is input into the image-text generating model to perform noise reduction processing on a noise-added sample image so as to generate a noise-reduced sample image, wherein texts with different text contents correspond to different noise-added sample images, then a first text-graph alignment score is obtained according to a first representation vector of the noise-reduced sample image and a second representation vector of the sample text, a first training sample is selected from a plurality of training samples in a current batch based on the first text-graph alignment score, a first loss function of the image-text generating model is determined according to an original sample image corresponding to the first training sample and the noise-reduced sample image, and the image-text generating model is adjusted based on the first loss function. And then, training the image-text generating model by using the training samples of the next batch until the training is finished.
However, since the conventional image-text generating model can only generate images from text contents, the training process of the image-text generating model can only convert the text contents, so that the accuracy of the overall training process is low.
In the embodiment of the application, a first training sample pair in a first training sample pair set is input into a first image-text generating model, a second training sample pair is output, the first image-text generating model is obtained by training based on the first training sample pair set, the first training sample pair comprises a first image and a first text for describing the image content of the first image, the second training sample pair comprises a second image and a second text, the second image is an image obtained by converting the first text into the image, and the second text is a text obtained by converting the first image into the image; generating M training sample pairs based on the first training sample pair and the second training sample pair, wherein the M training sample pairs at least comprise the first training sample pair and the second training sample pair, and M is an integer greater than 1; replacing a first training sample pair in the first training sample pair set with a third training sample pair to obtain a second training sample pair set, wherein the third training sample pair is the training sample pair with the highest image-text similarity in M training sample pairs; and training the first image-text generating model based on the second training sample pair set to obtain a target image-text generating model. In this way, one or more training samples in the training sample set are subjected to image-text conversion on the input image-text generation model to obtain new training sample pairs, then, based on the image-text similarity between texts and images in each training sample, the training sample pair set is continuously updated by using the training sample pairs with higher image-text similarity, so that the image-text generation model is trained based on the updated training sample pair set, the model training precision of the image-text generation model is improved, and the consistency of images and text contents generated by the image-text generation model is further improved.
The embodiment of the application provides a training method of an image-text generation model, and fig. 1 shows a flow chart of the training method of the image-text generation model, which can be applied to a training device of the image-text generation model. In the embodiment of the application, the training device of the image-text generation model can take electronic equipment as an example.
As shown in fig. 1, the training method of the image-text generating model provided in the embodiment of the present application may include the following steps 201 to 204.
Step 201, the electronic device inputs a first training sample pair in the first training sample pair set into a first image-text generating model, and outputs a second training sample pair.
In the embodiment of the application, the first image-text generating model is obtained by training the set based on a first training sample.
In an embodiment of the present application, the first image-text generating model may be a convolutional neural network model.
In the embodiment of the present application, the first training sample pair set may be automatically acquired by an electronic device, or may be selected by a user.
In an embodiment of the present application, the first training sample pair set includes N training sample pairs, and each training sample pair includes: an image and a text describing the image. Wherein N is a positive integer.
In an embodiment of the present application, the first training sample pair includes a first image and a first text for describing image content of the first image.
For example, as shown in fig. 2, if the image content of the first image is fig. 2, the first text may be a cat, a cat is fishing, or a fishing.
The first training sample pair is any training sample pair in the first training sample pair set.
In an embodiment of the present application, the second training sample pair includes a second image and a second text.
In the embodiment of the application, the second image is an image obtained by converting the first text into graphics. For example, taking the first text as an example of a kitten, the second image may be any image of the kitten.
In the embodiment of the application, the second text is a text obtained by converting the first image into graphics. For example, taking the image content of the first image as an example in fig. 2, the second text may be a cat, a cat may be used for fishing, or a fish may be used for fishing.
Step 202, the electronic device generates M training sample pairs based on the first training sample pair and the second training sample pair.
In the embodiment of the present application, the M training sample pairs at least include a first training sample pair and a second training sample pair, where M is an integer greater than 1.
In an embodiment of the present application, each of the M training sample pairs includes an image and text for describing image content of the image.
Optionally, in an embodiment of the present application, the M training sample pairs include: the first training sample pair, the second training sample pair, the fourth training sample pair, and the fifth training sample pair.
In an embodiment of the present application, the first training sample pair includes a first image and a first text, the second training sample pair includes a second image and a second text, the fourth training sample pair includes a first image and a second text, and the fifth training sample pair includes a second image and a first text.
Step 203, the electronic device replaces the first training sample pair in the first training sample pair set with the third training sample pair, so as to obtain a second training sample pair set.
In the embodiment of the present application, the third training sample pair is a training sample pair with highest image-text similarity among the M training sample pairs.
In the embodiment of the application, the electronic equipment calculates the image-text similarity between the text and the image in each training sample pair in the M training samples, and then selects the training sample pair with the highest image-text similarity from the M training samples as a third training sample pair based on the image-text similarity between the text and the image in each training sample pair in the M training samples.
In the embodiment of the application, the image-text similarity between the text and the image in the training sample pair is calculated by the integral loss function in the image-text generation model.
Optionally, in an embodiment of the present application, the overall loss function in the first teletext generation model may be constructed based on at least one of: a text encoder model, an image encoder model, a diffusion model, a text decoder model, and the like in the first teletext generation model.
Optionally, in the embodiment of the present application, each time training of the image-text generating model is completed, a new overall loss function is constructed based on the trained image-text generating model, and when the image-text generating model is trained next time, the new overall loss function can be used to calculate the image-text similarity between the text and the image in the training sample pair.
Therefore, the overall loss function is constructed through the characteristics of different modules in the image-text generation model, so that the constructed overall loss function can be attached to the image-text generation model, and the accuracy of the image-text similarity of the training sample pair calculated by the electronic equipment by using the overall loss function is higher.
Step 204, the electronic device trains the first image-text generating model based on the second training sample pair set to obtain a target image-text generating model.
Optionally, in the embodiment of the present application, the electronic device continuously constructs the overall loss function based on the second training sample pair set, so as to perform gradient back propagation on each training sample pair, thereby updating parameters of the image-text generating model, and when the second training sample pair set is completely input into the first image-text generating model, it indicates that a complete training period is currently completed. If the number of the current training period reaches the preset number, directly outputting a target image-text generation model; if the number of the current training period does not reach the preset number, the target image-text generation model is not mature. Therefore, the electronic equipment continues to replace the current training sample to train the set until the training period reaches the preset times, and finally takes the image-text generating model as a target image-text generating model.
Optionally, in the embodiment of the present application, the step 204 specifically includes steps 204a to 204e:
and 204a, the electronic equipment trains the first image-text generating model based on the second training sample pair set to obtain a second image-text generating model.
In the embodiment of the present application, the second image-text generating model is obtained by training the set based on a second training sample.
In an embodiment of the present application, the second image-text generating model may be a convolutional neural network model.
The electronic device inputs the training sample pairs in the second training sample pair set to the first image-text generation model, constructs an overall loss function, performs gradient back propagation on each training sample pair in the second training sample pair set, and updates parameters of the first image-text generation model to obtain a second image-text generation model.
And 204b, under the condition that the model training times do not reach a preset threshold value, the electronic equipment inputs a sixth training sample pair in the second training sample pair set to the second image-text generation model, and outputs a seventh training sample pair.
In an embodiment of the present application, the sixth training sample pair is any training sample pair in the second training sample pair set.
In the embodiment of the application, the preset threshold is set by the electronic equipment or set by a user.
Step 204c, the electronic device generates N training sample pairs based on the sixth training sample pair and the seventh training sample pair.
In the embodiment of the present application, the N training sample pairs at least include a sixth training sample pair and a seventh training sample pair, where N is an integer greater than 1.
And 204d, the electronic device replaces the sixth training sample pair in the second training sample pair set with the eighth training sample pair to obtain a third training sample pair set.
In the embodiment of the present application, the eighth training sample pair is a training sample pair with highest image-text similarity among N training sample pairs.
And 204c, the electronic equipment trains the second image-text generating model based on the third training sample pair set to obtain a third image-text generating model, iterates the process until the training times of the model reach a preset threshold value, and takes the image-text generating model obtained by the last training as a target image-text generating model.
The model training method provided by the embodiment of the application is exemplified by taking the first image as M1, the first text as T1, the second image as M2 and the second text as T2. Specifically, the above model training method may include the following steps B1 to B6:
and B1, the electronic equipment selects training sample pairs < T1, M1> from the training sample pair set.
And B2, the electronic equipment adopts a graph and text generation model which has completed one period of training to perform graph and text conversion processing on the training sample pair < T1, M1>, and outputs the training sample pair < T2, M2>.
And B3, the electronic equipment calculates the image-text similarity of the images and the texts in the four training sample pairs of < T1, M1>, < T1, M2>, < T2, M1>, < T2, M2> by adopting the integral loss function, and simultaneously, gradient back propagation is carried out on the four training sample pairs by constructing a new integral loss function.
And B4, the electronic equipment selects a training sample pair with highest image-text similarity, and the training sample pair is used for updating and replacing < T1, M1> on the assumption of < T2, M1>.
And B5, the electronic equipment traverses the complete training sample pair set, and one training period is completed.
Step B6, the electronic equipment judges whether the number of the current training period reaches the number of the maximum training period, and if so, the electronic equipment directly outputs a final image-text generation model; if not, repeating steps B1 to B5.
Therefore, the image-text generating model is trained by continuously replacing the training sample pair set, so that the consistency of the image and the text output by the image-text generating model is improved.
In the training method of the image-text generating model, one or more training samples in the training sample set are subjected to image-text conversion on the input image-text generating model to obtain new training sample pairs, then the training sample pair set is updated continuously by using the training sample pairs with higher image-text similarity based on the image-text similarity in each training sample, so that the image-text generating model is trained based on the updated training sample pair set, the model training precision of the image-text generating model is improved, and the consistency of images and text contents generated by the image-text generating model is further improved.
Optionally, in an embodiment of the present application, the image-text generating model at least includes: a text encoder model, an image encoder model, a diffusion model, an image decoder model, and a text decoder model.
Illustratively, the above text encoder model is used for encoding text, i.e. for extracting text characteristic information of the text.
Illustratively, the text encoder model described above includes: a Tokenizer module, and a self-attention module.
In one example, the Tokenizer module is configured to convert text into a numeric vector corresponding to the text.
In one example, the self-attention module is used to extract feature information in text.
The above-described image encoder model is used for encoding an image, i.e. for extracting image characteristic information of the image, for example.
Illustratively, the image encoder model described above includes: the device comprises a first convolution module, a second convolution module and a third convolution module.
Illustratively, the diffusion model is used to convert text feature information into image feature information.
In one example, the first convolution module includes: a first convolution layer, a second convolution layer, and a third convolution layer; the second convolution module includes: a first convolution layer, a second convolution layer, and a third convolution layer; the third convolution module includes: a first convolution layer, a second convolution layer, and a third convolution layer.
In one example, the convolution kernel of the first convolution layer is 3×3, the step size is 2, and the number of output channels is 8; the convolution kernel of the second convolution layer is 3×3, the step length is 1, and the output channel number is 8; the convolution kernel of the third convolution layer is 3×3, the step size is 1, and the number of output channels is 8.
Illustratively, the diffusion model includes: diffusion model encoder, diffusion model intermediate layer, and diffusion model decoder.
In one example, the diffusion model encoder includes: the first coding module, the second coding module and the third coding module.
In one example, the diffusion model decoder includes: the first decoding module, the second decoding module and the third decoding module.
It should be noted that the first encoding module, the second encoding module, the third encoding module, the first decoding module, the second decoding module, and the third decoding module each include: cross attention operator (Cross attention), add & Nor, and feed forward.
Illustratively, the above-described image decoder model is used to convert the image characteristic information into a final image.
In one example, the image decoder model includes: the device comprises a first deconvolution module, a second deconvolution module and a third deconvolution module.
In one example, the first deconvolution module includes: a first deconvolution layer, a second deconvolution layer, and a third deconvolution layer; the second deconvolution module includes: a first deconvolution layer, a second deconvolution layer, and a third deconvolution layer; the third deconvolution module includes: a first deconvolution layer, a second deconvolution layer, and a third deconvolution layer.
In one example, the deconvolution kernel of the first deconvolution is 3×3, the step size is 2, and the number of output channels is 16; the deconvolution kernel of the second deconvolution layer is 3 multiplied by 3, the step length is 1, and the number of output channels is 8; the deconvolution kernel of the third deconvolution layer is 3×3, the step size is 1, and the number of output channels is 8.
Illustratively, the text decoder is configured to convert the image characteristic information into text.
Optionally, in the embodiment of the present application, the step 201 specifically includes the following steps 201a to 201f:
step 201a, the electronic device inputs a first training sample pair in the first training sample pair set to a first image-text generating model.
Step 201b, the electronic device performs feature extraction on the first text input text encoder model to obtain first text feature information of the first text.
The first text feature information may be expressed in the form of a numerical vector, for example.
For example, as shown in fig. 3, taking a first text as an example, first, the electronic device inputs the first text into the Tokenizer module to obtain a first numerical vector with a vector dimension of [128,768 ]. Where 128 represents the maximum number of character labels (token) and 768 is a representation vector for each character. Then, the electronic device inputs the first numerical vector in [128,768] dimension as QKV value into a self-attention module, extracts the feature vector in the first numerical vector through a self-attention operator (selfAttention) to obtain key text feature information of the first text, calculates the mean value and standard deviation value corresponding to the key text feature information through a vector addition & normalization operator (Add & Norm), performs feature information fusion processing on the mean value and standard deviation value corresponding to the key text feature information through a forward feedback operator (feed forward), and repeats the calculation process for 12 times to obtain a text coding vector (Text Encoder Vector, TCV) in [128,768] dimension, thus obtaining the more fully described first text feature information.
Illustratively, the QKV Value contains three input vectors of the attention mechanism, respectively expressed as Query, key and Value, and in the self-attention module selfattribute, the three vectors are from the same input map; in cross section, query is mapped as an independent input, and Key and Value are from the same input map.
Step 201c, the electronic device inputs the first image into the image encoder model to perform feature extraction, so as to obtain first image feature information of the first image.
The image of the first image may be represented as [ X, Y, Z ], where X represents the number of color channels of the first image, and Y and Z represent the numerical sizes of the pixels of the image.
The first image characteristic information may be represented in the form of a vector matrix, for example.
For example, as shown in fig. 4, for the first convolution module in the image encoder model, the electronic device inputs the first image with the vector dimension of [3,512,512] into the first convolution module, calculates the feature matrix with the vector dimension of [8,256,256] through the first convolution layer, and then passes the feature matrix with the vector dimension of [8,256,256] through an activation function Relu to obtain the feature matrix with the dimension of [8,256,256 ]. Next, the electronic device sequentially inputs the [8,256,256] feature matrix into the second convolution layer and the third convolution layer, and the calculation process is the same as that of the first convolution layer, but the step size of convolution is changed to 1, so that the [8,256,256] feature matrix is finally obtained.
The electronic device sequentially inputs the [8,256,256] feature matrix obtained by the output of the first convolution module into the second convolution module and the third convolution module, and the calculation process is the same as that of the first convolution module, so as to finally obtain the [32,64,64] image feature encoding matrix (Image Encoder Matrix, IEM), namely the first image feature information.
Optionally, after obtaining the first text feature information, i.e. after step 201b, the electronic device may subject the first text feature information to a time mapping process to obtain a text condition control vector (Text Condition Vector, TCV).
For example, in connection with fig. 3, the electronic device first passes the first text feature information [128,768] dimension through the linear mapping layer Project to obtain a text encoding vector (Text Project Vector, TPV) with dimension [320,768 ]. And then the initialization time Embedding passes through the linear mapping layer Project to obtain the time mapping Embedding with the dimension of [320,768 ]. Finally, the TPV and the time map Embedding are added to obtain TCV with the dimension of [320,768 ].
Optionally, after obtaining the first image feature information, i.e. after step 201c, the electronic device may perform noise addition processing on the first image feature information, and perform convolution calculation on the processed first image feature information to obtain an image coding mapping vector (Image Project Vector, IPV).
For example, the "noise adding process" (ADDNoise process) may be to add a random gaussian noise matrix (Gaussian Noise Matrix, GNM) to the first image feature information, to obtain a noise added image Latent matrix latex.
Illustratively, the electronic device performs a Conv convolution operation on the above-mentioned latex, so as to obtain a matrix with a dimension of [320,64,64], where the convolution kernel of the convolution operation has a size of 3x3, a step size of 1, and the number of output channels is 320. And then remolding the matrix by a remolding function reshape to obtain the IPV with the dimension of [320,64,64 ].
It should be noted that, the two examples are not sequential, and the first text feature information may be processed first, and then the first image feature information may be processed. The first image characteristic information may be processed first, and then the first text characteristic information may be processed. Or can be processed simultaneously. The application is not limited.
Step 201d, inputting the first text feature information and the first image feature information into a diffusion model for cross attention mechanism and convolution calculation to obtain second image feature information.
The second image feature information is, for example, image feature information corresponding to the first text.
For example, for a first encoding module in a diffusion model encoder in a diffusion model, the processing of the first encoding module may include the following S1 to S7:
s1) the electronic equipment takes the first image characteristic information as a Query, takes the first text characteristic information as Key and Value, inputs the Key and Value into a first coding module, carries out cross attention mechanism calculation, extracts the first image characteristic information and the associated image characteristic information in the first text characteristic information through cross attention, and then calculates the average Value and the standard deviation Value corresponding to the associated image characteristic information through vector addition and normalization operators (Add & Nor). And then, carrying out feature information fusion processing on the mean value and the standard deviation value corresponding to the associated image feature information through a forward feedback operator (feed forward), and outputting a vector with vector dimension of [320,64,64] after repeating the calculation twice. Then, the vector is subjected to convolution operation with a convolution kernel of 3×3, a step length of 2 and an output channel of 640 to obtain a vector in [640,32,32] dimension, the vector is used as a Query of the second coding module to be operated again, the operation process is the same as that of the first coding module, and the operation process is repeated twice to obtain a vector in [320,64,64] dimension.
S2) carrying out convolution operation on the vector in the [320,64,64] dimension, outputting the vector with the vector dimension of [640,32,32], carrying out operation again as the Query of the third coding module, repeating the operation process with the operation process of the first coding module for two times, and obtaining the vector in the [640,32,32] dimension. And then carrying out convolution operation to output a vector with the vector dimension of [1280,16,16 ].
S3) inputting the vector with the vector dimension of [1280,16,16] into a diffusion model middle layer, wherein the calculation process of the diffusion model middle layer is the same as that of the first coding module, and the vector with the vector dimension of [1280,16,16] is output after CrossAttention, add & Norm and feed forward calculation.
S4) averaging the vector output by the third coding module and the vector output by the middle layer of the diffusion model, inputting the averaged vector to the first decoding module for calculation, and inputting the averaged vector and the averaged vector as the first text characteristic information of Key and Value to the first decoding module for calculation, wherein the calculation process is the same as that of the first coding module, and the vector in [1280,16,16] dimension is obtained after repeating the calculation twice. Then, deconvolution kernel 3×3 is performed on the vector in [1280,16,16] dimension, the step length is 2, the channel number is 640, and the vector in [640,32,32] dimension is output.
S5) averaging the vector output by the first decoding module and the vector output by the second coding module, inputting the averaged vector as a Query and the first text characteristic information as Key and Value into the second decoding module, calculating, repeating the calculation process twice as the same as the first coding module, and obtaining the vector in [640,32,32] dimension. And then carrying out deconvolution operation to output a vector with vector dimension [320,32,32 ].
S6) averaging the vector output by the second decoding module and the vector output by the first coding module, inputting the averaged vector as a Query and the first text characteristic information as Key and Value into the third decoding module, calculating, repeating the calculation process for the same time as the first coding module, and outputting the vector with the vector dimension of [320,64,64 ].
S7) calculating the vector through a layer of convolution layer (Conv_out) to obtain a vector in [32,64,64] dimension, and marking the vector as an image prediction noise matrix, namely, an image noise matrix predicted according to the first text, and finally subtracting the image noise matrix from the Latent to obtain a final image reconstruction matrix (Image Reconstruction Matrix, IRM), namely, the third image characteristic information.
Step 201e, the electronic device inputs the second image feature information into the image decoder model to perform image decoding, obtains third image feature information, and outputs the second image based on the third image feature information.
For example, for the first deconvolution module, the electronic device may input the second image feature information with the vector dimension of [32,64,64] into the first deconvolution module, calculate the feature matrix with the dimension of [16,128,128] through the first deconvolution layer, and then pass the feature matrix with the vector dimension of [16,128,128] through an activation function Relu to obtain the feature matrix with the dimension of [8,256,256 ]. Then, the [16,128,128] feature matrix is sequentially input into a second deconvolution layer and a third deconvolution layer, the calculation process is the same as that of the first deconvolution layer, but the step length of deconvolution is changed to 1, and the output channel is changed to 8, so that the [16,128,128] feature matrix is finally obtained.
It should be noted that the feature matrix in [16,128,128] dimension output by the first deconvolution module is sequentially input into the second deconvolution module and the third deconvolution module, the calculation process is the same as that of the first deconvolution module, and the third image feature information in [3,512,512] dimension is finally obtained.
The electronic device converts the feature vector into the second image for output based on the feature vector corresponding to the third image feature information.
Step 201f, the electronic device inputs the first image feature information into a text decoder model for text prediction, obtains text prediction parameters, and outputs a second text based on the text prediction parameters.
Illustratively, the first image feature information is input into a text decoder model, text codebook information having a mapping relation with the first image feature information is obtained, and prediction parameters are obtained based on the text codebook information.
The prediction parameter is, for example, a probability that the text of the text content represented by the first image feature information belongs to different preset texts.
The text codebook information is illustratively a text codebook of the system.
The preset text is, for example, text in the text codebook information.
Illustratively, the 12 text decoding modules in the text decoder model described above, wherein each decoding module comprises: crossAttention, add & Norm, and feed forward.
For example, the electronic device inputs a first text codebook information in the text codebook information with the dimension of [1,768] as a Query Value, and a first image feature information as a Key Value and a Value into a first decoding module to perform cross attention calculation so as to obtain associated feature information between the text codebook information and the first image feature information, calculates a mean Value and a standard deviation Value corresponding to the associated feature information between the text codebook information and the first image feature information through Add & Norm, and performs feature information fusion processing on the mean Value and the standard deviation Value corresponding to the associated feature information between the text codebook information and the first image feature information through FeedForward. And inputting the vector subjected to the fusion processing to the rest 11 decoding modules for calculation in sequence, wherein the calculation processes are the same, and finally, the first text vector in [1,768] dimension is obtained.
Then, a vocabulary vector matrix corresponding to the first text vector in the [1,768] dimension is obtained through a vocabulary (TokenEmbedding), and then a text character with the highest probability is selected as the first text character of the second text through one-layer Softmax calculation. And splicing the text characters to the first image characteristic information to obtain vectors with the dimension of [33,768] as Key and Value values, inputting second text codebook information in the text codebook information of [1,768] to the first decoding module again as Query values, performing cross attention calculation, performing Add, norm, feedForward operation, and performing calculation of 12 decoding modules to finally obtain second text vectors with the dimension of [1,768 ]. And then obtaining a second text character of the second text through calculation of the vocabulary and the Softmax. And stopping iteration until reaching the set longest text length or terminator through the iterative word-by-word calculation, and outputting the finally obtained character string as a second text.
Optionally, in the embodiment of the present application, after the electronic device obtains the first image feature information, the electronic device may map the first image feature information to the same dimension as the first text feature information, and then perform text prediction on the mapped input text decoder model to obtain the text prediction parameter.
For example, the first image feature information is a vector in [32,64,64] dimension, the electronic device converts the vector of the first image feature information into a vector in [32,4096] dimension through a reshaping function, and then maps the vector through a linear mapping layer Project to obtain a vector in [32,768] dimension, which is the same as the first text feature information, namely the fourth image feature information.
The fourth image feature information may be input as the first image feature information into the text decoder model.
Therefore, through various network models in the image-text generation model, the text and the image are subjected to fusion association training, so that the consistency between the generated text and the generated image can be improved.
Optionally, in the embodiment of the present application, before the step 202, the training method for the image-text generating model provided in the embodiment of the present application further includes steps 301 to 304:
step 301, the electronic device constructs a first loss function based on the first text feature information and the first image feature information.
Illustratively, the construction process of the first loss function includes the following steps A1 to A3:
And A1, the electronic equipment obtains a [256,768] -dimensional text alignment vector TAV by the linear mapping layer Project on the first text characteristic information.
And A2, the electronic equipment calculates the first image characteristic information through a convolution layer Conv to obtain an image alignment matrix IAM in [768,16,16] dimension, and then remodels the image alignment matrix IAM into an image alignment vector IAV in [256,768] dimension through a remodelling function Reshape.
And A3, the electronic equipment calculates the similarity of the text pair Ji Xiangliang TAV and the image pair Ji Xiangliang IAV to construct an image-text contrast loss function, and the image-text contrast loss function is recorded as C1=1-Cos (TAV, IAV).
Step 302, the electronic device constructs a second loss function based on the second image feature information and the gaussian noise matrix.
Illustratively, the second loss function construction process is as follows: the electronic device constructs an MSE loss function, denoted C2, using the randomly sampled gaussian noise matrix GNM and the image prediction noise matrix INM, which is derived based on the first text and the feature information of the first image. In this way, the image prediction noise matrix INM is made to approach the randomly sampled gaussian noise matrix GNM, so that the image can be reconstructed back to the original image as much as possible.
Step 303, the electronic device constructs a third loss function based on the text prediction parameter.
The text prediction parameter may be a probability value for predicting a text character as a certain text character.
Illustratively, taking the ith training sample pair as an example, the third loss function is constructed by:embedding (32 total) corresponding to the kth text character representing the first image feature information of the ith sample, ++>The j-th embedding (text maximum length 128) of the first image feature information representing the i-th sample. Assume that after the predicted 2 nd text character is calculated by softmax, the text character corresponds to ebedding (noted +.>) The third loss function is constructed as follows, with the largest softmax value, i.e. pθ (c2i|p1i, …, p32i, c1 i) probability value higher than the probability value of the other characters in the vocabulary: />
Step 303, the electronic device constructs an overall loss function based on the first loss function, the second loss function, and the third loss function.
In the embodiment of the present application, the overall loss function may be:
c=a×c1+b×c2+c×c3 (a+b+c=1) formula 1
Wherein C1 represents a first loss function constructed based on the text encoder model and the image encoder model in the above-described teletext generation model, and C2 represents a second loss function constructed based on the diffusion model in the above-described teletext generation model. C3 represents a third penalty function constructed based on the text decoder model in the teletext generation model described above. a. b and c respectively represent weights of three loss functions, and the weights of the three are added to be 1.
It should be noted that the weights of the three loss functions may be preset weights.
Optionally, in the embodiment of the present application, in combination with the steps 301 to 304, after the step 202, the training method for the image-text generating model provided in the embodiment of the present application further includes step 305:
and 305, the electronic equipment calculates the image-text similarity of each training sample pair of the M training sample pairs by adopting an integral loss function.
It should be noted that, as shown in fig. 5, the above-mentioned image-text generating model may include 4 modules: the training system comprises an input module, an inference module, an output module and a training loss function module. In the training process, the electronic equipment firstly inputs the training sample pair into an input module of the image-text generating model, then outputs a final image and a text through an reasoning process of an reasoning module of the image-text generating model and an output module of the image-text generating model. Meanwhile, an integral loss function is created in the middle process of the reasoning module and the final output result to carry out training learning.
Optionally, in the embodiment of the present application, after the step 204, the training method for the image-text generating model provided in the embodiment of the present application further includes the following step 401 or step 402:
Step 401, the electronic device inputs the fourth image into the target image-text generating model to perform image-text conversion, and outputs a fourth text.
The fourth text is used to describe the image content of the fourth image.
Step 402, the electronic device inputs the fifth text into the target image-text generating model to perform image-text conversion, and outputs a fifth image.
Illustratively, the fifth image is generated based on the fifth text.
It should be noted that, in the training method of the image-text generating model provided by the embodiment of the present application, the execution subject may be a training device of the image-text generating model, or an electronic device, or may be a functional module or entity in the electronic device. In the embodiment of the application, the training device for the image-text generating model provided by the embodiment of the application is described by taking the training method for executing the image-text generating model by the training device for the image-text generating model as an example.
Fig. 6 shows a schematic diagram of a possible structure of a training device of the image-text generating model according to an embodiment of the present application. As shown in fig. 6, the training device 700 for the image-text generating model may include: a processing module 701, a generating module 702, a replacing module 703 and a training module 704;
the processing module 701 is configured to input a first training sample pair in a first training sample pair set to a first image-text generating model, and output a second training sample pair, where the first image-text generating model is obtained by training based on the first training sample pair set, the first training sample pair includes a first image and a first text for describing an image content of the first image, the second training sample pair includes a second image and a second text, the second image is an image obtained by performing image-text conversion on the first text, and the second text is a text obtained by performing image-text conversion on the first image; the generating module 702 is configured to generate M training sample pairs based on the first training sample pair and the second training sample pair, where the M training sample pairs at least include the first training sample pair and the second training sample pair, and M is an integer greater than 1; the replacing module 703 is configured to replace a first training sample pair in the first training sample pair set with a third training sample pair, to obtain a second training sample pair set, where the third training sample pair is a training sample pair with highest image-text similarity in the M training sample pairs generated by the generating module 702; the training module 704 is configured to train the first image-text generating model to the set based on the second training sample pair replaced by the replacing module 703, so as to obtain a target image-text generating model.
Optionally, in an embodiment of the present application, the graphics context generating model includes a text encoder model, an image encoder model, a diffusion model, an image decoder model, and a text decoder model;
the processing module 701 is specifically configured to:
inputting a first training sample pair in the first training sample pair set into a first image-text generation model; inputting the first text into a text encoder model for feature extraction to obtain first text feature information of the first text; inputting the first image into an image encoder model for feature extraction to obtain first image feature information of the first image; inputting the first text feature information and the first image feature information into a diffusion model to perform a cross attention mechanism and convolution calculation to obtain second image feature information; inputting the second image characteristic information into an image decoder model for image decoding to obtain third image characteristic information, and outputting a second image based on the third image characteristic information; and inputting the first image characteristic information into a text decoder model for text prediction to obtain text prediction parameters, and outputting a second text based on the text prediction parameters.
Optionally, in this embodiment of the present application, the processing module 701 is further configured to construct, before generating M training sample pairs based on the first training sample pair and the second training sample pair, a first loss function based on the first text feature information and the first image feature information; the processing module 701 is further configured to construct a second loss function based on the second image feature information and the gaussian noise matrix; the processing module 701 is further configured to construct a third loss function based on the text prediction parameter; the processing module 701 is further configured to construct an overall loss function based on the first loss function, the second loss function, and the third loss function; the processing module 701 is further configured to, after generating M training sample pairs based on the first training sample pair and the second training sample pair, calculate the image-text similarity of each training sample pair in the M training sample pairs by using the overall loss function.
Optionally, in an embodiment of the present application, the M training sample pairs include: a first training sample pair, a second training sample pair, a fourth training sample pair, and a fifth training sample pair; wherein the fourth training sample pair comprises the first image and the second text and the fifth training sample pair comprises the second image and the first text.
Optionally, in an embodiment of the present application, the training module 704 is specifically configured to:
training the first image-text generating model based on the second training sample pair set to obtain a second image-text generating model; under the condition that the model training times do not reach a preset threshold value, inputting a sixth training sample pair in the second training sample pair set into a second image-text generation model, and outputting a seventh training sample pair; generating N training sample pairs based on the sixth training sample pair and the seventh training sample pair, wherein the N training sample pairs at least comprise the sixth training sample pair and the seventh training sample pair, and N is an integer greater than 1; replacing a sixth training sample pair in the second training sample pair set with an eighth training sample pair to obtain a third training sample pair set, wherein the eighth training sample pair is a training sample pair with highest image-text similarity in N training sample pairs; training the second image-text generating model based on the third training sample pair set to obtain a third image-text generating model, iterating the process until the training times of the model reach a preset threshold value, and taking the image-text generating model obtained by the last training as a target image-text generating model.
Optionally, in the embodiment of the present application, the processing module 701 is further configured to train the first image-text generating model on the basis of the second training sample pair set, obtain the target image-text generating model, input the fourth image into the target image-text generating model to perform image-text conversion, and output a fourth text, where the fourth text is used to describe the image content of the fourth image; or, inputting the fifth text into the target image-text generating model for image-text conversion, and outputting a fifth image, wherein the fifth image is generated based on the fifth text.
In the training device of the image-text generating model provided by the embodiment of the application, a first training sample pair in a first training sample pair set is input into a first image-text generating model, a second training sample pair is output, the first image-text generating model is obtained by training based on the first training sample pair set, the first training sample pair comprises a first image and a first text for describing the image content of the first image, the second training sample pair comprises a second image and a second text, the second image is an image obtained by converting the first text into the image, and the second text is a text obtained by converting the first image into the image; generating M training sample pairs based on the first training sample pair and the second training sample pair, wherein the M training sample pairs at least comprise the first training sample pair and the second training sample pair, and M is an integer greater than 1; replacing a first training sample pair in the first training sample pair set with a third training sample pair to obtain a second training sample pair set, wherein the third training sample pair is the training sample pair with the highest image-text similarity in M training sample pairs; and training the first image-text generating model based on the second training sample pair set to obtain a target image-text generating model. In this way, one or more training samples in the training sample set are subjected to image-text conversion on the input image-text generation model to obtain new training sample pairs, then, based on the image-text similarity between texts and images in each training sample, the training sample pair set is continuously updated by using the training sample pairs with higher image-text similarity, so that the image-text generation model is trained based on the updated training sample pair set, the model training precision of the image-text generation model is improved, and the consistency of images and text contents generated by the image-text generation model is further improved.
The training device of the image-text generating model in the embodiment of the application can be electronic equipment, and can also be a component in the electronic equipment, such as an integrated circuit or a chip. The electronic device may be a terminal, or may be other devices than a terminal. By way of example, the electronic device may be a mobile phone, tablet computer, notebook computer, palm computer, vehicle-mounted electronic device, mobile internet appliance (Mobile Internet Device, MID), augmented reality (augmented reality, AR)/Virtual Reality (VR) device, robot, wearable device, ultra-mobile personal computer, UMPC, netbook or personal digital assistant (personal digital assistant, PDA), etc., but may also be a server, network attached storage (Network Attached Storage, NAS), personal computer (personal computer, PC), television (TV), teller machine or self-service machine, etc., and the embodiments of the present application are not limited in particular.
The training device of the image-text generating model in the embodiment of the application can be a device with an operating system. The operating system may be an Android operating system, an ios operating system, or other possible operating systems, and the embodiment of the present application is not limited specifically.
The training device for the image-text generating model provided by the embodiment of the application can realize each process realized by the method embodiments of fig. 1 to 5, and in order to avoid repetition, the description is omitted here.
Optionally, as shown in fig. 7, the embodiment of the present application further provides an electronic device 800, including a processor 801 and a memory 802, where the memory 802 stores a program or an instruction that can be executed on the processor 801, and the program or the instruction implements each step of the training method embodiment of the image-text generating model when executed by the processor 801, and the steps can achieve the same technical effect, so that repetition is avoided, and no further description is given here.
The electronic device in the embodiment of the application includes the mobile electronic device and the non-mobile electronic device.
Fig. 8 is a schematic diagram of a hardware structure of an electronic device implementing an embodiment of the present application.
The electronic device 100 includes, but is not limited to: radio frequency unit 101, network module 102, audio output unit 103, input unit 104, sensor 105, display unit 106, user input unit 107, interface unit 108, memory 109, and processor 110.
Those skilled in the art will appreciate that the electronic device 100 may further include a power source (e.g., a battery) for powering the various components, and that the power source may be logically coupled to the processor 110 via a power management system to perform functions such as managing charging, discharging, and power consumption via the power management system. The electronic device structure shown in fig. 8 does not constitute a limitation of the electronic device, and the electronic device may include more or less components than shown, or may combine certain components, or may be arranged in different components, which are not described in detail herein.
The processor 110 is configured to input a first training sample pair in a first training sample pair set to a first image-text generating model, and output a second training sample pair, where the first image-text generating model is obtained by training based on the first training sample pair set, the first training sample pair includes a first image and a first text for describing an image content of the first image, the second training sample pair includes a second image and a second text, the second image is an image obtained by performing image-text conversion on the first text, and the second text is a text obtained by performing image-text conversion on the first image; the processor 110 is further configured to generate M training sample pairs based on the first training sample pair and the second training sample pair, where the M training sample pairs at least include the first training sample pair and the second training sample pair, and M is an integer greater than 1; the processor 110 is further configured to replace a first training sample pair in the first training sample pair set with a third training sample pair, to obtain a second training sample pair set, where the third training sample pair is a training sample pair with highest image-text similarity in M training sample pairs; the processor 110 is further configured to train the first image-text generating model based on the second training sample pair set to obtain a target image-text generating model.
Optionally, in an embodiment of the present application, the graphics context generating model includes a text encoder model, an image encoder model, a diffusion model, an image decoder model, and a text decoder model;
the processor 110 is specifically configured to:
inputting a first training sample pair in the first training sample pair set into a first image-text generation model; inputting the first text into a text encoder model for feature extraction to obtain first text feature information of the first text; inputting the first image into an image encoder model for feature extraction to obtain first image feature information of the first image; inputting the first text feature information and the first image feature information into a diffusion model to perform a cross attention mechanism and convolution calculation to obtain second image feature information; inputting the second image characteristic information into an image decoder model for image decoding to obtain third image characteristic information, and outputting a second image based on the third image characteristic information; and inputting the first image characteristic information into a text decoder model for text prediction to obtain text prediction parameters, and outputting a second text based on the text prediction parameters.
Optionally, in an embodiment of the present application, the processor 110 is further configured to construct a first loss function based on the first text feature information and the first image feature information before generating M training sample pairs based on the first training sample pair and the second training sample pair; the processor 110 is further configured to construct a second loss function based on the second image feature information and the gaussian noise matrix; the processor 110 is further configured to construct a third loss function based on the text prediction parameter; the processor 110 is further configured to construct an overall loss function based on the first loss function, the second loss function, and the third loss function; the processor 110 is further configured to calculate the image-text similarity of each of the M training sample pairs by using the overall loss function after generating the M training sample pairs based on the first training sample pair and the second training sample pair.
Optionally, in an embodiment of the present application, the M training sample pairs include: a first training sample pair, a second training sample pair, a fourth training sample pair, and a fifth training sample pair; wherein the fourth training sample pair comprises the first image and the second text and the fifth training sample pair comprises the second image and the first text.
Optionally, in an embodiment of the present application, the processor 110 is specifically configured to:
training the first image-text generating model based on the second training sample pair set to obtain a second image-text generating model; under the condition that the model training times do not reach a preset threshold value, inputting a sixth training sample pair in the second training sample pair set into a second image-text generation model, and outputting a seventh training sample pair; generating N training sample pairs based on the sixth training sample pair and the seventh training sample pair, wherein the N training sample pairs at least comprise the sixth training sample pair and the seventh training sample pair, and N is an integer greater than 1; replacing a sixth training sample pair in the second training sample pair set with an eighth training sample pair to obtain a third training sample pair set, wherein the eighth training sample pair is a training sample pair with highest image-text similarity in N training sample pairs; training the second image-text generating model based on the third training sample pair set to obtain a third image-text generating model, iterating the process until the training times of the model reach a preset threshold value, and taking the image-text generating model obtained by the last training as a target image-text generating model.
Optionally, in the embodiment of the present application, the processor 110 is further configured to, after training the image-text generating model on the set based on the replaced training sample to obtain the target image-text generating model, input a fourth image into the target image-text generating model to perform image-text conversion, and output a fourth text, where the fourth text is used to describe image content of the fourth image; or, inputting the fifth text into the target image-text generating model for image-text conversion, and outputting a fifth image, wherein the fifth image is generated based on the fifth text.
In the electronic device provided by the embodiment of the application, a first training sample pair in a first training sample pair set is input into a first image-text generation model, a second training sample pair is output, the first image-text generation model is obtained by training based on the first training sample pair set, the first training sample pair comprises a first image and a first text for describing the image content of the first image, the second training sample pair comprises a second image and a second text, the second image is an image obtained by converting the first text into the image, and the second text is a text obtained by converting the first image into the image; generating M training sample pairs based on the first training sample pair and the second training sample pair, wherein the M training sample pairs at least comprise the first training sample pair and the second training sample pair, and M is an integer greater than 1; replacing a first training sample pair in the first training sample pair set with a third training sample pair to obtain a second training sample pair set, wherein the third training sample pair is the training sample pair with the highest image-text similarity in M training sample pairs; and training the first image-text generating model based on the second training sample pair set to obtain a target image-text generating model. In this way, one or more training samples in the training sample set are subjected to image-text conversion on the input image-text generation model to obtain new training sample pairs, then, based on the image-text similarity between texts and images in each training sample, the training sample pair set is continuously updated by using the training sample pairs with higher image-text similarity, so that the image-text generation model is trained based on the updated training sample pair set, the model training precision of the image-text generation model is improved, and the consistency of images and text contents generated by the image-text generation model is further improved.
It should be appreciated that in embodiments of the present application, the input unit 104 may include a graphics processor (Graphics Processing Unit, GPU) 1041 and a microphone 1042, the graphics processor 1041 processing image data of still pictures or video obtained by an image capturing device (e.g., a camera) in a video capturing mode or an image capturing mode. The display unit 106 may include a display panel 1061, and the display panel 1061 may be configured in the form of a liquid crystal display, an organic light emitting diode, or the like. The user input unit 107 includes at least one of a touch panel 1071 and other input devices 1072. The touch panel 1071 is also referred to as a touch screen. The touch panel 1071 may include two parts of a touch detection device and a touch controller. Other input devices 1072 may include, but are not limited to, a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and so forth, which are not described in detail herein.
Memory 109 may be used to store software programs as well as various data. The memory 109 may mainly include a first memory area storing programs or instructions and a second memory area storing data, wherein the first memory area may store an operating system, application programs or instructions (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like. Further, the memory 109 may include volatile memory or nonvolatile memory, or the memory 109 may include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable EPROM (EEPROM), or a flash Memory. The volatile memory may be random access memory (Random Access Memory, RAM), static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (ddr SDRAM), enhanced SDRAM (Enhanced SDRAM), synchronous DRAM (SLDRAM), and Direct RAM (DRRAM). Memory 109 in embodiments of the present application includes, but is not limited to, these and any other suitable types of memory.
Processor 110 may include one or more processing units; optionally, the processor 110 integrates an application processor that primarily processes operations involving an operating system, user interface, application programs, etc., and a modem processor that primarily processes wireless communication signals, such as a baseband processor. It will be appreciated that the modem processor described above may not be integrated into the processor 110.
The embodiment of the application also provides a readable storage medium, wherein the readable storage medium stores a program or an instruction, and the program or the instruction realizes each process of the training method embodiment of the image-text generation model when being executed by a processor, and can achieve the same technical effect, so that repetition is avoided and redundant description is omitted.
Wherein the processor is a processor in the electronic device described in the above embodiment. The readable storage medium includes computer readable storage medium such as computer readable memory ROM, random access memory RAM, magnetic or optical disk, etc.
The embodiment of the application further provides a chip, the chip comprises a processor and a communication interface, the communication interface is coupled with the processor, the processor is used for running a program or instructions, the processes of the training method embodiment of the image-text generation model can be realized, the same technical effects can be achieved, and the repetition is avoided, and the description is omitted here.
It should be understood that the chips referred to in the embodiments of the present application may also be referred to as system-on-chip chips, chip systems, or system-on-chip chips, etc.
An embodiment of the present application provides a computer program product, which is stored in a storage medium, and the program product is executed by at least one processor to implement the respective processes of the training method embodiment of the image-text generating model, and the same technical effects can be achieved, so that repetition is avoided, and a detailed description is omitted herein.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. Furthermore, it should be noted that the scope of the methods and apparatus in the embodiments of the present application is not limited to performing the functions in the order shown or discussed, but may also include performing the functions in a substantially simultaneous manner or in an opposite order depending on the functions involved, e.g., the described methods may be performed in an order different from that described, and various steps may be added, omitted, or combined. Additionally, features described with reference to certain examples may be combined in other examples.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a computer software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present application.
The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those having ordinary skill in the art without departing from the spirit of the present application and the scope of the claims, which are to be protected by the present application.

Claims (14)

1. The training method of the image-text generation model is characterized by comprising the following steps of:
inputting a first training sample pair in a first training sample pair set into a first image-text generation model, outputting a second training sample pair, wherein the first image-text generation model is obtained by training based on the first training sample pair set, the first training sample pair comprises a first image and a first text for describing the image content of the first image, the second training sample pair comprises a second image and a second text, the second image is an image obtained by converting the first text into the image, and the second text is a text obtained by converting the first image into the image;
generating M training sample pairs based on the first training sample pair and the second training sample pair, wherein the M training sample pairs at least comprise the first training sample pair and the second training sample pair, and M is an integer greater than 1;
replacing the first training sample pair in the first training sample pair set with a third training sample pair to obtain a second training sample pair set, wherein the third training sample pair is the training sample pair with highest image-text similarity in the M training sample pairs;
And training the first image-text generating model based on the second training sample pair set to obtain a target image-text generating model.
2. The method of claim 1, wherein the first teletext generation model comprises a text encoder model, an image encoder model, a diffusion model, an image decoder model, and a text decoder model;
inputting a first training sample pair in a first training sample pair set into the first image-text generating model, and outputting a second training sample pair, wherein the method comprises the following steps:
inputting the first training sample pair into the first image-text generating model;
inputting the first text into the text encoder model for feature extraction to obtain first text feature information of the first text;
inputting the first image into the image encoder model for feature extraction to obtain first image feature information of the first image;
inputting the first text feature information and the first image feature information into the diffusion model to perform a cross attention mechanism and convolution calculation to obtain second image feature information;
inputting the second image characteristic information into the image decoder model for image decoding to obtain third image characteristic information, and outputting the second image based on the third image characteristic information;
And inputting the first image characteristic information into the text decoder model for text prediction to obtain text prediction parameters, and outputting the second text based on the text prediction parameters.
3. The method of claim 2, wherein prior to generating M training sample pairs based on the first training sample pair and the second training sample pair, the method further comprises:
constructing a first loss function based on the first text feature information and the first image feature information;
constructing a second loss function based on the second image characteristic information and a Gaussian noise matrix;
constructing a third loss function based on the text prediction parameters;
constructing an overall loss function based on the first, second, and third loss functions;
after generating M training sample pairs based on the first training sample pair and the second training sample pair, the method further includes:
and respectively calculating the image-text similarity of each training sample pair of the M training sample pairs by adopting the integral loss function.
4. The method of claim 1, wherein the M training sample pairs comprise: the first training sample pair, the second training sample pair, the fourth training sample pair, and the fifth training sample pair; wherein the fourth training sample pair comprises the first image and the second text, and the fifth training sample pair comprises the second image and the first text.
5. The method according to claim 1, wherein training the first teletext generation model based on the second training sample pair set to obtain a target teletext generation model comprises:
training the first image-text generating model based on the second training sample pair set to obtain a second image-text generating model;
under the condition that the model training times do not reach a preset threshold value, inputting a sixth training sample pair in the second training sample pair set into the second image-text generating model, and outputting a seventh training sample pair;
generating N training sample pairs based on the sixth training sample pair and the seventh training sample pair, wherein the N training sample pairs at least comprise the sixth training sample pair and the seventh training sample pair, and N is an integer greater than 1;
replacing the sixth training sample pair in the second training sample pair set with an eighth training sample pair to obtain a third training sample pair set, wherein the eighth training sample pair is the training sample pair with highest image-text similarity in the N training sample pairs;
and training the second image-text generating model based on the third training sample pair set to obtain a third image-text generating model, iterating the process until the model training times reach the preset threshold value, and taking the image-text generating model obtained by the last training as the target image-text generating model.
6. The method according to claim 1, wherein after training the first teletext generation model based on the second training sample pair set to obtain a target teletext generation model, the method further comprises:
inputting a fourth image into the target image-text generation model for image-text conversion, and outputting a fourth text, wherein the fourth text is used for describing the image content of the fourth image;
or,
and inputting a fifth text into the target image-text generation model for image-text conversion, and outputting a fifth image, wherein the fifth image is generated based on the fifth text.
7. A training device for a pattern-text generation model, the device comprising: the device comprises a processing module, a generating module, a replacing module and a training module;
the processing module is used for inputting a first training sample pair in a first training sample pair set into a first image-text generating model and outputting a second training sample pair, the first image-text generating model is obtained by training based on the first training sample pair set, the first training sample pair comprises a first image and a first text for describing the image content of the first image, the second training sample pair comprises a second image and a second text, the second image is an image obtained by converting the first text into the image, and the second text is a text obtained by converting the first image into the image;
The generating module is configured to generate M training sample pairs based on the first training sample pair and the second training sample pair, where the M training sample pairs at least include the first training sample pair and the second training sample pair, and M is an integer greater than 1;
the replacing module is configured to replace the first training sample pair in the first training sample pair set with a third training sample pair to obtain a second training sample pair set, where the third training sample pair is a training sample pair with highest image-text similarity in the M training sample pairs generated by the generating module;
the training module is used for training the first image-text generating model based on the second training sample pair set replaced by the replacing module to obtain a target image-text generating model.
8. The apparatus of claim 7, wherein the teletext generation model comprises a text encoder model, an image encoder model, a diffusion model, an image decoder model, and a text decoder model;
the processing module is specifically configured to:
inputting a first training sample pair in the first training sample pair set to the first image-text generating model;
Inputting the first text into the text encoder model for feature extraction to obtain first text feature information of the first text;
inputting the first image into the image encoder model for feature extraction to obtain first image feature information of the first image;
inputting the first text feature information and the first image feature information into the diffusion model to perform a cross attention mechanism and convolution calculation to obtain second image feature information;
inputting the second image characteristic information into the image decoder model for image decoding to obtain third image characteristic information, and outputting the second image based on the third image characteristic information;
and inputting the first image characteristic information into the text decoder model for text prediction to obtain text prediction parameters, and outputting the second text based on the text prediction parameters.
9. The apparatus of claim 8, wherein the processing module is further configured to, prior to generating M training sample pairs based on the first training sample pair and the second training sample pair,
constructing a first loss function based on the first text feature information and the first image feature information;
The processing module is further used for constructing a second loss function based on the second image characteristic information and a Gaussian noise matrix;
the processing module is further used for constructing a third loss function based on the text prediction parameters;
the processing module is further configured to construct an overall loss function based on the first loss function, the second loss function, and the third loss function;
the processing module is further configured to calculate, according to the overall loss function, the image-text similarity of each of the M training sample pairs after generating the M training sample pairs based on the first training sample pair and the second training sample pair.
10. The apparatus of claim 7, wherein the M training sample pairs comprise: the first training sample pair, the second training sample pair, the fourth training sample pair, and the fifth training sample pair; wherein the fourth training sample pair comprises the first image and the second text, and the fifth training sample pair comprises the second image and the first text.
11. The device according to claim 7, wherein the training module is specifically configured to:
Training the first image-text generating model based on the second training sample pair set to obtain a second image-text generating model;
under the condition that the model training times do not reach a preset threshold value, inputting a sixth training sample pair in the second training sample pair set into the second image-text generating model, and outputting a seventh training sample pair;
generating N training sample pairs based on the sixth training sample pair and the seventh training sample pair, wherein the N training sample pairs at least comprise the sixth training sample pair and the seventh training sample pair, and N is an integer greater than 1;
replacing the sixth training sample pair in the second training sample pair set with an eighth training sample pair to obtain a third training sample pair set, wherein the eighth training sample pair is the training sample pair with highest image-text similarity in the N training sample pairs;
and training the second image-text generating model based on the third training sample pair set to obtain a third image-text generating model, iterating the process until the model training times reach the preset threshold value, and taking the image-text generating model obtained by the last training as the target image-text generating model.
12. The apparatus of claim 7, wherein the processing module is further configured to:
after the first image-text generating model is trained based on the second training sample pair set to obtain a target image-text generating model, inputting a fourth image into the target image-text generating model for image-text conversion, and outputting a fourth text, wherein the fourth text is used for describing the image content of the fourth image; or, inputting a fifth text into the target image-text generation model for image-text conversion, and outputting a fifth image, wherein the fifth image is generated based on the fifth text.
13. An electronic device comprising a processor, a memory and a program or instruction stored on the memory and executable on the processor, which when executed by the processor, implements the steps of the training method of the teletext generation model according to any one of claims 1 to 6.
14. A readable storage medium, wherein a program or instructions is stored on the readable storage medium, which when executed by a processor, implements the steps of the training method of the teletext generation model according to any one of claims 1 to 6.
CN202311101515.6A 2023-08-29 2023-08-29 Training method and device for image-text generation model, electronic equipment and medium Pending CN117094365A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311101515.6A CN117094365A (en) 2023-08-29 2023-08-29 Training method and device for image-text generation model, electronic equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311101515.6A CN117094365A (en) 2023-08-29 2023-08-29 Training method and device for image-text generation model, electronic equipment and medium

Publications (1)

Publication Number Publication Date
CN117094365A true CN117094365A (en) 2023-11-21

Family

ID=88780072

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311101515.6A Pending CN117094365A (en) 2023-08-29 2023-08-29 Training method and device for image-text generation model, electronic equipment and medium

Country Status (1)

Country Link
CN (1) CN117094365A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117407518A (en) * 2023-12-15 2024-01-16 广州市省信软件有限公司 Information screening display method and system based on big data analysis

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117407518A (en) * 2023-12-15 2024-01-16 广州市省信软件有限公司 Information screening display method and system based on big data analysis
CN117407518B (en) * 2023-12-15 2024-04-02 广州市省信软件有限公司 Information screening display method and system based on big data analysis

Similar Documents

Publication Publication Date Title
CN111368118B (en) Image description generation method, system, device and storage medium
CN110457661B (en) Natural language generation method, device, equipment and storage medium
CN116363261B (en) Training method of image editing model, image editing method and device
CN116392812A (en) Action generating method and virtual character animation generating method
CN116910572B (en) Training method and device for three-dimensional content generation model based on pre-training language model
CN117094365A (en) Training method and device for image-text generation model, electronic equipment and medium
CN112560456A (en) Generation type abstract generation method and system based on improved neural network
CN116563399A (en) Image generation method based on diffusion model and generation countermeasure network
CN116452706A (en) Image generation method and device for presentation file
CN117593400A (en) Image generation method, model training method and corresponding devices
CN117291232A (en) Image generation method and device based on diffusion model
CN116630479A (en) Image generation method, device, electronic equipment and readable storage medium
CN116958738A (en) Training method and device of picture recognition model, storage medium and electronic equipment
CN111445545A (en) Text-to-map method, device, storage medium and electronic equipment
CN114579728B (en) Dialogue generation method, device, equipment and medium applied to multi-round dialogue system
CN114970666B (en) Spoken language processing method and device, electronic equipment and storage medium
CN116341564A (en) Problem reasoning method and device based on semantic understanding
CN110222222A (en) Based on deep layer theme from the multi-modal retrieval method of encoding model
CN111242246B (en) Image classification method based on reinforcement learning
CN111144492B (en) Scene map generation method for mobile terminal virtual reality and augmented reality
CN116306612A (en) Word and sentence generation method and related equipment
CN112434143A (en) Dialog method, storage medium and system based on hidden state constraint of GRU (generalized regression Unit)
CN113901841A (en) Translation method, translation device and storage medium
CN118657845A (en) Method and device for generating aragonial graph model, model training and image, and electronic equipment
CN117994708B (en) Human body video generation method based on time sequence consistent hidden space guiding diffusion model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination