CN114373028A

CN114373028A - Method and device for generating picture and electronic equipment

Info

Publication number: CN114373028A
Application number: CN202111550114.XA
Authority: CN
Inventors: 陈帅; 陈维强; 孙永良; 李建伟
Original assignee: Hisense TransTech Co Ltd
Current assignee: Hisense TransTech Co Ltd
Priority date: 2021-12-17
Filing date: 2021-12-17
Publication date: 2022-04-19

Abstract

The application discloses a method and a device for generating pictures and electronic equipment, wherein the method comprises the following steps: responding to an automatic picture generation instruction, and acquiring input text information; inputting text information into a text joint generation model obtained by pre-training, extracting features of the text information by using the text joint generation model, predicting and outputting image features with the relevance degree of the extracted text features larger than a set threshold value by using a conversion layer; and performing image reconstruction on the output image characteristics by using a decoder, outputting a reconstructed image, and outputting and displaying the reconstructed image on a display interface. The method solves the problems that the existing expression graph in the network can not completely express the view of a speaker sometimes and can not dynamically generate the expression graph according to the current scene of the speaker when the text is used for chatting.

Description

Method and device for generating picture and electronic equipment

Technical Field

The present invention relates to the field of internet technologies, and in particular, to a method and an apparatus for generating a picture, and an electronic device.

Background

With the increasing of network chat tools, network chat becomes one of the daily lives of people. People are increasingly using pictures to express their own opinions while chatting with text, for example, using third party emoticons in WeChat, classical emoticons streamed in a network, and the like. The emoticons already in the network sometimes do not fully express the point of view of the person speaking.

Disclosure of Invention

The application aims to provide a method and a device for generating a picture and electronic equipment. The method is used for solving the problems that when people chat by using texts, more and more pictures are used for expressing own viewpoints, the viewpoints of the speaking people can not be completely expressed sometimes by the existing emoticons in the network, and the emoticons cannot be dynamically generated according to the current scenes of the speaking people.

In a first aspect, an embodiment of the present application provides a method for generating a picture, where the method includes:

responding to an automatic picture generation instruction, and acquiring input text information;

inputting the text information into a text joint generation model obtained by pre-training, extracting features of the text information by using the text joint generation model, predicting and outputting image features with the relevance degree of the extracted text features larger than a set threshold value by using a conversion layer;

and performing image reconstruction on the output image characteristics by using a decoder, outputting a reconstructed image, and outputting and displaying the reconstructed image on a display interface.

In some possible embodiments, the pre-trained text joint generation model includes:

acquiring a training sample, wherein the training sample comprises text information, an image and an association degree label representing the association degree of the text information and the image;

combining the text information and the image input text image to generate a model, extracting the text information features by using a feature extraction layer to obtain text features, and extracting the image features to obtain image features;

and predicting and outputting the association degree of the extracted text features and the image features by using a conversion layer, predicting and outputting the image features of which the association degree with the extracted text features is greater than a set threshold value, performing image reconstruction and outputting on the output image features by using a decoder, and performing text-image joint generation model parameter adjustment by taking the association degree label output by the conversion layer and the image information in the training sample output by the decoder as targets.

In some possible embodiments, the feature extraction layer comprises an encoder, and the extracting the image features by using the feature extraction layer comprises:

performing feature extraction on an input image by using the encoder to obtain an image word matrix comprising a plurality of image words, wherein the image words are vectors with fixed lengths;

before the training sample is obtained, the method further comprises the following steps:

acquiring an image sample, and inputting the image sample into an encoder;

and performing feature extraction on the image sample by using the encoder to obtain an image word matrix comprising a plurality of image words, inputting the image word matrix into a decoder, performing image reconstruction by using the decoder according to the image word matrix, and performing parameter adjustment of the encoder and the decoder by taking the image sample output by the decoder as a target.

In some possible embodiments, during the training process of the encoder and the decoder, the method further includes:

when the encoder is utilized to obtain an image word matrix, determining a target image word which does not appear in a current image word matrix dictionary;

and storing the target image words into an image word matrix dictionary, wherein the number of the image words in the image word matrix dictionary does not exceed a set number.

In some possible embodiments, storing the target image word to an image word matrix dictionary comprises:

clustering by combining the target image words and the image words of the image word matrix dictionary by using a memory module;

determining that the target image words are currently in a first training stage, storing the target image words into an image word matrix dictionary, wherein the number of the image words exceeds a set number, and deleting the image words from morning to evening according to storage time or according to the cluster number and the similarity of the image words until the number of the image words behind the image word matrix dictionary reaches the set number;

and determining that the target image words are currently in a second training stage, storing the target image words into an image word matrix dictionary, judging that the number of the image words exceeds a set number, and deleting the image words in the corresponding clusters according to the sequence of less image words to more image words in the same cluster until the number of the image words behind the image word matrix dictionary reaches the set number.

In some possible embodiments, before training the text image joint generation model, the method further includes:

and resetting the parameters of the decoder obtained after the training of the encoder and the decoder is finished, and using the parameters as the encoder during the initial training in the text image joint generation model.

In some possible embodiments, before performing image reconstruction on the output image features by using a decoder in the training process of the text image joint generation model, the method further includes:

searching image words in image characteristics output by an encoder for image words matched with the image words in the image characteristics in a current image word matrix dictionary;

the found matched image word is output to a decoder.

In some possible embodiments, inputting the text information into a text joint generation model obtained by pre-training includes:

inputting the text information into a text joint generation model obtained by pre-training once, or repeatedly inputting the text information into the text joint generation model obtained by pre-training for multiple times;

utilizing a decoder to carry out image reconstruction on the output image characteristics and output a reconstructed image, and outputting and displaying the reconstructed image on a display interface, wherein the method comprises the following steps:

when the option for starting automatic generation of the reconstructed image is determined to be selected, the decoder conducts image reconstruction on the image features with the maximum relevance degree output by the conversion layer for one time and outputs one reconstructed image, the reconstructed image is output and displayed on the display interface, or the decoder conducts image reconstruction on the image features with the maximum relevance degree output by the conversion layer for multiple times and outputs a plurality of reconstructed images respectively, and one reconstructed image selected by the user is displayed on the display interface according to the selection of the user.

In a second aspect, an embodiment of the present application provides an apparatus for generating a picture, where the apparatus includes:

the information acquisition module is used for responding to an automatic picture generation instruction and acquiring input text information;

the conversion module is used for inputting the text information into a text joint generation model obtained by pre-training, extracting the features of the text information by using the text joint generation model, predicting and outputting the image features with the relevance degree of the extracted text features larger than a set threshold value by using a conversion layer;

and the display generation reconstruction image module is used for carrying out image reconstruction on the output image characteristics by using a decoder, outputting a reconstruction image and outputting and displaying the reconstruction image on a display interface.

In a third aspect, an embodiment of the present application provides an electronic device, including at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of training a text-image joint generation model as provided in the first aspect above.

According to the embodiment of the application, in order to solve the problems that when people chat by using texts, more and more pictures are used for expressing own viewpoints, the existing emoticons in a network sometimes cannot completely express the viewpoints of a speaker who speaks, and the emoticons cannot be dynamically generated according to the current scene of the speaker, the deep learning model structure for generating images by using text information is achieved, and the change of the images is diversified.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the embodiments of the present application will be briefly described below, and it is obvious that the drawings described below are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic diagram of an application scenario according to one embodiment of the present application;

FIG. 2 is a schematic flow chart diagram illustrating a method for training a text image joint generation model according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a training process for an encoder and decoder according to one embodiment of the present application;

FIG. 4 is a schematic diagram of a training process for jointly generating a model from wholly introduced text images according to an embodiment of the present application;

FIG. 5 is a flowchart illustrating a method for generating a picture according to an embodiment of the present application;

FIG. 6 is a display interface UI diagram according to one embodiment of the application;

FIG. 7 is a display interface UI diagram according to one embodiment of the application;

FIG. 8 is a display interface UI diagram according to one embodiment of the application;

FIG. 9 is a schematic structural diagram of a training apparatus for a text image joint generation model according to an embodiment of the present application;

FIG. 10 is a diagram illustrating an apparatus for generating a picture according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described in detail and clearly with reference to the accompanying drawings. In the description of the embodiments of the present application, "/" means "or" unless otherwise specified, for example, a/B may mean a or B; "and/or" in the text is only an association relationship describing an associated object, and means that three relationships may exist, for example, a and/or B may mean: three cases of a alone, a and B both, and B alone exist, and in addition, "a plurality" means two or more than two in the description of the embodiments of the present application.

In the description of the embodiments of the present application, the term "plurality" means two or more unless otherwise specified, and other terms and the like should be understood similarly, and the preferred embodiments described herein are only for the purpose of illustrating and explaining the present application, and are not intended to limit the present application, and features in the embodiments and examples of the present application may be combined with each other without conflict.

To further illustrate the technical solutions provided by the embodiments of the present application, the following detailed description is made with reference to the accompanying drawings and the detailed description. Although the embodiments of the present application provide method steps as shown in the following embodiments or figures, more or fewer steps may be included in the method based on conventional or non-inventive efforts. In steps where no necessary causal relationship exists logically, the order of execution of the steps is not limited to that provided by the embodiments of the present application. The method can be executed in the order of the embodiments or the method shown in the drawings or in parallel in the actual process or the control device.

In view of the related art in network chat, referring to fig. 1, the application scenario diagram of the present application includes a plurality of instant messaging terminals 101 and a server 102. People use pictures to express own opinions more and more while chatting with texts, and the existing emoticons in the network sometimes cannot completely express the opinions of speakers, so that the emoticons cannot be dynamically generated according to the current scenes of the speakers. The application provides a method and a device for training a text-image joint generation model and electronic equipment, which can quickly generate images according to texts input by a user and meet the chat requirements of the user.

In view of the above, the inventive concept of the present application is: according to the method, the text image joint generation model is obtained through training by using a method of using a massive image and text joint training model, so that the function of generating the image by the text is realized. The text is used as the input of the model, the picture with the same semantic as the input text is used as the output of the model, and the text accords with the context of the speaker in the context, so that the semantic is accurate and the content is rich.

The following describes in detail a method for training a text image joint generation model in the embodiment of the present application with reference to fig. 2.

First, an application environment of an embodiment of the present application is exemplarily described:

in the process of chatting by using WeChat software, a user converts the voice input into a text or manually inputs the text in a chat box, such as inputting 'hello', the system receives the 'hello' text, automatically generates an emoticon related to the 'hello' and returns the emoticon to the user. And after obtaining the picture, the user clicks to confirm and sends the picture to the chat window.

The method is different from the method for searching the existing pictures by using the text 'hello', the pictures generated by the system are completely originally generated by the deep learning model, and the generated pictures are more diversified in change.

Fig. 2 is a flowchart illustrating a method for training a text-image joint generation model according to an embodiment of the present application, including:

the text image joint generation model can be trained in advance and placed on the terminal, and can also be trained with the server in an interactive mode.

Step 201: the method comprises the steps of obtaining a training sample, wherein the training sample comprises text information, images and association degree labels representing the association degree of the text information and the images.

The relevance degree label for representing the relevance degree of the text information and the image comprises the following steps: the first label of the text information matched with the image, and the second label of the text information not matched with the image.

Specifically, in the embodiment of the present invention, a large number of images and text information capable of describing the content of the images need to be prepared as positive samples. Of course, a negative example whose image does not match the image content may be obtained, where the association degree label of the positive example is 1, and the association degree label of the negative example is 0. Specific data sources may be, but are not limited to: images in wikipedia and text description information below the images.

Step 202, combining the text information and the image input text image to generate a model, extracting the text information features by using a feature extraction layer to obtain text features, and extracting the image features to obtain image features.

The text image joint generation model is formed by combining a plurality of models and comprises a character encoder for extracting the characteristics of the text information, an encoder for extracting the image characteristics, a Transformer module for predicting the relevance of the extracted text characteristics and the image characteristics and the like.

In implementation, a text content vector corresponding to the text information in the training sample can be input into the NLP, and the NLP is used to perform feature extraction on the text content, so as to obtain text features of the text information.

As an optional implementation, the feature extraction layer includes an encoder, and the extracting the image features by using the feature extraction layer includes:

and performing feature extraction on the input image by using the encoder to obtain an image word matrix comprising a plurality of image words, wherein the image words are vectors with fixed lengths.

In the embodiment of the application, the encoder is composed of a multilayer convolution structure, and performs image encoding on an input image to form an image word matrix, wherein the image word matrix comprises a plurality of image words, and each image word is a floating point number vector with a certain length. The image word matrix of the present application may be, but is not limited to, composed of 32 × 32 ═ 1024 image words.

The image word is similar to the concept of "word" in text, and the image word in this embodiment can be but is not limited to a vector with length of 2048; all possible image words constitute a dictionary of a matrix of image words, and the embodiment may, but is not limited to, set the dictionary length of the matrix of image words to 8192, i.e. comprising up to 8192 image words.

In implementation, a matrix of pixel values describing an image in a training sample may be input to an encoder, and the encoder encodes the matrix to obtain an image word matrix.

And 203, predicting and outputting the association degree of the extracted text features and the image features by using a conversion layer, predicting and outputting the image features of which the association degree with the extracted text features is greater than a set threshold value, performing image reconstruction and outputting on the output image features by using a decoder, and performing text-image joint generation model parameter adjustment by taking the association degree label output by the conversion layer and the image information in the training sample output by the decoder as targets.

The conversion layer in the application may specifically include a Transformer module, and the Transformer module is specifically configured to perform feature extraction on the relationship between the extracted text features and image features to obtain the association degree between the text features and the image features. The Transformer module adopts an Attention mechanism, traditional CNN and RNN are abandoned, and the whole network structure is completely formed based on the Attention mechanism. The Transformer module core component is composed of a Multi-head Self-Attention mechanism Multi-head Self-attachment part (hereinafter referred to as Self-attachment) and a feedforward fed Forward neural network, and further comprises a residual error connection and normalization part (Add & Norm), wherein the Self-attachment is the most core part. The transform module in the embodiment of the invention learns the text characteristics and the image characteristics by applying Self-Attention, and captures the mutual relation between the text characteristics and the image characteristics to obtain the association degree of the text characteristics and the image characteristics.

The role of a conversion layer is divided into two parts, the first part predicts the relevance of the extracted text features and the image features and outputs the relevance, namely the first label of the text information matched with the image is 1, and otherwise, the second label is 0; the second part predicts the image features with the association degree with the extracted text features larger than a set threshold, and as an alternative implementation mode, predicts the image features with the maximum association degree with the extracted text features.

The decoder is composed of a multilayer convolution structure and decodes the image word matrix to generate a reconstructed image.

In the training process, the relevance degree label in the sample output by the conversion layer is used, the similarity between the reconstructed image output by the decoder and the image in the training sample is larger than a set threshold value, the adjustment of the text image joint generation model parameter is carried out, a corresponding loss function is obtained according to the difference value between the relevance degree output by the decoder and the corresponding relevance degree label and the similarity between the reconstructed image and the image in the sample, when the loss function value reaches the set threshold value, the condition that the model training is finished is determined to be met, and the adjustment of the model parameter is finished.

After the adjustment of the model parameters is finished, the text-image joint generation model can be applied to the function of generating images from texts, text information is used as the input of the model, and images which are consistent with the text information are used as the output of the model, and the consistency means that the potential semantics of the images are consistent with the input text information and the contexts of speakers in the contexts, so that the semantics are accurate and rich in content. The method and the device for generating the image depth learning model solve the problem that the deep learning model structure of the image generated by the text information is realized, and the change of the image is more diversified.

In the training process of the text image joint generation model, the adjusted model parameters are specifically the model parameters of the conversion layer and the decoder, and the parameters of the feature extraction layer are not adjusted. The feature extraction layer can be modeled in advance by using another sample, and the NLP model training process is the same as that in the prior art and is not detailed here.

As an optional implementation manner, before obtaining the training sample, the method further includes:

acquiring an image sample, and inputting the image sample into an encoder;

and performing feature extraction on the image sample by using the encoder to obtain an image word matrix comprising a plurality of image words and outputting the image word matrix to the decoder, performing image reconstruction by using the decoder according to the image word matrix, and performing parameter adjustment of the encoder and the decoder by taking the image sample output by the decoder as a target.

Specifically, referring to fig. 3, image samples are input to an encoder and a corresponding image reconstruction is output by a decoder. Assuming that the content of the input image A is 'a little dog sitting on a sofa', after the image A is input into an encoder, the encoder performs feature extraction on the image A, for example, extracting image features of a puppy, a sofa, the puppy and the sofa on the sofa, converting the extracted features into an image word matrix comprising a plurality of image words, inputting the image word matrix into a decoder, performing image reconstruction by decoding the image word matrix by the decoder and outputting a reconstructed image, in the training process, the parameters of the encoder and the decoder are adjusted by taking the image and the image sample output by the decoder as targets, when the similarity between the image output by the decoder and the image sample is larger than the set threshold, the training is finished, the training of the encoder and the decoder is finished, then an image is input into the encoder, and the decoder can output an image with extremely high similarity to the input image.

And taking the encoder after the training as an encoder of the text image joint generation model, and resetting the parameters of the decoder after the training as an encoder during initial training in the text image joint generation model.

As an optional implementation manner, in the training process of the encoder and the decoder, the method further includes:

The image word matrix dictionary is generated in the training process of the encoder and the decoder, the number of the image word matrix dictionary is continuously increased along with the accumulation of samples, the experience is from small to large, and image words are removed by adopting a corresponding mechanism for updating when the maximum number of the dictionary is reached.

In the process of training by using each sample, when an image word matrix is obtained by using an encoder, whether each image word of the current image word matrix appears in an image word matrix dictionary or not is judged, namely whether the image word is a new image word or not, and if not, the new image word is stored in the image word matrix dictionary. For example, the image word representing the apple is stored in the image word matrix dictionary, when a pear image appears, the image word representing the pear image is obtained from the pear image through an encoder, the image word representing the pear does not appear in the image word matrix dictionary in a traversing mode, at the moment, the image word representing the pear is the target image word, and the image word representing the pear is stored in the image word matrix dictionary.

The above-mentioned image word matrix dictionary that generates in encoder and decoder training process, in this application text image jointly generates the model training process, need use this image word matrix dictionary, in text image jointly generates the model training process, before utilizing the decoder to jointly generate the model and carry out image reconstruction to the image feature text image of output, still include:

the found matched image word is output to a decoder.

Specifically, in the training process of the text-image joint generation model, character coding and image coding are jointly input into a Transformer module, after an image word in an image word matrix with the relevance between the character coding and the extracted text feature being larger than a set threshold is predicted by the Transformer module, a matched image word is searched in a current image word matrix dictionary and input into a decoder, for example, if an image represented by the image word obtained by image coding is an apple, the image word representing the apple is matched with the image word in the image word matrix dictionary, the image word matched with the image word representing the apple in the input encoder in the image word matrix dictionary is searched, and finally the image word searched in the image word matrix dictionary is output to the decoder, so that an image representing the apple is obtained.

As described above, the image word matrix dictionary is generated in the training process of the encoder and the decoder, and with the accumulation generation and the update of the samples, the generation of the dictionary can be divided into two stages in the embodiment of the present application, where the first training stage is an earlier stage, that is, the image word matrix reaches the maximum number from least to most, and is updated for a short period of time when the maximum number is reached, and at this stage, image words of as many content types as possible need to be stored in the image word matrix dictionary; the second training phase is the late stage, i.e., the image word matrix dictionary has already been accumulated by a sufficient amount. If the number of the image word matrix dictionary is not enough, directly storing new image words, if the image words reach the maximum number and need to be updated, storing the new image words to the image word matrix dictionary in different updating mechanisms of the image word matrix dictionary in the training process for different stages, and storing the target image words to the image word matrix dictionary, wherein the steps of:

The memory module stores the image word matrix dictionary and provides clustering operation to aggregate similar image words. As an example, the set number of the image word matrix dictionary, i.e. the length, does not exceed 8192, and the clustering operation can be divided into two training stages according to the length of the number stored in the image word matrix dictionary:

1. first training stage (number of target image words within 9000)

The principle of storing the target image words in the first training stage is as follows: and ensuring that image words corresponding to different types of image contents are stored in the image word matrix dictionary as much as possible.

Specifically, when a target image word is stored in the image word matrix dictionary and the number of image words exceeds a set number, a redundant target image word is processed for two cases:

in case 1, the target image words are deleted in the order of the time of storing the target image words from morning to evening.

And if the target image words belong to a new cluster, storing the target image words as the new cluster, and then deleting the target image words from the early to the late according to the early and the late of the time when the target image words are stored in the image word matrix dictionary until the number of the image words behind the image word matrix dictionary reaches the set number.

And 2, deleting the image words according to the clustering number and the image word similarity.

Specifically, when the number of certain clusters is particularly large, the clustering method indicates that the image words of the clusters are abundant enough, and can delete a plurality of words in the image words with larger similarity according to the similarity of the image words in the clusters, so as to reduce the redundancy of the image words.

2. Second training phase (number of target image words beyond 9000)

The principle of storing the target image words in the second training stage is as follows: some of the rarely used image words are deleted.

Specifically, when the target image word is stored in the image word matrix dictionary and it is determined that the number of image words exceeds the set number, the image words in the same cluster are sorted in order from a small number to a large number, the image words in the cluster with the small number of image words are deleted until the number of image words in the image word matrix dictionary reaches the set number.

As an optional implementation manner, before training of the text image joint generation model, the method further includes:

and resetting the parameters of the decoder obtained after the training of the encoder and the decoder is finished, wherein the parameters are used as the encoder during the initial training in the text image joint generation model, and the encoder performs parameter adjustment again in the training process of the text image joint generation model.

When the text image joint generation model is trained, a decoder used in the training of the encoder and the decoder is needed to output a predicted image when prediction is carried out after the transform module, so that the parameters of the decoder need to be reset before the training of the text image joint generation model, and the encoder is used as the encoder in the text image joint generation model, and the image character matrix output when prediction is carried out after the transform module is used for carrying out image decoding to obtain the predicted image.

The following describes the training process of the text image joint generation model from the whole.

Referring to fig. 4, for example, the text information is "a puppy sits on a sofa", the image is an image related to the puppy sits on the sofa, the text information and the image are used as input of a text image joint generation model, the text information forms a text code through character coding, a Bert pre-training word vector is used, CLS represents a start position, and SEP represents a segmentation position; PAD stands for filled-in word. Inputting text coding and image coding into a Transformer module at the same time, inputting the text coding and the image coding into a softmax function during training by the Transformer module, wherein softmax is used in a multi-classification process, the softmax maps the output of a plurality of neurons into a (0,1) interval and can be understood as probability, so that multi-classification is performed, and when softmax output 0 represents that a character string does not match an image, namely the relevance is low, the semantic meaning of text information is low in similarity with the image; when the softmax output 1 represents that the character string is matched with the image, namely the relevance is high, the semantic of the text information is high in similarity with the image, and model parameter adjustment is carried out by taking the relevance output by the softmax and the relevance label as the same as each other and the reconstructed image output by the decoder and the input image as the same target. When the semantic meaning of the text information and the similarity of the image are high, the Transformer module inputs an image word matrix obtained by image coding into a decoder with parameters reset during prediction, a predicted image is output through the decoder, if the final relevance degree label output value is 1, and the predicted image and the image of the text image joint generation model which is input firstly can accurately represent that a puppy sits on a sofa, the training of the text image joint generation model is proved to be finished.

Example 2

Based on the same inventive concept, based on the text joint generation model obtained by the method training in embodiment 1, the present application further provides a method for generating an image, see fig. 5, and a flow diagram of the method for generating an image provided in an embodiment of the present application includes:

step 501, responding to an automatic picture generation instruction, and acquiring input text information;

step 502, inputting the text information into a text joint generation model, extracting features of the text information by using the text joint generation model, predicting and outputting image features with the relevance degree of the extracted text features larger than a set threshold value by using a conversion layer;

and 503, performing image reconstruction on the output image characteristics by using a decoder, outputting a reconstructed image, and outputting and displaying the reconstructed image on a display interface.

Specifically, after training of the text joint generation model is completed, when the text joint generation model starts to be used, text information input by a user is obtained, the text joint generation model directly starts to predict after feature extraction is performed on the text information, image features with the relevance larger than a set threshold and an image word matrix representing the image features are searched for image words in the image features output by an encoder in a current image word matrix dictionary, the searched image words matched with the image words in the image features are output to a decoder, the decoder outputs a reconstructed image, and the reconstructed image is displayed on a display interface.

As an optional implementation, inputting the text information into a text joint generation model obtained by pre-training includes:

Specifically, in practical application, text information can be input into a text joint generation model obtained by pre-training once or multiple times according to the selection of a user. When the input is performed once, outputting and displaying a reconstructed image with the maximum relevance on a display interface; when the input is carried out for multiple times, a reconstructed image with the maximum relevance degree is displayed on the display interface according to each input, namely multiple reconstructed images are obtained after the input is carried out for multiple times, and finally the reconstructed images are displayed according to the selection of a user.

Referring to the UI diagrams of the display interfaces of fig. 6 to 8, as shown in fig. 6, when a user types or inputs a text by voice on the display interface, the user may select a switch option of automatically generating a text-related image, when the option of automatically generating a text-related image is turned on, the display interface inputs the text information into a text joint generation model obtained by pre-training along with the input of the text information of the user, performs feature extraction on the text information by using the text joint generation model, predicts and outputs an image feature with a degree of association with the extracted text feature greater than a set threshold by using a conversion layer, performs image reconstruction on the output image feature by using a decoder, outputs a reconstructed image, and automatically outputs and displays the reconstructed image at the bottom of the display interface, and if the user wants to send an image, the user may select to send the displayed image; as another optional implementation, as shown in fig. 7, after it is determined that text information editing is finished, a popup window with multiple sending modes is popped up on a display interface according to a trigger button selected by a user, an icon of a converted picture is in the popup window, when the user selects a sending mode using the converted picture, the icon of the converted picture is clicked and triggered, the system automatically inputs the text information into a text joint generation model obtained by pre-training, performs feature extraction on the text information by using the text joint generation model, predicts and outputs an image feature with a correlation degree with the extracted text feature larger than a set threshold by using a conversion layer, performs image reconstruction on the output image feature by using a decoder, outputs and displays the reconstructed image on the display interface, and pops up the generated reconstructed image in an instant messaging dialog box. As an alternative, as shown in fig. 8, the text information may be input into a text joint generation model obtained by pre-training once, or the text information may be repeatedly input into the text joint generation model obtained by pre-training for multiple times, and one or multiple times of rotation is selected according to an instruction of a user, when a selection of an option for automatically generating a reconstructed image is determined to be turned on, when the user triggers and selects the one time of rotation option, the decoder performs image reconstruction on an image feature with the maximum relevance degree output by the conversion layer once, outputs and displays one reconstructed image on the display interface; or when the user triggers and selects the multiple conversion option, the decoder respectively carries out image reconstruction on the image features with the maximum relevance output by the conversion layer for multiple times and outputs a plurality of reconstructed images, and one reconstructed image selected by the user is displayed on the display interface according to the selection of the user.

Example 3

Based on the same inventive concept, the present application further provides an apparatus for training a text image joint generation model, as shown in fig. 9, the apparatus includes:

a training sample obtaining module 901, configured to obtain a training sample, where the training sample includes text information, an image, and an association degree label representing association degrees of the text information and the image;

a feature extraction module 902, configured to jointly generate a model from the text information and an image input text image, extract text features from the text information features by using a feature extraction layer, and extract image features from the image features;

and a model parameter adjusting module 903, configured to predict and output the correlation degree between the extracted text feature and the image feature by using the conversion layer, predict and output the image feature of which the correlation degree with the extracted text feature is greater than a set threshold, perform image reconstruction and output on the output image feature by using a decoder, and perform text-image joint generation model parameter adjustment by using the correlation degree label output by the conversion layer and the image information in the training sample output by the decoder as targets.

Optionally, the feature extraction layer includes an encoder, and the feature extraction module 902 is specifically configured to:

The acquire training sample module is further configured to, prior to acquiring the training sample:

acquiring an image sample, and inputting the image sample into an encoder;

Optionally, the feature extraction module 902 is further configured to, during the training of the encoder and the decoder:

Optionally, the feature extraction module 902 is specifically configured to:

Optionally, the model parameter adjustment module 903 is further configured to, before training of the text image joint generation model:

Optionally, the model parameter adjusting module 903 is further configured to, during training of the text image joint generation model, before performing image reconstruction on the output image features by using a decoder, further:

the found matched image word is output to a decoder.

Example 4

Based on the same inventive concept, the present application further provides an apparatus for generating a picture, as shown in fig. 10, the apparatus comprising:

an information acquisition module 1001 configured to acquire input text information in response to an automatic picture generation instruction;

the conversion module 1002 is configured to input the text information into a text joint generation model obtained by training based on any one of the methods in the first aspect, perform feature extraction on the text information by using the text joint generation model, predict and output an image feature with a correlation degree between the text feature and the extracted text feature being greater than a set threshold by using a conversion layer;

and a display generation reconstructed image module 1003, configured to perform image reconstruction on the output image features by using a decoder, output a reconstructed image, and output and display the reconstructed image on a display interface.

Optionally, the conversion module 1002 is specifically configured to:

and inputting the text information into a text joint generation model obtained by pre-training once, or repeatedly inputting the text information into the text joint generation model obtained by pre-training for multiple times.

Optionally, the display generation reconstructed image module 1003 is specifically configured to:

Having described the text image joint generation model training and picture generation method and apparatus according to an exemplary embodiment of the present application, an electronic device according to another exemplary embodiment of the present application is described next.

As will be appreciated by one skilled in the art, aspects of the present application may be embodied as a system, method or program product. Accordingly, various aspects of the present application may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.

In some possible implementations, an electronic device according to the present application may include at least one processor, and at least one memory. The memory stores program code, and when the program code is executed by the processor, the processor executes the steps of the training method for training the text image joint generation model according to the various exemplary embodiments of the present application described above in the present specification, or executes the steps of the method for generating pictures according to the various exemplary embodiments of the present application described above in the present specification.

The electronic device 130 according to this embodiment of the present application is described below with reference to fig. 11. The electronic device 130 shown in fig. 11 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 11, the electronic device 130 is represented in the form of a general electronic device. The components of the electronic device 130 may include, but are not limited to: the at least one processor 131, the at least one memory 132, and a bus 133 that connects the various system components (including the memory 132 and the processor 131).

Bus 133 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, a processor, or a local bus using any of a variety of bus architectures.

The memory 132 may include readable media in the form of volatile memory, such as Random Access Memory (RAM)1321 and/or cache memory 1322, and may further include Read Only Memory (ROM) 1323.

Memory 132 may also include a program/utility 1325 having a set (at least one) of program modules 1324, such program modules 1324 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

The electronic device 130 may also communicate with one or more external devices 134 (e.g., keyboard, pointing device, etc.), with one or more devices that enable a user to interact with the electronic device 130, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 130 to communicate with one or more other electronic devices. Such communication may occur via input/output (I/O) interfaces 135. Also, the electronic device 130 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) via the network adapter 136. As shown, network adapter 136 communicates with other modules for electronic device 130 over bus 133. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with electronic device 130, including but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

In some possible embodiments, aspects of a method for generating a picture provided herein may also be implemented in the form of a program product including program code for causing a computer device to perform the steps of a text image joint generation model training and picture generation method according to various exemplary embodiments of the present application described above in this specification when the program product is run on the computer device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The program product for monitoring of the embodiments of the present application may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on an electronic device. However, the program product of the present application is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the consumer electronic device, partly on the consumer electronic device, as a stand-alone software package, partly on the consumer electronic device and partly on a remote electronic device, or entirely on the remote electronic device or server. In the case of remote electronic devices, the remote electronic devices may be connected to the consumer electronic device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external electronic device (e.g., through the internet using an internet service provider).

It should be noted that although several units or sub-units of the apparatus are mentioned in the above detailed description, such division is merely exemplary and not mandatory. Indeed, the features and functions of two or more units described above may be embodied in one unit, according to embodiments of the application. Conversely, the features and functions of one unit described above may be further divided into embodiments by a plurality of units.

Further, while the operations of the methods of the present application are depicted in the drawings in a particular order, this does not require or imply that these operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and block diagrams, and combinations of flows and blocks in the flow diagrams and block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and block diagram block or blocks.

While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A method of generating a picture, the method comprising:

2. The method of claim 1, wherein pre-training the derived text joint generation model comprises:

3. The method of claim 2, wherein the feature extraction layer comprises an encoder, and wherein extracting the image features using the feature extraction layer comprises:

acquiring an image sample, and inputting the image sample into an encoder;

4. The method of claim 3, wherein during the training of the encoder and the decoder, further comprising:

5. The method of claim 4, wherein storing the target image word to an image word matrix dictionary comprises:

6. The method of claim 3, further comprising, prior to training the text image joint generative model:

7. The method of claim 3, wherein before performing image reconstruction on the output image features by using a decoder in the training process of the text image joint generation model, the method further comprises:

the found matched image word is output to a decoder.

8. The method of claim 1, wherein inputting the text information into a pre-trained text joint generative model comprises:

9. An apparatus for generating a picture, the apparatus comprising:

10. An electronic device comprising at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.