WO2020227971A1 - Image generation - Google Patents

Image generation Download PDF

Info

Publication number
WO2020227971A1
WO2020227971A1 PCT/CN2019/087041 CN2019087041W WO2020227971A1 WO 2020227971 A1 WO2020227971 A1 WO 2020227971A1 CN 2019087041 W CN2019087041 W CN 2019087041W WO 2020227971 A1 WO2020227971 A1 WO 2020227971A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
style
text
foreground
generating
Prior art date
Application number
PCT/CN2019/087041
Other languages
English (en)
French (fr)
Inventor
Yang Xiang
Bo Wang
Yu Shi
Xianchao WU
Xiaocheng ZHANG
Yuanchun XU
Lingling Zhang
Original Assignee
Microsoft Technology Licensing, Llc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Microsoft Technology Licensing, Llc filed Critical Microsoft Technology Licensing, Llc
Priority to PCT/CN2019/087041 priority Critical patent/WO2020227971A1/en
Priority to CN201980044979.3A priority patent/CN112400186B/zh
Publication of WO2020227971A1 publication Critical patent/WO2020227971A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T11/002D [Two Dimensional] image generation

Definitions

  • Automatic image generation aims to simulate human beings’ art creation of images.
  • the automatic image generation may be implemented through techniques of machine learning, deep learning, etc.
  • an image generation model may be trained with a large amount of text and image pairs.
  • the trained image generation model may generate an image associated with the text input.
  • Embodiments of the present disclosure propose method and apparatus for image generation. At least one background element and at least one foreground element may be identified from a text. At least one background image corresponding to the at least one background element may be generated. At least one foreground image corresponding to the at least one foreground element may be generated. A merged image may be generated based on the at least one background image and the at least one foreground image. A style of a target image may be determined from the text. The merged image may be transferred into the target image in the determined style.
  • FIG. 1 illustrates an exemplary image generation flow according to an embodiment.
  • FIG. 2 illustrates an exemplary image generation flow according to an embodiment.
  • FIG. 3 illustrates an exemplary process of generating images according to an embodiment.
  • FIG. 4 illustrates an exemplary process of generating an initial image according to an embodiment.
  • FIG. 5 illustrates exemplary attention mechanism in an attention Generative Adversarial Network (GAN) model according to an embodiment.
  • GAN attention Generative Adversarial Network
  • FIG. 6 illustrates an exemplary process of generating a foreground image according to an embodiment.
  • FIG. 7 illustrates an exemplary training process of a Progressive Growing GAN (PG GAN) model according to an embodiment.
  • PG GAN Progressive Growing GAN
  • FIG. 8 illustrates an exemplary training process of a discriminator in a PG GAN model according to an embodiment.
  • FIG. 9 illustrates an exemplary training process of a generator in a PG GAN model according to an embodiment.
  • FIG. 10 illustrates an exemplary process of transferring style according to an embodiment.
  • FIG. 11 illustrates an exemplary training process of a cycle GAN model according to an embodiment.
  • FIG. 12 illustrates exemplary user interfaces for image generation according to an embodiment.
  • FIG. 13 illustrates exemplary user interfaces for image generation according to an embodiment.
  • FIG. 14 illustrates a flowchart of an exemplary method for image generation according to an embodiment.
  • FIG. 15 illustrates an exemplary apparatus for image generation according to an embodiment.
  • FIG. 16 illustrates an exemplary apparatus for image generation according to an embodiment.
  • the existing image generation approaches identify image elements from a text input, and generate an image including the identified image elements. That is, the existing processes of image generation merely consider “what to draw” and intend to incorporate those image elements indicated in the text input into the generated image.
  • Embodiments of the present disclosure propose to generate an image which not only includes image elements indicated in a text, but also is drawn in an image style conforming to the text. That is, image generation according to the embodiments of the present disclosure considers both “what to draw” and “how to draw” .
  • image refers to a visual representation of content in a creative manner, which is also known as, e.g., drawing, painting, picture, etc.
  • Image elements refer to various visible objects that exist in the human life, in the nature or in the abstract world and can be visually expressed in the image, e.g., humans, animals, river, sea, boat, geometry, etc.
  • “Style” of an image refers to characteristics specific to the image in terms of art forms of expression including color, composition, etc., which is also known as, e.g., picture style, painting style, painting genre, etc. Image styles may be classified according to various criteria, e.g., time period, country or region, artists, painting ways, etc. For example, some common image styles may comprise impressionism, realism, fauvism, abstractionism, ukiyo-e, Chinese painting, etc.
  • the embodiments of the present disclosure may identify image elements explicitly indicated in the text, and may also determine image elements associated with an emotion factor or category expressed by the text.
  • the emotion category of the text may be further used for determining a style in which a generated image is to be drawn.
  • the embodiments of the present disclosure train and apply a series of models or networks that are based on machine learning, deep learning, neural networks, etc., to generate an image in a style conforming to a text.
  • various types of Generative Adversarial Network (GAN) models may be adopted in respective stages of the process of generating images, so as to achieve higher quality of generated images.
  • GAN Generative Adversarial Network
  • an image generated in this style would be more attractive to the user.
  • User experiences of the automatic image generation would be improved. All the elements in a generated image would be in a unified style. The generated image would also possess higher diversity.
  • FIG. 1 illustrates an exemplary image generation flow 100 according to an embodiment.
  • the image generation flow 100 may be performed for generating a target image based on a text.
  • a text 110 is obtained, which may comprise understandable image elements that can be understood, utilized, processed or expressed directly by operations or models involved in the process of generating images.
  • the text 110 may be an input of a user, i.e., received from the user directly.
  • the user may input a sentence “beach in the sunshine” .
  • This input from the user describes understandable image elements, e.g., “beach” , “sunshine” , etc., and thus may be used as a basis of the following processes.
  • the user input may be in a format of text, or the user input may be in any other formats, e.g., audio, and can be converted into text through various existing format conversion techniques.
  • the text 110 may be derived from an input of a user, such that the text 110 may comprise more understandable image elements.
  • the user input may comprise topic words describing specific image elements, e.g., “hill” , “boat” , “horse” , etc.
  • Word extension may be performed to these topic words so as to obtain extended topic words that are semantically relevant to the original topic words, e.g., the word “hill” may be extended to “mountain” , the word “boat” may be extended to “ship” , “yacht” , “sailboat” , etc.
  • the topic words in the user input and/or the extended topic words may be included in the text 110.
  • the extension of topic words may be implemented through various approaches. For example, a topic-to-topic knowledge graph may be pre-established and used for extending from one topic to other topics. For example, a machine learning or deep learning model, e.g., a word2vec-based extension model, may be trained and applied for generating relevant topics based on an input topic.
  • the user input may comprise emotion words describing emotion category that is desired to be expressed in a generated image, e.g., “sad” , “good mood” , “crazy” , etc.
  • image elements associated with the emotion words may be determined.
  • the emotion “sad” may be expressed in an image with some image elements like “dark clouds” , “light rain” , etc.
  • the emotion “good mood” may be expressed in an image with some image elements like “bright sun” , “smile on the face” , etc.
  • the image elements determined for the emotion words in the user input may be included in the text 110, such that the generated image may better meet with the user’s desires through including those image elements associated with the emotion words into the generated image.
  • the conversion from the emotion words to the image elements may be implemented through various approaches. For example, an emotion-to-element knowledge graph may be pre-established and used for extending from one emotion word or category to image elements.
  • a machine learning or deep learning model e.g., a word2vec-based extension model, may be trained and applied for generating relevant image elements based on an emotion word or category.
  • word extension may also be performed to the emotion words in the user input so as to obtain extended emotion words, and the extended emotion words may be further for determining corresponding image elements.
  • the extension of emotion words may be implemented through various approaches, e.g., a pre-established emotion-to-emotion knowledge graph, a machine learning model trained for predicting relevant emotion words from an input emotion word, etc.
  • a topic word in the user input may also be extended to one or more emotion words through, e.g., a pre-established topic-to-emotion knowledge graph, a machine learning model trained for predicting relevant emotion words from an input topic word, etc., and the extended emotion words may be further used for determining image elements in the way as discussed above.
  • the text 110 may be derived as “A boat is sailing in the sea, with bright sun, blue sky and white clouds” , wherein the “boat” in the text 110 may be obtained through extending from the topic words “sailing” and “sea” in the user input, and the “bright sun” , “blue sky” and “white clouds” in the text 110 may be converted from the emotion word “joyfully” or an emotion category “happiness” corresponding to the emotion word “joyfully” .
  • the derived text 110 comprises various image elements that may be further visually expressed in a generated image.
  • the background image 120 refers to an image corresponding to background elements in the text
  • the foreground image 130 refers to an image corresponding to foreground elements in the text.
  • the background elements may refer to those image elements usually expressed in a background part of an image, e.g., sky, grassland, sea, etc.
  • the foreground elements may refer to those image elements usually expressed in a foreground part of an image, e.g., human, animal, car, ship, etc.
  • the background elements and the foreground elements may be identified from the text 110 through various approaches.
  • a reference table may be pre-established, which includes a set of reference words classified as background elements, and another set of reference words classified as foreground elements. Through matching a word in the text with a corresponding reference word in the reference table, it may be determined whether the word in the text is a background element or a foreground element. Moreover, for example, a machine learning-based classifier may be trained and applied for classifying a word in the text as a background element or a foreground element.
  • the background image 120 and the foreground image 130 may be merged together to form a merged image 140 which would contain both background elements and foreground elements identified from the text 110.
  • An image style may be determined based on the text 110, in which the target image would be drawn.
  • the style may be associated with an emotion category of the text 110. Accordingly, the emotion category may be determined firstly based on content of the text 110, and then the determined emotion category may be further used for determining the style.
  • the style may be determined from the text 110 through various approaches. For example, a classification model may be trained and applied for classifying a text into one of a plurality of image styles, wherein the classification model may take at least emotion category of the text as a feature for the classification process.
  • a style transferring process may be performed to the merged image 140 based on the determined style, so as to transfer the merged image 140 to a target image 150.
  • the target image 150 comprises image elements indicated in the text 110, and is in the determined style which conforms to the text and accordingly conforms to the user input.
  • FIG. 2 illustrates an exemplary image generation flow 200 according to an embodiment.
  • the image generation flow 200 is an exemplary further implementation of the image generation flow 100 in FIG. 1.
  • the user may provide at least one additional input during the generating of a target image or after the generated target image has been presented to the user.
  • the user’s additional input may comprise additional information about what the generated image is desired to further include, how the generated image is desired to be drawn, modification comments on the generated image having been presented to the user, etc.
  • the image generation flow 200 may obtain a text 210, generate a background image 220 and a foreground image 230, form a merged image 240, determine an image style, and generate a target image 250 in the determined style.
  • the image generation flow 200 may return to obtain an additional text based on the additional user input, and generate an updated target image based at least on the additional text.
  • the background image and/or the foreground image may be updated according to the additional text
  • an updated merged image may be then formed from the updated background image and/or the updated foreground image
  • the image style may be updated based on the additional text
  • an updated target image may be generated in the updated image style. It should be appreciated that during the updating of the previous target image, the previous target image may be used as a reference for the updating process.
  • both the previous target image and the additional text would be considered in the process of generating the updated background image and/or the foreground image, in the process of determining the updated style, etc. If more than one additional user input is received, the image generation flow 200 will iteratively perform the above updating process of target image. If it is determined at 260 that no additional user input is received, the current updated target image 270 may be provided to the user.
  • FIG. 3 illustrates an exemplary process 300 of generating images according to an embodiment.
  • a text 302 is obtained.
  • the text 302 may be received from a user, or derived from the user’s input.
  • the text 302 may comprise image elements to be included in a generated target image.
  • the text may be “A goose standing on the grass” .
  • An initial image generator 304 may be applied for generating an initial image 306 based on the text 302.
  • the initial image 306 intends to include the image elements contained in the text 302.
  • an exemplary initial image 306-1 includes image elements, e.g., “goose” , “grass” , etc.
  • the initial image generator 304 may not sufficiently reflect an emotion category of the text 302 and the user input, and may lack an explicit image style, especially an image style associated with the text 302 and conforming to the user input.
  • the following operations in the process 300 may aim to add an appropriate image style into a generated image and improving the quality of the generated image.
  • the initial image generator 304 may be a machine learning or deep learning model trained for generating an image based on an input text, e.g., an attention GAN model, etc. Moreover, the initial image generator 304 may also be trained for determining mapping relationship between each element in the input text and a corresponding image region in the generated image.
  • the process 300 may traverse all the words in the text 302, so as to generate corresponding foreground images and background images.
  • it may be determined whether the current word in the text 302 is a foreground element. If the current word in the text 302 is not a foreground element, the process 300 will proceed to step 324. If the current word in the text 302 is a foreground element, e.g., “goose” , a foreground image generator 310 may be applied for generating a foreground image 312 corresponding to the foreground element, e.g., an exemplary foreground image 312-1 which draws a goose.
  • the foreground image generator 310 may be a machine learning or deep learning model trained for generating an image based on an input word, e.g., a progressive growing GAN (PG GAN) model, etc.
  • PG GAN progressive growing GAN
  • the foreground image generator 310 may also generate the foreground image 312 in a style conforming to the text 302.
  • a classification model 314 may be trained and applied for classifying the text 302 into an image style 316 among a plurality of image styles.
  • elements in the text 302 may facilitate to determine an image style, e.g., an element “water lily” may be associated with the style of impressionism since the great artist Monet has some famous painting works about water lily.
  • emotion category of the text 302 or the user input may facilitate to determine an image style, since the emotion category may reflect what style is preferred by the user, e.g., emotions like “passionate” , “crazy” , etc.
  • the classification model may be trained as classifying an input text into an image style based on at least one of features about elements in the input text, emotion category of the input text, etc.
  • the foreground image generator 310 may take both the foreground element and the style 316 as an input, and generate the foreground image 312 corresponding to the foreground element in the style 316.
  • the foreground image generator 310 may be trained with images in the style 316, and accordingly, the foreground image 312 generated by the foreground image generator 310 would be in the style 316.
  • a plurality of candidate foreground image generators may be trained for a plurality of image styles respectively, and the determined style 316 may be used for selecting a candidate foreground image generator corresponding to the style 316 as the foreground image generator 310.
  • a corresponding background image may be generated through removing the foreground element from the initial image 306.
  • an image inpainting model 318 may be applied for removing the foreground element from the initial image 306 so as to generate a corresponding background image 320.
  • An exemplary background image 320-1 is shown as having removed the foreground element “goose” from the initial image 306.
  • the image inpainting model 318 may utilize the mapping relationship determined by the initial image generator 304 to locate an image region corresponding to the foreground element in the initial image 306 and further remove the image region corresponding to the foreground element.
  • the image inpainting model 318 may be a machine learning or deep learning model trained for taking an image, a word, mapping relationship between the word and image regions in the image, etc. as inputs, and outputting a result image in which the image region corresponding to the word is removed from the input image.
  • the foreground image 312 and the background image 320 may be merged together to generate a merged image 322, e.g., an exemplary merged image 322-1 in FIG. 3.
  • the process 300 may be determined at 324 whether there is any more word in the text 302 to be traversed. If yes, the process 300 will iteratively return to the step 308, so as to generate a further foreground image and a further background image, and accordingly obtain an updated merged image. During the iteration, the initial image 306 may be replaced by the current merged image 322 so as to generate the further background image.
  • the process 300 may further transfer the final merged image into a target image in the style 316 conforming to the text 302.
  • the process 300 may utilize an edge detection model 326 to generate a sketched image 328 based on the final merged image.
  • the edge detection model 326 may detect edges from the final merged image, wherein the edges may refer to junction parts in the final merged image where color changes or texture changes are relative large, e.g., outline of a goose, outline of the goose’s nose, outline of grass, etc.
  • the edge detection model 326 may also remove element details from the final merged image.
  • the edge detection model 326 may output the sketched image 328 corresponding to the final merged image, e.g., the exemplary sketched image 328-1 in FIG. 3.
  • the edge detection model 326 may be a machine learning or deep learning model trained for generating a sketched image based on an input image.
  • the process 300 may generate a target image through adding the style 316 to the sketched image 328.
  • a style transferring model 330 may be applied for generating a target image 332 in the style 316 based on the sketched image 328.
  • the style transferring model 330 may be a machine learning or deep learning model trained for generating an image from an input image under a certain image style, e.g., a cycle GAN model, etc.
  • one style transferring model may be trained, and accordingly, a plurality of candidate style transferring models corresponding to respective image styles may be obtained.
  • a candidate style transferring model corresponding to this style 316 may be selected for generating the target image 332 in this style 316.
  • a merged image may be generated after the determination operation at 324. That is, through the multiple iterations corresponding to the words in the text 302, a plurality of foreground images respectively corresponding to a plurality of foreground elements may be obtained, and a final background image which removes the plurality of foreground elements from the initial image 306 and contains only background elements in the text 302 may also be obtained. Then, when it is determined that there is no more word in the text 302 to be traversed, a merged image may be generated based on the plurality of foreground images and the final background image.
  • each iteration may also start with a determination of whether the current word is a background word.
  • the iteration will go to step 324, else if the current word is identified as not a background word but a foreground word, the iteration will proceed to generate the foreground image 312, the background image 320 and further the merged image 322.
  • the initial image generator 304 and the image inpainting model 318 are shown as separate modules or models, they may also be functionally combined together as a background image generator.
  • This background image generator may be used for generating a background image corresponding to background elements in the text 302.
  • the process 300 may comprise further operations for updating the target image 332 in the case that an additional user input is received and accordingly an additional text is obtained.
  • both the previous target image and the additional text may be considered in the process of generating an updated target image.
  • the foreground image generator 310 may generate a foreground image corresponding to this further foreground element, and the merged image 322 may also be updated with this foreground image, which is further used for generating an updated target image.
  • the initial image generator 304 and the image inpainting model 318 may cooperate to generate an updated background image containing the further background element.
  • the initial image generator 304 may generate an updated initial image based on both the original text 302 and the additional text, and the image inpainting model 318 may remove foreground elements from the updated initial image so as to generate an update background image. Accordingly, the merged image 322 may also be updated with this updated background image, which is further used for generating an updated target image.
  • the style 316 may be updated based on the additional text through the classification model 314. The updated style may be further used by the style transferring model 330 for generating an updated target image. The updated style may also be used by the foreground image generator 310 for generating updated foreground images in the updated style.
  • FIG. 4 illustrates an exemplary process 400 of generating an initial image according to an embodiment.
  • an attention GAN model 420 is adopted for generating an initial image 430 based on a text 410.
  • the attention GAN model 420 is an exemplary implementation of the initial image generator 304 in FIG. 3.
  • the attention GAN 420 aims to establish a connection between a text and an image through an attention mechanism in a latent and high-dimension space.
  • the attention mechanism may reflect mapping relationship between words or elements in the text and image regions in the generated image.
  • the attention mechanism may be used for selecting a candidate image part associated with a word in the text and further determining at which position the selected candidate image part is placed in the image.
  • the attention GAN model 420 may be trained by a plurality of text-image pairs. When applying the trained attention GAN model 420, it may take a text as input and output an image containing elements in the text.
  • FIG. 5 illustrates exemplary attention mechanism 500 in an attention GAN model according to an embodiment.
  • the attention mechanism 500 may comprise various types of attention, e.g., text self-attention, image self-attention, text-image joint attention, etc.
  • text self-attention 512 may be performed to the text 510 for obtaining text vectors 514.
  • the text vectors 514 may comprise vector representations of words in the text 510, wherein a vector representation of each word reflects relevance or matching degrees with all other words in the text 510.
  • the text self-attention 512 may be in a form of multi-head attention.
  • the inputs to the multi-head attention may be denoted as query Q, key K and value V.
  • the multi-head attention may be formed through a stacking of a number h of scaled dot-product attentions.
  • the inputs to each scaled dot-product attention may also be Q, K and V.
  • each of the Q, K and V may be all the word embeddings of a number n of words in the text.
  • a word embedding is taken out from Q at one time to check matching degrees with any other word embeddings, and this process is performed for n times.
  • linear transformation may be performed to Q, K and V respectively to obtain Q’ , K’ and V’ . Then a scaled dot-product attention may be calculated for Q’ , K’ and V’ , and this calculation may be repeated for h times.
  • the h calculation results may be concatenated together and then be performed by linear transformation.
  • the result of the linear transformation is an output of the multi-head attention.
  • the output of the text self-attention is reshaped from [batch size, maximum sequence length, word embedding dimension] into [batch size, maximum sequence length, head number*head embedding dimension] . As an example, assuming that the head number h is 8, the output of the text self-attention may be reshaped from [64, 30, 512] into [64, 30, 8*64] .
  • image self-attention 522 may be performed to the image 520 for obtaining image vectors 524.
  • the image vectors 524 may comprise vector representations of regions or pixels in the image 520, wherein a vector representation of each region reflects relevance or matching degrees with all other regions in the image 520.
  • the image self-attention 522 aims to establish relationship among various regions in an image, and may be used for, e.g., finding most approximate or relevant regions in the image for a current region.
  • the image self-attention 522 may also be in a form of multi-head attention, which is similar with the multi-head attention for the text self-attention 512 as discussed above.
  • linear/non-linear transformations may be performed respectively to a set of convolution feature maps x corresponding to vector representations of regions in an image, so as to obtain, e.g., a group of transformed x 1 , x 2 and x 3 .
  • x 1 may be transposed and matrix-multiplied with x 2 , and the multiplication result may be normalized through softmax to obtain an attention map.
  • the attention map may be matrix-multiplied with x 3 on a basis of region or pixel to obtain a set of self-attention feature maps.
  • the text self-attention 512 and the image self-attention 522 may be trained separately, and text vectors and image vectors may be updated during respective training processes.
  • the text self-attention 512 and the image self-attention 522 may also be trained jointly in the attention mechanism 500, and text vectors and image vectors may be updated synchronously.
  • three full connection linear layers f (x) , g (x) and h (x) are applied to the text vectors 514 and the image vectors 524 respectively so as to obtain converted text vectors 516, converted image vectors 526, and converted image vectors 528.
  • Matrix multiplication 530 may be performed to a transposition of the converted text vectors 516 and the converted image vectors 526, so as to calculate their distances in a high-dimension dense space.
  • the result of the matrix multiplication 530 is a weight matrix that expresses distances between regions in the image 520 and semantic meaning of words in the text 510, which further forms an attention map 540.
  • matrix multiplication 550 may be performed to the attention map 540 and the converted image vectors 528 so as to further identify most approximate or relevant words for each region in the image, and a joint attention map 560 is finally obtained.
  • FIG. 6 illustrates an exemplary process 600 of generating a foreground image according to an embodiment.
  • a PG GAN model 620 is adopted for generating a foreground image 630 based on a foreground word or element 610.
  • the PG GAN model 620 is an exemplary implementation of the foreground image generator 310 in FIG. 3.
  • the PG GAN model 620 may generate the foreground image 630 in a predetermined image style.
  • the PG GAN model 620 may be trained by a plurality of word-image pairs, wherein the images are attached with, e.g., labels indicating whether being a human-created image or not, labels indicating styles of the images, etc.
  • the PG GAN model 620 may be trained such that a generated image is approximate to an existing style but still has sufficient diversity. This may be achieved through applying a min-max loss function during the training process.
  • FIG. 7 illustrates an exemplary training process 700 of a PG GAN model according to an embodiment.
  • the PG GAN model may comprise a generator 710 and a discriminator 720.
  • data 712 during training may comprise image, Label A, and Label B.
  • the Label A indicates whether the image is real or fake, i.e., whether the image is human-created or not.
  • the Label B is a vector describing probabilities of image categories, wherein categories may comprise various types of foreground elements, e.g., goose, horse, kangaroo, etc.
  • the generator 710 may have two inputs, e.g., white noise and Label B.
  • the target of the generator 710 is to generate an image that matches with the category as indicated in the Label B.
  • the Label B may be a 5-dimension vector in which each dimension corresponds to a category.
  • a dimension corresponding to the category “kangaroo” in the Label B may be set to 1 and all other dimensions are set to 0.
  • the image generated by the generator 710 may be further provided to the discriminator 720.
  • the discriminator 720 takes auto-generated ⁇ image, label> pairs from the generator 710 and reference ⁇ image, label> pairs as inputs.
  • the target of the discriminator 720 is to distinguish the reference/real images from the generated/fake images.
  • Cross-entropy loss may be used as loss function.
  • the unified distribution may facilitate for generation of high-quality images, because if the Label B of the fake image is set to be the same as the Label B of the real image, the loss function of the generated fake image would then contain less distinction from real images, thus making the discriminator 720 to be more difficult to make a classification.
  • the generator 710 and the discriminator 720 may be trained jointly according to the framework in FIG. 7. Exemplary details of the training of the discriminator and the generator will be discussed below in connection with FIG. 8 and FIG. 9.
  • FIG. 8 illustrates an exemplary training process 800 of a discriminator in a PG GAN model according to an embodiment.
  • FIG. 8 shows forward and backward process of training the discriminator from step t to step t+1.
  • Inputs 810 e.g., white noise and Label B
  • the generator 820 may generate an image, e.g., a fake image 830.
  • a loss 860 of the version t of discriminator 850 may be calculated.
  • the loss 860 may be further used for updating the discriminator to obtain a version t+1 of discriminator 870.
  • FIG. 9 illustrates an exemplary training process 900 of a generator in a PG GAN model according to an embodiment.
  • FIG. 9 shows forward and backward process of training the generator from step t to step t+1.
  • Inputs 910 e.g., white noise and Label B
  • the generator 920 may generate an image, e.g., a fake image 930.
  • the discriminator 940 may give scores 950 of being Label A and Label B.
  • the scores may be further used for calculating a loss 960 of the version t of generator 920.
  • the loss 960 may be further used for updating the generator to obtain a version t+1 of generator 970.
  • the process 800 in FIG. 8 and the process 900 in FIG. 9 may be performed jointly so as to achieve the training of the whole PG GAN model.
  • FIG. 10 illustrates an exemplary process 1000 of transferring style according to an embodiment.
  • a cycle GAN model 1020 is adopted for generating a target image 1030 based on a sketched image 1010.
  • the cycle GAN model 1020 is an exemplary implementation of the style transferring model 330 in FIG. 3.
  • the cycle GAN model 1020 may generate the target image 1030 in a predetermined image style.
  • the cycle GAN model 1020 may be trained with a plurality of images under the image style.
  • the cycle GAN model 1020 aims to enable all the elements in the generated target image 1030 to be in the unified image style.
  • a plurality of candidate cycle GAN models corresponding to respective image styles may be trained respectively.
  • a candidate cycle GAN model corresponding to this style may be selected for generating the target image in this style.
  • FIG. 11 illustrates an exemplary training process 1100 of a cycle GAN model according to an embodiment.
  • the cycle GAN model comprises two mirror-symmetric GANs which form a cycle network.
  • the two GANs share two generators, e.g., generator A-B and generator B-A, and have respective discriminators, e.g., discriminator A and discriminator B. That is, there are total two generators and two discriminators in the cycle GAN model. Training with two unpaired image sets may be achieved for the cycle GAN model.
  • an input image A in a domain A may be obtained and provided to the generator A-B which is used for transferring an image in the domain A to an image in a domain B. Accordingly, the generator A-B may output a generated image B based on the input image A.
  • the generated image B is provided to the generator B-Awhich is used for transferring an image in the domain B to an image in the domain A. Accordingly, the generator B-A may output a cyclic image A based on the generated image B.
  • the cyclic image A shall be similar with the input image A, for defining meaningful mappings between two unpaired data sets.
  • an input image B in the domain B may be provided to another GAN so as to finally generate a cyclic image B.
  • the trained cycle GAN model would then be used for generating target images in the predetermined image style.
  • an additional user input and accordingly an additional text may be obtained, which may trigger an updating of the current generated target image.
  • the attention GAN model may take both the original text and the additional text as an input, and may also take the current target image as an input which may replace a white noise that have been previously provided to a generator in the attention GAN model.
  • the attention GAN model may generate an updated initial image at least in consideration of the current target image.
  • the PG GAN model may generate new foreground images corresponding to the new foreground elements, while if the additional text results in an updated style, the PG GAN model may be trained or reselected for generating initial images in the updated style. In another aspect, if the additional text results in an updated style, the cycle GAN model may also be trained or reselected for generating initial images in the updated style.
  • the image generation schemes according to the embodiments of the present disclosure may be applied in various application scenarios or have various product forms.
  • the image generation schemes may be implemented in independent application software or platform which is specially designed for providing an image generation service for users.
  • the independent application software or platform may have user interfaces for collecting user inputs and presenting generated images.
  • the image generation schemes may be implemented in a third party application software or platform as an additional functionality module of the third party application software or platform for providing an image generation service.
  • the image generation schemes may be added into an AI chatbot as an additional feature, and thus during chatting with a user, the AI chatbot may collect the user’s input, generate images according to the user input, and provide or present generated images to the user.
  • FIG. 12 illustrates exemplary user interfaces for image generation according to an embodiment.
  • the user interfaces in FIG. 12 show a process of providing a generated image according to a user’s input.
  • a prompt of requiring the user to input descriptions about what image is desired to generate may be presented, e.g., “Please input some text for which a painting is to be drawn” .
  • the user may input a text “Sail the ocean alone” in an input box 1222 to indicate that the user wants to obtain an image about this text.
  • an image generation process may be performed according to the embodiments of the present disclosure.
  • a generated image 1232 may be presented to the user.
  • FIG. 13 illustrates exemplary user interfaces for image generation according to an embodiment.
  • the user interfaces in FIG. 13 show that image generation may be conducted through interactions with the user. For example, a generated image may be updated according to the user’s additional input.
  • the user interfaces in FIG. 13 may be construed as continuations to the user interfaces in FIG. 12.
  • a prompt of requiring the user to input comments on the generated image 1312 may be provided in an interaction box 1314.
  • the user may input a text 1322 “I prefer a color deeper one” in the interaction box.
  • the generated image 1312 may be updated in response to the user’s input of the text 1322 according to the embodiments of the present disclosure.
  • an updated image 1332 may be presented to the user, which has been updated according to the user’s input of the text 1322.
  • FIG. 14 illustrates a flowchart of an exemplary method 1400 for image generation according to an embodiment.
  • At 1410 at least one background element and at least one foreground element may be identified from a text.
  • At 1420 at least one background image corresponding to the at least one background element may be generated.
  • At 1430 at least one foreground image corresponding to the at least one foreground element may be generated.
  • a merged image may be generated based on the at least one background image and the at least one foreground image.
  • a style of a target image may be determined from the text.
  • the merged image may be transferred into the target image in the determined style.
  • the generating the at least one background image may comprise: generating an initial image based on the text; and generating the at least one background image through removing the at least one foreground element from the initial image.
  • the at least one background image may be generated through a background image generator.
  • the background image generator may comprise: an attention GAN model, for generating the initial image based on the text; and an image inpainting model, for removing the at least one foreground element from the initial image.
  • the at least one foreground image may be generated through a foreground image generator.
  • the foreground image generator may comprise a PG GAN model.
  • the style may be determined based on the text through a classification model.
  • the style may be associated with an emotion category of the text.
  • the transferring the merged image into the target image in the determined style may comprise: generating a sketched image based on the merged image through an edge detection model; and transferring the sketched image into the target image in the determined style through a style transferring model.
  • the style transferring model may comprise a cycle GAN model.
  • the transferring the sketched image into the target image in the determined style through the style transferring model may comprise: selecting a candidate style transferring model corresponding to the determined style from a plurality of candidate style transferring models; and transferring the sketched image into the target image in the determined style through the selected candidate style transferring model.
  • the method 1400 may further comprise: receiving the text.
  • the method 1400 may further comprise: receiving an input; and deriving the text from the input.
  • the deriving the text may comprise: identifying an emotion word from the input; and determining at least one image element associated with the emotion word.
  • the method 1400 may further comprise: obtaining an additional text; and updating the target image based at least on the additional text.
  • the method 1400 may further comprise any steps/processes for image generation according to the embodiments of the present disclosure as mentioned above.
  • FIG. 15 illustrates an exemplary apparatus 1500 for image generation according to an embodiment.
  • the apparatus 1500 may comprise: an element identifying module 1510, for identifying at least one background element and at least one foreground element from a text; a background image generating module 1520, for generating at least one background image corresponding to the at least one background element; a foreground image generating module 1530, for generating at least one foreground image corresponding to the at least one foreground element; a merged image generating module 1540, for generating a merged image based on the at least one background image and the at least one foreground image; a style determining module 1550, determining a style of a target image from the text; and a style transferring module 1560, for transferring the merged image into the target image in the determined style.
  • an element identifying module 1510 for identifying at least one background element and at least one foreground element from a text
  • a background image generating module 1520 for generating at least one background image corresponding to the at least one background element
  • a foreground image generating module 1530
  • the background image generating module 1520 may be for: generating an initial image based on the text through an attention GAN model; and generating the at least one background image through removing the at least one foreground element from the initial image.
  • the foreground image generating module 1530 may comprise a PG GAN model.
  • the style may be associated with an emotion category of the text.
  • the style transferring module 1560 may be for: generating a sketched image based on the merged image through an edge detection model; and transferring the sketched image into the target image in the determined style through a cycle GAN model.
  • the apparatus 1500 may also comprise any other modules configured for image generation according to the embodiments of the present disclosure as mentioned above.
  • FIG. 16 illustrates an exemplary apparatus 1600 for image generation according to an embodiment.
  • the apparatus 1600 may comprise at least one processor 1610 and a memory 1620 storing computer-executable instructions.
  • the at least one processor 1610 may perform any operations of the methods for image generation according to the embodiments of the present disclosure as mentioned above.
  • the embodiments of the present disclosure may be embodied in a non-transitory computer-readable medium.
  • the non-transitory computer-readable medium may comprise instructions that, when executed, cause one or more processors to perform any operations of the methods for image generation according to the embodiments of the present disclosure as mentioned above.
  • modules in the apparatuses described above may be implemented in various approaches. These modules may be implemented as hardware, software, or a combination thereof. Moreover, any of these modules may be further functionally divided into sub-modules or combined together.
  • processors have been described in connection with various apparatuses and methods. These processors may be implemented using electronic hardware, computer software, or any combination thereof. Whether such processors are implemented as hardware or software will depend upon the particular application and overall design constraints imposed on the system.
  • a processor, any portion of a processor, or any combination of processors presented in the present disclosure may be implemented with a microprocessor, microcontroller, digital signal processor (DSP) , a field-programmable gate array (FPGA) , a programmable logic device (PLD) , a state machine, gated logic, discrete hardware circuits, and other suitable processing components configured to perform the various functions described throughout the present disclosure.
  • DSP digital signal processor
  • FPGA field-programmable gate array
  • PLD programmable logic device
  • a state machine gated logic, discrete hardware circuits, and other suitable processing components configured to perform the various functions described throughout the present disclosure.
  • the functionality of a processor, any portion of a processor, or any combination of processors presented in the present disclosure may be
  • a computer-readable medium may include, by way of example, memory such as a magnetic storage device (e.g., hard disk, floppy disk, magnetic strip) , an optical disk, a smart card, a flash memory device, random access memory (RAM) , read only memory (ROM) , programmable ROM (PROM) , erasable PROM (EPROM) , electrically erasable PROM (EEPROM) , a register, or a removable disk.
  • RAM random access memory
  • ROM read only memory
  • PROM programmable ROM
  • EPROM erasable PROM
  • EEPROM electrically erasable PROM

Landscapes

  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Image Analysis (AREA)
PCT/CN2019/087041 2019-05-15 2019-05-15 Image generation WO2020227971A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2019/087041 WO2020227971A1 (en) 2019-05-15 2019-05-15 Image generation
CN201980044979.3A CN112400186B (zh) 2019-05-15 2019-05-15 图像生成

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/087041 WO2020227971A1 (en) 2019-05-15 2019-05-15 Image generation

Publications (1)

Publication Number Publication Date
WO2020227971A1 true WO2020227971A1 (en) 2020-11-19

Family

ID=73289732

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/087041 WO2020227971A1 (en) 2019-05-15 2019-05-15 Image generation

Country Status (2)

Country Link
CN (1) CN112400186B (zh)
WO (1) WO2020227971A1 (zh)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112884669A (zh) * 2021-02-25 2021-06-01 电子科技大学 基于多尺度内容注意力机制的图像修复方法、存储介质和终端
CN113205574A (zh) * 2021-04-30 2021-08-03 武汉大学 一种基于注意力机制的艺术字风格迁移系统
CN113343705A (zh) * 2021-04-26 2021-09-03 山东师范大学 一种基于文本语义的细节保持图像生成方法及系统
CN113468981A (zh) * 2021-06-10 2021-10-01 的卢技术有限公司 图像处理方法、装置、计算机设备和存储介质
CN114119811A (zh) * 2022-01-28 2022-03-01 北京智谱华章科技有限公司 图像的生成方法、装置和电子设备
CN116188632A (zh) * 2023-04-24 2023-05-30 之江实验室 一种图像的生成方法、装置、存储介质及电子设备

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116129117B (zh) * 2023-02-03 2023-07-14 中国人民解放军海军工程大学 基于多头注意力的声呐小目标半监督语义分割方法及系统
CN116433825B (zh) * 2023-05-24 2024-03-26 北京百度网讯科技有限公司 图像生成方法、装置、计算机设备及存储介质

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000089660A (ja) * 1998-09-09 2000-03-31 Matsushita Electric Ind Co Ltd 手話学習支援装置および手話学習支援プログラムを記録した記録媒体
CN102662568A (zh) * 2012-03-23 2012-09-12 北京百舜华年文化传播有限公司 一种图画输入方法及装置
CN102662961A (zh) * 2012-03-08 2012-09-12 北京百舜华年文化传播有限公司 一种语义与图像匹配处理方法、装置及终端设备
CN103927372A (zh) * 2014-04-24 2014-07-16 厦门美图之家科技有限公司 一种基于用户语义的图像处理方法
CN104902189A (zh) * 2015-06-24 2015-09-09 小米科技有限责任公司 图片处理方法及装置
CN105100491A (zh) * 2015-08-11 2015-11-25 努比亚技术有限公司 一种处理照片的装置和方法
CN106454086A (zh) * 2016-09-30 2017-02-22 维沃移动通信有限公司 一种图像处理方法和移动终端
CN108198162A (zh) * 2017-12-29 2018-06-22 努比亚技术有限公司 照片处理方法、移动终端、服务器、系统、存储介质

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108765278B (zh) * 2018-06-05 2023-04-07 Oppo广东移动通信有限公司 一种图像处理方法、移动终端及计算机可读存储介质
CN109543159B (zh) * 2018-11-12 2023-03-24 南京德磐信息科技有限公司 一种文本生成图像方法及装置

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000089660A (ja) * 1998-09-09 2000-03-31 Matsushita Electric Ind Co Ltd 手話学習支援装置および手話学習支援プログラムを記録した記録媒体
CN102662961A (zh) * 2012-03-08 2012-09-12 北京百舜华年文化传播有限公司 一种语义与图像匹配处理方法、装置及终端设备
CN102662568A (zh) * 2012-03-23 2012-09-12 北京百舜华年文化传播有限公司 一种图画输入方法及装置
CN103927372A (zh) * 2014-04-24 2014-07-16 厦门美图之家科技有限公司 一种基于用户语义的图像处理方法
CN104902189A (zh) * 2015-06-24 2015-09-09 小米科技有限责任公司 图片处理方法及装置
CN105100491A (zh) * 2015-08-11 2015-11-25 努比亚技术有限公司 一种处理照片的装置和方法
CN106454086A (zh) * 2016-09-30 2017-02-22 维沃移动通信有限公司 一种图像处理方法和移动终端
CN108198162A (zh) * 2017-12-29 2018-06-22 努比亚技术有限公司 照片处理方法、移动终端、服务器、系统、存储介质

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112884669A (zh) * 2021-02-25 2021-06-01 电子科技大学 基于多尺度内容注意力机制的图像修复方法、存储介质和终端
CN113343705A (zh) * 2021-04-26 2021-09-03 山东师范大学 一种基于文本语义的细节保持图像生成方法及系统
CN113343705B (zh) * 2021-04-26 2022-07-05 山东师范大学 一种基于文本语义的细节保持图像生成方法及系统
CN113205574A (zh) * 2021-04-30 2021-08-03 武汉大学 一种基于注意力机制的艺术字风格迁移系统
CN113468981A (zh) * 2021-06-10 2021-10-01 的卢技术有限公司 图像处理方法、装置、计算机设备和存储介质
CN114119811A (zh) * 2022-01-28 2022-03-01 北京智谱华章科技有限公司 图像的生成方法、装置和电子设备
CN114119811B (zh) * 2022-01-28 2022-04-01 北京智谱华章科技有限公司 图像的生成方法、装置和电子设备
CN116188632A (zh) * 2023-04-24 2023-05-30 之江实验室 一种图像的生成方法、装置、存储介质及电子设备

Also Published As

Publication number Publication date
CN112400186A (zh) 2021-02-23
CN112400186B (zh) 2023-08-01

Similar Documents

Publication Publication Date Title
WO2020227971A1 (en) Image generation
Voynov et al. Sketch-guided text-to-image diffusion models
US11670071B2 (en) Fine-grained image recognition
CN111488931B (zh) 文章质量评估方法、文章推荐方法及其对应的装置
WO2019075130A1 (en) IMAGE PROCESSING DEVICE AND METHOD
CN111709406B (zh) 文本行识别方法及装置、可读存储介质、电子设备
US11914841B2 (en) Automatic generation of stylized icons
CN113762309B (zh) 对象匹配方法、装置及设备
KR102084782B1 (ko) 적대적 생성 신경망 알고리즘을 기반으로 한 의인화 캐릭터 생성 방법
CN111694959A (zh) 基于面部表情和文本信息的网络舆情多模态情感识别方法及系统
Liu et al. Name your style: An arbitrary artist-aware image style transfer
CN112685582A (zh) 自动生成故事板
CN116958957A (zh) 多模态特征提取网络的训练方法及三维特征表示方法
CN114021646A (zh) 一种图像描述文本确定方法及其相关设备
CN115797706A (zh) 目标检测方法、目标检测模型训练方法及相关装置
CN114995729A (zh) 一种语音绘图方法、装置及计算机设备
Baraheem et al. Image synthesis: a review of methods, datasets, evaluation metrics, and future outlook
CN114677402A (zh) 海报文本布局、海报生成方法及相关装置
CN114913590A (zh) 一种数据的情感识别方法、装置、设备及可读存储介质
CN117058275B (zh) 商品宣传图生成方法、装置、计算机设备及存储介质
Jin et al. Text2poster: Laying out stylized texts on retrieved images
CN114742991A (zh) 海报背景图像选取、模型训练、海报生成方法及相关装置
CN112419249B (zh) 一种特殊服饰图片转化方法、终端设备及存储介质
CN110516024A (zh) 地图搜索结果展现方法、装置、设备和存储介质
US20240201824A1 (en) Automatic generation of stylized icons

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19929062

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19929062

Country of ref document: EP

Kind code of ref document: A1