WO2024130753A1 - 一种多路并行的文本到图像生成方法和系统 - Google Patents

一种多路并行的文本到图像生成方法和系统 Download PDF

Info

Publication number
WO2024130753A1
WO2024130753A1 PCT/CN2022/141736 CN2022141736W WO2024130753A1 WO 2024130753 A1 WO2024130753 A1 WO 2024130753A1 CN 2022141736 W CN2022141736 W CN 2022141736W WO 2024130753 A1 WO2024130753 A1 WO 2024130753A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
text
model
generation
different
Prior art date
Application number
PCT/CN2022/141736
Other languages
English (en)
French (fr)
Inventor
彭宇新
叶钊达
何相腾
Original Assignee
北京大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京大学 filed Critical 北京大学
Publication of WO2024130753A1 publication Critical patent/WO2024130753A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T11/002D [Two Dimensional] image generation
    • G06T11/60Editing figures and text; Combining figures or text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4053Scaling of whole images or parts thereof, e.g. expanding or contracting based on super-resolution, i.e. the output image resolution being higher than the sensor resolution
    • G06T3/4076Scaling of whole images or parts thereof, e.g. expanding or contracting based on super-resolution, i.e. the output image resolution being higher than the sensor resolution using the original low-resolution images to iteratively correct the high-resolution images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • G06T7/55Depth or shape recovery from multiple images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10024Color image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20221Image fusion; Image merging

Definitions

  • the invention relates to the field of image generation, and in particular to a multi-path parallel text-to-image generation method and system.
  • Text-to-image generation is to enable computers to automatically generate semantically consistent, authentic and logical image visual content from scratch based on natural language text descriptions given by users (speech recognition, image OCR, text input, etc.).
  • Reed et al. (Scott E. Reed, et al., Learning What and Where to Draw, Annual Conference on Neural Information Processing Systems, 2016) proposed a generative adversarial network method based on text manifold interpolation and image-text matching, using text semantic vectors as the input of the generative model, and through image-text matching constraints, the text and the generated image are kept semantically consistent, thus realizing the generation of text to image.
  • the present invention proposes a multi-path parallel text-to-image generation method, which can effectively reduce local distortion and deformation in the generation results by learning different generation parameters and strategies according to the characteristics of different image contents through a parallel generation structure.
  • a multi-path parallel text-to-image generation method comprises the following steps:
  • the pre-trained text-image association model is used to extract the cross-modal semantic features of the text, and the cross-modal semantic features of the text are decoupled through a recurrent neural network to obtain a text conditional vector sequence;
  • the spatial depth information of the generated image is predicted using the spatial depth prediction model, and corresponding weights are assigned to the images generated by the generative network modules of different branches, so that the images generated by the generative network modules of different branches are fused into one image.
  • a discriminant model is constructed to implement adversarial training: by distinguishing between images generated by the image generation model and real paired images, the generation quality of the image generation model is improved; by distinguishing between the spatial depth information of the image predicted by the spatial depth prediction model and the spatial depth information of the extracted real image, the accuracy of the spatial depth prediction model is improved.
  • the text cross-modal semantic features in step (1) are specifically intermediate features of the text encoding module of a pre-trained text-image association model (e.g., CLIP).
  • a pre-trained text-image association model e.g., CLIP
  • a deep model based on a recurrent neural network is used to input the text cross-modal semantic features into the recurrent neural network, and output the same number of text conditional vectors according to the number of branches in the image generation model.
  • the generative network modules of different branches in the image generation model of step (2) adopt a multi-level graph generation model based on StyleGAN, taking the corresponding text condition vector and the image result generated in the previous stage as input, and generating images step by step from low resolution to high resolution.
  • the image pixels generated by the multi-level graph generation model are obtained by residual accumulation to obtain the final image:
  • img k Upsample(img k-1 )+RGB k
  • img k represents the image generated at the kth level
  • RGB k represents the content generated at the current stage
  • Upsample represents the upsampling operation.
  • step (3) uses the spatial depth prediction model to predict the spatial depth information of the generated image, and then uses the predicted spatial depth information to fuse the images generated by the generative network modules of different branches into one image.
  • n is the number of branches; the fusion method can be formulated as:
  • dhk is the spatial depth information of the k-th level image generated by the spatial depth prediction model
  • FC * is the weight mapping network
  • Cov2D is the convolution operation used to achieve image pixel fusion.
  • the discriminant model aims to distinguish the result of the image generation model from the real paired image-text pair data, and the loss function provided by the discriminant model is:
  • the first term is the loss function of the unconditional vector, which aims to evaluate the quality of image generation
  • the second term is the loss function based on the text conditional vector, which aims to evaluate the semantic consistency between the image and the text.
  • Di represents the image discriminator
  • Dt represents the image discriminator based on the text conditional vector
  • Isa represents the text conditional vector
  • x represents the image sample
  • Ex ⁇ Real represents the expectation when the image sample comes from the real image
  • Ex ⁇ G represents the expectation when the image sample comes from the generated image.
  • the discriminant model aims to distinguish the prediction results of the spatial depth prediction model from the spatial depth information extracted from the real image, and its loss function is:
  • depth represents the spatial depth prediction model
  • D dep represents the image depth discriminator
  • GT represents the depth information extracted based on the image.
  • the present invention proposes a multi-path parallel text-to-image generation system, which comprises:
  • the text feature extraction module is responsible for extracting cross-modal semantic features of text using the pre-trained text-image association model, and inputting them into the recurrent neural network, outputting the same number of text conditional vectors according to the number of branches in the image generation model;
  • the multi-channel image generation module is responsible for inputting different text condition vectors into the generation network modules of different branches in the image generation model, so that the generation network modules of different branches can generate images according to different generation strategies;
  • the image spatial depth prediction module is responsible for using the spatial depth prediction model to predict the spatial depth information of the generated image, assigning corresponding weights to the images generated by the generative network modules of different branches, and fusing the images generated by the generative network modules of different branches into one image.
  • the effect of the present invention is that compared with the existing methods, the present method takes into account the differences in the properties of the generated image content itself, learns different generation parameters and strategies according to the characteristics of different image contents, and can effectively reduce the local distortion and deformation in the image results of the text-to-image generation method.
  • this method learns different generation parameters and strategies according to different image content characteristics through a parallel generation structure.
  • the method introduces the spatial depth information of the image to help the generation model analyze and decouple the image content, providing a basis for the fusion of multi-channel image generation results, and further reducing the probability of local distortion and deformation in the generation process of the generation model.
  • FIG. 1 is a flow chart of the multi-path parallel text-to-image generation method of the present invention.
  • FIG. 2 is a detailed diagram of the network structure of the present invention, where Conv 3x3 represents a convolution operation with a convolution kernel size of 3x3, and AdaIn represents an affine transformation based on the mean and standard deviation of the image.
  • the multi-path parallel text-to-image generation method of the present invention comprises the following steps:
  • the cross-modal semantic features of the text are extracted using a pre-trained text-image association model and input into a recurrent neural network to generate the same number of text conditional vectors according to the number of branches in the generative network, i.e., the image generation model.
  • step (1) The different text condition vectors in step (1) are respectively input into the multi-level generation network modules of different branches, and the corresponding images are generated according to the different generation strategies learned.
  • the text conditional vector in step (2) is input into a multi-level graph generation model (e.g., StyleGAN), and the generated image and text conditional vector in the above stage are used as input to generate the corresponding image, and the image is generated step by step from low resolution to high resolution.
  • the image pixels generated by the multi-level graph generation model are obtained by residual accumulation to obtain the final image.
  • img k Upsample(img k-1 )+RGB k
  • img k represents the image generated at the kth level
  • RGB k represents the content generated at the current stage
  • Upsample represents the upsampling operation.
  • the k-th level image pixel information generated for different modules can be formulated as:
  • dhk is the spatial depth information of the k-th level image generated by the spatial depth prediction model
  • FC * is the weight mapping network
  • Cov2D is the convolution operation used to achieve image pixel fusion.
  • adversarial model training is achieved by constructing a discriminant model.
  • the discriminant model uses paired image-text data to improve the performance of the generative model by distinguishing the difference between the image generated by the image generation model and the real image.
  • the discriminant model aims to distinguish the results of the image generation model from the real image, and its training loss function is:
  • the first term is the loss function of the unconditional vector, which aims to evaluate the quality of image generation
  • the second term is the loss function based on the text conditional vector, which aims to evaluate the semantic consistency between the image and the text.
  • Di represents the image discriminator
  • Dt represents the image discriminator based on the text conditional vector
  • Isa represents the text conditional vector
  • x represents the image sample
  • Ex ⁇ Real represents the expectation when the image sample comes from the real image
  • Ex ⁇ G represents the expectation when the image sample comes from the generated image.
  • the spatial depth prediction model For the spatial depth prediction model, the spatial depth information extracted from the real image is used.
  • the discriminant model improves the accuracy of the spatial depth prediction model by distinguishing the prediction results of the spatial depth prediction model from the spatial depth information extracted from the real image.
  • the training loss function is:
  • depth represents the spatial depth prediction model
  • D dep represents the image depth discriminator
  • GT represents the depth information extracted based on the image.
  • the text encoding module based on the cross-modal text-image association model extracts the representation of the user input text, and uses the same method as steps 2 and 3 to generate an image that is semantically consistent with the user input text.
  • the present invention the method of this embodiment.
  • IS is often used to indicate the degree of distinguishability of image content, and the higher the score, the better.
  • the probability that it belongs to a certain class should be very large, and the probability that it belongs to other classes should be very small; the clearer the image, the greater the difference from the mean of the probability vector of all images.
  • FID is often used to indicate the generation quality of image content, and the lower the value, the better.
  • For a high-quality generated image it has a high degree of similarity with the real image in visual features; the higher the quality of the image, the closer the statistical distribution of its visual features is to the statistical distribution of the visual features of the real image.
  • the IS and FID indicators of this method are improved compared with the comparison method, which shows that the method can effectively improve the quality of image generation by learning different generation strategies.
  • the three existing methods do not consider the nature of the generated content itself, and use a unified network structure and parameters to generate different image contents. Since the generation model fails to effectively model the content of some image regions, local distortion and distortion exist in the generated image.
  • the present invention achieves better generation effect by learning different generation parameters and strategies according to the characteristics of different image contents through parallel generation structures.
  • Another embodiment of the present invention provides a multi-path parallel text-to-image generation system, comprising:
  • the text feature extraction module is responsible for extracting cross-modal semantic features of text using the pre-trained text-image association model, and inputting them into the recurrent neural network, outputting the same number of text conditional vectors according to the number of branches in the image generation model;
  • the multi-channel image generation module is responsible for inputting different text condition vectors into the generation network modules of different branches in the image generation model, so that the generation network modules of different branches can generate images according to different generation strategies;
  • the image spatial depth prediction module is responsible for using the spatial depth prediction model to predict the spatial depth information of the generated image, assigning corresponding weights to the images generated by the generative network modules of different branches, and fusing the images generated by the generative network modules of different branches into one image.
  • a computer device computer, server, smart phone, etc.
  • the memory stores a computer program
  • the computer program is configured to be executed by the processor
  • the computer program includes instructions for executing each step in the method of the present invention.
  • Another embodiment of the present invention provides a computer-readable storage medium (such as ROM/RAM, magnetic disk, optical disk), wherein the computer-readable storage medium stores a computer program, and when the computer program is executed by a computer, the steps of the method of the present invention are implemented.
  • a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

本发明涉及一种多路并行的文本到图像生成方法和系统。该方法包括以下步骤:1.利用预训练模型提取文本跨模态语义表征,并输入到循环神经网络,根据图像生成模型中分支的数量输出相同长度的文本条件向量序列。2.将不同的文本条件向量输入不同分支的生成网络模块中,根据不同生成策略生成相应的图像。3.根据图像空间深度预测模型输出的深度信息,赋予不同分支的生成图像不同权重,融合不同分支的生成结果。4.在训练阶段,通过构建判别模型对图像生成模型、空间深度预测模型实现基于对抗的模型训练。本发明针对不同图像内容特点学习不同的生成参数与策略,能够有效降低文本到图像生成方法的图像结果中的局部失真与形变。

Description

一种多路并行的文本到图像生成方法和系统 技术领域
本发明涉及图像生成领域,具体涉及一种多路并行的文本到图像生成方法和系统。
背景技术
文本到图像生成是使计算机能够根据用户给定的自然语言文本描述(语音识别、图像OCR、文字输入等)从无到有地自动生成语义一致、内容真实且符合逻辑的图像视觉内容。
在现有技术中,一些方法通过检索技术获得与文本描述相关的视觉内容,通过组合与拼接生成视觉内容。例如Wang等人提出Write-A-Video技术(Miao Wang,et.al.,Write-A-Video:Computational Video Montage from Themed Text,ACM Transactions on Graphics,2019),可以根据用户输入的文字,搜索与文本描述相匹配的候选视频镜头并自动组合与剪辑视频。该类以检索为核心的生成技术,主要通过分析文本描述与已有视觉内容的相似性,检索得到相似的已有视觉内容,很难满足用户个性化、多样化的需求。另外一类方法是通过设计不同的生成网络结构将文本信息映射到图像空间中,实现文本到图像生成。例如,Reed等人(Scott E.Reed,et.al.,Learning What and Where to Draw,Annual Conference on Neural Information Processing Systems,2016)提出了基于文本流形插值与图文匹配的生成式对抗网络方法,以文本语义向量作为生成模型的输入,并通过图文匹配约束使得文本与生成图像保持语义一致性,实现了文本到图像的生成。Zhang等人(Zizhao Zhang,et.al.,Photographic Text-to-Image Synthesis With a Hierarchically-Nested Adversarial Network,IEEE Conference on Computer Vision and Pattern Recognition,2018)提出一种具有层次结构的生成式对抗网络,能够将生成的低分辨率图像扩展到高分辨率图像。
然而,上述方法都没有考虑生成内容自身的性质,采用统一的网络结构与参数生成不同的图像内容,导致生成模型未能有效建模部分图像区域内容,生成的图像中存在局部失真、扭曲等情况。
发明内容
本发明针对上述问题,提出了一种多路并行的文本到图像生成方法,通过并行的生成结构,针对不同图像内容特点学习不同的生成参数与策略,能够有效降低生成结果中的局部失真与形变。
为达到以上目的,本发明采用的技术方案如下:
一种多路并行的文本到图像生成方法,包括以下步骤:
(1)利用预训练文本图像关联模型提取文本跨模态语义特征,并通过循环神经网络将文本跨模态语义特征进行解耦,得到文本条件向量序列;
(2)将不同的文本条件向量输入图像生成模型中不同分支的生成网络模块中,使得不同分支的生成网络模块能够根据不同生成策略生成图像;
(3)利用空间深度预测模型预测生成的图像的空间深度信息,给不同分支的生成网络模块生成的图像赋予相应的权重,将不同分支的生成网络模块生成的图像融合为一张图像。
进一步,上述方法中,在训练阶段,构建判别模型实现对抗训练:通过区分图像生成模型生成的图像与真实成对的图像,提高图像生成模型的生成质量;通过区分空间深度预测模型预测的图像的空间深度信息与提取的真实图像的空间深度信息,提高空间深度预测模型的准确率。
进一步,上述方法中,步骤(1)中的文本跨模态语义特征具体为预训练文本图像关联模型(例如CLIP)的文本编码模块中间特征。并采用了基于循环神经网络的深度模型,将文本跨模态语义特征输入到循环神经网络,根据图像生成模型中分支的数量,输出相同数量的文本条件向量。
进一步,上述方法中,步骤(2)的图像生成模型中不同分支的生成网络模块,采用了基于StyleGAN的多级图生成模型,以对应的文本条件向量和上阶段生成的图像结果作为输入,从低分辨率到高分辨率逐级生成图像。多级图生成模型生成的图像像素通过残差累加的形式得到最终的图像:
img k=Upsample(img k-1)+RGB k
其中img k表示第k级生成的图像,RGB k表示当前阶段生成的内容,Upsample表示上采样操作。
进一步,上述方法中,步骤(3)采用了空间深度预测模型预测生成的图像的空间深度信息,然后利用预测的空间深度信息,将不同分支的生成网络模块生成的图像融合为一张图像。针对不同分支的生成网络模块生成的第k级图像像素信息
Figure PCTCN2022141736-appb-000001
其中n为分支的数量;其融合方式可公式化为:
Figure PCTCN2022141736-appb-000002
其中dh k为空间深度预测模型生成的第k级图像的空间深度信息,FC *为权重映射网络,Cov2D为卷积操作,用于实现图像像素融合。
进一步,上述方法中,在训练阶段,针对图像生成模型,判别模型旨在区分图像生成模型的结果与真实成对的图像-文本对数据,其提供的损失函数为:
Figure PCTCN2022141736-appb-000003
其中第一项为无条件向量的损失函数,其目的在于评价图像的生成质量,其第二项为基于文本条件向量的损失函数,其目的在于评价图像与文本的语义一致性。D i表示图像判别器,D t表示基于文本条件向量的图像判别器,I sa表示文本条件向量,x表示图像样本,E x~Real表示图像样本来源于真实图像时的期望,E x~G表示图像样本来源于生成图像时的期望。
进一步,上述方法中,在训练阶段,针对空间深度预测模型,判别模型旨在区分空间深度预测模型的预测结果与真实图像提取的空间深度信息,其损失函数为:
Figure PCTCN2022141736-appb-000004
其中x表示图像,depth表示空间深度预测模型,D dep表示图像深度判别器,GT表示基于图像提取的深度信息。
进一步,本发明提出一种多路并行的文本到图像生成系统,其包括:
文本特征提取模块,负责利用预训练文本图像关联模型提取文本跨模态语义特征,并输入到循环神经网络,根据图像生成模型中分支的数量输出相同数量的文本条件向量;
多路图像生成模块,负责将不同的文本条件向量输入图像生成模型中不同分支的生成网络模块中,使得不同分支的生成网络模块能够根据不同生成策略生成图像;
图像空间深度预测模块,负责利用空间深度预测模型预测生成的图像的空间深度信息,给不同分支的生成网络模块生成的图像赋予相应的权重,将不同分支的生成网络模块生成的图像融合为一张图像。
本发明的效果在于与现有方法相比,本方法考虑生成图像内容自身的性质差异性,针对不同图像内容特点学习不同的生成参数与策略,能够有效降低文本到图像生成方法的图像结果中的局部失真与形变。
本方法之所以具有上述发明效果,其原因在于:本方法通过并行的生成结构针对不同图像内容特点学习不同的生成参数与策略。此外,方法引入了图像的空间深度信息帮助生成模型分析和解耦图像内容,为多路图像生成结果的融合提供依据,进一步降低了生成模型生成过程中出现局部失真与形变的概率。
附图说明
图1是本发明的多路并行的文本到图像生成方法流程图。
图2是本发明的网络结构细节图,其中Conv 3x3表示卷积核大小为3x3的卷积操作,AdaIn表示基于图像的均值和标准差的仿射变换。
具体实施方式
下面结合附图和具体实施例对本发明作进一步的详细描述。
本发明的多路并行的文本到图像生成方法,其流程如图1所示,包含以下步骤:
(1)利用预训练文本图像关联模型提取文本跨模态语义特征,并输入到循环神经网络,根据生成网络即图像生成模型中分支的数量生成相同数量的文本条件向量。
(2)将步骤(1)中不同的文本条件向量分别输入不同分支的多级生成网络模块中,根据学习到的不同生成策略生成相应的图像。
如图2所示,步骤(2)中的文本条件向量输入多级图生成模型(例如StyleGAN),以上阶段的生成图像和文本条件向量作为输入生成相应的图像,从低分辨率到高分辨率逐级生成图像。多级图生成模型生成的图像像素通过残差累加的形式得到最终的图像。
img k=Upsample(img k-1)+RGB k
其中img k表示第k级生成的图像,RGB k表示当前阶段生成的内容,Upsample表示上采样操作。
(3)利用空间深度预测网络预测生成图像的深度信息,赋予不同模块分支的生成像素相应权重,并融合不同分支的生成结果。
针对不同模块生成的第k级图像像素信息
Figure PCTCN2022141736-appb-000005
其融合结果方式可公式化为:
Figure PCTCN2022141736-appb-000006
其中dh k为空间深度预测模型生成的第k级图像的空间深度信息,FC *为权重映射网络,Cov2D为卷积操作,用于实现图像像素融合。
(4)在训练阶段,通过构建判别模型实现基于对抗的模型训练。
其中利用成对的图像-文本对数据,判别模型通过区分图像生成模型的图像与真实图像的区别,提高生成模型的性能。针对生成模型,判别模型旨在区分图像生成模型的结果与真实图像,其训练的损失函数为:
Figure PCTCN2022141736-appb-000007
其中第一项为无条件向量的损失函数,其目的在于评价图像的生成质量,其第二项为基于文本条件向量的损失函数,其目的在于评价图像与文本的语义一致性。D i表示图像判别器,D t表示基于文本条件向量的图像判别器,I sa表示文本条件向量,x表示图像样本,E x~Real表示图像样本来源于真实图像时的期望,E x~G表示图像样本来源于生成图像时的期望。
针对空间深度预测模型,利用提取的真实图像空间深度信息,判别模型通过区分空间深度预测模型的预测结果与真实图像中提取的空间深度信息,提高空间深度预测模型的准确率,其训练损失函数为:
Figure PCTCN2022141736-appb-000008
其中x表示图像,depth表示空间深度预测模型,D dep表示图像深度判别器,GT表示基于图像提取的深度信息。
(5)在生成阶段,基于跨模态文本图像关联模型的文本编码模块提取用户输入文本的表征,采用步骤2和3中相同的方法,生成与用户输入文本语义一致的图像。
本实施例采用CUB数据集进行实验,该数据集由文献“The CALTECH-UCSD birds-200-2011dataset”(作者C.Wah等人)提出。我们测试了以下3种方法作为实验对比:
现有方法一:文献“Stackgan:Text to photo-realistic image synthesis with stacked generative adversarial networks”(作者Zhang H等人)中的StackGAN方法。
现有方法二:文献“Attngan:Fine-grained text to image generation with attentional generative adversarial networks”(作者Xu T等人)中的AttnGAN方法。
现有方法三:文献“Lafite:Towards language-free training for text-to-image generation”(作者Zhou Y等人)中的LAFITE方法。
本发明:本实施例的方法。
在评价指标上,IS常用于表示图像内容的可分辨程度,分值越高越好。对于一张清晰的图像,它属于某一类的概率应该非常大,而属于其它类的概率应该很小;越清晰的图像,与所有图像概率向量的均值差异越大。而FID常用于表示图像内容的生成质量,数值越低越好。对于一张高质量生成图像,它与真实图像在视觉特征具有高度的相似性;质量越高的图像,其视觉特征的统计分布与真实图像视觉特征统计分布越接近。
从表1可以看出,本方法在IS和FID指标上相比对比方法具有提升,这表明方法通过学 习不同的生成策略能有效提高模型对图像生成的质量。三种现有方法都没有考虑生成内容自身的性质,采用统一的网络结构与参数生成不同的图像内容。由于生成模型未能有效建模部分图像区域内容,导致生成的图像中存在局部失真、扭曲的情况。本发明通过并行的生成结构,针对不同图像内容特点学习不同的生成参数与策略,取得了更好的生成效果。
表1.各方法在CUB数据集上的实验结果
  IS↑ FID↓
现有方法一 4.04 15.30
现有方法二 4.36 23.98
现有方法三 5.97 10.48
本发明 6.08 8.57
本发明的另一实施例提供一种多路并行的文本到图像生成系统,其包括:
文本特征提取模块,负责利用预训练文本图像关联模型提取文本跨模态语义特征,并输入到循环神经网络,根据图像生成模型中分支的数量输出相同数量的文本条件向量;
多路图像生成模块,负责将不同的文本条件向量输入图像生成模型中不同分支的生成网络模块中,使得不同分支的生成网络模块能够根据不同生成策略生成图像;
图像空间深度预测模块,负责利用空间深度预测模型预测生成的图像的空间深度信息,给不同分支的生成网络模块生成的图像赋予相应的权重,将不同分支的生成网络模块生成的图像融合为一张图像。
其中各模块的具体实施过程参见前文对本发明方法的描述。
本发明的另一实施例提供一种计算机设备(计算机、服务器、智能手机等),其包括存储器和处理器,所述存储器存储计算机程序,所述计算机程序被配置为由所述处理器执行,所述计算机程序包括用于执行本发明方法中各步骤的指令。
本发明的另一实施例提供一种计算机可读存储介质(如ROM/RAM、磁盘、光盘),所述计算机可读存储介质存储计算机程序,所述计算机程序被计算机执行时,实现本发明方法的各个步骤。
显然,本领域的技术人员可以对本发明进行各种改动和变型而不脱离本发明的精神和范围。这样,倘若本发明的这些修改和变型属于本发明权利要求及其等同技术的范围之内,则本发明也意图包含这些改动和变型在内。

Claims (10)

  1. 一种多路并行的文本到图像生成方法,包括以下步骤:
    利用预训练文本图像关联模型提取文本跨模态语义特征,并通过循环神经网络将文本跨模态语义特征进行解耦,得到文本条件向量序列;
    将不同的文本条件向量输入图像生成模型中不同分支的生成网络模块中,使得不同分支的生成网络模块能够根据不同生成策略生成图像;
    利用空间深度预测模型预测生成的图像的空间深度信息,给不同分支的生成网络模块生成的图像赋予相应的权重,将不同分支的生成网络模块生成的图像融合为一张图像。
  2. 如权利要求1所述的方法,其特征在于,在训练阶段,构建判别模型实现对抗训练:通过区分图像生成模型生成的图像与真实成对的图像,提高图像生成模型的生成质量;通过区分空间深度预测模型预测的图像的空间深度信息与提取的真实图像的空间深度信息,提高空间深度预测模型的准确率。
  3. 如权利要求1所述的方法,其特征在于,所述文本跨模态语义特征具体为预训练文本图像关联模型中文本编码模块的中间特征,并采用了基于循环神经网络的深度模型,将文本跨模态语义特征输入到循环神经网络,生成与图像生成模型中分支的数量相同的文本条件向量。
  4. 如权利要求1所述的方法,其特征在于,所述图像生成模型中不同分支的生成网络模块,采用多级图生成模型,以对应的文本条件向量和前一阶段生成的图像作为输入,从低分辨率到高分辨率逐级生成图像;多级图生成模型生成的像素通过残差累加的形式得到最终的图像:
    img k=Upsample(img k-1)+RGB k
    其中,img k表示第k级生成的图像,RGB k表示当前阶段生成的内容,Upsample表示上采样操作。
  5. 如权利要求1所述方法,其特征在于,所述空间深度预测模型针对不同分支的生成网络模块生成的第k级图像像素信息
    Figure PCTCN2022141736-appb-100001
    其融合方式公式化为:
    Figure PCTCN2022141736-appb-100002
    其中,dh k为空间深度预测模型生成的第k级图像的空间深度信息,FC *为权重映射网络,Cov2D为卷积操作,用于实现图像像素融合。
  6. 如权利要求1所述方法,其特征在于,在训练阶段,针对图像生成模型,判别模型旨在区分图像生成模型的结果与真实图像的差异,其用于训练的损失函数为:
    Figure PCTCN2022141736-appb-100003
    其中,第一项为无条件向量的损失函数,其目的在于评价图像的生成质量;第二项为基于文本条件向量的损失函数,其目的在于评价图像与文本的语义一致性;D i表示图像判别器,D t表示基于文本条件向量的图像判别器,I sa表示文本条件向量,x表示图像样本,E x~Real表示图像样本来源于真实图像时的期望,E x~G表示图像样本来源于生成图像时的期望。
  7. 如权利要求1所述方法,其特征在于,在训练阶段,针对空间深度预测模型,判别模型旨在区分空间深度预测模型的预测结果与真实图像提取的空间深度信息,其损失函数为:
    Figure PCTCN2022141736-appb-100004
    其中,x表示图像,depth表示空间深度预测模型,D dep表示图像深度判别器,GT表示基于图像提取的深度信息。
  8. 一种多路并行的文本到图像生成系统,其特征在于,包括:
    文本特征提取模块,负责利用预训练文本图像关联模型提取文本跨模态语义特征,并输入到循环神经网络,根据图像生成模型中分支的数量输出相同数量的文本条件向量;
    多路图像生成模块,负责将不同的文本条件向量输入图像生成模型中不同分支的生成网络模块中,使得不同分支的生成网络模块能够根据不同生成策略生成图像;
    图像空间深度预测模块,负责利用空间深度预测模型预测生成的图像的空间深度信息,给不同分支的生成网络模块生成的图像赋予相应的权重,将不同分支的生成网络模块生成的图像融合为一张图像。
  9. 一种计算机设备,其特征在于,包括存储器和处理器,所述存储器存储计算机程序,所述计算机程序被配置为由所述处理器执行,所述计算机程序包括用于执行权利要求1~7中任一项所述方法的指令。
  10. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储计算机程序,所述计算机程序被计算机执行时,实现权利要求1~7中任一项所述的方法。
PCT/CN2022/141736 2022-12-23 2022-12-25 一种多路并行的文本到图像生成方法和系统 WO2024130753A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211664553.8 2022-12-23
CN202211664553.8A CN116128998A (zh) 2022-12-23 2022-12-23 一种多路并行的文本到图像生成方法和系统

Publications (1)

Publication Number Publication Date
WO2024130753A1 true WO2024130753A1 (zh) 2024-06-27

Family

ID=86309314

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/141736 WO2024130753A1 (zh) 2022-12-23 2022-12-25 一种多路并行的文本到图像生成方法和系统

Country Status (2)

Country Link
CN (1) CN116128998A (zh)
WO (1) WO2024130753A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116883528A (zh) * 2023-06-12 2023-10-13 阿里巴巴(中国)有限公司 图像生成方法及装置

Also Published As

Publication number Publication date
CN116128998A (zh) 2023-05-16

Similar Documents

Publication Publication Date Title
Chen et al. Learning spatial attention for face super-resolution
Zhang et al. HIBERT: Document level pre-training of hierarchical bidirectional transformers for document summarization
WO2021223323A1 (zh) 一种中文视觉词汇表构建的图像内容自动描述方法
Zhang A survey of unsupervised domain adaptation for visual recognition
WO2020224405A1 (zh) 图像处理方法、装置、计算机可读介质及电子设备
Shen et al. Pedestrian-specific bipartite-aware similarity learning for text-based person retrieval
CN113159023A (zh) 基于显式监督注意力机制的场景文本识别方法
WO2024130753A1 (zh) 一种多路并行的文本到图像生成方法和系统
CN115222998B (zh) 一种图像分类方法
CN114973222A (zh) 基于显式监督注意力机制的场景文本识别方法
CN110347853B (zh) 一种基于循环神经网络的图像哈希码生成方法
CN115187456A (zh) 基于图像强化处理的文本识别方法、装置、设备及介质
Xiang et al. Deep multimodal representation learning for generalizable person re-identification
CN114463552A (zh) 迁移学习、行人重识别方法及相关设备
Cao et al. Attention where it matters: Rethinking visual document understanding with selective region concentration
Papadimitriou et al. End-to-End Convolutional Sequence Learning for ASL Fingerspelling Recognition.
Ma et al. Multi-scale cooperative multimodal transformers for multimodal sentiment analysis in videos
Gan et al. GANs with multiple constraints for image translation
CN112528989A (zh) 一种图像语义细粒度的描述生成方法
Huang et al. Target-Oriented Sentiment Classification with Sequential Cross-Modal Semantic Graph
CN116958868A (zh) 用于确定文本和视频之间的相似度的方法和装置
CN113723111B (zh) 一种小样本意图识别方法、装置、设备及存储介质
Parvin et al. Image captioning using transformer-based double attention network
CN115049546A (zh) 样本数据处理方法、装置、电子设备及存储介质
Amiri Parian et al. Are you watching closely? content-based retrieval of hand gestures