CN118691923A - Method, device, computer equipment and storage medium for generating image of target subject - Google Patents

Method, device, computer equipment and storage medium for generating image of target subject Download PDF

Info

Publication number
CN118691923A
CN118691923A CN202310335853.XA CN202310335853A CN118691923A CN 118691923 A CN118691923 A CN 118691923A CN 202310335853 A CN202310335853 A CN 202310335853A CN 118691923 A CN118691923 A CN 118691923A
Authority
CN
China
Prior art keywords
image
text
target
subject
description text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310335853.XA
Other languages
Chinese (zh)
Inventor
温泉
周智毅
王逸宇
衣景龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202310335853.XA priority Critical patent/CN118691923A/en
Publication of CN118691923A publication Critical patent/CN118691923A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/75Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Processing Or Creating Images (AREA)

Abstract

本申请涉及一种目标主题的图像生成方法、装置、计算机设备、存储介质和计算机程序产品。所述方法包括:获取通过样本图像对预训练模型进行二次训练得到的主题图像生成模型;所述预训练模型用于按文本生成图像;所述样本图像包含至少一项符合目标主题的主题要素;基于每一所述样本图像各自携带的用于描述所述主题要素的主题要素描述文本,获取包含相同类型的所述主题要素描述文本的文本集合;从至少一部分所述文本集合中分别选取主题要素描述文本,组合得到包含所选取的主题要素描述文本的目标描述文本;通过主题图像生成模型,按照目标描述文本进行图像生成处理,得到与所述目标主题匹配的目标图像。采用本方法能够生成目标主题匹配的高质量图像。

The present application relates to a method, device, computer equipment, storage medium and computer program product for generating an image of a target theme. The method comprises: obtaining a theme image generation model obtained by performing secondary training on a pre-trained model through sample images; the pre-trained model is used to generate images according to text; the sample image contains at least one theme element that matches the target theme; based on the theme element description text for describing the theme element carried by each of the sample images, a text set containing the theme element description text of the same type is obtained; theme element description texts are respectively selected from at least a part of the text set, and a target description text containing the selected theme element description text is obtained by combining; through the theme image generation model, image generation processing is performed according to the target description text to obtain a target image matching the target theme. The present method can generate a high-quality image matching the target theme.

Description

目标主题的图像生成方法、装置、计算机设备和存储介质Method, device, computer equipment and storage medium for generating image of target subject

技术领域Technical Field

本申请涉及人工智能技术领域,特别是涉及一种目标主题的图像生成方法、装置、计算机设备、存储介质和计算机程序产品。The present application relates to the field of artificial intelligence technology, and in particular to a method, apparatus, computer device, storage medium and computer program product for generating an image of a target subject.

背景技术Background Art

随着人工智能技术的发展,出现了图像智能生成技术,能够根据用户的输入的文本描述进行图像生成,生成的图像具有比较细致的细节。With the development of artificial intelligence technology, intelligent image generation technology has emerged, which can generate images based on the text description input by the user, and the generated images have relatively fine details.

基于文本生成图像的技术方法,在生成图像时可以切换指定的主题风格,但是对于指定主题分格的图像,需要大量的调参和尝试工作,其图像生成过程存在质量不稳定的问题,难以高质量地生成符合主题风格的图像。The technical method of generating images based on text can switch to a specified theme style when generating images. However, for images with specified theme grids, a lot of parameter adjustment and trial work is required. The image generation process has the problem of unstable quality, and it is difficult to generate high-quality images that meet the theme style.

发明内容Summary of the invention

基于此,有必要针对上述技术问题,提供一种能够高质量地生成符合主题风格的图像的目标主题的图像生成方法、装置、计算机设备、计算机可读存储介质和计算机程序产品。Based on this, it is necessary to provide a method, device, computer equipment, computer-readable storage medium and computer program product for generating an image of a target theme that can generate high-quality images that conform to the theme style in order to address the above technical problems.

第一方面,本申请提供了一种目标主题的图像生成方法。所述方法包括:In a first aspect, the present application provides a method for generating an image of a target subject. The method comprises:

获取通过样本图像对预训练模型进行二次训练得到的主题图像生成模型;所述预训练模型用于按文本生成图像;所述样本图像包含至少一项符合目标主题的主题要素;Obtaining a subject image generation model obtained by secondary training a pre-trained model through sample images; the pre-trained model is used to generate images according to text; the sample image contains at least one subject element that conforms to the target subject;

基于每一所述样本图像各自携带的用于描述所述主题要素的主题要素描述文本,获取包含相同类型的所述主题要素描述文本的文本集合;Based on the subject element description text for describing the subject element carried by each of the sample images, obtaining a text set containing the subject element description text of the same type;

从至少一部分所述文本集合中分别选取主题要素描述文本,组合得到包含所选取的主题要素描述文本的目标描述文本;Selecting subject element description texts from at least a portion of the text set, respectively, and combining them to obtain a target description text containing the selected subject element description texts;

通过所述主题图像生成模型,按照所述目标描述文本进行图像生成处理,得到与所述目标主题匹配的目标图像。The subject image generation model is used to perform image generation processing according to the target description text to obtain a target image that matches the target subject.

第二方面,本申请还提供了一种目标主题的图像生成装置。所述装置包括:In a second aspect, the present application also provides a device for generating an image of a target subject. The device comprises:

模型获取模块,用于获取通过样本图像对预训练模型进行二次训练得到的主题图像生成模型;所述预训练模型用于按文本生成图像;所述样本图像包含至少一项符合目标主题的主题要素;A model acquisition module, used to acquire a theme image generation model obtained by retraining a pre-trained model through sample images; the pre-trained model is used to generate images according to text; the sample image contains at least one theme element that conforms to the target theme;

文本集合确定模块,用于基于每一所述样本图像各自携带的用于描述所述主题要素的主题要素描述文本,获取包含相同类型的所述主题要素描述文本的文本集合;A text set determination module, configured to obtain a text set containing the same type of subject element description texts based on the subject element description texts carried by each of the sample images and used to describe the subject element;

描述文本确定模块,用于从至少一部分所述文本集合中分别选取主题要素描述文本,组合得到包含所选取的主题要素描述文本的目标描述文本;A description text determination module, used to select subject element description texts from at least a part of the text set, and combine them to obtain a target description text containing the selected subject element description texts;

目标图像生成模块,用于通过所述主题图像生成模型,按照所述目标描述文本进行图像生成处理,得到与所述目标主题匹配的目标图像。The target image generation module is used to perform image generation processing according to the target description text through the subject image generation model to obtain a target image matching the target subject.

第三方面,本申请还提供了一种计算机设备。所述计算机设备包括存储器和处理器,所述存储器存储有计算机程序,所述处理器执行所述计算机程序时实现上述目标主题的图像生成方法的步骤。In a third aspect, the present application further provides a computer device, which includes a memory and a processor, wherein the memory stores a computer program, and the processor implements the steps of the above-mentioned method for generating an image of a target subject when executing the computer program.

第四方面,本申请还提供了一种计算机可读存储介质。所述计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现上述目标主题的图像生成方法的步骤。In a fourth aspect, the present application further provides a computer-readable storage medium having a computer program stored thereon, which implements the steps of the above-mentioned method for generating an image of a target subject when executed by a processor.

第五方面,本申请还提供了一种计算机程序产品。所述计算机程序产品,包括计算机程序,该计算机程序被处理器执行时实现上述目标主题的图像生成方法的步骤。In a fifth aspect, the present application further provides a computer program product, which includes a computer program, and when the computer program is executed by a processor, the steps of the image generation method of the target subject are implemented.

上述目标主题的图像生成方法、装置、计算机设备、存储介质和计算机程序产品,通过获取对用于预训练模型进行二次训练得到的主题图像生成模型,得到可以按文本生成符合目标主题的图像的模型,在确定输入模型的文本的过程中,利用训练主题图像生成模型的样本图像所包含至少一项符合目标主题的主题要素、以及样本图像各自携带的用于描述主题要素的主题要素描述文本,可以将包含相同类型的主题要素描述文本的文本集合作为输入模型的文本组成部分,通过从至少一部分文本集合中分别选取主题要素描述文本,组合得到包含所选取的主题要素描述文本的目标描述文本,能够使得主题图像生成模型,按照目标描述文本进行图像生成处理,得到区别于样本图像、但能够与目标主题高度匹配的高质量目标图像。The above-mentioned target theme image generation method, device, computer equipment, storage medium and computer program product obtain a model that can generate images that meet the target theme according to text by acquiring a theme image generation model obtained by secondary training of a pre-trained model. In the process of determining the text of the input model, the sample images of the training theme image generation model are used to contain at least one theme element that meets the target theme, and the theme element description text for describing the theme element carried by each sample image. A text collection containing the same type of theme element description text can be used as a text component of the input model. By selecting theme element description texts from at least a part of the text collection, target description texts containing the selected theme element description texts are obtained by combining them. This enables the theme image generation model to perform image generation processing according to the target description text to obtain a high-quality target image that is different from the sample image but can highly match the target theme.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为一个实施例中目标主题的图像生成方法的应用环境图;FIG1 is a diagram showing an application environment of a method for generating an image of a target subject in one embodiment;

图2为一个实施例中目标主题的图像生成方法的流程示意图;FIG2 is a schematic diagram of a flow chart of a method for generating an image of a target subject in one embodiment;

图3为一个实施例中目标图像与不同候选图像比较进行图像筛选的示意图;FIG3 is a schematic diagram of image screening by comparing a target image with different candidate images in one embodiment;

图4为一个实施例中基于目标主题视频来获取样本图像的流程示意图;FIG4 is a schematic diagram of a process of acquiring sample images based on a target subject video in one embodiment;

图5为一个实施例中主题图像生成模型的结构示意图;FIG5 is a schematic diagram of the structure of a subject image generation model in one embodiment;

图6为一个实施例中主题图像生成模型中图像编码和解码过程示意图;FIG6 is a schematic diagram of an image encoding and decoding process in a subject image generation model in one embodiment;

图7为一个实施例中主题图像生成模型中去噪过程示意图;FIG7 is a schematic diagram of a denoising process in a subject image generation model in one embodiment;

图8为一个实施例中主题图像生成模型中Unet模型的结构示意图;FIG8 is a schematic diagram of the structure of a Unet model in a subject image generation model in one embodiment;

图9为一个实施例中主题图像生成模型的训练过程和应用过程示意图;FIG9 is a schematic diagram of the training process and application process of a subject image generation model in one embodiment;

图10为一个实施例中分割后的语义对象的布局图的示意图;FIG10 is a schematic diagram of a layout diagram of segmented semantic objects in one embodiment;

图11为一个实施例中按照语义对象的布局图生成的图像示意图;FIG11 is a schematic diagram of an image generated according to a layout diagram of a semantic object in one embodiment;

图12为一个实施例中基于深度图影响生成的图像示意图;FIG12 is a schematic diagram of an image generated based on the influence of a depth map in one embodiment;

图13为一个实施例中不同样本图像的示意图;FIG13 is a schematic diagram of different sample images in one embodiment;

图14为一个实施例中不同生成图像的示意图;FIG14 is a schematic diagram of different generated images in one embodiment;

图15为一个实施例中目标主题的图像生成装置的结构框图;FIG15 is a block diagram of a device for generating an image of a target subject in one embodiment;

图16为一个实施例中计算机设备的内部结构图;FIG16 is a diagram showing the internal structure of a computer device in one embodiment;

图17为一个实施例中计算机设备的内部结构图。FIG. 17 is a diagram showing the internal structure of a computer device in one embodiment.

具体实施方式DETAILED DESCRIPTION

为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。In order to make the purpose, technical solution and advantages of the present application more clearly understood, the present application is further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application and are not used to limit the present application.

下面对本申请中涉及的技术术语进行定义:The technical terms involved in this application are defined below:

IP(Intellectual Property,知识产权):在互联网的内容领域,定义为可进行多维度开发的文化品牌,具有主体的价值辨识度(可以是人物形象,特定故事,故事场景),多载体(可以是小说,漫画,动画,电视剧,电影,真人,虚拟形象等各种载体)等特征。IP (Intellectual Property): In the content field of the Internet, it is defined as a cultural brand that can be developed in multiple dimensions, with characteristics such as subject value recognition (which can be a character image, a specific story, a story scene), and multiple carriers (which can be novels, comics, animations, TV series, movies, real people, virtual images, etc.).

输入的文本描述(Prompt):基于该文本描述,模型可以生成相对应的图像。Prompt包含几个要素,分别为绘画的对象(绘画的主体,一般作为前景),绘画对象的细节(服饰,表情,颜色,动作,面部和肢体细节等),绘画的背景,拍摄的视角和远近(视角包括正面,侧面,背面等;远近包括特写,半身,全身等)。Input text description (Prompt): Based on the text description, the model can generate the corresponding image. Prompt contains several elements, namely the object of the painting (the main body of the painting, usually as the foreground), the details of the painting object (clothing, expression, color, action, facial and body details, etc.), the background of the painting, the shooting angle and distance (angle includes front, side, back, etc.; distance includes close-up, half-body, full-body, etc.).

本申请实施例提供的目标主题的图像生成方法,可以应用于如图1所示的应用环境中。其中,终端102通过网络与服务器104进行通信。数据存储系统可以存储服务器104需要处理的数据。数据存储系统可以集成在服务器104上,也可以放在云上或其他服务器上。在数据储存系统中,存储有用于按文本生成图像的预训练模型。服务器104可以接收终端102发送的针对目标主题的图像生成请求时间,服务器104获取包含至少一项符合目标主题的主题要素的样本图像,通过样本图像对预训练模型进行二次训练得到的主题图像生成模型。可以理解在其他实施例中,服务器104可预先通过多个不同主题的样本图像对分别预训练模型进行二次训练得到的多个具有不同主题的主题图像生成模型,服务器104在接收到终端102的请求之后,可以直接从训练好的多个不同主题的主题图像生成模型中,选择出与目标主题匹配的主题图像生成模型。The method for generating an image of a target theme provided in an embodiment of the present application can be applied in an application environment as shown in FIG. 1 . Among them, the terminal 102 communicates with the server 104 via a network. The data storage system can store data that the server 104 needs to process. The data storage system can be integrated on the server 104, or it can be placed on a cloud or other server. In the data storage system, a pre-trained model for generating images by text is stored. The server 104 can receive the image generation request time for the target theme sent by the terminal 102, and the server 104 obtains a sample image containing at least one theme element that meets the target theme, and obtains a theme image generation model by performing secondary training on the pre-trained model through the sample image. It can be understood that in other embodiments, the server 104 can pre-train the pre-trained models through sample images of multiple different themes to obtain multiple theme image generation models with different themes. After receiving the request from the terminal 102, the server 104 can directly select a theme image generation model that matches the target theme from the trained theme image generation models of multiple different themes.

服务器104在确定与目标主题匹配的主题图像生成模型之后,可以基于每一样本图像各自携带的用于描述主题要素的主题要素描述文本,获取包含相同类型的主题要素描述文本的文本集合,以便从至少一部分文本集合中分别选取主题要素描述文本,从而组合得到包含所选取的主题要素描述文本的目标描述文本。After determining the subject image generation model that matches the target subject, server 104 can obtain a text collection containing subject element description texts of the same type based on the subject element description texts carried by each sample image for describing the subject elements, so as to select subject element description texts from at least a part of the text collection respectively, thereby combining to obtain a target description text containing the selected subject element description texts.

在其中一个实施例中,目标描述文本可以是由服务器104基于文本集合随机选取和组合,也可以由服务器104将文本集合反馈至终端102,终端用户可以通过文本集合选取主题要素描述文本,来组合得到包含所选取的主题要素描述文本的目标描述文本并反馈至服务器104,服务器104通过主题图像生成模型,按照目标描述文本进行图像生成处理,得到与目标主题匹配的目标图像并反馈给终端102。In one embodiment, the target description text can be randomly selected and combined by the server 104 based on the text set, or the server 104 can feed back the text set to the terminal 102. The terminal user can select the subject element description text from the text set to combine and obtain the target description text containing the selected subject element description text and feed it back to the server 104. The server 104 uses the subject image generation model to perform image generation processing according to the target description text, obtains the target image matching the target subject and feeds it back to the terminal 102.

在其中另一个实施例中,服务器104也可以将文本集合和主题图像生成模型一起反馈至终端102,以使终端102基于文本集合随机选取和组合得到目标描述文本,或是响应于终端用户针对文本集合触发的描述文本选取操作,确定目标描述文本,然后通过主题图像生成模型,按照目标描述文本进行图像生成处理,得到与目标主题匹配的目标图像。In another embodiment, the server 104 may also feed back the text set and the subject image generation model to the terminal 102, so that the terminal 102 randomly selects and combines the target description text based on the text set, or determines the target description text in response to a description text selection operation triggered by the terminal user for the text set, and then uses the subject image generation model to perform image generation processing according to the target description text to obtain a target image matching the target subject.

其中,终端102可以但不限于是各种台式计算机、笔记本电脑、智能手机、平板电脑、物联网设备和便携式可穿戴设备,物联网设备可为智能音箱、智能电视、智能空调、智能车载设备等。便携式可穿戴设备可为智能手表、智能手环、头戴设备等。服务器104可以用独立的服务器或者是多个服务器组成的服务器集群或云服务器来实现。The terminal 102 may be, but is not limited to, various desktop computers, laptop computers, smart phones, tablet computers, IoT devices, and portable wearable devices. The IoT devices may be smart speakers, smart TVs, smart air conditioners, smart car-mounted devices, etc. The portable wearable devices may be smart watches, smart bracelets, head-mounted devices, etc. The server 104 may be implemented as an independent server or a server cluster or cloud server consisting of multiple servers.

在一个实施例中,如图2所示,提供了一种目标主题的图像生成方法,该方法可以应用于计算机设备。该计算机设备可以为终端,也可以是服务器。为便于描述,以下各实施例以该方法应用于计算机设备为例进行说明,目标主题的图像生成方法具体包括以下步骤:In one embodiment, as shown in FIG2 , a method for generating an image of a target subject is provided, and the method can be applied to a computer device. The computer device can be a terminal or a server. For ease of description, the following embodiments are described by taking the method applied to a computer device as an example, and the method for generating an image of a target subject specifically includes the following steps:

步骤202,获取通过样本图像对预训练模型进行二次训练得到的主题图像生成模型。Step 202, obtaining a subject image generation model obtained by performing secondary training on the pre-trained model through sample images.

其中,预训练模型是基于文本生成图像的技术方法,按文本生成图像的模型,能够根据输入的文本描述进行图像生成,并且使得生成的图像具有比较细致的细节。其中,文本生成图像的技术方法,包括GAN(Generative Adversarial Network,生成对抗网络)框架,自回归框架,扩散框架(Diffusion)。其中,GAN框架的思路是将图像编码器、图像解码器与GAN判别器统一训练,使用CNN(Convolutional Neural Networks,卷积神经网络)结构捕捉局部特征,VIT结构捕捉全局特征。自回归框架通过把图像编码为离散序列,从而把图像与文本都通过序列来表示,再使用NLP(Natural Language Processing,自然语言处理)模型的自回归语言模型,进行文本到图像,图像到文本的生成。Diffusion扩散框架的思路是将图像编码到二维结构的隐变量域,在隐变量域进行相应的扩散过程、加噪过程,以及文本控制的训练。在本实施例中,预训练模型可以是以上各种框架中的任意一种框架训练得到的模型。Among them, the pre-trained model is a technical method for generating images based on text. The model for generating images according to text can generate images according to the input text description, and the generated images have relatively detailed details. Among them, the technical method for generating images from text includes GAN (Generative Adversarial Network) framework, autoregressive framework, and diffusion framework (Diffusion). Among them, the idea of the GAN framework is to train the image encoder, image decoder and GAN discriminator uniformly, use CNN (Convolutional Neural Networks) structure to capture local features, and VIT structure to capture global features. The autoregressive framework encodes the image into a discrete sequence, so that both the image and the text are represented by a sequence, and then uses the autoregressive language model of the NLP (Natural Language Processing) model to generate text to image and image to text. The idea of the Diffusion diffusion framework is to encode the image into the hidden variable domain of the two-dimensional structure, and perform corresponding diffusion process, noise addition process, and text control training in the hidden variable domain. In this embodiment, the pre-trained model can be a model obtained by training any one of the above various frameworks.

二次训练是针对预先训练好的预训练模型的精调训练,以使得预训练模型能够模型具有更好的数据处理能力。在一些实施例中,可以基于不同的目的,对相同的预训练模型进行不同的二次训练,以使得二次训练后得到的各个模型能够实现不同的功能。例如,针对不同的IP主题,可以分别采用不同IP主题的样本图像分别对相同的可按文本生成图像的预训练模型进行训练,以使得训练得到的不同模型能够生成不同IP主题的图像。Secondary training is a fine-tuning training for a pre-trained pre-trained model so that the pre-trained model can have better data processing capabilities. In some embodiments, different secondary trainings can be performed on the same pre-trained model based on different purposes so that each model obtained after secondary training can achieve different functions. For example, for different IP themes, sample images of different IP themes can be used to train the same pre-trained model that can generate images according to text, so that different models obtained by training can generate images of different IP themes.

主题图像生成模型是通过符合目标主题的样本图像对预训练模型进行二次训练得到的、能够基于输入文本生成符合目标主题的图像的模型。需要说明的是,为了确保主题图像生成模型生成的图像与目标主题具有较高的适配度,主题图像生成模型的输入文本可以是该目标主题匹配的文本。The theme image generation model is a model that is obtained by retraining the pre-trained model with sample images that match the target theme, and can generate images that match the target theme based on input text. It should be noted that in order to ensure that the images generated by the theme image generation model have a high degree of fit with the target theme, the input text of the theme image generation model can be text that matches the target theme.

样本图像是包含至少一项符合目标主题的主题要素、且携带的用于描述主题要素的主题要素描述文本的图像。可以理解,用于进行模型训练的样本图像的数量为多个。各样本图像可以具有相同的图像参数,例如相同的尺寸、相同的分辨率等,以使得在模型的二次训练过程中基于样本图像进行向量转换时具有相同的信息量,同时能够实现模型在处理过程中的计算资源量的均衡,提高数据处理效率。The sample image is an image that contains at least one subject element that matches the target subject and carries a subject element description text for describing the subject element. It can be understood that the number of sample images used for model training is multiple. Each sample image can have the same image parameters, such as the same size, the same resolution, etc., so that the same amount of information is obtained when performing vector conversion based on the sample image during the secondary training of the model, and the computing resources of the model can be balanced during the processing, thereby improving data processing efficiency.

其中,各个样本图像可以具有相同的数据来源,也可以具有不同的数据来源。数据来源可以包括从图像网页搜索获取,或是从同一个视频中通过图像帧抽取来获取。当获取到的图像具有不同的图像参数时,可以先对图像进行参数对齐,以使得获取得到图像具有相同的图像参数。例如,当样本图像为指定IP主题的图像时,以该指定IP主题为电视剧,可以从电视剧视频中选取部分包含至少一项符合目标主题的主题要素的图像帧、并对各个图像帧的主题要素进行标注,得到与图像帧匹配的描述主题要素的主题要素描述文本,从而得到该指定IP主题的样本图像。Among them, each sample image may have the same data source or different data sources. The data source may include obtaining from an image web page search, or obtaining from the same video by extracting image frames. When the acquired images have different image parameters, the image parameters may be aligned first so that the acquired images have the same image parameters. For example, when the sample image is an image of a specified IP theme, the specified IP theme is a TV series, and some image frames containing at least one theme element that meets the target theme can be selected from the TV series video, and the theme elements of each image frame are annotated to obtain a theme element description text that describes the theme elements and matches the image frame, thereby obtaining a sample image of the specified IP theme.

符合目标主题的主题要素是指图像中的内容为目标主题的特定要素。具体可以是该目标主题的人物形象、特定故事、故事场景等。例如在电视剧出现中的主要人物、特定的场景等。以某古装电视剧为例,主要人物可以包括男主角A、女主角B等;特定的场景可以是电视剧中有代表性的场景,例如广为传播的名场面等。The subject elements that match the target theme refer to the specific elements of the target theme in the image. Specifically, they can be the characters, specific stories, story scenes, etc. of the target theme. For example, they can be the main characters and specific scenes in a TV series. Taking a certain costume TV series as an example, the main characters can include the male protagonist A, the female protagonist B, etc.; the specific scenes can be representative scenes in the TV series, such as famous scenes that are widely circulated.

主题要素描述文本是对样本图像中的主题要素的内容进行描述的文本。例如,针对电视剧的样本图像的主题要素面描述文本可以包括主要人物、人物的造型、人物的表情、人物所处的背景,人物在画面中出现的视角大小等内容中的至少一项。在一些具体的实施例中,人物的造型的描述可以分为服装的描述和发型的描述,如主角A的服装的描述包括:淡黄绿领锦缎长袍,暗紫色丝绸坎肩大褂,白色棉麻套服等,主角A的发型的描述包括:半披发和束发等;人物的表情包括:平静、愤怒、疑惑等;人物所处的背景的描述包括:马车、广场空地等;人物在画面中出现的视角大小的描述包括:大头近景、半身中景、全身远景等。The subject element description text is a text that describes the content of the subject element in the sample image. For example, the subject element description text for the sample image of a TV series may include at least one of the following contents: the main character, the character's appearance, the character's expression, the background in which the character is located, and the angle of view of the character in the picture. In some specific embodiments, the description of the character's appearance can be divided into the description of clothing and the description of hairstyle. For example, the description of the protagonist A's clothing includes: a light yellow-green collared brocade robe, a dark purple silk waistcoat, a white cotton and linen suit, etc. The description of the protagonist A's hairstyle includes: half-loose hair and tied hair, etc.; the character's expression includes: calm, angry, doubtful, etc.; the description of the character's background includes: a carriage, a square, etc.; the description of the angle of view of the character in the picture includes: a close-up of the head, a half-body mid-shot, a full-body long shot, etc.

在一些具体的实施例中,计算机设备可以响应于针对目标主题的主题图像生成模型的训练请求,基于按文本生成图像的预训练模型,获取符合目标主题的样本图像实时进行模型的二次训练,得到主题图像生成模型。In some specific embodiments, a computer device may respond to a training request for a subject image generation model for a target subject, obtain sample images that match the target subject based on a pre-trained model for generating images from text, and perform secondary training of the model in real time to obtain a subject image generation model.

在另一些具体的实施例中,计算机设备也可以预先训练好针对不同主题的主题图像生成模型,计算机设备响应于针对目标主题的主题图像生成模型的获取请求,从训练好的多个主题图像生成模型查找与目标主题匹配的主题图像生成模型。In other specific embodiments, the computer device may also pre-train theme image generation models for different themes. In response to a request to obtain a theme image generation model for a target theme, the computer device searches for a theme image generation model that matches the target theme from multiple trained theme image generation models.

步骤204,基于每一样本图像各自携带的用于描述主题要素的主题要素描述文本,获取包含相同类型的主题要素描述文本的文本集合。Step 204 : based on the subject element description text for describing the subject element carried by each sample image, a text set containing the subject element description text of the same type is obtained.

其中,每一样本图像均携带有用于描述主题要素的主题要素描述文本。针对每一个样本图像的主题要素,均是从多个角度来进行描述的。针对每一个描述角度,存在很对样本图像具有相应的主题要素描述文本,该描述角度的各个主题要素描述文本可以构成一个文本集合。例如针对每一个样本图像的主题要素,均从主要人物、人物服饰、人物发型、人物表情、人物背景,出现视角六个角度进行描述,相应的,可以构成主要人物、人物服饰、人物发型、人物表情、人物背景,出现视角这个六个不同类型的文本集合。文本集合中描述文本的构成,可以通过对各个样本图像中该角度的主题要素描述文本进行去重处理,将得到的结果作为该角度的文本集合所包含的描述文本。Among them, each sample image carries a subject element description text for describing the subject element. The subject element of each sample image is described from multiple angles. For each description angle, there are many sample images with corresponding subject element description texts, and each subject element description text of the description angle can constitute a text set. For example, for the subject element of each sample image, it is described from six angles: main character, character clothing, character hairstyle, character expression, character background, and appearance perspective. Accordingly, six different types of text sets can be constructed: main character, character clothing, character hairstyle, character expression, character background, and appearance perspective. The composition of the description text in the text set can be obtained by deduplicating the subject element description text of the angle in each sample image, and the result is used as the description text contained in the text set of the angle.

需要说明的是,在其他实施例中,也可以在文本集合中增加除了该目标主题以外的其他外部要素的描述文本,例如在主要人物的文本集合中加入外部人物,可以把外部人物与目标IP的核心场景进行融合,实现目标IP场景的嵌入。其中,添加的外部要素可以是从能够基于特定的途径获取具体的图像信息的要素也可以是基于用户上传的图像能够提取相应图像信息的要素。例如在主要人物这一文本集合中,可以增加网络上广为流传的著名人物的名字、或者上传的人物照片的人物名字等。It should be noted that, in other embodiments, description texts of other external elements besides the target theme may be added to the text collection. For example, by adding external characters to the text collection of the main characters, the external characters may be integrated with the core scene of the target IP to embed the target IP scene. The added external elements may be elements that can obtain specific image information based on a specific path or elements that can extract corresponding image information based on images uploaded by users. For example, in the text collection of the main characters, the names of famous people widely circulated on the Internet or the names of characters in uploaded photos of the characters may be added.

步骤206,从至少一部分文本集合中分别选取主题要素描述文本,组合得到包含所选取的主题要素描述文本的目标描述文本。Step 206, selecting subject element description texts from at least a portion of the text set, respectively, and combining them to obtain a target description text containing the selected subject element description texts.

其中,用于选取主题要素描述文本的文本集合可以是构建的每一类型的文本集合,也可以是所构建的文本集合中的一部分文本集合。例如,在构建文本集合中,一共构建了6个文本集合,则用于选取描述文本的文本集合可以是6个文本集合,也可以是5个文本集合或是4个文本集合等。其中,目标描述文本可以包括从文本集合中选取的多个主题要素描述文本,也可以包括从文本集合中选取的主题要素描述文本、以及外部对象的对象要素描述文本。其中,外部对象的对象要素描述文本可以预先添加到步骤204中所获取的对象要素所对应文本集合,以便直接从对象要素所对应文本集合中选取得到。可以理解,在其他实施例中,外部对象的对象要素描述文本也可以从仅包含外部对象的对象要素描述文本的文本集合中选取得到,对于外部对象的对象要素描述文本具体的获取方式,在此不做限定。Wherein, the text set used to select the subject element description text can be a text set of each type constructed, or a part of the text set in the constructed text set. For example, in the construction of the text set, a total of 6 text sets are constructed, then the text set used to select the description text can be 6 text sets, or 5 text sets or 4 text sets, etc. Wherein, the target description text can include multiple subject element description texts selected from the text set, or can include the subject element description text selected from the text set, and the object element description text of the external object. Wherein, the object element description text of the external object can be pre-added to the text set corresponding to the object element obtained in step 204, so as to be directly selected from the text set corresponding to the object element. It can be understood that in other embodiments, the object element description text of the external object can also be selected from the text set that only contains the object element description text of the external object, and the specific acquisition method of the object element description text of the external object is not limited here.

在一些具体的实现中,计算机设备可以随机从至少一部分文本集合中分别选取主题要素描述文本,组合得到包含所选取的主题要素描述文本的目标描述文本。计算机设备也可以将各个文本集合在显示界面进行显示,通过响应用户对其中一部分文本合集中的主题要素描述文本的选中操作,将用户选中的包含主题要素描述文本在内的多个描述文本进行组合,得到目标描述文本。In some specific implementations, the computer device may randomly select subject element description texts from at least a portion of the text collections, and combine them to obtain a target description text containing the selected subject element description texts. The computer device may also display each text collection on a display interface, and in response to a user's selection operation of a subject element description text in a portion of the text collection, combine multiple description texts including the subject element description text selected by the user to obtain a target description text.

在一些具体的应用中,以主题要素包括主要人物为例,当主要人物的数量为一个时,从每一个文本集合中最多选择一个对象要素描述文本,以避免出现描述内容冲突的情况,从而影响生成图像的质量。当主要人物的数量为两个或两个以上时,从每一个文本集合中选取的对象要素描述文本的数量可以为1个,也可以为其他小于或等于主要人物个数的数量。In some specific applications, taking the example that the subject elements include main characters, when the number of the main characters is one, at most one object element description text is selected from each text set to avoid the situation where the description content conflicts, thereby affecting the quality of the generated image. When the number of the main characters is two or more, the number of object element description texts selected from each text set can be one, or a number less than or equal to the number of the main characters.

相应地,当主要人物的数量为两个或两个以上、从每一个文本集合中选取的对象要素描述文本的数量为1个时,生成的图像中的两个或两个主要人物,在该描述文本对应的角度可以具有相同和相似的表现。当主要人物的数量为N、从每一个文本集合中选取的对象要素描述文本的数量为N时,生成的图像中的N个人物,在该描述文本对应的角度可以分别具有不同的表现。步骤208,通过主题图像生成模型,按照目标描述文本进行图像生成处理,得到与目标主题匹配的目标图像。Accordingly, when the number of main characters is two or more and the number of object element description texts selected from each text set is 1, the two or two main characters in the generated image can have the same and similar expressions at the angle corresponding to the description text. When the number of main characters is N and the number of object element description texts selected from each text set is N, the N characters in the generated image can have different expressions at the angle corresponding to the description text. Step 208, through the subject image generation model, image generation processing is performed according to the target description text to obtain a target image that matches the target subject.

具体地,目标描述文本为主题图像生成模型的输入数据。由于主题图像生成模型是通过携带有主题要素描述文本的样本图像,对预训练模型进行二次训练得到的模型,目标描述文本是从样本图像的主题要素描述文本所在的文本集合中选取的、包含主题要素描述文本的组合结果,因此,计算机设备能够通过主题图像生成模型,按照目标描述文本进行图像生成处理,得到与目标主题匹配的高质量目标图像。Specifically, the target description text is the input data of the subject image generation model. Since the subject image generation model is a model obtained by retraining the pre-trained model with sample images carrying subject element description texts, and the target description text is a combination result selected from the text set containing the subject element description texts of the sample images, the computer device can perform image generation processing according to the target description text through the subject image generation model to obtain a high-quality target image matching the target subject.

在一些实施例中,主题图像生成模型为基于Diffusion扩散框架的预训练模型,基于目标主题的样本图像对该框架的预训练模型进行精细训练(finetune训练)得到的模型。基于Diffusion扩散框架的预训练模型的数据处理过程包括扩散过程与去噪过程,从而finetune训练得到的主题图像生成模型也具有扩散过程与去噪过程。扩散过程指把一个数据样本(图像通过编码得到的隐变量)逐渐通过加噪声变成随机噪声的过程。去噪过程是通过一个复杂的Unet模型学到每一步加噪过程中的噪声,使得预估的噪声与真实的噪声的差异越小越好。去噪过程的输入为一个白噪声,通过多步噪声的噪声估计,使得图像的噪声越来越小,并最终采样出一个高清晰度的图像。In some embodiments, the subject image generation model is a pre-trained model based on the Diffusion diffusion framework, and the model is obtained by fine-tuning the pre-trained model of the framework based on the sample image of the target subject. The data processing process of the pre-trained model based on the Diffusion diffusion framework includes a diffusion process and a denoising process, so the subject image generation model obtained by fine-tuning training also has a diffusion process and a denoising process. The diffusion process refers to the process of gradually converting a data sample (a hidden variable obtained by encoding the image) into random noise by adding noise. The denoising process is to learn the noise in each step of the denoising process through a complex Unet model, so that the difference between the estimated noise and the actual noise is as small as possible. The input of the denoising process is a white noise. Through the noise estimation of multiple steps of noise, the noise of the image is made smaller and smaller, and finally a high-definition image is sampled.

具体地,计算机设备除了获取到目标描述文件,还可以获取随机种子,将目标描述文件和随机种子输入主题图像生成模型。随机种子使用随机数生成器生成随机隐变量,目标描述文件经过文本编码器编码生成文本向量,基于文本向量和随机隐变量,可以重建图像隐变量,然后将随机隐变量解码为图像,即可得到目标主题匹配的高质量目标图像。Specifically, in addition to obtaining the target description file, the computer device can also obtain a random seed, and input the target description file and the random seed into the theme image generation model. The random seed uses a random number generator to generate a random latent variable, and the target description file is encoded by a text encoder to generate a text vector. Based on the text vector and the random latent variable, the image latent variable can be reconstructed, and then the random latent variable is decoded into an image, so that a high-quality target image matching the target theme can be obtained.

上述目标主题的图像生成方法,通过获取对用于预训练模型进行二次训练得到的主题图像生成模型,得到可以按文本生成符合目标主题的图像的模型,在确定输入模型的文本的过程中,利用训练主题图像生成模型的样本图像所包含至少一项符合目标主题的主题要素、以及样本图像各自携带的用于描述主题要素的主题要素描述文本,可以将包含相同类型的主题要素描述文本的文本集合作为输入模型的文本组成部分,通过从至少一部分文本集合中分别选取主题要素描述文本,组合得到包含所选取的主题要素描述文本的目标描述文本,能够使得主题图像生成模型,按照目标描述文本进行图像生成处理,得到区别于样本图像、但能够与目标主题高度匹配的高质量目标图像。The above-mentioned method for generating an image of a target theme obtains a theme image generation model obtained by performing secondary training on a pre-trained model, thereby obtaining a model that can generate an image that conforms to the target theme according to text. In the process of determining the text of the input model, the sample images of the trained theme image generation model are used to contain at least one theme element that conforms to the target theme, and the theme element description text for describing the theme element carried by each sample image. A text set containing theme element description texts of the same type can be used as a text component of the input model. By selecting theme element description texts from at least a part of the text set, target description texts containing the selected theme element description texts are obtained by combining them. This enables the theme image generation model to perform image generation processing according to the target description text, thereby obtaining a high-quality target image that is different from the sample image but can be highly matched with the target theme.

接下来,通过以下各实施例来说明用于进行模型训练的样本图像的获取方式。Next, the following embodiments are used to illustrate the method of obtaining sample images for model training.

在一些实施例中,样本图像可以从视频中获取,具体过程包括:获取符合目标主题的目标视频,对目标视频进行抽帧处理,得到场景存在差异的多个候选图像帧;对每一候选图像帧分别进行主题要素识别,确定包含至少一项主题要素的优选图像帧;基于主题要素,对优选图像帧进行主题要素描述文本标注,得到样本图像。In some embodiments, sample images can be obtained from a video, and the specific process includes: obtaining a target video that matches the target theme, performing frame extraction processing on the target video, and obtaining multiple candidate image frames with different scenes; performing theme element identification on each candidate image frame, and determining a preferred image frame that contains at least one theme element; based on the theme elements, annotating the preferred image frame with text describing the theme elements to obtain a sample image.

其中,符合目标主题的目标视频,具体可以是针对某个特定的IP的视频,例如电视剧、电影、动漫、动画等。目标视频是围绕该目标主题进行展示的视频,具有特定的人物形象,特定故事,故事场景中的至少一种,该特定的人物形象,特定故事,故事场景可以作为主题要素。以目标视频具有特定的人物形象为例,该目标视频的主题要素可以是该目标视频中的主角。The target video that meets the target theme may be a video for a specific IP, such as a TV series, movie, cartoon, animation, etc. The target video is a video that is presented around the target theme and has at least one of a specific character image, a specific story, and a story scene. The specific character image, the specific story, and the story scene may serve as a theme element. For example, if the target video has a specific character image, the theme element of the target video may be the protagonist of the target video.

目标视频是由多个视频帧构成的,为了实现对模型的有效训练,可以通过对目标视频进行抽帧处理,提取其中一部分效果较好的视频帧来用作模型训练的样本图像。在一个具体的实施例中,视频文件的帧率为30frame/s,一集视频在40min-120min左右,平均每个视频文件可以抽取70k-210k张图像,但是由于视频文件的连续帧之间,图像内容的变化极小,因此可以只需要抽取场景变化比较大的图像帧。通过该方法,对于一集40min的视频,一般只需要抽取2k-4k图像,能够有效实现对目标视频中视频帧的筛选,提高样本图像的有效性,进而能够提高模型训练效果,缩短模型训练过程达到期望效果所需的训练时间,减小对用于训练模型的计算机设备的数据处理资源的占用。The target video is composed of multiple video frames. In order to achieve effective training of the model, the target video can be processed by frame extraction, and a part of the video frames with better effects can be extracted to be used as sample images for model training. In a specific embodiment, the frame rate of the video file is 30frame/s, and a video episode is about 40min-120min. On average, 70k-210k images can be extracted from each video file. However, since the image content changes very little between consecutive frames of the video file, only image frames with relatively large scene changes need to be extracted. Through this method, for a 40min video episode, generally only 2k-4k images need to be extracted, which can effectively realize the screening of video frames in the target video, improve the effectiveness of sample images, and further improve the model training effect, shorten the training time required for the model training process to achieve the desired effect, and reduce the occupation of data processing resources of the computer equipment used to train the model.

场景用于表征图像帧中主要内容。场景存在差异的图像帧可以是具有不同背景、不同人物、不同视角、不同人物造型等中的一项或多项差异的多个图像帧。通过抽帧筛选出场景存在差异的多个候选图像帧,能够使得作为样本图像的图像帧的内容具有较大差异,提高样本图像在模型训练中的训练效果。The scene is used to characterize the main content in the image frame. The image frames with different scenes can be multiple image frames with one or more differences among different backgrounds, different characters, different perspectives, different character shapes, etc. By extracting multiple candidate image frames with different scenes, the content of the image frames used as sample images can be greatly different, thereby improving the training effect of the sample images in model training.

主题要素识别是用于判断图像帧是否包含主题要素的处理过程。在具体的应用中,目标视频的图像帧可能具有一个主题要素,也可能具有两个主题要素,也可能不具有主题要素。在对图像帧进行筛选的过程中,要对不包含主题要素的图像帧进行剔除,以使得过滤后得到的图像帧都具有至少一项主题要素,例如本实施例中对候选图像帧中不包含主题要素的图像帧进行剔除后得到的优选图像帧。Theme element recognition is a process for determining whether an image frame contains a theme element. In a specific application, an image frame of a target video may have one theme element, two theme elements, or no theme element. In the process of screening image frames, image frames that do not contain theme elements are eliminated so that the image frames obtained after filtering all have at least one theme element, such as the preferred image frame obtained by eliminating image frames that do not contain theme elements from candidate image frames in this embodiment.

主题要素描述文本标注是指针对图像帧中的主题要素用文字进行描述的过程。主题要素描述文本,是用文字来描述图像帧中出现的要素的文字。主题要素描述文本可以包括主要对象(图像的主体,一般为前景),图像背景,拍摄的视角和远近,以及一些细节描述(服饰,表情,颜色,动作,面部和肢体细节等)。其中,主要对象可以是IP视频中的角色,具体可以用该角色的名字来描述。比如,《电视剧A》中的主角张小明。图像背景可以是IP视频中的场景,具体可以是建筑物,自然景观,故事情节。比如,《电视剧A》中的古式亭楼,古式庙宇,手持大旗名场面等。细节描述-服饰类:IP视频中特有的服饰。比如《电视剧A》中男女主角的衣服,用自然的文本描述出来,淡黄绿领锦缎长袍,蓝色棉麻套服,蓝白格棉麻套服,暗紫色丝绸坎肩大褂,白色棉麻套服。细节描述-表情类:IP视频中角色的表情,表情的分类包括平静,严肃,开心,愤怒,疑惑等。基于以上文本的描述,对于每张图像帧,会产生一个或者多个文本描述,与图像帧构成成对的训练数据,例如上述实施例中的携带有主题要素描述文本的样本图像。Thematic element description text annotation refers to the process of using text to describe the thematic elements in the image frame. Thematic element description text is the text that describes the elements appearing in the image frame. Thematic element description text can include the main object (the main body of the image, generally the foreground), image background, shooting angle and distance, and some detailed descriptions (clothing, expression, color, action, facial and body details, etc.). Among them, the main object can be a character in the IP video, which can be described by the name of the character. For example, Zhang Xiaoming, the protagonist in "TV Series A". The image background can be a scene in the IP video, which can be a building, a natural landscape, or a storyline. For example, the ancient pavilions, ancient temples, and famous scenes of holding large flags in "TV Series A". Detailed description-clothing category: unique clothing in IP videos. For example, the clothes of the male and female protagonists in "TV Series A" are described in natural text, light yellow and green collar brocade robes, blue cotton and linen suits, blue and white checkered cotton and linen suits, dark purple silk waistcoats, and white cotton and linen suits. Detailed description-expression category: the expression of the character in the IP video, the classification of expression includes calm, serious, happy, angry, confused, etc. Based on the above text description, for each image frame, one or more text descriptions will be generated, which form paired training data with the image frame, such as the sample image carrying the text describing the subject elements in the above embodiment.

在一些具体地实施例中,对于主题要素为特定人物的情况,针对图像帧中的每一个主题要素均可以从主要人物、人物服饰、人物发型、人物表情、人物背景,出现视角等角度进行描述。对于主题要素为特定静态对象的情况,针对图像帧中的每一个主题要素均可以从对象名称、对象背景,出现视角等角度进行描述。In some specific embodiments, when the subject element is a specific person, each subject element in the image frame can be described from the perspectives of the main person, the person's clothing, the person's hairstyle, the person's expression, the person's background, the viewing angle, etc. When the subject element is a specific static object, each subject element in the image frame can be described from the perspectives of the object name, the object background, the viewing angle, etc.

在本实施例中,计算机设备通过获取符合目标主题的目标视频,对目标视频进行抽帧处理,得到场景存在差异的多个候选图像帧,实现相似图像帧的初步过滤,然后对每一候选图像帧分别进行主题要素识别,确定包含至少一项主题要素的优选图像帧,剔除不包含主题要素的无效图像帧,最后基于主题要素,对优选图像帧进行主题要素描述文本标注,能够便捷高效的得到高质量的样本图像。可以理解,在其他实施例中,各个步骤的执行顺序可以根据实际需要进行重组。In this embodiment, the computer device obtains a target video that matches the target theme, performs frame extraction processing on the target video, obtains multiple candidate image frames with different scenes, implements preliminary filtering of similar image frames, and then performs theme element recognition on each candidate image frame, determines the preferred image frame containing at least one theme element, removes invalid image frames that do not contain theme elements, and finally annotates the preferred image frame with a description of the theme element based on the theme element, so as to obtain high-quality sample images conveniently and efficiently. It can be understood that in other embodiments, the execution order of each step can be reorganized according to actual needs.

下面对抽帧过程进行详细介绍,可以理解,在其他实施例中,该抽帧过程也可以用其他方式来实现,在此不做限定。在其中一些实施例中,对目标视频进行抽帧处理,得到场景存在差异的多个候选图像帧,包括:The frame extraction process is described in detail below. It is understood that in other embodiments, the frame extraction process can also be implemented in other ways, which are not limited here. In some embodiments, the target video is subjected to frame extraction processing to obtain multiple candidate image frames with different scenes, including:

对目标视频进行分帧处理,得到具有相同像素排布的多个图像帧;针对多个图像帧中的两个视频帧,基于同一像素排布位置的像素粒度差值,确定各像素排布位置的像素粒度差值的绝对值之和;当绝对值之和与图像帧的像素粒度总值之间的比值大于场景变化阈值时,确定两个图像帧为场景存在差异的候选图像帧。The target video is frame-processed to obtain multiple image frames with the same pixel arrangement; for two video frames in the multiple image frames, the sum of the absolute values of the pixel granularity differences at the same pixel arrangement position is determined; when the ratio between the sum of the absolute values and the total value of the pixel granularity of the image frame is greater than a scene change threshold, the two image frames are determined to be candidate image frames with different scenes.

其中,像素排布是指组成图像帧的各个像素的相对位置。由于图像帧是从目标视频中分帧得到的,针对同一目标视频,通过相同的处理方式抽帧得到的视频帧具有相同的图像属性,相同的图像属性包括相同的像素排布。像素粒度是用于描述图像帧中各个像素点的具体内容的参数,具体可以是RGB值、灰度值等参数中的一种。针对互为比较对象的两个图像帧,计算机设备通过将两个图像帧的相同像素排布位置的像素粒度的具体参数值进行作差处理,可以得到每一像素排布位置的像素粒度差值,其中,像素粒度差值可以是正值也可以是负值,为了避免正负值之间的数值抵消造成干扰,计算机设备通过累计各像素排布位置的像素粒度差值的绝对值,得到绝对值之和,以表征互为比较对象的两个图像帧之间的差异性。Among them, pixel arrangement refers to the relative position of each pixel that constitutes the image frame. Since the image frame is obtained by dividing the frame from the target video, for the same target video, the video frame obtained by extracting the frame through the same processing method has the same image attributes, and the same image attributes include the same pixel arrangement. Pixel granularity is a parameter used to describe the specific content of each pixel point in the image frame, which can be one of the parameters such as RGB value and grayscale value. For two image frames that are compared with each other, the computer device can obtain the pixel granularity difference of each pixel arrangement position by performing difference processing on the specific parameter values of the pixel granularity of the same pixel arrangement position of the two image frames, wherein the pixel granularity difference can be a positive value or a negative value. In order to avoid interference caused by the numerical offset between positive and negative values, the computer device accumulates the absolute value of the pixel granularity difference of each pixel arrangement position to obtain the sum of the absolute values to characterize the difference between the two image frames that are compared with each other.

两个图像帧之间是否存在场景差异,具体可以通过绝对值之和与图像帧的像素粒度总值之间的比值与场景变化阈值的数值大小比较结果来体现。在一个具体的实施例中,为说明方便,示例将图像帧简化为单通道3x3的图像,如图3所示,分别为目标图像、候选图像1和候选图像2,以目标图像与候选图像1比较来说,通过计算目标图像与候选图像1中每一个像素点的像素粒度差值,可以得到像素粒度差值的绝对值之和为20,目标图像的像素粒度总值为44,比值为20/44=0.45。场景变化的阈值设为0.3,即要求场景变化的比例超过0.3,则候选图像1满足条件,可以把目标图像和候选图像1确定为场景存在差异的两个图像。Whether there is a scene difference between two image frames can be specifically reflected by comparing the ratio between the sum of absolute values and the total value of pixel granularity of the image frame with the numerical value of the scene change threshold. In a specific embodiment, for the convenience of explanation, the image frame is simplified into a single-channel 3x3 image, as shown in FIG3, which are the target image, candidate image 1 and candidate image 2. For example, by comparing the target image with candidate image 1, by calculating the pixel granularity difference of each pixel point in the target image and candidate image 1, it can be obtained that the sum of the absolute values of the pixel granularity difference is 20, the total value of the pixel granularity of the target image is 44, and the ratio is 20/44=0.45. The threshold of scene change is set to 0.3, that is, the ratio of scene change is required to exceed 0.3, then candidate image 1 meets the condition, and the target image and candidate image 1 can be determined as two images with scene differences.

同理,如图3所示,以目标图像与候选图像2比较来说,通过计算目标图像与候选图像2中每一个像素点的像素粒度差值,可以得到像素粒度差值的绝对值之和为4,目标图像的像素粒度总值为44,比值为4/44=0.09。场景变化的阈值设为0.3,即要求场景变化的比例超过0.3,则候选图像2不满足条件。Similarly, as shown in Figure 3, comparing the target image with candidate image 2, by calculating the pixel granularity difference of each pixel in the target image and candidate image 2, the sum of the absolute values of the pixel granularity difference can be obtained to be 4, the total pixel granularity of the target image is 44, and the ratio is 4/44 = 0.09. The threshold of scene change is set to 0.3, that is, the ratio of scene change is required to exceed 0.3, then candidate image 2 does not meet the condition.

进一步地,当目标图像与候选图像的比较结果为候选图像不满足条件时,可以舍弃候选图像,将候选图像的下一个图像与目标图像进行比较,当出现与目标图像的比较结果为满足条件的候选图像时,则可以将该候选图像作为新的目标图像与下一个图像进行比较,通过对按顺序排列的图像帧的场景变化判定,能够在有效减少比较次数的前提下,实现对图像帧的高效和准确过滤,进而提高样本图像的质量。Furthermore, when the comparison result between the target image and the candidate image is that the candidate image does not meet the conditions, the candidate image can be discarded and the next image of the candidate image can be compared with the target image. When a candidate image with a comparison result that meets the conditions appears with the target image, the candidate image can be used as a new target image and compared with the next image. By judging the scene changes of image frames arranged in sequence, efficient and accurate filtering of image frames can be achieved while effectively reducing the number of comparisons, thereby improving the quality of sample images.

在一些实施例中,对每一候选图像帧分别进行主题要素识别,确定包含至少一项主题要素的优选图像帧,包括:分别获取每一候选图像帧各自的属性信息和内容信息;丢弃属性信息和内容信息中的至少一项不满足主题要素识别条件的候选图像帧,得到包含至少一项主题要素的优选图像帧。In some embodiments, subject element identification is performed on each candidate image frame to determine a preferred image frame containing at least one subject element, including: obtaining attribute information and content information of each candidate image frame; discarding candidate image frames whose attribute information and content information do not satisfy at least one subject element identification condition, to obtain a preferred image frame containing at least one subject element.

其中,图像帧的属性信息包括图像帧在目标视频中的属性,例如画面清晰度,是否为片头图像帧、片尾图像帧、广告图像帧、转场过渡图像帧等不适宜用作样本图像的无效图像帧。是否为无效图像帧的属性信息可以通过与预先配置的属性标识来确定,也可以通过识别目标视频中片头、片尾、广告、转场过渡在目标视频中的起止时间节点,将图像帧的时间信息与上述各起始时间节点进行比较,从而确定图像帧的属性信息。在其他实施例中,还可以通过借助图像识别的工具,包括片头片尾识别工具,广告识别工具来确定图像帧的属性。内容信息是指图像帧包含的内容,具体可以是图中人物是否为主角,场景是否具有辨识度,图中人物是否截取完整等。Among them, the attribute information of the image frame includes the attributes of the image frame in the target video, such as picture clarity, whether it is an invalid image frame such as a title image frame, an end image frame, an advertisement image frame, a transition image frame, etc. that is not suitable for use as a sample image. The attribute information of whether it is an invalid image frame can be determined by comparing it with a pre-configured attribute identifier, or by identifying the start and end time nodes of the title, end, advertisement, and transition in the target video, and comparing the time information of the image frame with the above-mentioned start time nodes, thereby determining the attribute information of the image frame. In other embodiments, the attributes of the image frame can also be determined by using image recognition tools, including title and end recognition tools and advertisement recognition tools. Content information refers to the content contained in the image frame, which can specifically be whether the character in the picture is the protagonist, whether the scene is recognizable, whether the character in the picture is completely captured, etc.

对于用于样本图像的图像帧,需要属性信息和内容信息同时符合主题要素识别条件。主题要素识别条件具体可以包括属性信息筛选条件和内容信息筛选条件,只有当样本图像的属性信息满足属性信息筛选条件、且内容信息满足内容信息筛选条件的情况下,才能确保优选图像能够包含至少一项主题要素,该图像帧才能被判定为满足主题要素识别条件的优选图像帧。For the image frame used for the sample image, the attribute information and content information need to meet the subject element identification conditions at the same time. The subject element identification conditions may specifically include attribute information screening conditions and content information screening conditions. Only when the attribute information of the sample image meets the attribute information screening conditions and the content information meets the content information screening conditions can it be ensured that the preferred image can contain at least one subject element, and the image frame can be determined as a preferred image frame that meets the subject element identification conditions.

在一个具体的应用中,IP视频中可能包括一些片头片尾,广告,转场过渡类的图像,此外,剧情中可能包括一些配角人物,不具备辨识度的场景,图像清晰度较低,人物截取不完整的图像,都不适合做为训练语料,计算机设备通过获取每一候选图像帧各自的属性信息和内容信息;将属性信息和内容信息分别进行筛选,丢弃属性信息和内容信息中的至少一项不满足主题要素识别条件的候选图像帧,以确保得到的优选图像帧包含至少一项主题要素。In a specific application, IP video may include some opening and ending credits, advertisements, and transition images. In addition, the plot may include some supporting characters, unrecognizable scenes, low image clarity, and incomplete images of characters, all of which are not suitable as training corpus. The computer device obtains the attribute information and content information of each candidate image frame; the attribute information and content information are screened separately, and the candidate image frames whose attribute information and content information do not meet the subject element recognition conditions are discarded to ensure that the preferred image frame contains at least one subject element.

在一些实施例中,主题要素包括符合目标主题的对象要素;基于主题要素,对优选图像帧进行主题要素描述文本标注,得到样本图像,包括:In some embodiments, the subject elements include object elements that conform to the target subject; based on the subject elements, the preferred image frame is annotated with a text description of the subject elements to obtain a sample image, including:

针对每一优选图像帧,对优选图像帧中的对象要素进行中心点定位,得到定位位置;按照样本图像的尺寸条件,以定位位置为中心,对优选图像帧进行图像裁剪,得到裁剪后的优选图像帧;对裁剪后的优选图像帧进行主题要素描述文本标注,得到样本图像。For each preferred image frame, the center point of the object element in the preferred image frame is located to obtain the located position; according to the size conditions of the sample image, the preferred image frame is cropped with the located position as the center to obtain a cropped preferred image frame; the cropped preferred image frame is annotated with a text description of the subject element to obtain a sample image.

其中,对象要素是指以核心对象的方式存在的主题要素。例如电视剧中的主角人物,动画片中的主要虚拟形象等。中心点定位是指确定优选图像帧中的对象要素的中心位置进行定位的过程。样本图像的尺寸条件是指输入模型的样本图像所需要满足的尺寸大小。通过以定位位置为中心,对优选图像帧进行图像裁剪,得到裁剪后的优选图像帧,能够实现对对象要素的有效显示。Among them, the object element refers to the theme element that exists in the form of a core object. For example, the protagonist in a TV series, the main virtual image in a cartoon, etc. Center point positioning refers to the process of determining the center position of the object element in the preferred image frame for positioning. The size condition of the sample image refers to the size that the sample image of the input model needs to meet. By cropping the preferred image frame with the positioning position as the center, the cropped preferred image frame is obtained, which can achieve effective display of the object element.

在一些实施例中,IP视频抽帧的图像,一般为横屏1920x1080的宽屏图像,而训练图像的输入要求较小(如512x512或768x768等)。如果直接进行图像的缩放,会导致图像清晰度的损失,还会导致主要要素的细节分辨率不足。上述实施例通过对对象要素进行定位,并以该对象要素为中心,进行目标尺寸的图像截取,能够提高在确保样本图像具有相同的尺寸的前提下,提高样本图像的质量。In some embodiments, the image of the IP video frame extraction is generally a widescreen image of 1920x1080 in horizontal screen, while the input requirement of the training image is relatively small (such as 512x512 or 768x768, etc.). If the image is directly scaled, it will lead to the loss of image clarity and insufficient detail resolution of the main elements. The above embodiment can improve the quality of the sample image while ensuring that the sample images have the same size by locating the object element and taking the object element as the center to perform image interception of the target size.

在一个具体的实施例中,如图4所示,获取样本图像的具体过程可以包括:获取符合目标主题的视频文件,通过计算连续两帧图像像素粒度的差值的绝对值之和,将绝对值之和与目标图像的像素粒度取值的比值与设定的阈值进行比较,以实现对视频文件的抽帧处理,然后通过对主题要素的识别和质量判断,以过滤掉一些片头片尾、广告、转场过渡类的图像,剧情中可能包括一些配角人物图像,不具备辨识度的场景图像,清晰度较低的图像,人物截取不完整的图像等,这些图像不适合做为训练语料的图像帧因此需要进行过滤,然后对图像帧中的主题要素进行定位,并以该主题要素为中心,进行目标尺寸的图像截取,最后对处理得到的每张图像进行文本标注,文本标注用文字来描述图像中出现的主题要素,包括主要对象(图像的主体,一般为前景),图像背景,拍摄的视角和远近,以及一些细节描述(服饰,表情,颜色,动作,面部和肢体细节等)。In a specific embodiment, as shown in FIG4 , the specific process of obtaining the sample image may include: obtaining a video file that meets the target theme, calculating the absolute value sum of the difference between the pixel granularity of two consecutive frames of images, and comparing the ratio of the absolute value sum to the pixel granularity value of the target image with a set threshold value to realize frame extraction processing of the video file, and then filtering out some opening and ending images, advertisements, and transition images by identifying and judging the quality of the theme elements. The plot may include some supporting character images, unrecognizable scene images, images with low clarity, and incomplete character images. These images are not suitable as image frames for training corpus and therefore need to be filtered, and then locating the theme elements in the image frame, and taking the theme elements as the center, performing image capture of the target size, and finally performing text annotation on each processed image. The text annotation uses words to describe the theme elements appearing in the image, including the main object (the main body of the image, generally the foreground), the image background, the shooting angle and distance, and some detailed descriptions (clothing, expression, color, action, facial and body details, etc.).

接下来,通过以下实施例对基于文本集合确定目标文本组合的一种可能实现方式进行介绍。Next, a possible implementation method of determining a target text combination based on a text set is introduced through the following embodiment.

在一些实施例中,文本集合包括开放文本集合和封闭文本集合。其中,开放文本集合包括用于描述主题要素中的对象要素的对象要素描述文本、以及用于描述外部对象要素的对象要素描述文本。外部对象要素是指不符合目标主题的对象要素。例如,以对象要素为主要人物为例,开放文本集合用于描述主要人物,除了该IP领域内的主要人物,还可以是外部世界的著名人物,也可以是用户上传有素人照片的人物。以《电视剧A》为例,IP领域内的人物包括主角A,主角B,主角C等。外部世界的著名人物可以是外部世界的任意名人,比如,著名电影演员著名歌手等,还可以是用户上传的素人照片,一般为外部世界的非著名人物,例如用户通过相机拍摄的包含人物头像的照片对应的人物。In some embodiments, the text set includes an open text set and a closed text set. Among them, the open text set includes object element description text for describing object elements in the subject elements, and object element description text for describing external object elements. External object elements refer to object elements that do not conform to the target theme. For example, taking the object element as the main character as an example, the open text set is used to describe the main character. In addition to the main character in the IP field, it can also be a famous person in the outside world, or a person whose amateur photos are uploaded by the user. Taking "TV Series A" as an example, the characters in the IP field include protagonist A, protagonist B, protagonist C, etc. Famous people in the outside world can be any celebrity in the outside world, such as famous movie actors, famous singers, etc., and can also be amateur photos uploaded by users, generally non-famous people in the outside world, such as people corresponding to photos containing character portraits taken by users with cameras.

封闭文本集合是指基于主题要素的主题要素描述文本构建的文本集合。在一些实施例中,封闭文本集合与开放文本集合所对应的要素类型可以不相同,例如,当开放文本集合所对应的要素类型为对象要素时,封闭文本集合所对应的要素类型为除对象要素以外的其他主题要素。例如,封闭文本集合具体可以包括服装文本集合:[淡黄绿领锦缎长袍,蓝色棉麻套服,蓝白格棉麻套服,暗紫色丝绸坎肩大褂,白色棉麻套服],发型文本集合:[半披发,束发],表情文本集合:[平静,严肃,吃惊,开心,愤怒,疑惑],视角文本集合:[大头近景,半身中景,全身远景]。A closed text set refers to a text set constructed based on the subject element description text of the subject element. In some embodiments, the element type corresponding to the closed text set and the open text set may be different. For example, when the element type corresponding to the open text set is an object element, the element type corresponding to the closed text set is other subject elements except the object element. For example, the closed text set may specifically include a clothing text set: [light yellow-green collar brocade robe, blue cotton and linen suit, blue and white checkered cotton and linen suit, dark purple silk waistcoat, white cotton and linen suit], a hairstyle text set: [half-open hair, tied hair], an expression text set: [calm, serious, surprised, happy, angry, confused], and a perspective text set: [close-up of the head, mid-length shot of the body, and long shot of the whole body].

进一步地,从至少一部分文本集合中分别选取主题要素描述文本,组合得到包含所选取的主题要素描述文本的目标描述文本,包括:从开放文本集合中选取对象要素描述文本;将对象要素描述文本,与从至少一部分封闭文本集合中分别选取的主题要素描述文本进行组合,得到目标文本组合。Furthermore, subject element description texts are selected from at least a portion of the text set and combined to obtain a target description text containing the selected subject element description texts, including: selecting object element description text from the open text set; combining the object element description text with subject element description texts selected from at least a portion of the closed text set to obtain a target text combination.

其中,从开放文本集合中选取对象要素描述文本,能够确定期望生成的目标图像中的对象。由于开放文本集合除了包括描述主题要素中的对象要素的对象要素描述文本,还包括描述外部对象要素的对象要素描述文本,因此使得便于将外部著名人物或是素人做为目标图像的主体,以文本生成图像的方式进行嵌入,从而把外部人物与目标主题的核心场景进行融合,实现目标主题场景的嵌入。Among them, selecting object element description text from the open text set can determine the object in the target image to be generated. Since the open text set includes not only object element description text describing the object element in the subject element, but also object element description text describing the external object element, it is convenient to embed external famous figures or ordinary people as the subject of the target image in the form of text-generated image, thereby integrating the external figures with the core scene of the target theme and realizing the embedding of the target theme scene.

在本实施例中,主题图像生成模型对于选取的对象要素描述文本为外部对象要素的情况,可以通过获取外部对象要素的图像作为隐变量的初始特征进行处理,从而实现外部对象要素与其他目标主题的内容的有效融合,扩大了主题图像生成模型的应用场景。In this embodiment, when the selected object element description text is an external object element, the theme image generation model can obtain the image of the external object element as the initial feature of the latent variable for processing, thereby achieving effective fusion of the external object element with the content of other target topics, thereby expanding the application scenarios of the theme image generation model.

接下来,将通过以下实施例对主题图像生成模型进行介绍。主题图像生成模型是基于特定IP主题的训练语料对预训练模型进行精细训练得到的结果,主题图像生成模型与预训练模型具有相同的模型结构Next, the topic image generation model will be introduced through the following embodiments. The topic image generation model is the result of fine training of the pre-trained model based on the training corpus of a specific IP topic. The topic image generation model has the same model structure as the pre-trained model.

在一个实施例中,首先对主题图像生成模型即预训练模型的模型结构进行介绍,如图5所示,模型包括控制域、隐变量域和图像域。其中,图像域包括图像编码器和图像解码器,图像编码器和图像解码器用于使得图像域和隐变量域建立联系。其中,图像编码器用于对图像域中维度为[Channel,Height,Width]的图像向量进行编码压缩,得到维度为[LatentChannel,LatentHeight,LatentWidth]的隐变量。通常情况下,图像为RGB编码,Channel为3,Height和Width表示原始图像的高和宽;LatentChannel为隐变量域的channel个数,例如选为4,LatentHeight,LatentWidth会基于原始图像进行固定比例(scale_factor,取值为2的指数,如4,8,16等)的缩小。LatentHeight=Height/scale_factor,LatentWidth=Width/scale_factor。为保证隐变量域的高和宽为整数,一般会对输入图像进行对齐操作,例如通过图像resize和pad操作来完成,从而保证图像的宽和高可以被固定比例scale_factor整除。In one embodiment, the model structure of the theme image generation model, i.e., the pre-trained model, is first introduced. As shown in FIG5 , the model includes a control domain, a latent variable domain, and an image domain. Among them, the image domain includes an image encoder and an image decoder, and the image encoder and the image decoder are used to establish a connection between the image domain and the latent variable domain. Among them, the image encoder is used to encode and compress the image vector with the dimension [Channel, Height, Width] in the image domain to obtain a latent variable with the dimension [LatentChannel, LatentHeight, LatentWidth]. Usually, the image is RGB encoded, Channel is 3, Height and Width represent the height and width of the original image; LatentChannel is the number of channels in the latent variable domain, for example, 4 is selected, and LatentHeight and LatentWidth will be reduced by a fixed ratio (scale_factor, an exponent of 2, such as 4, 8, 16, etc.) based on the original image. LatentHeight=Height/scale_factor, LatentWidth=Width/scale_factor. To ensure that the height and width of the latent variable domain are integers, the input image is generally aligned, for example, by image resizing and padding operations, to ensure that the width and height of the image can be divided by a fixed ratio scale_factor.

图像解码器的处理过程为图像编码器的处理过程的逆过程,负责将隐变量域的隐变量还原为图像域的图像向量。例如,图像解码器可以接收隐变量域工输入维度为[LatentChannel,LatentHeight,LatentWidth]的隐变量,将该隐变量还原为[Channel,Heigth,Width]的图像向量,进而生成对应的图像。The processing of the image decoder is the inverse of the processing of the image encoder, and is responsible for restoring the latent variables in the latent variable domain to the image vector in the image domain. For example, the image decoder can receive a latent variable with an input dimension of [LatentChannel,LatentHeight,LatentWidth] in the latent variable domain, restore the latent variable to an image vector of [Channel,Heigth,Width], and then generate the corresponding image.

在其中一个实施例中,图像编码和图像解码可以采用VAE框架(VariationalAuto-Encoder)来实现,模型框架为如图6,其中,图像编码器(Encoder)用于把输入图像的图像向量编码为隐变量,具体为估计隐变量空间每个维度的高斯分布的均值E(z)和方差V(z),再基于每个维度的均值和方差,进行高斯分布的采样,进而得到隐变量。图像解码器(Decoder)用于把隐变量解码为图像向量。In one embodiment, image encoding and image decoding can be implemented using a VAE framework (Variational Auto-Encoder), and the model framework is shown in Figure 6, where the image encoder (Encoder) is used to encode the image vector of the input image into a latent variable, specifically to estimate the mean E(z) and variance V(z) of the Gaussian distribution of each dimension in the latent variable space, and then sample the Gaussian distribution based on the mean and variance of each dimension to obtain the latent variable. The image decoder (Decoder) is used to decode the latent variable into an image vector.

控制域中包括文本编码器,用于把自然文本转换成固定维度的文本向量。具体地,文本编码器可以采用transformer结构,包括以下几个步骤,通过切词器切词和词表查找,将自然文本变成一个ID序列,通过多层transformer结构,把ID序列编码为[Seq_Len,HiddenSize]的序列向量矩阵;其中,Seq_Len为序列的长度,HiddenSize为隐向量的维度。在一些具体的实施例中,文本编码器可以采用开源预训练的模型得到,包括CLIPTextEncoder,BertTextEncoder,T5TextEncoder等。The control domain includes a text encoder for converting natural text into a text vector of fixed dimension. Specifically, the text encoder can adopt a transformer structure, including the following steps: natural text is converted into an ID sequence through word segmentation and vocabulary lookup by a word segmenter, and the ID sequence is encoded into a sequence vector matrix of [Seq_Len, HiddenSize] through a multi-layer transformer structure; wherein Seq_Len is the length of the sequence and HiddenSize is the dimension of the hidden vector. In some specific embodiments, the text encoder can be obtained using an open source pre-trained model, including CLIPTextEncoder, BertTextEncoder, T5TextEncoder, etc.

隐变量域的数据处理过程包括扩散过程与去噪过程。扩散过程指把一个数据样本(图像通过VAE编码得到的隐变量)逐渐通过加噪声变成随机噪声的过程。具体公式如下The data processing process in the latent variable domain includes diffusion process and denoising process. Diffusion process refers to the process of gradually converting a data sample (the latent variable obtained by VAE encoding) into random noise by adding noise. The specific formula is as follows

Xt=αtXt-1tt,∈t~N(0,1)X t =α t X t-1tt ,∈ t ~N(0,1)

其中,Xt-1为上一步的隐变量,αt,t为预先设定的噪声权重,∈t为随机抽样的高斯噪声。对于初始的隐变量会进行1000步以上的扩散加噪过程,噪声权重在初期会较小,在后期会较大,具体可以是遵循固定的权重分布。Among them, Xt -1 is the hidden variable of the previous step, αt ,t is the pre-set noise weight, and ∈t is the randomly sampled Gaussian noise. For the initial hidden variable, a diffusion noise adding process of more than 1000 steps will be performed. The noise weight will be smaller in the early stage and larger in the later stage. Specifically, it can follow a fixed weight distribution.

如图7所示,去噪过程是通过一个复杂的Unet模型,学到每一步加噪过程中的噪声,使得预估的Xt-1'与真实Xt-1的差异越小越好。去噪过程输入为一个白噪声,通过多步噪声的估计,使得图像的噪声越来越小,并最终采样出一个高清晰度的图像。Unet模型训练的损失函数为As shown in Figure 7, the denoising process uses a complex Unet model to learn the noise in each step of the denoising process, so that the difference between the estimated X t-1 ' and the actual X t-1 is as small as possible. The input of the denoising process is a white noise. Through multi-step noise estimation, the noise of the image is reduced and finally a high-definition image is sampled. The loss function of the Unet model training is

通过最小化预估的t时刻的噪声与真实噪声的差值,可以最小化预估的Xt-1'与真实Xt-1的差值。通过t步的去噪,最终还原出原始的X在t=0时刻的原始图像,去噪过程的模型结构为一个Unet结构,如图8所示,Unet结构包括一组下采样器,一组上采样器,以及中间一层维度保持的采样器。其中下采样器可以是CNN+transformer结构,上采样器CNN+transformer结构,中间一层采样器与上下采样器相同的结构,区别在于不进行维度的缩放处理。By minimizing the difference between the estimated noise at time t and the actual noise, the difference between the estimated X t-1 ' and the actual X t-1 can be minimized. After t steps of denoising, the original image of X at time t = 0 is finally restored. The model structure of the denoising process is a Unet structure, as shown in Figure 8. The Unet structure includes a group of downsamplers, a group of upsamplers, and a middle layer of dimensionality-preserving samplers. The downsampler can be a CNN+transformer structure, the upsampler CNN+transformer structure, and the middle layer of samplers has the same structure as the upsamplers and downsamplers, except that the dimensionality is not scaled.

在基于特定IP主题的训练语料对预训练模型进行精细训练放入过程中,训练框架主要是训练控制域的文本编码器,以及去噪过程的Unet模块,对于整体框架中的其他两个部分,图像编码器和图像解码器,使用预先训练好的VAE,图像编码器和图像解码器只做推理,不参与训练,扩散过程是一个固定多步添加高斯噪声的流程,也不参与训练。In the process of fine-tuning the pre-trained model based on the training corpus of a specific IP topic, the training framework mainly trains the text encoder of the control domain and the Unet module of the denoising process. For the other two parts of the overall framework, the image encoder and the image decoder, the pre-trained VAE is used. The image encoder and the image decoder are only used for inference and do not participate in training. The diffusion process is a fixed multi-step process of adding Gaussian noise and does not participate in training.

接下来介绍主题图像生成模型生成图像的应用过程。需要说明的是,主题图像生成模型的应用过程中的框架与训练过程中的框架的区别,应用过程中的框架由于没有输入图像,因此不包括训练框架中基于输入图像进行编码过程和扩散过程,而是使用随机种子直接采样隐变量域的随机隐变量。Next, we will introduce the application process of the theme image generation model to generate images. It should be noted that the framework in the application process of the theme image generation model is different from the framework in the training process. Since there is no input image in the framework in the application process, it does not include the encoding process and diffusion process based on the input image in the training framework, but uses random seeds to directly sample random latent variables in the latent variable domain.

在一个实施例中,通过主题图像生成模型,按照目标描述文本进行图像生成处理,得到与目标主题匹配的目标图像,包括:In one embodiment, the subject image generation model is used to perform image generation processing according to the target description text to obtain a target image matching the target subject, including:

基于从控制域输入的目标描述文本,将目标描述文本转换为文本向量;基于隐变量域的随机隐变量对应的白噪声,在文本向量的指导下进行去噪处理,以重建图像隐变量;在图像域对图像隐变量进行解码,得到与目标主题匹配的目标图像。Based on the target description text input from the control domain, the target description text is converted into a text vector; based on the white noise corresponding to the random latent variables in the latent variable domain, denoising processing is performed under the guidance of the text vector to reconstruct the image latent variables; the image latent variables are decoded in the image domain to obtain a target image matching the target subject.

具体地,如图9所示,计算机设备通过主题图像生成模型的控制域输入目标描述文本,基于控制域的文本编码器将目标描述文本转换为隐变量域的文本向量,生成与文本向量对应的图像隐变量Z_0;随机种子通过使用随机数生成器随机生成随机高斯白噪声作为隐变量域的随机隐变量,然后基于随机隐变量,使用多步去噪过程,在文本向量的指导下,对图像隐变量Z_0进行重建得到Z'_0,再使用图像域的图像解码器,得到图像向量,实现目标图像的生成。在本实施例中,主题图像生成模型通过生成随机隐变量在文本向量的指导下进行去噪,能够有效地实现对图像隐变量的重建,实现图像的快速高质量生成。Specifically, as shown in FIG9 , the computer device inputs the target description text through the control domain of the subject image generation model, and the text encoder based on the control domain converts the target description text into a text vector in the latent variable domain, and generates an image latent variable Z_0 corresponding to the text vector; the random seed randomly generates random Gaussian white noise as a random latent variable in the latent variable domain by using a random number generator, and then based on the random latent variable, a multi-step denoising process is used to reconstruct the image latent variable Z_0 under the guidance of the text vector to obtain Z'_0, and then an image decoder in the image domain is used to obtain the image vector to achieve the generation of the target image. In this embodiment, the subject image generation model can effectively achieve the reconstruction of the image latent variable and achieve fast and high-quality generation of the image by generating random latent variables and performing denoising under the guidance of the text vector.

在主题图像生成模型的应用过程中,通过输入主题IP有关的文本,主题图像生成模型可以生成符合主题IP风格、与文本描述一致、且具有很大的多样性的图像,即生成的图像并非与训练数据一模一样的图像。为了进一步地提高主题图像生成模型生成的图像的质量,以下实施例将从不同的方面介绍本申请中提高图像质量的方式,可以理解,可以通过各实施例单独实现图像的生成,也可以与其他的实施例进行组合来实现图像的生成,在此不做限定。In the application process of the theme image generation model, by inputting text related to the theme IP, the theme image generation model can generate images that conform to the style of the theme IP, are consistent with the text description, and have great diversity, that is, the generated images are not exactly the same as the training data. In order to further improve the quality of the images generated by the theme image generation model, the following embodiments will introduce the methods of improving image quality in this application from different aspects. It can be understood that the generation of images can be achieved through each embodiment alone, or in combination with other embodiments, which is not limited here.

在一些实施例中,计算机设备可以通过文本描述加权的方式来提高生成图像的质量。具体来说,在其中一些实施例中,基于从控制域输入的目标描述文本,将目标描述文本转换为文本向量,包括:In some embodiments, the computer device can improve the quality of the generated image by weighting the text description. Specifically, in some embodiments, based on the target description text input from the control domain, the target description text is converted into a text vector, including:

针对从控制域输入的目标描述文本,获取目标描述文本中关键描述文本的权重数据;基于关键描述文本的权重数据,在向量转换过程中对关键描述文本进行向量提权处理,得到目标描述文本对应的文本向量。For the target description text input from the control domain, the weight data of the key description text in the target description text is obtained; based on the weight data of the key description text, the key description text is vector-weighted during the vector conversion process to obtain the text vector corresponding to the target description text.

具体地,采用标准训练过程和模型框架进行精细训练得到的主题图像生成模型,存在部分特征控制力不强的问题,具体体现为,生成的图像与文本描述不一致。为了解决这一技术问题,在本实施例中,通过在文本向量域,对关键特征进行向量提权,保证关键特征相对其他文本,在图像生成过程中,有更高的权重。具体放入实现方式为在文本构建过程中,把权重与文本进行融合,在文本编码过程中进行相应的解析和加权。Specifically, the subject image generation model obtained by fine training using the standard training process and model framework has the problem of weak control over some features, which is specifically manifested in that the generated image is inconsistent with the text description. In order to solve this technical problem, in this embodiment, by weighting the key features in the text vector domain, it is ensured that the key features have a higher weight in the image generation process than other texts. The specific implementation method is to merge the weight with the text during the text construction process, and perform corresponding parsing and weighting during the text encoding process.

例如,在一个具体的应用中,目标描述文本为“人物:主角A,视角:(半身中景:1.1),发型:束发,服装:(蓝色棉麻套服:1.2)”。通过(主题要素描述文本:权重)的方式,可以确定待加权的文本部分,对于不同的文本施加不同的权重;其中,不加括号的文本,可以默认权重为1.0。For example, in a specific application, the target description text is "Character: Protagonist A, Perspective: (half-length mid-shot: 1.1), Hairstyle: tied hair, Clothing: (blue cotton and linen suit: 1.2)". Through the method of (subject element description text: weight), the text part to be weighted can be determined, and different weights can be applied to different texts; among them, the text without brackets can be weighted as 1.0 by default.

在本实施例中,主题图像生成模型在应用过程中,针对从控制域输入的目标描述文本,按照目标描述文本的文本格式,确定目标描述文本中关键描述文本的权重数据,然后基于关键描述文本的权重数据,在文本编码器执行向量转换过程中,对关键描述文本进行向量提权处理,从而得到目标描述文本对应的文本向量,能够有效提高关键特征在生成图像中的有效表达,进而提高图像质量。In this embodiment, during the application process of the theme image generation model, for the target description text input from the control domain, the weight data of the key description text in the target description text is determined according to the text format of the target description text, and then based on the weight data of the key description text, during the vector conversion process performed by the text encoder, the key description text is vector-weighted, so as to obtain the text vector corresponding to the target description text, which can effectively improve the effective expression of key features in the generated image, thereby improving the image quality.

在一些实施例中,计算机设备可以融合文本编码器中不同层的向量的方式来提高生成图像的质量。具体来说,在其中一些实施例中,控制域包括将文本转换为向量的文本编码器;文本编码器包括至少两个网络层。例如文本编码器可以为具有多层的transformer模型。In some embodiments, the computer device can improve the quality of the generated image by fusing vectors of different layers in the text encoder. Specifically, in some embodiments, the control domain includes a text encoder that converts text into a vector; the text encoder includes at least two network layers. For example, the text encoder can be a transformer model with multiple layers.

进一步地,基于从控制域输入的目标描述文本,将目标描述文本转换为文本向量,包括:按照文本编码器中的各个网络层,依次对输入的目标描述文本进行向量转换处理,得到每一网络层的输出特征;将文本编码器的目标网络层的输出特征和最后一个网络层的输出特征进行特征融合,得到目标描述文本对应的文本向量;目标网络层为文本编码器中用于提升文本控制能力的网络层。Furthermore, based on the target description text input from the control domain, the target description text is converted into a text vector, including: performing vector conversion processing on the input target description text in turn according to each network layer in the text encoder to obtain output features of each network layer; performing feature fusion on the output features of the target network layer of the text encoder and the output features of the last network layer to obtain a text vector corresponding to the target description text; the target network layer is a network layer in the text encoder used to improve text control capabilities.

具体地,在多次实验和反复研究中发现,文本编码器不同层的语义控制能力和最终图像的清晰度会有差别。通过确定文本编码器中用于提升文本控制能力的目标网络层,将目标网络层的输入特征与最后一层的输出特征进行特征融合,其中,特征融合的方式包括加权的插值融合,从而使得得到的本文向量在图像生成过程中具有更好的语义控制能力。Specifically, we found in multiple experiments and repeated studies that the semantic control ability of different layers of the text encoder and the clarity of the final image will be different. By determining the target network layer in the text encoder for improving the text control ability, the input features of the target network layer are fused with the output features of the last layer, where the feature fusion method includes weighted interpolation fusion, so that the obtained text vector has better semantic control ability in the image generation process.

其中,以文件编码器为transformer模型为例,transformer的倒数第二层,可以提升文本的控制能力。通过把transformer模型的倒数第二层和最后一层的文本向量进行插值融合,具体融合方式如下:Taking the transformer model as an example, the penultimate layer of the transformer can improve the control ability of the text. The text vectors of the penultimate layer and the last layer of the transformer model are interpolated and fused. The specific fusion method is as follows:

text_emb=alpha*text_emb(L-1)+(1-alpha)*text_emb(L)text_emb=alpha*text_emb(L-1)+(1-alpha)*text_emb(L)

其中,text_emb(L-1)为倒数第二层的文本向量,维度为[Seq_Len,HiddenSize];text_emb(L)为最后一层的文本向量,维度与text_emb(L-1)一致,因此可以进行线性插值融合;alpha为线性插值的权重,取值为[0,1]之间的小数,取值越高,倒数第二层的权重越高。Among them, text_emb(L-1) is the text vector of the second-to-last layer, and its dimension is [Seq_Len, HiddenSize]; text_emb(L) is the text vector of the last layer, and its dimension is consistent with text_emb(L-1), so linear interpolation fusion can be performed; alpha is the weight of linear interpolation, and its value is a decimal between [0,1]. The higher the value, the higher the weight of the second-to-last layer.

在本实施例中,通过对文件编码器的不同层特征进行特征融合,能够将具有更好的语义控制能力的向量融合到文件编码器的输出层,从而使得文件编码器编码得到的本文向量在图像生成过程中具有更好的语义控制能力,能有效提高生成的图像质量。In this embodiment, by performing feature fusion on features of different layers of the file encoder, a vector with better semantic control ability can be fused to the output layer of the file encoder, so that the vector encoded by the file encoder has better semantic control ability in the image generation process, which can effectively improve the quality of the generated image.

在一些实施例中,计算机设备可以增加负向描述并对负向描述加权的方式来提高生成图像的质量。具体来说,在其中一些实施例中,控制域的输入文本还包括负向描述文本。负向描述文本是对不希望在图像中出现的内容的文本描述。例如,为提高图像生成的质量,可以把低质图像的描述词“低质,低分辨率”等,加入到负向描述文本中。In some embodiments, the computer device can improve the quality of the generated image by adding negative descriptions and weighting the negative descriptions. Specifically, in some of the embodiments, the input text of the control domain also includes negative description text. The negative description text is a text description of the content that is not desired to appear in the image. For example, in order to improve the quality of image generation, the description words of low-quality images such as "low quality, low resolution" can be added to the negative description text.

主题图像生成模型的去噪过程预测的噪声隐变量,可以采用guidancefree的加权方法,将输入文本的预估噪声与负向描述文本的预估噪声,进行融合。其中,在模型训练中,负向描述文本为无输入文本unconditional。进一步的,输入文本的预估噪声与负向描述文本的预估噪声具体可以通过以下公式进行融合:The noise latent variable predicted by the denoising process of the topic image generation model can be fused with the estimated noise of the input text and the estimated noise of the negative description text using a guidancefree weighted method. In the model training, the negative description text is unconditional without input text. Furthermore, the estimated noise of the input text and the estimated noise of the negative description text can be fused specifically by the following formula:

noise_predict=noise_uncond+scale*(noise_text-noise_uncond)noise_predict=noise_uncond+scale*(noise_text-noise_uncond)

其中,noise_predict为融合结果,noise_uncond为负向描述文本的预估噪声,noise_text为输入文本的预估噪声,scale用于控制noise_text与noise_uncond的差值的放大率,该值越高,生成的图像会越偏向输入文本,会越远离noise_uncond的文本,从而实现控制负向描述不出现的效果。Among them, noise_predict is the fusion result, noise_uncond is the estimated noise of the negative description text, noise_text is the estimated noise of the input text, and scale is used to control the amplification rate of the difference between noise_text and noise_uncond. The higher the value, the more the generated image will be biased towards the input text and the further away from the text of noise_uncond, thereby achieving the effect of controlling the negative description from appearing.

进一步地,基于隐变量域的随机隐变量对应的白噪声,在文本向量的指导下进行去噪处理,以重建图像隐变量,包括:Furthermore, based on the white noise corresponding to the random latent variables in the latent variable domain, denoising is performed under the guidance of the text vector to reconstruct the image latent variables, including:

将负向描述文本的预估噪声作为去噪处理过程中的融合对象,与目标描述文本的预估噪声进行融合,得到融合预估噪声;基于隐变量域的随机隐变量对应的白噪声,在文本向量的指导下按照融合预估噪声进行去噪处理,以重建图像隐变量。The estimated noise of the negative description text is taken as the fusion object in the denoising process and fused with the estimated noise of the target description text to obtain the fused estimated noise; the white noise corresponding to the random latent variable in the latent variable domain is denoised according to the fused estimated noise under the guidance of the text vector to reconstruct the image latent variable.

具体地,计算机设备在通过主题图像生成模型执行去噪过程时,先识别控制域的输入文本中是否包括负向描述文本,其中,目标描述文本和负向描述文本可以通过不同的标识进行标记用以识别区分。计算机设备在识别到负向描述文本后,获取负向描述文本的预估噪声,将负向描述文本的预估噪声与目标描述文本的预估噪声进行融合,得到融合预估噪声,以将负向描述文本的内容更好地通过噪声的方式被剔除,以减小负向描述文本所表征的内容出现在生成的图像中的概率,提高生成图像的质量。Specifically, when the computer device performs the denoising process through the subject image generation model, it first identifies whether the input text of the control domain includes negative description text, wherein the target description text and the negative description text can be marked by different identifiers for identification and distinction. After identifying the negative description text, the computer device obtains the estimated noise of the negative description text, and fuses the estimated noise of the negative description text with the estimated noise of the target description text to obtain the fused estimated noise, so that the content of the negative description text can be better eliminated by noise, so as to reduce the probability that the content represented by the negative description text appears in the generated image, and improve the quality of the generated image.

在一些实施例中,计算机设备可以控制不同文本在去噪过程的不同阶段生效的方式来提高生成图像的质量。具体来说,在其中一些实施例中,去噪处理包括依次发生的第一阶段和第二阶段。第一阶段具体可以是去噪的前期阶段,第二阶段具体可以是去噪的后期阶段,第一阶段与第二阶段的分界点可以基于实际需要进行设定,在此不作限定。目标描述文本包括主体描述文本和细节描述文本,其中主体描述文本可以是主题人物、布局相关的文本,细节描述文本可以是发型、服饰等相关的文本。In some embodiments, the computer device can control the way in which different texts take effect at different stages of the denoising process to improve the quality of the generated image. Specifically, in some of the embodiments, the denoising process includes a first stage and a second stage that occur in sequence. The first stage can specifically be the early stage of denoising, and the second stage can specifically be the later stage of denoising. The dividing point between the first stage and the second stage can be set based on actual needs and is not limited here. The target description text includes a main description text and a detailed description text, wherein the main description text can be text related to the theme character and layout, and the detailed description text can be text related to hairstyle, clothing, etc.

在对去噪过程的多次实验和反复研究中发现,在去噪的早期阶段,主要进行绘画布局以及主体要素的绘制,在去噪的后期阶段,完成局部细节的绘制。基于这一发现,进一步的,在图像域对图像隐变量进行解码,得到与目标主题匹配的目标图像,包括:In many experiments and repeated studies on the denoising process, it was found that in the early stage of denoising, the main work is to draw the layout of the painting and the main elements, and in the later stage of denoising, the local details are drawn. Based on this discovery, the image latent variables are further decoded in the image domain to obtain the target image that matches the target theme, including:

在第一阶段,掩码覆盖细节描述文本,对主体描述文本的主体图像隐变量进行解码,得到与目标主题匹配的主体图像;在第二阶段,掩码覆盖主体描述文本,对细节描述文本的细节图像隐变量进行解码,得到细节图像;将细节图像渲染至主体图像中的对应区域,得到与目标主题匹配的目标图像。In the first stage, the mask covers the detail description text, and the subject image latent variables of the subject description text are decoded to obtain a subject image matching the target subject; in the second stage, the mask covers the subject description text, and the detail image latent variables of the detail description text are decoded to obtain a detail image; the detail image is rendered to the corresponding area in the subject image to obtain a target image matching the target subject.

其中,掩码覆盖是指将细节描述文本或主体描述文本暂时地进行屏蔽,以使主题图像生成模型在去噪过程中不会关注到被掩码覆盖的细节描述文本或主体描述文本。Among them, mask coverage refers to temporarily shielding the detail description text or the main description text so that the subject image generation model will not pay attention to the detail description text or the main description text covered by the mask during the denoising process.

具体地,计算机设备通过控制主题图像生成模型在去噪过程对应的图像早期生成阶段,仅让主题人物和布局相关的主体描述文本生效,将细节描述文本通过掩码覆盖失效,保证早期阶段图像的更多注意力在主体描述文本上。在后期生成阶段,进行相反的处理,使得细节描述文本更容易控制图像的生成,将主题人物、布局等大面积控制的文本通过掩码覆盖,保证后期阶段图像的更多注意力在细节描述文本上。两个阶段相互配合,可以进一步提升生成文本的控制力,以及细节文本的图像精细程度。Specifically, the computer device controls the subject image generation model in the early image generation stage corresponding to the denoising process, so that only the main description text related to the subject character and layout is effective, and the detail description text is invalidated by masking, ensuring that more attention is paid to the main description text in the early stage of the image. In the later generation stage, the opposite process is performed to make it easier for the detail description text to control the generation of the image, and the text that controls a large area such as the subject character and layout is covered by a mask, ensuring that more attention is paid to the detail description text in the later stage of the image. The two stages work together to further enhance the control of the generated text and the image refinement of the detail text.

在一些实施例中,计算机设备可以通过语义切割,控制不同区域的生成对象的方式来提高生成图像的质量。具体来说,在其中一些实施例中,目标主题的图像生成方法还包括:对目标描述文本进行语义切割,基于语义对象布局配置参数,确定切割得到的每一语义对象在图像中的布局图;确定每一语义对象在布局图中各布局区域的注意力影响机制的影响权重。In some embodiments, the computer device can improve the quality of the generated image by controlling the generated objects in different regions through semantic segmentation. Specifically, in some of the embodiments, the method for generating an image of a target subject also includes: semantic segmentation of the target description text, and determining a layout diagram of each semantic object obtained by segmentation in the image based on the semantic object layout configuration parameters; and determining the influence weight of the attention influence mechanism of each layout area of each semantic object in the layout diagram.

其中,语义切割是指对文本中的实体对象进行识别,并将识别结果进行切分得到语义对象的处理过程。通过语义切割,可以在图像生成之前预先设定好图像不同区域需要描绘的对象,在生成过程中,通过影响不同文本片段在不通区域的注意力影响机制的权重,可以实现不同区域生成不同对象的目标。语义对象布局配置参数用于针对语义对象确定目标图像的布局,得到语义对象布局图。其中,不同的语义对象,在语义对象布局图中具有不同的影响权重。影响权重的具体取值可以通过语义对象布局配置参数得到。Semantic segmentation refers to the process of identifying entity objects in text and segmenting the recognition results to obtain semantic objects. Through semantic segmentation, the objects to be depicted in different areas of the image can be pre-set before image generation. During the generation process, the goal of generating different objects in different areas can be achieved by influencing the weights of the attention influence mechanism of different text fragments in different areas. The semantic object layout configuration parameters are used to determine the layout of the target image for the semantic object and obtain a semantic object layout diagram. Different semantic objects have different influence weights in the semantic object layout diagram. The specific value of the influence weight can be obtained through the semantic object layout configuration parameters.

进一步地,在图像域对图像隐变量进行解码,得到与目标主题匹配的目标图像,包括:在图像域对基于影响权重得到的图像隐变量进行解码,得到使得各语义对象按布局图进行布局的目标图像。Furthermore, the image latent variables are decoded in the image domain to obtain a target image matching the target subject, including: decoding the image latent variables obtained based on the influence weights in the image domain to obtain a target image in which each semantic object is laid out according to the layout diagram.

在一个具体的实施例中,如图10所示,输入文本为:“1个白裙子长头发的女生,旁边1个红领巾的男生,树,天空,草,高度详细,质量最好”,可以通过配置文件,生成以下的语义切割得到的语义对象的布局图。其中,白色区域为“1个白裙子长头发的女生”,黑色区域为“1个红领巾的男生”,绿色区域为“树”,蓝色区域为“天空”,棕色区域为“草”。其中,“1个白裙子长头发的女生”对白色区域的影响权重为1.0,“1个红领巾的男生”对黑色区域的影响权重为1.4,“树”对绿色区域的影响权重为1.2,“天空”对蓝色区域的影响权重为0.2,“草”对棕色区域的影响权重为0.2。进一步地,基于语义对象布局配置参数和输入的目标描述文本,生成的图像如图11所示,以保证不同语义对象,出现在布局图中按语义对象切割的不同区域。In a specific embodiment, as shown in FIG10 , the input text is: “1 girl in a white skirt with long hair, next to a boy in a red scarf, tree, sky, grass, highly detailed, best quality”, and the following layout diagram of the semantic objects obtained by semantic cutting can be generated through the configuration file. Among them, the white area is “1 girl in a white skirt with long hair”, the black area is “1 boy in a red scarf”, the green area is “tree”, the blue area is “sky”, and the brown area is “grass”. Among them, the influence weight of “1 girl in a white skirt with long hair” on the white area is 1.0, the influence weight of “1 boy in a red scarf” on the black area is 1.4, the influence weight of “tree” on the green area is 1.2, the influence weight of “sky” on the blue area is 0.2, and the influence weight of “grass” on the brown area is 0.2. Further, based on the semantic object layout configuration parameters and the input target description text, the generated image is shown in FIG11 to ensure that different semantic objects appear in different areas cut by semantic objects in the layout diagram.

在本实施例中,通过语义切割,来控制不同区域的生成对象,能够有效提高语义对象生成位置的准确性,实现不通过语义对象在生成图像中的合理布局,进而能够有效提高生成图像的质量。In this embodiment, by controlling the generated objects in different areas through semantic segmentation, the accuracy of the generated positions of the semantic objects can be effectively improved, and the reasonable layout of the non-semantic objects in the generated image can be achieved, thereby effectively improving the quality of the generated image.

在一些实施例中,样本图像还包括深度图。计算机设备可以通过基于深度切割的控制手段来提高生成图像的质量。具体来说,深度切割是通过在训练过程中,引入训练图像的深度图,从而使得深度图的黑白通道,可以影响最终图像生成的位置。In some embodiments, the sample image also includes a depth map. The computer device can improve the quality of the generated image by a control method based on depth cutting. Specifically, the depth cutting is to introduce the depth map of the training image during the training process so that the black and white channels of the depth map can affect the position of the final image generation.

在主题图像生成模型的应用过程中,方法还包括:获取与待生成的目标图像尺寸相同的单通道灰度图,单通道灰度图的黑白通道分别用于定位目标图像的前景部分和背景部分。如图12左图所示,在主题图像生成模型生成图像的过程中,除了输入的目标描述文本以外,还会输入一个事先选定的alpha图像。其中,alpha图像为一张单通道的灰度图,该灰度图的尺寸与目标生成图像的尺寸相同。其中,alpha值为1的部分(对应像素值255)为前景,alpha值为0部分(对应像素值0)为背景,介于0-1之间的区域(对应像素值0-255)为半透明部分。In the application process of the subject image generation model, the method also includes: obtaining a single-channel grayscale image of the same size as the target image to be generated, and the black and white channels of the single-channel grayscale image are used to locate the foreground and background parts of the target image respectively. As shown in the left figure of Figure 12, in the process of generating an image by the subject image generation model, in addition to the input target description text, a pre-selected alpha image is also input. Among them, the alpha image is a single-channel grayscale image, and the size of the grayscale image is the same as the size of the target generated image. Among them, the part with an alpha value of 1 (corresponding to a pixel value of 255) is the foreground, the part with an alpha value of 0 (corresponding to a pixel value of 0) is the background, and the area between 0-1 (corresponding to a pixel value of 0-255) is the semi-transparent part.

相应的,在图像生成过程中,基于隐变量域的随机隐变量对应的白噪声,在文本向量的指导下进行去噪处理,以重建图像隐变量,包括:基于隐变量域的随机隐变量对应的白噪声和单通道灰度图的黑白通道,构建初始隐变量;在文本向量的指导下基于初始隐变量进行去噪处理,以重建图像隐变量。Correspondingly, in the image generation process, denoising is performed on the white noise corresponding to the random latent variables in the latent variable domain under the guidance of the text vector to reconstruct the image latent variables, including: constructing initial latent variables based on the white noise corresponding to the random latent variables in the latent variable domain and the black and white channels of the single-channel grayscale image; and denoising is performed based on the initial latent variables under the guidance of the text vector to reconstruct the image latent variables.

其中,单通道灰度图的黑白通道的作用是对目标描述文本中的主题要素在图像中的位置为前景部分或背景部分进行显示产生影响。在一个具体的应用中,基于alpha图像通道与随机噪声共同构成隐变量的初始隐变量,以使初始隐变量与输入目标描述文本共同作用,生成符合目标主题的图像。具体的,如图12右图所示,从基于输入的目标文本“站在讲台前的菠萝”的生成图像来看,可以把生成的目标对象“菠萝”很好的圈定在深度图中的前景部分,从而能够有效提高生成的图像的质量。Among them, the role of the black and white channels of the single-channel grayscale image is to affect the position of the subject elements in the target description text in the image as the foreground part or the background part. In a specific application, the initial latent variables based on the alpha image channel and random noise constitute the latent variables, so that the initial latent variables and the input target description text work together to generate an image that conforms to the target theme. Specifically, as shown in the right figure of Figure 12, from the generated image based on the input target text "pineapple standing in front of the podium", the generated target object "pineapple" can be well circled in the foreground part of the depth map, thereby effectively improving the quality of the generated image.

本申请还提供一种应用场景,该应用场景应用上述的目标主题的图像生成方法。具体地,该目标主题的图像生成方法在该应用场景的应用如下:The present application also provides an application scenario, which applies the above-mentioned method for generating an image of a target subject. Specifically, the application of the method for generating an image of a target subject in the application scenario is as follows:

具体来说,本方案基于Diffusion扩散框架的预训练模型,基于特定IP的数据对该框架进行finetune训练,解决diffusion扩散框架无法稳定生成特定IP风格的图像的问题。在训练语料构建,模型训练调优,文本生成推理等几个关键步骤中,融入了适应特定IP风格的技术,从而使得产出的模型能够稳定可靠的根据输入文本生成符合该IP内容风格的高质量图像。Specifically, this solution is based on the pre-trained model of the Diffusion framework, and fine-tunes the framework based on the data of a specific IP to solve the problem that the Diffusion framework cannot stably generate images in a specific IP style. In several key steps such as training corpus construction, model training and tuning, and text generation and reasoning, technologies that adapt to the style of a specific IP are incorporated, so that the output model can stably and reliably generate high-quality images that match the content style of the IP based on the input text.

其中,通过模型学习IP的核心场景和核心特征,将外部名人,素人做为绘画的主体,以文本生成图像的方式进行嵌入,从而把外部人物与IP核心场景进行融合,实现IP场景的嵌入。还可以用户上传一张自己的图像,以该图像的隐变量编码为初始条件,完成输入图像的IP风格迁移。此外,还可以基于IP的有限素材进行相应的训练,学习IP的核心要素,根据特征的组合,生成大量的相关素材,在社交媒体上进行投放,辅助IP的宣发。Among them, the core scenes and core features of the IP are learned through the model, and external celebrities and ordinary people are used as the subjects of the painting, and embedded in the way of text-generated images, so as to integrate the external characters with the core scenes of the IP and realize the embedding of the IP scenes. Users can also upload an image of their own, and use the hidden variable encoding of the image as the initial condition to complete the IP style transfer of the input image. In addition, corresponding training can be carried out based on the limited materials of the IP, and the core elements of the IP can be learned. According to the combination of features, a large amount of related materials can be generated and released on social media to assist the promotion of the IP.

模型主要由3个部分构成,分别为控制域、隐变量域和图像域。其中,图像域包括图像编码器和图像解码器,图像编码器和图像解码器用于使得图像域和隐变量域建立联系。其中,图像编码器用于对图像域中维度为[Channel,Height,Width]的图像向量进行编码压缩,得到维度为[LatentChannel,LatentHeight,LatentWidth]的隐变量。通常情况下,图像为RGB编码,Channel为3,Height和Width表示原始图像的高和宽;LatentChannel为隐变量域的channel个数,例如选为4,LatentHeight,LatentWidth会基于原始图像进行固定比例(scale_factor,取值为2的指数,如4,8,16等)的缩小。LatentHeight=Height/scale_factor,LatentWidth=Width/scale_factor。为保证隐变量域的高和宽为整数,一般会对输入图像进行对齐操作,例如通过图像resize和pad操作来完成,从而保证图像的宽和高可以被固定比例scale_factor整除。The model is mainly composed of three parts, namely the control domain, the latent variable domain and the image domain. Among them, the image domain includes an image encoder and an image decoder, which are used to establish a connection between the image domain and the latent variable domain. Among them, the image encoder is used to encode and compress the image vector with the dimension [Channel, Height, Width] in the image domain to obtain a latent variable with the dimension [LatentChannel, LatentHeight, LatentWidth]. Usually, the image is RGB encoded, Channel is 3, Height and Width represent the height and width of the original image; LatentChannel is the number of channels in the latent variable domain, for example, if it is selected as 4, LatentHeight, LatentWidth will be reduced by a fixed ratio (scale_factor, an exponent of 2, such as 4, 8, 16, etc.) based on the original image. LatentHeight = Height/scale_factor, LatentWidth = Width/scale_factor. To ensure that the height and width of the latent variable domain are integers, the input image is generally aligned, for example, by image resizing and padding operations, to ensure that the width and height of the image can be divided by a fixed ratio scale_factor.

图像解码器的处理过程为图像编码器的处理过程的逆过程,负责将隐变量域的隐变量还原为图像域的图像向量。例如,图像解码器可以接收隐变量域工输入维度为[LatentChannel,LatentHeight,LatentWidth]的隐变量,将该隐变量还原为[Channel,Heigth,Width]的图像向量,进而生成对应的图像。The processing of the image decoder is the inverse of the processing of the image encoder, and is responsible for restoring the latent variables in the latent variable domain to the image vector in the image domain. For example, the image decoder can receive a latent variable with an input dimension of [LatentChannel,LatentHeight,LatentWidth] in the latent variable domain, restore the latent variable to an image vector of [Channel,Heigth,Width], and then generate the corresponding image.

在其中一个实施例中,图像编码和图像解码可以采用VAE框架来实现,VAE框架包括图像编码器和图像解码器,其中,图像编码器用于把输入图像的图像向量编码为隐变量,具体为估计隐变量空间每个维度的高斯分布的均值E(z)和方差V(z),再基于每个维度的均值和方差,进行高斯分布的采样,进而得到隐变量。图像解码器用于把隐变量解码为图像向量。In one embodiment, image encoding and image decoding can be implemented using a VAE framework, which includes an image encoder and an image decoder, wherein the image encoder is used to encode the image vector of the input image into a latent variable, specifically to estimate the mean E(z) and variance V(z) of the Gaussian distribution of each dimension in the latent variable space, and then sample the Gaussian distribution based on the mean and variance of each dimension to obtain the latent variable. The image decoder is used to decode the latent variable into an image vector.

控制域中包括文本编码器,用于把自然文本转换成固定维度的文本向量。具体地,文本编码器可以采用transformer结构,包括以下几个步骤,通过切词器切词和词表查找,将自然文本变成一个ID序列,通过多层transformer结构,把ID序列编码为[Seq_Len,HiddenSize]的序列向量矩阵;其中,Seq_Len为序列的长度,HiddenSize为隐向量的维度。The control domain includes a text encoder, which is used to convert natural text into a text vector of fixed dimension. Specifically, the text encoder can adopt a transformer structure, including the following steps: through word segmentation by a tokenizer and word table lookup, the natural text is converted into an ID sequence, and through a multi-layer transformer structure, the ID sequence is encoded into a sequence vector matrix of [Seq_Len, HiddenSize]; where Seq_Len is the length of the sequence and HiddenSize is the dimension of the hidden vector.

隐变量域的数据处理过程包括扩散过程与去噪过程。扩散过程指把一个数据样本(图像通过VAE编码得到的隐变量)逐渐通过加噪声变成随机噪声的过程。具体公式如下The data processing process in the latent variable domain includes diffusion process and denoising process. The diffusion process refers to the process of gradually converting a data sample (the latent variable obtained by VAE encoding) into random noise by adding noise. The specific formula is as follows

Xt=αtXt-1tt,∈t~N(0,1)X t =α t X t-1tt ,∈ t ~N(0,1)

其中,Xt-1为上一步的隐变量,αtt为预先设定的噪声权重,∈t为随机抽样的高斯噪声。对于初始的隐变量会进行1000步以上的扩散加噪过程,噪声权重在初期会较小,在后期会较大,具体可以是遵循固定的权重分布。Among them, Xt -1 is the hidden variable of the previous step, αt , αt are the pre-set noise weights, and ∈t is the randomly sampled Gaussian noise. For the initial hidden variable, a diffusion noise adding process of more than 1000 steps will be performed. The noise weight will be smaller in the early stage and larger in the later stage. Specifically, it can follow a fixed weight distribution.

去噪过程是通过一个复杂的Unet模型,学到每一步加噪过程中的噪声,使得预估的Xt-1'与真实Xt-1的差异越小越好。去噪过程输入为一个白噪声,通过多步噪声的估计,使得图像的噪声越来越小,并最终采样出一个高清晰度的图像。Unet模型训练的损失函数为The denoising process uses a complex Unet model to learn the noise in each step of the denoising process, so that the difference between the estimated X t-1 ' and the actual X t-1 is as small as possible. The input of the denoising process is a white noise. Through multi-step noise estimation, the noise of the image is reduced and finally a high-definition image is sampled. The loss function of the Unet model training is

通过最小化预估的t时刻的噪声与真实噪声的差值,可以最小化预估的Xt-1'与真实Xt-1的差值。通过t步的去噪,最终还原出原始的X在t=0时刻的原始图像,去噪过程的模型结构为一个Unet结构,Unet结构包括一组下采样器,一组上采样器,以及中间一层维度保持的采样器。其中下采样器可以是CNN+transformer结构,上采样器CNN+transformer结构,中间一层采样器与上下采样器相同的结构,区别在于不进行维度的缩放处理。By minimizing the difference between the estimated noise at time t and the actual noise, the difference between the estimated X t-1 ' and the actual X t-1 can be minimized. After t steps of denoising, the original image of X at time t = 0 is finally restored. The model structure of the denoising process is a Unet structure, which includes a group of downsamplers, a group of upsamplers, and a middle layer of dimensionality-preserving samplers. The downsampler can be a CNN+transformer structure, the upsampler CNN+transformer structure, and the middle layer of samplers has the same structure as the upsamplers and downsamplers, except that the dimensionality is not scaled.

在基于特定IP主题的训练语料对预训练模型进行精细训练放入过程中,训练框架主要是训练控制域的文本编码器,以及去噪过程的Unet模块,对于整体框架中的其他两个部分,图像编码器和图像解码器,使用预先训练好的VAE,图像编码器和图像解码器只做推理,不参与训练,扩散过程是一个固定多步添加高斯噪声的流程,也不参与训练。In the process of fine-tuning the pre-trained model based on the training corpus of a specific IP topic, the training framework mainly trains the text encoder of the control domain and the Unet module of the denoising process. For the other two parts of the overall framework, the image encoder and the image decoder, the pre-trained VAE is used. The image encoder and the image decoder are only used for inference and do not participate in training. The diffusion process is a fixed multi-step process of adding Gaussian noise and does not participate in training.

在模型应用过程中,模型应用的目标是基于输入的Prompt,随机生成一张与该Prompt描述一致的图像。计算机设备通过主题图像生成模型的控制域输入目标描述文本,基于控制域的文本编码器将目标描述文本转换为隐变量域的文本向量,生成与文本向量对应的图像隐变量Z_0;随机种子通过使用随机数生成器随机生成随机高斯白噪声作为隐变量域的随机隐变量,然后基于随机隐变量,使用多步去噪过程,在文本向量的指导下,对图像隐变量Z_0进行重建得到Z'_0,再使用图像域的图像解码器,得到图像向量,实现目标图像的生成。在本实施例中,主题图像生成模型通过生成随机隐变量在文本向量的指导下进行去噪,能够有效地实现对图像隐变量的重建,实现图像的快速高质量生成。In the process of model application, the goal of the model application is to randomly generate an image consistent with the description of the prompt based on the input prompt. The computer device inputs the target description text through the control domain of the theme image generation model, and the text encoder based on the control domain converts the target description text into a text vector in the latent variable domain, and generates an image latent variable Z_0 corresponding to the text vector; the random seed randomly generates random Gaussian white noise as a random latent variable in the latent variable domain by using a random number generator, and then based on the random latent variable, a multi-step denoising process is used to reconstruct the image latent variable Z_0 under the guidance of the text vector to obtain Z'_0, and then the image decoder in the image domain is used to obtain the image vector to achieve the generation of the target image. In this embodiment, the theme image generation model can effectively achieve the reconstruction of the image latent variables and achieve fast and high-quality generation of images by generating random latent variables and performing denoising under the guidance of the text vector.

在模型训练过程中,构建的训练语料是与主题IP相关的图像,其具体获取过程包括:During the model training process, the training corpus constructed is images related to the subject IP. The specific acquisition process includes:

对视频文件进行抽帧,视频文件的帧率为30frame/s,一集视频在40min-120min左右,平均每个文件可以抽取70k-210k张图像。但是由于视频文件的连续帧之间,图像的变化极小,因此只需要抽取场景变化比较大的图像帧。通过该方法,对于一集40min的视频,抽取2k-4k图像。具体来说,可以将图像帧依次进行比较,针对互为比较对象的两个图像帧,计算机设备通过将两个图像帧的相同像素排布位置的像素粒度的具体参数值进行作差处理,可以得到每一像素排布位置的像素粒度差值,计算机设备通过累计各像素排布位置的像素粒度差值的绝对值,得到绝对值之和,以表征互为比较对象的两个图像帧之间的差异性。The video file is framed, the frame rate of the video file is 30frame/s, and one video is about 40min-120min. On average, 70k-210k images can be extracted from each file. However, since the image changes very little between consecutive frames of the video file, only image frames with relatively large scene changes need to be extracted. Through this method, 2k-4k images are extracted for a 40min video. Specifically, the image frames can be compared one by one. For two image frames that are compared with each other, the computer device can obtain the pixel granularity difference of each pixel arrangement position by performing difference processing on the specific parameter values of the pixel granularity of the same pixel arrangement position of the two image frames. The computer device accumulates the absolute values of the pixel granularity differences of each pixel arrangement position to obtain the sum of the absolute values to characterize the difference between the two image frames that are compared with each other.

此外,IP视频中可能包括一些片头片尾,广告,转场过渡类的图像,此外,剧情中可能包括一些配角人物,不具备辨识度的场景,图像清晰度较低,人物截取不完整的图像,都不适合做为训练语料,可以视为无效图像,可以通过图像识别的工具,包括片头片尾识别,广告识别,图像清晰度模型,基于这些工具进行无效图像的过滤。In addition, IP videos may include some opening and ending credits, advertisements, and transition images. In addition, the plot may include some supporting characters, unrecognizable scenes, low-definition images, and incomplete images of characters. These are not suitable as training materials and can be regarded as invalid images. Image recognition tools, including opening and ending credits recognition, advertisement recognition, and image clarity models, can be used to filter out invalid images based on these tools.

另外,IP视频抽帧的图像,一般为横屏1920x1080的宽屏图像,而训练图像的输入要求较小。如果直接进行图像的缩放,会导致图像清晰度的损失,还会导致主要要素的细节分辨率不足。因此,需要对主题要素进行定位,并以该主题要素为中心,进行目标尺寸的图像截取。In addition, the images extracted from IP video frames are generally widescreen images with a horizontal screen size of 1920x1080, while the input requirements for training images are relatively small. If the image is directly scaled, it will lead to a loss of image clarity and insufficient detail resolution of the main elements. Therefore, it is necessary to locate the subject element and capture the image of the target size with the subject element as the center.

对截取后得到的每张图像进行文本标注,文本标注用文字来描述图像中出现的主题要素,包括主要对象(图像的主体,一般为前景),图像背景,拍摄的视角和远近,以及一些细节描述(服饰,表情,颜色,动作,面部和肢体细节等)。其中,主要对象可以是IP视频中的角色,具体可以用该角色的名字来描述。比如,《电视剧A》中的主角张小明。图像背景可以是IP视频中的场景,具体可以是建筑物,自然景观,故事情节。比如,《电视剧A》中的古式亭楼,古式庙宇,手持大旗名场面等。细节描述-服饰类:IP视频中特有的服饰。比如《电视剧A》中男女主角的衣服,用自然的文本描述出来,淡黄绿领锦缎长袍,蓝色棉麻套服,蓝白格棉麻套服,暗紫色丝绸坎肩大褂,白色棉麻套服。细节描述-表情类:IP视频中角色的表情,表情的分类包括平静,严肃,开心,愤怒,疑惑等。基于以上文本的描述,对于每张图像帧,会产生一个或者多个文本描述,与图像帧构成成对的训练数据。Each captured image is annotated with text. The text annotation uses words to describe the theme elements in the image, including the main object (the main body of the image, generally the foreground), the image background, the shooting angle and distance, and some detailed descriptions (clothing, expression, color, action, facial and body details, etc.). Among them, the main object can be a character in the IP video, which can be described by the name of the character. For example, Zhang Xiaoming, the protagonist in "TV Series A". The image background can be a scene in the IP video, which can be a building, a natural landscape, or a storyline. For example, the ancient pavilions, ancient temples, and famous scenes of holding large flags in "TV Series A". Detailed description-clothing category: unique clothing in IP videos. For example, the clothes of the male and female protagonists in "TV Series A" are described in natural text, light yellow and green collar brocade robe, blue cotton and linen suit, blue and white checkered cotton and linen suit, dark purple silk waistcoat, and white cotton and linen suit. Detailed description-expression category: the expression of the character in the IP video, and the classification of expression includes calm, serious, happy, angry, confused, etc. Based on the above text descriptions, one or more text descriptions are generated for each image frame, forming paired training data with the image frame.

如图13所示,以《电视剧A》为例,可能会产生以下的训练数据,针对图13的左图,标注的文本为:[人物:主角A,视角:大头近景,服装:蓝色棉麻套服,发型:束发,动作:坐在车内往外看,表情:平静,背景:马车]。针对图13的右图,标注的文本为:[人物:主角A,视角:全身远景,服装:蓝色棉麻套服,发型:半披发,动作:站立远视,表情:严肃,背景:广场空地]。As shown in FIG13 , taking TV Series A as an example, the following training data may be generated. For the left picture in FIG13 , the annotated text is: [Character: Protagonist A, Perspective: Close-up of head, Clothing: Blue cotton and linen suit, Hairstyle: Hair tied up, Action: Sitting in the car looking out, Expression: Calm, Background: Carriage]. For the right picture in FIG13 , the annotated text is: [Character: Protagonist A, Perspective: Full-body long-distance view, Clothing: Blue cotton and linen suit, Hairstyle: Half-loose hair, Action: Standing and looking far away, Expression: Serious, Background: Open space in the square].

基于该IP的文本描述,对于每个属性特征,可以构建一个封闭的文本集合作为特征属性词表。以上文的《电视剧A》为例,特征属性词表可以构建如下:Based on the text description of the IP, for each attribute feature, a closed text set can be constructed as a feature attribute word list. Taking the TV series A above as an example, the feature attribute word list can be constructed as follows:

例如,服装文本集合:[淡黄绿领锦缎长袍,蓝色棉麻套服,蓝白格棉麻套服,暗紫色丝绸坎肩大褂,白色棉麻套服],发型文本集合:[半披发,束发],表情文本集合:[平静,严肃,吃惊,开心,愤怒,疑惑],视角文本集合:[大头近景,半身中景,全身远景]。For example, the clothing text set: [light yellow and green collar brocade robe, blue cotton and linen suit, blue and white checkered cotton and linen suit, dark purple silk waistcoat, white cotton and linen suit], the hairstyle text set: [half-length hair, tied hair], the expression text set: [calm, serious, surprised, happy, angry, confused], the perspective text set: [close-up of the head, mid-length shot, long shot of the whole body].

在模型的应用过程中,输入模型的文本描述部分,包括的内容可以从开放的文本描述以及封闭的属性词表中获取。其中,开放的文本描述,一般用于描述主要人物,除了该IP领域内的人物,还可以是外部世界的著名人物,也可以是用户上传的素人照片所表示的人物。以《电视剧A》为例,IP领域内的人物包括主角A,主角B,主角C等。外部世界的著名人物可以是外部世界的任意名人,比如,著名电影演员著名歌手等,还可以是用户上传的素人照片,一般为外部世界的非著名人物,例如用户通过相机拍摄的包含人物头像的照片对应的人物。During the application of the model, the text description part of the input model includes content that can be obtained from open text descriptions and closed attribute word lists. Among them, open text descriptions are generally used to describe the main characters. In addition to the characters in the IP field, they can also be famous characters in the outside world, or characters represented by amateur photos uploaded by users. Taking "TV Series A" as an example, the characters in the IP field include protagonist A, protagonist B, protagonist C, etc. Famous people in the outside world can be any celebrities in the outside world, such as famous movie actors, famous singers, etc., and can also be amateur photos uploaded by users, generally non-famous people in the outside world, such as people corresponding to photos containing character portraits taken by users with cameras.

基于以上的文本输入的元素构成,可以构建出一段自由的输入文本,以该文本作为模型推理的输入,例如,“人物:主角A,视角:半身中景,发型:半披发,服装:蓝色棉麻套服”生成的图像效果如图14所示。Based on the above text input elements, a free input text can be constructed and used as the input for model reasoning. For example, the image effect generated by "Character: Protagonist A, Perspective: Half-length mid-shot, Hairstyle: Half-down hair, Clothing: Blue cotton and linen suit" is shown in Figure 14.

在模型应用过程中,为进一步提高图像生成的质量,并保证文本的控制能力,本申请的方案还加入了很多文本向量编码和细调的模块,并对去噪过程进行了相应的调整。During the model application process, in order to further improve the quality of image generation and ensure the controllability of text, the solution of this application also added many text vector encoding and fine-tuning modules, and made corresponding adjustments to the denoising process.

采用标准训练过程和模型框架进行精细训练得到的主题图像生成模型,存在部分特征控制力不强的问题,具体体现为,生成的图像与文本描述不一致。为了解决这一技术问题,在本实施例中,通过在文本向量域,对关键特征进行向量提权,保证关键特征相对其他文本,在图像生成过程中,有更高的权重。具体放入实现方式为在文本构建过程中,把权重与文本进行融合,在文本编码过程中进行相应的解析和加权。在一个具体的应用中,目标描述文本为“人物:主角A,视角:(半身中景:1.1),发型:束发,服装:(蓝色棉麻套服:1.2)”。通过(主题要素描述文本:权重)的方式,可以确定待加权的文本部分,对于不同的文本施加不同的权重;其中,不加括号的文本,可以默认权重为1.0。基于关键描述文本的权重数据,可以在文本编码器执行向量转换过程中,对关键描述文本进行向量提权处理,从而得到目标描述文本对应的文本向量,能够有效提高关键特征在生成图像中的有效表达,进而提高图像质量。The theme image generation model obtained by fine training using the standard training process and model framework has the problem that some features are not well controlled, which is specifically reflected in that the generated image is inconsistent with the text description. In order to solve this technical problem, in this embodiment, by performing vector weighting on key features in the text vector domain, it is ensured that the key features have a higher weight in the image generation process relative to other texts. The specific implementation method is to merge the weight with the text during the text construction process, and perform corresponding parsing and weighting during the text encoding process. In a specific application, the target description text is "Character: Protagonist A, Perspective: (half-length mid-shot: 1.1), Hairstyle: Bunch, Clothing: (blue cotton and linen suit: 1.2)". By means of (theme element description text: weight), the text part to be weighted can be determined, and different weights can be applied to different texts; among them, the text without brackets can be weighted by default as 1.0. Based on the weight data of the key description text, the key description text can be vector weighted during the vector conversion process performed by the text encoder, so as to obtain the text vector corresponding to the target description text, which can effectively improve the effective expression of the key features in the generated image, thereby improving the image quality.

文本编码器不同层的语义控制能力和最终图像的清晰度会有差别。以文件编码器为transformer模型为例,transformer的倒数第二层,可以提升文本的控制能力。通过把transformer模型的倒数第二层和最后一层的文本向量进行插值融合,具体融合方式如下:The semantic control ability of different layers of the text encoder and the clarity of the final image will be different. Taking the file encoder as a transformer model as an example, the penultimate layer of the transformer can improve the control ability of the text. The text vectors of the penultimate layer and the last layer of the transformer model are interpolated and fused. The specific fusion method is as follows:

text_emb=alpha*text_emb(L-1)+(1-alpha)*text_emb(L)text_emb=alpha*text_emb(L-1)+(1-alpha)*text_emb(L)

其中,text_emb(L-1)为倒数第二层的文本向量,维度为[Seq_Len,HiddenSize];text_emb(L)为最后一层的文本向量,维度与text_emb(L-1)一致,因此可以进行线性插值融合;alpha为线性插值的权重,取值为[0,1]之间的小数,取值越高,倒数第二层的权重越高。通过对文件编码器的不同层特征进行特征融合,能够将具有更好的语义控制能力的向量融合到文件编码器的输出层,从而使得文件编码器编码得到的本文向量在图像生成过程中具有更好的语义控制能力,能有效提高生成的图像质量。Among them, text_emb(L-1) is the text vector of the second-to-last layer, and its dimension is [Seq_Len, HiddenSize]; text_emb(L) is the text vector of the last layer, and its dimension is the same as text_emb(L-1), so it can be linearly interpolated and fused; alpha is the weight of linear interpolation, and its value is a decimal between [0,1]. The higher the value, the higher the weight of the second-to-last layer. By fusing the features of different layers of the file encoder, the vector with better semantic control ability can be fused to the output layer of the file encoder, so that the text vector encoded by the file encoder has better semantic control ability in the image generation process, which can effectively improve the quality of the generated image.

在去噪的早期阶段,主要进行绘画布局以及主体要素的绘制,在去噪的后期阶段,完成局部细节的绘制。计算机设备通过控制主题图像生成模型在去噪过程对应的图像早期生成阶段,仅让主题人物和布局相关的主体描述文本生效,将细节描述文本通过掩码覆盖失效,保证早期阶段图像的更多注意力在主体描述文本上。在后期生成阶段,进行相反的处理,使得细节描述文本更容易控制图像的生成,将主题人物、布局等大面积控制的文本通过掩码覆盖,保证后期阶段图像的更多注意力在细节描述文本上。两个阶段相互配合,可以进一步提升生成文本的控制力,以及细节文本的图像精细程度。In the early stage of denoising, the main work is to draw the layout of the painting and the main elements, and in the later stage of denoising, the drawing of local details is completed. The computer device controls the subject image generation model in the early stage of the image generation corresponding to the denoising process, so that only the main description text related to the subject character and layout is effective, and the detail description text is invalidated by masking, ensuring that more attention is paid to the main description text in the early stage of the image. In the later generation stage, the opposite processing is performed, making it easier for the detail description text to control the generation of the image, and the text that controls a large area such as the subject character and layout is covered by a mask, ensuring that more attention is paid to the detail description text in the later stage of the image. The two stages work together to further improve the control of the generated text and the image refinement of the detail text.

计算机设备可以通过语义切割,控制不同区域的生成对象的方式来提高生成图像的质量。通过语义切割,可以在图像生成之前预先设定好图像不同区域需要描绘的对象,在生成过程中,通过影响不同文本片段在不通区域的注意力影响机制的权重,可以实现不同区域生成不同对象的目标。语义对象布局配置参数用于针对语义对象确定目标图像的布局,得到语义对象布局图。其中,不同的语义对象,在语义对象布局图中具有不同的影响权重。例如输入文本为:“1个白裙子长头发的女生,旁边1个红领巾的男生,树,天空,草,高度详细,质量最好”,可以通过配置文件,生成以下的语义切割得到的语义对象的布局图。其中,白色区域为“1个白裙子长头发的女生”,黑色区域为“1个红领巾的男生”,绿色区域为“树”,蓝色区域为“天空”,棕色区域为“草”。其中,“1个白裙子长头发的女生”对白色区域的影响权重为1.0,“1个红领巾的男生”对黑色区域的影响权重为1.4,“树”对绿色区域的影响权重为1.2,“天空”对蓝色区域的影响权重为0.2,“草”对棕色区域的影响权重为0.2。进一步地,基于语义对象布局配置参数和输入的目标描述文本,生成的图像如图11所示,以保证不同语义对象,出现在布局图中按语义对象切割的不同区域。Computer equipment can improve the quality of generated images by controlling the generated objects in different regions through semantic cutting. Through semantic cutting, the objects to be depicted in different regions of the image can be pre-set before the image is generated. During the generation process, the goal of generating different objects in different regions can be achieved by affecting the weights of the attention influence mechanism of different text fragments in different regions. The semantic object layout configuration parameters are used to determine the layout of the target image for the semantic object and obtain the semantic object layout map. Among them, different semantic objects have different influence weights in the semantic object layout map. For example, if the input text is: "1 girl with long hair in a white skirt, 1 boy with a red scarf next to her, tree, sky, grass, highly detailed, best quality", the following semantic object layout map obtained by semantic cutting can be generated through the configuration file. Among them, the white area is "1 girl with long hair in a white skirt", the black area is "1 boy with a red scarf", the green area is "tree", the blue area is "sky", and the brown area is "grass". Among them, the influence weight of "a girl in a white skirt and long hair" on the white area is 1.0, the influence weight of "a boy in a red scarf" on the black area is 1.4, the influence weight of "tree" on the green area is 1.2, the influence weight of "sky" on the blue area is 0.2, and the influence weight of "grass" on the brown area is 0.2. Further, based on the semantic object layout configuration parameters and the input target description text, the generated image is shown in Figure 11 to ensure that different semantic objects appear in different areas cut by semantic objects in the layout diagram.

此外,在主题图像生成模型生成图像的过程中,除了输入的目标描述文本以外,还可以输入一个事先选定的alpha图像。其中,alpha图像为一张单通道的灰度图,该灰度图的尺寸与目标生成图像的尺寸相同。基于alpha图像通道与随机噪声共同构成隐变量的初始隐变量,以使初始隐变量与输入目标描述文本共同作用,生成符合目标主题的图像,从基于输入的目标文本“站在讲台前的菠萝”的生成图像来看,可以把生成的目标对象“菠萝”很好的圈定在深度图中的前景部分,从而能够有效提高生成的图像的质量。本方案针对面向特定的IP例如某个视频IP的内容风格和场景进行定制化的训练,使得finetune训练之后的模型能够稳定可控的根据用户需要生成符合IP内容风格的图像。此外,参照本方案训练落地的模型还可以作为其它技术方案的基础模型,例如根据图像生成图像或者个性化定制等基础模型。In addition, in the process of generating images by the theme image generation model, in addition to the input target description text, a pre-selected alpha image can also be input. Among them, the alpha image is a single-channel grayscale image, and the size of the grayscale image is the same as the size of the target generated image. Based on the alpha image channel and random noise, the initial latent variable of the latent variable is formed together, so that the initial latent variable and the input target description text work together to generate an image that conforms to the target theme. From the generated image based on the input target text "pineapple standing in front of the podium", the generated target object "pineapple" can be well circled in the foreground part of the depth map, thereby effectively improving the quality of the generated image. This solution is customized for the content style and scene of a specific IP, such as a video IP, so that the model after finetune training can stably and controllably generate images that conform to the IP content style according to user needs. In addition, the model trained and implemented in accordance with this solution can also be used as the basic model of other technical solutions, such as basic models such as image generation or personalized customization.

应该理解的是,虽然如上所述的各实施例所涉及的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,如上所述的各实施例所涉及的流程图中的至少一部分步骤可以包括多个步骤或者多个阶段,这些步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤中的步骤或者阶段的至少一部分轮流或者交替地执行。It should be understood that, although the various steps in the flowcharts involved in the above-mentioned embodiments are displayed in sequence according to the indication of the arrows, these steps are not necessarily executed in sequence according to the order indicated by the arrows. Unless there is a clear explanation in this article, the execution of these steps does not have a strict order restriction, and these steps can be executed in other orders. Moreover, at least a part of the steps in the flowcharts involved in the above-mentioned embodiments can include multiple steps or multiple stages, and these steps or stages are not necessarily executed at the same time, but can be executed at different times, and the execution order of these steps or stages is not necessarily carried out in sequence, but can be executed in turn or alternately with other steps or at least a part of the steps or stages in other steps.

基于同样的发明构思,本申请实施例还提供了一种用于实现上述所涉及的目标主题的图像生成方法的目标主题的图像生成装置。该装置所提供的解决问题的实现方案与上述方法中所记载的实现方案相似,故下面所提供的一个或多个目标主题的图像生成装置实施例中的具体限定可以参见上文中对于目标主题的图像生成方法的限定,在此不再赘述。Based on the same inventive concept, the embodiment of the present application also provides a target subject image generation device for implementing the target subject image generation method involved above. The implementation scheme for solving the problem provided by the device is similar to the implementation scheme recorded in the above method, so the specific limitations of the one or more target subject image generation device embodiments provided below can refer to the limitations of the target subject image generation method above, and will not be repeated here.

在一个实施例中,如图15所示,提供了一种目标主题的图像生成装置,包括:模型获取模块1502、文本集合确定模块1504、描述文本确定模块1506和目标图像生成模块1508,其中:In one embodiment, as shown in FIG. 15 , a device for generating an image of a target subject is provided, comprising: a model acquisition module 1502, a text set determination module 1504, a description text determination module 1506, and a target image generation module 1508, wherein:

模型获取模块1502,用于获取通过样本图像对预训练模型进行二次训练得到的主题图像生成模型;所述预训练模型用于按文本生成图像;所述样本图像包含至少一项符合目标主题的主题要素。The model acquisition module 1502 is used to obtain a theme image generation model obtained by secondary training a pre-trained model through sample images; the pre-trained model is used to generate images according to text; and the sample image contains at least one theme element that conforms to the target theme.

文本集合确定模块1504,用于基于每一所述样本图像各自携带的用于描述所述主题要素的主题要素描述文本,获取包含相同类型的所述主题要素描述文本的文本集合。The text set determining module 1504 is configured to obtain a text set containing the same type of subject element description text based on the subject element description text carried by each of the sample images and used to describe the subject element.

描述文本确定模块1506,用于从至少一部分所述文本集合中分别选取主题要素描述文本,组合得到包含所选取的主题要素描述文本的目标描述文本。The description text determination module 1506 is used to select subject element description texts from at least a part of the text set, and combine them to obtain a target description text containing the selected subject element description texts.

目标图像生成模块1508,用于通过所述主题图像生成模型,按照所述目标描述文本进行图像生成处理,得到与所述目标主题匹配的目标图像。The target image generation module 1508 is used to perform image generation processing according to the target description text through the subject image generation model to obtain a target image matching the target subject.

在一些实施例中,所述目标主题的图像生成装置还包括样本图像获取模块,用于获取符合所述目标主题的目标视频,对所述目标视频进行抽帧处理,得到场景存在差异的多个候选图像帧;对每一所述候选图像帧分别进行主题要素识别,确定包含至少一项主题要素的优选图像帧;基于所述主题要素,对所述优选图像帧进行主题要素描述文本标注,得到样本图像。In some embodiments, the image generation device of the target theme also includes a sample image acquisition module, which is used to obtain a target video that conforms to the target theme, perform frame extraction processing on the target video, and obtain multiple candidate image frames with different scenes; perform theme element recognition on each of the candidate image frames to determine a preferred image frame containing at least one theme element; based on the theme elements, perform theme element description text annotation on the preferred image frame to obtain a sample image.

在一些实施例中,所述样本图像获取模块,还用于对所述目标视频进行分帧处理,得到具有相同像素排布的多个图像帧;针对所述多个图像帧中的两个视频帧,基于同一像素排布位置的像素粒度差值,确定各所述像素排布位置的像素粒度差值的绝对值之和;当所述绝对值之和与所述图像帧的像素粒度总值之间的比值大于场景变化阈值时,确定所述两个图像帧为场景存在差异的候选图像帧。In some embodiments, the sample image acquisition module is also used to perform frame processing on the target video to obtain multiple image frames with the same pixel arrangement; for two video frames among the multiple image frames, based on the pixel granularity difference at the same pixel arrangement position, the sum of the absolute values of the pixel granularity differences at each of the pixel arrangement positions is determined; when the ratio between the sum of the absolute values and the total pixel granularity value of the image frame is greater than a scene change threshold, the two image frames are determined to be candidate image frames with scene differences.

在一些实施例中,所述样本图像获取模块,还用于分别获取每一所述候选图像帧各自的属性信息和内容信息;丢弃所述属性信息和所述内容信息中的至少一项不满足主题要素识别条件的候选图像帧,得到包含至少一项主题要素的优选图像帧。In some embodiments, the sample image acquisition module is further used to respectively acquire the attribute information and content information of each of the candidate image frames; discard the candidate image frames whose at least one item of the attribute information and the content information does not meet the subject element recognition condition, and obtain the preferred image frames containing at least one subject element.

在一些实施例中,所述样本图像获取模块,还用于针对每一所述优选图像帧,对所述优选图像帧中的对象要素进行中心点定位,得到定位位置;按照样本图像的尺寸条件,以所述定位位置为中心,对所述优选图像帧进行图像裁剪,得到裁剪后的优选图像帧;对裁剪后的所述优选图像帧进行主题要素描述文本标注,得到样本图像。In some embodiments, the sample image acquisition module is also used to locate the center point of the object element in each of the preferred image frames to obtain the located position; according to the size conditions of the sample image, the preferred image frame is cropped with the located position as the center to obtain a cropped preferred image frame; and the cropped preferred image frame is annotated with text describing the subject elements to obtain a sample image.

在一些实施例中,所述文本集合包括开放文本集合和封闭文本集合;所述开放文本集合包括所述主题要素中的对象要素的对象要素描述文本、以及外部对象要素的对象要素描述文本;In some embodiments, the text set includes an open text set and a closed text set; the open text set includes object element description texts of object elements in the subject elements and object element description texts of external object elements;

所述描述文本确定模块还用于从所述开放文本集合中选取对象要素描述文本;将所述对象要素描述文本,与从至少一部分所述封闭文本集合中分别选取的主题要素描述文本进行组合,得到目标文本组合。The description text determination module is also used to select object element description text from the open text set; combine the object element description text with the subject element description text selected from at least a part of the closed text set to obtain a target text combination.

在一些实施例中,所述主题图像生成模型包括控制域、隐变量域和图像域;In some embodiments, the subject image generation model includes a control domain, a latent variable domain, and an image domain;

所述目标图像生成模块,还用于基于从所述控制域输入的所述目标描述文本,将所述目标描述文本转换为文本向量;基于所述隐变量域的随机隐变量对应的白噪声,在所述文本向量的指导下进行去噪处理,以重建图像隐变量;在所述图像域对所述图像隐变量进行解码,得到与所述目标主题匹配的目标图像。The target image generation module is also used to convert the target description text into a text vector based on the target description text input from the control domain; perform denoising processing under the guidance of the text vector based on the white noise corresponding to the random latent variables in the latent variable domain to reconstruct the image latent variables; and decode the image latent variables in the image domain to obtain a target image matching the target subject.

在一些实施例中,所述目标图像生成模块,还用于针对从所述控制域输入的所述目标描述文本,获取所述目标描述文本中关键描述文本的权重数据;基于所述关键描述文本的权重数据,在向量转换过程中对所述关键描述文本进行向量提权处理,得到所述目标描述文本对应的文本向量。In some embodiments, the target image generation module is also used to obtain weight data of key description text in the target description text input from the control domain; based on the weight data of the key description text, the key description text is vector-weighted during the vector conversion process to obtain a text vector corresponding to the target description text.

在一些实施例中,所述控制域包括将文本转换为向量的文本编码器;所述文本编码器包括至少两个网络层;所述目标图像生成模块,还用于按照所述文本编码器中的各个网络层,依次对输入的所述目标描述文本进行向量转换处理,得到每一网络层的输出特征;将所述文本编码器的目标网络层的输出特征和最后一个网络层的输出特征进行特征融合,得到所述目标描述文本对应的文本向量;所述目标网络层为所述文本编码器中用于提升文本控制能力的网络层。In some embodiments, the control domain includes a text encoder that converts text into a vector; the text encoder includes at least two network layers; the target image generation module is also used to perform vector conversion processing on the input target description text in turn according to each network layer in the text encoder to obtain the output features of each network layer; the output features of the target network layer of the text encoder and the output features of the last network layer are feature fused to obtain the text vector corresponding to the target description text; the target network layer is a network layer in the text encoder used to improve text control capabilities.

在一些实施例中,所述控制域的输入文本还包括负向描述文本;所述目标图像生成模块,还用于将负向描述文本的预估噪声作为去噪处理过程中的融合对象,与所述目标描述文本的预估噪声进行融合,得到融合预估噪声;基于所述隐变量域的随机隐变量对应的白噪声,在所述文本向量的指导下按照所述融合预估噪声进行去噪处理,以重建图像隐变量。In some embodiments, the input text of the control domain also includes negative description text; the target image generation module is also used to use the estimated noise of the negative description text as a fusion object in the denoising process, and fuse it with the estimated noise of the target description text to obtain a fused estimated noise; based on the white noise corresponding to the random latent variables in the latent variable domain, denoising is performed according to the fused estimated noise under the guidance of the text vector to reconstruct the image latent variables.

在一些实施例中,所述去噪处理包括依次发生的第一阶段和第二阶段;所述目标描述文本包括主体描述文本和细节描述文本;所述目标图像生成模块,还用于在所述第一阶段,掩码覆盖所述细节描述文本,对所述主体描述文本的主体图像隐变量进行解码,得到与所述目标主题匹配的主体图像;在所述第二阶段,掩码覆盖所述主体描述文本,对所述细节描述文本的细节图像隐变量进行解码,得到细节图像;将所述细节图像渲染至所述主体图像中的对应区域,得到与所述目标主题匹配的目标图像。In some embodiments, the denoising process includes a first stage and a second stage that occur sequentially; the target description text includes a main description text and a detail description text; the target image generation module is also used to, in the first stage, mask the detail description text, decode the main image latent variables of the main description text, and obtain a main image matching the target subject; in the second stage, mask the main description text, decode the detail image latent variables of the detail description text, and obtain a detail image; render the detail image to a corresponding area in the main image to obtain a target image matching the target subject.

在一些实施例中,所述目标图像生成模块,还用于对所述目标描述文本进行语义切割,基于语义对象布局配置参数,确定切割得到的每一语义对象在图像中的布局图;确定每一所述语义对象在所述布局图中各布局区域的注意力影响机制的影响权重;在所述图像域对基于所述影响权重得到的图像隐变量进行解码,得到使得各所述语义对象按所述布局图进行布局的目标图像。In some embodiments, the target image generation module is also used to perform semantic segmentation on the target description text, determine a layout diagram of each semantic object obtained by segmentation in the image based on the semantic object layout configuration parameters; determine the influence weight of the attention influence mechanism of each layout area of each semantic object in the layout diagram; and decode the image latent variables obtained based on the influence weights in the image domain to obtain a target image in which each semantic object is laid out according to the layout diagram.

在一些实施例中,所述样本图像还包括深度图;所述目标图像生成模块,还用于获取与待生成的目标图像尺寸相同的单通道灰度图,所述单通道灰度图的黑白通道分别用于定位目标图像的前景部分和背景部分;基于所述隐变量域的随机隐变量对应的白噪声和所述单通道灰度图的黑白通道,构建初始隐变量;在所述文本向量的指导下基于所述初始隐变量进行去噪处理,以重建图像隐变量。In some embodiments, the sample image also includes a depth map; the target image generation module is also used to obtain a single-channel grayscale image of the same size as the target image to be generated, and the black and white channels of the single-channel grayscale image are used to locate the foreground and background parts of the target image respectively; based on the white noise corresponding to the random latent variables in the latent variable domain and the black and white channels of the single-channel grayscale image, an initial latent variable is constructed; and denoising is performed based on the initial latent variables under the guidance of the text vector to reconstruct the image latent variables.

上述目标主题的图像生成装置,通过获取对用于预训练模型进行二次训练得到的主题图像生成模型,得到可以按文本生成符合目标主题的图像的模型,在确定输入模型的文本的过程中,利用训练主题图像生成模型的样本图像所包含至少一项符合目标主题的主题要素、以及样本图像各自携带的用于描述所述主题要素的主题要素描述文本,可以将包含相同类型的主题要素描述文本的文本集合作为输入模型的文本组成部分,通过从至少一部分所述文本集合中分别选取主题要素描述文本,组合得到包含所选取的主题要素描述文本的目标描述文本,能够使得所述主题图像生成模型,按照所述目标描述文本进行图像生成处理,得到区别于样本图像、但能够与所述目标主题高度匹配的高质量目标图像。The above-mentioned target theme image generation device obtains a model that can generate images that meet the target theme according to text by acquiring a theme image generation model obtained by secondary training of a pre-trained model. In the process of determining the text of the input model, the sample images of the training theme image generation model are used to contain at least one theme element that meets the target theme, and the theme element description text for describing the theme element carried by each sample image. A text set containing the same type of theme element description text can be used as a text component of the input model. By selecting theme element description texts from at least a part of the text set, target description texts containing the selected theme element description texts are obtained by combining them. The theme image generation model can perform image generation processing according to the target description text to obtain a high-quality target image that is different from the sample image but can highly match the target theme.

上述目标主题的图像生成装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。Each module in the image generation device of the above target subject can be implemented in whole or in part by software, hardware, or a combination thereof. Each module can be embedded in or independent of a processor in a computer device in the form of hardware, or can be stored in a memory in a computer device in the form of software, so that the processor can call and execute operations corresponding to each module.

在一个实施例中,提供了一种计算机设备,该计算机设备可以是服务器,其内部结构图可以如图16所示。该计算机设备包括处理器、存储器、输入/输出接口(Input/Output,简称I/O)和通信接口。其中,处理器、存储器和输入/输出接口通过系统总线连接,通信接口通过输入/输出接口连接到系统总线。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质和内存储器。该非易失性存储介质存储有操作系统、计算机程序和数据库。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的数据库用于存储预训练模型和样本图像。该计算机设备的输入/输出接口用于处理器与外部设备之间交换信息。该计算机设备的通信接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现一种目标主题的图像生成方法。In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be shown in FIG16. The computer device includes a processor, a memory, an input/output interface (Input/Output, referred to as I/O) and a communication interface. The processor, the memory and the input/output interface are connected via a system bus, and the communication interface is connected to the system bus via the input/output interface. The processor of the computer device is used to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program and a database. The internal memory provides an environment for the operation of the operating system and the computer program in the non-volatile storage medium. The database of the computer device is used to store pre-trained models and sample images. The input/output interface of the computer device is used to exchange information between the processor and an external device. The communication interface of the computer device is used to communicate with an external terminal via a network connection. When the computer program is executed by the processor, a method for generating an image of a target subject is implemented.

在一个实施例中,提供了一种计算机设备,该计算机设备可以是终端,其内部结构图可以如图17所示。该计算机设备包括处理器、存储器、输入/输出接口、通信接口、显示单元和输入装置。其中,处理器、存储器和输入/输出接口通过系统总线连接,通信接口、显示单元和输入装置通过输入/输出接口连接到系统总线。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统和计算机程序。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的输入/输出接口用于处理器与外部设备之间交换信息。该计算机设备的通信接口用于与外部的终端进行有线或无线方式的通信,无线方式可通过WIFI、移动蜂窝网络、NFC(近场通信)或其他技术实现。该计算机程序被处理器执行时以实现一种目标主题的图像生成方法。该计算机设备的显示单元用于形成视觉可见的画面,可以是显示屏、投影装置或虚拟现实成像装置,显示屏可以是液晶显示屏或电子墨水显示屏,该计算机设备的输入装置可以是显示屏上覆盖的触摸层,也可以是计算机设备外壳上设置的按键、轨迹球或触控板,还可以是外接的键盘、触控板或鼠标等。In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be shown in FIG17. The computer device includes a processor, a memory, an input/output interface, a communication interface, a display unit, and an input device. The processor, the memory, and the input/output interface are connected via a system bus, and the communication interface, the display unit, and the input device are connected to the system bus via the input/output interface. The processor of the computer device is used to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and the computer program in the non-volatile storage medium. The input/output interface of the computer device is used to exchange information between the processor and an external device. The communication interface of the computer device is used to communicate with an external terminal in a wired or wireless manner, and the wireless manner can be implemented through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. When the computer program is executed by the processor, a method for generating an image of a target subject is implemented. The display unit of the computer device is used to form a visually visible image, and can be a display screen, a projection device or a virtual reality imaging device. The display screen can be a liquid crystal display screen or an electronic ink display screen. The input device of the computer device can be a touch layer covered on the display screen, or a button, trackball or touchpad set on the computer device casing, or an external keyboard, touchpad or mouse, etc.

本领域技术人员可以理解,图16和17中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。Those skilled in the art will understand that the structures shown in Figures 16 and 17 are merely block diagrams of partial structures related to the scheme of the present application, and do not constitute a limitation on the computer device to which the scheme of the present application is applied. The specific computer device may include more or fewer components than shown in the figures, or combine certain components, or have a different arrangement of components.

在一个实施例中,提供了一种计算机设备,包括存储器和处理器,存储器中存储有计算机程序,该处理器执行计算机程序时实现上述各方法实施例中的步骤。In one embodiment, a computer device is provided, including a memory and a processor, wherein a computer program is stored in the memory, and the processor implements the steps in the above-mentioned method embodiments when executing the computer program.

在一个实施例中,提供了一种计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现上述各方法实施例中的步骤。In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored. When the computer program is executed by a processor, the steps in the above-mentioned method embodiments are implemented.

在一个实施例中,提供了一种计算机程序产品,包括计算机程序,该计算机程序被处理器执行时实现上述各方法实施例中的步骤。In one embodiment, a computer program product is provided, including a computer program, which implements the steps in the above method embodiments when executed by a processor.

需要说明的是,本申请所涉及的用户信息(包括但不限于用户设备信息、用户个人信息等)和数据(包括但不限于用于分析的数据、存储的数据、展示的数据等),均为经用户授权或者经过各方充分授权的信息和数据,且相关数据的收集、使用和处理需要遵守相关国家和地区的相关法律法规和标准。It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, stored data, displayed data, etc.) involved in this application are all information and data authorized by the user or fully authorized by all parties, and the collection, use and processing of relevant data must comply with relevant laws, regulations and standards of relevant countries and regions.

本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一非易失性计算机可读取存储介质中,该计算机程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、数据库或其它介质的任何引用,均可包括非易失性和易失性存储器中的至少一种。非易失性存储器可包括只读存储器(Read-OnlyMemory,ROM)、磁带、软盘、闪存、光存储器、高密度嵌入式非易失性存储器、阻变存储器(ReRAM)、磁变存储器(Magnetoresistive Random Access Memory,MRAM)、铁电存储器(Ferroelectric Random Access Memory,FRAM)、相变存储器(Phase Change Memory,PCM)、石墨烯存储器等。易失性存储器可包括随机存取存储器(Random Access Memory,RAM)或外部高速缓冲存储器等。作为说明而非局限,RAM可以是多种形式,比如静态随机存取存储器(Static Random Access Memory,SRAM)或动态随机存取存储器(Dynamic RandomAccess Memory,DRAM)等。本申请所提供的各实施例中所涉及的数据库可包括关系型数据库和非关系型数据库中至少一种。非关系型数据库可包括基于区块链的分布式数据库等,不限于此。本申请所提供的各实施例中所涉及的处理器可为通用处理器、中央处理器、图形处理器、数字信号处理器、可编程逻辑器、基于量子计算的数据处理逻辑器等,不限于此。Those of ordinary skill in the art can understand that all or part of the processes in the above-mentioned embodiment methods can be completed by instructing the relevant hardware through a computer program, and the computer program can be stored in a non-volatile computer-readable storage medium. When the computer program is executed, it can include the processes of the embodiments of the above-mentioned methods. Among them, any reference to the memory, database or other medium used in the embodiments provided in the present application can include at least one of non-volatile and volatile memory. Non-volatile memory can include read-only memory (ROM), magnetic tape, floppy disk, flash memory, optical memory, high-density embedded non-volatile memory, resistive random access memory (ReRAM), magnetoresistive random access memory (MRAM), ferroelectric random access memory (FRAM), phase change memory (PCM), graphene memory, etc. Volatile memory can include random access memory (RAM) or external cache memory, etc. As an illustration and not limitation, RAM can be in various forms, such as static random access memory (SRAM) or dynamic random access memory (DRAM). The database involved in each embodiment provided in this application may include at least one of a relational database and a non-relational database. Non-relational databases may include distributed databases based on blockchains, etc., but are not limited to this. The processor involved in each embodiment provided in this application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic device, a data processing logic device based on quantum computing, etc., but are not limited to this.

以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。The technical features of the above embodiments may be arbitrarily combined. To make the description concise, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.

以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对本申请专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请的保护范围应以所附权利要求为准。The above-described embodiments only express several implementation methods of the present application, and the descriptions thereof are relatively specific and detailed, but they cannot be understood as limiting the scope of the present application. It should be pointed out that, for a person of ordinary skill in the art, several modifications and improvements can be made without departing from the concept of the present application, and these all belong to the protection scope of the present application. Therefore, the protection scope of the present application shall be subject to the attached claims.

Claims (17)

1.一种目标主题的图像生成方法,其特征在于,所述方法包括:1. A method for generating an image of a target subject, characterized in that the method comprises: 获取通过样本图像对预训练模型进行二次训练得到的主题图像生成模型;所述预训练模型用于按文本生成图像;所述样本图像包含至少一项符合目标主题的主题要素;Obtaining a subject image generation model obtained by secondary training a pre-trained model through sample images; the pre-trained model is used to generate images according to text; the sample image contains at least one subject element that conforms to the target subject; 基于每一所述样本图像各自携带的用于描述所述主题要素的主题要素描述文本,获取包含相同类型的所述主题要素描述文本的文本集合;Based on the subject element description text for describing the subject element carried by each of the sample images, obtaining a text set containing the subject element description text of the same type; 从至少一部分所述文本集合中分别选取主题要素描述文本,组合得到包含所选取的主题要素描述文本的目标描述文本;Selecting subject element description texts from at least a portion of the text set, respectively, and combining them to obtain a target description text containing the selected subject element description texts; 通过所述主题图像生成模型,按照所述目标描述文本进行图像生成处理,得到与所述目标主题匹配的目标图像。The subject image generation model is used to perform image generation processing according to the target description text to obtain a target image that matches the target subject. 2.根据权利要求1所述的方法,其特征在于,所述方法还包括:2. The method according to claim 1, characterized in that the method further comprises: 获取符合所述目标主题的目标视频,对所述目标视频进行抽帧处理,得到场景存在差异的多个候选图像帧;Acquire a target video that matches the target theme, perform frame extraction on the target video, and obtain a plurality of candidate image frames with different scenes; 对每一所述候选图像帧分别进行主题要素识别,确定包含至少一项主题要素的优选图像帧;Performing subject element recognition on each of the candidate image frames to determine a preferred image frame containing at least one subject element; 基于所述主题要素,对所述优选图像帧进行主题要素描述文本标注,得到样本图像。Based on the subject elements, the preferred image frame is annotated with subject element description text to obtain a sample image. 3.根据权利要求2所述的方法,其特征在于,所述对所述目标视频进行抽帧处理,得到场景存在差异的多个候选图像帧,包括:3. The method according to claim 2, wherein the step of extracting frames from the target video to obtain a plurality of candidate image frames having different scenes comprises: 对所述目标视频进行分帧处理,得到具有相同像素排布的多个图像帧;Performing frame processing on the target video to obtain multiple image frames with the same pixel arrangement; 针对所述多个图像帧中的两个视频帧,基于同一像素排布位置的像素粒度差值,确定各所述像素排布位置的像素粒度差值的绝对值之和;For two video frames in the plurality of image frames, based on the pixel granularity difference at the same pixel arrangement position, determine the sum of the absolute values of the pixel granularity difference at each of the pixel arrangement positions; 当所述绝对值之和与所述图像帧的像素粒度总值之间的比值大于场景变化阈值时,确定所述两个图像帧为场景存在差异的候选图像帧。When the ratio of the sum of the absolute values to the total value of the pixel granularity of the image frames is greater than a scene change threshold, the two image frames are determined to be candidate image frames with scene differences. 4.根据权利要求2所述的方法,其特征在于,所述对每一所述候选图像帧分别进行主题要素识别,确定包含至少一项主题要素的优选图像帧,包括:4. The method according to claim 2, wherein the step of performing subject element recognition on each candidate image frame to determine a preferred image frame containing at least one subject element comprises: 分别获取每一所述候选图像帧各自的属性信息和内容信息;Respectively obtaining attribute information and content information of each of the candidate image frames; 丢弃所述属性信息和所述内容信息中的至少一项不满足主题要素识别条件的候选图像帧,得到包含至少一项主题要素的优选图像帧。The candidate image frames whose at least one item of the attribute information and the content information does not satisfy the subject element recognition condition are discarded to obtain a preferred image frame containing at least one subject element. 5.根据权利要求4所述的方法,其特征在于,所述主题要素包括符合所述目标主题的对象要素;所述基于所述主题要素,对所述优选图像帧进行主题要素描述文本标注,得到样本图像,包括:5. The method according to claim 4, characterized in that the subject elements include object elements that conform to the target subject; and based on the subject elements, the preferred image frames are annotated with subject element description text to obtain sample images, comprising: 针对每一所述优选图像帧,对所述优选图像帧中的对象要素进行中心点定位,得到定位位置;For each of the preferred image frames, positioning the center point of the object element in the preferred image frame to obtain a positioning position; 按照样本图像的尺寸条件,以所述定位位置为中心,对所述优选图像帧进行图像裁剪,得到裁剪后的优选图像帧;According to the size condition of the sample image, the preferred image frame is cropped with the positioning position as the center to obtain a cropped preferred image frame; 对裁剪后的所述优选图像帧进行主题要素描述文本标注,得到样本图像。The cropped preferred image frame is annotated with a textual description of subject elements to obtain a sample image. 6.根据权利要求1所述的方法,其特征在于,所述文本集合包括开放文本集合和封闭文本集合;所述开放文本集合包括所述主题要素中的对象要素的对象要素描述文本、以及外部对象要素的对象要素描述文本;6. The method according to claim 1, characterized in that the text set includes an open text set and a closed text set; the open text set includes object element description texts of object elements in the subject elements and object element description texts of external object elements; 所述从至少一部分所述文本集合中分别选取主题要素描述文本,组合得到包含所选取的主题要素描述文本的目标描述文本,包括:The step of selecting subject element description texts from at least a portion of the text set and combining them to obtain target description texts containing the selected subject element description texts comprises: 从所述开放文本集合中选取对象要素描述文本;Selecting object element description text from the open text set; 将所述对象要素描述文本,与从至少一部分所述封闭文本集合中分别选取的主题要素描述文本进行组合,得到目标文本组合。The object element description text is combined with the subject element description text selected from at least a part of the closed text set to obtain a target text combination. 7.根据权利要求1至6中任一项所述的方法,其特征在于,所述主题图像生成模型包括控制域、隐变量域和图像域;7. The method according to any one of claims 1 to 6, characterized in that the subject image generation model comprises a control domain, a latent variable domain and an image domain; 所述通过所述主题图像生成模型,按照所述目标描述文本进行图像生成处理,得到与所述目标主题匹配的目标图像,包括:The subject image generation model is used to perform image generation processing according to the target description text to obtain a target image matching the target subject, including: 基于从所述控制域输入的所述目标描述文本,将所述目标描述文本转换为文本向量;Based on the target description text input from the control domain, converting the target description text into a text vector; 基于所述隐变量域的随机隐变量对应的白噪声,在所述文本向量的指导下进行去噪处理,以重建图像隐变量;Based on the white noise corresponding to the random latent variable in the latent variable domain, denoising is performed under the guidance of the text vector to reconstruct the image latent variable; 在所述图像域对所述图像隐变量进行解码,得到与所述目标主题匹配的目标图像。The image latent variables are decoded in the image domain to obtain a target image matching the target subject. 8.根据权利要求7所述的方法,其特征在于,所述基于从所述控制域输入的所述目标描述文本,将所述目标描述文本转换为文本向量,包括:8. The method according to claim 7, characterized in that the step of converting the target description text into a text vector based on the target description text input from the control domain comprises: 针对从所述控制域输入的所述目标描述文本,获取所述目标描述文本中关键描述文本的权重数据;For the target description text input from the control domain, obtaining weight data of key description texts in the target description text; 基于所述关键描述文本的权重数据,在向量转换过程中对所述关键描述文本进行向量提权处理,得到所述目标描述文本对应的文本向量。Based on the weight data of the key description text, the key description text is subjected to vector weighting processing during the vector conversion process to obtain a text vector corresponding to the target description text. 9.根据权利要求7所述的方法,其特征在于,所述控制域包括将文本转换为向量的文本编码器;所述文本编码器包括至少两个网络层;9. The method according to claim 7, characterized in that the control domain includes a text encoder that converts text into a vector; the text encoder includes at least two network layers; 所述基于从所述控制域输入的所述目标描述文本,将所述目标描述文本转换为文本向量,包括:The step of converting the target description text into a text vector based on the target description text input from the control domain comprises: 按照所述文本编码器中的各个网络层,依次对输入的所述目标描述文本进行向量转换处理,得到每一网络层的输出特征;According to each network layer in the text encoder, the input target description text is sequentially subjected to vector conversion processing to obtain output features of each network layer; 将所述文本编码器的目标网络层的输出特征和最后一个网络层的输出特征进行特征融合,得到所述目标描述文本对应的文本向量;所述目标网络层为所述文本编码器中用于提升文本控制能力的网络层。The output features of the target network layer of the text encoder and the output features of the last network layer are feature fused to obtain a text vector corresponding to the target description text; the target network layer is the network layer in the text encoder used to improve text control capabilities. 10.根据权利要求7所述的方法,其特征在于,所述控制域的输入文本还包括负向描述文本;10. The method according to claim 7, characterized in that the input text of the control field also includes negative description text; 所述基于所述隐变量域的随机隐变量对应的白噪声,在所述文本向量的指导下进行去噪处理,以重建图像隐变量,包括:The white noise corresponding to the random latent variable based on the latent variable domain is subjected to denoising processing under the guidance of the text vector to reconstruct the image latent variable, including: 将负向描述文本的预估噪声作为去噪处理过程中的融合对象,与所述目标描述文本的预估噪声进行融合,得到融合预估噪声;The estimated noise of the negative description text is used as a fusion object in the denoising process, and is fused with the estimated noise of the target description text to obtain a fused estimated noise; 基于所述隐变量域的随机隐变量对应的白噪声,在所述文本向量的指导下按照所述融合预估噪声进行去噪处理,以重建图像隐变量。Based on the white noise corresponding to the random latent variable in the latent variable domain, denoising processing is performed according to the fused estimated noise under the guidance of the text vector to reconstruct the image latent variable. 11.根据权利要求7所述的方法,其特征在于,所述去噪处理包括依次发生的第一阶段和第二阶段;所述目标描述文本包括主体描述文本和细节描述文本;11. The method according to claim 7, characterized in that the denoising process comprises a first stage and a second stage occurring in sequence; the target description text comprises a main description text and a detail description text; 所述在所述图像域对所述图像隐变量进行解码,得到与所述目标主题匹配的目标图像,包括:The decoding of the image latent variable in the image domain to obtain a target image matching the target subject includes: 在所述第一阶段,掩码覆盖所述细节描述文本,对所述主体描述文本的主体图像隐变量进行解码,得到与所述目标主题匹配的主体图像;In the first stage, the mask covers the detail description text, and the subject image latent variables of the subject description text are decoded to obtain a subject image matching the target subject; 在所述第二阶段,掩码覆盖所述主体描述文本,对所述细节描述文本的细节图像隐变量进行解码,得到细节图像;In the second stage, the main description text is covered with a mask, and the detail image latent variables of the detail description text are decoded to obtain a detail image; 将所述细节图像渲染至所述主体图像中的对应区域,得到与所述目标主题匹配的目标图像。The detail image is rendered to a corresponding area in the main image to obtain a target image matching the target subject. 12.根据权利要求7所述的方法,其特征在于,所述方法还包括:12. The method according to claim 7, characterized in that the method further comprises: 对所述目标描述文本进行语义切割,基于语义对象布局配置参数,确定切割得到的每一语义对象在图像中的布局图;Performing semantic segmentation on the target description text, and determining a layout diagram of each semantic object obtained by segmentation in the image based on semantic object layout configuration parameters; 确定每一所述语义对象在所述布局图中各布局区域的注意力影响机制的影响权重;Determine the influence weight of each of the semantic objects on the attention influence mechanism of each layout area in the layout diagram; 所述在所述图像域对所述图像隐变量进行解码,得到与所述目标主题匹配的目标图像,包括:The decoding of the image latent variable in the image domain to obtain a target image matching the target subject includes: 在所述图像域对基于所述影响权重得到的图像隐变量进行解码,得到使得各所述语义对象按所述布局图进行布局的目标图像。The image latent variables obtained based on the influence weights are decoded in the image domain to obtain a target image in which each of the semantic objects is laid out according to the layout diagram. 13.根据权利要求7所述的方法,其特征在于,所述样本图像还包括深度图;所述方法还包括:13. The method according to claim 7, wherein the sample image further comprises a depth map; the method further comprises: 获取与待生成的目标图像尺寸相同的单通道灰度图,所述单通道灰度图的黑白通道分别用于定位目标图像的前景部分和背景部分;Obtain a single-channel grayscale image of the same size as the target image to be generated, wherein the black and white channels of the single-channel grayscale image are used to locate the foreground and background of the target image respectively; 所述基于所述隐变量域的随机隐变量对应的白噪声,在所述文本向量的指导下进行去噪处理,以重建图像隐变量,包括:The white noise corresponding to the random latent variable based on the latent variable domain is subjected to denoising processing under the guidance of the text vector to reconstruct the image latent variable, including: 基于所述隐变量域的随机隐变量对应的白噪声和所述单通道灰度图的黑白通道,构建初始隐变量;Constructing an initial latent variable based on the white noise corresponding to the random latent variable in the latent variable domain and the black and white channels of the single-channel grayscale image; 在所述文本向量的指导下基于所述初始隐变量进行去噪处理,以重建图像隐变量。A denoising process is performed based on the initial latent variables under the guidance of the text vector to reconstruct the image latent variables. 14.一种目标主题的图像生成装置,其特征在于,所述装置包括:14. A device for generating an image of a target subject, characterized in that the device comprises: 模型获取模块,用于获取通过样本图像对预训练模型进行二次训练得到的主题图像生成模型;所述预训练模型用于按文本生成图像;所述样本图像包含至少一项符合目标主题的主题要素;A model acquisition module, used to acquire a theme image generation model obtained by retraining a pre-trained model through sample images; the pre-trained model is used to generate images according to text; the sample image contains at least one theme element that conforms to the target theme; 文本集合确定模块,用于基于每一所述样本图像各自携带的用于描述所述主题要素的主题要素描述文本,获取包含相同类型的所述主题要素描述文本的文本集合;A text set determination module, configured to obtain a text set containing the same type of subject element description texts based on the subject element description texts carried by each of the sample images and used to describe the subject element; 描述文本确定模块,用于从至少一部分所述文本集合中分别选取主题要素描述文本,组合得到包含所选取的主题要素描述文本的目标描述文本;A description text determination module, used to select subject element description texts from at least a part of the text set, and combine them to obtain a target description text containing the selected subject element description texts; 目标图像生成模块,用于通过所述主题图像生成模型,按照所述目标描述文本进行图像生成处理,得到与所述目标主题匹配的目标图像。The target image generation module is used to perform image generation processing according to the target description text through the subject image generation model to obtain a target image matching the target subject. 15.一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,其特征在于,所述处理器执行所述计算机程序时实现权利要求1至13中任一项所述的方法的步骤。15. A computer device comprising a memory and a processor, wherein the memory stores a computer program, wherein the processor implements the steps of the method according to any one of claims 1 to 13 when executing the computer program. 16.一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现权利要求1至13中任一项所述的方法的步骤。16. A computer-readable storage medium having a computer program stored thereon, wherein when the computer program is executed by a processor, the steps of the method according to any one of claims 1 to 13 are implemented. 17.一种计算机程序产品,包括计算机程序,其特征在于,所述计算机程序被处理器执行时实现权利要求1至13中任一项所述的方法的步骤。17. A computer program product, comprising a computer program, characterized in that when the computer program is executed by a processor, the steps of the method according to any one of claims 1 to 13 are implemented.
CN202310335853.XA 2023-03-24 2023-03-24 Method, device, computer equipment and storage medium for generating image of target subject Pending CN118691923A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310335853.XA CN118691923A (en) 2023-03-24 2023-03-24 Method, device, computer equipment and storage medium for generating image of target subject

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310335853.XA CN118691923A (en) 2023-03-24 2023-03-24 Method, device, computer equipment and storage medium for generating image of target subject

Publications (1)

Publication Number Publication Date
CN118691923A true CN118691923A (en) 2024-09-24

Family

ID=92765287

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310335853.XA Pending CN118691923A (en) 2023-03-24 2023-03-24 Method, device, computer equipment and storage medium for generating image of target subject

Country Status (1)

Country Link
CN (1) CN118691923A (en)

Similar Documents

Publication Publication Date Title
CN113408471B (en) Non-green-curtain portrait real-time matting algorithm based on multitask deep learning
CN111954053B (en) Method for acquiring mask frame data, computer equipment and readable storage medium
Tolosana et al. An introduction to digital face manipulation
US20240144520A1 (en) Generating three-dimensional human models representing two-dimensional humans in two-dimensional images
KR20240089729A (en) Image processing methods, devices, storage media and electronic devices
US20240256218A1 (en) Modifying digital images using combinations of direct interactions with the digital images and context-informing speech input
CN113255551A (en) Training, face editing and live broadcasting method of face editor and related device
KR20230106809A (en) A method of providing a service that creates a face image of a virtual person by synthesizing face images
CN117332118A (en) Method, device, storage medium and equipment for generating story video
CN118212687A (en) Human body posture image generation method, device, equipment and medium
US20240127509A1 (en) Generating scale fields indicating pixel-to-metric distances relationships in digital images via neural networks
WO2024217150A1 (en) Video processing method and apparatus, computer device, storage medium and program product
CN118674803A (en) Image generation model, training method and device for image generation model
Song et al. Photo squarization by deep multi-operator retargeting
CN119091092A (en) A training method for a virtual dressing model, a virtual dressing method and a device
US20240361891A1 (en) Implementing graphical user interfaces for viewing and interacting with semantic histories for editing digital images
CN117934654A (en) Image generation model training, image generation method, device and computer equipment
CN118691923A (en) Method, device, computer equipment and storage medium for generating image of target subject
US20240135613A1 (en) Modifying digital images via perspective-aware object move
CN111415397A (en) Face reconstruction and live broadcast method, device, equipment and storage medium
Li et al. Vr+ hd: Video semantic reconstruction from spatio-temporal scene graphs
Sankalpa et al. Using generative adversarial networks for conditional creation of Anime posters
US20210248755A1 (en) Product release method and image processing method, apparatus, device, and storage medium
CN108600614A (en) Image processing method and device
Yue et al. Addme: Zero-shot group-photo synthesis by inserting people into scenes

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication