CN108564126B

CN108564126B - Specific scene generation method fusing semantic control

Info

Publication number: CN108564126B
Application number: CN201810353922.9A
Authority: CN
Inventors: 曹仰杰; 陈永霞; 段鹏松; 林楠; 贾丽丽
Original assignee: Zhengzhou University
Current assignee: Zhengzhou University
Priority date: 2018-04-19
Filing date: 2018-04-19
Publication date: 2022-04-19
Anticipated expiration: 2038-04-19
Also published as: CN108564126A

Abstract

The invention provides a specific scene generation method integrating semantic control, which comprises the steps of selecting a plurality of object pictures and a plurality of different specific scene pictures containing the object; manufacturing different attribute labels according to the characteristics of a specific scene in a specific scene picture, and obtaining a training sample after cutting the specific scene picture; constructing a condition generating type countermeasure network consisting of a discriminator and a generator; inputting the item graph and the label into a generator as input, and generating a specific scene graph described by the label; the method comprises the steps that a specific scene graph of an article is used as a target scene graph, the specific scene graph, the target scene graph, the article graph and the label which are described by a label generated by a generator are input into a discriminator together, and the discriminator performs model training through a conditional countermeasure network; inputting the similar object images to be processed and the scene to be obtained into the trained model in a label mode to obtain the corresponding scene image.

Description

A Scenario-Specific Generation Method with Semantic Control

技术领域technical field

本发明属于机器学习算法领域，具体的说，涉及了一种融合语义控制的特定场景生成方法。The invention belongs to the field of machine learning algorithms, and specifically relates to a method for generating a specific scene integrating semantic control.

背景技术Background technique

融合语义控制的特定场景生成指的是通过语义控制让计算机生成语言所描述的场景。能够真实的描绘世界一直是人类的追求，绘画的诞生源于人类描绘世界的需要，对极致的追求成就了艺术。相机的发明使人类记录世界变得容易，计算机出现后，人类开始让计算机自己来描绘真实世界，由此诞生了许多生成算法。传统的生成算法有梯度方向直方图，尺度不变特征变换等，这些算法采用手工提取特征与浅层模型相组合的方法实现目标的生成。其解决方案基本遵循四个步骤：图像预处理→手动特征提取→建立模型(分类器/回归器)→输出。而深度学习算法解决计算机视觉的思路是端到端(End to End),即从输入直接到输出，中间采用神经网络自动学习特征，避免了手动特征提取的繁琐操作。The specific scene generation with semantic control refers to the use of semantic control to let the computer generate the scene described by the language. It has always been the pursuit of human beings to be able to truly depict the world. The birth of painting stems from the need of human beings to depict the world, and the pursuit of perfection has made art. The invention of the camera made it easy for humans to record the world. After the advent of computers, humans began to let the computer describe the real world by themselves, and many generative algorithms were born. The traditional generation algorithms include gradient direction histogram, scale-invariant feature transformation, etc. These algorithms use the method of manually extracting features and combining shallow models to achieve target generation. Its solution basically follows four steps: image preprocessing → manual feature extraction → model building (classifier/regressor) → output. The idea of deep learning algorithm to solve computer vision is end-to-end (End to End), that is, from input directly to output, using neural network to automatically learn features in the middle, avoiding the tedious operation of manual feature extraction.

深度学习是机器学习的一个重要分支，因其最近几年在许多领域取得重大突破而受到广泛关注。生成式对抗网络(Generative Adversarial Networks,GAN)是2014年由Goodfellow等提出的一种生成式深度学习模型，该模型一经提出就成为了计算机视觉研究领域热点研究方向之一。由于GAN出色的生成能力，使得GAN在样本生成领域取得显著成就，其次GAN在图像还原与修复、图像风格迁移、文本与图像的相互生成、图像的高质量生成等领域也已经成为一个有巨大应用价值的课题。同时工业界中的不少领军企业也已加入GAN发展的浪潮中。比如Facebook、Google、Apple等公司。基于以上研究，GAN为实现融合语义控制生成特定场景提供了实现的可能。但是目前还没有一个模型能够直接实现通过语义控制生成不同的特定场景。Deep learning is an important branch of machine learning, which has received extensive attention due to its major breakthroughs in many fields in recent years. Generative Adversarial Networks (GAN) is a generative deep learning model proposed by Goodfellow et al. in 2014. Once this model is proposed, it has become one of the hot research directions in the field of computer vision research. Due to the excellent generation ability of GAN, GAN has made remarkable achievements in the field of sample generation. Secondly, GAN has also become a huge application in the fields of image restoration and restoration, image style transfer, mutual generation of text and images, and high-quality image generation. issue of value. At the same time, many leading companies in the industry have also joined the wave of GAN development. Such as Facebook, Google, Apple and other companies. Based on the above research, GAN provides the possibility to realize fusion semantic control to generate specific scenes. However, there is currently no model that directly implements the generation of different specific scenarios through semantic control.

为了解决以上存在的问题，人们一直在寻求一种理想的技术解决方案。In order to solve the above problems, people have been looking for an ideal technical solution.

发明内容SUMMARY OF THE INVENTION

本发明的目的是针对现有技术的不足，从而提供了一种融合语义控制的特定场景生成方法。The purpose of the present invention is to aim at the deficiencies of the prior art, so as to provide a specific scene generation method integrating semantic control.

为了实现上述目的，本发明所采用的技术方案是：一种融合语义控制的特定场景生成方法，包括如下步骤：In order to achieve the above object, the technical solution adopted in the present invention is: a method for generating a specific scene integrating semantic control, comprising the following steps:

步骤1、选取若干物品图以及多个包含该物品的不同特定场景图片；Step 1. Select several item images and several different specific scene images containing the item;

步骤2、根据特定场景图片中特定场景的特点制作不同的属性标签，将特定场景图片裁剪处理后，获得训练样本，训练样本包括物品图、与物品图对应的包含该物品的特定场景图及描述该场景的标签；Step 2. Create different attribute labels according to the characteristics of the specific scene in the specific scene picture, and after the specific scene picture is cropped, a training sample is obtained, and the training sample includes the item map, the specific scene map and description corresponding to the item containing the item. the label of the scene;

步骤3、构建由判别器与生成器组成的条件生成式对抗网络；Step 3. Construct a conditional generative adversarial network composed of a discriminator and a generator;

步骤4、将物品图与标签一起作为输入，输入到生成器中，生成标签所描述的特定场景图；Step 4. The item graph and the label are used as input into the generator to generate the specific scene graph described by the label;

步骤5、包含物品的特定场景图作为目标场景图，将由生成器生成的标签所描述的特定场景图、目标场景图、物品图及标签一同输入到判别器中，判别器通过条件对抗网络进行模型训练；Step 5. The specific scene graph containing the item is used as the target scene graph, and the specific scene graph, the target scene graph, the item graph and the label described by the label generated by the generator are input into the discriminator, and the discriminator is modeled through a conditional confrontation network. train;

步骤6、将待处理的同类物品图及想要得到的场景以标签形式输入训练好的模型即可获得对应的场景图像。Step 6: Input the image of the same item to be processed and the scene to be obtained into the trained model in the form of labels to obtain the corresponding scene image.

基于上述，所述标签为二进制形式的语义标签。Based on the above, the tags are semantic tags in binary form.

基于上述，步骤1中，所述物品图为从购物网站上爬取的物品特写图。Based on the above, in step 1, the item image is a close-up image of an item crawled from a shopping website.

基于上述，步骤3中，所述生成式对抗网络为GAN模型，所述生成式对抗网络的生成器表示为

其中，y为目标场景图像域，x为原始输入图像，l为目标场景图像域标签，

为标签所描述的特定场景图；Based on the above, in step 3, the generative adversarial network is a GAN model, and the generator of the generative adversarial network is expressed as

where y is the target scene image domain, x is the original input image, l is the target scene image domain label,

the specific scene graph described by the label;

使用条件GAN的代价函数作为模型的对抗性损失，其中，所述代价函数为Use the cost function of the conditional GAN as the adversarial loss of the model, where the cost function is

其中，D为判别器，G为生成器。Among them, D is the discriminator and G is the generator.

本发明相对现有技术具有突出的实质性特点和显著的进步，具体的说：The present invention has outstanding substantive features and remarkable progress relative to the prior art, specifically:

本发明通过构建条件生成式对抗网络进行模型训练，通过人工智能技术代替重复的劳动，能极大的提高人类的工作效率，一些简单的场景可以直接由系统生成，不用浪费人力去拍摄、制作。通过语义控制生成指定的场景，针对不同的情况只需提供一些该场景所需的训练样本，并为训练样本制作域标签，经过训练，就能够生成指定场景的图像。本发明方法有广阔的应用前景，尤其是购物网站上展示商品详情的图像可以由该方法生成，从而节约劳动力和资源。The invention conducts model training by constructing a conditional generative confrontation network, and replaces repetitive labor by artificial intelligence technology, which can greatly improve the work efficiency of human beings, and some simple scenes can be directly generated by the system without wasting manpower for shooting and production. The specified scene is generated through semantic control. According to different situations, it is only necessary to provide some training samples required for the scene, and make domain labels for the training samples. After training, the images of the specified scene can be generated. The method of the present invention has broad application prospects, especially the images showing the details of commodities on the shopping website can be generated by the method, thereby saving labor and resources.

附图说明Description of drawings

图1是本发明的算法流程示意图。FIG. 1 is a schematic flow chart of the algorithm of the present invention.

图2为本发明一种融合语义控制的特定场景生成方法的设计示意图。FIG. 2 is a schematic design diagram of a method for generating a specific scene integrating semantic control according to the present invention.

具体实施方式Detailed ways

下面通过具体实施方式，对本发明的技术方案做进一步的详细描述。The technical solutions of the present invention will be further described in detail below through specific embodiments.

如图1和图2所示，一种融合语义控制的特定场景生成方法，包括如下步骤：As shown in Figure 1 and Figure 2, a method for generating a specific scene integrating semantic control includes the following steps:

步骤1、从购物网站上爬取若干物品图以及多个包含该物品的不同特定场景图片；Step 1. Crawl several item images and several different specific scene images containing the item from the shopping website;

步骤2、根据特定场景图片中特定场景的特点制作不同的属性标签，所述标签为二进制形式的语义标签；将特定场景图片裁剪处理后，获得训练样本，训练样本包括物品图、与物品图对应的包含该物品的特定场景图及描述该场景的标签；Step 2. Make different attribute labels according to the characteristics of the specific scene in the specific scene picture, and the label is a semantic label in binary form; after the specific scene picture is cropped, a training sample is obtained, and the training sample includes an item map and corresponds to the item map. a specific scene graph containing the item and a label describing the scene;

具体的，步骤3中，所述生成式对抗网络为GAN模型，所述生成式对抗网络的生成器表示为

为标签所描述的特定场景图；Specifically, in step 3, the generative adversarial network is a GAN model, and the generator of the generative adversarial network is expressed as

the specific scene graph described by the label;

本发明方法中，每一个输入物品图像对应一个成对的目标场景图像域y及标签l，使得G可以准确的学习生成特定场景。判别器学习将真实图像与生成图像分类，生成器需要学会欺骗判别器，并且判别器在输入物品图像和标签上产生概率分布，能够指定标签，实现语义控制生成器的生成。生成器的目标是将原始物品图像转换为由标签描述的真实场景图像，因此训练样本的数据集是作为一组相应图像(x，y，l)给出的，其中x是输入物品图像，y是相应的目标场景图像，l是目标场景图像域标签。In the method of the present invention, each input item image corresponds to a paired target scene image domain y and label l, so that G can accurately learn to generate a specific scene. The discriminator learns to classify the real image and the generated image. The generator needs to learn to deceive the discriminator, and the discriminator generates a probability distribution on the input item image and label, which can specify the label and realize the generation of semantic control generator. The goal of the generator is to convert the original item images into real scene images described by labels, so the dataset of training samples is given as a set of corresponding images (x, y, l), where x is the input item image, y is the corresponding target scene image, and l is the target scene image domain label.

使用条件GAN的代价函数作为算法模型的对抗性损失，该代价函数是一个极小极大的双人零和游戏：As the adversarial loss of the algorithm model, the cost function of the conditional GAN is used, which is a minimax two-player zero-sum game:

函数的第一项表明，当输入真实场景图像时，判别器使目标函数尽可能大，并判断它是真实图像。函数的第二项表示在输入生成的图像时，G(x，y，l)尽可能小，因此，损失函数的值相对较大，生成器欺骗判别器并错误地认为输入是真实图像的同时判别器试图将其识别为假图像，函数的两项模型进行游戏直到达到纳什均衡，使生成器学习到标签的语义特征，并与物品图像对应起来。The first term of the function shows that when a real scene image is input, the discriminator makes the objective function as large as possible and judges that it is a real image. The second term of the function means that when the generated image is input, G(x, y, l) is as small as possible, therefore, the value of the loss function is relatively large, while the generator fools the discriminator and mistakenly believes that the input is a real image. The discriminator tries to identify it as a fake image, and the two models of the function play until a Nash equilibrium is reached, allowing the generator to learn the semantic features of the label and map it to the item image.

使用GAN模型的生成式对抗网络，生成器输入目标域场景的原始图像，目标域图像和标签作为条件变量，同时生成假的特定场景，目标域图像和目标域标签在输入时被复制并与输入图像拼接。生成器则试图从输入图像和给出原始域标签中重建新的场景，并试图生成与真实场景无法区分的特定场景，使不容易被判别器区分。两者在对抗博弈的过程中，生成器生成的场景越来越逼真，判别器区分真实场景图像与伪场景图像愈加困难，从而实现训练的目的。Generative adversarial network using GAN model, the generator inputs the original image of the target domain scene, the target domain image and the label as condition variables, while generating a fake specific scene, the target domain image and the target domain label are copied as input and compared with the input Image stitching. The generator then tries to reconstruct a new scene from the input image and given the original domain labels, and tries to generate a specific scene that is indistinguishable from the real scene, making it difficult for the discriminator to distinguish. In the process of confrontation between the two, the scene generated by the generator becomes more and more realistic, and the discriminator becomes more and more difficult to distinguish the real scene image from the fake scene image, so as to achieve the purpose of training.

本发明整体结构简单，设计合理，采用条件GAN作为模型框架。为了实现语义控制功能，算法模型能够接受多个领域的训练数据，并且只使用一个生成器学习所有可用领域之间的映射，本算法模型不是学习固定的生成(例如，仅从衣服到正面的模特)，而是将物品图像和目标信息作为输入，并学习将输入图像中的物体灵活地生成相应的场景。通过使用标签来表示域信息，在训练过程中，随机生成一个目标域标签，训练模型将输入图像转换为目标域，从而实现通过语义控制域标签，在训练阶段将输入转换成任何期望的场景输出，比如输入生成正面站立、手拿包、手垂下的模特，输出一个包含输入衣服的满足要求的模特。The overall structure of the invention is simple, the design is reasonable, and the conditional GAN is used as the model frame. In order to realize the semantic control function, the algorithm model can accept training data from multiple domains, and only use one generator to learn the mapping between all available domains, the algorithm model does not learn fixed generation (for example, only from clothes to front models ), but takes the item image and target information as input, and learns to flexibly generate the corresponding scene from the objects in the input image. By using labels to represent domain information, during the training process, a target domain label is randomly generated, and the training model converts the input image to the target domain, so as to control the domain label through semantics, and convert the input to any desired scene output during the training phase , for example, the input generates a model that is standing on the front, clutching a bag, and hanging hands, and outputs a model that meets the requirements of the input clothes.

即输入一个物品图，生成包含该物品的合理场景。这克服了两大难关，首先是多域生成，其次是生成输入中不存在且合理的场景。对于第一种情况，本发明将训练样本的标签以向量的形式表示，并与输入图像、目标场景对应，形成映射，通过在训练过程中，随机生成一个目标域标签，训练模型灵活地将输入图像转换为目标域。通过这样做，在使用模型的阶段实现通过语义控制域标签，对于同一张输入图像，输入不同的标签，即可得到不同的场景，实现多域的生成。对于第二种情况，本发明在训练阶段提供了目标场景图像与描述该场景的标签，通过生成式对抗网络学习两者之间的映射，并将图像与标签的文本对应起来，在训练过程中，生成器学到文本的图像表示，判别器识别真实图像与生成图像，经过对抗博弈，生成器生成人眼辨别不出真假的特定场景图像。That is, input an item graph and generate a reasonable scene containing the item. This overcomes two major hurdles, first, multi-domain generation, and second, generating non-existent and plausible scenarios in the input. For the first case, the present invention expresses the label of the training sample in the form of a vector, and corresponds to the input image and the target scene to form a map. By randomly generating a target domain label during the training process, the training model flexibly converts the input The image is converted to the target domain. By doing this, in the stage of using the model, the domain labels are controlled by semantics. For the same input image, input different labels to obtain different scenes and realize multi-domain generation. For the second case, the present invention provides a target scene image and a label describing the scene in the training phase, learns the mapping between the two through a generative adversarial network, and associates the image with the label text. During the training process , the generator learns the image representation of the text, the discriminator recognizes the real image and the generated image, and after an adversarial game, the generator generates a specific scene image that the human eye cannot distinguish between true and false.

该发明算法模型结构精简、训练方便，运行平稳、可靠，可移植性较好，可以在多种特定场景中使用。The algorithm model of the invention has a simplified structure, convenient training, stable and reliable operation, good portability, and can be used in various specific scenarios.

最后应当说明的是:以上实施例仅用以说明本发明的技术方案而非对其限制；尽管参照较佳实施例对本发明进行了详细的说明，所属领域的普通技术人员应当理解：依然可以对本发明的具体实施方式进行修改或者对部分技术特征进行等同替换；而不脱离本发明技术方案的精神，其均应涵盖在本发明请求保护的技术方案范围当中。Finally it should be noted that: the above embodiment is only used to illustrate the technical scheme of the present invention and not to limit it; Although the present invention has been described in detail with reference to the preferred embodiment, those of ordinary skill in the art should understand: The specific embodiment of the invention is modified or some technical features are equivalently replaced; without departing from the spirit of the technical solution of the present invention, all of them should be included in the scope of the technical solution claimed in the present invention.

Claims

1. A specific scene generation method with fusion semantic control is characterized by comprising the following steps:

the method comprises the following steps of 1, selecting a plurality of item pictures and a plurality of different specific scene pictures containing the items, wherein the item pictures are close-up pictures of the items crawled from a shopping website;

step 2, making different attribute labels according to the characteristics of a specific scene in a specific scene picture, and obtaining a training sample after cutting the specific scene picture, wherein the training sample comprises an article picture, a specific scene picture which corresponds to the article picture and contains the article and a label for describing the scene;

step 3, constructing a conditional generation type countermeasure network consisting of a discriminator and a generator;

the generative countermeasure network is a GAN model, and the generator of the generative countermeasure network is represented as

Wherein, in the step (A),

is the target scene image domain, x is the original input image, l is the target scene image domain label,

a particular scene graph described for the tag;

using a cost function of conditional GAN as a antagonistic loss of the model, wherein the cost function is

Wherein D is a discriminator and G is a generator;

step 4, inputting the article graph and the label into a generator together as input to generate a specific scene graph described by the label;

step 5, taking a specific scene graph containing the article as a target scene graph, inputting the specific scene graph, the target scene graph, the article graph and the label described by the label generated by the generator into a discriminator, and performing model training by the discriminator through a conditional countermeasure network;

step 6, inputting the similar object images to be processed and the scene to be obtained into the trained model in a label mode to obtain corresponding scene images; the tags are semantic tags in binary form.