RU2735148C1

RU2735148C1 - Training gan (generative adversarial networks) to create pixel-by-pixel annotation

Info

Publication number: RU2735148C1
Application number: RU2019140414A
Authority: RU
Inventors: Данил Фанильевич ГАЛЕЕВ; Константин Сергеевич СОФИЮК; Данила Дмитриевич РУХОВИЧ; Антон Сергеевич Конушин; Михаил Викторович Романов
Original assignee: Самсунг Электроникс Ко., Лтд.
Priority date: 2019-12-09
Filing date: 2019-12-09
Publication date: 2020-10-28

Abstract

FIELD: physics.

SUBSTANCE: invention relates to a method and a computer-readable medium for combined synthesis of images and pixel-by-pixel annotations for display. Method comprises preliminarily instructing a GAN model on an unmapped target data set to create images from an existing data set based on random vectors, constructing a decoder by displaying outputs of intermediate layers of GAN in a semantic segmentation mask, annotating with human involvement several images created by GAN, masks of semantic segmentation, teaching with teacher decoder on pairs of introduced features and corresponding annotated masks to produce pairs of synthetic images with corresponding pixel by annotations, creating synthetic data set, wherein synthetic data set consists of pairs of images with corresponding segmentation masks, by selecting random vector from normal distribution and supplying random noise to GAN input, which displays it in synthetic image, a separate semantic segmentation network is trained under supervision on the created synthetic data set.

EFFECT: technical result is higher efficiency of deep training algorithms.

6 cl, 2 tbl, 6 dwg

Description

Настоящее изобретение может быть использовано в сетях машинного обучения, компьютерного зрения и генеративно-состязательных сетях для создания синтетических наборов данных в целях повышения качества методов глубокого обучения.The present invention can be used in machine learning networks, computer vision and generative adversarial networks to create synthetic datasets in order to improve the quality of deep learning methods.

Описание известного уровня техникиDescription of the prior art

В последние годы резко повысилось качество нейронных генеративных моделей. Первые GAN [8] могли генерировать только цифры MNIST или изображения с низким разрешением. Самые современные модели [3, 16] способны генерировать изображения с настолько высокими качеством и разрешением, что люди часто с трудом могут отличить их от реальных.In recent years, the quality of neural generative models has improved dramatically. The first GANs [8] could only generate MNIST digits or low resolution images. The most modern models [3, 16] are capable of generating images with such high quality and resolution that people can often hardly distinguish them from real ones.

С появлением современных методов, основанных на нейронных сетях, были также получены высококачественные решения задачи семантической сегментации изображения, которая имеет решающее значение для понимания сцен. В ряде работ описывается, что эти два подхода можно использовать вместе в одной обучающей схеме для создания реалистичного изображения с применением маски сегментации [13, 12] или для сегментации с частичным привлечением учителя.With the advent of modern neural network-based techniques, high-quality solutions to the problem of semantic image segmentation have also been obtained, which is critical for understanding scenes. A number of works describe that these two approaches can be used together in one teaching scheme to create a realistic image using a segmentation mask [13, 12] or for segmentation with partial involvement of a teacher.

Генеративно-состязательные сети (GAN) обычно состоят из сети генератора и сети дискриминатора. Генератор, обученный обманывать дискриминатор, генерирует изображение из случайного шума, а дискриминатор обучается дифференцировать реальные и сгенерированные изображения. Итеративное обучение обеих сетей делает генератор способным создавать изображения, неотличимые от реальных изображений. Основные проблемы, с которыми сталкивается сообщество машинного обучения, включают создание высококачественных изображений, создание изображений с большим разнообразием и стабильное обучение.Generative adversarial networks (GANs) usually consist of a generator network and a discriminator network. A generator trained to trick the discriminator generates an image from random noise, and the discriminator is trained to differentiate the real and generated images. The iterative training of both networks makes the generator capable of producing images that are indistinguishable from real images. The main challenges facing the machine learning community include creating high-quality images, creating images with a lot of variety, and stable learning.

Архитектура первой GAN [8] была очень простой: дискриминатор и генератор состояли из только полносвязных слоев. Такая GAN была лишь способна генерировать цифры из набора данных MNIST или изображения с низким разрешением из CIFAR-10, и не работала на более сложных наборах данных. С появлением DCGAN [21] качество и разнообразие генерации изображений продолжили улучшаться благодаря использованию сверточных слоев и транспонированных сверточных слоев. Авторы упомянутой статьи [21] анализируют латентное пространство, демонстрируя интерполяцию между изображениями. Это была одна из первых работ, в которой было предложено использовать признаки дискриминатора для обучения классификатора. В [23] были предложены мини- дискриминация по мини-батчам (mini-batch discrimination), сопоставление признаков и сглаживание меток. Поскольку для сетей GAN не существует явной целевой функции, качество разных моделей трудно сравнивать. Авторы [23] предложили начальную оценку (Inception Score, IS) как объективную метрику для оценки качества сгенерированных изображений. Они показали, что IS хорошо коррелирует с субъективной оценкой человека. Одним из недостатков IS является то, что она может неверно интерпретировать качество, если GAN генерирует всего одно изображение на класс. В работе [11] была предложена новая целевая метрика для GAN, названная начальным расстоянием Фреше (Fr´echet Inception Distance, FID). Предполагается, что FID является усовершенствованием IS, поскольку вместо только оценки сгенерированных образцов она включает в себя действительное сравнение статистики сгенерированных образцов с реальными образцами. Обе эти оценки (FID и IS) обычно используются для измерения качества сгенерированных образцов. ProGAN [15] показала впечатляющие результаты на генерации высококачественных изображений с разрешением до 1024x1024. Предложенная прогрессивная стратегия обучения существенно стабилизировала обучение для обеих сетей. BigGAN [3] показала результаты самого высокого уровня на условной генерации с использованием ImageNet. В других работах были предложены новые потери [14, 18, 9], архитектуры и способы внедрения условной информации [19, 20].The architecture of the first GAN [8] was very simple: the discriminator and generator consisted of only fully connected layers. This GAN was only capable of generating digits from the MNIST dataset or low-resolution images from CIFAR-10, and did not work on more complex datasets. With the advent of DCGAN [21], the quality and variety of image generation continued to improve through the use of convolutional layers and transposed convolutional layers. The authors of the above-mentioned article [21] analyze the latent space, demonstrating interpolation between images. This was one of the first works in which it was proposed to use the discriminator features for training the classifier. In [23], mini-batch discrimination, feature matching, and label smoothing were proposed. Since there is no explicit objective function for GANs, the performance of different models is difficult to compare. The authors of [23] proposed the Inception Score (IS) as an objective metric for assessing the quality of the generated images. They showed that IS correlates well with a person's subjective assessment. One of the disadvantages of IS is that it can misinterpret quality if the GAN only generates one image per class. In [11], a new target metric for GAN was proposed, called the Fr'echet Inception Distance (FID). The FID is intended to be an enhancement to IS because, instead of just evaluating the generated samples, it involves actually comparing the statistics of the generated samples with real samples. Both of these scores (FID and IS) are commonly used to measure the quality of the generated samples. ProGAN [15] has shown impressive results in generating high quality images up to 1024x1024 resolution. The proposed progressive learning strategy substantially stabilized learning for both networks. BigGAN [3] showed the highest level results on conditional generation using ImageNet. In other works, new losses were proposed [14, 18, 9], architectures and methods for implementing conditional information [19, 20].

В ряде работ были предложены методы изучения и манипулирования внутренними признаками GAN. Например, в GAN Dissection [2] авторы представили аналитическую структуру для визуализации и понимания свойств GAN на уровне элемента, объекта и сцены. В [4] авторы ввели редактор NeuralPhoto Editor как интерфейс для изучения обученного скрытого пространства порождающих моделей и внесения определенных семантических изменений в естественные изображения.A number of works have proposed methods for studying and manipulating the internal features of the GAN. For example, in GAN Dissection [2], the authors provided an analytical framework for visualizing and understanding GAN properties at the element, object and scene levels. In [4], the authors introduced the NeuralPhoto Editor as an interface for exploring the trained hidden space of generative models and making certain semantic changes to natural images.

Семантическая сегментация решает задачу классификации каждого пикселя изображения по определенному набору категорий. С развитием моделей глубокого обучения качество семантической сегментации значительно повысилось по сравнению с методами, в основе которых лежат созданные вручную признаки.Semantic segmentation solves the problem of classifying each pixel in an image into a specific set of categories. With the development of deep learning models, the quality of semantic segmentation has improved significantly compared to methods based on manually created features.

Полносверточная сеть (Fully Convolution Network, FCN) [17] стала первой архитектурой, которая поддерживает сквозное обучение для задачи сегментации изображения. Основная сеть (AlexNet, VGG16) без полносвязных слоев была адаптирована для приема произвольных размеров изображений. Признаки, полученные основной сетью из изображений, затем увеличивают в размере через билинейную интерполяцию или через серию транспонированных сверток. Архитектура U-Net [22] является усовершенствованием относительно FCN: в ней часть кодера извлекает признаки из изображения, а часть декодера постепенно повышает дискретизацию карт признаков и формирует окончательный прогноз. Главным новшеством в U-Net являются пропускаемые соединения между соответствующими блоками частей декодера и кодера. Это оптимизирует градиентный поток, улучшая агрегацию информации разных масштабов. В сети Pyramid Scene Parsing Network (PSPNet) [26] был введен модуль пирамидального пулинга (Pyramid Pooling Module, PPM) для явного включения информации разных масштабов. Этот модуль выполняет операцию пулинга на картах признаков, используя параллельно различные размеры ядра. Затем выходы PPM увеличивают в размерах и объединяют для образования карт признаков, содержащих как глобальную, так и локальную контекстную информацию. Эта идея была изучена в DeepLabV3 [5], и в дальнейшем PPM заменили расширяющим пространственным пирамидным пулингом (Atrous Spatial Pyramid Pooling, ASPP).The Fully Convolution Network (FCN) [17] was the first architecture to support end-to-end learning for the image segmentation problem. The main network (AlexNet, VGG16) without fully connected layers was adapted to receive arbitrary image sizes. The features derived from the underlying image network are then scaled up through bilinear interpolation or through a series of transposed convolutions. The U-Net architecture [22] is an improvement over FCN: in it, part of the encoder extracts features from the image, and part of the decoder gradually upsamples feature maps and generates the final forecast. The main innovation in U-Net are skip connections between corresponding blocks of decoder and encoder parts. This optimizes the gradient flow, improving the aggregation of information at different scales. The Pyramid Scene Parsing Network (PSPNet) [26] introduced the Pyramid Pooling Module (PPM) to explicitly include information of different scales. This module performs pooling operation on feature maps using different kernel sizes in parallel. The PPM outputs are then scaled up and combined to form feature maps containing both global and local contextual information. This idea was explored in DeepLabV3 [5], and later on PPM was replaced by Atrous Spatial Pyramid Pooling (ASPP).

В нем применяются расширенные (дилатационные) свертки с различной степенью дилатации. В DeepLabV3 последние слои свертки в основной сети кодера заменяются расширенной сверткой, чтобы предотвратить значительные потери в размере изображения. DeepLabV3+ [6] достигает отвечающих современным требованиям результатов на широко используемых тестовых программах. В ней DeepLabV3 была усовершенствована путем добавления простого, но эффективного модуля декодера для получения более четких масок сегментации. В настоящем изобретении в качестве базовой версии используется DeepLabV3+.It uses extended (dilatational) convolutions with varying degrees of dilatation. In DeepLabV3, the last layers of convolution in the encoder's main network are replaced with extended convolution to prevent significant loss in image size. DeepLabV3 + [6] achieves up-to-date results on widely used test programs. In it, DeepLabV3 has been enhanced by the addition of a simple yet effective decoder module to obtain clearer segmentation masks. The present invention uses DeepLabV3 + as the base version.

Сущность изобретенияThe essence of the invention

Поскольку GAN могут создавать высококачественные изображения на основе некоторого случайного вектора, этот вектор и набор выводов промежуточных слоев содержат высокоуровневую информацию о построенном изображении. Поэтому возникает естественный вопрос, можно ли проецировать вектор и промежуточные признаки в маску семантической сегментации и генерировать изображения вместе с попиксельной аннотацией. Был проведен ряд экспериментов, которые показали, что на этот вопрос можно ответить утвердительно.Since GANs can generate high quality images based on some random vector, this vector and a set of intermediate layer pins contain high-level information about the rendered image. Therefore, a natural question arises whether it is possible to project the vector and intermediate features into the semantic segmentation mask and generate images together with pixel-by-pixel annotation. A number of experiments were carried out, which showed that this question can be answered in the affirmative.

StyleGAN [16] - это современный метод генерации изображений без учителя с наилучшими оценками IS и FID на датасетах FFHQ, Celeba-HQ и LSUN. Применив ряд идей из исследований переноса стилей, авторы предложили новую архитектуру, в которой процесс синтеза изображений контролируется адаптивной нормализацией экземпляров (AdaIN). Генератор запускается с обученным постоянным тензором и корректирует стиль в каждом сверточном блоке на основе скрытого кода. На фиг. 2 показана архитектура генератора. Сеть отображения состоит из 8 полносвязных слоев, и каждый блок SBlock имеет слой повышающей дискретизации (upsampling), 2 свертки и 2 AdaIN.StyleGAN [16] is a modern unsupervised image generation method with the best IS and FID scores on the FFHQ, Celeba-HQ and LSUN datasets. Applying a number of ideas from style transfer research, the authors proposed a new architecture in which the image synthesis process is controlled by Adaptive Instance Normalization (AdaIN). The generator starts with a trained constant tensor and adjusts the style in each convolutional block based on the hidden code. FIG. 2 shows the generator architecture. The display network consists of 8 fully connected layers, and each SBlock has an upsampling layer, 2 convolutions and 2 AdaINs.

Известен алгоритм генеративно-состязательных сетей, который взят за основу для генерации данных; используя его, можно генерировать данные из определенного распределения, однако известный метод не способен создать разметку для этих данных. Предложенный алгоритм одновременно генерирует данные и разметку для них. Предложен способ совместного синтеза изображений и попиксельных аннотаций.The known algorithm of generative adversarial networks, which is taken as the basis for data generation; using it, it is possible to generate data from a specific distribution, but the known method is not capable of generating markup for this data. The proposed algorithm simultaneously generates data and markup for them. A method for joint synthesis of images and per-pixel annotations is proposed.

Краткое описание чертежейBrief Description of Drawings

Представленные выше и/или другие аспекты станут более очевидными из описания примерных вариантов осуществления со ссылкой на прилагаемые чертежи, на которых:The above and / or other aspects will become more apparent from the description of exemplary embodiments with reference to the accompanying drawings, in which:

фиг. 1 изображает пример изображения из StyleGAN-FFHQ и соответствующую созданную аннотацию для сегментации волос;fig. 1 shows an example of an image from StyleGAN-FFHQ and the corresponding generated annotation for hair segmentation;

фиг. 2 - схематическое представление архитектуры сети;fig. 2 - schematic network architecture presentation;

фиг. 3 - сравнение предлагаемого метода с базовой версией на интерьерах LSUN для различного количества обучающих образцов (верхняя кривая - предложенный метод без предобучения на ImageNet; нижняя кривая - DeepLabV3+ без предобучения на ImageNet);fig. 3 - comparison of the proposed method with the basic version on LSUN interiors for a different number of training samples (upper curve - the proposed method without pre-training on ImageNet; lower curve - DeepLabV3 + without pre-training on ImageNet);

фиг. 4 - маски сегментации для автомобилей из набора данных LSUN;fig. 4 - segmentation masks for cars from the LSUN dataset;

фиг. 5 - слева - произвольное изображение; в центре - результат последовательного применения моделей Image2StyleGAN и StyleGAN; справа - маска сегментации, созданная предлагаемым декодером, обученным на 20 аннотированных синтетических (без реальных или непомеченных) изображениях;fig. 5 - on the left - an arbitrary image; in the center - the result of consistent application of the Image2StyleGAN and StyleGAN models; on the right - a segmentation mask created by the proposed decoder trained on 20 annotated synthetic (without real or unmarked) images;

фиг. 6 - синтетические изображения из StyleGAN, обученной на датасетах FFHQ, и предложенные маски сегментации переднего правого зуба.fig. 6 shows synthetic images from StyleGAN trained on FFHQ datasets and proposed segmentation masks for the anterior right tooth.

Подробное описание Detailed description

Предложен способ совместного синтеза изображений и попиксельных аннотаций (масок сегментации) посредством GAN. GAN имеет хорошее высокоуровневое представление целевых данных, которые можно легко спроецировать в маски семантической сегментации. Этот метод можно использовать для создания обучающего набора данных для обучения отдельной сети семантической сегментации. Эксперименты показали, что такая сеть сегментация успешно обобщается на реальных данных. Кроме того, этот способ превосходит по производительности обучение с учителем при малом количестве обучающих образцов и работает на широком спектре различных сцен и классов. Изобретение может быть реализовано в программных и аппаратных средствах для получения изображений на экране.A method for joint synthesis of images and per-pixel annotations (segmentation masks) using GAN is proposed. GAN has a good high-level representation of the target data, which can be easily projected into semantic segmentation masks. This method can be used to create a training dataset to train a single semantic segmentation network. Experiments have shown that such a segmentation network is successfully generalized to real data. It also outperforms supervised learning with a small number of teaching samples and works on a wide variety of different scenes and classes. The invention can be implemented in software and hardware for obtaining images on a screen.

Задача изобретения состоит в совместной генерации данных и соответствующей синтетической разметки с использованием генеративно-состязательных сетей, для которой требуется небольшое количество примеров разметки человеком. Предлагается использовать эту разметку для обучения подсети.An object of the invention is to co-generate data and corresponding synthetic markup using generative adversarial networks, which requires few examples of human markup. It is suggested to use this markup to train the subnet.

Предложенное изобретение позволяет достичь такого же качества алгоритмов глубокого обучения при меньшем количестве образцов, аннотированных человеком.The proposed invention makes it possible to achieve the same quality in-depth learning algorithms with fewer samples, annotated human.

В настоящем описании продемонстрировано, что отдельная сеть семантической сегментации, обученная на синтетическом наборе данных, обобщается на реальных изображениях. Кроме того, в описании показано, что предлагаемый способ превосходит по производительности обычное обучение с учителем при малом количестве аннотированных изображений.This description demonstrates that a separate semantic segmentation network trained on a synthetic dataset generalizes to real images. In addition, the description shows that the proposed method outperforms conventional supervised learning with a small number of annotated images.

Предположим, что имеется модель GAN, которая обучена на некотором наборе данных. Модель GAN обучена создавать изображения с характеристиками, подобными характеристикам из целевого набора данных. Поскольку обучение GAN отнимает много времени и не представляет интереса для данного изобретения, во всех экспериментах использовались предварительно обученные модели, в частности, StyleGAN [17]. GAN берет случайный вектор в качестве ввода и выдает изображение. Основная идея предложенного способа состоит в добавлении облегченного декодера в эту GAN. Под облегченным подразумевается модель, имеющая гораздо меньшее количество параметров, чем GAN. Этот декодер обучается создавать попиксельную аннотацию для изображения, сгенерированного GAN. Для обучения декодера GAN создает несколько изображений, а человек-аннотатор аннотирует их вручную. На основании изображения, созданного GAN, человек рисует пиксельную маску представляющего интерес объекта. Аннотация именуется как маска сегментации или пиксельная карта. Декодер обучается на масках из предыдущего этапа с соответствующими промежуточными признаками GAN. Как и в большинстве исследовательских работ по семантической сегментации, минимизируется кросс-энтропия между предсказанной маской и эталоном. Чтобы уменьшить затраты на вычисления, во время обучения GAN остается фиксированной. Авторы продемонстрировали, что благодаря облегченному характеру декодера для его обучения требуется всего несколько изображений. Модифицированную сеть затем используют для создания большого набора данных изображений вместе с аннотацией.Suppose you have a GAN model that is trained on some dataset. The GAN model is trained to generate images with characteristics similar to those from the target dataset. Since GAN training is time-consuming and of no interest for this invention, all experiments used pre-trained models, in particular StyleGAN [17]. GAN takes a random vector as input and outputs an image. The main idea of the proposed method is to add a lightweight decoder to this GAN. A lightweight model means a model that has a much smaller number of parameters than a GAN. This decoder is trained to create a pixel-by-pixel annotation for the GAN-generated image. To train the decoder, the GAN creates several images and the human annotator annotates them manually. Based on the image generated by the GAN, a person draws a pixel mask object of interest. Annotation is referred to as a segmentation mask or pixel map. The decoder is trained on the masks from the previous step with the corresponding intermediate GAN features. As with most research papers on semantic segmentation, cross-entropy between the predicted mask and the reference is minimized. To reduce computational costs, the GAN remains fixed during training. The authors have demonstrated that due to the lightweight nature of the decoder, only a few images are required to train it. The modified mesh is then used to generate a large image dataset along with annotation.

Созданный синтетический набор данных состоит из пар изображений и предсказанных масок сегментации, которые можно рассматривать как эталонную разметку. Следовательно, этот набор данных можно использовать для обучения сети сегментации с учителем.The generated synthetic dataset consists of image pairs and predicted segmentation masks, which can be viewed as a reference markup. Hence, this dataset can be used to train a segmentation network with a supervisor.

В настоящем изобретении в качестве базовой версии метода генерации изображения используется StyleGAN, а в качестве базовой версии метода сегментации изображения - DeepLabV3+.In the present invention, StyleGAN is used as the basic version of the image generation method, and DeepLabV3 + is used as the basic version of the image segmentation method.

В частности, предлагается аппаратное обеспечение, содержащее программные продукты, которые выполняют способ совместного синтеза изображений и попиксельных аннотаций посредством GAN, включающий следующие этапы. Предварительное обучение модели GAN на неразмеченном целевом наборе данных. Предполагается, что GAN является отображением из случайного вектора в изображение из распределения целевого набора данных (например, StyleGAN, DCGAN, BigGAN и т.п.). Расширение предобученной модели GAN путем добавления сети декодера. Эта сеть отображает признаки внутренних слоев модели GAN в маску семантической сегментации, соответствующую изображению, созданному GAN из тех же признаков. Аннотирование человеком нескольких образцов, созданных GAN, масками семантической сегментации. Обучение декодера на парах входных признаков и соответствующих аннотированных масок с учителем. Модель GAN остается фиксированной в процессе обучения. Создание большого синтетического набора данных из GAN и декодера путем применения этих моделей к случайным векторам. Обучение с учителем отдельной сети семантической сегментации на созданном синтетическом наборе данных.In particular, there is proposed hardware containing software products that perform a method for jointly synthesizing images and pixel-by-pixel annotations by means of a GAN, including the following steps. Pretraining a GAN model on an unlabeled target dataset. The GAN is assumed to be a random vector-to-image mapping from the target dataset distribution (eg, StyleGAN, DCGAN, BigGAN, etc.). Extending the pretrained GAN model by adding a decoder network. This network maps the features of the inner layers of the GAN model into a semantic segmentation mask corresponding to the image generated by the GAN from the same features. Human annotation of multiple GAN-generated samples with semantic segmentation masks. Train the decoder on pairs of input features and corresponding annotated masks with a teacher. The GAN model remains fixed throughout the learning process. Generate a large synthetic dataset from the GAN and decoder by applying these models to random vectors. Supervised learning of a separate semantic segmentation network on the generated synthetic dataset.

В результате обеспечивается возможность создания синтетического набора данных, который можно использовать для обучения сети сегментации, используя всего несколько масок семантической сегментации, аннотированных человеком. Обычно требуются тысячи масок, аннотированных человеком.As a result, it is possible to create a synthetic dataset that can be used to train a segmentation network using just a few human annotated semantic segmentation masks. Typically thousands of human annotated masks are required.

Был описан алгоритм семантической сегментации, взятый за основу. Предлагаемый алгоритм сравнивался по точности с основным на различном количестве обучающих примеров.The semantic segmentation algorithm taken as a basis was described. The proposed algorithm was compared in accuracy with the main one using a different number of training examples.

В отличие от известного уровня настоящее изобретение обеспечивает:In contrast to the prior art, the present invention provides:

- алгоритм для совместной генерации данных и связанной с ними аннотации разметки; - an algorithm for jointly generating data and associated annotation markup ;

- возможность обучать модели глубокого обучения семантической сегментации с меньшим количеством аннотированных данных;- the ability to train semantic segmentation deep learning models with less annotated data;

- возможность обучать на данных, сгенерированных GAN, и при этом данная модель успешно выполняет обобщение на реальных данных.- the ability to train on data generated by the GAN, and at the same time, this model successfully generalizes on real data.

В общих чертах, настоящее изобретение заключается в следующем.In general terms, the present invention is as follows.

Примером реализации изобретения является его применение для ускорения интерактивного аннотирования для выполнения сегментации на пользовательском наборе данных. Это означает, что для обучения состязательной модели сегментации, хорошо работающей со случайными изображениями из Интернета, требуется меньше аннотированных масок семантической сегментации. Обычно для достижения такой же точности требуются сотни или тысячи аннотированных масок семантической сегментации.An example of implementation of the invention is its application to speed up interactive annotation to perform segmentation on a user dataset. This means that fewer annotated semantic segmentation masks are required to train an adversarial segmentation model that works well with random images from the Internet. Typically hundreds or thousands of annotated semantic segmentation masks are required to achieve the same precision.

Основная идея предлагаемого способа состоит в модификации уже обученной модели GAN путем добавления специальной облегченной сети семантической сегментации. Далее описывается архитектура этой сети и предлагаемый способ совместного синтеза изображений и попиксельных аннотаций.The main idea of the proposed method is to modify the already trained GAN model by adding a special lightweight semantic segmentation network. The following describes the architecture of this network and the proposed method for joint synthesis of images and per-pixel annotations.

Предварительное обучение GAN: Модель GAN предварительно обучают на наборе неразмеченных целевых данных (далее будут описаны эксперименты на нескольких наборах данных, включая FFHQ, интерьеры LSUN, автомобили LSUN). На первом этапе обучают модель GAN создавать изображения из имеющегося набора данных на основании случайных векторов. Обучение GAN - это длительный и ресурсоемкий процесс, который не рассматривается в данном документе. Во всех экспериментах предполагается наличие предобученной модели GAN, поэтому в данной работе используется StyleGAN, хотя идеи изобретения можно применить к любой другой архитектуре. As a preliminary training GAN: GAN model pre-trained on the untagged set of target data (hereinafter will be described experiments for a number of data sets, including FFHQ, LSUN interiors, LSUN automobiles). The first step is to train the GAN model to generate images from the available dataset based on random vectors. GAN training is a time consuming and resource intensive process that is beyond the scope of this document. All experiments assume a pre-trained GAN model, so this work uses StyleGAN, although the ideas of the invention can be applied to any other architecture.

Построение декодера (назначение графов): На фиг. 2 показана архитектура предложенного декодера. Декодер отображает выходы промежуточных слоев GAN в маску семантической сегментации. Сеть отображения состоит из 8 полносвязных слоев, и каждый SBlock содержит слой повышающей дискретизации (upsampling), 2 свертки и 2 AdaIN. Каждый CBlock декодера принимает признаки из соответствующего SBlock StyleGAN в качестве ввода. CBlock состоит из dropout-слоя, свертки и слоя батч-нормализации. Этот блок отображает признаки из StyleGAN в декодер, уменьшая их размерность. Construction of a decoder (assignment of graphs): FIG. 2 shows the architecture of the proposed decoder. The decoder maps the outputs of the GAN middleware to a semantic segmentation mask. The display network consists of 8 fully connected layers, and each SBlock contains an upsampling layer, 2 convolutions and 2 AdaINs. Each CBlock decoder accepts features from the corresponding SBlock StyleGAN as input. CBlock consists of a dropout layer, a convolution layer, and a batch normalization layer. This block maps features from StyleGAN to the decoder, decreasing their dimension.

Вероятность выключения нейронов в dropout-слоях устанавливается на 50%, это значение было выбрано во время эксперимента. Каждый RBlock декодера имеет один остаточный блок с двумя сверточными слоями. Количество карт признаков для каждого сверточного слоя декодера установлено равным 32, поскольку первоначальные эксперименты показали, что увеличение количества карт признаков не приводит к улучшению качества.The probability of turning off neurons in dropout layers is set to 50%, this value was chosen during the experiment. Each RBlock of the decoder has one residual block with two convolutional layers. Number of feature maps for each convolutional layer the decoder is set to 32 because initial experiments have shown that increasing the number of feature maps does not improve quality.

Аннотирование нескольких синтетических изображений: Для дальнейшего обучения построенного декодера аннотируют вручную небольшой образец синтетических изображений; для каждой картинки из небольшого образца пользователь определяет маску интересующего объекта, рисуя ее с помощью компьютерной мыши. Изображения создаются путем отображения случайного вектора из нормального распределения посредством GAN, и промежуточные признаки сохраняются для дальнейшего обучения. Annotation of several synthetic images: For further training of the constructed decoder, a small sample of synthetic images is annotated manually; for each picture from a small sample, the user defines the mask of the object of interest by drawing it with a computer mouse. Images are generated by mapping a random vector from a normal distribution via the GAN, and intermediate features are stored for further training.

Обучение декодера: Декодер обучается с учителем (https://en.wikipedia.org/wiki/Supervised_learning), используя маски из предыдущего этапа с соответствующими промежуточными признаками GAN. Как и в большинстве исследований семантической сегментации, минимизируется кросс-энтропия между предсказанной маской и эталоном. Во время обучения GAN остается фиксированной, чтобы уменьшить затраты на вычисления. Decoder training: The decoder is supervised (https://en.wikipedia.org/wiki/Supervised_learning) using the masks from the previous step with the corresponding intermediate GAN features. As with most semantic segmentation research, cross-entropy between the predicted mask and the reference is minimized. During training, the GAN remains fixed to reduce computational costs.

Генерация большого синтетического набора данных: После обучения декодера генерируется произвольное количество синтетических изображений с соответствующими масками сегментации. Для этого выбирается случайный вектор из нормального распределения и подается на вход GAN, которая отображает его в синтетическое изображение. Выходы требуемых блоков генератора подаются в декодер, как было описано выше. И наконец, декодер создает маску сегментации, соответствующую предложенному синтетическому изображению. Generating a large synthetic dataset: After training the decoder, an arbitrary number of synthetic images with the corresponding segmentation masks are generated. To do this, a random vector is selected from a normal distribution and fed to the GAN input, which maps it into a synthetic image. The outputs of the required generator blocks are fed to the decoder as described above. Finally, the decoder creates a segmentation mask corresponding to the proposed synthetic image.

Обучение сети сегментации: Созданный синтетический набор данных состоит из пар изображений и предсказанных масок сегментации, которые можно рассматривать в качестве эталона. Следовательно, этот набор данных можно использовать для обучения сети сегментации с учителем. Segmentation network training: The generated synthetic dataset consists of image pairs and predicted segmentation masks, which can be considered as a reference. Hence, this dataset can be used to train a segmentation network with a supervisor.

Рассмотрим этапы этого способа более подробно.Let's consider the steps of this method in more detail.

Обучение декодераDecoder training

Как показано на фиг. 1, на этапе обучения декодер обучается совместному синтезу изображения 1 и маски 2 семантической сегментации. Как показано на фиг. 2, декодер принимает признаки из GAN и выдает маску сегментации.As shown in FIG. 1, at the training stage, the decoder is trained to jointly synthesize image 1 and mask 2 of semantic segmentation. As shown in FIG. 2, the decoder receives features from the GAN and provides a segmentation mask.

Декодер обучается с учителем на парах входных признаков и соответствующих масок (см. фиг. 2). В частности, во время процедуры обучения кросс-энтропийная потеря минимизируется через обратное распространение, это стандартная процедура в обучении нейронных сетей. Такие пары можно собирать просто путем аннотирования созданных изображений и сохранения соответствующих промежуточных признаков из GAN. Следует отметить, что первоначальная GAN остается фиксированной (замороженной). Промежуточные признаки берутся после каждого блока Stylegan перед повышающей дискретизацией (upsampling-слой), как показано на фиг. 2.The decoder is trained with a teacher on pairs of input features and corresponding masks (see Fig. 2). In particular, during the training procedure, the cross-entropy loss is minimized through backpropagation, this is a standard procedure in training neural networks. Such pairs can be collected simply by annotating the generated images and storing the corresponding intermediate features from the GAN. It should be noted that the original GAN remains fixed (frozen). Intermediate features are taken after each Stylegan block before the upsampling layer, as shown in FIG. 2.

Хорошо известно, что время обучения пропорционально количеству обучаемых параметров, поэтому веса GAN остаются фиксированными (замороженными) для уменьшения затрат на вычисления. Обучение занимает несколько минут, и декодер успешно обучается на небольшом количестве обучающих примеров. Обучающие примеры выбираются случайным образом из созданных изображений.It is well known that the training time is proportional to the number of parameters being trained, so the GAN weights remain fixed (frozen) to reduce computational costs. The training takes a few minutes, and the decoder is successfully trained on a small number of training examples. Tutorial examples are randomly selected from generated images.

На фиг. 2 показана схема декодера с первоначальной StyleGAN. StyleGAN берет случайный вектор из нормального распределения (нормальное распределение - это вид непрерывного распределения вероятностей для вещественной случайной переменной (https://en.wikipedia.org/wiki/Normal_distribution)) в качестве ввода и выдает изображение. Декодер использует признаки из StyleGAN в качестве ввода и выдает маску. Признаки берутся после каждого блока StyleGAN.FIG. 2 shows a schematic diagram of a decoder with the original StyleGAN. StyleGAN takes a random vector from a normal distribution (normal distribution is a kind of continuous probability distribution for a real random variable (https://en.wikipedia.org/wiki/Normal_distribution)) as input and outputs an image. The decoder uses features from StyleGAN as input and issues a mask. Tags are taken after each StyleGAN block.

Обучение сети сегментации на синтетических данныхTraining a segmentation network on synthetic data

Затем декодер обучают созданию большого набора данных из пар, сгенерированных GAN изображений и соответствующих масок, предсказанных декодером. Обучение производится с использованием DeepLabV3+ на этом синтетическом наборе данных. Эксперименты показали, что такая сеть успешно обобщается на реальных данных.The decoder is then trained to generate a large dataset from the pairs of generated GAN pictures and the corresponding masks predicted by the decoder. Training is done using DeepLabV3 + on this synthetic dataset. Experiments have shown that such a network is successfully generalized to real data.

Нахождение маски сегментации без обучения сети сегментацииFinding the segmentation mask without training the segmentation network

Для этой цели предложенный конвейер обучения модели сегментации включает в себя отдельную нейронную сеть на последнем этапе. Делается попытка проверить, можно ли удалить генерацию большого синтетического набора данных и обучение отдельной сети сегментации из предложенного конвейера, и ограничить его обученным декодером. Как декодер, так и модель сегментации выдают на выходе маску сегментации, но первый принимает в качестве ввода набор промежуточных функций GAN, а вторая берет само изображение. Последним этапом проведения эксперимента в такой формулировке является построение отображения из произвольного изображения в пространство промежуточных признаков GAN. Эта тема уже изучалась в литературе, в частности, в работе Image2StyleGAN [1] была предложена процедура оптимизации, которая отображает произвольное изображение в набор входных векторов для генератора StyleGAN. В результате ввода полученных векторов в этот генератор получается набор промежуточных признаков для соответствующего изображения, из которых с помощью предобученного предложенного декодера получается маска сегментации. Один из результатов показан на фиг. 5. На левом изображении показана произвольная фотография, на среднем - ее вид после последовательного применения генератора Image2StyleGAN и StyleGAN, и на правом - маска сегментации, созданная декодером, описанным в предыдущем эксперименте.For this purpose, the proposed training pipeline for the segmentation model includes a separate neural network at the last stage. An attempt is made to check whether it is possible to remove the generation of a large synthetic dataset and the training of a separate segmentation network from the proposed pipeline and restrict it to a trained decoder. Both the decoder and the segmentation model output a segmentation mask, but the former takes a set of intermediate GAN functions as input, and the latter takes the picture itself. The last stage of the experiment in this formulation is the construction of a mapping from an arbitrary image into the space of intermediate GAN features. This topic has already been studied in the literature, in particular, in the work Image2StyleGAN [1], an optimization procedure was proposed that maps an arbitrary image into a set of input vectors for the StyleGAN generator. As a result of inputting the obtained vectors into this generator, a set of intermediate features for the corresponding image is obtained, from which a segmentation mask is obtained using the pre-trained proposed decoder. One of the results is shown in FIG. 5. The left image shows an arbitrary photo, the middle one shows its view after successive application of the Image2StyleGAN and StyleGAN generator, and the right one shows the segmentation mask created by the decoder described in the previous experiment.

Способность настоящей модели обобщать можно использовать для очень специфической части лица. С этой целью авторы изобретения провели эксперимент с разметкой только правого верхнего переднего зуба. Хотя на фотографии лица человека видно более 10 зубов, предложенный способ продемонстрировал почти идеальные результаты при использовании всего 5 аннотированных синтетических изображений, как показано на фиг. 6 (только 5 синтетических изображений были аннотированы для обучения. Следует отметить, что предлагаемая модель ожидаемо сегментирует только один из нескольких одинаково текстурированных зубов).The ability of the present model to generalize can be used for a very specific part of the face. For this purpose, the inventors conducted an experiment with marking only the right upper anterior tooth. Although more than 10 teeth are visible in the photograph of a human face, the proposed method has shown nearly ideal results using only 5 annotated synthetic images, as shown in FIG. 6 (only 5 synthetic images were annotated for training. It should be noted that the proposed model, as expected, segments only one of several equally textured teeth).

ЭкспериментыExperiments

Автомобили LSUN. ("LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop". ссылка https://www.yf.io/p/lsun).LSUN cars. ("LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop". Link https://www.yf.io/p/lsun).

Авторы изобретения выбрали случайным образом подмножество из 100 изображений из валидационной части автомобилей LSUN и аннотировали их масками автомобилей. Затем набор данных был случайным образом разбит на обучающую и тестовую части, 20 образцов использовались для обучения и 80 образцов для тестирования. Для базовой версии авторы изобретения использовали эти 20 обучающих образцов для обучения DeepLabV3+ [7]. Для предложенного способа авторы изобретения также аннотировали 20 случайных изображений, сгенерированных StyleGAN, и использовали их для обучения декодера. Затем было сгенерировано 10000 синтетических образцов и на них обучена DeepLabV3+. Оба способа тестировались на 80 реальных образцах. Результаты оценки показаны в таблице 1. Примеры результатов представлены на фиг. 3. На фиг. 3 показано сравнение предлагаемого способа с базовой версией на интерьерах LSUN для различного количества обучающих образцов (верхняя кривая - предложенный способ без предобучения на ImageNet; нижняя кривая - DeepLabV3+ без предобучения на ImageNet): (a) - основная сеть без предобучения на ImageNet; точность; (b) - основная сеть, предобученная на ImageNet, точность; (c) - основная сеть без предобучения на ImageNet, среднее IoU; (d) - основная сеть, предобученная на ImageNet, среднее IoU. Точность и IoU используются в качестве оценочных метрик для сравнения базовой версии с предлагаемым методом. Оба метода оценивались на 80 тестовых изображениях. Результаты представлены в таблице 1. Предложенный метод дает увеличение точности более чем на 2% и увеличение IoU на 5% по сравнению с базовой версией. В таблице 1 также показано сравнение предлагаемого подхода с базовой версией в случае обучения нейронных сетей с нуля, то есть без использования основной сети, предобученной на ImageNet. В этом эксперименте в обучении использовали всего 20 аннотированных изображений в отличие от дополнительных миллионов изображений из ImageNet с многоклассовыми масками семантической сегментации в эксперименте с предобученной основной сетью. В этом случае повышение точности по сравнению с базовой версией становится еще больше (конкретно, более 10% по точности и 20% по IoU).The inventors randomly selected a subset of 100 images from the validation part of LSUN vehicles and annotated them with vehicle masks. Then the dataset was randomly split into training and test parts, 20 samples were used for training and 80 samples for testing. For the basic version, the inventors used these 20 training samples to train DeepLabV3 + [7]. For the proposed method, the inventors also annotated 20 random images generated by StyleGAN and used them to train the decoder. Then 10,000 synthetic samples were generated and DeepLabV3 + trained on them. Both methods were tested on 80 real samples. The evaluation results are shown in Table 1. Examples of results are shown in FIG. 3. In FIG. 3 shows a comparison of the proposed method with the basic version on LSUN interiors for a different number of training samples (the upper curve is the proposed method without pre-training on ImageNet; the lower curve is DeepLabV3 + without pre-training on ImageNet): (a) - the main network without pre-training on ImageNet; accuracy; (b) - main network, pre-trained on ImageNet, accuracy; (c) - main network without pre-training on ImageNet, average IoU; (d) - main network, pretrained on ImageNet, average IoU. Accuracy and IoU are used as score metrics to compare the baseline version with the proposed method. Both methods were evaluated on 80 test images. The results are presented in Table 1. The proposed method gives an increase in accuracy of more than 2% and an increase in IoU by 5% compared to the basic version. Table 1 also shows a comparison of the proposed approach with the basic version in the case of training neural networks from scratch, that is, without using the main network pre-trained on ImageNet. In this experiment, only 20 annotated images were used in training, as opposed to the additional millions of images from ImageNet with multiclass semantic segmentation masks in the pretrained core network experiment. In this case, the increase in accuracy compared to the basic version becomes even greater (specifically, more than 10% in accuracy and 20% in IoU).

На фиг. 4 представлена визуализация результатов. Маски в верхнем ряду, полученные предлагаемым способом, более точны, чем маски в нижнем ряду, полученные базовой версией.FIG. 4 shows a visualization of the results. The masks in the upper row obtained by the proposed method are more accurate than the masks in the lower row obtained by the basic version.

Таблица 1. Результаты оценки на наборе данных автомобилей LSUN с двумя классами (фон и автомобили) Table 1. Evaluation results on the LSUN vehicle dataset with two classes (background and vehicles)

МетодMethod Предобученная
основная сетьPretrained
main network ТочностьAccuracy IoUIoU DeepLabV3+(Chen et.al.2018)
Предложенный метод (изобретение)DeepLabV3 + (Chen et.al.2018)
Proposed method (invention) -
--
- 0,8588
0,97870.8588
0.9787 0,6983
0,94080.6983
0.9408 DeepLabV3+(Chen et.al.2018)
Предложенный метод (изобретение)DeepLabV3 + (Chen et.al.2018)
Proposed method (invention) ImageNet
ImageNetImageNet
ImageNet 0,9641
0,9862 0.9641
0.9862 0,9049
0,9609 0.9049
0.9609

Для задачи сегментации автомобилей использовался набор данных LSUN [24] с предобученной моделью StyleGAN. LSUN - это набор данных для классификации миллиона изображений, разбитых на 10 категорий сцен и 20 категорий объектов. Были выбраны только изображения, относящиеся к категории "автомобили". Было аннотировано вручную 100 изображений размером 512х384 пикселей с масками сегментации автомобилей. Из этих изображений 20 образцов использовались для обучения базовой модели DeepLabV3+, а остальные 80 - для оценки предложенного метода и базовой версии. Затем запускался весь обучающий конвейер, включающий генерацию 10000 синтетических изображений на этапе генерации набора данных и ручное аннотирование 20 из них. Оба метода оценивались на 80 тестовых изображениях. Результаты показаны в таблице 1. Предложенный метод обеспечил повышение точности более чем на 2% и повышение IoU на 5% по сравнению с базовой версией.The LSUN dataset [24] with a pretrained StyleGAN model was used for the car segmentation problem. LSUN is a dataset for classifying a million images, broken down into 10 scene categories and 20 object categories. Only images related to the category "cars" were selected. 100 512x384 pixel images with car segmentation masks were manually annotated. Of these images, 20 samples were used to train the DeepLabV3 + basic model, and the remaining 80 were used to evaluate the proposed method and basic version. Then the entire training pipeline was launched, including the generation of 10,000 synthetic images at the stage of generating a dataset and manual annotation of 20 of them. Both methods were evaluated on 80 test images. The results are shown in Table 1. The proposed method provided an increase in accuracy of more than 2% and an increase in IoU by 5% compared to the basic version.

В таблице 1 также показано сравнение предлагаемого подхода с базовой версией в случае обучения нейронных сетей с нуля, то есть без использования основной сети, предобученной на ImageNet. В этом эксперименте в обучении использовалось всего 20 аннотированных изображений в отличие от дополнительных миллионов изображений из ImageNet с многоклассовыми масками семантической сегментации в эксперименте с предобученной основной сетью. В данном случае повышение точности по сравнению с базовой версией стало еще больше (конкретно, более 10% по точности и 20% по IoU). Дополнительное обсуждение приводится в подразделе с экспериментом на сценах с интерьерами.Table 1 also shows a comparison of the proposed approach with the basic version in the case of training neural networks from scratch, that is, without using the main network pre-trained on ImageNet. In this experiment, only 20 annotated images were used in training, as opposed to the additional millions of images from ImageNet with multiclass semantic segmentation masks in the pretrained core network experiment. In this case, the increase in accuracy compared to the basic version has become even greater (specifically, more than 10% in accuracy and 20% in IoU). Additional discussion is provided in the subsection on experimenting with interior scenes.

Также на фиг. 4 продемонстрированы полученные маски сегментации для четырех изображений из тестового подмножества, чтобы подтвердить разницу в качестве между предлагаемым методом и базовой версией. На фиг. 4 показаны маски сегментации для автомобилей из набора данных LSUN. Верхний ряд - предлагаемый способ, нижний ряд - DeepLabV3+. Обучение проводилось на 20 аннотированных изображениях для обоих методов.Also in FIG. 4 shows the obtained segmentation masks for four images from the test subset to confirm the difference in quality between the proposed method and the basic version. FIG. 4 shows the segmentation masks for vehicles from the LSUN dataset. The top row is the proposed method, the bottom row is DeepLabV3 +. The training was carried out on 20 annotated images for both methods.

Протокол оценки.Assessment protocol.

Проводилось тестирование двух вариантов главной сети для DeepLabV3+ [7]: предобученной на ImageNet и без предобучения. Измерялись точность пикселей и пересечение по объединению, усредненное по классам (mIoU). DeepLabV3+, как и другие модели сегментации, использует предобученную на ImageNet основную сеть. ImageNet - это крупномасштабный датасет, содержащий 1 миллион аннотированных человеком изображений из 1000 классов. следовательно, такие методы подразумевают неявное использование классификационного аннотирования в дополнение к сегментационному. Авторы изобретения провели эксперименты для сравнения предлагаемого метода с базовой версией по двум протоколам: с основной сетью, предобученной на ImageNet, и без предобучения.Two variants of the main network for DeepLabV3 + [7] were tested: pre-trained on ImageNet and without pre-training. Pixel precision and class-averaged intersection (mIoU) were measured. DeepLabV3 +, like other segmentation models, uses a backbone pre-trained on ImageNet. ImageNet is a large-scale dataset containing 1 million human annotated images from 1000 classes. hence such methods imply implicit use of classification annotation in addition to segmentation. The inventors conducted experiments to compare the proposed method with the basic version using two protocols: with the main network, pre-trained on ImageNet, and without pre-training.

Авторы изобретения провели такие же эксперименты на наборе данных FFHQ. Flickr-Faces-HQ (FFHQ) - это высококачественный набор данных изображений человеческих лиц, изначально созданный в качестве бенчмарка для генеративно-состязательных сетей (GAN). Этот набор данных состоит из 70000 высококачественных изображений PNG с разрешением 1024×1024, значительно различающихся по возрасту, этнической принадлежности и фону изображения. Он также имеет хороший охват аксессуаров, таких как очки, солнцезащитные очки, шляпы и т.п. Для демонстрации предлагаемого метода на этом наборе данных в качестве целевой задачи использовалась сегментация волос. Авторы изобретения использовали 20 аннотированных человеком изображений для обучения и 80 аннотированных человеком изображений для тестирования.The inventors performed the same experiments on the FFHQ dataset. Flickr-Faces-HQ (FFHQ) is a high quality human face imaging dataset originally created as a benchmark for generative adversarial networks (GANs). This dataset consists of 70,000 high quality PNG images with a resolution of 1024 × 1024, varying greatly in age, ethnicity, and image background. It also has a good coverage of accessories such as glasses, sunglasses, hats, etc. Hair segmentation was used as a target to demonstrate the proposed method on this dataset. The inventors used 20 human annotated images for training and 80 human annotated images for testing.

Результаты представлены в таблице 2 (точность пикселей и пересечение по объединению, усредненное по классам (см. протокол оценки), они являются безразмерными величинами). Предлагаемый метод превосходит базовую модель DeepLabV3 на 7% по IoU и на 1% по точности.The results are presented in Table 2 (pixel precision and merge intersection, averaged across classes (see evaluation protocol), they are dimensionless). The proposed method outperforms the base DeepLabV3 model by 7% in IoU and 1% in accuracy.

Авторы изобретения также провели эксперименты с Image2StyleGAN для StyleGAN-FFHQ. На фиг. 5 показан пример вложения и маски. На левом изображении показан пример реального фото. Затем авторы изобретения применили алгоритм Image2StyleGAN, чтобы найти его представление в пространстве вложения. Реконструированное из этого вложения изображение изображено в центре. И наконец, авторы изобретения применили обученный декодер к признакам этого реконструированного изображения, чтобы получить маску сегментации волос. Эта маска показана на правом изображении. The inventors also experimented with Image2StyleGAN for StyleGAN-FFHQ. FIG. 5 shows an example of nesting and masking. The left image shows an example of a real photo. Then the inventors applied the Image2StyleGAN algorithm to find its spatial representation attachments. The image reconstructed from this attachment is shown in the center. Finally, the inventors applied a trained decoder to the features of this reconstructed image to obtain a hair segmentation mask. This mask is shown in the right image.

Таблица 2. Результаты оценки на наборе данных FFHQ с двумя классами (фон и волосы)Table 2. Assessment results on the FFHQ dataset with two classes (background and hair)

МетодMethod Предобученная
основная сеть Pretrained
main network ТочностьAccuracy IoUIoU DeepLabV3+(Chen et.al.2018)
Предложенный метод (изобретение)DeepLabV3 + (Chen et.al.2018)
Proposed method (invention) ImageNet
ImageNetImageNet
ImageNet 0,8831
0,8967 0.8831
0.8967 0,7549
0,8243 0.7549
0.8243

Интерьеры LSUN - это подмножество из набора данных LSUN, которое содержит фотографии интерьеров помещений. В этом эксперименте предлагаемый метод сравнивался с базовой версией для различного количества обучающих образцов, чтобы увидеть динамику. Поскольку для интерьеров LSUN нет масок семантической сегментации, а аннотирование довольно утомительная процедура, авторы изобретения использовали для создания аннотации сеть сегментации из пакета GluonCV, предобученную на ADE20K. Из 150 классов в ADE20K было использовано всего 13, которые соответствовали сценам интерьеров. Результаты показаны на фиг. 3.LSUN interiors is a subset of the LSUN dataset that contains photographs of interiors. In this experiment, the proposed method was compared with the baseline version for different numbers of training samples to see the dynamics. Since there are no semantic segmentation masks for LSUN interiors, and annotation is a rather tedious procedure, the inventors used the segmentation network from the GluonCV package, pre-trained on the ADE20K, to create the annotation. Of the 150 classrooms in the ADE20K, only 13 were used, which corresponded to the interior scenes. The results are shown in FIG. 3.

Сравнение предложенного метода с базовой версией при разных установках показано на четырех графиках. Авторы изобретения сравнили метрики IoU и точности для различного количества обучающих примеров. На графиках (а) и (b) сравнивается точность, а на графиках (c) и (d) - mIoU. Предобученная на ImageNet основная сеть использовалась для экспериментов (а) и (с). Сеть сегментации обучалась с нуля для экспериментов (b) и (d). Эксперименты показали, что предложенный метод хорошо работает с малым количеством обучающих образцов и в этом случае превосходит базовую версию с большим отрывом.Comparison of the proposed method with the basic version at different settings is shown in four graphs. The inventors compared IoU and accuracy metrics for different numbers of training examples. Graphs (a) and (b) compare the accuracy, while graphs (c) and (d) compare mIoU. The main network pretrained on ImageNet was used for experiments (a) and (c). Segmentation network trained from scratch for experiments (b) and (d). Experiments have shown that the proposed method works well with a small number of training samples and in this case outperforms the basic version by a large margin.

Подробности реализацииImplementation details

Для реализации предложенного алгоритма использовалась платформа MXNet Gluon [7]. Поскольку обучение StyleGAN отнимает много времени и не представляет интереса для данного изобретения, во всех экспериментах использовались модели, предобученные авторами первоначальной работы, которые были преобразованы в формат, принимаемый MXNet Gluon. Для обучения декодера использовался оптимизатор Adam с установкой его начальной скорости обучения на 1×10-4. При обучении первоначальной DeepLabV3+ использовались различные параметры обучения. В частности, использовался оптимизатор SGD с установкой момента на 0,9, начальной скорости обучения на 0,01 и коэффициент регуляризации весов сети на 1х10-4. Кроме того, ResNet-50 была взята в качестве основной сети для DeepLabV3+, и предполагалось, что она предобучена на ImageNet, если не указано иное. Во всех экспериментах для оценки качества использовались точность пикселей и пересечение по объединению, усредненное по классам (IoU).To implement the proposed algorithm, the MXNet Gluon platform was used [7]. Because StyleGAN training is time-consuming and of no interest to this invention, all experiments used models pre-trained by the original authors, which were converted to the format accepted by MXNet Gluon. To train the decoder, the Adam optimizer was used with its initial learning rate set to 1 × 10-4. The original DeepLabV3 + was trained using various training parameters. In particular, the SGD optimizer was used with the torque setting at 0.9, the initial learning rate at 0.01, and the regularization factor of the network weights at 1x10-4. In addition, ResNet-50 was taken as the primary network for DeepLabV3 + and was assumed to be pre-trained on ImageNet unless otherwise noted. Pixel fidelity and class-averaged intersection (IoU) were used to evaluate quality in all experiments.

Эксперименты показали, что предлагаемый метод хорошо работает при небольшом количестве обучающих примеров и в этом случае с большим отрывом превосходит обычное обучение под наблюдением. Однако с увеличением количества обучающих примеров разница в точности уменьшается (фиг. 3 (a), (c)). В случае использования основной сети, предобученной на ImageNet, предлагаемый метод с некоторого момента начинает работать хуже (фиг. 3 (b), (d)). Это может объяснить тем фактом, что сама GAN имеет ограниченные возможности: качество генерируемых изображений не является идеальным и GAN часто не способна генерировать некоторые редкие объекты. Следовательно, такие редкие объекты отсутствуют в синтетическом наборе данных. Кроме того, внутреннее представление GAN, из которого проецируются семантические маски, может немного отличаться от реального высокоуровневого представления. Например, для представления волос и бороды человека, вероятно, используются одни и те же признаки. В результате ухудшается качество сегментации волос.Experiments have shown that the proposed method works well with a small number of training examples and, in this case, outperforms conventional supervised learning by a large margin. However, with an increase in the number of training examples, the difference in accuracy decreases (Fig. 3 (a), (c)). In the case of using the main network, pre-trained on ImageNet, the proposed method starts to work worse at some point (Fig. 3 (b), (d)). This can be explained by the fact that the GAN itself has limited capabilities: the quality of the generated images is not ideal and the GAN is often unable to generate some rare objects. Consequently, such rare objects are absent from the synthetic dataset. In addition, the internal GAN representation from which the semantic masks are projected may differ slightly from the actual high-level representation. For example, the same features are probably used to represent human hair and beard. As a result, the quality of hair segmentation deteriorates.

Авторы изобретения предложили способ создания изображений вместе с масками семантической сегментации с использованием предобученной GAN. Его можно применять для обучения отдельной сети сегментации. Исследование показали, что такая сеть сегментации успешно обобщается на реальных данных.The inventors have proposed a method for generating images along with semantic segmentation masks using a pretrained GAN. It can be used to train a single segmentation network. Research has shown that such a segmentation network is successfully generalized to real data.

Ограничения предлагаемого метода обусловлены двумя факторами. Первый - это недостаточное разнообразие сетей GAN. Второй - несовершенное внутреннее представление сетей GAN.The limitations of the proposed method are due to two factors. The first is the lack of GAN diversity. The second is an imperfect internal representation of GANs.

Описанные выше варианты осуществления являются примерными и не должны рассматриваться как ограничивающие. Кроме того, описание примерных вариантов осуществления предназначено для иллюстрации, а не для ограничения объема формулы изобретения, и для специалистов в данной области техники будут очевидны многие альтернативы, модификации и варианты.The above described embodiments are exemplary and should not be construed as limiting. In addition, the description of the exemplary embodiments is intended to be illustrative, and not to limit the scope of the claims, and many alternatives, modifications, and variations will be apparent to those skilled in the art.

ЛИТЕРАТУРАLITERATURE

[1] Rameen Abdal, Yipeng Qin, and Peter Wonka. Image2stylegan: How to embed images into the stylegan latent space? arXiv preprint arXiv:1904.03189, 2019.[1] Rameen Abdal, Yipeng Qin, and Peter Wonka. Image2stylegan: How to embed images into the stylegan latent space? arXiv preprint arXiv: 1904.03189, 2019.

[2] David Bau, Jun-Yan Zhu, Hendrik Strobelt, Bolei Zhou, Joshua B Tenenbaum, William T Freeman, and Antonio Torralba. Gan dissection: Visualizing and understanding generative adversarial networks. arXiv preprint arXiv:1811.10597, 2018. [2] David Bau, Jun-Yan Zhu, Hendrik Strobelt, Bolei Zhou, Joshua B Tenenbaum, William T Freeman, and Antonio Torralba. Gan dissection: Visualizing and understanding generative adversarial networks. arXiv preprint arXiv: 1811.10597, 2018.

[3] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018. [3] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv: 1809.11096, 2018.

[4] Andrew Brock, Theodore Lim, James M Ritchie, and Nick Weston. Neural photo editing with introspective adversarial networks. arXiv preprint arXiv:1609.07093, 2016. [4] Andrew Brock, Theodore Lim, James M Ritchie, and Nick Weston. Neural photo editing with introspective adversarial networks. arXiv preprint arXiv: 1609.07093, 2016.

[5] Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017. [5] Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv: 1706.05587, 2017.

[6] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, 2018. [6] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, 2018.

[7] Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274, 2015.[7] Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv: 1512.01274, 2015.

[8] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, DavidWarde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672-2680, 2014. [8] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, DavidWarde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672-2680, 2014.

[9] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pages 5767-5777, 2017. [9] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pages 5767-5777, 2017.

[10] Tong He, Zhi Zhang, Hang Zhang, Zhongyue Zhang, Junyuan Xie, and Mu Li. Bag of tricks for image classification with convolutional neural networks. arXiv preprint arXiv:1812.01187, 2018.[10] Tong He, Zhi Zhang, Hang Zhang, Zhongyue Zhang, Junyuan Xie, and Mu Li. Bag of tricks for image classification with convolutional neural networks. arXiv preprint arXiv: 1812.01187, 2018.

[11] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pages 6626-6637, 2017. [11] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pages 6626-6637, 2017.

[12] Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz. Multimodal unsupervised image-to-image translation. In ECCV, 2018. [12] Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz. Multimodal unsupervised image-to-image translation. In ECCV, 2018.

[13] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. arxiv, 2016. [13] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. arxiv, 2016.

[14] Alexia Jolicoeur-Martineau. The relativistic discriminator: a key element missing from standard gan. arXiv preprint arXiv:1807.00734, 2018. [14] Alexia Jolicoeur-Martineau. The relativistic discriminator: a key element missing from standard gan. arXiv preprint arXiv: 1807.00734, 2018.

[15] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017. [15] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv: 1710.10196, 2017.

[16] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. arXiv preprint arXiv:1812.04948, 2018. [16] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. arXiv preprint arXiv: 1812.04948, 2018.

[17] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431-3440, 2015. [17] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431-3440, 2015.

[18] Lars Mescheder, Andreas Geiger, and Sebastian Nowozin. Which training methods for gans do actually converge? arXiv preprint arXiv:1801.04406, 2018. [18] Lars Mescheder, Andreas Geiger, and Sebastian Nowozin. Which training methods for gans do actually converge? arXiv preprint arXiv: 1801.04406, 2018.

[19] Takeru Miyato and Masanori Koyama. cgans with projection discriminator. arXiv preprint arXiv:1802.05637, 2018. [19] Takeru Miyato and Masanori Koyama. cgans with projection discriminator. arXiv preprint arXiv: 1802.05637, 2018.

[20] Augustus Odena, Christopher Olah, and Jonathon Shlens. Conditional image synthesis with auxiliary classifier gans. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2642-2651. JMLR. org, 2017. [20] Augustus Odena, Christopher Olah, and Jonathon Shlens. Conditional image synthesis with auxiliary classifier gans. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2642-2651. JMLR. org, 2017.

[21] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015. [21] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv: 1511.06434, 2015.

[22] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. Unet: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234-241. Springer, 2015. [22] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. Unet: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234-241. Springer, 2015.

[23] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Advances in neural information processing systems, pages 2234-2242, 2016.[23] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Advances in neural information processing systems, pages 2234-2242, 2016.

[24] Fisher Yu, Yinda Zhang, Shuran Song, Ari Seff, and Jianxiong Xiao. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365, 2015. [24] Fisher Yu, Yinda Zhang, Shuran Song, Ari Seff, and Jianxiong Xiao. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv: 1506.03365, 2015.

[25] Zhi Zhang, Tong He, Hang Zhang, Zhongyuan Zhang, Junyuan Xie, and Mu Li. Bag of freebies for training object detection neural networks. arXiv preprint arXiv:1902.04103, 2019. [25] Zhi Zhang, Tong He, Hang Zhang, Zhongyuan Zhang, Junyuan Xie, and Mu Li. Bag of freebies for training object detection neural networks. arXiv preprint arXiv: 1902.04103, 2019.

[26] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2881-2890, 2017.[26] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2881-2890, 2017.

[27] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Semantic understanding of scenes through the ade20k dataset. arXiv preprint arXiv:1608.05442, 2016. [27] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Semantic understanding of scenes through the ade20k dataset. arXiv preprint arXiv: 1608.05442, 2016.

[28] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.[28] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.

Claims

1. A method of joint synthesis of images and per-pixel annotations for display, which consists in the fact that

- pre-train the GAN model on the unlabeled target dataset to create images from the available dataset based on random vectors;

- constructing a decoder by mapping the outputs of the intermediate layers of the GAN into a semantic segmentation mask;

- annotate with the involvement of a person several images created by the GAN, masks of semantic segmentation;

- a decoder is trained with a teacher on pairs of introduced features and corresponding annotated masks to obtain pairs of synthetic images with corresponding per-pixel annotations;

- create a synthetic dataset, and the synthetic dataset consists of pairs of images with corresponding segmentation masks, by choosing a random vector from the normal distribution and supplying random noise to the input of the GAN, which displays it in the synthetic image;

- a separate semantic segmentation network is trained under supervision on the created synthetic dataset.

2. The method of claim 1, wherein the GAN is a random vector-to-image mapping from a target dataset distribution.

3. The method of claim 1, wherein the GAN model is a StyleGAN model.

4. The method of claim 1, wherein the decoder network maps features of the inner layers of the GAN model into a semantic segmentation mask corresponding to an image generated by the GAN from the same features.

5. The method of claim 1, wherein the GAN model remains fixed during training.

6. Computer-readable media, which stores computer-executable instructions for implementing the method according to PP. 1-5.