RU2710659C1

RU2710659C1 - Simultaneous uncontrolled segmentation of objects and drawing

Info

Publication number: RU2710659C1
Application number: RU2019104710A
Authority: RU
Inventors: Павел Александрович ОСТЯКОВ; Роман Евгеньевич Суворов; Елизавета Михайловна ЛОГАЧЕВА; Олег Игоревич Хоменко; Сергей Игоревич НИКОЛЕНКО
Original assignee: Самсунг Электроникс Ко., Лтд.
Priority date: 2019-02-20
Filing date: 2019-02-20
Publication date: 2019-12-30

Abstract

FIELD: image processing means.

SUBSTANCE: invention relates to image processing. Computing system for performing automatic image processing comprises: a first neural network for forming a coarse image z by object segmentation O from initial image x, containing object O and background B _x , using segmentation mask, and cutting with segmented object mask O from image x and inserting it into an image y, containing only background B _y; a second neural network for creating an improved version images ŷ with inserted segmented object O by improving coarse image z based on original images x and y and a mask m; third neural network for image reconstruction only background

without remote segmented object O by drawing an image obtained by zeroing pixel images x using a mask m; wherein the first, second and third neural networks are combined into a common neural network architecture for successive segmentation, enhancement and drawing and simultaneous training; wherein the general architecture of the neural network receives images and outputs the processed images of the same size.

EFFECT: higher efficiency of image processing.

26 cl, 5 dwg

Description

Область техникиTechnical field

Изобретение относится к реализации функций обработки изображений, связанных с нахождением границ объектов, удалением объектов из изображения, вставкой объектов в изображение, созданием новых изображений из комбинации уже существующих.The invention relates to the implementation of image processing functions associated with finding the boundaries of objects, removing objects from an image, inserting objects into an image, creating new images from a combination of existing ones.

Описание известного уровня техникиDescription of the prior art

Уровень техники раскрывает следующие методы неконтролируемой и слабо контролируемой сегментации объектов. В работе [18] авторы предлагают метод, основанный на GAN [30], для создания масок сегментации объектов из ограничивающих рамок. Их обучающий конвейер включает в себя получение двух вырезок одного и того же изображения: одной вырезки с объектом и одной без объекта. Обнаружение объектов осуществляется с помощью Faster R-CNN [19]. Затем GAN обучают создавать маску сегментации для слияния этих двух вырезок и получения правдоподобного изображения. Авторы используют комбинацию состязательной потери, потери наличия (которая проверяет наличие объекта на изображении) и потери вырезки (которая проверяет, чтобы после вырезки объекта не осталось ни одной его части). Эксперименты проводились только с некоторыми классами из набора данных Cityscapes [5] и со всеми классами из набора данных MS COCO [14]. Авторы сообщают, что их метод позволяет достичь более высоких средних значений коэффициента Жаккара, чем классический алгоритм GrabCut [21] и недавний Simple-Does-It [12]. Такой подход требует предварительно обученной Faster R-CNN и специальной политики для выбора патчей переднего плана и фона. В нем также имеются проблемы с правильностью сегментирования некоторых классов объектов (например, воздушный змей, жираф и т.д.). Кроме того, этот метод хорошо работает только с изображениями, имеющими низкое разрешение (28×28).The prior art discloses the following methods of uncontrolled and poorly controlled segmentation of objects. In [18], the authors propose a method based on GAN [30] for creating masks for segmenting objects from bounding frames. Their training conveyor includes obtaining two clippings of the same image: one clipping with an object and one without an object. Object detection is carried out using Faster R-CNN [19]. The GAN is then trained to create a segmentation mask to merge the two clippings and produce a believable image. The authors use a combination of adversarial loss, loss of presence (which checks for the presence of an object in the image) and loss of clipping (which checks to ensure that no parts remain after cutting the object). The experiments were carried out only with some classes from the Cityscapes data set [5] and with all classes from the MS COCO data set [14]. The authors report that their method allows to achieve higher average Jacquard coefficient values than the classical GrabCut algorithm [21] and the recent Simple-Does-It [12]. This approach requires pre-trained Faster R-CNN and special policies to select foreground and background patches. It also has problems with the correct segmentation of certain classes of objects (for example, a kite, a giraffe, etc.). In addition, this method only works well with images that have a low resolution (28 × 28).

В работе [23] авторы предлагают среду без аннотаций для изучения сети сегментации для однородных объектов. Они используют адаптивный процесс генерации синтетических данных для создания обучающего набора данных.In [23], the authors propose an environment without annotations for studying a segmentation network for homogeneous objects. They use the adaptive synthetic data generation process to create a training dataset.

Хотя проблема неконтролируемой сегментации изображений традиционно решается кластеризацией суперпикселей, в последнее время для ее решения стали применять глубокое обучение [9]. В последней работе авторы предлагают максимизировать информацию между двумя кластеризованными векторами, полученными полносверточной сетью из соседних патчей одного и того же изображения. Подобный метод был предложен в [24], хотя он ограничен потерями реконструкции. Авторы описывают W-Net (автокодер с кодером и декодером U-Netlike), который пытается кластеризовать пиксели на внутреннем слое, а затем реконструировать изображение из кластеров пикселей. В результате их сегментации классы объектов не известны. Although the problem of uncontrolled image segmentation is traditionally solved by clustering superpixels, in recent years, deep learning has been used to solve it [9]. In a recent work, the authors propose to maximize information between two clustered vectors obtained by a full-convolution network from neighboring patches of the same image. A similar method was proposed in [24], although it is limited by reconstruction losses. The authors describe W-Net (an auto-encoder with an encoder and U-Netlike decoder) that attempts to cluster pixels on an inner layer and then reconstruct an image from pixel clusters. As a result of their segmentation feature classes are not known.

Визуальное обоснование. Методы визуального обоснования нацелены на неконтролируемое или слабо контролируемое сопоставление произвольных текстовых запросов и областей изображений. Обычно контроль принимает форму пар (изображение; надпись). Эффективность модели обычно измеряется мерой Жаккара относительно истинных меток. Наиболее популярными наборами данных являются Visual Genome [13], Flickr30k [17], Refer-It-Game [11] и MS COCO [14]. Общий подход к обоснованию заключается в прогнозировании, соответствуют ли данная надпись и изображение друг другу. Отрицательные выборки получаются при независимом перемещении подписей и изображений. Внимание к тексту/изображению является основной особенностью большинства моделей визуального обоснования [28]. Очевидно, что использование более детального контроля (например, аннотации на уровне области, а не изображения) позволяет получить более высокие оценки [29]. Visual justification . Visual justification methods are aimed at uncontrolled or poorly controlled matching of arbitrary text queries and image areas. Typically, control takes the form of pairs (image; inscription). The effectiveness of a model is usually measured by the Jacquard measure relative to true labels. The most popular datasets are Visual Genome [13], Flickr30k [17], Refer-It-Game [11] and MS COCO [14]. A general approach to justification is to predict whether a given inscription and image correspond to each other. Negative selections are obtained by independently moving captions and images. Attention to text / image is the main feature of most models of visual justification [28]. Obviously, the use of more detailed inspections (e.g., at the annotation area instead of images) allows to obtain a higher evaluation [29].

Создание тримэпа. Создание тримэпа - это задача сегментации изображения на три класса: передний план, фон и неизвестный (прозрачный передний план). В большинстве алгоритмов для предложения тримэпа требуется участие человека, однако недавно были предложены суперпиксельный и кластерный подходы для автоматического создания тримэпа [7]. Но в этом методе необходимо выполнить несколько этапов оптимизации для каждого изображения. Для создания маски альфа-матирования данного изображения и тримэпа используется глубокое обучение [26]. Известна также работа по матированию видео и замене фона в видео [8]. В ней используется покадровая суперпиксельная сегментация, а затем оптимизируется энергия в условном случайном поле моделей гауссовой смеси для покадрового разделения переднего плана и фона. Create a trim . Creating a trim is the task of segmenting an image into three classes: foreground, background, and unknown (transparent foreground). In most algorithms, a person’s participation is required to propose a trip, but recently super-pixel and cluster approaches for automatically creating a trip have been proposed [7]. But in this method, it is necessary to perform several optimization steps for each image. To create an alpha matting mask for this image and trim, deep learning is used [26]. Also known work on matting video and replacing the background in the video [8]. It uses frame-by-frame superpixel segmentation, and then the energy in the conditional random field of the Gaussian mixture models is optimized for frame-by-frame separation of the foreground and background.

Генеративно-состязательные сети. В последние годы по всей видимости наиболее часто используемым подходом для обучения генеративной модели являются генеративно-состязательные сети, GAN [6]. Несмотря на их мощность, эти сети подвержены нестабильности процесса обучения и эффективности на изображениях с более высоким разрешением. В методе, предложенном не так недавно, CycleGAN [30] обучает две GAN вместе, чтобы установить двунаправленное преобразование между двумя доменами. Этот метод обеспечивает более высокую стабильность и согласованность. В нем, напротив, требуется, чтобы набор данных визуализировал своего рода обратимую операцию. Было опубликовано множество модификаций и приложений для CycleGAN, включая семантическое управление изображениями [22], доменную адаптацию [2], неконтролируемое преобразование изображений в изображения [15], многодоменное преобразование [3] и многие другие. Существует также проблема возможной неоднозначности такого преобразования между доменами. BicycleGAN [31] и дополненный CycleGAN [1] решают эту проблему вводом требования, чтобы преобразование сохраняло скрытые представления. Generative-adversarial networks . In recent years, it seems that the most commonly used approach for training the generative model is generative-competitive networks, GAN [6]. Despite their power, these networks are subject to instability in the learning process and efficiency in higher resolution images. In a method not so recently proposed, CycleGAN [30] trains two GANs together to establish bidirectional conversion between two domains. This method provides higher stability and consistency. On the contrary, it requires that the data set visualize a kind of reversible operation. Many modifications and applications for CycleGAN have been published, including semantic image management [22], domain adaptation [2], uncontrolled conversion of images to images [15], multi-domain conversion [3] and many others. There is also the problem of the possible ambiguity of such a conversion between domains. BicycleGAN [31] and the augmented CycleGAN [1] solve this problem by introducing the requirement that the transformation retain hidden views.

В данной работе авторы берут за основу идеи Cut&Paste [18] и CycleGAN [6] и предлагают новую архитектуру и конвейер, который решает другую проблему (замену фона) и позволяет получить лучшие результаты при неконтролируемой сегментации объектов, подрисовке и смешивании изображений.In this work, the authors take the ideas of Cut & Paste [18] and CycleGAN [6] as a basis and propose a new architecture and pipeline that solves another problem (replacing the background) and allows you to get better results with uncontrolled segmentation of objects, painting and mixing images.

СУЩНОСТЬ ИЗОБРЕТЕНИЯSUMMARY OF THE INVENTION

В настоящем изобретении представлен новый подход к визуальному восприятию посредством одновременного обучения сегментированию масок объектов и удалению объектов из фона (т.е. вырезке и вставке).The present invention provides a new approach to visual perception by simultaneously learning how to segment object masks and remove objects from the background (i.e., cut and paste).

Предложена вычислительная система для выполнения автоматической обработки изображений, содержащая: первую нейронную сеть для формирования грубого изображения z путем сегментации объекта O из исходного изображения x, содержащего объект O и фон B_x с помощью маски сегментации, и вырезки с помощью данной маски сегментированного объекта O из изображения x и вставки его в изображение y, содержащее только фон B_y; вторую нейронную сеть для создания улучшенной версии изображения

с вставленным сегментированным объектом O путем улучшения грубого изображения z на основе исходных изображений x и y и маски m; третью нейронную сеть для восстановления изображения только фона

без удаленного сегментированного объекта O путем подрисовки изображения, полученного путем обнуления пикселей изображения x с использованием маски m; причем первая, вторая и третья нейронные сети объединены в общую архитектуру нейронной сети для последовательного выполнения сегментации, улучшения и подрисовки и одновременного обучения; при этом общая архитектура нейронной сети принимает изображения и выдает обработанные изображения одинаковых размеров. При этом первая, вторая и третья нейронные сети являются генераторами, которые создают изображения

и

и конвертируют их. Система дополнительно содержит две нейронные сети, сконфигурированные как дискриминаторы, которые оценивают правдоподобие изображений. При этом первый дискриминатор является дискриминатором фона, который пытается провести различие между эталонным реальным изображением фона и подрисованным изображением фона; второй дискриминатор является дискриминатором объекта, который пытается провести различие между эталонным реальным изображением объекта O и улучшенным изображением объекта O. При этом первая и вторая нейронные сети составляют сеть замены. При этом сеть замены выполнена с возможностью непрерывного обучения с функциями потерь для создания улучшенной версии изображения

с вставленным сегментированным объектом О. При этом одна из функций потерь является функцией реконструкции объекта для обеспечения согласованности и стабильности обучения и реализуется как средняя абсолютная разница между изображением х и изображением

. При этом одна из функций потерь является состязательной функцией объекта для повышения правдоподобия изображения

и реализуется с помощью выделенной сети дискриминатора. При этом одна из функций потерь является функцией согласованности маски для придания первой сети инвариантности относительно фона, и реализуется как среднее абсолютное расстояние между маской, извлеченной из изображения x, и маской, извлеченной из изображения

. Одна из функций потерь является это функцией улучшения идентичности объекта, которая вынуждает вторую сеть создавать изображения более близкие к реальным изображениям и представляет собой среднее абсолютное расстояние между G_enh(x) и самим х. При этом одна из функций потерь представляет собой функцию идентичности фона для обеспечения того, чтобы общая архитектура не делала ничего с изображением, которое не содержит объектов. При этом одна из функций потерь представляет собой общую функцию потерь, которая является линейной комбинацией функции реконструкции объекта, состязательной функции объекта, функции согласованности маски, функции улучшения идентичности объекта, функции идентичности фона. При этом маска сегментации прогнозируется первой сетью с учетом изображения x.A computing system for performing automatic image processing is proposed, comprising: a first neural network for generating a coarse image z by segmenting an object O from an original image x containing an object O and background B_x using a segmentation mask, and cutting out a segmented object O from image x using this mask and pasting it into image y containing only background B_y; a second neural network to create an improved version of the image

with a segmented object O inserted by improving the rough image z based on the original images x and y and the mask m; third neural network to restore background-only images

without a remote segmented object O by painting on the image obtained by zeroing the pixels of the image x using the mask m; moreover, the first, second and third neural networks are combined into a common neural network architecture for sequentially performing segmentation, improving and painting and simultaneous training; however, the general architecture of the neural network receives images and provides processed images of the same size. In this case, the first, second and third neural networks are generators that create images

and

and convert them. The system further comprises two neural networks configured as discriminators that evaluate the likelihood of images. In this case, the first discriminator is the background discriminator, which attempts to distinguish between the reference real background image and the painted background image; the second discriminator is the discriminator of the object, which tries to distinguish between the reference real image of the object O and the improved image of the object O. In this case, the first and second neural networks constitute a replacement network. Moreover, the replacement network is capable of continuous learning with loss functions to create an improved version of the image

with a segmented object O inserted. Moreover, one of the loss functions is the reconstruction function of the object to ensure consistency and stability of training and is implemented as the average absolute difference between image x and image

. Moreover, one of the loss functions is an adversarial function of the object to increase the likelihood of the image.

and implemented using a dedicated discriminator network. Moreover, one of the loss functions is a mask matching function to give the first network invariance with respect to the background, and is realized as the average absolute distance between the mask extracted from the image x and the mask extracted from the image

. One of the loss functions is the function of improving the identity of the object, which forces the second network to create images closer to real images and represents the average absolute distance between G_enh(x) and x itself. However, one of the loss functions is a background identity function to ensure that the overall architecture does not do anything with an image that does not contain objects. Moreover, one of the loss functions is a general loss function, which is a linear combination of the reconstruction function of the object, the adversarial function of the object, the consistency function of the mask, the function of improving the identity of the object, the background identity function. In this case, the segmentation mask is predicted by the first network taking into account the image x.

Предложен способ автоматической обработки согласно которому: используют первую нейронную сеть для формирования грубого изображения z путем сегментации объекта O из исходного изображения x, содержащего объект O и фон B_x, с помощью маски сегментации, и вырезают с помощью данной маски сегментированный объект O из изображения x и вставляют его в изображение y, содержащее только фон B _y ; используют вторую нейронную сеть для построения улучшенной версии изображения

с вставленным сегментированным объектом O, посредством улучшения грубого изображения z на основе исходных изображений x и у и маски m; используют третью нейронную сеть для восстановления изображения только фона

без удаленного сегментированного объекта O путем подрисовки изображения, полученного путем обнуления пикселей изображения x с использованием маски m; выводят изображения

и

одинакового размера. При этом первая, вторая и третья нейронные сети являются генераторами, которые создают изображения

и

и конвертируют их. Способ дополнительно содержит две нейронные сети, сконфигурированные как дискриминаторы, которые оценивают достоверность изображений. При этом первый дискриминатор является дискриминатором фона, который пытается провести различие между эталонным реальным изображением фона и подрисованным изображением фона; второй дискриминатор является дискриминатором объекта, который пытается провести различие между эталонным реальным изображением объекта O и улучшенным изображением объекта O. При этом первая и вторая нейронные сети составляют сеть замены. При этом сеть замены выполнена с возможностью непрерывного обучения с функциями потерь для создания улучшенной версии изображения

. При этом одна из функций потерь является состязательной функцией объекта для повышения правдоподобия изображения и реализуется с помощью выделенной сети дискриминатора. При этом одна из функций потерь является функцией согласованности маски для придания первой сети инвариантности относительно фона и реализуется как среднее абсолютное расстояние между маской, извлеченной из изображения x, и маской, извлеченной из изображения

. При этом одна из функций потерь является функцией улучшения идентичности объекта для улучшения второй сети для получения изображений более близких к реальным изображениям, и представляет собой среднее абсолютное расстояние между G_enh(x) и самим х. При этом одна из функций потерь представляет собой функцию идентичности фона для обеспечения того, чтобы общая архитектура не делала ничего с изображением, которое не содержит объектов. При этом одна из функций потерь представляет собой общую функцию потерь, которая является линейной комбинацией функции реконструкции объекта, состязательной функции объекта, функции согласованности маски, функции улучшения идентичности объекта, функции идентичности фона. При этом маска сегментации прогнозируется первой сетью с учетом изображения x.A method of automatic processing is proposed, according to which: the first neural network is used to form a coarse image z by segmenting an object O from an original image x containing an object O and a background B _x using a segmentation mask, and a segmented object O from an image x is cut using this mask and paste it into the image y, containing only the background B _y ; use a second neural network to build an improved version of the image

with the segmented object O inserted, by improving the coarse image z based on the original images x and y and the mask m; use a third neural network to restore only background images

without a remote segmented object O by painting on the image obtained by zeroing the pixels of the image x using the mask m; display images

and

the same size. In this case, the first, second and third neural networks are generators that create images

and

and convert them. The method further comprises two neural networks configured as discriminators that evaluate the reliability of the images. In this case, the first discriminator is the background discriminator, which attempts to distinguish between the reference real background image and the painted background image; the second discriminator is the discriminator of the object, which tries to distinguish between the reference real image of the object O and the improved image of the object O. The first and second neural networks make up the replacement network. Moreover, the replacement network is capable of continuous learning with loss functions to create an improved version of the image

. Moreover, one of the loss functions is an adversarial function of the object to increase the likelihood of the image and is implemented using a dedicated discriminator network. In this case, one of the loss functions is the mask matching function to give the first network invariance with respect to the background and is realized as the average absolute distance between the mask extracted from image x and the mask extracted from image

. Moreover, one of the loss functions is a function of improving the identity of the object to improve the second network to obtain images closer to real images, and represents the average absolute distance between G _enh (x) and x itself. Moreover, one of the loss functions is a background identity function to ensure that the overall architecture does not do anything with an image that does not contain objects. Moreover, one of the loss functions is a general loss function, which is a linear combination of the reconstruction function of the object, the adversarial function of the object, the mask consistency function, the function to improve the object's identity, and the background identity function. In this case, the segmentation mask is predicted by the first network taking into account the image x.

Описание чертежейDescription of drawings

Представленные выше и/или другие аспекты станут более очевидными из описания примерных вариантов осуществления со ссылками на прилагаемые чертежи, на которых:The foregoing and / or other aspects will become more apparent from the description of exemplary embodiments with reference to the accompanying drawings, in which:

Фиг.1 изображает архитектуру нейронной сети, схему подготовки данных и установку ее параметров. Высокоуровневый общий вид конвейера SEIGAN (Segment-Enhance-Inpainting) для совместной сегментации и подрисовки: операция замены выполняется дважды и оптимизируется для воспроизведения исходных изображений. Эллипсы обозначают объекты и данные; сплошные прямоугольники - нейронные сети; скругленные прямоугольники - функции потерь; сплошные линии - потоки данных, а пунктирные линии показывают поток значений к функциям потерь.Figure 1 depicts the architecture of the neural network, the scheme of data preparation and installation of its parameters. High-level general view of the SEIGAN pipeline (Segment-Enhance-Inpainting) for joint segmentation and painting: the replacement operation is performed twice and optimized to reproduce the original images. Ellipses indicate objects and data; solid rectangles - neural networks; rounded rectangles - loss functions; solid lines are data streams, and dashed lines show a stream of values to loss functions.

Фиг. 2 - архитектура сети замены (из фиг. 1), которая вырезает объект из одного изображения и вставляет его в другое.FIG. 2 - architecture of a replacement network (from FIG. 1), which cuts an object from one image and inserts it into another.

Фиг. 3 - примеры изображений и масок, сгенерированных предложенной моделью.FIG. 3 - examples of images and masks generated by the proposed model.

Фиг. 4 - архитектура остаточной сети, используемой для сетей подрисовки и/или сегментации.FIG. 4 is a residual network architecture used for painting and / or segmentation networks.

Фиг. 5 - архитектура U-Network, используемая для сетей сегментации и уточнения.FIG. 5 is a U-Network architecture used for segmentation and refinement networks.

Подробное описаниеDetailed description

Предлагаемое изобретение может быть полезным аппаратным средством, содержащим программные продукты и устройства, которые выполняют автоматическую или автоматизированную обработку изображений, включающим в себя:The present invention can be a useful hardware tool containing software products and devices that perform automatic or automated image processing, including:

- графический редактор;- graphics editor;

- креативные приложения для создания графического контента;- Creative applications for creating graphic content;

- аппаратные системы (носимые устройства, смартфоны, роботы), для которых желательно найти объекты на изображениях;- hardware systems (wearable devices, smartphones, robots), for which it is desirable to find objects in the images;

- моделирование дополненной реальности (виртуальная/дополненная реальность);- modeling of augmented reality (virtual / augmented reality);

- подготовку данных для настройки методов машинного обучения (в любой отрасли).- data preparation for tuning machine learning methods (in any industry).

Ниже поясняются символы, используемые в материалах заявки.The symbols used in the application materials are explained below.

О - объект, показанный на изображении.About - the object shown in the image.

B_x - фон, показанный на изображении x.B _x is the background shown in image x.

B_y - фон, показанный на изображении y.B _y is the background shown in image y.

- изображение, содержащее объект O и фон B_x.

is an image containing an object O and background B _x .

- изображение, содержащее только фон B_y (без объекта на переднем плане).

- an image containing only the background B _y (without an object in the foreground).

X - набор всех изображений x.X is the set of all images x.

Y - набор всех изображений y.Y is the set of all images of y.

- изображение x с удаленным объектом O (изображение содержит только фон B_x).

- image x with object O removed (image contains only background B _x ).

- изображение y со вставленным объектом O.

- the image y with the inserted object O.

и

- преобразованные (приближенные) варианты фонов B_x и B_y и объекта O.

and

- converted (approximate) options for the backgrounds B _x and B _y and the object O.

- маска сегментации для изображения х.

- segmentation mask for image x.

- грубое изображение, построенное путем смешивания изображений x и y с маской смешивания m.

- rough image constructed by mixing x and y images with m mix mask.

G_seg, G_inp, G_enh - нейронные сети, используемые в качестве генераторов для сегментации, подрисовки и улучшения.G _seg , G _inp , G _enh - neural networks used as generators for segmentation, painting, and enhancement.

D_bg, D_obj - нейронные сети, используемые в качестве дискриминаторов (D_bg отсортировывает изображения с реальным фоном от изображений с нарисованным фоном, D_obj отсортировывает изображения с реальными объектами от изображений со вставленными объектами).D _bg , D _obj - neural networks used as discriminators (D _bg sorts images with a real background from images with a drawn background, D _obj sorts images with real objects from images with inserted objects).

Gram(i) - матрица Грама, построенная из трехмерного тензора, представляющего признаки пикселей изображения.Gram (i) is a Gram matrix constructed from a three-dimensional tensor representing the signs of image pixels.

VGG(i) - функция для вычисления трехмерного тензора, который представляет признаки пикселей изображения.VGG (i) is a function for calculating a three-dimensional tensor that represents the signs of the image pixels.

- критерии оптимизации, используемые для настройки параметров нейронных сетей.

- optimization criteria used to configure the parameters of neural networks.

- неотрицательные реальные коэффициенты, используемые для уравновешивания важности различных критериев оптимизации.

- non-negative real coefficients used to balance the importance of various optimization criteria.

Предлагаемые функции обработки изображений требуют менее детального контроля со стороны человека по сравнению с аналогами, существующими на данный момент.The proposed image processing functions require less detailed control on the part of a person in comparison with the analogues existing at the moment.

Предложенное изобретение может быть реализовано в программном обеспечении, которое в свою очередь может работать на любых устройствах, имеющих достаточную вычислительную мощность.The proposed invention can be implemented in software, which in turn can work on any devices that have sufficient computing power.

Во всем описании изображения обозначаются как кортежи фона объекта, например,

означает, что изображение x содержит объект O и фон B_x, а

означает, что изображение y содержит фон B_y и никаких объектов.Throughout the description, images are referred to as background tuples of an object, for example,

means that image x contains object O and background B _x , and

means that image y contains background B _y and no objects.

Основную проблему, решаемую в настоящем изобретении, можно сформулировать следующим образом. Имея набор данных изображений фона

и набор данных объектов на различных фонах

(непарные, т.е. без преобразования между X и Y), обучают модель брать объект из изображения

и вставлять его в новый фон, определенный изображением

, в то же время удаляя его из исходного фона. Иными словами, задача состоит в том, чтобы преобразовать пару изображений

и

в новую пару

и

, где

, но объект и оба фона изменяются так, чтобы новые изображения выглядели естественными.The main problem to be solved in the present invention can be formulated as follows. Having a background image dataset

and dataset of objects on various backgrounds

(unpaired, i.e. without conversion between X and Y), train the model to take an object from the image

and paste it into the new background defined by the image

while removing it from the original background. In other words, the challenge is to convert a pair of images

and

in a new pair

and

where

, but the subject and both backgrounds change so that the new images look natural.

Эту общую задачу можно разбить на три подзадачи:This general task can be divided into three subtasks:

- Сегментация: сегментируют объект O из исходного изображения

путем прогнозирования сегментации

; с этой маской можно получить грубую смесь, которая просто отсекает сегментированный объект от x и вставляет его в y:

, где

означает покомпонентное умножение. В процессе обучения параметры нейронной сети корректируются таким образом, что при вводе изображения с объектом эта нейронная сеть дает корректную маску, на которой выбран данный объект. Пользователь не участвует в этом процессе.- Segmentation: segment the object O from the original image

by predicting segmentation

; with this mask you can get a rough mixture that just cuts the segmented object from x and inserts it into y:

where

means componentwise multiplication. During training, the parameters of the neural network are adjusted in such a way that when entering an image with an object, this neural network gives the correct mask on which the given object is selected. The user is not involved in this process.

- Улучшение: из исходных изображений x и y, грубого изображения z и маски сегментации m создают улучшенную версию

- Improvement: from the original images x and y, the coarse image z and the segmentation mask m create an improved version

- Подрисовка: взяв маску сегментации m и изображение

, полученное путем обнуления пикселей x в соответствии с m, восстанавливают изображение только фона

. Вместо удаленного сегментированного объекта O третья нейронная сеть заполняет часть изображения на основании оставшейся части изображения и случайного сигнала. Во время обучения параметры третьей нейронной сети настраиваются таким образом, чтобы на основе этой фрагментарной информации сеть создала правдоподобное заполнение фона. В результате получают два изображения

и

. Однако основное внимание уделяется изображению

, а изображение

с пустым фоном является промежуточным результатом этого алгоритма, хотя его также можно использовать.- Painting: taking m segmentation mask and image

obtained by zeroing pixels x in accordance with m restore only the background image

. Instead of a remote segmented object O, a third neural network fills a portion of the image based on the remaining portion of the image and the random signal. During training, the parameters of the third neural network are adjusted so that, based on this fragmented information, the network creates a plausible background filling. The result is two images

and

. However, the focus is on image

, and the image

with a blank background is an intermediate result of this algorithm, although it can also be used.

Для каждой из этих задач можно построить отдельную нейронную сеть, которая принимает изображение или пару изображений и выдает новое изображение или изображения такого же размера. Однако, основная гипотеза, исследуемая в данной работе, заключается в том, что при отсутствии больших парных и помеченных наборов данных (что является нормальным состоянием дел в большинстве приложений) очень целесообразно обучать все эти нейронные сети вместе.For each of these tasks, you can build a separate neural network that receives an image or a pair of images and produces a new image or images of the same size. However, the main hypothesis investigated in this paper is that in the absence of large paired and labeled data sets (which is the normal state of affairs in most applications) it is very advisable to train all these neural networks together.

Поэтому авторы представляют свою архитектуру SEIGAN (Segment-Enhance-Inpaint, сегментация-улучшение-подрисовка), которая объединяет все три компонента в новый и ранее неизученный метод. На фиг. 1 рамка с пунктирным контуром обозначает данные (изображения); эллипсы - объекты, содержащиеся в данных; прямоугольники с острыми углами - подпрограммы, реализующие нейронные сети; прямоугольники со скругленными углами - подпрограммы, управляющие процессом настройки параметров нейронной сети в процессе обучения; линии обозначают потоки данных в процессе обучения (тот факт, что стрелка указывает от одного блока к другому, означает, что результаты первого блока передаются второму в качестве ввода). На фиг. 1 представлен общий поток предложенной архитектуры; модуль "сеть замены" объединяет сегментацию и улучшение. Поскольку вырезка и вставка являются частично обратимой операцией, естественно организовать процесс обучения аналогично CycleGAN [30]: сети замены и подрисовки применяются дважды, чтобы завершить цикл и иметь возможность использовать свойство идемпотентности для функций потерь. Результаты первого приложения обозначены как

и у, а результаты второго приложения, перемещение объекта назад от

, как

(см. фиг. 1).Therefore, the authors present their SEIGAN architecture (Segment-Enhance-Inpaint, segmentation-improvement-painting), which combines all three components into a new and previously unexplored method. In FIG. 1 frame with a dashed outline indicates data (images); ellipses - objects contained in data; rectangles with sharp angles - subroutines that implement neural networks; rounded rectangles - subroutines that control the process of tuning the parameters of the neural network in the learning process; lines indicate data flows during training (the fact that the arrow points from one block to another means that the results of the first block are transferred to the second as input). In FIG. 1 shows the general flow of the proposed architecture; the replacement network module combines segmentation and improvement. Since cutting and pasting is a partially reversible operation, it is natural to organize the learning process similarly to CycleGAN [30]: replacement and painting networks are applied twice to complete the cycle and be able to use the idempotency property for loss functions. The results of the first application are indicated as

and y, and the results of the second application, moving the object back from

, as

(see Fig. 1).

Архитектура, показанная на фиг. 1, объединяет пять различных нейронных сетей, три из которых используются в качестве генераторов, создающих изображение и преобразующих его, и две - в качестве дискриминаторов, оценивающих достоверность изображения:The architecture shown in FIG. 1, combines five different neural networks, three of which are used as generators that create the image and transform it, and two as discriminators that evaluate the reliability of the image:

G_segрешает задачу сегментации: взяв изображение x, предсказывает Mask(x), маску сегментации объекта на изображении;G _seg solves the problem of segmentation: taking the image x, predicts Mask (x), the segmentation mask of the object in the image;

G _inp решает задачу подрисовки: взяв m и

, предсказывает

; G _inp solves the problem of painting: taking m and

predicts

;

С_enh осуществляет улучшение: взяв x, y, и z=

+…., предсказывает

;With _{enh, it} makes an improvement: taking x, y, and z =

+ .... predicts

;

D_bg- дискриминатор фона, который пытается осуществить различие между реальным и фиктивным (подрисованным) изображениями только фона: его вывод D_bg(х) должен быть ближе к 1, если x реальное изображение, и ближе к 0, если x фиктивное;D _bg - background discriminator that tries to distinguish between real and fictitious (painted) images of only the background: its output D _bg (x) should be closer to 1 if x is a real image, and closer to 0 if x is fictitious;

D _obj - дискриминатор объекта, который осуществляет то же самое для изображений объекта на фоне; его вывод D_obj(x) должен быть ближе к 1, если x реальное изображение, и ближе к 0, если x фиктивное. D _obj - discriminator of an object that does the same for images of an object in the background; his conclusion D_obj(x) should be closer to 1 if x is a real image, and closer to 0 ifxfictitious.

Генераторы G_seg и G_eng образуют так называемую "сеть замены", которая изображена на фиг. 1 в виде одного блока и подробно объясняется на фиг. 2. На этой фигуре изображена архитектура "сети замены" (блок под названием "сеть замены" на фиг. 1) наряду с минимальным набором других компонентов, необходимых для описания того, как используется "сеть замены". Рамки с пунктирным контуром обозначают данные (изображения); эллипсы - объекты, содержащиеся в этих данных; прямоугольники с острыми углами - подпрограммы, реализующие нейронные сети; прямоугольники со скругленными углами - подпрограммы, управляющие процессом настройки параметров нейронной сети в процессе обучения; линии обозначают потоки данных в процессе обучения (тот факт, что стрелка указывает от одного блока к другому, означает, что результаты первого блока передаются во второй как ввод). Сеть сегментации - это нейронная сеть, которая берет изображение и выдает маску сегментации такого же размера. Сеть улучшения берет изображение и выводит его улучшенную версию (т.е. с более реалистичными красками, удаленными артефактами и т.д.) такого же размера.The generators G _seg and G _eng form the so-called "replacement network", which is shown in FIG. 1 in the form of a single block and is explained in detail in FIG. 2. This figure depicts the architecture of a "replacement network" (a block called "replacement network" in Fig. 1) along with a minimal set of other components necessary to describe how the "replacement network" is used. Frames with a dashed outline indicate data (images); ellipses - objects contained in this data; rectangles with sharp angles - subroutines that implement neural networks; rounded rectangles - subroutines that control the process of tuning the parameters of the neural network in the learning process; lines indicate data flows during training (the fact that the arrow points from one block to another means that the results of the first block are transferred to the second as input). A segmentation network is a neural network that takes an image and produces a segmentation mask of the same size. The enhancement network takes an image and displays its improved version (i.e. with more realistic colors, artifacts removed, etc.) of the same size.

По сравнению с [18], процесс обучения в SEIGAN оказался более стабильным и способным работать с более высокими разрешениями. Кроме того, предложенная архитектура позволяет решать больше задач одновременно (подрисовка и смешивание), а не только предсказывать маски сегментации. Как обычно в дизайне GAN, уникальной особенностью архитектуры является хорошее сочетание различных функций потерь. В SEIGAN используется комбинация состязательных потерь и потерь реконструкции и регуляризации.Compared to [18], the learning process at SEIGAN turned out to be more stable and able to work with higher resolutions. In addition, the proposed architecture allows us to solve more problems at the same time (painting and mixing), and not only predict segmentation masks. As usual in GAN design, a unique feature of the architecture is a good combination of various loss functions. SEIGAN uses a combination of contention loss and reconstruction loss and regularization loss.

Целью сети подрисовки G_inp является создание правдоподобного фона

из изображения-источника

, которое представляет собой исходное изображение x с объектом, удаленным в соответствии с маской сегментации m, полученной путем применения сети сегментации

; на практике пиксели

заполняются белым цветом. Параметры сетей подрисовки оптимизируются во время непрерывного обучения в соответствии со следующими функциями потерь (показаны прямоугольниками с закругленными углами на фиг. 1).The goal of the G _inp painting network is to create a believable background

from source image

, which is the original image x with the object removed in accordance with the segmentation mask m obtained by applying the segmentation network

; pixels in practice

Fill in white. The parameters of the painting networks are optimized during continuous learning in accordance with the following loss functions (shown by rounded rectangles in Fig. 1).

Состязательная потеря фона направлена на улучшение правдоподобия получаемого изображения. Она реализуется специальной сетью- дискриминатором D_bg. Для D_bg используется та же архитектура, что и в исходном CycleGAN [30] за исключением количества слоев; эксперименты показали, что в предложенной схеме лучше работает более глубокий дискриминатор. В качестве функции потерь для D_bg используется состязательная потеря MSE, предложенная в Least Squares GAN (LSGAN) [16], поскольку на практике она гораздо стабильнее, чем другие типы функций потерь GAN:Competing background loss is aimed at improving the likelihood of the resulting image. It is implemented by a special network discriminator D_bg. For D_bg uses the same architecture as in the original CycleGAN [30] with the exception of the number of layers; experiments showed that in the proposed scheme a deeper discriminator works better. As a loss function for D_bg It uses the competitive MSE loss proposed in the Least Squares GAN (LSGAN) [16], because in practice it is much more stable than other types of GAN loss functions:

где

- исходное изображение фона,Where

- source image of the background,

- изображение фона, полученное из x после первой замены, и

- изображение фона, полученное из

после второй замены.

- a background image obtained from x after the first replacement, and

- background image derived from

after the second change.

Потеря реконструкции фона направлена на сохранение информации об исходном фоне B_x. Она реализуется с использованием потери текстуры [25], средней абсолютной разницы между матрицами Грама карт признаков после первых пяти уровней сетей VGG-16:The loss of background reconstruction is aimed at preserving information about the initial background B _x . It is implemented using texture loss [25], the average absolute difference between the Gram matrices of feature maps after the first five levels of VGG-16 networks:

где VGG(y) - матрица признаков предварительно обученной нейронной сети классификации изображений (например, VGG, но не ограничиваясь ею) и

- матрица Грама.where VGG (y) is the matrix of attributes of a pre-trained neural network for image classification (for example, VGG, but not limited to it) and

- Gram matrix.

Авторы выбрали функции потерь на основании того, что существует множество возможных правдоподобных реконструкций фона, поэтому функции потерь должны допускать определенную степень свободы, которую не допускает средняя абсолютная ошибка или среднеквадратическая ошибка, но допускает потеря текстуры. В экспериментах авторов оптимизация MAE или MSE обычно приводила к тому, что созданное изображение заполнялось медианными или средними значениями пикселей без каких-либо объектов или текстуры. Следует отметить, что потери реконструкции фона применяются только к у ввиду отсутствия истинного фона для x (см. фиг. 1).The authors chose the loss functions based on the fact that there are many possible plausible reconstructions of the background, so the loss functions must allow a certain degree of freedom that the average absolute error or the standard error does not allow, but allows the loss of texture. In the experiments of the authors, the optimization of MAE or MSE usually led to the fact that the created image was filled with median or average pixel values without any objects or texture. It should be noted that the background reconstruction losses apply only to y due to the absence of a true background for x (see Fig. 1).

Еще важно отметить, что перед подачей изображения в сеть подрисовки G_inp часть изображения удаляется в соответствии с маской сегментации m , и это делается дифференцируемым образом без применения какого-либо порога к m . Следовательно, градиенты могут распространяться обратно через маску сегментации в сеть сегментации G_seg. Совместное обучение подрисовке и сегментации имеет эффект регуляризации. Во-первых, сеть подрисовки G_inp стремится к максимально точной маске: если она слишком мала, то G_inp придется стереть оставшиеся части объектов, что является проблемой большего порядка, а если она слишком велика, то G_inp будет иметь больше пустой области для подрисовки. Во-вторых, G_inp стремится к тому, чтобы маска сегментации m была высококонтрастной (со значениями, близкими к 0 и 1) даже без применения порога: если большая часть m имеет низкий контраст (близкий к 0,5), то G_inp придется научиться удалять "призрак" объекта (что гораздо сложнее, чем простая подрисовка в пустом пространстве), и, вероятнее всего, дискриминатору D_bg будет гораздо проще решить, что полученная картинка является фиктивной.It is also important to note that before applying the image to the G _inp painting network, part of the image is deleted in accordance with the segmentation mask m , and this is done in a differentiable way without applying any threshold to m . Therefore, the gradients can propagate back through the segmentation mask to the segmentation network G _seg . Joint training in painting and segmentation has a regularization effect. Firstly, the G _inp painting network strives for the most accurate mask: if it is too small, then G _inp will have to erase the remaining parts of the objects, which is a problem of a higher order, and if it is too large, then G _inp will have more empty area for painting . Secondly, G _inp tends to ensure that the segmentation mask m is high contrast (with values close to 0 and 1) even without applying a threshold: if most of m has low contrast (close to 0.5), then G _inp will have to learn to remove the “ghost” of the object (which is much more complicated than just painting in empty space), and, most likely, it will be much easier for the discriminator D _bg to decide that the resulting image is fictitious.

На фиг. 3 показан пример потребляемых и получаемых данных согласно предлагаемому способу. Значения изображений, слева направо, сверху вниз:In FIG. 3 shows an example of consumed and obtained data according to the proposed method. Image values, from left to right, from top to bottom:

1) крайнее левое изображение в верхнем ряду - реальное входное изображение с объектом (пример "Исходное изображение 1" на фиг. 1);1) the leftmost image in the upper row is the real input image with the object (example "Original image 1" in Fig. 1);

2) второе изображение в верхнем ряду - реальное входное изображением без объектов (пример "Исходного изображения 2" на фиг. 1);2) the second image in the upper row is the real input image without objects (example of "Original image 2" in Fig. 1);

3) маска, предсказанная сетью сегментации с учетом изображения 1;3) the mask predicted by the segmentation network taking into account image 1;

4) реальное входное изображение с объектом (другой пример "исходного изображения 1" на фиг. 1);4) the actual input image with the object (another example of the "original image 1" in Fig. 1);

5) реальное входное изображение без объектов (еще один пример "исходного изображения 2" на фиг. 1);5) a real input image without objects (another example of a "source image 2" in Fig. 1);

6) крайнее левое изображение в нижнем ряду - вывод сети, на которой объект из изображения 1 удален с помощью маски на изображении 3 (пример "Созданного изображения 1" на фиг. 1);6) the leftmost image in the bottom row is the output of the network on which the object from image 1 is removed using the mask in image 3 (an example of "Created image 1" in Fig. 1);

7) вывод сети улучшения с объектом из изображения 1, вставленным в фон из изображения 2 (пример "Созданного изображения 2" на фиг. 1);7) output of the enhancement network with an object from image 1 inserted into the background from image 2 (example of "Created image 2" in Fig. 1);

8) маска, предсказанная сетью сегментации с учетом изображения 4;8) the mask predicted by the segmentation network based on image 4;

9) вывод сети подрисовки с объектом из изображения 4, удаленным маской на изображении 8 (другой пример "Созданного изображения 1" на фиг. 1);9) the conclusion of the network of painting with the object from image 4, the removed mask in image 8 (another example of "Created image 1" in Fig. 1);

10) вывод сети улучшения с объектом из изображения 4, вставленным в фон из изображения 5 (другой пример "Созданного изображения 2" на фиг. 1).10) output of the enhancement network with an object from image 4 inserted into the background from image 5 (another example of “Created image 2” in Fig. 1).

Для G_inp используется нейронная сеть, состоящая из двух остаточных, последовательно соединенных блоков (см. фиг. 4). Также выполнялись эксперименты с ShiftNet [27]. На фиг. 4 показана архитектура нейронной сети ResNet, использованной в качестве "сети подрисовки" и "сети сегментации". Эллипсы обозначают данные; прямоугольники - слои нейронных сетей. В левой части данной фигуры показана общая архитектура. В правой части представлено более подробное описание блоков, используемых в левой части. Стрелки обозначают поток данных (то есть вывод одного блока подается как ввод в другой блок). Conv2d обозначает сверточный слой; BatchNorm2d - уровень нормализации пакетов; ReLU - блок линейного выпрямления; ReflectionPad - заполнение пикселей отражением; ConvTranspose2d - слой обратной свертки.For G _inp , a neural network consisting of two residual blocks connected in series is used (see Fig. 4). Also performed experiments with ShiftNet [27]. In FIG. Figure 4 shows the architecture of the ResNet neural network used as the "painting network" and the "segmentation network". Ellipses indicate data; rectangles are layers of neural networks. The left side of this figure shows the overall architecture. The right side provides a more detailed description of the blocks used on the left. The arrows indicate the data stream (that is, the output of one block is fed as input to another block). Conv2d denotes a convolutional layer; BatchNorm2d - packet normalization level; ReLU - linear rectification unit; ReflectionPad - filling pixels with reflection; ConvTranspose2d - a layer of reverse convolution.

Целью сети замены является создание нового изображения

из двух исходных изображений:

с объектом O и

с другим фоном.The purpose of the replacement network is to create a new image.

from two source images:

with object O and

with a different background.

Сеть замены выполняет две основные операции: сегментацию G_seg и улучшение G_enh (см. фиг. 2).The replacement network performs two basic operations: segmentation G _seg and improvement G _enh (see Fig. 2).

Сеть сегментации G_seg создает приблизительную маску сегментации

из x . Используя маску m , можно извлечь объект O из исходного изображения x и вставить его в B _y, чтобы получить "грубую" версию целевого изображения

. z не является конечным результатом: ему не хватает сглаживания, коррекции цвета или освещенности и других улучшений. Следует отметить, что в идеальном случае вставка объекта естественным образом может также потребовать более глубокого понимания целевого фона; например, если мы хотим вставить собаку на травяное поле, то, вероятно, потребуется наложить немного фоновой травы перед собакой, прикрыв ее лапы, так как в реальности их не будет видно за травой.G _seg segmentation network creates an approximate segmentation mask

from x . Using the mask m , you can extract the object O from the original image x and paste it into B _y to get a “rough” version of the target image

. z is not the end result: it lacks anti-aliasing, color or light correction, and other improvements. It should be noted that in the ideal case, inserting an object in a natural way may also require a deeper understanding of the target background; for example, if we want to put a dog on a grass field, then we probably need to put some background grass in front of the dog, covering its paws, since in reality they will not be visible behind the grass.

Для решения этой задачи авторы вводят так называемую нейронную сеть улучшения G _enh, целью которой является создание "более плавного", более естественного изображения

из исходных изображений x и y и маски сегментации m , что приведет к грубому результату

. Были проведены эксперименты с сетью улучшения, реализованной четырьмя различными способами:To solve this problem, the authors introduce the so-called neural network of improvement G _enh , the purpose of which is to create a “smoother”, more natural image

from the original images x and y and the segmentation mask m , which will lead to a rough result

. Experiments were conducted with an improvement network implemented in four different ways:

улучшение по принципу "черного ящика": G _enh (x,y,m) выдает окончательное улучшенное изображение; black box improvement: G _enh (x, y, m) gives the final enhanced image;

улучшение маски: G _enh (x,y,m) выдает новую маску сегментации m' , которая лучше подходит для объекта О и нового фона B _y вместе взятых; mask enhancement: G _enh (x, y, m) produces a new segmentation mask m ' which is better suited to the facility ABOUT and new backgroundB _y taken together;

улучшение цвета: G _enh (х,у,m) выдает попиксельные множители ϒ для каждого канала, которые следует применить к грубому изображению:

; веса ϒ регуляризуются для приближения к 1 посредством дополнительной потери MSE; color improvement: G _enh (x, y, m) produces pixel-by-pixel factors ϒ for each channel, which should be applied to the coarse image:

; weight ϒ are regularized to approach 1 through an additional loss of MSE;

гибридное улучшение: G_enh(x,y,m) выдает новую маску m' и множители ϒ. hybrid improvement: G _enh (x, y, m) produces a new mask m ' and factors ϒ.

В любом случае,

обозначает окончательное улучшенное изображение после того, как все выходы G _enh были соответственно применены к z .Anyway,

denotes the final enhanced image after all outputs of G _enh have been appropriately applied to z .

Сеть замены обучается непрерывно со следующими функциями потерь (показаны закругленными прямоугольниками на фиг. 1).The replacement network is trained continuously with the following loss functions (shown by rounded rectangles in Fig. 1).

Потеря реконструкции объекта

направлена на обеспечение согласованности и стабильности обучения. Она реализуется как средняя абсолютная разница между исходным изображением x=(О,Bx) иLoss of object reconstruction

aims to ensure coherence and stability of learning. It is realized as the average absolute difference between the original image x = (0, Bx) and

где

иWhere

and

где

и

,Where

and

,

т.е.

является результатом применения сети замены к x и y дважды.those.

is the result of applying the replacement network to x and y twice.

Состязательная потеря объекта

направлена на повышение правдоподобия

. Она реализуется с помощью специальной сети дискриминатора D_obj. Она также имеет побочный эффект максимизации области, покрытой маской сегментации

. Авторы применяют эту потерю ко всем изображениям с объектами: реальное изображение x и "фиктивные" изображения

. В этом случае дискриминатор тоже имеет такую же архитектуру, как в CycleGAN [30], за исключением количества слоев, для которого было обнаружено, что лучше работает более глубокий дискриминатор. Здесь снова используется потеря MSK, инспирированная LSGAN [16]:Adversarial Object Loss

aims to increase credibility

. It is implemented using a special network of discriminators D _obj . It also has the side effect of maximizing the area covered by the segmentation mask.

. The authors apply this loss to all images with objects: real image x and "dummy" images

. In this case, the discriminator also has the same architecture as in CycleGAN [30], with the exception of the number of layers, for which it was found that a deeper discriminator works better. Here again, the LSKAN-inspired MSK loss is used [16]:

Потеря согласованности маски направлена на то, чтобы сделать сеть сегментации инвариантной относительно фона. Она реализуется как среднее абсолютное расстояние между m=G _seg (x) , маской, извлеченной из x=(O,B _x ), и m=G _seg (y), маской, извлеченной из

:The loss of mask consistency is intended to make the segmentation network invariant with respect to the background. It is realized as the average absolute distance between m = G _seg (x) , the mask extracted from x = (O, B _x ), and m = G _seg (y), the mask extracted from

:

Эта маска представляет собой черно-белое изображение такого же размера, как и изображение, из которого была извлечена данная маска. Белые пиксели на маске соответствуют выбранным областям изображения (в этом случае пиксели, на которых изображен объект), черные пиксели соответствуют фону. Среднее абсолютное расстояние - это модуль разницы значений пикселей, усредненный по всем пикселям. Маска извлекается повторно, чтобы убедиться, что нейронная сеть, извлекающая маску, точно откликается на форму объекта и не откликается на фон позади нее (иными словами, маски для одного и того же объекта должны быть всегда одинаковыми).This mask is a black and white image of the same size as the image from which the mask was extracted. White pixels on the mask correspond to the selected areas of the image (in this case, the pixels on which the object is displayed), black pixels correspond to the background. The absolute absolute distance is the magnitude of the difference in pixel values averaged over all pixels. The mask is re-extracted to ensure that the neural network that extracts the mask accurately responds to the shape of the object and does not respond to the background behind it (in other words, the masks for the same object must always be the same).

И наконец, помимо функций потери, описанных выше, использовалась потеря идентичности, идея которой была выдвинута в CycleGAN [30]. Вводятся два различных случая потери идентичности:Finally, in addition to the loss functions described above, an identity loss was used, the idea of which was put forward in CycleGAN [30]. Two different cases of loss of identity are introduced:

Потеря улучшения идентичности объекта

приближает результат сети улучшения G _enh на реальных изображениях к идентичности: она представляет собой среднее расстояние между G _enh {x) и самим x : Loss of object identity improvement

brings the result of an improvement networkG _enh on real images to identity: it represents the average distance betweenG _enh {x) and ourselves x :

=|G_enh(x)-x|;

= | G _enh (x) -x |;

Потеря идентичности фона

пытается обеспечить, чтобы архитектура вырезки и подрисовки не делала ничего с изображением, не содержащим объектов: для изображения

находят маску сегментации G_seg(y), вычитают ее из у для получения ( 1-G _seg (y))

y , применяют подрисовку G _inp и затем минимизируют среднее расстояние между исходным у и результатом: Background identity loss

tries to ensure that the clipping and painting architecture does not do anything with an image containing no objects: for the image

find the segmentation mask G_seg(y) subtract her from at for getting ( 1-g _seg (y))

y apply painting G _inp and then minimize the average distance between the original at and result:

Общая функция потерь SEIGAN представляет собой линейную комбинацию всех описанных выше функций потерь:The overall loss function of SEIGAN is a linear combination of all the loss functions described above:

с коэффициентами

, выбранными эмпирически.with coefficients

empirically selected.

Во время экспериментов авторы заметили несколько интересных эффектов. Во-первых, перед объединением исходные изображения

и у =

могут иметь различные масштаб и пропорции. Изменение их масштаба для получения одинаковой формы с помощью билинейной интерполяции внесло бы значительные различия в текстуры низкого уровня, которые дискриминатор бы очень легко идентифицировал как фиктивные и тем самым воспрепятствовал конвергенции в GAN.During the experiments, the authors noticed several interesting effects. First, before combining the original images

and y =

may have different scales and proportions. Changing their scale to obtain the same shape using bilinear interpolation would introduce significant differences in low-level textures that the discriminator would very easily identify as fictitious and thereby prevent convergence in the GAN.

Авторы [18] столкнулись с такой же проблемой и решили ее с помощью специальной процедуры, которую они использовали для создания обучающих образцов: они брали патчи переднего плана и фона только с одного и того же изображения, чтобы обеспечить одинаковый масштаб и пропорции, тем самым уменьшая разнообразие и количество изображений, подходящие для обучающего набора. В предложенной схеме эту задачу решает отдельная сеть улучшения, поэтому имеется меньше ограничений при поиске подходящих обучающих данных.The authors of [18] faced the same problem and solved it using a special procedure that they used to create training samples: they took the foreground and background patches from only the same image to ensure the same scale and proportions, thereby reducing variety and number of images suitable for the training set. In the proposed scheme, this problem is solved by a separate improvement network, therefore there are fewer restrictions when searching for suitable training data.

Другим интересным эффектом является низкий контраст в масках сегментации, когда подрисовка оптимизируется относительно потерь реконструкции MAE или MSE. Низкоконтрастная маска (т.е. m со многими значениями около 0,5, а не близко к 0 или 1) позволяет информации об объекте "просачиваться" из исходного изображения и облегчать реконструкцию. Подобный эффект был замечен ранее другими исследователями, и в архитектуре CycleGAN он даже использовался для стеганографии [4]. Авторы сначала решили эту проблему путем преобразования мягкой маски сегментации в жесткую маску просто с помощью применения порога. Позже было обнаружено, что оптимизация подрисовки относительно потери текстуры

является более элегантным решением, приводящим к лучшим результатам, чем применение порога.Another interesting effect is the low contrast in the segmentation masks when the painting is optimized for MAE or MSE reconstruction loss. Low contrast mask(those. m with many values around 0.5, not close to 0 or 1)allows information about the object to “leak out” from the original image and facilitate reconstruction. A similar effect was noticed earlier by other researchers, and in the CycleGAN architecture it was even used for steganography [4]. The authors first solved this problem by converting a soft segmentation mask to a hard mask simply by applying a threshold. It was later discovered that the optimization of painting with respect to texture loss

is a more elegant solution leading to better results than applying a threshold.

Для сети сегментации G _seg авторы использовали архитектуру из CycleGAN [30], которая сама по себе является адаптацией архитектуры из [10]. Для повышения эффективности слои ConvTranspose были заменены билинейной повышающей выборки. Кроме того, после последнего уровня сети использовалась логистическая сигмоида в качестве функции активации.For the G _seg segmentation network, the authors used the architecture from CycleGAN [30], which in itself is an adaptation of the architecture from [10]. To increase efficiency, ConvTranspose layers were replaced with bilinear upsampling. In addition, after the last level of the network, a logistic sigmoid was used as an activation function.

Для сети улучшения G _enh использовалась архитектура U-net [20], поскольку она способна работать как с изображениями с высоким разрешением, так и вносить небольшие изменения в исходное изображение. Это важно для предложенной схемы ввиду стремления авторов просто "сгладить" границы вставленного изображения в сети улучшения более разумным способом, не внося существенного изменения в содержание изображения.The U-net architecture was used for the G _enh improvement network [20], since it is able to work with both high-resolution images and make small changes to the original image. This is important for the proposed scheme in view of the authors' desire to simply “smooth out” the boundaries of the inserted image in the improvement network in a more reasonable way, without making a significant change in the image content.

На фиг. 5 изображена архитектура нейронной сети U-Net, используемой в качестве "сети подрисовки" и "сети улучшения". Эллипсы обозначают данные; прямоугольники - слои нейронных сетей. В левой части фигуры показана общая архитектура. Правая часть фигуры содержит более подробное описание блоков, используемых в левой части. Стрелки обозначают поток данных (т.е. вывод одного блока подается как ввод в другой блок). Conv2d обозначает сверточный слой; BatchNorm2d - уровень нормализации пакета; ReLU - блок линейного выпрямления; ReflectionPad - заполнение пикселей отражением; ConvTranspose2d - слой обратной свертки.In FIG. 5 depicts the architecture of a U-Net neural network used as a “painting network” and “improvement network”. Ellipses indicate data; rectangles are layers of neural networks. The left side of the figure shows the overall architecture. The right side of the figure contains a more detailed description of the blocks used on the left. The arrows indicate the data stream (i.e. the output of one block is fed as input to another block). Conv2d denotes a convolutional layer; BatchNorm2d - packet normalization level; ReLU - linear rectification unit; ReflectionPad - filling pixels with reflection; ConvTranspose2d - a layer of reverse convolution.

Подготовка данныхData preparation

Большая часть экспериментов согласно изобретению выполняется на изображениях, доступных на Flickr по лицензии Creative Commons. Для сборки исходного изображения использовался запрос "собака". Затем использовалась предварительно обученная Faster R-CNN для обнаружения всех объектов (включая собак) и всех областей без объектов. Далее было создано два набора данных

(из областей с собаками) и {(B₂)} (из областей без объектов любого класса). После сбора данных была выполнена процедура фильтрации данных, чтобы получить области изображений без посторонних объектов.Most of the experiments according to the invention are performed on images available on Flickr under a Creative Commons license. To build the original image, the query "dog" was used. Then, the pre-trained Faster R-CNN was used to detect all objects (including dogs) and all areas without objects. Next, two data sets were created.

(from areas with dogs) and {(B ₂ )} (from areas without objects of any class). After collecting the data, a data filtering procedure was performed to obtain image areas without foreign objects.

Процедура фильтрации выполнялась следующим образом. Сначала использовалась Faster R-CNN [19] (предварительно обученная на MS COCO (14]), чтобы обнаружить все объекты на изображении. Затем получали вырезки введенного изображения в соответствии со следующими правилами:The filtering procedure was performed as follows. First, Faster R-CNN [19] (pre-trained on MS COCO (14)) was used to detect all objects in the image. Then clippings of the input image were obtained in accordance with the following rules:

1. После масштабирования размер объекта равен 64×64 и размер конечной вырезки равен 128×128;1. After scaling, the size of the object is 64 × 64 and the size of the final cut is 128 × 128;

2. Объект расположен в центре вырезки;2. The object is located in the center of the clipping;

3. Нет других объектов, которые пересекаются с данной вырезкой;3. There are no other objects that intersect with this clipping;

4. Исходный размер объекта на вырезке превышает 60 (по меньшей стороне) и не превышает 40 процентов от всего исходного изображения (по большей стороне).4. The initial size of the object on the clipping exceeds 60 (on the lower side) and does not exceed 40 percent of the entire original image (on the larger side).

Представленные выше примерные варианты осуществления являются примерами и не должны рассматриваться как ограничивающие. Кроме того, описание этих примерных вариантов предназначено для иллюстрации, а не для ограничения объема притязаний, и для специалистов будут очевидны многие альтернативы, модификации и варианты.The exemplary embodiments presented above are examples and should not be construed as limiting. In addition, the description of these exemplary options is intended to illustrate and not to limit the scope of the claims, and many alternatives, modifications and variations will be apparent to those skilled in the art.

ЛИТЕРАТУРА:LITERATURE:

[1] J A. Almahairi. S. Rajeswar, A. Sordoni, R Bachman, and A. Courville. Augmented cyclegan: Learning many-to-many mappings from unpaiied data. arXiv preprint arXiv.l802.10151. 2018.[1] J A. Almahairi. S. Rajeswar, A. Sordoni, R Bachman, and A. Courville. Augmented cyclegan: Learning many-to-many mappings from unpaiied data. arXiv preprint arXiv.l802.10151. 2018.

[2] K. Bousmalis. A. Iipan. P. Wohlhait. Y. Bai. M. Kelcey.[2] K. Bousmalis. A. Iipan. P. Wohlhait. Y. Bai. M. Kelcey.

M. Kalakrishnan. L Downs. J. I bar/. P. Pastor. K. Konolige. et al. Using simulation and domain adaptation to improve efficiency of deep robotic grasping. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 4243-4250. IEEE, 2018.M. Kalakrishnan. L downs. J. I bar /. P. Pastor. K. Konolige. et al. Using simulation and domain adaptation to improve efficiency of deep robotic grasping. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 4243-4250. IEEE, 2018.

[3] Y. Choi. M. Choi. M. Kim. J.-W. Ha. S. Kim. and J. Choo. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. arXiv preprint. 1711, 2017.[3] Y. Choi. M. Choi. M. Kim. J.-W. Ha S. Kim. and J. Choo. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. arXiv preprint. 1711, 2017.

[4] C. Chu. A. Zhmoginov. and M. Sandler. Cyclegan: a master of steganography. arXiv preprint arXiv: 1712.02950,2017.[4] C. Chu. A. Zhmoginov. and M. Sandler. Cyclegan: a master of steganography. arXiv preprint arXiv: 1712.02950,2017.

[5] M. Cordts. M. 6mran. S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson. U. Franke. S. Roth, and B. Schiele. The cityscapes dataset for semantic urban scene understanding. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016.[5] M. Cordts. M. 6mran. S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson. U. Franke. S. Roth, and B. Schiele. The cityscapes dataset for semantic urban scene understanding. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016.

[6] I. Goodfellow, J. Pouget-Abadie. M. Miiza, B. Xu, D. Warde-Farley. S. Ozair. A. Courville. and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672-2680, 2014.[6] I. Goodfellow, J. Pouget-Abadie. M. Miiza, B. Xu, D. Warde-Farley. S. Ozair. A. Courville. and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672-2680, 2014.

[7] V. Gupta and S. Raman. Automatic trimap generation for image matting. In Signal and Information Processing (ICon-SIP). International Conference on. pages 1-5. IEEE. 2016.[7] V. Gupta and S. Raman. Automatic trimap generation for image matting. In Signal and Information Processing (ICon-SIP). International Conference on. pages 1-5. IEEE 2016.

[8] H. Huang. X. Fang. Y. Ye. S. Zhang, and P. L Rosin Prac-tical automatic background substitution for live video. Computational Visual Media, 3(3):273-284.2017.[8] H. Huang. X. Fang. Y. Ye. S. Zhang, and P. L Rosin Practical automatic background substitution for live video. Computational Visual Media, 3 (3): 273-284.2017.

[9] X. Ji, J. F. Henriques, and A. Vedaldi. Invariant information distillation for unsupervised image segmentation and clustering. arXiv preprint arXiv:1807.06653, 2018.[9] X. Ji, J. F. Henriques, and A. Vedaldi. Invariant information distillation for unsupervised image segmentation and clustering. arXiv preprint arXiv: 1807.06653, 2018.

[10] J. Johnson, A. Alahi, and F. Li. Perceptual losses for real-time style transfer and super-resolution. CoRR, abs/1603.08155, 2016.[10] J. Johnson, A. Alahi, and F. Li. Perceptual losses for real-time style transfer and super-resolution. CoRR, abs / 1603.08155, 2016.

[11] S. Kazemzadeh, V. Ordonez, M. Matten, and T. Berg. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 conference on empirical methods in natural language processing {EMNLP), pages 787-798,2014.[11] S. Kazemzadeh, V. Ordonez, M. Matten, and T. Berg. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 787-798,2014.

[12] A. Khoreva, R. Benenson, J. H. Hosang, M. Hein, and B. Schiele. Simple does it: Weakly supervised instance and semantic segmentation. In CVPR, volume 1, page 3,2017.[12] A. Khoreva, R. Benenson, J. H. Hosang, M. Hein, and B. Schiele. Simple does it: Weakly supervised instance and semantic segmentation. In CVPR, volume 1, page 3.2017.

[13] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, M. Bernstein, and L Fei-Fei. Visual genome: Connecting language and vision using crowdsourced dense image annotations. 2016.[13] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, M. Bernstein, and L Fei-Fei. Visual genome: Connecting language and vision using crowdsourced dense image annotations. 2016.

[14] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740-755. Springer, 2014.[14] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740-755. Springer, 2014.

[15] M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised image-to image translation networks. In Advances in Neural Information Processing Systems, pages 700-708, 2017.[15] M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised image-to image translation networks. In Advances in Neural Information Processing Systems, pages 700-708, 2017.

[16] X. Mao, Q. Li, H. Xie, R. Lau, Z. Wang, and S. P. Smol-[16] X. Mao, Q. Li, H. Xie, R. Lau, Z. Wang, and S. P. Smol-

ley. Least squares generative adversarial networks, arxiv preprint. arXiv preprint ArXiv:1611.04076, 2(5), 2016.ley. Least squares generative adversarial networks, arxiv preprint. arXiv preprint ArXiv: 1611.04076, 2 (5), 2016.

[17] B. A. Plummer, L. Wang, С. M. Cervantes, J. С Caicedo, J. Hockenmaier, and S. Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pages 2641-2649, 2015.[17] B. A. Plummer, L. Wang, C. M. Cervantes, J. C Caicedo, J. Hockenmaier, and S. Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pages 2641-2649, 2015.

[18] T. Remez, J. Huang, and M. Brown. Learning to segment via[18] T. Remez, J. Huang, and M. Brown. Learning to segment via

cut-and-paste. arXiv preprint arXiv:1803.06414, 2018.cut-and-paste. arXiv preprint arXiv: 1803.06414, 2018.

[19] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91-99,2015.[19] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91-99,2015.

[20] O. Ronneberger, P Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. CoRR, abs/1505.04597,2015.[20] O. Ronneberger, P Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. CoRR, abs / 1505.04597,2015.

[21] С. Rother, V. Kolmogorov, and A. Blake. Grabcut: Interactive foreground extraction using iterated graph cuts. In ACM transactions on graphics (TOG), volume 23, pages 309-314. ACM, 2004.[21] C. Rother, V. Kolmogorov, and A. Blake. Grabcut: Interactive foreground extraction using iterated graph cuts. In ACM transactions on graphics (TOG), volume 23, pages 309-314. ACM, 2004.

[22] Т.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro. High-resolution image synthesis and semantic manipulation with conditional gans. arXiv preprint arXiv:1711.11585,2017.[22] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro. High-resolution image synthesis and semantic manipulation with conditional gans. arXiv preprint arXiv: 1711.11585,2017.

[23] Z. Wu, R. Chang, J. Ma, C. Lu, and C.-K. Tang. Annotation-free and one-shot learning for instance segmentation of homogeneous object clusters. arXiv preprint arXiv:1802.00383, 2018.[23] Z. Wu, R. Chang, J. Ma, C. Lu, and C.-K. Tang. Annotation-free and one-shot learning for instance segmentation of homogeneous object clusters. arXiv preprint arXiv: 1802.00383, 2018.

[24] X. Xia and B. Kulis. W-net: A deep model for fully unsupervised image segmentation. arXiv preprint arXiv:1711.08506, 2017.[24] X. Xia and B. Kulis. W-net: A deep model for fully unsupervised image segmentation. arXiv preprint arXiv: 1711.08506, 2017.

[25] W. Xian, P. Sangkloy, J. Lu, С Fang, F. Yu, and J. Hays. Texturegan: Controlling deep image synthesis with texture patches. CoRR, abs/1706.02823, 2017.[25] W. Xian, P. Sangkloy, J. Lu, C. Fang, F. Yu, and J. Hays. Texturegan: Controlling deep image synthesis with texture patches. CoRR, abs / 1706.02823, 2017.

[26] N. Xu, B. L. Price, S. Cohen, and T. S. Huang. Deep image matting. In CVPR, volume 2, page 4, 2017.[26] N. Xu, B. L. Price, S. Cohen, and T. S. Huang. Deep image matting. In CVPR, volume 2, page 4, 2017.

[27] Z. Yan, X. Li, M. Li, W. Zuo, and S. Shan. Shift-net: Image inpainting via deep feature rearrangement. arXiv preprint arXiv:1801.09392, 2018.[27] Z. Yan, X. Li, M. Li, W. Zuo, and S. Shan. Shift-net: Image inpainting via deep feature rearrangement. arXiv preprint arXiv: 1801.09392, 2018.

[28] L. Yu, Z. Lin, X. Shen, J. Yang, X. Lu, M. Bansal, and T. L. Berg. Mattnet: Modular attention network for referring expression comprehension. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.[28] L. Yu, Z. Lin, X. Shen, J. Yang, X. Lu, M. Bansal, and T. L. Berg. Mattnet: Modular attention network for referring expression comprehension. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.

[29] Y. Zhang, L. Yuan, Y. Guo, Z. He, I.-A. Huang, and H. Lee. Discriminative bimodal networks for visual localization and detection with natural language queries. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.[29] Y. Zhang, L. Yuan, Y. Guo, Z. He, I.-A. Huang, and H. Lee. Discriminative bimodal networks for visual localization and detection with natural language queries. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.

[30] J. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. CoRR, abs/1703.10593, 2017.[30] J. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. CoRR, abs / 1703.10593, 2017.

[31] J.-Y. Zhu, R. Zhang, D. Pathak, T. Darnell, A. A. Efros, O. Wang, and E. Shechtman. Toward multimodal image-to-image translation. In Advances in Neural Information Processing Systems, pages 465-176, 2017.[31] J.-Y. Zhu, R. Zhang, D. Pathak, T. Darnell, A. A. Efros, O. Wang, and E. Shechtman. Toward multimodal image-to-image translation. In Advances in Neural Information Processing Systems, pages 465-176, 2017.

Claims

1. A computing system for performing automatic image processing, comprising:

first neural network

to form a coarse image z by segmenting an object O from an original image x containing an object O and a background B _x using a segmentation mask, and cutting out a segmented object O from an image x using this mask and pasting it into an image y containing only background B _y ;

second neural network

to create an improved version of the image

with the segmented object O inserted by improving the rough image z based on the original images x and y and the mask m ;

third neural network

to restore image only background

without a remote segmented object O by painting on the image obtained by zeroing the pixels of the image x using the mask m ;

moreover, the first, second and third neural networks are combined into a common neural network architecture for sequentially performing segmentation, improving and painting and simultaneous training, while the overall neural network architecture receives images and provides processed images of the same size.

2. The system of claim 1, wherein the first, second, and third neural networks are generators that create images

and

and convert them.

3. The system of claim 2, further comprising two neural networks configured as discriminators that evaluate the reliability of the images.

4. The system of claim 3, wherein the first discriminator is a background discriminator that attempts to distinguish between a reference real background image and a painted background image; the second discriminator is the discriminator of the object, which is trying to distinguish between the reference real image of the object O and the enhanced image of the object O.

5. The system according to any one of paragraphs. 2-4, in which the first and second neural networks constitute a replacement network.

6. The system of claim 5, wherein the replacement network is configured to provide continuous learning with loss functions to create an improved version of the image

with a segmented object O inserted.

7. The system of claim 6, in which one of the loss functions is a reconstruction function of the object to ensure consistency and stability of training and is implemented as the average absolute difference between image x and image

.

8. The system of claim 6, wherein one of the loss functions is an adversarial function of the object to increase image likelihood

and implemented using a dedicated discriminator network.

9. The system of claim 6, wherein one of the loss functions is a mask matching function for imparting invariance to the first network relative to the background and is realized as the average absolute distance between the mask extracted from image x and the mask extracted from image

.

10. The system according to claim 6, in which one of the loss functions is a function of improving the identity of the object, which forces the second network to create images closer to real images, and represents the average absolute distance between G _enh (x) and x itself.

11. The system of claim 6, wherein one of the loss functions is a background identity function to ensure that the overall architecture does not do anything with an image that does not contain objects.

12. The system according to claim 6, in which one of the loss functions is a general loss function, which is a linear combination of the reconstruction function of the object, the adversarial function of the object, the mask matching function, the function of improving the identity of the object, the background identity function.

13. The system of claim 1, wherein the segmentation mask is predicted by the first network based on image x .

14. The method of automatic image processing, which consists in the fact that,

using the first neural network,

- form a rough image z by segmenting the object O from the original image x containing the object O and the background B _x using a segmentation mask, and using the mask, cut the segmented object O from the image x and paste it into the image y containing only background B _y ;

using a second neural network,

- create an improved version of the image

using a third neural network,

- restore the background image only

display images

and

the same size.

15. The method according to p. 14, in which the first, second and third neural networks are generators that create images

and

and convert them.

16. The method of claim 15, further comprising two neural networks configured as discriminators that evaluate the reliability of the images.

17. The method of claim 16, wherein the first discriminator is a background discriminator that attempts to distinguish between a reference real background image and a painted background image; the second discriminator is the discriminator of the object, which is trying to distinguish between the reference real image of the object O and the improved image of the object O.

18. The method according to any one of paragraphs. 15-17, in which the first and second neural networks constitute a replacement network.

19. The method according to p. 18, in which the replacement network is configured to continuously learn with loss functions to create an improved version of the image

with a segmented object O inserted.

20. The method according to p. 19, in which one of the loss functions is the reconstruction function of the object to ensure consistency and stability of training and is implemented as the average absolute difference between image x and image

.

21. The method according to p. 19, in which one of the loss functions is an adversarial function of the object to increase the likelihood of the image and is implemented using a dedicated discriminator network.

22. The method according to p. 19, in which one of the loss functions is a mask matching function to give the first network invariance with respect to the background and is implemented as the average absolute distance between the mask extracted from image x and the mask extracted from image

.

23. The method according to p. 19, in which one of the loss functions is a function of improving the identity of the object to improve the second network to obtain images closer to real images, and is the average absolute distance between G _enh (x) and x itself.

24. The method of claim 19, wherein one of the loss functions is a background identity function to ensure that the overall architecture does not do anything with an image that does not contain objects.

25. The method according to claim 19, in which one of the loss functions is a general loss function, which is a linear combination of the reconstruction function of the object, the adversarial function of the object, the mask consistency function, the function of improving the identity of the object, the background identity function.

26. The method of claim 14, wherein the segmentation mask is predicted by the first network based on image x .