RU2665273C2

RU2665273C2 - Trained visual markers and the method of their production

Info

Publication number: RU2665273C2
Application number: RU2016122082A
Authority: RU
Inventors: Виктор Сергеевич Лемпицкий
Priority date: 2016-06-03
Filing date: 2016-06-03
Publication date: 2018-08-28
Also published as: WO2017209660A1; RU2016122082A3; RU2016122082A

Abstract

FIELD: computer equipment.SUBSTANCE: group of inventions refers to the computing field of technology, in particular to visual markers and methods for their production, which can be used in robotics, virtual and augmented reality. Method comprises the steps of: forming a synthesizing neural network that translates a sequence of bits into images of visual markers; forming a render neural network that converts input images of visual markers into images, containing visual markers; forming a recognizing neural network that translates images, containing visual markers, in a sequence of bits; teaching together synthesizing, render and recognition neural network by minimizing the loss function, reflecting the probability of correctly recognizing random bit sequences; synthesizing visual markers by passing bit sequences through a trained synthesizing neural network; receiving a set of images of visual markers from a video data source; extracting from the resulting set of visual marker images the encoded bit sequences by the recognizing neural network.EFFECT: technical result is an increase in the accuracy of recognizing and localizing visual markers.21 cl, 11 dwg

Description

ОБЛАСТЬ ТЕХНИКИFIELD OF TECHNOLOGY

Данное техническое решение в общем относится к вычислительной области техники, а в частности к визуальным маркерам и способам их продуцирования, которые могут использоваться в робототехнике, виртуальной и дополненной реальности.This technical solution generally relates to the computing field of technology, and in particular to visual markers and methods for their production, which can be used in robotics, virtual and augmented reality.

УРОВЕНЬ ТЕХНИКИBACKGROUND

В настоящее время визуальные маркеры (также известные как визуальные реперные точки или визуальные коды) используются для облегчения среды обитания человека и роботов, а также для оказания помощи алгоритмам компьютерного зрения в сценариях, которые ограничены ресурсами и/или являются очень важными. Известными из уровня техники визуальными маркерами могут являться простые (линейные) штрихкоды и их двумерные (матричные) копии, такие как QR-коды или ацтекские коды, которые используются для встраивания визуальных объектов информации в объекты и сцены. В робототехнике очень популярны визуальные маркеры AprilTags (Фиг. 6) и похожие способы, которые являются популярным способом упрощения идентификации местоположения, объектов и агентов для роботов. В рамках дополненной реальности визуальные маркеры ARCodes и похожие способы используются для обеспечения оценки положения камеры с высокой точностью, низкой задержкой и на бюджетных устройствах. В целом такие маркеры могут встраивать визуальную информацию в окружающую среду более компактно и независимо от языка, причем они могут быть распознаны и использованы автономными, а также управляемыми человеком устройствами.Currently, visual markers (also known as visual reference points or visual codes) are used to facilitate the living environment of humans and robots, as well as to assist computer vision algorithms in scenarios that are resource-limited and / or very important. Visual markers known from the prior art can be simple (linear) barcodes and their two-dimensional (matrix) copies, such as QR codes or Aztec codes, which are used to embed visual information objects in objects and scenes. AprilTags visual markers (Fig. 6) and similar methods, which are a popular way to simplify the identification of locations, objects and agents for robots, are very popular in robotics. As part of augmented reality, ARCodes visual markers and similar methods are used to provide camera position estimates with high accuracy, low latency, and on budget devices. In general, such markers can integrate visual information into the environment more compactly and independently of the language, and they can be recognized and used by autonomous as well as human-controlled devices.

Таким образом, все визуальные маркеры, известные в уровне техники разрабатываются эвристически, исходя из соображений легкости распознавания посредством алгоритмов компьютерного (машинного) зрения. Для вновь созданного семейства маркеров, проектируются и настраиваются алгоритмы-распознаватели, целью которых является обеспечение надежной локализации и интерпретации визуальных маркеров. Создание визуальных маркеров и распознавателей визуальных маркеров разделены таким образом на две стадии, причем данное разделение является не оптимальным (конкретный вид маркеров не является оптимальным с точки зрения распознавателя в математическом смысле). Кроме того, при создании визуальных маркеров упускается аспект эстетичности, что приводит к появлению "назойливых" визуальным маркерам, которые во многих случаях не соответствуют стилю окружающей среды, в которую они помещаются, или товаров, на которые они наносится, и делают внешний вид этой среды или товаров "дружественным компьютеру" (простым для распознавания) и "не дружественным человеку".Thus, all visual markers known in the prior art are developed heuristically, based on considerations of ease of recognition through computer (machine) vision algorithms. For the newly created family of markers, recognizer algorithms are designed and configured, the purpose of which is to ensure reliable localization and interpretation of visual markers. The creation of visual markers and recognizers of visual markers are thus divided into two stages, and this separation is not optimal (the specific type of markers is not optimal from the point of view of the recognizer in the mathematical sense). In addition, when creating visual markers, the aspect of aesthetics is missed, which leads to the appearance of “annoying” visual markers, which in many cases do not correspond to the style of the environment in which they are placed, or the goods on which they are applied, and make the appearance of this environment or products "computer friendly" (easy to recognize) and "not human friendly."

СУЩНОСТЬ ИЗОБРЕТЕНИЯSUMMARY OF THE INVENTION

Данное техническое решение направлено на устранение недостатков, свойственных решениям, известным из уровня техники.This technical solution is aimed at eliminating the disadvantages inherent in solutions known from the prior art.

Технической задачей, поставленной в данном техническом решении, является создание семейств визуальных маркеров, не имеющих аналогичных проблем из уровня техники.The technical problem posed in this technical solution is the creation of families of visual markers that do not have similar problems from the prior art.

Техническим результатом является повышение точности распознавания визуальных маркеров за счет учета в процессе обучения нейронной сети искажений перспективы, путаницы с фоном, низкого разрешения, размытости изображения и т.д. Все такие эффекты моделируются во время обучения нейронной сети как кусочно-дифференцируемые преобразования.The technical result is to increase the accuracy of recognition of visual markers by taking into account perspective distortion, confusion with the background, low resolution, image blurring, etc., during training of the neural network. All such effects are modeled during the training of the neural network as piecewise differentiable transformations.

Дополнительным техническим результатом, проявляющимся при решении вышеуказанной технической задачи, является повышение похожести визуального маркера визуальному стилю интерьера помещения или дизайна товара.An additional technical result that manifests itself in solving the above technical problem is to increase the similarity of the visual marker to the visual style of the interior of the room or design of the goods.

Указанный технический результат достигается благодаря осуществлению способа продуцирования семейства визуальных маркеров, кодирующих информацию, в котором формируют синтезирующую нейронную сеть, переводящую последовательность бит в изображения визуальных маркеров; формируют рендерную нейронную сеть, преобразующую входные изображения визуальных маркеров в изображения, содержащие визуальные маркеры, посредством геометрических и фотометрических преобразований; формируют распознающую нейронную сеть, переводящую изображения, содержащие визуальные маркеры в последовательности бит; обучают совместно синтезирующую, рендерную и распознающую нейронную сеть, путем минимизации функции потери, отражающей вероятность правильного распознавания случайных битовых последовательностей; получают набор изображений визуальных маркеров из источника видеоданных; извлекают из полученного набора изображений визуальных маркеров закодированные битовые последовательности посредством распознающей нейронной сети.The specified technical result is achieved due to the implementation of the method of producing a family of visual markers encoding information in which a synthesizing neural network is formed that translates a sequence of bits into images of visual markers; forming a rendering neural network that converts input images of visual markers into images containing visual markers by means of geometric and photometric transformations; form a recognizing neural network that translates images containing visual markers in a sequence of bits; they train together a synthesizing, rendering and recognizing neural network by minimizing the loss function, reflecting the probability of correct recognition of random bit sequences; receive a set of images of visual markers from a video source; the encoded bit sequences are extracted from the obtained set of images of visual markers by means of a recognizing neural network.

В некоторых вариантах осуществления технического решения рендерная нейронная сеть, преобразует входные изображения визуальных маркеров в изображения, содержащие визуальные маркеры, помещенные поверх фоновых изображений.In some embodiments of the technical solution, a rendered neural network converts input images of visual markers into images containing visual markers placed on top of background images.

В некоторых вариантах осуществления технического решения синтезирующая нейронная сеть, состоит из одного линейного слоя, за которым следует поэлементная сигмоидная функция.In some embodiments of the technical solution, the synthesizing neural network consists of one linear layer, followed by an element-wise sigmoid function.

В некоторых вариантах осуществления технического решения синтезирующая и/или распознающая нейронная сеть имеет сверточный вид (являться сверточной нейронной сетью).In some embodiments of the technical solution, the synthesizing and / or recognizing neural network has a convolutional form (being a convolutional neural network).

В некоторых вариантах осуществления технического решения в процессе обучения в функционал оптимизации добавляется член, характеризующий эстетическую приемлемость маркеров.In some embodiments of the technical solution in the learning process, a term characterizing the aesthetic acceptability of the markers is added to the optimization functional.

В некоторых вариантах осуществления технического решения в процессе обучения в функционал оптимизации добавляется член, измеряющий соответствие маркеров визуальному стилю, заданному в виде изображения-образца.In some embodiments of the technical solution in the learning process, a member is added to the optimization functional that measures the correspondence of markers to the visual style specified in the form of a sample image.

В некоторых вариантах осуществления технического решения минимизация функции потерь выполняется с использованием алгоритма стохастического градиентного спуска.In some embodiments of the technical solution, minimization of the loss function is performed using a stochastic gradient descent algorithm.

В некоторых вариантах осуществления технического решения битовая последовательность при обучении выбирается равномерно из множества вершин Булевого куба.In some embodiments of the technical solution, the bit sequence during training is selected evenly from the set of vertices of the Boolean cube.

В некоторых вариантах осуществления технического решения синтезирующая, рендерная, распознающая нейронная сеть являются сетью прямого распространения.In some embodiments of the technical solution, the synthesizing, rendering, recognizing neural network is a direct distribution network.

Также указанный технический результат достигается благодаря осуществлению способа продуцирования семейства визуальных маркеров, кодирующих информацию, в котором получают переменные, соответствующие значениям пикселей создаваемых визуальных маркеров; формируют рендерную нейронную сеть, преобразующую значения пикселей визуальных маркеров в изображения, содержащие визуальные маркеры, посредством геометрических и фотометрических преобразований; формируют распознающую нейронную сеть, переводящую изображения, содержащие визуальные маркеры в последовательности бит; обучают совместно синтезирующую, рендерную и распознающую нейронную сеть, путем минимизации функции потери, отражающей вероятность правильного распознавания случайных битовых последовательностей; получают набор изображений визуальных маркеров из источника видеоданных; извлекают из полученного набора изображений визуальных маркеров номера классов маркеров.Also, the specified technical result is achieved due to the implementation of the method of producing a family of visual markers encoding information in which variables corresponding to the pixel values of the created visual markers are obtained; forming a render neural network that converts the pixel values of visual markers into images containing visual markers by means of geometric and photometric transformations; form a recognizing neural network that translates images containing visual markers in a sequence of bits; they train together a synthesizing, rendering and recognizing neural network by minimizing the loss function, reflecting the probability of correct recognition of random bit sequences; receive a set of images of visual markers from a video source; retrieving marker class numbers from the resulting set of visual marker images.

В некоторых вариантах осуществления технического решения рендерная нейронная сеть, преобразует входные изображения визуальных маркеров в изображения, содержащие визуальные маркеры, помещенные в центр фонового изображения.In some embodiments of the technical solution, a rendered neural network converts input images of visual markers into images containing visual markers placed in the center of the background image.

В некоторых вариантах осуществления технического решения рендерная и распознающая нейронная сеть являются сетью прямого распространения.In some embodiments, the rendering and recognition neural network is a direct distribution network.

Также указанный технической результат достигается благодаря осуществлению способа продуцирования семейства визуальных маркеров, кодирующих информацию, в котором получают переменные, соответствующие значениям пикселей создаваемых визуального маркера; формируют рендерную нейронную сеть, преобразующую входные изображения визуальных маркеров в изображения, содержащие визуальные маркеры, посредством геометрических и фотометрических преобразований; формируют локализующую нейронную сеть, переводящую изображения, содержащие маркер, в параметры положения маркера; обучают совместно синтезирующую, рендерную и локализующую нейронную сеть, путем минимизации функции потери, отражающей вероятность нахождения положения маркера на изображении; получают набор изображений визуальных маркеров из источника видеоданных; извлекают из полученного набора изображений визуальных маркеров закодированные битовые последовательности посредством распознающей нейронной сети.Also, the indicated technical result is achieved due to the implementation of the method of producing a family of visual markers encoding information in which variables corresponding to the pixel values of the created visual marker are obtained; forming a rendering neural network that converts input images of visual markers into images containing visual markers by means of geometric and photometric transformations; form a localizing neural network that translates images containing the marker into the marker position parameters; jointly synthesizing, rendering, and localizing the neural network is taught by minimizing the loss function, which reflects the probability of finding the marker position on the image; receive a set of images of visual markers from a video source; the encoded bit sequences are extracted from the obtained set of images of visual markers by means of a recognizing neural network.

В некоторых вариантах осуществления технического решения локализующая, рендерная и распознающая нейронная сеть являются сетью прямого распространения.In some embodiments of the technical solution, the localizing, rendering, and recognizing neural network is a direct distribution network.

КРАТКОЕ ОПИСАНИЕ ЧЕРТЕЖЕЙBRIEF DESCRIPTION OF THE DRAWINGS

Признаки и преимущества настоящего технического решения станут очевидными из приведенного ниже подробного описания и прилагаемых чертежей, на которых:The signs and advantages of this technical solution will become apparent from the following detailed description and the accompanying drawings, in which:

На Фиг. 1 показан пример осуществления способа создания и распознавания визуального маркера;In FIG. 1 shows an example implementation of a method for creating and recognizing a visual marker;

На Фиг. 2 показана рендерная нейронная сеть. Входной маркер М на левом выходе сети, который получают через несколько состояний (все это кусочно-дифференцируемые входы); справа показаны выходы T(М; φ) для нескольких случайных параметров помех φ. Использование кусочно-дифференцируемых преобразований в Т позволяет использовать обратное распространение ошибки обучения;In FIG. 2 shows a render neural network. An input marker M on the left output of the network, which is received through several states (all these are piecewise differentiable inputs); the outputs T (M; φ) for several random interference parameters φ are shown on the right. Using piecewise differentiable transformations in T allows you to use the back propagation of learning errors;

На Фиг. 3 показаны визуальные маркеры, полученные посредством осуществления данного технического решения. Подписи на фигуре показывают длину бита, емкость результирующего кодирования (в битах), а также точность, достигнутую во время обучения. В каждом случае показаны шесть маркеров: (1) - маркер, соответствующий битовой последовательности содержащий 0; (2) - маркер, соответствующий битовой последовательности содержащий 1; (3) и (4) - маркеры, соответствующие двум случайным битовым последовательностям, отличающимся одним битом; (5) и (6) - два маркера, соответствующих двум и более битовым последовательностям. При многих условиях возникает характерный узор в виде сеток;In FIG. 3 shows visual markers obtained through the implementation of this technical solution. The signatures on the figure show the bit length, the capacity of the resulting encoding (in bits), as well as the accuracy achieved during training. In each case, six markers are shown: (1) - a marker corresponding to a bit sequence containing 0; (2) - a marker corresponding to a bit sequence containing 1; (3) and (4) are markers corresponding to two random bit sequences differing in one bit; (5) and (6) are two markers corresponding to two or more bit sequences. Under many conditions, a characteristic pattern appears in the form of grids;

На Фиг. 4 показаны примеры текстурированных 64-х битных семейств маркеров. Текстурный прототип показан в первом столбце, в то время как остальные столбцы показывают маркеры для следующих последовательностей: все нули, все единицы, 32 последовательных нуля, и в конце две случайные битовые последовательности, которые отличаются одним битом;In FIG. Figure 4 shows examples of textured 64-bit marker families. The texture prototype is shown in the first column, while the remaining columns show markers for the following sequences: all zeros, all ones, 32 consecutive zeros, and at the end two random bit sequences that differ in one bit;

На Фиг. 5 показаны скриншоты восстановленных маркеров из видеопотока в реальном времени и правильно распознанной последовательности бит;In FIG. 5 shows screenshots of reconstructed markers from a real-time video stream and a correctly recognized bit sequence;

На Фиг. 6 показаны визуальные маркеры AprilTags;In FIG. 6 shows AprilTags visual markers;

На Фиг. 7 показана архитектура рендерной нейронной сети: сеть получает батч паттернов (b×k×k×3) и фоновые изображения (b×s×s×3). Сеть состоит из рендеринга, аффинного преобразования, преобразования цвета и размытия слоев. Форма вывода s×s×3;In FIG. Figure 7 shows the architecture of a rendered neural network: the network receives the batch of patterns (b × k × k × 3) and background images (b × s × s × 3). The network consists of rendering, affine transformation, color conversion and blur layers. Output form s × s × 3;

На Фиг. 8 показана локализующая нейронная сеть, в которой входное изображение проходит через три слоя и предсказывает 4 карты точек, соответствующих положению каждого угла визуального маркера;In FIG. Figure 8 shows a localizing neural network in which the input image passes through three layers and predicts 4 point maps corresponding to the position of each corner of the visual marker;

На Фиг. 9 показано созданное семейство визуальных маркеров для рендерной, локализующей, классификационной распознающей нейронной сети. Для человека данные маркеры выглядят одинаковыми, однако распознающая нейросеть достигает 99% точности распознавания;In FIG. Figure 9 shows the created family of visual markers for a rendered, localizing, classification recognizing neural network. For a person, these markers look the same, but a recognizing neural network reaches 99% recognition accuracy;

На Фиг. 10 показана архитектура системы продуцирования семейства визуальных маркеров, кодирующих информацию;In FIG. 10 shows the architecture of a system for producing a family of visual markers encoding information;

На Фиг. 11 показан пример определения положения маркеров (из семейства, показанного на Фиг. 9), при помощи обученной локализующей нейронной сети. Положение каждого маркера задается координатами четырех углов. Предсказания локализующей нейросети для углов показаны белыми точками.In FIG. 11 shows an example of determining the position of markers (from the family shown in FIG. 9) using a trained localizing neural network. The position of each marker is set by the coordinates of the four corners. The predictions of the localizing neural network for the corners are shown by white dots.

ПОДРОБНОЕ ОПИСАНИЕ ИЗОБРЕТЕНИЯDETAILED DESCRIPTION OF THE INVENTION

Ниже будут описаны понятия и определения, необходимые для подробного раскрытия осуществляемого технического решения.Below will be described the concepts and definitions necessary for the detailed disclosure of the ongoing technical solution.

Техническое решение может быть реализовано в виде распределенной компьютерной системы.The technical solution can be implemented as a distributed computer system.

В данном решении под системой подразумевается компьютерная система, ЭВМ (электронно-вычислительная машина), ЧПУ (числовое программное управление), ПЛК (программируемый логический контроллер), компьютеризированные системы управления и любые другие устройства, способные выполнять заданную, четко определенную последовательность операций (действий, инструкций).In this solution, a system means a computer system, a computer (electronic computer), CNC (numerical control), PLC (programmable logic controller), computerized control systems and any other devices that can perform a given, well-defined sequence of operations (actions, instructions).

Под устройством обработки команд подразумевается электронный блок либо интегральная схема (микропроцессор), исполняющая машинные инструкции (программы).By a command processing device is meant an electronic unit or an integrated circuit (microprocessor) that executes machine instructions (programs).

Устройство обработки команд считывает и выполняет машинные инструкции (программы) с одного или более устройства хранения данных. В роли устройства хранения данных могут выступать, но, не ограничиваясь, жесткие диски (HDD), флеш-память, ПЗУ (постоянное запоминающее устройство), твердотельные накопители (SSD), оптические носители (CD, DVD и т.п.).The command processing device reads and executes machine instructions (programs) from one or more data storage devices. Storage devices may include, but are not limited to, hard disks (HDD), flash memory, ROM (read only memory), solid state drives (SSD), optical media (CD, DVD, etc.).

Программа - последовательность инструкций, предназначенных для исполнения устройством управления вычислительной машины или устройством обработки команд.A program is a sequence of instructions intended for execution by a control device of a computer or a device for processing commands.

Искусственная нейронная сеть (ИНС) - математическая модель, а также ее программное или аппаратное воплощение, построенная по принципу сложной функции, преобразующей входную информацию путем применения последовательности простых операций (также называемых слоями), зависящих от обучаемых параметров нейросети. Рассматриваемые ниже ИНС могут принадлежать к любым стандартным типам (например, многослойный перцептрон, сверточная нейронная сеть, рекуррентная нейронная сеть).An artificial neural network (ANN) is a mathematical model, as well as its software or hardware implementation, built on the principle of a complex function that converts input information by applying a sequence of simple operations (also called layers), depending on the trained parameters of the neural network. The ANNs discussed below can be of any standard type (for example, a multilayer perceptron, a convolutional neural network, a recurrent neural network).

Обучение искусственной нейронной сети - процесс подстройки параметров слоев искусственной нейросети, в результате которого предсказания нейросети на обучающих данных улучшаются. При этом качество предсказаний ИНС на обучающих данных задается т.н. функцией потерь. Таким образом, процесс обучения соответствует математической минимизации функции потерь.Training an artificial neural network is the process of adjusting the parameters of layers of an artificial neural network, as a result of which the predictions of the neural network on the training data are improved. Moreover, the quality of ANN predictions on the training data is set by the so-called loss function. Thus, the learning process corresponds to the mathematical minimization of the loss function.

Метод обратного распространения ошибки (англ. backpropagation) - метод эффективного вычисления градиента функции потери по параметрам слоев нейросети с помощью рекуррентных соотношений, использующих известные аналитические формулы для частных производных отдельных слоев нейросети. Под методом обратного распространения ошибки будем также понимать алгоритм обучения нейросети, использующих вышеупомянутый способ вычисления градиента.The backpropagation method is an effective method for calculating the gradient of the loss function with respect to the parameters of neural network layers using recurrence relations using well-known analytical formulas for partial derivatives of individual layers of the neural network. By the back propagation method of the error, we also understand the algorithm for training a neural network using the aforementioned gradient calculation method.

Параметр градиентных методов обучения нейронных сетей - параметр, позволяющий управлять величиной коррекции весов на каждой итерации.The parameter of gradient methods of training neural networks is a parameter that allows you to control the magnitude of the correction of weights at each iteration.

Визуальный маркер - физический объект, представляющий собой распечатанное изображение, размещенное на одной из поверхностей физической сцены, и предназначенное для эффективной обработки цифровых фотографий с помощью алгоритмов машинного зрения. Результатом обработки фотографии маркера может быть или извлечение информационного сообщения (битовой последовательности), закодированного при помощи маркера, или определение положения камеры относительно положения маркера в момент съемки цифровой фотографии. Примером маркеров первого типа служат QR-коды, примером маркеров второго типа служат ArUko Markers и April Tags.A visual marker is a physical object, which is a printed image placed on one of the surfaces of a physical scene and designed to efficiently process digital photographs using machine vision algorithms. The result of processing the photo of the marker can be either the extraction of an information message (bit sequence) encoded with a marker, or determining the position of the camera relative to the position of the marker at the time of taking the digital photo. An example of markers of the first type are QR codes, an example of markers of the second type are ArUko Markers and April Tags.

Распознающая нейронная сеть - нейронная сеть, принимающая на вход изображение, содержащее визуальный маркер, и выдающая в качестве результата информационное сообщение, закодированное в маркере.A recognizing neural network is a neural network that receives an image containing a visual marker as an input and generates an informational message encoded in the marker as a result.

Локализующая нейронная сеть - нейронная сеть, принимающая на вход изображение и выдающая в качестве результата численную информацию о положении визуального маркера на изображении (например, положение углов маркера). Как правило, подобная информация достаточна для определения положения камеры относительно маркера (при условии наличия информации о калибрации).Localizing neural network - a neural network that receives an input image and provides as a result numerical information about the position of the visual marker on the image (for example, the position of the corners of the marker). As a rule, such information is sufficient to determine the position of the camera relative to the marker (subject to the availability of calibration information).

Синтезирующая нейронная сеть - нейронная сеть, принимающая на вход некоторую численную информацию, например битовую последовательность, и преобразующая ее в цветное или черно-белое изображение.A synthesizing neural network is a neural network that receives some numerical information, such as a bit sequence, and converts it into a color or black and white image.

Рендерная нейронная сеть - нейронная сеть, принимающая на вход некоторое изображение и преобразующая ее в другое изображение таким образом, что выходное изображение становится похожим на цифровую фотографию распечатанного входного изображения.A rendered neural network is a neural network that receives an input image and converts it to another image so that the output image becomes like a digital photograph of a printed input image.

Сверточная ИНС - один из видов искусственных нейронных сетей, широко используемый в распознавании образов, в т.ч. в компьютерном зрении. Характерной особенностью сверточных нейросетей является использование представления данных в виде набора изображений (карт), и применение локальных операций-сверток, модифицирующих и комбинирующих данные карты друг с другом.Convolutional ANN is one of the types of artificial neural networks that is widely used in pattern recognition, including in computer vision. A characteristic feature of convolutional neural networks is the use of data representation in the form of a set of images (cards), and the use of local convolution operations that modify and combine card data with each other.

Подробно рассмотрим способ создания обучаемого визуального маркера, показанный на Фиг. 1. Основной целью является создание синтезирующей нейронной сети S(b; θs) с параметрами обучения θs, которые могут кодировать битовую последовательность b={b₁, …, b_n}, содержащую n бит. Определим визуальный маркер (образец) M_k(b_n) как изображение размера (k,k,3), соответствующее битовой последовательности b_n. Для упрощения обозначений в дальнейших выводах предположим, что bi ∈ {-1; 1}.Let us consider in detail the method of creating a trained visual marker shown in FIG. 1. The main goal is to create a synthesizing neural network S (b; θs) with training parameters θs that can encode the bit sequence b = {b ₁ , ..., b _n } containing n bits. We define a visual marker (sample) M _k (b _n ) as an image of size (k, k, 3) corresponding to the bit sequence b _n . To simplify the notation in the following conclusions, we assume that bi ∈ {-1; one}.

Для распознавания визуальных маркеров, созданных, синтезирующей нейронной сетью, создают и используют распознающую нейронную сеть R(I; θR) с параметрами обучения θR. Данная нейронная сеть принимает изображение I, содержащее визуальный маркер, и выводит оцененную последовательность τ={τ1, …, τn}. Распознающая нейронная сеть взаимодействует с синтезирующей нейронной сетью для соблюдения условия r_i=b_i, т.е. знак числа, выведенный распознающей нейронной сетью соответствует битам, закодированным синтезирующей нейронной сетью. В частности, можно измерить успех распознавания, используя простую функцию потерь на основе сигмоидальной кривой:For recognition of visual markers created by a synthesizing neural network, a recognizing neural network R (I; θR) with training parameters θR is created and used. This neural network receives an image I containing a visual marker and displays the estimated sequence τ = {τ1, ..., τn}. A recognizing neural network interacts with a synthesizing neural network to satisfy the condition r _i = b _i , i.e. the sign of the number deduced by the recognizing neural network corresponds to the bits encoded by the synthesizing neural network. In particular, recognition success can be measured using a simple loss function based on a sigmoidal curve:

где потери распределяются между -1 (совершенное распознавание) и 0.where losses are distributed between -1 (perfect recognition) and 0.

В реальной жизни алгоритмы, распознающие маркеры, не получают на вход изображения-маркеры напрямую. Вместо этого, визуальные маркеры встраиваются в окружающую среду (например, посредством печати визуальных маркеров и размещения на объектах окружающей среды или посредством использования электронных дисплеев для отображения визуальных маркеров), после чего их изображения захватываются посредством некоторой камеры, управляемой человеком или роботом.In real life, algorithms that recognize markers do not directly receive image markers. Instead, visual markers are embedded in the environment (for example, by printing visual markers and placing on environmental objects or by using electronic displays to display visual markers), after which their images are captured using some camera controlled by a person or a robot.

Поэтому во время обучения распознающей и синтезирующей нейронных сетей, осуществляют моделирование преобразования между визуальным маркером, созданным посредством синтезирующей нейронной сети, и изображением данного маркера, с использованием специальной сети прямого распространения (рендерная нейронная сеть) Т(М; φ), где параметры рендерной сети φ выбираются во время обучения и соответствуют изменчивости фона, изменчивости освещения, наклонной перспективы, ядра размытости, изменению цвета/баланса белого камеры и т.д. Во время обучения, φ выбираются из некоторого распределения Φ, которое должно моделировать изменчивость вышеупомянутых эффектов в условиях, в которых предполагается использование визуальных маркеров.Therefore, during the training of recognizing and synthesizing neural networks, the conversion is simulated between the visual marker created by the synthesizing neural network and the image of this marker using a special direct distribution network (render neural network) T (M; φ), where the parameters of the render network φ are selected during training and correspond to the variability of the background, variability of lighting, oblique perspective, blur core, color change / white balance of the camera, etc. During training, φ are selected from some distribution Φ, which should simulate the variability of the above effects under the conditions in which the use of visual markers is supposed.

В случае, когда единственной целью является надежное распознавание маркеров, процесс обучения может быть осуществлен как минимизация следующего функционала:In the case when the only goal is reliable marker recognition, the learning process can be implemented as minimization of the following functionality:

Здесь битовая последовательность b выбирается равномерно из U(n)={-1; +1}ⁿ, прошедшей через синтезирующую нейронную сеть, рендерную и распознающую нейронную сеть, при этом функция потери (1) используются для измерения успеха распознавания. Параметры синтезирующей нейронной сети и распознающей нейронной сети оптимизируются для минимизации ожидания функции потери. Минимизация выражения (2) может быть затем выполнена с использованием алгоритма стохастического градиентного спуска, например ADAM [1]. Каждая итерация алгоритма отображает мини-батч различных битовых последовательностей в виде набора различных параметров слоев рендерной нейронной сети и обновляет параметры синтезирующей нейронной сети и распознающей нейронной сети для минимизации функции потерь (1) этих выборок.Here, the bit sequence b is selected uniformly from U (n) = {- 1; +1} ⁿ passed through a synthesizing neural network, a rendering and recognizing neural network, and the loss function (1) is used to measure recognition success. The parameters of the synthesizing neural network and the recognizing neural network are optimized to minimize the expectation of the loss function. Minimization of expression (2) can then be performed using a stochastic gradient descent algorithm, such as ADAM [1]. Each iteration of the algorithm displays a mini-batch of different bit sequences in the form of a set of different parameters of the layers of the rendered neural network and updates the parameters of the synthesizing neural network and the recognizing neural network to minimize the loss function (1) of these samples.

В некоторых вариантах осуществления в процесс обучения добавляется также локализующая нейросеть (Фиг. 8), которая обнаруживает примеры маркеров в видеопотоке и определяет их положение на кадре (например, находит координаты их углов). Координаты преобразуются в бинарную карту с измерениями, равными форме входных изображений. Бинарная карта имеет везде нулевое значение, кроме местоположения углов, где значение равно одному. Локализующая сеть тренируется предсказывать эти бинарные карты, которые далее могут использоваться для выравнивания маркера перед подачей его на вход распознающей нейронной сети (Фиг. 10) или использоваться для оценки положения камеры относительно маркера в приложениях, где такая оценка является необходимой. При добавлении подобной локализующей нейросети в обучение, синтезирующая нейронная сеть адаптируется для создания маркеров, которые отличаются от фона и имеют легко идентифицируемые углы.In some embodiments, a localizing neural network is also added to the learning process (Fig. 8), which detects examples of markers in the video stream and determines their position on the frame (for example, finds the coordinates of their angles). The coordinates are converted to a binary map with dimensions equal to the shape of the input images. The binary map has a zero value everywhere, except for the location of the corners, where the value is one. The localizing network is trained to predict these binary maps, which can then be used to align the marker before applying it to the input of the recognizing neural network (Fig. 10) or used to estimate the position of the camera relative to the marker in applications where such an assessment is necessary. When a similar localizing neural network is added to training, the synthesizing neural network is adapted to create markers that differ from the background and have easily identifiable angles.

В некоторых вариантах осуществления создается единственный маркер или небольшое количество маркеров, существенно меньшее, чем количество битовых последовательностей существенной длины. В таких вариантах синтезирующая сеть не используются. Параметры синтезирующей сети в оптимизации заменяются непосредственно значениями пикселей маркеров (или маркера). В этих вариантах, как правило, используется локализующая нейросеть, а распознающая нейросеть или реализуется как классификатор для количества классов, равного количеству маркеров, или не используется вообще (в случае варианта с одним маркером). Пример обученных маркеров в данном варианте осуществления показан на Фиг. 9.In some embodiments, a single marker or a small number of tokens is created that is substantially smaller than the number of bit sequences of substantial length. In such embodiments, a synthesizing network is not used. The parameters of the synthesizing network in optimization are replaced directly by the pixel values of the markers (or marker). In these cases, as a rule, a localizing neural network is used, and a recognizing neural network is either implemented as a classifier for the number of classes equal to the number of markers, or is not used at all (in the case of the single-marker variant). An example of trained markers in this embodiment is shown in FIG. 9.

Как было показано выше, компоненты архитектуры, а именно синтезирующая нейронная сеть, рендерная нейронная сеть, распознающая нейронная сеть, локализующая нейросеть могут быть реализованы, например, как сети прямого распространения или как другие архитектуры, позволяющие проводить обучение при помощи метода обратного распространения ошибки. Распознающая сеть может быть реализована как сверточная нейронная сеть [2] с n выходами. Синтезирующая нейронная сеть также может иметь сверточную архитектуру (являться сверточной нейронной сетью). Локализующая нейронная сеть также может иметь сверточную архитектуру (являться сверточной нейронной сетью).As shown above, the components of the architecture, namely, a synthesizing neural network, a rendering neural network, recognizing a neural network, and localizing a neural network can be implemented, for example, as direct distribution networks or as other architectures that allow training using the back propagation method of error. The recognition network can be implemented as a convolutional neural network [2] with n outputs. A synthesizing neural network may also have a convolutional architecture (being a convolutional neural network). A localizing neural network can also have a convolutional architecture (being a convolutional neural network).

Для реализации рендерной нейронной сети Т(М; φ), показанной на Фиг. 2, требуется применение нестандартных слоев. Рендерная нейронная сеть реализовывается как цепочка слоев, каждый из которых вносит некоторое "мешающее" преобразование. Также реализовывается специальный слой, который накладывает входное изображение (образец) поверх фонового изображения, взятого из случайного набора изображений, моделирующего вид поверхностей, на которые обученные маркеры могут наноситься при использовании. Для реализации геометрического искажения используется пространственный преобразующий слой (spatial transformed layer) [5]. Изменение цвета или изменение интенсивности могут быть реализованы посредством использования дифференцируемых преобразований элементов (линейные, мультипликативные, гамма-преобразование). Слои преобразования помех могут применяться последовательно, образуя рендерную нейронную сеть, которая может моделировать сложные геометрические и фотометрические преобразования (Фиг. 2).To implement the rendered neural network T (M; φ) shown in FIG. 2, the use of custom layers is required. A rendered neural network is implemented as a chain of layers, each of which introduces some “interfering” transformation. A special layer is also implemented that superimposes an input image (sample) over a background image taken from a random set of images simulating the appearance of surfaces onto which trained markers can be applied when used. To implement geometric distortion, a spatial transformed layer is used [5]. A color change or a change in intensity can be realized through the use of differentiable transformations of elements (linear, multiplicative, gamma conversion). Interference transform layers can be applied sequentially, forming a render neural network that can simulate complex geometric and photometric transformations (Fig. 2).

Интересно, что при переменных условиях оптимизация результатов выражения (2) приводит к маркерам, которые имеют непротиворечивую и интересную визуальную текстуру (Фиг. 3). Несмотря на такую визуальную "интересность", желательно контролировать появление результирующих маркеров более конкретно, например, посредством использования некоторых изображений-образцов.Interestingly, under variable conditions, optimizing the results of expression (2) leads to markers that have a consistent and interesting visual texture (Fig. 3). Despite such visual "interest", it is desirable to control the appearance of the resulting markers more specifically, for example, through the use of some sample images.

Для такого контроля в некоторых вариантах осуществления задача обучения (2) дополняется функцией потери, измеряющей разницу между текстурами получаемых маркеров и текстурой изображения образца [6]. Опишем вкратце данную функцию потерь, введенную в [6]. Рассмотрим сеть прямого распространения С(М; γ), которая вычисляет результат t-го сверточного слоя сети, обученной для классификации крупномасштабного изображения, такая как VGGNet [7]. Для изображения М выход сети С(М; γ) содержит k двумерных каналов (карт). Сеть С использует параметры γ, которые предварительно обучены на большом наборе данных и которые не являются частью данного процесса обучения. Затем стиль изображения М определяется с помощью следующей матрицей Грама G(M; γ) размера k на k, где каждый элемент определяется как:For such control, in some embodiments, the training task (2) is supplemented by a loss function that measures the difference between the textures of the obtained markers and the texture of the image of the sample [6]. We briefly describe this loss function introduced in [6]. Consider a direct distribution network C (M; γ), which calculates the result of the t-th convolutional layer of a network trained to classify a large-scale image, such as VGGNet [7]. For the image M, the output of the network C (M; γ) contains k two-dimensional channels (maps). Network C uses parameters γ that are pre-trained on a large data set and which are not part of this learning process. Then, the image style M is determined using the following Gram matrix G (M; γ) of size k by k, where each element is defined as:

где C_i и C_j это i-я и j-я карты и скалярное произведение берется по всем пространственным положениям. Учитывая текстуру прототипа М⁰, задача обучения может быть дополнена следующим выражением:where C _i and C _j are the i-th and j-th maps and the scalar product is taken over all spatial positions. Given the texture of the prototype M ⁰ , the training task can be supplemented by the following expression:

Включение выражения (4) позволяет маркерам S(b; θ_S), созданным синтезирующей нейронной сетью, иметь визуальный облик, подобный экземплярам текстуры, определяемой прототипом М⁰ [6].The inclusion of expression (4) allows the markers S (b; θ _S ) created by the synthesizing neural network to have a visual appearance similar to texture instances defined by the prototype M ⁰ [6].

Для более длинных последовательностей битов в некоторых вариантах осуществления применяется способ кодирования с исправлением ошибок. Таким образом, распознающая нейронная сеть возвращает коэффициенты для каждого бита в восстановленном сигнале, и заявляемое техническое решение подходит для любого вероятностного кодирования с исправлением ошибок.For longer bit sequences, in some embodiments, an error correction coding method is used. Thus, the recognizing neural network returns the coefficients for each bit in the reconstructed signal, and the claimed technical solution is suitable for any probabilistic coding with error correction.

В некоторых вариантах осуществления для экспериментов без потерь текстур используется простейшая синтезирующая нейронная сеть, которая состоит из одного линейного слоя (с матрицей 3m²×n и вектором смещения), за которым следует поэлементная сигмоидная функция. В некоторых вариантах осуществления синтезирующая нейронная сеть имеет сверточный вид, принимая двоичный код в качестве входных данных и преобразуя их одним или несколькими мультипликативными слоями и наборами сверточных слоев. В последнем случае сходимость в состоянии обучения значительно выигрывает от добавления батч нормализации [8] после каждого сверточного слоя.In some embodiments, for the lossless texture experiments, a simple synthesizing neural network is used, which consists of a single linear layer (with a 3m ² × n matrix and a displacement vector), followed by an element-wise sigmoid function. In some embodiments, the synthesizing neural network has a convolutional form, taking the binary code as input and transforming it with one or more multiplicative layers and sets of convolutional layers. In the latter case, convergence in the learning state greatly benefits from the addition of a normalization batch [8] after each convolutional layer.

В некоторых вариантах параметры рендерной сети могут выбираться следующим образом. Пространственное преобразование выполняется как аффинное преобразование, где 6 аффинных параметров выбираются из [1, 0, 0, 0, 1, 0] + N(0, σ) (предполагая начало координат в центре маркера). Пример для σ=1 показан на Фиг. 2. Возьмем изображение x, тогда можно реализовать слой преобразования цвета как

, где параметры выбираются из равномерного распределения U[-δ, δ]. Поскольку было выявлено, что напечатанные визуальные маркеры стремятся уменьшить контрастность, добавляют слой уменьшения контрастности, который преобразует каждое значение до kx+(1-k)[0.5] для случайного k.In some embodiments, the render network parameters may be selected as follows. Spatial transformation is performed as an affine transformation, where 6 affine parameters are selected from [1, 0, 0, 0, 0, 1, 0] + N (0, σ) (assuming the origin at the center of the marker). An example for σ = 1 is shown in FIG. 2. Take the image x, then you can implement the color conversion layer as

, where the parameters are selected from the uniform distribution of U [-δ, δ]. Since it has been revealed that printed visual markers tend to reduce contrast, a contrast reduction layer is added that converts each value to kx + (1-k) [0.5] for random k.

В некоторых вариантах осуществления технического решения распознающая и локализующая нейронная сеть могут быть сверточными.In some embodiments of the technical solution, the recognizing and localizing neural network may be convolutional.

Результаты осуществления данного технического решения, показанные на Фиг. 4, позволяют понять, что техническое решение может успешно восстанавливать закодированные сигналы с небольшим количеством ошибок. Количество ошибок можно дополнительно уменьшить посредством применения набора (ансамбля) распознающих нейросетей или путем применения распознающей нейросети к нескольким искаженным версиям изображения (test-time data augmentation).The results of this technical solution shown in FIG. 4, it can be understood that a technical solution can successfully recover encoded signals with a small number of errors. The number of errors can be further reduced by applying a set (ensemble) of recognizing neural networks or by applying a recognizing neural network to several distorted versions of the image (test-time data augmentation).

В некоторых вариантах для улучшения точности могут выравнивать маркер с заранее определенным квадратом (показан как часть пользовательского интерфейса на Фиг. 5). Как можно видеть, происходит ухудшение результатов с увеличением ошибки выравнивания.In some embodiments, to improve accuracy, the marker may be aligned with a predetermined square (shown as part of the user interface in FIG. 5). As you can see, the results deteriorate with an increase in alignment error.

ИСПОЛЬЗУЕМЫЕ ИСТОЧНИКИ ИНФОРМАЦИИUSED INFORMATION SOURCES

1. D.P. Kingma and J.В. Adam. A method for stochastic optimization. International Conference on Learning Representation, 2015.1. D.P. Kingma and J.V. Adam. A method for stochastic optimization. International Conference on Learning Representation, 2015.

2. Y. LeCun, B. Boser, J.S. Denker, D. Henderson, R.E. Howard, W. Hubbard, and L.D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4): 541-551, 1989.2. Y. LeCun, B. Boser, J.S. Denker, D. Henderson, R.E. Howard, W. Hubbard, and L.D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1 (4): 541-551, 1989.

3. A. Dosovitskiy, J.T. Springenberg, and T. Brox. Learning to generate chairs with convolutional neural networks. Conf. on Computer Vision and Pattern Recognition (CVPR), 2015.3. A. Dosovitskiy, J.T. Springenberg, and T. Brox. Learning to generate chairs with convolutional neural networks. Conf. on Computer Vision and Pattern Recognition (CVPR), 2015.

4. M.D. Zeiler, G.W. Taylor, and R. Fergus. Adaptive deconvolutional networks for mid and high level feature learning. Int. Conf. on Computer Vision (ICCV), pp. 2018-2025, 2011.4. M.D. Zeiler, G.W. Taylor, and R. Fergus. Adaptive deconvolutional networks for mid and high level feature learning. Int. Conf. on Computer Vision (ICCV), pp. 2018-2025, 2011.

5. M. Jaderberg, K. Simonyan, A. Zisserman, et al. Spatial transformer networks. Advances in Neural Information Processing Systems, pp. 2008-2016, 2015.5. M. Jaderberg, K. Simonyan, A. Zisserman, et al. Spatial transformer networks. Advances in Neural Information Processing Systems, pp. 2008-2016, 2015.

6. L. Gatys, A.S. Ecker, and M. Bethge. Texture synthesis using convolutional neural networks. Advances in Neural Information Processing Systems, NIPS, pp. 262-270, 2015.6. L. Gatys, A.S. Ecker, and M. Bethge. Texture synthesis using convolutional neural networks. Advances in Neural Information Processing Systems, NIPS, pp. 262-270, 2015.

7. K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv: 1409.1556, 2014.7. K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv: 1409.1556, 2014.

8. S. loffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. Proc. International Conference on Machine Learning, ICML, pp. 448-456, 2015.8. S. loffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. Proc. International Conference on Machine Learning, ICML, pp. 448-456, 2015.

9. E. Olson. Apriltag: A robust and flexible visual fiducial system. Robotics and Automation (ICRA), 2011 IEEE International Conference on, pp. 3400-3407. IEEE, 2011.9. E. Olson. Apriltag: A robust and flexible visual fiducial system. Robotics and Automation (ICRA), 2011 IEEE International Conference on, pp. 3400-3407. IEEE, 2011.

Claims

1. A method of producing a family of visual markers encoding information, comprising the following steps:

- form a synthesizing neural network that translates a sequence of bits into images of visual markers;

- form a rendering neural network that converts input images of visual markers into images containing visual markers through geometric and photometric transformations;

- form a recognizing neural network that translates images containing visual markers in a sequence of bits;

- they teach jointly synthesizing, rendering, and recognizing neural networks by minimizing the loss function, which reflects the probability of correct recognition of random bit sequences;

- synthesize visual markers by passing bit sequences through a trained synthesizing neural network;

- get a set of images of visual markers from a video source;

- the encoded bit sequences are extracted from the obtained set of images of visual markers by means of a recognizing neural network.

2. The method according to claim 1, characterized in that the rendered neural network converts input images of visual markers into images containing visual markers placed in the center of the background image.

3. The method according to claim 1, characterized in that the synthesizing neural network consists of one linear layer, followed by an element-wise sigmoid function.

4. The method according to p. 1, characterized in that the synthesizing and / or recognizing neural network has a convolutional form.

5. The method according to p. 1, characterized in that during the learning process a member is added to the optimization functional that characterizes the aesthetic acceptability of the markers.

6. The method according to claim 1, characterized in that in the learning process, a member is added to the optimization functional that measures the correspondence of markers to the visual style specified in the form of a sample image.

7. The method according to p. 1, characterized in that the minimization of the loss function is performed using the stochastic gradient descent algorithm.

8. The method according to p. 1, characterized in that in the learning process, the bit sequence is selected evenly from the Boolean cube.

9. The method according to p. 1, characterized in that the synthesizing, rendering, recognizing neural network is a direct distribution network.

10. A method of producing a family of visual markers encoding information, comprising the following steps:

- create variables corresponding to the pixel values of the created visual markers;

- form a rendering neural network that converts the pixel values of visual markers into images containing visual markers by means of geometric and photometric transformations;

- train the synthesizing, rendering, and recognizing neural network together by minimizing the loss function, reflecting the probability of correct recognition of random bit sequences;

- synthesize visual markers by creating raster images with pixel values found as a result of training;

- get a set of images of visual markers from a video source;

- retrieve marker class numbers from the resulting set of visual marker images.

11. The method according to p. 10, characterized in that the rendered neural network converts input images of visual markers into images containing visual markers placed in the center of the background image.

12. The method according to p. 10, characterized in that during the learning process a member is added to the optimization functional that characterizes the aesthetic acceptability of the markers.

13. The method according to p. 10, characterized in that in the learning process, a member is added to the optimization functional that measures the correspondence of markers to the visual style specified in the form of a sample image.

14. The method according to p. 10, characterized in that the minimization of the loss function is performed using the stochastic gradient descent algorithm.

15. The method according to p. 10, characterized in that the render and recognition neural network is a direct distribution network.

16. A method of producing a family of visual markers encoding information, comprising the following steps:

- create variables corresponding to the pixel values of the created visual marker;

- form a localizing neural network that translates images containing the marker into the marker position parameters;

- they teach together a synthesizing, rendering and localizing neural network by minimizing the loss function, reflecting the probability of finding the marker position on the image;

- get a set of images of visual markers from a video source;

17. The method according to p. 16, characterized in that the rendered neural network converts input images of visual markers into images containing visual markers placed in the center of the background image.

18. The method according to p. 16, characterized in that during the learning process a member is added to the optimization functional that characterizes the aesthetic acceptability of the markers.

19. The method according to p. 16, characterized in that during the training, a member is added to the optimization functional that measures the correspondence of markers to the visual style specified in the form of a sample image.

20. The method according to p. 16, characterized in that the minimization of the loss function is performed using the stochastic gradient descent algorithm.

21. The method according to p. 16, characterized in that the localizing, rendering and recognizing neural network is a direct distribution network.