RU2791587C1

RU2791587C1 - Method for providing computer vision

Info

Publication number: RU2791587C1
Application number: RU2022113324A
Authority: RU
Inventors: Данила Дмитриевич РУХОВИЧ; Анна Борисовна ВОРОНЦОВА; Антон Сергеевич Конушин
Original assignee: Самсунг Электроникс Ко., Лтд.
Filing date: 2022-05-18
Publication date: 2023-03-10

Abstract

FIELD: navigation of mobile robots for mobile applications.

SUBSTANCE: method performs scene understanding and object recognition. The method is proposed in which a point cloud is taken, reconstructed or obtained from depth sensors. First, this point cloud is represented as a voxel representation. The voxel representation is then processed with a 3D sparse convolutional "skeleton". Next, a 3D sparse convolution "neck" is used to extract features at different levels of the "skeleton". The resulting features are processed through a 3D sparse convolutional "head" with weights common to different feature levels. At each feature level, the "head" is responsible for predicting objects of a certain scale. The "head" gives the bounding box options for each position. Finally, all predictions received are filtered to select the most likely ones.

EFFECT: increasing the accuracy of object detection and reducing the scene processing time.

3 cl, 4 dwg, 4 tbl

Description

Область техники, к которой относится изобретениеThe field of technology to which the invention belongs

Изобретение относится к компьютерному зрению, в частности к навигации мобильных роботов, для мобильных приложений, которые осуществляют понимание сцены и распознавание объектов.The invention relates to computer vision, in particular to the navigation of mobile robots, for mobile applications that perform scene understanding and object recognition.

Описание предшествующего уровня техникиDescription of the Prior Art

Обнаружение 3D объектов из облаков точек направлено на одновременное определение местоположения и распознавание 3D объектов по предоставленному набору 3D точек. Этот метод широко применяется в автономном вождении, робототехнике и дополненной реальности в качестве основного метода понимания 3D сцен.Detection of 3D objects from point clouds is aimed at the simultaneous location and recognition of 3D objects from a provided set of 3D points. This method is widely used in autonomous driving, robotics and augmented reality as the main method for understanding 3D scenes.

В то время как 2D методы ([26], [32]) работают с плотными массивами фиксированного размера, 3D методы должны работать с нерегулярными неструктурированными 3D данными произвольного объема. Поэтому методы обработки 2D данных невозможно применить напрямую для обнаружения 3D объектов, и в методах обнаружения 3D объектов ([10], [22], [19]) используются инновационные подходы к обработке 3D данных.While the 2D methods ([26], [32]) operate on dense fixed-size arrays, the 3D methods must operate on irregular, unstructured 3D data of arbitrary size. Therefore, 2D data processing methods cannot be applied directly to 3D object detection, and 3D object detection methods ([10], [22], [19]) use innovative approaches to 3D data processing.

Проблемой сверточных методов обнаружения 3D объектов является масштабируемость: для обработки крупномасштабных сцен требуется либо нерационально большой объем вычислительных ресурсов, либо слишком много времени. В других методах прибегают к представлению данных вокселей и используют разреженные свертки; однако эти методы решают проблемы масштабируемости в ущерб точности обнаружения. Иными словами, не существует метода обнаружения 3D объектов, который бы обеспечивал точные оценки и при этом обладал хорошим масштабированием.The problem of convolutional methods for detecting 3D objects is scalability: processing large-scale scenes requires either an unreasonably large amount of computing resources or too much time. Other methods resort to representing voxel data and using sparse convolutions; however, these methods solve scalability issues at the expense of detection accuracy. In other words, there is no method for detecting 3D objects that provides accurate estimates and at the same time has good scaling.

Современные методы обнаружения 3D объектов проектируются для применения либо внутри, либо вне помещений.Modern methods for detecting 3D objects are designed for either indoor or outdoor applications.

Методы для применения внутри и вне помещений разрабатываются практически независимо друг от друга с использованием методов обработки данных, специфичных для предметной области. Многие современные методы для применения вне помещений [30], [13], [35] проецируют 3D точки на плоскость общей перспективы, тем самым сводя задачу обнаружения 3D объектов к обнаружению 2D объектов. Эти методы, конечно же, используют преимущества быстро развивающихся алгоритмов обнаружения 2D объектов. При наличии проекции общей перспективы в работе [14] ее обрабатывают полностью сверточным методом, а в работе [31] используется безъякорный 2D подход. К сожалению, эти подходы, оказавшиеся эффективными как для обнаружения 2D объектов, так и для обнаружения 3D объектов вне помещений, сложно адаптировать для помещений, так как это бы потребовало нерационально большого объема памяти и вычислительных ресурсов. Для решения проблем производительности предлагались различные стратегии обработки 3D данных. В настоящее время в области обнаружения 3D объектов преобладают три подхода - на основе голосования, на основе преобразователя и 3D свертки. Далее подробно обсуждается каждый из этих подходов; также представлен краткий обзор безъякорных методов.Methods for indoor and outdoor applications are developed almost independently of each other using domain-specific data processing methods. Many modern methods for outdoor applications [30], [13], [35] project 3D points onto a general perspective plane, thereby reducing the task of detecting 3D objects to detecting 2D objects. These methods, of course, take advantage of rapidly developing 2D object detection algorithms. In the presence of a projection of the general perspective in [14], it is processed by a completely convolutional method, and in [31], an anchorless 2D approach is used. Unfortunately, these approaches, which have proved to be effective for both 2D object detection and outdoor 3D object detection, are difficult to adapt to indoor applications, as this would require an unreasonably large amount of memory and computing resources. Various strategies for processing 3D data have been proposed to address performance issues. Currently, three approaches dominate in the field of 3D object detection - based on voting, based on the converter and 3D convolution. The following discusses each of these approaches in detail; a brief overview of anchorless methods is also presented.

Методы на основе голосованияVoting-Based Methods

Первым методом, в который было введено голосование по точкам для обнаружения 3D объектов, был метод VoteNet [22]. VoteNet обрабатывает 3D точки с помощью Point-Net [23], присваивает группу точек каждому объекту-кандидату в соответствии с их центром голосования и вычисляет признаки объекта из каждой группы точек. Из многочисленных последователей VoteNet основной прогресс связан с современными стратегиями группировки и голосования, применяемыми к признакам PointNet. BRNet [4] уточняет результаты голосования с помощью репрезентативных точек из центров голосования, что улучшает захват мелких локальных структурных признаков. MLCVNet [29] вводит три контекстных модуля в этапы голосования и классификации VoteNet для кодирования контекстной информации на разных уровнях. H3DNet [33] усовершенствует процедуру генерации группы точек путем прогнозирования гибридного набора геометрических примитивов.The first method in which voting by points was introduced to detect 3D objects was the VoteNet method [22]. VoteNet processes 3D points using Point-Net [23], assigns a group of points to each candidate object according to their voting center, and computes feature features from each group of points. Of the many followers of VoteNet, the main progress has been in modern grouping and voting strategies applied to PointNet features. BRNet [4] refines voting results using representative points from voting centers, which improves the capture of fine local structural features. MLCVNet [29] introduces three context modules into the voting and classification steps of VoteNet to encode context information at different levels. H3DNet [33] improves the procedure for generating a group of points by predicting a hybrid set of geometric primitives.

VENet [28] содержит механизм внимания и вводит модуль взвешивания голосов, обучаемый с помощью новой потери привлекательности голосов.VENet [28] contains an attention mechanism and introduces a vote weighting module trained with a new vote attractiveness loss.

Все методы на основе голосования типа VoteNet ограничены проектным решением. Во-первых, они плохо масштабируются: поскольку их производительность зависит от объема входных данных, они имеют свойство замедляться при увеличении сцен. Кроме того, многие методы на основе голосования реализуют стратегии голосования и группировки в виде настраиваемых слоев, что затрудняет воспроизведение или отладку этих методов или перенос их на мобильные устройства.All voting-based methods of the VoteNet type are limited by design. First, they do not scale well: since their performance depends on the amount of input data, they tend to slow down as scenes get larger. In addition, many voting-based methods implement voting and grouping strategies as custom layers, making it difficult to reproduce or debug these methods or port them to mobile devices.

Методы на основе преобразователяTransducer based methods

Недавно появившиеся методы на основе преобразователя используют сквозное обучение и прямое распространение на логический вывод вместо эвристики и оптимизации, что делает их менее предметно-ориентированными. В GroupFree [16] "голова" VoteNet заменена модулем-преобразователем, итеративно обновляющим положения запросов объектов и собирающим промежуточные результаты обнаружения. 3DETR [19] был первым методом обнаружения 3D объектов, реализованным в виде сквозного обучаемого преобразователя. Однако более продвинутые методы на основе преобразователя по-прежнему имеют проблемы с масштабируемостью, аналогичные ранним методам на основе голосования. В отличие от них, предлагаемый метод является полностью сверточным, поэтому он более быстрый и простой в реализации, чем методы на основе голосования и на основе преобразователя.Recent transformer-based methods use end-to-end learning and direct extension to inference instead of heuristics and optimization, making them less domain-specific. In GroupFree [16], the "head" of VoteNet is replaced by a transformer module that iteratively updates the positions of object requests and collects intermediate detection results. 3DETR [19] was the first 3D object detection method implemented as an end-to-end trainable transducer. However, the more advanced converter-based methods still have scalability issues similar to the earlier voting-based methods. In contrast, the proposed method is fully convolutional, so it is faster and easier to implement than the vote-based and transformer-based methods.

3D сверточные методы3D Convolutional Methods

Воксельное представление позволяет эффективно обрабатывать кубически растущие разреженные 3D данные. Методы обнаружения 3D объектов на основе вокселей ([12], [18]) преобразуют точки в воксели и обрабатывают их с помощью 3D сверточных сетей. Однако плотные объемные признаки по-прежнему потребляют много памяти, а 3D свертки требуют больших вычислительных ресурсов. В общем, обработка больших сцен требует много ресурсов и не может быть выполнена за один проход.The voxel representation allows efficient processing of cube-growing sparse 3D data. Voxel-based 3D object detection methods ([12], [18]) convert points to voxels and process them using 3D convolutional networks. However, dense volumetric features still consume a lot of memory, and 3D convolutions require large computational resources. In general, processing large scenes requires a lot of resources and cannot be done in a single pass.

GSDN [10] решает проблемы производительности с помощью разреженных 3D сверток. Он имеет архитектуру кодер-декодер, в которой обе части, кодер и декодер, построены из разреженных 3D сверточных блоков. По сравнению со стандартными сверточными подходами на основе голосования и преобразователя, GSDN значительно более эффективно использует память и масштабируется на большие сцены без ущерба для плотности точек. Основным недостатком GSDN является его точность: этот метод сопоставим по качеству с VoteNet, но значительно уступает современному уровню развития [16].GSDN [10] solves performance problems with sparse 3D convolutions. It has an encoder-decoder architecture in which both encoder and decoder parts are built from sparse 3D convolutional blocks. Compared to standard vote-and-transformer convolutional approaches, GSDN is significantly more memory efficient and scales to large scenes without sacrificing point density. The main disadvantage of GSDN is its accuracy: this method is comparable in quality to VoteNet, but is significantly inferior to the current level of development [16].

GSDN использует 15 соотношений сторон для ограничивающих рамок 3D объектов в качестве якорей (анкоров). Если GSDN обучается в безъякорных установках с одним соотношением сторон, то его точность снижается на 12%. В отличие от GSDN, предлагаемый метод не содержит якоря, но использует преимущества разреженных 3D сверток.GSDN uses 15 aspect ratios for the bounding boxes of 3D objects as anchors. If GSDN is trained in unanchored setups with one aspect ratio, then its accuracy is reduced by 12%. Unlike GSDN, the proposed method does not contain an anchor, but takes advantage of sparse 3D convolutions.

Безъякорное обнаружение объектов на основе RGBAnchorless object detection based on RGB

При обнаружении 2D объектов безъякорные методы составляют конкуренцию стандартным якорным методам. FCOS [26] осуществляет обнаружение 2D объектов методом попиксельного прогнозирования и демонстрирует значительное улучшение по сравнению с его предшественником на основе якорей - RetinaNet [15]. FCOS3D [27] тривиально адаптирует FCOS путем добавления дополнительных целей для монокулярного обнаружения 3D объектов. ImVoxelNet [24] решает ту же проблему с "головой" типа FCOS, построенной из стандартных (неразреженных) 3D сверточных блоков. Предлагаемое изобретение адаптирует идеи упомянутых безъякорных методов для обработки разреженных нерегулярных данных.When detecting 2D objects, anchorless methods compete with standard anchor methods. FCOS [26] performs 2D object detection using pixel-by-pixel prediction and shows a significant improvement over its anchor-based predecessor, RetinaNet [15]. FCOS3D [27] trivially adapts FCOS by adding additional targets for monocular detection of 3D objects. ImVoxelNet [24] solves the same problem with a FCOS head built from standard (non-sparse) 3D convolution blocks. The present invention adapts the ideas of the mentioned anchorless methods for processing sparse irregular data.

Помимо масштабируемости и точности, идеальный метод обнаружения 3D объектов должен обладать способностью работать с объектами произвольной формы и размера без дополнительных хаков и настраиваемых вручную гиперпараметров. Использовавшиеся ранее допущения по ограничивающим рамкам 3D объектов (например, соотношения сторон или абсолютные размеры) ограничивают обобщение и увеличивают количество гиперпараметров и обучаемых параметров.In addition to scalability and accuracy, an ideal 3D object detection method should be able to handle objects of arbitrary shape and size without additional hacks and manually tuned hyperparameters. Previous assumptions on the bounding boxes of 3D objects (such as aspect ratios or absolute sizes) limit generalization and increase the number of hyperparameters and trainable parameters.

СУЩНОСТЬ ИЗОБРЕТЕНИЯSUMMARY OF THE INVENTION

В последнее время все больше внимания уделяется обнаружению 3D объектов из 3D облаков точек (компьютерному зрению) ввиду возможности его применения в таких перспективных областях, как робототехника и дополненная реальность.Recently, more and more attention has been paid to the detection of 3D objects from 3D point clouds (computer vision) due to the possibility of its application in such promising areas as robotics and augmented reality.

Предлагается безъякорный метод, в котором не задаются априоры на объектах, и обнаружение 3D объектов осуществляется исключительно на основе данных. Кроме того, вводится новая параметризация ориентированной ограничивающей рамки (OBB), инспирированная лентой Мебиуса, которая уменьшает количество гиперпараметров. Чтобы доказать эффективность параметризации, авторы провели эксперименты на SUN RGB-D с несколькими методами обнаружения 3D объектов и предоставили сведения об улучшении результатов для всех этих методов.A non-anchor method is proposed, in which priors on objects are not set, and 3D object detection is carried out solely on the basis of data. In addition, a new parameterization of the oriented bounding box (OBB) inspired by the Möbius strip is introduced, which reduces the number of hyperparameters. To prove the effectiveness of the parameterization, the authors performed experiments on SUN RGB-D with several methods for detecting 3D objects and provided data on improving the results for all these methods.

Предложен способ обеспечения компьютерного зрения робототехнического устройства, реализуемый в вычислительном устройстве робототехнического устройства, причем робототехническое устройство имеет камеру с датчиками глубины и RGB-камеру, центральный процессор, внутреннюю память, оперативную память, при этом способ включает этапы, на которых: A method for providing computer vision of a robotic device is proposed, which is implemented in a computing device of a robotic device, moreover, the robotic device has a camera with depth sensors and an RGB camera, a central processor, internal memory, RAM, and the method includes the following steps :

захватывают реальную сцену с объектами в сцене, используя камеру с датчиками глубины и RGB-камеру;capturing a real scene with objects in the scene using a depth sensor camera and an RGB camera;

представляют захваченную реальную сцену в виде облака из набора N точек, каждая из которых представлена своей координатой и цветом, иrepresent the captured real scene as a cloud of a set of N points, each of which is represented by its own coordinate and color, and

вводят облако из набора N точек в качестве данных в нейронную сеть, причем нейронная сеть состоит из части "скелет", части "шея" и части "голова",a cloud of a set of N points is introduced as data into the neural network, and the neural network consists of a "skeleton" part, a "neck" part and a "head" part,

где часть "скелет", часть "шея" и часть "голова" являются 3D разреженными сверточными частями нейронной сети;where the "skeleton" part, the "neck" part and the "head" part are the 3D sparse convolutional parts of the neural network;

в нейронной сети выполняют следующие этапы:in a neural network, the following steps are performed:

посредством части "скелет" представляют облако точек в виде объемного пиксельного представления входных данных;by means of the "skeleton" part, they represent a point cloud in the form of a volumetric pixel representation of the input data;

посредством части "скелет" обрабатывают объемное пиксельное представление входных данных для получения четырехмерных тензоров;by means of the "skeleton" part, the volumetric pixel representation of the input data is processed to obtain four-dimensional tensors;

посредством сверточной части "шея" обрабатывают четырехмерные тензоры для извлечения выраженных в числовой форме признаков объектов в сцене;through the convolutional part of the "neck" process four-dimensional tensors to extract numeric features of objects in the scene;

посредством части "голова" обрабатывают извлеченные признаки объектов в сцене для получения прогнозов положений и категорий объектов в сцене, причем для каждого объекта в сцене часть "голова" выдает прогнозы, где прогноз содержит:the extracted features of the objects in the scene are processed by the head part to obtain predictions of the positions and categories of objects in the scene, and for each object in the scene, the head part produces predictions, where the prediction contains:

вероятность классификации объекта, параметры регрессии ограничивающей рамки объекта и центрированность объекта внутри ограничивающей рамки объекта;object classification probability, object bounding box regression parameters, and object centering within the object bounding box;

посредством части "голова" фильтруют все полученные прогнозы путем сравнения прогнозов по вероятности классификации объекта для выбора наиболее вероятного прогноза, при этом наиболее вероятный прогноз считается окончательной оценкой положения и категории объекта в сцене и характеризуется данными, касающимися положения и категории объекта в сцене;by means of the "head" part, filtering all received predictions by comparing the predictions by the probability of classifying the object to select the most probable prediction, while the most probable prediction is considered the final estimate of the position and category of the object in the scene and is characterized by data regarding the position and category of the object in the scene;

выводят из нейронной сети данные, касающиеся положения, ориентации и категории объекта в сцене в виде числового представления сцены, представляющего компьютерное зрение для робототехнического устройства.outputting data from the neural network regarding the position, orientation, and category of the object in the scene as a numerical representation of the scene representing computer vision for the robotic device.

При этом способ может содержать дополнительный этап, на котором: преобразуют числовое представление сцены в изображение сцены, причем компьютерное устройство дополнительно содержит экран, и изображение сцены отображается на экране для пользователя.In this case, the method may comprise an additional step, at which: converting the numeric representation of the scene into a scene image, wherein the computer device further comprises a screen, and the scene image is displayed on the screen for the user.

Предложен компьютерно-читаемый носитель, содержащий программный код, который воспроизводит способ по п. 1 при его реализации в компьютерном устройстве.A computer-readable medium is provided, containing a program code that reproduces the method according to claim 1 when it is implemented in a computer device.

При этом описанный выше способ, выполняемый электронным устройством, может выполняться с использованием искусственного интеллекта (ИИ). Функция, связанная с ИИ, может выполняться посредством энергонезависимой памяти, энергозависимой памяти и процессора.Meanwhile, the above-described method performed by the electronic device can be performed using artificial intelligence (AI). The AI-related function can be performed by non-volatile memory, volatile memory, and a processor.

Процессор может включать в себя один или множество процессоров. В настоящее время один или несколько процессоров могут быть процессорами общего назначения, такими как центральный процессор (CPU), процессор приложений (АР) и т.п., блоком обработки только графики, таким как графический процессор (GPU), процессор компьютерного зрения (VPU) и/или специальный процессор для искусственного интеллекта, такой как нейронный процессор (NPU).The processor may include one or multiple processors. Currently, one or more processors may be a general purpose processor such as a central processing unit (CPU), an application processor (AP) and the like, a graphics-only processing unit such as a graphics processing unit (GPU), a computer vision processor (VPU). ) and/or a dedicated artificial intelligence processor such as a neural processing unit (NPU).

Эти один или несколько процессоров управляют обработкой входных данных в соответствии с предварительно определенным рабочим правилом или искусственным интеллектом (ИИ), хранящимся в энергонезависимой памяти и энергозависимой памяти. Предварительно определенное рабочее правило или искусственный интеллект обеспечиваются посредством обучения.These one or more processors control the processing of input data according to a predefined operating rule or artificial intelligence (AI) stored in non-volatile memory and non-volatile memory. A predetermined operating rule or artificial intelligence is provided through training.

В данном контексте обеспечение посредством обучения означает, что путем применения алгоритма обучения к множеству обучающих данных создается предварительно определенное рабочее правило или ИИ с требуемой характеристикой. Обучение может выполняться непосредственно в том устройстве, в котором выполняется ИИ согласно варианту осуществления, и/или может быть реализовано отдельным сервером/системой.In this context, provisioning by learning means that by applying a learning algorithm to a set of training data, a predetermined operating rule or AI with a desired performance is created. The training may be performed directly on the device running the AI according to the embodiment and/or may be implemented by a separate server/system.

ИИ может состоять из множества слоев нейронной сети. Каждый слой имеет множество весовых значений и выполняет операцию слоя посредством вычисления предыдущего слоя и операции множества весовых коэффициентов. Примеры нейронных сетей включают в себя, без ограничения, сверточную нейронную сеть (CNN), глубокую нейронную сеть (DNN), рекуррентную нейронную сеть (RNN), ограниченную машину Больцмана (RBM), глубокую сеть доверия (DBN), двунаправленную рекуррентную глубокую нейросеть (BRDNN), генеративно-состязательные сети (GAN) и глубокие Q-сети. Нейронная сеть может быть реализована аппаратными или программно-аппаратными средствами.An AI can be made up of many layers of a neural network. Each layer has a plurality of weights and performs a layer operation by computing the previous layer and a plurality of weights operation. Examples of neural networks include, without limitation, Convolutional Neural Network (CNN), Deep Neural Network (DNN), Recurrent Neural Network (RNN), Restricted Boltzmann Machine (RBM), Deep Belief Network (DBN), Bidirectional Recurrent Deep Neural Network ( BRDNN), Generative Adversarial Networks (GANs), and Deep Q-Nets. A neural network can be implemented in hardware or firmware.

Алгоритм обучения представляет собой способ обучения заданного целевого устройства (например, робота) с использованием множества обучающих данных, чтобы побудить, разрешить или контролировать целевое устройство выполнять определение или прогнозирование. Примеры алгоритмов обучения включают, без ограничения перечисленным, обучение с привлечением учителя, обучение без учителя, обучение с частичным привлечением учителя или обучение с подкреплением.The learning algorithm is a method for training a given target device (eg, a robot) using a plurality of training data to induce, allow, or control the target device to perform a determination or prediction. Examples of learning algorithms include, but are not limited to, supervised learning, unsupervised learning, partially supervised learning, or reinforcement learning.

Согласно изобретению, способ распознавания объектов может получать выходные данные распознавания путем использования данных изображения в качестве входных данных для искусственного интеллекта. Искусственный интеллект можно получить путем обучения. В данном контексте "полученный путем обучения" означает, что предварительно определенное рабочее правило или искусственный интеллект, обеспечивающий выполнение требуемой функции (или цели), получают путем обучения базового искусственного интеллекта на множестве частей обучающих данных с помощью обучающего алгоритма. Искусственный интеллект может включать в себя множество слоев нейронной сети. Каждый из множества слоев нейронной сети включает множество значений весов и выполняет вычисления нейронной сети путем вычисления между результатом вычисления предыдущего слоя и множеством значений весов.According to the invention, the object recognition method can obtain recognition output by using image data as input to artificial intelligence. Artificial intelligence can be obtained through training. In this context, "learned" means that a predetermined operating rule or artificial intelligence that performs the required function (or purpose) is obtained by learning a basic artificial intelligence on a set of pieces of training data using a training algorithm. Artificial intelligence can include many layers of a neural network. Each of the plurality of neural network layers includes a plurality of weight values and performs neural network calculations by calculating between the calculation result of the previous layer and the plurality of weight values.

Визуальное понимание - это метод распознавания и обработки вещей аналогично человеческому зрению, и оно включает, например, распознавание объекта, отслеживание объекта, поиск изображений, распознавание людей, распознавание сцен, 3D реконструкцию/локализацию или улучшение изображения.Visual understanding is a method of recognizing and processing things similar to human vision, and includes, for example, object recognition, object tracking, image search, person recognition, scene recognition, 3D reconstruction/localization or image enhancement.

Согласно изобретению в предлагаемом способе может использоваться искусственный интеллект, выполняемый путем использования данных. Процессор может выполнять на данных операцию предварительной обработки для их преобразования в форму, подходящую для применения в качестве ввода для модели искусственного интеллекта. Искусственный интеллект можно получить путем обучения. В данном контексте "полученный путем обучения" означает, что предварительно определенное рабочее правило или искусственный интеллект, обеспечивающий выполнение требуемой функции (или цели), получают путем обучения базового искусственного интеллекта с помощью нескольких частей обучающих данных, применяя обучающий алгоритм. Искусственный интеллект может включать в себя множество слоев нейронной сети. Каждый из множества слоев нейронной сети включает в себя множество значений весов и выполняет вычисления нейронной сети путем вычисления между результатом вычисления предыдущим слоем и множеством значений весов.According to the invention, the proposed method can use artificial intelligence, performed by using data. The processor may perform a pre-processing operation on the data to convert it into a form suitable for use as input to an artificial intelligence model. Artificial intelligence can be obtained through training. In this context, "learned" means that a predetermined operating rule or artificial intelligence that performs a desired function (or goal) is obtained by training the underlying artificial intelligence with several pieces of training data using a learning algorithm. Artificial intelligence can include many layers of a neural network. Each of the plurality of neural network layers includes a plurality of weight values and performs neural network calculations by calculating between the calculation result of the previous layer and the plurality of weight values.

Прогнозирование рассуждений - это метод формирования логических рассуждений и получения прогнозов путем определения информации, который включает, например, формирование рассуждений на основе знаний, прогнозирование оптимизации, планирование на основе предпочтений или рекомендации.Reasoning prediction is a method of generating logical reasoning and obtaining predictions by identifying information, which includes, for example, knowledge-based reasoning, optimization prediction, preference-based planning, or recommendations.

Краткое описание чертежейBrief description of the drawings

Представленные выше и/или другие аспекты станут более очевидными из описания примерных вариантов осуществления со ссылками на прилагаемые чертежи, на которых изображено следующее:The above and/or other aspects will become more apparent from the description of exemplary embodiments with reference to the accompanying drawings, which depict the following:

Фиг. 1 - общая схема предлагаемого способа.Fig. 1 is a general scheme of the proposed method.

Фиг. 2 - примеры объектов с неопределенным путевым углом.Fig. 2 - examples of objects with an indefinite track angle.

Фиг. 3 - результат предложенного метода с датасетом ScanNet.Fig. 3 - the result of the proposed method with the ScanNet dataset.

Фиг. 4 - зависимость точности обнаружения от скорости логического вывода, измеренной в кадрах в секунду, для исходного и модифицированного FCAF3D в сравнении с известными методами обнаружения 3D объектов.Fig. 4 - dependence of detection accuracy on inference speed, measured in frames per second, for the original and modified FCAF3D in comparison with known methods for detecting 3D objects.

Подробное описаниеDetailed description

Предлагаемое изобретение относится к компьютерному зрению и представляет собой первый в своем классе полностью сверточный безъякорный метод обнаружения 3D объектов внутри помещений, названный FCAF3D. FCAF3D - это простой, эффективный и масштабируемый метод обнаружения 3D объектов из облаков точек.The present invention relates to computer vision and is the first in its class fully convolutional anchorless method for detecting 3D objects indoors, called FCAF3D. FCAF3D is a simple, efficient and scalable method for detecting 3D objects from point clouds.

Предлагается способ обеспечения компьютерного зрения робототехнического устройства. Этот способ может быть реализован в компьютерном устройстве робототехнического устройства. Робототехническое устройство содержит камеру с датчиками глубины и RGB-камеру, центральный процессор, внутреннюю память, оперативную память. Камера с датчиками глубины и RGB-камера осуществляют захват реальной сцены с 3D объектами в этой сцене. Предлагаемый метод позволяет обнаруживать и распознавать категорию и положение 3D объектов в захваченной сцене.A method for providing computer vision of a robotic device is proposed. This method can be implemented in the computer device of the robotic device. The robotic device contains a camera with depth sensors and an RGB camera, a central processor, internal memory, and RAM. Depth camera and RGB camera capture a real scene with 3D objects in that scene. The proposed method allows to detect and recognize the category and position of 3D objects in the captured scene.

Предлагаемое решение спроектировано для анализа сцены и распознавания объектов. Полученные результаты могут использоваться в широком круге задач, где решения принимаются на основе сцены и ее объектов. Например, программное обеспечение, основанное на предлагаемом способе, может предоставлять мобильным робототехническим устройствам (например, мобильным роботам-навигаторам) пространственную информацию для планирования траектории, захвата объектов и манипулирования ими. Кроме того, предлагаемое решение может использоваться в мобильном приложении для автоматической генерации подсказок о сцене.The proposed solution is designed for scene analysis and object recognition. The results obtained can be used in a wide range of tasks where decisions are made based on the scene and its objects. For example, software based on the proposed method can provide mobile robotic devices (eg, mobile navigation robots) with spatial information for trajectory planning, grabbing and manipulating objects. In addition, the proposed solution can be used in a mobile application to automatically generate scene hints.

Предлагаемое решение предназначено для обнаружения и распознавания 3D объектов и оценки их пространственного положения. Формулировка задачи соответствует классической постановке задачи обнаружения 3D объектов, сформулированной научным сообществом компьютерного зрения.The proposed solution is designed to detect and recognize 3D objects and evaluate their spatial position. The formulation of the problem corresponds to the classical formulation of the problem of detecting 3D objects, formulated by the scientific community of computer vision.

Предлагаемое решение предполагается реализовать в мобильных робототехнических устройствах, имеющих компьютерное устройство. Компьютерное устройство робототехнического устройства содержит камеру с датчиками глубины и RGB-камеру, центральный процессор, внутреннюю память, оперативную память, экран.The proposed solution is supposed to be implemented in mobile robotic devices with a computer device. The computer device of the robotic device contains a camera with depth sensors and an RGB camera, a central processor, internal memory, RAM, and a screen.

Также изобретение может быть реализовано в смартфонах в составе мобильного приложения. Для реализации предлагаемого способа может использоваться носитель, например компьютерно-читаемый носитель. При этом компьютерно-читаемый носитель содержит программный код, который воспроизводит предлагаемый способ при реализации в компьютерном устройстве. В частности, компьютерно-читаемый носитель должен иметь достаточный объем оперативной памяти и вычислительных ресурсов. Требуемый объем ресурсов зависит от функции устройства, параметров камеры и требований к производительности.Also, the invention can be implemented in smartphones as part of a mobile application. To implement the proposed method can be used media, such as computer-readable media. At the same time, the computer-readable medium contains a program code that reproduces the proposed method when implemented in a computer device. In particular, the computer-readable medium must have sufficient RAM and computing resources. Resource requirements vary by device function, camera settings, and performance requirements.

Следует отметить, что при обнаружении 3D объектов традиционно используется предварительно определенный набор ограничивающих рамок 3D объектов, называемых якорями. Такой набор можно рассматривать как набор априорных гипотез о положении объектов в пространстве (объектов в сцене) и их размерах. Использование такого набора гипотез позволяет обнаруживать объекты в 3D пространстве путем выбора наиболее вероятных гипотез и их уточнения. Однако априорные гипотезы не всегда хорошо описывают реальные объекты в конкретной сцене, поэтому использование якорей ограничивает применимость метода обнаружения объектов. Вначале во всех методах обнаружения объектов использовались якоря. Недавно был описан новый подход, позволяющий не использовать якоря при решении задачи обнаружения объектов; это позволяет разработать более универсальное решение. За последние несколько лет сформировался целый класс методов, не использующих якоря - их можно назвать "безъякорными". Теперь они составляют конкуренцию традиционным "якорным" методам.It should be noted that 3D object detection traditionally uses a predefined set of 3D object bounding boxes, called anchors. Such a set can be considered as a set of a priori hypotheses about the position of objects in space (objects in the scene) and their sizes. The use of such a set of hypotheses makes it possible to detect objects in 3D space by selecting the most probable hypotheses and refining them. However, a priori hypotheses do not always describe real objects in a particular scene well, so the use of anchors limits the applicability of the object detection method. In the beginning, all object detection methods used anchors. Recently, a new approach has been described that allows not using anchors when solving the object detection problem; this allows a more versatile solution to be developed. Over the past few years, a whole class of methods has been formed that do not use anchors - they can be called "anchorless". Now they compete with traditional "anchor" methods.

Предлагаемый сверточный безъякорный способ обнаружения 3D объектов внутри помещений является простым, но эффективным способом, в котором используется воксельное представление облака точек (входные данные) и применяется обработка вокселей разреженными свертками.The proposed convolutional anchorless method for detecting 3D objects indoors is a simple but effective method that uses a voxel representation of a point cloud (input data) and uses sparse convolution voxel processing.

В соответствии с предложенным способом осуществляется захват точек Npts в цветах RGB и на выходе выдается набор ограничивающих рамок 3D объектов. Точки Npts в цветах RGB представляют собой набор из N точек, каждая из которых представлена своей координатой и цветом в предлагаемом способе, выполняемом на компьютере. Каждая точка трехмерного пространства (точка) определяется тремя координатами в пространстве, а также цветом в палитре RGB (точка в цветах RGB). Набор точек в трехмерном пространстве также называют облаком точек. Это облако из набора N точек, каждая из которых представлена своей координатой и цветом, можно получить путем обработки захваченной реальной сцены с имеющимися в ней объектами. Облако из набора N точек вводится в нейронную сеть в качестве данных. Координаты точки являются действительными числами. Воксель - это общепринятое сокращение объемного пикселя, т.е. трехмерного пикселя. Обычно "2D" пиксель в 2D изображении является базовой "ячейкой" изображения. Пиксель является элементом дискретизации изображения: изображение делится на равные секции (элементы) регулярной сеткой. Каждый такой элемент имеет форму квадрата, выровненного по сторонам изображения. По аналогии с пикселем воксель является частью трехмерного пространства, ограниченной параллелепипедом, выровненным по координатным осям и разделенным по сетке на элементы сетки. Однако воксель определяется более гибко, чем пиксель. Так, не обязательно делить пространство на элементы регулярной трехмерной сеткой: элементы сетки могут располагаться в пространстве произвольным образом. То есть, пространство делится на элементы сетки произвольно. Центр вокселя определяется путем усреднение всех координат точек из облака точек, попадающих в один элемент сетки (т.е. усредняются координаты x, координаты y и координаты z). Это значит, что каждый элемент сетки соответствует своему вокселю. В результате получается набор вокселей, такой же разреженный и нерегулярный, как и исходный набор точек в 3D пространстве.In accordance with the proposed method, Npts points are captured in RGB colors and a set of bounding boxes of 3D objects is output. Points Npts in RGB colors are a set of N points, each of which is represented by its own coordinate and color in the proposed method, performed on a computer. Each point in three-dimensional space (point) is defined by three coordinates in space, as well as a color in the RGB palette (point in RGB colors). A set of points in 3D space is also called a point cloud. This cloud from a set of N points, each of which is represented by its own coordinate and color, can be obtained by processing the captured real scene with the objects in it. A cloud from a set of N points is entered into the neural network as data. Point coordinates are real numbers. Voxel is a common abbreviation for volume pixel, i.e. three-dimensional pixel. Typically a "2D" pixel in a 2D image is the base "cell" of the image. A pixel is an element of the image discretization: the image is divided into equal sections (elements) by a regular grid. Each such element has the shape of a square aligned with the sides of the image. By analogy with a pixel, a voxel is a part of a three-dimensional space bounded by a parallelepiped aligned along the coordinate axes and divided along the grid into grid elements. However, a voxel is defined more flexibly than a pixel. So, it is not necessary to divide space into elements by a regular three-dimensional grid: grid elements can be located in space in an arbitrary way. That is, the space is divided into grid elements arbitrarily. The center of a voxel is determined by averaging all point coordinates from the point cloud that fall within one grid element (i.e. x coordinates, y coordinates, and z coordinates are averaged). This means that each element of the grid corresponds to its own voxel. The result is a set of voxels that is as sparse and irregular as the original set of points in 3D space.

Если воксели организованы в регулярную сетку, то говорят о воксельном объеме, а при отсутствии регулярной структуры говорят о воксельном представлении. Кроме того, воксель не обязательно имеет одинаковые пространственные размеры по всем трем осям: это может быть не куб, а произвольный параллелепипед, однако для удобства расчетов часто используются кубические воксели.If the voxels are organized into a regular grid, then one speaks of a voxel volume, and in the absence of a regular structure, one speaks of a voxel representation. In addition, a voxel does not necessarily have the same spatial dimensions along all three axes: it may not be a cube, but an arbitrary parallelepiped, but cubic voxels are often used for the convenience of calculations.

Предлагаемый метод FCAF3D способен обрабатывать крупномасштабные сцены, используя минимальное время выполнения и память, за один полностью сверточный проход с прямым распространением и не требует этапа эвристической постобработки. Существующие методы обнаружения 3D объектов используются предварительные предположения о геометрии объектов. Любые геометрические приоры ограничивают обобщающую способность метода. Вместо этого авторы изобретения предлагают новую параметризацию ориентированных ограничивающих рамок (OBB), которая позволяет получать лучшие результаты без каких-либо приоров. Предлагаемый метод обеспечивает соответствующие современному уровню результаты обнаружения 3D объектов в единицах mAP@0,5 на датасетах ScanNet V2 (+4,5), SUN RGB-D (+3,5) и S3DIS (+20,5). mAP@0,5 - это стандартный показатель для оценки качества обнаружения 3D объектов. Возможные значения mAP@0,5 находятся в диапазоне от 0 до 100. Чем выше значение mAP@0,5, тем выше качество. На датасете S3DIS FCAF3D превосходит конкурентов с огромным отрывом.The proposed FCAF3D method is capable of processing large-scale scenes using minimal runtime and memory in a single fully forward-propagated convolutional pass and does not require a heuristic post-processing step. Existing methods for detecting 3D objects use preliminary assumptions about the geometry of objects. Any geometric priors limit the generalizing ability of the method. Instead, the inventors propose a new Oriented Bounding Box (OBB) parameterization that allows better results without any priors. The proposed method provides the results of detection of 3D objects in units corresponding to the modern level. mAP@0.5 on ScanNet V2 (+4.5), SUN RGB-D (+3.5) and S3DIS (+20.5) datasets. mAP@0.5 is a standard metric for assessing the quality of 3D object detection. Possible values for mAP@0.5 range from 0 to 100. The higher the mAP@0.5 value, the higher the quality. On the S3DIS dataset, FCAF3D outperforms the competition by a huge margin.

Таким образом, настоящее изобретение обеспечивает следующий вклад в современный уровень техники:Thus, the present invention provides the following contribution to the state of the art:

- предложен первый в своем классе полностью сверточный безъякорный способ обнаружения 3D объектов (FCAF3D) для сцен внутри помещений;- the first in its class completely convolutional anchorless method for detecting 3D objects (FCAF3D) for indoor scenes is proposed;

- представлена новая параметризация OBB и доказано, что она повышает точность нескольких существующих методов обнаружения 3D объектов на SUN RGB-D;- a new OBB parameterization is presented and proven to improve the accuracy of several existing 3D object detection methods on SUN RGB-D;

- предложенный способ существенно превосходит известный уровень техники на сложных крупномасштабных датасетах ScanNet, SUN RGB-D и S3DIS для помещений в единицах mAP, и при этом он имеет более высокую скорость логического вывода.- the proposed method is significantly superior to the prior art on complex large-scale ScanNet, SUN RGB-D and S3DIS indoor datasets in mAP units, and at the same time it has a higher speed logical inference.

Задача обнаружения 3D объектов (компьютерного зрения) заключается в обнаружении и распознавании трехмерных объектов и оценке их пространственного положения в сцене. 3D объекты имеют сложную, разнообразную и иногда изменяемую форму, которую часто невозможно описать параметрически (уравнениями). Поэтому в стандартной постановке задачи обнаружения объектов форма, размеры и положение объектов моделируются простой объемной фигурой - параллелепипедом, "рамкой". Такие рамки называются 3D ограничивающими рамками. Они определяются трехмерными координатами центра рамки, а также шириной, высотой и длиной рамки. Для простоты допускается, что все такие параллелепипеды ориентированы по координатным осям трехмерного пространства, т.е. все их ребра и грани направлены по одной из осей. Иногда они решают задачу в более сложной постановке, где параллелепипеды поворачиваются в горизонтальной плоскости, тогда задача обнаружения объекта дополнительно включает определение ориентации объекта, т.е. угла поворота. Также у каждого объекта есть метка категории.The task of detecting 3D objects (computer vision) is to detect and recognize three-dimensional objects and evaluate their spatial position in the scene. 3D objects have a complex, varied and sometimes changeable shape, which often cannot be described parametrically (by equations). Therefore, in the standard formulation of the object detection problem, the shape, size and position of objects are modeled by a simple three-dimensional figure - a parallelepiped, a "frame". Such boxes are called 3D bounding boxes. frames. They are defined by the 3D coordinates of the center of the frame, as well as the width, height, and length of the frame. For simplicity, it is assumed that all such parallelepipeds are oriented along the coordinate axes of three-dimensional space, i.e. all their edges and faces are directed along one of the axes. Sometimes they solve the problem in a more complex formulation, where the parallelepipeds are rotated in the horizontal plane, then the problem of object detection additionally includes determining the orientation of the object, i.e. angle of rotation. Each object also has a category label.

Категория объекта указывается в разметке (аннотации) датасетов. Аннотация является эталоном, достоверной информацией, содержащейся в исходных датасетах. В этом случае аннотация представляет собой набор трехмерных ограничивающих рамок объектов, при этом определенные категории объектов для каждого облака точек содержатся в датасете. Аннотацию получают с привлечением специалистов-оценщиков при создании датасета его авторами. Аннотация используется для обучения нейронной сети: чтобы научиться делать прогнозы на основе входных данных, необходимо просмотреть ряд обучающих примеров данной формы (входные данные - эталонные истинные выходные данные, содержащиеся в разметке) и найти паттерн, позволяющий установить взаимосвязь между входными и выходными данными.The category of an object is indicated in the markup (annotation) of the datasets. The annotation is a standard, reliable information contained in the original datasets. In this case, the annotation is a set of 3D bounding boxes of objects, with certain categories of objects for each point cloud contained in the dataset. An annotation is obtained with the involvement of evaluators when creating a dataset by its authors. An annotation is used to train a neural network: to learn how to make predictions based on input data, you need to look at a series of training examples of a given form (input data - the reference true output data contained in the markup) and find a pattern that allows you to establish a relationship between input and output data.

Категории получают в результате экспертной оценки специалистом-оценщиком, которому на этапе сбора обучающих данных предоставляются трехмерные облака точек. Такие трехмерные облака точек получают из набора изображений RGB и их соответствующих карт глубины, т.е. измерений датчика глубины. Для разметки используется специальное программное обеспечение, позволяющее визуализировать облака точек на экране, манипулировать ими, не внося изменений в исходные данные (поворачивать, перемещать и масштабировать их для просмотра с разных сторон), путем нажатия на нужную область пространства или иного указания положения объектов в пространстве в виде охватывающих их параллелепипедов, т.е. 3D ограничивающих рамок.The categories are obtained as a result of peer review by an appraiser who is provided with 3D point clouds during the training data collection stage. Such 3D point clouds are derived from a set of RGB images and their respective depth maps, i.e. depth sensor measurements. For markup, special software is used that allows you to visualize point clouds on the screen, manipulate them without making changes to the source data (rotate, move and scale them to view from different sides), by clicking on the desired area of space or otherwise specifying the position of objects in space in the form of parallelepipeds covering them, i.e. 3D bounding boxes.

Предлагаемая архитектура FCAF3D состоит из части "скелет", части "шея" и части "голова", эти термины являются общепринятыми терминами в данной области техники. Обозначения частей нейронной сети обнаружения объектов как части "скелет", "шея" и "голова" используются в статьях, описывающих такие методы обнаружения двумерных/трехмерных объектов, как FCOS, ATSS, ImVoxelNet, FCOS3D. Части "скелет", "шея" и "голова" представляют собой разреженные 3D сверточные части нейронной сети.The proposed FCAF3D architecture consists of a skeleton part, a neck part, and a head part, these terms being generally accepted terms in the art. The designations of the parts of the object detection neural network as parts "skeleton", "neck" and "head" are used in articles describing such methods for detecting two-dimensional / three-dimensional objects as FCOS, ATSS, ImVoxelNet, FCOS3D. The "skeleton", "neck" and "head" parts are sparse 3D convolutional parts of the neural network.

Под частью "скелет" нейронной сети подразумевается предварительно обученная нейронная сеть или часть предварительно обученной нейронной сети. Нейронные сети, используемые в качестве "скелета", обучаются на больших объемах визуальных данных, обычно путем решения задачи классификации изображений. В результате такого обучения они приобретают способность захватывать паттерны в визуальных данных. Эту способность можно использовать не только для решения задачи классификации изображений, но и для решения многих других задач компьютерного зрения. Стандартный подход при проектировании нейронной сети для решения задач компьютерного зрения заключается в использовании предварительно обученного "скелета", но с заменой в нем некоторых слоев, предназначенных для решения задачи классификации, другими слоями, предназначенными для решения целевой задачи.The "skeleton" part of a neural network refers to a pre-trained neural network or a portion of a pre-trained neural network. Neural networks used as a "skeleton" are trained on large amounts of visual data, usually by solving an image classification problem. As a result of this training, they gain the ability to capture patterns in visual data. This ability can be used not only to solve the problem of image classification, but also to solve many other computer vision problems. The standard approach when designing a neural network for solving computer vision problems is to use a pretrained "skeleton", but with the replacement of some layers in it, designed to solve the classification problem, with other layers, designed to solve the target problem.

Часть "шея" нейронной сети принимает в качестве ввода вывод "скелета" - четырехмерный тензор, и также возвращает четырехмерные тензоры.The "neck" part of the neural network takes as input the output of the "skeleton" - a four-dimensional tensor, and also returns four-dimensional tensors.

Часть "голова" нейронной сети - это последний, заключительный слой нейронной сети для получения прогнозов положения, ориентации и категорий объектов в сцене. Часть "голова" выдает прогнозы для каждого из объектов сцены. Каждый прогноз содержит вероятность классификации объекта, параметры регрессии ограничивающей рамки объекта и центрированность объекта внутри ограничивающей рамки объекта.The "head" part of the neural network is the last, final layer of the neural network for obtaining predictions of the position, orientation, and categories of objects in the scene. The "head" part produces predictions for each of the objects in the scene. Each prediction contains the probability of classifying the object, the regression parameters of the object's bounding box, and the centering of the object within the object's bounding box.

Деление на части "шея" и "голова" является условным и формальным, так как каждая из этих частей состоит из сходных по типу и назначению слоев нейросети. В случае настоящего изобретения слои нейронной сети представляют собой разреженные сверточные слои.The division into parts "neck" and "head" is conditional and formal, since each of these parts consists of neural network layers similar in type and purpose. In the case of the present invention, the neural network layers are sparse convolutional layers.

При разработке предлагаемого FCAF3D для масштабируемости была выбрана разреженная сверточная сеть типа GSDN. Чтобы улучшить обобщение, в этой сети уменьшено количество гиперпараметров, которые необходимо настраивать вручную; в частности, упрощена обрезка разреженности в шее. Кроме того, в часть "голова" введено простое многоуровневое присвоение положения. И наконец, обсуждаются ограничения существующих параметризаций 3D ограничивающих рамок, и предлагается новая параметризация, которая повышает как точность, так и способность к обобщению.When developing the proposed FCAF3D, a sparse convolutional network of the GSDN type was chosen for scalability. To improve generalization, this network has reduced the number of hyperparameters that must be manually tuned; in particular, sparseness trimming in the neck is simplified. In addition, a simple multi-level position assignment has been introduced in the "head" part. Finally, the limitations of existing 3D bounding box parameterizations are discussed, and a new parameterization is proposed that improves both accuracy and generalization.

ResNet - это семейство архитектур нейронных сетей, широко используемое для решения задач компьютерного зрения. Семейство ResNet имеет как облегченные архитектуры, так и более мощные архитектуры с большим количеством настраиваемых параметров. Облегченные архитектуры предназначены для использования в тех случаях, когда вычислительные ресурсы ограничены и/или важна скорость. Более мощные архитектуры с большим количеством настраиваемых параметров показывают лучшее качество решения целевой задачи по сравнению с облегченными. Такие мощные архитектуры выбираются в тех случаях, когда основным приоритетом является качество и имеется значительный объем вычислительных ресурсов. Все эти архитектуры устроены по одному и тому же принципу: они содержат одинаковые или похожие вычислительные блоки, определенным образом связанные между собой. В результате семейство ResNet формируется архитектурами нейронных сетей с разным количеством таких вычислительных блоков. Недавно был описан метод модификации архитектуры нейронной сети семейства ResNet, позволяющий адаптировать эту архитектуру для обработки разреженных трехмерных данных (таких как облака точек), хотя первоначальные архитектуры семейства ResNet были предназначены для обработки двумерных данных (изображений). Вычислительные блоки нейросетевой архитектуры семейства ResNet (как и любой другой нейросетевой архитектуры) состоят из слоев. В архитектуре нейронной сети семейства ResNet в вычислительных блоках присутствуют двумерные сверточные слои. Метод модификации заключается в замене всех 2D сверточных слоев на 3D сверточные слои. Если подобную модификацию применить ко всем нейросетевым архитектурам семейства ResNet, то можно получить семейство трехмерных разреженных нейросетевых архитектур семейства ResNet. Предлагаемый способ реализован в модифицированной нейронной сети семейства ResNet.ResNet is a family of neural network architectures widely used to solve computer vision problems. The ResNet family has both lightweight architectures and more powerful architectures with more customizable options. Lightweight architectures are intended to be used when computing resources are limited and/or speed is important. More powerful architectures with a large number of customizable parameters show a better quality of solving the target problem compared to lightweight ones. Such powerful architectures are chosen when quality is a top priority and there is a significant amount of computing resources. All these architectures are arranged according to the same principle: they contain the same or similar computational units, interconnected in a certain way. As a result, the ResNet family is formed by neural network architectures with a different number of such computational units. Recently, a method has been described for modifying the architecture of the ResNet family of neural networks to adapt this architecture to process sparse 3D data (such as point clouds), although the original architectures of the ResNet family were designed to handle 2D data (images). Computing blocks of the neural network architecture of the ResNet family (like any other neural network architecture) consist of layers. In the neural network architecture of the ResNet family, there are two-dimensional convolutional layers in the computational blocks. The modification method is to replace all 2D convolutional layers with 3D convolutional layers. If such a modification is applied to all neural network architectures of the ResNet family, then a family of three-dimensional sparse neural network architectures of the ResNet family can be obtained. The proposed method is implemented in a modified neural network of the ResNet family.

Чтобы реализовать способ компьютерного зрения, с помощью камеры с датчиками глубины и RGB-камерой захватывается реальная сцена с объектами в этой сцене. Захваченная реальная сцена представляется компьютерным устройством в виде облака из набора N точек, каждая из которых представлена своей координатой и цветом (облако точек Npts в цветах RGB). Облако из набора N точек вводится в нейронную сеть в качестве данных.To implement the computer vision method, a real scene with objects in that scene is captured using a depth sensor camera with an RGB camera. The captured real scene is represented by a computer device as a cloud of a set of N points, each of which is represented by its own coordinate and color (Npts point cloud in RGB colors). A cloud from a set of N points is entered into the neural network as data.

Датчик глубины (камера глубины) измеряет расстояние до точек в сцене и выдает результаты измерений в виде плотной двумерной карты, каждый пиксель которой содержит расстояние от камеры глубины до некоторой точки в сцене. Далее необходимо определить, как соотносятся координаты пикселей на карте глубины и координаты точек в трехмерном пространстве, иначе говоря, определить, как карта глубины преобразуется в трехмерное пространство. Для этого необходимо знать параметры камеры глубины, которые определяют тип данного дисплея. Параметры камеры явно указаны в наборах данных, используемых для экспериментов в данной работе. В реальных условиях применения предлагаемого способа обнаружения 3D объектов параметры камеры могут оцениваться отдельно любым способом оценки параметров камеры (это стандартная процедура, также известная как калибровка камеры), либо задаваться непосредственно в явном виде, например, как характеристики конкретной модели камеры глубины.The depth sensor (depth camera) measures the distance to points in the scene and outputs the measurement results as a dense two-dimensional map, each pixel of which contains the distance from the depth camera to some point in the scene. Next, you need to determine how the coordinates of the pixels on the depth map correlate with the coordinates of the points in three-dimensional space, in other words, determine how the depth map is converted to three-dimensional space. To do this, you need to know the parameters of the depth camera, which determine the type of this display. The camera parameters are explicitly specified in the datasets used for the experiments in this work. In real conditions of application of the proposed method for detecting 3D objects, the camera parameters can be evaluated separately by any method for estimating camera parameters (this is a standard procedure, also known as camera calibration), or set directly in an explicit form, for example, as characteristics of a specific depth camera model.

Реальная сцена с объектами в ней захватывается камерой с датчиками глубины и RGB-камерой в виде набора N точек, каждая из которых дополнительно представлена своей координатой и цветом. Для этого требуется, чтобы пиксели изображения RGB и пиксели карты глубины отображались в одни и те же точки в 3D пространстве. Соответственно, необходимо согласовать пиксели RGB-изображения с пикселями карты глубины. Для этого пиксели карты глубины отображаются в 3D пространстве (с использованием настроек камеры глубины), а затем проецируются на плоскость изображения (с использованием настроек RGB-камеры). Результатом этой процедуры является карта глубины, попиксельно выровненная с изображением RGB. Каждой точке в 3D пространстве, отображенной из пикселя карты глубины, присваиваются значения RGB, которые были записаны в пикселе изображения RGB, соответствующем данному пикселю карты глубины.A real scene with objects in it is captured by a camera with depth sensors and an RGB camera as a set of N points, each of which is additionally represented by its own coordinate and color. This requires that the RGB image pixels and the depth map pixels map to the same points in 3D space. Accordingly, it is necessary to match the pixels of the RGB image with the pixels of the depth map. To do this, the pixels of the depth map are displayed in 3D space (using depth camera settings) and then projected onto the image plane (using RGB camera settings). The result of this procedure is a depth map pixel-by-pixel aligned with the RGB image. Each point in 3D space mapped from a depth map pixel is assigned the RGB values that were recorded in the RGB image pixel corresponding to that depth map pixel.

Выше была описана процедура получения облака точек из одного изображения RGB и одной карты глубины. Измерения любого датчика глубины неточны. Однако точность измерений можно повысить при наличии нескольких измерений области 3D пространства, полученных из разных точек. В этом случае можно агрегировать информацию из нескольких измерений и тем самым скорректировать измерения на отдельных картах глубины или, например, выделить некоторые измерения как случайные выбросы, обусловленные несовершенством измерительного прибора, и удалить эти выбросы. Также такая агрегация измерений позволяет повысить согласованность измерений в одной 3D сцене и получить не набор отдельных облаков точек для каждого RGB-изображения и карты глубины, а одно облако точек, полностью описывающее всю сцену. Разработан ряд методов агрегирования RGB-изображений и карт глубины, они включают в себя методы одновременной локализации и построения карты (SLAM), методы интегрирования усеченной функции расстояния со знаком (TSDF) и другие методы. Захваченное облако точек Npts в цветах RGB вводится в нейронную сеть.The procedure for obtaining a point cloud from one RGB image and one depth map was described above. The measurements of any depth sensor are inaccurate. However, the measurement accuracy can be improved by having multiple measurements of a region of 3D space obtained from different points. In this case, it is possible to aggregate information from several measurements and thereby correct measurements on individual depth maps, or, for example, highlight some measurements as random outliers due to the imperfection of the measuring instrument and remove these outliers. Also, such aggregation of measurements allows you to increase the consistency of measurements in one 3D scene and get not a set of separate point clouds for each RGB image and depth map, but one point cloud that completely describes the entire scene. A number of RGB image aggregation and depth map methods have been developed, including simultaneous localization and mapping (SLAM) methods, signed truncated distance function (TSDF) integration methods, and others. The captured point cloud Npts in RGB colors is fed into the neural network.

Нейронная сеть состоит из слоев, каждый из которых принимает в качестве ввода тензор, вычисляет на нем функцию, и возвращает результат, также являющийся тензором. Слои могут быть последовательными и/или параллельными. Метод "сборки" слоев в нейронную сеть также называют архитектурой нейронной сети.A neural network consists of layers, each of which takes a tensor as input, evaluates a function on it, and returns the result, which is also a tensor. The layers may be sequential and/or parallel. The method of "assembling" layers into a neural network is also called neural network architecture.

На фиг. 1 показаны слои нейронной сети. Облако точек Npts в цветах RGB вводится в слои нейронной сети; на фиг. 1 сверточный слой обозначен как "Conv0", этот слой вычисляет функцию, называемую "сверткой" (обычный математический термин). Слой объединения обозначен как "Объединение" на фиг. 1, слой объединения вычисляет локальный максимум. Результатом "Conv0" и "Объединения" является разреженный 3D тензор.In FIG. 1 shows the layers of the neural network. Point cloud Npts in RGB colors injected into the layers of the neural network; in fig. 1 convolutional layer is labeled "Conv0", this layer calculates a function called "convolution" (common mathematical term). The pool layer is labeled "Bundle" in FIG. 1, the pooling layer calculates the local maximum. The result of "Conv0" and "Union" is a sparse 3D tensor.

Разреженная нейронная сеть для предлагаемого способа FCAF3D показана на фиг. 1. Далее описывается ее работа.The sparse neural network for the proposed FCAF3D method is shown in FIG. 1. The following describes its operation.

Часть "скелет"Part "skeleton"

Часть "скелет" в FCAF3D представляет собой разреженную модификацию ResNet [11], в которой все 2D свертки заменены 3D разреженными свертками. Семейство разреженных многомерных версий ResNet впервые было представлено в [5], для краткости авторы называют их HDResNet.The "skeleton" part in FCAF3D is a sparse modification of ResNet [11], in which all 2D convolutions are replaced by 3D sparse convolutions. A family of sparse multidimensional versions of ResNet was first presented in [5], for brevity the authors call them HDResNet.

Часть "скелет" реализует следующее:The "skeleton" part implements the following:

представление облака точек в виде объемного пиксельного (воксельного) представления входных данных, облако точек Npts в цветах RGB представляется как воксельное представление в слоях "Conv0" и "Объединение";representation of the point cloud in the form of a volumetric pixel (voxel) representation of the input data, the Npts point cloud in RGB colors is represented as a voxel representation in the "Conv0" and "Unification" layers;

обработку объемного пиксельного представления входных данных с получением четырехмерных тензоров остаточными блоками.processing of the volumetric pixel representation of the input data with the receipt of four-dimensional tensors by the residual blocks.

Как следует из фиг. 1, блок 1, блок 2, блок 3, блок 4 - это остаточные блоки. Остаточный блок - это вычислительная единица архитектуры нейронной сети семейства ResNet. Она состоит из нескольких слоев различного типа: сверточных слоев, слоев, выполняющих нормализацию внутри минипакета, слоя активации. Отличительным признаком этого вычислительного блока является специально организованная связь между слоями, называемая пропускным соединением, или остаточным соединением (что и дало название как вычислительному блоку - остаточный блок, так и семейству нейросетевых архитектур ResNet - остаточные сети).As follows from FIG. 1, block 1, block 2, block 3, block 4 are the residual blocks. The residual block is a computational unit of the neural network architecture of the ResNet family. It consists of several layers of different types: convolutional layers, layers that perform normalization inside the minibatch, activation layer. A distinctive feature of this computing unit is a specially organized connection between the layers, called a through connection, or residual connection (which gave the name to both the computing unit - the residual block, and the ResNet family of neural network architectures - the residual networks).

Остаточные блоки - это пропускные блоки, которые изучают остаточные функции на основании вводов в слои вместо изучения несвязанных функций. Они были представлены как часть архитектуры ResNet. Формально, обозначая желаемое базовое отображение как H(x), авторы позволяют расположенным в стеке нелинейным слоям соответствовать другому отображению H(x) - x. Исходное отображение преобразуется в F(x) + x. F(x) действует как остаток, отсюда и название "остаточный блок". Интуиция подсказывает, что легче оптимизировать остаточное отображение, чем исходное несвязанное отображение. В крайнем случае, если бы отображение идентичности было оптимальным, было бы легче подтолкнуть остаток к нулю, чем соответствовать отображению идентичности стеком нелинейных слоев. Наличие пропускных соединений облегчает изучение сетью подобных идентичности отображений.Residual blocks are skip blocks that learn residual features based on layer inputs instead of learning unrelated features. They were introduced as part of the ResNet architecture. Formally, by denoting the desired underlying mapping as H(x), the authors allow stacked non-linear layers correspond to another mapping H(x) - x. The original mapping is converted to F(x) + x. F(x) acts as a remainder, hence the name "residual block". Intuition dictates that it is easier to optimize the residual mapping than the original unrelated mapping. In the extreme case, if the identity mapping were optimal, it would be easier to push the remainder to zero than it would be to match the identity mapping of a stack of non-linear layers. The presence of throughput connections makes it easier for the network to learn similar identity mappings.

Часть "шея"Part of the neck

Признаки объектов сцены, выраженные в числовой форме, извлекаются из разных слоев "скелета" с помощью части "шея". Часть "шея" обрабатывает четырехмерные тензоры из части "скелет", чтобы извлечь выраженные в числовой форме признаки объектов сцены. Под признаками подразумеваются описания, представления, дескрипторы объектов в формате, с которым может работать компьютерное устройство, т.е. в числовой форме, в виде набора чисел. Эти числа могут быть организованы в виде многомерных матриц, т.е. тензоров.Features of the scene objects, expressed in numerical form, are extracted from different layers of the "skeleton" using the "neck" part. The "neck" part processes the 4D tensors from the "skeleton" part to extract the numeric features of the objects in the scene. Features are descriptions, representations, descriptors of objects in a format that a computer device can work with, i.e. in numerical form, as a set of numbers. These numbers can be organized as multidimensional matrices, i.e. tensors.

Значения этих чисел трудно интерпретировать, они не являются наглядными: в общем, невозможно указать конкретное число и утверждать, что оно кодирует определенное свойство объекта. Согласно этому формату, признаки, извлеченные "шеей", являются четырехмерными тензорами.The meanings of these numbers are difficult to interpret, they are not visual: in general, it is impossible to specify a specific number and claim that it encodes a certain property of an object. According to this format, the features extracted by the "neck" are four-dimensional tensors.

Предлагаемая часть "шея" является упрощенным декодером GSDN. Признаки на каждом слое части "шея" обрабатываются посредством одной операции разреженной транспонированной 3D свертки и одной операции 3D разреженной свертки.The proposed neck part is a simplified GSDN decoder. The features on each layer of the neck part are processed with one 3D sparse transposed convolution operation and one 3D sparse sparse convolution operation.

Каждая операция транспонированной 3D разреженной свертки с размером ядра 2 может увеличить количество ненулевых значений в 23 раза. Чтобы предотвратить быстрое увеличение объема памяти, GSDN использует слой обрезки, который фильтрует ввод с помощью маски вероятности.Each operation of a transposed 3D sparse convolution with a kernel size of 2 can increase the number of non-zero values by a factor of 23. To prevent memory from growing too fast, GSDN uses a clipping layer that filters the input with a probability mask.

В GSDN вероятности на уровнях признаков рассчитываются с помощью дополнительного сверточного слоя оценки. Этот слой обучается со специальной потерей, обеспечивающей согласованность между прогнозируемой разреженностью и якорями. В частности, разреженность вокселей устанавливается положительной, если любой из последующих якорей, связанных с текущим вокселем, является положительным. Однако использование этой потери может быть неоптимальным, так как удаленные воксели объекта могут присваиваться с низкой вероятностью.In GSDN, feature level probabilities are calculated using an additional convolutional scoring layer. This layer is trained with a custom loss to ensure consistency between predicted sparsity and anchors. In particular, voxel sparsity is set to positive if any of the subsequent anchors associated with the current voxel is positive. However, the use of this loss may not be optimal, as distant object voxels may be assigned with low probability.

Для простоты слой оценки с соответствующей потерей удален и вместо него используются вероятности из слоя классификации в "голове". Порог вероятности не настраивается, но поддерживается на большинстве вокселей Nvox для управления уровнем разреженности, где Nvox равно количеству входных точек Npts. Это простой, но изящный способ предотвратить рост разреженности, поскольку повторное использование одного и того же гиперпараметра делает этот процесс более прозрачным и последовательным.For simplicity, the estimation layer with the associated loss has been removed and the probabilities from the classification layer in the "head" are used instead. A probability threshold is not configurable but is supported on most Nvox voxels to control the sparsity level, where Nvox is equal to the number of input points Npts. This is a simple yet elegant way to prevent sparsity from growing, since reusing the same hyperparameter makes the process more transparent and consistent.

Как следует из фиг. 1, каждый из слоев Conv1, Conv2, Conv3, Conv4 является сверточным слоем. Сверточные слои нейронной сети сворачивают ввод и передают результат следующему слою. Каждый сверточный нейрон обрабатывает данные только для своего рецептивного поля. Сверточные нейронные сети широко используются для обработки данных с топологией типа сетки (например, изображений), поскольку свертка учитывает пространственные отношения между отдельными признаками.As follows from FIG. 1, each of the layers Conv1, Conv2, Conv3, Conv4 is a convolutional layer. Convolutional neural network layers convolve the input and pass the result to the next layer. Each convolutional neuron processes data only for its receptive field. Convolutional neural networks are widely used to process data with a grid type topology (such as images), since convolution takes into account the spatial relationships between individual features.

Сверточные слои нейронной сети осуществляют свертку ввода и передают результат следующему слою. Каждый сверточный слой обрабатывает данные только для своего рецептивного поля. Сверточные нейронные сети широко используются для обработки данных с топологией типа сетки (например, изображений), поскольку свертка учитывает пространственные отношения между отдельными объектами. The convolutional layers of the neural network convolve the input and pass the result to the next layer. Each convolutional layer processes data only for its receptive field. Convolutional neural networks are widely used to process data with a grid type topology (such as images), since convolution takes into account the spatial relationships between individual objects .

Как следует из фиг. 1, TransConv1, TransConv2, TransConv3 являются транспонированными сверточными слоями. Транспонированный сверточный слой - это стандартный транспонированный сверточный слой. Он принимает тензор в качестве ввода, вычисляет на нем функцию свертки, и возвращает результат свертки, также являющийся тензором. У него есть параметры - так называемое ядро свертки, настраиваемое во время обучения нейронной сети. По сути, это сверточный слой, но он способен увеличивать размерность входного тензора путем, увеличивая его разреженность или дублируя значения.As follows from FIG. 1, TransConv1, TransConv2, TransConv3 are transposed convolutional layers. The transposed convolutional layer is the standard transposed convolutional layer. It takes a tensor as input, computes a convolution function on it, and returns the result of the convolution, which is also a tensor. It has parameters - the so-called convolution kernel, which is tuned during neural network training. It is essentially a convolutional layer, but it is able to increase the dimension of the input tensor by increasing its sparseness or duplicating values.

Как следует из фиг. 1, "обрезка" - это слой обрезки. Слой обрезки - это нестандартный слой, используемый в GSDN. Он принимает разреженный 3D тензор и фильтрует его с помощью вероятностной маски. Вероятности на уровне признаков рассчитываются с помощью дополнительного сверточного слоя оценки.As follows from FIG. 1, "crop" is a crop layer. The clip layer is a non-standard layer used in GSDN. It takes a sparse 3D tensor and filters it with a probability mask. Feature-level probabilities are calculated using an additional convolutional scoring layer.

Общая часть "голова". Часть "голова" безъякорного FCAF3D состоит из трех параллельных разреженных сверточных слоев (см. фиг. 1) с общими весами для уровней признаков.General part "head". The "head" part of an anchorless FCAF3D consists of three parallel sparse convolutional layers (see Fig. 1) with common weights for feature levels.

Извлеченные признаки, полученные на предыдущем этапе, обрабатываются 3D разреженной сверточной "головой" с весами, общими для разных уровней признаков. "Голова" обеспечивает обработку извлеченных признаков объектов сцены для получения прогнозов положений, ориентаций и категорий объектов в сцене, при этом "голова" выводит прогнозы для каждого из объектов сцены. Прогноз включает в себя вероятность классификации объекта, параметры регрессии ограничивающей рамки объекта и центрированность объекта внутри ограничивающей рамки объекта. Полученные прогнозы фильтруются путем сравнения прогнозов по вероятности классификации объекта, чтобы выбрать наиболее вероятный прогноз. Выбор наиболее вероятного прогноза рассматривается как окончательная оценка положения, ориентации и категории объекта в сцене и характеризуется данными положения, ориентации и категории объекта в сцене. Такое числовое представление сцены используется робототехническим устройством.The extracted features obtained in the previous step are processed by a 3D sparse convolutional "head" with weights common to different feature levels. The "head" processes the extracted features of the scene objects to obtain predictions of the positions, orientations, and categories of objects in the scene, while the "head" outputs predictions for each of the objects in the scene. The prediction includes the probability of classifying the object, the regression parameters of the object's bounding box, and the centering of the object within the object's bounding box. The resulting predictions are filtered by comparing the predictions by the object classification probability to select the most likely prediction. The choice of the most probable prediction is considered as the final assessment of the position, orientation and category of the object in the scene and is characterized by the data of the position, orientation and category of the object in the scene. This numerical representation of the scene is used by the robotic device.

В любой момент вычислений внутри нейронной сети происходит обработка набора координат точек в трехмерном пространстве. На вход подается набор координат точек, на входе и выходе каждого слоя нейронной сети имеются данные, представленные в виде трехмерных разреженных тензоров. Все точки находятся в одном и том же 3D пространстве. Но количество точек и координаты точек изменяются в процессе вычислений. Положения - это координаты точек, которые появляются в процессе вычислений. Они не являются точно теми же координатами точек, которые поступили как ввод, но они находятся где-то между координатами входных точек, примерно в той же области пространства.At any moment of computation, a set of coordinates of points in three-dimensional space is processed inside the neural network. The input is a set of point coordinates, at the input and output of each layer of the neural network there are data presented in the form of three-dimensional sparse tensors. All points are in the same 3D space. But the number of points and the coordinates of the points change during the calculation. Positions are the coordinates of the points that appear during the calculation. They are not exactly the same coordinates of the points that came in as input, but they are somewhere between the coordinates of the input points, in roughly the same region of space.

Система координат (координатная сетка) определяется через координаты точек в трехмерном пространстве и разметку ограничивающих рамок объектов: все координаты определяются в некоторой системе координат. В этой системе координат ось y должна иметь одинаковое направление с вектором силы тяжести, в этом случае трехмерные ограничивающие рамки объектов вида OBB и AABB будут располагаться горизонтально. AABB - это сокращение от Axis-Aligned Bounding Box (ограничивающая рамка выровнена по координатным осям). Это общий термин для 3D ограничивающей рамки какого-либо объекта, все ребра и плоскости которого направлены по координатным осям 3D пространства. OBB - это сокращение от Oriented Bounding Box (ориентированная ограничивающая рамка). Это общий термин для 3D ограничивающей рамки объекта определенного вида, расположенного горизонтально в пространстве и произвольно повернутого в горизонтальной плоскости. The coordinate system (coordinate grid) is defined through the coordinates of points in three-dimensional space and the marking of the bounding boxes of objects: all coordinates are defined in some coordinate system. In this coordinate system, the y-axis must have the same direction as the gravity vector, in which case the 3D bounding boxes of objects of the form OBB and AABB will be horizontal. AABB is short for Axis-Aligned Bounding Box (the bounding box is aligned to the coordinate axes). This is a general term for the 3D bounding box of some object, all edges and planes of which are directed along the coordinate axes of 3D space. OBB is short for Oriented Bounding Box (Oriented Bounding Box). This is a general term for the 3D bounding box of an object of a certain kind, located horizontally in space and arbitrarily rotated in the horizontal plane.

Для каждого положения

три параллельных разреженных сверточных слоя "головы" этой архитектуры выводят вероятности

классификации, параметры δ регрессии ограничивающей рамки и центрированность

, соответственно. Эта схема подобна простой и облегченной "голове" FCOS [26], но адаптирована для 3D данных.For every position

three parallel sparse convolutional "head" layers of this architecture output the probabilities

classification, bounding box regression parameters δ, and centrality

, respectively. This scheme is similar to the simple and lightweight FCOS "head" [26], but adapted for 3D data.

Вероятности классификации означают, что для каждой категории объекта (стол, стул, и т.д.) оценивается вероятность того, что данное положение находится внутри трехмерной ограничивающей рамки объекта этой категории ("данная точка принадлежит объекту этой категории").Classification probabilities mean that for each category of an object (table, chair, etc.) the probability that a given position is inside the three-dimensional bounding box of an object of that category ("given point belongs to an object of this category") is estimated.

Что касается параметров регрессии ограничивающей рамки, то здесь необходимо уточнить, что все существующие методы обнаружения 3D объектов не дают прямого прогноза 3D ограничивающей рамки объекта в явном виде. Обычно вместо этого оцениваются параметры 3D ограничивающей рамки, зависящие от положения: например, расстояния от данного положения до 6 граней 3D ограничивающей рамки. Ограничивающая рамка - это 3D ограничивающая рамка объекта.As for the parameters of the bounding box regression, it should be clarified here that all existing methods for detecting 3D objects do not give a direct prediction of the 3D bounding box of an object in an explicit form. Typically, position-dependent 3D bounding box parameters are evaluated instead: for example, the distance from a given position to the 6 faces of the 3D bounding box. The bounding box is the 3D bounding box of an object.

Центрированность

описывает близость положения к центру эталонной (истинной) 3D ограничивающей рамки объекта, в которую попадает это положение. Центрированность - это относительная величина, которая может принимать значения от 0 до 1; чем ближе к центру, тем больше значение и тем ближе оно к 1.Centeredness

describes the proximity of a position to the center of the object's reference (true) 3D bounding box that the position falls within. Centering is a relative value that can take values from 0 to 1; the closer to the center, the larger the value and the closer it is to 1.

Иными словами, часть "голова" возвращает вероятности классификации, параметры регрессии ограничивающей рамки, оценку центрированности для каждого положения.In other words, the head part returns the classification probabilities, the bounding box regression parameters, the centrality score for each position.

Как следует из фиг. 1, часть "голова" относится к сверточному слою. При этом часть "голова" содержит слои регрессии, центрированности, классификации. Для каждого положения

слой классификации выводит вероятности

классификации, слой регрессии выводит параметры δ регрессии ограничивающей рамки, а слой центрированности выводит оценку центрированности

, соответственно.As follows from FIG. 1, the "head" part refers to the convolutional layer. At the same time, the "head" part contains layers of regression, centering, and classification. For every position

classification layer outputs probabilities

classification, the regression layer outputs the bounding box regression parameters δ, and the centrality layer outputs the centrality score

, respectively.

Как следует из фиг. 1, "Объединение" относится к слою объединения. Слой объединения - это слой, который вычисляет локальный максимум или локальный средний стандартный слой. Слой объединения принимает тензор на входе, возвращает пространственно упорядоченный набор локальных максимумов/локальных средних значений этого тензора, который также является тензором.As follows from FIG. 1, "Combining" refers to the combining layer. The pooling layer is the layer that computes the local maximum or local average standard layer. The pooling layer takes a tensor as input, returns a spatially ordered set of local maxima/local averages of that tensor, which is also a tensor.

Слои объединения уменьшают размерность данных, объединяя выходы кластеров нейронов на одном слое в один нейрон на следующем слое. Локальное объединение объединяет небольшие кластеры, обычно используются тайлы размером 2×2. Глобальное объединение действует на все нейроны карты признаков. Широко используются два распространенных типа объединения: максимальное и среднее. Максимальное объединение использует максимальное значение каждого локального кластера нейронов на карте объектов, а среднее объединение использует среднее значение.Pooling layers reduce the dimensionality of data by merging the outputs of clusters of neurons on one layer into a single neuron on the next layer. Local pooling combines small clusters, usually 2x2 tiles are used. The global union acts on all feature map neurons. Two common types of pooling are widely used: maximum and average. Max pooling uses the maximum value of each local cluster of neurons in the feature map, while average pooling uses the average value.

Во время обучения FCAF3D выводит положения для различных уровней признаков, которые должны быть присвоены истинным рамкам {b}.During training, FCAF3D outputs positions for different feature levels to be assigned to true frames {b}.

На вход поступают точки Npts в цветах RGB, а выходе выдаются вероятность классификации объектов, параметры регрессии ограничивающей рамки, показатель центрированности для каждого положения (то есть для некоторого набора точек в 3D пространстве). На этапе тестирования из вероятностей классификации, параметров регрессии ограничивающей рамки, показателя центрированности вычисляются 3D ограничивающие рамки объектов. Примеры таких 3D ограничивающих рамок представлены на фиг. 3. Прогнозы положения, ориентации объекта в пространстве, а также его категории доступны в числовом представлении, представляющем компьютерное зрение для робототехнического устройства, и могут использоваться им в соответствии с поставленной задачей. Также реализуется способ визуализации прогнозов, полученных с помощью предлагаемого способа. Пользователь может видеть изображение сцены с размещенными в ней 3D ограничивающими рамками объектов. Эти 3D ограничивающие рамки окрашены по-разному, чтобы кодировать различные категории объектов. Цветовое кодирование фиксированное (один и тот же цвет всегда соответствует одной и той же категории) и произвольное.The input is Npts points in RGB colors, and the output is the probability of object classification, the bounding box regression parameters, the centering index for each position (that is, for some set of points in 3D space). At the testing stage, 3D bounding boxes of objects are calculated from the classification probabilities, the bounding box regression parameters, and the centrality index. Examples of such 3D bounding boxes are shown in FIG. 3. Predictions of the position, orientation of an object in space, as well as its category are available in a numerical representation representing computer vision for a robotic device, and can be used by it in accordance with the task. A method for visualizing forecasts obtained using the proposed method is also implemented. The user can see an image of the scene with 3D object bounding boxes placed in it. These 3D bounding boxes are colored differently to encode different categories of objects. The color coding is fixed (the same color always corresponds to the same category) and arbitrary.

В методе обнаружения 3D объектов ImVoxelNet выходные положения (локации) прогнозируются на трех уровнях, и для уровня заранее определены максимальные расстояния от положения до краев 3D ограничивающей рамки объекта, которые могут быть присвоены этому положению. Для трех масштабов установлены пороги 75 см, от 75 см до 1,5 м и более 1,5 м, соответственно.In ImVoxelNet's 3D object detection method, output positions (locations) are predicted at three levels, and the level has predefined maximum distances from the position to the edges of the object's 3D bounding box that can be assigned to that position. For three scales thresholds of 75 cm, from 75 cm to 1.5 m and more than 1.5 m, respectively, are set.

Настоящее изобретение предлагает упрощенную стратегию для разреженных данных, которая не требует настройки гиперпараметров, специфичных для набора данных. Для каждой ограничивающей рамки (примеры 3D ограничивающих рамок объектов показаны на фиг. 3) выбирается последний уровень признака, на котором эта ограничивающая рамка покрывает по меньшей мере Nloc положений. Если такого уровня признака нет, выбирается первый. Положения фильтруются путем выборки по центру [26], учитывая только точки рядом с центром ограничивающей рамки как положительные совпадения.The present invention provides a simplified strategy for sparse data that does not require tuning of dataset-specific hyperparameters. For each bounding box (examples of 3D object bounding boxes are shown in FIG. 3), the last feature level at which the bounding box covers at least Nloc positions is selected. If there is no such attribute level, the first one is selected. Positions are filtered by center sampling [26], considering only points near the center of the bounding box as positive matches.

Путем присвоений некоторые положения

спариваются с ограничивающими прямоугольниками

.

- это эталонная (истинная) 3D ограничивающая рамка объекта, ассоциированная с положением

. Эталонные ограничивающие рамки объекта содержатся в разметке набора данных или могут быть получены непосредственно из этой разметки.By assigning some provisions

pair with bounding boxes

.

is the reference (true) 3D bounding box of the object associated with the position

. The object's reference bounding boxes are contained in the dataset's markup, or can be derived directly from that markup.

Соответственно, эти положения ассоциируются с истинными метками

и значениями

3D центрированности.Accordingly, these positions are associated with true labels

and values

3D centering.

(истинные метки) означают эталонные категории объектов, которые известны или могут быть непосредственно получены для каждой ограничивающей рамки объекта из разметки набора данных.

(true labels) means reference categories of objects that are known or can be directly obtained for each object bounding box from the markup of the dataset.

(3D центрированность) означает центрированность. Центрированность описывает близость какого-либо положения к центру эталонной (истинной) 3D ограничивающей рамки объекта, в который попадает это положение. Центрированность - это относительная величина, которая может принимать значения от 0 до 1; чем ближе к центру, тем больше значение и тем ближе оно к 1.

(3D centered) means centered. Centerness describes the proximity of a position to the center of the reference (true) 3D bounding box of the object that the position falls within. Centering is a relative value that can take values from 0 to 1; the closer to the center, the larger the value and the closer it is to 1.

Во время логического вывода оценки

умножаются на 3D центрированность

непосредственно перед NMS, как предложено в [24].During evaluation inference

multiplied by 3D centering

immediately before NMS, as suggested in [24].

Общая функция потерь формулируется следующим образом:The overall loss function is formulated as follows:

В данном случае количество согласованных положений Npos равно

. Потеря классификации Lcls представляет собой фокальную потерю, потеря регрессии Lreg представляет собой IoU, а потеря центрированности Lcntr представляет собой бинарную кросс-энтропию. Для каждой потери прогнозируемые значения обозначены крышечкой.In this case, the number of agreed positions Npos is equal to

. Classification loss Lcls represents focal loss, regression loss Lreg represents IoU, and centering loss Lcntr represents binary cross entropy. For each loss, the predicted values are indicated by a cap.

Потеря классификации Lcls представляет собой фокальную потерю, потеря регрессии Lreg представляет собой IoU, а потеря центрированности Lcntr представляет собой бинарную кросс-энтропию. Фокальная потеря, IoU, бинарная кросс-энтропия - это общие термины для различных штрафных функций, используемых для обучения нейронных сетей.Classification loss Lcls represents focal loss, regression loss Lreg represents IoU, and centering loss Lcntr represents binary cross entropy. Focal loss, IoU, binary cross entropy are generic terms for various penalty functions used to train neural networks.

Параметризация ограничивающей рамки (на фиг. 3 для наглядности показан пример 3D ограничивающих рамок объектов).Parameterization of the bounding box (Fig. 3 shows an example of 3D object bounding boxes for clarity).

3D ограничивающие рамки объектов могут быть выровненными по координатным осям (AABB) или ориентированными (OBB).3D object bounding boxes can be axis-aligned (AABB) or oriented (OBB).

Следовательно, AABB горизонтальный и не повернут, а OBB горизонтальный и повернут произвольным образом. AABB может определяться центральной точкой (3 координаты), длиной, шириной и высотой. Для OBB также необходимо установить угол поворота в горизонтальной плоскости - путевой угол θ.Therefore, AABB is horizontal and not rotated, while OBB is horizontal and randomly rotated. AABB can be defined by the center point (3 coordinates), length, width and height. For OBB, it is also necessary to set the angle of rotation in the horizontal plane - the ground angle θ.

AABB можно описать как

, тогда как описание OBB включает в себя путевой угол θ: b^OBB=(x, y, z, w, l, h, θ). В обеих формулах x, y, z обозначают координаты центра ограничивающей рамки, а w, l, h - его ширину, длину и высоту, соответственно.AABB can be described as

, while the OBB description includes the track angle θ: b ^OBB =(x, y, z, w, l, h, θ). In both formulas, x, y, z denote the coordinates of the center of the bounding box, and w, l, h denote its width, length, and height, respectively.

Параметризация AABB. Параметризация для AABB была предложена в [24]. В частности, для истинного AABB (x, y, z, w, l, h) и положения

δ можно сформулировать как набор из 6 элементов:AABB parametrization. The parametrization for AABB was proposed in [24]. In particular, for true AABB (x, y, z, w, l, h) and position

δ can be formulated as a set of 6 elements:

Спрогнозированный AABB

может быть тривиально получен из δ.Predicted AABB

can be trivially obtained from δ.

Путевой угол - это угол поворота 3D ограничивающей рамки объекта в горизонтальной плоскости, который определяет ориентацию объекта ("куда обращен объект"). Это один из параметров, который определяет ориентированная 3D ограничивающая рамка OBB (определение OBB включает путевой угол θ).The track angle is the angle of rotation of the object's 3D bounding box in the horizontal plane, which determines the orientation of the object ("where the object is facing"). This is one of the parameters that the OBB oriented 3D bounding box defines (the OBB definition includes the track angle θ).

Все современные методы обнаружения 3D объектов из облаков точек решают задачу оценки путевого угла как классификацию с последующей регрессией. Путевой угол разбивается на бины; затем точный путевой угол регрессируется в бине. Для сцен в помещении диапазон от 0 до 2π обычно делится на 12 равных бинов [22], [21], [33], [19]. Для сцен вне помещения обычно используется всего два бина [30], [13], так как объекты на дороге могут быть либо параллельны, либо перпендикулярны дороге.All modern methods for detecting 3D objects from point clouds solve the problem of estimating the path angle as a classification followed by regression. The track angle is divided into bins; then the exact track angle is regressed in the bin. For indoor scenes, the range from 0 to 2π is usually divided into 12 equal bins [22], [21], [33], [19]. For outdoor scenes, usually only two bins [30], [13] are used, since objects on the road can be either parallel or perpendicular to the road.

Оценка значения путевого угла осуществляется в два этапа. Сначала делается приблизительная оценка: определяется диапазон значений, в которые попадает данный путевой угол. Затем на втором этапе уточняется значение путевого угла в этом интервале. Эти интервалы называются бином.Estimation of the track angle value is carried out in two stages. First, a rough estimate is made: the range of values within which a given track angle falls is determined. Then, at the second stage, the value of the track angle in this interval is specified. These intervals are called a bin.

После выбора бина путевого угла оценивается значение путевого угла посредством регрессии. VoteNet и другие методы на основе голосования прямо оценивают значение θ. Методы для использования вне помещения исследуют более сложные подходы, например, прогнозирование значений тригонометрических функций. Например, SMOKE [17] оценивает sin θ и cos θ и использует предсказанные значения для восстановления путевого угла.Once the track angle bin is selected, the track angle value is estimated by regression. VoteNet and other voting-based methods directly estimate the value of θ. Outdoor methods explore more sophisticated approaches, such as predicting the values of trigonometric functions. For example, SMOKE [17] estimates sin θ and cos θ and uses the predicted values to reconstruct the track angle.

На фиг. 2 изображены объекты в помещении, где путевой угол имеет точно выраженное значение. На фиг. 2 приведены примеры объектов, которые одинаково выглядят с нескольких сторон: квадратный стол, круглый стол, еще один круглый стол.In FIG. 2 shows objects in a room where the track angle has a precisely expressed value. In FIG. 2 shows examples of objects that look the same from several sides: a square table, a round table, another round table.

Следует отметить, что путевой угол - это угол поворота 3D ограничивающей рамки 3D объекта в горизонтальной плоскости, который определяет ориентацию объекта ("куда обращен объект"). Один из параметров, который определяет ориентированная 3D ограничивающая рамка истинного угла OBB - это угол, характеризующий эталонные, истинные значения любого из параметров объекта, оцениваемых данным методом. Желательно, чтобы значения параметров, оцениваемых данным методом (прогнозируемых, оцениваемых, выдаваемых методом), были максимально приближены к истинным значениям параметров объекта. В данном контексте истинный угол следует понимать, как истинный путевой угол, т.е. эталонное, истинное значение путевого угла, заранее известное из маркировки набора данных.It should be noted that the track angle is the angle of rotation of the 3D bounding box of the 3D object in the horizontal plane, which determines the orientation of the object ("where the object is facing"). One of the parameters that the oriented 3D bounding box of the OBB true angle determines is the angle that characterizes the reference, true values of any of the object parameters estimated by this method. It is desirable that the values of the parameters estimated by this method (predicted, estimated, issued by the method) be as close as possible to the true values of the object's parameters. In this context, the true angle should be understood as the true track angle, i.e. the reference, true value of the track angle, known in advance from the labeling of the data set.

Соответственно, аннотации истинного угла можно выбирать для этих объектов произвольно, что делает классификацию бинов путевого угла бессмысленной. Чтобы избежать выбраковки корректных прогнозов, не совпадающих с аннотациями, используется повернутая потеря IoU, так как ее значение одинаково для всех возможных вариантов выбора путевого угла. Следовательно, предлагается параметризация OBB, учитывающая неоднозначность поворота. Следует уточнить, что не всегда можно однозначно определить ориентацию предмета, так как некоторые предметы выглядят одинаково с нескольких сторон: например, круглая табуретка, круглый стол, квадратный стол, см. фиг. 2. Поэтому любое значение путевого угла, принятое в качестве эталона, будет выбрано до некоторой степени случайно. Это называется неоднозначностью поворота.Accordingly, the true angle annotations can be chosen arbitrarily for these features, making the classification of track angle bins meaningless. To avoid culling correct predictions that do not match the annotations, the rotated IoU is used, since its value is the same for all possible track angle choices. Therefore, a parametrization of OBB is proposed that takes into account the ambiguity of rotation. It should be clarified that it is not always possible to unambiguously determine the orientation of an object, since some objects look the same from several sides: for example, a round stool, a round table, a square table, see fig. 2. Therefore, any value of the track angle taken as a reference will be chosen to some extent randomly. This is called pivot ambiguity.

Параметризация для OBB основана на отображении ленты Мебиуса, поэтому авторы говорят о параметризации Мебиуса OBB.The parametrization for OBB is based on the Möbius strip mapping, so the authors talk about the Möbius parameterization of OBB.

Рассматривая OBB с параметрами (x, y, z, w, l, h, θ), обозначим q=w/l. Если x, y, z, w+l, h постоянные, то оказывается, что OBB сConsidering OBB with parameters (x, y, z, w, l, h, θ), we denote q=w/l. If x, y, z, w+l, h are constant, then it turns out that OBB with

определяют одну и ту же ограничивающую рамку. Набор (q, θ), где θ ∈ (0, 2π], q ∈ (0,+inf), топологически эквивалентен ленте Мебиуса [20] с точностью до этого отношения эквивалентности. Следовательно, авторы могут переформулировать задачу оценки (q, θ) как задачу прогнозирования точки на ленте Мебиуса. Естественным способом вложения ленты Мебиуса, являющейся двумерным многообразием, в евклидово пространство является:define the same bounding box. The set (q, θ), where θ ∈ (0, 2π], q ∈ (0,+inf), is topologically equivalent to the Möbius strip [20] up to this equivalence relation. Therefore, the authors can reformulate the problem of estimating (q, θ ) as a problem of predicting a point on a Möbius strip A natural way to embed a Möbius strip, which is a two-dimensional manifold, in Euclidean space is:

Легко проверить, что четыре точки из уравнения 3 отображаются в евклидовом пространстве в одну точку. Однако эксперименты показывают, что прогнозирование только ln(q)sin(2θ) и ln(q)cos(2θ) дает лучшие результаты, чем прогнозирование всех четырех значений. Поэтому авторы сделали выбор в пользу псевдовложения ленты Мебиуса в R². Оно называется "псевдо", поскольку отображает всю центральную окружность ленты Мебиуса, определяемую ln(q)=0 до (0,0). Соответственно, невозможно различить точки с In <7=0. Однако In(q)=0 подразумевает строгое равенство w и I, что редко встречается в реальных сценариях. Кроме того, выбор угла мало влияет на IoU, если w=l; поэтому этот редкий случай игнорируется ради точности обнаружения и простоты метода. В итоге, получена новая параметризация OBB:It is easy to check that the four points from Equation 3 are mapped to a single point in Euclidean space. However, experiments show that predicting only ln(q)sin(2θ) and ln(q)cos(2θ) gives better results than predicting all four values. Therefore, the authors opted for a pseudo-embedding of the Möbius strip in R ² . It is called "pseudo" because it represents the entire central circle of the Möbius strip, defined by ln(q)=0 to (0,0). Accordingly, it is impossible to distinguish points with In <7=0. However, In(q)=0 implies strict equality between w and I, which is rare in real scenarios. Also, the choice of angle has little effect on IoU if w=l; therefore, this rare case is ignored for the sake of accuracy of detection and simplicity of the method. As a result, a new parameterization of OBB is obtained:

В стандартной параметризации уравнения 2

тривиально выводится из δ. В предложенной параметризации w, l, θ нетривиальны и могут быть получены следующим образом:In the standard parametrization of Equation 2

is trivially derived from δ. In the proposed parametrization, w, l, θ are non-trivial and can be obtained as follows:

где отношение

и размер

.where is the ratio

and size

.

И наконец, данные положения, ориентации и категории объекта в сцене выводятся из нейронной сети в виде числового представления сцены. Числовое представление сцены представляется в виде изображения сцены.Finally, the position, orientation, and category data of the object in the scene is output from the neural network as a numeric representation of the scene. The numeric representation of the scene is represented as a scene image.

Робототехнические устройства могут использовать положение, ориентацию и категорию объектов в числовом представлении для планирования путей внутри сцены, позволяющих обойти эти объекты. Категории объектов могут использоваться для манипулирования только объектами, принадлежащими к требуемым категориям, например, для загрузки и транспортировки требуемых объектов в соответствии с инструкциями пользователя, для чистки предметов мебели заданных категорий и т.п.Robotic devices can use the position, orientation, and category of objects in the numeric representation to plan paths within the scene to bypass those objects. Object categories can be used to manipulate only objects belonging to the required categories, for example, to load and transport the required objects according to the user's instructions, to clean furniture items of the specified categories, etc.

Положение, ориентация и категория объектов в числовом представлении могут использоваться для автоматического предоставления статистики об объектах, присутствующих в сцене, например, для отслеживания количества имеющихся в настоящее время товаров в торговой зоне или для создания текстового описания данной сцены в приложениях-помощниках, например, для помощи незрячим людям.The position, orientation, and category of objects in the numeric representation can be used to automatically provide statistics about the objects present in a scene, such as keeping track of the number of items currently available in a retail area, or creating a textual description of a given scene in helper applications, such as helping blind people.

Числовое представление сцены может быть преобразовано (с помощью известных методов) в изображение сцены, при этом компьютерное устройство дополнительно содержит экран, и изображение сцены отображается на экране для пользователя.The numeric representation of the scene can be converted (using known techniques) into a scene image, wherein the computing device further comprises a screen, and the scene image is displayed on the screen to the user.

Изображение сцены может использоваться для наглядной демонстрации результатов применения предлагаемого метода человеку в приложениях искусственной реальности (AR), в приложениях для мониторинга или учета объектов и т.д. Потребителем результатов в виде изображений является человек.The scene image can be used to visually demonstrate the results of applying the proposed method to a person in artificial reality (AR) applications, in applications for monitoring or accounting for objects, etc. The consumer of the results in the form of images is a person.

Положение, ориентация и категория объектов в представлении изображения могут использоваться в AR для предоставления пользователю информации об объектах, присутствующих в сцене, и для дополнения захваченного изображения сцены сгенерированной аннотацией объектов, присутствующих в сцене. The position, orientation and category of objects in an image representation can be used in AR to provide the user with information about the objects present in the scene and to supplement the captured image of the scene with a generated annotation of the objects present in the scene.

ЭкспериментыExperiments

Предлагаемый способ оценивался на трех бенчмарках для обнаружения 3D объектов: ScanNet V2 [7], SUN RGB-D [25] и S3DIS [1]. Для всех датасетов в качестве метрики использовалась именованная средняя точность (mAP) при пороговых значениях IoU, равных 0,25 и 0,5.The proposed method was evaluated on three benchmarks for detecting 3D objects: ScanNet V2 [7], SUN RGB-D [25], and S3DIS [1]. For all datasets, the named average accuracy (mAP) was used as a metric at IoU thresholds of 0.25 and 0.5.

Датасет ScanNet содержит 1513 реконструированных 3D сканов в помещении с поточечными частными и семантическими метками 18 категорий объектов. При наличии такой аннотации AABB вычисляется по стандартному принципу [22]. Обучающее подмножество состоит из 1201 сканов, а 312 сканов оставлены для проверки.The ScanNet dataset contains 1513 reconstructed 3D indoor scans with pointwise private and semantic labels for 18 object categories. In the presence of such an annotation, AABB is calculated according to the standard principle [22]. The training subset consists of 1201 scans, and 312 scans are left for validation.

SUN RGB-D - это датасет для понимания монокулярных 3D сцен, содержащий более 10 000 изображений RGB-D внутри помещений. Аннотация состоит из поточечных семантических меток и OBB 37 категорий объектов. Как предложено в [22], эксперименты проводились с объектами 10 наиболее распространенных категорий. Части, выделенные для обучения и проверки, содержат 5285 и 5050 облаков точек, соответственно.SUN RGB-D is a dataset for understanding monocular 3D scenes containing over 10,000 indoor RGB-D images. The annotation consists of dotted semantic labels and OBB 37 object categories. As suggested in [22], the experiments were carried out with objects of the 10 most common categories. The parts allocated for training and validation contain 5285 and 5050 point clouds, respectively.

S3DIS. Датасет Stanford Large-Scale 3D Indoor Spaces содержит 3D сканы 272 комнат из 6 зданий с 3D частными и семантическими аннотациями. Следуя [10], предлагаемый метод оценивался на категориях мебели. AABB получены из 3D семантики. Использовалось официальное разделение, в котором 68 комнат из зоны 5 были предназначены для проверки, а остальные 204 комнаты составили обучающее подмножество.S3DIS. The Stanford Large-Scale 3D Indoor Spaces dataset contains 3D scans of 272 rooms from 6 buildings with 3D private and semantic annotations. Following [10], the proposed method was evaluated on furniture categories. AABB derived from 3D semantics. An official division was used, in which 68 rooms from zone 5 were designated for testing and the remaining 204 rooms constituted the training subset.

Для всех датасетов использовались одни и те же гиперпараметры, за исключением следующего. Во-первых, размер выходного слоя классификации был равен количеству категорий объектов, которое составляло 18, 10 и 5 для ScanNet, SUN RGB-D и S3DIS.The same hyperparameters were used for all datasets, except for the following. First, the size of the output classification layer was equal to the number of object categories, which was 18, 10, and 5 for ScanNet, SUN RGB-D, and S3DIS.

Во-вторых, SUN RGB-D содержит OBB, поэтому для этого датасета прогнозировались дополнительные цели δ7 и δ8; следует отметить, что функция потерь не затрагивалась. И наконец, ScanNet, SUN RGB-D и S3DIS содержат различное количество сцен, поэтому каждая сцена повторялась 10, 3 и 13 раз за эпоху, соответственно.Secondly, SUN RGB-D contains OBB, so additional targets δ7 and δ8 were predicted for this dataset; It should be noted that the loss function was not affected. Finally, ScanNet, SUN RGB-D, and S3DIS contain a different number of scenes, so each scene was repeated 10, 3, and 13 times per epoch, respectively.

Подобно GSDN [10], в качестве "скелета" использовалась разреженная 3D модификация ResNet34, названная HDResNet34. Часть "шея" и часть "голова" использовали выводы части "скелет" на всех уровнях признаков. В исходной вокселизации облака точек были установлены размер вокселя 0,01 м и количество точек Npts 100000. Соответственно, Nvox равно 100000. Как ATSS [32], так и FCOS [26] устанавливают Nloc равным 32 для обнаружения 2D объектов. Соответственно, был выбран такой уровень признаков, при котором ограничивающая рамка покрывает по меньшей мере Nloc=33 положения. 18 положений было получено выборкой по центру. Порог NMS IoU равен 0,5.Similar to GSDN [10], a sparse 3D modification of ResNet34 called HDResNet34 was used as a "skeleton". The "neck" part and the "head" part used the findings of the "skeleton" part at all feature levels. In the initial voxelization of the point cloud, the voxel size was set to 0.01 m and the number of points Npts to 100000. Accordingly, Nvox is 100000. Both ATSS [32] and FCOS [26] set Nloc to 32 to detect 2D objects. Accordingly, a feature level was chosen such that the bounding box covers at least Nloc=33 positions. 18 positions were obtained by sampling in the center. The NMS IoU threshold is 0.5.

Обучение. Был реализован метод FCAF3D с использованием фреймворка MMdetection3D [6]. Процедура обучения соответствовала стандартной схеме MMdetection [3]: обучение занимало 12 эпох, и скорость обучения снижалась на 8-й и 11-й эпохах. Использовался оптимизатор Адама с начальной скоростью обучения 0,001 и снижением веса 0,0001. Все нейронные сети обучались на двух NVidia V100 с размером пакета 8. Оценка и тесты производительности выполнялись на одной NVidia GTX1080Ti.Education. The FCAF3D method was implemented using the MMdetection3D framework [6]. The training procedure corresponded to the standard MMdetection scheme [3]: training took 12 epochs, and the learning rate decreased at the 8th and 11th epochs. An Adam optimizer was used with an initial learning rate of 0.001 and a weight reduction of 0.0001. All neural networks were trained on two NVidia V100s with batch size 8. Evaluation and performance tests were performed on one NVidia GTX1080Ti.

В настоящем изобретении используется протокол оценки, представленный в [16]. Обучение и оценка рандомизированы, так как входные Npt выбираются из облака точек произвольно. Для получения статистически значимых результатов обучение проводят 5 раз и каждую обученную нейронную сеть тестируют независимо 5 раз.The present invention uses the evaluation protocol presented in [16]. Training and estimation are randomized since the input Npts are chosen randomly from the point cloud. To obtain statistically significant results, training is carried out 5 times and each trained neural network is tested independently 5 times.

Сообщаются как лучшие, так и средние метрики для 5×5 тестов: это позволяет сравнивать FCAF3D с методами обнаружения 3D объектов, которые сообщают либо об одном лучшем, либо о среднем значении.Both best and average metrics for 5x5 tests are reported: this allows FCAF3D to be compared with 3D object detection methods that report either one best or average.

Был реализован метод FCAF3D с использованием фреймворка MMdetection3D [6].The FCAF3D method was implemented using the MMdetection3D framework [6].

Процедура обучения соответствовала стандартной схеме MMdetection [3]: обучение занимало 12 эпох, и скорость обучения снижалась на 8-й и 11-й эпохах.The training procedure corresponded to the standard MMdetection scheme [3]: training took 12 epochs, and the learning rate decreased at the 8th and 11th epochs.

Использовался оптимизатор Адама с начальной скоростью обучения 0,001 и снижением веса 0,0001. Все нейронные сети обучались на двух NVidia V100 с размером пакета 8. Оценка и тесты производительности выполнялись на одной NVidia GTX1080Ti.An Adam optimizer was used with an initial learning rate of 0.001 and a weight reduction of 0.0001. All neural networks were trained on two NVidia V100s with batch size 8. Evaluation and performance tests were performed on one NVidia GTX1080Ti.

Результатыresults

Сравнение с современными методамиComparison with modern methods

В таблице 1 FCAF3D сравнивается с известными современными методами на трех бенчмарках для помещений.Table 1 compares FCAF3D with known state-of-the-art methods on three indoor benchmarks.

В таблице 1 представлены результаты FCAF3D и существующих методов обнаружения 3D объектов внутри помещений, которые принимают облака точек. Лучшие значения метрик выделены жирным шрифтом. FCAF3D превосходит предыдущие современные методы: GroupFree (на ScanNet и SUN RGB-D) и GSDN (на S3DIS). Сообщаемое значение метрики является лучшим для 25 тестов; в скобках указано среднее значение.Table 1 presents the results of FCAF3D and existing indoor 3D object detection methods that accept point clouds. The best metrics are in bold. FCAF3D outperforms previous modern methods: GroupFree (on ScanNet and SUN RGB-D) and GSDN (on S3DIS). The reported value of the metric is the best for 25 tests; in parentheses is the average value.

Таблица 1Table 1

Предлагаемый метод оценивался на датасетах ScanNet [7], SUN RGB-D [25] и S3DIS [1] и продемонстрировал убедительное превосходство над известным уровнем техники на всех бенчмарках (среднее значение приведено в скобках). На SUN RGB-D и ScanNet предлагаемый метод превосходит другие методы по меньшей мере на 3,5% mAP@0,5. На датасете ScanNet предлагаемый метод обнаружения 3D объектов на 4,5 пункта выше, чем лучший конкурирующий метод обнаружения 3D объектов. На датасете SUN RGB-D он выше на 3,5 пункта. На датасете S3DIS он выше на 20,5 пункта. Такие же отличные результаты можно увидеть для стандартной метрики оценки качества mAP@0,25.The proposed method was evaluated on the ScanNet [7], SUN RGB-D [25] and S3DIS [1] datasets and demonstrated a convincing superiority over the known state of the art in all benchmarks (the average value is given in parentheses). On SUN RGB-D and ScanNet, the proposed method outperforms other methods by at least 3.5% mAP@0.5. On the ScanNet dataset, the proposed 3D object detection method is 4.5 points higher than the best competing 3D object detection method. On the SUN RGB-D dataset, it is 3.5 points higher. On the S3DIS dataset, it is 20.5 points higher. The same excellent results can be seen for the standard quality assessment metric mAP@0.25.

На фиг. 3 представлен пример облаков точек ScanNet с прогнозируемыми ограничивающими рамками. На фиг. 3 изображено облако точек из ScanNet с AABB. Цвет ограничивающей рамки обозначает категорию объекта. Слева: оценка по предложенному методу (FCAF3D), справа - истина. Каждая категория объектов имеет свой цвет ограничивающей рамки. Категории имеют цветовую кодировку, а именно: синий (обозначен как C) - стул, оранжевый (обозначен как O) - стол, зеленый (обозначен как G) - дверь, красный (обозначен как R) - шкаф. Можно заметить, что 3D ограничивающие рамки, спрогнозированные предложенным методом (слева), аналогичны истинным ограничивающим рамкам (справа). Этот результат наглядно иллюстрирует именно предлагаемый метод.In FIG. 3 shows an example of ScanNet point clouds with predictable bounding boxes. In FIG. 3 shows a point cloud from ScanNet with AABB. The color of the bounding box indicates the category of the object. Left: evaluation by the proposed method (FCAF3D), right - true. Each category of objects has its own bounding box color. The categories are color-coded, namely: blue (marked as C) - chair, orange (marked as O) - table, green (marked as G) - door, red (marked as R) - cabinet. It can be seen that the 3D bounding boxes predicted by the proposed method (left) are similar to the true bounding boxes (right). This result clearly illustrates the proposed method.

Аналогичные результаты были получены для облака точек из SUN RGB-D с OBB, а также для облака точек из S3DIS с AABB.Similar results were obtained for a point cloud from SUN RGB-D with OBB, as well as for a point cloud from S3DIS with AABB.

Для изучения геометрических приоров обучались и оценивались существующие методы с предлагаемыми модификациями. Проводились эксперименты с методами обнаружения 3D объектов, принимающими данные различных модальностей: облака точек, изображения RGB или и то, и другое, чтобы увидеть, заменили ли в FCAF3D вышеупомянутые потери повернутыми потерями IoU с параметризацией Мебиуса по уравнению 5. Для получения полной картины использовалась параметризация sin-cos, используемая в методе обнаружения 3D объектов вне помещений SMOKE [17].To study geometric priors, existing methods were trained and evaluated with the proposed modifications. Experiments were carried out with methods for detecting 3D objects that accept data of various modalities: point clouds, RGB images, or both, to see if in FCAF3D the above-mentioned losses were replaced by rotated IoU losses with the Möbius parameterization according to equation 5 . To obtain a complete picture, we used the sin-cos parameterization used in the SMOKE method for detecting 3D objects outdoors [17].

Повернутая потеря IoU уменьшает количество обучаемых параметров и гиперпараметров, включая геометрические приоры и веса потерь. Эта потеря уже использовалась при обнаружении 3D объектов вне помещений [34]. Недавно в работе [6] сообщалось о результатах обучения VoteNet с выровненными по осям координат потерями IoU на ScanNet.Rotated IoU loss reduces the number of trainable parameters and hyperparameters, including geometric priors and loss weights. This loss has already been used in the detection of 3D objects outdoors [34]. Recently, [6] reported the results of VoteNet training with axis-aligned IoU loss on ScanNet.

В таблице 2 показано, что замена стандартной параметризации на параметризацию Мебиуса повышает mAP@0,5 VoteNet и ImVoteNet примерно на 4%.Table 2 shows that replacing the standard parameterization with the Möbius parameterization improves mAP@0.5 VoteNet and ImVoteNet by about 4%.

В ImVoxelNet не используется схема "классификация + регрессия" для оценки путевого угла, а прогнозируется его значение напрямую за один шаг. Поскольку исходный ImVoxelNet использует повернутые потери IoU, авторам потребовалось только изменять параметризацию без необходимости удаления избыточных потерь. И снова параметризация Мебиуса помогла получить лучшие результаты, хотя это превосходство незначительно.ImVoxelNet does not use a "classification + regression" scheme to estimate the track angle, but predicts its value directly in one step. Because the original ImVoxelNet uses rotated IoU losses, the authors only needed to change the parameterization without having to remove the excess loss. Once again, the Möbius parametrization helped to obtain better results, although this superiority is negligible.

Таблица 2table 2

Таблица 2 иллюстрирует результаты нескольких методов обнаружения 3D объектов, принимающих на входе различные модальности, с различной параметризацией OBB на SUN RGB-D. Значение метрики FCAF3D является лучшим для 25 тестов; среднее значение указано в скобках. Для других методов представлены результаты из оригинальных статей, а также результаты, полученные в ходе предлагаемых экспериментов с повторными внедрениями (обозначены как Reimpl) на основе MMdetection3D. "РС" - облако точек. "RGB" - изображение RGB или набор изображений RGB. "RGB+PC" - изображение RGB и облако точек или набор изображений RGB и облако точек. VoteNet - основанный на голосовании метод обнаружения 3D объектов, который принимает облако точек в цветах RGB. ImVoteNet - основанный на голосовании метод обнаружения 3D объектов, который принимает изображение RGB и облако точек или набор изображений RGB и облако точек. ImVoxelNet - метод обнаружения 3D объектов, который принимает изображение RGB или набор изображений RGB. "Reimpl." означает повторное внедрение. VoteNet и ImVoteNet были повторно внедрены для экспериментов, поскольку исходный код этих методов не был опубликован. Приводятся как результаты, представленные в оригинальных статьях, так и результаты, полученные с помощью повторно внедренных методов. Эти результаты доказывают правильность повторного внедрения и обеспечивают точность, сравнимую с точностью, указанной в исходных статьях.Table 2 illustrates the results of several methods for detecting 3D objects, taking various modalities as input, with various OBB parameterizations on SUN RGB-D. The value of the FCAF3D metric is the best for 25 tests; the mean value is given in parentheses. For other methods, the results from the original papers are presented, as well as the results obtained from the proposed experiments with reimplementations (denoted as Reimpl) based on MMdetection3D. "RS" - a cloud of points. "RGB" - an RGB image or a set of RGB images. "RGB+PC" - an RGB image and a point cloud, or a set of RGB images and a point cloud. VoteNet is a voting-based 3D object detection method that accepts a point cloud in RGB colors. ImVoteNet is a voting-based 3D object detection method that accepts an RGB image and a point cloud, or a set of RGB images and a point cloud. ImVoxelNet is a 3D object detection method that accepts an RGB image or a set of RGB images. "Reimpl." means reintroduction. VoteNet and ImVoteNet were reintroduced for experimentation because the source code for these methods was not published. Both the results presented in the original papers and the results obtained using reimplemented methods are presented. These results prove the correctness of the re-implementation and provide comparable accuracy to the accuracy reported in the original papers.

“w/naive param.” означает наивную параметризацию OBB, где каждый параметр OBB оценивается напрямую. Эта параметризация используется в оригинальном VoteNet.“w/naive param.” means naive OBB parameterization, where each OBB parameter is evaluated directly. This parameterization is used in the original VoteNet.

“w/sin-cos param.” означает параметризацию sin-cos OBB. Параметризация sin-cos сформулирована в методе SMOKE обнаружения 3D объектов вне помещений.“w/sin-cos param.” means sin-cos OBB parametrization. The sin-cos parameterization is formulated in the SMOKE method for detecting 3D objects outdoors.

“w/Mobius param.” означает параметризацию Mobius OBB.“w/Mobius param.” stands for Mobius OBB parameterization.

Как видно, все исследованные методы, в частности, VoteNet, ImVoteNet, ImVoxelNet, FCAF3D, выигрывают от использования предложенной параметризации Мебиуса OBB. Результаты, полученные с помощью параметризации Мебиуса, лучше, чем результаты, полученные как с "наивной" параметризацией, так и с параметризацией sin-cos, описанной авторами метода SMOKE. Наблюдаемое улучшение одинаково для различных методов обнаружения 3D объектов, которые принимают различные типы входных данных.As can be seen, all the methods studied, in particular VoteNet, ImVoteNet, ImVoxelNet, FCAF3D, benefit from the use of the proposed Möbius OBB parameterization. The results obtained with the Möbius parametrization are better than the results obtained with both the "naive" parametrization and the sin-cos parametrization described by the authors of the SMOKE method. The observed improvement is the same for various 3D object detection methods that accept different types of input.

Затем изучались якоря GSDN, чтобы доказать, что слои на основе якорей имеют ограниченную способность к обобщению. Согласно таблице 3, mAP@0,5 резко падает на 12%, если GSDN обучается без якорей (что эквивалентно использованию одного якоря). Иными словами, GSDN демонстрирует низкую производительность без специфичного для предметной области наведения в форме якорей; следовательно, этот метод негибкий и необобщенный. Все FCAF3D, представленные в таблице 3, соответствуют предложенному методу и имеют различные модификации ResNet (HDResNet34, HDResNet34:3, HDResNet34:2) и разный размер вокселей.GSDN anchors were then examined to prove that anchor-based layers have limited generalizability. According to Table 3, mAP@0.5 drops drastically by 12% if the GSDN is trained without anchors (equivalent to using a single anchor). In other words, GSDN performs poorly without domain-specific guidance in the form of anchors; therefore, this method is inflexible and non-generic. All FCAF3Ds presented in Table 3 correspond to the proposed method and have different ResNet modifications (HDResNet34, HDResNet34:3, HDResNet34:2) and different voxel sizes.

Для сравнения оценивался FCAF3D с таким же "скелетом", и он с огромным отрывом превзошел GSDN, достигнув вдвое большего значения mAP.For comparison, FCAF3D with the same "skeleton" was evaluated, and it outperformed GSDN by a huge margin, reaching twice the mAP value.

Таблица 3Table 3

В таблице 3 представлены результаты полностью сверточных методов обнаружения 3D объектов, которые принимают облака точек в ScanNet.Table 3 presents the results of fully convolutional methods for detecting 3D objects that accept point clouds in ScanNet.

Сообщаются результаты GSDN, полученные с вокселями 0,05 м. Хотя меньшие воксели обеспечивают более детальное представление данных, зависимость между размером вокселя и точностью не является прямой. Изменение размера вокселя по-разному влияет на GSD и FCAF3D из-за разных стратегий присвоения. В FCAF3D авторы стремились плотно "покрыть" 3D пространство положениями. Расстояние между положениями пропорционально размеру вокселя, поэтому чем меньше размер вокселя, тем плотнее покрытие и, следовательно, выше отклик. GSDN "покрывает" 3D пространство якорями. Линейные размеры якорей пропорциональны размеру вокселей. При присваивании игнорировались якоря с небольшим пересечением с ограничивающими рамками. При уменьшении размера вокселей уменьшаются якоря; соответственно, некоторые якоря будут игнорироваться, что приведет к более низкому отклику.GSDN results obtained with 0.05 m voxels are reported. Although smaller voxels provide a more detailed representation of the data, the relationship between voxel size and accuracy is not straightforward. Changing the voxel size affects GSD and FCAF3D differently due to different assignment strategies. In FCAF3D, the authors tried to densely "cover" 3D space with positions. The spacing between positions is proportional to the voxel size, so the smaller the voxel size, the denser the coverage and hence the higher the response. GSDN "covers" 3D space with anchors. The linear dimensions of the anchors are proportional to the size of the voxels. The assignment ignored anchors with a small overlap with the bounding boxes. Decreasing voxel size reduces anchors; accordingly, some anchors will be ignored, resulting in a lower response.

Таким образом, FCAF3D в общем выигрывает от меньших вокселей, а GSDN - нет. В целом ожидается, что производительность GSDN упадет при уменьшении размера вокселей с 0,05 до 0,01.So FCAF3D generally benefits from smaller voxels, while GSDN does not. In general, GSDN performance is expected to drop as the voxel size decreases from 0.05 to 0.01.

Далее обсуждается проектное решение нейронной сети для предлагаемого метода (FCAF3D) и исследуется его влияние на метрики при независимом применении в абляционных исследованиях. Эксперименты проводились с различным размером вокселей, количеством точек в облаке точек Npts, количеством положений, выбранных путем выборки по центру, а также с центрированностью и без нее. Результаты абляционных исследований сведены в таблицу 4 для всех бенчмарков.Next, the neural network design solution for the proposed method (FCAF3D) is discussed and its impact on metrics when applied independently in ablation studies is examined. The experiments were carried out with different voxel sizes, the number of points in the point cloud Npts, the number of positions selected by center sampling, as well as with and without centering. The results of ablation studies are summarized in Table 4 for all benchmarks.

Таблица 4Table 4

В таблице 4 представлены результаты абляционных исследований на размере вокселей, количестве точек (которое равно количеству вокселей Nvox при обрезке), центрированности и выборке по центру в FCAF3D. Лучшие варианты выделены жирным шрифтом (на самом деле это опции по умолчанию, использованные для получения результатов в таблице 1 выше). Представленное значение метрики является лучшим для 25 тестов; в скобках указано среднее значение.Table 4 presents the results of ablation studies on voxel size, number of points (which is equal to the number of Nvox voxels when trimmed), centering, and center sampling in FCAF3D. The best options are in bold (actually the default options used to generate the results in Table 1 above). The presented metric value is the best for 25 tests; in parentheses is the average value.

Размер вокселей. Как и следовало ожидать, с увеличением размера вокселей снижается точность. Использовались воксели размером 0,03, 0,02 и 0,01 мкм. Заметный разрыв в mAP между размерами вокселей 0,01 и 0,02 мкм объясняется присутствием почти плоских объектов, таких как двери, картины и доски. В частности, при размере вокселей 2 см "голова" будет выводить положения с допуском 16 см, но почти плоские объекты могут быть меньше 16 см по одному из измерений. Наблюдается снижение точности для больших размеров вокселя.Voxel size. As you might expect, accuracy decreases with increasing voxel size. Voxels with sizes of 0.03, 0.02, and 0.01 µm were used. The noticeable gap in mAP between voxel sizes of 0.01 and 0.02 µm is explained by the presence of nearly flat objects such as doors, paintings, and boards. In particular, with a voxel size of 2 cm, the "head" will output positions with a tolerance of 16 cm, but almost flat objects can be less than 16 cm in one of the dimensions. There is a decrease in accuracy for large voxel sizes.

Количество точек. Подобно 2D изображениям, прореженные облака точек иногда называют облаками с низким разрешением. Соответственно, они содержат меньше информации, чем их версии с высоким разрешением. Как и следовало ожидать, чем меньше точек, тем ниже точность обнаружения. В этой серии экспериментов было выбрано 20 тысяч, 40 тысяч и 100 тысяч точек из всего облака точек, и полученные значения метрик выявили четкую зависимость между количеством точек и mAP. Авторы не считают, что большие значения Npts соответствуют уровню существующих методов (в частности, GSDN [10] использует все точки в облаке точек, GroupFree [16] выбирает 50 тысяч точек, VoteNet [22] выбирает 40 тысяч точек для ScanNet и 20 тысяч для SUN RGB-D). Для направления обрезки в "шее" используется Nvox=Npts. Когда Nvox превышает 100 тысяч, время логического вывода увеличивается из-за растущей разреженности в шее, при этом точность повышается незначительно. Поэтому поиск по сетке был ограничен для Npts 100 тысяч и использовался как значение по умолчанию для полученных результатов.Amount of points. Similar to 2D images, thinned point clouds are sometimes referred to as low resolution clouds. Accordingly, they contain less information than their high resolution versions. As expected, the fewer points, the lower the detection accuracy. In this series of experiments, 20 thousand, 40 thousand and 100 thousand points were selected from the entire point cloud, and the obtained metric values revealed a clear relationship between the number of points and mAP. The authors do not believe that large values of Npts correspond to the level of existing methods (in particular, GSDN [10] uses all points in the point cloud, GroupFree [16] selects 50 thousand points, VoteNet [22] selects 40 thousand points for ScanNet and 20 thousand points for SUN RGB-D). Nvox=Npts is used for cutting direction in the "neck". When Nvox exceeds 100 thousand, the inference time increases due to the growing sparsity in the neck, while the accuracy improves slightly. Therefore, the grid search was limited to Npts 100 thousand and was used as the default value for the results obtained.

Центрированность. Использование центрированности улучшает mAP для датасетов ScanNet и SUN RGB-D. Для S3DIS результаты противоречивы: лучшее значение mAP@0,5 компенсируется незначительным снижением mAP@0,25. Тем не менее, авторы проанализировали все результаты и поэтому могут считать центрированность полезной функцией с небольшим положительным влиянием на mAP, почти достигающим 1% от mAP@0,5 в ScanNet.Centeredness. Using centering improves mAP for ScanNet and SUN RGB-D datasets. For S3DIS, the results are inconsistent: the better value of mAP@0.5 is offset by a slight decrease in mAP@0.25. However, the authors analyzed all the results and therefore can consider centering to be a useful feature with a small positive effect on mAP, almost reaching 1% of mAP@0.5 in ScanNet.

Выборка по центру. И наконец, авторы изучили количество положений, полученных в выборке по центру. Выбрано 9 положений, как предложено в FCOS [26], весь набор из 27 положений, как в ImVoxelNet [24], и 18 положений. Последний выбор оказался наилучшим по mAP на всех бенчмарках.Center sampling. Finally, the authors examined the number of positions obtained in the centered sample. 9 positions are selected as proposed in FCOS [26], the entire set of 27 positions, as in ImVoxelNet [24], and 18 positions. The last choice turned out to be the best in terms of mAP on all benchmarks.

Центрированность. Использование центрированности улучшает mAP для датасетов ScanNet и SUN RGB-D. Для S3DIS результаты противоречивы: лучшее mAP@0,5 компенсируется незначительным снижением mAP@0,25. Тем не менее были проанализированы все результаты, и центрированность признана полезной функцией с небольшим положительным эффектом на mAP, почти достигающим 1% от mAP@0,5 в ScanNet.Centeredness. Using centering improves mAP for ScanNet and SUN RGB-D datasets. For S3DIS, the results are inconsistent: better mAP@0.5 is offset by a slight decrease in mAP@0.25. However, all results were analyzed and centering was found to be a useful feature with a small positive effect on mAP, almost reaching 1% of mAP@0.5 in ScanNet.

Выборка по центру. И наконец, было изучено количество положений, выбранных в выборке по центру. Было выбрано 9 положений, как предложено в FCOS [26], весь набор из 27 положений, как в ImVoxelNet [24], и 18 положений. Последний выбор оказался наилучшим по mAP на всех бенчмарках.Center sampling. Finally, the number of positions selected in the center sample was examined. 9 positions were selected as proposed in FCOS [26], the entire set of 27 positions as in ImVoxelNet [24], and 18 positions. The last choice turned out to be the best in terms of mAP on all benchmarks.

Скорость логического выводаInference speed

По сравнению со стандартными свертками, разреженные свертки рационально используют время и память. Авторы GSDN утверждают, что с помощью разреженных сверток они обрабатывают сцену с 78 миллионами точек, охватывающую около 14000 м3, за один полностью сверточный проход с прямым распространением, используя всего 5 ГБ памяти графического процессора. FCAF3D использует те же разреженные свертки и тот же "скелет", что и GSDN. Однако, как видно из таблицы 3, FCAF3D по умолчанию работает медленнее, чем GSDN. Это объясняется меньшим размером вокселей: авторы используют 0,01 м для надлежащего многоуровневого назначения, в то время как GSDN использует 0,05 м.Compared to standard folds, sparse folds are time and memory efficient. The GSDN authors claim that with sparse convolutions, they process a 78M-point scene spanning about 14,000m3 in one fully forward-propagated convolutional pass using as little as 5GB of GPU memory. FCAF3D uses the same sparse convolutions and the same "skeleton" as GSDN. However, as Table 3 shows, FCAF3D is slower than GSDN by default. This is due to the smaller voxel size: the authors use 0.01 m for proper layering, while GSDN uses 0.05 m.

Для создания самого быстрого метода авторы использовали "скелеты" HDResNet34:3 и HDResNet34:2 всего с тремя и двумя уровнями признаков, соответственно. С этими модификациями FCAF3D имеет более высокую скорость логического вывода, чем GSDN (фиг. 4). На фиг. 4 показана зависимость точности обнаружения от скорости логического вывода, измеренная в сценах в секунду, для исходного и модифицированного FCAF3D в сравнении с существующими методами обнаружения 3D объектов. На фиг. 4 показан показатель mAP@0,5 на ScanNet относительно количества сцен в секунду. Модификации FCAF3D имеют разное количество уровней признаков "скелета". Для каждого существующего метода имеется модификация FCAF3D, превосходящая этот метод как по точности обнаружения, так и по скорости логического вывода. Согласно графику предлагаемый метод FCAF3D без ускоряющих модификаций является наиболее точным из всех методов обнаружения 3D объектов; однако он медленнее, чем некоторые другие методы. Тем не менее, для каждого существующего метода имеется модификация FCAF3D, превосходящая этот метод как по точности обнаружения, так и по скорости логического вывода.To create the fastest method, the authors used HDResNet34:3 and HDResNet34:2 "skeletons" with only three and two feature levels, respectively. With these modifications, FCAF3D has a higher inference rate than GSDN (FIG. 4). In FIG. Figure 4 shows the dependence of detection accuracy on inference speed, measured in scenes per second, for the original and modified FCAF3D in comparison with existing methods for detecting 3D objects. In FIG. 4 shows mAP@0.5 on ScanNet relative to scenes per second. FCAF3D modifications have a different number of "skeleton" feature levels. For each existing method, there is a FCAF3D modification that surpasses this method both in detection accuracy and inference speed. According to the graph, the proposed FCAF3D method without accelerating modifications is the most accurate of all methods for detecting 3D objects; however, it is slower than some of the other methods. However, for each existing method, there is a FCAF3D modification that surpasses this method both in detection accuracy and inference speed.

На фиг. 4 представлено сравнение ключевых характеристик предлагаемого метода FCAF3D и двух его модификаций (FCAF3D с 3 уровнями и FCAF3D с 2 уровнями) с существующими методами, решающими аналогичную задачу: GroupFree, H3DNet, BRNet, 3DETR, 3DETR-m, VoteNet, GSDN. Сравнение выполнялось на датасете ScanNet. Каждая точка на графике соответствует какому-то способу обнаружения 3D объектов.In FIG. Figure 4 compares the key characteristics of the proposed FCAF3D method and its two modifications (FCAF3D with 3 levels and FCAF3D with 2 levels) with existing methods that solve a similar problem: GroupFree, H3DNet, BRNet, 3DETR, 3DETR-m, VoteNet, GSDN. The comparison was performed on the ScanNet dataset. Each point on the graph corresponds to some way of detecting 3D objects.

Ось Y: Производительность (mAP) - mAP@0,5, метрика точности обнаружения 3D объектов. Более высокие значения соответствуют более высокой точности; соответственно, чем выше расположены методы на графике, тем они лучше.Y-axis: Performance (mAP) - mAP@0.5, a metric of 3D object detection accuracy. Higher values correspond to higher precision; accordingly, the higher the methods are located on the chart, the better they are.

По оси абсцисс: Сцен в секунду - это количество сцен, обрабатываемых в секунду. Более высокие значения соответствуют более высокой скорости; соответственно, чем правее расположены методы на графике, тем они лучше.X-axis: Scenes per second is the number of scenes processed per second. Higher values correspond to higher speed; accordingly, the more to the right the methods are located on the chart, the better they are.

Для каждого из существующих методов существует модификация предлагаемого способа, которая одновременно показывает лучшую точность и является более быстрой (иными словами, точка которой расположена на графике выше и правее). Для GroupFree, H3DNet, BRNet, 3DETR, 3DETR-m, VoteNet это FCAF3D и FCAF3D с тремя уровнями. Для GSDN - FCAF3D с двумя уровнями.For each of the existing methods, there is a modification of the proposed method, which simultaneously shows better accuracy and is faster (in other words, the point of which is located above and to the right on the graph). For GroupFree, H3DNet, BRNet, 3DETR, 3DETR-m, VoteNet, these are FCAF3D and FCAF3D with three levels. For GSDN - FCAF3D with two levels.

Для объективного сравнения авторы повторно измерили скорость логического вывода для методов GSDN и методов на основе голосования, поскольку операция группировки точек и разреженные свертки стали намного быстрее с момента появления этих методов. В тестах производительности авторы выбрали реализации, основанные на фреймворке MMdetection3D [6], чтобы уменьшить различия в кодовой базе. Сообщаемая скорость логического вывода для всех методов измерялась на одном и том же графическом процессоре, поэтому их можно сравнивать напрямую.For an objective comparison, the authors re-measured the inference speed for GSDN methods and voting-based methods, since the point grouping operation and sparse convolutions have become much faster since the introduction of these methods. In performance tests, the authors chose implementations based on the MMdetection3D framework [6] to reduce differences in the code base. The reported inference speed for all methods was measured on the same GPU, so they can be compared directly.

Предлагается FCAF3D, первый в своем классе полностью сверточный безъякорный метод обнаружения 3D объектов для сцен внутри помещений. Предлагаемый метод значительно превосходит известный уровень техники на сложных бенчмарках SUN RGB-D, ScanNet и S3DIS для помещений как в значениях mAP, так и скорости логического вывода. Также предложена новая ориентированная параметризация ограничивающей рамки, и продемонстрировано, что она повышает точность для нескольких методов обнаружения 3D объектов. Кроме того, предлагаемая параметризация позволяет исключить какие-либо предварительные допущения об объектах, тем самым уменьшая количество гиперпараметров. В общем, FCAF3D с предложенной параметризацией ограничивающей рамки является точным, масштабируемым и в то же время обобщаемым методом. Предлагаемое программное решение предполагается исполнять в мобильных робототехнических устройствах или запускать на смартфонах в составе мобильного приложения. Для реализации предлагаемого способа устройство-носитель должно соответствовать техническим требованиям. В частности, оно должно иметь достаточный объем оперативной памяти и вычислительных ресурсов. Требуемый объем ресурсов зависит от функции устройства, параметров камеры и требований к производительности.FCAF3D is proposed, a first-in-class fully convolutional anchorless 3D object detection method for indoor scenes. The proposed method significantly outperforms the state of the art on the complex SUN RGB-D, ScanNet, and S3DIS indoor benchmarks in both mAP values and inference speed. A new oriented bounding box parameterization is also proposed and shown to improve accuracy for several 3D object detection methods. In addition, the proposed parameterization makes it possible to eliminate any preliminary assumptions about objects, thereby reducing the number of hyperparameters. In general, FCAF3D with the proposed bounding box parameterization is an accurate, scalable, and at the same time generalizable method. The proposed software solution is supposed to be executed in mobile robotic devices or run on smartphones as part of a mobile application. To implement the proposed method, the carrier device must meet the technical requirements. In particular, it must have sufficient RAM and computing resources. Resource requirements vary by device function, camera settings, and performance requirements.

Представленные выше варианты осуществления изобретения являются примерами и не должны рассматриваться как ограничивающие. Кроме того, описание примерных вариантов осуществления изобретения предназначено для иллюстрации, а не для ограничения объема формулы изобретения, и специалистам в данной области техники будут очевидны многие альтернативы, модификации и варианты.The above embodiments of the invention are examples and should not be construed as limiting. In addition, the description of exemplary embodiments of the invention is intended to be illustrative and not to limit the scope of the claims, and many alternatives, modifications, and variations will be apparent to those skilled in the art.

ЛитератураLiterature

1. Armeni, I., Sener, O., Zamir, A.R., Jiang, H., Brilakis, I., Fischer, M., Savarese, S.: 3d semantic parsing of large-scale indoor spaces. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1534-1543 (2016).1. Armeni, I., Sener, O., Zamir, A.R., Jiang, H., Brilakis, I., Fischer, M., Savarese, S.: 3d semantic parsing of large-scale indoor spaces. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1534-1543 (2016).

2. Chen, J., Lei, B., Song, Q., Ying, H., Chen, D.Z., Wu, J.: A hierarchical graph network for 3d object detection on point clouds. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 392-401 (2020).2. Chen, J., Lei, B., Song, Q., Ying, H., Chen, D.Z., Wu, J.: A hierarchical graph network for 3d object detection on point clouds. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 392-401 (2020).

3. Chen, K., Wang, J., Pang, J., Cao, Y., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., Xu, J., et al.: Mmdetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155 (2019).3. Chen, K., Wang, J., Pang, J., Cao, Y., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., Xu, J ., et al.: Mmdetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155 (2019).

4. Cheng, B., Sheng, L., Shi, S., Yang, M., Xu, D.: Back-tracing representative points for voting-based 3d object detection in point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8963-8972 (2021).4. Cheng, B., Sheng, L., Shi, S., Yang, M., Xu, D.: Back-tracing representative points for voting-based 3d object detection in point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8963-8972 (2021).

5. Choy, C., Gwak, J., Savarese, S.: 4d spatio-temporal convnets: Minkowski convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 3075-3084 (2019).5. Choy, C., Gwak, J., Savarese, S.: 4d spatio-temporal convnets: Minkowski convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 3075-3084(2019).

6. Contributors, M.: MMDetection3D: OpenMMLab next-generation platform for general 3D object detection. (2020)6. Contributors, M.: MMDetection3D: OpenMMLab next-generation platform for general 3D object detection. (2020)

https://github.com/open-mmlab/mmdetection3d.https://github.com/open-mmlab/mmdetection3d.

7. Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Niesner, M.: Scannet: Richly-annotated 3d reconstructions of indoor scenes. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 5828-5839 (2017).7. Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Niesner, M.: Scannet: Richly-annotated 3d reconstructions of indoor scenes. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 5828-5839 (2017).

8. Engelmann, F., Bokeloh, M., Fathi, A., Leibe, B., Niesner, M.: 3d-mpa: Multiproposal aggregation for 3d semantic instance segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9031-9040 (2020).8. Engelmann, F., Bokeloh, M., Fathi, A., Leibe, B., Niesner, M.: 3d-mpa: Multiproposal aggregation for 3d semantic instance segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9031-9040 (2020).

9. Ge, Z., Liu, S., Li, Z., Yoshie, O., Sun, J.: Ota: Optimal transport assignment for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 303-312 (2021).9. Ge, Z., Liu, S., Li, Z., Yoshie, O., Sun, J.: Ota: Optimal transport assignment for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 303-312 (2021).

10. Gwak, J., Choy, C., Savarese, S.: Generative sparse detection networks for 3d single-shot object detection. In: Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part IV 16. pp. 297-313. Springer (2020).10. Gwak, J., Choy, C., Savarese, S.: Generative sparse detection networks for 3d single-shot object detection. In: Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part IV 16. pp. 297-313. Springer (2020).

11. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770-778 (2016).11. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770-778 (2016).

12. Hou, J., Dai, A., Niesner, M.: 3d-sis: 3d semantic instance segmentation of rgbd scans. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4421-4430 (2019).12. Hou, J., Dai, A., Niesner, M.: 3d-sis: 3d semantic instance segmentation of rgbd scans. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4421-4430 (2019).

13. Lang, A.H., Vora, S., Caesar, H., Zhou, L., Yang, J., Beijbom, O.: Pointpillars: Fast encoders for object detection from point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12697-12705 (2019).13. Lang, A.H., Vora, S., Caesar, H., Zhou, L., Yang, J., Beijbom, O.: Pointpillars: Fast encoders for object detection from point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12697-12705 (2019).

14. Li, B.: 3d fully convolutional network for vehicle detection in point cloud. In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 1513-1518. IEEE (2017).14. Li, B.: 3d fully convolutional network for vehicle detection in point cloud. In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 1513-1518. IEEE (2017).

15. Lin, T.Y., Goyal, P., Girshick, R., He, K., Doll'ar, P.: Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision. pp. 2980-2988 (2017).15. Lin, T.Y., Goyal, P., Girshick, R., He, K., Doll'ar, P.: Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision. pp. 2980-2988 (2017).

16. Liu, Z., Zhang, Z., Cao, Y., Hu, H., Tong, X.: Group-free 3d object detection via transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 2949-2958 (2021).16. Liu, Z., Zhang, Z., Cao, Y., Hu, H., Tong, X.: Group-free 3d object detection via transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 2949-2958 (2021).

17. Liu, Z., Wu, Z., T'oth, R.: Smoke: Single-stage monocular 3d object detection via keypoint estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. pp. 996-997 (2020).17. Liu, Z., Wu, Z., T'oth, R.: Smoke: Single-stage monocular 3d object detection via keypoint estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. pp. 996-997 (2020).

18. Maturana, D., Scherer, S.: Voxnet: A 3d convolutional neural network for realtime object recognition. In: 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 922-928. IEEE (2015).18. Maturana, D., Scherer, S.: Voxnet: A 3d convolutional neural network for realtime object recognition. In: 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 922-928. IEEE (2015).

19. Misra, I., Girdhar, R., Joulin, A.: An end-to-end transformer model for 3d object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 2906-2917 (2021)19. Misra, I., Girdhar, R., Joulin, A.: An end-to-end transformer model for 3d object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 2906-2917 (2021)

20. Munkres, J.R.: Topology (2000).20. Munkres, J.R.: Topology (2000).

21. Qi, C.R., Chen, X., Litany, O., Guibas, L.J.: Imvotenet: Boosting 3d object detection in point clouds with image votes. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4404-4413 (2020).21. Qi, C.R., Chen, X., Litany, O., Guibas, L.J.: Imvotenet: Boosting 3d object detection in point clouds with image votes. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4404-4413 (2020).

22. Qi, C.R., Litany, O., He, K., Guibas, L.J.: Deep hough voting for 3d object detection in point clouds. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9277-9286 (2019).22. Qi, C.R., Litany, O., He, K., Guibas, L.J.: Deep hough voting for 3d object detection in point clouds. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9277-9286 (2019).

23. Qi, C.R., Su, H., Mo, K., Guibas, L.J.: Pointnet: Deep learning on point sets for 3d classification and segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 652-660 (2017).23. Qi, C.R., Su, H., Mo, K., Guibas, L.J.: Pointnet: Deep learning on point sets for 3d classification and segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 652-660 (2017).

24. Rukhovich, D., Vorontsova, A., Konushin, A.: Imvoxelnet: Image to voxels projection for monocular and multi-view general-purpose 3d object detection. arXiv preprint arXiv:2106.01178 (2021).24. Rukhovich, D., Vorontsova, A., Konushin, A.: Imvoxelnet: Image to voxels projection for monocular and multi-view general-purpose 3d object detection. arXiv preprint arXiv:2106.01178 (2021).

25. Song, S., Lichtenberg, S.P., Xiao, J.: Sun rgb-d: A rgb-d scene understanding benchmark suite. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 567-576 (2015).25. Song, S., Lichtenberg, S.P., Xiao, J.: Sun rgb-d: A rgb-d scene understanding benchmark suite. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 567-576 (2015).

26. Tian, Z., Shen, C., Chen, H., He, T.: Fcos: Fully convolutional one-stage object detection. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 9627-9636 (2019).26. Tian, Z., Shen, C., Chen, H., He, T.: Fcos: Fully convolutional one-stage object detection. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 9627-9636 (2019).

27. Wang, T., Zhu, X., Pang, J., Lin, D.: Fcos3d: Fully convolutional one-stage monocular 3d object detection. arXiv preprint arXiv:2104.10956 (2021).27. Wang, T., Zhu, X., Pang, J., Lin, D.: Fcos3d: Fully convolutional one-stage monocular 3d object detection. arXiv preprint arXiv:2104.10956 (2021).

28. Xie, Q., Lai, Y.K., Wu, J., Wang, Z., Lu, D., Wei, M., Wang, J.: Venet: Voting enhancement network for 3d object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 3712-3721 (2021).28. Xie, Q., Lai, Y.K., Wu, J., Wang, Z., Lu, D., Wei, M., Wang, J.: Venet: Voting enhancement network for 3d object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 3712-3721 (2021).

29. Xie, Q., Lai, Y.K., Wu, J., Wang, Z., Zhang, Y., Xu, K., Wang, J.: Mlcvnet: Multilevel context votenet for 3d object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10447-10456 (2020).29. Xie, Q., Lai, Y.K., Wu, J., Wang, Z., Zhang, Y., Xu, K., Wang, J.: Mlcvnet: Multilevel context votenet for 3d object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10447-10456 (2020).

30. Yan, Y., Mao, Y., Li, B.: Second: Sparsely embedded convolutional detection. Sensors 18(10), 3337 (2018).30. Yan, Y., Mao, Y., Li, B.: Second: Sparsely embedded convolutional detection. Sensors 18(10), 3337 (2018).

31. Yin, T., Zhou, X., Krahenbuhl, P.: Center-based 3d object detection and tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11784-11793 (2021).31. Yin, T., Zhou, X., Krahenbuhl, P.: Center-based 3d object detection and tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11784-11793 (2021).

32. Zhang, S., Chi, C., Yao, Y., Lei, Z., Li, S.Z.: Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9759-9768 (2020).32. Zhang, S., Chi, C., Yao, Y., Lei, Z., Li, S.Z.: Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9759-9768 (2020).

33. Zhang, Z., Sun, B., Yang, H., Huang, Q.: H3dnet: 3d object detection using hybrid geometric primitives. In: European Conference on Computer Vision. pp. 311-329. Springer (2020).33. Zhang, Z., Sun, B., Yang, H., Huang, Q.: H3dnet: 3d object detection using hybrid geometric primitives. In: European Conference on Computer Vision. pp. 311-329. Springer (2020).

34. Zhou, D., Fang, J., Song, X., Guan, C., Yin, J., Dai, Y., Yang, R.: Iou loss for 2d/3d object detection. In: 2019 International Conference on 3D Vision (3DV). pp. 85-94. IEEE (2019).34. Zhou, D., Fang, J., Song, X., Guan, C., Yin, J., Dai, Y., Yang, R.: Iou loss for 2d/3d object detection. In: 2019 International Conference on 3D Vision (3DV). pp. 85-94. IEEE (2019).

35. Zhou, Y., Tuzel, O.: Voxelnet: End-to-end learning for point cloud based 3d object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4490-4499 (2018).35. Zhou, Y., Tuzel, O.: Voxelnet: End-to-end learning for point cloud based 3d object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4490-4499 (2018).

Claims

1. A method for providing computer vision of a robotic device, implemented in a computer device of a robotic device, wherein the robotic device has a camera with depth sensors and an RGB camera, a central processor, internal memory, RAM, the method includes the steps of:

capturing a real scene with objects in the scene using a depth sensor camera and an RGB camera;

represent the captured real scene as a cloud of a set of N points, each of which is represented by its own coordinate and color, and

a cloud of a set of N points is introduced as data into the neural network, and the neural network consists of a "skeleton" part, a "neck" part and a "head" part,

the skeleton part, the neck part, and the head part are 3D sparse convolutional neural network parts;

in a neural network, the following steps are performed:

A) by means of the "skeleton" part, a point cloud is represented as a volumetric pixel representation of the input data;

B) through the "skeleton" part, the volumetric pixel representation of the input data is processed to obtain four-dimensional tensors;

C) by means of the "neck" part, four-dimensional tensors are processed to extract numeric features of objects in the scene;

D) the head part processes the extracted features of the objects in the scene to obtain predictions of the positions, orientations, and categories of the objects in the scene, wherein for each object in the scene, the head part produces predictions, the prediction comprising:

object classification probability, object bounding box regression parameters, and object centering within the object bounding box;

E) through the "head" part, filter all received predictions by comparing the predictions by the probability of classifying the object to select the most likely prediction, while the most likely prediction is considered the final estimate of the position, orientation and category of the object in the scene and is characterized by data regarding the position, orientation and category object in the scene;

output from the neural network data regarding the position, orientation and category of the object in the scene, in the form of a numerical representation of the scene representing computer vision for the robotic device.

2. The method of claim 1, further comprising the step of:

converting the numeric representation of the scene into a scene image, wherein the computer device further comprises a screen, and the scene image is displayed on the screen for the user.

3. A computer-readable medium containing a program code that reproduces the method according to claim 1 when implemented in a computer device.