WO2023080590A1

WO2023080590A1 - Method for providing computer vision

Info

Publication number: WO2023080590A1
Application number: PCT/KR2022/016890
Authority: WO
Inventors: Danila Dmitrievich RUKHOVICH; Anna Borisovna VORONTSOVA; Anton Sergeevich Konushin
Original assignee: Samsung Electronics Co., Ltd.
Priority date: 2021-11-03
Filing date: 2022-11-01
Publication date: 2023-05-11

Abstract

An disclosure relates a method for computer vision of a robotic device, the method comprising obtaining a scene with at least one object, representing the obtained scene as a set of N points, inputting the set of N points as input data to a neural network, wherein the neural network performs the following steps: representing the set of N points as volumetric pixel representation, processing the volumetric pixel representation to obtain four-dimensional tensors, processing the four-dimensional tensors to extract features, expressed in numerical form, of the at least one of object in the scene and processing the extracted features to obtain predictions of a location, an orientation and a category of the at least one object in the scene, and outputting, from the neural network, the predictions as numerical representation.

Description

METHOD FOR PROVIDING COMPUTER VISION

An disclosure is related a computer vision, namely to mobile robots navigation; for mobile apps that perform scene understanding and object recognition.

3D object detection from point clouds aims at simultaneous localization and recognition of 3D objects given a 3D point set. As a core technique for 3D scene understanding, it is widely applied in autonomous driving, robotics, and AR.

While 2D methods ([26], [32]) work with dense fixed-size arrays, 3D methods are challenged by irregular unstructured 3D data of arbitrary volume. Consequently, the 2D data processing techniques are not directly applicable for 3D object detection, so 3D object detection methods ([10], [22], [19]) employ inventive approaches to 3D data processing.

Convolutional 3D object detection methods have scalability issues: large-scale scenes either require an impractical amount of computational resources or take too much time to process. Other methods opt for voxel data representation and employ sparse convolutions; however, these methods solve scalability problems at the cost of detection accuracy. In other words, there is no 3D object detection method that provides precise estimates and scales well.

Recent 3D object detection methods are designed to be either indoor or outdoor.

Indoor and outdoor methods have been developing almost independently, applying domain-specific data processing techniques. Many modern outdoor methods [30], [13], [35] project 3D points onto a bird-eye-view plane, thus reducing the task of 3D object detection to 2D object detection. Naturally, these methods take advantage of the fast-evolving algorithms for 2D object detection. Given a birdeye- view projection, [14] processes it in a fully convolutional manner, while [31] exploits 2D anchor-free approach. Unfortunately, the approaches that proved to be effective for both 2D object detection and 3D outdoor object detection cannot be trivially adapted to indoor, as it would require an impracticable amount of memory and computing resources. To address performance issues, different 3D data processing strategies have been proposed. Currently, three approaches dominate the field of 3D object detection - voting-based, transformer-based, and 3D convolutional. Below discussed are each of these approaches in detail; also provided is a brief overview of anchor-free methods.

Voting-based methods.

VoteNet [22] was the first method that introduced points voting for 3D object detection. VoteNet processes 3D points with Point-Net [23], assigns a group of points to each object candidate according to their voted center, and computes object features from each point group. Among the numerous successors of VoteNet, the major progress is associated with advanced grouping and voting strategies applied to the PointNet features. BRNet [4] refines voting results with the representative points from the vote centers, which improves capturing the fine local structural features. MLCVNet [29] introduces three context modules into the voting and classifying stages of VoteNet to encode contextual information at different levels. H3DNet [33] improves the point group generation procedure by predicting a hybrid set of geometric primitives.

VENet [28] incorporates an attention mechanism and introduces a vote weighting module trained via a novel vote attraction loss.

All VoteNet-like voting-based methods are limited by design. First, they show poor scalability: as their performance depends on the amount of input data, they tend to slow down if scenes became larger. Moreover, many voting-based methods implement voting and grouping strategies as custom layers, making it difficult to reproduce or debug these methods or port them to mobile devices.

Transformer-based methods.

The recently emerged transformer-based methods use end-to-end learning and forward pass on inference instead of heuristics and optimization, which makes them less domain-specific. GroupFree [16] replaces VoteNet head with a transformer module, updating object query locations iteratively and ensembling intermediate detection results. 3DETR [19] was the first method of 3D object detection implemented as an end-to-end trainable transformer. However, more advanced transformer-based methods still experience scalability issues similar to early voting-based methods. Differently, proposed method is fully-convolutional, thus being faster and significantly easier to implement than both voting-based and transformer-based methods.

3D convolutional methods.

Voxel representation allows handling cubically growing sparse 3D data efficiently. Voxel-based 3D object detection methods ([12], [18]) convert points into voxels and process them with 3D convolutional networks. However, dense volumetric features still consume much memory, and 3D convolutions are computationally expensive. Overall, processing large scenes requires a lot of resources and cannot be done within a single pass.

GSDN [10] tackles performance issues with sparse 3D convolutions. It has encoder-decoder architecture, with both encoder and decoder parts built from sparse 3D convolutional blocks. Compared to the standard convolutional voting based and transformer-based approaches, GSDN is significantly more memory efficient and scales to large scenes without sacrificing point density. The major weakness of GSDN is its accuracy: this method is comparable to VoteNet in terms of quality, being significantly inferior to the current state-of-the-art [16].

GSDN uses 15 aspect ratios for 3D object bounding boxes as anchors. If GSDN is trained in an anchor-free setting with a single aspect ratio, the accuracy decreases by 12%. Unlike GSDN, proposed method is anchor-free while taking advantage of sparse 3D convolutions.

RGB-based anchor-free object detection.

In 2D object detection, anchorfree methods are competitors to the standard anchor-based methods. FCOS [26] addresses 2D object detection in a per-pixel prediction manner and shows a robust improvement over its anchor-based predecessor RetinaNet [15]. FCOS3D [27] trivially adapts FCOS by adding extra targets for monocular 3D object detection. ImVoxelNet [24] solves the same problem with an FCOS-like head built from standard (non-sparse) 3D convolutional blocks. Proposed disclosure adapts the ideas from mentioned anchor-free methods to process sparse irregular data.

Besides being scalable and accurate, an ideal 3D object detection method should handle objects of arbitrary shapes and sizes without additional hacks and hand-tuned hyperparameters. Prior assumptions on 3D object bounding boxes (e.g. aspect ratios or absolute sizes) restrict generalization and increase the number of hyperparameters and trainable parameters.

According to an aspect of the disclosure, a method for computer vision of a robotic device may include obtaining a scene with at least one object. A method for computer vision of a robotic device may include representing the obtained scene as a set of N points. A method for computer vision of a robotic device may include inputting the set of N points as input data to a neural network. The neural network performs the step of representing the set of N points as volumetric pixel representation. The neural network performs the step of processing the volumetric pixel representation to obtain four-dimensional tensors. The neural network performs the step of processing the four-dimensional tensors to extract features, expressed in numerical form, of the at least one of object in the scene. The neural network performs the step of processing the extracted features to obtain predictions of a location, an orientation and a category of the at least one object in the scene. A method for computer vision of a robotic device may include outputting, from the neural network, the predictions as numerical representation.

According to an aspect of the disclosure, robotic device for computer vision may include at least one memory storing one or more computer executable instructions. And the robotic device for computer vision may include at least one processor configured to execute the one or more instructions stored in the memory. The at least one processor configured to execute the one or more instructions stored in the memory to obtain a scene with at least one object. The at least one processor configured to execute the one or more instructions stored in the memory to represent the obtained scene as a set of N points. The at least one processor configured to execute the one or more instructions stored in the memory to input the set of N points as input data to a neural network, wherein the neural network performs the following steps: represent the set of N points as volumetric pixel representation, process the volumetric pixel representation to obtain four-dimensional tensors, process the four-dimensional tensors to extract features, expressed in numerical form, of the at least one object in the scene and process the extracted features to obtain predictions of a location, an orientation and a category of the at least one object in the scene. The at least one processor configured to execute the one or more instructions stored in the memory to output, from the neural network, the predictions as numerical representation.

According to an aspect of the disclosure, a computer-readable storage medium may store instructions that, when executed by a processor, cause the processor to obtain a scene with at least one object. A computer-readable storage medium may store instructions that, when executed by a processor, cause the processor to represent the obtained scene as a set of N points. A computer-readable storage medium may store instructions that, when executed by a processor, cause the processor to input the set of N points as input data to a neural network, wherein the neural network performs the following steps: represent the set of N points as volumetric pixel representation, process the volumetric pixel representation to obtain four-dimensional tensors, process the four-dimensional tensors to extract features, expressed in numerical form, of the at least one object in the scene and process the extracted features to obtain predictions of a location, an orientation and a category of the at least one object in the scene. A computer-readable storage medium may store instructions that, when executed by a processor, cause the processor to output, from the neural network, the predictions as numerical representation.

The above and/or other aspects will be more apparent by describing exemplary embodiments with reference to the accompanying drawings, in which:

Figure 1 illustrates the general scheme of the proposed method.

Figure 2 illustrates examples of objects with an ambiguous heading angle.

Figure 3 illustrates the result of the proposed method with dataset ScanNet.

Figure 4 illustrates the detection accuracy against inference speed measured in scenes per second for the original and modified FCAF3D in comparison with the existing methods of 3D object detection.

Figure 5 is a schematic diagram of a robotic device for computer vision according to an embodiment of the disclosure.

Figure 6 is a flow diagram of method for computer vision of a robotic device according to an embodiment of the disclosure.

The method performed by the electronic device may be performed using an artificial intelligence (AI). A function associated with AI may be performed through the non-volatile memory, the volatile memory, and the processor.

The processor may include one or a plurality of processors. At this time, one or a plurality of processors may be a general purpose processor, such as a central processing unit (CPU), an application processor (AP), or the like, a graphics-only processing unit such as a graphics processing unit (GPU), a visual processing unit (VPU), and/or an AI-dedicated processor such as a neural processing unit (NPU).

The one or a plurality of processors control the processing of the input data in accordance with a predefined operating rule or artificial intelligence (AI) stored in the non-volatile memory and the volatile memory. The predefined operating rule or artificial intelligence is provided through training or learning.

Here, being provided through learning means that, by applying a learning algorithm to a plurality of learning data, a predefined operating rule or AI of a desired characteristic is made. The learning may be performed in a device itself in which AI according to an embodiment is performed, and/o may be implemented through a separate server/system.

The AI may consist of a plurality of neural network layers. Each layer has a plurality of weight values, and performs a layer operation through calculation of a previous layer and an operation of a plurality of weights. Examples of neural networks include, but are not limited to, convolutional neural network (CNN), deep neural network (DNN), recurrent neural network (RNN), restricted Boltzmann Machine (RBM), deep belief network (DBN), bidirectional recurrent deep neural network (BRDNN), generative adversarial networks (GAN), and deep Q-networks. A neural network can be implemented in hardware or software-hardware.

The learning algorithm is a method for training a predetermined target device (for example, a robot) using a plurality of learning data to cause, allow, or control the target device to make a determination or prediction. Examples of learning algorithms include, but are not limited to, supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning.

According to the disclosure, a method for object recognition may obtain output data recognizing by using image data as input data for an artificial intelligence. The artificial intelligence may be obtained by training. Here, "obtained by training" means that a predefined operation rule or artificial intelligence configured to perform a desired feature (or purpose) is obtained by training a basic artificial intelligence with multiple pieces of training data by a training algorithm. The artificial intelligence may include a plurality of neural network layers. Each of the plurality of neural network layers includes a plurality of weight values and performs neural network computation by computation between a result of computation by a previous layer and the plurality of weight values.

Visual understanding is a technique for recognizing and processing things as does human vision and includes, e.g., object recognition, object tracking, image retrieval, human recognition, scene recognition, 3D reconstruction/localization, or image enhancement.

According to the disclosure, the proposed method may use an artificial intelligence to execute by using data. The processor may perform a pre-processing operation on the data to convert it into a form appropriate for use as an input for the artificial intelligence model. The artificial intelligence may be obtained by training. Here, "obtained by training" means that a predefined operation rule or artificial intelligence configured to perform a desired feature (or purpose) is obtained by training a basic artificial intelligence with multiple pieces of training data by a training algorithm. The artificial intelligence may include a plurality of neural network layers. Each of the plurality of neural network layers includes a plurality of weight values and performs neural network computation by computation between a result of computation by a previous layer and the plurality of weight values.

Reasoning prediction is a technique of logically reasoning and predicting by determining information and includes, e.g., knowledge-based reasoning, optimization prediction, preference-based planning, or recommendation.

The proposed disclosure relates to a computer vision and it is a first-in-class fully convolutional anchor-free indoor 3D object detection method named FCAF3D. FCAF3D is a simple, effective, and scalable method for detecting 3D objects from point clouds.

The disclosure is a method for providing computer vision of a robotic device. The method can be implemented on a computer device of the robotic device. The robotic device has a camera with depth sensors and a RGB camera, a CPU, internal memory storage, RAM. The camera with the depth sensors and the RGB camera are implemented capturing a real scene with 3D objects in the scene. The proposed method allows to detect and recognize a category and location of 3D objects in the captured scene.

The proposed solution is designed for scene analysis and object recognition. The obtained results can be employed in a wide range of tasks, where decisions are based on the scene and its objects. For instance, the software based on the proposed method can supply mobile robotic devices (for example mobile robot navigation) with spatial information for planning a trajectory, gripping and manipulating objects. Furthermore, in a mobile application, the proposed solution can be used to generate prompts about the scene automatically.

The proposed solution is designed to detect and recognize three-dimensional objects and estimate their spatial position. The task formulation follows the classical problem statement of 3D object detection, formulated by the scientific computer vision community.

The proposed solution is assumed to be executed in mobile robotic devices having a computer device. The computer device of the robotic device has a camera with depth sensors and a RGB camera, a CPU, internal memory storage, RAM, screen.

Also, the disclosure can be implemented on smartphones as a part of a mobile application. For execution of the proposed method, a carrier device, for example computer-readable medium, can be used. At that, the computer-readable medium contains program code that reproduces the proposed when implemented on the computer device. Specifically, the computer-readable medium should have a sufficient amount of RAM and computational resources. The required amount of resources depends on the device's function, camera parameters, and performance requirements.

It should be noted that when detecting 3d objects, it is traditional to use a predefined set of 3D object bounding boxes called anchors. Such a set can be considered as a set of a priori hypotheses about the location of objects in space (objects in the scene) and their sizes. The use of such a set of hypotheses makes it possible to detect objects in three-dimensional space by selecting the most probable hypotheses and refining them. However, a priori hypotheses do not always describe real objects in a particular scene well, so the use of anchors limits the applicability of the object detection method. In the beginning, all object detection methods used anchors. Recently, a new approach has been described that allows not using anchors when solving the object detection problem: this allows to develop a more universal solution. Over the past few years, a whole class of methods that do not use anchors have been formed - they can be called anchor-free, i.e. "anchorless". Now they are competing with traditional "anchor" methods.

Proposed convolutional anchor-free indoor 3D object detection method is a simple yet effective method that uses a voxel representation of a point cloud (input data) and processes voxels with sparse convolutions.

According to the proposed method, captured are Npts RGB-colored points and outputted are a set of 3D object bounding boxes. The Npts RGB-colored points are a set of N points, each of which is represented by its coordinate and color in the proposed method, executed on the computer. Each point of three-dimensional space (point) is specified by three coordinates in space, as well as a color in the RGB palette (RGB-colored point). A set of points in three-dimensional space is also called a point cloud. This cloud of a set of N points, each of which is represented by its coordinate and color can be obtained by processing the capturing real scene with objects occurred in the scene. The cloud of the set of N points are inputted as data into a neural network. Point coordinates are real numbers. Voxel is generally accepted of an abbreviation for volumetric pixel, i.e. three-dimensional, pixel. As usual, "2D" pixel in a 2D image is the basic "cell" of the image. A pixel is an image discretization element: the image is divided into equal sections (elements) by a regular grid. Each such element has the shape of a square aligned along the sides of the image. By analogy with a pixel, a voxel is a part of three-dimensional space bounded by a parallelepiped aligned with the coordinate axes and divided according to a grid into grid elements. However, a voxel is defined more flexibly than a pixel. So, it is not necessary to divide space into elements with a regular three-dimensional grid: the grid elements can be located in space in an arbitrary way. That is, the space is divided into grid elements arbitrarily. The averaging of all point coordinates from the point cloud falling into one grid element (i.e. x coordinates, y coordinates and z coordinates are averaged) defines a center of the voxel. That is, each grid element corresponds to its own voxel. The result is a set of voxels that is just as sparse and irregular as the original set of points in three-dimensional space.

If the voxels are organized into a regular grid, one speaks of a voxel volume, and in the absence of a regular structure, one speaks of a voxel representation. In addition, a voxel does not necessarily have the same spatial dimensions along all three axes: it may not be a cube, but an arbitrary parallelepiped, however cubic voxels are often used for the convenience of calculations.

Proposed FCAF3D method can handle large-scale scenes with minimal runtime and memory through a single fully convolutional feed-forward pass and does not require the heuristic post-processing stage. Existing methods of 3D object detection made prior assumptions on the geometry of objects. Any geometry priors limit the generalization ability of the method. Instead, authors of the disclosure propose a novel parametrization of oriented bounding boxes (OBB) that allows obtaining better results without any priors. The proposed method achieves state-of-the-art 3D object detection results in terms of mAP@0.5 on ScanNet V2 (+4.5), SUN RGB-D (+3.5), and S3DIS (+20.5) datasets. mAP@0.5 is a standard metric for assessing the quality of 3D object detection. Possible values of mAP@0.5 range from 0 to 100. The higher the value of mAP@0.5, the higher the quality. On S3DIS, FCAF3D outperforms the competitors by a huge margin.

Overall, the contribution of the disclosure in the art consists in:

- proposed is a first-in-class fully convolutional anchor-free 3D object detection method (FCAF3D) for indoor scenes;

- presented is a novel OBB parametrization and it is proved boosting the accuracy of several existing 3D object detection methods on SUN RGB-D;

- the proposed method significantly outperforms the previous state-of-the-art on challenging large-scale indoor ScanNet, SUN RGB-D, and S3DIS datasets in terms of mAP while being faster on inference.

The task of 3D object detection (the computer vision) consists in detecting and recognizing three-dimensional objects and estimating their spatial position in a scene. 3D objects have complex, varied and sometimes variable shapes that often cannot be described parametrically (in equations). Therefore, in the standard formulation of the object detection problem, the shape, size and location of objects are modeled by a simple three-dimensional figure - a parallelepiped, a "box". Such boxes are called 3D bounding boxes. They are specified by the three-dimensional coordinates of the center of the box, as well as the width, height, and length of the box. For simplicity, it is assumed that all such parallelepipeds are oriented along the coordinate axes of three-dimensional space, i.e. all their edges and faces are co-directed to one of the axes. Sometimes they solve the problem in a more complex formulation, where the parallelepipeds are rotated in the horizontal plane - then the object detection problem additionally involves determining the orientation of the object, i.e. rotation angle. Also each object has a category label.

The category of the object is specified in the markup (annotation) of the datasets. The annotation is the reference, true information contained in the original datasets. In this case, annotation is a collection of three-dimensional bounding frames of objects with specified categories of objects for each point cloud contained in the data set. The annotation was obtained with the help of expert assessors when creating a dataset by the authors of the dataset. The annotation is used to teach a neural network: in order to learn how to make predictions based on input data, the must see a number of training examples of the form (input data - reference true output data contained in the markup) and find a pattern that allows to establish a relationship between input data and output data.

The categories are obtained as a result of expert evaluation by a human assessor who is provided with three-dimensional point clouds at the stage of the training data acquisition. Such three-dimensional point clouds were obtained from a set of RGB images and their corresponding depth maps, i.e. depth sensor measurements. For markup, special software was used, which allows visualizing point clouds on the screen, manipulating them without making changes to the original data (rotate, move and zoom them in order to view them from different sides), by clicking on the desired area of space or otherwise indicate the location of objects in space in the form of parallelepipeds enclosing them, i.e. 3D bounding boxes.

The proposed FCAF3D architecture consists of a backbone part, a neck part, and a head part, these terms are generally accepted terms in this field of technology. Designation of parts of the neural network of object detection as the backbone part, the neck part, the head part are used in articles describing such methods of two-dimensional/three-dimensional object detection as FCOS, ATSS, ImVoxelNet, FCOS3D. The backbone part, the neck part and the head part are 3D sparse convolutional parts of the neural network.

The backbone part of the neural network means a pre-trained neural network or a part of a pre-trained neural network. Neural networks used as a backbone are trained on large amounts of visual data, usually by solving an image classification problem. As a result of this training, they acquire the ability to capture patterns in visual data. This ability can be used not only to solve the problem of image classification, but also to solve many other computer vision problems. The standard approach in designing neural network for solving computer vision problems is to use a pretrained backbone, but with replacing some of the layers in it designed to solve the classification problem with other layers designed to solve the target problem.

The neck part of the neural network takes as input the outputs of the backbone - four-dimensional tensor - and also returns four-dimensional tensors.

The head part of the neural network is the last, final layer of the neural network, for obtaining predictions of locations, orientations, and categories of the objects in the scene. For each of the objects in the scene the head part outputs the predictions. Each prediction has object classification probability, object bounding box regression parameters, and centerness of the object inside the object bounding box.

The division into the neck part and the head part is conditional and formal, since each of these parts consists of neural network layers similar in type and purpose. In the case of the disclosure, neural network layers are sparse convolutional layers.

While designing the proposed FCAF3D, for scalability, selected is a GSDN-like sparse convolutional network. For better generalization, reduced is the number of hyperparameters in this network that need to be manually tuned; specifically, simplified is sparsity pruning in the neck. Furthermore, the head part is introduced with a simple multi-level location assignment. Finally, discussed are the limitations of existing 3D bounding box parametrizations and proposed is a novel parametrization that improves both accuracy and generalization ability.

ResNet is a family of neural network architectures widely used to solve computer vision problems. The ResNet family has both lightweight architectures as well as more powerful architectures with a large number of customizable parameters. The lightweight architectures design for use in cases where computing resources are limited and/or speed are important. More powerful architectures with a large number of customizable parameters show a better quality of solving the target problem compared to lightweight ones. Such powerful architectures are chosen when quality is a key priority and a significant amount of computing resources are available. All these architectures are arranged according to the same principle: they contain the same or similar computational units, interconnected in a certain way. As a result, the ResNet family is formed by neural network architectures with a different number of such computing units. Recently, a method has been described for modifying the neural network architecture of the ResNet family, allowing this architecture to be adapted to process sparse three-dimensional data (such as point clouds), while the original architectures of the ResNet family are designed to process two-dimensional data (images). Computing blocks of the neural network architecture of the ResNet family (like any other neural network architecture) consist of layers. In the neural network architecture of the ResNet family, there are two-dimensional convolutional layers in the computational blocks. The modification method is to replace all 2D convolutional layers with 3D convolutional layers. If a similar modification is applied to all neural network architectures of the ResNet family, then a family of three-dimensional sparse neural network architectures of the ResNet family can be obtained. The proposed method is implemented in a modified neural network of the ResNet family.

To implement the computer vision method, captured are a real scene with objects in the scene using a camera with depth sensors and the RGB camera. The capturing real scene is represented by a computer device as a cloud of a set of N points, each of which is represented by its coordinate and color (Npts RGB points cloud). The cloud of the set of N points are inputted as data into a neural network.

The depth sensor (depth camera) measures the distance to points in the scene and outputs the measurement results in the form of a dense two-dimensional map, each pixel of which contains the distance from the depth camera to some point in the scene. Next, it is necessary to determine how the coordinates of the pixels on the depth map and the coordinates of the points in three-dimensional space correlate: in other words, to determine how the depth map is mapped to three-dimensional space. To do this, it is necessary to know the parameters of the depth camera that determine the type of this display. The camera parameters are given explicitly in the datasets used for the experiments in this work. Under real conditions of application of the proposed method of 3D object detection, the camera parameters can be separately estimated by any camera parameter estimation method (this is a standard procedure, also called camera calibration), or set directly in an explicit form - for example, as characteristics of a specific depth camera model.

The real scene with objects in the scene are captured by a camera with depth sensors and RGB camera as a set of N points, each of which is further represented by its coordinate and color. This requires that the RGB image pixels and the depth map pixels map to the same points in 3D space. Accordingly, it is necessary to match the pixels of the RGB image and the pixels of the depth map. To do this, the depth map pixels are mapped into 3D space (using the depth camera settings) and then projected onto the image plane (using the RGB camera settings). The result of this procedure is a depth map aligned pixel-by-pixel with the RGB image. Each point in 3D space mapped from a depth map pixel is assigned the RGB values that were recorded in the RGB image pixel corresponding to that depth map pixel.

The procedure for obtaining a point cloud from a single RGB image and a single depth map is described above. The measurements of any depth sensor are inaccurate. However, the accuracy of measurements can be improved, there are several measurements of a 3D space area obtained from different points. In this case, it is possible to aggregate information from several measurements and thereby correct the measurements on individual depth maps, or, for example, identify some measurements as random outliers due to the imperfection of the measuring device and remove these outliers. Also, such aggregation of measurements allows to increase the consistency of measurements in a single 3D scene, and get one point cloud that completely describes the entire scene, rather than a set of separate point clouds for each RGB image and depth map. A number of methods for aggregating RGB images and depth maps have been developed: these are methods of simultaneous localization and mapping (SLAM), methods of integrating a truncated signed distance function (TSDF), and other methods. Captured Npts RGB points cloud is input of the neural network.

The neural network consists of layers, each of which takes as input a tensor as input, calculates a function on it, returns the results - also a tensor. The layers may be sequential and/or parallel. The method of "assembling" layers into a neural network is also called neural network architecture.

Figure 1 illustrates neural network layers. Npts RGB points cloud is input into the neural network layers convolutional layer is indicated as "Conv0" in figure 1, this layer calculates a function called "convolution" (a common mathematical term). A pooling layer is indicated as "Pooling" in figure 1, pooling layer calculates the local maximum. The output of "Conv0" and "Pooling" is a sparse 3D tensor.

Sparse Neural Network for proposed FCAF3D is indicated in figure 1. Its work is described below.

Backbone part.

The backbone part in FCAF3D is a sparse modification of ResNet [11] where all 2D convolutions are replaced with sparse 3D convolutions. The family of sparse high-dimensional versions of ResNet was first introduced in [5], for brevity, authors refer to them as to HDResNet.

In the backbone part implemented are:

representing the point cloud as volumetric pixel (voxel) representation of the input data, the Npts RGB points cloud is represented as a voxel representation in the layers "Conv0" and "Pooling";

processing the volumetric pixel representation of the input data with obtaining a four-dimensional tensors by the Residual blocks.

As follows from figure 1: block1, block2, block3, block4 are Residual blocks. The Residual block is a computational unit of the neural network architecture of the ResNet family. It consists of several layers of various types - convolutional layers, layers that perform normalization inside a minibatch, an activation layer. A distinctive feature of this computing block is a specially arranged connection between layers, called skip connection, or residual connection (which gave the name to both the computing block - residual block, and the ResNet family of neural network architectures - residual networks.

Residual Blocks are skip-connection blocks that learn residual functions with reference to the layer inputs, instead of learning unreferenced functions. They were introduced as part of the ResNet architecture. Formally, denoting the desired underlying mapping as H(x), authors let the stacked nonlinear layers fit another mapping of H(x) - x. The original mapping is recast into F(x) + x. The F(x) acts like a residual, hence the name 'residual block'. The intuition is that it is easier to optimize the residual mapping than to optimize the original, unreferenced mapping. To the extreme, if an identity mapping were optimal, it would be easier to push the residual to zero than to fit an identity mapping by a stack of nonlinear layers. Having skip connections allows the network to more easily learn identity-like mappings.

Neck part.

Features of the objects in the scene, expressed in numerical form, are extracted from different layers of the backbone by applying neck part. The neck part processes the four-dimensional tensors from the backbone part to extract expressed in numerical form features of the objects in the scene. The features means descriptions, representations, object descriptors in a format that a computing device can work with - that is, in numerical form, as a set of numbers. These numbers can be organized as multidimensional matrices, i.e. tensors.

The values of these numbers are hardly interpretable, they are not visual: in general, it is impossible to point to a specific number and claim that it encodes a certain property of an object. According to the format, the features extracted by neck are four-dimensional tensors.

The proposed neck part is a simplified GSDN decoder. Features on each layer of the neck part are processed with one sparse transposed 3D convolution operation and one sparse 3D convolution operation.

Each transposed sparse 3D convolution operation with a kernel size of 2 might increase the number of non-zero values by 23 times. To prevent rapid memory growth, GSDN uses the pruning layer that filters input with a probability mask.

In GSDN, feature level-wise probabilities are calculated with an additional convolutional scoring layer. This layer is trained with a special loss encouraging consistency between the predicted sparsity and anchors. Specifically, voxel sparsity is set to be positive if any of the subsequent anchors associated with the current voxel is positive. However, using this loss may be suboptimal, as distant voxels of an object might get assigned with a low probability.

For simplicity, the scoring layer with the corresponding loss is removed and used is probabilities from the classification layer in the head instead. The probability threshold is not tuned but is kept at most Nvox voxels to control the sparsity level, where Nvox equals the number of input points Npts. This to be a simple yet elegant way to prevent sparsity growth since reusing the same hyperparameter makes the process more transparent and consistent.

As follows from figure 1, Conv1, Conv2, Conv3, Conv4 are Convolution layer each. Convolutional layers of the neural network convolve the input and pass its result to the next layer. Each convolutional neuron processes data only for its receptive field. Convolutional neural networks are widely used to process data with a grid-like topology (such as images) since convolution considers spatial relations between separate features.

Convolutional layers of the neural network convolve the input and pass its result to the next layer. Each convolutional layer processes data only for its receptive field. Convolutional neural networks are widely used to process data with a grid-like topology (such as images) since convolution considers spatial relations between separate features.

As follows from figure 1, TransConv1, TransConv2, TransConv3 are a transposed convolutional layers. The transposed convolutional layer is the standard transposed convolutional layer. It takes a tensor as input, calculates the convolution function over it, returns the results of the convolution - also a tensor. It has parameters - the so-called convolution kernel, customizable during neural network training. In essence, it is a convolutional layer, but it is able to increase the dimension of the input tensor by increasing its sparseness or duplicating values.

As follows from figure 1, Pruning is a Pruning layer. Pruning layer is a non-standard layer used in GSDN. It accepts a sparse 3D tensor and filters it with a probability mask. Feature level-wise probabilities are calculated with an additional convolutional scoring layer.

Shared Head part. The anchor-free FCAF3D head part consists of three parallel sparse convolutional layers (see figure 1) with weights shared across feature levels.

Extracted features obtained at the previous stage are processed with a 3D sparse convolutional head with weights shared across different feature levels. The head part provides processing the extracted features of the objects in the scene with the head part, for obtaining predictions of locations, orientations, and categories of the objects in the scene, wherein for each of the objects in the scene the head part outputs the predictions. The prediction comprises object classification probability, object bounding box regression parameters, and centerness of the object inside the object bounding box. Obtained predictions are filtered by comparing predictions according to the object classification probability to choose the most probable prediction. Choosing the most probable prediction is considered as final estimate of the location, orientation and the category of the object in the scene and is characterized by the data regarding the location, orientation and the category of the object in the scene. The robotic device uses such numerical representation of the scene.

At any moment of computation within the neural network, operated is a set of coordinates of points in three-dimensional space. The set of coordinates of points is fed into the input, at the input and output of each layer of the neural network there are data, represented as a three-dimensional sparse tensors. All points are in the same 3D space. But the number of points and the coordinates of the points change during the calculation. Locations are coordinates of points that appear in the process of computation. These are not exactly the same coordinates of points that were given as input, but they are somewhere between the coordinates of the input points, approximately in the same region of space.

The coordinate system (coordinate grid) is specified through the coordinates of points in three-dimensional space and the marking of the bounding boxes of objects: all coordinates are specified in some coordinate system. In this coordinate system, the y-axis must be co-directed with the gravity vector: in this case, the three-dimensional bounding boxes of objects of the form OBB and AABB will be located horizontally. AABB is short for Axis-Aligned Bounding Box. This is a common term for a 3D bounding box of an object of some kind, all edges and planes of which are co-directed to the coordinate axes of 3D space. OBB is short for Oriented Bounding Box. This is a common term meaning a three-dimensional bounding box of an object of a certain kind, located horizontally in space and arbitrarily rotated in the horizontal plane.

For each location

, three parallel sparse convolutional layers of the head of the architecture output classification probabilities

, bounding box regression parameters δ, and centerness

, respectively. This design is similar to the simple and light-weight head of FCOS [26] but adapted to 3D data.

Classification probabilities means that for each category of object (table, chair, ...) the probability that the location is inside the three-dimensional bounding box of an object of this category is estimated ("the point belongs to an object of this category").

As for the bounding box regression parameters, it is necessary to clarify here that all existing methods of 3D object detection do not directly predict the 3D object bounding box in an explicit form. Typically, location-specific 3D bounding box parameters are estimated instead: for example, distances from the location to 6 faces of the 3D bounding box. Bounding box is the 3D bounding box of the object.

The centerness

describes the proximity of a location to the center of the reference (ground truth) three-dimensional bounding box of an object that this location falls into. Centering is a relative value that can take values from 0 to 1; the closer to the center, the larger the value and the closer it is to 1.

In other words, the head part returns classification probabilities, bounding box regression parameters, centerness score for each location.

As follows from figure 1, Head part relates to a Convolution layer. At that, Head part contains layers of Regression, Centerness, Classification. For each location

, Classification layer outputs classification probabilities

, Regression layer outputs bounding box regression parameters δ, and Centerness layer outputs centerness score

, respectively.

As follows from figure 1, Pooling relates to a Pooling layer. The Pooling layer is a layer that calculates the local maximum or local average standard layer. The Pooling layer takes a tensor as input, returns a spatially ordered set of local maxima/local averages of this tensor, which is also a tensor.

Pooling layers reduce the dimensions of data by combining the outputs of neuron clusters at one layer into a single neuron in the next layer. Local pooling combines small clusters, tiling sizes such as 2 x 2 are commonly used. Global pooling acts on all the neurons of the feature map. There are two common types of pooling in popular use: max and average. Max pooling uses the maximum value of each local cluster of neurons in the feature map, while average pooling takes the average value.

During training, FCAF3D outputs locations for different feature levels, which should be assigned to ground truth boxes {b}.

The input is Npts RGB-colored points, and the output is object classification probability, bounding box regression parameters, centerness score for each location (that is, for some set of points in three-dimensional space). At the testing stage, three-dimensional bounding boxes of objects are calculated from the classification probabilities, bounding box regression parameters, centerness score. Examples of such three-dimensional bounding boxes are shown in the figure 3. Predictions of the location, orientation of an object in space, as well as its category are available in numerical representation representing computer vision for the robotic device and can be used by the robotic device in accordance with the task. A method for visualizing the predictions obtained using the proposed method is also implemented. The user can see an image of the scene with 3D bounding boxes of objects placed in it. These 3D bounding boxes are colored differently to encode different categories of objects. The color coding is fixed (the same color always corresponds to the same category) and arbitrary.

In ImVoxelNet's 3D object detection method, output locations (locations) are predicted at three levels, and for level the maximum distances from the location to the edges of the 3D bounding box of the object that can be assigned to this location are predefined. For three scales, thresholds are set at 75cm, from 75cm to 1.5m, and more than 1.5m, respectively.

The disclosure proposes a simplified strategy for sparse data that does not require tuning dataset-specific hyperparameters. For each bounding box (examples of 3D object bounding boxes are shown in figure 3) selected is the last feature level at which this bounding box covers at least Nloc locations. If there is no such a feature level, selected is the first one. Locations are filtered via center sampling [26], considering only the points near the bounding box center as positive matches.

Through assignment, some locations

are matched with bounding boxes

.

means the reference (ground truth) three-dimensional bounding box of the object associated with the location

. Reference object bounding boxes are contained in the markup of the dataset or can be obtained from this markup directly.

Accordingly, these locations get associated with ground truth labels

and 3D centerness values

.

(the ground truth labels) means the reference object categories that are known or can be directly derived for each object bounding box from the markup of the dataset.

(3D centerness) denotes centering. Centerness describes the proximity of a location to the center of the reference (ground truth) three-dimensional bounding box of an object that this location falls into. Centering is a relative value that can take values from 0 to 1; the closer to the center, the larger the value and the closer it is to 1.

During inference, the scores

are multiplied by 3D centerness

just before NMS as proposed in [24].

The overall loss function is formulated as follows:

Here, the number of matched locations Npos is

. Classification loss L_cls is a focal loss, regression loss L_reg is IoU, and centerness loss L_cntr is binary cross-entropy. For each loss, predicted values are denoted with a hat.

Classification loss L_cls is a focal loss, regression loss Lreg is IoU, and centerness loss L_cntr is binary cross-entropy. focal loss, IoU, binary cross-entropy these are common terms for different penalty functions used to train neural networks.

Bounding Box Parametrization (figure 3 shows example of three-dimensional bounding boxes of objects for clarity).

The 3D object bounding boxes can be axis-aligned (AABB) or oriented (OBB).

Thus, AABB is horizontal and not rotated, OBB is horizontal and randomly rotated. AABB can be specified by center point (3 coordinates), length, width and height. For OBB, it is also necessary to set the angle of rotation in the horizontal plane - heading angle

.

An AABB can be described as

while the definition of an OBB includes a heading angle

:

. In both formulas, x, y, z denote the coordinates of the center of a bounding box, while w, l, h are its width, length, and height, respectively.

AABB parametrization. For AABBs, the parametrization proposed

in [24]. Specifically, for a ground truth AABB (x, y, z, w, l, h) and a location

, δ can be formulated as a 6-tuple:

The predicted AABB

can be trivially obtained from δ.

Heading angle - the angle of rotation of the 3D bounding box of the 3D object in the horizontal plane, which determines the orientation of the object ("where the object is facing"). It is one of the parameters that defines an oriented three-dimensional OBB (definition of an OBB includes a heading angle

) bounding box.

All state-of-the-art 3D object detection methods from point clouds address the heading angle estimation task as classification followed by regression. The heading angle is classified into bins; then, the precise heading angle is regressed within a bin. For indoor scenes, the range from 0 to 2ð is typically divided into 12 equal bins [22], [21], [33], [19]. For outdoor scenes, there are usually only two bins [30], [13], as the objects on the road can be either parallel or perpendicular to the road.

Estimating the value of the heading angle occurs in two stages. First, a rough estimate is made: the range of values that the heading angle falls within is determined. Then, at the second stage, the heading angle value is refined within this interval. These intervals are called bin.

When a heading angle bin is chosen, the heading angle value is estimated through regression. VoteNet and other voting-based methods estimate the value of

directly. Outdoor methods explore more elaborate approaches, e.g. predicting the values of trigonometric functions. For instance, SMOKE [17] estimates sin

and cos

and uses the predicted values to recover the heading angle.

Figure 2 depicts indoor objects where the heading angle is unambiguous. Figure 2 shows examples of objects that look the same from several sides: a square table, a round table, another round table.

It is necessary to note that the heading angle is the angle of rotation of the 3D bounding box of a 3D object in the horizontal plane, which determines the orientation of the object ("where the object is facing"). Is one of the parameters that specifies the oriented 3D bounding box of the OBB ground truth angle is an angle that characterizes the reference, true values of any of the object's parameters estimated by the method. It is desirable that the values of the parameters estimated by the method (predicted, estimated, output by method) should be as close as possible to the ground truth values of the object's parameters. Here ground truth angle should be read as ground truth heading angle, i.e. the reference, true value of the heading angle, known in advance from the labeling of the dataset.

Accordingly, ground truth angle annotations can be chosen randomly for these objects, making heading angle bin classification meaningless. To avoid penalizing the correct predictions that do not coincide with annotations, used is rotated IoU loss, as its value is the same for all possible choices of heading angle. Thus, proposed is OBB parametrization that considers the rotation ambiguity. It should be clarified that it is not always possible to unambiguously determine the orientation of an object, since some objects look the same from several sides: for example, a round stool, a round table, a square table, see figure 2. Thus, any value of heading angle taken as a reference will be chosen randomly to some extent. This calls rotation ambiguity.

Parametrization for OBB is based on Mobius strip mapping, so authors are talking about Mobius OBB parametrization.

Considering the OBB with parameters (x, y, z, w, l, h,

), let denote q = w/l. If x, y, z, w + l, h are fixed, it turns out that the OBBs with

define the same bounding box. The set of (q,

), where

∈ (0, 2ð], q ∈ (0,+inf) is topologically equivalent to a Mobius strip [20] up to this equivalence relation. Hence, authors can reformulate the task of estimating (q,

) as a task of predicting a point on a Mobius strip. A natural way to embed a Mobius strip being a two-dimensional manifold to Euclidean space is the following:

It is easy to verify that 4 points from Eq. 3 are mapped into a single point in Euclidean space. However, the experiments reveal that predicting only ln(q)sin(2

) and ln(q)cos(2

) provides better results than predicting all four values. Thereby, authors opt for a pseudo embedding of a Mobius strip to R². It calls pseudo since it maps the entire center circle of a Mobius strip defined by ln(q) = 0 to (0,0). Accordingly, it is impossible to distinguish points with In <7 = 0. However, In(q) = 0 implies strict equality of w and I, which is rare in real-world scenarios. Moreover, the choice of an angle has a minor effect on the IoU if w = l; thereby, ignored is this rare case for the sake of detection accuracy and simplicity of the method. Overall, obtained is a novel OBB parametrization:

In the standard parametrization Eq. 2,

is trivially derived from δ. In the proposed parametrization, w, l,

are non-trivial and can be obtained as follows:

where ratio

and size

.

Finally, data regarding the location, orientation and the category of the object in the scene are outputted from the neural network as numerical representation of the scene. The numerical representation of the scene is represented into an image of the scene.

Robotic devices may use the location, orientation and category of objects in numerical representation to plan paths inside the scene so that these objects are bypassed. The objects' categories might be used to manipulate only the objects belonging to the desired categories, e.g., fetching and transporting the desired objects according to the user instructions, cleaning pieces of furniture of specified categories, etc.

The location, orientation and category of objects in numerical representation might be used to automatically provide statistics about the objects present in the scene, e.g. monitor the amount of currently available items in the retail area, or to create a textual description of a given scene in assistant applications, e.g. for the blind people assistance.

The numerical representation of the scene can be converted (by the methods known from the prior art) into an image of the scene, wherein the computer device further comprises a screen, the image of the scene is displayed on a screen for an user.

The image of the scene can be used to visually demonstrate the results of applying the proposed method to a person in AR (artificial reality) applications, in applications for monitoring or accounting for objects, etc. The consumer of the results in the form of images is a person.

The location, orientation, and category of objects in image representation might be used in AR to supply the user with information about the objects present scene and to enrich the captured image of the scene with generated annotation of the objects present in the scene.

Experiments

The proposed method is evaluated on three 3D object detection benchmarks: ScanNet V2 [7], SUN RGB-D [25], and S3DIS [1]. For all datasets, used is mean average precision (mAP) under IoU thresholds of 0.25 and 0.5 as a metric.

The ScanNet dataset contains 1513 reconstructed 3D indoor scans with per-point instance and semantic labels of 18 object categories. Given this annotation, AABBs is calculated via the standard approach [22]. The training subset is comprised of 1201 scans, while 312 scans are left for validation.

SUN RGB-D. SUN RGB-D is a monocular 3D scene understanding dataset containing more than 10,000 indoor RGB-D images. The annotation consists of per-point semantic labels and OBBs of 37 object categories. As proposed in [22], experiments is conducted with objects of the 10 most common categories. The training and validation splits contain 5285 and 5050 point clouds, respectively.

S3DIS. Stanford Large-Scale 3D Indoor Spaces dataset contains 3D scans of 272 rooms from 6 buildings, with 3D instance and semantic annotation. Following [10], the proposed method is evaluated on furniture categories. AABBs are derived from 3D semantics. Used are the official split, where 68 rooms from Area 5 are intended for validation, while the remaining 204 rooms comprise the training subset.

For all datasets, used are the same hyperparameters except for the following. First, the size of output classification layer equals the number of object categories, which is 18, 10, and 5 for ScanNet, SUN RGB-D, and S3DIS.

Second, SUN RGB-D contains OBBs, so predicted is additional targets δ7 and δ8 for this dataset; note that the loss function is not affected. Last, ScanNet, SUN RGB-D, and S3DIS contain different numbers of scenes, so each scene is repeated 10, 3, and 13 times per epoch, respectively.

Similar to GSDN [10], used is the sparse 3D modification of ResNet34 named HDResNet34 as a backbone. The neck part and the head part use the outputs of the backbone part at all feature levels. In initial point cloud voxelization, setted are the voxel size to 0.01m and the number of points Npts to 100000. Respectively, Nvox equals to 100000. Both ATSS [32] and FCOS [26] set Nloc to 32 for 2D object detection. Accordingly, selected is a feature level so bounding box covers at least Nloc = 33 locations. Selected are 18 locations by center sampling. The NMS IoU threshold is 0.5.

Training. Implemented is FCAF3D using the MMdetection3D [6] framework. The training procedure follows the default MMdetection [3] scheme: training takes 12 epochs and the learning rate decreases on the 8th and the 11th epochs. Employed is the Adam optimizer with an initial learning rate of 0.001 and weight decay of 0.0001. All neural networks are trained on two NVidia V100 with a batch size of 8. Evaluation and performance tests are run on a single NVidia GTX1080Ti.

The evaluation protocol introduced in [16] is used in the disclosure. Both training and evaluation are randomized, as the input Npts are randomly sampled from the point cloud. To obtain statistically significant results, training is run 5 times and test each trained neural network 5 times independently.

Both the best and average metrics across 5 Х 5 trials are reported: this allows comparing FCAF3D to the 3D object detection methods that report either a single best or an average value.

Implemented is FCAF3D using the MMdetection3D [6] framework.

The training procedure follows the default MMdetection [3] scheme: training takes 12 epochs and the learning rate decreases on the 8th and the 11th epochs.

Emploied is the Adam optimizer with an initial learning rate of 0.001 and weight decay of 0.0001. All neural networks are trained on two NVidia V100 with a batch size of 8. Evaluation and performance tests are run on a single NVidia GTX1080Ti.

Results.

Comparison with State-of-the-art Methods

Compared are FCAF3D with previous state-of-the-arts on three indoor benchmarks in Tab. 1.

Table 1 indicates results of FCAF3D and existing indoor 3D object detection methods that accept point clouds. The best metric values are marked bold. FCAF3D outperforms previous state-of-the-art methods: GroupFree (on ScanNet and SUN RGB-D) and GSDN (on S3DIS). The reported metric value is the best one across 25 trials; the average value is given in brackets.

Table 1.

Evaluated is the proposed method on datasets: ScanNet [7], SUN RGB-D [25], and S3DIS [1], demonstrating the solid superiority over the previous state-of-the-art on all benchmarks (the average value is given in brackets). On SUN RGB-D and ScanNet, proposed method surpasses other methods by at least 3.5% mAP@0.5. On the ScanNet dataset, the proposed 3D object detection method is 4.5 points higher than the best competing 3D object detection method. On the SUN RGB-D data set, it is 3.5 points higher. On the S3DIS data set, it is 20.5 points higher. The same excellent results can be seen for standard metric for assessing the quality mAP@0.25.

The example of ScanNet point clouds with predicted bounding boxes are depicted in Figure 3. Figure 3 illustrates the point cloud from ScanNet with AABBs. The color of a bounding box denotes the object category. Left: estimated with the proposed method (FCAF3D), right: ground truth. Each object category has a different color bounding box. The categories are color-coded, namely: blue (marked as C) is a chair, orange (marked as O) is a table, green (marked as G) is a door, red (marked as R) is a cupboard. It can be seen that the 3D bounding boxes predicted by the proposed method (left) are similar to the ground truth bounding boxes (right). This result clearly illustrates exactly the proposed method.

Similar results were obtained for the point cloud from SUN RGB-D with OBBs, and also for the point cloud from S3DIS with AABBs.

To study geometry priors, existing methods with proposed modifications are traned and evaluated. Experiments are conducted with 3D object detection methods accepting data of different modalities: point clouds, RGB images, or both, to see whether the FCAF3D, replaced are the aforementioned losses with a rotated IoU loss with Mobius parametrization Eq. 5. To give a complete picture, used is a sin-cos parametrization used in the outdoor 3D object detection method SMOKE [17].

The rotated IoU loss decreases the number of trainable parameters and hyperparameters, including geometry priors and loss weights. This loss has already been used in outdoor 3D object detection [34]. Recently, [6] reported results of VoteNet trained with axis-aligned IoU loss on ScanNet.

Table 2 shows that replacing the standard parametrization with Mobius one boosts VoteNet and ImVoteNet mAP@0.5 by approximately 4%.

ImVoxelNet does not use a classification+regression scheme to estimate heading angle but predicts its value directly in a single step. Since the original ImVoxelNet uses the rotated IoU loss, authors do not need to remove redundant losses, only to change the parametrization. Again, the Mobius parametrization helps to obtain the best results, even though the superiority is minor.

Table. 2

Table 2 illustrates results of several 3D object detection methods that accept inputs of different modalities, with different OBB parametrization on SUN RGB-D. The FCAF3D metric value is the best across 25 trials; the average value is given in brackets. For other methods, results are reported from the original papers and also the results obtained through proposed experiments with MMdetection3D-based re-implementations (marked as Reimpl). "PC" is a point cloud. "RGB" is an RGB image or a set of RGB images. "RGB+PC" is an RGB image and a point cloud or a set of RGB images and a point cloud. VoteNet is a voting-based 3D object detection method that accepts RGB-colored point cloud. ImVoteNet is a voting-based 3D object detection method that accepts an RGB image and a point cloud or a set of RGB images and a point cloud. ImVoxelNet is a 3D object detection method that accepts an RGB image or a set of RGB images. "Reimpl." means reimplementation. VoteNet and ImVoteNet was reimplemented for experiments as the source code of these methods has not been made publicly available. Reported are both the results provided in the original papers and the results obtained with reimplemented methods. These results prove reimplementation is correct and provides accuracy comparable with the accuracy reported in the original papers.

"w/naive param." means with the naive OBB parametrization, where each parameter of an OBB is estimated directly. This parametrization is used in the original VoteNet.

"w/sin-cos param." means with the sin-cos OBB parametrization. This sin-cos parametrization is formulated in the outdoor 3D object detection method SMOKE.

"w/Mobius param." means with the Mobius OBB parametrization.

As can be seen, all explored methods, specifically, VoteNet, ImVoteNet, ImVoxelNet, FCAF3D, benefit from using the proposed Mobius OBB parametrization. The results obtained with the Mobius parametrization are better than the ones obtained with both the "naive" parametrization and the sin-cos parametrization described by the authors of the SMOKE method. The observed improvement is consistent for different 3D object detection methods that accept different types of input data.

Next, studied are GSDN anchors to prove that the generalization ability of anchor-based layers is limited. According to Table 3, mAP@0.5 drops dramatically by 12% if GSDN is trained in an anchor-free setting (which is equivalent to using one anchor). In other words, GSDN demonstrates a poor performance without domain-specific guidance in the form of anchors; hence, this method to be inflexible and non-generalized. All FCAF3D shown in the table 3 correspond to the proposed method and have various modifications of ResNet (HDResNet34, HDResNet34:3, HDResNet34:2) and various Voxel size.

For a comparison, evaluated is FCAF3D with the same backbone, and it outperforms GSDN by a huge margin, achieving twice as large mAP value.

Table 3

Table 3 illustrates results of fully convolutional 3D object detection methods that accept point clouds on ScanNet.

Reported are GSDN results obtained with voxels of 0.05m. While smaller voxels seem to ensure more detailed data representation, the dependence between the voxel size and accuracy is not straightforward. Changing voxel size affects GSD and FCAF3D differently due to different assignment strategies. In FCAF3D, authors want 3D space to be "covered" with locations densely. The distance between locations is proportional to the voxel size, so the smaller the voxel size, the denser is the coverage, and, consequently, the higher is the recall. GSDN "covers" 3D space with anchors. The linear sizes of anchors are proportional to the voxel size. During an assignment, anchors with a small intersection with objects boxes are ignored. If the voxel size decreases, the anchors will be smaller; accordingly, some anchors will be ignored, resulting in a lower recall.

So, in general, FCAF3D benefits from smaller voxels while GSDN does not. Overall, expected is a GSDN performance to drop if reducing the voxel size from 0.05 to 0.01.

Discussed bellow is the neural network design for the proposed method (FCAF3D) and investigated is how they affect metrics when applied independently in ablation studies. Experiments are run with varying voxel size, the number of points in a point cloud Npts, the number of locations selected by center sampling, and with and without centerness. The results of ablation studies are aggregated in Tab. 4 for all benchmarks.

Table 4.

Table 4 illustrates results of ablation studies on the voxel size, the number of points (which equals the number of voxels Nvox in pruning), centerness, and center sampling in FCAF3D. The better options are marked bold (actually, these are the default options used to obtain the results in Tab. 1 above). The reported metric value is the best across 25 trials; the average value is given in brackets.

Voxel size. Expectedly, with an increasing voxel size, accuracy goes down. Used are voxels of 0.03, 0.02, and 0.01 m. Attributed is the notable gap in mAP between voxel sizes of 0.01 and 0.02 m to the presence of almost flat objects, such as doors, pictures, and whiteboards. Namely, with a voxel size of 2 cm, the head would output locations with 16 cm tolerance, but the almost flat objects could be less than 16 cm by one of the dimensions. Observed is a decrease in accuracy for larger voxel sizes.

Number of points. Similar to 2D images, subsampled point clouds are sometimes referred to as low-resolution ones. Accordingly, they contain less information than their high-resolution versions. As can be expected, the fewer the points, the lower is detection accuracy. In this series of experiments, sampled are 20k, 40k, and 100k points from the entire point cloud, and the obtained metric values revealed a clear dependency between the number of points and mAP. Authors do not consider larger Npts values to be on a par with the existing methods (specifically, GSDN [10] uses all points in a point cloud, GroupFree [16] samples 50k points, VoteNet [22] selects 40k points for ScanNet and 20k for SUN RGB-D). Nvox = Npts is used to guide pruning in the neck. When Nvox exceeds 100k, the inference time increases due to growing sparsity in the neck, while the accuracy improvement is negligible. So a grid search is ristricted for Npts with 100k and use it as a default value regarding the obtained results.

Centerness. Using centerness improves mAP for the ScanNet and SUN RGB-D datasets. For S3DIS, the results are controversial: the better mAP@0.5 is balanced with a minor decrease of mAP@0.25. Nevertheless, authors analyze the results altogether, so authors can consider centerness a helpful feature with a small positive effect on the mAP, almost reaching 1% of mAP@0.5 on ScanNet.

Center sampling. Finally, authors study the number of locations selected in center sampling. authors select 9 locations, as proposed in FCOS [26], the entire set of 27 locations, as in ImVoxelNet [24], and 18 locations. The latter appeared to be the best choice according to mAP on all the benchmarks.

Centerness. Using centerness improves mAP for the ScanNet and SUN RGB-D datasets. For S3DIS, the results are controversial: the better mAP@0.5 is balanced with a minor decrease of mAP@0.25. Nevertheless, analyzed are the results altogether, so centerness is considered as a helpful feature with a small positive effect on the mAP, almost reaching 1% of mAP@0.5 on ScanNet.

Center sampling. Finally, studyied is the number of locations selected in center sampling. Selected are 9 locations, as proposed in FCOS [26], the entire set of 27 locations, as in ImVoxelNet [24], and 18 locations. The latter appeared to be the best choice according to mAP on all the benchmarks.

Inference Speed

Compared to standard convolutions, sparse convolutions are time- and memoryefficient. GSDN authors claim that with sparse convolutions, they process a scene with 78M points covering about 14,000 m3 within a single fully convolutional feed-forward pass, using only 5G of GPU memory. FCAF3D uses the same sparse convolutions and the same backbone as GSDN. However, as can be seen in Tab.3, the default FCAF3D is slower than GSDN. This is due to the smaller voxel size: authors use 0.01m for a proper multi-level assignment while GSDN uses 0.05m.

To build the fastest method, authors use HDResNet34:3 and HDResNet34:2 backbones with only three and two feature levels, respectively. With these modifications, FCAF3D is faster on inference than GSDN (Figure 4). Figure 4 illustrates the detection accuracy against inference speed measured in scenes per second for the original and modified FCAF3D in comparison with the existing methods of 3D object detection. As follows from figure 4 mAP@0.5 scores on ScanNet against scenes per second. FCAF3D modifications have different number of backbone feature levels. For each existing method, there is a FCAF3D modification surpassing this method in both detection accuracy and inference speed. According to the plot, the proposed FCAF3D method without accelerating modifications is the most accurate of all 3D object detection methods; however, it is slower than some other methods. Nevertheless, for each existing method, there is a modification of the FCAF3D, surpassing this method in both detection accuracy and inference speed.

The figure 4 shows a comparison of the key characteristics of the proposed FCAF3D method and its two modifications (FCAF3D w/ 3 levels and FCAF3D w/ 2 levels) with existing methods that solve a similar problem: GroupFree, H3DNet, BRNet, 3DETR, 3DETR-m, VoteNet, GSDN. The comparison is carried out on the ScanNet dataset. Each point on the graph corresponds to some method of three-dimensional object detection.

Y-axis: Performance (mAP) - mAP@0.5, 3D object detection accuracy metric. Larger values correspond to higher precision; accordingly, the higher the methods are located on the chart, the better.

On the abscissa: Scenes per second - the number of scenes processed per second. Larger values correspond to higher speed; accordingly, the more to the right the methods are located on the chart, the better.

For each of the existing methods, there is a modification of the proposed method, which simultaneously shows better accuracy and is faster (in other words, whose point on the graph is located higher and to the right). For GroupFree, H3DNet, BRNet, 3DETR, 3DETR-m, VoteNet, these are FCAF3D and FCAF3D w/ 3 levels. For GSDN - FCAF3D w/ 2 levels.

For a fair comparison, authors re-measure inference speed for GSDN and votingbased methods, as point grouping operation and sparse convolutions have become much faster since the initial release of these methods. In performance tests, authors opt for implementations based on the MMdetection3D [6] framework to mitigate codebase differences. The reported inference speed for all methods is measured on the same single GPU so they can be directly compared.

Proposed is FCAF3D, a first-in-class fully convolutional anchor-free 3D object detection method for indoor scenes. Proposed method significantly outperforms the previous state-of-the-art on the challenging indoor SUN RGB-D, ScanNet, and S3DIS benchmarks in terms of both mAP and inference speed. Also proposed is a novel oriented bounding box parametrization and showed that it improves accuracy for several 3D object detection methods. Moreover, the proposed parametrization allows avoiding any prior assumptions about objects, thus reducing the number of hyperparameters. Overall, FCAF3D with proposed bounding box parametrization is accurate, scalable, and generalizable at the same time. The proposed software solution is assumed to be executed in mobile robotic devices or launched on smartphones as a part of a mobile application. To execute the proposed method, a carrier device should meet technical requirements. Specifically, it should have a sufficient amount of RAM and computational resources. The required amount of resources depends on the device's function, camera parameters, and performance requirements.

Referring to Figure 5, the robotic device for computer vision 100 may include a processor 110 and a memory 120.

The memory 120 may store programs necessary for processing or control operations performed by the processor 110.

Furthermore, the memory 120 may store data input to or output from the asset management device 100.

The memory 120 may include at least one type of storage medium, i.e., at least one of a flash memory-type memory, a hard disk-type memory, a multimedia card micro-type memory, a card-type memory (e.g., an SD card or an XD memory), random access memory (RAM), static RAM (SRAM), read-only memory (ROM), electrically erasable programmable ROM (EEPROM), PROM, a magnetic memory, a magnetic disc, or an optical disc.

The memory 120 may store one or more instructions executable by the processor 110.

In an embodiment of the disclosure, the memory 120 may store various types of information input/output via an input/output (I/O) interface (not shown).

In an embodiment of the disclosure, the memory 120 may store instructions that cause the processor 110 to obtain a scene with at least one object, represent the obtained scene as a set of N points, input the set of N points as input data to a neural network, wherein the neural network performs the following steps of representing the set of N points as volumetric pixel representation, processing the volumetric pixel representation to obtain four-dimensional tensors, processing the four-dimensional tensors to extract features, expressed in numerical form, of the at least one object in the scene and processing the extracted features to obtain predictions of a location, an orientation and a category of the at least one object in the scene and output, from the neural network, the predictions as numerical representation.

The processor 110 controls all operations of the robotic device for computer vision 100 and may be used in the same sense as a controller.

The processor 110 may control all the operations of the robotic device for computer vision 100 and a flow of signals between the internal components of the robotic device for computer vision 100 and perform a function of processing data.

The processor 110 may include RAM (not shown) that stores signals or data input from outside of the robotic device for computer vision 100 or is used as a storage area corresponding to various operations performed by the robotic device for computer vision 100, and ROM (not shown) that stores a control program for controlling the robotic device for computer vision 100.

Furthermore, the processor 110 may include a plurality of processors.

For example, the processor 110 may be implemented as a main processor (not shown) and a sub processor (not shown) operating in a sleep mode.

In addition, the processor 110 may include at least one of a central processing unit (CPU), a graphic processing unit (GPU), or a video processing unit (VPU).

The processor 110 may obtain a scene with at least one object.

The processor 110 may represent the obtained scene as a set of N points.

The processor 110 may input the set of N points as input data to a neural network. The neural network may represent the set of N points as volumetric pixel representation. The neural network may process the volumetric pixel representation to obtain four-dimensional tensors. The neural network may process the four-dimensional tensors to extract features, expressed in numerical form, of the at least one object in the scene. The neural network may process the extracted features to obtain predictions of a location, an orientation and a category of the at least one object in the scene.

The processor 110 may output, from the neural network, the predictions as numerical representation.

Referring to Figure 6, the robotic device for computer vision 100 may obtain a scene with at least one object(S610).

The robotic device for computer vision 100 may represent the obtained scene as a set of N points(S620).

The robotic device for computer vision 100 may input the set of N points as input data to a neural network(S630).

The neural network performs the steps of representing the set of N points as volumetric pixel representation, processing the volumetric pixel representation to obtain four-dimensional tensors, processing the four-dimensional tensors to extract features, expressed in numerical form, of the at least one object in the scene and processing the extracted features to obtain predictions of a location, an orientation and a category of the at least one object in the scene.

The robotic device for computer vision 100 may output, from the neural network, the predictions as numerical representation(S640).A detailed description of each step was described above with reference to Figure 1-4.

In accordance with an aspect of the disclosure, a method for computer vision of a robotic device, the method comprising obtaining a scene with at least one object, representing the obtained scene as a set of N points, inputting the set of N points as input data to a neural network, wherein the neural network performs the steps of representing the set of N points as volumetric pixel representation, processing the volumetric pixel representation to obtain four-dimensional tensors, processing the four-dimensional tensors to extract features, expressed in numerical form, of the at least one of object in the scene and processing the extracted features to obtain predictions of a location, an orientation and a category of the at least one object in the scene, and outputting, from the neural network, the predictions as numerical representation.

The method may further comprise converting the numerical representation of the scene into an image of the scene and displaying the image.

The predictions comprise an object classification probability, object bounding box regression parameters and centerness of the at least one object inside an object bounding box.

The method may further comprise filtering the obtained predictions by comparing the predictions according to the object classification probability to choose the most probable prediction and outputting, from the neural network, the most probable prediction as numerical representation, wherein the most probable prediction is considered as a final estimate of the location, the orientation and the category of the at least one object in the scene.

Each of the set of N points is represented by coordinate and a color.

The representing the set of N points as volumetric pixel representation comprising dividing a 3D space of the scene into elements with a three-dimensional grid and determining a center of the volumetric pixel by averaging of all point coordinates from the set of N points falling into one grid element.

The method may further comprise determining a 3D bounding box of the at least one object based on the object classification probability, the object bounding box regression parameters and the centerness.

In accordance with an aspect of the disclosure, a robotic device for computer vision is provided. The robotic device comprises at least one memory(120) storing one or more computer executable instructions, and at least one processor(110) configured to execute the one or more instructions stored in the memory(120) to obtain a scene with at least one object, represent the obtained scene as a set of N points, input the set of N points as input data to a neural network, wherein the neural network performs the following steps: represent the set of N points as volumetric pixel representation, process the volumetric pixel representation to obtain four-dimensional tensors, process the four-dimensional tensors to extract features, expressed in numerical form, of the at least one object in the scene and process the extracted features to obtain predictions of a location, an orientation and a category of the at least one object in the scene, and output, from the neural network, the predictions as numerical representation.

The robotic device further comprises a display, and the at least one processor(110) is further configured to execute the one or more instructions stored in the memory(120) to convert the numerical representation of the scene into an image of the scene and display the image on the display.

The at least one processor(110) is further configured to execute the one or more instructions stored in the memory(120) to filter the obtained predictions by comparing the predictions according to the object classification probability to choose the most probable prediction and output, from the neural network, the most probable prediction as numerical representation, wherein the most probable prediction is considered as a final estimate of the location, the orientation and the category of the at least one object in the scene.

Each of the set of N points is represented by coordinate and a color.

The at least one processor(110) is further configured to execute the one or more instructions stored in the memory(120) to divide a 3D space of the scene into elements with a three-dimensional grid and determine a center of the volumetric pixel by averaging of all point coordinates from the set of N points falling into one grid element.

The at least one processor(110) is further configured to execute the one or more instructions stored in the memory(120) to determine a 3D bounding box of the at least one object based on the object classification probability, the object bounding box regression parameters and the centerness.

According to an aspect to the disclosure, a computer-readable storage medium configured to store instructions which when executed by at least one processor, cause the at least one processor to execute any one of the methods for computer vision discussed above.

The foregoing exemplary embodiments are examples and are not to be construed as limiting. In addition, the description of the exemplary embodiments is intended to be illustrative, and not to limit the scope of the claims, and many alternatives, modifications, and variations will be apparent to those skilled in the art.

References

1. Armeni, I., Sener, O., Zamir, A.R., Jiang, H., Brilakis, I., Fischer, M., Savarese, S.: 3d semantic parsing of large-scale indoor spaces. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1534-1543 (2016)

2. Chen, J., Lei, B., Song, Q., Ying, H., Chen, D.Z., Wu, J.: A hierarchical graph network for 3d object detection on point clouds. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 392-401 (2020)

3. Chen, K., Wang, J., Pang, J., Cao, Y., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., Xu, J., et al.: Mmdetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155 (2019)

4. Cheng, B., Sheng, L., Shi, S., Yang, M., Xu, D.: Back-tracing representative points for voting-based 3d object detection in point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8963-8972 (2021)

5. Choy, C., Gwak, J., Savarese, S.: 4d spatio-temporal convnets: Minkowski convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 3075-3084 (2019)

6. Contributors, M.: MMDetection3D: OpenMMLab next-generation platform for general 3D object detection. (2020)

https://github.com/open-mmlab/mmdetection3d

7. Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Niesner, M.: Scannet: Richly-annotated 3d reconstructions of indoor scenes. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 5828-5839 (2017)

8. Engelmann, F., Bokeloh, M., Fathi, A., Leibe, B., Niesner, M.: 3d-mpa: Multiproposal aggregation for 3d semantic instance segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9031-9040 (2020)

9. Ge, Z., Liu, S., Li, Z., Yoshie, O., Sun, J.: Ota: Optimal transport assignment for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 303-312 (2021)

10. Gwak, J., Choy, C., Savarese, S.: Generative sparse detection networks for 3d single-shot object detection. In: Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part IV 16. pp. 297-313. Springer (2020)

11. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770-778 (2016)

12. Hou, J., Dai, A., Niesner, M.: 3d-sis: 3d semantic instance segmentation of rgbd scans. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4421-4430 (2019)

13. Lang, A.H., Vora, S., Caesar, H., Zhou, L., Yang, J., Beijbom, O.: Pointpillars: Fast encoders for object detection from point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12697-12705 (2019)

14. Li, B.: 3d fully convolutional network for vehicle detection in point cloud. In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 1513-1518. IEEE (2017)

15. Lin, T.Y., Goyal, P., Girshick, R., He, K., Doll´ar, P.: Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision. pp. 2980-2988 (2017)

16. Liu, Z., Zhang, Z., Cao, Y., Hu, H., Tong, X.: Group-free 3d object detection via transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 2949-2958 (2021)

17. Liu, Z., Wu, Z., T´oth, R.: Smoke: Single-stage monocular 3d object detection via keypoint estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. pp. 996-997 (2020)

18. Maturana, D., Scherer, S.: Voxnet: A 3d convolutional neural network for realtime object recognition. In: 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 922-928. IEEE (2015)

19. Misra, I., Girdhar, R., Joulin, A.: An end-to-end transformer model for 3d object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 2906-2917 (2021)

20. Munkres, J.R.: Topology (2000)

21. Qi, C.R., Chen, X., Litany, O., Guibas, L.J.: Imvotenet: Boosting 3d object detection in point clouds with image votes. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4404-4413 (2020)

22. Qi, C.R., Litany, O., He, K., Guibas, L.J.: Deep hough voting for 3d object detection in point clouds. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9277-9286 (2019)

23. Qi, C.R., Su, H., Mo, K., Guibas, L.J.: Pointnet: Deep learning on point sets for 3d classification and segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 652-660 (2017)

24. Rukhovich, D., Vorontsova, A., Konushin, A.: Imvoxelnet: Image to voxels projection for monocular and multi-view general-purpose 3d object detection. arXiv preprint arXiv:2106.01178 (2021)

25. Song, S., Lichtenberg, S.P., Xiao, J.: Sun rgb-d: A rgb-d scene understanding benchmark suite. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 567-576 (2015)

26. Tian, Z., Shen, C., Chen, H., He, T.: Fcos: Fully convolutional one-stage object detection. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 9627-9636 (2019)

27. Wang, T., Zhu, X., Pang, J., Lin, D.: Fcos3d: Fully convolutional one-stage monocular 3d object detection. arXiv preprint arXiv:2104.10956 (2021)

28. Xie, Q., Lai, Y.K., Wu, J., Wang, Z., Lu, D., Wei, M., Wang, J.: Venet: Voting enhancement network for 3d object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 3712-3721 (2021)

29. Xie, Q., Lai, Y.K., Wu, J., Wang, Z., Zhang, Y., Xu, K., Wang, J.: Mlcvnet: Multilevel context votenet for 3d object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10447-10456 (2020)

30. Yan, Y., Mao, Y., Li, B.: Second: Sparsely embedded convolutional detection. Sensors 18(10), 3337 (2018)

31. Yin, T., Zhou, X., Krahenbuhl, P.: Center-based 3d object detection and tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11784-11793 (2021)

32. Zhang, S., Chi, C., Yao, Y., Lei, Z., Li, S.Z.: Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9759-9768 (2020)

33. Zhang, Z., Sun, B., Yang, H., Huang, Q.: H3dnet: 3d object detection using hybrid geometric primitives. In: European Conference on Computer Vision. pp. 311-329. Springer (2020)

34. Zhou, D., Fang, J., Song, X., Guan, C., Yin, J., Dai, Y., Yang, R.: Iou loss for 2d/3d object detection. In: 2019 International Conference on 3D Vision (3DV). pp. 85-94. IEEE (2019)

35. Zhou, Y., Tuzel, O.: Voxelnet: End-to-end learning for point cloud based 3d object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4490-4499 (2018)

Claims

A method for computer vision of a robotic device, the method comprising:

obtaining a scene with at least one object;

representing the obtained scene as a set of N points;

inputting the set of N points as input data to a neural network, wherein the neural network performs the following steps:

representing the set of N points as volumetric pixel representation;

processing the volumetric pixel representation to obtain four-dimensional tensors;

processing the four-dimensional tensors to extract features, expressed in numerical form, of the at least one of object in the scene; and

processing the extracted features to obtain predictions of a location, an orientation and a category of the at least one object in the scene, and

outputting, from the neural network, the predictions as numerical representation.
The method of claim 1, further comprising:

converting the numerical representation of the scene into an image of the scene; and

displaying the image.
The method any one of the preceding claims, the predictions comprise an object classification probability, object bounding box regression parameters and centerness of the at least one object inside an object bounding box.
The method of claim 3, further comprising:

filtering the obtained predictions by comparing the predictions according to the object classification probability to choose the most probable prediction; and

outputting, from the neural network, the most probable prediction as numerical representation,

wherein the most probable prediction is considered as a final estimate of the location, the orientation and the category of the at least one object in the scene.
The method any one of the preceding claims, wherein each of the set of N points is represented by coordinate and a color.
The method any one of the preceding claims, wherein the representing the set of N points as volumetric pixel representation comprising:

dividing a 3D space of the scene into elements with a three-dimensional grid; and

determining a center of the volumetric pixel by averaging of all point coordinates from the set of N points falling into one grid element.
The method of claim 3, further comprising:

determining a 3D bounding box of the at least one object based on the object classification probability, the object bounding box regression parameters and the centerness.
A robotic device for computer vision, the robotic device comprising:

at least one memory(120) storing one or more computer executable instructions, and

at least one processor(110) configured to execute the one or more instructions stored in the memory(120) to:

obtain a scene with at least one object;

represent the obtained scene as a set of N points;

input the set of N points as input data to a neural network, wherein the neural network performs the following steps:

represent the set of N points as volumetric pixel representation;

process the volumetric pixel representation to obtain four-dimensional tensors;

process the four-dimensional tensors to extract features, expressed in numerical form, of the at least one object in the scene; and

process the extracted features to obtain predictions of a location, an orientation and a category of the at least one object in the scene, and

output, from the neural network, the predictions as numerical representation.
The robotic device of claim 8, the robotic device further comprising:

a display;

wherein the at least one processor(110) is further configured to execute the one or more instructions stored in the memory(120) to:

convert the numerical representation of the scene into an image of the scene; and

display the image on the display.
The robotic device of any one of the preceding claims, wherein the predictions comprise an object classification probability, object bounding box regression parameters and centerness of the at least one object inside an object bounding box.
The robotic device of claim 10, wherein the at least one processor(110) is further configured to execute the one or more instructions stored in the memory(120) to:

filter the obtained predictions by comparing the predictions according to the object classification probability to choose the most probable prediction; and

output, from the neural network, the most probable prediction as numerical representation,

wherein the most probable prediction is considered as a final estimate of the location, the orientation and the category of the at least one object in the scene.
The robotic device any one of the preceding claims, wherein each of the set of N points is represented by coordinate and a color.
The robotic device any one of the preceding claims, wherein the at least one processor(110) is further configured to execute the one or more instructions stored in the memory(120) to:

divide a 3D space of the scene into elements with a three-dimensional grid; and

determine a center of the volumetric pixel by averaging of all point coordinates from the set of N points falling into one grid element.
The robotic device of claim 10, wherein the at least one processor(110) is further configured to execute the one or more instructions stored in the memory(120) to:

determine a 3D bounding box of the at least one object based on the object classification probability, the object bounding box regression parameters and the centerness.
A computer-readable storage medium configured to store instructions which when executed by at least one processor(110), cause the at least one processor(110) to execute the image processing method of any one of claims 1 to 7.