CN113487664A

CN113487664A - Three-dimensional scene perception method and device, electronic equipment, robot and medium

Info

Publication number: CN113487664A
Application number: CN202110838071.9A
Authority: CN
Inventors: 黄锐; 李�杰
Original assignee: Chinese University of Hong Kong Shenzhen
Current assignee: Shenzhen Institute of Artificial Intelligence and Robotics
Priority date: 2021-07-23
Filing date: 2021-07-23
Publication date: 2021-10-08
Anticipated expiration: 2041-07-23
Also published as: CN113487664B

Abstract

The application discloses a three-dimensional scene perception method and device, a robot, electronic equipment and a readable storage medium. The method comprises the steps of performing two-dimensional semantic segmentation and monocular depth estimation on two-dimensional image data of RGB-D multi-modal data of a three-dimensional scene to be perceived respectively to obtain two-dimensional semantic features and two-dimensional structural features; respectively carrying out three-dimensional semantic segmentation and three-dimensional scene completion on three-dimensional depth data of the RGB-D multi-modal data to obtain three-dimensional semantic features and three-dimensional structural features; performing feature fusion on the two-dimensional semantic features and the three-dimensional semantic features to obtain fusion semantic features, and performing feature fusion on the two-dimensional structural features and the three-dimensional structural features to obtain fusion structural features; and performing three-dimensional semantic scene completion on the three-dimensional scene to be perceived through a semantic structure parallel interactive iterative fusion mode based on the fusion semantic features and the fusion structure features to obtain semantic category information and three-dimensional scene structure information of the three-dimensional scene to be perceived, thereby realizing efficient and accurate three-dimensional scene perception.

Description

Three-dimensional scene perception method and device, electronic equipment, robot and medium

Technical Field

The present application relates to the field of computer vision technologies, and in particular, to a three-dimensional scene sensing method and apparatus, a robot, an electronic device, and a readable storage medium.

Background

Three-dimensional scene perception may also be referred to as understanding of a three-dimensional scene, including understanding semantic information and three-dimensional structures of a three-dimensional scene, i.e., geometric information, is a core problem of computer vision research. The conventional three-dimensional scene understanding is usually performed on semantic information and a three-dimensional structure of the three-dimensional scene separately, for example, the scene semantic understanding may be achieved by performing statistical learning and other methods on features of three-dimensional data such as two-dimensional images or videos, even point clouds and the like, and the three-dimensional structure understanding of the three-dimensional scene may be obtained by reasoning from multi-frame images or videos by a multi-view geometric method or directly obtained by a three-dimensional scanning sensor such as a laser radar and the like. During the course of research, researchers are increasingly aware that the two tasks of semantic information and three-dimensional structure are complementary and enlightening each other. For example, the three-dimensional shape of an object is a strong prior for identifying the class of the object, while understanding the scene semantics can help to distinguish different objects on the same structural depth or geometric plane. Therefore, combining the semantics of different objects in a three-dimensional scene with the understanding of the three-dimensional structure of the scene, or even joint learning and reasoning, is an important research direction. Based on this, three-dimensional Semantic Scene Completion (SSC) technology is applied. The technology completes incomplete data and identifies and labels semantics of all parts of a three-dimensional scene by analyzing input data. As described above, understanding of semantics can help to complement the holes of three-dimensional data and even partial structures that cannot be scanned by the three-dimensional sensor, and the three-dimensional structure can also provide shape information of objects in a scene, thereby realizing recognition of partial semantics. Although this research direction has made great progress due to the development of deep learning techniques, there are still many deficiencies. For example, the three-dimensional data acquisition cost is high, a large number of three-dimensional semantic category labels depended on by supervised learning are difficult to acquire, a deep neural network model is huge, training is time-consuming, the actual reasoning speed is difficult to meet the requirement of real-time application, and the semantic classification and structure completion precision is required to be improved; there is also an important deficiency, and it can be said that one of the main causes of such deficiencies is the lack of image data containing rich visual information.

In recent years, with the emergence and popularization of a new type of RGB-D camera capable of simultaneously acquiring two complementary information, a color (i.e., RGB) image and a Depth (i.e., Depth) image of the same three-dimensional scene, a new search direction is developed for solving the above problems. RGB-D cameras such as Microsoft Kinect, Intel RealSense, and the Astro series of light in Australian brand of China. Generally speaking, an RGB image obtained by a two-dimensional color camera has high resolution and very complete image features, and contains rich visual information such as colors and textures, depth images obtained by a three-dimensional depth sensor such as an infrared thermal imaging sensor and a binocular stereo vision imaging sensor are sparse, and some holes often exist, but each pixel point with a value contains a depth of an actual scene which is relatively accurate, so that data of the two modes have good description semantics and structural complementarity.

The three-dimensional scene sensing technology may include various technologies, such as a scene recognition technology, a three-dimensional object detection technology, a semantic segmentation technology, a three-dimensional reconstruction technology, a scene completion technology, and the like. The two technologies can respectively acquire semantic category information and three-dimensional scene structure information of each point in a three-dimensional environment, such as a voxel or a point in a three-dimensional point cloud, finally form complete voxel representation or three-dimensional point cloud representation with a semantic label of the whole scene, and apply the obtained three-dimensional structure information and the semantic information in a computer vision technology to assist in executing different tasks, for example, the information can help a mobile robot or an automatic driving automobile to quickly acquire results of other tasks. Fig. 1 lists the three-dimensional scene perception technology of the related art, and because the RGB and D different modalities and complementary data in the related art are not sufficiently and efficiently fused, and the relationship between the scene semantic understanding and the three-dimensional structure perception which are enlightened by each other is not accurately expressed and sufficiently utilized, the three-dimensional semantic scene completion technology based on the RGB-D multimodal data is not perfectly solved. Therefore, the accuracy of the three-dimensional scene perception result obtained by the existing three-dimensional perception method is not high enough, the speed is not high enough, and the difference with the requirement of an actual application scene is large.

Disclosure of Invention

The application provides a three-dimensional scene perception method and device, a robot, electronic equipment and a readable storage medium, which can realize efficient and accurate three-dimensional scene perception and meet the requirements of practical application scenes.

In order to solve the above technical problems, embodiments of the present invention provide the following technical solutions:

an embodiment of the present invention provides a three-dimensional scene sensing method, including:

respectively carrying out two-dimensional semantic segmentation and monocular depth estimation on two-dimensional image data of RGB-D multi-modal data of a three-dimensional scene to be perceived to obtain two-dimensional semantic features and two-dimensional structural features;

respectively carrying out three-dimensional semantic segmentation and three-dimensional scene completion on the three-dimensional depth data of the RGB-D multi-modal data to obtain three-dimensional semantic features and three-dimensional structural features;

performing feature fusion on the two-dimensional semantic features and the three-dimensional semantic features to obtain fusion semantic features, and performing feature fusion on the two-dimensional structural features and the three-dimensional structural features to obtain fusion structural features;

and performing three-dimensional semantic scene completion on the three-dimensional scene to be perceived through a semantic structure parallel interactive iterative fusion mode based on the fusion semantic features and the fusion structural features to obtain semantic category information and three-dimensional scene structural information of the three-dimensional scene to be perceived.

Optionally, the process of performing feature fusion on the two-dimensional semantic features and the three-dimensional semantic features to obtain fusion semantic features includes:

constructing a DCP module based on a deformable convolution neural network and a text pyramid structure in advance;

projecting the two-dimensional semantic features to a three-dimensional space to obtain three-dimensional projection features;

converting the three-dimensional projection feature into a three-dimensional standard projection feature with the same channel number as the three-dimensional semantic feature;

inputting the three-dimensional projection characteristics to the DCP module to obtain three-dimensional enhancement characteristics;

projecting the three-dimensional primary fusion semantic features synthesized by the three-dimensional enhancement features, the three-dimensional standard projection features and the three-dimensional semantic features to a plane space to obtain two-dimensional fusion semantic features;

and inputting the three-dimensional primary fusion semantic features into a depth attention module to obtain the three-dimensional fusion semantic features.

Optionally, the process of performing feature fusion on the two-dimensional structural feature and the three-dimensional structural feature to obtain a fusion structural feature includes:

projecting the three-dimensional structural feature to a plane space to obtain a two-dimensional projection feature;

synthesizing the two-dimensional projection feature and the two-dimensional structure feature to obtain a two-dimensional fusion structure feature;

and projecting the two-dimensional fusion structural feature to a three-dimensional space to obtain a three-dimensional primary fusion structural feature, and synthesizing the three-dimensional primary fusion structural feature and the three-dimensional structural feature to obtain a three-dimensional fusion structural feature.

Optionally, before performing two-dimensional semantic segmentation and monocular depth estimation on the two-dimensional image data of the RGB-D multimodal data of the three-dimensional scene to be perceived respectively to obtain two-dimensional semantic features and two-dimensional structural features, the method further includes:

constructing a semantic segmentation backbone network comprising a two-dimensional semantic segmentation network and a three-dimensional semantic segmentation network and a scene completion backbone network comprising a depth estimation network and a scene completion network based on a deep neural network model; decomposing the target three-dimensional convolution kernels of the semantic segmentation trunk network and the scene completion trunk network along three dimensions;

the two-dimensional semantic segmentation network performs two-dimensional semantic segmentation on two-dimensional image data to obtain two-dimensional semantic features; the three-dimensional semantic segmentation network carries out three-dimensional semantic segmentation on the three-dimensional depth data to obtain three-dimensional semantic features; the depth estimation network carries out monocular depth estimation on the two-dimensional image data to obtain two-dimensional structural characteristics; and the scene completion network performs three-dimensional scene completion on the three-dimensional depth data to obtain three-dimensional structural characteristics.

Optionally, the performing three-dimensional semantic scene completion on the three-dimensional scene to be perceived through a parallel interactive iterative fusion manner based on the fusion semantic features and the fusion structural features to obtain semantic category information and three-dimensional scene structural information of the three-dimensional scene to be perceived includes:

constructing a bidirectional iterative interaction enhancement network in advance based on a deep neural network model, wherein the bidirectional iterative interaction enhancement network comprises a semantic auxiliary structure module, a structure auxiliary semantic module and a semantic structure feature fusion module;

inputting the fusion semantic features and the fusion structural features into the bidirectional iterative interaction enhancement network, executing a scene completion task based on the fusion semantic features by using the semantic auxiliary structural module, and executing a semantic segmentation task based on the fusion structural features by using the structural auxiliary semantic module; and continuously and iteratively fusing the execution result of the scene completion task and the execution result of the semantic segmentation task by using the semantic feature fusion module until an iteration ending condition is met, so as to obtain the semantic category information and the three-dimensional scene structure information.

Optionally, the target three-dimensional convolution kernel of the bidirectional iterative interaction enhancement network is decomposed along three dimensions.

Another aspect of the embodiments of the present invention provides a three-dimensional scene sensing apparatus, including:

the two-dimensional feature extraction module is used for respectively carrying out two-dimensional semantic segmentation and monocular depth estimation on two-dimensional image data of RGB-D multi-modal data of a three-dimensional scene to be perceived to obtain two-dimensional semantic features and two-dimensional structural features;

the three-dimensional feature extraction module is used for respectively carrying out three-dimensional semantic segmentation and three-dimensional scene completion on the three-dimensional depth data of the RGB-D multi-modal data to obtain three-dimensional semantic features and three-dimensional structural features;

the semantic structure feature fusion module is used for performing feature fusion on the two-dimensional semantic features and the three-dimensional semantic features to obtain fusion semantic features, and performing feature fusion on the two-dimensional structure features and the three-dimensional structure features to obtain fusion structure features;

and the semantic scene completion module is used for performing three-dimensional semantic scene completion on the three-dimensional scene to be perceived through a semantic structure iterative fusion mode based on the fusion semantic features and the fusion structural features to obtain semantic category information and three-dimensional scene structural information of the three-dimensional scene to be perceived.

The embodiment of the invention also provides a robot, which comprises a computer vision processor and an RGB-D camera;

the computer vision processor is connected with the RGB-D camera, and the RGB-D camera sends the collected RGB-D multi-mode data of the three-dimensional scene to be perceived to the computer vision processor;

the computer vision processor, when executing a computer program stored in a memory, carries out the steps of the three-dimensional scene perception method according to any of the previous claims.

An embodiment of the present invention further provides an electronic device, which includes a processor, and the processor is configured to implement the steps of the three-dimensional scene sensing method according to any one of the preceding items when executing the computer program stored in the memory.

Finally, an embodiment of the present invention provides a readable storage medium, where a computer program is stored, and when being executed by a processor, the computer program implements the steps of the three-dimensional scene sensing method according to any of the foregoing embodiments.

The technical scheme provided by the application has the advantages that the semantic features and the structural features for respectively extracting the RGB images and the D are fused, and the feature obtained by fusion not only keeps the uniqueness of two-dimensional and three-dimensional features, so that the 2D-3D feature can improve the precision of the task of the other party by utilizing complementary information, but also can realize feature enhancement by utilizing the commonality between different features, thereby realizing the full and efficient fusion of data which are mutually complemented and have two different modes of RGB and D. Based on the fused structural features and semantic features, the semantic segmentation and three-dimensional scene completion are combined by adopting a parallel interactive iterative learning mode, the parallel iterative interaction mode avoids semantic segmentation error accumulation and transmits the accumulated semantic segmentation error to a three-dimensional scene completion network, and the interaction mode with continuous iterative updating can ensure that the three-dimensional semantic scene completion task can apply the latest segmentation result every time, so that the mutual heuristic promotion relationship between the scene semantic understanding task and the three-dimensional structure perception task is accurately expressed and fully utilized, the data of different modes of RGB-D can be mutually utilized, the sparse three-dimensional semantic scene completion result is improved, the efficient and accurate three-dimensional scene perception is realized, and the actual application scene requirements can be met.

In addition, the embodiment of the invention also provides a corresponding implementation device, a robot, electronic equipment and a readable storage medium for the three-dimensional scene perception method, so that the method has higher practicability, and the device, the robot, the electronic equipment and the readable storage medium have corresponding advantages.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the related art, the drawings required to be used in the description of the embodiments or the related art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a prior art classification diagram of three-dimensional scene perception provided by an embodiment of the present invention;

fig. 2 is a schematic flowchart of a three-dimensional scene sensing method according to an embodiment of the present invention;

fig. 3 is a schematic flowchart of another three-dimensional scene sensing method according to an embodiment of the present invention;

fig. 4 is a schematic flow chart of a semantic feature fusion method according to an embodiment of the present invention;

FIG. 5 is a schematic flow chart of a conventional feature fusion method according to an embodiment of the present invention;

fig. 6 is a schematic flowchart of a feature fusion method according to this embodiment of the present invention;

FIG. 7 is a schematic diagram of a projection principle of different dimensional spaces provided by an embodiment of the present invention;

fig. 8 is a schematic flow chart of a structural feature fusion method according to an embodiment of the present invention;

FIG. 9 is a schematic diagram of plane and line constraints in one illustrative example provided by an embodiment of the invention;

FIG. 10 is a diagram illustrating semantic category determination by structural shape prior in an exemplary embodiment of the present invention;

FIG. 11 is a diagram illustrating semantic feature construction assisted by structural features according to an exemplary embodiment of the present invention;

FIG. 12 is a schematic data processing flow diagram of a unidirectional cascading semantic assisted architecture network according to an embodiment of the present invention;

fig. 13 is a schematic data processing flow diagram of a bidirectional iterative interaction-enhanced network according to an embodiment of the present invention;

FIG. 14 is a diagram illustrating a sequential representation of an iterative interaction approach provided by an embodiment of the present invention;

FIG. 15 is a schematic diagram of a prior art planar decomposition along three different dimensions according to an embodiment of the present invention;

FIG. 16 is a diagram illustrating a standard residual error network structure in an exemplary embodiment of the present invention;

fig. 17 is a schematic diagram of a structure of a dimensionality reduction residual error network in an illustrative example provided by the embodiment of the present invention;

FIG. 18 is a graphical representation of experimental validation results in an illustrative example provided by an embodiment of the invention;

fig. 19 is a structural diagram of a specific embodiment of a three-dimensional scene sensing device according to an embodiment of the present invention;

fig. 20 is a block diagram of an embodiment of an electronic device according to an embodiment of the present invention.

Fig. 21 is a structural diagram of a specific embodiment of a robot according to an embodiment of the present invention;

Detailed Description

In order that those skilled in the art will better understand the disclosure, the invention will be described in further detail with reference to the accompanying drawings and specific embodiments. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terms "first," "second," "third," "fourth," and the like in the description and claims of this application and in the above-described drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "comprising" and "having," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements but may include other steps or elements not expressly listed.

Referring to fig. 2, fig. 2 is a schematic flow chart of a three-dimensional scene sensing method according to an embodiment of the present invention, where the embodiment of the present invention may include the following:

s201: and respectively carrying out two-dimensional semantic segmentation and monocular depth estimation on the two-dimensional image data of the RGB-D multi-modal data of the three-dimensional scene to be perceived to obtain two-dimensional semantic features and two-dimensional structural features.

The three-dimensional scene to be perceived in the step is a three-dimensional scene needing to be understood, the RGB-D multi-mode data is data obtained by utilizing an RGB-D camera to acquire images of the three-dimensional scene to be perceived, and the two-dimensional image data is an RGB image in the images acquired by the RGB-D camera. A person skilled in the art can flexibly adopt any semantic segmentation technology for performing semantic segmentation processing on two-dimensional image data, so as to obtain corresponding two-dimensional semantic features, which is not limited in this application. A person skilled in the art can flexibly adopt any depth estimation technology for performing depth estimation processing on depth information carried by two-dimensional image data, so as to obtain corresponding two-dimensional structural features, which is not limited in this application.

S202: and respectively carrying out three-dimensional semantic segmentation and three-dimensional scene completion on the three-dimensional depth data of the RGB-D multi-modal data to obtain three-dimensional semantic features and three-dimensional structural features.

The three-dimensional depth data in this step is the D data in the image collected by the RGB-D camera. The skilled person can flexibly adopt any semantic segmentation technology for implementing semantic segmentation processing on the three-dimensional data, so as to obtain corresponding three-dimensional semantic features, which is not limited in this application. A person skilled in the art can flexibly adopt any three-dimensional scene completion technology for performing scene completion processing on three-dimensional data, so as to obtain corresponding three-dimensional structural features, which is not limited in this application.

S203: and performing feature fusion on the two-dimensional semantic features and the three-dimensional semantic features to obtain fusion semantic features, and performing feature fusion on the two-dimensional structural features and the three-dimensional structural features to obtain fusion structural features.

It can be understood that, in the currently popular method, RGB-D multimodal data are mostly used as joint input to train an end-to-end system, but the early fusion of the data of the two modalities causes that the network is difficult to capture the unique features of the respective modalities, increases the training difficulty, is difficult to converge, and has an increasing demand for data. Fig. 5 and 6 show two different fusion modes, early fusion and late fusion. The early fusion mode mainly means that before the network is input, the RGB image and the D depth feature are merged together and input into the network as a whole; and the later-stage fusion mode is that the RGB image and the D depth data are respectively input into a semantic segmentation network and a three-dimensional semantic scene completion network, then the characteristics obtained by the two networks are fused, and finally semantic completion information is output. In order to solve these technical drawbacks, the embodiment performs feature fusion before performing the three-dimensional semantic scene completion processing, and then performs S204. When fusing the features, a weighting method or an accumulation method in the prior art can be adopted. The simplest way for fusing the RGB data and the D data is to connect respective vectors, and then discover the internal relation of the RGB data and the D data by using a complex neural network and massive training data. The convergence rate of network training can be greatly improved due to the exquisite design containing the prior knowledge, and overfitting is avoided. The RGB data and D data can be used for two-dimensional and three-dimensional semantic segmentation, respectively, and are therefore completely identical on the semantic class labels; on the other hand, dense "relative" depth estimates based on RGB data and sparse "absolute" depth information contained by D data may also mutually validate. Thus, in addition to a simple concatenation approach, depth fusion of RGB data and D data may be considered from both aspects.

S204: and performing three-dimensional semantic scene completion on the three-dimensional scene to be perceived through a semantic structure parallel interactive iterative fusion mode based on the fusion semantic features and the fusion structure features to obtain semantic category information and three-dimensional scene structure information of the three-dimensional scene to be perceived.

The three-dimensional scene structure information refers to a geometric structure of each three-dimensional object in the three-dimensional scene to be perceived, the semantic category information refers to a category label to which each three-dimensional object in the three-dimensional scene to be perceived belongs or object information described by characters and/or letters, for example, the three-dimensional scene to be perceived includes walls, floors, tables and beds, the three-dimensional scene structure information outputs geometric structures corresponding to the walls, the floors, the tables and the beds, the semantic category information outputs walls, the floors, the tables and the beds, and if the labels corresponding to the walls, the floors, the tables and the beds are defined to be 1, 2, 3 and 4 in advance, the semantic category information outputs 1, 2, 3 and 4. The semantic structure parallel interactive iterative fusion mode is to execute two tasks of a semantic segmentation task and a scene completion task in parallel and perform iterative fusion on the two tasks until the iterative process is finished.

At present, the mainstream method does not deeply excavate the mutual relation between semantic segmentation and scene completion in the three-dimensional semantic scene completion joint task, the scene completion result is simply used for the three-dimensional semantic segmentation, and the scene completion result is not benefited by the improvement of the three-dimensional semantic segmentation and is further improved. The addition of RGB data only adopts the serial connection mode to take the output of two-dimensional semantic segmentation as the input of three-dimensional semantic scene completion. The series connection mode is simple in structure and rapid in training, but classification errors of two-dimensional semantic segmentation can be accumulated in a three-dimensional semantic scene completion task, and effective iteration and repeated promotion cannot be achieved. In order to solve the defects of the prior art, on the basis that the RGB-D full fusion respectively promotes semantic segmentation and scene completion, and in order to improve the accuracy of three-dimensional semantic scene completion, the mutual relation between semantics and structures is further researched, by establishing a support model of the semantics for the structures and the structures for the semantics, based on how mutual prior interaction exists between the semantic segmentation and the scene completion, the two tasks of the semantic segmentation and the scene completion are subjected to parallel interactive iteration to realize deep coupling, and the semantics and the structures promote each other, so that the high-accuracy three-dimensional semantic scene completion is finally realized.

In the technical scheme provided by the embodiment of the invention, the semantic features and the structural features of the RGB image and the D are respectively extracted and fused, and the fused features not only keep the uniqueness of two-dimensional and three-dimensional features, so that the 2D-3D features can improve the precision of tasks of the other party by utilizing complementary information, but also can realize feature enhancement by utilizing the commonality between different features, thereby realizing the full and efficient fusion of data which are mutually complemented and have two different modes of RGB and D. Based on the fused structural features and semantic features, the semantic segmentation and three-dimensional scene completion are combined by adopting a parallel interactive iterative learning mode, the parallel iterative interaction mode avoids semantic segmentation error accumulation and transmits the accumulated semantic segmentation error to a three-dimensional scene completion network, and the interaction mode with continuous iterative updating can ensure that the three-dimensional semantic scene completion task can apply the latest segmentation result every time, so that the mutual heuristic promotion relationship between the scene semantic understanding task and the three-dimensional structure perception task is accurately expressed and fully utilized, the data of different modes of RGB-D can be mutually utilized, the sparse three-dimensional semantic scene completion result is improved, the efficient and accurate three-dimensional scene perception is realized, and the actual application scene requirements can be met.

It should be noted that, in the present application, there is no strict sequential execution order among the steps, and as long as the logical order is met, the steps may be executed simultaneously or according to a certain preset order, and fig. 1 is only an exemplary manner, and does not represent that only the execution order is the order.

It will be appreciated that the RGB data and D data are different modalities and both can be used for semantic segmentation and depth estimation or three-dimensional scene completion. Although two-dimensional semantic segmentation on RGB data and three-dimensional semantic segmentation based on D data are problems in different dimensions, semantic class labels of the RGB data and three-dimensional semantic segmentation based on D data are completely the same, namely class information described by characters and/or letters and/or numbers, so that fusion of high-level features close to the semantic labels needs to be researched, two-dimensional semantic segmentation features obtained by RGB data training and three-dimensional semantic segmentation features obtained by D data training can be fused with each other, and results can be promoted mutually. On the other hand, depth estimation based on RGB data can yield dense relative depth information, for example, by an unsupervised depth learning method, but a single-frame two-dimensional image cannot infer absolute scale. And the D data contains sparse absolute depth information which can be mutually verified and complemented with the former. Based on the method, under the framework of deep learning, the optimal time and method are found for the network structure for fusing the RGB-D data at different stages, the promotion effects of two tasks of semantic segmentation and scene completion are achieved, the characteristics of the data under two modes are fully explored, the RGB-D multi-mode data fully extract respective unique features and are fused in a feature layer with positive correlation in height, and the fusion of the RGB and D multi-mode data is deepened through continuous iteration. Based on this, the present application provides a semantic feature fusion method and a structural feature fusion method under an implementation manner, respectively, which may include the following contents:

with reference to fig. 3 and 4, in order to make the dense two-dimensional semantic segmentation result generated by RGB data and the three-dimensional sparse semantic scene completion result generated by D data help each other, features of two different tasks need to be fused, so this embodiment designs a 2D-3D semantic feature fusion module, that is, module 1.1 in fig. 3, and performs feature fusion of two-dimensional semantic features and three-dimensional semantic features by using this module to obtain a task of fusing semantic features, where the execution process may include:

constructing a DCP module based on a deformable convolution neural network and a text pyramid structure in advance; projecting the two-dimensional semantic features to a three-dimensional space to obtain three-dimensional projection features; converting the three-dimensional projection characteristics into three-dimensional standard projection characteristics with the same channel number as the three-dimensional semantic characteristics; inputting the three-dimensional projection characteristics into a DCP module to obtain three-dimensional enhancement characteristics; projecting the three-dimensional primary fusion semantic features synthesized by the three-dimensional enhancement features, the three-dimensional standard projection features and the three-dimensional semantic features to a plane space to obtain two-dimensional fusion semantic features; and inputting the three-dimensional primary fusion semantic features into a depth attention module to obtain the three-dimensional fusion semantic features.

Specifically, the feature fusion module takes the three-dimensional semantic segmentation branch output in a two-dimensional semantic segmentation backbone network and a three-dimensional semantic scene completion backbone network as input, outputs a more optimal two-dimensional semantic and three-dimensional semantic segmentation result, and feeds back the result to a corresponding module. In the feature fusion process, the feature F is output_3dShould remain the original f_2dAnd f_3dThe uniqueness of the features and the commonality of different features are used to improve the precision, so that different feature fusion modes are explored for connection in the embodiment. Because the features of two-dimensional semantic segmentation and three-dimensional semantic scene completion have different dimensions, a 2D-3D projection layer needs to be designed so as to facilitate the low-dimensional feature f_2dProjecting the three-dimensional space to obtain projection characteristics

Then will be

With the initial three-dimensional feature f_3dFusion is performed. FIG. 7 is a schematic 2D-3D projection diagram showing the depth z of a pixelUnder known conditions, the image coordinates [ u, v ]]Internal reference matrix K that can pass through camera^3*3Conversion to camera coordinates [ x, y, z ]]. Next, the external parameter matrix [ R | t ] of the camera is utilized]^3*4The camera coordinates can be reconverted to three-dimensional world coordinates [ X, Y, Z ]]At this point, the 2D to 3D spatial conversion is completed. And finally, completing the feature fusion module by using key technologies such as a one-hot function, a deformable convolution, a text pyramid, a depth attention mechanism and the like. Projecting features by using one-hot function

Is converted into and_3dfeatures having the same number of channels

The classification result of the two-dimensional semantic segmentation is continuously maintained. The features are then generated using a Deformable text Pyramid Module (DCP)

The DCP module can aggregate geometric context information and can well adapt to the shape and scale change of an object. As an alternative implementation, a text pyramid structure with a hole convolution can be applied to expand the receptive field of the network and obtain richer local and global features so as to enhance the shape recovery capability of the deep learning network. In addition, in order to improve the geometric transformation modeling capability of the network, a deformable convolutional neural network can be utilized to maintain smooth consistency of the shape. In addition, objects belonging to the same semantic category typically have similar or consistent depths, while different objects typically have truncated depths. Therefore, the fusion characteristics can be further processed by using a Depth Attention Module (DAM) constructed by using the Attention mechanism in the prior art, and the measurement of the Depth importance of different classes can be realized. As an alternative embodiment, the 2D-3D fusion module can be expressed as:

in the formula, F_2dIs a two-dimensional semantic segmentation coarse result f_2dAnd three-dimensional semantic segmentation result F_3dThe updated semantic segmentation result generated after passing through the two-dimensional DCP module, p and h respectively represent a 2D-3D projection function and a one-hot function,

representing a sum-by-element operation, DCP represents a deformable text pyramid module. D represents the number of channels in the depth dimension, C is a two-dimensional convolution layer, R is a ReLU activation function, i is an ith dimension index,

is composed of

The feature vector at the position i is,

is composed of

Feature vector at position j, F_3d,jOutput signature F for DAM_3dThe feature vector at position j.

In order to facilitate the monocular RGB based 2D Depth estimation and Depth based 3D scene completion with each other, the present embodiment designs a 2D-3D structural feature fusion module, i.e. module 1.2 in fig. 3, in combination with fig. 3 and 8. Because the depth map collected by the depth camera such as the Kinect has large noise and the object made of the materials such as the mirror and the glass is difficult to collect the accurate depth of the depth map, the input depth map is often provided with holes, and the 3D scene directly generated by the depth map also has holes. And the depth value of each pixel point can be estimated by the RGB-based 2D depth estimation network, and the original 3D scene with holes can be completed after the depth value is projected to 3D. Conversely, after the completed 3D scene is projected into the 2D space, the result of the 2D depth estimation can be corrected. Therefore, iterating the RGB-D loop multiple times will finally obtain a better complemented 3D scene and a more accurate estimated 2D depth map, and the specific design is shown in fig. 8. The process of performing the feature fusion task of the two-dimensional structural feature and the three-dimensional structural feature by using the 2D-3D structural feature fusion module may include:

projecting the three-dimensional structural feature to a plane space to obtain a two-dimensional projection feature; synthesizing the two-dimensional projection characteristics and the two-dimensional structure characteristics to obtain two-dimensional fusion structure characteristics; and projecting the two-dimensional fusion structural features to a three-dimensional space to obtain three-dimensional primary fusion structural features, and synthesizing the three-dimensional primary fusion structural features and the three-dimensional structural features to obtain the three-dimensional fusion structural features.

As can be seen from the above, in the embodiment, the one-hot function, the deformable convolution, the text pyramid, the depth attention mechanism, and the iterative loop technology are utilized to complete the semantic feature fusion between 2D and 3D and the structural feature fusion between 2D and 3D, so as to realize the high fusion of RGB-D multimodal data. The uniqueness of two-dimensional and three-dimensional characteristics is kept, so that the 2D-3D characteristics can improve the precision of tasks of opposite parties by utilizing complementary information, and the characteristics can be enhanced by utilizing the commonness among different characteristics. By constructing an effective multi-task joint learning framework to fully fuse RGB-D data and deeply couple subtasks of semantic understanding and structure understanding, the mIoU (Mean Intersection over Union) precision of a three-dimensional semantic segmentation task can be higher than 50%.

At present, the research of three-dimensional semantic scene completion has made little progress, and semantic information can help structural completion and structural information is proved to help semantic segmentation. However, in the prior art, a three-dimensional semantic scene completion result is directly generated based on an end-to-end deep learning network model, and the problem of how semantic information and scene structure information are mutually promoted is not accurately modeled and clearly explained, which is still a great challenge to the research in this aspect at present. The applicant believes that the following two issues mainly need to be considered: 1) the semantics and the structure are specific methods which are mutually prior, namely how the semantics help the structure and how the structure helps the semantics; 2) the semantic segmentation and the structure complement the joint mode of two tasks, the execution sequence of the two tasks, unidirectional cascade or interactive iteration and the like. The embodiment starts from two directions of the first problem, and models semantic information auxiliary structure completion and structure information auxiliary semantic segmentation respectively; after solving how these two tasks affect the specific form of each other, the second problem, i.e. the different network architecture where the two are combined, will be explored.

Based on semantic segmentation in three-dimensional semantic scene completion, the method comprises two-dimensional semantic segmentation and three-dimensional semantic segmentation, and the relation between the two subtasks and the three-dimensional scene completion, a better joint learning mode between two theories of semantics and structure and a better connection method between the two modules on the basis of RGB-D full fusion is sought, so that the two tasks are fully promoted. As an optional implementation manner, based on the fusion semantic features and the fusion structural features, the process of performing three-dimensional semantic scene completion on the three-dimensional scene to be perceived in a parallel interactive iterative fusion manner to obtain semantic category information and three-dimensional scene structural information of the three-dimensional scene to be perceived may include:

constructing a bidirectional iterative interaction enhancement network in advance based on a deep neural network model, wherein the bidirectional iterative interaction enhancement network comprises a semantic auxiliary structure module, a structure auxiliary semantic module and a semantic structure feature fusion module; inputting the fusion semantic features and the fusion structural features into a bidirectional iterative interaction enhancement network, executing a scene completion task based on the fusion semantic features by using a semantic auxiliary structural module, and executing a semantic segmentation task based on the fusion structural features by using a structural auxiliary semantic module; and continuously and iteratively fusing the execution result of the scene completion task and the execution result of the semantic segmentation task by using a semantic feature fusion module until an iteration ending condition is met, so as to obtain semantic category information and three-dimensional scene structure information. The semantic feature fusion module can perform feature fusion by adopting any feature fusion mode, such as a weighting mode.

First, with the goal of executing the semantic auxiliary structure task by the semantic auxiliary structure module (i.e. module 2.1 in fig. 3), the embodiment considers that the semantic tags can be expressed as a priori knowledge of the structure inference. For a certain point or voxel in space, the geometrical information it contains at different spatial positions has a high variability. Often points in the same object are more homogenous, i.e. points inside the same object are likely to belong to the same semantic category as their surroundings. Meanwhile, points at the curved surface, the edge and the vertex have richer geometric information than points inside the object, and most of the points have different semantic labels with the surrounding environment. Therefore, the geometric properties of each point in the space can be deduced according to the semantic information of the point and the semantic labels of the surrounding points, and the geometric structure of the scene object is completed. Secondly, semantic information also contains higher-order geometric prior knowledge, and categories such as indoor wall surfaces, floors, ceilings, doors and windows necessarily contain a large number of plane and straight-line elements. In the embodiment, the superpixel segmentation algorithm and the traditional line extraction algorithm LSD can be used for carrying out plane and line segmentation preliminarily, and abundant plane and line information in the space can be obtained, so that local and global geometric constraints such as line or plane constraints can be established for a scene completion task according to the semantic information, and a more robust and smoother geometric structure can be completed. The specific plane and line constraints are shown in fig. 9. For the extracted planes, 4 pixel points (a; B; C; D) can be randomly selected and projected to the 3D to get four points (A; B; C; D) in space according to their depth values. According to the basic knowledge of the geometry, the method can be known,

and

should be perpendicular to the plane Δ ABC. If D and Δ ABC are coplanar, then

And

should be 0, i.e. the dot product of

Network learning can be supervised as an error term. Its loss function can be defined as:

wherein L is_PCRepresents the Plane Consistency constraint (Plane Consistency), N_PRepresents the number of combinations of four points extracted from the plane,

the vectors between points A, B in the ith group of points,

the vectors between points A, C in the ith group of points,

is the vector between points A, D in the ith group of points.

According to the embodiment, the outer points can be removed through plane constraint, so that the scene structure obtained through completion is more consistent.

Wherein L is_PCRepresents the Plane Consistency constraint (Plane Consistency), N_PTo represent

Since a Line segment is another ubiquitous element in an artificial environment, and a Line segment in a two-dimensional image always corresponds to a straight Line in a three-dimensional space, the present embodiment will also establish a straight Line Consistency constraint (Line Consistency). When constructing a straight line consistency loss, the strategy of planar consistency loss may still be followed. That is, the embodiment can randomly extract 3 pixels from the extracted line segment, i.e., (e; f; g). Their corresponding three-dimensional points E, F and G should be constrained to be on the same line, and

and

the cross product of (d) should be a zero vector. Therefore, the temperature of the molten metal is controlled,

the learning of the scene completion task can be supervised as an error item. This embodiment can select N from a straight line_lGroup data to construct a loss function L_lcIt is defined as follows:

in the formula (I), the compound is shown in the specification,

the vectors between points E, F in the ith group of points,

is the vector between points E, G in the ith group of points.

Secondly, with the goal of executing the structure-assisted semantic task by the structure-assisted semantic module (i.e., module 2.2 in fig. 3), the embodiment considers that the dense shape prior obtained by the scene completion task can better help to infer the semantic category of the object. In the previous work, when the sparse scene is gradually complemented into the dense scene, as shown in fig. 10, the shape contour of the truck at the circle is slowly revealed, and the technical solution provided by the embodiment is easier to identify that is a truck, but not other semantic categories. However, the mainstream method at present is to implicitly express the relationship between shape information and semantics in a scene structure by using an end-to-end neural network. The present embodiments may attempt to model semantic features with structural features, explicitly aiding the semantic segmentation task.

A typical idea is shown in fig. 11, for a point that needs to infer semantic category and complete structure, such as a point O in the number 5, a central point of a rough three-dimensional prediction voxel on the left side and a K-NN nearest neighbor algorithm may be used to search K nearest neighbor points from a set of original three-dimensional point cloud data, then a position weight between each point and the point O is calculated, then feature aggregation is performed by using a graph-based attention convolution operation, and finally a final aggregation feature is obtained. The method not only ensures consistency with surrounding semantics, but also ensures integrity and reasonability of the shape structure of the object with the same semantic category. The specific characteristic polymerization formula is as follows:

wherein, w_ijRepresenting the weight between the semantic features of point i and the near point j,

representing a sum-by-element operation, p_iRepresenting the structural features of point i, p_jStructural features of point j, f_iRepresenting semantic features of point i, f_jThe semantic features of the point j are represented,

indicates the final polymerization characteristic of point i, φ₁、φ₂Representing two different multi-level perceptrons, and n (i) representing the nearest neighbor set for the ith point.

Finally, the specific combination mode of the semantic segmentation and the scene completion can have different forms, such as a mode based on one-way cascade and a mode based on two-way iterative interaction, and the two multi-task combination methods are respectively introduced below.

A typical unidirectional cascaded network framework is shown in fig. 12. In the structure, the RGB-D multi-modal data is firstly subjected to semantic segmentation network to obtain the semantic segmentation characteristics, and in order to enable the semantic information and the structural information to be fused and promoted, the semantic and the structural characteristics can be simultaneously input into the semantic structure fusion module. In this module, since the semantic priors of the scene are obtained, the structure can be updated with the assistance of semantic information. Meanwhile, the structure after completion can further optimize the result of semantic segmentation. And finally, outputting the optimized 3D scene completion result through a semantic structure fusion module, and combining the result with semantic information to generate the final semantic scene completion. The unidirectional cascade network has a simple structure, and can eliminate the following modules as necessary by reasonably designing the semantic auxiliary structure feature fusion module to meet the real-time requirement.

Bidirectional iterative interaction network framework as shown in fig. 13, the network framework is referred to as a bidirectional iterative interaction enhanced network. The network can predict a completion scene based on a 3D voxel format from pairs of RGB and D images, and each voxel will be assigned a specific semantic class. Firstly, data of RGB and D channels are respectively sent into a semantic segmentation backbone network and a semantic scene completion backbone network, rough segmentation results are respectively generated, the rough segmentation results are simultaneously input into a feature fusion module for fusion optimization, and finally, a more optimal three-dimensional semantic scene completion result is output through continuous iterative learning. In order to fully mine more interactions between the semantic segmentation and the scene completion task, the embodiment represents the iterative learning process in a time dimension serialization manner, as shown in fig. 14. The design of iterative interaction is based on the assumption that better semantic segmentation results should help achieve better semantic scene completion performance and vice versa, as shown by the semantic structure iterative fusion module in fig. 13. SSCt and SSt respectively represent the three-dimensional scene completion and semantic segmentation results at the time t. The iterative interaction enhancement network assists the three-dimensional scene completion network by introducing a better semantic segmentation result each time so as to improve the performance of the network.

The technical scheme is to use a cross entropy loss function L to train a one-way cascade type or two-way iterative interaction enhancement network. It should be noted that, in the interactive iterative network training process, in order to reduce the training time and enable a task to fully utilize the result updated by another task, the network will update the semantic segmentation network and the three-dimensional scene completion network in turn, i.e. the parameters of the semantic segmentation network are kept unchanged while training the three-dimensional completion network, and vice versa. The cross entropy loss function L can be expressed as:

in the formula, N is the number of voxels, N is the nth voxel, C is the C-th category, C is the total number of semantic categories, w_cIs the weight of the c-th semantic class, y_ncIs the true probability that the nth voxel belongs to class c,

is the predicted probability that the nth voxel belongs to class c,

is the prediction probability that the nth voxel belongs to the c ' th class, and c ' is the c ' th semantic class.

In summary, for the modeling manner of the multi-task joint learning, the technical scheme provided by this embodiment explores advantages and disadvantages of the unidirectional cascade and bidirectional iterative connection manner, and combines the two-dimensional semantic segmentation and the three-dimensional scene completion by applying the interactive iterative learning manner for the first time, so that data of different modalities of RGB-D can be mutually utilized and the sparse three-dimensional semantic scene completion result is improved. Preliminary experiments show that on one hand, the parallel iterative interaction mode avoids semantic segmentation error accumulation and transmits the semantic segmentation error to the three-dimensional scene completion network, and on the other hand, the interaction mode with continuous iterative updating can ensure that the three-dimensional semantic scene completion task can apply the latest segmentation result every time, and IoU of the three-dimensional scene completion task exceeds 80%.

Although deep web learning achieves very good performance in the visual task, it presents very high challenges to the storage, power consumption, etc. of the system. For some application scenarios, such as mobile systems like robots, there are severe requirements in terms of storage, effort and power consumption. Therefore, in order to adapt to the actual requirements of such application scenarios and ensure that the research results of the technical scheme of the present application can be really and practically applied, the present embodiment is intended to research the model compression and acceleration method of the established three-dimensional semantic scene completion network, and find a balance point between high precision and low complexity. On one hand, in order to conveniently use the traditional convolution operation, most of the existing methods express the three-dimensional semantic scene completion characteristics based on the ordered voxel format, but the voxelized data format usually occupies high memory and consumes large resources, and only can output a prediction result with low resolution, which is not favorable for capturing fine structures and semantically segmenting small objects. Therefore, there is a need to study how to express the characteristics of three-dimensional space in a more concise and effective data format, such as exploring the possibility of expressing the result of three-dimensional semantic scene completion by using three-dimensional point cloud; on the other hand, the existing optimal algorithm has a deeper network layer, and the complexity of the model is also increased by the multiple iteration structures in the two research contents, so that a compression and acceleration method suitable for the joint learning model mentioned in this embodiment needs to be researched. For example, the core of the existing network pruning method is to determine parameter redundancy of different layers of the model so as to determine a reasonable layer-by-layer pruning strategy. The network structure is simplified according to the importance degree of each parameter of the artificially calculated network, unimportant nodes or branches are subtracted, and the important network structure is reserved. The latest reinforcement learning-based method can search network pruning strategies, model pruning is regarded as a special case of neural structure search, and the layer-by-layer compression strategies superior to artificial rules are automatically searched by utilizing the capability of well balancing 'utilization' (exploitation) and 'exploration' (exploration) of reinforcement learning. For another example, the conventional three-dimensional convolution layer is degraded into convolution kernels in each dimension, so as to achieve the purpose of reducing the network parameters; and meanwhile, how to control the calculation amount and the iteration number of the iterative optimization to achieve the balance of efficiency and precision is researched. The other idea is to quantize the full-precision model into a low-bit model, perform fixed-point quantization on the parameters and activation of each layer of the network, collect and analyze the data distribution of the parameters and activation of each layer of the network, and determine the bit number by automatic search in combination with the task type and the hardware platform; on the basis, determining a set of fixed point values after quantization, and quantizing the parameters and the activation to one element in the set of fixed point values by determining a quantization or random quantization method to realize the quantization of the weight and the activation of the neural network; and finally, recovering the precision loss in the quantization process through network fine adjustment. For the specific model provided by the embodiment, because the three-dimensional network uses a large number of standard three-dimensional convolution operations, on one hand, the parameter number and the calculation amount are increased sharply, so that the creation of a deeper network is restricted; on the other hand, the limitation of computational resources causes that the existing methods generally output low resolution results, so that object details and some small objects may be overlooked. To solve this problem, DDRNet first uses a dimension Decomposition Residual module (DDR) to change a standard three-Dimensional convolution kernel k × k × k into three consecutive dimension reduction convolution kernels [ (1 × 1 × k), (1 × k × 1), (k × 1 × 1) ], where k is the convolution kernel size, which changes the number of convolution kernel parameters from k × k × k to k + k + k. However, the three-dimensional indoor scene contains a large amount of plane information, such as walls, floors, desktops, etc., and the DDR operation may damage the plane structures, as shown in fig. 15, and planes with different dimensions contain completely different information. For this reason, the present embodiment proposes a new convolution operation to extract planar features, that is, this operation decomposes the conventional 3D convolution kernel along three dimensions, i.e., [ (1 × k × k), (k × k × 1), (k × 1 × k) ], as shown in fig. 16 and 17, which show a comparison graph between the reduced-dimension residual network and the standard residual network block when k is 5. The new convolution proposed by the embodiment not only can well retain the plane characteristics, but also can effectively reduce the parameter quantity, reduce the calculation complexity and realize the balance between the precision and the parameter quantity compared with the standard three-dimensional convolution operation.

In view of this, before performing two-dimensional semantic segmentation and monocular depth estimation on the two-dimensional image data of the RGB-D multimodal data of the three-dimensional scene to be perceived respectively to obtain the two-dimensional semantic features and the two-dimensional structural features, the method may further include:

constructing a semantic segmentation backbone network comprising a two-dimensional semantic segmentation network and a three-dimensional semantic segmentation network and a scene completion backbone network comprising a depth estimation network and a scene completion network based on a deep neural network model; decomposing a target three-dimensional convolution kernel of the semantic segmentation trunk network and the scene completion trunk network along three dimensions; one or more or all three-dimensional convolution kernels in the two-dimensional semantic segmentation network and the depth estimation network are decomposed along three dimensions; one or more or all three-dimensional convolution kernels in the three-dimensional semantic segmentation network and the scene completion network are decomposed along three dimensions. The number of target convolution kernels can be flexibly selected according to the actual application scenario, and the application does not limit the number of the target convolution kernels.

The two-dimensional semantic segmentation network is used for performing two-dimensional semantic segmentation on input two-dimensional image data to obtain two-dimensional semantic features; the three-dimensional semantic segmentation network is used for performing three-dimensional semantic segmentation on the input three-dimensional depth data to obtain three-dimensional semantic features; the depth estimation network is used for performing monocular depth estimation on input two-dimensional image data to obtain two-dimensional structural features; the scene completion network is used for performing three-dimensional scene completion on the input three-dimensional depth data to obtain three-dimensional structural characteristics. Correspondingly, the implementation process of performing two-dimensional semantic segmentation and monocular depth estimation on the two-dimensional image data of the RGB-D multimodal data of the three-dimensional scene to be perceived in step S201 of the above embodiment may include:

inputting two-dimensional image data into a two-dimensional semantic segmentation network to obtain two-dimensional semantic features; inputting two-dimensional image data into a depth estimation network to obtain two-dimensional structural features;

correspondingly, the implementation process of performing three-dimensional semantic segmentation and three-dimensional scene completion on the three-dimensional depth data of the RGB-D multimodal data in step S202 in the above embodiment may include:

inputting the three-dimensional depth data into a three-dimensional semantic segmentation network to obtain three-dimensional semantic features; and inputting the three-dimensional depth data into a scene completion network to obtain the three-dimensional structural characteristics.

As an optional implementation mode, the target three-dimensional convolution kernel of the bidirectional iterative interaction enhancement network can also be decomposed along three dimensions. The number of target convolution kernels can be flexibly selected according to the actual application scenario, and the application does not limit the number of the target convolution kernels.

As can be seen from the above, the lightweight model proposed by the embodiment of the invention will achieve a balance between high accuracy and low parameter, not only preserve the planar features of the space as much as possible, but also simplify the standard three-dimensional convolution operation, and reduce the computational complexity. Therefore, the problem of high calculation complexity of the three-dimensional data is solved, and the balance between high precision and low parameter quantity is realized. By simplifying the deep convolutional neural network model, the frame rate of the three-dimensional semantic scene completion algorithm in the inference stage reaches 10fps close to real time.

In addition, most of the existing methods express the three-dimensional semantic scene completion characteristics based on an ordered voxel format, but the voxelized data format usually occupies a high memory and consumes a large amount of resources, so that the embodiment also provides a more effective data expression form, that is, a three-dimensional point cloud is used for expressing the result of the three-dimensional semantic scene completion, so as to reduce the memory consumption, and finally, the method provided by the scheme realizes the balance between high precision and low computation complexity, as shown in fig. 18.

The embodiment of the invention also provides a corresponding device for the three-dimensional scene perception method, so that the method has higher practicability. Wherein the means can be described separately from the functional module point of view and the hardware point of view. In the following, the three-dimensional scene sensing device provided by the embodiment of the present invention is introduced, and the three-dimensional scene sensing device described below and the three-dimensional scene sensing method described above may be referred to in a corresponding manner.

Based on the angle of the functional module, referring to fig. 19, fig. 19 is a structural diagram of a three-dimensional scene sensing device according to an embodiment of the present invention, in a specific implementation manner, where the device may include:

the two-dimensional feature extraction module 191 is used for performing two-dimensional semantic segmentation and monocular depth estimation on the two-dimensional image data of the RGB-D multi-modal data of the three-dimensional scene to be perceived respectively to obtain two-dimensional semantic features and two-dimensional structural features;

the three-dimensional feature extraction module 192 is configured to perform three-dimensional semantic segmentation and three-dimensional scene completion on the three-dimensional depth data of the RGB-D multimodal data, respectively, to obtain three-dimensional semantic features and three-dimensional structural features;

the semantic structure feature fusion module 193 is configured to perform feature fusion on the two-dimensional semantic features and the three-dimensional semantic features to obtain fusion semantic features, and perform feature fusion on the two-dimensional structure features and the three-dimensional structure features to obtain fusion structure features;

and the semantic scene completion module 194 is configured to perform three-dimensional semantic scene completion on the three-dimensional scene to be perceived through a semantic structure iterative fusion manner based on the fusion semantic features and the fusion structural features, so as to obtain semantic category information and three-dimensional scene structural information of the three-dimensional scene to be perceived.

Optionally, in some embodiments of this embodiment, the semantic structure feature fusion module 193 includes a semantic feature fusion unit and a structure feature fusion unit;

the semantic feature fusion unit may be configured to: constructing a DCP module based on a deformable convolution neural network and a text pyramid structure in advance; projecting the two-dimensional semantic features to a three-dimensional space to obtain three-dimensional projection features; converting the three-dimensional projection characteristics into three-dimensional standard projection characteristics with the same channel number as the three-dimensional semantic characteristics; inputting the three-dimensional projection characteristics into a DCP module to obtain three-dimensional enhancement characteristics; projecting the three-dimensional primary fusion semantic features synthesized by the three-dimensional enhancement features, the three-dimensional standard projection features and the three-dimensional semantic features to a plane space to obtain two-dimensional fusion semantic features; and inputting the three-dimensional primary fusion semantic features into a depth attention module to obtain the three-dimensional fusion semantic features.

The structural feature fusion unit may be configured to: projecting the three-dimensional structural feature to a plane space to obtain a two-dimensional projection feature; synthesizing the two-dimensional projection characteristics and the two-dimensional structure characteristics to obtain two-dimensional fusion structure characteristics; and projecting the two-dimensional fusion structural features to a three-dimensional space to obtain three-dimensional primary fusion structural features, and synthesizing the three-dimensional primary fusion structural features and the three-dimensional structural features to obtain the three-dimensional fusion structural features.

Optionally, in another implementation manner of this embodiment, the apparatus further includes a model building module, configured to build, based on the deep neural network model, a semantic segmentation backbone network including a two-dimensional semantic segmentation network and a three-dimensional semantic segmentation network, and a scene completion backbone network including a depth estimation network and a scene completion network; decomposing a target three-dimensional convolution kernel of the semantic segmentation trunk network and the scene completion trunk network along three dimensions; the two-dimensional semantic segmentation network performs two-dimensional semantic segmentation on two-dimensional image data to obtain two-dimensional semantic features; the three-dimensional semantic segmentation network carries out three-dimensional semantic segmentation on the three-dimensional depth data to obtain three-dimensional semantic features; the depth estimation network carries out monocular depth estimation on the two-dimensional image data to obtain two-dimensional structural characteristics; and the scene completion network performs three-dimensional scene completion on the three-dimensional depth data to obtain three-dimensional structural characteristics.

Optionally, in some other embodiments of this embodiment, the semantic scene completing module 194 may be further configured to: constructing a bidirectional iterative interaction enhancement network in advance based on a deep neural network model, wherein the bidirectional iterative interaction enhancement network comprises a semantic auxiliary structure module, a structure auxiliary semantic module and a semantic structure feature fusion module; inputting the fusion semantic features and the fusion structural features into a bidirectional iterative interaction enhancement network, executing a scene completion task based on the fusion semantic features by using a semantic auxiliary structural module, and executing a semantic segmentation task based on the fusion structural features by using a structural auxiliary semantic module; and continuously and iteratively fusing the execution result of the scene completion task and the execution result of the semantic segmentation task by using a semantic feature fusion module until an iteration ending condition is met, so as to obtain semantic category information and three-dimensional scene structure information.

As an optional implementation manner of this embodiment, the target three-dimensional convolution kernel of the bidirectional iterative interaction enhancement network is decomposed along three dimensions.

The functions of the functional modules of the three-dimensional scene sensing device according to the embodiments of the present invention may be specifically implemented according to the method in the above method embodiments, and the specific implementation process may refer to the description related to the above method embodiments, which is not described herein again.

The three-dimensional scene sensing device mentioned above is described from the perspective of functional modules, and further, the present application also provides an electronic device described from the perspective of hardware. Fig. 20 is a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 20, the electronic device includes a memory 20 for storing a computer program; a processor 21, configured to implement the steps of the three-dimensional scene perception method as mentioned in any of the above embodiments when executing the computer program.

The processor 21 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 21 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 21 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 21 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed by the display screen. In some embodiments, the processor 21 may further include an AI (Artificial Intelligence) processor for processing a calculation operation related to machine learning.

The memory 20 may include one or more computer-readable storage media, which may be non-transitory. Memory 20 may also include high speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In this embodiment, the memory 20 is at least used for storing the following computer program 201, wherein after being loaded and executed by the processor 21, the computer program can implement the relevant steps of the three-dimensional scene perception method disclosed in any one of the foregoing embodiments. In addition, the resources stored in the memory 20 may also include an operating system 202, data 203, and the like, and the storage manner may be a transient storage manner or a permanent storage manner. Operating system 202 may include, among others, Windows, Unix, Linux, and the like. The data 203 may include, but is not limited to, data corresponding to three-dimensional scene perception results, and the like.

In some embodiments, the electronic device may further include a display 22, an input/output interface 23, a communication interface 24, which may be referred to as a network interface, a power supply 25, and a communication bus 26. The display 22 and the input/output interface 23, such as a Keyboard (Keyboard), belong to a user interface, and the optional user interface may also include a standard wired interface, a wireless interface, and the like. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, as appropriate, is used for displaying information processed in the electronic device and for displaying a visualized user interface. The communication interface 24 may optionally include a wired interface and/or a wireless interface, such as a WI-FI interface, a bluetooth interface, etc., typically used to establish a communication connection between an electronic device and other electronic devices. The communication bus 26 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 20, but this is not intended to represent only one bus or type of bus.

Those skilled in the art will appreciate that the configuration shown in fig. 20 is not intended to be limiting of the electronic device and may include more or fewer components than those shown, such as sensors 27 that perform various functions.

The functions of the functional modules of the electronic device according to the embodiments of the present invention may be specifically implemented according to the method in the above method embodiments, and the specific implementation process may refer to the description related to the above method embodiments, which is not described herein again.

Therefore, the embodiment of the invention realizes efficient and accurate three-dimensional scene perception, and can meet the requirements of practical application scenes.

It is to be understood that, if the three-dimensional scene perception method in the above embodiments is implemented in the form of a software functional unit and sold or used as a stand-alone product, it may be stored in a computer-readable storage medium. Based on such understanding, the technical solutions of the present application may be substantially or partially implemented in the form of a software product, which is stored in a storage medium and executes all or part of the steps of the methods of the embodiments of the present application, or all or part of the technical solutions. And the aforementioned storage medium includes: a U disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), an electrically erasable programmable ROM, a register, a hard disk, a removable magnetic disk, a CD-ROM, a magnetic or optical disk, and other various media capable of storing program codes.

Based on this, the embodiment of the present invention further provides a readable storage medium, which stores a computer program, and the computer program is executed by a processor, and the steps of the three-dimensional scene sensing method according to any one of the above embodiments are provided.

An embodiment of the present invention further provides a robot, please refer to fig. 21, which may include a computer vision processor 211 and an RGB-D camera 212. The computer vision processor 211 is connected with the RGB-D camera 212, and the RGB-D camera 212 sends the collected RGB-D multi-modal data of the three-dimensional scene to be perceived to the computer vision processor 211; the computer vision processor 211, when executing a computer program stored in a memory, implements the steps of the three-dimensional scene perception method as recited in any of the embodiments above.

The robot may be any type of robot, such as an indoor robot, a rail-mounted robot, a patrol robot, and the like. The RGB-D camera 212 may be any camera that simultaneously captures RGB images and depth images in the same scene. In this embodiment, a robot executes a target location task as an example to describe its implementation scheme, when the target location task is received, the robot sends an image acquisition instruction to the RGB-D camera 212, the RGB-D camera 212 sends an acquired image of the surrounding environment to the computer vision processor, the computer vision processor calls a three-dimensional scene perception computer program stored in the memory to obtain three-dimensional scene structure information and corresponding semantic information of the environment where the robot is located, generates a three-dimensional semantic map based on the three-dimensional scene structure information and the corresponding semantic information, searches for a target position in the three-dimensional semantic map, performs path planning based on the searched target position and a current position, and then moves to the target along the planned path. The embodiment combines deep learning and multi-task joint learning technologies, takes RGB-D multi-modal data as input, obviously improves the result of three-dimensional scene semantic completion, and can improve the three-dimensional scene understanding capability of the mobile robot, so that the mobile robot can adapt to environmental perception and semantic map construction tasks under indoor complex conditions.

The functions of the functional modules of the robot according to the embodiments of the present invention may be specifically implemented according to the method in the embodiments of the method, and the specific implementation process may refer to the description related to the embodiments of the method, which is not described herein again.

The robot can efficiently and accurately sense the three-dimensional scene of the space, and the task execution accuracy is improved.

The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other. For hardware including devices and electronic equipment disclosed by the embodiment, the description is relatively simple because the hardware includes the devices and the electronic equipment correspond to the method disclosed by the embodiment, and the relevant points can be obtained by referring to the description of the method.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The three-dimensional scene sensing method and device, the robot, the electronic device and the readable storage medium provided by the application are described in detail above. The principles and embodiments of the present invention are explained herein using specific examples, which are presented only to assist in understanding the method and its core concepts. It should be noted that, for those skilled in the art, it is possible to make various improvements and modifications to the present invention without departing from the principle of the present invention, and those improvements and modifications also fall within the scope of the claims of the present application.

Claims

1. A method for three-dimensional scene perception, comprising:

2. The three-dimensional scene perception method according to claim 1, wherein the process of performing feature fusion on the two-dimensional semantic features and the three-dimensional semantic features to obtain fused semantic features includes:

3. The three-dimensional scene perception method according to claim 1, wherein the process of performing feature fusion on the two-dimensional structural feature and the three-dimensional structural feature to obtain a fused structural feature includes:

4. The three-dimensional scene perception method according to claim 1, wherein before the two-dimensional image data of the RGB-D multi-modal data of the three-dimensional scene to be perceived is subjected to two-dimensional semantic segmentation and monocular depth estimation, respectively, to obtain two-dimensional semantic features and two-dimensional structural features, the method further comprises:

5. The three-dimensional scene sensing method according to any one of claims 1 to 4, wherein the obtaining of the semantic category information and the three-dimensional scene structure information of the three-dimensional scene to be sensed by performing three-dimensional semantic scene completion on the three-dimensional scene to be sensed through a parallel interactive iterative fusion mode based on the fusion semantic features and the fusion structure features comprises:

6. The three-dimensional scene awareness method of claim 5, wherein the target three-dimensional convolution kernel of the bi-directional iterative interaction enhancement network is decomposed along three dimensions.

7. A three-dimensional scene awareness apparatus, comprising:

8. A robot comprising a computer vision processor and an RGB-D camera;

the computer vision processor, when executing a computer program stored in a memory, carries out the steps of the three-dimensional scene perception method according to any of the claims 1 to 6.

9. An electronic device, characterized in that it comprises a processor and a memory, said processor being adapted to implement the steps of the three-dimensional scene perception method according to any of claims 1 to 6 when executing a computer program stored in said memory.

10. A readable storage medium, characterized in that the readable storage medium has stored thereon a computer program which, when being executed by a processor, carries out the steps of the three-dimensional scene perception method according to any one of claims 1 to 6.