CN115601506B

CN115601506B - Reconstruction method of three-dimensional scene, electronic equipment and medium

Info

Publication number: CN115601506B
Application number: CN202211382693.6A
Authority: CN
Inventors: 李学龙; 王栋; 况佳冰
Original assignee: Shanghai AI Innovation Center
Current assignee: Shanghai AI Innovation Center
Priority date: 2022-11-07
Filing date: 2022-11-07
Publication date: 2024-05-28
Anticipated expiration: 2042-11-07
Also published as: CN115601506A

Abstract

The invention discloses a three-dimensional scene reconstruction method, which comprises the steps of firstly predicting the directional distance of an input point based on a depth image to obtain a voxel point, and then rendering the voxel point by combining the image characteristics in an RGB image to reconstruct the three-dimensional scene. When reconstructing the scene, the image features extracted from the RGB image are added, so that the objects in the scene can be subjected to example segmentation and richer color rendering, and finally the three-dimensional reconstruction scene which is more similar to the original three-dimensional scene is obtained.

Description

Reconstruction method of three-dimensional scene, electronic equipment and medium

Technical Field

The present invention relates to the field of robotics, and in particular, to a three-dimensional scene reconstruction method, electronic device, and medium.

Background

In the related tasks in the robot field, when the robot executes the tasks, the scene data is required to be acquired through the sensor, and then the scene data is resolved, so that the task scenes can be perceived and understood, and further the corresponding tasks are completed. The space for the robot to execute the task is a three-dimensional space, but the image data collected by the existing sensor, such as RGB image, only contains two-dimensional plane information, and if the image data is based on the RGB image, the robot can only complete the task of a certain plane. Therefore, if the three-dimensional model can be reconstructed based on the two-dimensional information, it is of great help to perform the robot task.

Three-dimensional space can be represented in three general categories: voxels, point clouds, grids, but these expressions are all discrete, while the real scene is continuous, and the memory occupied by these expressions grows exponentially with increasing resolution. Therefore, in practical application, the geometric reconstruction is realized by using a method based on an implicit network, so that the scene is effectively encoded under the condition of low memory occupancy rate, and the high-resolution three-dimensional model can be reconstructed because the implicit network is continuous.

If a three-dimensional space within a certain range is voxelized, a voxel is considered to be occupied when the voxel contains a boundary, and based on this, the three-dimensional space can be represented by the probability that the voxel is occupied. The occupancy network implicitly represents the three-dimensional surface as a continuous decision boundary for the deep neural network classifier. Based on the thought as described above, a certain research proposes a three-dimensional reconstruction method, which divides the robot operation space into N _res×N_res×N_res grids along the x, y and z axes, wherein the probability of the storage space being occupied in each grid is the three-dimensional grid feature of the operation space, projects the three-dimensional feature onto three planes of xz, yz and xy to be converted into plane features, decodes and calculates the grabbing pose [ qual, rot and width ] according to the plane features, and reconstructs the grabbing scene. However, the input data of the method is a depth image RGB-D, only contains geometric information of a scene, and since the value of each voxel point represents the probability that the voxel is occupied and the range is 0,1, the method based on the occupied network only can calculate where the object exists in the scene and where the object does not exist, and cannot distinguish the object in the scene. Meanwhile, when data is acquired in a simulation mode, as the grabbing sampling points are randomly selected, the points containing effective task execution result labels in one scene are rare, for example, 3000 points are contained in one scene point cloud data, but only 0-3% of label points containing success or failure of grabbing are contained. The method only predicts [ qual, rot, width ] of a certain point when training, does not consider the local area of the point, and is different from the actual situation.

Disclosure of Invention

Aiming at part or all of the problems in the prior art, the first aspect of the present invention provides a three-dimensional scene reconstruction method, which takes an RGB image and a depth image RGB-D as inputs, and performs a solution to multi-sensor data to obtain a three-dimensional model, wherein the reconstruction method comprises the following operations for each input point:

Predicting the directional distance of the input point based on the depth image to obtain a voxel point; and

And rendering the voxel points by combining the image characteristics, and reconstructing the three-dimensional scene.

Further, the RGB image and the depth image include a plurality of images, and the photographing angles of each image are different.

Further, the predicting of the directional distance includes:

Sampling the geometric characteristics of the two-dimensional plane around the input point; and

And predicting the directional distance of the input point according to the geometric characteristics of the two-dimensional plane through a neural network obtained through pre-training.

Further, the two-dimensional planar geometric feature is obtained according to the depth image, and includes:

Determining the directed distance of each point in the scene to be reconstructed according to the depth image to form point cloud data;

extracting three-dimensional grid features according to the point cloud data; and

And respectively projecting the three-dimensional grid features onto xz, zy and xy planes, and converting the three-dimensional grid features into two-dimensional plane geometric features.

Further, the image features include color and texture information for each point.

Further, the rendering the voxel point includes:

and predicting example labels, colors and textures of the voxel points according to the directed distance and the image characteristics through a neural network obtained through pre-training so as to render the voxel points.

Based on the reconstruction method of a three-dimensional scene as described above, a second aspect of the invention provides an electronic device for three-dimensional scene reconstruction, comprising a memory and a processor, wherein the memory is configured to store a computer program which, when run by the processor, performs the reconstruction method of a three-dimensional scene as described above.

The third aspect of the present invention also provides a computer readable storage medium storing a computer program which, when run on a processor, performs a method of reconstructing a three-dimensional scene as described above.

The fourth aspect of the invention also provides a grabbing robot, which comprises the computer readable storage medium, and can identify the object to be grabbed and the position information thereof through the three-dimensional scene reconstruction method.

The three-dimensional scene reconstruction method provided by the invention adopts the RGB image and the RGB-D image which are acquired from the simulation environment as data input, carries out corresponding calculation on the data, reconstructs the simulation scene by using an implicit network, increases the image characteristics extracted from the RGB image when reconstructing the scene, carries out example segmentation on objects in the scene and carries out richer color rendering, and finally obtains the three-dimensional reconstruction scene which is more similar to the original three-dimensional scene. The reconstruction method is applied to the grabbing robot, so that the robot can understand geometric information such as the position of an object, the size of the object and the like, can recognize related information between the objects, distinguishes multi-class objects in a scene, and finally can better complete corresponding tasks in a real environment.

Drawings

To further clarify the above and other advantages and features of embodiments of the present invention, a more particular description of embodiments of the invention will be rendered by reference to the appended drawings. It is appreciated that these drawings depict only typical embodiments of the invention and are therefore not to be considered limiting of its scope. In the drawings, for clarity, the same or corresponding parts will be designated by the same or similar reference numerals.

FIG. 1 is a flow chart of a method of reconstructing a three-dimensional scene according to an embodiment of the present invention; and

Fig. 2 shows a schematic workflow diagram of a gripping robot according to an embodiment of the invention.

Detailed Description

In the following description, the present invention is described with reference to various embodiments. One skilled in the relevant art will recognize, however, that the embodiments may be practiced without one or more of the specific details, or with other alternative and/or additional methods or components. In other instances, well-known structures or operations are not shown or described in detail to avoid obscuring aspects of the invention. Similarly, for purposes of explanation, specific numbers and configurations are set forth in order to provide a thorough understanding of embodiments of the present invention. However, the invention is not limited to these specific details.

Reference throughout this specification to "one embodiment" or "the embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment.

It should be noted that the embodiments of the present invention describe the steps of the method in a specific order, however, this is merely for the purpose of illustrating the specific embodiments, and not for limiting the order of the steps. In contrast, in different embodiments of the present invention, the sequence of each step may be adjusted according to the adjustment of the actual requirement.

High resolution three-dimensional scene representations help to improve the ability of a robot to perceive an environment, but they occupy a large amount of memory, and acquiring three-dimensional scene information data from an actual environment is complex and time consuming. The representation mode based on the implicit network geometric modeling can effectively encode the scene under the condition of low occupation, and because the implicit network is continuous, a high-resolution three-dimensional model can be reconstructed, the geometric capacity of the robot for sensing the scene is enhanced, and meanwhile, the occupation of a memory space is reduced. The existing geometric reconstruction based on the implicit network mainly comprises two methods of a directed distance field and an occupied network. Where directional distance field SIGNED DISTANCE FIELD refers to a field representing the surface of an object, within a certain spatial range, by the nearest distance from any point in space to the surface of the object, together with the point's pointing direction to the surface of the object. The distance and the pointing direction together form a directional distance, and the three-dimensional space can be represented as a function between the directional distance and the boundary, which can be represented by a depth-directed distance field implicit network. However, in the prior art, only a depth image is generally used as input data, so that the finally obtained three-dimensional model only contains geometric information of a scene, and it is difficult to distinguish objects in the scene. Based on the method, the invention provides a three-dimensional scene reconstruction method, which adopts RGB images and RGB-D images acquired from a simulation environment as data input, carries out corresponding calculation on the data, and reconstructs the simulation scene by using an implicit network, so that a robot can understand geometric information such as the position of an object, the size of the object and the like, can recognize related information between the objects, and finally, the robot can better complete corresponding tasks in a real environment.

The embodiments of the present invention will be further described with reference to the drawings.

Fig. 1 shows a schematic flow chart of a method for reconstructing a three-dimensional scene according to an embodiment of the invention. As shown in fig. 1, a method for reconstructing a three-dimensional scene includes:

First, in step 101, a directional distance is predicted. For a certain input point, firstly predicting the directed distance based on the depth image to obtain a voxel point. And voxelizing the scene to be reconstructed based on the depth image. In one embodiment of the invention, the voxelization comprises: firstly, acquiring the two-dimensional plane geometric characteristics of the input points, and then predicting the directional distance of the input points according to the two-dimensional plane geometric characteristics through a neural network obtained through pre-training. In order to avoid the problem of inaccurate modeling of local geometric information caused by sampling the characteristic of only one point, in one embodiment of the invention, the local geometric information around the input point is also considered in the prediction process, namely, a plurality of point cloud data point characteristics locally adjacent to the input point are fused, so that the modeling of the local geometric information around the input point is realized, and specifically, in the voxelization process, the two-dimensional plane geometric characteristics of the input point are processed and sampled, and besides the two-dimensional plane geometric characteristics of the surrounding points are further sampled. In one embodiment of the invention, the two-dimensional plane geometric feature is obtained according to the depth image, specifically, firstly, the directed distance of each point in the scene to be reconstructed is determined according to the depth image to form point cloud data, then, three-dimensional grid features are extracted according to the point cloud data, and finally, the three-dimensional grid features are respectively projected onto xz, zy and xy planes and are converted into two-dimensional plane geometric features; and

Finally, at step 102, voxels are rendered. And rendering the voxel points obtained in the previous steps by combining the image characteristics, and reconstructing the three-dimensional scene. In one embodiment of the present invention, firstly, image features are extracted from the RGB image, where the image features at least include textures, colors, and the like, then, through a neural network obtained by training in advance, the example label, textures, colors of each voxel point in the scene to be reconstructed are predicted by combining the obtained geometric features and the image features in the previous step, and the voxel points are rendered, so as to finally obtain the reconstructed scene model.

In one embodiment of the present invention, the RGB image and the depth image as data input include a plurality of images, and the photographing angles of each image are different.

Based on the reconstruction method of a three-dimensional scene as described above, the invention further provides an electronic device for three-dimensional scene reconstruction, comprising a memory and a processor, wherein the memory is configured to store a computer program which, when run by the processor, performs the reconstruction method of a three-dimensional scene as described above.

The invention also provides a computer readable storage medium storing a computer program which when run on a processor performs a method of reconstructing a three-dimensional scene as described above.

The invention also provides a grabbing robot which can identify the object to be grabbed and the position information thereof through the three-dimensional scene reconstruction method. The grabbing robot decomposes the grabbing task into two parts of grabbing prediction and three-dimensional scene reconstruction. As shown in fig. 2, the method comprises three steps of data input, multi-sensor data calculation and output result:

First, data is input. And randomly placing a plurality of textured object three-dimensional models on a desktop in a simulation environment, recording RGB and RGB-D depth images of a plurality of shooting angles in the same scene, randomly selecting a large number of points in the robot operation space range of each scene, enabling the grabbing robot to try to execute tasks one by one, and storing task execution process data and simulation results. And processing the RGB-D depth image, voxelizing the operation space and marking the occupation probability and the directed distance of each voxel point. Densely sampling in an operation space, recording the sampling point position, an instance label and calculating the directed distance of the point. Recording image information such as color, texture and the like of the sampling points according to the RGB image;

Next, the multisensor data is resolved. Reading information such as point cloud, RGB image, RGB-D image and the like of a three-dimensional scene, voxelizing an operation space to extract three-dimensional grid features, respectively projecting the three-dimensional grid features onto xz, zy and xy planes, and converting the three-dimensional grid features into two-dimensional plane geometric features, wherein the point cloud information is obtained according to the RGB-D image. And sampling a certain input point and two-dimensional plane geometric features around the point, and predicting a directed distance, the grabbing quality of the point and the six-degree-of-freedom grabbing pose through the sampled geometric features. Extracting image features such as textures and colors in the image according to the input image data, predicting instance labels, textures and colors of each voxel point in the operation space according to the geometric features and the image features, and rendering the voxel points to finally obtain a reconstructed scene model; and

And finally, outputting a result. And through multi-sensing data calculation, the model predicts six-degree-of-freedom grabbing pose of the input point, performs quality assessment on the grabbing pose, and simultaneously renders each voxel point of the operation space to obtain a reconstructed three-dimensional scene model.

The capture robot combines implicit geometric modeling with scene voxel rendering to realize accurate three-dimensional perception of multi-sensor data based on RGB and RGB-D images, the obtained three-dimensional reconstruction scene is accurate in geometry, and image characteristic information such as textures, colors and the like which are not contained in the existing method is increased, so that the capability of the robot for perceiving an unknown environment is improved. The multi-sensing data of the RGB image and the RGB-D image are taken as input, the multi-sensing data is resolved by adopting an implicit geometric modeling method, the RGB-D image data is mainly used for resolving geometric features of a scene, the supplementing of the RGB image data can perfect image feature information such as textures, colors and the like of the reconstructed scene, the voxels of the scene are rendered by combining the geometric features and the image features after the scene is voxelized, and finally the three-dimensional reconstructed scene which is more similar to the original three-dimensional scene is obtained.

In addition, the information such as the spatial position, the directional distance and the like of the input points is contained in the data input of the multi-sensor data calculation, so that the robot can know the geometric information such as the distance between the robot and the object, the size of the object and the like in a more accurate three-dimensional reconstruction scene, and can predict the instance labels of all voxels to distinguish multi-category objects in the scene, thereby improving the success rate of the grabbing task.

Finally, compared with the existing method which only utilizes the characteristics of a single grabbing point, the method considers the local geometric information around the grabbing point in the grabbing prediction process, namely, fuses the characteristics of a plurality of point cloud data points locally adjacent to the grabbing point, further realizes modeling of the local geometric information around the grabbing point, effectively avoids the problem of inaccurate modeling of the local geometric information caused by the characteristic that only one point is sampled, and can carry out more comprehensive grabbing quality prediction assessment on the grabbing point of the mechanical arm.

While various embodiments of the present invention have been described above, it should be understood that they have been presented by way of example only, and not limitation. It will be apparent to those skilled in the relevant art that various combinations, modifications, and variations can be made therein without departing from the spirit and scope of the invention. Thus, the breadth and scope of the present invention as disclosed herein should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalents.

Claims

1. A method for reconstructing a three-dimensional scene, wherein for each input point the following operations are performed:

Predicting the directional distance of the input point based on the depth image to obtain a voxel point, including:

Sampling the two-dimensional plane geometric characteristics of the input point and surrounding points thereof, wherein the two-dimensional plane geometric characteristics are obtained according to the depth image, and the method comprises the following steps:

determining the directed distance of each point in the scene to be reconstructed according to the depth image,

Forming point cloud data;

Respectively projecting the three-dimensional grid features onto xz, zy and xy planes, and converting the three-dimensional grid features into two-dimensional plane geometric features; and

Predicting the directed distance, the grabbing quality and the six-degree-of-freedom grabbing pose of the input points according to the geometric characteristics of the two-dimensional plane through a neural network obtained through pre-training; and rendering the voxel points according to image features in the RGB image, comprising: and according to the two-dimensional plane geometric characteristics and the image characteristics, predicting example labels, textures and colors of each voxel point in the operation space, rendering the voxel points to finally obtain a reconstructed scene model, wherein the image characteristics comprise color and texture information of each point, the RGB images and the depth images comprise a plurality of images, and the shooting angles of each image are different.

2. The reconstruction method according to claim 1, further comprising the step of:

and executing the robot task by using the three-dimensional scene.

3. The reconstruction method according to claim 1, wherein rendering the voxel points comprises:

4. An electronic device for three-dimensional scene reconstruction, comprising a memory and a processor, wherein the memory is configured to store a computer program which, when run by the processor, performs the three-dimensional scene reconstruction method according to any one of claims 1 to 3.

5. A computer-readable storage medium, in which a computer program is stored, which computer program, when run on a processor, performs a method of reconstructing a three-dimensional scene as claimed in any one of claims 1 to 3.

6. A gripping robot comprising the computer readable storage medium of claim 5.