CN115018989B

CN115018989B - Three-dimensional dynamic reconstruction method based on RGB-D sequence, training device and electronic equipment

Info

Publication number: CN115018989B
Application number: CN202210704281.3A
Authority: CN
Inventors: 张举勇; 蔡泓锐
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2022-06-21
Filing date: 2022-06-21
Publication date: 2024-03-29
Anticipated expiration: 2042-06-21
Also published as: CN115018989A

Abstract

The invention discloses a three-dimensional dynamic reconstruction method based on an RGB-D sequence, which comprises the following steps: inputting the object to be reconstructed into the three-dimensional dynamic reconstruction model after parameter optimization to obtain a geometric reconstruction model and a color reconstruction model of the object to be reconstructed. The invention also discloses a three-dimensional dynamic reconstruction training device based on the RGB-D sequence. The invention also discloses an electronic device and a storage medium suitable for the three-dimensional dynamic reconstruction method based on the RGB-D sequence.

Description

Three-dimensional dynamic reconstruction method based on RGB-D sequence, training device and electronic equipment

Technical Field

The invention relates to the field of three-dimensional dynamic reconstruction, in particular to a three-dimensional dynamic reconstruction method based on an RGB-D sequence, a training device, electronic equipment and a storage medium.

Background

As a classical problem of computer vision, reconstructing a three-dimensional model of a dynamic target object from a video segment has a broad prospect in the fields of meta-universe, augmented reality, virtual reality, and the like. Particularly, under monocular setting, as the RGB sequence acquired by the color camera has the problems of depth ambiguity, insignificant feature change at weak texture and the like, no algorithm can reconstruct a high-quality three-dimensional model of a dynamic object without template prior at present. On the other hand, as the cost of depth cameras decreases, many dynamic fusion algorithms for depth maps have been developed, but since they do not adequately combine color information and depth information, the resulting model fidelity is not high and there are many problems.

Disclosure of Invention

In view of the above problems, the present invention provides a three-dimensional dynamic reconstruction method, a training device, an electronic device, and a storage medium based on RGB-D sequences, so as to solve one of the above problems.

According to a first aspect of the present invention, there is provided a three-dimensional dynamic reconstruction method based on an RGB-D sequence, comprising:

inputting an object to be reconstructed into a three-dimensional dynamic reconstruction model after parameter optimization to obtain a geometric reconstruction model and a color reconstruction model of the object to be reconstructed, wherein the three-dimensional dynamic reconstruction model after parameter optimization is obtained by training by the following method:

randomly initializing a three-dimensional dynamic reconstruction model, wherein the three-dimensional dynamic reconstruction model comprises a bijective mapping network, a topology sensing network and an implicit reference space;

the method comprises the steps of obtaining the sight direction of a depth camera and the normal vector of a sampling point on a target training object, and obtaining RGB information and depth information of the training object through the depth camera;

preprocessing RGB-D sequence information acquired by a depth camera to obtain mask information of a training object;

performing tensor operation on the mask information and deformation codes of the three-dimensional dynamic reconstruction model to obtain a tensor operation result;

mapping the tensor operation result into an implicit reference space by using a bijective mapping network and a topology perception network to obtain a mapping result;

processing the mapping result by using the implicit reference space, and rendering the processing result by using the symbol distance function to obtain predicted mask information of the training object and predicted depth information of the training object;

performing composite operation on the visual line direction, the appearance coding and mapping result of the three-dimensional dynamic reconstruction model, and rendering the result of the composite operation by utilizing a nerve radiation field to obtain predicted RGB information of a training object;

inputting the sight direction, normal vector, mask information, RGB information, depth information, predicted mask information, predicted RGB information and predicted depth information into a loss function to obtain a loss value, and optimizing parameters of the three-dimensional dynamic reconstruction model according to the loss value;

and iterating the acquisition operation, the preprocessing operation, the tensor operation, the mapping operation, the rendering operation and the optimizing operation until the loss value meets the preset condition to obtain the three-dimensional dynamic reconstruction model after parameter optimization.

According to an embodiment of the present invention, mapping the tensor operation result to the implicit reference space by using the bijective mapping network and the topology aware network to obtain a mapping result includes:

the method comprises the steps of carrying out coordinate transformation on tensor operation results by using coordinate transformation modules of a bijective mapping network to obtain transformation results, wherein the number of the coordinate transformation modules of the bijective mapping network is the same as the dimension of the tensor operation results, and the coordinate transformation module of each bijective mapping network is used for carrying out primary coordinate transformation on the tensor operation results;

and mapping the transformation result into an implicit reference space by using a topology perception network to obtain a mapping result.

According to an embodiment of the present invention, the coordinate transformation module using the bijective mapping network performs coordinate transformation on the tensor operation result, and obtaining a transformation result includes:

randomly selecting one coordinate axis in the tensor operation result as a transformation reference axis, and translating tensor values on the transformation reference axis by utilizing a coordinate transformation module to obtain a translation result;

according to the translation result, utilizing a coordinate transformation module to translate other coordinate axes in the tensor operation result and rotate around a transformation reference axis to obtain a primary transformation result;

and carrying out translation operation and rotation operation iteratively until all coordinate transformation modules finish coordinate transformation of tensor operation results to obtain transformation results.

According to an embodiment of the present invention, each of the transformation results described above is continuously differentiable and of the same period.

According to an embodiment of the invention, the above-mentioned loss functions include a free space loss function and a loss function on the surface.

According to an embodiment of the present invention, the free space loss function includes an RGB supervised loss function, a depth map supervised loss function, a mask map supervised loss function, and a regularized term loss function;

wherein the RGB monitor loss function is determined by equation (1):

the depth map supervision loss function is determined by formula (2):

the mask map supervision loss function is determined by equation (3):

the canonical term loss function is determined by equation (4):

wherein,an internal reference matrix representing an RGB camera, +.>An internal reference matrix representing a depth camera, +.>Is an external parameter matrix of camera motion, +.>Represents a sampling ray set, C (r) represents observed RGB information, D (r) represents observed depth information, M (r) represents observed mask information, </u >>RGB information representing predictions->Representing the predicted depth information of the object,mask information representing predictions, BCE is cross entropy function,>is a set of sampling points.

According to an embodiment of the present invention, the loss functions on the surface include an SDF loss function and a visibility loss function;

wherein the SDF loss function is determined by equation (5):

the visibility loss function is determined by equation (6):

wherein n is _p Representing normal, v _p Indicating the direction of the line of sight, p _i The expression, d (x) represents,and (3) representing.

According to a second aspect of the present invention, there is provided a three-dimensional dynamic reconstruction training apparatus based on RGB-D sequences, comprising:

the initialization module is used for randomly initializing a three-dimensional dynamic reconstruction model, wherein the three-dimensional dynamic reconstruction model comprises a bijective mapping network, a topology sensing network and an implicit reference space;

the acquisition module is used for acquiring the sight direction of the depth camera and the normal vector of the sampling point on the target training object, and acquiring RGB information and depth information of the training object through the depth camera;

the preprocessing module is used for preprocessing the RGB-D sequence information acquired by the depth camera to obtain mask information of a training object;

the tensor operation module is used for performing tensor operation on the mask information and the deformation codes of the three-dimensional dynamic reconstruction model to obtain a tensor operation result;

the mapping module is used for mapping the tensor operation result into an implicit reference space by utilizing a bijective mapping network and a topology perception network to obtain a mapping result;

the first rendering module is used for processing the mapping result by utilizing the implicit reference space and rendering the processing result by utilizing the symbol distance function to obtain predicted mask information of the training object and predicted depth information of the training object;

the second rendering module is used for carrying out compound operation on the visual line direction, the appearance coding and mapping result of the three-dimensional dynamic reconstruction model, and rendering the compound operation result by utilizing the nerve radiation field to obtain the predicted RGB information of the training object;

the optimizing module is used for inputting the sight direction, the normal vector, the mask information, the RGB information, the depth information, the predicted mask information, the predicted RGB information and the predicted depth information into the loss function to obtain a loss value, and optimizing parameters of the three-dimensional dynamic reconstruction model according to the loss value;

the iteration module is used for carrying out iteration on the acquisition operation, the preprocessing operation, the tensor operation, the mapping operation, the rendering operation and the optimization operation until the loss value meets the preset condition, and obtaining the three-dimensional dynamic reconstruction model after parameter optimization.

According to a second aspect of the present invention, there is provided an electronic device comprising:

one or more processors;

storage means for storing one or more programs,

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the above described RGB-D sequence based three-dimensional dynamic reconstruction method.

According to a second aspect of the present invention, there is provided a computer readable storage medium having stored thereon executable instructions which, when executed by a processor, cause the processor to perform a three-dimensional dynamic reconstruction method based on an RGB-D sequence.

According to the three-dimensional dynamic reconstruction method based on the RGB-D sequence, colors and geometry are put into the same frame for joint optimization through the micro-rendering and hidden representation technology, so that all the observed RGB-D information participates in the optimization process together; meanwhile, the bijective transformation is used as a part of the deformation field, so that the cycle consistency among different observation frames is strictly satisfied; in addition, a topology sensing network is introduced, and a corresponding relation containing topology is constructed, so that the model can robustly process complex RGB-D sequences.

Drawings

FIG. 1 is a flow chart of a training method of a parameter-optimized three-dimensional dynamic reconstruction model according to an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of a three-dimensional dynamic reconstruction method based on RGB-D sequences according to an embodiment of the present invention;

FIG. 3 is a flow chart of obtaining a mapping result according to an embodiment of the invention;

FIG. 4 is a flow chart of obtaining a transformation result according to an embodiment of the invention;

FIG. 5 is a schematic diagram of a dual shot mapping network according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of the reconstruction and rendering effects of the three-dimensional dynamic reconstruction method based on RGB-D sequence according to the embodiment of the present invention;

FIG. 7 is a schematic structural diagram of a three-dimensional dynamic reconstruction training device based on RGB-D sequences according to an embodiment of the present invention;

fig. 8 schematically illustrates a block diagram of an electronic device adapted to implement a three-dimensional dynamic reconstruction method based on RGB-D sequences, according to an embodiment of the disclosure.

Detailed Description

The present invention will be further described in detail below with reference to specific embodiments and with reference to the accompanying drawings, in order to make the objects, technical solutions and advantages of the present invention more apparent.

The three-dimensional dynamic reconstruction is a task of recovering three-dimensional geometric and motion information of a target object aiming at visual signals (such as a color map, a depth map and the like) acquired by equipment. How to reconstruct the geometry of a priori-free object from the RGB-D (Red, green, blue, depth) sequence is a current research hotspot, and the current algorithm of RGB-D dynamic reconstruction is mostly based on DynamicFusion, and storing a TSDF Volume (truncated symbol distance body representation) in a reference space represents the geometry of the object, and each observation frame (depth information) is deformed into the reference frame by a pre-constructed deformed node map, wherein the node map is multi-level. This fusion process will be done based on ICP (closest point matching algorithm) registration. The deformation process may be considered as sparse control points directing rotation and translation of points on a dense depth point cloud. The subsequent correlation methods are mostly based on DynamicFusion improvement. The method is characterized in that the method models illumination, texture and other color information, so that the color image is used for registering; also added are loss functions to solve the topology change problem of Close-to-Open; there are also methods that use deep learning modules, in combination with optical flow information (motion structure supervision), to make the estimated motion more accurate.

The previous methods are based on sparse node motion and geometric representation, and do not fully utilize RGB color information, so the resulting geometric accuracy is not very high. On the other hand, the hidden representation method which is developed at high speed in recent years gradually replaces a part of the traditional representation method. Micro-renderable techniques enable geometry and rendering to be considered simultaneously in one frame, so that color information can be better used to reconstruct geometry.

The invention aims to provide a dynamic reconstruction algorithm which fully combines color and depth information so as to solve the problems of insufficient reconstruction precision and the like in the existing RGB-D sequence-based algorithm.

inputting the object to be reconstructed into the three-dimensional dynamic reconstruction model after parameter optimization to obtain a geometric reconstruction model and a color reconstruction model of the object to be reconstructed.

FIG. 1 is a flow chart of a training method of a three-dimensional dynamic reconstruction model after parameter optimization according to an embodiment of the present invention.

As shown in FIG. 1, the training method for obtaining the three-dimensional dynamic reconstruction model after parameter optimization comprises operations S110 to S190.

In operation S110, a three-dimensional dynamic reconstruction model is randomly initialized, wherein the three-dimensional dynamic reconstruction model includes a bijective mapping network, a topology aware network, and an implicit reference space.

In operation S120, a line of sight direction of the depth camera and a normal vector of a sampling point on the target training object are obtained, and RGB information and depth information of the training object are obtained through the depth camera.

The depth camera, i.e., RGB-D camera, can measure the distance of an object from the camera by actively emitting light to the object and receiving the returned light, like a laser sensor, through infrared structured light or Time-of-Flight (ToF) principle.

In operation S130, RGB-D sequence information acquired by the depth camera is preprocessed to obtain mask information of the training object.

In operation S140, tensor operation is performed on the mask information and the deformation code of the three-dimensional dynamic reconstruction model, so as to obtain a tensor operation result.

In operation S150, the tensor operation result is mapped to the implicit reference space by using the bijective mapping network and the topology aware network, to obtain a mapping result.

In operation S160, the mapping result is processed using the implicit reference space, and the processing result is rendered using the symbol distance function, resulting in predicted mask information of the training object and predicted depth information of the training object.

In operation S170, the line of sight direction, the appearance coding of the three-dimensional dynamic reconstruction model, and the mapping result are subjected to a composite operation, and the result of the composite operation is rendered by using the neural radiation field, thereby obtaining predicted RGB information of the training object.

In operation S180, a line of sight direction, a normal vector, mask information, RGB information, depth information, predicted mask information, predicted RGB information, and predicted depth information are input into a loss function to obtain a loss value, and parameters of a three-dimensional dynamic reconstruction model are optimized according to the loss value.

In operation S190, the obtaining operation, the preprocessing operation, the tensor operation, the mapping operation, the rendering operation and the optimizing operation are iterated until the loss value meets the preset condition, and the three-dimensional dynamic reconstruction model after parameter optimization is obtained.

The preset conditions include, but are not limited to, the loss value converging to a certain value or the loss value oscillating back and forth within a certain interval, etc.

The implicit representation treats the object as an implicit surface, with each point in space being associated with a physical property of the object (e.g., density, color, etc.). In this way, some points are sampled in space, namely corresponding aggregate attribute values can be obtained through a mathematical physical formula, and the acquired visual signals (such as RGB, depth map and the like) can be used for supervising and learning the attribute of the implicit curved surface. In a dynamic sequence, a typical technical route models the motion and geometric appearance information of an object separately, so that an implicit deformation field correlates sample points with motion properties of the object, while a reference geometric appearance representation correlates sample points with geometric, texture information of the object. The deformation field transforms the sampling points to corresponding points in the reference space, so that corresponding color and depth information is obtained in the reference space to perform supervised learning by using RGB-D information. The method provided by the invention implicitly comprises the deformation field and the reference space, and jointly optimizes the acquired object visual information (all RGB-D observation frames) in the same frame. The coupling property of joint optimization needs to be considered, namely, the non-rigid deformation information and the reference space information are iterated together in the optimization, so that the solving process is easy to fall into a local optimal solution, and accurate motion and geometric information cannot be obtained. The root of this problem is that it is not easy to build the periodicity consistency between any observed frames (cycle consistency). Previous approaches have constrained the construction of loss functions in conjunction with forward and backward deformation, but have not essentially addressed this problem.

To solve the problem of the cycle consistency of different observation frames, the invention proposes a bijective map module in the deformation field. The bijection representation regards each step of deformation as rotation and translation motion along and around a certain coordinate axis, strictly guarantees one-to-one correspondence between an observation space and a reference space, reasonably guides convergence of a solution and enables the reconstructed surface to be higher in fidelity. On the other hand, because of the topology-preserving property of bijection, the problem of topology change in a real scene is quite common, and therefore, the representation aiming at the topology property is introduced into an algorithm, and the implicit deformation field can obtain the mapping relation of the perceived topology in the reconstruction process.

In summary, in the method provided by the invention, a certain sampling point in the observation space is deformed into the reference space through the implicit deformation field (composed of bijective transformation and topological transformation), after corresponding physical properties are obtained in the reference space, the aggregated visual information (color map and depth map) is obtained through an integral formula (volume rendering), and finally, the RGB-D sequence is used for supervision. Thus, all the observation information can be jointly optimized in the same implicit micro-frame, and the final algorithm can obtain more reasonable object motion information and geometric texture information in the scene.

Fig. 2 is a schematic structural diagram of a three-dimensional dynamic reconstruction method based on an RGB-D sequence according to an embodiment of the present invention.

As shown in fig. 2, assume that an RGB-D sequence is obtained using depth camera acquisition: wherein->Represents the i-th frame RGB information,/and->Representing the i-th frame depth map information. First, object information of each frame is scratched out by the existing video segmentation method, and a mask image (mask) is recorded as +.>As can be seen from fig. 2, for each time t _i The motion information of the object is represented by an implicit deformation field, and the geometric and color information of the object is represented by an implicit field of a reference space. The method provided by the invention firstly observes a certain sampling point p in the space _i ＝[x _i ,y _i ,z _i ]Mapping network F by bijective _h And topology aware network F _q Mapping into a reference space containing topology information, and mapping the corresponding point x of the reference space, geometric information d (x) and color information c _i Corresponding to the physical attributes. And finally, rendering through an integral formula to obtain a color image and a depth image, and performing supervision optimization by using an observation frame. Wherein the deformation associated with each moment is encoded +.>Also as one of the inputs of the deformation field, and the appearance code psi of each moment _i Direction of sight v _i And also serves as an input to the reference space.

The method provided by the invention treats the object as an implicit curved surface in a reference space, wherein the geometric attribute of the implicit curved surface is represented by F (sign distance function) through an implicit SDF _d Obtained by passing a nerve radiation field F _c Obtained. Sampling along a ray passing through a pixel point, and integrating the physical properties of the sampling point to obtain a rendered color valueAnd depth valueAnd the collected RGB-D sequence is used for supervision.

Fig. 3 is a flow chart of obtaining a mapping result according to an embodiment of the present invention.

As shown in fig. 3, the tensor operation result is mapped into the implicit reference space by using the bijective mapping network and the topology aware network, and the mapping result includes operations S310 to S320.

In operation S310, the tensor operation result is subjected to coordinate transformation by using the coordinate transformation modules of the bijective mapping network to obtain a transformation result, wherein the number of the coordinate transformation modules of the bijective mapping network is the same as the dimension of the tensor operation result, and the coordinate transformation module of each bijective mapping network is used for performing one-time coordinate transformation on the tensor operation result.

In operation S320, the transformation result is mapped into the implicit reference space using the topology aware network, resulting in a mapping result.

Fig. 4 is a flow chart of obtaining a transformation result according to an embodiment of the present invention.

As shown in fig. 4, the coordinate transformation module of the bijective mapping network is used to perform coordinate transformation on the tensor operation result, so as to obtain a transformation result, which includes operations S410 to S430.

In operation S410, one coordinate axis of the tensor operation result is randomly selected as a transformation reference axis, and the tensor value on the transformation reference axis is translated by the coordinate transformation module, so as to obtain a translation result.

In operation S420, according to the translation result, the coordinate transformation module is used to translate other coordinate axes in the tensor operation result and rotate around the transformation reference axis, so as to obtain a primary transformation result.

In operation S430, the translation operation and the rotation operation are iterated until all the coordinate transformation modules complete the coordinate transformation of the tensor operation result, and a transformation result is obtained.

In order to better illustrate the structure and function of the dual-shot mapping network of the present invention, the structure and function of the dual-shot mapping network provided by the present invention will be described in further detail with reference to fig. 5.

Fig. 5 is a schematic structural diagram of a dual shot mapping network according to an embodiment of the present invention.

As shown in fig. 5, the dual-mapping network F provided by the present invention _h Consists of a plurality of modules (namely the coordinate transformation modules), wherein each module models the point coordinates [ u, v, w ] aiming at a certain coordinate axis]Along and around this axis, the point coordinates are split into two parts during this process,the transformation of one part is based on the reasoning of another part. The map thus constructed is reversible, and the solution of the inverse is relatively easy (the same computational complexity as forward reasoning). Assuming coordinate axis w is selected, point coordinates [ u, v, w]Is divided into [ u, v ]]And [ w ]]Two parts, module l is based on [ u, v ]]And deformation codingObtaining the translation delta along the coordinate axis w _w ，[w]Conversion to [ w ]']The method comprises the steps of carrying out a first treatment on the surface of the Then module l is based on [ w ]']And deformation code->Obtain a rotation R about the coordinate axis w _uv And translation delta _uv ,[u,v]Conversion to [ u ', v ]']. Each module of the bijective representation models motion information along and around a certain coordinate axis, and the periodic consistency of mapping between different observation frames is strictly ensured. Such deformations are strictly satisfactory for periodic consistency between different observed frames and are continuously differentiable.

Considering that there are too many topology changes in the real world, and that bijections are strictly topology-invariant, the present invention thus introduces a topology aware network F _q Which will observe the sampling point p of the space _i Mapping to m-dimensional topological coordinates q (p _i ). This constructs a corresponding deformation map of the perceptual topology that maps the sampling points of the observation space to corresponding points x of the reference space.

Because the invention adopts micro-renderable technology to help the geometry and color information to be jointly optimized in the same frame, and uses the collected RGB-D information as supervision information to guide modeling. An implicit deformation field is constructed between the observation space and the reference space such that the sampling points are mapped onto corresponding points of the reference space. Implicit SDF fields and neuro-radiation fields are defined in the reference space to model geometry and color, respectively. The invention provides a bijective representation in an implicit deformation field, wherein the bijective representation strictly meets the periodic consistency among different observation frames. In consideration of the topology problem, the invention introduces a topology perception network at the same time, so that the whole model can accurately model the corresponding relation containing topology.

wherein the RGB monitor loss function is determined by equation (1):

the depth map supervision loss function is determined by formula (2):

the mask map supervision loss function is determined by equation (3):

the canonical term loss function is determined by equation (4):

wherein,an internal reference matrix representing an RGB camera, +.>An internal reference matrix representing a depth camera, +.>Is an external parameter matrix of camera motion, +.>Represents a sampling ray set, C (r) represents observed RGB information, D (r) represents observed depth information, M (r) represents observed mask information, </u >>RGB information representing predictions->Representing predicted depth information +_>Mask information representing predictions, BCE is cross entropy function,>is a set of sampling points.

wherein the SDF loss function is determined by equation (5):

the visibility loss function is determined by equation (6):

The method provided by the invention treats the object as an implicit surface in the reference space, the geometric attributes of which are determined by an implicit SDF (symbolNumber distance function) represents F _d Obtained by passing a nerve radiation field F _c Obtained. Along a ray passing through a pixel point, we integrate the physical properties of the sampling point to obtain the rendered color valueAnd depth value->And the collected RGB-D sequence is used for supervision. The loss function is divided into two parts, namely a free space loss function and a surface loss function. The free space loss function comprises RGB supervision, depth map supervision, mask map supervision and regular terms; the loss function on the surface includes two parts, an SDF loss function and a visibility loss function.

Fig. 6 is a schematic diagram of reconstruction and rendering effects using the three-dimensional dynamic reconstruction method based on RGB-D sequence according to an embodiment of the invention.

As shown in fig. 6, the three-dimensional dynamic reconstruction method provided by the invention can be applied to reconstruction and rendering of animals (including characters), plants and artificial abstract objects, and achieves better reconstruction and rendering effects. Meanwhile, as can be seen from fig. 6, the method provided by the invention can reconstruct and render the multi-pose of different types of articles, and has better robustness and generalization. Since the color and geometry are put into the same frame to be jointly optimized through the micro-rendering and hidden representation technology, all the observed RGB-D information is jointly participated in the optimization process, and the bijective mapping network and the topology sensing network are introduced, so that the accuracy of reconstruction and rendering is greatly improved, and as can be seen from fig. 6, the reconstruction and rendering of the target to be reconstructed are carried out by adopting the method provided by the invention, and a good effect is obtained.

Fig. 7 is a schematic structural diagram of a three-dimensional dynamic reconstruction training apparatus 700 based on RGB-D sequences according to an embodiment of the present invention.

As shown in fig. 7, the system includes an initialization module 710, an acquisition module 720, a preprocessing module 730, a tensor operation module 740, a mapping module 750, a first rendering module 760, a second rendering module 770, and optimization modules 780 and 790.

An initialization module 710, configured to randomly initialize a three-dimensional dynamic reconstruction model, where the three-dimensional dynamic reconstruction model includes a bijective mapping network, a topology aware network, and an implicit reference space.

The obtaining module 720 is configured to obtain a line of sight direction of the depth camera and a normal vector of a sampling point on the target training object, and obtain RGB information and depth information of the training object through the depth camera.

The preprocessing module 730 is configured to preprocess RGB-D sequence information acquired by the depth camera, so as to obtain mask information of the training object.

The tensor operation module 740 is configured to perform tensor operation on the mask information and the deformation code of the three-dimensional dynamic reconstruction model, so as to obtain a tensor operation result.

The mapping module 750 is configured to map the tensor operation result to the implicit reference space by using the bijective mapping network and the topology aware network, so as to obtain a mapping result.

The first rendering module 760 is configured to process the mapping result using the implicit reference space, and render the processing result using the symbol distance function to obtain predicted mask information of the training object and predicted depth information of the training object.

The second rendering module 770 is configured to perform a compound operation on the line-of-sight direction, the appearance coding and the mapping result of the three-dimensional dynamic reconstruction model, and render the result of the compound operation by using the neural radiation field to obtain predicted RGB information of the training object.

The optimizing module 780 is configured to input the line of sight direction, the normal vector, the mask information, the RGB information, the depth information, the predicted mask information, the predicted RGB information, and the predicted depth information into the loss function, obtain a loss value, and optimize parameters of the three-dimensional dynamic reconstruction model according to the loss value.

The iteration module 790 is configured to iterate the obtaining operation, the preprocessing operation, the tensor operation, the mapping operation, the rendering operation, and the optimizing operation until the loss value meets a preset condition, and obtain the parameter-optimized three-dimensional dynamic reconstruction model.

As shown in fig. 8, an electronic device 800 according to an embodiment of the present disclosure includes a processor 801 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 802 or a program loaded from a storage section 808 into a Random Access Memory (RAM) 803. The processor 801 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or an associated chipset and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), or the like. The processor 801 may also include on-board memory for caching purposes. The processor 801 may include a single processing unit or multiple processing units for performing the different actions of the method flows according to embodiments of the disclosure.

In the RAM 803, various programs and data required for the operation of the electronic device 800 are stored. The processor 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. The processor 801 performs various operations of the method flow according to the embodiments of the present disclosure by executing programs in the ROM 802 and/or the RAM 803. Note that the program may be stored in one or more memories other than the ROM 802 and the RAM 803. The processor 801 may also perform various operations of the method flows according to embodiments of the present disclosure by executing programs stored in the one or more memories.

According to an embodiment of the present disclosure, the electronic device 800 may also include an input/output (I/O) interface 805, the input/output (I/O) interface 805 also being connected to the bus 804. The electronic device 800 may also include one or more of the following components connected to the I/O interface 805: an input portion 806 including a keyboard, mouse, etc.; an output portion 807 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and a speaker; a storage section 808 including a hard disk or the like; and a communication section 809 including a network interface card such as a LAN card, a modem, or the like. The communication section 809 performs communication processing via a network such as the internet. The drive 810 is also connected to the I/O interface 805 as needed. A removable medium 811 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 810 as needed so that a computer program read out therefrom is mounted into the storage section 808 as needed.

The present disclosure also provides a computer-readable storage medium that may be embodied in the apparatus/device/system described in the above embodiments; or may exist alone without being assembled into the apparatus/device/system. The computer-readable storage medium carries one or more programs which, when executed, implement methods in accordance with embodiments of the present disclosure.

According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example, but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to embodiments of the present disclosure, the computer-readable storage medium may include ROM 802 and/or RAM 803 and/or one or more memories other than ROM 802 and RAM 803 described above.

The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the invention, and is not meant to limit the invention thereto, but to limit the invention thereto, and any modifications, equivalents, improvements and equivalents thereof may be made without departing from the spirit and principles of the invention.

Claims

1. A three-dimensional dynamic reconstruction method based on RGB-D sequence includes:

acquiring the sight direction of a depth camera and the normal vector of a sampling point on a target training object, and acquiring RGB information and depth information of the training object through the depth camera;

preprocessing RGB-D sequence information acquired by the depth camera to obtain mask information of the training object;

performing tensor operation on the mask information and the deformation codes of the three-dimensional dynamic reconstruction model to obtain a tensor operation result;

mapping the tensor operation result into the implicit reference space by using the bijective mapping network and the topology aware network to obtain a mapping result;

processing the mapping result by utilizing the implicit reference space, and rendering the processing result by utilizing a symbol distance function to obtain predicted mask information of the training object and predicted depth information of the training object;

performing composite operation on the sight line direction, the appearance code of the three-dimensional dynamic reconstruction model and the mapping result, and rendering the result of the composite operation by utilizing a nerve radiation field to obtain predicted RGB information of the training object;

inputting the sight line direction, the normal vector, the mask information, the RGB information, the depth information, the predicted mask information, the predicted RGB information and the predicted depth information into a loss function to obtain a loss value, and optimizing parameters of the three-dimensional dynamic reconstruction model according to the loss value;

2. The method of claim 1, wherein the mapping the tensor operation result into the implicit reference space using the bijective mapping network and the topology aware network to obtain a mapping result comprises:

the tensor operation result is subjected to coordinate transformation by utilizing the coordinate transformation modules of the bijective mapping network to obtain a transformation result, wherein the number of the coordinate transformation modules of the bijective mapping network is the same as the dimension of the tensor operation result, and the coordinate transformation module of each bijective mapping network is used for carrying out primary coordinate transformation on the tensor operation result;

and mapping the transformation result into an implicit reference space by using the topology aware network to obtain a mapping result.

3. The method of claim 2, wherein the transforming the tensor operation result by the coordinate transformation module of the bijective mapping network to obtain a transformed result comprises:

randomly selecting one coordinate axis in the tensor operation result as a transformation reference axis, and translating tensor values on the transformation reference axis by utilizing the coordinate transformation module to obtain a translation result;

according to the translation result, translating other coordinate axes in the tensor operation result and rotating around the transformation reference axis by utilizing the coordinate transformation module to obtain a primary transformation result;

and carrying out translation operation and rotation operation iteratively until all the coordinate transformation modules finish coordinate transformation of the tensor operation result to obtain a transformation result.

4. A method according to any one of claims 2-3, wherein each of said transformation results is continuously differentiable and of the same period.

5. The method of claim 1, wherein the loss function comprises a free space loss function and a loss function on a surface.

6. The method of claim 5, wherein the free space loss function comprises an RGB supervised loss function, a depth map supervised loss function, a mask map supervised loss function, and a regular term loss function;

wherein the RGB supervision loss function is determined by formula (1):

the depth map supervision loss function is determined by formula (2):

the mask map supervision loss function is determined by formula (3):

the canonical term loss function is determined by equation (4):

wherein,an internal reference matrix representing an RGB camera, +.>Representing depthInternal reference matrix of camera, ">Is an external parameter matrix of camera motion, +.>Represents a sampling ray set, C (r) represents observed RGB information, D (r) represents observed depth information, M (r) represents observed mask information, </u >>RGB information representing predictions->Representing predicted depth information +_>Mask information representing predictions, BCE is cross entropy function,>is a set of sampling points.

7. The method of claim 5, wherein the loss function on the surface comprises an SDF loss function and a visibility loss function;

wherein the SDF loss function is determined by equation (5):

the visibility loss function is determined by equation (6):

wherein the method comprises the steps of，n _p Representing normal, v _p Indicating the direction of the line of sight, p _i The expression, d (x) represents,and (3) representing.

8. A three-dimensional dynamic reconstruction training device based on RGB-D sequences, comprising:

the preprocessing module is used for preprocessing the RGB-D sequence information acquired by the depth camera to obtain mask information of the training object;

the tensor operation module is used for carrying out tensor operation on the mask information and the deformation codes of the three-dimensional dynamic reconstruction model to obtain tensor operation results;

the mapping module is used for mapping the tensor operation result into the implicit reference space by utilizing the bijective mapping network and the topology aware network to obtain a mapping result;

the first rendering module is used for processing the mapping result by utilizing the implicit reference space and rendering the processing result by utilizing a symbol distance function to obtain predicted mask information of the training object and predicted depth information of the training object;

the second rendering module is used for carrying out composite operation on the sight line direction, the appearance code of the three-dimensional dynamic reconstruction model and the mapping result, and rendering the result of the composite operation by utilizing a nerve radiation field to obtain the predicted RGB information of the training object;

the optimization module is used for inputting the sight direction, the normal vector, the mask information, the RGB information, the depth information, the predicted mask information, the predicted RGB information and the predicted depth information into a loss function to obtain a loss value, and optimizing parameters of the three-dimensional dynamic reconstruction model according to the loss value;

9. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs,

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method of any of claims 1-7.

10. A computer readable storage medium having stored thereon executable instructions which, when executed by a processor, cause the processor to perform the method according to any of claims 1-7.