CN115018989A

CN115018989A - Three-dimensional dynamic reconstruction method based on RGB-D sequence, training device and electronic equipment

Info

Publication number: CN115018989A
Application number: CN202210704281.3A
Authority: CN
Inventors: 张举勇; 蔡泓锐
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2022-06-21
Filing date: 2022-06-21
Publication date: 2022-09-06
Anticipated expiration: 2042-06-21
Also published as: CN115018989B

Abstract

The invention discloses a three-dimensional dynamic reconstruction method based on an RGB-D sequence, which comprises the following steps: and inputting the object to be reconstructed into the three-dimensional dynamic reconstruction model after parameter optimization to obtain a geometric reconstruction model and a color reconstruction model of the object to be reconstructed. The invention also discloses a three-dimensional dynamic reconstruction training device based on the RGB-D sequence. The invention also discloses electronic equipment and a storage medium suitable for the RGB-D sequence-based three-dimensional dynamic reconstruction method.

Description

Three-dimensional dynamic reconstruction method based on RGB-D sequence, training device and electronic equipment

Technical Field

The invention relates to the field of three-dimensional dynamic reconstruction, in particular to a three-dimensional dynamic reconstruction method based on an RGB-D sequence, a training device, electronic equipment and a storage medium.

Background

As a classic problem of computer vision, a three-dimensional model of a dynamic target object reconstructed from a video has a wide prospect in the fields of metasma, augmented reality, virtual reality and the like. Particularly, under monocular setting, as the RGB sequence acquired by the color camera has the problems of depth ambiguity, insignificant feature change at weak texture and the like, no algorithm can reconstruct a high-quality three-dimensional model of a dynamic object without template prior at present. On the other hand, as the cost of depth cameras decreases, many dynamic fusion algorithms for depth maps come into play, but because they do not fully combine color information and depth information, the resulting model is also not very high in fidelity, and has many problems.

Disclosure of Invention

In view of the above problems, the present invention provides a three-dimensional dynamic reconstruction method based on RGB-D sequences, a training apparatus, an electronic device, and a storage medium, so as to solve one of the above problems.

According to a first aspect of the present invention, there is provided a three-dimensional dynamic reconstruction method based on an RGB-D sequence, comprising:

inputting an object to be reconstructed into the three-dimensional dynamic reconstruction model after parameter optimization to obtain a geometric reconstruction model and a color reconstruction model of the object to be reconstructed, wherein the three-dimensional dynamic reconstruction model after parameter optimization is obtained by training through the following method:

randomly initializing a three-dimensional dynamic reconstruction model, wherein the three-dimensional dynamic reconstruction model comprises a bijective mapping network, a topology sensing network and an implicit reference space;

the method comprises the steps of obtaining a sight direction of a depth camera and a normal vector of a sampling point on a target training object, and obtaining RGB information and depth information of the training object through the depth camera;

preprocessing RGB-D sequence information acquired by a depth camera to obtain mask information of a training object;

carrying out tensor operation on the mask information and the deformation code of the three-dimensional dynamic reconstruction model to obtain a tensor operation result;

mapping the tensor operation result to an implicit reference space by using a bijective mapping network and a topology sensing network to obtain a mapping result;

processing the mapping result by using an implicit reference space, and rendering the processing result by using a symbolic distance function to obtain predicted mask information of the training object and predicted depth information of the training object;

performing composite operation on the visual direction and the appearance coding and mapping result of the three-dimensional dynamic reconstruction model, and rendering the result of the composite operation by using a nerve radiation field to obtain predicted RGB information of the training object;

inputting the sight line direction, the normal vector, the mask information, the RGB information, the depth information, the predicted mask information, the predicted RGB information and the predicted depth information into a loss function to obtain a loss value, and optimizing parameters of a three-dimensional dynamic reconstruction model according to the loss value;

and iterating the obtaining operation, the preprocessing operation, the tensor operation, the mapping operation, the rendering operation and the optimization operation until the loss value meets the preset condition to obtain a parameter-optimized three-dimensional dynamic reconstruction model.

According to an embodiment of the present invention, mapping the tensor operation result to the implicit reference space by using the bijective mapping network and the topology aware network, and obtaining the mapping result includes:

coordinate transformation is carried out on the tensor operation result by utilizing a coordinate transformation module of the bijective mapping network to obtain a transformation result, wherein the number of the coordinate transformation modules of the bijective mapping network is the same as the dimensionality of the tensor operation result, and the coordinate transformation module of each bijective mapping network is used for carrying out one-time coordinate transformation on the tensor operation result;

and mapping the transformation result to an implicit reference space by using a topology sensing network to obtain a mapping result.

According to an embodiment of the present invention, the coordinate transformation of the tensor operation result by the coordinate transformation module of the bijective mapping network to obtain the transformation result includes:

randomly selecting one coordinate axis in the tensor operation result as a transformation reference axis, and translating the tensor value on the transformation reference axis by using a coordinate transformation module to obtain a translation result;

according to the translation result, the coordinate transformation module is utilized to translate other coordinate axes in the tensor operation result and rotate around the transformation reference axis to obtain a primary transformation result;

and (4) carrying out translation operation and rotation operation in an iteration mode until all the coordinate transformation modules complete the coordinate transformation of the tensor operation result, and obtaining a transformation result.

According to the embodiment of the invention, each transformation result is continuous and micro and has the same period.

According to an embodiment of the invention, the above-mentioned loss functions include a free space loss function and a loss function on a surface.

According to an embodiment of the present invention, the free space loss function includes an RGB supervised loss function, a depth map supervised loss function, a mask map supervised loss function, and a regular term loss function;

wherein the RGB surveillance loss function is determined by equation (1):

the depth map supervised loss function is determined by equation (2):

the mask map supervised loss function is determined by equation (3):

the regularized term loss function is determined by equation (4):

wherein, the first and the second end of the pipe are connected with each other,

an internal reference matrix representing the RGB camera,

an internal reference matrix representing the depth camera,

is an external parameter matrix of the camera motion,

representing a set of sampled rays, C (r) representing RGB information of the observation, D (r) representingObserved depth information, M (r) represents observed mask information,

represents the RGB information of the prediction and,

which represents the depth information of the prediction,

representing the predicted mask information, BCE is a cross-entropy function,

is a set of sample points.

According to an embodiment of the present invention, the above-mentioned loss function on the surface includes an SDF loss function and a visibility loss function;

wherein the SDF loss function is determined by equation (5):

the visibility loss function is determined by equation (6):

wherein n is _p Denotes the normal direction, v _p Denotes the direction of the line of sight, p _i The expression, d (x) represents,

and (4) showing.

According to a second aspect of the present invention, there is provided a three-dimensional dynamic reconstruction training apparatus based on RGB-D sequences, comprising:

the device comprises an initialization module, a data processing module and a data processing module, wherein the initialization module is used for randomly initializing a three-dimensional dynamic reconstruction model, and the three-dimensional dynamic reconstruction model comprises a bijective mapping network, a topology sensing network and an implicit reference space;

the acquisition module is used for acquiring the sight direction of the depth camera and the normal vector of the sampling point on the target training object, and acquiring RGB information and depth information of the training object through the depth camera;

the preprocessing module is used for preprocessing the RGB-D sequence information acquired by the depth camera to acquire mask information of the training object;

the tensor operation module is used for carrying out tensor operation on the mask information and the deformation code of the three-dimensional dynamic reconstruction model to obtain a tensor operation result;

the mapping module is used for mapping the tensor operation result to an implicit reference space by utilizing a bijective mapping network and a topology sensing network to obtain a mapping result;

the first rendering module is used for processing the mapping result by using an implicit reference space and rendering the processing result by using a symbolic distance function to obtain predicted mask information of the training object and predicted depth information of the training object;

the second rendering module is used for performing composite operation on the visual line direction and the appearance coding and mapping result of the three-dimensional dynamic reconstruction model, and rendering the result of the composite operation by using the nerve radiation field to obtain the predicted RGB information of the training object;

the optimization module is used for inputting the sight direction, the normal vector, the mask information, the RGB information, the depth information, the predicted mask information, the predicted RGB information and the predicted depth information into a loss function to obtain a loss value, and optimizing parameters of the three-dimensional dynamic reconstruction model according to the loss value;

and the iteration module is used for iterating the acquisition operation, the preprocessing operation, the tensor operation, the mapping operation, the rendering operation and the optimization operation until the loss value meets a preset condition to obtain a parameter-optimized three-dimensional dynamic reconstruction model.

According to a second aspect of the present invention, there is provided an electronic apparatus comprising:

one or more processors;

a storage device for storing one or more programs,

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the above-described RGB-D sequence based three-dimensional dynamic reconstruction method.

According to a second aspect of the present invention, there is provided a computer-readable storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to perform a method of three-dimensional dynamic reconstruction based on an RGB-D sequence.

According to the three-dimensional dynamic reconstruction method based on the RGB-D sequence, colors and geometry are placed in the same frame through micro-rendering and implicit expression technologies for joint optimization, and all observed RGB-D information participates in the optimization process together; meanwhile, bijective transformation is used as a part of a deformation field, so that the periodic consistency among different observation frames is strictly met; in addition, a topology perception network is introduced, and a corresponding relation containing topology is constructed, so that the model can robustly process a complex RGB-D sequence.

Drawings

FIG. 1 is a flow chart of a method of training a parameter optimized three-dimensional dynamic reconstruction model according to an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of a three-dimensional dynamic reconstruction method based on an RGB-D sequence according to an embodiment of the invention;

FIG. 3 is a flow chart of obtaining a mapping result according to an embodiment of the invention;

FIG. 4 is a flow chart of obtaining a transformation result according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a bijective mapping network according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a reconstruction and rendering effect using the RGB-D sequence-based three-dimensional dynamic reconstruction method according to an embodiment of the present invention;

FIG. 7 is a schematic structural diagram of a three-dimensional dynamic reconstruction training device based on an RGB-D sequence according to an embodiment of the present invention;

fig. 8 schematically shows a block diagram of an electronic device adapted to implement a method for three-dimensional dynamic reconstruction based on RGB-D sequences according to an embodiment of the present disclosure.

Detailed Description

In order that the objects, technical solutions and advantages of the present invention will become more apparent, the present invention will be further described in detail with reference to the accompanying drawings in conjunction with the following specific embodiments.

The three-dimensional dynamic reconstruction is a task of recovering three-dimensional geometry and motion information of a target object aiming at visual signals (such as a color image, a depth image and the like) acquired by equipment. How to reconstruct the geometry of an object without prior from an RGB-D (Red, Green, Blue, Depth) sequence is a research hotspot at present, most of algorithms for RGB-D dynamic reconstruction at present are based on dynamic fusion, a TSDF Volume (indicated by a truncated symbol distance in vitro) is stored in a reference space to represent the geometry of the object, and each observation frame (Depth information) is deformed to a position below a reference frame through a pre-constructed deformed node map, where the node map is multi-level. This fusion process would be done based on ICP (closest point matching algorithm) registration. The deformation process can be regarded as sparse control points to guide rotation and translation of points on the dense depth point cloud. Most of the subsequent correlation methods are based on the improvement of DynamicFusion. There is a method of modeling color information such as illumination, texture, etc. to register color images; there are also loss functions added to solve the topology change problem of Close-to-Open; there are also methods that use deep learning modules in combination with optical flow information (motion structure supervision) to make the estimated motion more accurate.

Previous methods are based on sparse node motion and geometric representation, and do not fully utilize RGB color information, so the obtained geometric accuracy is not very high. On the other hand, in recent years, a hidden representation method developed at a high speed gradually replaces a part of the conventional representation method. The micro-renderable technique enables geometry and rendering to be considered simultaneously in one frame, so that color information can be better used to reconstruct the geometry.

The invention aims to provide a dynamic reconstruction algorithm which fully combines color and depth information so as to solve the problems of insufficient reconstruction precision and the like in the existing algorithm based on an RGB-D sequence.

and inputting the object to be reconstructed into the three-dimensional dynamic reconstruction model after parameter optimization to obtain a geometric reconstruction model and a color reconstruction model of the object to be reconstructed.

Fig. 1 is a flowchart of a training method of a parameter-optimized three-dimensional dynamic reconstruction model according to an embodiment of the present invention.

As shown in fig. 1, the training method for obtaining the parameter-optimized three-dimensional dynamic reconstruction model includes operations S110 to S190.

In operation S110, a three-dimensional dynamic reconstruction model is randomly initialized, wherein the three-dimensional dynamic reconstruction model includes a bijective mapping network, a topology aware network, and an implicit reference space.

In operation S120, a line-of-sight direction of the depth camera and a normal vector of sampling points on the target training object are acquired, and RGB information and depth information of the training object are acquired through the depth camera.

The depth camera, i.e., the RGB-D camera, can measure the distance from an object to the camera by actively emitting light to the object and receiving the returned light, like a laser sensor, through an infrared structured light or Time-of-flight (tof) principle.

In operation S130, RGB-D sequence information acquired by the depth camera is preprocessed to obtain mask information of the training object.

In operation S140, tensor operation is performed on the mask information and the deformation code of the three-dimensional dynamic reconstruction model to obtain a tensor operation result.

In operation S150, the tensor operation result is mapped to the implicit reference space by using the bijective mapping network and the topology aware network, so as to obtain a mapping result.

In operation S160, the mapping result is processed using the implicit reference space, and the processing result is rendered using the symbol distance function, resulting in predicted mask information of the training object and predicted depth information of the training object.

In operation S170, a composite operation is performed on the visual direction, the appearance coding and the mapping result of the three-dimensional dynamic reconstruction model, and the result of the composite operation is rendered by using the nerve radiation field, so as to obtain the predicted RGB information of the training object.

In operation S180, the line-of-sight direction, the normal vector, the mask information, the RGB information, the depth information, the predicted mask information, the predicted RGB information, and the predicted depth information are input to a loss function to obtain a loss value, and a parameter of the three-dimensional dynamic reconstruction model is optimized according to the loss value.

In operation S190, the obtaining operation, the preprocessing operation, the tensor operation, the mapping operation, the rendering operation, and the optimizing operation are iterated until the loss value satisfies the preset condition, so as to obtain the parameter-optimized three-dimensional dynamic reconstruction model.

The preset condition includes, but is not limited to, convergence of the loss value to a certain value, oscillation of the loss value back and forth within a certain interval, and the like.

According to the RGB-D sequence-based three-dimensional dynamic reconstruction method provided by the invention, colors and geometry are put in the same frame through micro-rendering and implicit expression technologies for joint optimization, so that all observed RGB-D information jointly participates in the optimization process; meanwhile, bijective transformation is used as a part of a deformation field, so that the periodic consistency among different observation frames is strictly met; in addition, a topology perception network is introduced, and a corresponding relation containing topology is constructed, so that the model can robustly process a complex RGB-D sequence.

The hidden representation method treats an object as an implicit curved surface, and each point in space is associated with physical attributes (such as density, color, and the like) of the object. Thus, some points are sampled in the space, corresponding aggregate attribute values can be obtained through mathematical and physical formulas, and the acquired visual signals (such as RGB, depth map and the like) can be used for supervising and learning the attribute of the implicit curved surface. In dynamic sequences, the usual technical route would model the motion and geometric appearance information of the object separately, so the implicit deformation field would associate the sampling points with the motion attributes of the object, while the reference geometric appearance representation would associate the sampling points with the geometric, texture information of the object. And transforming the sampling points to corresponding points in the reference space by the deformation field, so that corresponding color and depth information is obtained in the reference space to use RGB-D information for supervised learning. The method provided by the invention has the advantages that the deformation field and the reference space are hidden, and the collected object visual information (all RGB-D observation frames) is jointly optimized in the same frame. In the method, the coupling property of joint optimization needs to be considered, namely, the non-rigid deformation information and the reference space information are jointly iterated in the optimization, so that the solving process is easy to fall into a local optimal solution, and accurate motion and geometric information cannot be obtained. The root of this problem is that it is not easy to construct a periodic consistency between arbitrary observation frames. Previous methods construct a loss function to constrain in conjunction with forward and backward warping, but do not address this issue substantially.

In order to solve the problem of periodic consistency of different observation frames, the invention provides a bijective map module in a deformation field. The bijection represents that each step of deformation is regarded as rotation and translation motion along and around a certain coordinate axis, and the one-to-one correspondence relationship between an observation space and a reference space is strictly ensured, so that the convergence of a solution is reasonably guided, and the reconstructed surface fidelity is higher. On the other hand, due to the topology preserving property of bijection, the problem of topology change in a real scene is quite common, so that the representation aiming at the topology property is introduced into the algorithm, and the implicit deformation field can obtain the mapping relation of the perception topology in the reconstruction process.

In summary, in the method provided by the present invention, a certain sampling point in the observation space is deformed into the reference space through an implicit deformation field (composed of bijective transformation and topological transformation), after obtaining corresponding physical attributes in the reference space, aggregated visual information (color image and depth image) is obtained through an integral formula (volume rendering), and finally, an RGB-D sequence is used for supervision. Therefore, all observation information can be jointly optimized in the same implicit micro-frame, and finally, reasonable object motion information and geometric texture information in the scene can be obtained through an algorithm.

Fig. 2 is a schematic structural diagram of a three-dimensional dynamic reconstruction method based on an RGB-D sequence according to an embodiment of the present invention.

As shown in fig. 2, assume that an RGB-D sequence is acquired using a depth camera:

wherein

Representing the RGB information of the i-th frame,

indicating the ith frame depth map information. Firstly, the object information of each frame is extracted by the existing video segmentation method, and the mask image (mask) is recorded as

As can be seen from fig. 2, for each time t _i The motion information of the object is represented by an implicit deformation field, and the geometric and color information of the object is represented by an implicit field of a reference space. The method provided by the invention firstly carries out certain sampling point p in an observation space _i ＝[x _i ,y _i ,z _i ]Mapping network F by bijections _h And topology aware network F _q Mapping into a reference space containing topological information, and mapping the corresponding point x in the reference space with the geometric information d (x), the color information c _i And the like. And finally, rendering through an integral formula to obtain a color image and a depth image, and performing supervision optimization by using an observation frame. Wherein the deformation coding associated with each time instant

Also as an input for the deformation field, with an appearance code psi at each instant _i And the direction of sight v _i And also as an input to the reference space.

The method provided by the invention considers the object as an implicit curved surface in a reference space, and the geometrical property of the object expresses F through an implicit SDF (symbolic distance function) _d Obtained, and the color attribute is passed through a nerve radiation field F _c Thus obtaining the product. Along a line passing through a certain pixelSampling the strip rays, integrating the physical properties of the sampling points to obtain the rendered color value

And depth value

And the collected RGB-D sequence is used for supervision.

Fig. 3 is a flow chart of obtaining a mapping result according to an embodiment of the present invention.

As shown in fig. 3, mapping the tensor operation result to the implicit reference space by using the bijective mapping network and the topology aware network, and obtaining the mapping result includes operations S310 to S320.

In operation S310, coordinate transformation is performed on the tensor operation result by using the coordinate transformation modules of the bijective mapping network to obtain a transformation result, where the number of the coordinate transformation modules of the bijective mapping network is the same as the dimension of the tensor operation result, and the coordinate transformation module of each bijective mapping network is configured to perform coordinate transformation on the tensor operation result once.

In operation S320, the mapping result is obtained by mapping the transformation result to the implicit reference space using the topology aware network.

Fig. 4 is a flow chart of obtaining a transformation result according to an embodiment of the present invention.

As shown in fig. 4, the coordinate transformation of the tensor operation result by the coordinate transformation module of the bijective mapping network includes operations S410 to S430.

In operation S410, one coordinate axis in the tensor operation result is randomly selected as a transformation reference axis, and the coordinate transformation module is used to translate the tensor value on the transformation reference axis to obtain a translation result.

In operation S420, according to the translation result, the coordinate transformation module is used to translate other coordinate axes in the tensor operation result and rotate around the transformation reference axis, so as to obtain a primary transformation result.

In operation S430, the translation operation and the rotation operation are iterated until all the coordinate transformation modules complete the coordinate transformation of the tensor operation result, so as to obtain a transformation result.

To better illustrate the structure and function of the bijective mapping network of the present invention, the structure and function of the bijective mapping network provided by the present invention are further described in detail below with reference to fig. 5.

Fig. 5 is a schematic structural diagram of a bijective mapping network according to an embodiment of the present invention.

As shown in fig. 5, the bijective mapping network F provided by the present invention _h Composed of a plurality of modules (i.e., the coordinate transformation modules described above), each of which models point coordinates [ u, v, w ] for a certain coordinate axis]Along and around this axis, the point coordinates are divided into two parts in this process, the transformation of one part being based on the inference of the other part. The mapping thus constructed is invertible and the solution of the inversion is relatively easy (same computational complexity as forward reasoning). Assuming that the coordinate axis w is selected, the point coordinates u, v, w]Is divided into [ u, v ]]And [ w]Two parts, module i based on [ u, v ]]And transform coding

The translation delta along the coordinate axis w is obtained _w ，[w]To [ w'](ii) a Module l is then based on [ w']And transform coding

Obtaining a rotation R about the coordinate axis w _uv And a translation of δ _uv ,[u,v]Conversion to [ u ', v']. Each module of the bijection representation models motion information along and around a coordinate axis, and the periodic consistency of mapping between different observation frames is strictly ensured. Such a deformation is strictly consistent with the periodicity between different observation frames and is continuous and differentiable.

Considering that there are too many topology changes in the real world and bijections are strictly topology invariant, the present invention then introduces a topology aware network F _q Which will observe the sampling point p of the space _i Mapping to m-dimensional topology seatsMark q (p) _i ). In this way, a corresponding deformation map of the perceptual topology is constructed, which maps the sampling points of the observation space to the corresponding points x of the reference space.

As the micro-rendering technology is adopted to help the geometric and color information to be jointly optimized in the same frame, and the collected RGB-D information is used as the supervision information to guide modeling. An implicit deformation field is constructed between the observation space and the reference space, so that the sampling points are mapped to corresponding points of the reference space. The implicit SDF field and the nerve radiation field are defined in the reference space to model geometry and color, respectively. The invention provides a bijection representation in an implicit deformation field, wherein the bijection representation strictly meets the periodic consistency among different observation frames. In consideration of the topology problem, the invention introduces a topology sensing network at the same time, so that the whole model can accurately model the corresponding relation of the implied topology.

wherein the RGB surveillance loss function is determined by equation (1):

the depth map supervised loss function is determined by equation (2):

the mask map supervised loss function is determined by equation (3):

the canonical term loss function is determined by equation (4):

wherein the content of the first and second substances,

an internal reference matrix representing the RGB camera,

an internal reference matrix representing the depth camera,

is an external parameter matrix of the camera motion,

representing a set of sampled rays, C (r) representing RGB information of the observation, D (r) representing depth information of the observation, M (r) representing mask information of the observation,

represents the RGB information of the prediction and,

which represents the depth information of the prediction,

representing the predicted mask information, BCE is a cross-entropy function,

is a set of sample points.

wherein the SDF loss function is determined by equation (5):

the visibility loss function is determined by equation (6):

and (4) showing.

The method provided by the invention considers the object as an implicit curved surface in a reference space, and the geometrical property of the object expresses F through an implicit SDF (symbolic distance function) _d Obtained, and the color attribute is passed through a nerve radiation field F _c Thus obtaining the product. Sampling along a ray passing through a certain pixel point, integrating physical properties of sampling points to obtain a rendered color value

And depth value

And the collected RGB-D sequence is used for supervision. The loss function is divided into a free-space loss function and a surface loss function. The loss function of the free space comprises RGB supervision, depth map supervision, mask map supervision and a regular term; the apparent loss function includes both an SDF loss function and a visibility loss function.

Fig. 6 is a schematic diagram of a reconstruction and rendering effect by using the RGB-D sequence-based three-dimensional dynamic reconstruction method according to an embodiment of the present invention.

As shown in fig. 6, the three-dimensional dynamic reconstruction method provided by the present invention can be applied to the reconstruction and rendering of animals (including characters), plants, and artificial abstract objects, and achieves a better reconstruction and rendering effect. Meanwhile, as can be seen from fig. 6, the method provided by the invention can be used for reconstructing and rendering multiple postures of different types of articles, and has better robustness and generalization. As the color and the geometry are put in the same frame for joint optimization through the micro-rendering and hidden representation technologies, all the observed RGB-D information jointly participates in the optimization process, and the bijective mapping network and the topology perception network are introduced, the reconstruction and rendering precision is greatly improved, and as can be seen from the figure 6, the method provided by the invention has good effect on reconstructing and rendering the target to be reconstructed.

Fig. 7 is a schematic structural diagram of a three-dimensional dynamic reconstruction training device 700 based on an RGB-D sequence according to an embodiment of the present invention.

As shown in fig. 7, the system comprises an initialization module 710, an acquisition module 720, a preprocessing module 730, a tensor operation module 740, a mapping module 750, a first rendering module 760, a second rendering module 770, and

optimization modules

780 and 790.

The initialization module 710 is configured to randomly initialize a three-dimensional dynamic reconstruction model, where the three-dimensional dynamic reconstruction model includes a bijective mapping network, a topology aware network, and an implicit reference space.

The obtaining module 720 is configured to obtain a sight direction of the depth camera and a normal vector of a sampling point on the target training object, and obtain RGB information and depth information of the training object through the depth camera.

The preprocessing module 730 is configured to preprocess the RGB-D sequence information acquired by the depth camera to obtain mask information of the training object.

And the tensor operation module 740 is configured to perform tensor operation on the mask information and the deformation code of the three-dimensional dynamic reconstruction model to obtain a tensor operation result.

And a mapping module 750, configured to map the tensor operation result to an implicit reference space by using a bijective mapping network and a topology sensing network, so as to obtain a mapping result.

The first rendering module 760 is configured to process the mapping result using the implicit reference space and render the processing result using the symbolic distance function to obtain predicted mask information of the training object and predicted depth information of the training object.

The second rendering module 770 is configured to perform a complex operation on the visual direction and the appearance coding and mapping result of the three-dimensional dynamic reconstruction model, and render the complex operation result by using the nerve radiation field to obtain the predicted RGB information of the training object.

And the optimization module 780 is configured to input the line-of-sight direction, the normal vector, the mask information, the RGB information, the depth information, the predicted mask information, the predicted RGB information, and the predicted depth information into a loss function to obtain a loss value, and optimize parameters of the three-dimensional dynamic reconstruction model according to the loss value.

And the iteration module 790 is configured to iterate the obtaining operation, the preprocessing operation, the tensor operation, the mapping operation, the rendering operation, and the optimization operation until the loss value meets a preset condition, so as to obtain a parameter-optimized three-dimensional dynamic reconstruction model.

As shown in fig. 8, an electronic device 800 according to an embodiment of the present disclosure includes a processor 801 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)802 or a program loaded from a storage section 808 into a Random Access Memory (RAM) 803. The processor 801 may include, for example, a general purpose microprocessor (e.g., CPU), an instruction set processor and/or associated chipset, and/or a special purpose microprocessor (e.g., Application Specific Integrated Circuit (ASIC)), among others. The processor 801 may also include onboard memory for caching purposes. The processor 801 may include a single processing unit or multiple processing units for performing different actions of the method flows according to embodiments of the present disclosure.

In the RAM 803, various programs and data necessary for the operation of the electronic apparatus 800 are stored. The processor 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. The processor 801 performs various operations of the method flows according to the embodiments of the present disclosure by executing programs in the ROM 802 and/or RAM 803. Note that the programs may also be stored in one or more memories other than the ROM 802 and RAM 803. The processor 801 may also perform various operations of method flows according to embodiments of the present disclosure by executing programs stored in the one or more memories.

Electronic device 800 may also include input/output (I/O) interface 805, input/output (I/O) interface 805 also connected to bus 804, according to an embodiment of the present disclosure. Electronic device 800 may also include one or more of the following components connected to I/O interface 805: an input portion 806 including a keyboard, a mouse, and the like; an output section 807 including a signal such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 808 including a hard disk and the like; and a communication section 809 including a network interface card such as a LAN card, a modem, or the like. The communication section 809 performs communication processing via a network such as the internet. A drive 810 is also connected to the I/O interface 805 as necessary. A removable medium 811 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 810 as necessary, so that a computer program read out therefrom is mounted on the storage section 808 as necessary.

The present disclosure also provides a computer-readable storage medium, which may be contained in the apparatus/device/system described in the above embodiments; or may exist separately and not be assembled into the device/apparatus/system. The computer-readable storage medium carries one or more programs which, when executed, implement a method according to an embodiment of the disclosure.

According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to embodiments of the present disclosure, a computer-readable storage medium may include the ROM 802 and/or RAM 803 described above and/or one or more memories other than the ROM 802 and RAM 803.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention and are not intended to limit the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A three-dimensional dynamic reconstruction method based on an RGB-D sequence comprises the following steps:

inputting an object to be reconstructed into a three-dimensional dynamic reconstruction model after parameter optimization to obtain a geometric reconstruction model and a color reconstruction model of the object to be reconstructed, wherein the three-dimensional dynamic reconstruction model after parameter optimization is obtained by training through the following method:

preprocessing RGB-D sequence information acquired by the depth camera to obtain mask information of the training object;

carrying out tensor operation on the mask information and deformation codes of the three-dimensional dynamic reconstruction model to obtain a tensor operation result;

mapping the tensor operation result to the implicit reference space by utilizing the bijective mapping network and the topology sensing network to obtain a mapping result;

processing the mapping result by using the implicit reference space, and rendering the processing result by using a symbolic distance function to obtain predicted mask information of the training object and predicted depth information of the training object;

performing composite operation on the sight direction, the appearance code of the three-dimensional dynamic reconstruction model and the mapping result, and rendering the result of the composite operation by using a nerve radiation field to obtain predicted RGB information of the training object;

inputting the sight line direction, the normal vector, the mask information, the RGB information, the depth information, the predicted mask information, the predicted RGB information and the predicted depth information into a loss function to obtain a loss value, and optimizing parameters of the three-dimensional dynamic reconstruction model according to the loss value;

and iterating to obtain operation, preprocessing operation, tensor operation, mapping operation, rendering operation and optimization operation until the loss value meets a preset condition to obtain a parameter-optimized three-dimensional dynamic reconstruction model.

2. The method of claim 1, wherein the mapping the tensor operation result into the implicit reference space using the bijective mapping network and the topology aware network comprises:

performing coordinate transformation on the tensor operation result by using a coordinate transformation module of the bijective mapping network to obtain a transformation result, wherein the number of the coordinate transformation modules of the bijective mapping network is the same as the dimension of the tensor operation result, and the coordinate transformation module of each bijective mapping network is used for performing coordinate transformation on the tensor operation result for the first time;

and mapping the transformation result to an implicit reference space by using the topology sensing network to obtain a mapping result.

3. The method of claim 2, wherein the coordinate transforming the tensor operation result by the coordinate transforming module of the bijective mapping network to obtain a transformed result comprises:

randomly selecting one coordinate axis in the tensor operation result as a transformation reference axis, and translating the tensor value on the transformation reference axis by using the coordinate transformation module to obtain a translation result;

and iterating the translation operation and the rotation operation until all the coordinate transformation modules complete the coordinate transformation of the tensor operation result to obtain a transformation result.

4. A method according to any one of claims 2 to 3, wherein each of said transform results is continuously differentiable and of the same period.

5. The method of claim 1, wherein the loss function comprises a free space loss function and a loss function on a surface.

6. The method of claim 5, wherein the free space loss function comprises an RGB supervised loss function, a depth map supervised loss function, a mask map supervised loss function, and a regularized term loss function;

wherein the RGB supervised loss function is determined by equation (1):

the depth map supervised loss function is determined by equation (2):

the mask map supervised loss function is determined by equation (3):

the regularized term loss function is determined by equation (4):

an internal reference matrix representing the RGB camera,

an internal reference matrix representing the depth camera,

is an external parameter matrix of the camera motion,

represents the RGB information of the prediction and,

which represents the depth information of the prediction,

representing the predicted mask information, BCE is a cross-entropy function,

is a set of sample points.

7. The method of claim 5, wherein the loss function on the surface comprises an SDF loss function and a visibility loss function;

wherein the SDF loss function is determined by equation (5):

the visibility loss function is determined by equation (6):

and (4) showing.

8. A three-dimensional dynamic reconstruction training device based on an RGB-D sequence comprises:

the mapping module is used for mapping the tensor operation result to the implicit reference space by utilizing the bijective mapping network and the topology sensing network to obtain a mapping result;

a first rendering module, configured to process the mapping result using the implicit reference space, and render the processing result using a symbolic distance function to obtain predicted mask information of the training object and predicted depth information of the training object;

the second rendering module is used for performing composite operation on the sight line direction, the appearance codes of the three-dimensional dynamic reconstruction model and the mapping result, and rendering the result of the composite operation by using a nerve radiation field to obtain the predicted RGB information of the training object;

the optimization module is used for inputting the sight line direction, the normal vector, the mask information, the RGB information, the depth information, the predicted mask information, the predicted RGB information and the predicted depth information into a loss function to obtain a loss value, and optimizing parameters of the three-dimensional dynamic reconstruction model according to the loss value;

9. An electronic device, comprising:

one or more processors;

a storage device for storing one or more programs,

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method of any of claims 1-7.

10. A computer readable storage medium having stored thereon executable instructions which, when executed by a processor, cause the processor to perform the method of any one of claims 1 to 7.