CN113112607B

CN113112607B - Method and device for generating three-dimensional grid model sequence with any frame rate

Info

Publication number: CN113112607B
Application number: CN202110416920.1A
Authority: CN
Inventors: 付彦伟; 姜柏言; 张寅达; 薛向阳
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2021-04-19
Filing date: 2021-04-19
Publication date: 2022-09-06
Anticipated expiration: 2041-04-19
Also published as: CN113112607A

Abstract

The invention provides a method and a device for generating a three-dimensional grid model sequence with any frame rate, which belong to the field of three-dimensional computer vision and are used for processing a sparse three-dimensional point cloud sequence to obtain a three-dimensional grid model corresponding to any time in the sequence, and the method comprises the following steps: step S1, preprocessing a three-dimensional model data set which is obtained in advance and comprises a plurality of three-dimensional models to obtain a training sample; step S2, three-dimensional point cloud feature extraction networks are set up; step S3, randomly determining whether to exchange identity codes of paired three-dimensional point cloud sequences; step S4, building a neural ordinary differential equation network; step S5, building a deep decoding network; step S6, constructing a loss function; step S7, training the three-dimensional model generation model based on the loss function, and step S8, inputting the single three-dimensional point cloud sequence and the query time T ═ T into the trained three-dimensional model generation model.

Description

Method and device for generating three-dimensional grid model sequence with any frame rate

Technical Field

The invention belongs to the field of three-dimensional computer vision, and particularly relates to a method and a device for processing a sparse three-dimensional point cloud sequence to obtain a three-dimensional grid model sequence with any frame rate by using a neural ordinary differential equation-based combined representation method.

Background

Shape representation is one of the core topics of three-dimensional computer vision, especially in the deep learning era. Recently, the implicit representation of depth has shown encouraging ability in reconstructing accurate surface details. However, our human lives in a four-dimensional world with a time dimension, and most objects and scenes we see every day move or deform over time. Many existing applications also require machines capable of understanding or reconstructing 4D data, such as autopilot, robotics, virtual reality/augmented reality, and the like. However, how to reconstruct four-dimensional data, i.e. three-dimensional objects that change over time, remains a problem to be solved.

Some three-dimensional reconstruction algorithms based on deep learning can be extended more directly to four-dimensional space. For example, there is a point cloud generation model that can reconstruct a point cloud of an object surface from a single color picture, for which it can be extended to predict the trajectories of a plurality of three-dimensional points instead of the coordinates of the three-dimensional points, thereby achieving four-dimensional point cloud reconstruction; or the implicit surface of the three-dimensional object is modeled by using a neural network, a plurality of query points are sampled in a given volume, then the probability that each point is located in the object is predicted by using the network, and finally the three-dimensional grid model is obtained by using a surface extraction algorithm. According to the method, the query points can be directly sampled in the four-dimensional space, and the surfaces corresponding to different moments are modeled. However, the above methods are simple extensions of the existing three-dimensional reconstruction algorithm, and cannot accurately capture motion information of an object.

In addition, a speed field is constructed by using a neural ordinary differential equation, the speed of each three-dimensional point at a certain moment is predicted, and then the position of the three-dimensional point at each moment is solved by using an ordinary differential equation solver. During inference, the grid model of the first frame is reconstructed by using depth implicit expression, and then each point of the grid model is used as a starting point to directly transform the coordinate of each point by using the neural differential equation, so that the grid model corresponding to any time is obtained. However, transforming points directly in three-dimensional space has limited model representation capability, resulting in unreasonable movement of certain parts of the reconstructed object and lack of surface detail.

Disclosure of Invention

In order to solve the above problems, the present invention provides a method and a device for generating a three-dimensional mesh model sequence with an arbitrary frame rate, wherein the following technical scheme is adopted:

the invention provides a method for generating a three-dimensional grid model sequence with any frame rate, which is used for processing a sparse three-dimensional point cloud sequence to obtain a three-dimensional grid model corresponding to any time in the sequence, and is characterized by comprising the following steps of: step S1, preprocessing a three-dimensional model data set which is obtained in advance and comprises a plurality of three-dimensional models to obtain training samples, wherein the training samples comprise paired three-dimensional point cloud sequences and label information of three-dimensional sampling points which are positioned inside or outside the three-dimensional models; step S2, building three-dimensional point cloud feature extraction networks, extracting initial attitude features, global geometric features and global motion features of paired three-dimensional point cloud sequences respectively, and expressing the initial attitude features, the global geometric features and the global motion features as initial attitude codes, identity codes and motion codes respectively; step S3, randomly determining whether to exchange the identity codes of the paired three-dimensional point cloud sequences; step S4, building a neural ordinary differential equation network, and transforming the initial attitude code by taking the motion code corresponding to the three-dimensional point cloud sequence as a guide code to obtain the attitude code corresponding to the query time T; step S5, building a deep decoding network; step S6, constructing a loss function; step S7, training a three-dimensional model generation model composed of a three-dimensional point cloud feature extraction network, a neural ordinary differential equation network and a deep decoding network based on a loss function to obtain a trained three-dimensional model generation model; and step S8, inputting the single three-dimensional point cloud sequence and the query time T ═ T into the trained three-dimensional model generation model, extracting the surface of the three-dimensional surface implicit function by using a predetermined surface extraction algorithm, obtaining a three-dimensional mesh model corresponding to the query time T ═ T, and outputting the three-dimensional mesh model, where the query time T ═ T may be any scalar value in a specified time range.

The method for generating a three-dimensional mesh model sequence with an arbitrary frame rate according to the present invention may further have a feature that, in step S5, a three-dimensional surface implicit function is modeled, a posture code and an identity code at a time T-T corresponding to the three-dimensional point cloud sequence are connected as a guidance code, and a probability that a three-dimensional point sampled in a specified volume is located inside the three-dimensional model is predicted, thereby establishing a depth decoding network.

The method for generating a three-dimensional mesh model sequence with any frame rate provided by the invention can also have the characteristic that in the step S5, a classic SMPL human body parameterized model is utilized, and the posture coding and the identity coding are used as input to obtain corresponding mesh vertex coordinates so as to establish a depth decoding network.

The method for generating the three-dimensional grid model sequence with any frame rate provided by the invention can also have the characteristics that, the three-dimensional point cloud feature extraction networks are initial attitude feature extraction networks, global geometric feature extraction networks and global motion feature extraction networks respectively, each of the initial attitude extraction networks, the global geometric feature extraction networks and the global motion feature extraction networks is composed of five cascaded residual blocks, the first four residual blocks comprise two layers of full connection layers, one maximum pooling layer and one expansion connection operation, the last residual block comprises three layers of full connection layers and one maximum pooling layer, the global geometric feature extraction networks and the initial attitude feature extraction networks extract global geometric features and initial attitude features from a first frame point cloud of a three-dimensional point cloud sequence, and the global motion feature extraction networks extract global motion features from the whole three-dimensional point cloud sequence.

The method for generating the three-dimensional grid model sequence with any frame rate provided by the invention can also have the characteristic that the expansion connection operation of the residual block is to copy the characteristics output by the maximum pooling layer by N times, and then connect the copied characteristics and the input of the maximum pooling layer in the characteristic dimension, wherein N is the point number of the input three-dimensional point cloud sequence.

The method for generating the three-dimensional grid model sequence with any frame rate provided by the invention can also have the characteristic that the neural ordinary differential equation network comprises a vector field network and a predetermined ordinary differential equation solver, wherein the vector field network consists of five cascaded residual blocks comprising two fully-connected layers.

The method for generating a three-dimensional mesh model sequence with any frame rate provided by the invention can also have the characteristic that in the step S5, the depth decoding network consists of five cascaded residual blocks containing two fully-connected layers, and the three-dimensional points obtained by sampling are sampling points in the training data.

The method for generating the three-dimensional mesh model sequence with any frame rate provided by the invention can also have the characteristics that the preprocessing comprises the following steps: step T1, sampling the three-dimensional model to obtain a three-dimensional point cloud coordinate; and step T2, normalizing the three-dimensional model into a unit cube with the side length of 1, sampling in the unit cube, and confirming whether a sampling point is positioned inside or outside the three-dimensional model by using a preset algorithm, wherein the sampling point corresponds to the label values of 1 and 0 respectively.

The method for generating the three-dimensional grid model sequence with any frame rate provided by the invention can also have the characteristic that the loss function is a binary cross entropy loss function.

The invention provides a device for generating a three-dimensional grid model sequence with any frame rate, which is characterized by comprising the following steps: a three-dimensional point cloud sequence data acquisition unit for acquiring three-dimensional point cloud sequence data of an object to be modeled; a mesh model generation unit that generates a three-dimensional mesh model corresponding to an arbitrary time in the sequence based on the three-dimensional point cloud sequence by the method of generating a three-dimensional mesh model sequence at an arbitrary frame rate according to any one of claims 1 to 10; and a model output section for outputting the three-dimensional mesh model.

Action and effects of the invention

According to the method and the device for generating the three-dimensional grid model sequence with any frame rate, the three-dimensional grid model sequence with any frame rate is generated by utilizing the three-dimensional point cloud sequence, and after four-dimensional data (a sparse point cloud sequence dispersed on a time dimension) is obtained, because three separated characteristic extraction networks are used, initial attitude coding, identity coding and motion coding can be extracted from the whole sequence to jointly represent the four-dimensional data; and further transforming the initial attitude code in a hidden space under the guidance of motion coding through a neural ordinary differential equation network, thereby obtaining the attitude code corresponding to any moment, ensuring stronger expression capability and realizing the breakthrough of obtaining continuous attitude information from discrete input data. Furthermore, because the identity code is irrelevant to time, the identity code is connected with the attitude codes at different moments and then used as the guidance of a deep decoding network, so that the three-dimensional mesh model reconstructed at different moments has better consistency on the geometrical shape. According to the method, three parts of information, namely initial posture, identity and motion, are decoupled, and four-dimensional data are represented by a combined representation method based on a neural ordinary differential equation, so that novel tasks such as three-dimensional motion migration, four-dimensional time sequence completion, four-dimensional space completion and future motion prediction can be realized.

Drawings

FIG. 1 is a block diagram of an apparatus for generating a three-dimensional mesh model sequence at an arbitrary frame rate according to an embodiment of the present invention;

FIG. 2 is a flow chart of a method of generating a sequence of three-dimensional mesh models at arbitrary frame rates in an embodiment of the invention;

FIG. 3 is a schematic structural diagram of a method for generating a three-dimensional mesh model sequence at an arbitrary frame rate according to an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of a three-dimensional point cloud feature extraction network according to an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of a vector field network in a neural ordinary differential equation in accordance with an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a deep three-dimensional surface implicit function network according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made with reference to the accompanying drawings.

< example >

The embodiment provides a method and a device for generating a three-dimensional grid model sequence with any frame rate by using a three-dimensional point cloud sequence, which are used for processing a sparse three-dimensional point cloud sequence to obtain a three-dimensional grid model corresponding to any time in the sequence based on a combined representation method of a neural ordinary differential equation so as to enable a modeling user to view and apply the three-dimensional grid model.

Fig. 1 is a block diagram showing the structure of an apparatus for generating a three-dimensional mesh model sequence method at an arbitrary frame rate according to an embodiment of the present invention.

As shown in fig. 1, the apparatus 10 for generating a three-dimensional mesh model sequence at an arbitrary frame rate includes a three-dimensional point cloud sequence data acquisition unit 101, a mesh model generation unit 102, a model output unit 103, and a control unit 104.

The three-dimensional cloud sequence data acquisition unit 101 can acquire three-dimensional cloud sequence data of an object to be modeled.

The network model generation unit 102 generates a three-dimensional mesh model corresponding to an arbitrary time in a three-dimensional point cloud sequence based on the method of representing four-dimensional data by combining the pseudo differential equations.

The method for generating a three-dimensional mesh model sequence at any frame rate by using the three-dimensional point cloud sequence is described in detail below.

The model output unit 103 can output a three-dimensional mesh model at any time for a modeling user to view or directly apply to the fields of industrial design, digital entertainment, and the like.

The control unit 104 controls the three-dimensional point cloud sequence data acquisition unit 101, the mesh model generation unit 102, and the model output unit 103 to realize the corresponding functions.

Fig. 2 is a flowchart of a method for generating a three-dimensional mesh model sequence at an arbitrary frame rate in an embodiment of the present invention, and fig. 3 is a schematic structural diagram of the method for generating a three-dimensional mesh model sequence at an arbitrary frame rate in an embodiment of the present invention.

As shown in fig. 2 and fig. 3, the method for generating a three-dimensional mesh model sequence at an arbitrary frame rate by using a three-dimensional point cloud sequence includes the following steps:

step S1, preprocessing a pre-acquired three-dimensional model data set to obtain a training sample, wherein the training sample comprises a paired three-dimensional point cloud sequence and label information of a three-dimensional point sampled in a specified volume positioned inside or outside the surface of the three-dimensional model in the three-dimensional model data set.

In this embodiment, two different three-dimensional model datasets are used, the first one is a dataset for data augmentation of the disclosed D-FAUST dataset, and the dataset has 1065 human body motion SMPL model sequences, which contain 10 different people and 14 different actions, all the sequences are divided into a plurality of subsequences with length of 17 frames, that is, three-dimensional point cloud sequences, and 300 points of sparse point clouds are arranged in each frame, and each point has a corresponding relationship at different time instants, for example, a first point in the point cloud of a first frame time corresponds to a first point in the point cloud of a second frame time.

Selecting 1040 three-dimensional point cloud sequences as a training set, 5 sequences as a verification set and the rest three-dimensional point cloud sequences as a test set; the second is a CAD automobile model data set with non-rigid motion, which is autonomously constructed according to the method proposed in the existing working Occupancy Flow, and comprises 10 different automobiles, 10 different actions of each automobile and a total of 1000 automobile mesh model sequences.

The preprocessing performed in step S1 includes the specific steps of:

step T1, sampling the three-dimensional model to obtain a three-dimensional point cloud coordinate, in this embodiment, sampling the SMPL and CAD mesh model to obtain a three-dimensional point cloud coordinate,

and step T2, normalizing the three-dimensional model into a unit cube with the side length of 1, sampling in the unit cube, and confirming whether a sampling point is positioned inside or outside the three-dimensional model by using a preset algorithm, wherein the sampling point corresponds to the label values of 1 and 0 respectively.

And step S2, building three-dimensional point cloud feature extraction networks, and respectively extracting initial attitude features, global geometric features and global motion features from the paired three-dimensional point cloud sequences.

The three-dimensional point cloud feature extraction networks are an initial attitude feature extraction network, a global geometric feature extraction network and a global motion feature network respectively, and the initial attitude feature, the global geometric feature and the global motion feature are extracted by the initial attitude feature extraction network, the global geometric feature extraction network and the global motion feature extraction network respectively.

Fig. 4 is a schematic structural diagram of a three-dimensional point cloud feature extraction network according to an embodiment of the present invention.

In this embodiment, the three-dimensional point cloud feature extraction networks have the same network structure, and as shown in fig. 4, the input of the feature extraction network is a point cloud sequence with a size (B, L, N,3), where B, L, N represents the batch processing size, the length of the input sequence, and the number of points in each frame of point cloud, respectively.

The initial attitude feature extraction network, the global geometric feature extraction network and the global motion feature network have the same structure, are a variant of the disclosed algorithm PointNet, and have five residual blocks, each of the first four blocks has an additional maximum pooling layer to obtain a pooling feature with the size of (B,1, C), wherein C represents the dimension of a hidden layer, and the output of the fifth block is subjected to expansion operation to pass through the maximum pooling layer and a full connection layer to obtain a final feature vector with the dimension of 128.

And (3) copying the features output by the maximum pooling layer by N times through expansion operation, wherein N is the number of points of the input three-dimensional point cloud sequence, and connecting the copied features with the input of the maximum pooling layer in feature dimensions. The pooled feature is expanded to a size (B,1, N, C) in this embodiment to make it suitable for connection.

The gesture feature extraction network processes a first frame of an input three-dimensional point cloud sequence to extract an initial gesture feature, and the global geometric feature extraction network and the global motion feature extraction network perform feature extraction preprocessing on the three-dimensional point cloud sequence and then perform feature extraction on the whole three-dimensional point cloud sequence. The feature extraction preprocessing comprises the steps of connecting all input three-dimensional point cloud sequences on the last dimension, and setting the input dimension of a feature extraction network to be 3L.

In the present embodiment, the initial pose feature, the global geometric feature and the global motion feature are all represented by the above 128-dimensional feature vector, which are respectively expressed as the initial pose code, the identity code and the motion code.

And step S3, randomly determining whether to exchange identity codes of paired three-dimensional point cloud sequences.

In this embodiment, the initial pose information and the global geometric information of the three-dimensional point cloud sequences need to be decoupled, so that paired three-dimensional point cloud sequences need to be used as input in the training process, and for each pair of three-dimensional point cloud sequences, the corresponding identity codes of the three-dimensional point cloud sequences are randomly exchanged. Specifically, assuming that the input pair of three-dimensional point cloud sequences are "identity 1+ motion 1" and "identity 2+ motion 2", respectively, if the identity codes of the two are selected to be exchanged, it is required that the network can reconstruct the sequences "identity 2+ motion 1" and "identity 1+ motion 2".

The purpose of exchanging identity codes of paired three-dimensional point cloud sequences is to enable the network to decouple the pose information and the identity information of the object during the training process.

And step S4, constructing a neural ordinary differential equation network, and transforming the initial attitude code by taking the motion code corresponding to the three-dimensional point cloud sequence as a guide so as to obtain the attitude code corresponding to the required time T.

Fig. 5 is a schematic structural diagram of a vector field network in a neural ordinary differential equation according to an embodiment of the present invention.

In the embodiment, the motion code is used as guidance information, and the initial attitude code is transformed in the hidden space by using the neural ordinary differential equation network, so that the attitude code corresponding to any time is obtained. The neural constant differential equation network consists of a vector field network and a predetermined constant differential equation solver. As shown in fig. 5, the input to the vector field network contains the motion code, and the vector obtained by connecting the initial pose code and the required query time t.

The vector field network consists of five concatenated residual blocks containing two fully connected layers, the input of each residual block being summed with features extracted from the motion coding. And the dimension of the attitude code corresponding to the output time T-T is the same as that of the initial attitude code.

The ordinary differential equation solver used in the present embodiment is the known adaptive solver "dopri 5", and the absolute error tolerance and the relative error tolerance are set to 1e-5 and 1e-3, respectively.

In this embodiment, the time range corresponding to the input three-dimensional point cloud sequence is normalized to [0,1], that is, the specified time range is [0,1], and the query time input to the neural ordinary differential equation network may be any scalar value within [0,1 ].

And step S5, building a deep decoding network.

The steps have two schemes: 1) aiming at a human body or other general objects, a three-dimensional surface implicit function is modeled, attitude codes and identity codes at the time T-T corresponding to a three-dimensional point cloud sequence are connected to serve as guide codes, and the probability that three-dimensional points sampled in a specified volume are located in a three-dimensional model is predicted; 2) aiming at a more refined human body model, a classical SMPL human body parameterized model is utilized, and a posture code and an identity code are used as input to obtain corresponding grid vertex coordinates.

In this embodiment, a depth decoding network is built based on a implicit function of the surface of a three-dimensional object, a posture code at the time T-T corresponding to a three-dimensional point cloud sequence and an identity code are connected together as a guide, and the probability that a three-dimensional point sampled in a specified volume is located inside the surface is predicted.

When the guided encoding is fixed, the surface of the three-dimensional object is also fixed. In this embodiment, the guideline encoding of the depth decoding network based on the implicit function of the three-dimensional object surface is the vector after the connection of the identity encoding and the pose encoding at the time T ═ T.

As shown in fig. 6, the depth decoding network based on implicit functions of the surface of the three-dimensional object has a similar network structure to the vector field network. The depth decoding network based on the implicit function of the surface of the three-dimensional object selects query points from a sampling point set S, and predicts a probability value for each query point under the guidance of guidance coding, wherein the probability value indicates the probability of the queried point in the interior of the surface of the object. The present embodiment uses a Conditional Batch Normalization (CBN) scheme to insert information extracted from the guide code. In particular, each time a batch normalization layer is used, two vectors β and γ are obtained by processing the instructional code with two fully-connected layers, and these are used as affine transformation parameters of the batch normalization layer, so that the information of the instructional code can be better propagated in the network.

The method comprises the steps of averaging the predicted probability value of each sampled three-dimensional point and the label calculation loss of a depth decoding network based on an implicit function of the surface of a three-dimensional object, wherein the steps of supervising an initial frame T-0 and supervising a query time T-T are included, the supervision on the initial frame T-0 is used for connecting an initial posture code and an identity code to serve as the guidance of the depth decoding network, and all sampled three-dimensional points at the time T-0 are predicted.

In step S6, a loss function is constructed.

In this embodiment, the Loss function is Binary Cross Entropy Loss (Binary Cross-Entropy Loss). And averaging the prediction probability values of all the query points and the real label values by the deep decoding network after calculating losses.

In this embodiment, for each three-dimensional point cloud sequence, in addition to the loss calculated at time T-T, the loss is also calculated at time T-0, so as to ensure that the reconstruction result of the first frame has good enough quality.

And step S7, training a three-dimensional model generation model composed of a three-dimensional point cloud feature extraction network, a neural ordinary differential equation network and a deep decoding network based on the loss function.

In step S7 of this embodiment, the back propagation algorithm and the gradient descent algorithm are used to optimize the weight parameters in the three-dimensional model generation model, thereby completing the training. During training, an Adam optimizer is used for optimizing the network, and the coefficient beta is (0.9, 0.999). The learning rate is set to 1 e-4.

And step S8, inputting a three-dimensional point cloud sequence and a query time T (T) into a three-dimensional model generation model, extracting a surface of the three-dimensional surface implicit function by using a surface extraction algorithm, and finally obtaining and outputting a three-dimensional grid model corresponding to the query time T (T).

In this embodiment, after the training of the three-dimensional model generation model is completed, a sparse point cloud sequence dispersed in the time dimension and any query time T ═ T in [0,1] are given, and the model may directly output the three-dimensional mesh generation result corresponding to the time.

By evaluating the volume IoU and the chamfer distance of the generated result, the performance of the three-dimensional model generation model of the embodiment exceeds that of the existing four-dimensional mesh reconstruction method. Where volume IoU evaluates to a standard maximum value of 100, the larger the value, the more desirable the value of the chamfer distance, the smaller the value. In this embodiment, specific experimental comparison is performed through a test data set, and the relevant experimental data is as follows:

extension of the PSGN method: volume IoU: unable to calculate, chamfer distance: 0.108;

expansion of an ONet method: volume IoU: 77.9%, chamfer distance: 0.084;

OFlow method: volume IoU: 79.9%, chamfer distance: 0.073.

the method of the embodiment comprises the following steps: volume IoU: 81.8%, chamfer distance: 0.068.

in addition, for the sake of convenience in practical use, the three-dimensional model generation model obtained through the training in steps S1 to S7 may be packaged to form a three-dimensional mesh model generation unit, and the three-dimensional mesh model generation unit may be connected to a device (e.g., a three-dimensional point cloud scanner) for generating a three-dimensional mesh model from a sparse three-dimensional point cloud sequence at discrete time, so that the three-dimensional point cloud sequence data acquisition unit, after acquiring the sparse three-dimensional point cloud sequence at discrete time, is processed by the three-dimensional mesh model generation unit to generate a three-dimensional mesh model corresponding to any time in the sequence.

Examples effects and effects

The invention provides a method and a device for generating a three-dimensional grid model sequence with any frame rate, wherein the three-dimensional grid model sequence with any frame rate is generated by utilizing a three-dimensional point cloud sequence, and after four-dimensional data (a sparse point cloud sequence dispersed on a time dimension) is obtained, an initial attitude code, an identity code and a motion code can be extracted from the whole sequence to jointly represent the four-dimensional data by using three separated characteristic extraction networks; and further transforming the initial attitude code in a hidden space under the guidance of motion coding through a neural ordinary differential equation network, thereby obtaining the attitude code corresponding to any moment, ensuring stronger expression capability and realizing the breakthrough of obtaining continuous attitude information from discrete input data. Furthermore, because the identity code is irrelevant to time, the identity code is connected with the attitude codes at different moments and then is used as a guidance of a deep decoding network, so that the three-dimensional mesh models reconstructed at different moments have better consistency in geometrical shape. According to the method, three parts of information, namely initial posture, identity and motion, are decoupled, and four-dimensional data are represented by a combined representation method based on a neural ordinary differential equation, so that novel tasks such as three-dimensional motion migration, four-dimensional time sequence completion, four-dimensional space completion and future motion prediction can be realized.

Further, in the method for generating a three-dimensional mesh model sequence at any frame rate provided by the present invention, different schemes can be adopted for different classes of objects during decoding: aiming at human bodies and other general objects, because the identity codes are irrelevant to time, the identity codes are connected with the attitude codes at different moments and then used as the guidance of a depth decoding network based on implicit functions of the surfaces of three-dimensional objects, so that three-dimensional mesh models reconstructed at different moments have better consistency in geometrical shapes; when only aiming at a human body, the SMPL human body parameterized model is used, the same identity code and the posture code corresponding to the moment T-T are used as input, and the vertex coordinates of the human body mesh model at the moment are obtained, so that the three-dimensional mesh models reconstructed at different moments have better consistency in the geometrical shape. Furthermore, the method and the device for generating the three-dimensional grid model sequence with any frame rate provided by the invention use a combined representation method based on a neural ordinary differential equation to decouple the initial posture, the identity and the motion of the four-dimensional data, so that the invention can realize three-dimensional motion migration and migrate the motion of an object in one sequence to another object. In addition, the application of four-dimensional time sequence completion, four-dimensional space completion, future action prediction and the like can be realized by utilizing a reverse optimization strategy.

Furthermore, the method and the device for generating the three-dimensional grid model sequence with any frame rate provided by the invention utilize the implicit function commonly used in the depth decoding network modeling three-dimensional reconstruction task, not only can reconstruct the human body three-dimensional model sequence, but also can reconstruct the automobile three-dimensional model sequence with non-rigid motion, have stable performance when processing data of completely different domains, and have better robustness and algorithm universality.

Furthermore, the three-dimensional point cloud feature extraction networks in the method for generating the three-dimensional grid model sequence with any frame rate provided by the invention are an initial attitude feature extraction network, a global geometric feature extraction network and a global motion feature extraction network respectively, and the three networks can extract robust three-dimensional space features from the input sparse point cloud sequence, are not influenced by the sequence of the midpoint of the three-dimensional point cloud and provide required information for reconstructing a three-dimensional model by a subsequent decoding network.

Furthermore, in the method for generating a three-dimensional grid model sequence with any frame rate, provided by the invention, the extended connection operation of the residual block is to copy the features output by the largest pooling layer by N times, and then connect the copied features with the input of the largest pooling layer in the feature dimension, so that the operation also extracts the most prominent features in all points through the pooling layer while retaining the features of each point, thereby ensuring that enough information is available for decoding.

Further, in the method for generating a three-dimensional mesh model sequence with an arbitrary frame rate, provided by the invention, the neural ordinary differential equation network comprises a vector field network and a predetermined ordinary differential equation solver, and the neural ordinary differential equation network can transform the initial attitude code in an implicit space under the guidance of global motion coding so as to obtain the attitude codes corresponding to different moments, and the part can also be called as conditional implicit neural ordinary differential equation, and the method is put forward for the first time.

Further, in step S5 of the method for generating a three-dimensional mesh model sequence with an arbitrary frame rate, the depth decoding network based on the three-dimensional surface implicit function is composed of five cascaded residual blocks including two fully-connected layers, and is capable of predicting a probability value for each three-dimensional space point, which represents the probability that the point is located inside the three-dimensional model. The network is a decoding network commonly used in the field of three-dimensional reconstruction at present, and can obtain a very fine and smooth grid model by matching with a subsequent surface extraction algorithm.

The above-described embodiments are merely illustrative of specific embodiments of the present invention, and the present invention is not limited to the description of the above-described embodiments.

In the above embodiment, the noise input into the three-dimensional point cloud sequence is random noise generated by a gaussian function, and in other aspects of the present invention, the noise may be generated by other functions.

In the above embodiment, the Marching Cubes (Marching Cubes) are used in the three-dimensional mesh surface extraction algorithm in step S8, and in other aspects of the present invention, other existing three-dimensional mesh surface extraction algorithms may also be used.

Claims

1. A method for generating a three-dimensional grid model sequence with any frame rate is used for processing a sparse three-dimensional point cloud sequence to obtain a three-dimensional grid model corresponding to any time in the sequence, and is characterized by comprising the following steps:

step S1, preprocessing a three-dimensional model data set which is obtained in advance and comprises a plurality of three-dimensional models to obtain training samples, wherein the training samples comprise paired three-dimensional point cloud sequences and label information of three-dimensional sampling points which are positioned inside or outside the three-dimensional models;

step S2, building three-dimensional point cloud feature extraction networks, extracting initial attitude features, global geometric features and global motion features of the three-dimensional point cloud sequences in pairs respectively, and expressing the initial attitude features, the global geometric features and the global motion features as initial attitude codes, identity codes and motion codes respectively;

step S3, randomly determining whether to exchange the identity codes of the paired three-dimensional point cloud sequences;

step S4, a neural ordinary differential equation network is built, the motion code corresponding to the three-dimensional point cloud sequence is used as a guide code to transform the initial attitude code, and the attitude code corresponding to the query moment T = T is obtained;

step S5, building a deep decoding network;

step S6, constructing a loss function;

step S7, training a three-dimensional model generation model composed of the three-dimensional point cloud feature extraction network, the neural ordinary differential equation network and the deep decoding network based on the loss function to obtain the trained three-dimensional model generation model;

step S8, inputting the single three-dimensional point cloud sequence and the query time T = T into the trained three-dimensional model generation model, extracting the surface of the three-dimensional surface implicit function by using a preset surface extraction algorithm to obtain and output a three-dimensional mesh model corresponding to the query time T = T,

and the query time T = T is any scalar value in a specified time range.

2. The method of generating a sequence of three-dimensional mesh models at an arbitrary frame rate as set forth in claim 1, wherein:

in step S5, a three-dimensional surface implicit function is modeled, the pose coding and the identity coding at the time T = T corresponding to the three-dimensional point cloud sequence are connected as a guide coding, and the probability that a three-dimensional point sampled in a specified volume is located inside the three-dimensional model is predicted, so as to establish the deep decoding network.

3. The method of generating a sequence of three-dimensional mesh models at an arbitrary frame rate as set forth in claim 1, wherein:

in step S5, a classical SMPL human parameterized model is used, and the pose code and the identity code are used as inputs to obtain corresponding grid vertex coordinates, so as to establish the depth decoding network.

4. The method of generating a sequence of three-dimensional mesh models at an arbitrary frame rate as set forth in claim 1, wherein:

wherein the three-dimensional point cloud feature extraction networks are an initial attitude feature extraction network, a global geometric feature extraction network and a global motion feature extraction network respectively,

the initial pose feature extraction network, the global geometric feature extraction network and the global motion feature extraction network are all composed of five cascaded residual blocks,

of the five residual blocks, the first four residual blocks comprise two fully connected layers, one maximum pooling layer and one extended connection operation, the last residual block comprises three fully connected layers and one maximum pooling layer,

the global geometric feature extraction network and the initial pose extraction network extract global geometric features and the initial pose features for a first frame of point cloud of the three-dimensional point cloud sequence,

and the global motion feature extraction network extracts the global motion features from the whole three-dimensional point cloud sequence.

5. The method of generating a sequence of three-dimensional mesh models at an arbitrary frame rate as set forth in claim 4, wherein:

wherein the extended join operation of the residual block is to copy N times the features output by the maximum pooling layer, and then join the copied features with the input of the maximum pooling layer in a feature dimension,

and N is the number of points of the input three-dimensional point cloud sequence.

6. The method of generating a sequence of three-dimensional mesh models at an arbitrary frame rate as set forth in claim 1, wherein:

wherein the neural ordinary differential equation network comprises a vector field network and a predetermined ordinary differential equation solver,

the vector field network consists of five cascaded residual blocks containing two fully-connected layers.

7. The method of generating a sequence of three-dimensional mesh models at an arbitrary frame rate as claimed in claim 2, wherein:

wherein, in step S5, the deep decoding network is composed of five cascaded residual blocks containing two fully-connected layers,

and the three-dimensional points obtained by sampling are sampling points of the training samples.

8. The method of generating a sequence of three-dimensional mesh models at arbitrary frame rates as claimed in claim 1, wherein:

wherein the pretreatment comprises the following steps:

step T1, sampling the three-dimensional model to obtain a three-dimensional point cloud coordinate;

and step T2, normalizing the three-dimensional model into a unit cube with the side length of 1, sampling in the unit cube, and confirming whether a sampling point is positioned inside or outside the three-dimensional model by using a preset algorithm, wherein the corresponding label values are 1 and 0 respectively.

9. The method of generating a sequence of three-dimensional mesh models at an arbitrary frame rate as set forth in claim 1, wherein:

wherein the loss function is a binary cross entropy loss function.

10. An apparatus for generating a sequence of three-dimensional mesh models at an arbitrary frame rate, comprising:

a three-dimensional point cloud sequence data acquisition unit for acquiring three-dimensional point cloud sequence data of an object to be modeled;

a mesh model generation unit that generates a three-dimensional mesh model corresponding to an arbitrary time in a three-dimensional point cloud sequence based on the sequence by the method of generating a three-dimensional mesh model at an arbitrary time according to any one of claims 1 to 9; and

a model output unit for outputting the three-dimensional mesh model.