CN116188690A

CN116188690A - Hand-drawn sketch three-dimensional model reconstruction method based on space skeleton information

Info

Publication number: CN116188690A
Application number: CN202310163381.4A
Authority: CN
Inventors: 孔德慧; 马杨; 李敬华; 尹宝才
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2023-02-24
Filing date: 2023-02-24
Publication date: 2023-05-30

Abstract

The invention discloses a three-dimensional model reconstruction method of a sketch based on space skeleton information, which provides a space skeleton guiding encoder, a domain self-adaptive encoder and a self-attention decoder, wherein skeleton characteristics of the sketch are extracted through the space skeleton encoder, the skeleton information is used as a priori knowledge to provide auxiliary information required by reconstructing a complete three-dimensional model, the domain self-adaptive encoder transfers knowledge learned by synthesizing the sketch into the sketch, and the decoder based on attention eliminates ambiguity, so that the three-dimensional reconstruction precision of a single sketch is improved. The self-attention mechanism enables the model to distinguish sketch input with higher contour similarity; the value function trained by the domain adaptation method using the discriminant and gradient inversion layer is equivalent to minimizing Jensen-Shannon divergence between the two distributions relative to other techniques, because the minimized divergence may not be continuous for the generator parameters, whereas the domain adaptation constraint function of the present invention may be considered to be everywhere slightly more stable.

Description

Hand-drawn sketch three-dimensional model reconstruction method based on space skeleton information

Technical Field

The invention belongs to the technical field of computer vision and three-dimensional reconstruction, and particularly relates to a sketch three-dimensional reconstruction method based on self-adaption of a spatial skeleton guiding domain of a deep learning technology.

Background

Reconstructing a corresponding three-dimensional model from a two-dimensional image plays an extremely important role in the fields of human-computer interaction, virtual reality, augmented reality and the like. The traditional method relies on the study of users to use a plurality of complex editing tools to master geometric knowledge such as space curved surfaces, and can only reconstruct relatively simple ellipsoidal objects, and cannot be popularized to complex objects. With the development of deep learning technology and the appearance of large databases, the deep learning-based method is friendly to users, and good reconstruction results can be obtained for some more complex objects. The multi-view reconstruction method has higher requirements on sketches, users need to provide accurate sketches under different visual angles, or sketches are drawn under the regulation of auxiliary lines and are not suitable for untrained novice users, so that the current mainstream method relaxes the limitation on input sketches and further regards the input sketches as single-view reconstruction tasks. But ambiguity problems caused by sparse sketch information, changeable styles and ambiguous perspective are the hindrance of the single-view reconstruction method. Some methods use auxiliary information to enrich sketch semantic representations, such as view angle information, foreground mask information and the like, so as to try to solve the problem of sparse information, but the auxiliary information is not sufficiently effective, so that a more accurate reconstruction result cannot be obtained. Aiming at the problem that the prior art does not completely solve, the invention provides a domain self-adaptive hand-drawn sketch three-dimensional reconstruction method based on a deep learning network and guided by a spatial skeleton, which can provide priori knowledge of a microstructure through spatial skeleton information, uses a self-attention mechanism to learn the dependency relationship among details in the sketch, eliminates ambiguity, improves the reconstruction precision, and further improves the performance of a model in the hand-drawn sketch domain through a domain self-adaptive method.

Disclosure of Invention

The invention provides a domain self-adaptive sketch three-dimensional reconstruction method of a space skeleton, and provides a space skeleton guiding encoder, a domain self-adaptive encoder and a self-attention decoder.

In order to achieve the above purpose, the invention adopts the following technical scheme:

step 1, in the training stage, firstly, a synthetic sketch x is obtained from a rendered image by using an edge extractor, and the synthetic sketch x is passed through a sketch encoder E _v-s To obtain its average shape characteristics, the encoder E is guided by the spatial skeleton _ss Obtaining skeleton characteristics, after characteristic fusion, passing through self-attention layer of decoder, and finally passing through decoder D _v And carrying out three-dimensional deconvolution reconstruction to obtain voxels, and calculating the loss with the real model. Wherein, in the feature extraction stage, the domain adaptive encoder E _v-h Sketch of hand painting

Extracting features, and carrying out similarity constraint with the features of the synthetic sketch to realize domain migration.

Step 2, test stage, for synthetic sketch input, through E _v-s And E is connected with _ss Obtaining a characteristic vector, and passing through D after fusion _v Predicting; for hand sketch input, through E _v-h And E is connected with _ss Obtaining a characteristic vector, and passing through D after fusion _v And (5) predicting.

The step 1 comprises the following steps:

step 1.1, obtaining a synthetic sketch from a rendered image in a shape Net dataset by using a mode of first deriving and then binarizing as an input of training

Step 1.2, feature extraction stage

Step 1.2.1, after the synthetic sketch is obtained, the average shape characteristic z with the same dimension (1×1024) can be obtained _v ＝E _v-s (x) And skeleton feature z _p ＝E _ss (x)，z _v The bias represents the average shape of the category and can be regarded as global information, z _p Guided by the skeleton data, it can be regarded as information of the local structure. Predicted skeleton p=e _ss (x) And the real skeleton P _gt Performing a chamfer distance constraint, the loss being used only to train E _ss Parameters of the network

And step 1.2.2, introducing a domain self-adaption method when the model parameters are stable. Representing the sketch m of the hand drawing as a set of n different styles may be represented as: { m ₁ ,m ₂ ,…,m _i ,…m _n Representation of the synthetic sketch as style m outside this set _n+1 The goal of the domain adaptive approach is to align n styles to m _n+1 Is a kind of medium.

In each epoch, the input hand-drawn sketch is trained at 400 intervals, and the input hand-drawn sketch is passed through a domain adaptive encoder E _v-h Obtaining domain adaptive characteristics:

obtaining

And z _v After that, by constraint function L _da Training E _v-h ：

The Earth-river (Wasserstein-1) distance is used, so that the network training is more stable, and the gradient disappearance problem during the general GAN network training can not occur.

Step 1.3, feature fusion and deconvolution reconstruction stage

Step 1.3.1, z _v And z _p Fusion is carried out, and in order to fully utilize skeleton information, a corresponding element adding mode is adopted

z＝z _v +z _p

Step 1.3.2, pass decoder D _v Self-attention operation is carried out on the fusion characteristics to obtain a voxel model, wherein V=D _v (z) and loss L through voxel reconstruction _vol Training E _v-s And D _v

The overall network architecture of the present invention is shown in fig. 1. The introduction of the spatial skeleton information enables the reconstruction precision of the local fine component to be higher, compared with prior information of other methods, such as view angle information, 2.5D representation and the like, the skeleton information not only maintains global geometric structure, but also provides local information of a complex area, so that the integrity of the spatial geometric structure of a reconstructed object can be ensured, and the point is visually represented in fig. 10; the self-attention mechanism enables the model to distinguish sketch input with higher contour similarity; in addition, the domain adaptive method using a discriminator and a gradient inversion layer relative to other techniques, the trained value function is equivalent to minimizing the Jensen-Shannon divergence between the two distributions, which tends to cause the gradient to disappear when the discriminator is saturated, because the minimized divergence may not be continuous for the generator parameters, while the domain adaptive constraint function of the present invention can be considered to be everywhere slightly more stable.

Drawings

FIG. 1 is a network overall framework of the present invention;

FIG. 2 is a schematic diagram of the internal structure of a space skeleton encoder;

FIG. 3 is a schematic diagram of a self-attention mechanism in an encoder;

FIG. 4 is a schematic diagram of a specific network architecture of a self-attention decoder;

FIG. 5 is a schematic diagram of a network architecture of an encoder;

FIG. 6a is a schematic diagram of a detailed sketch;

FIG. 6b is a rough sketch schematic;

FIG. 7 is a comparative graph of the comparative experiment of the present invention;

FIG. 8 is a schematic diagram of a hand-drawn sketch reconstruction result according to the present invention;

FIG. 9 is a schematic diagram of a hand-drawn sketch reconstruction result according to the present invention;

FIG. 10 is a schematic diagram of the results of a spatial skeleton encoder (SS) ablation experiment;

fig. 11 is a schematic diagram of the results of a Domain Adaptation (DA) method ablation experiment.

Detailed Description

Introduction of basic model: the two-dimensional convolution network encoder and the three-dimensional deconvolution network decoder are classical single-view or multi-view three-dimensional voxel reconstruction networks, the invention reconstructs voxels based on the architecture of the encoder-decoder, and the designed model is divided into three modules, namely a spatial skeleton guiding encoder, a domain self-adaptive encoder and a decoder based on a self-attention mechanism.

Spatial skeleton guided encoder: the spatial skeleton data is actually a sampled compact point cloud, is a curved surface and curve combination formed by points, and is often used for shape completion work. The skeleton can better represent the complex topological structure of the thin area while keeping the global shape, and has lower learning complexity. The shape information of the sketch is often represented by simple lines, such as the legs of a chair, and other auxiliary information such as viewing angles, masks, surface normals and the like are not fit enough to guide the reconstruction of the fine parts. Aiming at the problem of sparse sketch information, the invention designs a space skeleton guiding encoder, extracts skeleton features from sketch, fuses the skeleton features with average shape features extracted by the sketch encoder, and reconstructs more complete and accurate voxels.

The input is a b×c×w×h synthetic sketch, which is a single-channel binary sketch, which the present invention replicates in triplicate, c=3. Firstly, performing downsampling operation on output after a convolutional layer Conv1 and two residual blocks, respectively performing downsampling operation on the output after the convolutional layer Conv1 and the pooling layer, aligning the output after the Conv2 and the output after the two residual blocks to the same scale through downsampling, obtaining skeleton features through a full-connection layer Fc1 for fusion of sketch average shape features, and finally, obtaining a prediction skeleton through splicing the features obtained through the full-connection layer Fc2 with other same-dimension features. Specific structure as shown in fig. 2, the convolution kernel sizes of the two-layer convolution of Conv1 are 7×7 and 3×3, respectively, the convolution kernel sizes of the two-layer convolution of Conv2 are both 3×3, and the last layer is the maximum pooling layer. The convolution kernel sizes of the residual blocks are all 3×3. The dimensions of the fully connected layers Fc1 and Fc2 are 1 x 1024 and the activation functions of the convolution layer, the pooling layer and the fully connected layer are set to the relu function.

Decoder based on self-attention mechanism: the ambiguity of the sketch is mainly expressed by the lack of color and texture information, the curved surfaces are blank, the outline of the object is similar, and the object is difficult to distinguish. The self-attention mechanism in vision can learn the interrelationship between each element in the feature, and based on the interrelationship, the invention uses the self-attention mechanism to learn the long-distance dependence of the sketch, and the ambiguity problem is eliminated through the relation between the parts.

The encoded sketch features are fed into a decoder where V is obtained by a self-attention layer and a deconvolution layer. The specific process is as follows: sketch features feature map mεR is obtained by shape transformation (convolution operation of 1×1) ^c×n Where c is the number of channels and n is the feature location. Attention coefficient beta _j,i By means of a transfer function (full link layer) q (·), k (·), v (·).

Can get attention to the force o _j The method comprises the following steps:

/>

represents matrix multiplication, o _j ＝(o ₁ ,o ₂ ,…o _j ,…,o _n ) The final output y is obtained by combining the self-attention diagram and the input characteristic diagram _j :

y _j ＝αo _j +m _j #(3)

Where α is a learnable scale parameter, in order to re-draw self-attention in the case where network learning is good, the present invention sets the initial value of α to 0.

The self-attention layer is followed by 9 layers of three-dimensional deconvolution, each layer of deconvolution having a convolution kernel size of 3 x 3, the activation function using relu, the final output using sigmoid activation function, the network predicts the occupancy probability of each voxel grid, the voxel model with 32 x 32 resolution is obtained in total, the number of channels was [128,128,128,128,64,64,32,1]. The specific structure is shown in fig. 3.

Encoder based on domain adaptive method: the final input of the network designed by the invention is a hand sketch, because of the lack of a large database, a synthetic sketch is used for training, and because the data distribution difference between the hand sketch and the synthetic sketch is large, the three-dimensional reconstruction of the hand sketch can not obtain reconstruction voxels conforming to the input by directly using a network model obtained by training the synthetic sketch. The domain self-adaptive method of the invention aligns the features of the synthesis and hand-drawn sketch in the feature space, firstly uses a sketch encoder E in the early stage of network training _v-s The features of the synthetic sketch are extracted, after training is stable (after training 10 epochs), the hand-drawn sketch already paired with the synthetic sketch in the dataset is input to the domain adaptive encoder E _v-h In (3), a sketch characteristic of the hand drawing is obtained, and a sketch encoder E is used for the sketch _v-s And parameters of the decoder are kept unchanged, and the obtained synthetic sketch features and hand-drawn sketch features are subjected to L ₁ Constraint of distance, thereby training E _v-h The parameters of (2) are such that the extracted features can be adapted as input to a self-attention decoder, and can also be fused with skeleton features.

The domain self-adaptive method directly carries out minimum constraint on the distance between the features, so that the training process of the network can be quickened, the problem of gradient disappearance can be avoided, the training is more stable, and the training result is more accurate.

To better accomplish the inter-domain knowledge migration, the sketch encoder E designed by the invention _v-s Sum domain adaptive encoder E _v-h With the same network structure, the convolution kernel of the first layer is 7×7, the convolution kernels of the remaining convolution layers are 3×3, the activation function is a relu function, and the output dimension of the final full connection layer is 1024 as shown in fig. 4.

Loss function: the loss function of the network mainly has two parts, namely reconstruction loss L _rec Adaptive constraint loss L of domain _da . Reconstruction loss L _rec Including voxel reconstruction loss L _vol Loss of skeleton reconstruction L _ske First is the voxel reconstruction penalty L for training the sketch encoder and the self-attention decoder _vol :

L _vol ＝VlogV _gt +(1-V)log(1-V _gt )#(4)

Wherein V represents a predicted voxel, V _gt Representing the real voxels. Secondly, skeleton reconstruction loss L for training skeleton reconstruction network _ske ：

Wherein P ε P represents the points in the predicted point set P, P _gt ∈P _gt Representing a set of real points P _gt Is a point in (a). The available reconstruction losses are:

L _rec ＝L _vol +αL _ske #(6)

for domain-adaptive constraint function L _da ：

Wherein the method comprises the steps of

Indicating the desire. The overall objective function is thus obtained as: />

L＝L _rec +βL _da #(8)

The present invention sets α to 1 and β to 0.001.

Experimental part

Experimental data set: the invention adopts the reconstruction-based data set and the skeleton-based data set to carry out training and testing experiments.

The reconstruction-based dataset contains synthetic sketches, hand-drawn sketches and 3D voxels. The composite sketch is obtained by rendering 43783 objects with 13 categories from 20 views and then extracting the edges of the rendered image, which can be divided into a coarse composite sketch and a fine composite sketch for comparison with other methods, as shown in fig. 5, the training set and the test set are divided in a ratio of 8:2. The hand-drawn sketch is obtained by randomly selecting 100 rendered images from each class of ShapeNet dataset and then drawing by volunteers, and can thus be paired with a composite sketch of these 100 images. The hand-drawn sketch has 1300 frames and different patterns. The resolution of the sketch is 224×224. This dataset is used for sketch encoders, domain adaptive encoders and self-attention decoders for training and testing models. The skeleton-based dataset includes a skeleton dataset and a synthetic sketch, the latter also obtained by rendering 43783 objects of 13 categories in 20 views in the shaanenet database, and then extracting the edges of the rendered image. The data set is used to train a spatial skeleton guided encoder.

Evaluation index: the invention adopts the common index cross-over of three-dimensional reconstruction to evaluate the method compared with the IOU (the higher the result is, the better the reconstruction effect is), and the calculation formula of the IOU is as follows, wherein pre (i, j, k) and gt (i, j, k) respectively represent the predicted occurrence probability and the true value at the three-dimensional space coordinates (i, j, k). I (·) is an indicator function, t represents a denoising threshold.

Experiment setting: we choose Adam optimizers to optimize the objective function. In all experiments, the learning rate was set to 0.0001, the batch size was set to 4, a total of 30 epochs were trained, and the domain adaptive encoder network co-training was added after 10 epochs. The whole model was built based on the programming language python2.7 and the deep learning framework TensorFlow 1.4, trained using RTX 3090 24 g.

The performance of the model was compared to some baseline method and the most advanced hand-drawn sketch three-dimensional reconstruction method. Tables 1 and 2 show the comparison results on the reconstruction-based 13 categories. The invention has carried out experiments on two inputs of the synthetic sketch and the hand-drawn sketch respectively, and for comparison with other methods, the model has carried out test experiments on the rough synthetic sketch, the rough synthetic sketch is suitable for a reconstruction method using contour constraint, which is unfavorable for our network, and a domain self-adaption method is not used, but the model still obtains the best quantitative result, and the qualitative result is shown in figure 7. In addition, the invention provides the reconstruction result of more hand-drawn sketch tests, as shown in fig. 8 and 9.

TABLE 1 precision comparison of different sketch three-dimensional methods on the reconstruction-based, input as rough synthetic sketch (bold representation best)

Table 2 precision comparison of different sketch three-dimensional methods on the reconstruction-based, input is hand-drawn sketch

Ablation experiment: a series of ablation experiments are carried out on a reconstruction-based data set by using a refined synthetic sketch and a hand-drawn sketch, SS is used for representing a spatial skeleton guiding encoder, SA is used for representing a self-attention mechanism in a decoder, DA is used for representing a domain self-adaptive encoder, and experimental results are shown in tables 3, 10 and 11. In addition, the invention carries out an ablation experiment on the fusion mode of the sketch skeleton feature and the average shape feature, wherein "-" representsNo skeleton feature is added;

representing feature concatenation, "-a" represents feature addition, the results are shown in table 4.

Table 3 ablation experiments

TABLE 4 ablation experiments on the benches in the reconstruction-based dataset

/>

Claims

1. A hand-drawn sketch three-dimensional model reconstruction method based on space skeleton information is characterized by comprising the following steps of step 1, in a training stage, firstly obtaining a synthetic sketch x from a rendered image by using an edge extractor, and passing through a sketch encoder E _v-s To obtain its average shape characteristics, the encoder E is guided by the spatial skeleton _ss Obtaining skeleton characteristics, after characteristic fusion, passing through self-attention layer of decoder, and finally passing through decoder D _v Performing three-dimensional deconvolution reconstruction to obtain voxels, and calculating loss with a real model; wherein, in the feature extraction stage, the domain adaptive encoder E _v-h Sketch of hand painting

Extracting features, and performing similarity constraint with the synthetic sketch features to realize domain migration;

2. The method for reconstructing a three-dimensional model of a sketch based on spatial skeleton information according to claim 1, wherein said step 1 comprises the steps of:

step 1.1, obtaining a synthetic sketch from a rendered image in a shape Net data set by using a mode of leading and then binarizing, wherein the synthetic sketch is used as an input of training;

step 1.2, feature extraction stage

Step 1.2.1, obtaining a synthetic sketch, and obtaining an average shape characteristic z with the same dimension of 1 multiplied by 1024 _v ＝E _v-s (x) And skeleton feature z _p ＝E _ss (x)，z _v The bias represents the average shape of the category and can be regarded as global information, z _p The information is guided by the skeleton data and can be regarded as local structure information; predicted skeleton p=e _ss (x) And the real skeleton P _gt Performing a chamfer distance constraint, the loss being used only to train E _ss Parameters of the network

Step 1.2.2, introducing a domain self-adaption method when the model parameters are stable; the hand-drawn sketch m is represented as a set of n different styles, expressed as: { m ₁ ,m ₂ ,…,m _i ,…m _n Representation of the synthetic sketch as style m outside this set _n+1 The goal of the domain adaptive approach is to align n styles to m _n+1 In (a) and (b);

obtaining

And z _v After that, by constraint function L _da Training E _v-h ：

The Earth-river distance is used, so that the network training is more stable, and the gradient disappearance problem during the general GAN network training can not occur;

step 1.3, feature fusion and deconvolution reconstruction stage

Step 1.3.1, z _v And z _p Fusion is carried out by adopting a corresponding element adding mode

z＝z _v +z _p

Step 1.3.2, pass decoder D _v Self-attention operation is carried out on the fusion characteristics to obtain a voxel model, wherein V=D _v (z) and loss L through voxel reconstruction _vol Training E _v-s And D _v 。