CN111882661B

CN111882661B - Method for reconstructing three-dimensional scene of video

Info

Publication number: CN111882661B
Application number: CN202010727956.7A
Authority: CN
Inventors: 高跃; 李仁杰; 赵曦滨
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2020-07-23
Filing date: 2020-07-23
Publication date: 2022-07-26
Anticipated expiration: 2040-07-23
Also published as: CN111882661A

Abstract

The invention provides a video-based stereo scene reconstruction method, which comprises the following steps: step 1, training sample processing, namely combining video frames and sparse depth image frames scanned by a laser radar into a frame sequence with a certain length for training; step 2, training a model, namely inputting video frames and sparse depth image frames in the preprocessed sample into the model in sequence, calculating loss with a true value of the sample, and updating model parameters by using back propagation; and 3, inputting the test data into the model frame by frame to obtain a depth reconstruction result corresponding to each frame. According to the invention, a more accurate reconstruction result can be obtained by extracting the continuous features in the video.

Description

Method for reconstructing three-dimensional scene of video

Technical Field

The application relates to the technical field of stereoscopic scene reconstruction, in particular to a stereoscopic scene reconstruction method based on video.

Background

In recent years, technologies such as automatic driving and virtual reality are increasingly applied in life, and reconstructing a stereoscopic scene from data acquired by a sensor is an important step for realizing the technologies. The prior art uses some conventional graphics methods, uses RGB picture guidance or exploits structural information of a scene to reconstruct a stereoscopic scene. In the methods, scenes at different moments are treated as isolated objects, and the important characteristic of real scenes, namely continuity in time, is not considered. By utilizing the continuity information in the data modality of the video, the reconstruction effect of the scene can be improved.

Conventional approaches to exploiting the continuity of video include exploiting adjacent frame gradients, feature point matching, pose estimation, etc. The neural network encodes the historical features in the hidden state using a circular structure. For an image sequence, a loop structure will bring a large space overhead and a high training difficulty.

Disclosure of Invention

The invention aims to extract time continuity characteristics in a video through a circulating network structure and obtain a better scene reconstruction result by utilizing scene continuity.

The technical scheme of the invention provides a video-based stereo scene reconstruction method, which is characterized by comprising the following steps of:

step 1, training sample processing, namely adjusting the sparse depth frames scanned by a video frame and a laser radar frame to a proper size, adjusting the dense depth frames used as supervision data to a proper size, and then taking a sub-image with a proper size; dividing the preprocessed video frames, the sparse depth frames and the dense depth frames into a frame sequence with a certain length as training data;

step 2, performing model training, and inputting each frame in the frame sequence into the model according to the sequence of each preprocessed frame sequence to obtain a prediction result of each frame; inputting the prediction result of each frame and the supervision data of each frame into a loss function, and updating the weight parameters of the model by using a back propagation method;

and 3, predicting by using the model trained in the step 2.

Further, in step 1, 480 frame sequences in the KITTI data set and 21000 frames are used as a training data set, the frame sequences are divided into a plurality of frame sequences with the length of 4, then sparse depth frames scanned by video frames and laser radar frames are adjusted to 375 × 1242 pixels, dense depth frames used as supervision data are adjusted to 375 × 1242 pixels, and then sub-images with 370 × 1242 pixels are taken.

Further, step 2 comprises:

2.1, randomly selecting an unselected training sequence input model;

step 2.2 for a training sequence ((I) ¹ ，d ¹ ，g ¹ )，(I ² ，d ² ，g ² )，...，(I ⁴ ，d ⁴ ，g ⁴ ) Whereinsaid: i is ^k 、d ^k And g and ^k inputting each frame into a model in sequence to obtain a predicted frame sequence (p) comprising an input video frame, an input sparse depth frame and a semi-dense depth frame for supervision ¹ ，p ² ，...，p ⁴ ). The loss is then calculated as follows:

wherein L is ₁ And L ₂ Is a norm loss function, where L ₁ To mean absolute value error, L ₂ Is the mean squared error.

Step 2.3, calculate the gradient using the loss function described in step 2.2, and use lr ═ 0.001, β ₁ ＝0.9，β ₂ 0.999 ADAM optimizer updates the network parameters, where lr is the learning rate of the optimizer, β ₁ And beta ₂ Is a weight attenuation hyperparameter;

step 2.4, repeating steps 2.1 to 2.3 until all sequences are selected, and then marking all sequences as unselected;

and 2.5, repeating the steps 2.1 to 2.4 until the model converges.

Further, step 3 comprises:

step 3.1, processing the input view video frame and the sparse depth frame into the size in step 1 and forming a frame sequence according to the time sequence;

step 3.2, inputting each frame in the frame sequence into an encoder to obtain the characteristics of time continuity:

and 3.3, inputting the time continuity characteristics obtained in the step 3.2 into a dense depth decoder to obtain a reconstruction result of the current frame.

The beneficial effect of this application is: and extracting and maintaining the historical information of the video by using a cyclic neural network structure, and obtaining a more accurate three-dimensional scene reconstruction result through the continuous characteristics of the scene.

Drawings

Fig. 1 is a schematic flow chart of a method for reconstructing a video-based stereoscopic scene according to an embodiment of the present application.

Detailed Description

In order that the above objects, features and advantages of the present application can be more clearly understood, the present application will be described in further detail below with reference to the accompanying drawings and detailed description.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application, however, the present application may be practiced in other ways than those described herein, and therefore the scope of the present application is not limited by the specific embodiments disclosed below.

As shown in fig. 1, the embodiment provides a method for reconstructing a stereoscopic scene based on a video, including the following steps:

step 1, training sample processing, namely adjusting the video frame and the sparse depth frame scanned by the laser radar frame to a proper size, adjusting the dense depth frame used as the supervision data to a proper size, and then taking a subgraph with a proper size; dividing the preprocessed video frames, the sparse depth frames and the dense depth frames into a frame sequence with a certain length, and taking the frame sequence as training data;

in this step, 480 frame sequences in the KTTTI data set, about 21000 frames, are used as the training data set, the frame sequences are divided into a plurality of frame sequences with the length of 4, then the sparse depth frames scanned by the video frames and the lidar frames are adjusted to 375 × 1242, and the dense depth frames used as the surveillance data are adjusted to 375 × 1242, and then sub-images of 370 × 1242 pixels are taken.

And 2, performing model training, and inputting each frame in the frame sequences into the model according to the sequence of each preprocessed frame sequence to obtain a prediction result of each frame. Inputting the prediction result of each frame and the supervision data of each frame into a loss function, and updating the weight parameters of the model by using a back propagation method.

In this step, L is selected ₁ And L ₂ Loss of power

Heavy super parameter lambda ₁ ＝0.2、λ ₂ 0.8, and a set of sequence loss weights (w) ¹ ，w ² ，w ³ ，w ⁴ ) (0.8, 0.9, 1.0, 1.0). Then the following steps are carried out

And 2.1, randomly selecting an unselected training sequence to input into the model.

Step 2.2 for a training sequence ((I) ¹ ，d ¹ ，g ¹ )，(I ² ，d ² ，g ² )，...，(I ⁴ ，d ⁴ ，g ⁴ ) In which I) ^k 、d ^k And g ^k Respectively inputting an input video frame, an input sparse depth frame and a semi-dense depth frame for supervision into a model in sequence to obtain a predicted frame sequence (p) ¹ ，p ² ，...，p ⁴ ). The loss is then calculated as follows:

wherein, L ₁ And L ₂ Is a norm loss function, where L ₁ Is the mean absolute value error, L ₂ Is the mean squared error.

Step 2.3, calculate the gradient using the loss function described in step 2.2, and use lr ═ 0.001, β ₁ ＝0.9，β ₂ 0.999 ADAM optimizer updates the network parameters, where lr is the learning rate of the optimizer, β ₁ And beta ₂ Is a weight attenuation hyperparameter.

Step 2.4, repeat steps 2.1 through 2.3 until all sequences have been selected, and then mark all sequences as unselected.

And 2.5, repeating the steps 2.1 to 2.4 until the model converges.

Step 3, predicting by using the model trained in the step 2

Step 3.1, processing the input view video frame and the sparse depth frame into the size in step 1 and forming a frame sequence according to the time sequence.

and 3.2.1, inputting the view video frame into a view characteristic encoder to obtain the encoded view characteristic. Specifically, the view feature encoder sequentially includes: : 7 × 7 convolution; 2 average pooling of 2 x 2; 1 × 1 convolution; two resblocks; 2 x 2 average pooling; a Resblock; 2 x 2 average pooling.

Step 3.2.2, inputting the sparse depth frame into a sparse depth feature encoder to obtain continuous sparse depth features, wherein the sparse depth feature encoder comprises the following steps: 7 × 7 convolution; 2 × 2 average pooling; 1 × 1 convolution; a Resblock; a CLSTM; a Resblock; a CLSTM; 2 average pooling of 2 x 2; a Resblock; a CLSTM; 2 x 2 average pooling. The method mainly comprises the steps that reblock mainly extracts the characteristics of a current frame, CLSTM mainly integrates the characteristics of the current frame and historical frames to obtain continuity characteristics, and the continuity characteristics are maintained.

And 3.2.3, connecting the view characteristics obtained in the step 3.2.1 with the continuous sparse depth characteristics obtained in the step 3.2.2 to obtain complete time continuity characteristics after coding.

Step 3.3, inputting the time continuity features obtained in the step 3.2 into a dense depth decoder to obtain a reconstruction result of the current frame, wherein the dense depth decoder specifically comprises the following steps: 1 × 1 convolution; bilinear upsampling; residual error connection; resblock; upsampling; residual error connection; resblock; residual error connection; upsampling; resblock; batch normalization; a ReLU activation function; 1 × 1 convolution;

the Resblock mentioned in step 3 sequentially comprises: wherein Resblock comprises in sequence: batch normalization; a ReLU activation function; 3 × 3 convolution; batch normalization; a ReLU activation function; 3 × 3 convolution; residual connection with the input features; 3 × 3 convolution;

the CLSTM mentioned in the step 3 is calculated in the following way:

r _t ＝[x _t ，h _t-1 ]

h _t ＝o _t *tanh(C _t )

wherein x, o, h and C are input characteristics, output characteristics, a network hiding state and a network cell state respectively; w is a group of _fr ，W _fc ，W _ir ，W _ic ，W _c ，W _or ，W _oc Trainable weight parameters for the convolutions, respectively; b _f ，b _i ，b _c ，b _o Trainable bias parameters for the convolution, respectively; sum of

Respectively representing a hadamard product and a convolution. σ represents sigmoid function, [ alpha ] represents a]Representing a tensor connection.

The foregoing is a further detailed description of the invention in connection with specific preferred embodiments and it is not intended to limit the invention to the specific embodiments described. The scope of the present application is defined by the appended claims and may include various modifications, adaptations, and equivalents of the invention without departing from the scope and spirit of the application.

Claims

1. A method for reconstructing a stereoscopic scene based on video is characterized by comprising the following steps:

step 1, training sample processing, namely adjusting the sparse depth frames scanned by a video frame and a laser radar frame to a proper size, adjusting the dense depth frames used as supervision data to a proper size, and then taking a sub-image with a proper size; dividing the preprocessed video frames, the sparse depth frames and the dense depth frames into a frame sequence with a certain length, and taking the frame sequence as training data;

step 3, predicting by using the model trained in the step 2

Step 3.1, processing the input view video frame and the sparse depth frame into the size in the step 1 and forming a frame sequence according to the time sequence;

step 3.2.1, inputting the view video frame into a view characteristic encoder to obtain the encoded view characteristic;

step 3.2.2, inputting the sparse depth frame into a sparse depth feature encoder to obtain continuous sparse depth features;

step 3.2.3, connecting the view characteristics obtained in the step 3.2.1 with the continuous sparse depth characteristics obtained in the step 3.2.2 to obtain complete time continuity characteristics after coding;

3.3, inputting the time continuity characteristics obtained in the step 3.2 into a dense depth decoder to obtain a reconstruction result of the current frame; the dense depth decoder includes in order: 1 × 1 convolution; bilinear upsampling; residual error connection; resblock; upsampling; residual error connection; resblock; residual error connection; upsampling; resblock; batch normalization; a ReLU activation function; 1 × 1 convolution.

2. The method according to claim 1, wherein in step 1, 480 frame sequences in the KITTI dataset are used, 21000 frames are used as the training dataset, the frame sequences are divided into a plurality of frame sequences with the length of 4, then the sparse depth frames scanned by the video frames and the lidar frames are adjusted to 375 × 1242 pixels, and the dense depth frames used as the surveillance data are adjusted to 375 × 1242 pixels and then are taken as 370 × 1242 pixels sub-images.

3. The method for reconstructing a video-based stereoscopic scene according to claim 1, wherein step 2 comprises:

2.1, randomly selecting an unselected training sequence input model;

step 2.2 for a training sequence ((I) ¹ ，d ¹ ，g ¹ )，(I ² ，d ² ，g ² )，...，(I ⁴ ，d ⁴ ，g ⁴ ) Whereinsaid: i is ^k 、d ^k And g ^k Respectively inputting the k frame input video frame, the input sparse depth frame and the semi-dense depth frame for supervision into a model in sequence to obtain a predicted frame sequence (p) ¹ ，p ² ，...，p ⁴ ) (ii) a The loss is then calculated as follows:

wherein L is ₁ And L ₂ Is a norm loss function, where L ₁ Is the mean absolute value error, L ₂ Is the mean squared error;

step 2.3, calculate the gradient using the loss function described in step 2.2, and use lr ═ 0.001, β ₁ ＝0.9，β ₂ 0.999 ADAM optimizer updates network parameters, where lr is the learning rate of the optimizer, β ₁ And beta ₂ Is a weight attenuation hyperparameter;

step 2.4, repeating steps 2.1 to 2.3 until all sequences have been selected, and then marking all sequences as unselected;

and 2.5, repeating the steps 2.1 to 2.4 until the model converges.