CN111882661B - Method for reconstructing three-dimensional scene of video - Google Patents

Method for reconstructing three-dimensional scene of video Download PDF

Info

Publication number
CN111882661B
CN111882661B CN202010727956.7A CN202010727956A CN111882661B CN 111882661 B CN111882661 B CN 111882661B CN 202010727956 A CN202010727956 A CN 202010727956A CN 111882661 B CN111882661 B CN 111882661B
Authority
CN
China
Prior art keywords
frame
sequence
video
frames
inputting
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010727956.7A
Other languages
Chinese (zh)
Other versions
CN111882661A (en
Inventor
高跃
李仁杰
赵曦滨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN202010727956.7A priority Critical patent/CN111882661B/en
Publication of CN111882661A publication Critical patent/CN111882661A/en
Application granted granted Critical
Publication of CN111882661B publication Critical patent/CN111882661B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Graphics (AREA)
  • Geometry (AREA)
  • Testing, Inspecting, Measuring Of Stereoscopic Televisions And Televisions (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a video-based stereo scene reconstruction method, which comprises the following steps: step 1, training sample processing, namely combining video frames and sparse depth image frames scanned by a laser radar into a frame sequence with a certain length for training; step 2, training a model, namely inputting video frames and sparse depth image frames in the preprocessed sample into the model in sequence, calculating loss with a true value of the sample, and updating model parameters by using back propagation; and 3, inputting the test data into the model frame by frame to obtain a depth reconstruction result corresponding to each frame. According to the invention, a more accurate reconstruction result can be obtained by extracting the continuous features in the video.

Description

Method for reconstructing three-dimensional scene of video
Technical Field
The application relates to the technical field of stereoscopic scene reconstruction, in particular to a stereoscopic scene reconstruction method based on video.
Background
In recent years, technologies such as automatic driving and virtual reality are increasingly applied in life, and reconstructing a stereoscopic scene from data acquired by a sensor is an important step for realizing the technologies. The prior art uses some conventional graphics methods, uses RGB picture guidance or exploits structural information of a scene to reconstruct a stereoscopic scene. In the methods, scenes at different moments are treated as isolated objects, and the important characteristic of real scenes, namely continuity in time, is not considered. By utilizing the continuity information in the data modality of the video, the reconstruction effect of the scene can be improved.
Conventional approaches to exploiting the continuity of video include exploiting adjacent frame gradients, feature point matching, pose estimation, etc. The neural network encodes the historical features in the hidden state using a circular structure. For an image sequence, a loop structure will bring a large space overhead and a high training difficulty.
Disclosure of Invention
The invention aims to extract time continuity characteristics in a video through a circulating network structure and obtain a better scene reconstruction result by utilizing scene continuity.
The technical scheme of the invention provides a video-based stereo scene reconstruction method, which is characterized by comprising the following steps of:
step 1, training sample processing, namely adjusting the sparse depth frames scanned by a video frame and a laser radar frame to a proper size, adjusting the dense depth frames used as supervision data to a proper size, and then taking a sub-image with a proper size; dividing the preprocessed video frames, the sparse depth frames and the dense depth frames into a frame sequence with a certain length as training data;
step 2, performing model training, and inputting each frame in the frame sequence into the model according to the sequence of each preprocessed frame sequence to obtain a prediction result of each frame; inputting the prediction result of each frame and the supervision data of each frame into a loss function, and updating the weight parameters of the model by using a back propagation method;
and 3, predicting by using the model trained in the step 2.
Further, in step 1, 480 frame sequences in the KITTI data set and 21000 frames are used as a training data set, the frame sequences are divided into a plurality of frame sequences with the length of 4, then sparse depth frames scanned by video frames and laser radar frames are adjusted to 375 × 1242 pixels, dense depth frames used as supervision data are adjusted to 375 × 1242 pixels, and then sub-images with 370 × 1242 pixels are taken.
Further, step 2 comprises:
2.1, randomly selecting an unselected training sequence input model;
step 2.2 for a training sequence ((I) 1 ,d 1 ,g 1 ),(I 2 ,d 2 ,g 2 ),...,(I 4 ,d 4 ,g 4 ) Whereinsaid: i is k 、d k And g and k inputting each frame into a model in sequence to obtain a predicted frame sequence (p) comprising an input video frame, an input sparse depth frame and a semi-dense depth frame for supervision 1 ,p 2 ,...,p 4 ). The loss is then calculated as follows:
Figure BDA0002598593280000021
wherein L is 1 And L 2 Is a norm loss function, where L 1 To mean absolute value error, L 2 Is the mean squared error.
Step 2.3, calculate the gradient using the loss function described in step 2.2, and use lr ═ 0.001, β 1 =0.9,β 2 0.999 ADAM optimizer updates the network parameters, where lr is the learning rate of the optimizer, β 1 And beta 2 Is a weight attenuation hyperparameter;
step 2.4, repeating steps 2.1 to 2.3 until all sequences are selected, and then marking all sequences as unselected;
and 2.5, repeating the steps 2.1 to 2.4 until the model converges.
Further, step 3 comprises:
step 3.1, processing the input view video frame and the sparse depth frame into the size in step 1 and forming a frame sequence according to the time sequence;
step 3.2, inputting each frame in the frame sequence into an encoder to obtain the characteristics of time continuity:
and 3.3, inputting the time continuity characteristics obtained in the step 3.2 into a dense depth decoder to obtain a reconstruction result of the current frame.
The beneficial effect of this application is: and extracting and maintaining the historical information of the video by using a cyclic neural network structure, and obtaining a more accurate three-dimensional scene reconstruction result through the continuous characteristics of the scene.
Drawings
Fig. 1 is a schematic flow chart of a method for reconstructing a video-based stereoscopic scene according to an embodiment of the present application.
Detailed Description
In order that the above objects, features and advantages of the present application can be more clearly understood, the present application will be described in further detail below with reference to the accompanying drawings and detailed description.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application, however, the present application may be practiced in other ways than those described herein, and therefore the scope of the present application is not limited by the specific embodiments disclosed below.
As shown in fig. 1, the embodiment provides a method for reconstructing a stereoscopic scene based on a video, including the following steps:
step 1, training sample processing, namely adjusting the video frame and the sparse depth frame scanned by the laser radar frame to a proper size, adjusting the dense depth frame used as the supervision data to a proper size, and then taking a subgraph with a proper size; dividing the preprocessed video frames, the sparse depth frames and the dense depth frames into a frame sequence with a certain length, and taking the frame sequence as training data;
in this step, 480 frame sequences in the KTTTI data set, about 21000 frames, are used as the training data set, the frame sequences are divided into a plurality of frame sequences with the length of 4, then the sparse depth frames scanned by the video frames and the lidar frames are adjusted to 375 × 1242, and the dense depth frames used as the surveillance data are adjusted to 375 × 1242, and then sub-images of 370 × 1242 pixels are taken.
And 2, performing model training, and inputting each frame in the frame sequences into the model according to the sequence of each preprocessed frame sequence to obtain a prediction result of each frame. Inputting the prediction result of each frame and the supervision data of each frame into a loss function, and updating the weight parameters of the model by using a back propagation method.
In this step, L is selected 1 And L 2 Loss of power
Heavy super parameter lambda 1 =0.2、λ 2 0.8, and a set of sequence loss weights (w) 1 ,w 2 ,w 3 ,w 4 ) (0.8, 0.9, 1.0, 1.0). Then the following steps are carried out
And 2.1, randomly selecting an unselected training sequence to input into the model.
Step 2.2 for a training sequence ((I) 1 ,d 1 ,g 1 ),(I 2 ,d 2 ,g 2 ),...,(I 4 ,d 4 ,g 4 ) In which I) k 、d k And g k Respectively inputting an input video frame, an input sparse depth frame and a semi-dense depth frame for supervision into a model in sequence to obtain a predicted frame sequence (p) 1 ,p 2 ,...,p 4 ). The loss is then calculated as follows:
Figure BDA0002598593280000041
wherein, L 1 And L 2 Is a norm loss function, where L 1 Is the mean absolute value error, L 2 Is the mean squared error.
Step 2.3, calculate the gradient using the loss function described in step 2.2, and use lr ═ 0.001, β 1 =0.9,β 2 0.999 ADAM optimizer updates the network parameters, where lr is the learning rate of the optimizer, β 1 And beta 2 Is a weight attenuation hyperparameter.
Step 2.4, repeat steps 2.1 through 2.3 until all sequences have been selected, and then mark all sequences as unselected.
And 2.5, repeating the steps 2.1 to 2.4 until the model converges.
Step 3, predicting by using the model trained in the step 2
Step 3.1, processing the input view video frame and the sparse depth frame into the size in step 1 and forming a frame sequence according to the time sequence.
Step 3.2, inputting each frame in the frame sequence into an encoder to obtain the characteristics of time continuity:
and 3.2.1, inputting the view video frame into a view characteristic encoder to obtain the encoded view characteristic. Specifically, the view feature encoder sequentially includes: : 7 × 7 convolution; 2 average pooling of 2 x 2; 1 × 1 convolution; two resblocks; 2 x 2 average pooling; a Resblock; 2 x 2 average pooling.
Step 3.2.2, inputting the sparse depth frame into a sparse depth feature encoder to obtain continuous sparse depth features, wherein the sparse depth feature encoder comprises the following steps: 7 × 7 convolution; 2 × 2 average pooling; 1 × 1 convolution; a Resblock; a CLSTM; a Resblock; a CLSTM; 2 average pooling of 2 x 2; a Resblock; a CLSTM; 2 x 2 average pooling. The method mainly comprises the steps that reblock mainly extracts the characteristics of a current frame, CLSTM mainly integrates the characteristics of the current frame and historical frames to obtain continuity characteristics, and the continuity characteristics are maintained.
And 3.2.3, connecting the view characteristics obtained in the step 3.2.1 with the continuous sparse depth characteristics obtained in the step 3.2.2 to obtain complete time continuity characteristics after coding.
Step 3.3, inputting the time continuity features obtained in the step 3.2 into a dense depth decoder to obtain a reconstruction result of the current frame, wherein the dense depth decoder specifically comprises the following steps: 1 × 1 convolution; bilinear upsampling; residual error connection; resblock; upsampling; residual error connection; resblock; residual error connection; upsampling; resblock; batch normalization; a ReLU activation function; 1 × 1 convolution;
the Resblock mentioned in step 3 sequentially comprises: wherein Resblock comprises in sequence: batch normalization; a ReLU activation function; 3 × 3 convolution; batch normalization; a ReLU activation function; 3 × 3 convolution; residual connection with the input features; 3 × 3 convolution;
the CLSTM mentioned in the step 3 is calculated in the following way:
r t =[x t ,h t-1 ]
Figure BDA0002598593280000061
Figure BDA0002598593280000062
Figure BDA0002598593280000063
Figure BDA0002598593280000064
Figure BDA0002598593280000065
h t =o t *tanh(C t )
wherein x, o, h and C are input characteristics, output characteristics, a network hiding state and a network cell state respectively; w is a group of fr ,W fc ,W ir ,W ic ,W c ,W or ,W oc Trainable weight parameters for the convolutions, respectively; b f ,b i ,b c ,b o Trainable bias parameters for the convolution, respectively; sum of
Figure BDA0002598593280000066
Respectively representing a hadamard product and a convolution. σ represents sigmoid function, [ alpha ] represents a]Representing a tensor connection.
The foregoing is a further detailed description of the invention in connection with specific preferred embodiments and it is not intended to limit the invention to the specific embodiments described. The scope of the present application is defined by the appended claims and may include various modifications, adaptations, and equivalents of the invention without departing from the scope and spirit of the application.

Claims (3)

1. A method for reconstructing a stereoscopic scene based on video is characterized by comprising the following steps:
step 1, training sample processing, namely adjusting the sparse depth frames scanned by a video frame and a laser radar frame to a proper size, adjusting the dense depth frames used as supervision data to a proper size, and then taking a sub-image with a proper size; dividing the preprocessed video frames, the sparse depth frames and the dense depth frames into a frame sequence with a certain length, and taking the frame sequence as training data;
step 2, performing model training, and inputting each frame in the frame sequence into the model according to the sequence of each preprocessed frame sequence to obtain a prediction result of each frame; inputting the prediction result of each frame and the supervision data of each frame into a loss function, and updating the weight parameters of the model by using a back propagation method;
step 3, predicting by using the model trained in the step 2
Step 3.1, processing the input view video frame and the sparse depth frame into the size in the step 1 and forming a frame sequence according to the time sequence;
step 3.2, inputting each frame in the frame sequence into an encoder to obtain the characteristics of time continuity:
step 3.2.1, inputting the view video frame into a view characteristic encoder to obtain the encoded view characteristic;
step 3.2.2, inputting the sparse depth frame into a sparse depth feature encoder to obtain continuous sparse depth features;
step 3.2.3, connecting the view characteristics obtained in the step 3.2.1 with the continuous sparse depth characteristics obtained in the step 3.2.2 to obtain complete time continuity characteristics after coding;
3.3, inputting the time continuity characteristics obtained in the step 3.2 into a dense depth decoder to obtain a reconstruction result of the current frame; the dense depth decoder includes in order: 1 × 1 convolution; bilinear upsampling; residual error connection; resblock; upsampling; residual error connection; resblock; residual error connection; upsampling; resblock; batch normalization; a ReLU activation function; 1 × 1 convolution.
2. The method according to claim 1, wherein in step 1, 480 frame sequences in the KITTI dataset are used, 21000 frames are used as the training dataset, the frame sequences are divided into a plurality of frame sequences with the length of 4, then the sparse depth frames scanned by the video frames and the lidar frames are adjusted to 375 × 1242 pixels, and the dense depth frames used as the surveillance data are adjusted to 375 × 1242 pixels and then are taken as 370 × 1242 pixels sub-images.
3. The method for reconstructing a video-based stereoscopic scene according to claim 1, wherein step 2 comprises:
2.1, randomly selecting an unselected training sequence input model;
step 2.2 for a training sequence ((I) 1 ,d 1 ,g 1 ),(I 2 ,d 2 ,g 2 ),...,(I 4 ,d 4 ,g 4 ) Whereinsaid: i is k 、d k And g k Respectively inputting the k frame input video frame, the input sparse depth frame and the semi-dense depth frame for supervision into a model in sequence to obtain a predicted frame sequence (p) 1 ,p 2 ,...,p 4 ) (ii) a The loss is then calculated as follows:
Figure FDA0003628236790000021
wherein L is 1 And L 2 Is a norm loss function, where L 1 Is the mean absolute value error, L 2 Is the mean squared error;
step 2.3, calculate the gradient using the loss function described in step 2.2, and use lr ═ 0.001, β 1 =0.9,β 2 0.999 ADAM optimizer updates network parameters, where lr is the learning rate of the optimizer, β 1 And beta 2 Is a weight attenuation hyperparameter;
step 2.4, repeating steps 2.1 to 2.3 until all sequences have been selected, and then marking all sequences as unselected;
and 2.5, repeating the steps 2.1 to 2.4 until the model converges.
CN202010727956.7A 2020-07-23 2020-07-23 Method for reconstructing three-dimensional scene of video Active CN111882661B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010727956.7A CN111882661B (en) 2020-07-23 2020-07-23 Method for reconstructing three-dimensional scene of video

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010727956.7A CN111882661B (en) 2020-07-23 2020-07-23 Method for reconstructing three-dimensional scene of video

Publications (2)

Publication Number Publication Date
CN111882661A CN111882661A (en) 2020-11-03
CN111882661B true CN111882661B (en) 2022-07-26

Family

ID=73201398

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010727956.7A Active CN111882661B (en) 2020-07-23 2020-07-23 Method for reconstructing three-dimensional scene of video

Country Status (1)

Country Link
CN (1) CN111882661B (en)

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103546747B (en) * 2013-09-29 2016-11-23 北京航空航天大学 A kind of depth map sequence fractal coding based on color video encoding pattern
CN105225269B (en) * 2015-09-22 2018-08-17 浙江大学 Object modelling system based on motion
EP3349176B1 (en) * 2017-01-17 2021-05-12 Facebook, Inc. Three-dimensional scene reconstruction from set of two-dimensional images for consumption in virtual reality
CN107845134B (en) * 2017-11-10 2020-12-29 浙江大学 Three-dimensional reconstruction method of single object based on color depth camera
CN108416840B (en) * 2018-03-14 2020-02-18 大连理工大学 Three-dimensional scene dense reconstruction method based on monocular camera

Also Published As

Publication number Publication date
CN111882661A (en) 2020-11-03

Similar Documents

Publication Publication Date Title
CN110363716B (en) High-quality reconstruction method for generating confrontation network composite degraded image based on conditions
CN108596958B (en) Target tracking method based on difficult positive sample generation
CN102156875B (en) Image super-resolution reconstruction method based on multitask KSVD (K singular value decomposition) dictionary learning
CN111445476B (en) Monocular depth estimation method based on multi-mode unsupervised image content decoupling
CN108960059A (en) A kind of video actions recognition methods and device
CN110211045A (en) Super-resolution face image method based on SRGAN network
CN113658051A (en) Image defogging method and system based on cyclic generation countermeasure network
CN112541864A (en) Image restoration method based on multi-scale generation type confrontation network model
CN113177882A (en) Single-frame image super-resolution processing method based on diffusion model
CN112016682B (en) Video characterization learning and pre-training method and device, electronic equipment and storage medium
CN114463218B (en) Video deblurring method based on event data driving
CN113205449A (en) Expression migration model training method and device and expression migration method and device
CN110706303A (en) Face image generation method based on GANs
CN112766062B (en) Human behavior identification method based on double-current deep neural network
CN114581304B (en) Image super-resolution and defogging fusion method and system based on circulation network
CN114170286B (en) Monocular depth estimation method based on unsupervised deep learning
CN111462208A (en) Non-supervision depth prediction method based on binocular parallax and epipolar line constraint
CN116168067B (en) Supervised multi-modal light field depth estimation method based on deep learning
CN112686817A (en) Image completion method based on uncertainty estimation
CN110335299A (en) A kind of monocular depth estimating system implementation method based on confrontation network
CN109658508B (en) Multi-scale detail fusion terrain synthesis method
CN112184555B (en) Stereo image super-resolution reconstruction method based on deep interactive learning
CN111882661B (en) Method for reconstructing three-dimensional scene of video
CN116912727A (en) Video human behavior recognition method based on space-time characteristic enhancement network
CN114612305B (en) Event-driven video super-resolution method based on stereogram modeling

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant