CN111105432A - Unsupervised end-to-end driving environment perception method based on deep learning - Google Patents

Unsupervised end-to-end driving environment perception method based on deep learning Download PDF

Info

Publication number
CN111105432A
CN111105432A CN201911345900.9A CN201911345900A CN111105432A CN 111105432 A CN111105432 A CN 111105432A CN 201911345900 A CN201911345900 A CN 201911345900A CN 111105432 A CN111105432 A CN 111105432A
Authority
CN
China
Prior art keywords
estimation network
pose
depth
flow
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911345900.9A
Other languages
Chinese (zh)
Other versions
CN111105432B (en
Inventor
陈宗海
洪洋
王纪凯
戴德云
赵皓
包鹏
江建文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
Original Assignee
University of Science and Technology of China USTC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology of China USTC filed Critical University of Science and Technology of China USTC
Priority to CN201911345900.9A priority Critical patent/CN111105432B/en
Publication of CN111105432A publication Critical patent/CN111105432A/en
Application granted granted Critical
Publication of CN111105432B publication Critical patent/CN111105432B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/215Motion-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/269Analysis of motion using gradient-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10024Color image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses an unsupervised end-to-end driving environment perception method based on deep learning, which comprises the following steps: acquiring images by using a binocular camera, and preprocessing to obtain training data; utilizing two continuous stereo images with the same size in training data to train an optical flow estimation network, a pose estimation network, a depth estimation network and motion segmentation; carrying out rigid registration by using output results of the three networks to optimize the output of the pose estimation network; and calculating rigid flow caused by the motion of the camera by using the output of the depth estimation network and the output of the optimized pose estimation network, and performing flow consistency check with the output of the optical flow estimation network so as to perform motion segmentation. The method adopts an unsupervised end-to-end frame without requiring true value depth, pose and optical flow as label supervision training, and can obtain camera pose with absolute scale and dense depth map estimation, thereby segmenting dynamic objects with higher precision.

Description

Unsupervised end-to-end driving environment perception method based on deep learning
Technical Field
The invention relates to the technical field of intelligent driving, in particular to an unsupervised end-to-end driving environment perception method based on deep learning.
Background
Learning three-dimensional scene geometry, scene flow, and robot motion relative to rigid scenes from video images is an important research content in computer vision and has found widespread application in many different fields, including autopilot, robot navigation, and video analysis, among others. However, the current environmental perception methods based on deep learning are all supervised learning frameworks, and it is very difficult to obtain the true value labels for training. In recent years, many advances have been made in unsupervised learning of depth, optical flow, and pose using convolutional neural network methods. These methods have their own advantages and limitations. Unsupervised deep learning approaches take advantage of the geometry of the scene and decompose the problem into multiple orthogonal problems, adding more constraints to the solution with more temporal image frames or stereo image information. On the one hand, current optical flow, depth and pose estimation methods based on depth learning assume that the entire scene is static, and therefore it is difficult to handle moving objects. On the other hand, the optical flow method can handle moving objects in principle, but has difficulty in a complicated structure region and an occlusion region.
Chinese patent ' method for estimating and optimizing depth of monocular view in video sequence by using depth learning ' (publication number: CN108765479A) ' estimates and optimizes the depth of the monocular view in the video sequence by using depth learning, but the method based on monocular vision has scale uncertainty, so the estimated depth scale is unknown and has no practical application value.
Chinese patent "a binocular depth estimation method based on a depth convolution network" (publication number: CN109598754A) trains a deep convolution neural network to perform depth estimation by using binocular images, but a true value depth is required to participate in training as a label in the training process, but it is very difficult and expensive to obtain the true value depth in an actual environment.
Chinese patent "a monocular vision positioning method based on unsupervised learning" (publication number: CN109472830A) utilizes the method of unsupervised learning to carry out monocular vision positioning, but monocular vision positioning has scale uncertainty and scale drift, positioning accuracy is poor, and positioning scale uncertainty has no engineering value in actual environment.
Therefore, the current driving environment perception method based on deep learning still has the following problems:
1) the depth estimation and pose estimation depth learning model trained by using the monocular picture sequence is limited by monocular scale uncertainty and scale drift, the estimated depth and pose scale are unknown, and the model has no practical application value;
2) the current depth estimation, pose estimation and optical flow estimation methods based on deep learning need true value supervised training, but true value data acquisition in a real environment is very difficult and needs high cost;
3) dynamic objects are very common in the actual driving environment, the current environment perception method based on deep learning does not consider the influence of the dynamic objects, and the precision is to be further improved.
Disclosure of Invention
The invention aims to provide an unsupervised end-to-end driving environment perception method based on deep learning, an unsupervised end-to-end framework is adopted, true value depth, pose and optical flow are not needed to be used as label supervision training, and camera pose with absolute scale and dense depth map estimation can be obtained, so that a dynamic object can be segmented with high precision.
The purpose of the invention is realized by the following technical scheme:
an unsupervised end-to-end driving environment perception method based on deep learning comprises the following steps:
acquiring images by using a binocular camera, and preprocessing to obtain training data;
utilizing two continuous stereo images with the same size in training data to train an optical flow estimation network, a pose estimation network, a depth estimation network and motion segmentation;
after training is finished, carrying out rigid registration on two newly input continuous stereo image pairs with the same size by using output results of the three networks to optimize the output of a pose estimation network; and calculating rigid flow caused by the motion of the camera by using the output of the depth estimation network and the output of the optimized pose estimation network, and performing flow consistency check with the output of the optical flow estimation network so as to perform motion segmentation.
According to the technical scheme provided by the invention, the training data only need binocular RGB images, and the data acquisition is very simple; by adopting a unified framework, the light stream, the depth, the pose and the motion segmentation can be learned at the same time, the training process of the model is simple and direct, the parameters needing to be adjusted are very few, and the scene migration capability is strong; the model has good adaptability, can learn the optical flow and the geometric information of the environment with absolute scale depth, pose and the like in an unsupervised end-to-end mode, and can segment dynamic objects with higher precision due to higher precision of the estimated optical flow, pose and depth.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.
Fig. 1 is a flowchart of an unsupervised end-to-end driving environment sensing method based on deep learning according to an embodiment of the present invention;
fig. 2 is a framework diagram of an unsupervised end-to-end driving environment sensing method based on deep learning according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the invention provides an unsupervised end-to-end driving environment perception method based on deep learning, and as shown in fig. 1-2, a flow chart and a frame chart of the method are respectively provided. The method mainly comprises the following steps:
1. and acquiring images by using a binocular camera, and preprocessing to obtain training data.
In the embodiment of the invention, the binocular camera is applied to driving environment perception, so that the binocular camera is installed on a vehicle and is used for acquiring an environment image.
Before network training is input, in order to reduce training time and reduce calculation cost and hardware consumption, original images acquired by the binocular camera are zoomed, and corresponding camera parameters are zoomed simultaneously.
In addition, a data enhancement method is also applied to improve the generalization performance of the model and reduce overfitting, training data is generated through the method, and two continuous stereo images with the same size are extracted each time of training and input to the network for training. Two consecutive stereo image pairs of the same size are denoted L1、R1、L2And R2(ii) a Wherein L is1、R1Corresponding representation at t1Left and right images of time, L2、R2Corresponding representation t2Left and right images of time, width,The height is noted as W, H.
In the embodiment of the invention, the data enhancement method comprises the following steps of performing data enhancement in one or more modes:
randomly correcting the input monocular image by using a brightness factor y;
scale factor sxAnd syZooming the image along an X axis and a Y axis, and then randomly cutting the image into a specified size;
randomly rotating the image by r degrees, and interpolating by using a nearest neighbor method;
random left-right flipping and random time sequence switching (exchange t)1And t2)。
Illustratively, the following setting γ ∈ [0.7, 1.3 ] may be employed],sx∈[1.0,1.2],sy∈[1.0,1.2],r∈[-5,5](ii) a The specified size may be set as: 832 × 256.
2. And training an optical flow estimation network, a pose estimation network, a depth estimation network and motion segmentation by using two continuous stereo image pairs with the same size in the training data.
In this step, the training of the optical flow estimation network, the pose estimation network, the depth estimation network, and the motion segmentation by using two consecutive same-size stereo images in the training data is mainly divided into the following two stages:
the first stage is as follows: and training an optical flow estimation network by using continuous stereo images with the same size in the training data, and simultaneously training a pose estimation network and a depth estimation network.
In this phase, first, two successive left images L are utilized1And L2And designed optical flow loss function
Figure BDA0002333342950000041
Training an optical flow estimation network, the output of which is two continuous left images L with the same size1And L2Flow of light between
Figure BDA0002333342950000042
Its dimensions and input imageThe same is true.
The optical flow loss function
Figure BDA0002333342950000043
The method comprises the following steps: occlusion aware reconstruction loss term
Figure BDA0002333342950000044
And a smoothing loss term
Figure BDA0002333342950000045
Figure BDA0002333342950000046
Is based on a weighted average between the loss of Structural Similarity (SSIM) and the loss of absolute photometric difference over an unclosed area,
Figure BDA0002333342950000047
being the mean absolute value of the edge-weighted second derivative of optical flow over moving areas, will provide a constraint on optical flow over static areas in the consistency loss section.
Figure BDA0002333342950000048
Figure BDA0002333342950000049
Where ψ (.) represents an occlusion aware reconstruction loss function, α represents an adjustment coefficient, O1Representing non-occluded areas, M1Representing a loss mask, N being the normalized coefficient (i.e. the number of pixels of the moving area);
Figure BDA00023333429500000410
is represented by L1、L2Flow of light between
Figure BDA00023333429500000411
And in combination with L2Reconstructed left image and note
Figure BDA00023333429500000412
e denotes the natural logarithm, (i, j) denotes the pixel position,
Figure BDA00023333429500000413
refers to the derivation operation along the x or y direction of the image, the square of which represents the derivation of the second order, a refers to the x or y direction of the image, which indicates the direction of the derivation, and β is a weight, which is a constant value.
Then, simultaneously training a pose estimation network and a depth estimation network:
using two successive left images L1And L2And designed rigid flow loss function
Figure BDA00023333429500000414
Training a pose estimation network, outputting the pose estimation network as two continuous left images L1And L2Relative camera pose T therebetween12(ii) a Using two successive pairs L of stereo images of the same size1、R1、L2And R2And loss of stereo
Figure BDA00023333429500000415
Training a depth estimation network, the output of which is the disparity d between stereo image pairs, using a stereo camera baseline B and a horizontal focal length fxCalculating the absolute scale depth D ═ Bf through the parallax DxD, recording the calculated absolute scale depth as D1,2
The loss of solid
Figure BDA0002333342950000051
Same as monodepth.
Loss of said rigid flow
Figure BDA0002333342950000052
Is applied in static area
Figure BDA0002333342950000053
And
Figure BDA0002333342950000054
the reconstruction loss term of (2):
Figure BDA0002333342950000055
wherein, O1Representing non-occluded areas, M1Representing a loss mask;
Figure BDA0002333342950000056
according to rigid flow
Figure BDA0002333342950000057
And in combination with L2Two reconstructed left images, noted
Figure BDA0002333342950000058
Rigid flow
Figure BDA0002333342950000059
By absolute scale depth D1,2And pose T12Calculated (assuming the entire scene is static), rigid flow
Figure BDA00023333429500000510
By absolute scale depth D1,2And the optimized pose T'12Is calculated to obtain (T'12See below for the calculation of (c).
Will be provided with
Figure BDA00023333429500000511
Is involved in the loss, since the rigid registration module is not differentiable, it is necessary to do so
Figure BDA00023333429500000512
To supervise the training pose estimation network.
And a second stage: and simultaneously training an optical flow estimation network, a pose estimation network, a depth estimation network and motion segmentation by using continuous stereo image pairs with the same size in the training data.
At this stage, two consecutive stereo image pairs L with the same size are used1、R1、L2And R2Optical flow loss
Figure BDA00023333429500000513
Loss of dimension
Figure BDA00023333429500000514
Loss of rigid flow
Figure BDA00023333429500000515
And loss of flow consistency
Figure BDA00023333429500000516
And simultaneously training an optical flow estimation network, a pose estimation network, a depth estimation network, a rigid registration module and a flow consistency check module.
The optical flow estimation network, pose estimation network and depth estimation network are trained in the stage, the training process is the same as that in the first stage, the output result is the same, and the description is omitted. The difference is that the motion segmentation is trained simultaneously by combining the outputs of the three networks at this stage, and since the principles of this part are the same in the test stage and the training stage, the description will be given later to avoid redundancy. Based on the training strategy, the problem of gradient disappearance generated in the training process of the network can be avoided.
Alternatively, the optical flow estimation network may employ a PWC-Net framework that merges several classical optical flow estimation techniques in an end-to-end trainable deep neural network, including image pyramids, warping, and cost metrics, to achieve the most advanced results. The pose estimation network can adopt a framework based on a cyclic convolution neural network (RCNN), and the features extracted by the CNN are input into two layers of convolution LSTM (ConvLSTM) to output 6-DoF poses, and the poses are translated by p ═ t [ (t [) ]x,ty,tz) And angle of rotation
Figure BDA00023333429500000517
And (4) forming. The depth estimation network can employ an encoder and decoder architecture based on ResNet50, and the network can estimateA dense depth map of the same size as the input raw RGB image is computed.
3. After training is finished, carrying out rigid registration on two newly input continuous stereo image pairs with the same size by using output results of the three networks to optimize the output of a pose estimation network; and calculating rigid flow caused by the motion of the camera by using the output of the depth estimation network and the output of the optimized pose estimation network, and performing flow consistency check with the output of the optical flow estimation network so as to perform motion segmentation.
1) A rigid registration module.
Estimating optical flow output by a network using optical flow through a rigid registration module
Figure BDA0002333342950000061
And the absolute scale depth D is obtained by calculating the parallax D output by the depth estimation network1,2To optimize the pose T output by the pose estimation network12Obtaining an optimized pose T'12
During rigid registration, points in 2D image space are converted into 3D point clouds, the formula:
Qk(i,j)=Dk(i,j)K-1Pk(i,j),k=1,2
wherein, Pk(i, j) is the image LkThe homogeneous coordinates of the pixel at the (i, j) position of (a), K is a camera intrinsic parameter, Dk(i, j) is the image LkAbsolute scale depth, Q, at the (i, j) position of (a)k(i, j) is the image LkThe corresponding 3D coordinates of the pixel at the (i, j) position of (a);
by using position and orientation T12Converting 3D point cloud Q1 to 3D point cloud
Figure BDA0002333342950000062
(
Figure BDA0002333342950000063
Can be understood as being at t2L of time1Point cloud constructed from the 3D coordinates of the points in (1); and, using a bilinear sampling method, based on the optical flow
Figure BDA0002333342950000064
Point Q of 3D point2Deformation back to t1Obtaining corresponding 3D point cloud by time
Figure BDA0002333342950000065
The correspondence is established by a deformation step such that
Figure BDA0002333342950000066
Correspond to
Figure BDA0002333342950000067
Figure BDA0002333342950000068
Figure BDA0002333342950000069
Wherein W, H represents the width and height of the image, respectively;
Figure BDA00023333429500000610
respectively representing the flow of light
Figure BDA00023333429500000611
Components in the x, y axes;
if everything is very accurate, then
Figure BDA00023333429500000612
Should equal static and non-occluded areas of the scene
Figure BDA00023333429500000613
Therefore, use is made first
Figure BDA00023333429500000614
Of the opposite direction of the light flow
Figure BDA00023333429500000615
Estimating a non-occluded area O1The pose estimate is then re-determined by tightly aligning the two non-occluded area point clouds. In particular, by minimizing the size of the selected region R
Figure BDA00023333429500000616
And
Figure BDA00023333429500000617
the improved posture Δ T is estimated by the distance between:
Figure BDA00023333429500000618
wherein the region R is
Figure BDA00023333429500000619
And
Figure BDA00023333429500000620
the top R% (e.g., 25%) of the minimum distance ordering between corresponding non-occluded regions; by doing so, it is attempted to exclude points in the moving area, since they tend to be in
Figure BDA00023333429500000621
And
Figure BDA00023333429500000622
with a greater distance therebetween. By combining T12And delta T can obtain an optimized pose T'12
T′12=ΔT×T12
2) Flow consistency and motion segmentation.
Through optimized pose T'12The formula that can calculate the rigid flow caused by camera motion is:
Figure BDA00023333429500000623
wherein K is camera reference P1Represents L1Homogeneous coordinates of the middle pixel;
if it is not
Figure BDA0002333342950000071
And
Figure BDA0002333342950000072
the sections are accurate, their values should match in the static area and differ in the moving area. In a rigid flow
Figure BDA0002333342950000073
And
Figure BDA0002333342950000074
a consistency check is performed between, if the difference between the two rigid stream flows is greater than a threshold δ, the corresponding region is marked as a moving foreground M1And the rest of the image is marked as a static background M0So that the image loss mask is M1
Figure BDA0002333342950000075
Figure BDA0002333342950000076
Due to O1Is composed of
Figure BDA0002333342950000077
Less accurate in occluded areas, which may lead to false positives, the default estimated motion area is located in the non-occluded area.
Figure BDA0002333342950000078
In static area ratio
Figure BDA0002333342950000079
And is more accurate. Thus, use is made of
Figure BDA00023333429500000710
To guide learning
Figure BDA00023333429500000711
Using the following flow consistency losscon
Figure BDA00023333429500000712
Wherein, SG denotes the stop gradient,
Figure BDA00023333429500000713
for rigid flow caused by camera motion, N is a normalization coefficient.
Based on the above, the total loss for the model shown in fig. 2 is:
Figure BDA00023333429500000714
in the above equation, λ is a weight coefficient of the corresponding loss term.
Through the above description of the embodiments, it is clear to those skilled in the art that the above embodiments can be implemented by software, and can also be implemented by software plus a necessary general hardware platform. With this understanding, the technical solutions of the embodiments can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions for enabling a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods according to the embodiments of the present invention.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. An unsupervised end-to-end driving environment perception method based on deep learning is characterized by comprising the following steps:
acquiring images by using a binocular camera, and preprocessing to obtain training data;
utilizing two continuous stereo images with the same size in training data to train an optical flow estimation network, a pose estimation network, a depth estimation network and motion segmentation;
after training is finished, carrying out rigid registration on two newly input continuous stereo image pairs with the same size by using output results of the three networks to optimize the output of a pose estimation network; and calculating rigid flow caused by the motion of the camera by using the output of the depth estimation network and the output of the optimized pose estimation network, and performing flow consistency check with the output of the optical flow estimation network so as to perform motion segmentation.
2. The unsupervised end-to-end driving environment perception method based on deep learning of claim 1, wherein the image acquisition using a binocular camera and the obtaining of the training data through preprocessing comprises:
firstly, zooming an original image acquired by a binocular camera, and simultaneously zooming internal parameters of the corresponding camera;
then, generating training data by a data enhancement method;
the data enhancement method comprises the following steps of performing data enhancement in one or more ways:
randomly correcting the input monocular image by using a brightness factor gamma;
scale factor sxAnd syZooming the image along an X axis and a Y axis, and then randomly cutting the image into a specified size;
randomly rotating the image by r degrees, and interpolating by using a nearest neighbor method;
random left-right turning and random time sequence switching.
3. The method as claimed in claim 1, wherein the training of the optical flow estimation network, the pose estimation network, the depth estimation network, and the motion segmentation by using two consecutive same-size stereo images in the training data comprises:
firstly, training an optical flow estimation network by using continuous stereo images with the same size in training data, and simultaneously training a pose estimation network and a depth estimation network;
then, the optical flow estimation network, the pose estimation network, the depth estimation network and the motion segmentation are trained simultaneously by using the continuous stereo image pairs with the same size in the training data.
4. The unsupervised end-to-end driving environment perception method based on deep learning of claim 3,
two consecutive stereo image pairs of the same size are denoted L1、R1、L2And R2(ii) a Wherein L is1、R1Corresponding representation at t1Left and right images of time, L2、R2Corresponding representation t2Left and right images of a time;
using two successive left images L1And L2And designed optical flow loss function
Figure FDA0002333342940000021
Training an optical flow estimation network, the output of which is two continuous left images L with the same size1And L2Flow of light between
Figure FDA0002333342940000025
Training a pose estimation network and a depth estimation network simultaneously:
using two successive left images L1And L2And designed rigid flow loss function
Figure FDA0002333342940000024
Training a pose estimation network, outputting the pose estimation network as two continuous left images L1And L2With relative camera pose T therebetween12(ii) a Using two successive pairs L of stereo images of the same size1、R1、L2And R2And loss of stereo
Figure FDA0002333342940000029
Training a depth estimation network, the output of which is the disparity d between stereo image pairs, using a stereo camera baseline B and a horizontal focal length fxCalculating the absolute scale depth D ═ Bf through the parallax DxD, recording the calculated absolute scale depth as D1,2
5. The unsupervised end-to-end driving environment perception method based on deep learning of claim 4,
the optical flow loss function
Figure FDA0002333342940000028
The method comprises the following steps: occlusion aware reconstruction loss term
Figure FDA0002333342940000026
And a smoothing loss term
Figure FDA0002333342940000027
Figure FDA0002333342940000022
Where ψ (.) represents an occlusion aware reconstruction loss function, α represents an adjustment coefficient, O1Representing non-occluded areas, M1Representing a loss mask, N being a normalization coefficient;
Figure FDA00023333429400000210
is represented by L1、L2Flow of light between
Figure FDA00023333429400000211
And in combination with L2Reconstructed left image, note
Figure FDA00023333429400000212
e denotes the natural logarithm, (i, j) denotes the pixel position,
Figure FDA00023333429400000216
refers to the derivation operation along the x or y direction of the image, the square of which represents the derivation of the second order, a refers to the x or y direction of the image, indicating the direction of the derivation, β is the weight.
6. The unsupervised end-to-end driving environment perception method based on deep learning of claim 4,
loss of said rigid flow
Figure FDA00023333429400000213
Is applied in static area
Figure FDA00023333429400000214
And
Figure FDA00023333429400000215
the reconstruction loss term of (2):
Figure FDA0002333342940000023
where ψ () denotes an occlusion aware reconstruction loss function, O1Representing non-occluded areas, M1Representing a loss mask;
Figure FDA00023333429400000217
according to rigid flow
Figure FDA00023333429400000218
And in combination with L2Two reconstructed left images, noted
Figure FDA00023333429400000220
By absolute scale depth D1,2And pose T12The calculation results in that,
Figure FDA00023333429400000219
by absolute scale depth D1,2And calculating the pose after optimization.
7. The method as claimed in claim 3, wherein the simultaneous training of the optical flow estimation network, the pose estimation network, the depth estimation network and the motion segmentation by using consecutive same-size stereo image pairs in the training data comprises:
two consecutive stereo image pairs of the same size are denoted L1、R1、L2And R2(ii) a Wherein L is1、R1Corresponding representation at t1Left and right images of time, L2、R2Corresponding representation t2Left and right images of a time;
using two successive pairs L of stereo images of the same size1、R1、L2And R2Optical flow loss
Figure FDA0002333342940000034
Loss of stereo sound
Figure FDA0002333342940000036
Loss of rigid flow
Figure FDA0002333342940000037
And loss of flow consistency
Figure FDA0002333342940000035
And simultaneously training an optical flow estimation network, a pose estimation network, a depth estimation network, a rigid registration module and a flow consistency check module.
8. The method as claimed in claim 7, wherein the optical flow output by the network is estimated by a rigid registration module using optical flow
Figure FDA00023333429400000313
And the absolute scale depth D is obtained by calculating the parallax D output by the depth estimation network1,2To optimize the pose T output by the pose estimation network12Obtaining an optimized pose T'12
During rigid registration, points in 2D image space are converted into 3D point clouds, the formula:
Qk(i,j)=Dk(i,j)K-1Pk(i,j),k=1,2
wherein, Pk(i, j) is the image LkThe homogeneous coordinates of the pixel at the (i, j) position of (a), K is a camera intrinsic parameter, Dk(i, j) is the image LkAbsolute scale depth, Q, at the (i, j) position of (a)k(i, j) is the image LkThe corresponding 3D coordinates of the pixel at the (i, j) position of (a);
by using position and orientation T12Point Q of 3D point1Conversion to 3D point cloud
Figure FDA00023333429400000310
And, using a bilinear sampling method, based on the optical flow
Figure FDA00023333429400000311
Point Q of 3D point2Deformation back to t1Obtaining corresponding 3D point cloud by time
Figure FDA00023333429400000312
The correspondence is established by a deformation step such that
Figure FDA0002333342940000039
Correspond to
Figure FDA0002333342940000038
Figure FDA0002333342940000031
Figure FDA0002333342940000032
Wherein W, H represents the width and height of the image, respectively;
Figure FDA00023333429400000314
respectively representing the flow of light
Figure FDA00023333429400000315
Components in the x, y axes;
by minimizing in the selected region R
Figure FDA00023333429400000316
And
Figure FDA00023333429400000317
the improved posture Δ T is estimated by the distance between:
Figure FDA0002333342940000033
wherein the region R is
Figure FDA00023333429400000318
And
Figure FDA00023333429400000319
the first R% of the minimum distance ordering between corresponding non-occluded regions;
thus obtaining an optimized pose T 'through the following formula'12
T′12=ΔT×T12
9. The unsupervised end-to-end driving environment perception method based on deep learning of claim 7 or 8, wherein the formula for calculating the rigid flow caused by camera motion is as follows:
Figure FDA0002333342940000041
wherein K is camera reference P1Represents L1Homogeneous coordinates of the middle pixel;
the loss mask is estimated by:
Figure FDA0002333342940000042
Figure FDA0002333342940000043
wherein, O1Representing an unobstructed area and δ is the threshold.
10. The unsupervised end-to-end driving environment perception method based on deep learning of claim 7 or 8, characterized by stream consistency loss
Figure FDA0002333342940000045
Expressed as:
Figure FDA0002333342940000044
where SG denotes the stopping gradient, (i, j) denotes the pixel position,
Figure FDA0002333342940000046
represents L1、L2The flow of light in between the two,
Figure FDA0002333342940000047
for rigid flow caused by camera motion, through the absolute scale depth D1,2And the optimized pose T'12And calculating to obtain N which is a normalization coefficient.
CN201911345900.9A 2019-12-24 2019-12-24 Unsupervised end-to-end driving environment perception method based on deep learning Active CN111105432B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911345900.9A CN111105432B (en) 2019-12-24 2019-12-24 Unsupervised end-to-end driving environment perception method based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911345900.9A CN111105432B (en) 2019-12-24 2019-12-24 Unsupervised end-to-end driving environment perception method based on deep learning

Publications (2)

Publication Number Publication Date
CN111105432A true CN111105432A (en) 2020-05-05
CN111105432B CN111105432B (en) 2023-04-07

Family

ID=70423494

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911345900.9A Active CN111105432B (en) 2019-12-24 2019-12-24 Unsupervised end-to-end driving environment perception method based on deep learning

Country Status (1)

Country Link
CN (1) CN111105432B (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111627056A (en) * 2020-05-14 2020-09-04 清华大学 Depth estimation-based driving visibility determination method and device
CN111629194A (en) * 2020-06-10 2020-09-04 北京中科深智科技有限公司 Method and system for converting panoramic video into 6DOF video based on neural network
CN113140011A (en) * 2021-05-18 2021-07-20 烟台艾睿光电科技有限公司 Infrared thermal imaging monocular vision distance measurement method and related assembly
CN113838104A (en) * 2021-08-04 2021-12-24 浙江大学 Registration method based on multispectral and multi-mode image consistency enhancement network
CN113902807A (en) * 2021-08-19 2022-01-07 江苏大学 Electronic component three-dimensional reconstruction method based on semi-supervised learning
CN114187581A (en) * 2021-12-14 2022-03-15 安徽大学 Driver distraction fine-grained detection method based on unsupervised learning
CN114359363A (en) * 2022-01-11 2022-04-15 浙江大学 Video consistency depth estimation method and device based on deep learning
CN114494332A (en) * 2022-01-21 2022-05-13 四川大学 Unsupervised estimation method for scene flow from synthesis to real LiDAR point cloud
GB2618775A (en) * 2022-05-11 2023-11-22 Continental Autonomous Mobility Germany GmbH Self-supervised learning of scene flow
WO2024051184A1 (en) * 2022-09-07 2024-03-14 南京逸智网络空间技术创新研究院有限公司 Optical flow mask-based unsupervised monocular depth estimation method

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110097109A (en) * 2019-04-25 2019-08-06 湖北工业大学 A kind of road environment obstacle detection system and method based on deep learning
US20190265712A1 (en) * 2018-02-27 2019-08-29 Nauto, Inc. Method for determining driving policy
CN110189278A (en) * 2019-06-06 2019-08-30 上海大学 A kind of binocular scene image repair method based on generation confrontation network
CN110443843A (en) * 2019-07-29 2019-11-12 东北大学 A kind of unsupervised monocular depth estimation method based on generation confrontation network
CN110490928A (en) * 2019-07-05 2019-11-22 天津大学 A kind of camera Attitude estimation method based on deep neural network
CN110490919A (en) * 2019-07-05 2019-11-22 天津大学 A kind of depth estimation method of the monocular vision based on deep neural network
WO2019223382A1 (en) * 2018-05-22 2019-11-28 深圳市商汤科技有限公司 Method for estimating monocular depth, apparatus and device therefor, and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190265712A1 (en) * 2018-02-27 2019-08-29 Nauto, Inc. Method for determining driving policy
WO2019223382A1 (en) * 2018-05-22 2019-11-28 深圳市商汤科技有限公司 Method for estimating monocular depth, apparatus and device therefor, and storage medium
CN110097109A (en) * 2019-04-25 2019-08-06 湖北工业大学 A kind of road environment obstacle detection system and method based on deep learning
CN110189278A (en) * 2019-06-06 2019-08-30 上海大学 A kind of binocular scene image repair method based on generation confrontation network
CN110490928A (en) * 2019-07-05 2019-11-22 天津大学 A kind of camera Attitude estimation method based on deep neural network
CN110490919A (en) * 2019-07-05 2019-11-22 天津大学 A kind of depth estimation method of the monocular vision based on deep neural network
CN110443843A (en) * 2019-07-29 2019-11-12 东北大学 A kind of unsupervised monocular depth estimation method based on generation confrontation network

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
周云成;许童羽;邓寒冰;苗腾;吴琼;: "基于自监督学习的番茄植株图像深度估计方法" *
毕天腾;刘越;翁冬冬;王涌天;: "基于监督学习的单幅图像深度估计综述" *
黄军;王聪;刘越;毕天腾;: "单目深度估计技术进展综述" *

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111627056A (en) * 2020-05-14 2020-09-04 清华大学 Depth estimation-based driving visibility determination method and device
CN111627056B (en) * 2020-05-14 2023-09-01 清华大学 Driving visibility determination method and device based on depth estimation
CN111629194A (en) * 2020-06-10 2020-09-04 北京中科深智科技有限公司 Method and system for converting panoramic video into 6DOF video based on neural network
CN111629194B (en) * 2020-06-10 2021-01-26 北京中科深智科技有限公司 Method and system for converting panoramic video into 6DOF video based on neural network
CN113140011B (en) * 2021-05-18 2022-09-06 烟台艾睿光电科技有限公司 Infrared thermal imaging monocular vision distance measurement method and related components
CN113140011A (en) * 2021-05-18 2021-07-20 烟台艾睿光电科技有限公司 Infrared thermal imaging monocular vision distance measurement method and related assembly
CN113838104B (en) * 2021-08-04 2023-10-27 浙江大学 Registration method based on multispectral and multimodal image consistency enhancement network
CN113838104A (en) * 2021-08-04 2021-12-24 浙江大学 Registration method based on multispectral and multi-mode image consistency enhancement network
CN113902807A (en) * 2021-08-19 2022-01-07 江苏大学 Electronic component three-dimensional reconstruction method based on semi-supervised learning
CN114187581A (en) * 2021-12-14 2022-03-15 安徽大学 Driver distraction fine-grained detection method based on unsupervised learning
CN114187581B (en) * 2021-12-14 2024-04-09 安徽大学 Driver distraction fine granularity detection method based on unsupervised learning
CN114359363A (en) * 2022-01-11 2022-04-15 浙江大学 Video consistency depth estimation method and device based on deep learning
CN114494332A (en) * 2022-01-21 2022-05-13 四川大学 Unsupervised estimation method for scene flow from synthesis to real LiDAR point cloud
CN114494332B (en) * 2022-01-21 2023-04-25 四川大学 Unsupervised synthesis to real LiDAR point cloud scene flow estimation method
GB2618775A (en) * 2022-05-11 2023-11-22 Continental Autonomous Mobility Germany GmbH Self-supervised learning of scene flow
WO2024051184A1 (en) * 2022-09-07 2024-03-14 南京逸智网络空间技术创新研究院有限公司 Optical flow mask-based unsupervised monocular depth estimation method

Also Published As

Publication number Publication date
CN111105432B (en) 2023-04-07

Similar Documents

Publication Publication Date Title
CN111105432B (en) Unsupervised end-to-end driving environment perception method based on deep learning
Shu et al. Feature-metric loss for self-supervised learning of depth and egomotion
Mitrokhin et al. EV-IMO: Motion segmentation dataset and learning pipeline for event cameras
US11315266B2 (en) Self-supervised depth estimation method and system
Zhu et al. Unsupervised event-based learning of optical flow, depth, and egomotion
CN111325794B (en) Visual simultaneous localization and map construction method based on depth convolution self-encoder
Zhan et al. Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction
CN109377530B (en) Binocular depth estimation method based on depth neural network
US10818029B2 (en) Multi-directional structured image array capture on a 2D graph
CN110782490B (en) Video depth map estimation method and device with space-time consistency
TWI709107B (en) Image feature extraction method and saliency prediction method including the same
WO2018000752A1 (en) Monocular image depth estimation method based on multi-scale cnn and continuous crf
CN110689008A (en) Monocular image-oriented three-dimensional object detection method based on three-dimensional reconstruction
CN111783582A (en) Unsupervised monocular depth estimation algorithm based on deep learning
CN108491763B (en) Unsupervised training method and device for three-dimensional scene recognition network and storage medium
WO2024051184A1 (en) Optical flow mask-based unsupervised monocular depth estimation method
CN115035171B (en) Self-supervision monocular depth estimation method based on self-attention guide feature fusion
CN113850900B (en) Method and system for recovering depth map based on image and geometric clues in three-dimensional reconstruction
CN111950477A (en) Single-image three-dimensional face reconstruction method based on video surveillance
Song et al. Self-supervised depth completion from direct visual-lidar odometry in autonomous driving
CN113065506B (en) Human body posture recognition method and system
CN116188550A (en) Self-supervision depth vision odometer based on geometric constraint
Lee et al. Globally consistent video depth and pose estimation with efficient test-time training
CN112634331A (en) Optical flow prediction method and device
Su et al. Omnidirectional depth estimation with hierarchical deep network for multi-fisheye navigation systems

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant