CN111105432A

CN111105432A - Unsupervised end-to-end driving environment perception method based on deep learning

Info

Publication number: CN111105432A
Application number: CN201911345900.9A
Authority: CN
Inventors: 陈宗海; 洪洋; 王纪凯; 戴德云; 赵皓; 包鹏; 江建文
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2019-12-24
Filing date: 2019-12-24
Publication date: 2020-05-05
Anticipated expiration: 2039-12-24
Also published as: CN111105432B

Abstract

The invention discloses an unsupervised end-to-end driving environment perception method based on deep learning, which comprises the following steps: acquiring images by using a binocular camera, and preprocessing to obtain training data; utilizing two continuous stereo images with the same size in training data to train an optical flow estimation network, a pose estimation network, a depth estimation network and motion segmentation; carrying out rigid registration by using output results of the three networks to optimize the output of the pose estimation network; and calculating rigid flow caused by the motion of the camera by using the output of the depth estimation network and the output of the optimized pose estimation network, and performing flow consistency check with the output of the optical flow estimation network so as to perform motion segmentation. The method adopts an unsupervised end-to-end frame without requiring true value depth, pose and optical flow as label supervision training, and can obtain camera pose with absolute scale and dense depth map estimation, thereby segmenting dynamic objects with higher precision.

Description

Unsupervised end-to-end driving environment perception method based on deep learning

Technical Field

The invention relates to the technical field of intelligent driving, in particular to an unsupervised end-to-end driving environment perception method based on deep learning.

Background

Learning three-dimensional scene geometry, scene flow, and robot motion relative to rigid scenes from video images is an important research content in computer vision and has found widespread application in many different fields, including autopilot, robot navigation, and video analysis, among others. However, the current environmental perception methods based on deep learning are all supervised learning frameworks, and it is very difficult to obtain the true value labels for training. In recent years, many advances have been made in unsupervised learning of depth, optical flow, and pose using convolutional neural network methods. These methods have their own advantages and limitations. Unsupervised deep learning approaches take advantage of the geometry of the scene and decompose the problem into multiple orthogonal problems, adding more constraints to the solution with more temporal image frames or stereo image information. On the one hand, current optical flow, depth and pose estimation methods based on depth learning assume that the entire scene is static, and therefore it is difficult to handle moving objects. On the other hand, the optical flow method can handle moving objects in principle, but has difficulty in a complicated structure region and an occlusion region.

Chinese patent ' method for estimating and optimizing depth of monocular view in video sequence by using depth learning ' (publication number: CN108765479A) ' estimates and optimizes the depth of the monocular view in the video sequence by using depth learning, but the method based on monocular vision has scale uncertainty, so the estimated depth scale is unknown and has no practical application value.

Chinese patent "a binocular depth estimation method based on a depth convolution network" (publication number: CN109598754A) trains a deep convolution neural network to perform depth estimation by using binocular images, but a true value depth is required to participate in training as a label in the training process, but it is very difficult and expensive to obtain the true value depth in an actual environment.

Chinese patent "a monocular vision positioning method based on unsupervised learning" (publication number: CN109472830A) utilizes the method of unsupervised learning to carry out monocular vision positioning, but monocular vision positioning has scale uncertainty and scale drift, positioning accuracy is poor, and positioning scale uncertainty has no engineering value in actual environment.

Therefore, the current driving environment perception method based on deep learning still has the following problems:

1) the depth estimation and pose estimation depth learning model trained by using the monocular picture sequence is limited by monocular scale uncertainty and scale drift, the estimated depth and pose scale are unknown, and the model has no practical application value;

2) the current depth estimation, pose estimation and optical flow estimation methods based on deep learning need true value supervised training, but true value data acquisition in a real environment is very difficult and needs high cost;

3) dynamic objects are very common in the actual driving environment, the current environment perception method based on deep learning does not consider the influence of the dynamic objects, and the precision is to be further improved.

Disclosure of Invention

The invention aims to provide an unsupervised end-to-end driving environment perception method based on deep learning, an unsupervised end-to-end framework is adopted, true value depth, pose and optical flow are not needed to be used as label supervision training, and camera pose with absolute scale and dense depth map estimation can be obtained, so that a dynamic object can be segmented with high precision.

The purpose of the invention is realized by the following technical scheme:

an unsupervised end-to-end driving environment perception method based on deep learning comprises the following steps:

acquiring images by using a binocular camera, and preprocessing to obtain training data;

utilizing two continuous stereo images with the same size in training data to train an optical flow estimation network, a pose estimation network, a depth estimation network and motion segmentation;

after training is finished, carrying out rigid registration on two newly input continuous stereo image pairs with the same size by using output results of the three networks to optimize the output of a pose estimation network; and calculating rigid flow caused by the motion of the camera by using the output of the depth estimation network and the output of the optimized pose estimation network, and performing flow consistency check with the output of the optical flow estimation network so as to perform motion segmentation.

According to the technical scheme provided by the invention, the training data only need binocular RGB images, and the data acquisition is very simple; by adopting a unified framework, the light stream, the depth, the pose and the motion segmentation can be learned at the same time, the training process of the model is simple and direct, the parameters needing to be adjusted are very few, and the scene migration capability is strong; the model has good adaptability, can learn the optical flow and the geometric information of the environment with absolute scale depth, pose and the like in an unsupervised end-to-end mode, and can segment dynamic objects with higher precision due to higher precision of the estimated optical flow, pose and depth.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

Fig. 1 is a flowchart of an unsupervised end-to-end driving environment sensing method based on deep learning according to an embodiment of the present invention;

fig. 2 is a framework diagram of an unsupervised end-to-end driving environment sensing method based on deep learning according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention provides an unsupervised end-to-end driving environment perception method based on deep learning, and as shown in fig. 1-2, a flow chart and a frame chart of the method are respectively provided. The method mainly comprises the following steps:

1. and acquiring images by using a binocular camera, and preprocessing to obtain training data.

In the embodiment of the invention, the binocular camera is applied to driving environment perception, so that the binocular camera is installed on a vehicle and is used for acquiring an environment image.

Before network training is input, in order to reduce training time and reduce calculation cost and hardware consumption, original images acquired by the binocular camera are zoomed, and corresponding camera parameters are zoomed simultaneously.

In addition, a data enhancement method is also applied to improve the generalization performance of the model and reduce overfitting, training data is generated through the method, and two continuous stereo images with the same size are extracted each time of training and input to the network for training. Two consecutive stereo image pairs of the same size are denoted L₁、R₁、L₂And R₂(ii) a Wherein L is₁、R₁Corresponding representation at t₁Left and right images of time, L₂、R₂Corresponding representation t₂Left and right images of time, width,The height is noted as W, H.

In the embodiment of the invention, the data enhancement method comprises the following steps of performing data enhancement in one or more modes:

randomly correcting the input monocular image by using a brightness factor y;

scale factor s_xAnd s_yZooming the image along an X axis and a Y axis, and then randomly cutting the image into a specified size;

randomly rotating the image by r degrees, and interpolating by using a nearest neighbor method;

random left-right flipping and random time sequence switching (exchange t)₁And t₂)。

Illustratively, the following setting γ ∈ [0.7, 1.3 ] may be employed]，s_x∈[1.0，1.2]，s_y∈[1.0，1.2]，r∈[-5，5](ii) a The specified size may be set as: 832 × 256.

2. And training an optical flow estimation network, a pose estimation network, a depth estimation network and motion segmentation by using two continuous stereo image pairs with the same size in the training data.

In this step, the training of the optical flow estimation network, the pose estimation network, the depth estimation network, and the motion segmentation by using two consecutive same-size stereo images in the training data is mainly divided into the following two stages:

the first stage is as follows: and training an optical flow estimation network by using continuous stereo images with the same size in the training data, and simultaneously training a pose estimation network and a depth estimation network.

In this phase, first, two successive left images L are utilized₁And L₂And designed optical flow loss function

Training an optical flow estimation network, the output of which is two continuous left images L with the same size₁And L₂Flow of light between

Its dimensions and input imageThe same is true.

The optical flow loss function

The method comprises the following steps: occlusion aware reconstruction loss term

And a smoothing loss term

Is based on a weighted average between the loss of Structural Similarity (SSIM) and the loss of absolute photometric difference over an unclosed area,

being the mean absolute value of the edge-weighted second derivative of optical flow over moving areas, will provide a constraint on optical flow over static areas in the consistency loss section.

Where ψ (.) represents an occlusion aware reconstruction loss function, α represents an adjustment coefficient, O₁Representing non-occluded areas, M₁Representing a loss mask, N being the normalized coefficient (i.e. the number of pixels of the moving area);

is represented by L₁、L₂Flow of light between

And in combination with L₂Reconstructed left image and note

e denotes the natural logarithm, (i, j) denotes the pixel position,

refers to the derivation operation along the x or y direction of the image, the square of which represents the derivation of the second order, a refers to the x or y direction of the image, which indicates the direction of the derivation, and β is a weight, which is a constant value.

Then, simultaneously training a pose estimation network and a depth estimation network:

using two successive left images L₁And L₂And designed rigid flow loss function

Training a pose estimation network, outputting the pose estimation network as two continuous left images L₁And L₂Relative camera pose T therebetween₁₂(ii) a Using two successive pairs L of stereo images of the same size₁、R₁、L₂And R₂And loss of stereo

Training a depth estimation network, the output of which is the disparity d between stereo image pairs, using a stereo camera baseline B and a horizontal focal length f_xCalculating the absolute scale depth D ═ Bf through the parallax D_xD, recording the calculated absolute scale depth as D_1，2。

The loss of solid

Same as monodepth.

Loss of said rigid flow

Is applied in static area

And

the reconstruction loss term of (2):

wherein, O₁Representing non-occluded areas, M₁Representing a loss mask;

according to rigid flow

And in combination with L₂Two reconstructed left images, noted

Rigid flow

By absolute scale depth D_1，2And pose T₁₂Calculated (assuming the entire scene is static), rigid flow

By absolute scale depth D_1，2And the optimized pose T'₁₂Is calculated to obtain (T'₁₂See below for the calculation of (c).

Will be provided with

Is involved in the loss, since the rigid registration module is not differentiable, it is necessary to do so

To supervise the training pose estimation network.

And a second stage: and simultaneously training an optical flow estimation network, a pose estimation network, a depth estimation network and motion segmentation by using continuous stereo image pairs with the same size in the training data.

At this stage, two consecutive stereo image pairs L with the same size are used₁、R₁、L₂And R₂Optical flow loss

Loss of dimension

Loss of rigid flow

And loss of flow consistency

And simultaneously training an optical flow estimation network, a pose estimation network, a depth estimation network, a rigid registration module and a flow consistency check module.

The optical flow estimation network, pose estimation network and depth estimation network are trained in the stage, the training process is the same as that in the first stage, the output result is the same, and the description is omitted. The difference is that the motion segmentation is trained simultaneously by combining the outputs of the three networks at this stage, and since the principles of this part are the same in the test stage and the training stage, the description will be given later to avoid redundancy. Based on the training strategy, the problem of gradient disappearance generated in the training process of the network can be avoided.

Alternatively, the optical flow estimation network may employ a PWC-Net framework that merges several classical optical flow estimation techniques in an end-to-end trainable deep neural network, including image pyramids, warping, and cost metrics, to achieve the most advanced results. The pose estimation network can adopt a framework based on a cyclic convolution neural network (RCNN), and the features extracted by the CNN are input into two layers of convolution LSTM (ConvLSTM) to output 6-DoF poses, and the poses are translated by p ═ t [ (t [) ]_x，t_y，t_z) And angle of rotation

And (4) forming. The depth estimation network can employ an encoder and decoder architecture based on ResNet50, and the network can estimateA dense depth map of the same size as the input raw RGB image is computed.

3. After training is finished, carrying out rigid registration on two newly input continuous stereo image pairs with the same size by using output results of the three networks to optimize the output of a pose estimation network; and calculating rigid flow caused by the motion of the camera by using the output of the depth estimation network and the output of the optimized pose estimation network, and performing flow consistency check with the output of the optical flow estimation network so as to perform motion segmentation.

1) A rigid registration module.

Estimating optical flow output by a network using optical flow through a rigid registration module

And the absolute scale depth D is obtained by calculating the parallax D output by the depth estimation network_1，2To optimize the pose T output by the pose estimation network₁₂Obtaining an optimized pose T'₁₂。

During rigid registration, points in 2D image space are converted into 3D point clouds, the formula:

Q_k(i，j)＝D_k(i，j)K^-1P_k(i，j)，k＝1，2

wherein, P_k(i, j) is the image L_kThe homogeneous coordinates of the pixel at the (i, j) position of (a), K is a camera intrinsic parameter, D_k(i, j) is the image L_kAbsolute scale depth, Q, at the (i, j) position of (a)_k(i, j) is the image L_kThe corresponding 3D coordinates of the pixel at the (i, j) position of (a);

by using position and orientation T₁₂Converting 3D point cloud Q1 to 3D point cloud

(

Can be understood as being at t₂L of time₁Point cloud constructed from the 3D coordinates of the points in (1); and, using a bilinear sampling method, based on the optical flow

Point Q of 3D point₂Deformation back to t₁Obtaining corresponding 3D point cloud by time

The correspondence is established by a deformation step such that

Correspond to

Wherein W, H represents the width and height of the image, respectively;

respectively representing the flow of light

Components in the x, y axes;

if everything is very accurate, then

Should equal static and non-occluded areas of the scene

Therefore, use is made first

Of the opposite direction of the light flow

Estimating a non-occluded area O₁The pose estimate is then re-determined by tightly aligning the two non-occluded area point clouds. In particular, by minimizing the size of the selected region R

And

the improved posture Δ T is estimated by the distance between:

wherein the region R is

And

the top R% (e.g., 25%) of the minimum distance ordering between corresponding non-occluded regions; by doing so, it is attempted to exclude points in the moving area, since they tend to be in

And

with a greater distance therebetween. By combining T₁₂And delta T can obtain an optimized pose T'₁₂：

T′₁₂＝ΔT×T₁₂。

2) Flow consistency and motion segmentation.

Through optimized pose T'₁₂The formula that can calculate the rigid flow caused by camera motion is:

wherein K is camera reference P₁Represents L₁Homogeneous coordinates of the middle pixel;

if it is not

And

the sections are accurate, their values should match in the static area and differ in the moving area. In a rigid flow

And

a consistency check is performed between, if the difference between the two rigid stream flows is greater than a threshold δ, the corresponding region is marked as a moving foreground M¹And the rest of the image is marked as a static background M⁰So that the image loss mask is M₁：

Due to O₁Is composed of

Less accurate in occluded areas, which may lead to false positives, the default estimated motion area is located in the non-occluded area.

In static area ratio

And is more accurate. Thus, use is made of

To guide learning

Using the following flow consistency loss_con：

Wherein, SG denotes the stop gradient,

for rigid flow caused by camera motion, N is a normalization coefficient.

Based on the above, the total loss for the model shown in fig. 2 is:

in the above equation, λ is a weight coefficient of the corresponding loss term.

Through the above description of the embodiments, it is clear to those skilled in the art that the above embodiments can be implemented by software, and can also be implemented by software plus a necessary general hardware platform. With this understanding, the technical solutions of the embodiments can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions for enabling a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods according to the embodiments of the present invention.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. An unsupervised end-to-end driving environment perception method based on deep learning is characterized by comprising the following steps:

2. The unsupervised end-to-end driving environment perception method based on deep learning of claim 1, wherein the image acquisition using a binocular camera and the obtaining of the training data through preprocessing comprises:

firstly, zooming an original image acquired by a binocular camera, and simultaneously zooming internal parameters of the corresponding camera;

then, generating training data by a data enhancement method;

the data enhancement method comprises the following steps of performing data enhancement in one or more ways:

randomly correcting the input monocular image by using a brightness factor gamma;

random left-right turning and random time sequence switching.

3. The method as claimed in claim 1, wherein the training of the optical flow estimation network, the pose estimation network, the depth estimation network, and the motion segmentation by using two consecutive same-size stereo images in the training data comprises:

firstly, training an optical flow estimation network by using continuous stereo images with the same size in training data, and simultaneously training a pose estimation network and a depth estimation network;

then, the optical flow estimation network, the pose estimation network, the depth estimation network and the motion segmentation are trained simultaneously by using the continuous stereo image pairs with the same size in the training data.

4. The unsupervised end-to-end driving environment perception method based on deep learning of claim 3,

two consecutive stereo image pairs of the same size are denoted L₁、R₁、L₂And R₂(ii) a Wherein L is₁、R₁Corresponding representation at t₁Left and right images of time, L₂、R₂Corresponding representation t₂Left and right images of a time;

using two successive left images L₁And L₂And designed optical flow loss function

Training a pose estimation network and a depth estimation network simultaneously:

Training a pose estimation network, outputting the pose estimation network as two continuous left images L₁And L₂With relative camera pose T therebetween₁₂(ii) a Using two successive pairs L of stereo images of the same size₁、R₁、L₂And R₂And loss of stereo

5. The unsupervised end-to-end driving environment perception method based on deep learning of claim 4,

the optical flow loss function

And a smoothing loss term

Where ψ (.) represents an occlusion aware reconstruction loss function, α represents an adjustment coefficient, O₁Representing non-occluded areas, M₁Representing a loss mask, N being a normalization coefficient;

is represented by L₁、L₂Flow of light between

And in combination with L₂Reconstructed left image, note

e denotes the natural logarithm, (i, j) denotes the pixel position,

refers to the derivation operation along the x or y direction of the image, the square of which represents the derivation of the second order, a refers to the x or y direction of the image, indicating the direction of the derivation, β is the weight.

6. The unsupervised end-to-end driving environment perception method based on deep learning of claim 4,

loss of said rigid flow

Is applied in static area

And

the reconstruction loss term of (2):

where ψ () denotes an occlusion aware reconstruction loss function, O₁Representing non-occluded areas, M₁Representing a loss mask;

according to rigid flow

And in combination with L₂Two reconstructed left images, noted

By absolute scale depth D_1，2And pose T₁₂The calculation results in that,

by absolute scale depth D_1，2And calculating the pose after optimization.

7. The method as claimed in claim 3, wherein the simultaneous training of the optical flow estimation network, the pose estimation network, the depth estimation network and the motion segmentation by using consecutive same-size stereo image pairs in the training data comprises:

using two successive pairs L of stereo images of the same size₁、R₁、L₂And R₂Optical flow loss

Loss of stereo sound

Loss of rigid flow

And loss of flow consistency

8. The method as claimed in claim 7, wherein the optical flow output by the network is estimated by a rigid registration module using optical flow

And the absolute scale depth D is obtained by calculating the parallax D output by the depth estimation network_1，2To optimize the pose T output by the pose estimation network₁₂Obtaining an optimized pose T'₁₂；

Q_k(i，j)＝D_k(i，j)K^-1P_k(i，j)，k＝1，2

by using position and orientation T₁₂Point Q of 3D point₁Conversion to 3D point cloud

And, using a bilinear sampling method, based on the optical flow

The correspondence is established by a deformation step such that

Correspond to

Wherein W, H represents the width and height of the image, respectively;

respectively representing the flow of light

Components in the x, y axes;

by minimizing in the selected region R

And

the improved posture Δ T is estimated by the distance between:

wherein the region R is

And

the first R% of the minimum distance ordering between corresponding non-occluded regions;

thus obtaining an optimized pose T 'through the following formula'₁₂：

T′₁₂＝ΔT×T₁₂。

9. The unsupervised end-to-end driving environment perception method based on deep learning of claim 7 or 8, wherein the formula for calculating the rigid flow caused by camera motion is as follows:

the loss mask is estimated by:

wherein, O₁Representing an unobstructed area and δ is the threshold.

10. The unsupervised end-to-end driving environment perception method based on deep learning of claim 7 or 8, characterized by stream consistency loss

Expressed as:

where SG denotes the stopping gradient, (i, j) denotes the pixel position,

represents L₁、L₂The flow of light in between the two,

for rigid flow caused by camera motion, through the absolute scale depth D_1，2And the optimized pose T'₁₂And calculating to obtain N which is a normalization coefficient.