CN117079237A

CN117079237A - Self-supervision monocular vehicle distance detection method

Info

Publication number: CN117079237A
Application number: CN202311049975.9A
Authority: CN
Inventors: 王浩; 刘鹏宇
Original assignee: Shanghai Institute of Technology
Current assignee: Shanghai Institute of Technology
Priority date: 2023-08-21
Filing date: 2023-08-21
Publication date: 2023-11-17

Abstract

The invention discloses a self-supervision monocular distance detection method, which comprises the following steps of S1: downloading and processing the KITTI data set, and randomly dividing the data set into a training set and a testing set; step S2: constructing a self-encoder structure for extracting features, inputting an original image, calculating luminosity errors in the extracted feature image, and dynamically adjusting parameters to minimize errors between a reconstructed image and the original image; step S3: constructing a multidimensional model, constructing a pose network and a depth prediction network, and adding an attention mechanism; step S4: and (5) performing scale recovery on the relative depth output by the model, and converting the relative depth into absolute depth. According to the invention, by improving the network structure of the monodepth2, a CRP block chain residue pooling module and an attention introducing module are added into a decoder network, so that the model is focused on important characteristic areas, and the performance of the model is improved.

Description

Self-supervision monocular vehicle distance detection method

Technical Field

The invention belongs to the technical field of vehicle distance estimation, and particularly relates to a self-supervision monocular vehicle distance detection method.

Background

In the fields of vehicle driving and traffic safety, accurate estimation of the distance between a vehicle and a camera is critical to driving assistance systems and intelligent transportation systems. With the rapid development of computer vision and deep learning, the vehicle distance detection method based on monocular images becomes a solution with wide application potential. The traditional supervised learning method requires a large amount of labeling data, and accurate vehicle distance information is difficult to acquire. In addition, the cost of data labeling and the cost of time in the training process are also one of the factors limiting the application of the training process. However, existing self-supervising algorithms cannot handle transparent, reflective, and low texture areas, etc. scenes. These scenes lack explicit depth cues resulting in the depth estimation algorithm having difficulty accurately inferring the depth of these regions. Therefore, a monocular distance detection method for self-supervised learning is needed to realize accurate vehicle distance estimation, thereby improving the performance and traffic safety of the driving assistance system, and further research and technical innovation are needed to solve the challenges.

Disclosure of Invention

In order to overcome the defects in the prior art, the invention provides a monocular vehicle distance detection method based on self supervision.

In order to achieve the above object, the technical scheme adopted for solving the technical problems is as follows:

a self-supervision monocular distance detection method comprises the following steps:

step S1: downloading and processing the KITTI data set, and randomly dividing the data set into a training set and a testing set;

step S2: constructing a self-encoder structure for extracting features, inputting an original image, calculating luminosity errors in the extracted feature image, and dynamically adjusting parameters to minimize errors between a reconstructed image and the original image;

step S3: constructing a multidimensional model, constructing a pose network and a depth prediction network, and adding an attention mechanism;

step S4: and (5) performing scale recovery on the relative depth output by the model, and converting the relative depth into absolute depth.

Further, step S1 includes the following:

and carrying out data enhancement on the downloaded KITTI data set, transforming and expanding the data to generate diversified training samples, constructing a vehicle distance detection training data set and a test data set by adopting a real shooting image of the KITTI data set, wherein the original size of the data set is 1242 x 375 pixels, preprocessing the image, compressing the image to 320 x 1024 pixels, and dividing the data set according to the ratio of train: val: test=8:1:1.

Further, step S1 includes the following:

389 stereoscopic images and optical flow diagrams, 29.2km visual ranging sequence, 9300 RGBD training samples and depth maps, and images of 3D tagged objects exceeding 200K, sampled and synchronized point cloud data at a frequency of 10 Hz.

Further, step S2 includes the following:

step S2-1: in the traditional U-Net sampling network, the up-sampling part converts the original transpose convolution into deconvolution operation, and the number of sampling layers is increased by 16 times on the basis of 2 times, 4 times and 8 times of the number of the U-Net sampling layers;

step S2-2: adding a key module, adding a CRP block chain residue pooling module in a decoder network according to the structure described in the step S2-1, fusing residual connection and weight learning, and adding a maximum pooling layer in an Encoder part for restraining the size of a feature map;

step S2-3: calculating photometric losses, calculating photometric errors from the output feature map according to the structure described in step S2-2, using single view reconstruction to learn the feature representation will facilitate discrimination of non-textured areas as well as surfaces that are illuminated for reflection.

Further, step S3 includes the following:

an Attention module is introduced into the existing Monodepth2 network model, the Attention module is added at the tail of a backhaul network of Monodepth2, namely, a Self-Attention module is inserted between the last feature extraction module and the jump connection module in the backhaul network, and the relevance of different positions in an image is Self-adaptively learned.

Further, step S4 includes the following:

the depth map output by the model is stored in a format of a uid 16, scale recovery is carried out, and the ratio of data read from the depth map to 256 is carried out to obtain a real distance value.

Compared with the prior art, the invention has the following advantages and positive effects due to the adoption of the technical scheme:

the invention discloses a self-supervision monocular distance detection method which is mainly used for solving the problem of inaccurate distance when a monocular camera is used for estimating the distance of a vehicle. Previous methods often have estimation errors for non-textured areas and surfaces that are reflective to illumination. When there are moving objects and motion blur in the image, the model may create a problem of inaccurate estimated depth. By adding additional codec structures to capture semantic information of the input image from multiple dimensions, optimal photometric errors are calculated, and the network learns to a consistent feature representation, thereby optimizing the conditions of non-texture and inaccurate distance estimation of the illumination-reflected surface. And an adaptive attention mechanism is added in the network structure for dynamically adjusting the attention weight, so that the network is focused on the tested vehicle, and the robustness and generalization capability of the model are improved. After the model is determined, the KITTI data set is downloaded for training the model, and the trained model is inferred and compared with the real depth data with the depth value, so that the model effect is further optimized to meet the requirements of practical application.

Drawings

In order to more clearly illustrate the technical solution of the embodiments of the present invention, the drawings that are required to be used in the description of the embodiments will be briefly described below. In the accompanying drawings:

FIG. 1 is a flow chart of a self-supervising monocular distance detection method of the present invention;

FIG. 2 is a schematic diagram of a network model structure of the present invention;

FIG. 3 is a schematic representation of the CRP Block structure of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made apparent and fully in view of the accompanying drawings, in which some, but not all embodiments of the invention are shown. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

As shown in fig. 1, the embodiment discloses a self-supervision monocular distance detection method, which comprises the following steps:

further, step S1 includes the following:

Further, step S1 includes the following:

further, step S2 includes the following:

Further, step S3 includes the following:

the depth residual error network is used as a backhaul for feature extraction, and can realize jump connection and better transmit upper network information to a lower network.

The input picture is convolved by a 7x7 convolution layer, the convolution layer has 64 channels, the step length is 2 to extract the characteristic information of the picture, and then the obtained characteristic picture is downsampled. First, a downsampling is used, the number of output channels of which is 128, and the downsampling layer is used for reducing the spatial dimension of the feature map, while preserving important feature information. After the downsampling layer, 3 blocks are added, including a number of residual blocks and 1 downsampling layer, gradually reducing the spatial size of the feature map.

After the final layer convolution of the Encoder, a Self-Attention module is inserted for enhancing the accuracy of the feature representation. Specifically comprises two stages:

stage I: performing convolution operation on the original image by adopting a 7x7 convolution check, and respectively projecting the original image into 3 1x1 convolutions to obtain an intermediate feature set containing 3xN feature images;

stage II: the intermediate features are clustered into N groups, each group containing 3 feature maps, as queries, keys and values, respectively, following a conventional multi-headed self-attention model. And processing by adopting a lightweight full connection layer and packet convolution to finally obtain N feature graphs serving as one of the feature graphs output by the Encoder.

A Decoder section: comprising 4 blocks, several deconvolution layers, convolution layers and Skip Connection. Wherein the first block contains 1 deconvolution layer, the number of output channels is 256; the number of output channels of the other 3 blocks is 128, 64 and 32 respectively, the last layer adopts a 1x1 convolution layer, and the number of output channels is 1 for predicting the depth map.

Further, step S4 includes the following:

Compared with the prior art, the invention adds the CRP block chain residue pooling module and the Attention-introducing module (Self-Attention) into the decoder network by improving the network structure of the monodepth2, so that the model focuses on important characteristic areas, thereby improving the performance of the model. The method has the advantages that the problems that estimation errors occur on the surface of the non-texture area and the surface of illumination reflection due to the fact that a plurality of codec structures are introduced, and the estimated depth is inaccurate when moving objects and motion blur exist in an image are optimized, so that the accuracy of the estimated distance of a model is finally improved, and the robustness of the model is improved.

The present invention is not limited to the above-mentioned embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present invention are intended to be included in the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims

1. The self-supervision monocular distance detection method is characterized by comprising the following steps of:

2. The method for detecting the monocular distance according to claim 1, wherein the step S1 includes the following steps:

3. The method for detecting the monocular distance according to claim 2, wherein the step S1 includes the following steps:

4. The method for detecting the monocular distance according to claim 1, wherein the step S2 includes the following steps:

5. The method for detecting the monocular distance according to claim 1, wherein the step S3 includes the following steps:

6. The method for detecting the monocular distance according to claim 1, wherein the step S4 includes the following steps: