CN117079237A - Self-supervision monocular vehicle distance detection method - Google Patents

Self-supervision monocular vehicle distance detection method Download PDF

Info

Publication number
CN117079237A
CN117079237A CN202311049975.9A CN202311049975A CN117079237A CN 117079237 A CN117079237 A CN 117079237A CN 202311049975 A CN202311049975 A CN 202311049975A CN 117079237 A CN117079237 A CN 117079237A
Authority
CN
China
Prior art keywords
network
data set
image
depth
self
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311049975.9A
Other languages
Chinese (zh)
Inventor
王浩
刘鹏宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Institute of Technology
Original Assignee
Shanghai Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Institute of Technology filed Critical Shanghai Institute of Technology
Priority to CN202311049975.9A priority Critical patent/CN117079237A/en
Publication of CN117079237A publication Critical patent/CN117079237A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/56Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/467Encoded features or binary features, e.g. local binary patterns [LBP]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Abstract

The invention discloses a self-supervision monocular distance detection method, which comprises the following steps of S1: downloading and processing the KITTI data set, and randomly dividing the data set into a training set and a testing set; step S2: constructing a self-encoder structure for extracting features, inputting an original image, calculating luminosity errors in the extracted feature image, and dynamically adjusting parameters to minimize errors between a reconstructed image and the original image; step S3: constructing a multidimensional model, constructing a pose network and a depth prediction network, and adding an attention mechanism; step S4: and (5) performing scale recovery on the relative depth output by the model, and converting the relative depth into absolute depth. According to the invention, by improving the network structure of the monodepth2, a CRP block chain residue pooling module and an attention introducing module are added into a decoder network, so that the model is focused on important characteristic areas, and the performance of the model is improved.

Description

Self-supervision monocular vehicle distance detection method
Technical Field
The invention belongs to the technical field of vehicle distance estimation, and particularly relates to a self-supervision monocular vehicle distance detection method.
Background
In the fields of vehicle driving and traffic safety, accurate estimation of the distance between a vehicle and a camera is critical to driving assistance systems and intelligent transportation systems. With the rapid development of computer vision and deep learning, the vehicle distance detection method based on monocular images becomes a solution with wide application potential. The traditional supervised learning method requires a large amount of labeling data, and accurate vehicle distance information is difficult to acquire. In addition, the cost of data labeling and the cost of time in the training process are also one of the factors limiting the application of the training process. However, existing self-supervising algorithms cannot handle transparent, reflective, and low texture areas, etc. scenes. These scenes lack explicit depth cues resulting in the depth estimation algorithm having difficulty accurately inferring the depth of these regions. Therefore, a monocular distance detection method for self-supervised learning is needed to realize accurate vehicle distance estimation, thereby improving the performance and traffic safety of the driving assistance system, and further research and technical innovation are needed to solve the challenges.
Disclosure of Invention
In order to overcome the defects in the prior art, the invention provides a monocular vehicle distance detection method based on self supervision.
In order to achieve the above object, the technical scheme adopted for solving the technical problems is as follows:
a self-supervision monocular distance detection method comprises the following steps:
step S1: downloading and processing the KITTI data set, and randomly dividing the data set into a training set and a testing set;
step S2: constructing a self-encoder structure for extracting features, inputting an original image, calculating luminosity errors in the extracted feature image, and dynamically adjusting parameters to minimize errors between a reconstructed image and the original image;
step S3: constructing a multidimensional model, constructing a pose network and a depth prediction network, and adding an attention mechanism;
step S4: and (5) performing scale recovery on the relative depth output by the model, and converting the relative depth into absolute depth.
Further, step S1 includes the following:
and carrying out data enhancement on the downloaded KITTI data set, transforming and expanding the data to generate diversified training samples, constructing a vehicle distance detection training data set and a test data set by adopting a real shooting image of the KITTI data set, wherein the original size of the data set is 1242 x 375 pixels, preprocessing the image, compressing the image to 320 x 1024 pixels, and dividing the data set according to the ratio of train: val: test=8:1:1.
Further, step S1 includes the following:
389 stereoscopic images and optical flow diagrams, 29.2km visual ranging sequence, 9300 RGBD training samples and depth maps, and images of 3D tagged objects exceeding 200K, sampled and synchronized point cloud data at a frequency of 10 Hz.
Further, step S2 includes the following:
step S2-1: in the traditional U-Net sampling network, the up-sampling part converts the original transpose convolution into deconvolution operation, and the number of sampling layers is increased by 16 times on the basis of 2 times, 4 times and 8 times of the number of the U-Net sampling layers;
step S2-2: adding a key module, adding a CRP block chain residue pooling module in a decoder network according to the structure described in the step S2-1, fusing residual connection and weight learning, and adding a maximum pooling layer in an Encoder part for restraining the size of a feature map;
step S2-3: calculating photometric losses, calculating photometric errors from the output feature map according to the structure described in step S2-2, using single view reconstruction to learn the feature representation will facilitate discrimination of non-textured areas as well as surfaces that are illuminated for reflection.
Further, step S3 includes the following:
an Attention module is introduced into the existing Monodepth2 network model, the Attention module is added at the tail of a backhaul network of Monodepth2, namely, a Self-Attention module is inserted between the last feature extraction module and the jump connection module in the backhaul network, and the relevance of different positions in an image is Self-adaptively learned.
Further, step S4 includes the following:
the depth map output by the model is stored in a format of a uid 16, scale recovery is carried out, and the ratio of data read from the depth map to 256 is carried out to obtain a real distance value.
Compared with the prior art, the invention has the following advantages and positive effects due to the adoption of the technical scheme:
the invention discloses a self-supervision monocular distance detection method which is mainly used for solving the problem of inaccurate distance when a monocular camera is used for estimating the distance of a vehicle. Previous methods often have estimation errors for non-textured areas and surfaces that are reflective to illumination. When there are moving objects and motion blur in the image, the model may create a problem of inaccurate estimated depth. By adding additional codec structures to capture semantic information of the input image from multiple dimensions, optimal photometric errors are calculated, and the network learns to a consistent feature representation, thereby optimizing the conditions of non-texture and inaccurate distance estimation of the illumination-reflected surface. And an adaptive attention mechanism is added in the network structure for dynamically adjusting the attention weight, so that the network is focused on the tested vehicle, and the robustness and generalization capability of the model are improved. After the model is determined, the KITTI data set is downloaded for training the model, and the trained model is inferred and compared with the real depth data with the depth value, so that the model effect is further optimized to meet the requirements of practical application.
Drawings
In order to more clearly illustrate the technical solution of the embodiments of the present invention, the drawings that are required to be used in the description of the embodiments will be briefly described below. In the accompanying drawings:
FIG. 1 is a flow chart of a self-supervising monocular distance detection method of the present invention;
FIG. 2 is a schematic diagram of a network model structure of the present invention;
FIG. 3 is a schematic representation of the CRP Block structure of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made apparent and fully in view of the accompanying drawings, in which some, but not all embodiments of the invention are shown. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
As shown in fig. 1, the embodiment discloses a self-supervision monocular distance detection method, which comprises the following steps:
step S1: downloading and processing the KITTI data set, and randomly dividing the data set into a training set and a testing set;
further, step S1 includes the following:
and carrying out data enhancement on the downloaded KITTI data set, transforming and expanding the data to generate diversified training samples, constructing a vehicle distance detection training data set and a test data set by adopting a real shooting image of the KITTI data set, wherein the original size of the data set is 1242 x 375 pixels, preprocessing the image, compressing the image to 320 x 1024 pixels, and dividing the data set according to the ratio of train: val: test=8:1:1.
Further, step S1 includes the following:
389 stereoscopic images and optical flow diagrams, 29.2km visual ranging sequence, 9300 RGBD training samples and depth maps, and images of 3D tagged objects exceeding 200K, sampled and synchronized point cloud data at a frequency of 10 Hz.
Step S2: constructing a self-encoder structure for extracting features, inputting an original image, calculating luminosity errors in the extracted feature image, and dynamically adjusting parameters to minimize errors between a reconstructed image and the original image;
further, step S2 includes the following:
step S2-1: in the traditional U-Net sampling network, the up-sampling part converts the original transpose convolution into deconvolution operation, and the number of sampling layers is increased by 16 times on the basis of 2 times, 4 times and 8 times of the number of the U-Net sampling layers;
step S2-2: adding a key module, adding a CRP block chain residue pooling module in a decoder network according to the structure described in the step S2-1, fusing residual connection and weight learning, and adding a maximum pooling layer in an Encoder part for restraining the size of a feature map;
step S2-3: calculating photometric losses, calculating photometric errors from the output feature map according to the structure described in step S2-2, using single view reconstruction to learn the feature representation will facilitate discrimination of non-textured areas as well as surfaces that are illuminated for reflection.
Step S3: constructing a multidimensional model, constructing a pose network and a depth prediction network, and adding an attention mechanism;
an Attention module is introduced into the existing Monodepth2 network model, the Attention module is added at the tail of a backhaul network of Monodepth2, namely, a Self-Attention module is inserted between the last feature extraction module and the jump connection module in the backhaul network, and the relevance of different positions in an image is Self-adaptively learned.
Further, step S3 includes the following:
the depth residual error network is used as a backhaul for feature extraction, and can realize jump connection and better transmit upper network information to a lower network.
The input picture is convolved by a 7x7 convolution layer, the convolution layer has 64 channels, the step length is 2 to extract the characteristic information of the picture, and then the obtained characteristic picture is downsampled. First, a downsampling is used, the number of output channels of which is 128, and the downsampling layer is used for reducing the spatial dimension of the feature map, while preserving important feature information. After the downsampling layer, 3 blocks are added, including a number of residual blocks and 1 downsampling layer, gradually reducing the spatial size of the feature map.
After the final layer convolution of the Encoder, a Self-Attention module is inserted for enhancing the accuracy of the feature representation. Specifically comprises two stages:
stage I: performing convolution operation on the original image by adopting a 7x7 convolution check, and respectively projecting the original image into 3 1x1 convolutions to obtain an intermediate feature set containing 3xN feature images;
stage II: the intermediate features are clustered into N groups, each group containing 3 feature maps, as queries, keys and values, respectively, following a conventional multi-headed self-attention model. And processing by adopting a lightweight full connection layer and packet convolution to finally obtain N feature graphs serving as one of the feature graphs output by the Encoder.
A Decoder section: comprising 4 blocks, several deconvolution layers, convolution layers and Skip Connection. Wherein the first block contains 1 deconvolution layer, the number of output channels is 256; the number of output channels of the other 3 blocks is 128, 64 and 32 respectively, the last layer adopts a 1x1 convolution layer, and the number of output channels is 1 for predicting the depth map.
Step S4: and (5) performing scale recovery on the relative depth output by the model, and converting the relative depth into absolute depth.
Further, step S4 includes the following:
the depth map output by the model is stored in a format of a uid 16, scale recovery is carried out, and the ratio of data read from the depth map to 256 is carried out to obtain a real distance value.
Compared with the prior art, the invention adds the CRP block chain residue pooling module and the Attention-introducing module (Self-Attention) into the decoder network by improving the network structure of the monodepth2, so that the model focuses on important characteristic areas, thereby improving the performance of the model. The method has the advantages that the problems that estimation errors occur on the surface of the non-texture area and the surface of illumination reflection due to the fact that a plurality of codec structures are introduced, and the estimated depth is inaccurate when moving objects and motion blur exist in an image are optimized, so that the accuracy of the estimated distance of a model is finally improved, and the robustness of the model is improved.
The present invention is not limited to the above-mentioned embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present invention are intended to be included in the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims (6)

1. The self-supervision monocular distance detection method is characterized by comprising the following steps of:
step S1: downloading and processing the KITTI data set, and randomly dividing the data set into a training set and a testing set;
step S2: constructing a self-encoder structure for extracting features, inputting an original image, calculating luminosity errors in the extracted feature image, and dynamically adjusting parameters to minimize errors between a reconstructed image and the original image;
step S3: constructing a multidimensional model, constructing a pose network and a depth prediction network, and adding an attention mechanism;
step S4: and (5) performing scale recovery on the relative depth output by the model, and converting the relative depth into absolute depth.
2. The method for detecting the monocular distance according to claim 1, wherein the step S1 includes the following steps:
and carrying out data enhancement on the downloaded KITTI data set, transforming and expanding the data to generate diversified training samples, constructing a vehicle distance detection training data set and a test data set by adopting a real shooting image of the KITTI data set, wherein the original size of the data set is 1242 x 375 pixels, preprocessing the image, compressing the image to 320 x 1024 pixels, and dividing the data set according to the ratio of train: val: test=8:1:1.
3. The method for detecting the monocular distance according to claim 2, wherein the step S1 includes the following steps:
389 stereoscopic images and optical flow diagrams, 29.2km visual ranging sequence, 9300 RGBD training samples and depth maps, and images of 3D tagged objects exceeding 200K, sampled and synchronized point cloud data at a frequency of 10 Hz.
4. The method for detecting the monocular distance according to claim 1, wherein the step S2 includes the following steps:
step S2-1: in the traditional U-Net sampling network, the up-sampling part converts the original transpose convolution into deconvolution operation, and the number of sampling layers is increased by 16 times on the basis of 2 times, 4 times and 8 times of the number of the U-Net sampling layers;
step S2-2: adding a key module, adding a CRP block chain residue pooling module in a decoder network according to the structure described in the step S2-1, fusing residual connection and weight learning, and adding a maximum pooling layer in an Encoder part for restraining the size of a feature map;
step S2-3: calculating photometric losses, calculating photometric errors from the output feature map according to the structure described in step S2-2, using single view reconstruction to learn the feature representation will facilitate discrimination of non-textured areas as well as surfaces that are illuminated for reflection.
5. The method for detecting the monocular distance according to claim 1, wherein the step S3 includes the following steps:
an Attention module is introduced into the existing Monodepth2 network model, the Attention module is added at the tail of a backhaul network of Monodepth2, namely, a Self-Attention module is inserted between the last feature extraction module and the jump connection module in the backhaul network, and the relevance of different positions in an image is Self-adaptively learned.
6. The method for detecting the monocular distance according to claim 1, wherein the step S4 includes the following steps:
the depth map output by the model is stored in a format of a uid 16, scale recovery is carried out, and the ratio of data read from the depth map to 256 is carried out to obtain a real distance value.
CN202311049975.9A 2023-08-21 2023-08-21 Self-supervision monocular vehicle distance detection method Pending CN117079237A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311049975.9A CN117079237A (en) 2023-08-21 2023-08-21 Self-supervision monocular vehicle distance detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311049975.9A CN117079237A (en) 2023-08-21 2023-08-21 Self-supervision monocular vehicle distance detection method

Publications (1)

Publication Number Publication Date
CN117079237A true CN117079237A (en) 2023-11-17

Family

ID=88711094

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311049975.9A Pending CN117079237A (en) 2023-08-21 2023-08-21 Self-supervision monocular vehicle distance detection method

Country Status (1)

Country Link
CN (1) CN117079237A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117422751A (en) * 2023-12-19 2024-01-19 中科华芯(东莞)科技有限公司 Non-motor vehicle safe driving auxiliary method, system and electronic equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117422751A (en) * 2023-12-19 2024-01-19 中科华芯(东莞)科技有限公司 Non-motor vehicle safe driving auxiliary method, system and electronic equipment
CN117422751B (en) * 2023-12-19 2024-03-26 中科华芯(东莞)科技有限公司 Non-motor vehicle safe driving auxiliary method, system and electronic equipment

Similar Documents

Publication Publication Date Title
CN112347859A (en) Optical remote sensing image saliency target detection method
CN111832453B (en) Unmanned scene real-time semantic segmentation method based on two-way deep neural network
CN112329780B (en) Depth image semantic segmentation method based on deep learning
CN110020658B (en) Salient object detection method based on multitask deep learning
CN114943963A (en) Remote sensing image cloud and cloud shadow segmentation method based on double-branch fusion network
CN116342596B (en) YOLOv5 improved substation equipment nut defect identification detection method
CN113554032B (en) Remote sensing image segmentation method based on multi-path parallel network of high perception
CN117079237A (en) Self-supervision monocular vehicle distance detection method
CN115713679A (en) Target detection method based on multi-source information fusion, thermal infrared and three-dimensional depth map
CN116229106A (en) Video significance prediction method based on double-U structure
CN115035171A (en) Self-supervision monocular depth estimation method based on self-attention-guidance feature fusion
CN115908772A (en) Target detection method and system based on Transformer and fusion attention mechanism
CN115861756A (en) Earth background small target identification method based on cascade combination network
CN114092824A (en) Remote sensing image road segmentation method combining intensive attention and parallel up-sampling
CN116205962A (en) Monocular depth estimation method and system based on complete context information
CN116883912A (en) Infrared dim target detection method based on global information target enhancement
CN116310916A (en) Semantic segmentation method and system for high-resolution remote sensing city image
CN116612283A (en) Image semantic segmentation method based on large convolution kernel backbone network
CN115797684A (en) Infrared small target detection method and system based on context information
CN116703885A (en) Swin transducer-based surface defect detection method and system
CN116485860A (en) Monocular depth prediction algorithm based on multi-scale progressive interaction and aggregation cross attention features
CN113920317B (en) Semantic segmentation method based on visible light image and low-resolution depth image
CN115240163A (en) Traffic sign detection method and system based on one-stage detection network
CN115424187B (en) Auxiliary driving method for multi-angle camera collaborative importance ranking constraint
CN116612288B (en) Multi-scale lightweight real-time semantic segmentation method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination