CN117079237A - Self-supervision monocular vehicle distance detection method - Google Patents
Self-supervision monocular vehicle distance detection method Download PDFInfo
- Publication number
- CN117079237A CN117079237A CN202311049975.9A CN202311049975A CN117079237A CN 117079237 A CN117079237 A CN 117079237A CN 202311049975 A CN202311049975 A CN 202311049975A CN 117079237 A CN117079237 A CN 117079237A
- Authority
- CN
- China
- Prior art keywords
- network
- data set
- image
- depth
- self
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 15
- 238000012549 training Methods 0.000 claims abstract description 16
- 238000012360 testing method Methods 0.000 claims abstract description 10
- 238000011176 pooling Methods 0.000 claims abstract description 8
- 238000011084 recovery Methods 0.000 claims abstract description 7
- 238000012545 processing Methods 0.000 claims abstract description 5
- 238000005070 sampling Methods 0.000 claims description 12
- 238000000034 method Methods 0.000 claims description 10
- 238000010586 diagram Methods 0.000 claims description 4
- 238000000605 extraction Methods 0.000 claims description 4
- 230000003287 optical effect Effects 0.000 claims description 3
- 238000007781 pre-processing Methods 0.000 claims description 3
- 230000000452 restraining effect Effects 0.000 claims description 3
- 230000001360 synchronised effect Effects 0.000 claims description 3
- 230000001131 transforming effect Effects 0.000 claims description 3
- 230000000007 visual effect Effects 0.000 claims description 3
- 238000005286 illumination Methods 0.000 description 2
- 238000002372 labelling Methods 0.000 description 2
- 208000037170 Delayed Emergence from Anesthesia Diseases 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 230000008092 positive effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/50—Context or environment of the image
- G06V20/56—Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/46—Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
- G06V10/467—Encoded features or binary features, e.g. local binary patterns [LBP]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Abstract
The invention discloses a self-supervision monocular distance detection method, which comprises the following steps of S1: downloading and processing the KITTI data set, and randomly dividing the data set into a training set and a testing set; step S2: constructing a self-encoder structure for extracting features, inputting an original image, calculating luminosity errors in the extracted feature image, and dynamically adjusting parameters to minimize errors between a reconstructed image and the original image; step S3: constructing a multidimensional model, constructing a pose network and a depth prediction network, and adding an attention mechanism; step S4: and (5) performing scale recovery on the relative depth output by the model, and converting the relative depth into absolute depth. According to the invention, by improving the network structure of the monodepth2, a CRP block chain residue pooling module and an attention introducing module are added into a decoder network, so that the model is focused on important characteristic areas, and the performance of the model is improved.
Description
Technical Field
The invention belongs to the technical field of vehicle distance estimation, and particularly relates to a self-supervision monocular vehicle distance detection method.
Background
In the fields of vehicle driving and traffic safety, accurate estimation of the distance between a vehicle and a camera is critical to driving assistance systems and intelligent transportation systems. With the rapid development of computer vision and deep learning, the vehicle distance detection method based on monocular images becomes a solution with wide application potential. The traditional supervised learning method requires a large amount of labeling data, and accurate vehicle distance information is difficult to acquire. In addition, the cost of data labeling and the cost of time in the training process are also one of the factors limiting the application of the training process. However, existing self-supervising algorithms cannot handle transparent, reflective, and low texture areas, etc. scenes. These scenes lack explicit depth cues resulting in the depth estimation algorithm having difficulty accurately inferring the depth of these regions. Therefore, a monocular distance detection method for self-supervised learning is needed to realize accurate vehicle distance estimation, thereby improving the performance and traffic safety of the driving assistance system, and further research and technical innovation are needed to solve the challenges.
Disclosure of Invention
In order to overcome the defects in the prior art, the invention provides a monocular vehicle distance detection method based on self supervision.
In order to achieve the above object, the technical scheme adopted for solving the technical problems is as follows:
a self-supervision monocular distance detection method comprises the following steps:
step S1: downloading and processing the KITTI data set, and randomly dividing the data set into a training set and a testing set;
step S2: constructing a self-encoder structure for extracting features, inputting an original image, calculating luminosity errors in the extracted feature image, and dynamically adjusting parameters to minimize errors between a reconstructed image and the original image;
step S3: constructing a multidimensional model, constructing a pose network and a depth prediction network, and adding an attention mechanism;
step S4: and (5) performing scale recovery on the relative depth output by the model, and converting the relative depth into absolute depth.
Further, step S1 includes the following:
and carrying out data enhancement on the downloaded KITTI data set, transforming and expanding the data to generate diversified training samples, constructing a vehicle distance detection training data set and a test data set by adopting a real shooting image of the KITTI data set, wherein the original size of the data set is 1242 x 375 pixels, preprocessing the image, compressing the image to 320 x 1024 pixels, and dividing the data set according to the ratio of train: val: test=8:1:1.
Further, step S1 includes the following:
389 stereoscopic images and optical flow diagrams, 29.2km visual ranging sequence, 9300 RGBD training samples and depth maps, and images of 3D tagged objects exceeding 200K, sampled and synchronized point cloud data at a frequency of 10 Hz.
Further, step S2 includes the following:
step S2-1: in the traditional U-Net sampling network, the up-sampling part converts the original transpose convolution into deconvolution operation, and the number of sampling layers is increased by 16 times on the basis of 2 times, 4 times and 8 times of the number of the U-Net sampling layers;
step S2-2: adding a key module, adding a CRP block chain residue pooling module in a decoder network according to the structure described in the step S2-1, fusing residual connection and weight learning, and adding a maximum pooling layer in an Encoder part for restraining the size of a feature map;
step S2-3: calculating photometric losses, calculating photometric errors from the output feature map according to the structure described in step S2-2, using single view reconstruction to learn the feature representation will facilitate discrimination of non-textured areas as well as surfaces that are illuminated for reflection.
Further, step S3 includes the following:
an Attention module is introduced into the existing Monodepth2 network model, the Attention module is added at the tail of a backhaul network of Monodepth2, namely, a Self-Attention module is inserted between the last feature extraction module and the jump connection module in the backhaul network, and the relevance of different positions in an image is Self-adaptively learned.
Further, step S4 includes the following:
the depth map output by the model is stored in a format of a uid 16, scale recovery is carried out, and the ratio of data read from the depth map to 256 is carried out to obtain a real distance value.
Compared with the prior art, the invention has the following advantages and positive effects due to the adoption of the technical scheme:
the invention discloses a self-supervision monocular distance detection method which is mainly used for solving the problem of inaccurate distance when a monocular camera is used for estimating the distance of a vehicle. Previous methods often have estimation errors for non-textured areas and surfaces that are reflective to illumination. When there are moving objects and motion blur in the image, the model may create a problem of inaccurate estimated depth. By adding additional codec structures to capture semantic information of the input image from multiple dimensions, optimal photometric errors are calculated, and the network learns to a consistent feature representation, thereby optimizing the conditions of non-texture and inaccurate distance estimation of the illumination-reflected surface. And an adaptive attention mechanism is added in the network structure for dynamically adjusting the attention weight, so that the network is focused on the tested vehicle, and the robustness and generalization capability of the model are improved. After the model is determined, the KITTI data set is downloaded for training the model, and the trained model is inferred and compared with the real depth data with the depth value, so that the model effect is further optimized to meet the requirements of practical application.
Drawings
In order to more clearly illustrate the technical solution of the embodiments of the present invention, the drawings that are required to be used in the description of the embodiments will be briefly described below. In the accompanying drawings:
FIG. 1 is a flow chart of a self-supervising monocular distance detection method of the present invention;
FIG. 2 is a schematic diagram of a network model structure of the present invention;
FIG. 3 is a schematic representation of the CRP Block structure of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made apparent and fully in view of the accompanying drawings, in which some, but not all embodiments of the invention are shown. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
As shown in fig. 1, the embodiment discloses a self-supervision monocular distance detection method, which comprises the following steps:
step S1: downloading and processing the KITTI data set, and randomly dividing the data set into a training set and a testing set;
further, step S1 includes the following:
and carrying out data enhancement on the downloaded KITTI data set, transforming and expanding the data to generate diversified training samples, constructing a vehicle distance detection training data set and a test data set by adopting a real shooting image of the KITTI data set, wherein the original size of the data set is 1242 x 375 pixels, preprocessing the image, compressing the image to 320 x 1024 pixels, and dividing the data set according to the ratio of train: val: test=8:1:1.
Further, step S1 includes the following:
389 stereoscopic images and optical flow diagrams, 29.2km visual ranging sequence, 9300 RGBD training samples and depth maps, and images of 3D tagged objects exceeding 200K, sampled and synchronized point cloud data at a frequency of 10 Hz.
Step S2: constructing a self-encoder structure for extracting features, inputting an original image, calculating luminosity errors in the extracted feature image, and dynamically adjusting parameters to minimize errors between a reconstructed image and the original image;
further, step S2 includes the following:
step S2-1: in the traditional U-Net sampling network, the up-sampling part converts the original transpose convolution into deconvolution operation, and the number of sampling layers is increased by 16 times on the basis of 2 times, 4 times and 8 times of the number of the U-Net sampling layers;
step S2-2: adding a key module, adding a CRP block chain residue pooling module in a decoder network according to the structure described in the step S2-1, fusing residual connection and weight learning, and adding a maximum pooling layer in an Encoder part for restraining the size of a feature map;
step S2-3: calculating photometric losses, calculating photometric errors from the output feature map according to the structure described in step S2-2, using single view reconstruction to learn the feature representation will facilitate discrimination of non-textured areas as well as surfaces that are illuminated for reflection.
Step S3: constructing a multidimensional model, constructing a pose network and a depth prediction network, and adding an attention mechanism;
an Attention module is introduced into the existing Monodepth2 network model, the Attention module is added at the tail of a backhaul network of Monodepth2, namely, a Self-Attention module is inserted between the last feature extraction module and the jump connection module in the backhaul network, and the relevance of different positions in an image is Self-adaptively learned.
Further, step S3 includes the following:
the depth residual error network is used as a backhaul for feature extraction, and can realize jump connection and better transmit upper network information to a lower network.
The input picture is convolved by a 7x7 convolution layer, the convolution layer has 64 channels, the step length is 2 to extract the characteristic information of the picture, and then the obtained characteristic picture is downsampled. First, a downsampling is used, the number of output channels of which is 128, and the downsampling layer is used for reducing the spatial dimension of the feature map, while preserving important feature information. After the downsampling layer, 3 blocks are added, including a number of residual blocks and 1 downsampling layer, gradually reducing the spatial size of the feature map.
After the final layer convolution of the Encoder, a Self-Attention module is inserted for enhancing the accuracy of the feature representation. Specifically comprises two stages:
stage I: performing convolution operation on the original image by adopting a 7x7 convolution check, and respectively projecting the original image into 3 1x1 convolutions to obtain an intermediate feature set containing 3xN feature images;
stage II: the intermediate features are clustered into N groups, each group containing 3 feature maps, as queries, keys and values, respectively, following a conventional multi-headed self-attention model. And processing by adopting a lightweight full connection layer and packet convolution to finally obtain N feature graphs serving as one of the feature graphs output by the Encoder.
A Decoder section: comprising 4 blocks, several deconvolution layers, convolution layers and Skip Connection. Wherein the first block contains 1 deconvolution layer, the number of output channels is 256; the number of output channels of the other 3 blocks is 128, 64 and 32 respectively, the last layer adopts a 1x1 convolution layer, and the number of output channels is 1 for predicting the depth map.
Step S4: and (5) performing scale recovery on the relative depth output by the model, and converting the relative depth into absolute depth.
Further, step S4 includes the following:
the depth map output by the model is stored in a format of a uid 16, scale recovery is carried out, and the ratio of data read from the depth map to 256 is carried out to obtain a real distance value.
Compared with the prior art, the invention adds the CRP block chain residue pooling module and the Attention-introducing module (Self-Attention) into the decoder network by improving the network structure of the monodepth2, so that the model focuses on important characteristic areas, thereby improving the performance of the model. The method has the advantages that the problems that estimation errors occur on the surface of the non-texture area and the surface of illumination reflection due to the fact that a plurality of codec structures are introduced, and the estimated depth is inaccurate when moving objects and motion blur exist in an image are optimized, so that the accuracy of the estimated distance of a model is finally improved, and the robustness of the model is improved.
The present invention is not limited to the above-mentioned embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present invention are intended to be included in the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.
Claims (6)
1. The self-supervision monocular distance detection method is characterized by comprising the following steps of:
step S1: downloading and processing the KITTI data set, and randomly dividing the data set into a training set and a testing set;
step S2: constructing a self-encoder structure for extracting features, inputting an original image, calculating luminosity errors in the extracted feature image, and dynamically adjusting parameters to minimize errors between a reconstructed image and the original image;
step S3: constructing a multidimensional model, constructing a pose network and a depth prediction network, and adding an attention mechanism;
step S4: and (5) performing scale recovery on the relative depth output by the model, and converting the relative depth into absolute depth.
2. The method for detecting the monocular distance according to claim 1, wherein the step S1 includes the following steps:
and carrying out data enhancement on the downloaded KITTI data set, transforming and expanding the data to generate diversified training samples, constructing a vehicle distance detection training data set and a test data set by adopting a real shooting image of the KITTI data set, wherein the original size of the data set is 1242 x 375 pixels, preprocessing the image, compressing the image to 320 x 1024 pixels, and dividing the data set according to the ratio of train: val: test=8:1:1.
3. The method for detecting the monocular distance according to claim 2, wherein the step S1 includes the following steps:
389 stereoscopic images and optical flow diagrams, 29.2km visual ranging sequence, 9300 RGBD training samples and depth maps, and images of 3D tagged objects exceeding 200K, sampled and synchronized point cloud data at a frequency of 10 Hz.
4. The method for detecting the monocular distance according to claim 1, wherein the step S2 includes the following steps:
step S2-1: in the traditional U-Net sampling network, the up-sampling part converts the original transpose convolution into deconvolution operation, and the number of sampling layers is increased by 16 times on the basis of 2 times, 4 times and 8 times of the number of the U-Net sampling layers;
step S2-2: adding a key module, adding a CRP block chain residue pooling module in a decoder network according to the structure described in the step S2-1, fusing residual connection and weight learning, and adding a maximum pooling layer in an Encoder part for restraining the size of a feature map;
step S2-3: calculating photometric losses, calculating photometric errors from the output feature map according to the structure described in step S2-2, using single view reconstruction to learn the feature representation will facilitate discrimination of non-textured areas as well as surfaces that are illuminated for reflection.
5. The method for detecting the monocular distance according to claim 1, wherein the step S3 includes the following steps:
an Attention module is introduced into the existing Monodepth2 network model, the Attention module is added at the tail of a backhaul network of Monodepth2, namely, a Self-Attention module is inserted between the last feature extraction module and the jump connection module in the backhaul network, and the relevance of different positions in an image is Self-adaptively learned.
6. The method for detecting the monocular distance according to claim 1, wherein the step S4 includes the following steps:
the depth map output by the model is stored in a format of a uid 16, scale recovery is carried out, and the ratio of data read from the depth map to 256 is carried out to obtain a real distance value.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311049975.9A CN117079237A (en) | 2023-08-21 | 2023-08-21 | Self-supervision monocular vehicle distance detection method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311049975.9A CN117079237A (en) | 2023-08-21 | 2023-08-21 | Self-supervision monocular vehicle distance detection method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117079237A true CN117079237A (en) | 2023-11-17 |
Family
ID=88711094
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311049975.9A Pending CN117079237A (en) | 2023-08-21 | 2023-08-21 | Self-supervision monocular vehicle distance detection method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117079237A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117422751A (en) * | 2023-12-19 | 2024-01-19 | 中科华芯(东莞)科技有限公司 | Non-motor vehicle safe driving auxiliary method, system and electronic equipment |
-
2023
- 2023-08-21 CN CN202311049975.9A patent/CN117079237A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117422751A (en) * | 2023-12-19 | 2024-01-19 | 中科华芯(东莞)科技有限公司 | Non-motor vehicle safe driving auxiliary method, system and electronic equipment |
CN117422751B (en) * | 2023-12-19 | 2024-03-26 | 中科华芯(东莞)科技有限公司 | Non-motor vehicle safe driving auxiliary method, system and electronic equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112347859A (en) | Optical remote sensing image saliency target detection method | |
CN111832453B (en) | Unmanned scene real-time semantic segmentation method based on two-way deep neural network | |
CN112329780B (en) | Depth image semantic segmentation method based on deep learning | |
CN110020658B (en) | Salient object detection method based on multitask deep learning | |
CN114943963A (en) | Remote sensing image cloud and cloud shadow segmentation method based on double-branch fusion network | |
CN116342596B (en) | YOLOv5 improved substation equipment nut defect identification detection method | |
CN113554032B (en) | Remote sensing image segmentation method based on multi-path parallel network of high perception | |
CN117079237A (en) | Self-supervision monocular vehicle distance detection method | |
CN115713679A (en) | Target detection method based on multi-source information fusion, thermal infrared and three-dimensional depth map | |
CN116229106A (en) | Video significance prediction method based on double-U structure | |
CN115035171A (en) | Self-supervision monocular depth estimation method based on self-attention-guidance feature fusion | |
CN115908772A (en) | Target detection method and system based on Transformer and fusion attention mechanism | |
CN115861756A (en) | Earth background small target identification method based on cascade combination network | |
CN114092824A (en) | Remote sensing image road segmentation method combining intensive attention and parallel up-sampling | |
CN116205962A (en) | Monocular depth estimation method and system based on complete context information | |
CN116883912A (en) | Infrared dim target detection method based on global information target enhancement | |
CN116310916A (en) | Semantic segmentation method and system for high-resolution remote sensing city image | |
CN116612283A (en) | Image semantic segmentation method based on large convolution kernel backbone network | |
CN115797684A (en) | Infrared small target detection method and system based on context information | |
CN116703885A (en) | Swin transducer-based surface defect detection method and system | |
CN116485860A (en) | Monocular depth prediction algorithm based on multi-scale progressive interaction and aggregation cross attention features | |
CN113920317B (en) | Semantic segmentation method based on visible light image and low-resolution depth image | |
CN115240163A (en) | Traffic sign detection method and system based on one-stage detection network | |
CN115424187B (en) | Auxiliary driving method for multi-angle camera collaborative importance ranking constraint | |
CN116612288B (en) | Multi-scale lightweight real-time semantic segmentation method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |