WO2024060605A1 - 基于改进YOLOv5的多任务全景驾驶感知方法与系统 - Google Patents
基于改进YOLOv5的多任务全景驾驶感知方法与系统 Download PDFInfo
- Publication number
- WO2024060605A1 WO2024060605A1 PCT/CN2023/089631 CN2023089631W WO2024060605A1 WO 2024060605 A1 WO2024060605 A1 WO 2024060605A1 CN 2023089631 W CN2023089631 W CN 2023089631W WO 2024060605 A1 WO2024060605 A1 WO 2024060605A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- feature map
- network
- task
- module
- feature
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 64
- 230000008447 perception Effects 0.000 title claims abstract description 52
- 238000001514 detection method Methods 0.000 claims abstract description 135
- 230000011218 segmentation Effects 0.000 claims abstract description 47
- 238000007781 pre-processing Methods 0.000 claims abstract description 11
- 230000004927 fusion Effects 0.000 claims abstract description 10
- 230000006870 function Effects 0.000 claims description 14
- 238000011176 pooling Methods 0.000 claims description 12
- 238000004590 computer program Methods 0.000 claims description 8
- 238000012545 processing Methods 0.000 claims description 7
- 230000003993 interaction Effects 0.000 claims description 6
- 230000002776 aggregation Effects 0.000 claims description 5
- 238000004220 aggregation Methods 0.000 claims description 5
- 238000001914 filtration Methods 0.000 claims description 3
- 230000008569 process Effects 0.000 abstract description 10
- 230000008901 benefit Effects 0.000 description 7
- 230000000694 effects Effects 0.000 description 6
- 238000004422 calculation algorithm Methods 0.000 description 5
- 238000004364 calculation method Methods 0.000 description 5
- 238000013461 design Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 5
- 238000000605 extraction Methods 0.000 description 5
- 238000012549 training Methods 0.000 description 5
- 238000012360 testing method Methods 0.000 description 4
- 230000009286 beneficial effect Effects 0.000 description 3
- 238000013527 convolutional neural network Methods 0.000 description 3
- 238000007726 management method Methods 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 230000008859 change Effects 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 101100012902 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) FIG2 gene Proteins 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 238000000137 annealing Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000002860 competitive effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000010191 image analysis Methods 0.000 description 1
- 238000003709 image segmentation Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
- 230000016776 visual perception Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/50—Context or environment of the image
- G06V20/56—Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/50—Context or environment of the image
- G06V20/56—Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
- G06V20/588—Recognition of the road, e.g. of lane markings; Recognition of the vehicle driving pattern in relation to the road
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/26—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
- G06V10/443—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
- G06V10/449—Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
- G06V10/451—Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
- G06V10/454—Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
Definitions
- the invention belongs to the field of automatic driving technology, and specifically relates to a multi-task panoramic driving perception method and system based on improved YOLOv5.
- Deep learning is critical to recent advances in many fields, especially autonomous driving.
- Many deep learning applications in autonomous vehicles are in their perception systems.
- the perception system can extract visual information from images captured by a monocular camera placed on the vehicle and help the vehicle's decision-making system make good driving decisions to control vehicle behavior.
- the visual perception system should be able to process the surrounding scene information in real time, and then help the decision-making system make judgments, including: the location of obstacles, whether the road is drivable, Lane location, etc. Therefore, the panoramic driving perception algorithm must ensure the detection of the three most critical tasks of traffic target detection, drivable area segmentation and lane line detection.
- multi-task networks which can process multiple tasks at the same time instead of processing them one by one to speed up the image analysis process.
- the network can also share information between multiple tasks, which may improve the performance of each task.
- multi-task networks usually share the same backbone network for feature extraction.
- some researchers have proposed an instance segmentation algorithm Mask R-CNN, which is used to jointly detect objects and segment instances, achieving state-of-the-art performance for each task. But this is unlikely to be directly applicable to the field of intelligent transportation because the network cannot detect drivable areas and lane lines.
- Some researchers have proposed a MultiNet network structure, which consists of a shared backbone network and three separate branch networks for classification, target detection and semantic segmentation.
- the classification task is not as important as lane detection.
- DLT-Net DLT-Net network structure
- Some researchers have built an efficient multi-task network (YOLOP) for the panoramic driving perception system.
- the network includes target detection, drivable area segmentation and lane detection tasks. It can be deployed on the embedded device JetsonTX2 through TensorRT to achieve real-time performance. Although it has reached an advanced level in terms of real-time performance and high accuracy, its three-branch network is used to process three different tasks respectively, which increases the inference time of the network.
- the purpose of this invention is to provide a multi-task panoramic driving perception method and system based on improved YOLOv5 in view of the shortcomings of the existing technology, which can process the scene information around the vehicle in real time and with high precision, and help the vehicle's decision-making system make judgments. It can simultaneously complete the three tasks of traffic target detection, drivable area segmentation and lane line detection.
- the present invention is implemented using the following technical solutions.
- the present invention provides a multi-task panoramic driving perception method based on improved YOLOv5, including:
- the YOLOv4 image preprocessing method is used to preprocess each frame of the video captured by the vehicle camera to obtain an input image
- the features of the input image are extracted using the improved YOLOv5 backbone network to obtain a feature map;
- the improved YOLOv5 backbone network is obtained by replacing the C3 module in the YOLOv5 backbone network with the inverted residual bottleneck module, and the inverted residual bottleneck module is
- the difference bottleneck module consists of x inverted residual bottleneck component structures, where x is a natural number;
- the inverted residual bottleneck component structure consists of three layers, the first layer is a convolution component, which maps the low-dimensional space to high-dimensional space for dimension expansion;
- the second layer is a depth-separable convolution layer, which uses depth-separable convolution for spatial filtering;
- the third layer is a convolution component, which maps high-dimensional space to low-dimensional space;
- the feature map obtained by the improved YOLOv5 backbone network is input into the neck network.
- the feature map obtained through the spatial pyramid pooling SPP network and the feature pyramid network FPN is the same as the feature map obtained by the improved YOLOv5 backbone network. Fusion to obtain the fused feature map;
- the fused feature map is input to the detection head, and a multi-scale fused feature map is obtained through the path aggregation network PAN.
- the YOLOv4 anchor-based multi-scale detection scheme is used for the multi-scale fused feature map to detect traffic targets;
- the bottom feature map in the feature map obtained through the spatial pyramid pooling SPP network and the feature pyramid network FPN is input to the branch network, and the branch network is used to detect lane lines and segment drivable areas.
- the picture preprocessing also includes adjusting each frame of the video captured by the vehicle camera from an image of width ⁇ height ⁇ number of channels of 1280 ⁇ 720 ⁇ 3 to an image of width ⁇ height ⁇ number of channels of 640 ⁇ 384 ⁇ 3 image.
- inverted residual bottleneck modules are used in the backbone network of the improved YOLOv5;
- the first inverted residual bottleneck module is CSPI_1, which is composed of the convolution component Conv and an inverted residual bottleneck component structure through the Concat operation;
- the second inverted residual bottleneck module is CSPI_3, which is composed of the convolution component Conv and three inverted residual bottleneck component structures through the Concat operation;
- the third inverted residual bottleneck module is CSPI_3, which is composed of the convolution component Conv and three inverted residual bottleneck component structures through the Concat operation;
- the convolution component Conv consists of the conv function, the Bn function, and the SiLU function;
- the improved YOLOv5 backbone network is used to extract the features of the input image, and the obtained feature maps include feature map out1, feature map out2 and feature map out3;
- the feature map out1 is a feature map obtained after the preprocessed image undergoes the Focus operation and then the Conv and CSPI_1 operations, and then the Conv and CSPI_3 operations;
- the feature map out2 is the feature map obtained by the feature map out1 after Conv and CSPI_3 operations;
- the feature map out3 is the feature map obtained after the feature map out2 undergoes the Conv operation.
- the feature map input by the spatial pyramid pooling SPP network passes through the inversion residual bottleneck module, and then undergoes the Conv operation to obtain the high-level feature map f3, which is output to the detection head;
- the high-level feature map f3 is upsampled, and then the feature map obtained by Concat operation with the feature map out2 is obtained by inverting the residual bottleneck module and then Conv operation.
- Layer feature map f2 is output to the detection head;
- the middle-level feature map f2 is upsampled, and then performs a concat operation with the feature map out1 to obtain the bottom-level feature map f1, which is output to the detection head.
- branch network consists of four layers of convolutional components, three layers of BottleneckCSP modules and three layers of upsampling layers;
- branch networks for lane line detection and drivable area segmentation includes: restoring the bottom feature map f1 in the feature pyramid network FPN to a size of W ⁇ H ⁇ 4 after passing through three upsampling layers in the branch network.
- Feature map where W is the width of the input image, H is the height of the input image, the feature points in the feature map correspond to the pixel points in the input image one-to-one, and 4 means that each feature point in the feature map has four values;
- the branch The network divides the feature map of size W ⁇ H ⁇ 4 into two feature maps of size W ⁇ H ⁇ 2, where one feature map of size W ⁇ H ⁇ 2 indicates that each pixel in the input image is The probability of the drivable area corresponding to the background is used to predict the drivable area, and the predicted drivable area is used as the result of drivable area segmentation; another feature map of size W ⁇ H ⁇ 2 represents the relationship between each pixel in the input image and The probability that the lane line corresponds to the background is used to predict
- a nearest interpolation method is used in the upsampling layer to perform upsampling processing.
- the present invention also provides a multi-task panoramic driving perception system based on improved YOLOv5 to implement the above multi-task panoramic driving perception method based on improved YOLOv5, including:
- the human-computer interaction module is used to provide a reserved input interface and obtain input data in the correct format
- the multi-task detection module is used to complete the three tasks of traffic target detection, lane line detection and drivable area segmentation respectively based on the input data obtained by the human-computer interaction module, and combine traffic target detection, lane line detection and drivable area The segmentation results are output to the display module;
- the display module displays the input data and the results of traffic target detection, lane line detection and drivable area segmentation output by the multi-task detection module.
- the multi-task panoramic driving perception system based on improved YOLOv5 also includes:
- the traffic target detection module is used to complete the traffic target detection task and output the traffic target detection results, traffic target categories and traffic target detection accuracy to the display module;
- the lane line detection module is used to complete the lane line detection task and output the lane line detection results and lane line detection accuracy to the display module;
- the drivable area segmentation module is used to complete the drivable area segmentation task and output the drivable area segmentation results to the display module;
- the display module can display traffic target categories, traffic target detection accuracy, or lane line detection accuracy.
- the present invention also provides a multi-task panoramic driving perception device based on improved YOLOv5.
- the device includes a memory and a processor; the memory stores a computer program that implements the above multi-task panoramic driving perception method based on improved YOLOv5. , the processor executes the computer program to implement the steps of the above method.
- the present invention also provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the steps of the above method are implemented.
- the multi-task panoramic driving perception method and system based on improved YOLOv5 of the present invention adopts the multi-task panoramic driving perception algorithm framework DP-YOLO (Driving perception-YOLO) based on the YOLOv5 network structure, and uses an end-to-end network to achieve real-time, high-precision Traffic target detection, drivable area segmentation and lane line detection.
- DP-YOLO Driving perception-YOLO
- the present invention's multi-task panoramic driving perception method and system based on improved YOLOv5 designs an inverted residual bottleneck module (CSPI_x module), replacing the original C3 module in the YOLOv5 backbone network with an inverted residual bottleneck module.
- the inverted residual bottleneck module (CSPI_x module) is composed of x inverted residual bottleneck component structures, where x is a natural number.
- the CSPI_x module maps the features of the base layer into two parts, and then merges them through a cross-stage hierarchical structure. This can greatly reduce the calculation amount of the backbone network and improve the running speed of the backbone network, while the accuracy remains basically unchanged.
- the inversion residual bottleneck module allows a special efficient memory management method, thereby improving the recognition accuracy of the network model.
- the multi-task panoramic driving perception method and system based on improved YOLOv5 of the present invention designs a branch network, which consists of a four-layer convolution component (Conv) and a three-layer BottleneckCSP module. and three upsampling layers.
- This branch network can train the two tasks of drivable area segmentation and lane line detection at the same time.
- the BottleneckCSP module is used to strengthen the network feature fusion ability and improve detection accuracy; the bottom feature map output by the feature pyramid network FPN is input into the drivable Region segmentation branch network, the bottom layer of FPN has strong semantic information and high-resolution information that is beneficial to positioning.
- the nearest interpolation method is used in the upsampling layer to perform upsampling processing to reduce computational costs.
- the branch network of the present invention not only obtains high-precision output, but also reduces its inference time, thereby increasing the speed of feature extraction by the branch network while ensuring little impact on accuracy.
- the present invention provides a multi-task panoramic driving perception system based on improved YOLOv5, which facilitates display of the results of traffic target detection, lane line detection, and drivable area segmentation based on the multi-task panoramic driving perception method of improved YOLOv5.
- the multi-task panoramic driving perception method and system based on the improved YOLOv5 of the present invention can simultaneously perform the three tasks of traffic target detection, drivable area segmentation and lane line detection. Compared with other existing methods, it has a higher reasoning speed. and detection accuracy; the multi-task panoramic driving perception method and system based on the improved YOLOv5 of the present invention can better process the scene information around the vehicle, and then help the vehicle's decision-making system make judgments, and has good practical feasibility.
- Figure 1 is a flow chart of the method of the present invention.
- Figure 2 is a schematic structural diagram of a network model according to an embodiment of the present invention.
- FIG 3 is a schematic structural diagram of the inverted residual bottleneck module according to the embodiment of the present invention, in which (a) is the inverted residual bottleneck module (CSPI_x module), and (b) is the inverted residual bottleneck component structure (Invert Bottleneck).
- CSPI_x module the inverted residual bottleneck module
- Invert Bottleneck the inverted residual bottleneck component structure
- Figure 4 is a schematic diagram of the changes in the size and channel number of the feature map when the input image passes through the backbone network according to the embodiment of the present invention.
- Figure 5 is a schematic diagram of the change in size and channel number of the feature map when it passes through the neck network according to the embodiment of the present invention.
- Figure 6 is a schematic structural diagram of a branch network model according to an embodiment of the present invention.
- One embodiment of the present invention is a panoramic driving perception method based on improved YOLOv5, which is a simple and efficient detection method (DP-YOLO, Driving perception-YOLO).
- DP-YOLO Driving perception-YOLO
- the hardware conditions and related software configurations for this embodiment are as follows:
- the operating system version of the experimental machine is CentOS Linux release 7.6.1810
- the CPU model is HygonC86 7185 32-core Processor CPU@2.0GHz
- the GPU model is NVIDIA Tesla T4
- the video memory size is 16GB
- the memory size is 50GB.
- the program code is implemented using Python3.8 and Pytorch 1.9, and cuda 11.2 and cudnn 7.6.5 are used to accelerate the GPU.
- the number of model iterations is set to 200, and the input data amount of each batch is 24, which means that 24 training samples are taken from the training set for each training.
- the initial learning rate is 0.01, and the momentum and weight attenuation are set to 0.937 and 0.0005 respectively.
- the learning rate is adjusted through warm-up and cosine annealing to make the model converge faster and better.
- the panoramic driving perception method based on improved YOLOv5 in this embodiment includes the following steps:
- the present invention uses the picture preprocessing method of YOLOv4 to perform picture preprocessing on each frame of image in the video collected by the vehicle camera to obtain the input image.
- the image preprocessing method of YOLOv4 is used to eliminate irrelevant information in the original image, restore useful real information, enhance the detectability of relevant information and simplify the data to the greatest extent, thereby improving feature extraction, image segmentation, matching and recognition. reliability.
- the BDD 100K data set is selected to train and evaluate the network model of the present invention.
- the BDD 100K data set is divided into three parts, namely a training set of 70K images, a verification set of 10K images, and a test set of 20K images. Since the labels of the test set are not public, the network model is evaluated on the validation set.
- each frame of the image in the BDD 100K data set is also adjusted from an image of width ⁇ height ⁇ number of channels of 1280 ⁇ 720 ⁇ 3 to an image of width ⁇ height ⁇ number of channels.
- Feature extraction that is, using the backbone network based on improved YOLOv5 to extract features of the input image.
- the multi-task panoramic driving perception method based on improved YOLOv5 of the present invention The method and system adopt an improved YOLOv5 backbone network, and replace the original C3 module in the YOLOv5 backbone network with an inverted residual bottleneck module (CSPI_x module).
- the inverted residual bottleneck module (CSPI_x module) consists of x inverted residual bottleneck component structures (InvertBottleneck), where x is a natural number. As shown in (a) of FIG.
- the CSPI_x module in the present invention maps the features of the base layer into two parts, and then merges them through a cross-stage hierarchy, which can greatly reduce the amount of network calculation, improve the network's operating speed, and keep the accuracy basically unchanged.
- the inverted residual bottleneck module allows a special memory-efficient management method, thereby improving the recognition accuracy of the network model.
- Three CSPI_x modules are used in the backbone network of this embodiment, as shown in Figure 2.
- the first inverted residual bottleneck module is CSPI_1, which consists of the convolution component Conv and an inverted residual bottleneck component structure through the Concat operation.
- the second inverted residual bottleneck module is CSPI_3, which is composed of the convolution component Conv and three inverted residual bottleneck component structures through the Concat operation.
- the third inverted residual bottleneck module is CSPI_3, which is composed of the convolution component Conv and three inverted residual bottleneck component structures through the Concat operation.
- the convolution component Conv consists of conv function (convolution function), Bn function (normalization function), and SiLU function (activation function).
- the invert residual bottleneck component structure (Invert Bottleneck) in the CSPI_x module consists of three layers.
- the first layer is the convolutional component (Conv), which maps low-dimensional space to high-dimensional space for dimension expansion.
- the second layer is the depthwise separable convolution layer (DWConv layer), which uses depthwise separable convolution for spatial filtering.
- the third layer is the convolutional component (Conv), which maps high-dimensional space to low-dimensional space. Compare the network reasoning speed when the low-dimensional space is mapped to 2 times the high-dimensional space, 3 times the high-dimensional space and 4 times the high-dimensional space when the dimension is expanded.
- the reasoning speed can reach 7.9 milliseconds/frame, but the detection accuracy of the network is relatively low.
- the inference speed is 9.1 milliseconds/frame.
- the inference speed reaches 10.3 milliseconds/frame.
- the low-dimensional space is chosen to be mapped to 3 times the high-dimensional space. Compared with extending the dimension to 4 times, the detection accuracy of the network is somewhat reduced, but the inference time and calculation amount of the network are reduced.
- the obtained feature maps include feature map out1, feature map out2 and feature map out3.
- the feature map out1 is the feature map obtained after the preprocessed image undergoes the Focus operation, the Conv and CSPI_1 operations, and then the Conv and CSPI_3 operations.
- the feature map out2 is the feature map obtained after the feature map out1 undergoes Conv and CSPI_3 operations.
- the feature map out3 is the feature map obtained after the feature map out2 undergoes the Conv operation.
- the size of the preprocessed image (ie, the input image) is 640 ⁇ 384 ⁇ 3, that is, the width, height, and number of channels of the image are 640, 384, and 3 respectively.
- out1 the size of the feature map out1 is 80 ⁇ 48 ⁇ 128)
- out2 the size of the feature map out2 is 40 ⁇ 24 ⁇ 256
- out3 The size of the feature map out3 is 20 ⁇ 12 ⁇ 512.
- the size of the feature map and the number of channels change as follows:
- the input image i.e. the input image in Figure 2 and Figure 4 (size is 640 ⁇ 384 ⁇ 3), becomes a feature map of (320 ⁇ 192 ⁇ 32) after the Focus operation; becomes a feature map of (160 ⁇ 96 ⁇ 64) after the Conv and CSPI_1 operations; becomes a feature map of (80 ⁇ 48 ⁇ 128) after the Conv and CSPI_3 operations, as the first output out1; becomes a feature map of (40 ⁇ 24 ⁇ 256) after the Conv and CSPI_3 operations, as the second output out2; becomes a feature map of (20 ⁇ 12 ⁇ 512) after the Conv operation, as the third output out3. That is, the image of size (640 ⁇ 384 ⁇ 3) after preprocessing obtains a feature map of size 20 ⁇ 12 after passing through the backbone network.
- Feature fusion that is, the features of the backbone network are input to the neck network (Neck).
- the neck network the feature map obtained through the spatial pyramid pooling SPP network and the feature pyramid network FPN is fused with the feature map obtained by the backbone network. , to obtain the fused feature map.
- the neck network of the present invention adopts the spatial pyramid pooling SPP network and the feature pyramid network FPN to form the neck network.
- the primary function of the spatial pyramid pooling SPP network is to solve the problem of non-uniform size of the input image.
- the fusion of different size features in the SPP network is beneficial to situations where the target size in the image to be detected is greatly different.
- the main function of the Feature Pyramid Network FPN is to solve the multi-scale problem in object detection. It can greatly improve the detection performance of small objects through simple network connection changes without basically increasing the calculation load of the original network model. Specifically include:
- the feature map output by the backbone network is sent to the neck network, and through the SPP network and FPN in sequence, the obtained feature map is input into the Detect Head.
- the SPP network enables the convolutional neural network to input images of any size.
- a layer of SPP network is added after the last convolutional layer, which allows feature maps of different sizes to pass through the SPP network and output a fixed-length feature map.
- FPN is top-down. It fuses high-level features with low-level features through upsampling to obtain feature maps for prediction, and transfers strong semantic features from high levels to enhance the entire pyramid.
- the feature map output by the backbone network is of size (20 ⁇ 12 ⁇ 512) and is sent to the SPP network, and the resulting feature map is then sent to the FPN.
- the feature map input by the spatial pyramid pooling SPP network passes through the inversion residual bottleneck module, and then undergoes the Conv operation to obtain the high-level feature map f3, which is output to the detection head.
- the high-level feature map f3 is upsampled (UpSample), and then Concated with the feature map out2 obtained by the backbone network to obtain the feature map. After inverting the residual bottleneck module, the mid-level feature map f2 is obtained after the Conv operation, and the output to the detection head.
- the middle-level feature map f2 is upsampled (UpSample), and then performs a Concat operation with the feature map out1 obtained from the backbone network to obtain the bottom-level feature map f1, which is output to the detection head.
- the feature map (size 20 ⁇ 12 ⁇ 512) input by the spatial pyramid pooling SPP network passes through the inverted residual bottleneck module (size 20 ⁇ 12 ⁇ 512), and then passes through Conv After the operation, the high-level feature map f3 (size 20 ⁇ 12 ⁇ 256) is obtained, which is finally output to the detection head.
- the above-mentioned high-level feature map f3 (size 20 ⁇ 12 ⁇ 256) is transformed into a feature map (size 40 ⁇ 24 ⁇ 256) after upsampling, and then combined with the feature map out2 (size 40 ⁇ 24 ⁇ ) in the backbone network 256)
- the middle layer feature map f2 (size is 40 ⁇ 24 ⁇ 128) is finally output to the detection head.
- the above-mentioned mid-level feature map f2 (size 40 ⁇ 24 ⁇ 128) is upsampled to become a feature map (size 80 ⁇ 48 ⁇ 128), and then combined with the feature map out1 (size 80 ⁇ 48 ⁇ 128) in the backbone network ) performs Concat operation to obtain the underlying feature map f1 (size 80 ⁇ 48 ⁇ 256), and finally outputs it to the detection head.
- Traffic target detection that is, the fused feature map obtained through the neck network is input to the detection head, and the detection head uses the obtained features to predict traffic targets. Specifically include:
- the fused feature map is input to the detection head, and a multi-scale fused feature map is obtained through the path aggregation network PAN.
- the YOLOv4 anchor-based multi-scale detection scheme is used for the multi-scale fused feature map to detect traffic targets.
- the detection head of the present invention adopts a path aggregation network PAN.
- the path aggregation network is a bottom-up feature pyramid network.
- the semantic features are transmitted from top to bottom by FPN in the neck network, and the positioning features are transmitted from bottom to top by PAN, and the two are combined to obtain a better feature fusion effect. Then the multi-scale fusion feature map in PAN is directly used for detection.
- a number for example, 3
- a priori boxes with different aspect ratios to each grid for example, the scale of the feature map with a size of (20 ⁇ 12 ⁇ 3 ⁇ 6) is (20 ⁇ 12
- the detection head predicts the position offset, height and width scaling, as
- three feature maps with sizes of (80 ⁇ 48 ⁇ 128), (40 ⁇ 24 ⁇ 256) and (80 ⁇ 48 ⁇ 512) are obtained.
- the sizes of the three feature maps obtained are (20 ⁇ 12 ⁇ 18), (40 ⁇ 24 ⁇ 18), and (80 ⁇ 48 ⁇ 18).
- the sizes of the three feature maps in each grid of each feature map are (20 ⁇ 12 ⁇ 3 ⁇ 6), (40 ⁇ 24 ⁇ 3 ⁇ 6), (80 ⁇ 48 ⁇ 3 ⁇ 6) .
- the dimension of the last feature of the feature map is 6, which represents this information, and the other features of the feature map
- M ⁇ N ⁇ 3 M represents the number of rows of the feature matrix
- N represents the number of columns of the feature matrix
- 3 represents three a priori boxes of different scales.
- Lane line detection and drivable area segmentation that is, using branch networks to detect lane lines and segment drivable areas.
- the bottom feature map in the feature map obtained by the spatial pyramid pooling SPP network and the feature pyramid network FPN is input to the branch network, and its size is (W /8) ⁇ (H/8) ⁇ 128, where W is the input image width of 640 (pixels), and H is the input image height of 384 (pixels).
- the branch network consists of four layers of convolutional components (Conv), three layers of BottleneckCSP modules and three layers of The composition of the upsampling layer is shown in Figure 6.
- the BottleneckCSP module can enhance the ability of network feature fusion and improve detection accuracy. Therefore, the branch network of the present invention can obtain high-precision output.
- using the nearest interpolation method for upsampling processing in the upsampling layer can reduce the computational cost, thereby reducing the inference time of the branch network.
- the bottom feature map f1 in the feature pyramid network FPN is restored to a feature map with a size of W ⁇ H ⁇ 4 after passing through three upsampling layers (that is, three upsampling processes) in the branch network, where W is the width of the input image. (for example, 640 pixels), H is the height of the input image (for example, 384 pixels), the feature points in the feature map correspond to the pixel points in the input image one-to-one, and 4 means that each feature point in the feature map has four values.
- the branch network of the present invention finally divides the feature map of size W ⁇ H ⁇ 4 into two feature maps of size W ⁇ H ⁇ 2.
- One of the feature maps with a size of W ⁇ H ⁇ 2 represents the probability of each pixel in the input image corresponding to the background of the drivable area, which is used to predict the drivable area, and the predicted drivable area is used as the result of drivable area segmentation;
- the other A feature map of size W ⁇ H ⁇ 2 represents the probability of each pixel in the input image corresponding to the lane line background, which is used to predict the lane line, and the predicted lane line is used as the result of lane line detection.
- W is the width of the input image (for example, 640 pixels)
- H is the height of the input image (for example, 384 pixels)
- 2 means that each feature point in the feature map has two values, and these two values are used to represent the feature respectively.
- This invention uses the intersection over union (IoU) to evaluate the segmentation of drivable areas and lane lines, and uses the average intersection over union (mIoU) to evaluate the segmentation performance of different models.
- the Intersection over Union (IoU) ratio is used to measure the pixel overlap between the predicted mask map and the real mask map.
- the formula is as follows.
- TN refers to the negative sample predicted by the model as the negative class
- FP refers to the negative sample predicted by the model as the positive class
- FN refers to the positive sample predicted by the model as the negative class.
- the average intersection-over-union ratio (mIoU) is used to average the IoU calculated for each prediction category (referring to lane line prediction and drivable area prediction).
- the formula is as follows.
- K represents the number of predicted categories
- K+1 represents the number of predicted categories plus background classes
- TP refers to the positive samples predicted as positive classes by the model
- FP refers to the negative samples predicted as positive classes by the model
- FN refers to the positive samples predicted by the model as negative classes.
- the units of Recall (recall rate), AP (average precision), mIoU (average intersection and union ratio), Accuracy (accuracy of lane lines), IoU (intersection and union ratio) are (%), and Speed (frame rate) The unit is milliseconds/frame. From the data in Table 1, we can see that the recognition accuracy of the improved model has been improved in various tasks. In the traffic target detection task, the recall rate (Recall) reached 89.3%, and the AP value reached 77.2%.
- the multi-task panoramic driving perception method and system based on improved YOLOv5 of the present invention adopts the multi-task panoramic driving perception algorithm framework DP-YOLO (Driving perception-YOLO) based on the YOLOv5 network structure, and uses an end-to-end network to achieve real-time, high-precision Traffic target detection, drivable area segmentation and lane line detection.
- DP-YOLO Driving perception-YOLO
- the present invention's multi-task panoramic driving perception method and system based on improved YOLOv5 designs an inverted residual bottleneck module (CSPI_x module) to replace the original C3 module in the YOLOv5 backbone network with the inverted residual bottleneck module.
- the inverted residual bottleneck module (CSPI_x module) is composed of x inverted residual bottleneck component structures, where x is a natural number.
- the CSPI_x module maps the features of the base layer into two parts and then merges them through the cross-stage hierarchy, which This can greatly reduce the calculation amount of the backbone network and improve the running speed of the backbone network, while the accuracy remains basically unchanged.
- the inversion residual bottleneck module allows a special efficient memory management method, thereby improving the recognition accuracy of the network model.
- the multi-task panoramic driving perception method and system based on improved YOLOv5 of the present invention designs a branch network, which is composed of four layers of convolution components (Conv), three layers of BottleneckCSP modules and three layers of upsampling layers.
- This branch network can train the two tasks of drivable area segmentation and lane line detection at the same time.
- the BottleneckCSP module it can strengthen the network feature fusion ability and improve detection accuracy; input the bottom layer of FPN to the segmentation branch, and the bottom layer of FPN has Strong semantic information and high-resolution information for positioning.
- the nearest interpolation method is used in the upsampling layer to perform upsampling processing to reduce computational costs.
- the branch network of the present invention not only obtains high-precision output, but also reduces its inference time, thereby increasing the speed of feature extraction by the branch network while ensuring little impact on accuracy.
- Another embodiment of the present invention is a multi-task panoramic driving perception system based on improved YOLOv5, including:
- the human-computer interaction module is used to provide a reserved input interface and obtain input data in the correct format.
- the multi-task detection module is used to complete the three tasks of traffic target detection, lane line detection and drivable area segmentation respectively based on the input data obtained by the human-computer interaction module, and combine traffic target detection, lane line detection and drivable area The segmentation results are output to the display module.
- the display module displays the input data and the results of traffic target detection, lane line detection and drivable area segmentation output by the multi-task detection module.
- the multi-task panoramic driving perception system based on improved YOLOv5 also includes:
- the traffic target detection module is used to complete the traffic target detection task and output the traffic target detection results, traffic target categories and traffic target detection accuracy to the display module; when only the vehicle category in the traffic target category is detected, all Vehicles are uniformly classified into the vehicle category for testing.
- the lane line detection module is used to complete the lane line detection task and output the lane line detection results and lane line detection accuracy to the display module.
- the drivable area segmentation module is used to complete the drivable area segmentation task and divide the drivable area into The segmentation results are output to the display module.
- the display module can also display traffic target categories, traffic target detection accuracy, or lane line detection accuracy.
- the present invention provides a multi-task panoramic driving perception system based on improved YOLOv5, which facilitates the demonstration of the multi-task panoramic driving perception method based on improved YOLOv5 to perform traffic target detection, lane line detection, drivable area segmentation respectively, or perform multi-task detection simultaneously. test results.
- certain aspects of the above-mentioned techniques may be implemented by one or more processors of a processing system executing software.
- the software includes one or more executable instruction sets stored or otherwise tangibly implemented on a non-transitory computer-readable storage medium.
- the software may include instructions and certain data that manipulate one or more processors to perform one or more aspects of the above-mentioned techniques when executed by one or more processors.
- Non-transitory computer-readable storage media may include, for example, magnetic or optical disk storage devices, solid-state storage devices such as flash memory, cache, random access memory (RAM), or other non-volatile memory devices.
- the executable instructions stored on the non-transitory computer-readable storage medium may be source code, assembly language code, object code, or other instruction formats interpreted or otherwise executed by one or more processors.
- the multi-task panoramic driving perception method and system based on the improved YOLOv5 of the present invention can simultaneously detect the three tasks of traffic target detection, drivable area segmentation and lane line detection. Compared with other existing methods, it has higher performance Reasoning speed and detection accuracy; the multi-task panoramic driving perception method and system based on improved YOLOv5 of the present invention can better process the scene information around the vehicle, and then help the vehicle's decision-making system make judgments, and has good practical feasibility sex.
- Computer-readable storage media may include any storage medium or combination of storage media that can be accessed by the computer system during use to provide instructions and/or data to the computer system.
- Such storage media may include, but are not limited to, optical media (eg, compact disc (CD), digital versatile disc (DVD), Blu-ray Disc), magnetic media (eg, floppy disk, tape, or magnetic hard drive), volatile memory (eg, floppy disk, magnetic tape, or magnetic hard drive)
- RAM or cache random access memory
- non-volatile memory eg, read-only memory (ROM) or flash memory
- MEMS microelectromechanical systems
- Computer-readable storage media can be embedded in computing systems (e.g., system RAM or ROM), fixedly attached to the computing system (e.g., a magnetic hard drive), removably attached to the computing system (e.g., optical disk or universal serial bus (USB)-based flash memory), or via A wired or wireless network (eg, network accessible storage (NAS)) is coupled to the computer system.
- system RAM or ROM system RAM or ROM
- USB universal serial bus
- NAS network accessible storage
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Multimedia (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Computing Systems (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- Biomedical Technology (AREA)
- Biodiversity & Conservation Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Image Analysis (AREA)
- Traffic Control Systems (AREA)
Abstract
本发明属于自动驾驶技术领域,公开了一种基于改进YOLOv5的多任务全景驾驶感知方法与系统。本发明的方法包括对数据集中的图像进行图片预处理,得到输入图像;利用改进YOLOv5的主干网络提取输入图像的特征,得到特征图;主干网络由将YOLOv5主干网络中C3模块替换为反转残差瓶颈模块得到;将特征图输入颈部网络得到的特征图与主干网络得到的特征图融合;将融合特征图输入到检测头进行交通目标检测;将颈部网络的特征图输入到分支网络,进行车道线检测和可行驶区域分割。采用本发明能够实时、高精度地处理车辆周围场景信息,帮助车辆决策系统做出判断,能够同时进行交通目标检测、可行驶区域分割和车道线检测这三个任务。
Description
本申请要求于2022年09月20日提交中国专利局、申请号为202211141578.X、发明名称为“基于改进YOLOv5的多任务全景驾驶感知方法与系统”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
本发明属于自动驾驶技术领域,具体涉及一种基于改进YOLOv5的多任务全景驾驶感知方法及系统。
深度学习对于许多领域的最新进展至关重要,尤其是在自动驾驶方面。自动驾驶汽车中的许多深度学习应用都在其感知系统中。因为感知系统可以从一个放在车上的单目相机拍摄的图像中提取视觉信息,并帮助车辆的决策系统做出良好的驾驶决策来控制车辆行为。为了车辆能在遵守交通规则前提下安全地行驶在道路上,视觉感知系统应该能够实时地处理周围的场景信息,然后来帮助决策系统做出判断,包括:障碍物的位置、道路是否可行驶、车道的位置等。因此,全景驾驶感知算法必须确保对交通目标检测、可行驶区域分割和车道线检测三个最关键的任务进行检测。
因此很多研究者提出了多任务网络,该网络可以同时处理多个任务而不是逐个处理来加速图像分析过程,该网络还可以在多个任务之间共享信息,这可能会提高每个任务的性能,因为多任务网络通常共享相同主干网络用于特征提取。其中,有的研究者提出了一个实例分割算法Mask R-CNN,该算法用于联合检测对象和分割实例,每个任务都达到了最先进的性能。但这不可能直接应用于智能交通领域,因为该网络无法检测可行驶区域和车道线。有的研究者提出了MultiNet网络结构,该网络结构由一个共享的主干网络和三个单独的分支网络,用于分类、目标检测和语义分割。它在这些任务上表现良好,并在KITTI可行驶区域分割任务上达到了最先进的水平。然而,在全景驾驶感知系统中,分类任务不如车道检测那么重要。有的研究者提出了DLT-Net网络结构,该网络结构将交通目标检测、可行驶区域分割和车道线检测结合在一起,并提出了上下文
张量来融合分支网络之间的特征地图,以共享相互信息。虽然具有竞争性的性能,但它不能达到实时性。有的研究者为全景驾驶感知系统构建了一个高效的多任务网络(YOLOP),该网络包括目标检测、可行驶区域分割和车道检测任务,可以在嵌入式设备JetsonTX2上通过TensorRT部署实现实时性,虽然在实时性和高精度上都达到了先进的水平,但它的三分支网络分别用于处理三个不同的任务增加了网络的推理时间。
综上所述,全景驾驶感知算法中由于可行驶区域分割和车道线检测任务分别采用不同的分支网络进行网络推理,增加了网络的推理时间,因此存在一定的改进空间。
发明内容
本发明目的是:针对现有技术的不足,提供一种基于改进YOLOv5的多任务全景驾驶感知方法与系统,能够实时、高精度地处理车辆周围的场景信息,帮助车辆的决策系统做出判断,能够同时完成交通目标检测、可行驶区域分割和车道线检测这三个任务。
具体地说,本发明是采用以下技术方案实现的。
一方面,本发明提供一种基于改进YOLOv5的多任务全景驾驶感知方法,包括:
采用YOLOv4的图片预处理方法对车载摄像头采集的视频中每一帧图像进行图片预处理,得到输入图像;
利用改进YOLOv5的主干网络提取所述输入图像的特征,得到特征图;所述改进YOLOv5的主干网络,由将YOLOv5的主干网络中C3模块替换为反转残差瓶颈模块得到,所述反转残差瓶颈模块由x个反转残差瓶颈组件结构组成,其中,x为自然数;所述反转残差瓶颈组件结构由三层组成,第一层是卷积组件,该层将低维空间映射到高维空间进行维度扩展;第二层是深度可分离卷积层,采用深度可分离卷积进行空间过滤;第三层是卷积组件,该层将高维空间映射到低维空间;
将所述改进YOLOv5的主干网络得到的特征图输入到颈部网络,在颈部网络中经空间金字塔池SPP网络和特征金字塔网络FPN得到的特征图与所述改进YOLOv5的主干网络得到的特征图融合,得到融合的特征图;
将所述融合的特征图输入到检测头,经路径聚合网络PAN得到多尺度融合特征图,对所述多尺度融合特征图采用YOLOv4基于锚定的多尺度检测方案,进行交通目标检测;
将所述经空间金字塔池SPP网络和特征金字塔网络FPN得到的特征图中底层特征图输入到分支网络,利用分支网络进行车道线检测和可行驶区域分割。
进一步的,所述图片预处理还包括将所述车载摄像头采集的视频中每一帧图像从宽度×高度×通道数为1280×720×3的图像调整成宽度×高度×通道数为640×384×3的图像。
进一步的,所述改进YOLOv5的主干网络中采用三个反转残差瓶颈模块;
第一个反转残差瓶颈模块为CSPI_1,由卷积组件Conv和一个反转残差瓶颈组件结构经过Concat操作组成;
第二个反转残差瓶颈模块为CSPI_3,由卷积组件Conv和三个反转残差瓶颈组件结构经过Concat操作组成;
第三个反转残差瓶颈模块为CSPI_3,由卷积组件Conv和三个反转残差瓶颈组件结构经过Concat操作组成;
其中,卷积组件Conv由conv函数、Bn函数、SiLU函数三者组成;
所述利用改进YOLOv5的主干网络提取所述输入图像的特征,得到的特征图包括特征图out1、特征图out2和特征图out3;
所述特征图out1,为预处理图片经过Focus操作后又经过Conv、CSPI_1操作,再经过Conv、CSPI_3操作后得到的特征图;
所述特征图out2,为所述特征图out1经过Conv、CSPI_3操作后得到的特征图;
所述特征图out3,为所述特征图out2经过Conv操作后得到的特征图。
进一步的,在所述特征金字塔网络FPN中,由空间金字塔池SPP网络输入的特征图经过反转残差瓶颈模块,再经过Conv操作后得到高层特征图f3,输出到检测头;
所述高层特征图f3经过上采样,再与所述特征图out2进行Concat操作得到的特征图,经过反转残差瓶颈模块,再经过Conv操作后得到中
层特征图f2,输出到检测头;
所述中层特征图f2经过上采样,再与所述特征图out1进行Concat操作得到底层特征图f1,输出到检测头。
进一步的,所述分支网络由四层卷积组件、三层BottleneckCSP模块和三层上采样层组成;
所述利用分支网络进行车道线检测和可行驶区域分割包括:将所述特征金字塔网络FPN中底层特征图f1在分支网络中经过三层上采样层后,恢复成大小为W×H×4的特征图,其中,W为输入图像宽度,H为输入图像高度,特征图中特征点与输入图像中像素点一一对应,4表示特征图中每个特征点有四个取值;所述分支网络将所述大小为W×H×4的特征图切分成两个大小为W×H×2的特征图,其中一个大小为W×H×2的特征图表示输入图像中每个像素点对于可行驶区域对应背景的概率,用来预测可行驶区域,预测所得的可行驶区域作为可行驶区域分割的结果;另一个大小为W×H×2的特征图表示输入图像中每个像素点对于车道线对应背景的概率,用来预测车道线,预测所得的车道线作为车道线检测的结果;其中,W为输入图像宽度,H为输入图像高度,2表示该特征图中每个特征点有两个取值,用这两个取值分别表示该特征点相应像素点有目标的概率、该特征点相应像素点无目标的概率。
进一步的,在所述上采样层中使用最近插值方法进行上采样处理。
另一方面,本发明还提供一种基于改进YOLOv5的多任务全景驾驶感知系统,实现上述基于改进YOLOv5的多任务全景驾驶感知方法,包括:
人机交互模块,用于提供预留输入接口,获得格式正确的输入数据;
多任务检测模块,用于根据所述人机交互模块获得的输入数据,分别完成交通目标检测、车道线检测和可行驶区域分割这三个任务,将交通目标检测、车道线检测和可行驶区域分割的结果输出给显示模块;
显示模块,显示所述输入数据,和多任务检测模块输出的交通目标检测、车道线检测和可行驶区域分割的结果。
进一步的,所述基于改进YOLOv5的多任务全景驾驶感知系统,还包括:
交通目标检测模块,用于完成交通目标检测任务,将交通目标检测结果、交通目标类别和交通目标检测精确率输出给显示模块;
车道线检测模块,用于完成车道线检测任务,将车道线检测结果和车道线检测精确率输出给显示模块;
可行驶区域分割模块,用于完成可行驶区域分割任务,将可行驶区域分割结果输出给显示模块;
所述显示模块,能够显示交通目标类别、交通目标检测精确率或车道线检测精确率。
再一方面,本发明还提供一种基于改进YOLOv5的多任务全景驾驶感知设备,所述设备包括存储器和处理器;所述存储器存储有实现上述基于改进YOLOv5的多任务全景驾驶感知方法的计算机程序,所述处理器执行所述计算机程序,以实现上述方法的步骤。
又一方面,本发明还提供一种计算机可读存储介质,其上存储有计算机程序,所述的计算机程序被处理器执行时实现上述方法的步骤。
本发明的基于改进YOLOv5的多任务全景驾驶感知方法与系统的有益效果如下:
本发明的基于改进YOLOv5的多任务全景驾驶感知方法与系统,采用基于YOLOv5网络结构的多任务全景驾驶感知算法框架DP-YOLO(Driving perception-YOLO),使用端到端的网络实现实时、高精度的交通目标检测、可行驶区域分割和车道线检测。
本发明的基于改进YOLOv5的多任务全景驾驶感知方法与系统,设计了一种反转残差瓶颈模块(CSPI_x模块),把YOLOv5主干网络中原有的C3模块替换为反转残差瓶颈模块。反转残差瓶颈模块(CSPI_x模块)是由x个反转残差瓶颈组件结构组成,x为自然数。CSPI_x模块把基础层的特征映射为两部分,然后通过跨阶段层次结构将它们合并,这样可以大大减少主干网络的计算量,提高主干网络的运行速度,同时精度基本上保持不变。对于实时性要求很高的系统,反转残差瓶颈模块允许一个特别的内存有效管理方式,从而提升了网络模型的识别精度。
本发明的基于改进YOLOv5的多任务全景驾驶感知方法与系统,设计了一种分支网络,由四层卷积组件(Conv)、三层BottleneckCSP模块
和三层上采样层组成。该分支网络可以同时对可行驶区域分割和车道线检测两个任务进行训练,采用BottleneckCSP模块,能加强网络特征融合的能力,提高检测精度;将特征金字塔网络FPN输出的底层特征图输入到可行驶区域分割分支网络,FPN的底层具有较强的语义信息和利于定位的高分辨率信息。进一步的,在上采样层中使用最近插值方法进行上采样处理,以减少计算成本。本发明的分支网络不仅获得了高精度的输出,而且减少了其推理时间,从而在保证对精度影响不大的前提下提高了分支网络提取特征的速度。
本发明提供一种基于改进YOLOv5的多任务全景驾驶感知系统,方便展示基于改进YOLOv5的多任务全景驾驶感知方法进行交通目标检测、车道线检测、可行驶区域分割的结果。
本发明的基于改进YOLOv5的多任务全景驾驶感知方法与系统,能够同时进行交通目标检测、可行驶区域分割和车道线检测这三个任务,与其他现有方法相比,具有更高的推理速度和检测精确度;本发明的基于改进YOLOv5的多任务全景驾驶感知方法和系统可以更好地处理车辆周围的场景信息,然后来帮助车辆的决策系统做出判断,具有较好的实际可行性。
说明书附图
图1为本发明的方法流程图。
图2为本发明实施例的网络模型结构示意图。
图3为本发明实施例的反转残差瓶颈模块结构示意图,其中,(a)为反转残差瓶颈模块(CSPI_x模块),(b)为反转残差瓶颈组件结构(Invert Bottleneck)。
图4为本发明实施例的输入图片经过主干网络时特征图的大小和通道数变化示意图。
图5为本发明实施例的特征图经过颈部网络时大小和通道数变化示意图。
图6为本发明实施例的分支网络模型结构示意图。
下面结合实施例并参照附图对本发明作进一步详细描述。
实施例1:
本发明的一个实施例,为一种基于改进YOLOv5的全景驾驶感知方法,是一种简单高效的检测方法(DP-YOLO,Driving perception-YOLO)。本实施例实施的硬件条件和相关软件配置如下:
实验机器操作系统版本为CentOS Linux release 7.6.1810,CPU型号为HygonC86 7185 32-core Processor CPU@2.0GHz,GPU型号为NVIDIA Tesla T4,显存大小为16GB,内存大小为50GB。
程序代码使用Python3.8和Pytorch 1.9实现,并使用cuda 11.2和cudnn 7.6.5对GPU进行加速。模型迭代次数设置为200,每批次的输入数据量为24,表示每次训练时在训练集中取24个训练样本进行训练,初始学习率为0.01,动量与重量衰减分别设置为0.937和0.0005,在训练过程中,通过预热和余弦退火调整学习速率,使模型更快更好地收敛。
如图1所示,本实施例的基于改进YOLOv5的全景驾驶感知方法包括以下步骤:
一、图片预处理
本发明采用YOLOv4的图片预处理方法对车载摄像头采集的视频中每一帧图像进行图片预处理,得到输入图像。其中,YOLOv4的图片预处理方法,用来消除原始图像中无关的信息,恢复有用的真实信息,增强有关信息的可检测性和最大限度地简化数据,从而改进特征抽取、图像分割、匹配和识别的可靠性。
本实施例选择BDD 100K数据集来训练和评估本发明的网络模型,将BDD 100K数据集分为三部分,即70K图像的训练集、10K图像的验证集和20K图像的测试集。由于测试集的标签不是公共的,所以在验证集上评估网络模型。
优选的,在另一个实施例中,为了节省内存使用,还将BDD 100K数据集中的每一帧图像从宽度×高度×通道数为1280×720×3的图像调整成宽度×高度×通道数为640×384×3的图像,其中宽度、高度单位为像素。
二、特征提取,即利用基于改进YOLOv5的主干网提取输入图像的特征。
如图2所示,本发明的基于改进YOLOv5的多任务全景驾驶感知方
法与系统,采用改进YOLOv5的主干网络,把YOLOv5主干网络中原有的C3模块替换为反转残差瓶颈模块(CSPI_x模块)。反转残差瓶颈模块(CSPI_x模块)由x个反转残差瓶颈组件结构(InvertBottleneck)组成,x为自然数。如图3中(a)所示,本发明中的CSPI_x模块,把基础层的特征映射为两部分,然后通过跨阶段层次结构将它们合并,这样可以大大减少网络的计算量,提高网络的运行速度,同时精度基本上保持不变。对于实时性要求很高的系统,反转残差瓶颈模块允许一个特别的内存有效管理方式,从而提升了网络模型的识别精度。
本实施例的主干网络中采用了三个CSPI_x模块,如图2所示。
第一个反转残差瓶颈模块为CSPI_1,由卷积组件Conv和一个反转残差瓶颈组件结构经过Concat操作组成。
第二个反转残差瓶颈模块为CSPI_3,由卷积组件Conv和三个反转残差瓶颈组件结构经过Concat操作组成。
第三个反转残差瓶颈模块为CSPI_3,由卷积组件Conv和三个反转残差瓶颈组件结构经过Concat操作组成。
其中,卷积组件Conv由conv函数(卷积函数)、Bn函数(归一化函数)、SiLU函数(激活函数)三者组成。
如图3中(b)所示,CSPI_x模块中的反转残差瓶颈组件结构(Invert Bottleneck)由三层组成。第一层是卷积组件(Conv),该层将低维空间映射到高维空间进行维度扩展。第二层是深度可分离卷积层(DWConv层),采用深度可分离卷积进行空间过滤。第三层是卷积组件(Conv),该层将高维空间映射到低维空间。比较对维度扩展时分别将低维空间映射到2倍高维空间、3倍高维空间和4倍高维空间时的网络推理速度,当维度扩展到2倍高维空间时,推理速度能达到7.9毫秒/帧,但网络的检测精度比较低。当维度扩展到3倍高维空间时推理速度为9.1毫秒/帧。当维度扩展到4倍高维空间时推理速度达到了10.3毫秒/帧。优选的,在另一个实施例中,选择将低维空间映射到3倍高维空间,与维度扩展到4倍相比,网络在检测精度有些下降,但减少了网络的推理时间和计算量。
如图2和图4所示,利用改进YOLOv5的主干网络提取所述输入图像的特征后,得到的特征图包括特征图out1、特征图out2和特征图out3。
特征图out1,为预处理图片经过Focus操作后又经过Conv、CSPI_1操作,再经过Conv、CSPI_3操作后得到的特征图。
特征图out2,为所述特征图out1经过Conv、CSPI_3操作后得到的特征图。
特征图out3,为所述特征图out2经过Conv操作后得到的特征图。
例如,经过预处理的图片(即输入图像)大小为640×384×3,即图片的宽度、高度、通道数分别为640、384、3。将经过预处理的图片输入到主干网络,最后输出三个特征图,分别是out1(特征图out1大小为80×48×128)、out2(特征图out2大小为40×24×256)和out3(特征图out3大小为20×12×512)。在主干网络中,特征图的大小和通道数的变化的规律如下:
输入图像,即图2和图4中输入图像(大小为640×384×3),经过Focus操作后成为(320×192×32)的特征图;经过Conv、CSPI_1操作后成为(160×96×64)的特征图;再经过Conv、CSPI_3操作后成为(80×48×128)的特征图,作为第一个输出out1;再经过Conv、CSPI_3操作后成为(40×24×256)的特征图,作为第二个输出out2;经过Conv操作后成为(20×12×512)的特征图,作为第三个输出out3。即经过预处理的大小为(640×384×3)的图片,经过主干网络后得到尺度为20×12的特征图。
三、特征融合,即经过主干网络的特征输入到颈部网络(Neck),在颈部网络中经空间金字塔池SPP网络和特征金字塔网络FPN得到的特征图与所述主干网络得到的特征图融合,得到融合的特征图。
本发明的颈部网络中采用了空间金字塔池SPP网络和特征金字塔网络FPN组成颈部网络。空间金字塔池SPP网络首要作用是用来解决输入图像尺寸不统一的问题,SPP网络中不同大小特征的融合,有利于待检测图像中目标大小差异较大的情况。特征金字塔网络FPN主要作用是解决物体检测中的多尺度问题,在基本不增加原有网络模型计算量的情况下,通过简单的网络连接改变,大幅度提升了小物体的检测性能。具体包括:
将主干网络输出的特征图送入颈部网络,依次经过SPP网络、FPN,得到的特征图输入检测头(Detect Head)。
SPP网络使卷积神经网络能够输入任意大小的图片,在卷积神经网络
的最后一层卷积层后面加入一层SPP网络,它能使不同任意尺寸的特征图通过该SPP网络之后都能输出一个固定长度的特征图。
FPN是自顶向下的,将高层特征通过上采样与底层特征做融合,得到用于预测的特征图,将高层的强语义特征传递下来,从而对整个金字塔进行增强。
例如,如图2所示,将主干网络输出的特征图大小为(20×12×512)送入SPP网络中,得到的特征图,再送入FPN。
如图5所示,在所述特征金字塔网络FPN中,由空间金字塔池SPP网络输入的特征图经过反转残差瓶颈模块,再经过Conv操作后得到高层特征图f3,输出到检测头。
所述高层特征图f3经过上采样(UpSample),再与主干网络得到的特征图out2进行Concat操作得到的特征图,经过反转残差瓶颈模块,再经过Conv操作后得到中层特征图f2,输出到检测头。
所述中层特征图f2经过上采样(UpSample),再与主干网络得到的特征图out1进行Concat操作得到底层特征图f1,输出到检测头。
例如,在特征金字塔网络FPN中,由空间金字塔池SPP网络输入的(大小为20×12×512)的特征图经过反转残差瓶颈模块后(大小为20×12×512),再经过Conv操作后得到高层特征图f3(大小为20×12×256),最后输出到检测头。
上述的高层特征图f3(大小为20×12×256),经过上采样变成的特征图(大小为40×24×256),再与主干网络中的特征图out2(大小为40×24×256)进行Concat操作得到的特征图(大小为40×24×512),经过反转残差瓶颈模块(CSPI_1模块)得到的特征图(大小为40×24×256),再经过Conv操作后得到中层特征图f2(大小为40×24×128),最后输出到检测头。
上述的中层特征图f2(大小为40×24×128),经过上采样变成的特征图(大小为80×48×128),再与主干网络中特征图out1(大小为80×48×128)进行Concat操作得到底层特征图f1(大小为80×48×256),最后输出到检测头。
四、交通目标检测,即经过颈部网络得到的融合的特征图输入到检测头,检测头利用获得到的特征对交通目标进行预测。具体包括:
将所述融合的特征图输入到检测头,经路径聚合网络PAN得到多尺度融合特征图,对所述多尺度融合特征图采用YOLOv4基于锚定的多尺度检测方案,进行交通目标检测。
本发明的检测头中采用了路径聚合网络PAN。路径聚合网络是一种自下而上的特征金字塔网络。利用颈部网络中FPN自上而下传递语义特征,以及PAN自下而上传递定位特征,将二者结合起来获得更好的特征融合效果。然后直接使用PAN中的多尺度融合特征图进行检测。所述YOLOv4基于锚定的多尺度检测方案包括将多尺度特征图(例如大小为(20×12×3×6)、(40×24×3×6)、(80×48×3×6)这三个特征图大小不一样,所以叫多尺度特征图)的每个网格(例如大小为(20×12×3×6)特征图中尺度就是(20×12),总计20*12=240个网格)分配若干个(例如3个)不同长宽比的先验框,检测头对位置偏移、高度和宽度的缩放,以及交通目标对应的概率和预测的置信度进行预测。例如:
首先,颈部网络输出的三个特征图,输入到PAN之后得到大小为(80×48×128)、(40×24×256)和(80×48×512)的三个特征图,再经过Conv操作后得到三个特征图大小分别为(20×12×18)、(40×24×18)、(80×48×18),每个特征图的每个网格中,都配置3个不同的先验框,经过检测头中reshape操作后,三张特征图大小分别为(20×12×3×6)、(40×24×3×6)、(80×48×3×6)。这三张特征图就是最终的输出的检测结果。因为检测框位置(4维)、检测置信度(1维)、类别(1维),加起来正好是6维,特征图最后一个特征的维度为6,代表的就是这些信息,而特征图其他特征的维度M×N×3中,M代表特征矩阵的行数,N代表特征矩阵的列数,3代表3个不同尺度的先验框。
五、车道线检测和可行驶区域分割,即利用分支网络进行车道线检测和可行驶区域分割。
因为FPN的底层具有较强的语义信息和利于定位的高分辨率信息,将经空间金字塔池SPP网络和特征金字塔网络FPN得到的特征图中的底层特征图输入到分支网络,其大小为(W/8)×(H/8)×128,其中,W为输入图像宽度640(像素),H为输入图像高度384(像素)。
分支网络由四层卷积组件(Conv)、三层BottleneckCSP模块和三层
上采样层组成,如图6所示。BottleneckCSP模块能加强网络特征融合的能力,提高检测精度,因此,本发明的分支网络能够获得了高精度的输出。优选的,在另一个实施例中,在上采样层中使用最近插值方法进行上采样处理,可以减少计算成本,从而使得分支网络减少了推理时间。
特征金字塔网络FPN中底层特征图f1在分支网络中经过三层上采样层(即经过了三次上采样处理)后,恢复成大小为W×H×4的特征图,其中,W为输入图像宽度(例如640像素),H为输入图像高度(例如384像素),特征图中特征点与输入图像中像素点一一对应,4表示特征图中每个特征点有四个取值。
本发明的分支网络最终将大小为W×H×4的特征图切分成两个大小为W×H×2特征图。其中一个大小为W×H×2的特征图表示输入图像中每个像素对于可行驶区域对应背景的概率,用来预测可行驶区域,预测所得的可行驶区域作为可行驶区域分割的结果;另一个大小为W×H×2的特征图表示输入图像中每个像素对于车道线对应背景的概率,用来预测车道线,预测所得的车道线作为车道线检测的结果。其中,W为输入图像宽度(例如640像素),H为输入图像高度(例如384像素),2表示该特征图中每个特征点有两个取值,用这两个取值分别表示该特征点相应像素点有目标的概率、该特征点相应像素点无目标的概率。
为了验证对YOLOv5进行改进后网络模型的性能,需要选用恰当的评价指标对网络模型进行评价。本发明采用交并比(IoU)来评估可行驶区域和车道线分割,采用平均交并比(mIoU)来评估不同模型的分割性能。
交并比(IoU)用于衡量预测掩码图与真实掩码图之间的像素重叠,公式如下。
其中,TN是指被模型预测为负类的负样本,FP是指被模型预测为正类的负样本,FN是指被模型预测为负类的正样本。
采用平均交并比(mIoU)是对于每个预测类别(指的是车道线预测、可行驶区域预测)计算出的IoU求和取平均,公式如下。
其中,K表示预测类别的数量,K+1表示加上了背景类的预测类别的数量,TP是指被模型预测为正类的正样本,FP是指被模型预测为正类的负样本,FN是指被模型预测为负类的正样本。
原始模型与改进模型的性能指标对比参见下表。
表1原始模型与改进模型的性能指标对比表
其中,Recall(召回率)、AP(平均精度)、mIoU(平均交并比)、Accuracy(车道线的精确度)、IoU(交并比)的单位为(%),Speed(帧率)的单位是毫秒/帧。从表1数据可以看出改进模型在各个任务中的识别精度都有较好的提升,在交通目标检测任务中召回率(Recall)达到了89.3%,AP值达到了77.2%,在可行驶区域分割任务中平均交并比(mIoU)达到了91.5%,在车道线检测任务中检测精度(Accurary)达到了71.1%,交并比(IoU)达到了26.0%,检测速度达到了9.1ms/frames。实验数据结果表明本发明提出的基于改进YOLOv5的多任务全景驾驶感知方法对全景驾驶感知任务有着较好的提升作用,并且满足实时性的要求。
本发明的基于改进YOLOv5的多任务全景驾驶感知方法与系统,采用基于YOLOv5网络结构的多任务全景驾驶感知算法框架DP-YOLO(Driving perception-YOLO),使用端到端的网络实现实时、高精度的交通目标检测、可行驶区域分割和车道线检测。
本发明的基于改进YOLOv5的多任务全景驾驶感知方法与系统,设计了一种反转残差瓶颈模块(CSPI_x模块),把YOLOv5主干网络中原有的C3模块用反转残差瓶颈模块进行替换。反转残差瓶颈模块(CSPI_x模块)是由x个反转残差瓶颈组件结构组成,x为自然数。CSPI_x模块把基础层的特征映射为两部分,然后通过跨阶段层次结构将它们合并,这
样可以大大减少主干网络的计算量,提高主干网络的运行速度,同时精度基本上保持不变。对于实时性要求很高的系统,反转残差瓶颈模块允许一个特别的内存有效管理方式,从而提升了网络模型的识别精度。
本发明的基于改进YOLOv5的多任务全景驾驶感知方法与系统,设计了一种分支网络,由四层卷积组件(Conv)、三层BottleneckCSP模块和三层上采样层组成。该分支网络可以同时对可行驶区域分割和车道线检测这两个任务进行训练,采用BottleneckCSP模块,能加强网络特征融合的能力,提高检测精度;将FPN的底层输入到分割分支,FPN的底层具有较强的语义信息和利于定位的高分辨率信息。进一步的,在上采样层中使用最近插值方法进行上采样处理,以减少计算成本。本发明的分支网络不仅获得了高精度的输出,而且减少了其推理时间,从而在保证对精度影响不大的前提下提高了分支网络提取特征的速度。
实施例2:
本发明的另一个实施例,为一种基于改进YOLOv5的多任务全景驾驶感知系统,包括:
人机交互模块,用于提供预留输入接口,获得格式正确的输入数据。
多任务检测模块,用于根据所述人机交互模块获得的输入数据,分别完成交通目标检测、车道线检测和可行驶区域分割这三个任务,将交通目标检测、车道线检测和可行驶区域分割的结果输出给显示模块。
显示模块,显示所述输入数据,和多任务检测模块输出的交通目标检测、车道线检测和可行驶区域分割的结果。
优选的,在另一个实施例中,基于改进YOLOv5的多任务全景驾驶感知系统还包括:
交通目标检测模块,用于完成交通目标检测任务,将交通目标检测结果、交通目标类别和交通目标检测精确率输出给显示模块;当只对交通目标类别中的车辆这一类别进行检测,把所有车辆统一归于vehicle这个类别进行检测。
车道线检测模块,用于完成车道线检测任务,将车道线检测结果和车道线检测精确率输出给显示模块。
可行驶区域分割模块,用于完成可行驶区域分割任务,将可行驶区域
分割结果输出给显示模块。
所述显示模块还能够显示交通目标类别、交通目标检测精确率或车道线检测精确率。
本发明提供一种基于改进YOLOv5的多任务全景驾驶感知系统,方便展示基于改进YOLOv5的多任务全景驾驶感知方法进行分别进行交通目标检测、车道线检测、可行驶区域分割,或者同时进行多任务检测的检测结果。
在一些实施例中,上述技术的某些方面可以由执行软件的处理系统的一个或多个处理器来实现。该软件包括存储或以其他方式有形实施在非暂时性计算机可读存储介质上的一个或多个可执行指令集合。软件可以包括指令和某些数据,这些指令和某些数据在由一个或多个处理器执行时操纵一个或多个处理器以执行上述技术的一个或多个方面。非暂时性计算机可读存储介质可以包括例如磁或光盘存储设备,诸如闪存、高速缓存、随机存取存储器(RAM)等的固态存储设备或其他非易失性存储器设备。存储在非临时性计算机可读存储介质上的可执行指令可以是源代码、汇编语言代码、目标代码或被一个或多个处理器解释或以其他方式执行的其他指令格式。
本发明的基于改进YOLOv5的多任务全景驾驶感知方法与系统,能够对交通目标检测、可行驶区域分割和车道线检测这三个任务同时进行检测,与其他现有方法相比,具有更高的推理速度和检测精确度;本发明的基于改进YOLOv5的多任务全景驾驶感知方法和系统可以更好地处理车辆周围的场景信息,然后来帮助车辆的决策系统做出判断,具有较好的实际可行性。
计算机可读存储介质可以包括在使用期间可由计算机系统访问以向计算机系统提供指令和/或数据的任何存储介质或存储介质的组合。这样的存储介质可以包括但不限于光学介质(例如,光盘(CD)、数字多功能光盘(DVD)、蓝光光盘)、磁介质(例如,软盘、磁带或磁性硬盘驱动器)、易失性存储器(例如,随机存取存储器(RAM)或高速缓存)、非易失性存储器(例如,只读存储器(ROM)或闪存)或基于微机电系统(MEMS)的存储介质。计算机可读存储介质可以嵌入计算系统(例如,
系统RAM或ROM)中,固定地附接到计算系统(例如,磁性硬盘驱动器),可移除地附接到计算系统(例如,光盘或通用基于串行总线(USB)的闪存),或者经由有线或无线网络(例如,网络可访问存储(NAS))耦合到计算机系统。
请注意,并非上述一般性描述中的所有活动或要素都是必需的,特定活动或设备的一部分可能不是必需的,并且除了描述的那些之外可以执行一个或多个进一步的活动或包括的要素。更进一步,活动列出的顺序不必是执行它们的顺序。而且,已经参考具体实施例描述了这些概念。然而,本领域的普通技术人员认识到,在不脱离权利要求书中阐述的本公开的范围的情况下,可以进行各种修改和改变。因此,说明书和附图被认为是说明性的而不是限制性的,并且所有这样的修改被包括在本公开的范围内。
上面已经关于具体实施例描述了益处、其他优点和问题的解决方案。然而,可能导致任何益处、优点或解决方案发生或变得更明显的益处、优点、问题的解决方案以及任何特征都不应被解释为任何或其他方面的关键、必需或任何或所有权利要求的基本特征。此外,上面公开的特定实施例仅仅是说明性的,因为所公开的主题可以以受益于这里的教导的本领域技术人员显而易见的不同但等同的方式进行修改和实施。除了在权利要求书中描述的以外,没有意图限制在此示出的构造或设计的细节。因此明显的是,上面公开的特定实施例可以被改变或修改,并且所有这样的变化被认为在所公开的主题的范围内。
Claims (10)
- 一种基于改进YOLOv5的多任务全景驾驶感知方法,其特征在于,包括:采用YOLOv4的图片预处理方法对车载摄像头采集的视频中每一帧图像进行图片预处理,得到输入图像;利用改进YOLOv5的主干网络提取所述输入图像的特征,得到特征图;所述改进YOLOv5的主干网络,由将YOLOv5的主干网络中C3模块替换为反转残差瓶颈模块得到,所述反转残差瓶颈模块由x个反转残差瓶颈组件结构组成,其中,x为自然数;所述反转残差瓶颈组件结构由三层组成,第一层是卷积组件,该层将低维空间映射到高维空间进行维度扩展;第二层是深度可分离卷积层,采用深度可分离卷积进行空间过滤;第三层是卷积组件,该层将高维空间映射到低维空间;将所述改进YOLOv5的主干网络得到的特征图输入到颈部网络,在颈部网络中经空间金字塔池SPP网络和特征金字塔网络FPN得到的特征图与所述改进YOLOv5的主干网络得到的特征图融合,得到融合的特征图;将所述融合的特征图输入到检测头,经路径聚合网络PAN得到多尺度融合特征图,对所述多尺度融合特征图采用YOLOv4基于锚定的多尺度检测方案,进行交通目标检测;将所述经空间金字塔池SPP网络和特征金字塔网络FPN得到的特征图中底层特征图输入到分支网络,利用分支网络进行车道线检测和可行驶区域分割。
- 根据权利要求1所述的基于改进YOLOv5的多任务全景驾驶感知方法,其特征在于,所述图片预处理还包括将所述车载摄像头采集的视频中每一帧图像从宽度×高度×通道数为1280×720×3的图像调整成宽度×高度×通道数为640×384×3的图像。
- 根据权利要求1所述的基于改进YOLOv5的多任务全景驾驶感知方法,其特征在于,所述改进YOLOv5的主干网络中采用三个反转残差瓶颈模块;第一个反转残差瓶颈模块为CSPI_1,由卷积组件Conv和一个反转残差瓶颈组件结构经过Concat操作组成;第二个反转残差瓶颈模块为CSPI_3,由卷积组件Conv和三个反转残差瓶颈组件结构经过Concat操作组成;第三个反转残差瓶颈模块为CSPI_3,由卷积组件Conv和三个反转残差瓶颈组件结构经过Concat操作组成;其中,卷积组件Conv由conv函数、Bn函数、SiLU函数三者组成;所述利用改进YOLOv5的主干网络提取所述输入图像的特征,得到的特征图包括特征图out1、特征图out2和特征图out3;所述特征图out1,为预处理图片经过Focus操作后又经过Conv、CSPI_1操作,再经过Conv、CSPI_3操作后得到的特征图;所述特征图out2,为所述特征图out1经过Conv、CSPI_3操作后得到的特征图;所述特征图out3,为所述特征图out2经过Conv操作后得到的特征图。
- 根据权利要求3所述的基于改进YOLOv5的多任务全景驾驶感知方法,其特征在于,在所述特征金字塔网络FPN中,由空间金字塔池SPP网络输入的特征图经过反转残差瓶颈模块,再经过Conv操作后得到高层特征图f3,输出到检测头;所述高层特征图f3经过上采样,再与所述特征图out2进行Concat操作得到的特征图,经过反转残差瓶颈模块,再经过Conv操作后得到中层特征图f2,输出到检测头;所述中层特征图f2经过上采样,再与所述特征图out1进行Concat操作得到底层特征图f1,输出到检测头。
- 根据权利要求4所述的基于改进YOLOv5的多任务全景驾驶感知方法,其特征在于,所述分支网络由四层卷积组件、三层BottleneckCSP模块和三层上采样层组成;所述利用分支网络进行车道线检测和可行驶区域分割包括:将所述特征金字塔网络FPN中底层特征图f1在分支网络中经过三层上采样层后,恢复成大小为W×H×4的特征图,其中,W为输入图像宽度,H为输入图像高度,特征图中特征点与输入图像中像素点一一对应,4表示特征图中每个特征点有四个取值;所述分支网络将所述大小为W×H×4的特征图切分成两个大小为W×H×2的特征图,其中一个大小为W×H×2的 特征图表示输入图像中每个像素点对于可行驶区域对应背景的概率,用来预测可行驶区域,预测所得的可行驶区域作为可行驶区域分割的结果;另一个大小为W×H×2的特征图表示输入图像中每个像素点对于车道线对应背景的概率,用来预测车道线,预测所得的车道线作为车道线检测的结果;其中,W为输入图像宽度,H为输入图像高度,2表示该特征图中每个特征点有两个取值,用这两个取值分别表示该特征点相应像素点有目标的概率、该特征点相应像素点无目标的概率。
- 根据权利要求5所述的基于改进YOLOv5的多任务全景驾驶感知方法,其特征在于,在所述上采样层中使用最近插值方法进行上采样处理。
- 一种基于改进YOLOv5的多任务全景驾驶感知系统,实现根据权利要求1至6任一所述的基于改进YOLOv5的多任务全景驾驶感知方法,其特征在于,包括:人机交互模块,用于提供预留输入接口,获得格式正确的输入数据;多任务检测模块,用于根据所述人机交互模块获得的输入数据,分别完成交通目标检测、车道线检测和可行驶区域分割这三个任务,将交通目标检测、车道线检测和可行驶区域分割的结果输出给显示模块;显示模块,显示所述输入数据,和多任务检测模块输出的交通目标检测、车道线检测和可行驶区域分割的结果。
- 根据权利要求7所述的基于改进YOLOv5的多任务全景驾驶感知系统,其特征在于,还包括:交通目标检测模块,用于完成交通目标检测任务,将交通目标检测结果、交通目标类别和交通目标检测精确率输出给显示模块;车道线检测模块,用于完成车道线检测任务,将车道线检测结果和车道线检测精确率输出给显示模块;可行驶区域分割模块,用于完成可行驶区域分割任务,将可行驶区域分割结果输出给显示模块;所述显示模块,能够显示交通目标类别、交通目标检测精确率或车道线检测精确率。
- 一种基于改进YOLOv5的多任务全景驾驶感知设备,其特征在于,所述设备包括存储器和处理器;所述存储器存储有实现基于改进YOLOv5 的多任务全景驾驶感知方法的计算机程序,所述处理器执行所述计算机程序,以实现根据权利要求1-6任一所述方法的步骤。
- 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述的计算机程序被处理器执行时实现根据权利要求1-6任一所述方法的步骤。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
GB2403166.8A GB2624812A (en) | 2022-09-20 | 2023-04-21 | Multi-task panoptic driving perception method and system based on improved YOLOv5 |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211141578.XA CN115223130B (zh) | 2022-09-20 | 2022-09-20 | 基于改进YOLOv5的多任务全景驾驶感知方法与系统 |
CN202211141578.X | 2022-09-20 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2024060605A1 true WO2024060605A1 (zh) | 2024-03-28 |
Family
ID=83617185
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2023/089631 WO2024060605A1 (zh) | 2022-09-20 | 2023-04-21 | 基于改进YOLOv5的多任务全景驾驶感知方法与系统 |
Country Status (3)
Country | Link |
---|---|
CN (1) | CN115223130B (zh) |
GB (1) | GB2624812A (zh) |
WO (1) | WO2024060605A1 (zh) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118447370A (zh) * | 2024-05-22 | 2024-08-06 | 沈阳工业大学 | 一种基于改进YOLOv5s的轻量化隧道裂缝检测方法 |
CN118470576A (zh) * | 2024-07-09 | 2024-08-09 | 齐鲁空天信息研究院 | 一种无人机图像的小目标检测方法及系统 |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115223130B (zh) * | 2022-09-20 | 2023-02-03 | 南京理工大学 | 基于改进YOLOv5的多任务全景驾驶感知方法与系统 |
CN115376093A (zh) * | 2022-10-25 | 2022-11-22 | 苏州挚途科技有限公司 | 智能驾驶中的对象预测方法、装置及电子设备 |
CN115797881A (zh) * | 2022-12-26 | 2023-03-14 | 江苏大学 | 一种用于交通道路路面信息的多任务联合感知网络模型及检测方法 |
CN116152345B (zh) * | 2023-04-19 | 2023-07-14 | 盐城数智科技有限公司 | 一种嵌入式系统实时物体6d位姿和距离估计方法 |
TWI846614B (zh) * | 2023-04-27 | 2024-06-21 | 旺宏電子股份有限公司 | 全景感知系統、方法及其非暫態電腦可讀取媒體 |
CN117372983B (zh) * | 2023-10-18 | 2024-06-25 | 北京化工大学 | 一种低算力的自动驾驶实时多任务感知方法及装置 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200293891A1 (en) * | 2019-04-24 | 2020-09-17 | Jiangnan University | Real-time target detection method deployed on platform with limited computing resources |
CN112686225A (zh) * | 2021-03-12 | 2021-04-20 | 深圳市安软科技股份有限公司 | Yolo神经网络的训练方法、行人检测方法和相关设备 |
CN114612835A (zh) * | 2022-03-15 | 2022-06-10 | 中国科学院计算技术研究所 | 一种基于YOLOv5网络的无人机目标检测模型 |
CN114863379A (zh) * | 2022-05-17 | 2022-08-05 | 安徽蔚来智驾科技有限公司 | 多任务目标检测方法、电子设备、介质及车辆 |
CN115223130A (zh) * | 2022-09-20 | 2022-10-21 | 南京理工大学 | 基于改进YOLOv5的多任务全景驾驶感知方法与系统 |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112699859B (zh) * | 2021-03-24 | 2021-07-16 | 华南理工大学 | 目标检测方法、装置、存储介质及终端 |
CN114005020B (zh) * | 2021-11-05 | 2024-04-26 | 河北工业大学 | 一种基于M3-YOLOv5的指定移动目标检测方法 |
CN114299405A (zh) * | 2021-12-28 | 2022-04-08 | 重庆大学 | 一种无人机图像实时目标检测方法 |
-
2022
- 2022-09-20 CN CN202211141578.XA patent/CN115223130B/zh active Active
-
2023
- 2023-04-21 GB GB2403166.8A patent/GB2624812A/en active Pending
- 2023-04-21 WO PCT/CN2023/089631 patent/WO2024060605A1/zh active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200293891A1 (en) * | 2019-04-24 | 2020-09-17 | Jiangnan University | Real-time target detection method deployed on platform with limited computing resources |
CN112686225A (zh) * | 2021-03-12 | 2021-04-20 | 深圳市安软科技股份有限公司 | Yolo神经网络的训练方法、行人检测方法和相关设备 |
CN114612835A (zh) * | 2022-03-15 | 2022-06-10 | 中国科学院计算技术研究所 | 一种基于YOLOv5网络的无人机目标检测模型 |
CN114863379A (zh) * | 2022-05-17 | 2022-08-05 | 安徽蔚来智驾科技有限公司 | 多任务目标检测方法、电子设备、介质及车辆 |
CN115223130A (zh) * | 2022-09-20 | 2022-10-21 | 南京理工大学 | 基于改进YOLOv5的多任务全景驾驶感知方法与系统 |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118447370A (zh) * | 2024-05-22 | 2024-08-06 | 沈阳工业大学 | 一种基于改进YOLOv5s的轻量化隧道裂缝检测方法 |
CN118470576A (zh) * | 2024-07-09 | 2024-08-09 | 齐鲁空天信息研究院 | 一种无人机图像的小目标检测方法及系统 |
Also Published As
Publication number | Publication date |
---|---|
CN115223130A (zh) | 2022-10-21 |
CN115223130B (zh) | 2023-02-03 |
GB2624812A (en) | 2024-05-29 |
GB202403166D0 (en) | 2024-04-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2024060605A1 (zh) | 基于改进YOLOv5的多任务全景驾驶感知方法与系统 | |
WO2021218786A1 (zh) | 一种数据处理系统、物体检测方法及其装置 | |
JP2019061658A (ja) | 領域判別器訓練方法、領域判別装置、領域判別器訓練装置及びプログラム | |
WO2022110611A1 (zh) | 一种面向平面交叉口的行人过街行为预测方法 | |
US11783588B2 (en) | Method for acquiring traffic state, relevant apparatus, roadside device and cloud control platform | |
CN112446292B (zh) | 一种2d图像显著目标检测方法及系统 | |
CN110781980B (zh) | 目标检测模型的训练方法、目标检测方法及装置 | |
CN114202743A (zh) | 自动驾驶场景下基于改进faster-RCNN的小目标检测方法 | |
CN111428664A (zh) | 一种基于人工智能深度学习技术的计算机视觉的实时多人姿态估计方法 | |
CN114519819A (zh) | 一种基于全局上下文感知的遥感图像目标检测方法 | |
CN113297956A (zh) | 一种基于视觉的手势识别方法及系统 | |
CN113870160A (zh) | 一种基于变换器神经网络的点云数据处理方法 | |
CN117409412A (zh) | 一种基于细节增强的双分辨率实时语义分割方法 | |
CN117975418A (zh) | 一种基于改进rt-detr的交通标识检测方法 | |
WO2024175099A1 (zh) | 图像处理方法、装置和存储介质 | |
Biswas et al. | Halsie: Hybrid approach to learning segmentation by simultaneously exploiting image and event modalities | |
CN113869144A (zh) | 目标检测方法、装置、电子设备及计算机可读存储介质 | |
Zhang et al. | Dense pedestrian detection method based on improved YOLOv5 | |
CN118115934A (zh) | 密集行人检测方法及系统 | |
CN117197472A (zh) | 基于鼻出血内窥镜影像的高效师生半监督分割方法及装置 | |
CN117198056A (zh) | 路口交通指挥模型的构建方法及相关装置、应用 | |
CN117372991A (zh) | 基于多视角多模态融合的自动驾驶方法及系统 | |
CN117078591A (zh) | 道路缺陷实时检测方法、系统、设备及存储介质 | |
CN117011932A (zh) | 一种奔跑行为检测方法、电子设备及存储介质 | |
Zheng et al. | A method of detect traffic police in complex scenes |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
ENP | Entry into the national phase |
Ref document number: 202403166 Country of ref document: GB Kind code of ref document: A Free format text: PCT FILING DATE = 20230421 |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23866902 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 18711367 Country of ref document: US |