CN116958910A - Attention mechanism-based multi-task traffic scene detection algorithm - Google Patents
Attention mechanism-based multi-task traffic scene detection algorithm Download PDFInfo
- Publication number
- CN116958910A CN116958910A CN202310696843.9A CN202310696843A CN116958910A CN 116958910 A CN116958910 A CN 116958910A CN 202310696843 A CN202310696843 A CN 202310696843A CN 116958910 A CN116958910 A CN 116958910A
- Authority
- CN
- China
- Prior art keywords
- module
- feature
- convolution
- network
- input
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 86
- 230000007246 mechanism Effects 0.000 title claims abstract description 45
- 230000011218 segmentation Effects 0.000 claims abstract description 27
- 230000004927 fusion Effects 0.000 claims abstract description 14
- 238000000605 extraction Methods 0.000 claims description 25
- 238000000034 method Methods 0.000 claims description 18
- 230000006870 function Effects 0.000 claims description 13
- 230000004913 activation Effects 0.000 claims description 12
- 238000010606 normalization Methods 0.000 claims description 12
- 238000013528 artificial neural network Methods 0.000 claims description 10
- 230000002776 aggregation Effects 0.000 claims description 8
- 238000004220 aggregation Methods 0.000 claims description 8
- 230000008901 benefit Effects 0.000 claims description 8
- 238000011176 pooling Methods 0.000 claims description 8
- 230000008569 process Effects 0.000 claims description 8
- 238000012545 processing Methods 0.000 claims description 8
- 230000000694 effects Effects 0.000 claims description 5
- 230000003993 interaction Effects 0.000 claims description 5
- 238000012546 transfer Methods 0.000 claims description 5
- 238000004364 calculation method Methods 0.000 claims description 4
- 230000005540 biological transmission Effects 0.000 claims description 3
- 238000004880 explosion Methods 0.000 claims description 3
- 238000012549 training Methods 0.000 claims description 3
- 238000010586 diagram Methods 0.000 description 11
- 230000008447 perception Effects 0.000 description 4
- 230000010339 dilation Effects 0.000 description 3
- 230000006978 adaptation Effects 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000003044 adaptive effect Effects 0.000 description 1
- 230000004888 barrier function Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 239000012141 concentrate Substances 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 230000002035 prolonged effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 239000013585 weight reducing agent Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/50—Context or environment of the image
- G06V20/52—Surveillance or monitoring of activities, e.g. for recognising suspicious objects
- G06V20/54—Surveillance or monitoring of activities, e.g. for recognising suspicious objects of traffic, e.g. cars on the road, trains or boats
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0499—Feedforward networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/26—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/52—Scale-space analysis, e.g. wavelet analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/50—Context or environment of the image
- G06V20/56—Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
- G06V20/588—Recognition of the road, e.g. of lane markings; Recognition of the vehicle driving pattern in relation to the road
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/70—Labelling scene content, e.g. deriving syntactic or semantic representations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V2201/00—Indexing scheme relating to image or video recognition or understanding
- G06V2201/07—Target detection
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Medical Informatics (AREA)
- Databases & Information Systems (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to a multitasking traffic scene detection algorithm based on an attention mechanism, which comprises a shared encoder and three decoders; the shared encoder consists of a backbone network and a neck network; the three decoders respectively finish the tasks of traffic targets, drivable areas and lane line detection. According to the invention, the large convolution kernel attention mechanism and the ELAN structure proposed by YOLOv7 are fused for the first time to serve as a brand-new main network, the large convolution kernel attention mechanism and the multi-scale information fusion mechanism are combined to provide a segmentation enhancement module for the lane line segmentation task, and the module is added after the main network and before the lane line detection segmentation head, so that the accuracy of the lane line detection task is further improved.
Description
Technical Field
The invention relates to the technical field of multi-task traffic scene detection, in particular to a multi-task traffic scene detection algorithm based on an attention mechanism.
Background
In the past decade, tremendous advances have been made in the areas of computer vision and deep learning, but vision-based tasks (e.g., vehicle object detection, drivable area detection, lane line detection, etc.) remain challenging in low-cost, high-precision traffic scene applications. In recent years, the method based on multitasking learning exhibits excellent performance on traffic scene perception problems, and provides a high-precision and high-efficiency solution. The target detection plays an important role in providing the position and size information of traffic barriers and helping automatic driving vehicles and road monitoring personnel to make accurate and timely decisions; the detection of the drivable area and the division of the lane lines provide rich information for route planning and driving safety. Therefore, the research of traffic target detection, drivable area detection and lane line detection tasks is very critical.
Each of these three tasks typically represent networks including, but not limited to, SSD series, R-CNN series, YOLO series, etc. networks for object detection; a network such as an ENT or PSPNet for detecting a traveling area; SAD-ENT, SCNN, etc. networks for lane line segmentation. Although the above networks can well realize traffic target detection, drivable area and lane line segmentation, a great delay exists when images sequentially pass through three cascaded networks, so that task processing time is prolonged.
The invention patent No. CN202211141578.X provides a method and a system for sensing multi-task panoramic driving based on improved YOLOv 5. Firstly, carrying out picture preprocessing on images in a data set to obtain input images; extracting the characteristics of the input image by using a trunk network of improved YOLOv5 to obtain a characteristic diagram; the backbone network is obtained by replacing a C3 module in the YOLOv5 backbone network with an inversion residual bottleneck module; inputting the feature map into a neck network to obtain a feature map, and fusing the feature map with the feature map obtained by a backbone network; inputting the fusion characteristic diagram to a detection head for traffic target detection; and inputting the characteristic diagram of the neck network into a branch network, and detecting lane lines and dividing a travelable area.
However, the traffic scene detection algorithm is based on a convolutional neural network, a concentration mechanism is lacked, the network cannot concentrate on important input information, and the conventional neural network is based on a small convolutional kernel, the small convolutional kernel has a smaller receptive field, and global information of an object cannot be obtained, so that the algorithm effect is poor.
Disclosure of Invention
The invention aims to further improve the precision of a traffic scene multitasking sensing model and provides a multitasking traffic scene detection algorithm based on an attention mechanism.
The invention adopts the following technical scheme to realize the aim:
a multitasking traffic scene detection algorithm based on attention mechanism, comprising a shared encoder and three decoders; the shared encoder consists of a backbone network and a neck network; the three decoders respectively finish the traffic target, the drivable area and the lane line detection tasks;
the backbone network is used for extracting the characteristics of an input image and comprises a convolution module, a characteristic extraction module and a downsampling module, wherein the convolution module consists of a Conv convolution layer, a BatchNorm batch normalization layer and a SiLU activation function; the feature extraction module fuses a large-core attention mechanism and an ELAN structure, and builds a backbone network to extract features; the downsampling module adds a downsampling layer on the basis of the convolution module to form two branches, and finally performs feature fusion through dimension addition, wherein the number of channels of the output feature map is 2 times that of the input feature map, and the length and the width of the output feature map are 1/2 of that of the input feature map;
the neck network comprises a cross-stage space pyramid pooling module, a characteristic pyramid network and a path aggregation network; the cross-stage space pyramid pooling module is used for expanding receptive fields, fusing information of feature graphs with different scales and finishing feature fusion; in the feature map transmission process, a deep feature map carries strong semantic features and weak position information, a shallow feature map carries strong position information and weak semantic features, a feature pyramid network transmits the deep semantic features to the shallow layer, so that semantic expression on multiple scales is enhanced, a path aggregation network transmits the shallow position information to the deep layer, and positioning capability on multiple scales is enhanced;
the specific process of the detection algorithm is as follows; the input of the network is 640 x 3 RGB pictures, firstly, the RGB pictures enter a convolution module to conduct feature transfer, the length and the width of the feature pictures of the 2 nd convolution module and the 4 th convolution module are respectively reduced by 1/2, the length and the width of the output feature pictures are 1/4 of the input, the feature pictures enter a feature extraction module and a downsampling module to conduct feature extraction, the length and the width of the output feature pictures are reduced to 1/32 of an original image from 1/4 after three downsampling, then the extracted feature pictures are sent to a neck network to conduct multi-scale feature fusion, the traffic target detection module transmits the feature pictures to three traffic target detection heads with different sizes, finally, three feature pictures with the sizes of (W/8,H/8,256), (W/16, H/16,512), (W/32, H/32,1024) are respectively output, the input sizes of a drivable region detection module and a lane line detection module are (W/8,H/8,128), the drivable region detection module comprises a Boleneck CSP module and three downsampling modules for feature extraction, after the input information transfer, the output size of the obtained feature pictures is (W/8,H) is the same as that the input of the post-detection module can be subjected to the segmentation of the road network, and the road detection module can be subjected to the following the road detection by the road detection module.
And selecting a backbone network of the YOLOv7 as a basic network structure, and replacing the original ELAN structure by a feature extraction module on the basis of the basic network structure to construct an improved backbone network of the YOLOv 7.
The ELAN structure is a high-efficiency layer aggregation network, can improve the network learning capacity under the condition of not damaging an original gradient path, and can learn more diversified features by guiding calculation blocks of different feature groups; because the LKA mechanism not only comprises the advantage that the self-attention mechanism can solve the problem of long-distance dependence, but also comprises the advantage that convolution can utilize local context information, the LKA mechanism is fused with an ELAN structure in YOLOv7 to form a feature extraction module;
the feature extraction Module comprises four convolution modules and two LKA-Module layers, wherein the input sequentially passes through the two convolution modules, the two convolution modules and the LKA-Module layer cascade structure, feature graphs with the number of channels o=i/2 are respectively output, wherein o is the number of output channels, i is the number of input channels, and finally dimension addition is carried out on the output feature graphs.
The feature extraction module comprises two forms, wherein one form is that the number of output channels of the first two convolution modules is 1/2 of the number of input channels, and the number of the input channels and the number of the output channels of the second two convolution modules are the same; the other form is that the number of output channels of the first two convolution modules is 1/4 of the number of input channels, and the number of input channels and the number of output channels of the second two convolution modules are the same.
The LKA-Module layer comprises a BatchNorm batch normalization layer, and an attention Module and a feedforward neural network Module in a Tranformer structure, wherein the attention Module and the feedforward neural network Module are in cascade connection to perform feature extraction; in order to prevent gradient explosion and accelerating model convergence, the input feature images are firstly subjected to batch normalization processing, and then enter an attention module and a feedforward neural network module;
the attention module consists of 1*1 convolution, GELU activation function and LKA module, wherein the LKA module is a large-core attention layer and helps the network to selectively learn input characteristics;
the feedforward neural network module is composed of 1*1 normal convolution, 3*3 depth expansion convolution and a GELU activation function, wherein the expansion rate of the depth expansion convolution is equal to 3.
And 7*7, 11 x 11 and 21 x 21 convolution kernels with different sizes are added into the segmentation enhancement module to perform multi-scale information interaction, and meanwhile, the convolution kernel of K is decomposed into a 1*K transverse convolution kernel and a K1 longitudinal convolution kernel, so that the computational complexity is further reduced.
Adding a gating mechanism in the segmentation enhancement module, and enabling the model to selectively learn important information features by recalibrating the weight sizes of different channels; from the aspect of data flow, an input feature map of a model is subjected to 1*1 common convolution firstly, then is subjected to 7*7, 11 x 11 and 21 x 21 deep convolution respectively to learn multi-scale information features, then the output feature map and an original input feature map are subjected to numerical addition to obtain a new output feature map, the feature map is multiplied by all channel weights subjected to global average pooling for adding an attention mechanism, the effect of selectively learning important channels is achieved, and a BatchNorm batch normalization layer and a ReLU activation function are added to prevent overfitting in the training process.
The beneficial effects of the invention are as follows: according to the invention, the large convolution kernel attention mechanism and the ELAN structure proposed by YOLOv7 are fused for the first time to serve as a brand-new main network, the large convolution kernel attention mechanism and the multi-scale information fusion mechanism are combined to provide a segmentation enhancement module for the lane line segmentation task, and the module is added after the main network and before the lane line detection segmentation head, so that the accuracy of the lane line detection task is further improved.
Drawings
FIG. 1 is a schematic diagram of the overall structure of the algorithm of the present invention;
FIG. 2 is a flow chart of the detection in the present invention;
FIG. 3 is a schematic diagram of one form of an LKA-ELAN module according to the present invention;
FIG. 4 is a schematic diagram of another form of an LKA-ELAN module according to the present invention;
FIG. 5 is a schematic diagram of an LKA-Module structure according to the present invention;
FIG. 6 is a schematic diagram of a SegMod module structure according to the present invention;
the embodiments of the present invention will be described in detail below with reference to the accompanying drawings.
Detailed Description
The principles and features of the present invention are described below with reference to the drawings, the examples are illustrated for the purpose of illustrating the invention and are not to be construed as limiting the scope of the invention. The invention is more particularly described by way of example in the following paragraphs with reference to the drawings. The advantages and features of the present invention will become more apparent from the following description. It should be noted that the drawings are in a very simplified form and are all to a non-precise scale, merely for convenience and clarity in aiding in the description of embodiments of the invention.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. The term "and/or" as used herein includes any and all combinations of one or more of the associated listed items.
The invention is further illustrated by the following examples in conjunction with the accompanying drawings:
currently, many researchers have designed multi-task learning networks (MultiNet, DLT-Net, YOLOP) in which the encoder-decoder architecture is used, with the decoders of the three detection tasks sharing the same encoder. MarvinTeichmann et al introduce the concept of multiplexing into the field of traffic scene perception for the first time in a MultiNet network, and the method uses a VGG16 backbone structure as an encoder, then performs feature fusion on a feature map generated by the encoder, and finally sends the feature map to ClassificationDecoder, detectionDecoder, segmentationDecoder decoders to complete three tasks of target classification, target detection and lane line detection. Qian et al first determined the detection tasks as traffic target detection, travelable region detection and lane line detection in the DLT-Net network, and provided a context tensor to share the information of DrivableAreaDecoder to TrafficObjectDecoder and LanelineeDecoder, which significantly improved the overall performance without increasing the computational overhead. Wu and the like firstly introduce a YOLO series network into a multi-task algorithm, take YOLOv5 as a main structure to complete a target detection task, use an ene network decoding structure to acquire a characteristic diagram of lane line detection and drivable region detection, further realize the weight reduction and portability of a model, and bring multi-task learning in the traffic scene perception field into the YOLO era. Although many excellent networks are proposed, the detection accuracy and other indexes of the proposed algorithm are required to be improved.
In order to further improve the precision of a traffic scene multi-task perception model, after the method is studied in depth, the invention provides a multi-task traffic scene detection algorithm based on an attention mechanism, which comprises a shared encoder and three decoders; the shared encoder consists of a Backbone network (Backbone) and a Neck network (neg); the three decoders respectively finish the traffic target, the drivable area and the lane line detection tasks;
the backbone network is used for extracting the characteristics of an input image and comprises a convolution module (CBS), a characteristic extraction module (LKA-ELAN) and a downsampling Module (MP), wherein the CBS consists of a Conv convolution layer, a BatchNorm batch normalization layer and a SiLU activation function; the LKA-ELAN fuses a large-core attention mechanism (largekernel attention, LKA) and an ELAN structure, and builds a backbone network to extract features; MP adds down sampling layer (MaxPooling) on the basis of CBS to form two branches, and finally performs feature fusion by dimension addition, wherein the number of channels of the output feature map is 2 times that of the input feature map, and the length and width of the output feature map are 1/2 of that of the input feature map;
the neck network includes a cross-phase spatial pyramid pooling module (spacial pyramidPoolingCross StagePartial, SPPCSP), a feature pyramid network (FeaturePyramidNetworks, FPN), and a path aggregation network (PathAggregationNetworks, PAN);
SPPCSP has the functions of expanding receptive fields, fusing information of feature graphs with different scales and completing feature fusion; in the feature map transmission process, the deep feature map carries stronger semantic features and weaker position information, and the shallow feature map carries stronger position information and weaker semantic features; the FPN transmits the deep semantic features to the deep layer, so that semantic expression on multiple scales is enhanced, the PAN transmits the position information of the deep layer to the deep layer, and the positioning capability on multiple scales is enhanced;
the specific process of the detection algorithm is as shown in fig. 1 and fig. 2, the input of the network is 640 x 3 RGB pictures, firstly, the CBS is entered for feature transfer, the length and width of the 2 nd and 4 th CBS feature maps are respectively reduced by 1/2, the length and width of the output feature maps are 1/4 of the input, the feature maps enter LKA-ELAN and MP for feature extraction, the length and width of the output feature maps are reduced from 1/4 of the original image to 1/32 of the original image after three downsampling, then the extracted feature maps are sent to the neg for multi-scale feature fusion, the traffic target detection module transmits the feature maps to three traffic target detection heads with different sizes, finally, the three feature maps with the sizes of (W/8,H/8,256), (W/16, h/16,512), (W/32, h/32,1024) are respectively output, the input sizes of the drivable area detection module and the lane line detection module are (W/8,H/8,128), the drivable area detection module comprises a bolleckkcsp module for feature extraction and three MPs, the input information is transmitted to the network after the large/small input module is the same as the large/small input area detection module (W/8,H), and the traffic area detection module is subjected to the subsequent semantic information is subjected to the enhancement of the network structure.
And selecting a backbone network of the YOLOv7 as a basic network structure, and replacing the original ELAN structure with LKA-ELAN on the basis of the basic network structure to construct an improved YOLOv7 backbone network.
The ELAN structure is a high-efficiency layer aggregation network, can improve the network learning capacity under the condition of not damaging an original gradient path, and can learn more diversified features by guiding calculation blocks of different feature groups; because the LKA mechanism not only comprises the advantage that the self-attention mechanism can solve the problem of long-distance dependence, but also comprises the advantage that convolution can utilize local context information, the LKA mechanism is fused with an ELAN structure in YOLOv7 to form an LKA-ELAN;
the LKA-ELAN comprises four CBS layers and two LKA-Module layers, wherein the input sequentially passes through the cascade structure of the two CBS layers, the two CBS layers and the LKA-Module layers, the characteristic diagrams with the channel number o=i/2 are respectively output, wherein o is the output channel number (OutputChannel) and i is the input channel number (InputChannel), and finally the output characteristic diagrams are subjected to dimension addition;
the LKA-ELAN only aggregates all the previous layers in the last layer of the structure, and the structure not only inherits the advantages of representing multiple characteristics by the multiple receptive fields of DenseNet, but also solves the problem of low dense connection efficiency, and meanwhile, compared with VoVNet, the structure adds a large-core attention mechanism, so that the network performance can be further improved.
The feature extraction module (LKA-ELAN) comprises two forms, wherein one form is that the number of output channels of the first two convolution modules (CBS) is 1/2 of the number of input channels, and the number of input and output channels of the second two convolution modules (CBS) is the same, as shown in fig. 3; another form is that the number of output channels of the first two convolution modules (CBS) is 1/4 of the number of input channels, and the number of input and output channels of the second two convolution modules (CBS) is the same, as shown in fig. 4.
As shown in fig. 5, similar to DETR and VAN algorithms, the LKA-Module layer contains a Attention Module (Attention) and a feed-forward neural network Module (FeedForwardNetwork, FFN) in a batch normalization layer and a Tranformer structure, and the Attention and FFN perform feature extraction by cascading; in order to prevent gradient explosion and accelerate model convergence, carrying out batch normalization processing on an input feature map, and then entering an Attention and FFN;
the Attention is composed of 1*1 convolution, GELU activation function and LKA module, wherein the LKA module is a large-core Attention layer and helps the network to selectively learn input characteristics;
FFN consists of 1*1 normal convolution, 3*3 depth dilation convolution, and gel activation function, where the dilation rate (d) of the depth dilation convolution is equal to 3.
The attention mechanism can be seen as an adaptive selection process that can select discriminating features based on input features and automatically ignore noise responses, the key step of the attention mechanism being the generation of attention feature maps, which can represent the importance of the different parts.
Currently, there are two methods to learn the relationship between different features.
The first is to employ a self-attention mechanism to capture long-range dependencies. Although the self-attention mechanism is very effective in natural language processing, it still has three drawbacks when processing computer vision tasks: 1) In the processing process, the images are regarded as one-dimensional sequences, and the two-dimensional structure of the images is ignored; 2) The calculation complexity of the method and the input resolution are in a quadratic increase relation, and the processing cost of the method for the high-resolution image is high; 3) It only achieves spatial adaptation, but ignores the adaptation of the channel dimension.
The second is the method employed by the present invention to construct the attention profile from large convolution kernels. As shown in fig. 4, since adding large convolution kernels (17×17, 21×21, etc.) to the network may cause the network computation amount to increase, which is unfavorable for increasing the model depth, the convolution kernel of k×k is replaced by the deep convolution of (2 d-1) ×2d-1, (K/d) ×k/d) and the normal convolution of 1*1 in the LKA module, where the deep convolution and the deep expansion convolution both use packet convolution, and the number of packets (groups) is equal to the number of input channels. Through the operation, the receptive field can be increased while the parameter quantity is reduced, so that more global features can be obtained, and then the input is multiplied by the output subjected to the large convolution kernel processing to add a attention mechanism, so that the input features can be better selectively learned.
In the multi-task traffic scene detection algorithm, two detection tasks are related to segmentation, namely a drivable region detection task and a lane line detection task. Although two downstream segmentation tasks are effectively improved after the large-core attention backbone network is replaced, the improvement amplitude of the lane line detection precision is smaller, so that a segmentation enhancement module comprising a large convolution core and a multi-scale information interaction mechanism is provided for improving the lane line segmentation effect.
In the process of comparing part of classical semantic segmentation models (deep LabV3+, SETR, segNeXt), a successful semantic segmentation model should be provided with a strong backbone network at first, and the specificity of a backbone network is shared by a plurality of detection tasks of a multi-task traffic scene detection algorithm in consideration of the fact that the segmentation performance of model lane lines is improved without change. Secondly, the method should have the characteristic of multi-scale information interaction, unlike the image classification task mainly identifying a single object, semantic segmentation is a dense prediction task, and detection objects with different sizes in a single image need to be processed, so that three convolution kernels with different sizes of 7*7, 11×11 and 21×21 are simultaneously added in a segmentation enhancement module to perform multi-scale information interaction, and meanwhile, the convolution kernel of k×k is decomposed into a transverse convolution kernel of 1*K and a longitudinal convolution kernel of k×1, so that the computational complexity is further reduced. Third, an attention mechanism should be provided to better select the input features.
Similar to SENet, a gating mechanism is added in the segmentation enhancement module, allowing the model to selectively learn important information features by recalibrating the weight sizes of the different channels. As shown in fig. 6, from the data flow perspective, the input feature map of the model is first subjected to 1*1 normal convolution, then respectively subjected to 7*7, 11×11, and 21×21 deep convolution to learn multi-scale information features, and then the output feature map is numerically added to the original input feature map to obtain a new output feature map. To add the attention mechanism, the feature map is multiplied by the weights of all channels subjected to Global Average Pooling (GAP), so as to achieve the effect of selectively learning important channels. During training, the addition of the BatchNorm batch normalization layer and the ReLU activation function prevents overfitting.
The invention fuses a large convolution kernel attention mechanism with the ELAN structure proposed by YOLOv7 for the first time to serve as a brand new backbone network.
Meanwhile, the invention combines a large-core attention mechanism and a multi-scale information fusion mechanism to provide a segmentation enhancement module aiming at the lane line segmentation task, and the module is added after the main network and before the lane line detection segmentation head to further improve the accuracy of the lane line detection task.
While the invention has been described above with reference to the accompanying drawings, it will be apparent that the invention is not limited to the above embodiments, but is intended to cover various modifications, either made by the method concepts and technical solutions of the invention, or applied directly to other applications without modification, within the scope of the invention.
Claims (7)
1. A multi-task traffic scene detection algorithm based on an attention mechanism is characterized in that,
comprises a shared encoder and three decoders; the shared encoder consists of a backbone network and a neck network; the three decoders respectively finish the traffic target, the drivable area and the lane line detection tasks;
the backbone network is used for extracting the characteristics of an input image and comprises a convolution module, a characteristic extraction module and a downsampling module, wherein the convolution module consists of a Conv convolution layer, a BatchNorm batch normalization layer and a SiLU activation function; the feature extraction module fuses a large-core attention mechanism and an ELAN structure, and builds a backbone network to extract features; the downsampling module adds a downsampling layer on the basis of the convolution module to form two branches, and finally performs feature fusion through dimension addition, wherein the number of channels of the output feature map is 2 times that of the input feature map, and the length and the width of the output feature map are 1/2 of that of the input feature map;
the neck network comprises a cross-stage space pyramid pooling module, a characteristic pyramid network and a path aggregation network; the cross-stage space pyramid pooling module is used for expanding receptive fields, fusing information of feature graphs with different scales and finishing feature fusion; in the feature map transmission process, a deep feature map carries strong semantic features and weak position information, a shallow feature map carries strong position information and weak semantic features, a feature pyramid network transmits the deep semantic features to the shallow layer, so that semantic expression on multiple scales is enhanced, a path aggregation network transmits the shallow position information to the deep layer, and positioning capability on multiple scales is enhanced;
the specific process of the detection algorithm is as follows; the input of the network is 640 x 3 RGB pictures, firstly, the RGB pictures enter a convolution module to conduct feature transfer, the length and the width of the feature pictures of the 2 nd convolution module and the 4 th convolution module are respectively reduced by 1/2, the length and the width of the output feature pictures are 1/4 of the input, the feature pictures enter a feature extraction module and a downsampling module to conduct feature extraction, the length and the width of the output feature pictures are reduced to 1/32 of an original image from 1/4 after three downsampling, then the extracted feature pictures are sent to a neck network to conduct multi-scale feature fusion, the traffic target detection module transmits the feature pictures to three traffic target detection heads with different sizes, finally, three feature pictures with the sizes of (W/8,H/8,256), (W/16, H/16,512), (W/32, H/32,1024) are respectively output, the input sizes of a drivable region detection module and a lane line detection module are (W/8,H/8,128), the drivable region detection module comprises a Boleneck CSP module and three downsampling modules for feature extraction, after the input information transfer, the output size of the obtained feature pictures is (W/8,H) is the same as that the input of the post-detection module can be subjected to the segmentation of the road network, and the road detection module can be subjected to the following the road detection by the road detection module.
2. The attention-based traffic scenario detection algorithm of claim 1, wherein,
and selecting a backbone network of the YOLOv7 as a basic network structure, and replacing the original ELAN structure by a feature extraction module on the basis of the basic network structure to construct an improved backbone network of the YOLOv 7.
3. The attention-based traffic scenario detection algorithm of claim 2, wherein,
the ELAN structure is a high-efficiency layer aggregation network, can improve the network learning capacity under the condition of not damaging an original gradient path, and can learn more diversified features by guiding calculation blocks of different feature groups; because the LKA mechanism not only comprises the advantage that the self-attention mechanism can solve the problem of long-distance dependence, but also comprises the advantage that convolution can utilize local context information, the LKA mechanism is fused with an ELAN structure in YOLOv7 to form a feature extraction module;
the feature extraction Module comprises four convolution modules and two LKA-Module layers, wherein the input sequentially passes through the two convolution modules, the two convolution modules and the LKA-Module layer cascade structure, feature graphs with the number of channels o=i/2 are respectively output, wherein o is the number of output channels, i is the number of input channels, and finally dimension addition is carried out on the output feature graphs.
4. The multi-tasking traffic scene detection algorithm based on an attention mechanism according to claim 3, wherein,
the feature extraction module comprises two forms, wherein one form is that the number of output channels of the first two convolution modules is 1/2 of the number of input channels, and the number of the input channels and the number of the output channels of the second two convolution modules are the same; the other form is that the number of output channels of the first two convolution modules is 1/4 of the number of input channels, and the number of input channels and the number of output channels of the second two convolution modules are the same.
5. The attention-based traffic scenario detection algorithm of claim 4, wherein,
the LKA-Module layer comprises a BatchNorm batch normalization layer, and an attention Module and a feedforward neural network Module in a Tranformer structure, wherein the attention Module and the feedforward neural network Module are in cascade connection to perform feature extraction; in order to prevent gradient explosion and accelerating model convergence, the input feature images are firstly subjected to batch normalization processing, and then enter an attention module and a feedforward neural network module;
the attention module consists of 1*1 convolution, GELU activation function and LKA module, wherein the LKA module is a large-core attention layer and helps the network to selectively learn input characteristics;
the feedforward neural network module is composed of 1*1 normal convolution, 3*3 depth expansion convolution and a GELU activation function, wherein the expansion rate of the depth expansion convolution is equal to 3.
6. The attention-based traffic scenario detection algorithm of claim 5, wherein,
and 7*7, 11 x 11 and 21 x 21 convolution kernels with different sizes are added into the segmentation enhancement module to perform multi-scale information interaction, and meanwhile, the convolution kernel of K is decomposed into a 1*K transverse convolution kernel and a K1 longitudinal convolution kernel, so that the computational complexity is further reduced.
7. The attention-based traffic scenario detection algorithm of claim 6, wherein,
adding a gating mechanism in the segmentation enhancement module, and enabling the model to selectively learn important information features by recalibrating the weight sizes of different channels; from the aspect of data flow, an input feature map of a model is subjected to 1*1 common convolution firstly, then is subjected to 7*7, 11 x 11 and 21 x 21 deep convolution respectively to learn multi-scale information features, then the output feature map and an original input feature map are subjected to numerical addition to obtain a new output feature map, the feature map is multiplied by all channel weights subjected to global average pooling for adding an attention mechanism, the effect of selectively learning important channels is achieved, and a BatchNorm batch normalization layer and a ReLU activation function are added to prevent overfitting in the training process.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310696843.9A CN116958910A (en) | 2023-06-13 | 2023-06-13 | Attention mechanism-based multi-task traffic scene detection algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310696843.9A CN116958910A (en) | 2023-06-13 | 2023-06-13 | Attention mechanism-based multi-task traffic scene detection algorithm |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116958910A true CN116958910A (en) | 2023-10-27 |
Family
ID=88445172
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310696843.9A Pending CN116958910A (en) | 2023-06-13 | 2023-06-13 | Attention mechanism-based multi-task traffic scene detection algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116958910A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117830981A (en) * | 2023-12-19 | 2024-04-05 | 杭州长望智创科技有限公司 | Automatic driving perception system and method based on multitask learning |
CN117934473A (en) * | 2024-03-22 | 2024-04-26 | 成都信息工程大学 | Highway tunnel apparent crack detection method based on deep learning |
-
2023
- 2023-06-13 CN CN202310696843.9A patent/CN116958910A/en active Pending
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117830981A (en) * | 2023-12-19 | 2024-04-05 | 杭州长望智创科技有限公司 | Automatic driving perception system and method based on multitask learning |
CN117934473A (en) * | 2024-03-22 | 2024-04-26 | 成都信息工程大学 | Highway tunnel apparent crack detection method based on deep learning |
CN117934473B (en) * | 2024-03-22 | 2024-05-28 | 成都信息工程大学 | Highway tunnel apparent crack detection method based on deep learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113486726B (en) | Rail transit obstacle detection method based on improved convolutional neural network | |
Han et al. | Yolopv2: Better, faster, stronger for panoptic driving perception | |
CN116958910A (en) | Attention mechanism-based multi-task traffic scene detection algorithm | |
CN112560656A (en) | Pedestrian multi-target tracking method combining attention machine system and end-to-end training | |
CN112417973A (en) | Unmanned system based on car networking | |
CN113269133A (en) | Unmanned aerial vehicle visual angle video semantic segmentation method based on deep learning | |
CN112489072B (en) | Vehicle-mounted video perception information transmission load optimization method and device | |
Mukherjee et al. | Interacting vehicle trajectory prediction with convolutional recurrent neural networks | |
CN114973199A (en) | Rail transit train obstacle detection method based on convolutional neural network | |
CN117292128A (en) | STDC network-based image real-time semantic segmentation method and device | |
CN115527096A (en) | Small target detection method based on improved YOLOv5 | |
CN117593623A (en) | Lightweight vehicle detection method based on improved YOLOv8n model | |
Wang et al. | Object detection algorithm based on improved Yolov3-tiny network in traffic scenes | |
Jing et al. | Lightweight Vehicle Detection Based on Improved Yolox-nano. | |
CN111639563B (en) | Basketball video event and target online detection method based on multitasking | |
Li et al. | An improved lightweight network based on yolov5s for object detection in autonomous driving | |
CN115331460A (en) | Large-scale traffic signal control method and device based on deep reinforcement learning | |
Pan et al. | Video Surveillance Vehicle Detection Method Incorporating Attention Mechanism and YOLOv5 | |
CN112364720A (en) | Method for quickly identifying and counting vehicle types | |
Tong et al. | Robust Depth Estimation Based on Parallax Attention for Aerial Scene Perception | |
Adnan et al. | Traffic congestion prediction using deep convolutional neural networks: A color-coding approach | |
Shao et al. | Research on yolov5 vehicle object detection algorithm based on attention mechanism | |
Guo et al. | An Effective Module CA-HDC for Lane Detection in Complicated Environment | |
CN114913328B (en) | Semantic segmentation and depth prediction method based on Bayes depth multitask learning | |
Wu et al. | Efficiently Learning a Robust Self-Driving Model with Neuron Coverage Aware Adaptive Filter Reuse |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |