CN116958910A

CN116958910A - Attention mechanism-based multi-task traffic scene detection algorithm

Info

Publication number: CN116958910A
Application number: CN202310696843.9A
Authority: CN
Inventors: 曲建创; 李成莪; 王雪
Original assignee: Cccc Huakong Tianjin Technology Development Co ltd
Current assignee: Cccc Huakong Tianjin Technology Development Co ltd
Priority date: 2023-06-13
Filing date: 2023-06-13
Publication date: 2023-10-27

Abstract

The invention relates to a multitasking traffic scene detection algorithm based on an attention mechanism, which comprises a shared encoder and three decoders; the shared encoder consists of a backbone network and a neck network; the three decoders respectively finish the tasks of traffic targets, drivable areas and lane line detection. According to the invention, the large convolution kernel attention mechanism and the ELAN structure proposed by YOLOv7 are fused for the first time to serve as a brand-new main network, the large convolution kernel attention mechanism and the multi-scale information fusion mechanism are combined to provide a segmentation enhancement module for the lane line segmentation task, and the module is added after the main network and before the lane line detection segmentation head, so that the accuracy of the lane line detection task is further improved.

Description

Attention mechanism-based multi-task traffic scene detection algorithm

Technical Field

The invention relates to the technical field of multi-task traffic scene detection, in particular to a multi-task traffic scene detection algorithm based on an attention mechanism.

Background

In the past decade, tremendous advances have been made in the areas of computer vision and deep learning, but vision-based tasks (e.g., vehicle object detection, drivable area detection, lane line detection, etc.) remain challenging in low-cost, high-precision traffic scene applications. In recent years, the method based on multitasking learning exhibits excellent performance on traffic scene perception problems, and provides a high-precision and high-efficiency solution. The target detection plays an important role in providing the position and size information of traffic barriers and helping automatic driving vehicles and road monitoring personnel to make accurate and timely decisions; the detection of the drivable area and the division of the lane lines provide rich information for route planning and driving safety. Therefore, the research of traffic target detection, drivable area detection and lane line detection tasks is very critical.

Each of these three tasks typically represent networks including, but not limited to, SSD series, R-CNN series, YOLO series, etc. networks for object detection; a network such as an ENT or PSPNet for detecting a traveling area; SAD-ENT, SCNN, etc. networks for lane line segmentation. Although the above networks can well realize traffic target detection, drivable area and lane line segmentation, a great delay exists when images sequentially pass through three cascaded networks, so that task processing time is prolonged.

The invention patent No. CN202211141578.X provides a method and a system for sensing multi-task panoramic driving based on improved YOLOv 5. Firstly, carrying out picture preprocessing on images in a data set to obtain input images; extracting the characteristics of the input image by using a trunk network of improved YOLOv5 to obtain a characteristic diagram; the backbone network is obtained by replacing a C3 module in the YOLOv5 backbone network with an inversion residual bottleneck module; inputting the feature map into a neck network to obtain a feature map, and fusing the feature map with the feature map obtained by a backbone network; inputting the fusion characteristic diagram to a detection head for traffic target detection; and inputting the characteristic diagram of the neck network into a branch network, and detecting lane lines and dividing a travelable area.

However, the traffic scene detection algorithm is based on a convolutional neural network, a concentration mechanism is lacked, the network cannot concentrate on important input information, and the conventional neural network is based on a small convolutional kernel, the small convolutional kernel has a smaller receptive field, and global information of an object cannot be obtained, so that the algorithm effect is poor.

Disclosure of Invention

The invention aims to further improve the precision of a traffic scene multitasking sensing model and provides a multitasking traffic scene detection algorithm based on an attention mechanism.

The invention adopts the following technical scheme to realize the aim:

a multitasking traffic scene detection algorithm based on attention mechanism, comprising a shared encoder and three decoders; the shared encoder consists of a backbone network and a neck network; the three decoders respectively finish the traffic target, the drivable area and the lane line detection tasks;

the backbone network is used for extracting the characteristics of an input image and comprises a convolution module, a characteristic extraction module and a downsampling module, wherein the convolution module consists of a Conv convolution layer, a BatchNorm batch normalization layer and a SiLU activation function; the feature extraction module fuses a large-core attention mechanism and an ELAN structure, and builds a backbone network to extract features; the downsampling module adds a downsampling layer on the basis of the convolution module to form two branches, and finally performs feature fusion through dimension addition, wherein the number of channels of the output feature map is 2 times that of the input feature map, and the length and the width of the output feature map are 1/2 of that of the input feature map;

the neck network comprises a cross-stage space pyramid pooling module, a characteristic pyramid network and a path aggregation network; the cross-stage space pyramid pooling module is used for expanding receptive fields, fusing information of feature graphs with different scales and finishing feature fusion; in the feature map transmission process, a deep feature map carries strong semantic features and weak position information, a shallow feature map carries strong position information and weak semantic features, a feature pyramid network transmits the deep semantic features to the shallow layer, so that semantic expression on multiple scales is enhanced, a path aggregation network transmits the shallow position information to the deep layer, and positioning capability on multiple scales is enhanced;

the specific process of the detection algorithm is as follows; the input of the network is 640 x 3 RGB pictures, firstly, the RGB pictures enter a convolution module to conduct feature transfer, the length and the width of the feature pictures of the 2 nd convolution module and the 4 th convolution module are respectively reduced by 1/2, the length and the width of the output feature pictures are 1/4 of the input, the feature pictures enter a feature extraction module and a downsampling module to conduct feature extraction, the length and the width of the output feature pictures are reduced to 1/32 of an original image from 1/4 after three downsampling, then the extracted feature pictures are sent to a neck network to conduct multi-scale feature fusion, the traffic target detection module transmits the feature pictures to three traffic target detection heads with different sizes, finally, three feature pictures with the sizes of (W/8,H/8,256), (W/16, H/16,512), (W/32, H/32,1024) are respectively output, the input sizes of a drivable region detection module and a lane line detection module are (W/8,H/8,128), the drivable region detection module comprises a Boleneck CSP module and three downsampling modules for feature extraction, after the input information transfer, the output size of the obtained feature pictures is (W/8,H) is the same as that the input of the post-detection module can be subjected to the segmentation of the road network, and the road detection module can be subjected to the following the road detection by the road detection module.

And selecting a backbone network of the YOLOv7 as a basic network structure, and replacing the original ELAN structure by a feature extraction module on the basis of the basic network structure to construct an improved backbone network of the YOLOv 7.

The ELAN structure is a high-efficiency layer aggregation network, can improve the network learning capacity under the condition of not damaging an original gradient path, and can learn more diversified features by guiding calculation blocks of different feature groups; because the LKA mechanism not only comprises the advantage that the self-attention mechanism can solve the problem of long-distance dependence, but also comprises the advantage that convolution can utilize local context information, the LKA mechanism is fused with an ELAN structure in YOLOv7 to form a feature extraction module;

the feature extraction Module comprises four convolution modules and two LKA-Module layers, wherein the input sequentially passes through the two convolution modules, the two convolution modules and the LKA-Module layer cascade structure, feature graphs with the number of channels o=i/2 are respectively output, wherein o is the number of output channels, i is the number of input channels, and finally dimension addition is carried out on the output feature graphs.

The feature extraction module comprises two forms, wherein one form is that the number of output channels of the first two convolution modules is 1/2 of the number of input channels, and the number of the input channels and the number of the output channels of the second two convolution modules are the same; the other form is that the number of output channels of the first two convolution modules is 1/4 of the number of input channels, and the number of input channels and the number of output channels of the second two convolution modules are the same.

The LKA-Module layer comprises a BatchNorm batch normalization layer, and an attention Module and a feedforward neural network Module in a Tranformer structure, wherein the attention Module and the feedforward neural network Module are in cascade connection to perform feature extraction; in order to prevent gradient explosion and accelerating model convergence, the input feature images are firstly subjected to batch normalization processing, and then enter an attention module and a feedforward neural network module;

the attention module consists of 1*1 convolution, GELU activation function and LKA module, wherein the LKA module is a large-core attention layer and helps the network to selectively learn input characteristics;

the feedforward neural network module is composed of 1*1 normal convolution, 3*3 depth expansion convolution and a GELU activation function, wherein the expansion rate of the depth expansion convolution is equal to 3.

And 7*7, 11 x 11 and 21 x 21 convolution kernels with different sizes are added into the segmentation enhancement module to perform multi-scale information interaction, and meanwhile, the convolution kernel of K is decomposed into a 1*K transverse convolution kernel and a K1 longitudinal convolution kernel, so that the computational complexity is further reduced.

Adding a gating mechanism in the segmentation enhancement module, and enabling the model to selectively learn important information features by recalibrating the weight sizes of different channels; from the aspect of data flow, an input feature map of a model is subjected to 1*1 common convolution firstly, then is subjected to 7*7, 11 x 11 and 21 x 21 deep convolution respectively to learn multi-scale information features, then the output feature map and an original input feature map are subjected to numerical addition to obtain a new output feature map, the feature map is multiplied by all channel weights subjected to global average pooling for adding an attention mechanism, the effect of selectively learning important channels is achieved, and a BatchNorm batch normalization layer and a ReLU activation function are added to prevent overfitting in the training process.

The beneficial effects of the invention are as follows: according to the invention, the large convolution kernel attention mechanism and the ELAN structure proposed by YOLOv7 are fused for the first time to serve as a brand-new main network, the large convolution kernel attention mechanism and the multi-scale information fusion mechanism are combined to provide a segmentation enhancement module for the lane line segmentation task, and the module is added after the main network and before the lane line detection segmentation head, so that the accuracy of the lane line detection task is further improved.

Drawings

FIG. 1 is a schematic diagram of the overall structure of the algorithm of the present invention;

FIG. 2 is a flow chart of the detection in the present invention;

FIG. 3 is a schematic diagram of one form of an LKA-ELAN module according to the present invention;

FIG. 4 is a schematic diagram of another form of an LKA-ELAN module according to the present invention;

FIG. 5 is a schematic diagram of an LKA-Module structure according to the present invention;

FIG. 6 is a schematic diagram of a SegMod module structure according to the present invention;

the embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

Detailed Description

The principles and features of the present invention are described below with reference to the drawings, the examples are illustrated for the purpose of illustrating the invention and are not to be construed as limiting the scope of the invention. The invention is more particularly described by way of example in the following paragraphs with reference to the drawings. The advantages and features of the present invention will become more apparent from the following description. It should be noted that the drawings are in a very simplified form and are all to a non-precise scale, merely for convenience and clarity in aiding in the description of embodiments of the invention.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. The term "and/or" as used herein includes any and all combinations of one or more of the associated listed items.

The invention is further illustrated by the following examples in conjunction with the accompanying drawings:

currently, many researchers have designed multi-task learning networks (MultiNet, DLT-Net, YOLOP) in which the encoder-decoder architecture is used, with the decoders of the three detection tasks sharing the same encoder. MarvinTeichmann et al introduce the concept of multiplexing into the field of traffic scene perception for the first time in a MultiNet network, and the method uses a VGG16 backbone structure as an encoder, then performs feature fusion on a feature map generated by the encoder, and finally sends the feature map to ClassificationDecoder, detectionDecoder, segmentationDecoder decoders to complete three tasks of target classification, target detection and lane line detection. Qian et al first determined the detection tasks as traffic target detection, travelable region detection and lane line detection in the DLT-Net network, and provided a context tensor to share the information of DrivableAreaDecoder to TrafficObjectDecoder and LanelineeDecoder, which significantly improved the overall performance without increasing the computational overhead. Wu and the like firstly introduce a YOLO series network into a multi-task algorithm, take YOLOv5 as a main structure to complete a target detection task, use an ene network decoding structure to acquire a characteristic diagram of lane line detection and drivable region detection, further realize the weight reduction and portability of a model, and bring multi-task learning in the traffic scene perception field into the YOLO era. Although many excellent networks are proposed, the detection accuracy and other indexes of the proposed algorithm are required to be improved.

In order to further improve the precision of a traffic scene multi-task perception model, after the method is studied in depth, the invention provides a multi-task traffic scene detection algorithm based on an attention mechanism, which comprises a shared encoder and three decoders; the shared encoder consists of a Backbone network (Backbone) and a Neck network (neg); the three decoders respectively finish the traffic target, the drivable area and the lane line detection tasks;

the backbone network is used for extracting the characteristics of an input image and comprises a convolution module (CBS), a characteristic extraction module (LKA-ELAN) and a downsampling Module (MP), wherein the CBS consists of a Conv convolution layer, a BatchNorm batch normalization layer and a SiLU activation function; the LKA-ELAN fuses a large-core attention mechanism (largekernel attention, LKA) and an ELAN structure, and builds a backbone network to extract features; MP adds down sampling layer (MaxPooling) on the basis of CBS to form two branches, and finally performs feature fusion by dimension addition, wherein the number of channels of the output feature map is 2 times that of the input feature map, and the length and width of the output feature map are 1/2 of that of the input feature map;

the neck network includes a cross-phase spatial pyramid pooling module (spacial pyramidPoolingCross StagePartial, SPPCSP), a feature pyramid network (FeaturePyramidNetworks, FPN), and a path aggregation network (PathAggregationNetworks, PAN);

SPPCSP has the functions of expanding receptive fields, fusing information of feature graphs with different scales and completing feature fusion; in the feature map transmission process, the deep feature map carries stronger semantic features and weaker position information, and the shallow feature map carries stronger position information and weaker semantic features; the FPN transmits the deep semantic features to the deep layer, so that semantic expression on multiple scales is enhanced, the PAN transmits the position information of the deep layer to the deep layer, and the positioning capability on multiple scales is enhanced;

the specific process of the detection algorithm is as shown in fig. 1 and fig. 2, the input of the network is 640 x 3 RGB pictures, firstly, the CBS is entered for feature transfer, the length and width of the 2 nd and 4 th CBS feature maps are respectively reduced by 1/2, the length and width of the output feature maps are 1/4 of the input, the feature maps enter LKA-ELAN and MP for feature extraction, the length and width of the output feature maps are reduced from 1/4 of the original image to 1/32 of the original image after three downsampling, then the extracted feature maps are sent to the neg for multi-scale feature fusion, the traffic target detection module transmits the feature maps to three traffic target detection heads with different sizes, finally, the three feature maps with the sizes of (W/8,H/8,256), (W/16, h/16,512), (W/32, h/32,1024) are respectively output, the input sizes of the drivable area detection module and the lane line detection module are (W/8,H/8,128), the drivable area detection module comprises a bolleckkcsp module for feature extraction and three MPs, the input information is transmitted to the network after the large/small input module is the same as the large/small input area detection module (W/8,H), and the traffic area detection module is subjected to the subsequent semantic information is subjected to the enhancement of the network structure.

And selecting a backbone network of the YOLOv7 as a basic network structure, and replacing the original ELAN structure with LKA-ELAN on the basis of the basic network structure to construct an improved YOLOv7 backbone network.

The ELAN structure is a high-efficiency layer aggregation network, can improve the network learning capacity under the condition of not damaging an original gradient path, and can learn more diversified features by guiding calculation blocks of different feature groups; because the LKA mechanism not only comprises the advantage that the self-attention mechanism can solve the problem of long-distance dependence, but also comprises the advantage that convolution can utilize local context information, the LKA mechanism is fused with an ELAN structure in YOLOv7 to form an LKA-ELAN;

the LKA-ELAN comprises four CBS layers and two LKA-Module layers, wherein the input sequentially passes through the cascade structure of the two CBS layers, the two CBS layers and the LKA-Module layers, the characteristic diagrams with the channel number o=i/2 are respectively output, wherein o is the output channel number (OutputChannel) and i is the input channel number (InputChannel), and finally the output characteristic diagrams are subjected to dimension addition;

the LKA-ELAN only aggregates all the previous layers in the last layer of the structure, and the structure not only inherits the advantages of representing multiple characteristics by the multiple receptive fields of DenseNet, but also solves the problem of low dense connection efficiency, and meanwhile, compared with VoVNet, the structure adds a large-core attention mechanism, so that the network performance can be further improved.

The feature extraction module (LKA-ELAN) comprises two forms, wherein one form is that the number of output channels of the first two convolution modules (CBS) is 1/2 of the number of input channels, and the number of input and output channels of the second two convolution modules (CBS) is the same, as shown in fig. 3; another form is that the number of output channels of the first two convolution modules (CBS) is 1/4 of the number of input channels, and the number of input and output channels of the second two convolution modules (CBS) is the same, as shown in fig. 4.

As shown in fig. 5, similar to DETR and VAN algorithms, the LKA-Module layer contains a Attention Module (Attention) and a feed-forward neural network Module (FeedForwardNetwork, FFN) in a batch normalization layer and a Tranformer structure, and the Attention and FFN perform feature extraction by cascading; in order to prevent gradient explosion and accelerate model convergence, carrying out batch normalization processing on an input feature map, and then entering an Attention and FFN;

the Attention is composed of 1*1 convolution, GELU activation function and LKA module, wherein the LKA module is a large-core Attention layer and helps the network to selectively learn input characteristics;

FFN consists of 1*1 normal convolution, 3*3 depth dilation convolution, and gel activation function, where the dilation rate (d) of the depth dilation convolution is equal to 3.

The attention mechanism can be seen as an adaptive selection process that can select discriminating features based on input features and automatically ignore noise responses, the key step of the attention mechanism being the generation of attention feature maps, which can represent the importance of the different parts.

Currently, there are two methods to learn the relationship between different features.

The first is to employ a self-attention mechanism to capture long-range dependencies. Although the self-attention mechanism is very effective in natural language processing, it still has three drawbacks when processing computer vision tasks: 1) In the processing process, the images are regarded as one-dimensional sequences, and the two-dimensional structure of the images is ignored; 2) The calculation complexity of the method and the input resolution are in a quadratic increase relation, and the processing cost of the method for the high-resolution image is high; 3) It only achieves spatial adaptation, but ignores the adaptation of the channel dimension.

The second is the method employed by the present invention to construct the attention profile from large convolution kernels. As shown in fig. 4, since adding large convolution kernels (17×17, 21×21, etc.) to the network may cause the network computation amount to increase, which is unfavorable for increasing the model depth, the convolution kernel of k×k is replaced by the deep convolution of (2 d-1) ×2d-1, (K/d) ×k/d) and the normal convolution of 1*1 in the LKA module, where the deep convolution and the deep expansion convolution both use packet convolution, and the number of packets (groups) is equal to the number of input channels. Through the operation, the receptive field can be increased while the parameter quantity is reduced, so that more global features can be obtained, and then the input is multiplied by the output subjected to the large convolution kernel processing to add a attention mechanism, so that the input features can be better selectively learned.

In the multi-task traffic scene detection algorithm, two detection tasks are related to segmentation, namely a drivable region detection task and a lane line detection task. Although two downstream segmentation tasks are effectively improved after the large-core attention backbone network is replaced, the improvement amplitude of the lane line detection precision is smaller, so that a segmentation enhancement module comprising a large convolution core and a multi-scale information interaction mechanism is provided for improving the lane line segmentation effect.

In the process of comparing part of classical semantic segmentation models (deep LabV3+, SETR, segNeXt), a successful semantic segmentation model should be provided with a strong backbone network at first, and the specificity of a backbone network is shared by a plurality of detection tasks of a multi-task traffic scene detection algorithm in consideration of the fact that the segmentation performance of model lane lines is improved without change. Secondly, the method should have the characteristic of multi-scale information interaction, unlike the image classification task mainly identifying a single object, semantic segmentation is a dense prediction task, and detection objects with different sizes in a single image need to be processed, so that three convolution kernels with different sizes of 7*7, 11×11 and 21×21 are simultaneously added in a segmentation enhancement module to perform multi-scale information interaction, and meanwhile, the convolution kernel of k×k is decomposed into a transverse convolution kernel of 1*K and a longitudinal convolution kernel of k×1, so that the computational complexity is further reduced. Third, an attention mechanism should be provided to better select the input features.

Similar to SENet, a gating mechanism is added in the segmentation enhancement module, allowing the model to selectively learn important information features by recalibrating the weight sizes of the different channels. As shown in fig. 6, from the data flow perspective, the input feature map of the model is first subjected to 1*1 normal convolution, then respectively subjected to 7*7, 11×11, and 21×21 deep convolution to learn multi-scale information features, and then the output feature map is numerically added to the original input feature map to obtain a new output feature map. To add the attention mechanism, the feature map is multiplied by the weights of all channels subjected to Global Average Pooling (GAP), so as to achieve the effect of selectively learning important channels. During training, the addition of the BatchNorm batch normalization layer and the ReLU activation function prevents overfitting.

The invention fuses a large convolution kernel attention mechanism with the ELAN structure proposed by YOLOv7 for the first time to serve as a brand new backbone network.

Meanwhile, the invention combines a large-core attention mechanism and a multi-scale information fusion mechanism to provide a segmentation enhancement module aiming at the lane line segmentation task, and the module is added after the main network and before the lane line detection segmentation head to further improve the accuracy of the lane line detection task.

While the invention has been described above with reference to the accompanying drawings, it will be apparent that the invention is not limited to the above embodiments, but is intended to cover various modifications, either made by the method concepts and technical solutions of the invention, or applied directly to other applications without modification, within the scope of the invention.

Claims

1. A multi-task traffic scene detection algorithm based on an attention mechanism is characterized in that,

comprises a shared encoder and three decoders; the shared encoder consists of a backbone network and a neck network; the three decoders respectively finish the traffic target, the drivable area and the lane line detection tasks;

2. The attention-based traffic scenario detection algorithm of claim 1, wherein,

3. The attention-based traffic scenario detection algorithm of claim 2, wherein,

4. The multi-tasking traffic scene detection algorithm based on an attention mechanism according to claim 3, wherein,

5. The attention-based traffic scenario detection algorithm of claim 4, wherein,

6. The attention-based traffic scenario detection algorithm of claim 5, wherein,

7. The attention-based traffic scenario detection algorithm of claim 6, wherein,