CN116229406A

CN116229406A - Lane line detection method, system, electronic equipment and storage medium

Info

Publication number: CN116229406A
Application number: CN202310511812.1A
Authority: CN
Inventors: 唐洪; 夏军; 邓锋; 喻璟怡
Original assignee: East China Jiaotong University
Current assignee: East China Jiaotong University
Priority date: 2023-05-09
Filing date: 2023-05-09
Publication date: 2023-06-06
Anticipated expiration: 2043-05-09
Also published as: CN116229406B

Abstract

The invention provides a lane line detection method, a lane line detection system, electronic equipment and a storage medium, wherein the lane line detection method comprises the following steps: acquiring a current road image with a lane line; extracting lane line image features with context information from a current road image by using ResNet-ECA-PSPNet; inputting lane line image features into a CNN network and a visual transducer network respectively to obtain a confidence map based on key points, a key point map and an offset map; and carrying out key point association on the confidence map and the key point map and the offset map to distinguish and construct each lane line so as to acquire lane line detection information of the current road image. According to the method and the device, the context feature representation capability of the lane lines can be enhanced, long-range offset can be captured under the condition that the resolution of the output feature map is not reduced, and regression errors of key point offset at the dense lane lines are avoided.

Description

Lane line detection method, system, electronic equipment and storage medium

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to a lane line detection method, a lane line detection system, electronic equipment and a storage medium.

Background

In recent years, autopilot has received extensive attention in both academia and industry. The most basic and challenging task among these is lane line detection in real scenes to assist driving. However, due to the existence of severe scenes, such as occlusion, haze, darkness, strong light reflection, etc., it is very challenging to accurately detect lane lines. The traditional image method is to divide the lane line area by means of edge detection filtering and the like, and then combine Hough transformation, RANSAC and other algorithms to detect the lane line; the algorithm needs manual demodulation filtering operators, parameters are manually adjusted according to the street scene characteristics aimed by the algorithm, the workload is large, the robustness is poor, and the detection effect of the lane lines is poor when the driving environment is obviously changed. With the rapid development of the deep learning technology, the method of lane line detection has shifted from using sensor detection to detecting lane lines in RGB images captured by a front-end camera using the deep learning technology.

As shown in fig. 1, fig. 1 (a) shows an input original lane line image, and the method of detecting a lane line by using the deep learning technique is roughly classified into three types of segmentation-based, detection-based, and point-based. 1. As shown in (b) of fig. 1, the lane line detection method based on segmentation is most direct, directly models the task of lane line detection as the task of pixel-level classification, and classifies pixels in the area where the lane line is located; however, the number of annotated lane pixels in the image data is far less than the background pixels, and learning from such subtle and sparse annotations is challenging; it is not efficient to classify all pixels of the area where the lane is located at the same time to describe the lane line. 2. As shown in fig. 1 (c), the anchor-based lane line detection method uses a plurality of predefined anchor lines, and the positions of the lane lines are returned on the basis of the predefined anchor lines, so that the method has the highest detection speed; but this approach is inflexible due to the limitations of the predefined anchor line shape. 3. As shown in (d) of fig. 1, the lane line detection method based on points describes the shape of the lane line by using the key points, then uses the key point clustering means to distinguish each lane line, the network outputs the probability map of the key points of the lane line and the deviation from the key points to the same lane starting point, and clusters the key points belonging to the same lane line by using the key points and the deviation thereof; this way of lane line modeling has greater flexibility and faster detection speed. However, the following difficulties remain with the point-based lane line detection method:

(1) Traditional Convolutional Neural Networks (CNNs) tend to have difficulty in efficiently extracting the semantic features of lane lines because the elongated structure of lane lines makes it difficult to establish efficient links with context information during the feature extraction stage;

(2) The lane line span is large, which is a challenge for the convolution-based neural network, because convolution operation can only capture limited local information and cannot effectively capture long-distance offset information of the lane line; this results in that in lane line detection tasks, the model often has difficulty capturing global context information for the lane line;

(3) Due to the compactness of the lane lines, the traditional lane line detection network often needs to reduce the resolution of the output feature map to realize larger feeling so as to return to the long-distance offset; however, this can make it difficult for the network to distinguish between key points in dense lane lines, resulting in reduced accuracy.

Therefore, how to solve the problems of the lane line detection method based on the point in the prior art is always a problem to be solved by those skilled in the art.

Disclosure of Invention

In order to solve the technical problems, the invention provides a lane line detection method, a system, electronic equipment and a storage medium, wherein the capability of extracting low-level characteristics of a model on a lane line can be enhanced by adopting a convolution and transform mixed architecture for lane line detection, so that the capability of representing the context characteristics of the lane line is improved; and the long-range dependency relationship between the lane line contexts is established through an attention mechanism, so that the global context information of the lane lines is better captured, long-range offset can be captured under the condition that the resolution of the output feature map is not reduced, and the problem of regression errors of key point offset at the dense lane lines is avoided.

In a first aspect, the present invention provides a lane line detection method, including:

acquiring a current road image with a lane line;

extracting lane line image features with context information from the current road image by using ResNet-ECA-PSPNet;

inputting the lane line image features into a CNN network and a visual transducer network respectively to obtain a confidence map based on key points, a key point map and an offset map;

and carrying out key point association on the confidence map and the key point map and the offset map to distinguish and construct each lane line so as to acquire lane line detection information of the current road image.

Preferably, the step of extracting the lane line image feature with the context information from the current road image by using the ResNet-ECA-PSPNet specifically includes:

extracting a high-dimensional feature map from the current road image by adopting a ResNet network as a backbone network;

inputting the high-dimensional feature map into an FPN through a convolution kernel with the size of 1 multiplied by 1 for training to obtain a dimension-reduction feature map;

and carrying out ECA up-sampling on the dimension reduction feature map by using ECA-PSPNet to fuse the dimension reduction feature map with the previous level feature map, so as to obtain the lane line image feature with the context information and fused with the multi-scale information.

Preferably, the ECA up-sampling specifically includes:

rearranging the tensor of the dimension reduction feature map, and the row vectors and the column vectors of the tensor to obtain rearranged vectors and rearranged vectors;

performing grouping convolution transformation processing on the rearranged vectors and the rearranged vectors by adopting a first preset model and a second preset model to obtain transformation rows and transformation columns;

respectively carrying out row rearrangement and column rearrangement on the transformation rows and the transformation columns, aligning the rearranged transformation rows and transformation columns with the original feature map space positions, and respectively fusing the aligned transformation rows and transformation columns with the rearranged row vectors and rearranged column vectors to obtain the dimension reduction feature map fused with row and column information;

inputting the dimension-reducing feature map fused with the row and column information into an improved coordinate attention mechanism so as to capture the context information on the lane lines of the multi-scale information, and obtaining the lane line image feature fused with the multi-scale information and provided with the context information.

Preferably, the improved coordinate attention mechanism refers to encoding lateral and longitudinal position information into channel attention so that the network accurately focuses on the spatial structure information of interest; the process of coding the channel relation and the long-distance relation of the improved coordinate attention mechanism comprises the following steps:

Using pooling kernels with the sizes of (1, W ') and (H', 1) to encode the horizontal direction and vertical direction characteristics aiming at the dimension reduction characteristic map to obtain a horizontal direction c-th channel spatial characteristic and a vertical direction c-th channel spatial characteristic;

connecting the c-th channel spatial feature in the horizontal direction and the c-th channel spatial feature in the vertical direction along the horizontal-vertical direction, and performing convolution, BN and nonlinear activation by using a convolution kernel with the size of 1 multiplied by 1 to perform feature conversion to obtain a first conversion feature;

decomposing the first conversion feature into two independent decomposition features along the vertical-horizontal direction, and adopting two convolution kernel convolution functions with the size of 1 multiplied by 1 and sigmoid functions as activation functions to respectively perform feature conversion on the two decomposition features, wherein the dimension is kept unchanged;

and generating a coordinate attention matrix by the two decomposed features after the feature conversion through a broadcasting mechanism, and multiplying the attention moment matrix by the dimension reduction feature map so as to accurately focus the newly generated feature map on the interested space structure information.

Preferably, the step of inputting the lane image features into a CNN network and a visual transducer network to obtain a confidence map based on key points and a key point map and an offset map specifically includes:

Sampling K key points on each lane line as confidence labels of the key points, and placing the confidence labels in a CNN (computer numerical network);

finding the offsets from the K key points to the corresponding starting points to serve as offset labels, and placing the offset labels in a visual transducer network;

respectively inputting the lane line image characteristics to a CNN network with a confidence coefficient label and a visual transducer network with an offset label to perform corresponding supervision learning so as to synchronously obtain key points of the lane line and offset from the key points to corresponding starting points;

and respectively constructing a confidence map, a key point map and an offset map according to the key points and the offsets.

Preferably, the confidence map is generated by adopting two-layer convolution through a key point layer of the CNN network structure, and the offset map is generated by adopting a convolution visual converter through an offset layer of the visual transducer network structure and combining the two-layer convolution.

Preferably, the step of associating the confidence map and the key point map with the offset map to distinguish and construct each lane line so as to obtain lane line detection information of the current road image specifically includes:

defining points with probability values larger than a preset threshold value of the confidence coefficient map as key points;

Extracting the corresponding offset of each key point according to the offset graph, and taking out the key points smaller than a preset offset threshold value to determine the unique starting point of the lane line;

acquiring key points after offset according to the offset of other key points except the unique starting point;

and counting the distance between each shifted key point and the unique starting point so as to distinguish the lane line where each key point belongs to the unique starting point and construct each lane line, so that the lane line detection information of the current road image is acquired.

In a second aspect, the invention provides a lane line detection system comprising:

the acquisition module is used for acquiring a current road image with a lane line;

the extraction module is used for extracting lane line image features with context information from the current road image by utilizing ResNet-ECA-PSPNet;

the construction module is used for inputting the lane line image characteristics into a CNN network and a visual transducer network respectively so as to obtain a confidence map based on key points, a key point map and an offset map;

and the association module is used for carrying out key point association on the confidence map and the key point map and the offset map so as to distinguish and construct each lane line, so that lane line detection information of the current road image is acquired.

Preferably, the extraction module includes:

an extraction unit for extracting a high-dimensional feature map from the current road image by using a ResNet network as a backbone network;

the training unit is used for inputting the high-dimensional feature map into the FPN through a convolution kernel with the size of 1 multiplied by 1 to train so as to obtain a dimension-reduction feature map;

and the fusion unit is used for carrying out ECA up-sampling on the dimension reduction feature map by using the ECA-PSPNet so as to fuse the dimension reduction feature map with the previous level feature map, and obtaining the lane line image feature with the context information and fused with the multi-scale information.

Preferably, the construction module includes:

the sampling unit is used for sampling K key points on each lane line as confidence labels of the key points and placing the confidence labels in the CNN network;

the searching unit is used for searching the offset from the K key points to the corresponding starting points to be used as an offset label and placing the offset label in a visual transducer network;

the supervision unit is used for respectively inputting the lane line image characteristics into a CNN network with a confidence coefficient label and a visual transducer network with an offset label to perform corresponding supervision learning so as to synchronously obtain key points of the lane lines and offsets from the key points to corresponding starting points;

And the construction unit is used for respectively constructing a confidence map, a key point map and an offset map according to the key points and the offsets.

Preferably, the association module includes:

a definition unit, configured to define points with probability values greater than a preset threshold of the confidence map as key points;

the determining unit is used for extracting the offset corresponding to each key point according to the offset graph, and taking out the key points smaller than a preset offset threshold value to determine the unique starting point of the lane line;

an obtaining unit, configured to obtain the shifted key points according to the shift of the key points other than the unique start point;

and the distinguishing unit is used for counting the distance between each shifted key point and the unique starting point so as to distinguish the lane line where each key point belongs to the unique starting point and construct each lane line, so that the lane line detection information of the current road image is acquired.

In a third aspect, an embodiment of the present application provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the lane line detection method according to the first aspect when executing the computer program.

In a fourth aspect, embodiments of the present application provide a storage medium having stored thereon a computer program which, when executed by a processor, implements the lane line detection method according to the first aspect.

Compared with the prior art, the lane line detection method, the lane line detection system, the electronic equipment and the storage medium provided by the application are as follows: firstly, extracting image features by using a ResNet network as a backbone network, then up-sampling the image features by ECA-PSPNet from top to bottom, fusing the multi-scale features, and fusing the information of the lane line context in a feature map to enhance the overall representation features of the lane lines. And then, respectively sending the features fused with the lane line context information into a key point extraction network and an offset extraction network to obtain a key point confidence coefficient graph and an offset graph from the key point to a corresponding starting point, wherein the offset extraction network is built based on a vision transducer module, and the network has global feeling when calculating the offset of each key point, so that the offset regression capability of the network is not limited by the offset length, and meanwhile, the features in the feature extraction stage are set to maintain larger resolution, so that the problem of the offset regression error of the key points at the dense lane lines is greatly relieved. Finally, the key points can be unambiguously classified into a plurality of groups through the key point association post-processing, each group of key points represents one lane line, even if the shape of the lane line is different, the key points can be flexibly represented, the accuracy of lane line detection is effectively improved, and the difficult problem of the lane line detection method based on the points in the prior art can be solved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a diagram showing the detection effect of a lane line detection technique according to the prior art;

fig. 2 is a flowchart of a lane line detection method according to embodiment 1 of the present invention;

fig. 3 is a flowchart of ECA upsampling according to embodiment 1 of the present invention;

fig. 4 is a schematic diagram of a lane line detection correlation process provided in embodiment 1 of the present invention;

FIG. 5 is a block diagram of a lane line detection system according to the method of embodiment 1 according to embodiment 2 of the present invention;

fig. 6 is a schematic hardware structure of an electronic device according to embodiment 3 of the present invention.

Reference numerals illustrate:

10-an acquisition module;

a 20-extraction module, a 21-extraction unit, a 22-training unit and a 23-fusion unit;

30-construction modules, 31-sampling units, 32-finding units, 33-supervision units and 34-construction units;

40-association module, 41-definition unit, 42-determination unit, 43-acquisition unit, 44-discrimination unit;

50-bus, 51-processor, 52-memory, 53-communication interface.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the disclosed aspects may be practiced without one or more of the specific details, or with other methods, components, devices, steps, etc. In other instances, well-known methods, devices, implementations, or operations are not shown or described in detail to avoid obscuring aspects of the disclosure.

The block diagrams depicted in the figures are merely functional entities and do not necessarily correspond to physically separate entities. I.e. the functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

The flow diagrams depicted in the figures are exemplary only, and do not necessarily include all of the elements and operations/steps, nor must they be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the order of actual execution may be changed according to actual situations.

Example 1

Specifically, fig. 2 is a schematic flow chart of a lane line detection method according to the present embodiment.

As shown in fig. 2, the lane line detection method of the present embodiment includes the steps of:

s101, acquiring a current road image with lane lines.

Specifically, the current road image is obtained through terminal equipment, and the terminal equipment can shoot the road image of the current running scene of the vehicle through a camera; the captured road image includes road images in front of, behind, or sideways of the vehicle. In specific practice, the terminal device may be a vehicle-mounted device, and is respectively in communication connection with the camera and a control system for automatic driving of the vehicle. Further, the terminal device can control the camera to acquire road images of the scene where the vehicle is located in real time or according to a preset period according to the requirements of the running scene of the vehicle. In this embodiment, experiments were specifically performed on two data sets, i.e., CULane and Tusimple, which contain 8880 training images and 34680 test images, including urban and highway scenes. The test images are divided into 9 different scenarios, tuSimple is a real road dataset comprising 3626 images for training and 2782 images for testing.

S102, extracting lane line image features with context information from the current road image by utilizing ResNet-ECA-PSPNet.

Specifically, resNet and pooling pyramid structures are utilized, and an enhanced coordinate attention layer is introduced to extract features with spatial structure perception in lane line images. When processing the lane line image, firstly, the classical ResNet is used as a main network to extract the characteristics, then the characteristic images of each stage are input into the FPN, and the advanced semantic characteristic images are extracted through 1X 1 convolution. And then, carrying out 2 times up-sampling on the current feature map and the previous level feature map, and then fusing the feature maps of each stage step by step, and simultaneously, enhancing the expression capacity of the features by utilizing an enhanced coordinate attention layer in the feature extraction process. And finally, outputting a characteristic diagram fused with the multi-scale lane line context information.

Further, the specific steps of step S102 include:

s1021, extracting a high-dimensional feature map from the current road image by adopting a ResNet network as a backbone network.

Specifically, in the present embodiment, res net is used as a backbone network to extract features. ResNet builds a network model by removing the convolutional layer, the global averaging pooling layer, and the linear classification layer, while preserving the feature map output of each stage for subsequent feature extraction and processing.

S1022, inputting the high-dimensional feature map into the FPN through a convolution kernel with the size of 1 multiplied by 1 for training to obtain a dimension-reduction feature map.

Specifically, the FPN adopts a multi-scale feature fusion method, and the scale robustness of feature expression can be remarkably improved on the premise of not greatly increasing the calculated amount. The network structure of FPN is also essentially a full convolution network, taking as input an image of arbitrary scale, the scale of each layer of feature map output for each convolution trunk being kept at a fixed ratio to the scale of the original image. The FPN architecture feature includes 3 processes of bottom-up, top-down and peer-to-peer connection, and the bottom-up process is essentially a forward propagation process of the convolutional network in this embodiment. During forward propagation of the ResNet network, successive scale-invariant feature graphs are in a network stage, and for each stage, the last layer of feature graph contains the most expressive features in that stage.

S1023, performing ECA up-sampling on the dimension reduction feature map by using ECA-PSPNet to fuse the dimension reduction feature map with the previous level feature map, so as to obtain the lane line image feature with the context information and fused with the multi-scale information.

Specifically, referring to fig. 3, ECA up-sampling specifically includes the following steps:

step one, tensor X-R of the dimension reduction feature map is calculated ^H'×W'×C' To perform row rearrangement and column arrangementReordered to obtain reordered vector X _r €R ^1×W'×H'C' And rearranging vector X _c €R ^H'×1×W'C' 。

Where H ' represents the length of the tensor X, W ' represents the width of the tensor X, and C ' represents the number of channels of the tensor X.

Step two, adopting a preset model I and a preset model II to respectively aim at the rearrangement vector X _r And the rearrangement vector X _c Performing group convolution transformation to obtain a transformation line Z _r €R ^1×W'×H'C' And transform column Z _c €R ^H'×1×W'C' ；

Wherein X is _r Packet convolution using group H', X _c Using group as W ^' The convolution transformation can be seen as a process of assigning feature rank index information.

The corresponding convolution preset model one and the preset model two are respectively as follows:

，

，

in the method, in the process of the invention,h∈[0，1，…，H'-1]，ω∈[0，1，…，W'-1]。

step three, aiming at the transformation line Z respectively _r The transformation column Z _c Performing row rearrangement and column rearrangement, aligning the rearranged transformation rows and transformation columns with the original feature map space positions, and respectively fusing the aligned transformation rows and transformation columns with the rearranged row vectors and rearranged column vectors to obtain the dimension reduction feature map fused with row and column information;

Wherein, respectively to Z _r And Z _c And (3) carrying out row rearrangement and column rearrangement, and fusing the space positions of the H groups of row features and the W groups of column features aligned with the original feature map, so that the original features have explicit row and column position information.

Inputting the dimension reduction feature map fused with the row and column information to an improved coordinate attention mechanism so as to capture the context information on the lane lines of the multi-scale information, and obtaining the lane line image feature fused with the multi-scale information and provided with the context information;

the context and the spatial position information in the feature map can be obviously enhanced through an improved coordinate attention mechanism (ECA), so that the position information of the lane line slender structure can be more accurately learned, the region of interest can be accurately positioned, and the context information of the lane line can be effectively captured. After feature map fusion, lane line context information is fused into the feature map and up-sampled 2 times by linear interpolation to facilitate subsequent different scale feature fusion.

Further, the coordinate attention mechanism (CA) refers to encoding lateral and longitudinal position information into channel attention to accurately focus the network on spatial structure information of interest. It should be noted that, the attention mechanism of the lightweight network at present mostly adopts channel attention (SE), which only considers the information between channels and ignores the position information; while the Bottleneck Attention (BAM) and convolution attention (CBAM) later attempt to extract position attention information by large convolution after reducing the number of channels, the actual extraction of local relationship information is very limited and still lacks the ability to model long-distance relationships. For this embodiment, a coordinate attention mechanism is proposed to encode the lateral and longitudinal position information into the channel attention with an improved attention mechanism, enabling the network to focus on the spatial structure information of interest without introducing excessive computation. The process of coding channel relationships and long distance relationships for improved attention mechanisms comprises the steps of:

Step one, a pooling kernel with the sizes of (1, W ') and (H', 1) is used for encoding the horizontal direction and the vertical direction aiming at the dimension reduction feature map, so as to obtain a horizontal direction c channel space feature and a vertical direction c channel space feature;

where channel attention is often used to globally pool coded global spatial information, compressing global information into a scalar is difficult to preserve important spatial information. The improved coordinate attention mechanism of this embodiment pools globally for this purposeThe pooling operation into two directions, horizontal-vertical, in this example, in the feature map X ∈R ^H'×W'×C' In the above, the spatial features in the horizontal direction are polymerized by using the pooling window of (1, w '), and the spatial features in the vertical direction are polymerized by using the pooling window of (H', 1):

，/>

；

in the method, in the process of the invention,P(h,c) Representing the horizontal pooling window versus the feature map X at the c-th channel and the c-th channelhThe result of the pooling on the row,Q(ω,c) Representing vertical pooling window versus feature map X at c-th channel and c-th channelωPooling results on columns.

The above formula uses a method of integrating features from the row and column directions to output a pair of feature maps whose directions can be known. Compared with global pooling operation, the method not only can capture long-distance relation in one direction, but also can reserve spatial information in the other direction, thereby being beneficial to the network to more accurately position the target and improving the precision and the robustness of the network.

It should be noted that, in order to better utilize the information of the coordinates, the improved coordinate attention mechanism proposes to generate a coordinate attention operation, which is mainly designed based on the following three points: (1) sufficiently compact; (2) the position information is fully utilized; (3) the treatment is efficient.

Connecting the c-th channel spatial feature in the horizontal direction and the c-th channel spatial feature in the vertical direction along the horizontal-vertical direction, and performing convolution, BN and nonlinear activation by using a convolution kernel with the size of 1 multiplied by 1 to perform feature conversion to obtain a first conversion feature; the first conversion feature of the present embodiment adopts the following formula:

；

in the method, in the process of the invention,Catthe connection operation is represented by a number of steps,F ₁ a 1 x 1 convolution is represented and,δrepresenting a non-linear activation function,f€R ^(H'+W')×C' 。

step three, decomposing the first conversion feature into two independent decomposition features along the vertical-horizontal directionf ^h €R ^H'×C' 、f ^ω €R ^W'×C' And adopting two convolution kernel convolution and sigmoid functions with the size of 1 multiplied by 1 as activation functions to respectively perform feature transformation on the two decomposition features and keep the dimension unchanged;

wherein, the decomposition characteristics after conversion are respectively as follows:

g ^h =σ（F _h （f ^h ）），g ^ω =σ（F _ω （f ^ω ））。

and step four, generating a coordinate attention matrix by the two decomposed features after the feature conversion through a broadcasting mechanism, and multiplying the attention moment matrix by the dimension reduction feature map by using the attention moment matrix so that the newly generated feature map accurately focuses on the interested space structure information.

The newly generated feature map is obtained by adopting the following formula:

Y=X×g ^h ×g ^ω ；

wherein the broadcasting mechanism causesg ^h ×g ^ω €R ^H'×W'×C' FinallyY€R ^H'×W'×C' Consistent with the X dimension.

And S103, inputting the lane line image features into a CNN network and a visual transducer network respectively to acquire a confidence map based on key points, a key point map and an offset map.

Specifically, the confidence map is generated by adopting two-layer convolution through a key point layer of the CNN network structure, and the offset map is generated by adopting a convolution visual converter through an offset layer of the visual transducer network structure and combining the two-layer convolution. It should be noted that since ViT introduced a transducer into computer vision tasks, the vision transducer became the most competitive architecture for resolving long-distance dependencies on image data. The convolutional vision Transformer preserves the computational properties of the vision transducer layer and introduces the ideal properties of the convolutional neural network into the transducer by replacing the linear projection mode in the self-attention layer with the compressed convolutional projections. The convolution vision converter uses a 3×3 convolution of step size 2 to project Key and Value, and uses a 3×3 convolution of step size 1 to project Query. The method reduces the number of the tokens of Key and Value by 4 times, thereby reducing the calculation cost of the follow-up multi-head self-attention operation. Furthermore, this approach has minimal impact on model performance because adjacent tokens in the image tend to have redundancy in appearance (semantics).

Further, the specific steps of step S103 include:

s1031, sampling K key points on each lane line as confidence labels of the key points, and placing the confidence labels on the CNN network.

S1032, finding the offset from the K key points to the corresponding starting points as an offset label, and placing the offset label on a visual transducer network.

S1033, inputting the lane line image characteristics to a CNN network with a confidence coefficient label and a visual transducer network with an offset label respectively for corresponding supervision learning so as to synchronously obtain key points of the lane lines and offsets from the key points to corresponding starting points.

Specifically, in a keypoint detection network, by using two 3×3 convolution kernels, the output resolution can be maintained and a keypoint confidence map generated. In order to solve the imbalance problem between the keypoint region and the non-keypoint region, the learning of the keypoint confidence may be performed using a focal loss as a supervisory signal. The focal loss is a loss function for solving the problem of class imbalance, and has the following formula:

，

in the method, in the process of the invention,

output of confidence map representing key point, +.>

The confidence level icon representing the key point,α、βsuper parameter representing focal loss, +. >

Representation->

Values in coordinates (x, Y), Y _（x,y） Representing the value of Y at coordinates (x, Y);

further, in an offset network, the model stacks multiple convolutions of the vision converter modules and one convolution reduces the number of channels of the output feature map to 2, which also ensures that the resolution of the feature map is unchanged. Finally, the offset learning is supervised by using a loss function, the offset learning is only applied to the positions of the lane lines, and the pavement parts of other non-lane lines are not adopted, and the loss function is expressed as follows:

，

in the method, in the process of the invention,

representing the shift of the key point ∈ ->

Offset labels representing keypoints.

S1034, respectively constructing a confidence map, a key point map and an offset map according to the key points and the offsets.

And S104, carrying out key point association on the confidence map and the key point map and the offset map to distinguish and construct each lane line so as to acquire lane line detection information of the current road image.

Specifically, through the post-processing of the key point association, the key points can be unambiguously classified into a plurality of groups, each group of key points represents one lane line, and even if the shapes of the lane lines are different, the key points can be flexibly represented.

Further, the specific steps of step S104 include:

And S1041, defining points with probability values larger than a preset threshold value of the confidence coefficient map as key points.

Specifically, in lane line detection, all points with a confidence level greater than the threshold σ are first selected as key points, as shown in fig. 4 (a). For a lane line keypoint, its probability value in the confidence map must be greater than the threshold σ. Points marked with light gray circles such as in (a) of fig. 4 are selected as key points because the confidence in the detection is not less than a threshold. For points with confidence less than the threshold σ, such as the points marked by white open circles in (a) of fig. 4, no subsequent processing is required.

S1042, extracting the corresponding offset of each key point according to the offset graph, and taking out the key points smaller than a preset offset threshold value to determine the unique starting point of the lane line.

Specifically, in lane line detection, processing is performed according to the key point offset O extracted by the offset network. If the offset O corresponding to a certain key point is smaller than the offset threshold μ, the key point is selected as a starting point of the lane line and is used as a unique identifier of the lane line. For example, in fig. 4 (b), the offset O corresponding to the dark gray circle point (x 5, y 5) is smaller than μ, and is selected as the start point of the lane line. And the key point with the offset O being larger than or equal to mu is not selected as a starting point, and the generation of the subsequent lane lines is not participated.

S1043, obtaining the key points after the offset according to the offsets of the other key points except the unique starting point.

Specifically, for all other key points except the starting point, extracting the corresponding offset O and applying the offset O to the corresponding key point to obtain the offset key point. The light gray circle point in (c) in fig. 4 is obtained by applying the offset O based on the dark gray circle point.

S1044, counting the distance between each shifted key point and the unique starting point to distinguish the lane line where each key point belongs to the unique starting point and construct each lane line, so as to obtain the lane line detection information of the current road image.

Specifically, when the position mapping of the key points is performed, the new key points are obtained after the key points except the starting point are shifted. And then counting the distance between each new key point and the starting point, and if the distance is smaller than the threshold value mu, attributing the new key point to the lane line where the starting point is located. By this process, the detection of the lane line is completed and the final result is obtained as shown in (d) of fig. 4, in which the connected black circles represent the detection result of the lane line.

In summary, in the lane line detection mode based on the CNN and visual transducer hybrid architecture provided in this embodiment, by using the priori of the elongated structure of the lane line, the enhanced coordinate attention is designed to effectively sense the elongated lane line structure and fuse the lane line context to enhance the overall representation of the lane line features, so that the continuity of the lane line key points is maintained. In addition, on the design of an offset network, a convolution vision converter with a global characteristic view angle is adopted as a main module of the network, and the offset network can effectively capture remote offset by local characteristics between any distances due to the advantages of a transducer architecture, so that the problem of error offset of key points at dense lane lines can be effectively solved under the condition that the resolution of an output characteristic diagram is not reduced.

Example 2

This embodiment provides a block diagram of a system corresponding to the method described in embodiment 1. Fig. 5 is a block diagram of a lane line detection system according to an embodiment of the present application, as shown in fig. 5, the system includes:

an acquisition module 10 for acquiring a current road image having a lane line;

an extraction module 20 for extracting lane line image features with context information from the current road image using ResNet-ECA-PSPNet;

The construction module 30 is configured to input the lane line image features into a CNN network and a visual transducer network respectively, so as to obtain a confidence map based on the key points, a key point map and an offset map;

and the association module 40 is used for associating the confidence map and the key point map with the offset map to distinguish and construct each lane line so as to acquire lane line detection information of the current road image.

Preferably, the extraction module 20 includes:

an extracting unit 21 for extracting a high-dimensional feature map from the current road image using a res net network as a backbone network;

a training unit 22, configured to input the high-dimensional feature map into the FPN through a convolution kernel with a size of 1×1 for training to obtain a dimension-reduced feature map;

and the fusion unit 23 is configured to perform ECA up-sampling on the dimension reduction feature map by using the ECA-PSPNet to fuse the dimension reduction feature map with the previous level feature map, so as to obtain a lane line image feature with context information, in which the multi-scale information is fused.

Wherein, the ECA up-sampling specifically includes:

Further, the improved coordinate attention mechanism refers to encoding lateral and longitudinal position information into channel attention so that the network accurately focuses on the spatial structure information of interest; the process of coding the channel relation and the long-distance relation of the improved coordinate attention mechanism comprises the following steps:

Preferably, the construction module 30 includes:

the sampling unit 31 is configured to sample K key points on each lane line as confidence labels of the key points, and place the confidence labels in the CNN network;

a searching unit 32, configured to find the offsets from the K key points to the corresponding start points as offset labels, and place the offsets in the visual transducer network;

the supervision unit 33 is configured to input the lane line image features to a CNN network with a confidence coefficient tag and a visual transducer network with an offset tag, respectively, to perform corresponding supervision learning, so as to obtain a key point of a lane line and an offset from the key point to a corresponding starting point;

And a construction unit 34, configured to construct a confidence map, a key point map and an offset map according to the key points and the offsets, respectively. The confidence level diagram is generated by adopting two layers of convolutions through a key point layer of a CNN network structure, and the offset diagram is generated by adopting a convolution visual converter and combining the two layers of convolutions through an offset layer of a visual transducer network structure.

Preferably, the association module 40 includes:

a definition unit 41, configured to define points with probability values of the confidence map greater than a preset threshold as key points;

a determining unit 42, configured to extract an offset corresponding to each of the key points according to the offset map, and extract the key points smaller than a preset offset threshold to determine a unique start point of the lane line;

an obtaining unit 43, configured to obtain the shifted key points according to the shift of the key points other than the unique start point;

and a distinguishing unit 44, configured to count the distance between each of the shifted key points and the unique start point, so as to distinguish the lane line where each of the key points belongs to the unique start point, and construct each lane line, so as to obtain lane line detection information of the current road image.

The above-described respective modules may be functional modules or program modules, and may be implemented by software or hardware. For modules implemented in hardware, the various modules described above may be located in the same processor; or the above modules may be located in different processors in any combination.

Example 3

The lane line detection method described in connection with fig. 2 may be implemented by an electronic device. Fig. 6 is a schematic diagram of the hardware structure of the electronic device according to the present embodiment.

The electronic device may comprise a processor 51 and a memory 52 storing computer program instructions.

In particular, the processor 51 may include a Central Processing Unit (CPU), or an application specific integrated circuit (Application Specific Integrated Circuit, abbreviated as ASIC), or may be configured to implement one or more integrated circuits of embodiments of the present application.

Memory 52 may include, among other things, mass storage for data or instructions. By way of example, and not limitation, memory 52 may comprise a Hard Disk Drive (HDD), floppy Disk Drive, solid state Drive (Solid State Drive, SSD), flash memory, optical Disk, magneto-optical Disk, tape, or universal serial bus (Universal Serial Bus, USB) Drive, or a combination of two or more of the foregoing. Memory 52 may include removable or non-removable (or fixed) media, where appropriate. The memory 52 may be internal or external to the data processing apparatus, where appropriate. In a particular embodiment, the memory 52 is a Non-Volatile memory. In particular embodiments, memory 52 includes Read-Only Memory (ROM) and random access Memory (Random Access Memory, RAM). Where appropriate, the ROM may be a mask-programmed ROM, a programmable ROM (Programmable Read-Only Memory, abbreviated PROM), an erasable PROM (Erasable Programmable Read-Only Memory, abbreviated EPROM), an electrically erasable PROM (Electrically Erasable Programmable Read-Only Memory, abbreviated EEPROM), an electrically rewritable ROM (Electrically Alterable Read-Only Memory, abbreviated EAROM), or a FLASH Memory (FLASH), or a combination of two or more of these. The RAM may be Static Random-Access Memory (SRAM) or dynamic Random-Access Memory (Dynamic Random Access Memory DRAM), where the DRAM may be a fast page mode dynamic Random-Access Memory (Fast Page Mode Dynamic Random Access Memory FPMDRAM), extended data output dynamic Random-Access Memory (Extended Date Out Dynamic Random Access Memory EDODRAM), synchronous dynamic Random-Access Memory (Synchronous Dynamic Random-Access Memory SDRAM), or the like, as appropriate.

Memory 52 may be used to store or cache various data files that need to be processed and/or communicated, as well as possible computer program instructions for execution by processor 51.

The processor 51 reads and executes the computer program instructions stored in the memory 52 to realize the lane line detection method of embodiment 1 described above.

In some of these embodiments, the electronic device may also include a communication interface 53 and a bus 50. As shown in fig. 6, the processor 51, the memory 52, and the communication interface 53 are connected to each other through the bus 50 and perform communication with each other.

The communication interface 53 is used to implement communication between modules, devices, units, and/or units in the embodiments of the present application. The communication interface 53 may also enable communication with other components such as: and the external equipment, the image/data acquisition equipment, the database, the external storage, the image/data processing workstation and the like are used for data communication.

Bus 50 includes hardware, software, or both, coupling the components of the device to one another. Bus 50 includes, but is not limited to, at least one of: data Bus (Data Bus), address Bus (Address Bus), control Bus (Control Bus), expansion Bus (Expansion Bus), local Bus (Local Bus). By way of example, and not limitation, bus 50 may include a graphics acceleration interface (Accelerated Graphics Port), abbreviated AGP, or other graphics Bus, an enhanced industry standard architecture (Extended Industry Standard Architecture, abbreviated EISA) Bus, a Front Side Bus (FSB), a HyperTransport (HT) interconnect, an industry standard architecture (Industry Standard Architecture, ISA) Bus, a wireless bandwidth (InfiniBand) interconnect, a Low Pin Count (LPC) Bus, a memory Bus, a micro channel architecture (Micro Channel Architecture, abbreviated MCa) Bus, a peripheral component interconnect (Peripheral Component Interconnect, abbreviated PCI) Bus, a PCI-Express (PCI-X) Bus, a serial advanced technology attachment (Serial Advanced Technology Attachment, abbreviated SATA) Bus, a video electronics standards association local (Video Electronics Standards Association Local Bus, abbreviated VLB) Bus, or other suitable Bus, or a combination of two or more of the foregoing. Bus 50 may include one or more buses, where appropriate. Although embodiments of the present application describe and illustrate a particular bus, the present application contemplates any suitable bus or interconnect.

The electronic device may execute the lane line detection method of embodiment 1 of the present application based on the obtained lane line detection system.

In addition, in combination with the lane line detection method of the above embodiment 1, the embodiment of the present application may provide a storage medium to be implemented. The storage medium having stored thereon computer program instructions; the computer program instructions, when executed by a processor, implement the lane line detection method of embodiment 1 described above.

The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, and alternatives falling within the spirit and principles of the invention.

Claims

1. A lane line detection method, characterized by comprising:

acquiring a current road image with a lane line;

2. The lane line detection method according to claim 1, wherein the step of extracting lane line image features having context information from the current road image using res net-ECA-PSPNet specifically comprises:

3. The lane line detection method according to claim 2, wherein the ECA up-sampling specifically includes:

4. The lane-line detection method according to claim 3, wherein the improved coordinate attention mechanism is to encode lateral and longitudinal position information into the lane attention to accurately focus the network on the spatial structure information of interest; the process of coding the channel relation and the long-distance relation of the improved coordinate attention mechanism comprises the following steps:

5. The lane line detection method according to claim 1, wherein the step of inputting the lane line image features into a CNN network and a visual transducer network to obtain a confidence map based on key points and a key point map and an offset map, respectively, specifically comprises:

6. The lane-line detection method of claim 5 wherein the confidence map is generated by a two-layer convolution of a key-point layer of a CNN network structure and the offset map is generated by a convolutional visual transducer in combination with the two-layer convolution of an offset layer of a visual transducer network structure.

7. The lane line detection method according to claim 1, wherein the step of correlating the confidence map and the key point map with the offset map to distinguish and construct each lane line so as to obtain lane line detection information of the current road image specifically includes:

8. A lane line detection system, comprising:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the lane line detection method of any one of claims 1-7 when the computer program is executed by the processor.

10. A storage medium having stored thereon a computer program, which when executed by a processor implements the lane line detection method according to any one of claims 1 to 7.