CN117058640A

CN117058640A - 3D lane line detection method and system integrating image features and traffic target semantics

Info

Publication number: CN117058640A
Application number: CN202311039603.8A
Authority: CN
Inventors: 徐林海; 鲁志瑶; 张皓霖; 王若彤; 陈仕韬; 郑南宁
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2023-08-17
Filing date: 2023-08-17
Publication date: 2023-11-14

Abstract

The invention discloses a 3D lane line detection method and system integrating image features and traffic target semantics, wherein the method is based on Gen-Lananenet, and road semantic information in an image is obtained by using a segmentation network aiming at the acquired image; respectively inputting the acquired image and the road semantic information into a downsampling network for processing to obtain road semantic information features and image visual features, and fusing the image features and the road semantic information by using a fusion sub-network to obtain fused features; projecting the fused features into the virtual top view, and predicting lane lines of the top view space by a lane line detection head network; the lane line of the top view space is geometrically transformed to obtain the lane line of the 3D space, the semantic features and the visual features are integrated by using the feature fusion module, the top view lane line is predicted by using the fused feature map, the real three-dimensional lane point is directly obtained through geometric projection, the lane line prediction at the far end of the image is more accurate, and compared with the unused prediction precision, the prediction precision is improved by nearly one time.

Description

3D lane line detection method and system integrating image features and traffic target semantics

Technical Field

The invention belongs to the field of automatic driving, and particularly relates to a 3D lane line detection method and system integrating image features and traffic target semantics.

Background

Lane detection has become an important issue in the field of automatic driving in recent years. As an important part of intelligent driving assistance, accurate recognition of lane lines plays a vital role in advanced driving systems such as advanced lane departure warning systems (LDW, lane Departure Warning), blind spot monitoring systems (BSM, blind Spot Monitoring), adaptive cruise control systems (ACC, adaptive Cruise Control), etc.

Most lane detection methods consider lane detection as a 2D lane segmentation task. For converting 2D detection results to 3D space, inverse perspective mapping (IPM, inverse Perspective Mapping) is typically used as a post-processing step. However, real traffic scenes often have uphill and downhill conditions, and thus this is not practical in a traffic environment.

The existing 3D lane perception field is inspired by the success of CNNs in monocular depth estimation, 3D-Lananenet designs an end-to-end frame, unified image coding, top view conversion and 3D curve extraction, and 3D lane lines are directly predicted by front views. But the end-to-end frame is greatly affected by visual variations. The Gen-lananenet thus decouples the image segmentation and three-dimensional feature extraction sub-networks, forming a two-stage sub-network. In the first stage, the input image is encoded, and then the fused features are decoded into a lane segmentation map. In the second stage, 3D-GeoNet is used, the segmentation map is projected into a virtual top view, a lane is predicted through a lane detection head, and the point of a real three-dimensional lane line is directly obtained through geometric transformation.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a 3D lane line detection method for fusing image features and traffic target semantics, which is based on Gen-Lananenet, uses a segmentation network to obtain road semantic information in an image, respectively encodes the input image features and the road semantic information, uses a fusion sub-network to fuse, and can effectively mine valuable relations between objects around lane lines and the lane lines by fusing embedded semantic features and visual features to obtain a better prediction result.

In addition, a computer device is provided, which comprises a processor and a memory, wherein the memory is used for storing a computer executable program, the processor reads the computer executable program from the memory and executes the computer executable program, and the processor can realize the 3D lane line detection method for fusing the image characteristics and the traffic target semantics when executing the program.

The invention also provides a computer readable storage medium, wherein the computer readable storage medium stores a computer program, and the computer program can realize the 3D lane line detection method for fusing the image features and the traffic target semantics when being executed by a processor.

In order to achieve the above purpose, the technical scheme adopted by the invention is as follows: A3D lane line detection method integrating image features and traffic target semantics comprises the following steps: based on Gen-Lananenet, obtaining road semantic information in the image by using a segmentation network aiming at the acquired image;

respectively inputting the acquired image and the road semantic information into a downsampling network for processing to obtain road semantic information features and image visual features, and fusing the image features and the road semantic information by using a fusion sub-network to obtain fused features;

projecting the fused features into the virtual top view, and predicting lane lines of the top view space by a lane line detection head network;

and carrying out geometric transformation on the lane lines in the top view space to obtain lane lines in the 3D space.

Further, when road semantic information in an image is obtained by using a segmentation network, two visual semantic segmentation networks, namely FCN and deep LabV3+, are adopted, each pixel in the image is classified by the semantic segmentation network based on a main network of the resnet50, an image mask segmented according to the classification is generated, and the road object mask obtained by segmentation is used as road semantic information for analyzing and understanding a scene.

Further, the downsampling network consists of a plurality of convolution networks, the input image is H multiplied by W multiplied by 3 tensor, and the input image is obtained through the downsampling networkIs a characteristic tensor of (c).

Further, when the semantic features and the visual features are fused, the semantic features and the visual features are added by tensor elements; or the semantic features are used as attention weights and multiplied by the visual feature tensor elements, and the visual feature influence is adjusted by using the semantic information.

Further, projecting the fused features into the virtual top view, and predicting lane lines of the top view space by the lane line detection head network includes: and processing the fused features through an up-sampling network to obtain feature tensors, projecting the feature tensors into a virtual top view through inverse perspective mapping, and then predicting lane lines of a top view space by a lane line detection head network to output.

Further, one-stage fusion subnet output in lane line detection head networkTensors of size projected via IPM to the top view space tensor +.>Menstruation againObtaining +.>Output tensor of (2), get->The individual lane line parameters represent.

Further, when the lane line in the top view space is geometrically transformed to obtain the lane line in the 3D space, ego-the 3D lane line points (x, y, z) in the vehicle coordinate system are transformed to the 2D image pixel points (u, v) by projection, and the points in the top view coordinate systemThe pixel points (u, v) transformed to the same 2D image through the homography matrix are formulated as follows:

wherein R is a rotation matrix, T is a conversion vector, K is a camera reference,

wherein θ is the camera elevation angle and h is the camera height; s and c are used for replacing sin theta and cos theta to obtain the following steps:

the following system of equations is obtained by taking apart the calculations:

for a pair ofAnd->Further toThe treatment results in:

substituting alpha into the equation set to obtain:

whereby points within the top view space are directly transformed to points in 3D space by a change in geometry.

Based on the conception of the method, the invention provides a 3D lane line detection system for fusing image features and traffic target semantics, which comprises a feature acquisition module, a feature fusion module, a top view lane line acquisition module and a geometric transformation module;

the feature acquisition module is based on Gen-Lananenet, and uses a segmentation network to acquire road semantic information in the image aiming at the acquired image; respectively inputting the acquired image and the road semantic information into a downsampling network for processing to acquire road semantic information characteristics and image visual characteristics;

the feature fusion module uses a fusion sub-network to fuse the image features and the road semantic information to obtain fused features;

the top view lane line acquisition module projects the fused features into a virtual top view, and a lane line detection head network predicts lane lines of a top view space;

the geometric transformation module is used for carrying out geometric transformation on the lane lines in the top view space to obtain lane lines in the 3D space.

The invention also provides computer equipment, which comprises a processor and a memory, wherein the memory is used for storing a computer executable program, the processor reads the computer executable program from the memory and executes the computer executable program, and the processor can realize the 3D lane line detection method for fusing the image characteristics and the traffic target semantics when executing the program.

Meanwhile, a computer readable storage medium is provided, and a computer program is stored in the computer readable storage medium, and when the computer program is executed by a processor, the 3D lane line detection method for fusing the image features and the traffic target semantics can be realized.

Compared with the prior art, the invention has at least the following beneficial effects:

the method comprises the steps of integrating semantic information from a road traffic object into a three-dimensional visual lane detection model based on deep learning, integrating semantic masks of the road moving object into a new branch, integrating semantic features and visual features by using a feature fusion module, predicting a top view lane line by using a fused feature map, directly obtaining a real three-dimensional lane point through geometric projection, and improving the prediction precision of the whole lane line on an Apollo data set by testing the method; the lane line prediction at the far end of the image is more accurate, and compared with the unused prediction precision, the prediction precision is improved by nearly one time; the fusion mechanism provided by the invention is effective and good in universality for the 3D lane detection method, and the proposed target segmentation algorithm can be successfully applied to practical application.

In the invention, in the image segmentation sub-network, the RGB image and the vehicle mask are input into the encoder at the same time, the vehicle mask is embedded, and valuable information can be mined from detected surrounding objects; furthermore, the invention provides different segmentation networks to obtain the road semantic information for auxiliary detection, and two different fusion methods are used for fusing the semantic features and the visual features.

Drawings

Fig. 1 is a two-stage framework of the present invention.

Fig. 2 is a semantic mask extraction and its impact on lane detection. Since the car obstructs the leftmost lane line in the two-dimensional image, direct prediction based on the image alone can lead to significant deviations over long distances. However, the invention which fuses the semantic information of the road can realize the accurate prediction of the lane lines, and proves the auxiliary effect of the object semantics.

Fig. 3 shows a representation of the lane lines (anchor) and the principle of geometrical projection.

Fig. 4 shows a specific composition of a one-stage downsampling network.

Fig. 5 shows a specific composition of a one-stage up-sampling network.

FIG. 6 shows a specific configuration of a two-stage lane line detector network.

FIG. 7 is a comparison of the predicted result of the present invention with the prior art method, wherein the red line indicates the predicted lane line and the blue line indicates the ground-based real lane line.

Detailed Description

As shown in fig. 1, the present invention includes two stages, divided into four steps.

In the first stage, the subnets are converged.

In the first step, in order to obtain the road semantic information in the image, the invention adopts two main stream visual semantic segmentation methods of FCN and DeepLabV3+ which correspond to the semantic segmentation network of FIG. 2 (a). The semantic segmentation network uses the resnet50 backbone network to classify each pixel in the image, resulting in a class-partitioned image mask. The invention mainly uses the road object mask obtained by segmentation as road semantic information for further analyzing and understanding the scene.

Secondly, in the fusion network of fig. 2 (a), the road semantic information obtained in the first step and the obtained image are simultaneously sent into a downsampling network composed of a plurality of convolution networks to respectively obtain the road semantic information characteristics and the image visual characteristics, the downsampling network has a specific structure as shown in fig. 4, the input image is a tensor of H multiplied by W multiplied by 3, and the downsampling network is used for obtaining the road semantic information and the image visual characteristicsIs a characteristic tensor of (c). In order to fuse semantic features and visual features, the invention adopts two methods: one is that semantic features and visual features are added per tensor element; one is that semantic features are used as attention weights, multiplied by visual feature tensor elements, and visual feature influence is adjusted by using semantic information; specific:

recording deviceTensor F for visual features of images _I I, j, column element, < ->Feature tensor F for road semantic information _M The i and j column elements, H, W, of the image.

Adding and fusing:

multiplication fusion:

wherein, as indicated by element wise multiplication.

The fused features then pass through an upsampling network consisting of several deconvolution networks of FIG. 2 (b), with the specific structure shown in FIG. 5, the input image beingTensors of (2) are obtained via a downsampling network>Is a characteristic tensor of (c).

Second stage, 3D geometry subnetwork.

Third, fig. 2 (c) first projects the decoded features into the virtual top view through Inverse Perspective Mapping (IPM), then a lane line detection head network consisting of several pooled convolution layers predicts the lane line output of the top view space. The specific composition of the lane line detection head network is shown in fig. 6, and the output of the subnetwork is integrated at one stageTensors of size projected via IPM to the top view space tensor +.>Then the +.A. is obtained by a plurality of pooling and convolution layers>Output tensor of (2), thus get +.> The individual lane line parameters represent.

Finally, referring to the geometric transformation of fig. 1 (D) and fig. 3 (a), the lane line points of the top view space are directly projected into the 3D space, resulting in lane line points of the 3D space. The geometric transformation is specifically as follows:

as shown in fig. 3 (b), ego-3D lane line points (x, y, z) in the vehicle coordinate system are transformed by projection to 2D image pixel points (u, v), points in the top view coordinate systemThe pixel points (u, v) transformed to the same 2D image by the homography matrix are formulated as follows.

Wherein R is a rotation matrix, T is a conversion vector, and K is a camera internal reference.

Where θ is the camera elevation angle and h is the camera height.

S and c are used for replacing sin theta and cos theta to obtain the following steps:

the following system of equations is obtained by taking apart the calculations:

simplifying the third equation in the equation set to obtain

The second equation is taken as:

substituting alpha into the equation set to obtain:

whereby points in top view space can be directly transformed to points in 3D space by the geometric transformation.

And predicting a top view lane line by utilizing the fused feature map, and directly obtaining a real three-dimensional lane point through geometric projection. In the long-range detection, as shown in fig. 7, the present invention predicts more accurately than the original model. In addition, various object segmentation algorithms can be successfully applied in practical applications.

The invention provides a 3D lane line detection system integrating image features and traffic target semantics, which comprises a feature acquisition module, a feature integration module, a top view lane line acquisition module and a geometric transformation module;

The invention also provides computer equipment, which comprises a processor and a memory, wherein the memory is used for storing a computer executable program, the processor reads the computer executable program from the memory and executes the computer executable program, and the processor can realize the 3D lane line detection method for fusing the image characteristics and the traffic target semantics when executing the computer executable program.

On the other hand, the invention also provides a computer readable storage medium, wherein the computer readable storage medium stores a computer program, and when the computer program is executed by a processor, the 3D lane line detection method for fusing the image features and the traffic target semantics can be realized.

The computer device may be a notebook computer, a desktop computer, a vehicle computer, or a workstation.

For the processors of the present invention, there may be a central processing unit (CPU, central Processing Unit), a graphics processor (GPU, graphics Processing Unit), a digital signal processor (DSP, digital Signal Processor), an Application specific integrated circuit (ASIC, application-Specfic Integrated Circuit) or an off-the-shelf programmable gate array (FPGA, field-Programmable Gate Array).

The memory can be an internal memory unit of a notebook computer, a desktop computer, a vehicle-mounted computer or a workstation, such as a memory and a hard disk; external storage units such as removable hard disks, flash memory cards may also be used.

Computer readable storage media may include computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. The computer readable storage medium may include: read Only Memory (ROM), random access Memory (RAM, random Access Memory), solid state disk (SSD, solid State Drives), or optical disk, etc. The random access memory may include resistive random access memory (ReRAM, resistance Random Access Memory) and dynamic random access memory (DRAM, dynamic Random Access Memory), among others.

It should be noted that, the above description is only for illustrating the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and those skilled in the art should understand that, based on the technical solution of the present invention, modifications or variations made according to the technical solution of the present invention and the inventive concept thereof should be covered in the scope of the present invention.

Claims

1. A3D lane line detection method integrating image features and traffic target semantics is characterized by comprising the following steps: based on Gen-Lananenet, obtaining road semantic information in the image by using a segmentation network aiming at the acquired image;

2. The 3D lane line detection method for fusing image features and traffic target semantics as claimed in claim 1, wherein when road semantic information in an image is obtained by using a segmentation network, two visual semantic segmentation networks FCN and deep labv3+ are adopted, the semantic segmentation network classifies each pixel in the image based on a res 50 backbone network, generates an image mask segmented by category, and uses the segmented road object mask as road semantic information for analyzing and understanding a scene.

3. The method for detecting 3D lane lines by combining image features and traffic target semantics according to claim 1, wherein the downsampling network is composed of a plurality of convolution networks, the input image is a tensor of h×w×3, and the input image passes through the downsampling networkObtaining the collateralsIs a characteristic tensor of (c).

4. The 3D lane line detection method of fusing image features and traffic target semantics according to claim 3, wherein semantic features and visual features are added by tensor elements when the semantic features and the visual features are fused; or the semantic features are used as attention weights and multiplied by the visual feature tensor elements, and the visual feature influence is adjusted by using the semantic information.

5. The method for detecting 3D lane lines by fusing image features and traffic target semantics according to claim 1, wherein projecting the fused features into a virtual top view, predicting lane lines of a top view space by a lane line detection head network comprises: and processing the fused features through an up-sampling network to obtain feature tensors, projecting the feature tensors into a virtual top view through inverse perspective mapping, and then predicting lane lines of a top view space by a lane line detection head network to output.

6. The method for 3D lane line detection based on the semantic fusion of image features and traffic targets according to claim 5, wherein the one-stage fusion of sub-network output in the lane line detection head networkTensors of size projected via IPM to the top view space tensor +.>Then the +.A. is obtained by a plurality of pooling and convolution layers>Output tensor of (2) to obtain The individual lane line parameters represent.

7. The method for detecting 3D lane lines by fusing image features and traffic target semantics as claimed in claim 1, wherein when lane lines in the top view space are geometrically transformed to obtain lane lines in the 3D space, the 3D lane line points (x, y, z) in the ego-vehicle coordinate system are transformed to 2D image pixel points (u, v) by projection, points in the top view coordinate systemThe pixel points (u, v) transformed to the same 2D image through the homography matrix are formulated as follows:

the following system of equations is obtained by taking apart the calculations:

for a pair ofAnd->Further processing to obtain the following components:

substituting alpha into the equation set to obtain:

8. The 3D lane line detection system integrating the image features and the traffic target semantics is characterized by comprising a feature acquisition module, a feature integration module, a top view lane line acquisition module and a geometric transformation module;

9. A computer device comprising a processor and a memory, the memory storing a computer executable program, the processor reading the computer executable program from the memory and executing the program, the processor executing the program implementing the 3D lane line detection method of fusing image features and traffic target semantics according to any one of claims 1-7.

10. A computer readable storage medium, wherein a computer program is stored in the computer readable storage medium, and when the computer program is executed by a processor, the 3D lane line detection method of fusing image features and traffic target semantics according to any one of claims 1 to 7 can be implemented.