CN114842085A

CN114842085A - Full-scene vehicle attitude estimation method

Info

Publication number: CN114842085A
Application number: CN202210780438.0A
Authority: CN
Inventors: 刘寒松; 王永; 王国强; 刘瑞; 翟贵乾; 李贤超; 焦安健; 谭连胜; 董玉超
Original assignee: Sonli Holdings Group Co Ltd
Current assignee: Sonli Holdings Group Co Ltd
Priority date: 2022-07-05
Filing date: 2022-07-05
Publication date: 2022-08-02
Anticipated expiration: 2042-07-05
Also published as: CN114842085B

Abstract

The invention belongs to the technical field of vehicle attitude estimation, and relates to a full-scene vehicle attitude estimation method.

Description

Full-scene vehicle attitude estimation method

Technical Field

Background

The automatic driving prospect is wide, the development trend of future automobiles is that the development of automatic driving requires that vehicles have the ability to clearly judge the surrounding environment, correct driving routes and driving behaviors are selected, drivers are assisted to control the vehicles, the driving scenes in reality are complex and changeable, different countermeasures are required in each complex scene, and the estimation of vehicle postures is used as an important task in the automatic driving technology and aims to locate key points of the vehicles from images or videos and help to judge the driving states of the surrounding vehicles.

At present, the main challenge of vehicle attitude estimation is the occlusion problem, and no matter in which driving scene, the occlusion problem exists, such as occlusion between vehicles, occlusion between pedestrians and vehicles, and occlusion between other objects and vehicles, but the existing vehicle attitude estimation method is difficult to identify the vehicle attitude in the occlusion scene, and therefore a vehicle attitude estimation method facing a full scene is urgently needed.

The convolutional neural network obtains excellent performance in the field of attitude estimation, most work regards the deep convolutional neural network as a strong black box predictor, however, how to capture the spatial relationship between components is still unclear, from the viewpoint of science and practical application, the interpretability of the model can help to understand how the model relates variables to achieve final prediction, and how the attitude estimation algorithm processes various input images, and the Transformer can capture long-distance relationship to reveal the dependency relationship between key points in the task of vehicle attitude estimation.

Since the advent of the Transformer, its high computational efficiency and scalability have made it dominate natural language processing, being a deep neural network based mainly on the mechanism of self-attention, and due to its powerful performance, researchers are looking for ways to apply the Transformer to computer vision tasks, where the performance of the Transformer-based model in various vision benchmarking tests is similar or better than that of other types of networks (such as convolutional networks and recursive networks), but no report or use of the model in vehicle pose estimation is known at present.

Disclosure of Invention

The invention aims to overcome the defects in the prior art, designs and provides a full-scene vehicle attitude estimation method, realizes high-efficiency vehicle attitude estimation, takes Swin transform as a backbone network for feature extraction, uses a transform encoder to encode feature map information into position representation of key points, obtains key point dependence items by calculating attention scores, predicts the final key point positions, effectively solves the problem of vehicle occlusion, and realizes full-scene vehicle attitude estimation.

In order to achieve the purpose, the Swin transform is introduced as a backbone network, a network structure is optimized according to the characteristics of a vehicle attitude estimation task, original image information is compressed into a position sequence with compact key points, the vehicle attitude estimation task is converted into a coding task, a key point dependent item is obtained by calculating an attention score, and a final key point position is predicted, wherein the specific process comprises the following steps:

(1) and (3) data set construction:

selecting vehicle images in an open source data set, collecting images of various vehicles in a traffic monitoring and parking lot, constructing a vehicle data set, and dividing the vehicle data set into a training set, a verification set and a test set;

(2) image segmentation: the image in the vehicle data set is segmented into non-overlapping image slices by a slice segmentation module, each image slice is regarded as a mark and is characterized by serial RGB values of the input image;

(3) extracting hierarchical features of a backbone network: the image slice marks obtained in the step (2) firstly pass through a linear embedding layer of a first stage of a backbone network, the feature dimension is changed into a random dimension C, and then, the two embedding layers and a second stage are used for carrying out layered feature extraction to obtain a feature map;

(4) position coding: inputting the characteristic diagram obtained in the step (3) into a position coding layer for position coding, and enabling the characteristic diagram to pass through

Convolution or one linear layer being flattened

An

Dimensional vectors which pass through four attention layers and a feedforward neural network and then output characteristic vectors, wherein H and W are the height and width of the image respectively;

(5) generating a keypoint heat map: reshaping the characteristic vector obtained in the step (4) back

Then channel dimensions are determined from

Lowering to K, and generating a predicted K key point heat map, wherein K is the number of key points of each vehicle and has a value of 78;

(6) and outputting a result: and inhibiting the key point heat map to a key point coordinate through a non-maximum value, and marking the position of the key point in the original image to realize the attitude estimation of the full-scene vehicle.

Further, 78 key points are defined for each vehicle in the vehicle image in the step (1), and a boundary frame and a category of the vehicle, namely a minimum bounding rectangle of the vehicle, are labeled.

Further, the size of each image slice in step (2) is

With a characteristic dimension of

。

Further, the trunk network in the step (3) adopts a Swin Transformer trunk network, the first stage includes a linear embedding layer and two Swin Transformer blocks, and the number of labels of the two Swin Transformer blocks is

Where H and W are the height and width of the input image; the second stage comprises a linear merging layer and two Swin transform blocks, the image slices subjected to the feature extraction in the first stage are marked in a reduction mode through the linear merging layer, and the linear merging layer enables each group to be marked

The features of adjacent blocks are connected, with a linear layer acting in the dimension of

The number of marks is reduced by 4 times, and the output dimension is changed to

Then, feature transformation is carried out through two Swin transform blocks, and the resolution of the obtained image is

And realizing layered feature extraction.

Further, the position coding layer in step (4) adopts an encoder with a standard transform architecture, the position coding layer regards the feature map as a dynamic weight determined by specific image content, information flow in forward propagation is reweighed, a key point dependency is obtained by calculating a score of a last attention layer, a higher value of a position attention score in an image indicates that the contribution degree to predicting key points is larger, and the blocked key points are predicted by the dependency of the key points.

Compared with the prior art, the method has the advantages that the Swin transducer is used for replacing the traditional convolutional neural network, the layered transducer is adopted for the main network, the calculation efficiency is improved, the linear calculation complexity is low, the long-distance relation in the image is captured by using the encoder of the standard transducer, the dependency relation of the predicted key point is disclosed, the final position of the predicted key point is formed by collecting the dependency item which contributes greatly to the key point through the final attention layer, the shielding problem is solved, the method obtains better balance between the detection precision and the speed, and the method has higher practical application value.

Drawings

Fig. 1 is a schematic structural framework diagram of a vehicle attitude estimation system provided by the present invention.

Fig. 2 is a schematic structural diagram of a first stage of the backbone network according to the present invention.

Fig. 3 is a schematic structural diagram of a second stage of the backbone network according to the present invention.

FIG. 4 is a single structure diagram of the coding layer according to the present invention.

FIG. 5 is a block flow diagram of a vehicle attitude estimation method according to the present invention.

Detailed Description

The invention will be further described by way of examples, without in any way limiting the scope of the invention, with reference to the accompanying drawings.

Example (b):

the embodiment provides a full-scene vehicle attitude estimation method based on a transform backbone and a position encoder, which introduces Swin transform as a backbone network, converts a vehicle attitude estimation task into an encoding task by compressing original image information into a position sequence with compact key points, obtains a key point dependency term by calculating an attention score, predicts a final key point position, can effectively predict a blocked vehicle key point position, and realizes full-scene vehicle attitude estimation, as shown in FIGS. 1-5, and specifically comprises the following steps:

(1) and (3) data set construction:

selecting vehicle images in an open source data set, collecting images containing various vehicles in real scenes such as traffic monitoring, parking lots and the like, constructing a vehicle data set, defining 78 key points on each vehicle, taking a car as an example, mainly defining points with strong local texture feature information, such as corner point definitions (4 corner points of car lamps, 4 corner points of front and rear windshields and the like) on multiple selected vehicles, labeling a boundary frame and a category of the vehicle, namely a minimum bounding rectangle of the vehicle, and finally dividing the data set into a training set, a verification set and a test set;

(2) image segmentation:

the vehicle image is segmented into non-overlapping image slices by a slice segmentation module, each image slice being of a size

Their characteristic dimension is

Each image slice is regarded as a mark and is characterized by serial RGB values of the input image;

(3) extracting hierarchical features of a backbone network;

the backbone network is divided into two stages, the image slice marking firstly passes through the first stage, as shown in fig. 2, the first stage comprises a linear embedding layer and two Swin transducer blocks, the linear embedding layer is applied to the image slice original value characteristics and maps the image slice original value characteristics to a random dimension C, and the number of the transducer blocks is that

H and W are the height and width of the input image, followed by a second stage, as shown in FIG. 3, of reducing the mark by a linear merge layer that reduces the mark per group as the network goes deep

Connecting the characteristics of adjacent blocks, connectingA linear layer acting in a dimension of

Realizing the extraction of the layered characteristics;

(4) position coding:

the feature graph output by the backbone network is input into the coding layer, the embodiment has 4 coding layers, each coding layer is as shown in fig. 4, firstly, the feature graph passes through

Convolution or a linear layer, flattened into

An

Dimensional vectors which are subjected to 4 attention layers and a feedforward neural network to obtain characteristic vectors;

(5) generating a keypoint heat map:

the coding layer outputs the feature vectors, which are first reshaped back

Then channel dimensions are determined from

Decreasing to K (K is the number of keypoints per vehicle, value 78), generating a predicted K keypoint heat map;

(6) and outputting a result: and (5) applying a non-maximum value to suppress the key point coordinates in the key point heat map generated in the step (5), and marking the positions of the key points in the original image to realize the estimation of the vehicle attitude of the whole scene.

Structures, algorithms, and computational processes not described in detail herein are all common in the art.

It is noted that the disclosed embodiments are intended to aid in further understanding of the invention, but those skilled in the art will appreciate that: various substitutions and modifications are possible without departing from the spirit and scope of the invention and appended claims. Therefore, the invention should not be limited to the disclosure of the embodiment examples, but the scope of the invention is defined by the appended claims.

Claims

1. A full scene vehicle attitude estimation method is characterized by comprising the following steps:

(1) and (3) data set construction:

Convolution or one linear layer being flattened

An

Then channel dimensions are determined from

2. The full-scene vehicle pose estimation method according to claim 1, wherein 78 key points are defined for each vehicle in the vehicle image in the step (1), and a boundary box and a category of the vehicle are labeled.

3. The full-scene vehicle pose estimation method of claim 2, wherein the size of each image slice in step (2) is

With a characteristic dimension of

。

4. The full-scene vehicle pose estimation method according to claim 3, whereinThe trunk network in the step (3) adopts a Swin Transformer trunk network, the first stage comprises a linear embedding layer and two Swin Transformer blocks, and the number of marks of the two Swin Transformer blocks is

And realizing layered feature extraction.

5. The full-scene vehicle attitude estimation method according to claim 4, wherein the position coding layer in step (4) adopts an encoder of a standard transform architecture, the position coding layer regards the feature map as a dynamic weight determined by specific image content, re-weights the information flow in forward propagation, obtains a key point dependency by calculating a score of a last attention layer, and predicts the occluded key point through the key point dependency, wherein the higher a certain position attention score value in the image is, the greater the contribution degree of the certain position attention score value to the predicted key point is.