CN115965969A

CN115965969A - Bird's-eye view semantic segmentation method, device, equipment and medium based on geometric prior

Info

Publication number: CN115965969A
Application number: CN202310041741.3A
Authority: CN
Inventors: 赫然; 黄怀波; 樊齐航; 周晓强
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2023-01-11
Filing date: 2023-01-11
Publication date: 2023-04-14

Abstract

The invention provides a bird's-eye view semantic segmentation method, a bird's-eye view semantic segmentation device, bird's-eye view semantic segmentation equipment and a bird's-eye view semantic segmentation medium, wherein the method comprises the following steps: acquiring query features of an image to be processed and a bird's-eye view; inputting the image to be processed into a feature extractor to obtain image features output by the feature extractor; inputting the image features and the bird's-eye view query features into a feature extraction model to obtain bird's-eye view features output by the feature extraction model; performing semantic segmentation on the image to be processed based on the aerial view characteristics; the feature extraction model comprises a self-attention module, and the self-attention module is used for carrying out cross-view self-attention conversion on the token of the image to be processed and the aerial view query feature. The method, the device, the electronic equipment and the storage medium provided by the invention can reduce the computational complexity of a self-attention mechanism, further reduce the complexity of image feature extraction and improve the accuracy and reliability of subsequent semantic segmentation.

Description

Bird's-eye view semantic segmentation method, device, equipment and medium based on geometric prior

Technical Field

The invention relates to the technical field of automatic driving perception, in particular to a method, a device, equipment and a medium for semantic segmentation of an aerial view based on geometric prior.

Background

The ability to perceive the surroundings is an important capability for autonomous driving. In particular, autonomous vehicles need to have capabilities of 3D object detection, map segmentation, and the like. In recent years, a Bird's Eye View (BEV) has attracted attention of many researchers as a type of map that facilitates both vehicle perception of surroundings and planning of downstream tasks, and there has been much work directed to sensing objects in a bird's Eye View.

In the prior art, the bird's-eye view semantic segmentation method based on the geometric projection mainly includes a bird's-eye view semantic segmentation method based on the geometric projection and a bird's-eye view semantic segmentation method based on the transform.

In the bird's-eye view semantic segmentation method based on the transform, the attention mechanism is global, and the calculation complexity is proportional to the number of input visual angles, the resolution of the feature map and the resolution of the bird's-eye view query feature. For a scene requiring a high-resolution aerial view, the calculation overhead of the scene is greatly increased, and the calculation load of the scene exceeds the upper limit of the calculation load of a plurality of devices.

Disclosure of Invention

The invention provides a bird's-eye view semantic segmentation method, a bird's-eye view semantic segmentation device, bird's-eye view semantic segmentation equipment and bird's-eye view semantic segmentation media based on geometric prior, which are used for solving the defects that in the prior art, for scenes needing high-resolution bird's-eye views, the calculation overhead of semantic segmentation is greatly increased and the calculation load upper limit of a plurality of equipment is exceeded.

The invention provides a geometric prior-based aerial view semantic segmentation method, which comprises the following steps of:

acquiring query features of an image to be processed and a bird's-eye view;

inputting the image to be processed to a feature extractor to obtain the image features output by the feature extractor;

inputting the image features and the aerial view query features into a feature extraction model to obtain aerial view features output by the feature extraction model;

performing semantic segmentation on the image to be processed based on the aerial view features;

the feature extraction model comprises a self-attention module, and the self-attention module is used for carrying out cross-view self-attention conversion on the token of the image to be processed and the aerial view query feature.

According to the aerial view semantic segmentation method based on the geometric prior, which is provided by the invention, the feature extraction model comprises a plurality of cascaded feature extraction modules;

inputting the image features and the bird's-eye view query features into a feature extraction model to obtain the bird's-eye view features output by the feature extraction model, wherein the bird's-eye view query features comprise:

inputting a previous token and a previous bird's-eye view query feature of the image to be processed into a current feature extraction module to obtain the current bird's-eye view query feature output by the current feature extraction module, wherein the previous bird's-eye view query feature is output by a feature extraction module before the current feature extraction module;

and taking the bird's-eye view query feature output by the tail-most feature extraction module as the bird's-eye view feature.

According to the bird's-eye view semantic segmentation method based on geometric prior provided by the invention, the step of inputting the last token and the last bird's-eye view query feature of the image to be processed into the current feature extraction module to obtain the current bird's-eye view query feature output by the current feature extraction module comprises the following steps:

inputting a previous token and a previous aerial view query feature of the image to be processed into a self-attention module of a current feature extraction module, sampling the aerial view query feature by the self-attention module to obtain a sampled aerial view query feature, and then performing inverse sampling on the sampled aerial view query feature to obtain an inverse sampled aerial view query feature output by the self-attention module;

and inputting the inverse sampling aerial view query features into a feed-forward propagation network of a current feature extraction module to obtain current aerial view query features output by the feed-forward propagation network.

According to the bird's-eye view semantic segmentation method based on the geometric prior, the self-attention module comprises an image block self-attention module, an image self-attention module and a scene self-attention module which are cascaded, the image block self-attention module is used for carrying out cross-view self-attention conversion on an image block of the bird's-eye view query feature to obtain a first bird's-eye view query feature, the image self-attention module is used for carrying out cross-view self-attention conversion on the first bird's-eye view query feature to obtain a second bird's-eye view query feature, and the scene self-attention module is used for carrying out cross-view self-attention conversion on the second bird's-eye view query feature.

According to the bird's-eye view semantic segmentation method based on the geometric prior, the feature extractor comprises a first convolution layer, a first activation layer and a first normalization layer which are connected in sequence, and the feature extractor is used for outputting multi-scale image features.

According to the bird's-eye view semantic segmentation method based on the geometric prior provided by the invention, the semantic segmentation is carried out on the image to be processed based on the bird's-eye view characteristics, and the semantic segmentation comprises the following steps:

inputting the aerial view features into a decoder, and outputting semantic segmentation results of the image to be processed by the decoder, wherein the decoder comprises a second convolution layer, a second normalization layer and a second activation layer which are connected in sequence.

According to the bird's-eye view semantic segmentation method based on the geometric prior provided by the invention, the image characteristics and the bird's-eye view query characteristics are input into a characteristic extraction model, and bird's-eye view characteristics output by the characteristic extraction model are obtained, and the method comprises the following steps:

acquiring an internal reference matrix and an external reference matrix of a camera, wherein the internal reference matrix and the external reference matrix of the camera are used for geometric prior;

and obtaining the aerial view characteristics output by the characteristic extraction model based on the internal parameter matrix and the external parameter matrix of the camera, the image characteristics and the aerial view query characteristics.

The invention also provides a bird's-eye view semantic segmentation device based on geometric prior, which comprises the following steps:

the acquisition unit is used for acquiring the image to be processed and the aerial view query feature;

the image feature extraction unit is used for inputting the image to be processed into a feature extractor to obtain image features output by the feature extractor;

the aerial view characteristic extraction unit is used for inputting the image characteristics and the aerial view query characteristics into a characteristic extraction model to obtain aerial view characteristics output by the characteristic extraction model;

the semantic segmentation unit is used for performing semantic segmentation on the image to be processed based on the aerial view characteristics;

The invention also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the bird's-eye view semantic segmentation method based on geometric prior as mentioned in any one of the above.

The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method for geometric prior-based bird's eye view semantic segmentation as described in any of the above.

The present invention also provides a computer program product, which includes a computer program, and when the computer program is executed by a processor, the method for airview semantic segmentation based on geometric prior as described in any one of the above.

According to the bird's-eye view semantic segmentation method, device, equipment and medium based on geometric prior, the self-attention module in the feature extraction model performs cross-view self-attention conversion on the tokens of the images to be processed and the bird's-eye view query features, so that the calculation complexity of a self-attention mechanism can be reduced, the complexity of image feature extraction is further reduced, the accuracy and reliability of the obtained bird's-eye view features are improved, and the accuracy and reliability of subsequent semantic segmentation are further improved.

Drawings

In order to more clearly illustrate the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a schematic flow chart of a bird's eye view semantic segmentation method based on geometric prior provided by the present invention;

FIG. 2 is a schematic flow chart illustrating sampling of bird's-eye view query features according to the present invention;

FIG. 3 is a schematic flow chart illustrating inverse sampling of a sampled bird's eye view query feature according to the present invention;

FIG. 4 is a schematic structural diagram of a self-attention module provided in the present invention;

FIG. 5 is a second schematic flowchart of the bird's eye view semantic segmentation method based on geometric prior provided by the present invention;

FIG. 6 is a schematic structural diagram of a bird's-eye view semantic segmentation device based on geometric prior provided by the present invention;

fig. 7 is a schematic structural diagram of an electronic device provided by the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terms first, second and the like in the description and in the claims of the present invention are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that embodiments of the application may be practiced in sequences other than those illustrated or described herein, and that the objects identified as "first", "second", etc. are generally one class.

In the related art, in recent years, a Bird's Eye View (BEV) has attracted attention of many researchers as a type of map that facilitates both perception of surroundings and planning of downstream tasks by vehicles, and there has been much work directed to perceiving objects in the bird's Eye View.

In view of the high cost of radar equipment and the low resolution nature of the point cloud data itself, many recent efforts have utilized inexpensive and high resolution vehicle-mounted multi-view cameras to obtain aerial views. The most direct method is to project the images from the multi-view camera directly onto the bird's eye view plane, although this method is simple and effective, and objects above the bird's eye view plane will be severely distorted due to the planar assumption of the method. Therefore, many recent works have been directed to obtaining a feature map of the bird's eye view first, and then obtaining the bird's eye view using the feature map.

The bird's-eye view semantic segmentation method mainly comprises a bird's-eye view semantic segmentation method based on geometric projection and a bird's-eye view semantic segmentation method based on Transformer, the bird's-eye view semantic segmentation method based on Transformer can model long-distance dependence, but a global attention mechanism brings huge calculation amount, and the methods do not effectively utilize geometric prior brought by a parameter matrix of a camera.

Based on the above problem, the present invention provides a geometric prior-based bird's-eye view semantic segmentation method, fig. 1 is one of the flow diagrams of the geometric prior-based bird's-eye view semantic segmentation method provided by the present invention, as shown in fig. 1, the method includes:

and step 110, acquiring the to-be-processed image and the aerial view query feature.

Specifically, the image to be processed and the bird's-eye view query feature may be obtained, where the image to be processed is an image that needs to be subjected to semantic segmentation subsequently, where the image to be processed may be an image acquired by an image acquisition device such as a vehicle-mounted camera, a mobile phone, a camera, a flat panel, and the like, for example, the image to be processed may be multiple images of multiple viewing angles captured at the same vehicle position, where the multiple viewing angles may be front, rear, front left, rear left, front right, and rear right viewing angles, and accordingly, the image to be processed may include a front view image, a rear view image, a front left view image, a rear left view image, a front right view image, and a rear right view image, which is not particularly limited in this embodiment of the present invention.

The size of the image to be processed may be 224 × 480, etc., the bird's-eye view query feature may be a preset feature, and the bird's-eye view query feature may be expressed as Q ∈ R ^d×H×W 。

And 120, inputting the image to be processed into a feature extractor to obtain the image features output by the feature extractor.

Specifically, after the image to be processed is acquired, the image to be processed may be input to a feature extractor, and the image features output by the feature extractor may be obtained. The feature extractor herein may include a first Convolutional layer, a first active layer, and a first normalization layer, which are connected in sequence, where the first Convolutional layer may use a multilayer Convolutional Neural Network (CNN) with a cascade structure, may also use a Deep Neural Network (DNN), may also use a combined structure of CNN and DNN, and the like, and this is not specifically limited in this embodiment of the present invention.

The first active Layer may use a GELU (Gaussian Error Linear Unit) activation function, a Sigmoid activation function, or a ReLU (Rectified Linear Units) activation function, where the first Normalization Layer may be LN (Layer Normalization), BN (Batch Normalization), IN (instant Normalization), and the like, and this is not specifically limited IN the embodiment of the present invention.

The feature extractor here may also be an EfficientNet-B4 model, and the like, which is not particularly limited in this embodiment of the present invention.

Here, the image features output by the feature extractor may be multi-scale, for example, 1/4, 1/8, 1/16, 1/32, and the like, which is not specifically limited in this embodiment of the present invention.

Step 130, inputting the image characteristics and the aerial view query characteristics into a characteristic extraction model to obtain the aerial view characteristics output by the characteristic extraction model; the feature extraction model comprises a self-attention module, and the self-attention module is used for carrying out cross-view self-attention conversion on the token of the image to be processed and the aerial view query feature.

Specifically, after obtaining the image features, the image features and the bird's-eye view query features may be input to the feature extraction model, and the bird's-eye view features output by the feature extraction model may be obtained.

The feature extraction model may include a plurality of cascaded feature extraction modules, where the plurality of feature extraction modules may include a self-attention module, and the feature extraction module may be a transform model, an LSTM (Long Short Term Memory network), an RNN (Recurrent Neural network), or the like, which is not specifically limited in this embodiment of the present invention.

The self-Attention module is used for performing Cross-view self-Attention conversion on a token and an aerial view query feature of the image to be processed, and the self-Attention module can be a Cross-Attention (Cross-Attention) module.

The cross-view self-attention conversion is to calculate the similarity between the token of the image to be processed and the bird's-eye view query feature, construct an attention matrix, and change the bird's-eye view query feature to obtain the feature optimized by the self-attention mechanism.

The bird's-eye view feature here reflects feature information of the bird's-eye view layer.

Here, when outputting the bird's eye view feature, reference is also made to the internal reference matrix and the external reference matrix of the camera, and the internal reference matrix and the external reference matrix of the camera here may be used for geometric prior.

It can be understood that in the bird's-eye view semantic segmentation method based on the transform, the attention mechanism used is global, and the computational complexity is proportional to the number of input views and the resolution of the feature map and the resolution of the bird's-eye view query features. And the cross-view self-attention conversion is carried out on the token of the image to be processed and the bird's-eye view query feature, so that the calculation complexity of a self-attention mechanism can be reduced, the complexity of image feature extraction is further reduced, the accuracy and the reliability of the obtained bird's-eye view feature are improved, and the accuracy and the reliability of subsequent semantic segmentation are further improved.

And 140, performing semantic segmentation on the image to be processed based on the aerial view characteristics.

Specifically, after the bird's-eye view features are obtained, the image to be processed may be semantically segmented based on the bird's-eye view features.

It can be understood that the bird's-eye view features output by the feature extraction model are features after cross-view self-attention conversion, and semantic segmentation is performed based on the bird's-eye view features obtained thereby, so that the accuracy and reliability of the semantic segmentation are further improved.

According to the method provided by the embodiment of the invention, the self-attention module in the feature extraction model performs cross-view self-attention conversion on the token of the image to be processed and the bird's-eye view query feature, so that the calculation complexity of the self-attention mechanism can be reduced, the complexity of image feature extraction is further reduced, the accuracy and reliability of the obtained bird's-eye view feature are improved, and the accuracy and reliability of subsequent semantic segmentation are further improved.

Based on the above embodiment, the feature extraction model includes a plurality of cascaded feature extraction modules;

step 130, comprising:

step 131, inputting a previous token and a previous bird's-eye view query feature of the image to be processed into a current feature extraction module to obtain the current bird's-eye view query feature output by the current feature extraction module, wherein the previous bird's-eye view query feature is output by a feature extraction module before the current feature extraction module;

and step 132, taking the bird's-eye view query feature output by the rearmost feature extraction module as the bird's-eye view feature.

Specifically, the feature extraction model includes a plurality of cascaded feature extraction modules, where the feature extraction module may be a transform model, an LSTM model, an RNN model, and the like, and this is not particularly limited in this embodiment of the present invention.

Here, the spatial resolution of the cascaded feature extraction modules may be arranged from high to low.

In the process of extracting the bird's-eye view features, first, the feature extraction module arranged first in the feature extraction model may be used as the current feature extraction module, and the process of extracting the bird's-eye view features may be performed:

the last token and the last bird's-eye view query feature of the image to be processed may be input to the current feature extraction module, and the current bird's-eye view query feature output by the current feature extraction module is obtained, where the last bird's-eye view query feature is output by the feature extraction module before the current feature extraction module, and the last token may be an image feature output by a convolution encoder, for example, an image feature of 1/4 scale output by the convolution encoder.

After obtaining the current bird's-eye view query feature output by the current feature extraction module, the next feature extraction module of the current feature extraction module, that is, the feature extraction module arranged at the second position, may be used as the current feature extraction module, and the process of extracting the bird's-eye view feature is returned to:

that is, after obtaining the current bird's eye view query feature output by the feature extraction module arranged at the first place, the current bird's eye view query feature and the current token of the image to be processed may be input to the current feature extraction module, and the current token output by the current feature extraction module may be obtained, where the current token may be an image feature output by a convolution encoder, and may be an image feature output by a convolution encoder in 1/8 scale, for example.

The process of extracting the bird's-eye view features by using the feature extraction module arranged at the third position as the current feature extraction module is similar to the process of extracting the bird's-eye view features by using the feature extraction module arranged at the second position as the current feature extraction module, and details are not repeated here.

By analogy, the current feature extraction module is known as the last feature extraction module in the feature extraction model, and the last feature extraction module is the last feature extraction module in the feature extraction model.

After the current bird's-eye view query feature is extracted by the tailmost feature extraction module, the current bird's-eye view query feature extracted by the tailmost feature extraction module may be used as the bird's-eye view feature.

According to the method provided by the embodiment of the invention, the feature extraction model comprises a plurality of cascaded feature extraction modules, and the acquired bird's-eye view features are subjected to feature extraction operations successively executed by the plurality of feature extraction modules, so that the accuracy and reliability of the bird's-eye view features are improved, and the accuracy and reliability of semantic segmentation on the image to be processed based on the bird's-eye view features are further improved.

Based on the above embodiment, step 131 includes:

step 1311, inputting a previous token and a previous aerial view query feature of the image to be processed into a self-attention module of a current feature extraction module, sampling the aerial view query feature by the self-attention module to obtain a sampled aerial view query feature, and performing inverse sampling on the sampled aerial view query feature to obtain an inverse sampled aerial view query feature output by the self-attention module;

and 1312, inputting the inverse sampling bird's-eye view query features into a feed-forward propagation network of a current feature extraction module to obtain current bird's-eye view query features output by the feed-forward propagation network.

Specifically, fig. 2 is a schematic flow chart of sampling the bird's-eye view query features provided by the present invention, and as shown in fig. 2, the previous token and the previous bird's-eye view query features of the image to be processed may be input to the self-attention module of the current feature extraction module, and the bird's-eye view query features may be sampled by the self-attention module to obtain the sampled bird's-eye view query features, Q11 represents the first token of the bird's-eye view query features Q1 corresponding to the 1 st view, and so on, Q14 represents the fourth token of the bird's-eye view query features Q1 corresponding to the 1 st view, qi1 represents the first token of the bird's-eye view query features Qi corresponding to the i-th view, and so on, qi1 represents the fourth token of the bird's-eye view query features Qi corresponding to the i-th view.

The self-Attention module may be a Patch Attention module (PA), an Image Attention module (IA), or a Scene Attention module (SA), which is not limited in this embodiment of the present invention.

For example, the bird's-eye view query feature may be projected onto the imaging plane according to the internal reference matrix, the external reference matrix, and the bird's-eye view query feature anchor coordinates, and the formula is as follows:

wherein I represents an image coordinate system, W represents a world coordinate system, x represents the coordinates themselves, K _i Denotes an internal reference matrix, R _i An external reference matrix is represented. According to the coordinate values obtained by projection, sampling the coordinates closest to the central point of the projection plane, and obtaining the query characteristics of the sampled bird's-eye view according to the coordinates, wherein the formula is as follows:

Q _i ＝{q _j |j∈Idx _i }

wherein Q is _i C, representing the bird's-eye view query feature corresponding to the ith view of the image to be processed in the process of sampling the bird's-eye view query feature by the attention module, and c _i The coordinates of the ith view center point of the image to be processed are represented.

In addition, the sampling bird's-eye view query features can be sampled once, and the formula is as follows:

Q _ip ＝{q _j |j∈Idx _ip }

wherein Q is _ip C, representing the p image block of the ith view of the image to be processed, namely the sampling aerial view query feature corresponding to the last token of the image to be processed in the process of sampling the sampling aerial view query feature by the attention module, and c _ip And the coordinates of the center point of the p image block of the ith view of the image to be processed, namely the coordinates of the center point of the last token of the image to be processed.

Fig. 3 is a schematic flow chart of inverse sampling of the sampled bird's-eye view query features according to the present invention, and as shown in fig. 3, after the sampled bird's-eye view query features are obtained, inverse sampling of the sampled bird's-eye view query features may be performed to obtain inverse sampled bird's-eye view query features output from the attention module, where the inverse sampled bird's-eye view query features reflect feature information of a bird's-eye view layer.

The self-attention module may be an image self-attention module, a scene self-attention module, an image self-attention module, and a scene self-attention module, which is not specifically limited in this embodiment of the present invention.

That is, the updated sampled bird's-eye view query features may be propagated to the entire bird's-eye view query features by using the image self-attention module and the scene self-attention module in the self-attention module, so as to obtain the inverse sampled bird's-eye view query features output by the self-attention module.

After the inverse sampling bird's-eye view query feature is obtained, the inverse sampling bird's-eye view query feature may be input to a Feed Forward Network (FFN) of the current feature extraction module, so as to obtain the current bird's-eye view query feature output by the Feed Forward Network.

Based on the foregoing embodiment, fig. 4 is a schematic structural diagram of a self-attention module provided by the present invention, and as shown in fig. 4, the self-attention module includes a cascaded image block self-attention module, an image self-attention module and a scene self-attention module, the image block self-attention module is configured to perform cross-view self-attention conversion on an image block of the bird's-eye view query feature to obtain a first bird's-eye view query feature, the image self-attention module is configured to perform cross-view self-attention conversion on the first bird's-eye view query feature to obtain a second bird's-eye view query feature, and the scene self-attention module is configured to perform cross-view self-attention conversion on the second bird's-eye view query feature.

Specifically, the self-attention module may include a cascaded image block self-attention module, an image self-attention module, and a scene self-attention module, where the image block self-attention module is configured to perform cross-view self-attention transformation on an image block of the bird's-eye view query feature to obtain a first bird's-eye view query feature, where the image self-attention module is configured to perform cross-view self-attention transformation on the first bird's-eye view query feature to obtain a second bird's-eye view query feature, and where the scene self-attention module is configured to perform cross-view self-attention transformation on the second bird's-eye view query feature to obtain the bird's-eye view feature.

Here, a feed-forward propagation network is further connected after the image block self-attention module, and the first bird's-eye view query feature is output from the feed-forward propagation network, and a second bird's-eye view query feature is output from the feed-forward propagation network after the image block self-attention module. After the scene self-attention module, a feed-forward propagation network is also connected, and the bird's-eye view characteristics are output by the feed-forward propagation network.

The image block self-attention module, the image self-attention module, and the scene self-attention module may be a cross-attention module, and the like, which is not particularly limited in this embodiment of the present invention.

Based on the above embodiment, the feature extractor includes a first convolution layer, a first activation layer, and a first normalization layer connected in sequence, and the feature extractor is configured to output multi-scale image features.

Specifically, the feature extractor herein includes a first convolution layer, a first activation layer, and a first normalization layer, which are connected in sequence, where the first convolution layer may use a Deep-layer full convolution network (Deep-ConvNet), a Full Convolution Network (FCN), or the like, and this is not particularly limited in this embodiment of the present invention.

The first activation layer here may use a GELU activation function, a Sigmoid activation function, or a ReLU activation function, and the first normalization layer here may be an LN, a BN, an IN, or the like, which is not specifically limited IN this embodiment of the present invention.

The feature extractor is configured to output multi-scale image features, for example, 1/4, 1/8, 1/16, 1/32, and the like, which is not specifically limited in this embodiment of the present invention.

For example, let the image to be processed x ∈ R ^6×3×H×W Inputting the image characteristic z into a characteristic extractor to obtain the image characteristics z of four scales output by the characteristic extractor _i ∈R ^6×c×h×w ,i＝1,2,3,4。

Based on the above embodiment, step 140 includes:

Specifically, the bird's-eye view feature may be input into a decoder, and the decoder outputs a semantic segmentation result of the image to be processed, where the decoder may be a convolutional decoder, where the decoder may include a second convolution layer, a second normalization layer, and a second activation layer that are connected in sequence, where the second convolution layer may use a deep full convolution network, may also use a full convolution network, and the like, which is not specifically limited in this embodiment of the present invention.

The second activation layer may use a GELU activation function, a Sigmoid activation function, or a ReLU activation function, where the second normalization layer may be an LN, a BN, an IN, or the like, and this is not specifically limited IN this embodiment of the present invention.

Here, the second convolution layer may be the same as the first convolution layer or different from the first convolution layer, the second active layer may be the same as the first active layer or different from the first active layer, and the second normalization layer may be the same as the first normalization layer or different from the first normalization layer, which is not specifically limited in this embodiment of the present invention.

Based on the above embodiment, step 130 includes:

step 310, acquiring an internal reference matrix and an external reference matrix of the camera, wherein the internal reference matrix and the external reference matrix of the camera are used for geometric prior;

and 320, obtaining the aerial view characteristics output by the characteristic extraction model based on the internal parameter matrix and the external parameter matrix of the camera, the image characteristics and the aerial view query characteristics.

Specifically, an internal reference matrix and an external reference matrix of the camera may be obtained, where the internal reference matrix and the external reference matrix of the camera are used for geometric prior.

The camera may be a vehicle-mounted camera in the automatic driving sensing application, and the embodiment of the present invention is not particularly limited thereto.

The internal reference matrix of the video camera reflects the attributes of the video camera, the internal reference matrix of each video camera is different, and the parameters can be known only by calibration, and the internal reference matrix of the video camera describes the relationship between points of an object and image points.

The external reference matrix of the video camera here is a transformation of the world coordinate system into the camera coordinate system.

After obtaining the internal reference matrix and the external reference matrix of the camera, obtaining the bird's-eye view feature output by the feature extraction model based on the internal reference matrix and the external reference matrix of the camera, the image feature and the bird's-eye view query feature.

wherein I represents an image coordinate system, W represents a world coordinate system, x represents the coordinates themselves, K _i Denotes an internal reference matrix, R _i An external reference matrix is represented. And sampling the coordinates closest to the central point of the projection plane according to the coordinate values obtained by projection, and obtaining the aerial view characteristics according to the coordinates.

Based on any of the above embodiments, fig. 5 is a second schematic flow chart of the geometric prior-based bird's eye view semantic segmentation method provided by the present invention, as shown in fig. 5, the method includes:

firstly, acquiring an image to be processed and an aerial view query feature.

And secondly, inputting the image to be processed into a feature extractor to obtain the image features output by the feature extractor, wherein the feature extractor comprises a first convolution layer, a first activation layer and a first normalization layer which are connected in sequence, and the feature extractor is used for outputting multi-scale image features.

And thirdly, inputting a previous token and a previous bird's-eye view query feature of the image to be processed into a self-attention module of a current feature extraction module in the feature extraction model, sampling the bird's-eye view query feature by the self-attention module to obtain a sampled bird's-eye view query feature, and then inversely sampling the sampled bird's-eye view query feature to obtain an inversely sampled bird's-eye view query feature output by the self-attention module.

And inputting the query features of the inverse sampling aerial view into a feed-forward propagation network of the current feature extraction module to obtain the query features of the current aerial view output by the feed-forward propagation network. The last bird's eye view query feature here is output by a feature extraction module before the current feature extraction module.

And taking the bird's-eye view query features output by the rearmost feature extraction module as bird's-eye view features.

And fourthly, performing semantic segmentation on the image to be processed based on the aerial view characteristics.

The self-attention module here may include a cascaded image block self-attention module, an image self-attention module, and a scene self-attention module, where the image block self-attention module is configured to perform cross-view self-attention transformation on an image block of the bird's-eye view query feature to obtain a first bird's-eye view query feature, where the image self-attention module is configured to perform cross-view self-attention transformation on the first bird's-eye view query feature to obtain a second bird's-eye view query feature, and where the scene self-attention module is configured to perform cross-view self-attention transformation on the second bird's-eye view query feature.

The bird's-eye view semantic segmentation device based on geometric prior provided by the invention is described below, and the bird's-eye view semantic segmentation device based on geometric prior described below and the bird's-eye view semantic segmentation method based on geometric prior described above can be referred to correspondingly.

Based on any of the embodiments, the present invention provides a bird's-eye view semantic segmentation device based on geometric prior, fig. 6 is a schematic structural diagram of the bird's-eye view semantic segmentation device based on geometric prior provided by the present invention, as shown in fig. 6, the device includes:

the acquisition unit 610 is used for acquiring the image to be processed and the bird's-eye view query feature;

an image feature extraction unit 620, configured to input the image to be processed to a feature extractor, so as to obtain an image feature output by the feature extractor;

an airview extraction feature unit 630, configured to input the image features and the airview query features into a feature extraction model, so as to obtain airview features output by the feature extraction model;

a semantic segmentation unit 640, configured to perform semantic segmentation on the image to be processed based on the bird's eye view feature;

the feature extraction model comprises a self-attention module, and the self-attention module is used for performing cross-view self-attention conversion on the tokens of the images to be processed and the bird's-eye view query features.

According to the device provided by the embodiment of the invention, the self-attention module in the feature extraction model performs cross-view self-attention conversion on the token of the image to be processed and the bird's-eye view query feature, so that the calculation complexity of the self-attention mechanism can be reduced, the complexity of image feature extraction is further reduced, the accuracy and reliability of the obtained bird's-eye view feature are improved, and the accuracy and reliability of subsequent semantic segmentation are further improved.

Based on any embodiment, the feature extraction model comprises a plurality of cascaded feature extraction modules;

the unit for extracting the aerial view characteristic is specifically used for:

the current feature extraction unit is used for inputting a previous token and a previous aerial view query feature of the image to be processed into a current feature extraction module to obtain the current aerial view query feature output by the current feature extraction module, wherein the previous aerial view query feature is output by a feature extraction module before the current feature extraction module;

and the airview feature extraction subunit is used for taking the airview query feature output by the tail-most feature extraction module as the airview feature.

Based on any of the above embodiments, the current feature extraction unit is specifically configured to:

Based on any of the embodiments, the self-attention module includes a cascaded image block self-attention module, an image self-attention module, and a scene self-attention module, the image block self-attention module is configured to perform cross-view self-attention conversion on an image block of the bird's-eye view query feature to obtain a first bird's-eye view query feature, the image self-attention module is configured to perform cross-view self-attention conversion on the first bird's-eye view query feature to obtain a second bird's-eye view query feature, and the scene self-attention module is configured to perform cross-view self-attention conversion on the second bird's-eye view query feature.

According to any one of the above embodiments, the feature extractor comprises a first convolution layer, a first activation layer and a first normalization layer which are connected in sequence, and the feature extractor is configured to output multi-scale image features.

Based on any one of the above embodiments, the performing semantic segmentation on the image to be processed based on the bird's-eye view feature includes:

and inputting the aerial view feature into a decoder, and outputting a semantic segmentation result of the image to be processed by the decoder, wherein the decoder comprises a second convolution layer, a second normalization layer and a second activation layer which are connected in sequence.

Based on any one of the above embodiments, the inputting the image feature and the bird's-eye view query feature into a feature extraction model to obtain the bird's-eye view feature output by the feature extraction model includes:

Fig. 7 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 7: a processor (processor) 710, a communication Interface (Communications Interface) 720, a memory (memory) 730, and a communication bus 740, wherein the processor 710, the communication Interface 720, and the memory 730 communicate with each other via the communication bus 740. The processor 710 may invoke logic instructions in the memory 730 to perform a geometric prior-based bird's eye view semantic segmentation method comprising: acquiring query features of an image to be processed and a bird's-eye view; inputting the image to be processed into a feature extractor to obtain image features output by the feature extractor; inputting the image features and the bird's-eye view query features into a feature extraction model to obtain bird's-eye view features output by the feature extraction model; performing semantic segmentation on the image to be processed based on the aerial view characteristics; the feature extraction model comprises a self-attention module, and the self-attention module is used for carrying out cross-view self-attention conversion on the token of the image to be processed and the aerial view query feature.

In addition, the logic instructions in the memory 730 can be implemented in the form of software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product, where the computer program product includes a computer program that can be stored on a non-transitory computer readable storage medium, and when the computer program is executed by a processor, the computer is capable of executing the airview semantic segmentation method based on geometric priors provided by the above methods, where the method includes: acquiring an image to be processed and an aerial view query feature; inputting the image to be processed into a feature extractor to obtain image features output by the feature extractor; inputting the image features and the bird's-eye view query features into a feature extraction model to obtain bird's-eye view features output by the feature extraction model; performing semantic segmentation on the image to be processed based on the aerial view characteristics; the feature extraction model comprises a self-attention module, and the self-attention module is used for carrying out cross-view self-attention conversion on the token of the image to be processed and the aerial view query feature.

In yet another aspect, the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, the computer program being implemented by a processor to execute the geometric prior-based bird's eye view semantic segmentation method provided by the above methods, the method including: acquiring an image to be processed and an aerial view query feature; inputting the image to be processed to a feature extractor to obtain the image features output by the feature extractor; inputting the image features and the aerial view query features into a feature extraction model to obtain aerial view features output by the feature extraction model; performing semantic segmentation on the image to be processed based on the aerial view characteristics; the feature extraction model comprises a self-attention module, and the self-attention module is used for performing cross-view self-attention conversion on the tokens of the images to be processed and the bird's-eye view query features.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, and not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A bird's-eye view semantic segmentation method based on geometric prior is characterized by comprising the following steps:

acquiring an image to be processed and an aerial view query feature;

inputting the image to be processed into a feature extractor to obtain image features output by the feature extractor;

inputting the image features and the bird's-eye view query features into a feature extraction model to obtain bird's-eye view features output by the feature extraction model;

performing semantic segmentation on the image to be processed based on the aerial view characteristics;

2. The bird's eye view semantic segmentation method based on geometric priors according to claim 1, wherein the feature extraction model comprises a cascade of a plurality of feature extraction modules;

3. The bird's-eye view semantic segmentation method based on geometric priors according to claim 2, wherein the inputting a previous token and a previous bird's-eye view query feature of the image to be processed into a current feature extraction module to obtain the current bird's-eye view query feature output by the current feature extraction module comprises:

inputting a last token and a last bird's-eye view query feature of the image to be processed into a self-attention module of a current feature extraction module, sampling the bird's-eye view query feature by the self-attention module to obtain a sampled bird's-eye view query feature, and then performing inverse sampling on the sampled bird's-eye view query feature to obtain an inverse sampled bird's-eye view query feature output by the self-attention module;

and inputting the query features of the inverse sampling aerial view into a feed-forward propagation network of a current feature extraction module to obtain the query features of the current aerial view output by the feed-forward propagation network.

4. The bird's-eye view semantic segmentation method based on geometric priors according to any one of claims 1 to 3, wherein the self-attention module comprises a cascaded image block self-attention module, an image self-attention module and a scene self-attention module, the image block self-attention module is used for performing cross-view self-attention conversion on an image block of the bird's-eye view query feature to obtain a first bird's-eye view query feature, the image self-attention module is used for performing cross-view self-attention conversion on the first bird's-eye view query feature to obtain a second bird's-eye view query feature, and the scene self-attention module is used for performing cross-view self-attention conversion on the second bird's-eye view query feature.

5. The airview semantic segmentation method based on geometric priors according to claim 1, wherein the feature extractor comprises a first convolution layer, a first activation layer and a first normalization layer which are connected in sequence, and the feature extractor is configured to output multi-scale image features.

6. The bird's-eye view semantic segmentation method based on geometric priors according to claim 1, wherein the semantic segmentation of the image to be processed based on the bird's-eye view features comprises:

7. The bird's-eye view semantic segmentation method based on geometric priors according to claim 1, wherein the step of inputting the image features and the bird's-eye view query features into a feature extraction model to obtain bird's-eye view features output by the feature extraction model comprises the steps of:

8. A bird's-eye view semantic segmentation device based on geometric prior is characterized by comprising the following components:

the airview characteristic extracting unit is used for inputting the image characteristics and the airview query characteristics into a characteristic extraction model to obtain the airview characteristics output by the characteristic extraction model;

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the geometric prior-based bird's eye view semantic segmentation method according to any one of claims 1 to 7 when executing the program.

10. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the bird's eye view semantic segmentation method based on geometric priors according to any one of claims 1 to 7.