CN115588175A

CN115588175A - Aerial view characteristic generation method based on vehicle-mounted all-around image

Info

Publication number: CN115588175A
Application number: CN202211290357.9A
Authority: CN
Inventors: 缪文俊; 李雪; 陈禹行
Original assignee: Beijing Yihang Yuanzhi Technology Co Ltd
Current assignee: Beijing Yihang Yuanzhi Technology Co Ltd
Priority date: 2022-10-21
Filing date: 2022-10-21
Publication date: 2023-01-10

Abstract

The invention provides a bird's-eye view image feature generation method based on a vehicle-mounted all-around image, which comprises the following steps: extracting collected images of all vehicle-mounted cameras of the vehicle at the current moment; acquiring image characteristics of each acquired image at the current moment; acquiring the aerial view characteristic after the initialization of the current moment based on the historical moment aerial view characteristic; performing spatial feature sampling in image features of each acquired image based on the aerial view features after initialization at the current moment to obtain sampled features; performing spatial cross attention on the bird's-eye view feature after initialization of the current moment and the sampled feature to generate a bird's-eye view feature of the current moment with the vehicle as the center; and establishing connection among different channels of the bird's-eye view features at the current moment based on the separable convolutional neural network, and obtaining the enhanced bird's-eye view features at the current moment. The present disclosure also provides an electronic device, a readable storage medium, and a program product.

Description

Aerial view characteristic generation method based on vehicle-mounted all-around image

Technical Field

The present disclosure relates to the field of automatic driving technologies and computer vision technologies, and in particular, to a method and an apparatus for generating an aerial view feature based on a vehicle-mounted surround view image, an electronic device, a storage medium, and a program product.

Background

A Bird's Eye View (BEV) is a top-down top view representation method, which can help to build downstream tasks such as 3D target detection, lane line segmentation and the like on a unified feature representation, and enhance reusability of a model; meanwhile, the method is beneficial to the process from the post-sensor fusion to the pre-sensor fusion, so that the fusion process is completely driven by data; the mutual relation of all objects in the space can be well represented, and prediction and planning are facilitated; the method can be well compatible with the previous 2D target detection method, and is convenient to deploy.

The use of the bird's eye view feature as a unified feature representation of downstream tasks is an important new technical means in the aspect of automatic driving perception, and plays an important role in the spatial perception of automatic driving. The prior aerial view characteristic generation technology is mostly based on laser radar generation, although the aerial view characteristic generation is more convenient through coordinate relation conversion, the expensive price of the aerial view characteristic generation technology makes it difficult to assemble high-precision laser radars on an automatic driving automobile in a large quantity, and the method for generating the aerial view characteristic based on the panoramic image is low in price, and meanwhile, the perception principle of the aerial view characteristic generation technology is the same as that of a perception method when people drive, pure vision perception is used, the aerial view characteristic generation technology is suitable for actual automatic driving scenes, and the aerial view characteristic generation technology has great research value. The quality of bird's-eye view feature generation directly influences the perception accuracy of downstream tasks, and further influences subsequent prediction and planning. The image-based bird's-eye view feature generation method also has the problems that time sequence features are difficult to effectively extract and utilize, high-quality spatial features are difficult to reasonably sample, the spatial features are difficult to efficiently use to generate the high-quality bird's-eye view features and the like.

Some technical solutions in the prior art are as follows.

The technical scheme 1: the DETR includes an End-to-End Object Detection With transforms, which first introduces a transform model into Object Detection, and at the same time, instead of proposing a prediction box first and then judging whether the prediction box has contents or not, and a paradigm of what contents to include, a query paradigm is proposed. And using a learnable vector as an Object Query to inquire whether objects exist at different positions of the image or not and what objects exist. The technical scheme provides a mechanism for Query to Query a target, but convergence is very slow due to bipartite graph matching, and Encoder is used for global extension, so that a large amount of calculation overhead is increased.

The technical scheme 2 is as follows: DETR3D, 3D Object Detection from Multi-view Images via 3D-to-2D materials, which introduces the DETR series into 3D target Detection for the first time, uses a CNN Back bone to extract the high-dimensional feature representation of an image, uses Object Query as a Query vector, iteratively predicts the three-dimensional center point of an Object through a Transformer Decoder and regresses the Object frame, maps the predicted center point to the high-dimensional feature representation extracted by the CNN Back bone through geometric projection to extract corresponding features, and performs the next iteration. The technical scheme provides a very effective solution for generating the 3D feature representation from the 2D feature representation, namely, an inquiry mechanism for predicting the projection of the center point of the 3D Object into the 2D feature space through Object Query, but the method is only suitable for 3D Object detection, is difficult to support dense feature representation prediction such as lane line detection and the like, and cannot uniformly and automatically drive downstream perception tasks.

Technical scheme 3: the first is to change Object Query to BEV Query, firstly, a fixed-size Bird's-Eye image via spatial mapping transform (BEV Map) is initialized, each position point on the BEV Map is projected to a 2D feature Map extracted by CNN Backbone as BEV Query to extract features, and dense feature Representation is provided by introducing the Bird's-Eye Map. And secondly, introducing a time sequence attention module, performing cross attention mixing by using the BEV Map at the previous moment and the BEV Map at the current moment through the speed and the pose change of the self-vehicle, and introducing time sequence information. However, the stacked and mixed use of the time sequence information and the spatial information makes it difficult to clearly and effectively distinguish the time offset and the spatial offset during learning, and it is difficult to construct a high-quality bird's-eye view feature.

The technical scheme 4 is as follows: the Unified Multi-View Fusion transform for Spatial-Temporal reconstruction in Bird's-Eye-View also uses a model architecture similar to that of technical scheme 3, but uses different timing information, and does not use the BEV Map at the previous time and the BEV Map at the current time for cross attention, but uses only the BEV Map at the current time for self-attention calculation of the BEV Query, but retains the convolution feature Map extracted at the previous time, and the BEV Query projects the feature Map mapped to a plurality of times to extract the feature. The method for directly extracting the image features collects more original time sequence information, facilitates analysis of the model, but obviously increases video memory overhead, so that the model is difficult to deploy in practical application.

The technical scheme 5 is as follows: translation Images in Maps, which uses the physical relationship between the image perspective and the bird's-eye view perspective, and the pixel characteristics of each vertical line of the image perspective are the same as the corresponding geometric positions of each polar-ray bird's-eye view characteristic of the bird's-eye view perspective centered on the self-vehicle in the real world. Firstly, mapping the vertical line pixel characteristics of the image visual angle into memory cell characteristics with the same size through an attention mechanism, and then mapping the memory cell characteristics into polar ray characteristics of the bird's-eye view visual angle through the attention mechanism. This makes good use of the physical relationships in the real world to model the relationship between the image perspective and the bird's eye view perspective, but this correspondence is a mapping of image features to bird's eye view features, rather than a sampling of bird's eye view features to image features, and can only be applied to bird's eye view feature generation for a single image.

The technical scheme 1 constructs a target detection paradigm based on a Query mechanism, lays a scheme foundation, and only carries out 2D target detection; the technical scheme 2 introduces a Query mechanism into 3D target detection, constructs a bridge from 2D feature representation to 3D feature representation, but only carries out 3D target detection; technical scheme 3 utilizes a Query mechanism to generate a BEV Map, but the utilization of the time sequence information is low, and no enhancement module is designed for the generated BEV Map; technical scheme 4 directly uses historical image features to extract time sequence information, but obviously increases the use of video memory. Technical means 5 is to perform feature joint correspondence generation using the geometric relationship between the image viewing angle and the bird's-eye view viewing angle, and thus the robustness of bird's-eye view feature construction is improved, but this is from image feature generation to bird's-eye view feature generation, rather than from bird's-eye view feature sampling to image feature generation, and can only be applied to bird's-eye view feature generation of a single image.

Disclosure of Invention

In order to solve at least one of the above technical problems, the present disclosure provides a bird's-eye view feature generation method, device, electronic device, storage medium, and program product based on a vehicle-mounted surround view image.

According to one aspect of the disclosure, a method for generating a bird's-eye view feature based on a vehicle-mounted all-around image is provided, which includes:

extracting collected images of all vehicle-mounted cameras of the vehicle at the current moment;

acquiring image characteristics of each acquired image at the current moment;

acquiring the aerial view characteristic after the initialization of the current moment based on the historical moment aerial view characteristic;

performing spatial feature sampling in image features of each acquired image based on the bird's eye view feature after the initialization at the current time to obtain sampled features;

performing spatial cross attention on the bird's-eye view feature after initialization of the current moment and the sampled feature to generate a bird's-eye view feature of the current moment with the vehicle as the center;

establishing a relationship between different channels of the bird's-eye view feature at the current time based on a separable convolutional neural network to obtain an enhanced bird's-eye view feature at the current time.

According to the bird's-eye view image feature generation method based on the vehicle-mounted all-around image, the image feature of each collected image at the current moment is acquired, and the method comprises the following steps:

and performing feature extraction on each acquired image at the current moment based on the convolutional neural network to obtain the image features of the acquired images of each vehicle-mounted camera at the current moment.

According to the bird's-eye view feature generation method based on the vehicle-mounted all-around image, the convolutional neural network comprises a standard ResNet-101 backbone network, an FPN network and an additional convolutional layer;

the ResNet-101 backbone network is a four-stage ResNet-101 backbone network pre-trained by ImageNet.

According to the bird's-eye view feature generation method based on the vehicle-mounted all-around image, the bird's-eye view feature after initialization of the current time is obtained based on the bird's-eye view feature at the historical time, and the bird's-eye view feature generation method based on the vehicle-mounted all-around image comprises the following steps:

randomly initializing a learnable aerial view feature taking the vehicle as a center as an aerial view feature at the current moment;

adding a position code into the learnable aerial view feature to obtain a position-coded aerial view of the current frame;

extracting a historical frame aerial view (BEV Map) within a preset time length before the current time;

acquiring the relative pose of the aerial view of the current frame after position coding and the aerial view of the historical frame;

searching a corresponding historical frame bird-eye view query vector in the historical frame bird-eye view as time sequence prior information based on the relative pose for the bird-eye view query vector of each position point of the bird-eye view after the position of the current frame is encoded;

and adding time sequence prior information to the position-coded aerial view of the current frame to obtain the aerial view characteristics after initialization of the current moment.

According to the bird's-eye view feature generation method based on the vehicle-mounted all-around image, the method for extracting the historical frame bird's-eye view (BEV Map) within a preset time length before the current time comprises the following steps:

one or more than two frames of the historical frame aerial view are extracted.

According to the bird's-eye view feature generation method based on the vehicle-mounted all-around image in at least one embodiment of the disclosure, if more than two frames of historical frame bird's-eye views are extracted, the time sequence prior information obtained from the historical frame bird's-eye views of each frame is subjected to weighted summation to obtain the final time sequence prior information.

According to the bird's-eye view feature generation method based on the vehicle-mounted all-around image of at least one embodiment of the present disclosure, spatial feature sampling is performed in image features of each captured image based on bird's-eye view features after initialization at the current time to obtain sampled features, including:

and sampling the spatial features based on the geometric corresponding relation between the bird's-eye view angle and the image angle.

According to at least one embodiment of the present disclosure, the method for generating a bird's-eye view image based on a vehicle-mounted panoramic image, which performs spatial feature sampling based on a geometric correspondence between a bird's-eye view image and an image view image, includes:

and taking each position point of the initialized aerial view at the current moment as an aerial view Query vector (BEV Query) to perform spatial feature sampling based on a feature map of the acquired image which is geometrically projected to each vehicle-mounted camera.

According to at least one embodiment of the present disclosure, the method for generating a bird's-eye view feature based on a vehicle-mounted all-around image, which performs the spatial feature sampling based on a geometric correspondence between a bird's-eye view angle and an image angle, includes:

and taking each vertical line of a feature map which is used for representing image features of each collected image as a reference line, and acquiring polar ray features of the aerial view by taking the own vehicle as the center to further acquire an aerial view Query vector (BEV Query) so as to perform spatial feature sampling based on the reference line.

According to at least one embodiment of the present disclosure, a method for generating a bird's-eye view feature based on a vehicle-mounted all-around image, which generates a bird's-eye view feature at a current time centered on a vehicle by spatially cross-paying attention to the bird's-eye view feature after initialization of the current time and the sampled feature, includes:

performing sampling point cross attention on a feature sampled based on projection sampling from a bird's-eye view (BEV Query) to an image viewing angle by each bird's-eye view Query vector (BEV Map);

sampling a reference line of each polar ray on a bird's-eye view Map (BEV Map) by taking the vehicle as a center, and performing reference line cross attention;

outputting the bird's-eye view feature at the current time with the vehicle as the center.

According to the bird's-eye view feature generation method based on the vehicle-mounted all-around view image, sampling point cross attention is carried out on the features sampled by carrying out projection sampling from a BEV visual angle to an image visual angle based on each bird's-eye view Query vector (BEV Query) in a bird's-eye view Map, and the method comprises the following steps:

obtaining m projection points of each aerial view query vector, wherein each projection point is used as a characteristic sampling point;

predicting k position offsets based on the characteristics of each projection point, adding the coordinates corresponding to the projection points to obtain additional k characteristic sampling points, and then obtaining (k + 1) characteristic sampling points for each projection point and m x (k + 1) characteristic sampling points for each bird's-eye view query vector;

and taking the bird's-eye view Query vector (BEV Query) as a cross attention Query vector (Query), and taking the sampling points and the characteristics of the sampling points as keys (Key) and values (Value) of cross attention respectively, and performing cross attention calculation to update the bird's-eye view Query vector (BEV Query).

According to at least one embodiment of the present disclosure, a method for generating a bird's-eye view feature based on a vehicle-mounted all-round view image, which performs reference line sampling on each polar ray centered on a vehicle on a bird's-eye view (BEV Map) and performs reference line cross attention, includes:

and taking each polar ray and the bird's-eye view Query vector (BEV Query) adjacent to the polar ray as a cross attention Query vector (Query), and taking vertical lines from left to right on image features and vertical line features as keys (Key) and values (Value) to perform cross attention calculation so as to update the bird's-eye view Query vector (BEV Query).

According to the bird's-eye view feature generation method based on the vehicle-mounted all-around image, the relationship is established among different channels of the bird's-eye view feature at the current moment based on the separable convolutional neural network so as to obtain the enhanced bird's-eye view feature at the current moment, and the method comprises the following steps:

and keeping the channel dimension of the aerial view features unchanged, so that the number of convolution kernels of the separable convolution neural network is the same as the channel dimension of the aerial view features.

According to another aspect of the present disclosure, there is provided a bird's-eye view feature generation device based on a vehicle-mounted panoramic image, including:

the vehicle-mounted camera comprises an image acquisition module, a display module and a control module, wherein the image acquisition module extracts acquired images of all vehicle-mounted cameras of a vehicle at the current moment;

the image feature extraction module acquires the image features of each acquired image at the current moment;

the time sequence prior module is used for acquiring the bird view image characteristics at the current moment after initialization based on the bird view image characteristics at the historical moment;

a sampling module that performs spatial feature sampling in image features of each of the captured images based on the bird's eye view feature after the initialization at the present time to obtain sampled features;

a spatial cross attention module that performs spatial cross attention on the bird's-eye view feature after initialization at the current time and the sampled feature to generate a bird's-eye view feature at the current time with the host vehicle as a center;

a separable convolution module that establishes a relationship between different channels of the bird's-eye view feature at the current time to obtain an enhanced bird's-eye view feature at the current time.

According to another aspect of the present disclosure, there is provided an electronic device including:

a memory storing execution instructions;

a processor executing the execution instructions stored in the memory, so that the processor executes the method for generating the bird's eye view feature based on the vehicle-mounted all-around image according to any one of the embodiments of the present disclosure.

According to still another aspect of the present disclosure, a readable storage medium is provided, in which execution instructions are stored, and the execution instructions are executed by a processor to implement the method for generating the bird's eye view feature based on the vehicle-mounted surround view image according to any one of the embodiments of the present disclosure.

According to still another aspect of the present disclosure, there is provided a computer program product comprising a computer program/instructions which, when executed by a processor, implements the bird's eye view feature generation method based on a vehicle-mounted panoramic image according to any one of the embodiments of the present disclosure.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the disclosure and together with the description serve to explain the principles of the disclosure.

Fig. 1 is a schematic flowchart of a bird's-eye view feature generation method based on a vehicle-mounted surround view image according to an embodiment of the present disclosure.

Fig. 2 is a schematic overall model architecture diagram for implementing the bird's-eye view feature generation method based on the vehicle-mounted all-around image according to one embodiment of the present disclosure.

Fig. 3 is an architecture diagram of an image feature extraction module of one embodiment of the present disclosure.

Fig. 4 is a flowchart illustrating acquiring the bird's eye view feature after initialization at the current time according to one embodiment of the present disclosure.

FIG. 5 is a diagram of the same real-world location correspondence for a current frame BEV Query in a history frame, in accordance with an embodiment of the present disclosure.

Fig. 6 is a projection view of BEV view and image view position for one embodiment of the present disclosure.

FIG. 7 is a schematic diagram of a reference line sampling mode of one embodiment of the present disclosure.

Fig. 8 is a schematic structural diagram of a spatial attention module according to one embodiment of the present disclosure.

Fig. 9 is a schematic block diagram of a structure of an overhead view feature generation device based on an on-vehicle surround view image, which is implemented by hardware using a processing system according to an embodiment of the present disclosure.

Description of the reference numerals

1000. Bird's-eye view feature generation device

1002. Image acquisition module

1004. Image feature extraction module

1006. Sequential prior module

1008. Sampling module

1010. Space cross attention module

1012. Separable convolution module

1100. Bus line

1200. Processor with a memory having a plurality of memory cells

1300. Memory device

1400. Other circuits.

Detailed Description

The present disclosure will be described in further detail with reference to the drawings and embodiments. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant matter and not restrictive of the disclosure. It should be further noted that, for the convenience of description, only the portions relevant to the present disclosure are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present disclosure may be combined with each other without conflict. Technical solutions of the present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Unless otherwise indicated, the illustrated exemplary embodiments/examples are to be understood as providing exemplary features of various details of some ways in which the technical concepts of the present disclosure may be practiced. Accordingly, unless otherwise indicated, features of the various embodiments may be additionally combined, separated, interchanged, and/or rearranged without departing from the technical concept of the present disclosure.

The use of cross-hatching and/or shading in the drawings is generally used to clarify the boundaries between adjacent components. As such, unless otherwise specified, the presence or absence of cross-hatching or shading does not convey or indicate any preference or requirement for a particular material, material property, size, proportion, commonality among the illustrated components and/or any other characteristic, attribute, property, etc., of a component. Further, in the drawings, the size and relative sizes of components may be exaggerated for clarity and/or descriptive purposes. While example embodiments may be practiced differently, the specific process sequence may be performed in a different order than that described. For example, two consecutively described processes may be performed substantially simultaneously or in an order reverse to the order described. In addition, like reference numerals denote like parts.

When an element is referred to as being "on" or "over," "connected to" or "coupled to" another element, it can be directly on, connected or coupled to the other element or intervening elements may be present. However, when an element is referred to as being "directly on," "directly connected to" or "directly coupled to" another element, there are no intervening elements present. For purposes of this disclosure, the term "connected" may refer to physically connected, electrically connected, and the like, with or without intervening components.

The terminology used herein is for the purpose of describing particular embodiments and is not intended to be limiting. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. Furthermore, when the terms "comprises" and/or "comprising" and variations thereof are used in this specification, the stated features, integers, steps, operations, elements, components and/or groups thereof are stated to be present but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components and/or groups thereof. It is also noted that, as used herein, the terms "substantially," "about," and other similar terms are used as approximate terms and not as degree terms, and as such, are used to interpret inherent deviations in measured values, calculated values, and/or provided values that would be recognized by one of ordinary skill in the art.

The method and the device for generating the bird's-eye view image based on the vehicle-mounted panoramic image of the present disclosure are described in detail below with reference to fig. 1 to 9.

Fig. 1 is a schematic flowchart of a bird's-eye view feature generation method based on a vehicle-mounted panoramic image according to an embodiment of the present disclosure.

Referring to fig. 1, a bird' S eye view feature generation method S100 based on a vehicle-mounted panoramic image of the present disclosure includes:

s102, extracting collected images of all vehicle-mounted cameras of the vehicle at the current moment (the collected images of all the vehicle-mounted cameras form a vehicle-mounted all-around image);

s104, acquiring image characteristics of each acquired image at the current moment (the image characteristics are represented by 3 (exemplary) characteristic diagrams);

s106, acquiring the bird 'S-eye view characteristics (a time sequence prior module) at the current moment after initialization based on the bird' S-eye view characteristics at the historical moment;

s108, based on the bird 'S-eye view feature after initialization at the current moment, carrying out spatial feature sampling (projection sampling from BEV visual angle to image visual angle + reference line sampling) in the image features (image features represented in a high-dimensional mode) of each collected image so as to obtain sampled features (BEV Map (bird' S-eye view) extracting corresponding features from the image space);

s110, performing spatial cross attention on the bird 'S-eye view feature after initialization of the current time and the sampled feature to generate a bird' S-eye view feature of the current time with the vehicle as the center;

and S112, establishing a relation between different channels of the bird 'S-eye view feature at the current moment based on the separable convolutional neural network to obtain the enhanced bird' S-eye view feature at the current moment.

The aerial view feature generation method based on the vehicle-mounted all-around image is a high-quality aerial view feature generation method, and aerial view features based on an overlooking visual angle are generated in real time through the vehicle-mounted all-around image of the vehicle body, so that subsequent downstream tasks can be conveniently expanded on the aerial view features.

Aiming at the technical problem that the time sequence characteristics are difficult to effectively extract and utilize, the time sequence prior module is provided in the disclosure, and effective time sequence information fusion is realized by utilizing a mode of fusing the same real world position characteristics on a bird's-eye view. The time sequence information is only used for initializing the aerial view characteristics at the current moment, and the iterative sampling generation of the aerial view characteristics is not involved, and the method of the separated space-time characteristic fusion is beneficial to enhancing the learning capability of the time sequence characteristic information and the space characteristic information, and the work in the prior art is the overlapping use of time cross attention and space cross attention.

Aiming at the technical problem that high-quality spatial features are difficult to reasonably sample, the method for sampling the spatial features from the aerial view features to the image features based on reference line sampling is provided, and the method for sampling the spatial features from the BEV visual angle to the image visual angle by projection and the method for sampling the reference line are matched for use, so that the high-quality spatial features are sampled. Most of feature sampling methods in the prior art are based on point-to-point projection sampling, so that the corresponding geometric relationship from an image visual angle to a bird's-eye view visual angle is lost, and therefore, the sampling method matched with a reference line is helpful for sampling high-quality spatial features.

Aiming at the technical problem that the high-quality aerial view features are difficult to generate efficiently by using the spatial features, the disclosure provides a spatial attention module which efficiently fuses the sampled spatial features by using cross attention and generates the high-quality aerial view features by using separable convolution. By using the cross attention, the fusion of different spatial features is automatically learned according to data driving, and the linkage is established between different channels of the aerial view features through a layer of separable convolutional neural network after the spatial cross attention, so that the aerial view features with high quality can be generated.

Fig. 2 is a schematic diagram of an overall model architecture (i.e., a bird's-eye view feature device) for implementing the bird's-eye view feature generation method based on vehicle-mounted panoramic images of the present disclosure, and referring to fig. 2, for bird's-eye view feature generation at a certain time, images captured by all vehicle-mounted panoramic cameras at the current time are first extracted, and then high-dimensional features of all the images are extracted by a convolutional neural network from the images for use in spatial feature sampling. Then, a learnable bird's-eye view feature (BEV Map) with the self-vehicle as the center is randomly initialized, the position code is added, and the position code is sent to the time sequence prior module. And fusing the initialized BEV Map with the BEV Map generated by history in a time sequence prior module according to the real world position, and secondarily initializing the BEV Map at the current moment by using the BEV Map generated by the history. And (3) carrying out spatial feature sampling on the BEV Map after time sequence initialization according to the geometric corresponding relation between the aerial view visual angle and the image visual angle, carrying out spatial feature sampling by using a sampling method which is used by matching the projection sampling from the BEV visual angle to the image visual angle and the sampling of the reference line, and calculating the spatial cross attention to generate the aerial view feature which takes the self-vehicle as the center at present. A separable convolution is then used to improve the quality of the bird's eye view feature generation. The spatial attention module and the separable convolution module are performed iteratively, and the BEV Map generated after each iteration is used to calculate the penalty next to a specific downstream task.

The above-described image features of each captured image at the current time of acquisition (in some embodiments of the present disclosure, the image features are represented by 3 feature maps) include:

and performing feature extraction on each acquired image at the current moment based on the convolutional neural network to obtain the image features of the acquired image of each vehicle-mounted camera at the current moment.

In the bird' S-eye view feature generation method S100 based on the vehicle-mounted panoramic image, the convolutional neural network preferably includes a standard ResNet-101 backbone network, an FPN network, and an additional convolutional layer;

wherein the ResNet-101 backbone network is a four-stage ResNet-101 backbone network pre-trained by ImageNet.

In some embodiments of the present disclosure, it is exemplarily assumed that the host vehicle has N onboard cameras operating, and N images are captured simultaneously at each time, so that each time will result in the input of N images.

In some embodiments of the present disclosure, the convolutional neural network of the present disclosure includes a standard ResNet-101 backbone network, an FPN network, and an additional convolutional layer.

Assuming that the input at the time T (the current time) is N images, the input is sent into a standard ResNet-101 backbone network, and then the image characteristics at the time T are obtained through an FPN network and an additional convolutional layer

Wherein

The image characteristics of the acquired image of the ith vehicle-mounted camera at the moment T. Illustratively, the present disclosure defaults to 800 × 1600 pixels for all acquired images.

The present disclosure is preferably based on pre-training a four-stage ResNet-101 backbone network, using ImageNet to pre-train the four-stage ResNet-101 as the backbone network, freezing the weights of the first stage of ResNet-101 during model training, and updating only the weights of the following three stages with parameters. And simultaneously, carrying out single-group deformable convolution on the output of the last Block of the last two stages, then sending the feature map output by the last Block of the last three stages into an FPN network, carrying out additional convolution operation on all the feature maps output by the FPN network (namely using an additional convolution layer), and taking the feature maps subjected to the FPN and the additional convolution as image features for carrying out spatial feature sampling.

Referring to fig. 3, for an image input to the image feature extraction module at time T (current time) by each vehicle-mounted camera, 3 feature maps are obtained to represent the image as original features sampled by a spatial attention module (described in detail below).

Fig. 4 is a flowchart of acquiring the bird ' S-eye view feature after initialization of the current time, according to an embodiment of the present disclosure, and referring to fig. 4, in some embodiments of the present disclosure, for S106 described above, acquiring the bird ' S-eye view feature after initialization of the current time based on the bird ' S-eye view feature at the historical time (time-series prior module), the method includes:

s1061, randomly initializing a learnable aerial view feature taking the vehicle as a center as an aerial view feature at the current moment;

s1062, adding a position code in the learnable bird 'S-eye view feature to obtain a position-coded bird' S-eye view of the current frame;

s1063, extracting a historical frame aerial view (BEV Map) within a preset time length before the current time;

s1064, acquiring the relative pose (namely pose change) of the aerial view of the current frame and the aerial view of the historical frame after position coding;

s1065, searching corresponding historical frame bird 'S-eye view query vectors (at least one) in the historical frame bird' S-eye view as time sequence prior information on the basis of relative poses for the bird 'S-eye view query vectors of each position point of the bird' S-eye view after the position coding of the current frame;

s1066, adding the time-series prior information to the position-encoded bird 'S-eye view of the current frame, and obtaining the bird' S-eye view characteristics after initialization of the current time (time-series prior module).

In a bird 'S-eye view feature generation method S100 based on a vehicle-mounted all-around image according to some embodiments of the present disclosure, extracting a historical frame bird' S-eye view (BEV Map) within a preset time length before a current time includes: and extracting one frame of historical frame aerial view or extracting more than two frames of historical frame aerial view.

In some embodiments of the present disclosure, if more than two frames of the historical frame bird's-eye views are extracted, the time-series prior information obtained from each frame of the historical frame bird's-eye views is weighted and summed to obtain the final time-series prior information.

In more detail, for the bird's eye view feature generation at the current moment, the disclosure first initializes a learnable BEV Map ∈ R centered on the vehicle ^H×W×C Where H is the height of the generated bird's-eye view, W is the width of the generated bird's-eye view, and C is the dimension of the generated bird's-eye view feature. Exemplarily, the present disclosure sets H = W =200, c =256.

The present disclosure initializes a BEV Map (bird's eye view) of a current frame by a BEV Map (bird's eye view) of a historical frame using a timing prior module. Specifically, the method includes the steps of firstly, randomly initializing a learnable BEV Map taking a host vehicle as a center for a current frame, and adding a two-dimensional relative position code. Then, within the past 0.5 second (for example), different four frames (i.e., multiple frames) are randomly extracted as historical frames, and the relative poses (including position offset and/or movement direction offset) of the current frame bird's-eye view of the host vehicle and the bird's-eye views of other historical frames are obtained through IMU information (IMU (Inertial Measurement Unit)).

For each position point on the BEV Map of the current frame, the position point is used as a BEV Query (bird's eye view Query vector), so that H × W BEV queries are obtained, and the dimension of each BEV Query is R ^1×C . Then, for each BEV Query of the current frame, a corresponding BEV Query is found in the history frame through the pose information (i.e., the relative pose described above). According to the position deviation and/or the movement direction deviation, for the position coordinate of the real world corresponding to the BEV Query of the current frame, the BEV Query corresponding to the position coordinate of the real world is searched in the historical frame.

It should be noted that the position point of the same real world in the history frame corresponding to the BEV Query of the current frame may be included in a plurality of BEV queries of the history frame, and therefore the present disclosure preferably performs weighted summation of the queries in the history frame in the neighboring 3 × 3 (exemplarily) Query range according to the ratio of the distances thereof (the position point of the same real world) to the different BEV queries, performs weighted summation of the queries in the history frame only in the neighboring 2 × 3 Query range if the corresponding position point in the history frame is at the boundary of the BEV Map (bird's eye view), and performs weighted summation of the queries in the history frame only in the neighboring 2 × 2 Query range if the corresponding position point in the history frame is at the four corners of the BEV Map.

For each historical frame, the BEV Query of the current frame extracts corresponding timing prior information according to the above-described relative pose, and then performs weighted summation on the timing prior information from multiple frames (multiple historical frames) to obtain final timing prior information. In the present disclosure, it is preferable to use the same weight for different historical frames. Finally, the obtained time sequence prior information is added to the corresponding BEV Query of the current frame, and the BEV Map belonging to the R and passing through the time sequence prior module is obtained ^H×W×C 。

Fig. 5 is a schematic diagram of the same real-world position correspondence of a current frame BEV Query in a history frame according to an embodiment of the disclosure.

In some embodiments of the present disclosure, it is worth noting that for a Query that does not have corresponding timing information in the historical BEV Map, no a priori information from the timing will be available. If the current frame is the first frame, the learnable BEV Map will only incorporate the two-dimensional relative position code and skip the temporal prior module described above in this disclosure. And if the historical frame of the current frame is less than four frames, only taking the existing historical frame for time sequence prior fusion.

In some embodiments of the present disclosure, the above-described S108, performing spatial feature sampling (projection sampling from a BEV perspective to an image perspective + reference line sampling) in the image features (image features represented in a high-dimensional manner) of each of the acquired images based on the bird 'S eye view feature after initialization at the current time to obtain sampled features (i.e., the bird' S eye view (BEV Map) is extracted from the image space to obtain corresponding features), including: and sampling the spatial features based on the geometric corresponding relation between the bird's-eye view and the image view.

Preferably, the spatial feature sampling based on the geometric correspondence between the bird's-eye view angle and the image view angle described above in the present disclosure includes:

In some embodiments of the present disclosure, the initialized BEV Map ∈ R for the elapsed time attention module (i.e., the timing prior module described above) ^H×W×C Similarly, taking each position point on the BEV Map as a BEV Query, H × W BEV queries will be obtained, and the dimension size of each BEV Query is R ^1×C 。

For each BEV Query, geometrically projecting the position of the vehicle-mounted camera on a BEV Map according to internal and external parameters of the camera of the vehicle-mounted camera into a feature Map (used for representing image features) extracted by a convolutional neural network for spatial feature sampling. Illustratively, the present disclosure assumes that each grid in the BEV Map corresponds to the size of S meters in the real world. As a default, if the feature center of the BEV Map corresponds to the position of the host vehicle, then for any BEV Query, the coordinates on the BEV Map are (x, y), and the coordinates projected onto the real-world top view are (x ', y'):

where (x, y) is the coordinate system origin at the lower left corner on the BEV Map and (x ', y') is the coordinate system origin at the host vehicle on the real-world top view, the present disclosure illustratively sets S =0.5.

Illustratively, for each coordinate point (x ', y'), sampling is performed by setting R sampling points from (-5m, 5m) on average in the z-axis direction, that is, for each BEV Query, R real-world 3D coordinate points are obtained

Each 3D coordinate point may then beA plurality of image coordinate points (p) can be obtained by projecting internal and external parameters of the camera onto the collected image of one or more vehicle-mounted cameras _x ，p _y ) I.e., image projection coordinate points, the present disclosure illustratively sets R =8.

For each BEV Query, a plurality of corresponding image projection coordinate points may be obtained, referring to fig. 6. Fig. 6 is a projection view of BEV view and image view position for one embodiment of the present disclosure.

It should be noted that R real world 3D coordinate points generated by each BEV Query calculate projection points according to internal and external parameters of each vehicle-mounted camera (camera), but since there is an overlapping area between the viewing angles of at most two cameras, each BEV Query projects on at most 2 vehicle-mounted images, the projection points of other images are projected outside the image range, and since the value range in the z-axis direction is fixed, R projection points on one image are not necessarily projected on the image, and the projection points projected outside the image are discarded.

According to a preferred embodiment of the present disclosure, the above-described spatial feature sampling based on the geometric correspondence between the bird's-eye view angle and the image view angle further includes:

In view of the fact that the real-world geometric relationship cannot be well utilized only by means of projection of corresponding projection points, a reference line-based sampling projection mode is also introduced in some preferred embodiments of the present disclosure.

In the geometric relationship between the image visual angle and the BEV visual angle, each vertical line feature of the image visual angle corresponds to the same polar ray geometric relationship of each BEV visual angle with the vehicle as the center, so that the vertical lines on the image are used as reference lines, reference line feature sampling is carried out, and geometric prior information of the real world is introduced to help generate the BEV Map with stronger robustness.

In general, the bird's-eye view feature is generated by taking the vehicle as a center point, so that the polar ray taking the vehicle as the center in the bird's-eye view feature has a 360 ° view angle range, while the view angle range of the vehicle-mounted camera is narrow, and each vertical line feature of the image view angle corresponds to the same polar ray geometric relationship taking the vehicle as the center in each BEV view angle in the same view angle range in which the polar rays taking the vehicle as the center of the vehicle-mounted camera and the vehicle are the same.

That is, the vertical line feature from left to right on the image feature corresponds to the polar line from right to left with the host vehicle as the center on the BEV Map within the corresponding camera sight line range, referring to fig. 7, fig. 7 is a schematic diagram of a reference line sampling manner according to an embodiment of the present disclosure.

It is noted that, since the corresponding extracted feature on the BEV Map is a polar ray feature, the corresponding feature extraction range may fall between multiple BEV Query, and therefore the corresponding feature sampling method is: and (3) each vertical line feature of the image features is unchanged, the corresponding aerial view feature of the image features takes the BEV Query of each horizontal line of the BEV Map corresponding to the nearest drop point and the BEV queries adjacent to the left and right of the BEV Query as corresponding features.

Therefore, the characteristic sampling method based on the reference line can obtain the vertical line characteristic on each characteristic map on the bird's-eye view characteristic of H multiplied by W

Or

BEV Query of (1). Whether the specific number is H or W is that the polar ray corresponding direction is the horizontal direction or the vertical direction, and the division by 2 is because the vehicle is at the center point of the bird's eye view.

The above describes the BEV view-to-image view projection sampling and reference line sampling approaches, both of which will help the BEV Map extract corresponding features from the image space.

The above-described spatial cross-attention of the bird's-eye view feature after the initialization of the current time and the sampled feature to generate the bird's-eye view feature of the current time centered on the own vehicle preferably includes:

sampling a reference line of each polar ray on a bird's-eye view (BEV Map) by taking a vehicle as a center, and performing reference line crossing attention;

In some embodiments of the present disclosure, the sampled features are cross-attention fused with the BEV Map, preferably by a spatial attention module, to generate a reliable current frame BEV Map.

In some preferred embodiments of the present disclosure, the spatial attention module of the present disclosure includes two parts, namely, a sampling point cross attention and a reference line cross attention, and for one bird's-eye view feature input (bird's-eye view feature after initialization at the current time), firstly, projection sampling from a BEV perspective to an image perspective is performed on each BEV Query in a bird's-eye view (BEV Map), sampling point cross attention is performed on the sampled feature, then, reference line sampling is performed on each polar line on the BEV Map with the vehicle as the center, and reference line cross attention is performed, so as to obtain an output of a final spatial attention module, and referring to fig. 8, fig. 8 shows a schematic structural diagram of the spatial attention module of an embodiment of the present disclosure.

Cross-attention is preferably given to the sampling point of the features sampled based on the projection sampling of the bird's-eye view (BEV Query) from the BEV perspective to the image perspective of each bird's-eye view Query vector (BEV Map) described above, including:

predicting k position offsets based on the characteristics of each projection point, adding the coordinates corresponding to the projection points to obtain additional k characteristic sampling points, and then obtaining (k + 1) characteristic sampling points for each projection point and obtaining mx (k + 1) characteristic sampling points for each bird's-eye view query vector;

In a preferred embodiment of the present disclosure, the sampling point cross attention is to update each BEV Query, first, assume that one BEV Query will project onto m projection points on an image feature, for each projection point, k offset displacements will be predicted from the projection point corresponding feature, and k feature sampling points will be obtained by adding coordinates corresponding to the projection point to each offset position, so that each projection point will obtain additional k feature sampling points, and each BEV Query will obtain m × (k + 1) feature sampling points. Then, taking the BEV Query as a cross attention Query, taking the characteristics of sampling points as Key and Value to carry out cross attention calculation, and obtaining an updated BEV Query:

wherein Q = Query W ^q ，K＝Key*W ^k ，V＝Value*W ^v ，W ^q ，W ^k ，W ^v Are all learnable matrices. It is noted that the number of projections projected onto an image feature by each BEV Query is not fixed, i.e., m is not fixed, and the number of offset displacements predicted by each projection point is fixed, the present disclosure sets k =4 (adjustable).

For the above-described reference line sampling, reference line cross attention, of each polar ray centered on the host vehicle on the bird's eye view (BEV Map), preferably, the method includes:

and taking each polar ray and the bird's-eye view Query vector (BEV Query) adjacent to the polar ray as a cross attention Query vector (Query), and taking vertical lines from left to right on the image features (feature map) as Key (Key) and Value (Value) to perform cross attention calculation so as to update the bird's-eye view Query vector (BEV Query).

In the preferred embodiment of the present disclosure, performing sampling point cross attention is followed by reference line cross attention, which is not a Query with a single BEV Query as cross attention, but a Query with a polar ray and its neighboring BEV Query as cross attention, while performing cross attention calculations with Key and Value as vertical line features from left to right on image features (feature maps), in the same manner as the sampling point cross attention described above.

For the above-described separable convolutional neural network based connection between different channels of the bird's-eye view feature at the current time to obtain an enhanced bird's-eye view feature at the current time, it is preferable to include:

For the downstream perception task of the bird's-eye view feature, different information is generally considered to be stored among different channels, so that the bird's-eye view feature with high quality can be generated by a layer of separable convolutional neural network after the spatial cross attention.

While generating high-quality bird's-eye view features, it is necessary to keep the channel dimensions of the features unchanged, so the number of convolution kernels is the same as the channel dimensions and placed behind the spatial attention module.

The present disclosure also provides a bird's-eye view feature generation device (i.e., a model architecture) based on the vehicle-mounted all-around image.

Referring to fig. 9, the bird's eye view feature generation device 1000 based on the vehicle-mounted surround view image of the present disclosure includes:

the image acquisition module 1002, the image acquisition module 1002 extracts the collected images of all the vehicle-mounted cameras of the vehicle at the current moment;

the image feature extraction module 1004 is used for acquiring the image features of each acquired image at the current moment by the image feature extraction module 1004;

a time sequence prior module 1006, wherein the time sequence prior module 1006 acquires the initialized bird's-eye view feature at the current moment based on the bird's-eye view feature at the historical moment;

the sampling module 1008 is used for sampling spatial features in the image features of each acquired image based on the bird's eye view features after initialization at the current moment so as to obtain the sampled features;

a spatial cross attention module 1010, the spatial cross attention module 1010 spatially cross-pays attention to the bird's eye view feature after initialization of the current time and the sampled feature to generate a bird's eye view feature of the current time with the host vehicle as the center;

a separable convolution module 1012, the separable convolution module 1012 establishing a relationship between different channels of the bird's-eye view feature at the current time to obtain an enhanced bird's-eye view feature at the current time.

In the model architecture of the bird's-eye view feature generation device 1000 based on the vehicle-mounted panoramic image according to the present disclosure, the bird's-eye view feature initialized by time series (based on the time series prior module) is output for the first time after passing through the spatial attention module and the separable convolution module by using a method of generating the bird's-eye view feature through iterative sampling.

In order to enhance the ability of the bird's-eye view feature to learn the spatial feature, the generated bird's-eye view feature is sent to the spatial cross attention module and the separable convolution module again, the spatial feature sampling is carried out again to calculate the spatial attention, and 6 times of (adjustable) spatial feature sampling are iterated in the disclosure.

When the aerial view characteristics are generated by each iteration of spatial attention sampling, the multitask loss can be calculated on the aerial view characteristics, and the existing task heads of 2D/3D detection, segmentation and the like can be well inherited.

In some embodiments of the present disclosure, only a 3D Object detection head in DETR3D is used in model training, and a 6-layer Decoder method is used as the 3D Object detection head, and each time the Object Query predicts an Object center point on the BEV Map, the 3D Object detection is implemented. And calculating task loss by using the 3D target detection head on the aerial view characteristics generated in each iteration.

The bird's-eye view feature generation device 1000 based on the vehicle-mounted surround view image of the present disclosure may be implemented based on a computer software architecture.

Fig. 9 is a block diagram schematically illustrating a structure of a bird's eye view feature generation device 1000 based on an on-vehicle surround view image, which is implemented by hardware using a processing system according to an embodiment of the present disclosure.

The device 1000 for generating the bird's eye view feature based on the vehicle-mounted all-around view image may include corresponding modules for executing each or several steps in the above-mentioned flowcharts. Thus, each step or several steps in the above-described flow charts may be performed by a respective module, and the apparatus may comprise one or more of these modules. The modules may be one or more hardware modules specifically configured to perform the respective steps, or implemented by a processor configured to perform the respective steps, or stored within a computer-readable medium for implementation by a processor, or by some combination.

The hardware architecture may be implemented with a bus architecture. The bus architecture may include any number of interconnecting buses and bridges depending on the specific application of the hardware and the overall design constraints. The bus 1100 couples various circuits including the one or more processors 1200, the memory 1300, and/or the hardware modules together. The bus 1100 may also connect various other circuits 1400, such as peripherals, voltage regulators, power management circuits, external antennas, and the like.

The bus 1100 may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (Extended Industry Standard Component) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one connection line is shown, but this does not indicate only one bus or one type of bus.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps in the process, and alternate implementations are included within the scope of the preferred embodiment of the present disclosure in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the implementation of the present disclosure. The processor performs the various methods and processes described above. For example, method embodiments in the present disclosure may be implemented as a software program tangibly embodied in a machine-readable medium, such as a memory. In some embodiments, some or all of the software program may be loaded and/or installed via memory and/or a communication interface. When the software program is loaded into memory and executed by a processor, one or more steps of the method described above may be performed. Alternatively, in other embodiments, the processor may be configured to perform one of the methods described above by any other suitable means (e.g., by means of firmware).

The logic and/or steps represented in the flowcharts or otherwise described herein may be embodied in any readable storage medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.

The present disclosure also provides an electronic device, comprising: a memory storing execution instructions; and the processor executes the execution instructions stored in the memory, so that the processor executes the bird's eye view feature generation method based on the vehicle-mounted all-around image according to any one of the embodiments of the disclosure.

The present disclosure also provides a readable storage medium, in which execution instructions are stored, and the execution instructions are executed by a processor to implement the method for generating the bird's-eye view feature based on the vehicle-mounted all-around image according to any one of the embodiments of the present disclosure.

The present disclosure also provides a computer program product comprising a computer program/instructions which, when executed by a processor, implement the method for generating a bird's eye view feature based on a vehicle-mounted surround view image according to any of the embodiments of the present disclosure.

For the purposes of this description, a "readable storage medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the readable storage medium include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable read-only memory (CDROM). In addition, the readable storage medium may even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in the memory.

It should be understood that portions of the present disclosure may be implemented in hardware, software, or a combination thereof. In the above embodiments, various steps or methods may be implemented in software stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following technologies, which are well known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps of the method implementing the above embodiments may be implemented by hardware that is related to instructions of a program, and the program may be stored in a readable storage medium, and when executed, the program may include one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present disclosure may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a separate product, may also be stored in a readable storage medium. The storage medium may be a read-only memory, a magnetic or optical disk, or the like.

In the description herein, reference to the description of the terms "one embodiment/implementation," "some embodiments/implementations," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment/implementation or example is included in at least one embodiment/implementation or example of the present application. In this specification, the schematic representations of the terms described above are not necessarily the same embodiment/mode or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments/modes or examples. Furthermore, the various embodiments/aspects or examples and features of the various embodiments/aspects or examples described in this specification can be combined and combined by one skilled in the art without conflicting therewith.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or to implicitly indicate the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one of the feature. In the description of the present application, "plurality" means at least two, e.g., two, three, etc., unless explicitly specified otherwise.

It will be understood by those skilled in the art that the foregoing embodiments are merely for clarity of illustration of the disclosure and are not intended to limit the scope of the disclosure. Other variations or modifications may be made to those skilled in the art, based on the above disclosure, and still be within the scope of the present disclosure.

Claims

1. A bird's-eye view image feature generation method based on a vehicle-mounted panoramic image is characterized by comprising the following steps:

acquiring image characteristics of each acquired image at the current moment;

performing spatial cross attention on the bird's-eye view feature after initialization of the current moment and the sampled feature to generate a bird's-eye view feature of the current moment with the vehicle as the center; and

2. The method for generating the bird's eye view feature based on the vehicle-mounted all-around image according to claim 1, wherein the step of obtaining the image feature of each acquired image at the current time comprises the steps of:

3. The method according to claim 2, wherein the convolutional neural network comprises a standard ResNet-101 backbone network, an FPN network and an additional convolutional layer;

4. The method according to claim 1, wherein acquiring the bird's-eye view feature after initialization of the current time based on the bird's-eye view feature at the historical time comprises:

searching corresponding historical frame aerial view query vectors in the historical frame aerial view as time sequence prior information based on the relative pose for the aerial view query vectors of each position point of the aerial view after the position of the current frame is encoded; and

and adding time sequence prior information to the bird's-eye view after the position coding of the current frame to obtain the bird's-eye view characteristics after the initialization of the current time.

5. The method according to claim 4, wherein extracting the bird's-eye view (BEV Map) of the historical frames within a preset time period before the current time comprises:

and extracting one frame of historical frame aerial view or extracting more than two frames of historical frame aerial view.

6. The bird's-eye view feature generation method based on the vehicle-mounted all-around image according to claim 5, characterized in that if more than two frames of the historical frame bird's-eye views are extracted, the time sequence prior information obtained from the historical frame bird's-eye views of each frame is weighted and summed to obtain the final time sequence prior information;

preferably, the spatial feature sampling is performed in the image features of each of the acquired images based on the bird's eye view features after the initialization at the current time to obtain sampled features, including:

sampling the spatial features based on the geometric corresponding relation between the bird's-eye view visual angle and the image visual angle;

preferably, the spatial feature sampling is performed based on a geometric correspondence between the bird's-eye view angle and the image angle, and includes:

taking each position point of the initialized aerial view at the current moment as an aerial view Query vector (BEV Query) so as to perform spatial feature sampling of a feature map of an acquired image based on geometric projection to each vehicle-mounted camera;

taking each vertical line of a feature map which is used for representing image features of each collected image as a reference line, and acquiring polar ray features of the aerial view by taking the own vehicle as a center so as to obtain an aerial view Query vector (BEV Query) for spatial feature sampling based on the reference line;

preferably, the spatially intersecting the bird's eye view feature after initialization of the current time with the sampled feature to generate the bird's eye view feature of the current time centered on the host vehicle includes:

performing sampling point cross attention on the characteristics sampled by performing projection sampling from a bird's-eye view to an image view on each bird's-eye view Query vector (BEV Query) in the bird's-eye view Map (BEV Map);

sampling a reference line of each polar ray on a bird's-eye view Map (BEV Map) by taking the vehicle as a center, and performing reference line cross attention; and

outputting the bird's-eye view characteristic of the current moment with the vehicle as the center;

preferably, the sampling point cross attention is given to a feature sampled by projection sampling of a bird's-eye view to an image view based on each bird's-eye view Query vector (BEV Query) in the bird's-eye view Map (BEV Map), including:

obtaining m projection points of each bird's-eye view query vector, wherein each projection point is used as a characteristic sampling point;

predicting k position offsets based on the characteristics of each projection point, adding the coordinates corresponding to the projection points to obtain additional k characteristic sampling points, and then obtaining (k + 1) characteristic sampling points for each projection point and obtaining mx (k + 1) characteristic sampling points for each bird's-eye view query vector; and

taking a bird's-eye view Query vector (BEV Query) as a cross attention Query vector (Query), and taking a sampling point and characteristics of the sampling point as a Key (Key) and a Value (Value) of cross attention respectively, and performing cross attention calculation to update the bird's-eye view Query vector (BEV Query);

preferably, sampling a reference line for each polar line centered on the host vehicle on the bird's eye view Map (BEV Map), performing reference line cross attention, comprising:

taking each polar ray and the bird's-eye view Query vector (BEV Query) adjacent to the polar ray as a cross attention Query vector (Query), and taking vertical lines from left to right on image features and vertical line features as keys (Key) and values (Value) to perform cross attention calculation so as to update the bird's-eye view Query vector (BEV Query);

preferably, the establishing of the relationship between different channels of the bird's-eye view feature at the current time based on the separable convolutional neural network to obtain the enhanced bird's-eye view feature at the current time comprises:

7. A bird's-eye view feature generation device based on a vehicle-mounted panoramic image is characterized by comprising:

the time sequence prior module is used for acquiring the aerial view characteristics of the current moment after initialization based on the aerial view characteristics of the historical moment;

a sampling module that performs spatial feature sampling in image features of each of the acquired images based on the bird's-eye view feature after the initialization at the present time to obtain sampled features;

a spatial cross attention module that performs spatial cross attention on the bird's-eye view feature after initialization at the current time and the sampled feature to generate a bird's-eye view feature at the current time with the host vehicle as a center; and

8. An electronic device, comprising:

a memory storing execution instructions; and

a processor executing the execution instructions stored in the memory to cause the processor to execute the method for generating the bird's eye view feature based on the vehicle-mounted all-around image according to any one of claims 1 to 6.

9. A readable storage medium, wherein the readable storage medium stores therein execution instructions, and the execution instructions are executed by a processor to implement the method for generating the bird's eye view feature based on the vehicle-mounted all-around image according to any one of claims 1 to 6.

10. A computer program product comprising computer programs/instructions, characterized in that the computer programs/instructions, when executed by a processor, implement the bird's eye view feature generation method based on vehicle-mounted panoramic images of any of claims 1 to 6.