CN115578705A

CN115578705A - Aerial view feature generation method based on multi-modal fusion

Info

Publication number: CN115578705A
Application number: CN202211290367.2A
Authority: CN
Inventors: 缪文俊; 李雪; 陈禹行
Original assignee: Beijing Yihang Yuanzhi Technology Co Ltd
Current assignee: Beijing Yihang Yuanzhi Technology Co Ltd
Priority date: 2022-10-21
Filing date: 2022-10-21
Publication date: 2023-01-06

Abstract

The invention provides a bird's-eye view feature generation method based on multi-mode fusion, which comprises the following steps: extracting the acquisition information of all vehicle-mounted sensors of the vehicle at the current moment so as to at least obtain first modality acquisition information and second modality acquisition information; extracting a first modal feature of the first modal acquisition information by using a first modal feature extraction network, and extracting a second modal feature of the second modal acquisition information by using a second modal feature extraction network; acquiring the aerial view characteristic after the initialization of the current moment based on the historical moment aerial view characteristic; obtaining a bird's-eye view feature with first modal feature information based on the first modal feature and the bird's-eye view feature after initialization at the current time; acquiring a bird's-eye view characteristic with second modal characteristic information based on the bird's-eye view characteristic with the first modal characteristic information and the second modal characteristic; and generating aerial view features with multi-modal fusion information. The present disclosure also provides an electronic device, a readable storage medium, and a program product.

Description

Aerial view feature generation method based on multi-mode fusion

Technical Field

The present disclosure relates to the field of automotive technologies and computer vision technologies, and in particular, to a bird's-eye view feature generation method and apparatus based on multi-modal fusion, an electronic device, a storage medium, and a program product.

Background

A Bird's Eye View (BEV) is a top-down top view representation method, which can help to build downstream tasks such as 3D target detection, lane line segmentation and the like on a unified feature representation, and enhance reusability of a model; meanwhile, the method is beneficial to the process from the post-sensor fusion to the pre-sensor fusion, so that the fusion process is completely driven by data; the mutual relation of all objects in the space can be well represented, and prediction and planning are facilitated; the method can be well compatible with the previous 2D target detection method, and is convenient to deploy.

The use of the bird's eye view feature as a unified feature representation of downstream tasks is an important new technical means in the aspect of automatic driving perception, and plays an important role in the spatial perception of automatic driving. In the prior bird's-eye view feature generation technology, the effectiveness of the bird's-eye view generation technology based on multiple modes is strongly proved, and on the public data sets of Nusceene, waymo and the like, the accuracy of the perception method based on multiple modes is far higher than that of the method based on single mode. However, most of the existing methods are directly fused based on multi-modal data, and bird's-eye view features are not generated to construct a uniform feature representation; in some recent newer methods, although attention is paid to generation of bird's-eye view features, it is difficult to effectively accommodate inputs of different modalities, and a model can be constructed only for preset inputs of specific modalities, which greatly limits the compatibility and effectiveness of the methods in practical applications.

The quality of bird's-eye view feature generation directly influences the perception accuracy of downstream tasks, and further influences subsequent prediction and planning. The aerial view multi-mode fusion generation method based on the expandable modes also has the problems that how to effectively mine the corresponding characteristic information of different modes, the information between different modes is difficult to effectively interact and fuse, and the like.

Some technical solutions in the prior art are as follows.

The technical scheme 1: FUTR3D, A Unified Sensor Fusion Framework for 3D Detection, utilizes the idea of a 3D reference point of DETR3D, effectively utilizes information of different modes for 3D target Detection by using a Unified Framework for the first time, and realizes the idea of expandable mode feature Fusion. The method is similar to DETR3D, the feature spaces under different modes are constructed by using different backbone networks, then the 3D reference points are used for sampling in the feature spaces of the different modes, the three-dimensional central point of the object is iteratively predicted by using a Transformer Decoder and regresses an object frame, the predicted central point is re-projected back to the high-dimensional feature representation of the different modes to extract corresponding features, and the next iteration is carried out. The technical scheme provides an effective idea of information fusion in any modes, namely information is iteratively extracted in feature spaces of different modes through a query mechanism of a 3D reference point, but the method does not construct bird's-eye view features for uniform feature representation, cannot well unify downstream perception tasks of automatic driving, does not construct an effective multi-mode feature fusion mode, and is difficult to fully exert the effect of multi-mode information fusion.

The technical scheme 2 is as follows: the method comprises the steps of utilizing a transform to perform Fusion of image and laser radar information, using an image guide query initialization strategy for performing image guide on BEV (beam intensity vector) features constructed by a laser radar to help generate Object query which is difficult to detect in sparse laser radar point cloud, then introducing local induction bias to help a network to better focus on related image regions, using an image as a guide query to optimize the BEV radar features, then using radar features to strengthen information of the image to perform iteration, and thus avoiding the problem of searching radar laser points corresponding to the image features, and simultaneously well fusing the features of the two. However, the problem of the method is that the constructed bird's-eye view features are only used for helping the image features to better fuse radar features, and finally, a Query mechanism is still used for iterative regression of a target frame, a unified bird's-eye view feature representation is not constructed, an image-guided Query initialization strategy needs to determine an input mode in advance, and a real any expandable mode is not achieved.

Technical scheme 3: the first step of changing Object Query to BEV Query is to initialize a Bird's-Eye-View image via spatial adaptive transforms (BEV Map) with a fixed size, project each position point on the BEV Map as BEV Query to a 2D feature Map extracted from CNN Backbone to extract features, and provide dense feature Representation by introducing the Bird's-Eye View. And secondly, introducing a time sequence attention module, performing cross attention mixing by using the BEV Map at the previous moment and the BEV Map at the current moment through the speed and the pose change of the self-vehicle, and introducing time sequence information. However, the BEVFormer only utilizes information of an image modality, does not well integrate laser radar features and millimeter wave radar features, and lacks fusion interaction of multi-modality information, so that although a good effect is achieved in the field of only image input, compared with the existing high-performance multi-modality method, the accuracy of the BEVFormer has a small difference.

Technical scheme 1 constructs a unified framework to effectively utilize information of different modes, and realizes utilization of any expandable mode information, but the technical scheme does not construct a unified aerial view characteristic representation, and does not construct an efficient multi-mode information interaction mode.

The technical scheme 2 provides an image-based guidance Query initialization strategy and a multi-modal iterative information interaction strategy, which well integrates multi-modal information, but a unified aerial view characteristic representation is not constructed, multi-modal information fusion is assisted by only utilizing the aerial view characteristic, and meanwhile, because the input modality needs to be determined in advance in the image guidance Query initialization strategy, and Query for guidance Query is initialized through the multi-modal characteristic, a real optional expandable modality is not achieved.

Technical scheme 3 generates the BEV Map by using a Query mechanism, realizes the generation of the aerial view characteristics for the first time based on the Query mechanism, constructs a unified aerial view characteristic representation, but does not enhance the performance of the model by using multi-modal information, so that the performance of the model is poorer in accuracy compared with the model constructed based on multi-modal.

Disclosure of Invention

In order to solve at least one of the above technical problems, the present disclosure provides a method, an apparatus, an electronic device, a storage medium, and a program product for generating an aerial view feature based on multi-modal fusion.

According to one aspect of the disclosure, a bird's-eye view feature generation method based on multi-modal fusion is provided, and includes:

extracting the acquisition information of all vehicle-mounted sensors of the vehicle at the current moment, wherein the vehicle-mounted sensors at least comprise a first modal sensor and a second modal sensor so as to at least obtain first modal acquisition information and second modal acquisition information;

extracting a first modal feature of the first modal acquisition information by using a first modal feature extraction network, and extracting a second modal feature of the second modal acquisition information by using a second modal feature extraction network;

acquiring the aerial view characteristic after the initialization of the current moment based on the historical moment aerial view characteristic;

obtaining a bird's-eye view feature having first modal feature information based on the first modal feature and the bird's-eye view feature after initialization of the current time;

acquiring a bird's-eye view characteristic with second modal characteristic information based on the bird's-eye view characteristic with the first modal characteristic information and the second modal characteristic;

a bird's-eye view feature (BEV Map) having first modal feature information is generated by accessing a bird's-eye view feature having second modal feature information in the form of a bird's-eye view query vector, and the second modal feature information is fused to the bird's-eye view feature (BEV Map) having the first modal feature information.

According to the bird's-eye view feature generation method based on multi-modal fusion of at least one embodiment of the present disclosure, the feature extraction network of each modality performs feature extraction on the collected information of all sensors of the modality to obtain the feature space of the modality, and the feature extraction networks of different modalities are independent of each other.

According to the bird's-eye view feature generation method based on multi-modal fusion, the first modal sensor is a vehicle-mounted camera, the vehicle-mounted sensor further comprises a third modal sensor used for obtaining third modal information, third modal features of the third modal acquisition information are further extracted by using a third modal feature extraction network, bird's-eye view features with third modal feature information are further obtained based on the bird's-eye view features with the first modal feature information and the third modal features, the bird's-eye view features (BEV Map) with the first modal feature information are further accessed in the form of bird's-eye view Query vectors (BEV Query), the second modal feature information and the third modal information are fused into the bird's-eye view features (BEV Map) with the first modal feature information, and the bird's-eye view features (BEV Map) with the multi-modal fusion information are generated.

According to the method for generating the aerial view characteristics based on the multi-modal fusion, the aerial view characteristics after the current time is initialized are acquired based on the aerial view characteristics at the historical time, and the method comprises the following steps:

randomly initializing a learnable aerial view feature taking the vehicle as a center as an aerial view feature at the current moment;

adding a position code in the learnable bird's-eye view feature to obtain a position-coded bird's-eye view of the current frame;

extracting a historical frame aerial view (BEV Map) within a preset time length before the current time;

acquiring the relative pose of the aerial view of the current frame after position coding and the aerial view of the historical frame;

searching corresponding historical frame aerial view query vectors in the historical frame aerial view as time sequence prior information based on the relative pose for the aerial view query vectors of each position point of the aerial view after the position of the current frame is encoded;

and adding time sequence prior information to the bird's-eye view after the position coding of the current frame to obtain the bird's-eye view characteristics after the initialization of the current time.

According to the aerial view feature generation method based on multi-modal fusion, the first modal feature is an image feature;

obtaining a bird's-eye view feature having first modal feature information based on the first modal feature and the bird's-eye view feature after initialization of the current time, including:

and carrying out spatial feature sampling on the first modal feature based on the geometric corresponding relation between the bird's-eye view visual angle and the image visual angle so as to obtain the bird's-eye view feature with the first modal feature information.

According to the aerial view feature generation method based on multi-modal fusion, the second modal feature is a BEV feature generated by a laser radar point cloud and obtained through a second modal feature extraction network;

obtaining a bird's-eye view feature having second modal feature information based on the bird's-eye view feature having first modal feature information and the second modal feature, comprising:

copying a bird's-eye view characteristic with first modal characteristic information as a bird's-eye view characteristic (BEV Map) for acquiring the modal information of the laser radar;

finding a corresponding feature in a second modal feature by the aerial view feature (BEV Map) of the acquired laser radar modal information in a projection mode;

and fusing the laser radar characteristic information which is the corresponding characteristic found in the second modal characteristic to a bird's-eye view characteristic (BEV Map) of the collected laser radar modal information to generate the bird's-eye view characteristic (BEV Map) with the laser radar modal characteristic information.

According to the bird's-eye view feature generation method based on multi-modal fusion of at least one embodiment of the present disclosure, the method for generating a bird's-eye view feature (BEV Map) having laser radar modal feature information by fusing laser radar feature information, which is the corresponding feature found in the second modal feature, to the bird's-eye view feature (BEV Map) of the collected laser radar modal information includes:

predicting k offset displacements of the projection point positions of the corresponding features through an MLP network, and adding each offset displacement to the projection point positions to obtain additional k feature sampling points, so that k +1 feature sampling points are obtained by each aerial view Query vector (BEV Query) of the aerial view features (BEV Map) of the collected laser radar modal information;

and taking the bird's-eye view Query vector (BEV Query) as a cross attention Query vector (Query), and taking the sampling points and the characteristics of the sampling points as keys (Key) and values (Value) of cross attention respectively, and performing cross attention calculation to update the bird's-eye view Query vector (BEV Query).

According to the aerial view feature generation method based on multi-modal fusion, the second modal feature is a millimeter wave radar point cloud feature obtained through a second modal feature extraction network;

copying a bird's-eye view feature with first modal feature information as a bird's-eye view feature (BEV Map) for acquiring millimeter wave radar modal information;

finding a corresponding feature in a second modal feature by the aerial view feature (BEV Map) of the acquired millimeter wave radar modal information in a projection mode;

and fusing millimeter wave radar characteristic information which is the corresponding characteristic found in the second modal characteristic to a bird's-eye view characteristic (BEV Map) of the collected millimeter wave radar modal information to generate the bird's-eye view characteristic (BEV Map) with the millimeter wave radar modal characteristic information.

According to the bird's-eye view feature generation method based on multi-modal fusion of at least one embodiment of the present disclosure, the millimeter wave radar feature information, which is the corresponding feature found in the second modal feature, is fused to the bird's-eye view feature (BEV Map) of the collected millimeter wave radar modal information to generate the bird's-eye view feature (BEV Map) having the millimeter wave radar modal feature information, including:

predicting K offset displacements at the projection point positions of the corresponding features through an MLP network, and adding each offset displacement to the projection point position to obtain K additional feature sampling points, so that K +1 feature sampling points are obtained by each bird's-eye view Query vector (BEV Query) of the bird's-eye view features (BEV Map) for acquiring the millimeter wave radar modal information;

According to the bird's-eye view feature generation method based on multi-mode fusion, the third mode is a laser radar mode or a millimeter wave radar mode.

According to at least one embodiment of the present disclosure, a method for generating a bird's-eye view feature based on multi-modal fusion, in which a bird's-eye view feature (BEV Map) having first modal feature information is accessed as a bird's-eye view Query vector (BEV Query) to a bird's-eye view feature having second modal feature information, and the second modal feature information is fused to the bird's-eye view feature (BEV Map) having the first modal feature information, generates a bird's-eye view feature (BEV Map) having multi-modal fusion information, includes:

using the BEV Query with the bird's-eye view feature (BEV Map) of the first modal feature information as a cross-attention Query, using the BEV Query with the bird's-eye view feature (BEV Map) of the second modal feature information as a cross-attention Key and Value, and for each BEV Query with the bird's-eye view feature (BEV Map) of the first modal feature information, finding a BEV Query corresponding to the bird's-eye view feature (BEV Map) of the second modal feature information;

predicting k offset displacements with the corresponding BEV Query, each offset position plus the coordinates of the corresponding BEV Query to obtain k additional BEV queries on the bird's-eye view feature (BEV Map) having the second modality feature information, and then each BEV Query having the bird's-eye view feature (BEV Map) having the first modality feature information obtaining k +1 corresponding BEV queries on the bird's-eye view feature (BEV Map) having the second modality feature information;

and performing cross attention calculation by taking the BEV Query with the bird's-eye view feature (BEV Map) of the first mode feature information as the cross attention Query and taking the BEV Query corresponding to the bird's-eye view feature (BEV Map) of the second mode feature information as Key and Value to obtain the BEV Query which is finally output, namely the BEV Query with the bird's-eye view feature (BEV Map) of the multi-mode fusion information.

The bird's-eye view feature generation method based on multi-modal fusion according to at least one embodiment of the present disclosure further includes:

the bird's eye view features (BEV Map) of the multi-modal fusion information are input to the FFN and separable convolution module to output enhanced bird's eye view features (BEV Map) of the multi-modal fusion information at the current time.

According to another aspect of the present disclosure, there is provided a bird's-eye view feature generation device based on multi-modal fusion, including:

the vehicle-mounted sensor acquisition information acquisition module extracts acquisition information of all vehicle-mounted sensors of the vehicle at the current moment, wherein the vehicle-mounted sensors at least comprise a first modal sensor and a second modal sensor so as to at least acquire first modal acquisition information and second modal acquisition information;

a first modality feature extraction network that extracts a first modality feature of the first modality acquisition information;

a second modality feature extraction network that extracts a second modality feature of the second modality acquisition information;

a temporal attention module that acquires, based on the historical time aerial view feature, an aerial view feature after initialization at a current time;

a first modal aerial view feature acquisition module that acquires an aerial view feature having first modal feature information based on the first modal feature and the aerial view feature after initialization of the current time;

a second modal aerial view feature acquisition module that acquires an aerial view feature having second modal feature information based on the aerial view feature having first modal feature information and the second modal feature;

and a multi-modal information fusion module that accesses the bird's-eye view feature (BEV Map) having the first modal feature information to the bird's-eye view feature (BEV Query) having the second modal feature information in the form of a bird's-eye view Query vector (BEV Query), and fuses the second modal feature information to the bird's-eye view feature (BEV Map) having the first modal feature information to generate the bird's-eye view feature (BEV Map) having the multi-modal fusion information.

The bird's-eye view feature generation device based on multi-modal fusion according to at least one embodiment of the present disclosure further includes:

an FFN and separable convolution module that processes a bird's eye view feature (BEV Map) of the multi-modal fusion information to output an enhanced bird's eye view feature (BEV Map) of the multi-modal fusion information at a current time.

According to still another aspect of the present disclosure, there is provided an electronic device including: a memory storing execution instructions; and a processor executing the execution instructions stored in the memory to enable the processor to execute the aerial view feature generation method based on multi-modal fusion according to any one of the embodiments of the present disclosure.

According to still another aspect of the present disclosure, a readable storage medium is provided, wherein an execution instruction is stored in the readable storage medium, and the execution instruction is executed by a processor to implement the method for generating the aerial view feature based on multi-modal fusion according to any one of the embodiments of the present disclosure.

According to yet another aspect of the present disclosure, there is provided a computer program product comprising a computer program/instructions which, when executed by a processor, implements the method for generating a bird's eye view feature based on multi-modal fusion of any of the embodiments of the present disclosure.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the disclosure and together with the description serve to explain the principles of the disclosure.

Fig. 1 is a flowchart of a bird's-eye view feature generation method based on multi-modal fusion according to an embodiment of the disclosure.

Fig. 2 is a flowchart of a bird's-eye view feature generation method based on multi-modal fusion according to another embodiment of the disclosure.

Fig. 3 is a schematic diagram of an overall model framework for implementing the bird's-eye view feature generation method based on multi-modal fusion according to an embodiment of the present disclosure.

Fig. 4 is a schematic flowchart of acquiring the bird's-eye view feature after initialization at the current time based on the bird's-eye view feature at the historical time in the bird's-eye view feature generation method based on multi-modal fusion according to the embodiment of the disclosure.

Fig. 5 is a schematic diagram of the same real-world position correspondence of a current frame BEV Query in a history frame according to an embodiment of the disclosure.

Fig. 6 is a projection view of BEV view and image view position for one embodiment of the present disclosure.

FIG. 7 is a schematic diagram of a reference line sampling mode of one embodiment of the present disclosure.

Fig. 8 shows a flowchart for obtaining a bird's eye view feature having second modal characteristic information based on the bird's eye view feature having first modal characteristic information and the second modal characteristic according to an embodiment of the present disclosure.

Figure 9 illustrates a flowchart for obtaining a bird's eye view feature having second modal characteristic information based on a bird's eye view feature having first modal characteristic information and the second modal characteristic of yet another embodiment of the present disclosure.

Fig. 10 is a flowchart of a bird's-eye view feature generation method based on multi-modal fusion according to still another embodiment of the disclosure.

Fig. 11 to 12 are schematic block diagrams of structures of a bird's-eye view feature generation device based on multi-modal fusion, which adopts a hardware implementation of a processing system according to an embodiment of the present disclosure.

Description of the reference numerals

1000. Bird's-eye view feature generation device

1002. Vehicle-mounted sensor acquisition information acquisition module

1004. First modality feature extraction network

1006. Second modality feature extraction network

1008. Time attention module

1010. First-mode aerial view feature acquisition module

1012. Second-mode aerial view characteristic acquisition module

1014. Multi-modal information fusion module

1016 FFN and separable convolution module

1100. Bus line

1200. Processor with a memory having a plurality of memory cells

1300. Memory device

1400. Other circuits.

Detailed Description

The present disclosure will be described in further detail with reference to the drawings and embodiments. It is to be understood that the specific embodiments described herein are for purposes of illustration only and are not to be construed as limitations of the present disclosure. It should be further noted that, for the convenience of description, only the portions relevant to the present disclosure are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present disclosure may be combined with each other without conflict. Technical solutions of the present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Unless otherwise indicated, the illustrated exemplary embodiments/examples are to be understood as providing exemplary features of various details of some ways in which the technical concepts of the present disclosure may be practiced. Accordingly, unless otherwise indicated, features of the various embodiments may be additionally combined, separated, interchanged, and/or rearranged without departing from the technical concept of the present disclosure.

The use of cross-hatching and/or shading in the drawings is generally used to clarify the boundaries between adjacent components. As such, unless otherwise noted, the presence or absence of cross-hatching or shading does not convey or indicate any preference or requirement for a particular material, material property, size, proportion, commonality between the illustrated components and/or any other characteristic, attribute, property, etc., of a component. Further, in the drawings, the size and relative sizes of components may be exaggerated for clarity and/or descriptive purposes. While example embodiments may be practiced differently, the specific process sequence may be performed in a different order than that described. For example, two processes described consecutively may be performed substantially simultaneously or in reverse order to that described. In addition, like reference numerals denote like parts.

When an element is referred to as being "on" or "over," "connected to" or "coupled to" another element, it can be directly on, connected or coupled to the other element or intervening elements may be present. However, when an element is referred to as being "directly on," "directly connected to" or "directly coupled to" another element, there are no intervening elements present. For purposes of this disclosure, the term "connected" may refer to physically, electrically, etc., and may or may not have intermediate components.

The terminology used herein is for the purpose of describing particular embodiments and is not intended to be limiting. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. Furthermore, when the terms "comprises" and/or "comprising" and variations thereof are used in this specification, the presence of stated features, integers, steps, operations, elements, components and/or groups thereof are stated but does not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components and/or groups thereof. It is also noted that, as used herein, the terms "substantially," "about," and other similar terms are used as approximate terms and not as degree terms, and as such, are used to interpret inherent deviations in measured values, calculated values, and/or provided values that would be recognized by one of ordinary skill in the art.

The method for generating the bird's-eye view features is a multi-mode fusion generation method for the bird's-eye view which supports input of expandable modal information, mainly takes input of a camera mode, supports input of any laser radar sensor, millimeter wave radar sensor and the like of different models as expandable modal input, and is used for generating high-quality bird's-eye view features.

A bird's-eye view generating method, a device, and the like based on multi-modal fusion according to the present disclosure will be described in detail below with reference to fig. 1 to 12.

Fig. 1 is a flowchart of a bird's eye view feature generation method based on multi-modal fusion according to an embodiment of the present disclosure.

Referring to fig. 1, a bird' S-eye view feature generation method S100 based on multi-modal fusion of the present disclosure includes:

s102, extracting the acquisition information of all vehicle-mounted sensors of the vehicle at the current moment, wherein the vehicle-mounted sensors at least comprise a first modality sensor (an image sensor, namely a vehicle-mounted camera) and a second modality sensor so as to at least obtain first modality acquisition information and second modality acquisition information;

s104, extracting first modal characteristics of the first modal acquisition information by using a first modal characteristic extraction network, and extracting second modal characteristics of the second modal acquisition information by using a second modal characteristic extraction network;

s106, acquiring the bird 'S-eye view feature after initialization at the current moment based on the bird' S-eye view feature at the historical moment (a time attention module);

s108, acquiring a bird 'S-eye view characteristic with first modal characteristic information based on the first modal characteristic and the bird' S-eye view characteristic after the initialization of the current time;

s110, acquiring a bird 'S-eye view characteristic with second modal characteristic information based on the bird' S-eye view characteristic with the first modal characteristic information and the second modal characteristic;

and S112, accessing the bird ' S-eye view feature (BEV Map) with the first modal feature information into a bird ' S-eye view Query vector (BEV Query) to obtain the bird ' S-eye view feature (BEV Map) with the second modal feature information, and fusing the second modal feature information into the bird ' S-eye view feature (BEV Map) with the first modal feature information to generate the bird ' S-eye view feature (BEV Map) with multi-modal fused information.

In the bird' S eye view feature generation method S100 based on multi-modal fusion according to the present disclosure, the feature extraction network of each modality performs feature extraction on the collected information of all sensors of the modality to obtain the feature space of the modality (i.e., obtain the first modality feature, the second modality feature, and the like), and the feature extraction networks of different modalities are independent of each other.

In some embodiments of the present disclosure, the first modality sensor described above in the present disclosure is an in-vehicle camera, the in-vehicle sensor further includes a third modality sensor for obtaining third modality information, a third modality feature of third modality acquisition information is further extracted using a third modality feature extraction network, a bird's eye view feature having third modality feature information is further obtained based on the bird's eye view feature having the first modality feature information and the third modality feature, the bird's eye view feature (BEV Map) having the first modality feature information is further accessed in the form of a bird's eye view Query vector (BEV Query), the second modality feature information and the third modality information are fused into the bird's eye view feature (BEV Map) having the first modality feature information, and the bird's eye view feature (BEV Map) having multi-modality fused information is generated.

Fig. 2 is a flowchart of a bird's eye view feature generation method based on multi-modal fusion according to another embodiment of the present disclosure.

According to the method, original corresponding information, such as images shot by a vehicle-mounted camera, laser radar point clouds captured by a laser radar and millimeter wave radar point clouds captured by a millimeter wave radar, is extracted according to sensors of different modes, and then the original characteristics of all the modes are extracted through a pre-trained backbone network (an image backbone network, a laser radar backbone network, a millimeter wave radar backbone network and the like) so as to be used for generating the bird's-eye view characteristics.

Referring to fig. 3, in some embodiments of the present disclosure, a learnable bird's-eye view feature (BEV Map) centered on a vehicle is randomly initialized, a position code is added, and the time-series information is sent to a time attention module (time-series prior module) to perform cross attention between bird's-eye view features at different times according to the same real-world position, so as to obtain time-series information.

In the disclosure, each position point of a bird's-eye view Map (BEV Map) after initialization at the current time is used as a bird's-eye view Query vector (BEV Query) to Query image features (first modality features), a corresponding position of the same real world of each BEV Query in an image feature space (first modality feature space) is found according to an internal and external parametric projection manner of a camera (image sensor), the queried image features (first modality features) are merged into the BEV Map extracted by time sequence information (i.e., the bird's-eye view Map after initialization at the current time) by using spatial cross attention, and the BEV Map with image prior information (i.e., the bird's-eye view Map features with the first modality feature prior information) is obtained.

In the present disclosure, for each modality, a BEV Map with the feature information of the modality is separately constructed, when the BEV Map is constructed for other modalities except for the camera modality (first modality), a BEV Map with image feature prior is separately copied to obtain one BEV Map with image feature prior, so that for each modality, a BEV Map with image prior information is obtained, and for other modalities (for example, the second modality), the copied BEV Map with image feature prior is used to correlate the BEV Map with the image feature prior obtained in the modality and the feature space of the modality through the position relationship of the real world, and update the BEV Map obtained in the modality, so that each modality can generate one BEV Map with the feature information of the modality, and feature fusion is facilitated.

In the present disclosure, if a modality, for example, a first modality, includes a plurality of input of sensor data, only one BEV Map is generated, because when a feature space is constructed in different modalities, raw information of different sensors in the same modality may generate raw features through a shared backbone network, and then the raw features generated by different sensors in the same modality are spliced together to obtain a feature space (for example, a feature in the first modality) unique to the modality.

And then the BEV maps of all the modes are sent into a multi-mode information interaction fusion module, the BEV maps generated by the camera mode (namely the BEV maps with the first mode characteristic information) access the BEV maps generated by other modes (the BEV maps with the second mode characteristic information, the BEV maps with the third mode characteristic information and the like) in a BEV Query mode, and the information of other modes is fused into the BEV maps of the image mode in a cross attention mode to generate the BEV maps fused with the multi-mode information.

Therefore, when a new modal input is introduced into the integral model, the backbone network extracted from the feature space of other modes does not need to be updated, and only the generation fusion module (multi-mode information interaction fusion module) of the BEV Map needs to be retrained, so that the integral model can be suitable for real tasks of different scenes; when a new sensor is introduced into the overall model in the existing mode, the original information of the sensor only needs to pass through the feature extractor shared by the modes and the extracted features are spliced to the original features, so that the mode information can be enriched, the overall model does not need to be changed, the feature extraction backbone network of any mode can be conveniently replaced, the updating and iteration of the overall model are very convenient, and the latest research results in different fields are quickly absorbed and fused; and the loss and damage of any other modal input when the camera modal is normal or the loss and damage of any sensor input in any other modal can not influence the generation of the final aerial view characteristic, so that the method has stronger robustness in practical application.

The feature extraction network (pre-trained backbone network) of each modality of the present disclosure is exemplarily explained below.

In the present disclosure, the characteristics of each modality are encoded independently. Since the overall model framework of the present disclosure does not make assumptions about the modality used or its model architecture, the overall model of the present disclosure is applicable to any optional feature encoder (feature extraction network).

Referring to fig. 3, the present disclosure exemplifies three types of data: laser radar point clouds, millimeter wave radar point clouds, and multi-view camera images.

(1) Laser radar point cloud. For the laser radar point cloud, the method uses a VoxelNet with a voxel size of 0.1 m or a PointPillar with a pilar size of 0.2 m as a characteristic extraction backbone network of the laser radar, and then uses the output of the FPN as the multi-scale BEV Map through a 3D backbone network and the FPN network.

(2) Millimeter wave radar point cloud. And for each feature point of the millimeter wave radar point cloud, extracting the point cloud feature of the millimeter wave radar through an MLP network shared by parameters.

(3) A multi-view camera image. The feature extraction method is the same as the DETR3D, BEVFormer and other methods, for N images input by N vehicle-mounted panoramic cameras at a certain moment, image features are extracted through a ResNet network and an FPN network, and the output of each image through the FPN is used as multi-scale image features extracted from images of the multi-view camera.

Under the teaching of the technical solution disclosed in the present disclosure, those skilled in the art may adjust the architectures/types of the lidar point cloud feature extraction network, the millimeter wave radar point cloud feature extraction network, and the image feature extraction network, which all fall within the protection scope of the present disclosure.

In the present disclosure, the specific multi-modal feature extraction pre-training backbone network may be the same as the feature coding network in FUTR 3D.

In some embodiments of the present disclosure, when an input of one modality needs to be added, for example, a camera modality and a lidar modality currently exist, and a millimeter wave radar modality needs to be added, then the feature extractors (feature extraction networks) of the camera modality and the lidar modality do not need to be retrained, but only need to be trained. When a sensor needs to be added to one existing mode, the trained feature extractor of the current mode is used for extracting features, the extracted features are spliced to the features of the original mode according to the position of the new sensor for use, the whole model does not need to be retrained, and rapid deployment is facilitated. When the integral model lacks a certain sensor in a certain mode, the structure of the characteristic space of the mode is not influenced, the integral framework of the model is not influenced, the richness of the characteristics of the mode is only reduced, when the integral model completely lacks a certain non-image mode, only the mode characteristic fusion branch is skipped when the BEV Map at the current moment is constructed, the generation of the final BEV Map is not influenced, and only the missing mode information is lost when the final BEV Map is generated.

Fig. 4 is a schematic flow chart of acquiring the bird's-eye view feature after initialization at the current time based on the bird's-eye view feature at the historical time in the bird's-eye view feature generation method based on multi-modal fusion according to the embodiment of the disclosure.

S106, acquiring the bird 'S-eye view feature (time attention module) after the initialization of the current time based on the bird' S-eye view feature at the historical time comprises the following steps:

s1061, randomly initializing a learnable aerial view feature taking the vehicle as a center as an aerial view feature at the current moment;

s1062, adding a position code in the learnable bird 'S-eye view feature to obtain a position-coded bird' S-eye view of the current frame;

s1063, extracting a historical frame aerial view (BEV Map) within a preset time length before the current time;

s1064, acquiring the relative pose (namely pose change) of the aerial view of the current frame and the aerial view of the historical frame after position coding;

s1065, searching corresponding historical frame aerial view query vectors (at least one) in the historical frame aerial view as time sequence prior information based on relative poses for the aerial view query vectors of each position point of the aerial view after the position of the current frame is encoded; and

s1066, adding the time-series prior information to the position-encoded bird 'S eye view of the current frame, and obtaining the bird' S eye view characteristics after initialization of the current time (temporal attention module).

In some embodiments of the present disclosure, extracting a bird's eye view (BEV Map) of a historical frame within a preset time length before a current time includes:

and extracting one frame of historical frame aerial view or extracting more than two frames of historical frame aerial view.

In some embodiments of the present disclosure, if more than two frames of the historical frames of the bird's-eye view are extracted, the chronological prior information obtained from the historical frames of the bird's-eye view is subjected to cross attention to obtain the final chronological prior information.

In more detail, for bird's-eye view feature generation at the current moment, the disclosure first initializes a learnable BEV Map ∈ R centered on the vehicle ^H×W×C Where H is the height of the generated bird's-eye view, W is the width of the generated bird's-eye view, and C is the dimension of the generated bird's-eye view feature. Illustratively, the present disclosure sets H = W =200, c =256.

The present disclosure initializes the BEV Map (aerial view) of the current frame by the BEV Map (aerial view) of the historical frame using a temporal attention module. Specifically, the method includes the steps of firstly, randomly initializing a learnable BEV Map taking a host vehicle as a center for a current frame, and adding a two-dimensional relative position code. Then, within the last 0.5 second (for example), different four frames (i.e., multiple frames) are randomly extracted as historical frames, and the relative poses (including position offset and/or movement direction offset) of the current frame bird's-eye view of the host vehicle and the bird's-eye views of other historical frames are obtained through IMU information (IMU (Inertial Measurement Unit)).

For each position point on the BEV Map of the current frame, the position point is used as a BEV Query (bird's eye view Query)Quantity), this will yield H x W BEV Query, each of which has a dimension of R ^1×C . Then, for each BEV Query of the current frame, a corresponding BEV Query is found in the history frame through the pose information (i.e., the relative pose described above). According to the position deviation and/or the movement direction deviation, for the position coordinate of the real world corresponding to the BEV Query of the current frame, the BEV Query corresponding to the position coordinate of the real world is searched in the historical frame.

It should be noted that the same real-world location point in the history frame corresponding to the BEV Query of the current frame may be included in a plurality of BEV queries of the history frame, and thus the present disclosure preferably cross-pays the Query in the history frame in the neighboring 3 × 3 (exemplarily) Query range according to a ratio of distances thereof (same real-world location point) to different BEV queries, cross-pays the Query in the history frame only in the neighboring 2 × 3 Query range if the corresponding location point in the history frame is a boundary of the BEV Map (bird's eye view), and cross-pays the Query in the history frame only in the neighboring 2 × 2 Query range if the corresponding location point in the history frame is four corner points of the BEV Map.

For each historical frame, corresponding prior information is extracted from the BEV Query of the current frame according to the relative pose, then the prior information from multiple frames is spliced, the BEV Query of the current frame is used as the Query, the BEV queries corresponding to the historical frames are used as keys and values, only cross attention is needed to obtain final time sequence prior information, and therefore BEV Map E R passing through a time attention module is obtained ^H×W×C 。

In some embodiments of the present disclosure, it is worth noting that for Query without corresponding timing information in the historical BEV Map, no a priori information from the timing will be obtained. If the current frame is the first frame, the learnable BEV Map will only incorporate the two-dimensional relative position coding and skip the temporal attention module described above in this disclosure. And if the historical frame of the current frame is less than four frames, only taking the existing historical frame for time sequence prior fusion.

In the bird' S-eye view feature generation method S100 based on multi-modal fusion of the present disclosure, preferably, the first modal feature is an image feature; obtaining the bird 'S-eye view feature having the first modality feature information based on the first modality feature and the bird' S-eye view feature after initialization of the current time at S108 described above, including:

In some embodiments of the present disclosure, each location point of the initialized bird's-eye view at the current time is taken as a bird's-eye view Query vector (BEV Query) to perform spatial feature sampling based on a feature map (first modality feature) of the captured image geometrically projected to each in-vehicle camera.

In this disclosure, the attention module initializes the BEV Map ∈ R for the elapsed time ^H×W×C Similarly, each location point on the BEV Map is used as a bird's-eye view Query vector (BEV Query), which results in H × W BEV queries, each of which has a dimension R ^1×C 。

For each BEV Query, the position of the first modality sensor, namely the image sensor, on the BEV Map is geometrically projected into the feature Map extracted by the convolutional neural network according to the internal and external parameters of the camera of the image sensor for spatial feature sampling. Illustratively, the present disclosure assumes that each grid in the BEV Map corresponds to a dimension of S meters real in the real world. As a default, if the feature center of the BEV Map corresponds to the position of the host vehicle, then for any BEV Query, the coordinates on the BEV Map are (x, y), and the coordinates projected onto the real-world top view are (x ', y'):

where (x, y) is the coordinate system origin at the lower left corner on the BEV Map and (x ', y') is the coordinate system origin at the host vehicle on the real-world top view, the present disclosure illustratively sets S =0.5.

Illustratively, for each coordinate point (x ', y'), R sampling points are set on average from (-5 m,5 m) in the z-axis direction for sampling, that is, for each BEV Query, R real-world 3D coordinate points are corresponded

Then each 3D coordinate point can be projected to one or more collected images of the vehicle-mounted camera through internal and external parameters of the camera, and a plurality of image coordinate points (p) can be obtained _x ，p _y ) Exemplarily, the present disclosure sets R =8.

For each BEV Query, a plurality of corresponding image projection coordinate points may be obtained, and referring to fig. 6, fig. 6 is a projection diagram of BEV viewing angles and image viewing angle positions according to an embodiment of the present disclosure.

It should be noted that R real-world 3D coordinate points generated by each BEV Query may calculate projection points according to internal and external parameters of each camera (i.e., a vehicle-mounted camera), but since there is only an overlapping area between the viewing angles of two cameras at most, each BEV Query projects on 2 vehicle-mounted images at most, and the projection points of other images are projected outside the image range.

Through the process, the BEV Map with the image characteristic prior information is obtained.

In some preferred embodiments of the present disclosure, obtaining the bird's eye view feature having the first modal feature information based on the first modal feature and the bird's eye view feature after the initialization of the current time further includes: taking each vertical line of a feature map used for representing image features of each image as a reference line, acquiring polar ray features of the aerial view by taking the vehicle as the center, further acquiring an aerial view Query vector (BEV Query), and performing spatial feature sampling based on the reference lines to acquire the aerial view features with first modal feature information.

Considering that the real-world geometric relationship cannot be well utilized only by adopting the corresponding projection point projection method, a reference line-based sampling projection method is also introduced in some preferred embodiments of the present disclosure.

In the geometric relationship between the image visual angle and the BEV visual angle, each vertical line feature of the image visual angle corresponds to the same polar ray geometric relationship of each BEV visual angle with the vehicle as the center, so that the vertical lines on the image are used as reference lines, reference line feature sampling is carried out, and geometric prior information of the real world is introduced to help generate the BEV Map with stronger robustness.

In general, the bird's-eye view feature is generated by taking the vehicle as a center point, so that the polar ray taking the vehicle as the center in the bird's-eye view feature has a 360 ° view angle range, while the view angle range of the vehicle-mounted camera is narrow, and each vertical line feature of the image view angle corresponds to the same polar ray geometric relationship taking the vehicle as the center in each BEV view angle in the same view angle range in which the polar rays taking the vehicle as the center of the vehicle-mounted camera and the vehicle are the same.

That is, the vertical line feature from left to right on the image feature corresponds to the polar line from right to left with the host vehicle as the center on the BEV Map within the corresponding camera sight line range, referring to fig. 7, fig. 7 is a schematic diagram of a reference line sampling manner according to an embodiment of the present disclosure.

It is noted that, since the corresponding extracted feature on the BEV Map is a polar ray feature, the corresponding feature extraction range may fall between multiple BEV Query, and therefore the corresponding feature sampling method is: and (3) each vertical line feature of the image features is unchanged, the corresponding aerial view feature of the image features takes the BEV Query of each horizontal line of the BEV Map corresponding to the nearest drop point and the BEV queries adjacent to the left and right of the BEV Query as corresponding features.

Therefore, the characteristic sampling method based on the reference line can obtain the vertical line characteristic on each characteristic map on the bird's-eye view characteristic of H multiplied by W

Or

BEV Query of (1). Whether the specific number is H or W is that the polar ray corresponding direction is the horizontal direction or the vertical direction, and the division by 2 is because the vehicle is at the center point of the bird's eye view.

The above describes the BEV view-to-image view projection sampling and reference line sampling approaches, both of which will help the BEV Map extract corresponding features from the image space.

In the bird 'S-eye view feature generation method S100 based on multi-modal fusion of the present disclosure, preferably, the second modal feature described above is a BEV feature (bird' S-eye view feature) generated from a laser radar point cloud obtained by a second modal feature extraction network (laser radar point cloud feature extraction network); obtaining the bird 'S-eye view feature having the second modal feature information based on the bird' S-eye view feature having the first modal feature information and the second modal feature at S110 described above, including:

s1102, copying a bird 'S-eye view feature with first modal feature information as a bird' S-eye view feature (BEV Map) for acquiring laser radar modal information;

s1104, finding corresponding features in second modal features (BEV features generated by laser radar point cloud) by means of projection of bird' S-eye view features (BEV maps) for collecting laser radar modal information;

and S1106, fusing laser radar characteristic information which is the corresponding characteristic found in the second modality characteristic (BEV characteristic generated by the laser radar point cloud) to a bird 'S eye view characteristic (BEV Map) for collecting the laser radar modality information to generate the bird' S eye view characteristic (BEV Map) with the laser radar modality characteristic information.

Fig. 8 shows a schematic flow chart for obtaining a bird's-eye view feature having second modal feature information based on a bird's-eye view feature having first modal feature information and the second modal feature according to an embodiment of the present disclosure.

Referring to fig. 8, in some embodiments of the present disclosure, when sampling lidar point cloud features of a lidar modality using a BEV Map with an image feature prior (i.e., a bird's eye view feature with first modality feature information), the present disclosure first obtains BEV features generated by the lidar point cloud through a lidar feature extraction network, and then copies a copy of the BEV Map with the image feature prior as the BEV Map for acquiring lidar modality information. Each position point on the BEV Map with the image feature prior is used as a BEV Query, corresponding features are found in the BEV features generated by the laser radar point cloud in a projection mode, the corresponding features of the laser radar can be obtained by carrying out bilinear interpolation on adjacent position features, then the information of the laser radar is fused to the BEV Map with the image feature prior obtained by copying the laser radar in a deformable attention mode, and the BEV Map with the laser radar modal information is generated.

In some embodiments of the present disclosure, preferably, the fusing the laser radar feature information, which is the corresponding feature found in the second modality feature (the laser radar point cloud generated BEV feature), to the bird's eye view feature (BEV Map) that collects the laser radar modality information, described above, to generate the bird's eye view feature (BEV Map) with the laser radar modality feature information, includes:

predicting k offset displacements at the projection point positions of the corresponding features through an MLP network, adding each offset displacement to the projection point position to obtain additional k feature sampling points, and acquiring each bird's-eye view Query vector (BEV Query) of bird's-eye view features (BEV Map) of laser radar modal information to obtain k +1 feature sampling points;

Firstly, each BEV Query on a BEV Map with image feature prior is projected and bilinear interpolated to obtain a BEV feature generated by corresponding laser radar point cloud, then k offset displacements are predicted for the BEV feature generated by the laser radar point cloud through an MLP network, and k feature sampling points are obtained by adding coordinates corresponding to the projection points to each offset position, so that each projection point obtains additional k feature sampling points, and each BEV Query obtains k +1 feature sampling points. Then, the BEV Query is used as a Query of cross attention, and the characteristics of the sampling points are used as Key and Value to carry out cross attention calculation, so that the updated BEV Query is obtained:

wherein Q = Query W ^q ，K＝Key*W ^k ，V＝Value*W ^v ，W ^q ，W ^k ，W ^v Are all learnable matrices.

The present disclosure exemplarily sets k =4.

Fig. 9 shows a schematic flow chart for obtaining a bird's-eye view feature having second modal feature information based on a bird's-eye view feature having first modal feature information and the second modal feature according to yet another embodiment of the present disclosure.

Referring to fig. 9, in the present embodiment, the second modal feature is a millimeter wave radar point cloud feature obtained by a second modal feature extraction network (millimeter wave radar point cloud feature extraction network); obtaining the bird 'S-eye view feature having the second modal feature information based on the bird' S-eye view feature having the first modal feature information and the second modal feature at S110 described above, including:

s1102, copying a bird 'S-eye view feature with first modal feature information as a bird' S-eye view feature (BEV Map) for acquiring millimeter wave radar modal information;

s1104, finding a corresponding feature in a second modal feature (millimeter wave radar point cloud feature) by a projection mode according to a bird' S eye view feature (BEV Map) for collecting millimeter wave radar modal information;

and S1106, fusing millimeter wave radar feature information which is the corresponding feature found in the second modality features (millimeter wave radar point cloud features) into a bird 'S-eye view feature (BEV Map) for collecting the millimeter wave radar modality information to generate the bird' S-eye view feature (BEV Map) with the millimeter wave radar modality feature information.

In some embodiments of the present disclosure, it is preferable that the above-described fusion of millimeter wave radar feature information, which is a corresponding feature found in the second modality feature (millimeter wave radar point cloud feature), to a bird's eye view feature (BEV Map) that acquires millimeter wave radar modality information to generate a bird's eye view feature (BEV Map) having millimeter wave radar modality feature information includes:

predicting K offset displacements for the positions of the projection points of the corresponding features through an MLP network, adding each offset displacement to the position of the projection point to obtain additional K feature sampling points, and acquiring each bird's-eye view Query vector (BEV Query) of a bird's-eye view feature (BEV Map) of the millimeter wave radar modal information to obtain K +1 feature sampling points;

and (3) taking the bird's-eye view Query vector (BEV Query) as a cross attention Query vector (Query), taking the sampling points and the characteristics of the sampling points as keys (Key) and values (Value) of cross attention respectively, and performing cross attention calculation to update the bird's-eye view Query vector (BEV Query).

In some embodiments of the present disclosure, the first modality is an image modality, and the second modality may also be a millimeter wave radar modality. When sampling the features of the millimeter wave radar mode by using the BEV Map with the image feature prior (namely, the bird's-eye view feature with the first mode feature information), the method firstly obtains the point cloud features generated by the millimeter wave radar point cloud through the millimeter wave radar feature extraction network (second mode feature extraction network), and then copies a copy of the BEV Map with the image feature prior as the BEV Map for collecting the millimeter wave radar mode information.

Updating the millimeter wave radar BEV Map also updates each BEV Query in a cross attention mode, for each BEV Query, finding K nearest points in the millimeter wave radar point cloud, then taking the BEV Query as the cross attention Query, taking point cloud characteristics obtained by the K points through MLP as Key and Value for cross attention, obtaining the updated BEV Query (the updating formula is the same as above), and exemplarily setting K =5.

For the third mode described above in this disclosure, it may be a lidar mode or a millimeter wave radar mode.

As can be seen from the above description, in the present disclosure, for each modality, an independent BEV Map with the present modality information is obtained, and the present disclosure requires at least a camera modality (first modality) to be included in the overall model, so the present disclosure will obtain at least one BEV Map with image modality information, and the overall model of the present disclosure may support three modalities, i.e., a camera modality, a lidar modality, and a millimeter wave modality, so the overall model of the present disclosure may obtain three BEV maps with modality information. The specific types of the second modality and the third modality can be adjusted by those skilled in the art in light of the technical solutions of the present disclosure, and all of them fall into the protection scope of the present disclosure.

Moreover, the mining of modal information to obtain the BEV Map form of the present disclosure allows for addition, deletion, damage of a sensor of a certain modality during testing of the overall model, without affecting the generation of the final BEV Map.

With regard to the method S100 of generating a bird 'S-eye view feature based on multi-modal fusion of the present disclosure, the above-described S112 of generating a bird' S-eye view feature (BEV Map) having multi-modal fusion information by accessing a bird 'S-eye view feature (BEV Map) having first-modal feature information to a bird' S-eye view feature having second-modal feature information in the form of a bird 'S-eye view Query vector (BEV Query), fusing the second-modal feature information into the bird' S-eye view feature (BEV Map) having the first-modal feature information, preferably includes:

using BEV Query with bird's-eye view feature (BEV Map) of the first modal feature information as cross attention Query, using BEV Query with bird's-eye view feature (BEV Map) of the second modal feature information as cross attention Key and Value, and for each BEV Query with bird's-eye view feature (BEV Map) of the first modal feature information, finding a corresponding BEV Query on the bird's-eye view feature (BEV Map) of the second modal feature information;

predicting k offset displacements from the corresponding BEV Query, and adding the coordinates of the corresponding BEV Query to each offset position to obtain k additional BEV queries on the bird's-eye view feature (BEV Map) with the second modal feature information, so that each BEV Query with the bird's-eye view feature (BEV Map) with the first modal feature information obtains k +1 corresponding BEV queries on the bird's-eye view feature (BEV Map) with the second modal feature information;

the cross attention calculation is performed with the BEV Query having the bird's-eye view feature (BEV Map) of the first modality feature information as the cross attention Query, and the BEV Query corresponding to the bird's-eye view feature (BEV Map) of the second modality feature information as Key and Value, and the BEV Query finally output, that is, the BEV Query having the bird's-eye view feature (BEV Map) of the multi-modality fusion information is obtained.

In some embodiments of the present disclosure, when one or more BEV maps are obtained (the bird's eye view feature having the first modality feature information, the bird's eye view feature having the second modality feature information, the bird's eye view feature having the third modality feature information, and the like), the multi-modality information fused BEV Map may be generated by the multi-modality information interaction fusion module.

In some embodiments of the present disclosure, if the multimodal information interaction fusion module has only an input of a camera modality, the multimodal information interaction fusion module is skipped.

When the input of the multi-modal information interaction fusion module comprises a camera modality (a first modality) and other modalities, deformable attention is used to fuse information of the other modalities into the camera modality. Specifically, with the BEV Query of the BEV Map of the camera modality as the cross attention Query and the BEV Query of the BEV Map of the other modality as the cross attention Key and Value, for the BEV Query of each camera modality, first, finding the BEV Query corresponding to the BEV Map of the other modality, then, similarly, predicting k offset displacements with the corresponding BEV Query, each offset position plus the corresponding BEV Query coordinate will result in k additional BEV queries on the BEV Map of the other modality, so that the BEV Query of each camera modality will result in k +1 corresponding BEV queries on one other modality, (if there are two other modalities, result in corresponding 2 [ (+ 1) BEV Query), then, taking the BEV Query of the image modality as the cross attention Query, and the other modalities as the Key and the BEV Query, performing the cross attention Key =4, and outputting an example.

Thus, in the present disclosure, for the absence of any modality, if the partially acquired Key and Value are absent, only the sequence of the deformable attention input becomes short, but does not affect the update and generation of the final BEV Query. If all keys and values are completely lost, the multi-mode information interaction fusion module is skipped completely.

Fig. 10 is a schematic flowchart of a method for generating a bird 'S-eye view feature based on multi-modal fusion according to still another embodiment of the present disclosure, and the method for generating a bird' S-eye view feature based on multi-modal fusion S100 according to the present embodiment further includes, in addition to the schematic flowcharts shown in fig. 1 or 2:

s114, inputting the bird 'S-eye view feature (BEV Map) of the multi-modal fusion information to the FFN and separable convolution module (composed of the FFN network (feed-forward neural network) and the separable convolution network) to output the bird' S-eye view feature (BEV Map) of the enhanced multi-modal fusion information at the present time.

As is customary with attention mechanisms, in some embodiments of the present disclosure, a layer of 2-fold upsampled FFN network is passed after spatial cross attention, while for downstream perception tasks on bird's-eye view features, it is generally believed that different information is stored between different channels, and therefore a layer of separable convolutional neural network after spatial cross attention will help generate high quality bird's-eye view features. While generating high-quality bird's-eye view features, the channel dimensions of the bird's-eye view features need to be kept constant, so the number of convolution kernels of the separable convolution network is the same as the channel dimensions and is placed behind the spatial attention module (i.e., the multi-modal information fusion module described above).

As can be seen from the above description, the method for generating an aerial view feature based on multi-modal fusion of the present disclosure is a high-quality method for generating an aerial view multi-modal fusion of expandable modalities, which facilitates the subsequent downstream task to be expanded on the aerial view feature.

Aiming at the technical problem that corresponding feature information of different modalities is difficult to effectively mine in the prior art, the method provides a multi-modality feature sampling mode based on image feature prior, firstly randomly initializes the current learnable aerial view feature, then updates the aerial view feature by using the image feature, and then carries out cross attention fusion by using the aerial view feature space and the feature of the same real world position on the feature space constructed by other modalities, thereby realizing effective multi-modality feature information mining. The visual information contains abundant geometric characteristics and is also a main perception means of a person in the process of driving an automobile, and the image characteristics are firstly fused into the aerial view characteristics as a priori information, so that information of other modes can be effectively mined and extracted.

Aiming at the technical problem that information among different modes is difficult to be effectively interactively fused in the prior art, the invention provides a scalable multi-mode information interaction fusion module based on cross attention. And for the aerial view features sampled by each modality, interacting the aerial view features based on the camera modality with the features of other modalities by means of cross attention, and updating the aerial view features generated based on the camera modality by means of the aerial view feature information of other modalities. Through a cross attention mode, effective information mined in different modalities can be effectively integrated based on aerial view characteristics of the camera modalities, meanwhile, the calculation amount of information fusion between the different modalities is effectively reduced through the interactive design, and the information of the different modalities is effectively utilized.

The aerial view feature generation method based on multi-mode fusion provides a unified framework for aerial view multi-mode fusion generation of expandable modes, and realizes that when new mode input is introduced into a model under unified representation of aerial view features, backbone networks extracted from feature spaces of other modes do not need to be updated, when a new sensor is introduced into an existing mode by an integral model, the model does not need to be updated, plug and play is realized, the backbone networks can be extracted by conveniently replacing features of any mode, update iteration of the model is very convenient, loss or damage of input of any other mode when the camera mode is normal does not influence generation of final aerial view features, and the method has strong robustness in practical application.

The present disclosure also provides a bird's-eye view feature generation device (i.e., the global model described above) based on multi-modal fusion.

Referring to fig. 11, the bird's-eye view feature generation device 1000 based on multi-modal fusion of the present disclosure includes:

the vehicle-mounted sensor acquisition information acquisition module 1002 is configured to extract acquisition information of all vehicle-mounted sensors of the vehicle at the current moment, where the vehicle-mounted sensors at least include a first modality sensor and a second modality sensor so as to obtain at least first modality acquisition information and second modality acquisition information;

a first modality feature extraction network 1004, the first modality feature extraction network 1004 extracting a first modality feature of the first modality acquisition information;

a second modality feature extraction network 1006, the second modality feature extraction network 1006 extracting a second modality feature of the second modality acquisition information;

a time attention module 1008, wherein the time attention module 1008 acquires the bird's-eye view feature after the initialization of the current moment based on the bird's-eye view feature at the historical moment;

a first-modality bird's-eye view feature acquisition module 1010, wherein the first-modality bird's-eye view feature acquisition module 1010 acquires a bird's-eye view feature having first-modality feature information based on the first-modality feature and the bird's-eye view feature after initialization at the current time;

a second-modality bird's-eye view feature acquisition module 1012, wherein the second-modality bird's-eye view feature acquisition module 1012 acquires a bird's-eye view feature having second-modality feature information based on the bird's-eye view feature having the first-modality feature information and the second-modality feature;

the multi-modality information fusion module 1014 accesses the bird's-eye view feature (BEV Map) having the first modality feature information to the bird's-eye view feature (BEV Query) having the second modality feature information in the form of a bird's-eye view Query vector (BEV Query), and fuses the second modality feature information to the bird's-eye view feature (BEV Map) having the first modality feature information to generate the bird's-eye view feature (BEV Map) having the multi-modality fusion information.

Referring to fig. 12, in other embodiments of the present disclosure, the bird's-eye view feature generation device 1000 based on multi-modal fusion of the present disclosure further includes:

FFN and separable convolution module 1016, FFN and separable convolution module 1016 processes the bird's eye view feature (BEV Map) of the multimodal fusion information to output an enhanced bird's eye view feature (BEV Map) of the multimodal fusion information at the current time.

It should be noted that the bird's-eye view feature generation apparatus 1000 based on multi-modal fusion according to the present disclosure may further include a third modal feature extraction network, and the like, which is not described in detail in the present disclosure.

In the overall model architecture of the bird's-eye view feature generation device 1000 based on multi-modal fusion of the present disclosure, it is preferable that the bird's-eye view feature is generated by iterative sampling, and the output of the bird's-eye view feature is obtained for the first time after the time attention module, the multi-modal information interaction fusion module, the FFN and the separable convolution module are passed for the bird's-eye view feature initialized.

In order to improve the learning effect of the bird's-eye view features on the multi-modal information and the time sequence information, the generated bird's-eye view features may be sent to the temporal attention module, the multi-modal information interaction fusion module, the FFN and the separable convolution module again, and the multi-modal feature corresponding sampling calculation attention may be performed again, and in some embodiments of the present disclosure, the iteration may be performed 6 times (refer to fig. 3).

And calculating the multitask loss on the bird's-eye view image feature output at the end of iteration, and well inheriting the existing task heads such as 2D/3D detection, segmentation and the like by utilizing the good compatibility of the bird's-eye view image feature form.

In some embodiments of the present disclosure, only the 3D target detection head of the DETR3D is adopted during the whole model training, and a 6-layer Decoder method is used as the 3D target detection head, and each time the Object Query predicts the center point of the Object on the BEV Map, the 3D target detection is realized. And calculating task loss by using the 3D target detection head on the aerial view characteristics generated in each iteration.

The aerial view feature generation device based on multi-modal fusion can be realized based on a computer software architecture.

Fig. 11 to 12 are schematic block diagrams of the structures of the bird's-eye view feature generation device 1000 based on multi-modal fusion, which adopts a hardware implementation of the processing system according to one embodiment of the present disclosure.

The bird's-eye view feature generation device 1000 based on multi-modal fusion can comprise corresponding modules for executing each or several steps in the flow chart. Thus, each step or several steps in the above-described flow charts may be performed by a respective module, and the apparatus may comprise one or more of these modules. The modules may be one or more hardware modules specifically configured to perform the respective steps, or implemented by a processor configured to perform the respective steps, or stored within a computer-readable medium for implementation by a processor, or by some combination.

The hardware architecture may be implemented with a bus architecture. The bus architecture may include any number of interconnecting buses and bridges depending on the specific application of the hardware and the overall design constraints. The bus 1100 couples various circuits including the one or more processors 1200, the memory 1300, and/or the hardware modules together. The bus 1100 may also connect various other circuits 1400 such as peripherals, voltage regulators, power management circuits, external antennas, and the like.

The bus 1100 may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one connection line is shown, but this does not indicate only one bus or one type of bus.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps in the process, and alternate implementations are included within the scope of the preferred embodiment of the present disclosure in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the implementation of the present disclosure. The processor performs the various methods and processes described above. For example, method embodiments in the present disclosure may be implemented as a software program tangibly embodied in a machine-readable medium, such as a memory. In some embodiments, some or all of the software program may be loaded and/or installed via memory and/or a communication interface. When the software program is loaded into memory and executed by a processor, one or more steps of the method described above may be performed. Alternatively, in other embodiments, the processor may be configured to perform one of the methods described above by any other suitable means (e.g., by means of firmware).

The logic and/or steps represented in the flowcharts or otherwise described herein may be embodied in any readable storage medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.

For the purposes of this description, a "readable storage medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the readable storage medium include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable read-only memory (CDROM). Further, the readable storage medium may even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a memory.

It should be understood that portions of the present disclosure may be implemented in hardware, software, or a combination thereof. In the above embodiments, various steps or methods may be implemented in software stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps of the method implementing the above embodiments may be implemented by hardware that is related to instructions of a program, and the program may be stored in a readable storage medium, and when executed, the program may include one or a combination of the steps of the method embodiments.

In addition, each functional unit in the embodiments of the present disclosure may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a separate product, may also be stored in a readable storage medium. The storage medium may be a read-only memory, a magnetic or optical disk, or the like.

The present disclosure also provides an electronic device, comprising: a memory storing execution instructions; and the processor executes the execution instructions stored in the memory, so that the processor executes the aerial view feature generation method based on multi-modal fusion according to any one of the embodiments of the disclosure.

The present disclosure also provides a readable storage medium, in which execution instructions are stored, and when executed by a processor, the execution instructions are used to implement the bird's-eye view feature generation method based on multi-modal fusion according to any one of the embodiments of the present disclosure.

The present disclosure also provides a computer program product comprising a computer program/instructions that when executed by a processor implement the method for generating a bird's eye view feature based on multi-modal fusion of any of the embodiments of the present disclosure.

In the description herein, reference to the description of the terms "one embodiment/implementation," "some embodiments/implementations," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment/implementation or example is included in at least one embodiment/implementation or example of the present application. In this specification, the schematic representations of the terms described above are not necessarily the same embodiment/mode or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments/modes or examples. Furthermore, the various embodiments/aspects or examples and features of the various embodiments/aspects or examples described in this specification can be combined and combined by those skilled in the art without being mutually inconsistent.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present application, "plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

It will be understood by those skilled in the art that the foregoing embodiments are provided merely for clarity of explanation and are not intended to limit the scope of the disclosure. Other variations or modifications may occur to those skilled in the art, based on the foregoing disclosure, and are still within the scope of the present disclosure.

Claims

1. A bird's-eye view feature generation method based on multi-modal fusion is characterized by comprising the following steps:

acquiring the bird's-eye view feature after initialization of the current time based on the bird's-eye view feature at the historical time;

obtaining a bird's-eye view feature with second modal feature information based on the bird's-eye view feature with first modal feature information and the second modal feature; and

2. The bird's eye view feature generation method based on multi-modal fusion according to claim 1, wherein the feature extraction network of each modality performs feature extraction on the collected information of all sensors of the modality to obtain a feature space of the modality, and the feature extraction networks of different modalities are independent of each other.

3. The bird's-eye view feature generation method based on multi-modal fusion according to claim 1 or 2, wherein the first modal sensor is a vehicle-mounted camera, the vehicle-mounted sensor further includes a third modal sensor for obtaining third modal information, further third modal features of the third modal acquisition information are extracted using a third modal feature extraction network, further bird's-eye view features having third modal feature information are obtained based on the bird's-eye view features having the first modal feature information and the third modal features, further bird's-eye view features having third modal feature information are accessed from the bird's-eye view features (BEV Map) having the first modal feature information in the form of a bird's-eye view Query vector (BEV Query), and the bird's-eye view features (BEV Map) having the third modal feature information are fused into the bird's-eye view features (BEV Map) having the first modal feature information, and the bird's-eye view features (BEV Map) having the multi-modal fusion information are generated.

4. The method of generating a bird's-eye view feature based on multi-modal fusion according to claim 1, wherein acquiring the bird's-eye view feature after initialization of the current time based on the bird's-eye view feature at the historical time comprises:

adding a position code into the learnable aerial view feature to obtain a position-coded aerial view of the current frame;

searching corresponding historical frame aerial view query vectors in the historical frame aerial view as time sequence prior information based on the relative pose for the aerial view query vectors of each position point of the aerial view after the position of the current frame is encoded; and

and adding time sequence prior information to the position-coded aerial view of the current frame to obtain the aerial view characteristics after initialization of the current moment.

5. The bird's eye view feature generation method based on multi-modal fusion of claim 1, wherein the first modal feature is an image feature;

6. The bird's-eye view feature generation method based on multi-modal fusion of claim 5, wherein the second modal features are BEV features generated by a lidar point cloud obtained through a second modal feature extraction network;

copying a bird's-eye view feature with first modal feature information as a bird's-eye view feature (BEV Map) for acquiring the modal information of the laser radar;

finding a corresponding feature in a second modal feature by the aerial view feature (BEV Map) of the acquired laser radar modal information in a projection mode; and

fusing laser radar characteristic information, which is the corresponding characteristic found in the second modal characteristic, to a bird's-eye view characteristic (BEV Map) of the collected laser radar modal information to generate a bird's-eye view characteristic (BEV Map) with laser radar modal characteristic information;

optionally, fusing the laser radar feature information, which is the corresponding feature found in the second modality feature, to a bird's eye view feature (BEV Map) of the collected laser radar modality information to generate a bird's eye view feature (BEV Map) with laser radar modality feature information, including:

predicting k offset displacements at the projection point positions of the corresponding features through an MLP network, and adding each offset displacement to the projection point position to obtain k additional feature sampling points, so that k +1 feature sampling points are obtained by each aerial view Query vector (BEV Query) of the aerial view features (BEV Map) for acquiring laser radar modal information;

taking a bird's-eye view Query vector (BEV Query) as a cross attention Query vector (Query), and taking a sampling point and characteristics of the sampling point as a Key (Key) and a Value (Value) of cross attention respectively, and performing cross attention calculation to update the bird's-eye view Query vector (BEV Query);

optionally, the second modality features are millimeter wave radar point cloud features obtained through a second modality feature extraction network;

finding a corresponding feature in a second modal feature by the aerial view feature (BEV Map) of the acquired millimeter wave radar modal information in a projection mode; and

fusing millimeter wave radar characteristic information, which is the corresponding characteristic found in the second modal characteristic, to a bird's-eye view characteristic (BEV Map) of the collected millimeter wave radar modal information to generate a bird's-eye view characteristic (BEV Map) with millimeter wave radar modal characteristic information;

optionally, fusing millimeter wave radar feature information, which is the corresponding feature found in the second-modality features, to the bird's eye view feature (BEV Map) of the collected millimeter wave radar modality information to generate a bird's eye view feature (BEV Map) with millimeter wave radar modality feature information, including:

predicting K offset displacements for the projection point positions of the corresponding features through an MLP network, and adding each offset displacement to the projection point positions to obtain additional K feature sampling points, so that K +1 feature sampling points are obtained by each bird's-eye view Query vector (BEV Query) of the bird's-eye view features (BEV Map) of the millimeter wave radar modal information; and

optionally, the third mode is a laser radar mode or a millimeter wave radar mode;

optionally, the generating a bird's-eye view feature (BEV Map) having multi-modal fusion information by accessing the bird's-eye view feature (BEV Map) having first modal feature information to the bird's-eye view feature (BEV Query) having second modal feature information in the form of a bird's-eye view Query vector (BEV Query), and fusing the second modal feature information into the bird's-eye view feature (BEV Map) having the first modal feature information includes:

predicting k offset displacements with the corresponding BEV Query, each offset position plus the coordinates of the corresponding BEV Query to obtain k additional BEV queries on the bird's-eye view feature (BEV Map) having the second modality feature information, and then each BEV Query having the bird's-eye view feature (BEV Map) having the first modality feature information obtaining k +1 corresponding BEV queries on the bird's-eye view feature (BEV Map) having the second modality feature information; and

taking the BEV Query with the bird's-eye view feature (BEV Map) of the first modal feature information as a cross attention Query, taking the corresponding BEV Query on the bird's-eye view feature (BEV Map) of the second modal feature information as a Key and a Value to perform cross attention calculation, and obtaining the BEV Query which is finally output, namely the BEV Query with the bird's-eye view feature (BEV Map) of the multi-modal fusion information;

optionally, the method further comprises:

the aerial view feature (BEV Map) of the multimodal fusion information is input to the FFN and separable convolution module to output an enhanced aerial view feature (BEV Map) of the multimodal fusion information at the current time.

7. A bird's-eye view feature generation device based on multi-modal fusion is characterized by comprising:

the vehicle-mounted sensor acquisition information acquisition module extracts acquisition information of all vehicle-mounted sensors of the vehicle at the current moment, wherein the vehicle-mounted sensors at least comprise a first modal sensor and a second modal sensor so as to at least obtain first modal acquisition information and second modal acquisition information;

a second modal aerial view feature acquisition module that acquires an aerial view feature having second modal feature information based on the aerial view feature having first modal feature information and the second modal feature; and

a multi-modal information fusion module that accesses the bird's-eye view feature (BEV Map) having the first modal feature information to the bird's-eye view feature (BEV Query) having the second modal feature information in the form of a bird's-eye view Query vector (BEV Query), and that fuses the second modal feature information to the bird's-eye view feature (BEV Map) having the first modal feature information to generate the bird's-eye view feature (BEV Map) having the multi-modal fusion information;

optionally, the method further comprises:

8. An electronic device, comprising:

a memory storing execution instructions; and

a processor executing the execution instructions stored in the memory to cause the processor to execute the bird's-eye view feature generation method based on multi-modal fusion of any one of claims 1 to 6.

9. A readable storage medium, wherein the readable storage medium has stored therein executable instructions, when executed by a processor, for implementing the bird's-eye view feature generation method based on multi-modal fusion according to any one of claims 1 to 6.

10. A computer program product comprising computer programs/instructions, characterized in that the computer programs/instructions, when executed by a processor, implement the bird's-eye view feature generation method based on multi-modal fusion of any of claims 1 to 6.