CN115965961B

CN115965961B - Local-global multi-mode fusion method, system, equipment and storage medium

Info

Publication number: CN115965961B
Application number: CN202310160693.XA
Authority: CN
Inventors: 侯跃南; 李鑫; 马涛; 石博天; 杨雨辰; 刘有权; 李怡康; 乔宇
Original assignee: Shanghai AI Innovation Center
Current assignee: Shanghai AI Innovation Center
Priority date: 2023-02-23
Filing date: 2023-02-23
Publication date: 2024-04-05
Anticipated expiration: 2043-02-23
Also published as: CN115965961A

Abstract

The embodiment of the application relates to the technical field of automatic driving, in particular to a local-global multi-mode fusion method, a system, equipment and a storage medium, wherein the method comprises the following steps: firstly, fusing the aggregate image features and voxel features by taking the mass center point in the voxel as a reference point to obtain cross-modal features; based on the cross-modal characteristics, grid characteristics for downstream are obtained; the centroid point comprises an origin point cloud; next, encoding the position information of the reference points to generate grid features; projecting the grid centroid to an image plane, and sampling the image features to obtain sampled image features; fusing the grid features and the sampled image features to obtain locally fused grid features; and finally, carrying out multi-mode fusion processing on the grid features, the grid features and the locally fused grid features used for downstream. The local-global multi-mode fusion method improves the accuracy of 3D target detection.

Description

Local-global multi-mode fusion method, system, equipment and storage medium

Technical Field

The embodiment of the application relates to the technical field of automatic driving, in particular to a local-global multi-mode fusion method, a system, equipment and a storage medium.

Background

The 3D object detection aims at locating and classifying objects in 3D space, is a basic perception task and plays a key role in automatic driving. Lidar and cameras are two of the sensors widely used. Since lidar provides accurate depth and geometry information, 3D target detection methods typically employ point cloud data collected by the lidar to accomplish 3D target detection, and lidar sensors have achieved competitive performance in various benchmarks.

However, due to the inherent limitations of lidar sensors, the point clouds are often sparse and do not provide enough context to distinguish between distant or occluded areas, resulting in poor performance. To improve the performance of 3D object detection, one natural remedy is to supplement the point cloud with image-rich semantic and texture information. Global fusion is typically employed to enhance a point cloud with image features, i.e., fusing point cloud features with image features in the entire scene. However, the method of enhancing the point cloud with image features using global fusion lacks fine-grained local information. For 3D object detection, the foreground object only occupies a small part of the entire scene, and only global fusion brings marginal benefit.

Disclosure of Invention

The embodiment of the application provides a local-global multi-mode fusion method, a system, equipment and a storage medium, which can improve the accuracy of 3D target detection.

In order to solve the above technical problems, in a first aspect, an embodiment of the present application provides a local-to-global multi-modal fusion method, including the following steps: firstly, fusing the aggregate image features and voxel features by taking the mass center point in the voxel as a reference point to obtain cross-modal features; based on the cross-modal characteristics, grid characteristics for downstream are obtained; the centroid point comprises an origin point cloud; next, encoding the position information of the reference points to generate grid features; projecting the grid centroid to an image plane, and sampling the image features to obtain sampled image features; fusing the grid features and the sampled image features to obtain locally fused grid features; and finally, carrying out multi-mode fusion processing on the grid features used for downstream, the grid features and the locally fused grid features.

In some exemplary embodiments, fusing the aggregated image features and the voxel features with the intra-voxel centroid point as a reference point to obtain the cross-modal feature includes: calculating centroid points of non-empty voxel features to obtain voxel features; projecting the voxel features on an image plane, and weighting a group of image features around the reference point to generate an aggregate image feature; and fusing the aggregate image features and the voxel features to obtain the cross-modal features.

In some exemplary embodiments, the voxel features include a plurality, each voxel feature being represented as a query feature Q _i Image features and aggregate image features are calculated by the following formula:

wherein W is _m And W is _m ' is a learnable weight; m is the number of attention heads; k is the total number of sampling points; f (F) _I And p _i Respectively representing image characteristics and reference points of the current voxels in an image characteristic plane; Δp _mik And A _mik Respectively representing the sampling offset and the attention weight of the kth sampling point in the mth attention head;is the kth image feature; />To aggregate image features; Δp _mik And A _mik All by looking up the characteristic Q _i Obtained by linear projection of (c).

In some exemplary embodiments, deriving grid features for downstream based on cross-modal features includes: and carrying out region-of-interest pooling treatment on the cross-modal features to obtain grid features for downstream.

In some exemplary embodiments, the self-attention module is employed to internally aggregate enhancement of grid features for downstream, grid features, and locally fused grid features.

In some exemplary embodiments, the internal aggregation enhancement of grid features for downstream, grid features, and locally fused grid features using a self-attention module includes: summing the grid features used for downstream, the grid features and the locally fused grid features to obtain total features; adopting a self-attention module to enable interaction between the total feature and the residual error connection module to be established on the non-empty grid point feature, and obtaining a boundary frame; the bounding box is refined based on the shared flattened features generated from the feature dynamics enhancement module.

In a second aspect, embodiments of the present application further provide a local-to-global multi-modal fusion system, including: the system comprises a global fusion module, a local fusion module and a characteristic dynamic enhancement module which are sequentially connected; the global fusion module is used for fusing the aggregate image features and the voxel features by taking the mass center point in the voxel as a reference point to obtain cross-modal features; based on the cross-modal characteristics, grid characteristics for downstream are obtained; the centroid point comprises an origin point cloud; the local fusion module is used for encoding the position information of the reference points to generate grid characteristics; projecting the grid centroid to an image plane, and sampling the image features to obtain sampled image features; fusing the grid features and the sampled image features to obtain locally fused grid features; the feature dynamic enhancement module is used for carrying out multi-mode fusion processing on the grid features used for downstream, the grid features and the locally fused grid features.

In some exemplary embodiments, the global fusion module includes a centroid point dynamic fusion processing module and a pooling processing module; the centroid point dynamic fusion processing module is used for calculating centroid points of non-empty voxel characteristics to obtain voxel characteristics; projecting the voxel features on an image plane, and weighting a group of image features around the reference point to generate an aggregate image feature; fusing the aggregate image features and the voxel features to obtain cross-modal features; the pooling processing module is used for obtaining grid characteristics for downstream according to the cross-modal characteristics; the local fusion module comprises a grid dynamic fusion processing module and a position information processing module; the position information processing module is used for encoding the position information of the reference points to generate grid characteristics; the grid dynamic fusion processing module is used for fusing the grid characteristics and the sampled image characteristics based on the cross attention module so as to generate locally fused grid characteristics; the feature dynamic enhancement module comprises a self-attention module and a residual error connection module, wherein the self-attention module is used for establishing interaction between the total feature and the residual error connection module on the non-empty grid point feature to obtain a boundary frame; and refining the bounding box based on the shared flattened features generated from the feature dynamic enhancement module.

In addition, the application also provides electronic equipment, which comprises: at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the local-to-global multi-modal fusion method.

In addition, the application also provides a computer readable storage medium storing a computer program, wherein the computer program realizes the local-global multi-mode fusion method when being executed by a processor.

The technical scheme provided by the embodiment of the application has at least the following advantages:

the embodiment of the application provides a local-global multi-mode fusion method, a system, equipment and a storage medium, wherein the method comprises the following steps: firstly, fusing the aggregate image features and voxel features by taking the mass center point in the voxel as a reference point to obtain cross-modal features; based on the cross-modal characteristics, grid characteristics for downstream are obtained; the centroid point comprises an origin point cloud; next, encoding the position information of the reference points to generate grid features; projecting the grid centroid to an image plane, and sampling the image features to obtain sampled image features; fusing the grid features and the sampled image features to obtain locally fused grid features; and finally, carrying out multi-mode fusion processing on the grid features, the grid features and the locally fused grid features used for downstream.

The local-global multi-modal fusion method provided by the application provides that the centroid point of original contour geometric information, namely the original point cloud contained in the voxel, is used as a reference point, so that more accurate cross-modal alignment fusion between point cloud points and pixel points is realized. Meanwhile, aiming at the problem that the proportion of foreground objects in the whole scene is low, the semantic consistency of the instance targets can be used as a cross-modal fusion natural guide, and the instance target level fusion is provided for frame refinement to provide stronger semantic features. In addition, the self-adaptive complementary enhancement is performed on local and global features at an instance level, and the self-attention-based dynamic feature aggregation module is combined with multi-mode global feature and local feature fusion so as to generate a more accurate result and improve the 3D target detection performance.

Drawings

One or more embodiments are illustrated by way of example and not limitation in the figures of the accompanying drawings, which are not to be construed as limiting the embodiments unless specifically indicated otherwise.

FIG. 1 is a flow chart of a local-to-global multi-modal fusion method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a local-to-global multi-modal fusion system according to an embodiment of the present application;

FIG. 3 is a block diagram of a local-to-global multi-modal fusion system according to one embodiment of the present application;

FIG. 4 is a schematic diagram of a global fusion module according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of a local fusion module according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a feature dynamic aggregation module according to an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

As known from the background art, the existing method for enhancing the point cloud with the image features by global fusion lacks fine-grained local information. For 3D object detection, the foreground object only occupies a small part of the entire scene, and only global fusion brings marginal benefit.

The lidar-camera fusion method shows impressive performance in 3D object detection. The current multi-mode method mainly performs global fusion, wherein image features and point cloud features are mainly in a global fusion mode of the whole scene. This approach lacks fine-grained region-level information, resulting in suboptimal fusion performance.

In order to solve the above technical problems, an embodiment of the present application provides a local-to-global multi-modal fusion method, including the following steps: firstly, fusing the aggregate image features and voxel features by taking the mass center point in the voxel as a reference point to obtain cross-modal features; based on the cross-modal characteristics, grid characteristics for downstream are obtained; the centroid point comprises an origin point cloud; next, encoding the position information of the reference points to generate grid features; projecting the grid centroid to an image plane, and sampling the image features to obtain sampled image features; fusing the grid features and the sampled image features to obtain locally fused grid features; and finally, carrying out multi-mode fusion processing on the grid features, the grid features and the locally fused grid features used for downstream.

Because the prior method ignores the original contour geometric information in the multi-modal fusion process, the method takes the centroid point of the point cloud contained in the Voxel (pixel) as a reference point, and realizes more accurate cross-modal alignment fusion between the point cloud point and the pixel point. Because the foreground object occupies a low proportion of the whole scene, the current research has less research on the characteristic enhancement of the target level, the semantic consistency of the target instance can be used as a cross-modal fusion natural guide, and the application provides stronger semantic characteristics for the detection of different long-distance and short-distance objects by the fusion of the target instance level. The existing method is generally used and low-efficiency for multi-mode global feature fusion and local feature fusion respectively, and the application proposes self-adaptive supplementary enhancement at an instance level based on self-attention combination of local and global features, so that a more accurate result is generated.

Embodiments of the present application will be described in detail below with reference to the accompanying drawings. However, as will be appreciated by those of ordinary skill in the art, in the various embodiments of the present application, numerous technical details have been set forth in order to provide a better understanding of the present application. However, the technical solutions claimed in the present application can be implemented without these technical details and with various changes and modifications based on the following embodiments.

Referring to fig. 1, an embodiment of the present application provides a local-to-global multi-modal fusion method, including the following steps:

s1, fusing the aggregate image features and the voxel features by taking the mass center point in the voxel as a reference point to obtain cross-modal features; based on the cross-modal characteristics, grid characteristics for downstream are obtained; the centroid point includes an origin point cloud.

S2, encoding the position information of the reference points to generate grid features; projecting the grid centroid to an image plane, and sampling the image features to obtain sampled image features; and fusing the grid features and the sampled image features to obtain locally fused grid features.

And S3, carrying out multi-mode fusion processing on the grid features used for downstream, the grid features and the locally fused grid features.

The local-global multi-mode fusion method is used for completing corresponding 3D target detection in a related 3D space scene. On one hand, the original contour geometric information is ignored in the multi-modal fusion process by the existing method, the application proposes that the original contour geometric information, namely the centroid point of the point cloud contained in the voxel, is taken as a reference point, the aggregated image characteristic and the voxel characteristic are fused, the cross-modal characteristic is obtained, and the cross-modal alignment fusion of the point cloud point-pixel point is more accurate. On the other hand, the method aims at the situation that the proportion of foreground objects to the whole scene is low, instance target semantic consistency can be used as a cross-modal fusion natural guide, and instance target level fusion is provided for providing stronger semantic features for Box (Score definition). In addition, the application provides self-adaptive complementary enhancement on the example level aiming at local and global features, and a self-attention-based dynamic feature aggregation module combines multi-mode global feature and local feature fusion so as to generate a more accurate result.

Referring to fig. 2, embodiments of the present application also provide a local-to-global multi-modal fusion system, comprising: the global fusion module 101, the local fusion module 102 and the feature dynamic enhancement module 103 are sequentially connected; the global fusion module 101 is used for fusing the aggregate image features and the voxel features by taking the internal centroid point of the voxel as a reference point to obtain cross-modal features; based on the cross-modal characteristics, grid characteristics for downstream are obtained; wherein the centroid point comprises an origin point cloud; the local fusion module 102 is used for encoding the position information of the reference points to generate grid features; projecting the grid centroid to an image plane, and sampling the image features to obtain sampled image features; fusing the grid features and the sampled image features to obtain locally fused grid features; the feature dynamic enhancement module 103 is configured to perform a multi-modal fusion process on the grid features for downstream, grid features, and locally fused grid features.

In some embodiments, global fusion module 101 includes centroid point dynamic fusion processing module 1011 and pooling processing module 1012; the centroid point dynamic fusion processing module 1011 is used for calculating centroid points of non-empty voxel characteristics to obtain voxel characteristics; projecting the voxel features on an image plane, and weighting a group of image features around the reference point to generate an aggregate image feature; fusing the aggregate image features and the voxel features to obtain cross-modal features; the pooling processing module 1012 is configured to obtain grid features for downstream according to the cross-modal features; the local fusion module 102 comprises a grid dynamic fusion processing module 1021 and a position information processing module 1022; the position information processing module 1022 is configured to encode position information of the reference points to generate grid features; the grid dynamic fusion processing module 1021 is used for fusing the grid characteristics and the sampled image characteristics based on the cross attention module so as to generate locally fused grid characteristics; the feature dynamic enhancement module 103 comprises a self-attention module 1031 and a residual connection module 1032, wherein the self-attention module 1031 is used for enabling interaction between the total feature and the residual connection module 1032 to be established on the non-empty grid point feature so as to obtain a bounding box; and refines the bounding box based on the shared flattened features generated from the feature dynamics enhancement module 103.

In this application we propose a novel local-to-global fusion network (LoGoNet) that performs lidar-camera fusion at both the local and global levels. The method and the device analyze the reasons of suboptimal performance of multi-mode fusion in detail, propose a new network and a new multi-mode fusion mode based on analysis, and realize the optimal performance obtained on the relevant standard of 3D target detection. The local-global fusion method and system provided in the present application are described in detail below.

As shown in fig. 3, the present application is a method for implementing accurate 3D object detection by combining local to global Multi-mode fusion, in the present application, two Multi-mode data such as a Point Cloud (Point Cloud in fig. 3) of a laser radar and a Multi-view camera (Multi-camera Images in fig. 3) are used as input, and a corresponding Multi-mode fusion processing module is designed; the multi-mode fusion processing module comprises a global fusion module (GoF), a local fusion module (LoF) and a characteristic dynamic enhancement module (FDA). As shown in fig. 3, the corresponding target detection is completed in the relevant 3D space scene by the multi-modal fusion processing module. The global fusion module (GoF) mainly comprises two processes of centroid point dynamic fusion (CDF) processing and region-of-interest pooling processing; the local fusion module (LoF) mainly comprises two processing modules of Grid Dynamic Fusion (GDF) and a Position Information Encoder (PIE), and the characteristic dynamic enhancement module (FDA) mainly comprises a Self Attention module and a residual error connection module (RCB).

In some embodiments, fusing the aggregate image features and the voxel features with the voxel internal centroid point as a reference point to obtain the cross-modal feature includes: calculating centroid points of non-empty voxel features to obtain voxel features; projecting the voxel features on an image plane, and weighting a group of image features around the reference point to generate an aggregate image feature; and fusing the aggregate image features and the voxel features to obtain the cross-modal features.

It should be noted that, the cross attention module is used to fuse the aggregate image feature and the voxel feature to obtain the cross-modal feature.

Fig. 4 shows a schematic diagram of the structure of a global fusion module (GoF). As shown in fig. 4, the global fusion module (GoF) includes two flows of centroid point dynamic fusion (CDF) processing and region of interest pooling processing. First calculating the point centroids of non-empty voxel features, then projecting (Centroid Point Projection) these centroid points onto the image plane and applying a learnable dynamic offset to a set of image features around a reference pointWeighting to generate aggregate image features>These image features are obtained by applying a learned offset to the image features +.>And then generated. The aggregated image features and voxel features are then fused by a cross-attention module to produce cross-modal features.

In some embodiments, the voxel features include a plurality, each voxel feature being represented as a query featureSign Q _i The cross-modal feature is calculated by the following formula:

wherein W is _m And W is _m ' is a learnable weight; m is the number of attention heads; k is the total number of sampling points; f (F) _I And p _i Respectively representing image characteristics and reference points of the current voxels in an image characteristic plane; Δp _mik And A _mik Respectively representing the sampling offset and the attention weight of the kth sampling point in the mth attention head;is the kth image feature; />To aggregate image features; Δp _mik And A _mik All by looking up the characteristic Q _i Obtained by Linear projection (Linear).

The local fusion module (LoF) is shown in fig. 5, and mainly comprises two processing modules, namely a Grid Dynamic Fusion (GDF) and a Position Information Encoder (PIE). It unifies sampling Grid points in each 3D suggestion box and encodes the location information of the original Point cloud by a location information encoder (PIE) to generate Grid featuresWe then project (Grid Point Projection) the calculated grid centroid onto the image plane and sample the image features with the learned offset. Finally, similar to GoF, these grid features and sampled image features are fused based on a cross-attention module to produce locally fused grid features ∈ ->

In some embodiments, deriving mesh features for downstream based on cross-modal features includes: pooling the region of interest of the cross-modal features to obtain grid features for downstreamThat is, the grid feature for downstream +.>Resulting from a region of interest (ROI) Pooling operation (Pooling).

In some embodiments, self-attention modules are employed to grid featuresLocal fused grid features->Grid features for downstream +.>Performing internal polymerization enhancement, comprising: for grid features for downstream->Grid characteristics->Local fused grid features->Summing to obtain total feature F _S The method comprises the steps of carrying out a first treatment on the surface of the Using self-attention modules to bring Total feature F _S Interaction is established between the non-empty grid point characteristics and the residual connection module, so that a boundary box is obtained; the bounding box is refined based on the shared flattened features generated from the feature dynamics enhancement module.

Specifically, feature dynamic enhancement module 103 is also referred to as a feature dynamic aggregation module (FDA). As shown in fig. 6, the device mainly comprises a Self Attention (Self Attention) module and a residual connection module (RCB). For mutual independence among grid characteristic internal characteristics, in order to better aggregate the internal characteristics and establish enhancement connection with each other, an internal aggregation enhancement is completed by using a self-attention module. Firstly, summing three Grid Features (Grid Features) to obtain a feature F _S The following formula is shown:

wherein,for grid features->For local fusion of grid features, +.>Is a grid feature for downstream use.

The self-attention module then includes a standard transducer encoder layer that enables interactions between the total feature and the Residual Connection Block (RCB) to be established on non-empty grid point features. Finally, the bounding box is refined using the shared flattened features generated from the feature dynamic aggregation module.

In summary, the application uses the original contour geometric information, namely the centroid point of the original point cloud in the voxel as a reference point and as a guidance of global fusion, so as to realize more accurate cross-modal alignment fusion of point cloud points and pixel points; the local instance target level fusion provides stronger semantic features for target objects with different distances; the self-attention-based dynamic feature aggregation module combines the multi-mode global feature and the local feature fusion, and performs self-adaptive supplementary enhancement at an instance level to generate a more accurate result.

Compared with the prior art, the invention has the advantages that: (1) The method is characterized in that the original contour geometric information, namely the centroid point of the point cloud contained in the voxel, is used as a reference point for global feature fusion; (2) Providing stronger semantic features for frame refinement by providing local instance target level fusion; (3) The dynamic feature aggregation module based on self-attention is provided to combine the multi-mode global feature and the local feature fusion to generate a more accurate result.

The invention has been proved to be feasible through experiments, simulation and use. Compared with the existing 3D target detection method, the method provided by the invention has the advantages that the centroid point of the point cloud in the voxel is used as the reference point to perform global feature fusion and local instance target level fusion to provide stronger semantic features for frame refinement, so that performance benefits can be brought to target detection, and meanwhile, the self-attention-based dynamic feature aggregation module is provided to combine the multi-mode global features and the local feature fusion, so that the 3D target detection performance can be further improved greatly. Compared with the existing 3D target detection method, the optimal performance is obtained on both the two disclosed data sets Waymo Open dataset and KITTI, and particularly, the method provided by the invention obtains the performance with the full class exceeding 80mAP (L2) for the first time at Waymo open dataset.

Referring to fig. 7, another embodiment of the present application provides an electronic device, including: at least one processor 110; and a memory 111 communicatively coupled to the at least one processor; the memory 111 stores instructions executable by the at least one processor 110, the instructions being executable by the at least one processor 110 to enable the at least one processor 110 to perform any one of the method embodiments described above.

Where the memory 111 and the processor 110 are connected by a bus, the bus may comprise any number of interconnected buses and bridges, the buses connecting the various circuits of the one or more processors 110 and the memory 111 together. The bus may also connect various other circuits such as peripherals, voltage regulators, and power management circuits, which are well known in the art, and therefore, will not be described any further herein. The bus interface provides an interface between the bus and the transceiver. The transceiver may be one element or may be a plurality of elements, such as a plurality of receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. The data processed by the processor 110 is transmitted over a wireless medium via an antenna, which further receives the data and transmits the data to the processor 110.

The processor 110 is responsible for managing the bus and general processing and may also provide various functions including timing, peripheral interfaces, voltage regulation, power management, and other control functions. And memory 111 may be used to store data used by processor 110 in performing operations.

Another embodiment of the present application relates to a computer-readable storage medium storing a computer program. The computer program implements the above-described method embodiments when executed by a processor.

That is, it will be understood by those skilled in the art that all or part of the steps in implementing the methods of the embodiments described above may be implemented by a program stored in a storage medium, where the program includes several instructions for causing a device (which may be a single-chip microcomputer, a chip or the like) or a processor (processor) to perform all or part of the steps in the methods of the embodiments described above. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

By the above technical solution, the embodiments of the present application provide a local to global multi-mode fusion method, system, device and storage medium, where the method includes the following steps: firstly, fusing the aggregate image features and voxel features by taking the mass center point in the voxel as a reference point to obtain cross-modal features; based on the cross-modal characteristics, grid characteristics for downstream are obtained; the centroid point comprises an origin point cloud; next, encoding the position information of the reference points to generate grid features; projecting the grid centroid to an image plane, and sampling the image features to obtain sampled image features; fusing the grid features and the sampled image features to obtain locally fused grid features; and finally, carrying out multi-mode fusion processing on the grid features used for downstream, the grid features and the locally fused grid features.

It will be understood by those of ordinary skill in the art that the foregoing embodiments are specific examples of implementing the present application and that various changes in form and details may be made therein without departing from the spirit and scope of the present application. Various changes and modifications may be made by one skilled in the art without departing from the spirit and scope of the invention, and the scope of the invention shall be defined by the appended claims.

Claims

1. A local-to-global multi-modal fusion method, comprising:

taking the mass center point in the voxel as a reference point, and fusing the aggregate image features and the voxel features to obtain cross-modal features;

based on the cross-modal characteristics, grid characteristics for downstream are obtained; the centroid point comprises an origin point cloud;

encoding the position information of the reference points to generate grid features; projecting the grid centroid to an image plane, and sampling the image features to obtain sampled image features; fusing the grid features and the sampled image features to obtain locally fused grid features;

and carrying out multi-mode fusion processing on the grid features used for downstream, the grid features and the locally fused grid features.

2. The local-global multi-modal fusion method according to claim 1, wherein the fusing the aggregated image features and the voxel features with the voxel internal centroid point as a reference point to obtain cross-modal features includes:

calculating centroid points of non-empty voxel features to obtain voxel features;

projecting the voxel features on an image plane, and weighting a group of image features around the reference point to generate an aggregate image feature;

and fusing the aggregate image features and the voxel features to obtain the cross-modal features.

3. The local-to-global multi-modal fusion method of claim 2, wherein the voxel features include a plurality of, each voxel feature represented as a query feature Q _i Image features and aggregate image features are calculated by the following formula:

4. The local-to-global multi-modal fusion method according to claim 1, wherein the deriving mesh features for downstream based on the cross-modal features comprises:

and carrying out region-of-interest pooling processing on the cross-modal characteristics to obtain grid characteristics for downstream.

5. The local-to-global multi-modal fusion method of claim 1, wherein the grid features for downstream, the grid features, and the locally fused grid features are internally aggregated enhanced with a self-attention module.

6. The local-to-global multi-modal fusion method of claim 5, wherein the employing a self-attention module to internally aggregate enhancements to the grid features for downstream, the grid features, and the locally fused grid features, comprises:

summing the grid features used for downstream, the grid features and the locally fused grid features to obtain total features;

adopting a self-attention module to establish interaction between the total feature and the residual error connection module on the non-empty grid point feature so as to obtain a boundary frame;

the bounding box is refined based on the shared flattened features generated from the feature dynamics enhancement module.

7. A local-to-global multimodal fusion system, comprising: the system comprises a global fusion module, a local fusion module and a characteristic dynamic enhancement module which are sequentially connected;

the global fusion module is used for fusing the aggregate image features and the voxel features by taking the mass center point in the voxel as a reference point to obtain cross-modal features; based on the cross-modal characteristics, grid characteristics for downstream are obtained; the centroid point comprises an origin point cloud;

the local fusion module is used for encoding the position information of the reference point to generate grid characteristics; projecting the grid centroid to an image plane, and sampling the image features to obtain sampled image features; fusing the grid features and the sampled image features to obtain locally fused grid features;

the feature dynamic enhancement module is used for carrying out multi-mode fusion processing on the grid features used for downstream, the grid features and the locally fused grid features.

8. The local-to-global multi-modal fusion system of claim 7, wherein the global fusion module includes a centroid point dynamic fusion processing module and a pooling processing module;

the centroid point dynamic fusion processing module is used for calculating centroid points of non-empty voxel characteristics to obtain voxel characteristics; projecting the voxel features on an image plane, and weighting a group of image features around the reference point to generate an aggregate image feature; fusing the aggregate image features and the voxel features to obtain cross-modal features; the pooling processing module is used for obtaining grid characteristics for downstream according to the cross-modal characteristics;

the local fusion module comprises a grid dynamic fusion processing module and a position information processing module;

the position information processing module is used for encoding the position information of the reference points to generate grid features; the grid dynamic fusion processing module is used for fusing the grid characteristics and the sampling image characteristics based on the cross attention module so as to generate locally fused grid characteristics;

the feature dynamic enhancement module comprises a self-attention module and a residual error connection module, wherein the self-attention module is used for enabling interaction between the total feature and the residual error connection module to be established on the non-empty grid point feature so as to obtain a boundary frame; and refining the bounding box based on the shared flattened features generated from the feature dynamic enhancement module.

9. An electronic device, comprising:

at least one processor; the method comprises the steps of,

a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the local-to-global multi-modal fusion method of any one of claims 1 to 6.

10. A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the local to global multimodal fusion method of any of claims 1 to 6.