CN115965961A

CN115965961A - Local-to-global multi-modal fusion method, system, device and storage medium

Info

Publication number: CN115965961A
Application number: CN202310160693.XA
Authority: CN
Inventors: 侯跃南; 李鑫; 马涛; 石博天; 杨雨辰; 刘有权; 李怡康; 乔宇
Original assignee: Shanghai AI Innovation Center
Current assignee: Shanghai AI Innovation Center
Priority date: 2023-02-23
Filing date: 2023-02-23
Publication date: 2023-04-14
Anticipated expiration: 2043-02-23
Also published as: CN115965961B

Abstract

The embodiment of the application relates to the technical field of automatic driving, in particular to a local-to-global multi-modal fusion method, a system, equipment and a storage medium, wherein the method comprises the following steps: firstly, fusing aggregate image features and voxel features by taking a voxel internal center of mass point as a reference point to obtain cross-modal features; obtaining a grid characteristic used for the downstream based on the cross-modal characteristic; the centroid points comprise an original point cloud; next, encoding the position information of the reference point to generate grid characteristics; projecting the grid centroid to an image plane, and sampling image features to obtain sampled image features; fusing the grid features and the sampled image features to obtain locally fused grid features; finally, multi-modal fusion processing is performed on the mesh features for downstream, mesh features, and local fusion. The local-to-global multi-mode fusion method provided by the embodiment of the application improves the accuracy of 3D target detection.

Description

Local-to-global multi-modal fusion method, system, device and storage medium

Technical Field

The embodiment of the application relates to the technical field of automatic driving, in particular to a local-to-global multi-modal fusion method, system, device and storage medium.

Background

3D object detection aims at locating and classifying objects in 3D space, is a basic perception task and plays a key role in automatic driving. Lidar and cameras are two of the widely used sensors. Because the laser radar provides accurate depth and geometric information, the 3D target detection method usually adopts point cloud data acquired by the laser radar to complete 3D target detection, and the laser radar sensor obtains competitive performance in various benchmark tests.

However, due to the inherent limitations of lidar sensors, the point cloud is typically sparse and does not provide sufficient context to distinguish distant or occluded areas, resulting in poor performance. To improve the performance of 3D object detection, a natural remedy is to supplement the point cloud with image rich semantic and texture information. Global fusion is typically employed to enhance point clouds having image features, i.e., to fuse point cloud features with image features throughout a scene. However, methods that employ global fusion to enhance point clouds with image features lack fine-grained local information. For 3D object detection, foreground objects only occupy a small portion of the entire scene, and only global fusion will bring marginal benefits.

Disclosure of Invention

The embodiment of the application provides a local-to-global multi-mode fusion method, system, device and storage medium, and improves the accuracy of 3D target detection.

In order to solve the above technical problem, in a first aspect, an embodiment of the present application provides a local-to-global multimodal fusion method, including the following steps: firstly, fusing aggregate image features and voxel features by taking a voxel internal center of mass point as a reference point to obtain cross-modal features; obtaining a grid characteristic used for the downstream based on the cross-modal characteristic; the centroid points comprise an original point cloud; next, encoding the position information of the reference point to generate grid characteristics; projecting the grid centroid to an image plane, and sampling image features to obtain sampled image features; fusing the grid features and the sampled image features to obtain locally fused grid features; and finally, carrying out multi-modal fusion processing on the mesh features, the mesh features and the local fusion mesh features used for the downstream.

In some exemplary embodiments, the fusing the aggregate image feature and the voxel feature with the voxel internal center of mass point as a reference point to obtain the cross-modal feature includes: calculating a mass center point of the non-empty voxel characteristic to obtain a voxel characteristic; projecting the voxel characteristics on an image plane, and weighting a group of image characteristics around a reference point to generate aggregate image characteristics; and fusing the aggregate image features and the voxel features to obtain cross-modal features.

In some exemplary embodiments, the voxel characteristics include a plurality, each voxel characteristic being represented as a query characteristic Q _i Calculating the image features and the aggregate image features by the following formula:

wherein, W _m And W _m ' is a learnable weight; m is the number of attention heads; k is the total number of sample points; Δ p _mik And A _mik Respectively representing the sampling offset and the attention weight of the kth sampling point in the mth attention head;

is an image feature;

is an aggregate image feature; Δ p _mik And A _mik Are all by querying the feature Q _i Obtained by linear projection.

In some exemplary embodiments, deriving mesh features for downstream based on the cross-modal features comprises: and performing region-of-interest pooling on the cross-modal features to obtain grid features for downstream.

In some exemplary embodiments, a self-attention module is employed to internally aggregate enhancement of mesh features for downstream mesh features, and local fusion.

In some exemplary embodiments, the internal aggregation enhancement of mesh features for downstream mesh features, and local fusion is performed using a self-attention module, comprising: summing the grid features, the grid features and the local fusion grid features used for the downstream to obtain a total feature; adopting a self-attention module to enable the total characteristic and the residual error connection module to establish interaction at the non-empty grid point characteristic to obtain a boundary frame; and refining the bounding box based on the shared flattening features generated by the feature dynamic enhancement module.

In a second aspect, an embodiment of the present application further provides a local-to-global multimodal fusion system, including: the global fusion module, the local fusion module and the characteristic dynamic enhancement module are connected in sequence; the global fusion module is used for fusing the aggregate image feature and the voxel feature by taking the voxel internal center of mass point as a reference point to obtain a cross-modal feature; obtaining a grid characteristic used for the downstream based on the cross-modal characteristic; the centroid points comprise an original point cloud; the local fusion module is used for coding the position information of the reference point to generate grid characteristics; projecting the grid centroid to an image plane, and sampling image characteristics to obtain sampled image characteristics; fusing the grid features and the sampled image features to obtain locally fused grid features; the feature dynamic enhancement module is used for performing multi-modal fusion processing on the mesh features for downstream mesh features, mesh features and local fusion.

In some exemplary embodiments, the global fusion module includes a centroid point dynamic fusion processing module and a pooling processing module; the mass center point dynamic fusion processing module is used for calculating mass center points of non-empty voxel characteristics to obtain voxel characteristics; projecting the voxel characteristics on an image plane, and weighting a group of image characteristics around a reference point to generate aggregate image characteristics; fusing the aggregate image features and the voxel features to obtain cross-modal features; the pooling processing module is used for obtaining grid characteristics used for downstream according to the cross-modal characteristics; the local fusion module comprises a grid dynamic fusion processing module and a position information processing module; the position information processing module is used for coding the position information of the reference point to generate grid characteristics; the grid dynamic fusion processing module is used for fusing the grid features and the sampling image features based on the cross attention module to generate locally fused grid features; the characteristic dynamic enhancement module comprises a self-attention module and a residual error connection module, wherein the self-attention module is used for establishing interaction between the total characteristic and the residual error connection module at the non-empty grid point characteristic to obtain a boundary frame; and refining the bounding box based on the shared flattening features generated by the feature dynamic enhancement module.

In addition, the present application also provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the above-described local-to-global multimodal fusion method.

In addition, the present application also provides a computer readable storage medium storing a computer program, which when executed by a processor implements the above local-to-global multimodal fusion method.

The technical scheme provided by the embodiment of the application has at least the following advantages:

the embodiment of the application provides a local-to-global multi-modal fusion method, a system, equipment and a storage medium, wherein the method comprises the following steps: firstly, fusing aggregate image features and voxel features by taking a voxel internal center of mass point as a reference point to obtain cross-modal features; obtaining a grid characteristic used for the downstream based on the cross-modal characteristic; the centroid points comprise an original point cloud; next, coding the position information of the reference point to generate grid characteristics; projecting the grid centroid to an image plane, and sampling image features to obtain sampled image features; fusing the grid features and the sampled image features to obtain locally fused grid features; finally, multi-modal fusion processing is performed on the mesh features for downstream, mesh features, and locally fused mesh features.

According to the local-to-global multi-modal fusion method, original contour geometric information, namely a centroid point containing original point clouds in a voxel is taken as a reference point, and more accurate cross-modal alignment fusion between the point cloud points and pixel points is achieved. Meanwhile, aiming at the problem that the foreground object occupies a low proportion of the whole scene, the semantic consistency of the example target can be used as a natural guide for cross-modal fusion, and the example target-level fusion provided by the application provides stronger semantic features for frame refinement. In addition, the application aims at the self-adaptive complementary enhancement of local and global features at an instance level, and provides a dynamic feature aggregation module based on self-attention to combine multi-modal global features with local features for fusion so as to generate a more accurate result and improve the 3D target detection performance.

Drawings

One or more embodiments are illustrated by corresponding figures in the drawings, which are not to be construed as limiting the embodiments, unless expressly stated otherwise, and the drawings are not to scale.

FIG. 1 is a schematic flow chart of a local-to-global multimodal fusion method according to an embodiment of the present application;

FIG. 2 is a schematic structural diagram of a local-to-global multimodal fusion system according to an embodiment of the present application;

FIG. 3 is a block diagram of an overall framework of a local-to-global multimodal fusion system according to an embodiment of the present application;

fig. 4 is a schematic diagram of a global fusion module according to an embodiment of the present application;

fig. 5 is a schematic diagram of a local fusion module according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a feature dynamic aggregation module provided in an embodiment of the present application;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

As can be seen from the background art, the existing method for enhancing a point cloud with image features by global fusion lacks fine-grained local information. For 3D object detection, foreground objects only occupy a small portion of the entire scene, and only global fusion will bring marginal benefits.

The lidar-camera fusion method shows impressive performance in 3D object detection. Current multimodal approaches mainly perform global fusion, where image features and point cloud features are mainly in a global fusion mode throughout the scene. This approach lacks fine-grained region-level information, resulting in suboptimal fusion performance.

In order to solve the above technical problem, an embodiment of the present application provides a local-to-global multimodal fusion method, including the following steps: firstly, fusing aggregate image features and voxel features by taking a voxel internal center of mass point as a reference point to obtain cross-modal features; obtaining a grid characteristic used for the downstream based on the cross-modal characteristic; the centroid points comprise an original point cloud; next, coding the position information of the reference point to generate grid characteristics; projecting the grid centroid to an image plane, and sampling image characteristics to obtain sampled image characteristics; fusing the grid features and the sampled image features to obtain locally fused grid features; finally, multi-modal fusion processing is performed on the mesh features for downstream, mesh features, and local fusion.

Because the prior method ignores the original outline geometric information in the process of multi-mode fusion, the application takes a center of mass point containing point cloud in a Voxel (Voxel) as a reference point, and realizes more accurate cross-mode alignment fusion between the point cloud point and a pixel point. Because the foreground object occupies a low proportion of the whole scene, the current research has less enhancement and research on the target-level features, the semantic consistency of the target instance can be used as a natural guidance for cross-modal fusion, and the target instance-level fusion provided by the application provides stronger semantic features for the detection of objects with different distances. The existing method is generally used for multi-modal global feature fusion and local feature fusion respectively and is low in efficiency, and the application provides self-adaptive supplementary enhancement at an instance level based on self-attention combined local and global features to generate a more accurate result.

Embodiments of the present application will be described in detail below with reference to the accompanying drawings. However, it will be appreciated by those of ordinary skill in the art that in the examples of the present application, numerous technical details are set forth in order to provide a better understanding of the present application. However, the technical solution claimed in the present application can be implemented without these technical details and various changes and modifications based on the following embodiments.

Referring to fig. 1, an embodiment of the present application provides a local-to-global multimodal fusion method, including the following steps:

s1, fusing aggregate image features and voxel features by taking a voxel internal center of mass point as a reference point to obtain cross-modal features; obtaining a grid characteristic used for the downstream based on the cross-modal characteristic; the centroid points comprise the original point cloud.

S2, coding the position information of the reference point to generate grid characteristics; projecting the grid centroid to an image plane, and sampling image characteristics to obtain sampled image characteristics; and fusing the grid features and the sampled image features to obtain locally fused grid features.

And S3, carrying out multi-mode fusion processing on the grid features, the grid features and the local fusion grid features used for the downstream.

The local-to-global multi-modal fusion method provided by the application is used for completing corresponding 3D target detection in a related 3D space scene. On one hand, because the original outline geometric information is ignored in the multi-mode fusion process of the existing method, the method provided by the application uses the original outline geometric information, namely a center of mass point containing point cloud in the voxel, as a reference point, aggregates image features and voxel features for fusion to obtain cross-mode features, and realizes more accurate cross-mode alignment fusion of point cloud points and pixel points. On the other hand, the method and the device provide the example target level integration to provide stronger semantic features for Box Refinement (Box & Score reference) aiming at the fact that the foreground object occupies a low proportion of the whole scene and the example target semantic consistency can be used as a natural guidance for cross-modal integration. In addition, the application aims at the supplementary enhancement of local and global features in an adaptive way at an instance level, and proposes a dynamic feature aggregation module based on self-attention to combine multi-modal global features with local features for fusion so as to generate more accurate results.

Referring to fig. 2, an embodiment of the present application further provides a local-to-global multimodal fusion system, including: the system comprises a global fusion module 101, a local fusion module 102 and a characteristic dynamic enhancement module 103 which are connected in sequence; the global fusion module 101 is configured to fuse the aggregate image feature and the voxel feature by using a voxel internal centroid point as a reference point to obtain a cross-modal feature; obtaining a grid characteristic used for the downstream based on the cross-modal characteristic; wherein the centroid point comprises an original point cloud; the local fusion module 102 is configured to encode the position information of the reference point to generate a grid feature; projecting the grid centroid to an image plane, and sampling image features to obtain sampled image features; fusing the grid features and the sampled image features to obtain locally fused grid features; the feature dynamic enhancement module 103 is configured to perform a multi-modal fusion process on the mesh features for the downstream mesh features, the mesh features, and the local fusion.

In some embodiments, the global fusion module 101 includes a centroid point dynamic fusion processing module 1011 and a pooling processing module 1012; the center-of-mass point dynamic fusion processing module 1011 is used for calculating the center-of-mass point of the non-empty voxel characteristic to obtain the voxel characteristic; projecting the voxel characteristics on an image plane, and weighting a group of image characteristics around a reference point to generate aggregate image characteristics; fusing the aggregate image features and the voxel features to obtain cross-modal features; the pooling processing module 1012 is used for obtaining grid characteristics for downstream according to the cross-modal characteristics; the local fusion module 102 includes a mesh dynamic fusion processing module 1021 and a position information processing module 1022; the position information processing module 1022 is configured to encode the position information of the reference point to generate a grid feature; the grid dynamic fusion processing module 1021 is used for fusing the grid features and the sampled image features based on the cross attention module to generate locally fused grid features; the feature dynamic enhancing module 103 includes a self-attention module 1031 and a residual connection module 1032, where the self-attention module 1031 is configured to obtain a bounding box by establishing interaction between the total features and the residual connection module 1032 at the non-empty grid point features; and the bounding box is refined based on the shared flattening features generated from the feature dynamic enhancement module 103.

In this application, we propose a novel local-to-global convergence network (LoGoNet) that performs lidar-camera fusion at both the local and global levels. The method and the device analyze the reason of suboptimal performance of multi-mode fusion in detail, provide a new network and a new multi-mode fusion mode based on analysis, and realize the optimal performance obtained on the relevant reference of 3D target detection. The local-to-global fusion method and system provided by the present application are described in detail below.

As shown in fig. 3, the present application is a method for realizing accurate 3D object detection by combining local-to-global Multi-modal fusion, and in the present application, two Multi-modal data, such as a lidar Point Cloud (Point Cloud in fig. 3) and a Multi-view camera (Multi-camera Images in fig. 3), are used as input to design a corresponding Multi-modal fusion processing module; the multi-modal fusion processing module comprises a global fusion module (GoF), a local fusion module (LoF) and a feature dynamic enhancement module (FDA). As shown in fig. 3, the corresponding target detection is completed in the relevant 3D spatial scene through the multi-modal fusion processing module. The global fusion module (GoF) mainly comprises two flows of centroid point dynamic fusion (CDF) processing and region-of-interest pooling processing; the local fusion module (LoF) mainly comprises two processing modules, namely a Grid Dynamic Fusion (GDF) and a Position Information Encoder (PIE), and the feature dynamic enhancement module (FDA) mainly comprises a Self Attention (Self Attention) module and a residual error connection module (RCB).

In some embodiments, the fusing the aggregate image feature and the voxel feature with the voxel internal centroid point as a reference point to obtain the cross-modal feature includes: calculating a mass center point of the non-empty voxel characteristic to obtain a voxel characteristic; projecting the voxel characteristics on an image plane, and weighting a group of image characteristics around a reference point to generate aggregated image characteristics; and fusing the aggregate image features and the voxel features to obtain cross-modal features.

It should be noted that the cross-mode feature is obtained by fusing the aggregate image feature and the voxel feature with a cross-attention module.

Fig. 4 shows a schematic structural diagram of a global fusion module (GoF). As shown in fig. 4, the global fusion module (GoF) includes two flows of a centroid point dynamic fusion (CDF) process and a region of interest pooling process. First, calculate the Point Centroid of non-empty voxel feature, then project (Centroid Point Projection) these Centroid points onto the image plane, and through the dynamic offset that can be learned, to a set of image features around the reference Point

Weighting to generate aggregate image features>

These image features are based on the fact that the learned offset is applied to the image feature->

And then produced. The aggregate image features and voxel features are then fused by a cross-attention module to produce cross-modal features.

In some embodiments, the voxel characteristics include a plurality, each voxel characteristic being represented as a query characteristic Q _i Calculating the cross-modal characteristics by the following formula:

is an image feature;

is an aggregate image feature; Δ p _mik And A _mik Are all by querying the feature Q _i Is obtained by Linear projection (Linear). />

The local fusion module (LoF) is composed of two processing modules, which are mainly a Grid Dynamic Fusion (GDF) and a Position Information Encoder (PIE), as shown in fig. 5. Uniformly sampling Grid points (Grid points) in each 3D suggestion box, and encoding the position information of the original Point cloud through a Position Information Encoder (PIE) to generate Grid characteristics

Then, we project (Grid Point Projection) the calculated Grid centroid to the image plane and sample the image features by the learned offset. Finally, similar to in GoF, the cross-attention-based module fuses the grid features and the sampled image features to produce locally fused grid features ≦ based on the local fusion of the grid features and the sampled image features>

In some embodiments, deriving mesh features for downstream based on the cross-modal features comprises: performing region-of-interest pooling on the cross-modal features to obtain mesh features for downstream

That is to say that the grid characteristic->

Resulting from a region of interest (ROI) Pooling operation (Pooling).

In some embodiments, a self-attention module is employed to pair grid features

Locally fused grid feature->

And for downstream grid characteristic>

Performing internal polymerization enhancement comprising: for the grid feature used downstream >>

Grid feature>

And locally fused grid features>

Summing to obtain total characteristic F _S (ii) a Using self-attention Module to make Total feature F _S Establishing interaction with a residual connection module at the non-empty grid point characteristics to obtain a boundary frame; and refining the bounding box based on the shared flattening features generated by the feature dynamic enhancement module.

In particular, the feature dynamics enhancement module 103 is also referred to as a feature dynamics aggregation module (FDA). As shown in fig. 6, there are mainly a Self Attention (Self Attention) module and a residual connection module (RCB). Aiming at mutual independence among internal features of grid features, in order to better aggregate the internal features and establish enhanced connection with each other, a self-attention module is used for completing internal aggregation enhancement. Firstly, summing three Grid Features (Grid Features) to obtain a feature F _S As shown in the following formula:

wherein,

is a grid characteristic, is greater than or equal to>

For locally fused grid features, ->

For use in downstream grid features.

Then, the introductions self-attention module includes an interaction between the standard transform encoder layer and the Residual Connection Block (RCB) to establish non-empty grid point features. Finally, the bounding box is refined using the shared flattened features generated from the feature dynamic aggregation module.

In summary, the method and the device use original contour geometric information, namely a centroid point containing original point cloud in a voxel as a reference point and serve as a guidance of global fusion, so that more accurate cross-mode alignment fusion of point cloud points and pixel points is realized; local instance target level fusion provides stronger semantic features for target objects with different distances; the dynamic feature aggregation module based on self-attention combines multi-modal global features and local features to be fused, self-adaptive supplementary enhancement is carried out at an instance level, and a more accurate result is generated.

Compared with the prior art, the invention has the advantages that: (1) The method comprises the steps of providing global feature fusion by taking original contour geometric information, namely a center of mass point containing point cloud in a voxel as a reference point; (2) Local instance target level fusion is proposed to provide stronger semantic features for frame refinement; (3) A self-attention-based dynamic feature aggregation module is proposed to combine multimodal global features with local feature fusion to produce more accurate results.

The invention has been proved to be feasible through experiments, simulation and use. Compared with the existing 3D target detection method, the method has the advantages that global feature fusion and local example target level fusion are carried out by taking the centroid point containing point cloud in the voxel as the reference point, so that stronger semantic features are provided for frame refinement, performance benefits can be brought to target detection, and meanwhile, the self-attention-based dynamic feature aggregation module is provided to further greatly improve the 3D target detection performance by combining the multi-mode global features with the local feature fusion. Compared with the existing 3D target detection method, the optimal performance is obtained on both the two public data sets Waymo Open dataset and KITTI, and particularly the method provided by the invention obtains the performance of more than 80mAP (L2) in the whole category on Waymo Open dataset for the first time.

Referring to fig. 7, another embodiment of the present application provides an electronic device, including: at least one processor 110; and, a memory 111 communicatively coupled to the at least one processor; wherein the memory 111 stores instructions executable by the at least one processor 110, the instructions being executable by the at least one processor 110 to enable the at least one processor 110 to perform any of the method embodiments described above.

Where the memory 111 and the processor 110 are coupled in a bus, the bus may include any number of interconnected buses and bridges that couple one or more of the various circuits of the processor 110 and the memory 111 together. The bus may also connect various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. A bus interface provides an interface between the bus and the transceiver. The transceiver may be one element or a plurality of elements, such as a plurality of receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. The data processed by the processor 110 is transmitted over a wireless medium through an antenna, which further receives the data and transmits the data to the processor 110.

The processor 110 is responsible for managing the bus and general processing and may also provide various functions including timing, peripheral interfaces, voltage regulation, power management, and other control functions. And memory 111 may be used to store data used by processor 110 in performing operations.

Another embodiment of the present application relates to a computer-readable storage medium storing a computer program. The computer program realizes the above-described method embodiments when executed by a processor.

That is, as can be understood by those skilled in the art, all or part of the steps in the method for implementing the above embodiments may be implemented by a program instructing related hardware, where the program is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps in the method for implementing the various embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

According to the technical scheme, the embodiment of the application provides a local-to-global multi-modal fusion method, a system, equipment and a storage medium, and the method comprises the following steps: firstly, fusing aggregate image features and voxel features by taking a voxel internal centroid as a reference point to obtain cross-modal features; obtaining a grid characteristic used for the downstream based on the cross-modal characteristic; the centroid points comprise the original point cloud; next, encoding the position information of the reference point to generate grid characteristics; projecting the grid centroid to an image plane, and sampling image features to obtain sampled image features; fusing the grid features and the sampled image features to obtain locally fused grid features; and finally, performing multi-modal fusion processing on the grid features, the grid features and the local fusion grid features used for the downstream.

According to the local-to-global multi-modal fusion method, original contour geometric information, namely a centroid point containing original point clouds in a voxel is taken as a reference point, and more accurate cross-modal alignment fusion between the point cloud points and pixel points is achieved. Meanwhile, aiming at the problem that the foreground object occupies a low proportion of the whole scene, the semantic consistency of the example target can be used as a natural guide for cross-modal fusion, and the example target-level fusion provided by the application provides stronger semantic features for frame refinement. In addition, the application aims at the situation level of the local and global features to perform self-adaptive supplementary enhancement, and provides a dynamic feature aggregation module based on self-attention to combine the multi-modal global features with the local features for fusion so as to generate a more accurate result and improve the 3D target detection performance.

It will be understood by those of ordinary skill in the art that the foregoing embodiments are specific examples for carrying out the present application, and that various changes in form and details may be made therein without departing from the spirit and scope of the present application in practice. Various changes and modifications may be effected therein by one of ordinary skill in the pertinent art without departing from the scope or spirit of the present disclosure, and the scope of the present disclosure should be defined only by the appended claims.

Claims

1. A method of local-to-global multimodal fusion, comprising:

fusing the aggregate image characteristic and the voxel characteristic by taking the voxel internal center of mass point as a reference point to obtain a cross-modal characteristic;

obtaining a grid characteristic used for the downstream based on the cross-modal characteristic; the centroid point comprises an original point cloud;

coding the position information of the reference point to generate grid characteristics; projecting the grid centroid to an image plane, and sampling image features to obtain sampled image features; fusing the grid features and the sampling image features to obtain locally fused grid features;

and performing multi-modal fusion processing on the mesh features for the downstream, the mesh features and the locally fused mesh features.

2. The local-to-global multi-modal fusion method according to claim 1, wherein the fusing the aggregated image features and the voxel features with the voxel internal centroid point as a reference point to obtain cross-modal features comprises:

calculating a centroid point of the non-empty voxel characteristic to obtain a voxel characteristic;

projecting the voxel characteristics on an image plane, and weighting a group of image characteristics around the reference point to generate aggregated image characteristics;

and fusing the aggregate image features and the voxel features to obtain cross-modal features.

3. The local-to-global multi-modal fusion method of claim 2, wherein the voxel characteristics comprise a plurality, each voxel characteristic being represented as a query characteristic Q _i Calculating the image features and the aggregate image features by the following formulas:

is an image feature; />

Is an aggregate image feature; Δ p of _mik And A _mik Are all by querying the feature Q _i Obtained by linear projection.

4. The local-to-global multimodal fusion method according to claim 1, wherein the deriving mesh features for downstream based on the cross-modal features comprises:

and performing region-of-interest pooling on the cross-modal characteristics to obtain grid characteristics for the downstream.

5. The local-to-global multimodal fusion method according to claim 1, characterized in that a self-attention module is used to perform internal aggregation enhancement of mesh features for downstream, the mesh features and the local fusion.

6. The local-to-global multimodal fusion method according to claim 5, wherein the employing a self-attention module to perform internal aggregation enhancement on mesh features for downstream, the mesh features and the locally fused mesh features comprises:

summing the grid features used for the downstream, the grid features and the locally fused grid features to obtain a total feature;

adopting a self-attention module to enable the total characteristic and the residual error connecting module to establish interaction at the non-empty grid point characteristic to obtain a boundary frame;

and performing refinement processing on the bounding box based on the shared flattening features generated by the feature dynamic enhancement module.

7. A local-to-global multimodal fusion system, comprising: the global fusion module, the local fusion module and the characteristic dynamic enhancement module are connected in sequence;

the global fusion module is used for fusing the aggregate image feature and the voxel feature by taking the voxel internal center of mass as a reference point to obtain a cross-modal feature; obtaining a grid characteristic used for the downstream based on the cross-modal characteristic; the centroid points comprise an original point cloud;

the local fusion module is used for coding the position information of the reference point to generate grid characteristics; projecting the grid centroid to an image plane, and sampling image features to obtain sampled image features; fusing the grid features and the sampling image features to obtain locally fused grid features;

the feature dynamic enhancement module is configured to perform multi-modal fusion processing on the mesh features for the downstream, the mesh features, and the locally fused mesh features.

8. The local-to-global multimodal fusion system according to claim 7, wherein the global fusion module comprises a centroid dynamic fusion processing module and a pooling processing module;

the mass center point dynamic fusion processing module is used for calculating mass center points of non-empty voxel characteristics to obtain voxel characteristics; projecting the voxel characteristics on an image plane, and weighting a group of image characteristics around the reference point to generate aggregated image characteristics; fusing the aggregate image features and the voxel features to obtain cross-modal features; the pooling processing module is used for obtaining grid characteristics used for the downstream according to the cross-modal characteristics;

the local fusion module comprises a grid dynamic fusion processing module and a position information processing module;

the position information processing module is used for coding the position information of the reference point to generate grid characteristics; the grid dynamic fusion processing module is used for fusing the grid features and the sampling image features based on a cross attention module to generate locally fused grid features;

the characteristic dynamic enhancement module comprises a self-attention module and a residual error connection module, wherein the self-attention module is used for establishing interaction between the total characteristic and the residual error connection module at the non-empty grid point characteristic to obtain a boundary frame; and refining the bounding box based on the shared flattening features generated by the feature dynamic enhancement module.

9. An electronic device, comprising:

at least one processor; and (c) a second step of,

a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the local-to-global multimodal fusion method of any one of claims 1 to 6.

10. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the local-to-global multimodal fusion method of any one of claims 1 to 6.