CN115965961A - Local-to-global multi-modal fusion method, system, device and storage medium - Google Patents

Local-to-global multi-modal fusion method, system, device and storage medium Download PDF

Info

Publication number
CN115965961A
CN115965961A CN202310160693.XA CN202310160693A CN115965961A CN 115965961 A CN115965961 A CN 115965961A CN 202310160693 A CN202310160693 A CN 202310160693A CN 115965961 A CN115965961 A CN 115965961A
Authority
CN
China
Prior art keywords
features
grid
fusion
module
local
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310160693.XA
Other languages
Chinese (zh)
Other versions
CN115965961B (en
Inventor
侯跃南
李鑫
马涛
石博天
杨雨辰
刘有权
李怡康
乔宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai AI Innovation Center
Original Assignee
Shanghai AI Innovation Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai AI Innovation Center filed Critical Shanghai AI Innovation Center
Priority to CN202310160693.XA priority Critical patent/CN115965961B/en
Publication of CN115965961A publication Critical patent/CN115965961A/en
Application granted granted Critical
Publication of CN115965961B publication Critical patent/CN115965961B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Image Processing (AREA)

Abstract

The embodiment of the application relates to the technical field of automatic driving, in particular to a local-to-global multi-modal fusion method, a system, equipment and a storage medium, wherein the method comprises the following steps: firstly, fusing aggregate image features and voxel features by taking a voxel internal center of mass point as a reference point to obtain cross-modal features; obtaining a grid characteristic used for the downstream based on the cross-modal characteristic; the centroid points comprise an original point cloud; next, encoding the position information of the reference point to generate grid characteristics; projecting the grid centroid to an image plane, and sampling image features to obtain sampled image features; fusing the grid features and the sampled image features to obtain locally fused grid features; finally, multi-modal fusion processing is performed on the mesh features for downstream, mesh features, and local fusion. The local-to-global multi-mode fusion method provided by the embodiment of the application improves the accuracy of 3D target detection.

Description

Local-to-global multi-modal fusion method, system, device and storage medium
Technical Field
The embodiment of the application relates to the technical field of automatic driving, in particular to a local-to-global multi-modal fusion method, system, device and storage medium.
Background
3D object detection aims at locating and classifying objects in 3D space, is a basic perception task and plays a key role in automatic driving. Lidar and cameras are two of the widely used sensors. Because the laser radar provides accurate depth and geometric information, the 3D target detection method usually adopts point cloud data acquired by the laser radar to complete 3D target detection, and the laser radar sensor obtains competitive performance in various benchmark tests.
However, due to the inherent limitations of lidar sensors, the point cloud is typically sparse and does not provide sufficient context to distinguish distant or occluded areas, resulting in poor performance. To improve the performance of 3D object detection, a natural remedy is to supplement the point cloud with image rich semantic and texture information. Global fusion is typically employed to enhance point clouds having image features, i.e., to fuse point cloud features with image features throughout a scene. However, methods that employ global fusion to enhance point clouds with image features lack fine-grained local information. For 3D object detection, foreground objects only occupy a small portion of the entire scene, and only global fusion will bring marginal benefits.
Disclosure of Invention
The embodiment of the application provides a local-to-global multi-mode fusion method, system, device and storage medium, and improves the accuracy of 3D target detection.
In order to solve the above technical problem, in a first aspect, an embodiment of the present application provides a local-to-global multimodal fusion method, including the following steps: firstly, fusing aggregate image features and voxel features by taking a voxel internal center of mass point as a reference point to obtain cross-modal features; obtaining a grid characteristic used for the downstream based on the cross-modal characteristic; the centroid points comprise an original point cloud; next, encoding the position information of the reference point to generate grid characteristics; projecting the grid centroid to an image plane, and sampling image features to obtain sampled image features; fusing the grid features and the sampled image features to obtain locally fused grid features; and finally, carrying out multi-modal fusion processing on the mesh features, the mesh features and the local fusion mesh features used for the downstream.
In some exemplary embodiments, the fusing the aggregate image feature and the voxel feature with the voxel internal center of mass point as a reference point to obtain the cross-modal feature includes: calculating a mass center point of the non-empty voxel characteristic to obtain a voxel characteristic; projecting the voxel characteristics on an image plane, and weighting a group of image characteristics around a reference point to generate aggregate image characteristics; and fusing the aggregate image features and the voxel features to obtain cross-modal features.
In some exemplary embodiments, the voxel characteristics include a plurality, each voxel characteristic being represented as a query characteristic Q i Calculating the image features and the aggregate image features by the following formula:
Figure BDA0004094064370000021
Figure BDA0004094064370000022
wherein, W m And W m ' is a learnable weight; m is the number of attention heads; k is the total number of sample points; Δ p mik And A mik Respectively representing the sampling offset and the attention weight of the kth sampling point in the mth attention head;
Figure BDA0004094064370000023
is an image feature;
Figure BDA0004094064370000024
is an aggregate image feature; Δ p mik And A mik Are all by querying the feature Q i Obtained by linear projection.
In some exemplary embodiments, deriving mesh features for downstream based on the cross-modal features comprises: and performing region-of-interest pooling on the cross-modal features to obtain grid features for downstream.
In some exemplary embodiments, a self-attention module is employed to internally aggregate enhancement of mesh features for downstream mesh features, and local fusion.
In some exemplary embodiments, the internal aggregation enhancement of mesh features for downstream mesh features, and local fusion is performed using a self-attention module, comprising: summing the grid features, the grid features and the local fusion grid features used for the downstream to obtain a total feature; adopting a self-attention module to enable the total characteristic and the residual error connection module to establish interaction at the non-empty grid point characteristic to obtain a boundary frame; and refining the bounding box based on the shared flattening features generated by the feature dynamic enhancement module.
In a second aspect, an embodiment of the present application further provides a local-to-global multimodal fusion system, including: the global fusion module, the local fusion module and the characteristic dynamic enhancement module are connected in sequence; the global fusion module is used for fusing the aggregate image feature and the voxel feature by taking the voxel internal center of mass point as a reference point to obtain a cross-modal feature; obtaining a grid characteristic used for the downstream based on the cross-modal characteristic; the centroid points comprise an original point cloud; the local fusion module is used for coding the position information of the reference point to generate grid characteristics; projecting the grid centroid to an image plane, and sampling image characteristics to obtain sampled image characteristics; fusing the grid features and the sampled image features to obtain locally fused grid features; the feature dynamic enhancement module is used for performing multi-modal fusion processing on the mesh features for downstream mesh features, mesh features and local fusion.
In some exemplary embodiments, the global fusion module includes a centroid point dynamic fusion processing module and a pooling processing module; the mass center point dynamic fusion processing module is used for calculating mass center points of non-empty voxel characteristics to obtain voxel characteristics; projecting the voxel characteristics on an image plane, and weighting a group of image characteristics around a reference point to generate aggregate image characteristics; fusing the aggregate image features and the voxel features to obtain cross-modal features; the pooling processing module is used for obtaining grid characteristics used for downstream according to the cross-modal characteristics; the local fusion module comprises a grid dynamic fusion processing module and a position information processing module; the position information processing module is used for coding the position information of the reference point to generate grid characteristics; the grid dynamic fusion processing module is used for fusing the grid features and the sampling image features based on the cross attention module to generate locally fused grid features; the characteristic dynamic enhancement module comprises a self-attention module and a residual error connection module, wherein the self-attention module is used for establishing interaction between the total characteristic and the residual error connection module at the non-empty grid point characteristic to obtain a boundary frame; and refining the bounding box based on the shared flattening features generated by the feature dynamic enhancement module.
In addition, the present application also provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the above-described local-to-global multimodal fusion method.
In addition, the present application also provides a computer readable storage medium storing a computer program, which when executed by a processor implements the above local-to-global multimodal fusion method.
The technical scheme provided by the embodiment of the application has at least the following advantages:
the embodiment of the application provides a local-to-global multi-modal fusion method, a system, equipment and a storage medium, wherein the method comprises the following steps: firstly, fusing aggregate image features and voxel features by taking a voxel internal center of mass point as a reference point to obtain cross-modal features; obtaining a grid characteristic used for the downstream based on the cross-modal characteristic; the centroid points comprise an original point cloud; next, coding the position information of the reference point to generate grid characteristics; projecting the grid centroid to an image plane, and sampling image features to obtain sampled image features; fusing the grid features and the sampled image features to obtain locally fused grid features; finally, multi-modal fusion processing is performed on the mesh features for downstream, mesh features, and locally fused mesh features.
According to the local-to-global multi-modal fusion method, original contour geometric information, namely a centroid point containing original point clouds in a voxel is taken as a reference point, and more accurate cross-modal alignment fusion between the point cloud points and pixel points is achieved. Meanwhile, aiming at the problem that the foreground object occupies a low proportion of the whole scene, the semantic consistency of the example target can be used as a natural guide for cross-modal fusion, and the example target-level fusion provided by the application provides stronger semantic features for frame refinement. In addition, the application aims at the self-adaptive complementary enhancement of local and global features at an instance level, and provides a dynamic feature aggregation module based on self-attention to combine multi-modal global features with local features for fusion so as to generate a more accurate result and improve the 3D target detection performance.
Drawings
One or more embodiments are illustrated by corresponding figures in the drawings, which are not to be construed as limiting the embodiments, unless expressly stated otherwise, and the drawings are not to scale.
FIG. 1 is a schematic flow chart of a local-to-global multimodal fusion method according to an embodiment of the present application;
FIG. 2 is a schematic structural diagram of a local-to-global multimodal fusion system according to an embodiment of the present application;
FIG. 3 is a block diagram of an overall framework of a local-to-global multimodal fusion system according to an embodiment of the present application;
fig. 4 is a schematic diagram of a global fusion module according to an embodiment of the present application;
fig. 5 is a schematic diagram of a local fusion module according to an embodiment of the present application;
FIG. 6 is a schematic diagram of a feature dynamic aggregation module provided in an embodiment of the present application;
fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
As can be seen from the background art, the existing method for enhancing a point cloud with image features by global fusion lacks fine-grained local information. For 3D object detection, foreground objects only occupy a small portion of the entire scene, and only global fusion will bring marginal benefits.
The lidar-camera fusion method shows impressive performance in 3D object detection. Current multimodal approaches mainly perform global fusion, where image features and point cloud features are mainly in a global fusion mode throughout the scene. This approach lacks fine-grained region-level information, resulting in suboptimal fusion performance.
In order to solve the above technical problem, an embodiment of the present application provides a local-to-global multimodal fusion method, including the following steps: firstly, fusing aggregate image features and voxel features by taking a voxel internal center of mass point as a reference point to obtain cross-modal features; obtaining a grid characteristic used for the downstream based on the cross-modal characteristic; the centroid points comprise an original point cloud; next, coding the position information of the reference point to generate grid characteristics; projecting the grid centroid to an image plane, and sampling image characteristics to obtain sampled image characteristics; fusing the grid features and the sampled image features to obtain locally fused grid features; finally, multi-modal fusion processing is performed on the mesh features for downstream, mesh features, and local fusion.
Because the prior method ignores the original outline geometric information in the process of multi-mode fusion, the application takes a center of mass point containing point cloud in a Voxel (Voxel) as a reference point, and realizes more accurate cross-mode alignment fusion between the point cloud point and a pixel point. Because the foreground object occupies a low proportion of the whole scene, the current research has less enhancement and research on the target-level features, the semantic consistency of the target instance can be used as a natural guidance for cross-modal fusion, and the target instance-level fusion provided by the application provides stronger semantic features for the detection of objects with different distances. The existing method is generally used for multi-modal global feature fusion and local feature fusion respectively and is low in efficiency, and the application provides self-adaptive supplementary enhancement at an instance level based on self-attention combined local and global features to generate a more accurate result.
Embodiments of the present application will be described in detail below with reference to the accompanying drawings. However, it will be appreciated by those of ordinary skill in the art that in the examples of the present application, numerous technical details are set forth in order to provide a better understanding of the present application. However, the technical solution claimed in the present application can be implemented without these technical details and various changes and modifications based on the following embodiments.
Referring to fig. 1, an embodiment of the present application provides a local-to-global multimodal fusion method, including the following steps:
s1, fusing aggregate image features and voxel features by taking a voxel internal center of mass point as a reference point to obtain cross-modal features; obtaining a grid characteristic used for the downstream based on the cross-modal characteristic; the centroid points comprise the original point cloud.
S2, coding the position information of the reference point to generate grid characteristics; projecting the grid centroid to an image plane, and sampling image characteristics to obtain sampled image characteristics; and fusing the grid features and the sampled image features to obtain locally fused grid features.
And S3, carrying out multi-mode fusion processing on the grid features, the grid features and the local fusion grid features used for the downstream.
The local-to-global multi-modal fusion method provided by the application is used for completing corresponding 3D target detection in a related 3D space scene. On one hand, because the original outline geometric information is ignored in the multi-mode fusion process of the existing method, the method provided by the application uses the original outline geometric information, namely a center of mass point containing point cloud in the voxel, as a reference point, aggregates image features and voxel features for fusion to obtain cross-mode features, and realizes more accurate cross-mode alignment fusion of point cloud points and pixel points. On the other hand, the method and the device provide the example target level integration to provide stronger semantic features for Box Refinement (Box & Score reference) aiming at the fact that the foreground object occupies a low proportion of the whole scene and the example target semantic consistency can be used as a natural guidance for cross-modal integration. In addition, the application aims at the supplementary enhancement of local and global features in an adaptive way at an instance level, and proposes a dynamic feature aggregation module based on self-attention to combine multi-modal global features with local features for fusion so as to generate more accurate results.
Referring to fig. 2, an embodiment of the present application further provides a local-to-global multimodal fusion system, including: the system comprises a global fusion module 101, a local fusion module 102 and a characteristic dynamic enhancement module 103 which are connected in sequence; the global fusion module 101 is configured to fuse the aggregate image feature and the voxel feature by using a voxel internal centroid point as a reference point to obtain a cross-modal feature; obtaining a grid characteristic used for the downstream based on the cross-modal characteristic; wherein the centroid point comprises an original point cloud; the local fusion module 102 is configured to encode the position information of the reference point to generate a grid feature; projecting the grid centroid to an image plane, and sampling image features to obtain sampled image features; fusing the grid features and the sampled image features to obtain locally fused grid features; the feature dynamic enhancement module 103 is configured to perform a multi-modal fusion process on the mesh features for the downstream mesh features, the mesh features, and the local fusion.
In some embodiments, the global fusion module 101 includes a centroid point dynamic fusion processing module 1011 and a pooling processing module 1012; the center-of-mass point dynamic fusion processing module 1011 is used for calculating the center-of-mass point of the non-empty voxel characteristic to obtain the voxel characteristic; projecting the voxel characteristics on an image plane, and weighting a group of image characteristics around a reference point to generate aggregate image characteristics; fusing the aggregate image features and the voxel features to obtain cross-modal features; the pooling processing module 1012 is used for obtaining grid characteristics for downstream according to the cross-modal characteristics; the local fusion module 102 includes a mesh dynamic fusion processing module 1021 and a position information processing module 1022; the position information processing module 1022 is configured to encode the position information of the reference point to generate a grid feature; the grid dynamic fusion processing module 1021 is used for fusing the grid features and the sampled image features based on the cross attention module to generate locally fused grid features; the feature dynamic enhancing module 103 includes a self-attention module 1031 and a residual connection module 1032, where the self-attention module 1031 is configured to obtain a bounding box by establishing interaction between the total features and the residual connection module 1032 at the non-empty grid point features; and the bounding box is refined based on the shared flattening features generated from the feature dynamic enhancement module 103.
In this application, we propose a novel local-to-global convergence network (LoGoNet) that performs lidar-camera fusion at both the local and global levels. The method and the device analyze the reason of suboptimal performance of multi-mode fusion in detail, provide a new network and a new multi-mode fusion mode based on analysis, and realize the optimal performance obtained on the relevant reference of 3D target detection. The local-to-global fusion method and system provided by the present application are described in detail below.
As shown in fig. 3, the present application is a method for realizing accurate 3D object detection by combining local-to-global Multi-modal fusion, and in the present application, two Multi-modal data, such as a lidar Point Cloud (Point Cloud in fig. 3) and a Multi-view camera (Multi-camera Images in fig. 3), are used as input to design a corresponding Multi-modal fusion processing module; the multi-modal fusion processing module comprises a global fusion module (GoF), a local fusion module (LoF) and a feature dynamic enhancement module (FDA). As shown in fig. 3, the corresponding target detection is completed in the relevant 3D spatial scene through the multi-modal fusion processing module. The global fusion module (GoF) mainly comprises two flows of centroid point dynamic fusion (CDF) processing and region-of-interest pooling processing; the local fusion module (LoF) mainly comprises two processing modules, namely a Grid Dynamic Fusion (GDF) and a Position Information Encoder (PIE), and the feature dynamic enhancement module (FDA) mainly comprises a Self Attention (Self Attention) module and a residual error connection module (RCB).
In some embodiments, the fusing the aggregate image feature and the voxel feature with the voxel internal centroid point as a reference point to obtain the cross-modal feature includes: calculating a mass center point of the non-empty voxel characteristic to obtain a voxel characteristic; projecting the voxel characteristics on an image plane, and weighting a group of image characteristics around a reference point to generate aggregated image characteristics; and fusing the aggregate image features and the voxel features to obtain cross-modal features.
It should be noted that the cross-mode feature is obtained by fusing the aggregate image feature and the voxel feature with a cross-attention module.
Fig. 4 shows a schematic structural diagram of a global fusion module (GoF). As shown in fig. 4, the global fusion module (GoF) includes two flows of a centroid point dynamic fusion (CDF) process and a region of interest pooling process. First, calculate the Point Centroid of non-empty voxel feature, then project (Centroid Point Projection) these Centroid points onto the image plane, and through the dynamic offset that can be learned, to a set of image features around the reference Point
Figure BDA0004094064370000071
Weighting to generate aggregate image features>
Figure BDA0004094064370000072
These image features are based on the fact that the learned offset is applied to the image feature->
Figure BDA0004094064370000073
And then produced. The aggregate image features and voxel features are then fused by a cross-attention module to produce cross-modal features.
In some embodiments, the voxel characteristics include a plurality, each voxel characteristic being represented as a query characteristic Q i Calculating the cross-modal characteristics by the following formula:
Figure BDA0004094064370000074
Figure BDA0004094064370000075
wherein, W m And W m ' is a learnable weight; m is the number of attention heads; k is the total number of sample points; Δ p mik And A mik Respectively representing the sampling offset and the attention weight of the kth sampling point in the mth attention head;
Figure BDA0004094064370000076
is an image feature;
Figure BDA0004094064370000077
is an aggregate image feature; Δ p mik And A mik Are all by querying the feature Q i Is obtained by Linear projection (Linear). />
The local fusion module (LoF) is composed of two processing modules, which are mainly a Grid Dynamic Fusion (GDF) and a Position Information Encoder (PIE), as shown in fig. 5. Uniformly sampling Grid points (Grid points) in each 3D suggestion box, and encoding the position information of the original Point cloud through a Position Information Encoder (PIE) to generate Grid characteristics
Figure BDA0004094064370000078
Then, we project (Grid Point Projection) the calculated Grid centroid to the image plane and sample the image features by the learned offset. Finally, similar to in GoF, the cross-attention-based module fuses the grid features and the sampled image features to produce locally fused grid features ≦ based on the local fusion of the grid features and the sampled image features>
Figure BDA0004094064370000079
In some embodiments, deriving mesh features for downstream based on the cross-modal features comprises: performing region-of-interest pooling on the cross-modal features to obtain mesh features for downstream
Figure BDA00040940643700000710
That is to say that the grid characteristic->
Figure BDA00040940643700000711
Resulting from a region of interest (ROI) Pooling operation (Pooling).
In some embodiments, a self-attention module is employed to pair grid features
Figure BDA00040940643700000712
Locally fused grid feature->
Figure BDA00040940643700000713
And for downstream grid characteristic>
Figure BDA00040940643700000714
Performing internal polymerization enhancement comprising: for the grid feature used downstream >>
Figure BDA00040940643700000715
Grid feature>
Figure BDA00040940643700000716
And locally fused grid features>
Figure BDA00040940643700000717
Summing to obtain total characteristic F S (ii) a Using self-attention Module to make Total feature F S Establishing interaction with a residual connection module at the non-empty grid point characteristics to obtain a boundary frame; and refining the bounding box based on the shared flattening features generated by the feature dynamic enhancement module.
In particular, the feature dynamics enhancement module 103 is also referred to as a feature dynamics aggregation module (FDA). As shown in fig. 6, there are mainly a Self Attention (Self Attention) module and a residual connection module (RCB). Aiming at mutual independence among internal features of grid features, in order to better aggregate the internal features and establish enhanced connection with each other, a self-attention module is used for completing internal aggregation enhancement. Firstly, summing three Grid Features (Grid Features) to obtain a feature F S As shown in the following formula:
Figure BDA0004094064370000081
wherein,
Figure BDA0004094064370000082
is a grid characteristic, is greater than or equal to>
Figure BDA0004094064370000083
For locally fused grid features, ->
Figure BDA0004094064370000084
For use in downstream grid features.
Then, the introductions self-attention module includes an interaction between the standard transform encoder layer and the Residual Connection Block (RCB) to establish non-empty grid point features. Finally, the bounding box is refined using the shared flattened features generated from the feature dynamic aggregation module.
In summary, the method and the device use original contour geometric information, namely a centroid point containing original point cloud in a voxel as a reference point and serve as a guidance of global fusion, so that more accurate cross-mode alignment fusion of point cloud points and pixel points is realized; local instance target level fusion provides stronger semantic features for target objects with different distances; the dynamic feature aggregation module based on self-attention combines multi-modal global features and local features to be fused, self-adaptive supplementary enhancement is carried out at an instance level, and a more accurate result is generated.
Compared with the prior art, the invention has the advantages that: (1) The method comprises the steps of providing global feature fusion by taking original contour geometric information, namely a center of mass point containing point cloud in a voxel as a reference point; (2) Local instance target level fusion is proposed to provide stronger semantic features for frame refinement; (3) A self-attention-based dynamic feature aggregation module is proposed to combine multimodal global features with local feature fusion to produce more accurate results.
The invention has been proved to be feasible through experiments, simulation and use. Compared with the existing 3D target detection method, the method has the advantages that global feature fusion and local example target level fusion are carried out by taking the centroid point containing point cloud in the voxel as the reference point, so that stronger semantic features are provided for frame refinement, performance benefits can be brought to target detection, and meanwhile, the self-attention-based dynamic feature aggregation module is provided to further greatly improve the 3D target detection performance by combining the multi-mode global features with the local feature fusion. Compared with the existing 3D target detection method, the optimal performance is obtained on both the two public data sets Waymo Open dataset and KITTI, and particularly the method provided by the invention obtains the performance of more than 80mAP (L2) in the whole category on Waymo Open dataset for the first time.
Referring to fig. 7, another embodiment of the present application provides an electronic device, including: at least one processor 110; and, a memory 111 communicatively coupled to the at least one processor; wherein the memory 111 stores instructions executable by the at least one processor 110, the instructions being executable by the at least one processor 110 to enable the at least one processor 110 to perform any of the method embodiments described above.
Where the memory 111 and the processor 110 are coupled in a bus, the bus may include any number of interconnected buses and bridges that couple one or more of the various circuits of the processor 110 and the memory 111 together. The bus may also connect various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. A bus interface provides an interface between the bus and the transceiver. The transceiver may be one element or a plurality of elements, such as a plurality of receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. The data processed by the processor 110 is transmitted over a wireless medium through an antenna, which further receives the data and transmits the data to the processor 110.
The processor 110 is responsible for managing the bus and general processing and may also provide various functions including timing, peripheral interfaces, voltage regulation, power management, and other control functions. And memory 111 may be used to store data used by processor 110 in performing operations.
Another embodiment of the present application relates to a computer-readable storage medium storing a computer program. The computer program realizes the above-described method embodiments when executed by a processor.
That is, as can be understood by those skilled in the art, all or part of the steps in the method for implementing the above embodiments may be implemented by a program instructing related hardware, where the program is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps in the method for implementing the various embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
According to the technical scheme, the embodiment of the application provides a local-to-global multi-modal fusion method, a system, equipment and a storage medium, and the method comprises the following steps: firstly, fusing aggregate image features and voxel features by taking a voxel internal centroid as a reference point to obtain cross-modal features; obtaining a grid characteristic used for the downstream based on the cross-modal characteristic; the centroid points comprise the original point cloud; next, encoding the position information of the reference point to generate grid characteristics; projecting the grid centroid to an image plane, and sampling image features to obtain sampled image features; fusing the grid features and the sampled image features to obtain locally fused grid features; and finally, performing multi-modal fusion processing on the grid features, the grid features and the local fusion grid features used for the downstream.
According to the local-to-global multi-modal fusion method, original contour geometric information, namely a centroid point containing original point clouds in a voxel is taken as a reference point, and more accurate cross-modal alignment fusion between the point cloud points and pixel points is achieved. Meanwhile, aiming at the problem that the foreground object occupies a low proportion of the whole scene, the semantic consistency of the example target can be used as a natural guide for cross-modal fusion, and the example target-level fusion provided by the application provides stronger semantic features for frame refinement. In addition, the application aims at the situation level of the local and global features to perform self-adaptive supplementary enhancement, and provides a dynamic feature aggregation module based on self-attention to combine the multi-modal global features with the local features for fusion so as to generate a more accurate result and improve the 3D target detection performance.
It will be understood by those of ordinary skill in the art that the foregoing embodiments are specific examples for carrying out the present application, and that various changes in form and details may be made therein without departing from the spirit and scope of the present application in practice. Various changes and modifications may be effected therein by one of ordinary skill in the pertinent art without departing from the scope or spirit of the present disclosure, and the scope of the present disclosure should be defined only by the appended claims.

Claims (10)

1. A method of local-to-global multimodal fusion, comprising:
fusing the aggregate image characteristic and the voxel characteristic by taking the voxel internal center of mass point as a reference point to obtain a cross-modal characteristic;
obtaining a grid characteristic used for the downstream based on the cross-modal characteristic; the centroid point comprises an original point cloud;
coding the position information of the reference point to generate grid characteristics; projecting the grid centroid to an image plane, and sampling image features to obtain sampled image features; fusing the grid features and the sampling image features to obtain locally fused grid features;
and performing multi-modal fusion processing on the mesh features for the downstream, the mesh features and the locally fused mesh features.
2. The local-to-global multi-modal fusion method according to claim 1, wherein the fusing the aggregated image features and the voxel features with the voxel internal centroid point as a reference point to obtain cross-modal features comprises:
calculating a centroid point of the non-empty voxel characteristic to obtain a voxel characteristic;
projecting the voxel characteristics on an image plane, and weighting a group of image characteristics around the reference point to generate aggregated image characteristics;
and fusing the aggregate image features and the voxel features to obtain cross-modal features.
3. The local-to-global multi-modal fusion method of claim 2, wherein the voxel characteristics comprise a plurality, each voxel characteristic being represented as a query characteristic Q i Calculating the image features and the aggregate image features by the following formulas:
Figure FDA0004094064300000011
Figure FDA0004094064300000012
wherein, W m And W m ' is a learnable weight; m is the number of attention heads; k is the total number of sample points; Δ p mik And A mik Respectively representing the sampling offset and the attention weight of the kth sampling point in the mth attention head;
Figure FDA0004094064300000013
is an image feature; />
Figure FDA0004094064300000014
Is an aggregate image feature; Δ p of mik And A mik Are all by querying the feature Q i Obtained by linear projection.
4. The local-to-global multimodal fusion method according to claim 1, wherein the deriving mesh features for downstream based on the cross-modal features comprises:
and performing region-of-interest pooling on the cross-modal characteristics to obtain grid characteristics for the downstream.
5. The local-to-global multimodal fusion method according to claim 1, characterized in that a self-attention module is used to perform internal aggregation enhancement of mesh features for downstream, the mesh features and the local fusion.
6. The local-to-global multimodal fusion method according to claim 5, wherein the employing a self-attention module to perform internal aggregation enhancement on mesh features for downstream, the mesh features and the locally fused mesh features comprises:
summing the grid features used for the downstream, the grid features and the locally fused grid features to obtain a total feature;
adopting a self-attention module to enable the total characteristic and the residual error connecting module to establish interaction at the non-empty grid point characteristic to obtain a boundary frame;
and performing refinement processing on the bounding box based on the shared flattening features generated by the feature dynamic enhancement module.
7. A local-to-global multimodal fusion system, comprising: the global fusion module, the local fusion module and the characteristic dynamic enhancement module are connected in sequence;
the global fusion module is used for fusing the aggregate image feature and the voxel feature by taking the voxel internal center of mass as a reference point to obtain a cross-modal feature; obtaining a grid characteristic used for the downstream based on the cross-modal characteristic; the centroid points comprise an original point cloud;
the local fusion module is used for coding the position information of the reference point to generate grid characteristics; projecting the grid centroid to an image plane, and sampling image features to obtain sampled image features; fusing the grid features and the sampling image features to obtain locally fused grid features;
the feature dynamic enhancement module is configured to perform multi-modal fusion processing on the mesh features for the downstream, the mesh features, and the locally fused mesh features.
8. The local-to-global multimodal fusion system according to claim 7, wherein the global fusion module comprises a centroid dynamic fusion processing module and a pooling processing module;
the mass center point dynamic fusion processing module is used for calculating mass center points of non-empty voxel characteristics to obtain voxel characteristics; projecting the voxel characteristics on an image plane, and weighting a group of image characteristics around the reference point to generate aggregated image characteristics; fusing the aggregate image features and the voxel features to obtain cross-modal features; the pooling processing module is used for obtaining grid characteristics used for the downstream according to the cross-modal characteristics;
the local fusion module comprises a grid dynamic fusion processing module and a position information processing module;
the position information processing module is used for coding the position information of the reference point to generate grid characteristics; the grid dynamic fusion processing module is used for fusing the grid features and the sampling image features based on a cross attention module to generate locally fused grid features;
the characteristic dynamic enhancement module comprises a self-attention module and a residual error connection module, wherein the self-attention module is used for establishing interaction between the total characteristic and the residual error connection module at the non-empty grid point characteristic to obtain a boundary frame; and refining the bounding box based on the shared flattening features generated by the feature dynamic enhancement module.
9. An electronic device, comprising:
at least one processor; and (c) a second step of,
a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the local-to-global multimodal fusion method of any one of claims 1 to 6.
10. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the local-to-global multimodal fusion method of any one of claims 1 to 6.
CN202310160693.XA 2023-02-23 2023-02-23 Local-global multi-mode fusion method, system, equipment and storage medium Active CN115965961B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310160693.XA CN115965961B (en) 2023-02-23 2023-02-23 Local-global multi-mode fusion method, system, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310160693.XA CN115965961B (en) 2023-02-23 2023-02-23 Local-global multi-mode fusion method, system, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN115965961A true CN115965961A (en) 2023-04-14
CN115965961B CN115965961B (en) 2024-04-05

Family

ID=87358666

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310160693.XA Active CN115965961B (en) 2023-02-23 2023-02-23 Local-global multi-mode fusion method, system, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115965961B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116433992A (en) * 2023-06-14 2023-07-14 电子科技大学中山学院 Image classification method, device, equipment and medium based on global feature completion

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114519853A (en) * 2021-12-29 2022-05-20 西安交通大学 Three-dimensional target detection method and system based on multi-mode fusion
US20220164597A1 (en) * 2020-11-20 2022-05-26 Shenzhen Deeproute.Ai Co., Ltd Methods for extracting point cloud feature
US20220164566A1 (en) * 2020-11-20 2022-05-26 Shenzhen Deeproute.Ai Co., Ltd Methods for encoding point cloud feature
US20220358328A1 (en) * 2021-05-05 2022-11-10 Motional Ad Llc End-to-end system training using fused images
CN115393680A (en) * 2022-08-08 2022-11-25 武汉理工大学 3D target detection method and system for multi-mode information space-time fusion in foggy day scene

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220164597A1 (en) * 2020-11-20 2022-05-26 Shenzhen Deeproute.Ai Co., Ltd Methods for extracting point cloud feature
US20220164566A1 (en) * 2020-11-20 2022-05-26 Shenzhen Deeproute.Ai Co., Ltd Methods for encoding point cloud feature
US20220358328A1 (en) * 2021-05-05 2022-11-10 Motional Ad Llc End-to-end system training using fused images
CN115393677A (en) * 2021-05-05 2022-11-25 动态Ad有限责任公司 End-to-end system training using fused images
CN114519853A (en) * 2021-12-29 2022-05-20 西安交通大学 Three-dimensional target detection method and system based on multi-mode fusion
CN115393680A (en) * 2022-08-08 2022-11-25 武汉理工大学 3D target detection method and system for multi-mode information space-time fusion in foggy day scene

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
李鑫 等: "一种毫米波雷达辅助的激光点云运动目标去除方法", 《第十四届全国DSP应用技术学术会议论文集》, 11 December 2022 (2022-12-11), pages 118 - 121 *
郑冰清 等: "一种融合语义地图与回环检测的视觉SLAM方法", 《中国惯性技术学报》, vol. 28, no. 5, 15 October 2020 (2020-10-15), pages 629 - 637 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116433992A (en) * 2023-06-14 2023-07-14 电子科技大学中山学院 Image classification method, device, equipment and medium based on global feature completion

Also Published As

Publication number Publication date
CN115965961B (en) 2024-04-05

Similar Documents

Publication Publication Date Title
CN113159151B (en) Multi-sensor depth fusion 3D target detection method for automatic driving
CN113936139B (en) Scene aerial view reconstruction method and system combining visual depth information and semantic segmentation
US7489812B2 (en) Conversion and encoding techniques
CN110414526B (en) Training method, training device, server and storage medium for semantic segmentation network
CN105160702A (en) Stereoscopic image dense matching method and system based on LiDAR point cloud assistance
CN110276768B (en) Image segmentation method, image segmentation device, image segmentation apparatus, and medium
CN109598754A (en) A kind of binocular depth estimation method based on depth convolutional network
CN109726739A (en) A kind of object detection method and system
CN116612468A (en) Three-dimensional target detection method based on multi-mode fusion and depth attention mechanism
CN114821507A (en) Multi-sensor fusion vehicle-road cooperative sensing method for automatic driving
CN111028281A (en) Depth information calculation method and device based on light field binocular system
CN115880555B (en) Target detection method, model training method, device, equipment and medium
CN116188893A (en) Image detection model training and target detection method and device based on BEV
CN113989758A (en) Anchor guide 3D target detection method and device for automatic driving
CN115965961B (en) Local-global multi-mode fusion method, system, equipment and storage medium
CN117496312A (en) Three-dimensional multi-target detection method based on multi-mode fusion algorithm
CN116402876A (en) Binocular depth estimation method, binocular depth estimation device, embedded equipment and readable storage medium
CN110276801B (en) Object positioning method and device and storage medium
CN114187208B (en) Semi-global stereo matching method based on fusion cost and self-adaptive penalty term coefficient
CN113269823A (en) Depth data acquisition method and device, storage medium and electronic equipment
CN112489097A (en) Stereo matching method based on mixed 2D convolution and pseudo 3D convolution
CN112508996A (en) Target tracking method and device for anchor-free twin network corner generation
CN116310326A (en) Multi-mode point cloud segmentation method, system, equipment and storage medium
CN116912645A (en) Three-dimensional target detection method and device integrating texture and geometric features
CN114842287B (en) Monocular three-dimensional target detection model training method and device of depth-guided deformer

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant