CN116645578B

CN116645578B - Multi-mode data fusion method and three-dimensional target detection method thereof

Info

Publication number: CN116645578B
Application number: CN202310565983.2A
Authority: CN
Inventors: 王意
Original assignee: Guangdong University of Science and Technology
Current assignee: Guangdong University of Science and Technology
Priority date: 2023-05-18
Filing date: 2023-05-18
Publication date: 2024-01-26
Anticipated expiration: 2043-05-18
Also published as: CN116645578A

Abstract

The invention relates to the technical field of machine vision, and provides a multi-mode data fusion method and a three-dimensional target detection method thereof, wherein the method comprises the following steps: acquiring an RGBD image and an RGB image; obtaining point cloud data, a first three-dimensional voxel characteristic and depth block information based on the RGBD graph; obtaining two-dimensional image features based on the RGB image; obtaining a second three-dimensional voxel characteristic according to the depth block information and the two-dimensional image characteristic; and carrying out data fusion through a transducer network according to the first three-dimensional voxel characteristic and the second three-dimensional voxel characteristic to obtain a target voxel characteristic. According to the invention, the RGBD image containing depth information and the general three-channel RGB image are combined, data fusion is carried out based on a transducer network, so that more accurate and complete target voxel characteristics are obtained, accurate three-dimensional target detection can be carried out based on the target voxel characteristics, the efficiency of three-dimensional target detection is greatly improved, and the accuracy of a detection result is effectively ensured.

Description

Multi-mode data fusion method and three-dimensional target detection method thereof

Technical Field

The invention relates to the technical field of machine vision, in particular to a multi-mode data fusion method and a three-dimensional target detection method thereof.

Background

With rapid development of deep learning, continuous iteration of machine vision technology is promoted, and application of target detection technology is becoming wider, for example, the target detection technology is applied to fruit picking, fruits are positioned first, and then picking is carried out through a robot. The traditional target detection method comprises the steps of shooting pictures in a plurality of directions, traversing the pictures to select target areas, extracting features of the target areas, and finally obtaining target positions through an SVM classifier, wherein the time complexity of traversing the pictures is high, the extracted two-dimensional image features only comprise two-dimensional information, the authenticity and the accuracy of the features are difficult to guarantee, and the accuracy of target detection is low.

Disclosure of Invention

The invention provides a multi-mode data fusion method, which aims to obtain more accurate target voxel characteristics.

The invention provides a three-dimensional target detection method, which aims to improve the accuracy of three-dimensional target detection.

In a first aspect, the present invention provides a multi-modal data fusion method, including:

acquiring an RGBD image and an RGB image;

obtaining point cloud data, a first three-dimensional voxel feature and depth block information based on the RGBD map;

based on the RGB image, obtaining two-dimensional image characteristics;

obtaining a second three-dimensional voxel characteristic according to the depth block information and the two-dimensional image characteristic;

and carrying out data fusion through a transducer network according to the first three-dimensional voxel characteristic and the second three-dimensional voxel characteristic to obtain a target voxel characteristic.

In one embodiment, the obtaining, based on the RGBD map, point cloud data, a first three-dimensional voxel feature, and depth block information includes:

obtaining point cloud data through image conversion based on the RGBD graph;

according to the point cloud data, a point cloud coding network is utilized to obtain a first three-dimensional voxel characteristic;

and obtaining depth block information by utilizing a distance linear increment discretization algorithm according to the point cloud data.

In one embodiment, the obtaining the point cloud data based on the RGBD map through an image conversion formula includes:

based on the RGBD image, obtaining an image coordinate, a depth value, a focal length of the camera on an x-axis and a focal length of the camera on a y-axis;

and obtaining point cloud data through an image conversion formula according to the image coordinates, the depth value, the focal length of the camera on the x axis and the focal length of the camera on the y axis.

In one embodiment, the obtaining depth block information according to the point cloud data by using a distance linear increment discretization algorithm includes:

obtaining the depth range, the number of depth blocks and the index of the range of the depth value according to the point cloud data;

and obtaining depth block information by utilizing a distance linear increment discretization algorithm according to the depth range, the number of the depth blocks and the index of the range to which the depth value belongs.

In one embodiment, the obtaining a second three-dimensional voxel feature according to the depth block information and the two-dimensional image feature includes:

according to the depth block information and the two-dimensional image characteristics, a three-dimensional view cone is obtained;

and obtaining the second three-dimensional voxel characteristic by utilizing a three-dimensional linear interpolation method according to the three-dimensional view cone.

In one embodiment, the obtaining the second three-dimensional voxel feature according to the three-dimensional view cone by using a three-dimensional linear interpolation method includes:

obtaining a view cone block sampling point of the three-dimensional view cone by using a three-dimensional linear interpolation method;

and converting, adjusting the size of the view cone block sampling points and splicing the feature blocks to obtain a second three-dimensional voxel feature.

In one embodiment, the data fusion is performed through a transducer network according to the first three-dimensional voxel feature and the second three-dimensional voxel feature to obtain a target voxel feature, which includes:

carrying out pooling treatment and data flattening treatment on the second three-dimensional voxel characteristic by using a non-empty voxel block in the first three-dimensional voxel characteristic to obtain a third three-dimensional voxel characteristic;

and according to the first three-dimensional voxel characteristic and the third three-dimensional voxel characteristic, performing characteristic fusion and matrix conversion through a multi-head converter network and a linear conversion matrix to obtain a target voxel characteristic.

In one embodiment, the performing feature fusion and matrix conversion according to the first three-dimensional voxel feature and the third three-dimensional voxel feature through a multi-head converter network and a linear conversion matrix to obtain a target voxel feature includes:

performing feature stitching through a multi-head Transformer network according to the first three-dimensional voxel feature and the third three-dimensional voxel feature to obtain a first feature stitching matrix;

performing matrix conversion on the first characteristic splicing matrix by using a linear conversion matrix to obtain a second characteristic splicing matrix;

and carrying out feature fusion on the non-empty voxel block and the second feature splicing matrix to obtain the target voxel feature.

In a second aspect, the present invention provides a multi-modal data fusion apparatus comprising:

an image acquisition module for: acquiring an RGBD image and an RGB image;

RGBD map processing module, configured to: obtaining point cloud data, a first three-dimensional voxel feature and depth block information based on the RGBD map;

an RGB diagram processing module configured to: based on the RGB image, obtaining two-dimensional image characteristics;

the take-out module is used for: obtaining a second three-dimensional voxel characteristic according to the depth block information and the two-dimensional image characteristic;

the data fusion module is used for: and carrying out data fusion through a multi-head transducer network according to the first three-dimensional voxel characteristic and the second three-dimensional voxel characteristic to obtain a target voxel characteristic.

In a third aspect, the present invention provides a three-dimensional target detection method, where three-dimensional target detection is performed on target voxel features obtained by the multi-modal data fusion method according to any one of the above-mentioned aspects through a three-dimensional target detection head.

In a fourth aspect, the present invention provides a three-dimensional object detection device, including an object detection module, configured to perform three-dimensional object detection on the object voxel feature obtained by the multi-mode data fusion method according to any one of the above-mentioned aspects through a three-dimensional object detection head.

According to the multi-mode data fusion method provided by the invention, the depth RGBD image containing the depth information and the general three-channel RGB image are combined to obtain more real point cloud data, the first three-dimensional voxel characteristic, the depth block information, the two-dimensional image characteristic and the second three-dimensional voxel characteristic, and then the first three-dimensional voxel characteristic and the second three-dimensional voxel characteristic are input into a Transformer network to perform comprehensive data fusion, so that more accurate, complete and clear target voxel characteristics can be obtained.

According to the three-dimensional target detection method provided by the invention, the existing three-dimensional target detection head is utilized, accurate three-dimensional target detection can be performed based on the target voxel characteristics obtained by the multi-mode data fusion method, the three-dimensional target detection precision can be greatly improved, the detection result accuracy is effectively ensured, and the application efficiency is improved.

Drawings

In order to more clearly illustrate the technical solutions of the present invention, the following description will be given with a brief introduction to the drawings used in the embodiments or the description of the prior art, it being obvious that the drawings in the following description are some embodiments of the present invention, and that other drawings can be obtained from these drawings without the inventive effort of a person skilled in the art.

FIG. 1 is a schematic flow chart of a multi-modal data fusion method provided by the invention;

FIG. 2 is a flow chart of the three-dimensional object detection method provided by the invention

FIG. 3 is a schematic diagram of a multi-modal data fusion apparatus according to the present invention;

fig. 4 is a schematic structural diagram of an electronic device according to the multi-mode data fusion method provided by the invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The embodiments of the present invention provide embodiments of a multi-modal data fusion method, it being noted that although a logical sequence is shown in the flow chart, under certain data, the steps shown or described may be accomplished in a different order than that shown or described herein.

Referring to fig. 1, fig. 1 is a flow chart of a multi-mode data fusion method provided by the invention. The multi-mode data fusion method provided by the embodiment of the invention comprises the following steps:

step 101, acquiring an RGBD image and an RGB image;

step 102, obtaining point cloud data, a first three-dimensional voxel feature P and depth block information D' based on an RGBD map;

step 103, obtaining two-dimensional image features F based on RGB map _2D ；

Step 104, according to the depth block information D' and the two-dimensional image feature F _2D Obtaining a second three-dimensional voxel characteristic I;

and 105, carrying out data fusion through a multi-head transducer network according to the first three-dimensional voxel characteristic P and the second three-dimensional voxel characteristic I to obtain a target voxel characteristic V.

The execution body of the embodiment of the invention can be any terminal side device meeting the device requirements, such as a data fusion device, a three-dimensional target detection device and the like.

In step 101, the terminal side device acquires an RGBD map and an RGB map.

It should be noted that an RGBD Map is an image or image channel containing information about the surface distance of a scene object from a viewpoint, which is similar to a gray scale image, in which each pixel value is the actual distance of the sensor from the object, and the RGBD Map can be obtained by an RGBD camera, rgbd=rgb+d (Depth Map), and typically the RGBD Map and the RGB Map are registered, so that there is a one-to-one correspondence between pixel points. The RGBD map and the RGB map in the embodiment of the present invention may be images of a space to be detected, so as to obtain a target voxel feature V, which is used for performing three-dimensional target detection subsequently.

In step 102, the terminal device obtains point cloud data, a first three-dimensional voxel feature P, and depth block information D' based on the RGBD map.

In one embodiment, step 102 may include:

step 1021, obtaining point cloud data through image conversion based on the RGBD graph;

step 1022, obtaining a first three-dimensional voxel characteristic P according to the point cloud data by utilizing the existing point cloud coding network;

step 1023, obtaining depth block information D' according to the point cloud data by utilizing the existing interval linear increment discretization algorithm.

Specifically, regarding step 1021, the terminal side device may obtain the image coordinates, the depth value, the focal length of the camera in the x-axis, and the focal length of the camera in the y-axis based on the RGBD map, and obtain the point cloud data by the image conversion formula (1) below).

Wherein, the image conversion formula is:

in the formula (1), (x, y, z) represents point cloud coordinates, (x ', y') represents image coordinates, D represents a depth value, f _x Representing the focal length of the camera in the x-axis, f _y Is the focal length of the camera in the y-axis. RGBD images can be converted into more real point cloud data through image conversion.

Specifically, with respect to step 1022, the terminal-side device may obtain the first three-dimensional voxel feature P from the point cloud data using an existing point cloud encoding network, such as VoxelNet, which has a size (X _P ,Y _P ,Z _P ,C _F ) Wherein (X) _P ,Y _P ,Z _P ) Grid size, C, representing a first three-dimensional voxel feature P _F The number of channels representing the first three-dimensional voxel feature P.

Specifically, with respect to step 1023, the terminal-side device may obtain the depth range [ d ] from the point cloud data _min ,d _max ]Number of depth blocks N _D Index d, to which depth value falls within range _i And obtaining depth block information D' according to a linear increment discretization algorithm (the following formula (2)).

The formula of the interval linear increment discretization algorithm is as follows:

in the formula (2), d _c Representing calculated successive depth values, calculationObtained d _c The depth block information D' is constituted.

In step 103, the terminal side device obtains a two-dimensional image feature F based on the RGB diagram _2D 。

It should be noted that, the terminal side device may process the RGB map through the existing two-dimensional deep learning backbone network, such as Efficient, regNet, to extract the two-dimensional image feature F _2D The size of which is (W _F ,H _F ,C _F ′)，W _F 、H _F Respectively representing two-dimensional image features F _2D Width and height of C _F ' representing two-dimensional image features F _2D Is a number of channels.

In step 104, the terminal side device generates a depth block according to the depth block information D' and the two-dimensional image feature F _2D And obtaining a second three-dimensional voxel characteristic I.

In one embodiment, step 104 includes:

step 1041, based on depth block information D' and two-dimensional image feature F _2D Obtaining a three-dimensional viewing cone M;

step 1042, obtaining a second three-dimensional voxel feature I by using the existing three-dimensional linear interpolation method according to the three-dimensional view cone M.

Specifically, the terminal side device may combine the depth block information D' and the two-dimensional image feature F _2D Performing outer multiplication (outer product) to obtain a three-dimensional viewing cone M with a size (W) _F ′,H _F ′,N _D ′,C _F ") wherein W _F ′、H _F ' represents the width and height, N, respectively, of the three-dimensional viewing cone M _D ' indicates the number of depth blocks, C _F "means the number of channels of the three-dimensional viewing cone M.

Wherein the depth block information D' and the two-dimensional image feature F _2D Performing external multiplication to obtain a three-dimensional viewing cone M, wherein the expression can be as follows:

in the expression (3), (i, j) represents an index value of the feature pixel point.

In particularAfter the three-dimensional viewing cone M is obtained, the terminal side device may convert the three-dimensional viewing cone M into a second three-dimensional voxel feature I having a size (X _I ，F _I ，Z _I ，C _F "', wherein (X) _I ，Y _I ，Z _I ) Grid size, C, representing a second three-dimensional voxel feature I _F "represents the number of channels of the second three-dimensional voxel feature I.

In one embodiment, the terminal-side device may obtain the second three-dimensional voxel feature I by: for the ith view cone block of the three-dimensional view cone M, i E C _F "A three-dimensional linear interpolation method is used to calculate the sampling point S of the view cone block of the three-dimensional view cone M _M Then the sampling point S of the view cone block is sampled by the following formula (4) _M Conversion to voxel feature block intermediate points S _I Then the voxel characteristic block intermediate point S _I Voxel feature block I that is sized (e.g., fill-in size) to be the same as the view cone block size _i And then I is carried out on the voxel characteristic block I _i Splicing to form a second three-dimensional voxel characteristic I.

CM·S _I ＝S _M (4)

In equation (4), CM represents a calibration matrix of the camera.

In step 105, the terminal device performs data fusion through the multi-head transducer network according to the first three-dimensional voxel feature P and the second three-dimensional voxel feature I to obtain a target voxel feature V.

In one embodiment, step 105 includes:

step 1051, utilizing non-empty voxel block F in a first three-dimensional voxel feature P _P Performing maximum pooling processing (maxpooling processing) and data flattening processing on the second three-dimensional voxel feature I to obtain a third three-dimensional voxel feature F _I The size of which is (L, C) _F ) Wherein l=x _P ×Y _P ×Z _P Non-empty voxel block F _P And a third three-dimensional voxel feature F _I Is the same in dimension;

step 1052, non-empty voxel block F according to the first three-dimensional voxel feature P _P And a third three-dimensional voxel feature F _I Through multiple heads Trand carrying out feature fusion and matrix conversion on the ansformer network and the linear conversion matrix to obtain the target voxel feature V.

In one embodiment, step 1052 may be implemented as follows: from non-empty voxel blocks F in a first three-dimensional voxel feature P _P And a third three-dimensional voxel feature I, performing feature splicing through a multi-head transducer network to obtain a first feature splicing matrix concat (head) ₁ ，head ₂ ，…，head _m ) Reuse of linear conversion matrix W ^O Concatenation of the first feature matrix (head ₁ ，head ₂ ，…，head _m ) Performing matrix conversion to obtain a second feature stitching matrix A, and then performing non-empty voxel block F _P And carrying out feature fusion with the second feature stitching matrix A to obtain a target voxel feature, wherein the target voxel feature can be used for three-dimensional target detection in all aspects.

Specifically, the terminal side device may first block the non-null voxel F _P Regarded as Query (definition of Query is Q _i ＝F _P ·W _i ^Q ) Characterizing a third three-dimensional voxel F _I Regarded as Key (Key is defined as K _i ＝F _I ·W _i ^K ) And Value (Value is defined as V _i ＝F _I ·W _i ^K ) The characteristic is input into the existing multi-head converter network, calculated according to the following formulas (5) - (6), and characteristic splicing is carried out.

Q _i ＝F _P ·W _i ^Q ，K _i ＝F _I ·W _i ^K ，V _i ＝F _I ·W _i ^K (5)

In the formula (5), Q _i ＝F _P ·W _i ^Q Representing non-empty voxel blocks F _P ，K _i ＝F _I ·W _i ^K And V _i ＝F _I ·W _i ^K Representing a third three-dimensional voxel feature F _I The method comprises the steps of carrying out a first treatment on the surface of the In the formula (6), the amino acid sequence of the compound,representing a relationship vector, V _i Representing the current input, head _i Representing the weighted output of the multi-headed transducer network.

Then, the terminal side device may perform feature stitching on the input of the m-header transducer to obtain a first feature stitching matrix concat (head ₁ ，head ₂ ，…，head _m ). Then, the linear transformation matrix W is utilized according to the following formula (7) ^O The spliced m-head convertors output results (first feature splice matrix concat (head) ₁ ，head ₂ ，…，head _m ) Mapping to a three-dimensional voxel space, and performing matrix conversion to obtain a second feature stitching matrix A.

A＝concat(head ₁ ，head ₂ ，…，head _m )W ^O (7)

Finally, non-empty voxel block F _P Feature fusion is carried out with the second feature stitching matrix A, and a target voxel feature V is obtained and is expressed as V=concat (A, F _P )。

According to the multi-mode data fusion method provided by the embodiment of the invention, the depth RGBD image containing the depth information and the general three-channel RGB image are combined to obtain more real point cloud data, a first three-dimensional voxel characteristic, depth block information, a two-dimensional image characteristic and a second three-dimensional voxel characteristic, the first three-dimensional voxel characteristic and the second three-dimensional voxel characteristic are input into a multi-head transducer network to perform comprehensive data fusion, so that more accurate, complete and clear target voxel characteristics can be obtained, and the target voxel characteristics can be used for three-dimensional target detection in various aspects, such as fruit picking.

Referring to fig. 2, the embodiment of the present invention further provides a three-dimensional target detection method, where the execution body may be a terminal side device such as a three-dimensional target detection device, and by using an existing three-dimensional target detection head (a device for implementing three-dimensional target detection), three-dimensional target detection is performed based on target voxel features obtained by the foregoing multi-mode data fusion method, so as to generate a three-dimensional anchor frame, implement target classification in the anchor frame, and complete three-dimensional target detection, so that accuracy and efficiency of three-dimensional target detection can be greatly improved, accuracy of a detection result is effectively ensured, and an application scenario of the present invention is enlarged.

In addition, three-dimensional target detection can be performed by other three-dimensional target detection technologies meeting requirements based on the target voxel characteristics obtained by the multi-mode data fusion method provided by the embodiment of the invention, so that the accuracy and the efficiency of three-dimensional target detection are improved.

Further, the multi-mode data fusion device provided by the invention and the multi-mode data fusion method provided by the invention are correspondingly referred to each other.

Referring to fig. 3, the multi-modal data fusion apparatus includes:

an image acquisition module 301 for: acquiring an RGBD image and an RGB image;

RGBD map processing module 302 is configured to: obtaining point cloud data, a first three-dimensional voxel feature and depth block information based on the RGBD map;

an RGB diagram processing module 303, configured to: based on the RGB image, obtaining two-dimensional image characteristics;

an squaring module 304 for: obtaining a second three-dimensional voxel characteristic according to the depth block information and the two-dimensional image characteristic;

a data fusion module 305, configured to: and carrying out data fusion through a transducer network according to the first three-dimensional voxel characteristic and the second three-dimensional voxel characteristic to obtain a target voxel characteristic.

In one embodiment, the RGBD map processing module 302 may include:

the point cloud data obtaining module is used for: obtaining point cloud data through image conversion based on the RGBD graph;

the first three-dimensional voxel feature obtaining module is used for: according to the point cloud data, a point cloud coding network is utilized to obtain a first three-dimensional voxel characteristic;

the depth block information obtaining module is used for: and obtaining depth block information by utilizing a distance linear increment discretization algorithm according to the point cloud data.

In one embodiment, the point cloud data is specifically used for:

In one embodiment, the depth block information obtaining module is specifically configured to:

In one embodiment, the take-out module 304 may include:

the three-dimensional view cone obtaining module is used for: according to the depth block information and the two-dimensional image characteristics, a three-dimensional view cone is obtained;

a second three-dimensional voxel feature obtaining module for: and obtaining a second three-dimensional voxel characteristic by utilizing a three-dimensional linear interpolation method according to the three-dimensional view cone.

In one embodiment, the second three-dimensional voxel feature obtaining module is specifically configured to:

In one embodiment, the data fusion module 305 may include:

a third three-dimensional voxel feature obtaining module for: carrying out pooling treatment and data flattening treatment on the second three-dimensional voxel characteristic by using a non-empty voxel block in the first three-dimensional voxel characteristic to obtain a third three-dimensional voxel characteristic;

the target voxel feature obtaining module is used for: and according to the first three-dimensional voxel characteristic and the third three-dimensional voxel characteristic, performing characteristic fusion and matrix conversion through a multi-head converter network and a linear conversion matrix to obtain a target voxel characteristic.

In one embodiment, the target voxel feature obtaining module may include:

the first concatenation module is used for: performing feature stitching through a multi-head Transformer network according to the first three-dimensional voxel feature and the third three-dimensional voxel feature to obtain a first feature stitching matrix;

the second splicing module is used for: performing matrix conversion on the first characteristic splicing matrix by using a linear conversion matrix to obtain a second characteristic splicing matrix;

the feature fusion module is used for: and carrying out feature fusion on the non-empty voxel block and the second feature splicing matrix to obtain the target voxel feature.

Furthermore, the invention also provides a three-dimensional target detection device, which comprises a target detection module, wherein the target detection module is used for carrying out three-dimensional target detection on the target voxel characteristics obtained according to the multi-mode data fusion method or the multi-mode data fusion device through a three-dimensional target detection head.

Fig. 4 illustrates a physical schematic diagram of an electronic device, as shown in fig. 4, which may include: processor 810, communication interface (Communications Interface) 820, memory 830, and communication bus 840, wherein processor 810, communication interface 820, memory 830 accomplish communication with each other through communication bus 840. The processor 810 may invoke logic instructions in the memory 830 to perform a multi-modal data fusion method comprising:

acquiring an RGBD image and an RGB image;

based on the RGB image, obtaining two-dimensional image characteristics;

Further, the logic instructions in the memory 830 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product, the computer program product comprising a computer program, the computer program being storable on a non-transitory computer readable storage medium, the computer program, when executed by a processor, being capable of performing the multi-modal data fusion method provided by the methods described above, the method comprising:

acquiring an RGBD image and an RGB image;

based on the RGB image, obtaining two-dimensional image characteristics;

In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the multi-modal data fusion method provided by the above methods, the method comprising:

acquiring an RGBD image and an RGB image;

based on the RGB image, obtaining two-dimensional image characteristics;

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method of multimodal data fusion, comprising:

acquiring an RGBD image and an RGB image;

based on the RGB map, a two-dimensional image feature is obtained, the size of which is (W _F ,H _F ,C _F ′)，W _F 、H _F Respectively representing two-dimensional image features F _2D Width and height of C _F ' representing two-dimensional image features F _2D The number of channels;

according to the first three-dimensional voxel characteristic and the second three-dimensional voxel characteristic, carrying out data fusion through a transducer network to obtain a target voxel characteristic;

the obtaining, based on the RGBD map, point cloud data, a first three-dimensional voxel feature, and depth block information includes:

based on the RGBD image, obtaining an image coordinate, a depth value, a focal length of a camera on an x-axis and a focal length of the camera on a y-axis, and obtaining point cloud data through an image conversion formula, wherein the image conversion formula is as follows:

in the formula (1), (x, y, z) represents point cloud coordinates, (x ', y') represents image coordinates, D represents a depth value, f _x Representing the focal length of the camera in the x-axis, f _y Focal length of the camera in the y-axis;

according to the point cloud data, a first three-dimensional voxel characteristic is obtained by utilizing a point cloud coding network, and the size of the first three-dimensional voxel characteristic is (X _P ,Y _P ,Z _P ,C _F ) Wherein (X) _P ,Y _P ,Z _P ) Grid size, C, representing a first three-dimensional voxel feature P _F A channel number representing a first three-dimensional voxel feature P;

obtaining a depth range [ d ] according to the point cloud data _min ,d _max ]Number of depth blocks N _D Index d, to which depth value falls within range _i Obtaining depth block information by using a distance linear increment discretization algorithm, wherein the formula of the distance linear increment discretization algorithm is as follows:

in the formula (2), d _c Representing calculated successive depth values, calculated d _c Constructing depth block information D';

and obtaining a second three-dimensional voxel feature according to the depth block information and the two-dimensional image feature, including:

performing outer multiplication on the depth block information and the two-dimensional image features to obtain a three-dimensional viewing cone M, wherein the size of the three-dimensional viewing cone M is (W _F ′,H _F ′,N _D ′,C _F ") wherein W _F ′、H _F ' represents the width and height, N, respectively, of the three-dimensional viewing cone M _D ' indicates the number of depth blocks, C _F "represents the number of channels of the three-dimensional view cone M, wherein depth block information d' and two-dimensional image features F _2D Performing external multiplication to obtain a three-dimensional viewing cone M, wherein the expression is as follows:

in the formula (3), (i, j) represents an index value of the feature pixel point;

according to the three-dimensional view cone, a second three-dimensional voxel characteristic is obtained by utilizing a three-dimensional linear interpolation method;

and performing data fusion through a transducer network according to the first three-dimensional voxel characteristic and the second three-dimensional voxel characteristic to obtain a target voxel characteristic, wherein the method comprises the following steps:

according to the first three-dimensional voxel feature and the third three-dimensional voxel feature, feature fusion and matrix conversion are carried out through a multi-head converter network and a linear conversion matrix, and a target voxel feature is obtained;

and performing feature fusion and matrix conversion through a multi-head converter network and a linear conversion matrix according to the first three-dimensional voxel feature and the third three-dimensional voxel feature to obtain a target voxel feature, wherein the method comprises the following steps:

2. The method of claim 1, wherein the obtaining a second three-dimensional voxel feature from the three-dimensional view cone by using a three-dimensional linear interpolation method comprises:

3. A multi-modal data fusion apparatus comprising:

an image acquisition module for: acquiring an RGBD image and an RGB image;

the take-out module is used for: obtaining a second three-dimensional voxel feature having a size (W _F ,H _F ,C _F ′)，W _F 、H _F Respectively representing two-dimensional image features F _2D Width and height of C _F ' representing two-dimensional image features F _2D The number of channels;

the data fusion module is used for: according to the first three-dimensional voxel characteristic and the second three-dimensional voxel characteristic, carrying out data fusion through a transducer network to obtain a target voxel characteristic;

wherein, RGBD graph processing module includes:

the point cloud data obtaining module is used for: based on the RGBD image, obtaining an image coordinate, a depth value, a focal length of a camera on an x-axis and a focal length of the camera on a y-axis, and obtaining point cloud data through an image conversion formula, wherein the image conversion formula is as follows:

the first three-dimensional voxel feature obtaining module is used for: according to the point cloud data, a first three-dimensional voxel characteristic is obtained by utilizing a point cloud coding network, and the size of the first three-dimensional voxel characteristic is (X _P ,Y _P ,Z _P ,C _F ) Wherein (X) _P ,Y _P ,Z _P ) Represent the firstGrid size, C, of a three-dimensional voxel feature P _F A channel number representing a first three-dimensional voxel feature P;

the depth block information obtaining module is used for: obtaining a depth range [ d ] according to the point cloud data _min ,d _max ]Number of depth blocks N _D Index d, to which depth value falls within range _i Obtaining depth block information by using a distance linear increment discretization algorithm, wherein the formula of the distance linear increment discretization algorithm is as follows:

and, the take-out module includes:

the three-dimensional view cone obtaining module is used for: performing outer multiplication on the depth block information and the two-dimensional image features to obtain a three-dimensional viewing cone M, wherein the size of the three-dimensional viewing cone M is (W _F ′,H _F ′,N _D ′,C _F ") wherein W _F ′、H _F ' represents the width and height, N, respectively, of the three-dimensional viewing cone M _D ' indicates the number of depth blocks, C _F "represents the number of channels of the three-dimensional view cone M, wherein depth block information D' and two-dimensional image features F _2D Performing external multiplication to obtain a three-dimensional viewing cone M, wherein the expression is as follows:

a second three-dimensional voxel feature obtaining module for: according to the three-dimensional view cone, a second three-dimensional voxel characteristic is obtained by utilizing a three-dimensional linear interpolation method;

and, the data fusion module includes:

the target voxel feature obtaining module is used for: according to the first three-dimensional voxel feature and the third three-dimensional voxel feature, feature fusion and matrix conversion are carried out through a multi-head converter network and a linear conversion matrix, and a target voxel feature is obtained;

the target voxel characteristic obtaining module comprises:

4. A three-dimensional object detection method, characterized in that three-dimensional object detection is performed on the object voxel characteristics obtained by the multi-modal data fusion method according to claim 1 or 2 by a three-dimensional object detection head.