CN116645578B - Multi-mode data fusion method and three-dimensional target detection method thereof - Google Patents

Multi-mode data fusion method and three-dimensional target detection method thereof Download PDF

Info

Publication number
CN116645578B
CN116645578B CN202310565983.2A CN202310565983A CN116645578B CN 116645578 B CN116645578 B CN 116645578B CN 202310565983 A CN202310565983 A CN 202310565983A CN 116645578 B CN116645578 B CN 116645578B
Authority
CN
China
Prior art keywords
dimensional
feature
voxel
characteristic
obtaining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310565983.2A
Other languages
Chinese (zh)
Other versions
CN116645578A (en
Inventor
王意
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong University of Science and Technology
Original Assignee
Guangdong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong University of Science and Technology filed Critical Guangdong University of Science and Technology
Priority to CN202310565983.2A priority Critical patent/CN116645578B/en
Publication of CN116645578A publication Critical patent/CN116645578A/en
Application granted granted Critical
Publication of CN116645578B publication Critical patent/CN116645578B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/809Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of classification results, e.g. where the classifiers operate on the same input data
    • G06V10/811Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of classification results, e.g. where the classifiers operate on the same input data the classifiers operating on different input data, e.g. multi-modal recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/10Image acquisition
    • G06V10/16Image acquisition using multiple overlapping images; Image stitching
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/467Encoded features or binary features, e.g. local binary patterns [LBP]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/64Three-dimensional objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Image Processing (AREA)

Abstract

The invention relates to the technical field of machine vision, and provides a multi-mode data fusion method and a three-dimensional target detection method thereof, wherein the method comprises the following steps: acquiring an RGBD image and an RGB image; obtaining point cloud data, a first three-dimensional voxel characteristic and depth block information based on the RGBD graph; obtaining two-dimensional image features based on the RGB image; obtaining a second three-dimensional voxel characteristic according to the depth block information and the two-dimensional image characteristic; and carrying out data fusion through a transducer network according to the first three-dimensional voxel characteristic and the second three-dimensional voxel characteristic to obtain a target voxel characteristic. According to the invention, the RGBD image containing depth information and the general three-channel RGB image are combined, data fusion is carried out based on a transducer network, so that more accurate and complete target voxel characteristics are obtained, accurate three-dimensional target detection can be carried out based on the target voxel characteristics, the efficiency of three-dimensional target detection is greatly improved, and the accuracy of a detection result is effectively ensured.

Description

Multi-mode data fusion method and three-dimensional target detection method thereof
Technical Field
The invention relates to the technical field of machine vision, in particular to a multi-mode data fusion method and a three-dimensional target detection method thereof.
Background
With rapid development of deep learning, continuous iteration of machine vision technology is promoted, and application of target detection technology is becoming wider, for example, the target detection technology is applied to fruit picking, fruits are positioned first, and then picking is carried out through a robot. The traditional target detection method comprises the steps of shooting pictures in a plurality of directions, traversing the pictures to select target areas, extracting features of the target areas, and finally obtaining target positions through an SVM classifier, wherein the time complexity of traversing the pictures is high, the extracted two-dimensional image features only comprise two-dimensional information, the authenticity and the accuracy of the features are difficult to guarantee, and the accuracy of target detection is low.
Disclosure of Invention
The invention provides a multi-mode data fusion method, which aims to obtain more accurate target voxel characteristics.
The invention provides a three-dimensional target detection method, which aims to improve the accuracy of three-dimensional target detection.
In a first aspect, the present invention provides a multi-modal data fusion method, including:
acquiring an RGBD image and an RGB image;
obtaining point cloud data, a first three-dimensional voxel feature and depth block information based on the RGBD map;
based on the RGB image, obtaining two-dimensional image characteristics;
obtaining a second three-dimensional voxel characteristic according to the depth block information and the two-dimensional image characteristic;
and carrying out data fusion through a transducer network according to the first three-dimensional voxel characteristic and the second three-dimensional voxel characteristic to obtain a target voxel characteristic.
In one embodiment, the obtaining, based on the RGBD map, point cloud data, a first three-dimensional voxel feature, and depth block information includes:
obtaining point cloud data through image conversion based on the RGBD graph;
according to the point cloud data, a point cloud coding network is utilized to obtain a first three-dimensional voxel characteristic;
and obtaining depth block information by utilizing a distance linear increment discretization algorithm according to the point cloud data.
In one embodiment, the obtaining the point cloud data based on the RGBD map through an image conversion formula includes:
based on the RGBD image, obtaining an image coordinate, a depth value, a focal length of the camera on an x-axis and a focal length of the camera on a y-axis;
and obtaining point cloud data through an image conversion formula according to the image coordinates, the depth value, the focal length of the camera on the x axis and the focal length of the camera on the y axis.
In one embodiment, the obtaining depth block information according to the point cloud data by using a distance linear increment discretization algorithm includes:
obtaining the depth range, the number of depth blocks and the index of the range of the depth value according to the point cloud data;
and obtaining depth block information by utilizing a distance linear increment discretization algorithm according to the depth range, the number of the depth blocks and the index of the range to which the depth value belongs.
In one embodiment, the obtaining a second three-dimensional voxel feature according to the depth block information and the two-dimensional image feature includes:
according to the depth block information and the two-dimensional image characteristics, a three-dimensional view cone is obtained;
and obtaining the second three-dimensional voxel characteristic by utilizing a three-dimensional linear interpolation method according to the three-dimensional view cone.
In one embodiment, the obtaining the second three-dimensional voxel feature according to the three-dimensional view cone by using a three-dimensional linear interpolation method includes:
obtaining a view cone block sampling point of the three-dimensional view cone by using a three-dimensional linear interpolation method;
and converting, adjusting the size of the view cone block sampling points and splicing the feature blocks to obtain a second three-dimensional voxel feature.
In one embodiment, the data fusion is performed through a transducer network according to the first three-dimensional voxel feature and the second three-dimensional voxel feature to obtain a target voxel feature, which includes:
carrying out pooling treatment and data flattening treatment on the second three-dimensional voxel characteristic by using a non-empty voxel block in the first three-dimensional voxel characteristic to obtain a third three-dimensional voxel characteristic;
and according to the first three-dimensional voxel characteristic and the third three-dimensional voxel characteristic, performing characteristic fusion and matrix conversion through a multi-head converter network and a linear conversion matrix to obtain a target voxel characteristic.
In one embodiment, the performing feature fusion and matrix conversion according to the first three-dimensional voxel feature and the third three-dimensional voxel feature through a multi-head converter network and a linear conversion matrix to obtain a target voxel feature includes:
performing feature stitching through a multi-head Transformer network according to the first three-dimensional voxel feature and the third three-dimensional voxel feature to obtain a first feature stitching matrix;
performing matrix conversion on the first characteristic splicing matrix by using a linear conversion matrix to obtain a second characteristic splicing matrix;
and carrying out feature fusion on the non-empty voxel block and the second feature splicing matrix to obtain the target voxel feature.
In a second aspect, the present invention provides a multi-modal data fusion apparatus comprising:
an image acquisition module for: acquiring an RGBD image and an RGB image;
RGBD map processing module, configured to: obtaining point cloud data, a first three-dimensional voxel feature and depth block information based on the RGBD map;
an RGB diagram processing module configured to: based on the RGB image, obtaining two-dimensional image characteristics;
the take-out module is used for: obtaining a second three-dimensional voxel characteristic according to the depth block information and the two-dimensional image characteristic;
the data fusion module is used for: and carrying out data fusion through a multi-head transducer network according to the first three-dimensional voxel characteristic and the second three-dimensional voxel characteristic to obtain a target voxel characteristic.
In a third aspect, the present invention provides a three-dimensional target detection method, where three-dimensional target detection is performed on target voxel features obtained by the multi-modal data fusion method according to any one of the above-mentioned aspects through a three-dimensional target detection head.
In a fourth aspect, the present invention provides a three-dimensional object detection device, including an object detection module, configured to perform three-dimensional object detection on the object voxel feature obtained by the multi-mode data fusion method according to any one of the above-mentioned aspects through a three-dimensional object detection head.
According to the multi-mode data fusion method provided by the invention, the depth RGBD image containing the depth information and the general three-channel RGB image are combined to obtain more real point cloud data, the first three-dimensional voxel characteristic, the depth block information, the two-dimensional image characteristic and the second three-dimensional voxel characteristic, and then the first three-dimensional voxel characteristic and the second three-dimensional voxel characteristic are input into a Transformer network to perform comprehensive data fusion, so that more accurate, complete and clear target voxel characteristics can be obtained.
According to the three-dimensional target detection method provided by the invention, the existing three-dimensional target detection head is utilized, accurate three-dimensional target detection can be performed based on the target voxel characteristics obtained by the multi-mode data fusion method, the three-dimensional target detection precision can be greatly improved, the detection result accuracy is effectively ensured, and the application efficiency is improved.
Drawings
In order to more clearly illustrate the technical solutions of the present invention, the following description will be given with a brief introduction to the drawings used in the embodiments or the description of the prior art, it being obvious that the drawings in the following description are some embodiments of the present invention, and that other drawings can be obtained from these drawings without the inventive effort of a person skilled in the art.
FIG. 1 is a schematic flow chart of a multi-modal data fusion method provided by the invention;
FIG. 2 is a flow chart of the three-dimensional object detection method provided by the invention
FIG. 3 is a schematic diagram of a multi-modal data fusion apparatus according to the present invention;
fig. 4 is a schematic structural diagram of an electronic device according to the multi-mode data fusion method provided by the invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The embodiments of the present invention provide embodiments of a multi-modal data fusion method, it being noted that although a logical sequence is shown in the flow chart, under certain data, the steps shown or described may be accomplished in a different order than that shown or described herein.
Referring to fig. 1, fig. 1 is a flow chart of a multi-mode data fusion method provided by the invention. The multi-mode data fusion method provided by the embodiment of the invention comprises the following steps:
step 101, acquiring an RGBD image and an RGB image;
step 102, obtaining point cloud data, a first three-dimensional voxel feature P and depth block information D' based on an RGBD map;
step 103, obtaining two-dimensional image features F based on RGB map 2D
Step 104, according to the depth block information D' and the two-dimensional image feature F 2D Obtaining a second three-dimensional voxel characteristic I;
and 105, carrying out data fusion through a multi-head transducer network according to the first three-dimensional voxel characteristic P and the second three-dimensional voxel characteristic I to obtain a target voxel characteristic V.
The execution body of the embodiment of the invention can be any terminal side device meeting the device requirements, such as a data fusion device, a three-dimensional target detection device and the like.
In step 101, the terminal side device acquires an RGBD map and an RGB map.
It should be noted that an RGBD Map is an image or image channel containing information about the surface distance of a scene object from a viewpoint, which is similar to a gray scale image, in which each pixel value is the actual distance of the sensor from the object, and the RGBD Map can be obtained by an RGBD camera, rgbd=rgb+d (Depth Map), and typically the RGBD Map and the RGB Map are registered, so that there is a one-to-one correspondence between pixel points. The RGBD map and the RGB map in the embodiment of the present invention may be images of a space to be detected, so as to obtain a target voxel feature V, which is used for performing three-dimensional target detection subsequently.
In step 102, the terminal device obtains point cloud data, a first three-dimensional voxel feature P, and depth block information D' based on the RGBD map.
In one embodiment, step 102 may include:
step 1021, obtaining point cloud data through image conversion based on the RGBD graph;
step 1022, obtaining a first three-dimensional voxel characteristic P according to the point cloud data by utilizing the existing point cloud coding network;
step 1023, obtaining depth block information D' according to the point cloud data by utilizing the existing interval linear increment discretization algorithm.
Specifically, regarding step 1021, the terminal side device may obtain the image coordinates, the depth value, the focal length of the camera in the x-axis, and the focal length of the camera in the y-axis based on the RGBD map, and obtain the point cloud data by the image conversion formula (1) below).
Wherein, the image conversion formula is:
in the formula (1), (x, y, z) represents point cloud coordinates, (x ', y') represents image coordinates, D represents a depth value, f x Representing the focal length of the camera in the x-axis, f y Is the focal length of the camera in the y-axis. RGBD images can be converted into more real point cloud data through image conversion.
Specifically, with respect to step 1022, the terminal-side device may obtain the first three-dimensional voxel feature P from the point cloud data using an existing point cloud encoding network, such as VoxelNet, which has a size (X P ,Y P ,Z P ,C F ) Wherein (X) P ,Y P ,Z P ) Grid size, C, representing a first three-dimensional voxel feature P F The number of channels representing the first three-dimensional voxel feature P.
Specifically, with respect to step 1023, the terminal-side device may obtain the depth range [ d ] from the point cloud data min ,d max ]Number of depth blocks N D Index d, to which depth value falls within range i And obtaining depth block information D' according to a linear increment discretization algorithm (the following formula (2)).
The formula of the interval linear increment discretization algorithm is as follows:
in the formula (2), d c Representing calculated successive depth values, calculationObtained d c The depth block information D' is constituted.
In step 103, the terminal side device obtains a two-dimensional image feature F based on the RGB diagram 2D
It should be noted that, the terminal side device may process the RGB map through the existing two-dimensional deep learning backbone network, such as Efficient, regNet, to extract the two-dimensional image feature F 2D The size of which is (W F ,H F ,C F ′),W F 、H F Respectively representing two-dimensional image features F 2D Width and height of C F ' representing two-dimensional image features F 2D Is a number of channels.
In step 104, the terminal side device generates a depth block according to the depth block information D' and the two-dimensional image feature F 2D And obtaining a second three-dimensional voxel characteristic I.
In one embodiment, step 104 includes:
step 1041, based on depth block information D' and two-dimensional image feature F 2D Obtaining a three-dimensional viewing cone M;
step 1042, obtaining a second three-dimensional voxel feature I by using the existing three-dimensional linear interpolation method according to the three-dimensional view cone M.
Specifically, the terminal side device may combine the depth block information D' and the two-dimensional image feature F 2D Performing outer multiplication (outer product) to obtain a three-dimensional viewing cone M with a size (W) F ′,H F ′,N D ′,C F ") wherein W F ′、H F ' represents the width and height, N, respectively, of the three-dimensional viewing cone M D ' indicates the number of depth blocks, C F "means the number of channels of the three-dimensional viewing cone M.
Wherein the depth block information D' and the two-dimensional image feature F 2D Performing external multiplication to obtain a three-dimensional viewing cone M, wherein the expression can be as follows:
in the expression (3), (i, j) represents an index value of the feature pixel point.
In particularAfter the three-dimensional viewing cone M is obtained, the terminal side device may convert the three-dimensional viewing cone M into a second three-dimensional voxel feature I having a size (X I ,F I ,Z I ,C F "', wherein (X) I ,Y I ,Z I ) Grid size, C, representing a second three-dimensional voxel feature I F "represents the number of channels of the second three-dimensional voxel feature I.
In one embodiment, the terminal-side device may obtain the second three-dimensional voxel feature I by: for the ith view cone block of the three-dimensional view cone M, i E C F "A three-dimensional linear interpolation method is used to calculate the sampling point S of the view cone block of the three-dimensional view cone M M Then the sampling point S of the view cone block is sampled by the following formula (4) M Conversion to voxel feature block intermediate points S I Then the voxel characteristic block intermediate point S I Voxel feature block I that is sized (e.g., fill-in size) to be the same as the view cone block size i And then I is carried out on the voxel characteristic block I i Splicing to form a second three-dimensional voxel characteristic I.
CM·S I =S M (4)
In equation (4), CM represents a calibration matrix of the camera.
In step 105, the terminal device performs data fusion through the multi-head transducer network according to the first three-dimensional voxel feature P and the second three-dimensional voxel feature I to obtain a target voxel feature V.
In one embodiment, step 105 includes:
step 1051, utilizing non-empty voxel block F in a first three-dimensional voxel feature P P Performing maximum pooling processing (maxpooling processing) and data flattening processing on the second three-dimensional voxel feature I to obtain a third three-dimensional voxel feature F I The size of which is (L, C) F ) Wherein l=x P ×Y P ×Z P Non-empty voxel block F P And a third three-dimensional voxel feature F I Is the same in dimension;
step 1052, non-empty voxel block F according to the first three-dimensional voxel feature P P And a third three-dimensional voxel feature F I Through multiple heads Trand carrying out feature fusion and matrix conversion on the ansformer network and the linear conversion matrix to obtain the target voxel feature V.
In one embodiment, step 1052 may be implemented as follows: from non-empty voxel blocks F in a first three-dimensional voxel feature P P And a third three-dimensional voxel feature I, performing feature splicing through a multi-head transducer network to obtain a first feature splicing matrix concat (head) 1 ,head 2 ,…,head m ) Reuse of linear conversion matrix W O Concatenation of the first feature matrix (head 1 ,head 2 ,…,head m ) Performing matrix conversion to obtain a second feature stitching matrix A, and then performing non-empty voxel block F P And carrying out feature fusion with the second feature stitching matrix A to obtain a target voxel feature, wherein the target voxel feature can be used for three-dimensional target detection in all aspects.
Specifically, the terminal side device may first block the non-null voxel F P Regarded as Query (definition of Query is Q i =F P ·W i Q ) Characterizing a third three-dimensional voxel F I Regarded as Key (Key is defined as K i =F I ·W i K ) And Value (Value is defined as V i =F I ·W i K ) The characteristic is input into the existing multi-head converter network, calculated according to the following formulas (5) - (6), and characteristic splicing is carried out.
Q i =F P ·W i Q ,K i =F I ·W i K ,V i =F I ·W i K (5)
In the formula (5), Q i =F P ·W i Q Representing non-empty voxel blocks F P ,K i =F I ·W i K And V i =F I ·W i K Representing a third three-dimensional voxel feature F I The method comprises the steps of carrying out a first treatment on the surface of the In the formula (6), the amino acid sequence of the compound,representing a relationship vector, V i Representing the current input, head i Representing the weighted output of the multi-headed transducer network.
Then, the terminal side device may perform feature stitching on the input of the m-header transducer to obtain a first feature stitching matrix concat (head 1 ,head 2 ,…,head m ). Then, the linear transformation matrix W is utilized according to the following formula (7) O The spliced m-head convertors output results (first feature splice matrix concat (head) 1 ,head 2 ,…,head m ) Mapping to a three-dimensional voxel space, and performing matrix conversion to obtain a second feature stitching matrix A.
A=concat(head 1 ,head 2 ,…,head m )W O (7)
Finally, non-empty voxel block F P Feature fusion is carried out with the second feature stitching matrix A, and a target voxel feature V is obtained and is expressed as V=concat (A, F P )。
According to the multi-mode data fusion method provided by the embodiment of the invention, the depth RGBD image containing the depth information and the general three-channel RGB image are combined to obtain more real point cloud data, a first three-dimensional voxel characteristic, depth block information, a two-dimensional image characteristic and a second three-dimensional voxel characteristic, the first three-dimensional voxel characteristic and the second three-dimensional voxel characteristic are input into a multi-head transducer network to perform comprehensive data fusion, so that more accurate, complete and clear target voxel characteristics can be obtained, and the target voxel characteristics can be used for three-dimensional target detection in various aspects, such as fruit picking.
Referring to fig. 2, the embodiment of the present invention further provides a three-dimensional target detection method, where the execution body may be a terminal side device such as a three-dimensional target detection device, and by using an existing three-dimensional target detection head (a device for implementing three-dimensional target detection), three-dimensional target detection is performed based on target voxel features obtained by the foregoing multi-mode data fusion method, so as to generate a three-dimensional anchor frame, implement target classification in the anchor frame, and complete three-dimensional target detection, so that accuracy and efficiency of three-dimensional target detection can be greatly improved, accuracy of a detection result is effectively ensured, and an application scenario of the present invention is enlarged.
In addition, three-dimensional target detection can be performed by other three-dimensional target detection technologies meeting requirements based on the target voxel characteristics obtained by the multi-mode data fusion method provided by the embodiment of the invention, so that the accuracy and the efficiency of three-dimensional target detection are improved.
Further, the multi-mode data fusion device provided by the invention and the multi-mode data fusion method provided by the invention are correspondingly referred to each other.
Referring to fig. 3, the multi-modal data fusion apparatus includes:
an image acquisition module 301 for: acquiring an RGBD image and an RGB image;
RGBD map processing module 302 is configured to: obtaining point cloud data, a first three-dimensional voxel feature and depth block information based on the RGBD map;
an RGB diagram processing module 303, configured to: based on the RGB image, obtaining two-dimensional image characteristics;
an squaring module 304 for: obtaining a second three-dimensional voxel characteristic according to the depth block information and the two-dimensional image characteristic;
a data fusion module 305, configured to: and carrying out data fusion through a transducer network according to the first three-dimensional voxel characteristic and the second three-dimensional voxel characteristic to obtain a target voxel characteristic.
In one embodiment, the RGBD map processing module 302 may include:
the point cloud data obtaining module is used for: obtaining point cloud data through image conversion based on the RGBD graph;
the first three-dimensional voxel feature obtaining module is used for: according to the point cloud data, a point cloud coding network is utilized to obtain a first three-dimensional voxel characteristic;
the depth block information obtaining module is used for: and obtaining depth block information by utilizing a distance linear increment discretization algorithm according to the point cloud data.
In one embodiment, the point cloud data is specifically used for:
based on the RGBD image, obtaining an image coordinate, a depth value, a focal length of the camera on an x-axis and a focal length of the camera on a y-axis;
and obtaining point cloud data through an image conversion formula according to the image coordinates, the depth value, the focal length of the camera on the x axis and the focal length of the camera on the y axis.
In one embodiment, the depth block information obtaining module is specifically configured to:
obtaining the depth range, the number of depth blocks and the index of the range of the depth value according to the point cloud data;
and obtaining depth block information by utilizing a distance linear increment discretization algorithm according to the depth range, the number of the depth blocks and the index of the range to which the depth value belongs.
In one embodiment, the take-out module 304 may include:
the three-dimensional view cone obtaining module is used for: according to the depth block information and the two-dimensional image characteristics, a three-dimensional view cone is obtained;
a second three-dimensional voxel feature obtaining module for: and obtaining a second three-dimensional voxel characteristic by utilizing a three-dimensional linear interpolation method according to the three-dimensional view cone.
In one embodiment, the second three-dimensional voxel feature obtaining module is specifically configured to:
obtaining a view cone block sampling point of the three-dimensional view cone by using a three-dimensional linear interpolation method;
and converting, adjusting the size of the view cone block sampling points and splicing the feature blocks to obtain a second three-dimensional voxel feature.
In one embodiment, the data fusion module 305 may include:
a third three-dimensional voxel feature obtaining module for: carrying out pooling treatment and data flattening treatment on the second three-dimensional voxel characteristic by using a non-empty voxel block in the first three-dimensional voxel characteristic to obtain a third three-dimensional voxel characteristic;
the target voxel feature obtaining module is used for: and according to the first three-dimensional voxel characteristic and the third three-dimensional voxel characteristic, performing characteristic fusion and matrix conversion through a multi-head converter network and a linear conversion matrix to obtain a target voxel characteristic.
In one embodiment, the target voxel feature obtaining module may include:
the first concatenation module is used for: performing feature stitching through a multi-head Transformer network according to the first three-dimensional voxel feature and the third three-dimensional voxel feature to obtain a first feature stitching matrix;
the second splicing module is used for: performing matrix conversion on the first characteristic splicing matrix by using a linear conversion matrix to obtain a second characteristic splicing matrix;
the feature fusion module is used for: and carrying out feature fusion on the non-empty voxel block and the second feature splicing matrix to obtain the target voxel feature.
Furthermore, the invention also provides a three-dimensional target detection device, which comprises a target detection module, wherein the target detection module is used for carrying out three-dimensional target detection on the target voxel characteristics obtained according to the multi-mode data fusion method or the multi-mode data fusion device through a three-dimensional target detection head.
Fig. 4 illustrates a physical schematic diagram of an electronic device, as shown in fig. 4, which may include: processor 810, communication interface (Communications Interface) 820, memory 830, and communication bus 840, wherein processor 810, communication interface 820, memory 830 accomplish communication with each other through communication bus 840. The processor 810 may invoke logic instructions in the memory 830 to perform a multi-modal data fusion method comprising:
acquiring an RGBD image and an RGB image;
obtaining point cloud data, a first three-dimensional voxel feature and depth block information based on the RGBD map;
based on the RGB image, obtaining two-dimensional image characteristics;
obtaining a second three-dimensional voxel characteristic according to the depth block information and the two-dimensional image characteristic;
and carrying out data fusion through a transducer network according to the first three-dimensional voxel characteristic and the second three-dimensional voxel characteristic to obtain a target voxel characteristic.
Further, the logic instructions in the memory 830 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product, the computer program product comprising a computer program, the computer program being storable on a non-transitory computer readable storage medium, the computer program, when executed by a processor, being capable of performing the multi-modal data fusion method provided by the methods described above, the method comprising:
acquiring an RGBD image and an RGB image;
obtaining point cloud data, a first three-dimensional voxel feature and depth block information based on the RGBD map;
based on the RGB image, obtaining two-dimensional image characteristics;
obtaining a second three-dimensional voxel characteristic according to the depth block information and the two-dimensional image characteristic;
and carrying out data fusion through a transducer network according to the first three-dimensional voxel characteristic and the second three-dimensional voxel characteristic to obtain a target voxel characteristic.
In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the multi-modal data fusion method provided by the above methods, the method comprising:
acquiring an RGBD image and an RGB image;
obtaining point cloud data, a first three-dimensional voxel feature and depth block information based on the RGBD map;
based on the RGB image, obtaining two-dimensional image characteristics;
obtaining a second three-dimensional voxel characteristic according to the depth block information and the two-dimensional image characteristic;
and carrying out data fusion through a transducer network according to the first three-dimensional voxel characteristic and the second three-dimensional voxel characteristic to obtain a target voxel characteristic.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (4)

1. A method of multimodal data fusion, comprising:
acquiring an RGBD image and an RGB image;
obtaining point cloud data, a first three-dimensional voxel feature and depth block information based on the RGBD map;
based on the RGB map, a two-dimensional image feature is obtained, the size of which is (W F ,H F ,C F ′),W F 、H F Respectively representing two-dimensional image features F 2D Width and height of C F ' representing two-dimensional image features F 2D The number of channels;
obtaining a second three-dimensional voxel characteristic according to the depth block information and the two-dimensional image characteristic;
according to the first three-dimensional voxel characteristic and the second three-dimensional voxel characteristic, carrying out data fusion through a transducer network to obtain a target voxel characteristic;
the obtaining, based on the RGBD map, point cloud data, a first three-dimensional voxel feature, and depth block information includes:
based on the RGBD image, obtaining an image coordinate, a depth value, a focal length of a camera on an x-axis and a focal length of the camera on a y-axis, and obtaining point cloud data through an image conversion formula, wherein the image conversion formula is as follows:
in the formula (1), (x, y, z) represents point cloud coordinates, (x ', y') represents image coordinates, D represents a depth value, f x Representing the focal length of the camera in the x-axis, f y Focal length of the camera in the y-axis;
according to the point cloud data, a first three-dimensional voxel characteristic is obtained by utilizing a point cloud coding network, and the size of the first three-dimensional voxel characteristic is (X P ,Y P ,Z P ,C F ) Wherein (X) P ,Y P ,Z P ) Grid size, C, representing a first three-dimensional voxel feature P F A channel number representing a first three-dimensional voxel feature P;
obtaining a depth range [ d ] according to the point cloud data min ,d max ]Number of depth blocks N D Index d, to which depth value falls within range i Obtaining depth block information by using a distance linear increment discretization algorithm, wherein the formula of the distance linear increment discretization algorithm is as follows:
in the formula (2), d c Representing calculated successive depth values, calculated d c Constructing depth block information D';
and obtaining a second three-dimensional voxel feature according to the depth block information and the two-dimensional image feature, including:
performing outer multiplication on the depth block information and the two-dimensional image features to obtain a three-dimensional viewing cone M, wherein the size of the three-dimensional viewing cone M is (W F ′,H F ′,N D ′,C F ") wherein W F ′、H F ' represents the width and height, N, respectively, of the three-dimensional viewing cone M D ' indicates the number of depth blocks, C F "represents the number of channels of the three-dimensional view cone M, wherein depth block information d' and two-dimensional image features F 2D Performing external multiplication to obtain a three-dimensional viewing cone M, wherein the expression is as follows:
in the formula (3), (i, j) represents an index value of the feature pixel point;
according to the three-dimensional view cone, a second three-dimensional voxel characteristic is obtained by utilizing a three-dimensional linear interpolation method;
and performing data fusion through a transducer network according to the first three-dimensional voxel characteristic and the second three-dimensional voxel characteristic to obtain a target voxel characteristic, wherein the method comprises the following steps:
carrying out pooling treatment and data flattening treatment on the second three-dimensional voxel characteristic by using a non-empty voxel block in the first three-dimensional voxel characteristic to obtain a third three-dimensional voxel characteristic;
according to the first three-dimensional voxel feature and the third three-dimensional voxel feature, feature fusion and matrix conversion are carried out through a multi-head converter network and a linear conversion matrix, and a target voxel feature is obtained;
and performing feature fusion and matrix conversion through a multi-head converter network and a linear conversion matrix according to the first three-dimensional voxel feature and the third three-dimensional voxel feature to obtain a target voxel feature, wherein the method comprises the following steps:
performing feature stitching through a multi-head Transformer network according to the first three-dimensional voxel feature and the third three-dimensional voxel feature to obtain a first feature stitching matrix;
performing matrix conversion on the first characteristic splicing matrix by using a linear conversion matrix to obtain a second characteristic splicing matrix;
and carrying out feature fusion on the non-empty voxel block and the second feature splicing matrix to obtain the target voxel feature.
2. The method of claim 1, wherein the obtaining a second three-dimensional voxel feature from the three-dimensional view cone by using a three-dimensional linear interpolation method comprises:
obtaining a view cone block sampling point of the three-dimensional view cone by using a three-dimensional linear interpolation method;
and converting, adjusting the size of the view cone block sampling points and splicing the feature blocks to obtain a second three-dimensional voxel feature.
3. A multi-modal data fusion apparatus comprising:
an image acquisition module for: acquiring an RGBD image and an RGB image;
RGBD map processing module, configured to: obtaining point cloud data, a first three-dimensional voxel feature and depth block information based on the RGBD map;
an RGB diagram processing module configured to: based on the RGB image, obtaining two-dimensional image characteristics;
the take-out module is used for: obtaining a second three-dimensional voxel feature having a size (W F ,H F ,C F ′),W F 、H F Respectively representing two-dimensional image features F 2D Width and height of C F ' representing two-dimensional image features F 2D The number of channels;
the data fusion module is used for: according to the first three-dimensional voxel characteristic and the second three-dimensional voxel characteristic, carrying out data fusion through a transducer network to obtain a target voxel characteristic;
wherein, RGBD graph processing module includes:
the point cloud data obtaining module is used for: based on the RGBD image, obtaining an image coordinate, a depth value, a focal length of a camera on an x-axis and a focal length of the camera on a y-axis, and obtaining point cloud data through an image conversion formula, wherein the image conversion formula is as follows:
in the formula (1), (x, y, z) represents point cloud coordinates, (x ', y') represents image coordinates, D represents a depth value, f x Representing the focal length of the camera in the x-axis, f y Focal length of the camera in the y-axis;
the first three-dimensional voxel feature obtaining module is used for: according to the point cloud data, a first three-dimensional voxel characteristic is obtained by utilizing a point cloud coding network, and the size of the first three-dimensional voxel characteristic is (X P ,Y P ,Z P ,C F ) Wherein (X) P ,Y P ,Z P ) Represent the firstGrid size, C, of a three-dimensional voxel feature P F A channel number representing a first three-dimensional voxel feature P;
the depth block information obtaining module is used for: obtaining a depth range [ d ] according to the point cloud data min ,d max ]Number of depth blocks N D Index d, to which depth value falls within range i Obtaining depth block information by using a distance linear increment discretization algorithm, wherein the formula of the distance linear increment discretization algorithm is as follows:
in the formula (2), d c Representing calculated successive depth values, calculated d c Constructing depth block information D';
and, the take-out module includes:
the three-dimensional view cone obtaining module is used for: performing outer multiplication on the depth block information and the two-dimensional image features to obtain a three-dimensional viewing cone M, wherein the size of the three-dimensional viewing cone M is (W F ′,H F ′,N D ′,C F ") wherein W F ′、H F ' represents the width and height, N, respectively, of the three-dimensional viewing cone M D ' indicates the number of depth blocks, C F "represents the number of channels of the three-dimensional view cone M, wherein depth block information D' and two-dimensional image features F 2D Performing external multiplication to obtain a three-dimensional viewing cone M, wherein the expression is as follows:
in the formula (3), (i, j) represents an index value of the feature pixel point;
a second three-dimensional voxel feature obtaining module for: according to the three-dimensional view cone, a second three-dimensional voxel characteristic is obtained by utilizing a three-dimensional linear interpolation method;
and, the data fusion module includes:
a third three-dimensional voxel feature obtaining module for: carrying out pooling treatment and data flattening treatment on the second three-dimensional voxel characteristic by using a non-empty voxel block in the first three-dimensional voxel characteristic to obtain a third three-dimensional voxel characteristic;
the target voxel feature obtaining module is used for: according to the first three-dimensional voxel feature and the third three-dimensional voxel feature, feature fusion and matrix conversion are carried out through a multi-head converter network and a linear conversion matrix, and a target voxel feature is obtained;
the target voxel characteristic obtaining module comprises:
the first concatenation module is used for: performing feature stitching through a multi-head Transformer network according to the first three-dimensional voxel feature and the third three-dimensional voxel feature to obtain a first feature stitching matrix;
the second splicing module is used for: performing matrix conversion on the first characteristic splicing matrix by using a linear conversion matrix to obtain a second characteristic splicing matrix;
the feature fusion module is used for: and carrying out feature fusion on the non-empty voxel block and the second feature splicing matrix to obtain the target voxel feature.
4. A three-dimensional object detection method, characterized in that three-dimensional object detection is performed on the object voxel characteristics obtained by the multi-modal data fusion method according to claim 1 or 2 by a three-dimensional object detection head.
CN202310565983.2A 2023-05-18 2023-05-18 Multi-mode data fusion method and three-dimensional target detection method thereof Active CN116645578B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310565983.2A CN116645578B (en) 2023-05-18 2023-05-18 Multi-mode data fusion method and three-dimensional target detection method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310565983.2A CN116645578B (en) 2023-05-18 2023-05-18 Multi-mode data fusion method and three-dimensional target detection method thereof

Publications (2)

Publication Number Publication Date
CN116645578A CN116645578A (en) 2023-08-25
CN116645578B true CN116645578B (en) 2024-01-26

Family

ID=87642807

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310565983.2A Active CN116645578B (en) 2023-05-18 2023-05-18 Multi-mode data fusion method and three-dimensional target detection method thereof

Country Status (1)

Country Link
CN (1) CN116645578B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111222395A (en) * 2019-10-21 2020-06-02 杭州飞步科技有限公司 Target detection method and device and electronic equipment
CN114782785A (en) * 2022-03-22 2022-07-22 华为技术有限公司 Multi-sensor information fusion method and device
CN114973407A (en) * 2022-05-10 2022-08-30 华南理工大学 RGB-D-based video three-dimensional human body posture estimation method
CN115116049A (en) * 2022-08-29 2022-09-27 苏州魔视智能科技有限公司 Target detection method and device, electronic equipment and storage medium
CN115861601A (en) * 2022-12-20 2023-03-28 清华大学 Multi-sensor fusion sensing method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111222395A (en) * 2019-10-21 2020-06-02 杭州飞步科技有限公司 Target detection method and device and electronic equipment
CN114782785A (en) * 2022-03-22 2022-07-22 华为技术有限公司 Multi-sensor information fusion method and device
CN114973407A (en) * 2022-05-10 2022-08-30 华南理工大学 RGB-D-based video three-dimensional human body posture estimation method
CN115116049A (en) * 2022-08-29 2022-09-27 苏州魔视智能科技有限公司 Target detection method and device, electronic equipment and storage medium
CN115861601A (en) * 2022-12-20 2023-03-28 清华大学 Multi-sensor fusion sensing method and device

Also Published As

Publication number Publication date
CN116645578A (en) 2023-08-25

Similar Documents

Publication Publication Date Title
CN111179324B (en) Object six-degree-of-freedom pose estimation method based on color and depth information fusion
CN108520535B (en) Object classification method based on depth recovery information
CN110175986B (en) Stereo image visual saliency detection method based on convolutional neural network
CN112801015B (en) Multi-mode face recognition method based on attention mechanism
US11367195B2 (en) Image segmentation method, image segmentation apparatus, image segmentation device
KR101567792B1 (en) System and method for describing image outlines
CN110390308B (en) Video behavior identification method based on space-time confrontation generation network
CN114666564A (en) Method for synthesizing virtual viewpoint image based on implicit neural scene representation
CN112907573B (en) Depth completion method based on 3D convolution
CN112270332A (en) Three-dimensional target detection method and system based on sub-stream sparse convolution
CN112801945A (en) Depth Gaussian mixture model skull registration method based on dual attention mechanism feature extraction
CN111402331B (en) Robot repositioning method based on visual word bag and laser matching
CN113822256B (en) Face recognition method, electronic device and storage medium
CN112149662A (en) Multi-mode fusion significance detection method based on expansion volume block
Silva et al. Light-field imaging reconstruction using deep learning enabling intelligent autonomous transportation system
CN111723688B (en) Human body action recognition result evaluation method and device and electronic equipment
CN116645578B (en) Multi-mode data fusion method and three-dimensional target detection method thereof
CN116704307A (en) Target detection method and system based on fusion of image virtual point cloud and laser point cloud
CN116580085A (en) Deep learning algorithm for 6D pose estimation based on attention mechanism
CN116310105A (en) Object three-dimensional reconstruction method, device, equipment and storage medium based on multiple views
CN113887385A (en) Three-dimensional point cloud classification method based on multi-view attention convolution pooling
CN111898671B (en) Target identification method and system based on fusion of laser imager and color camera codes
CN114693951A (en) RGB-D significance target detection method based on global context information exploration
CN108268533A (en) A kind of Image Feature Matching method for image retrieval
CN113496521B (en) Method and device for generating depth image and camera external parameter by using multiple color pictures

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant