CN115880333A - Three-dimensional single-target tracking method based on multi-mode information fusion - Google Patents

Three-dimensional single-target tracking method based on multi-mode information fusion Download PDF

Info

Publication number
CN115880333A
CN115880333A CN202211545845.XA CN202211545845A CN115880333A CN 115880333 A CN115880333 A CN 115880333A CN 202211545845 A CN202211545845 A CN 202211545845A CN 115880333 A CN115880333 A CN 115880333A
Authority
CN
China
Prior art keywords
point cloud
similarity
dimensional
features
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211545845.XA
Other languages
Chinese (zh)
Inventor
方正
林雨
李智恒
崔宇波
李硕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northeastern University China
63983 Troops of PLA
Original Assignee
Northeastern University China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northeastern University China filed Critical Northeastern University China
Priority to CN202211545845.XA priority Critical patent/CN115880333A/en
Publication of CN115880333A publication Critical patent/CN115880333A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Image Analysis (AREA)

Abstract

The invention provides a three-dimensional single-target tracking method based on multi-mode information fusion, which comprises the following steps of firstly, realizing the spatial alignment of an original image and laser point cloud; secondly, a double-flow feature extraction network is constructed based on a deep learning method, and high-dimensional semantic features of the image and the point cloud are extracted in parallel; secondly, weighting important information in the modes through a self-attention mechanism, and constructing semantic relation among different modes by utilizing a cross-attention mechanism; then calculating texture and geometric similarity based on the semantic enhanced features, and generating multi-modal similarity features by using an attention mechanism; and finally, predicting the spatial position and orientation of the target by adopting a multilayer convolution structure. The method aims to generate more robust multi-modal characteristics by fully utilizing the complementary advantages of heterogeneous data of the multi-source sensor and adaptively fusing image texture information and point cloud geometric characteristics by means of an attention mechanism in deep learning so as to accurately regress and track the three-dimensional coordinates and orientation of the target and improve the robustness and accuracy of the three-dimensional target tracker.

Description

Three-dimensional single-target tracking method based on multi-mode information fusion
Technical Field
The invention belongs to the technical field of three-dimensional point cloud single-target tracking, and particularly relates to a three-dimensional single-target tracking method based on multi-mode information fusion.
Background
In recent years, with the rapid development of artificial intelligence technology, intelligent equipment such as robots and automatic driving automobiles gradually appear in the field of vision of the public and become a new strategic direction for manufacturing in China, and the safe and stable operation of the intelligent products can not be realized by the environment perception technology. The environment perception mainly means that sensors such as a camera, a laser radar and ultrasonic waves are used for acquiring all-around environment state data, and the most critical semantic information such as the positions and sizes of nearby vehicles and pedestrians is extracted through a computer processor and an intelligent algorithm, and the information can provide vital data support for decision and control links of an intelligent system so as to ensure the reliable operation of the system.
In recent years, the rapid growth of the field of deep learning also greatly promotes the development of a three-dimensional target tracking algorithm, many researchers try to apply a deep learning method to target tracking and achieve a successful result, and compared with the traditional algorithm, the high-dimensional features extracted by deep learning are more suitable for matching targets, so that robust target tracking is achieved. However, most of the existing three-dimensional target tracking algorithms inherit the twin network structure of the two-dimensional tracking algorithm and build similarity by relying on geometric information provided by point cloud, which causes that the tracker is difficult to distinguish objects with similar structures, and the texture information of the image can effectively solve the problem, for example: two persons with similar geometric structures can be distinguished by means of texture features such as appearances, clothes and the like, so that the deep fusion of multi-mode data can be realized by means of a deep learning method, and the problem that a tracker is prone to failure in a complex environment is solved.
Chinese patent CN114862911A is a three-dimensional point cloud single target tracking method based on graph convolution, which provides a three-dimensional point cloud single target tracking method based on graph convolution. The method improves the conventional three-dimensional single-target tracking method P2B, firstly takes a template and a search point cloud as network input, secondly carries out down-sampling on the point cloud and extracts seed point characteristics, secondly utilizes a graph volume module to fuse global and local characteristics, codes a template clue into a search area, and finally inputs the seed point of coded template information into a Hough voting module to realize positioning and tracking of a target in the search area and generate a three-dimensional target frame. According to the technical scheme, the laser radar is used as a sensor for sensing the environment state, although the point cloud data collected by the radar can describe the geometric outline of an object, due to the fact that the point cloud has sparsity and disorder, the object represented by the point cloud is incomplete under the conditions of long distance, shielding and the like, and the performance of the tracker is prone to being sharply reduced. Furthermore, since this approach relies only on the geometric information of the point cloud, it will be difficult for the tracker to resolve the target when the geometry of the target and the background are similar. Namely, the method is easy to cause the problems of target confusion and target loss, and is difficult to ensure good tracking performance in a complex scene.
Chinese patent CN111091582A is a single visual target tracking algorithm and system based on a deep neural network, and provides a single visual target tracking algorithm and system based on a deep neural network. In addition, the invention provides a size adjusting module which can dynamically adjust the cutting size of the search area according to the size of the target, so that the size adjusting module is suitable for targets with different sizes and different motion characteristics. In the method, the image is used as a unique data source, and although the image has abundant texture information, the image is easily influenced by environmental factors such as illumination, weather and the like, so that the image tracker does not have the capacity of all-weather operation. In addition, since the image lacks three-dimensional information of an object, the image tracker can only track a target in a two-dimensional plane, but in tracking tasks in the fields of robots and automatic driving, it is more desirable to capture a motion trajectory of the target in a three-dimensional space than to estimate a position change of the target in the image plane, and thus the application scenarios of the image tracker are very limited.
Disclosure of Invention
Based on the problems, the invention provides a multi-mode information fusion-based three-dimensional single-target tracking method, which aims to fully utilize the complementary advantages of heterogeneous data of a multi-source sensor, adaptively fuse image texture information and point cloud geometric characteristics by means of attention in deep learning, generate more robust multi-mode characteristics, accurately regress the three-dimensional coordinates and orientation of a tracking target, solve the problem that a point cloud tracker is difficult to distinguish objects with similar structures, improve the tracking capability of a network on a remote target, relieve the influence of point cloud sparseness and shielding problems on the tracking performance, and finally improve the robustness and accuracy of the three-dimensional target tracker. In addition, the image and the low-line-number laser radar data are used as system input, and a robust tracking result is provided, so that the dependence on an expensive high-line-number radar for tracking is avoided, the production cost of a vehicle is reduced, the automatic driving product is promoted to quickly land, and objective economic benefits are brought.
A three-dimensional single-target tracking method based on multi-modal information fusion comprises the following steps:
step 1: carrying out spatial alignment on images and point cloud data acquired by different sensors; the method comprises the following steps:
step 1.1: projecting the point cloud to an image plane according to the internal and external parameters of the camera;
step 1.2: assigning point cloud depths to corresponding pixels;
step 1.3: back projecting the pixels to a three-dimensional space, thereby generating a pseudo point cloud which has image texture information and is aligned with an original point cloud coordinate system;
step 2: constructing a double-flow feature extraction network based on a deep learning method, and realizing parallel extraction of high-dimensional semantic features of pseudo point clouds and point clouds by different network branches; the method comprises the following steps:
step 2.1: extracting texture features of the pseudo-point cloud; the method comprises the following steps:
step 2.1.1: aiming at the pseudo point cloud obtained in the step 1, matching the template pseudo point cloud
Figure BDA0003979791320000031
And searching for a false point cloud->
Figure BDA0003979791320000032
Performing point cloud down-sampling by using a farthest point sampling algorithm to respectively obtain Q points as key points;
step 2.1.2: aiming at the template pseudo point cloud and the search pseudo point cloud obtained in the step 2.1.1, performing KNN (K nearest neighbor) clustering on points in the radius R by taking Q key points as centers, and then aggregating the clustered point features into the key points by using an MLP (multi-layer perceptron) network;
step 2.1.3: repeatedly applying farthest point sampling, KNN clustering and MLP network to reduce the number of key points to N, and outputting a template point set containing N points
Figure BDA0003979791320000033
And search point set>
Figure BDA0003979791320000034
Wherein each s i From a three-dimensional coordinate vector c i And a C-dimensional descriptor f representing local texture information of the object i Composition i.e. s i =(c i ,f i );
Step 2.2: extracting the geometric characteristics of the real point cloud, including:
step 2.2.1: aiming at the real point cloud, the template point cloud
Figure BDA0003979791320000035
And search point cloud->
Figure BDA0003979791320000036
Carrying out voxelization processing, and converting the voxelization processing into dense voxel representation;
step 2.2.2: for template point clouds
Figure BDA0003979791320000037
And search point cloud->
Figure BDA0003979791320000038
Extracting the geometrical characteristics of the point clouds in the voxels by adopting a three-dimensional sparse convolution network respectively to generate a template point cloud ^ er>
Figure BDA0003979791320000039
Is greater than or equal to>
Figure BDA00039797913200000310
Search point cloud->
Figure BDA00039797913200000311
Is greater than or equal to>
Figure BDA00039797913200000312
In the formula, W, L, H, C represents the dimension of the corresponding tensor;
step 2.2.3: for three-dimensional voxel features
Figure BDA00039797913200000313
Merging of the height channel and feature channel, respectively, to output a denser BEV (aerial view) feature->
Figure BDA00039797913200000314
Wherein C' = hxc;
and step 3: interaction and enhancement of multi-modal features are realized by combining self-attention and cross-attention; the method comprises the following steps:
step 3.1: constructing a long-distance dependency relationship in a mode through a self-attention mechanism, learning importance degrees of different features and weighting the features;
step 3.2: through cross attention, semantic relations of different modes are constructed, so that the information of different modes can strengthen the characteristics of the same semantic object, and more robust multi-mode semantic characteristics are generated;
and 4, step 4: the similarity between image pixels and point cloud data is fused based on the cross-modal modeling capability of an attention mechanism; the method comprises the following steps:
step 4.1: generating geometric similarity S by adopting a pixel-by-pixel cosine similarity calculation function for semantic enhancement features of search point cloud and template point cloud geo ∈R W×L×D In the formula, W × L × D represents tensor S geo Dimension (d);
step 4.2: generating texture similarity S by using point-by-point cosine similarity calculation function for semantic enhancement features of the searched pseudo-point cloud and the template pseudo-point cloud tex ∈R N×D In the formula, N × D represents tensor S tex Dimension (d);
step 4.3: fusing geometric similarity and texture similarity by means of cross attention mechanism to generate more robust multi-modal similarity feature S' fus
And 5: feature S 'according to multi-modal similarity' fus Predicting the spatial position and orientation of the target by adopting a multilayer convolution structure; the method comprises the following steps:
step 5.1: extracting implicit target clues in the multi-modal similarity features through a multi-layer convolution structure to generate a feature map;
step 5.2: predicting the confidence coefficient of a target and a corresponding spatial attribute at each position in the characteristic diagram, wherein the position with the highest confidence coefficient is the position to be positioned of the target, and the spatial attribute is the deviation amount and the direction of the position of the target and is used for correcting the position to be positioned;
step 5.3: and taking the corrected to-be-positioned position as a prediction result of the tracking target.
The invention has the beneficial effects that:
1) The invention can fully utilize the characteristics of each sensor to realize advantage complementation, improve the adaptability of the tracker to the tracking or shielding condition of a long-distance target and better distinguish objects with similar structures;
2) Compared with the existing multi-mode tracker, the tracking method provided by the invention has the best performance in the aspects of success rate and accuracy rate;
3) The method can be deployed in a mobile robot platform or an automatic driving vehicle and has real-time operation capability, and particularly takes Nvidia 2080Ti GPU as an example, the method can achieve the processing speed of 12 FPS.
Drawings
FIG. 1 is a schematic diagram of a three-dimensional single-target tracking method based on multi-modal information fusion according to the present invention;
FIG. 2 is a diagram of a multi-modal dual-flow feature extraction network architecture in accordance with the present invention;
FIG. 3 is a diagram of a multi-modal feature interaction and enhancement network architecture in accordance with the present invention;
FIG. 4 is a diagram of the multi-modal similarity fusion scheme according to the present invention.
Detailed Description
The invention is further described with reference to the following figures and specific examples.
The invention mainly focuses on three-dimensional target tracking, an extremely important task in environmental perception, and the core content of the task is as follows: and estimating the subsequent motion state of the target according to the initial state of the target and the data (images and point clouds) of the two frames. Because the sensing system often needs to position a specific target for a long time in practical application, the three-dimensional target tracking has extremely wide application value, and the application fields of the three-dimensional target tracking comprise: (1) Automatic driving, namely, performing automatic monitoring and tracking on targets such as pedestrians and vehicles in the surrounding environment of the vehicle by using a laser radar and a camera, ensuring that a safe distance is kept between the targets and the vehicles, the pedestrians and various objects, and effectively avoiding potential collision risks caused by emergency situations; (2) The method comprises the following steps of intelligently monitoring, intelligently analyzing a monitoring scene and continuously tracking a suspicious target, and early warning abnormal activities of a suspect, so that dangerous conditions are effectively prevented from occurring; (3) And in military guidance, the target with strategic value is intelligently identified through computing equipment, so that accurate positioning, continuous tracking and effective striking are realized.
The design schematic diagram of the invention is shown in fig. 1, firstly, multi-modal data is spatially aligned, that is, pixels containing texture information are projected to a three-dimensional space through internal and external parameters of a camera, pseudo point cloud which is in the same physical space with real point cloud is generated, then, prior information (three-dimensional position and size) is obtained through a target real value frame in an initial frame, and a target template area and a search area where targets may exist are intercepted from the point cloud and the pseudo point cloud. And secondly, extracting two types of modal characteristics by adopting a double-current network respectively, wherein a pseudo point cloud branch is used for quickly encoding texture information, and the point cloud branch can extract point cloud geometric characteristics and generate a dense BEV (belief-velocimetry) representation form. Subsequently, important features in the modes are weighted through the attention interaction and enhancement module, semantic association of the important features of different modes is constructed, and therefore more robust multi-mode features are obtained. In addition, the method further utilizes the enhanced features to calculate the similarity in the mode so as to generate the texture similarity and the geometric similarity, and then adaptively fuses the two types of similarity through an attention mechanism. Finally, the invention predicts the position and orientation of the target by means of multi-modal similarity features.
A three-dimensional single target tracking method based on multi-mode information fusion comprises the following steps of firstly, realizing the spatial alignment of an original image and laser point cloud; secondly, a double-flow feature extraction network is constructed based on a deep learning method, and high-dimensional semantic features of the image and the point cloud are extracted in parallel; secondly, weighting important information in the modes through a self-attention mechanism, and constructing semantic relation among different modes by utilizing a cross-attention mechanism; then calculating texture and geometric similarity based on the semantic enhanced features, and generating multi-modal similarity features by using an attention mechanism; finally, predicting the spatial position and orientation of the target by adopting a multilayer convolution structure; the specific implementation process is as follows:
image and point cloud data spatial alignment: because the image and the point cloud are positioned under different sensor coordinate systems, the corresponding relation of different data is simply and roughly established by directly adopting a neural network, and fusion confusion of different semantic information is easily caused, so that the spatial alignment of multi-mode data is ensured by adopting a mode of mapping image pixels to a point cloud space.
Step 1: carrying out spatial alignment on images and point cloud data acquired by different sensors; the method comprises the following steps:
step 1.1: projecting the point cloud to an image plane according to the internal and external parameters of the camera;
step 1.2: assigning point cloud depths to corresponding pixels;
step 1.3: back projecting the pixels to a three-dimensional space, thereby generating a pseudo point cloud which has image texture information and is aligned with an original point cloud coordinate system;
assuming that the three-dimensional coordinates of the space point P obtained by the laser radar are X = (X, Y, z), the rotation matrix between the laser and camera coordinate system is R, the translation matrix is T, and the camera internal reference matrix is K, a pixel Y = (u, v) uniquely corresponding to the space point P in the pixel plane can be obtained according to the following point cloud-image projection formula, where (u, v) is the pixel coordinate.
Figure BDA0003979791320000051
Wherein (f) x ,f y ) Is the focal length, (c) x ,c y ) Is the imaging origin. And secondly, endowing the point cloud depth Z to the corresponding image pixel (u, v), and then back-projecting the pixel to a point cloud coordinate system according to the point cloud-image projection formula to obtain a pseudo point cloud with image texture information.
Extracting image and point cloud double-current features: obtaining pseudo-point clouds of images in the same spatial coordinate system as the point clouds, i.e.
Figure BDA0003979791320000061
(template pseudo-point cloud) and->
Figure BDA0003979791320000062
(search for a false point cloud) and subsequently the true point cloud is/are taken>
Figure BDA0003979791320000063
(template point cloud),. Sup.>
Figure BDA0003979791320000064
(search point cloud) and pseudo point cloud are simultaneously input into a double-current feature extraction network, and the two networks are respectivelyAnd extracting different modal characteristics.
Step 2: constructing a double-flow feature extraction network based on a deep learning method, and realizing parallel extraction of high-dimensional semantic features of pseudo point clouds and point clouds by different network branches; the method comprises the following steps:
step 2.1: extracting texture features of the pseudo-point cloud; the method comprises the following steps:
step 2.1.1: aiming at the pseudo point cloud obtained in the step 1, matching the template pseudo point cloud
Figure BDA0003979791320000065
And searching for a false point cloud->
Figure BDA0003979791320000066
Performing point cloud downsampling by using a farthest point sampling algorithm to respectively obtain Q points serving as key points;
step 2.1.2: aiming at the template pseudo point cloud and the search pseudo point cloud obtained in the step 2.1.1, performing KNN (K nearest neighbor) clustering on points in the radius R by taking Q key points as centers, and then aggregating the clustered point features into the key points by using an MLP (multi-layer perceptron) network;
step 2.1.3: repeatedly applying farthest point sampling, KNN clustering and MLP network to reduce the number of key points to N, and outputting a template point set containing N points
Figure BDA0003979791320000067
And search point set->
Figure BDA0003979791320000068
Wherein each s i From a three-dimensional coordinate vector c i And a C-dimensional descriptor f representing local texture information of the object i Composition of, i.e. s i =(c i ,f i );
Step 2.2: extracting the geometric characteristics of the real point cloud, including:
step 2.2.1: aiming at the real point cloud, the template point cloud
Figure BDA0003979791320000069
And search point cloud->
Figure BDA00039797913200000610
Carrying out voxelization processing, and converting the voxelization processing into dense voxel representation;
step 2.2.2: for template point clouds
Figure BDA00039797913200000611
And search point cloud->
Figure BDA00039797913200000612
Extracting the geometrical characteristics of the point clouds in the voxels by adopting a three-dimensional sparse convolution network respectively to generate a template point cloud ^ er>
Figure BDA00039797913200000613
Is greater than or equal to>
Figure BDA00039797913200000614
Search point cloud>
Figure BDA00039797913200000615
Is greater than or equal to>
Figure BDA00039797913200000616
In the formula, W × L × H × C represents a tensor dimension.
Step 2.2.3: for three-dimensional voxel features
Figure BDA00039797913200000617
Merging the height channel and the feature channel to output a denser BEV (aerial view) feature->
Figure BDA00039797913200000618
Wherein C' = H × C;
as shown in fig. 2, in the pseudo point cloud branch, the pseudo point cloud template is first put into effect
Figure BDA00039797913200000619
And search point cloud->
Figure BDA00039797913200000620
And (3) applying a farthest point sampling algorithm, clustering sampling results to obtain Q clustering centers as point cloud key points, carrying out KNN (K nearest neighbor) clustering on points in the radius R, and then aggregating the clustered point features into the key points by using an MLP (multi-layer perceptron) network. After the processes are circularly reciprocated for a plurality of times,
Figure BDA00039797913200000621
/>
Figure BDA0003979791320000071
pseudo-point cloud branch final output template point set containing N points
Figure BDA0003979791320000072
And search point set->
Figure BDA0003979791320000073
Wherein each s i From a three-dimensional coordinate vector c i And a C-dimensional descriptor f representing local texture information of the object i Composition s i =(c i ,f i ) For simplification of the formula, subsequent use is made of>
Figure BDA0003979791320000074
And &>
Figure BDA0003979791320000075
Is taken place of>
Figure BDA0003979791320000076
And &>
Figure BDA0003979791320000077
Figure BDA0003979791320000078
As a pseudo-point cloudCharacteristic of the template region, ->
Figure BDA0003979791320000079
And searching the characteristics of the area for the pseudo point cloud. The branched feature extraction method can quickly aggregate the texture features of the pseudo-point cloud, and finally, only depends on the local texture features of the representation background and the representation foreground of a small number of points, so that the calculation cost of a subsequent network is effectively reduced.
Unlike the pseudo-point cloud branching, the real point cloud branching is first paired with the template
Figure BDA00039797913200000710
And search point cloud>
Figure BDA00039797913200000711
Performing voxelization processing, converting sparse point cloud into dense voxels, extracting geometrical characteristics of the point cloud in the voxels by adopting three-dimensional sparse convolution, and generating template three-dimensional voxel characteristics ^ whether or not>
Figure BDA00039797913200000712
And searching for a three-dimensional voxel feature->
Figure BDA00039797913200000713
Will then->
Figure BDA00039797913200000714
Is combined with the feature channel to output a denser template BEV feature->
Figure BDA00039797913200000715
And searching for BEV feature>
Figure BDA00039797913200000716
Wherein C' = H × C, the method is beneficial to relieving the influence caused by the point cloud sparsity.
Multi-modal feature interaction and enhancement: obtaining pseudo-point cloud template features
Figure BDA00039797913200000717
Pseudo-point cloud search feature->
Figure BDA00039797913200000718
And the template characteristic of the real point cloud->
Figure BDA00039797913200000719
Real point cloud search feature->
Figure BDA00039797913200000720
The respective attention enhancing features.
Firstly, a self-attention mechanism is adopted to carry out weight distribution on the point cloud and the pseudo point cloud characteristics, and the size of the characteristics is not changed in the process. Let F be the input feature, output the attention-enhancing feature after the self-attention mechanism processing
Figure BDA00039797913200000721
Will then>
Figure BDA00039797913200000722
Further, semantic associations of different modes are constructed through a cross attention mechanism, and a multi-mode semantic enhancement feature is output>
Figure BDA00039797913200000723
The following first introduces the attention mechanism rationale:
the attention mechanism is divided into three parts: inputting feature codes, position codes, similarity calculation and feature weighting, wherein the main idea is to generate feature weights in a self-adaptive manner according to feature similarity, so that the difference between important parts and non-important parts in features is more obvious, and a general formula of an attention mechanism is as follows:
Q,K,V=α(F+P),β(F+P),γ(F)
Figure BDA0003979791320000081
where F is an input feature of the attention mechanism, and α, β, γ denote lines whose weights are not sharedAnd a sexualization layer or a multilayer perceptron, wherein P is used for representing position coding characteristics, and Q, K and V respectively represent a Query matrix, a Key matrix and a Value matrix in the attention mechanism. Then through QK T Calculating the feature similarity between Q and K, and dividing by the scaling factor
Figure BDA0003979791320000082
Performing normalization processing by Softmax, and performing Hadamard product with V to generate attention enhancing feature->
Figure BDA0003979791320000083
The detailed process of self-attention and cross-attention will be described below in conjunction with fig. 3:
in the self-attention-enhancing stage, the principle is shown in the following formula:
Figure BDA0003979791320000084
F r is an input feature of a real point cloud, Q r ,K r ,V r Are all input features F r Obtained by linear variation, multiheadAttn means stitching the results calculated for multiple heads, layerNorm means applying layer normalization to the features,
Figure BDA0003979791320000085
representing the enhanced features of the output.
Taking the search point cloud characteristics as the input of a self-attention mechanism to obtain re-weighted search point cloud characteristics; taking the template point cloud characteristics as the input of a self-attention mechanism to obtain the re-weighted template point cloud characteristics;
and for the pseudo point cloud characteristics, realizing characteristic enhancement according to the following formula:
Figure BDA0003979791320000086
F p is an input feature of a pseudo-point cloud, Q p ,K p ,V p Are all input features F p Obtained by linear variation, multiheadAttn means stitching the results calculated for multiple heads, layerNorm means applying layer normalization to the features,
Figure BDA0003979791320000087
representing the enhanced features of the output.
Taking the searched pseudo-point cloud characteristics as the input of a self-attention mechanism to obtain the re-weighted searched pseudo-point cloud characteristics; and taking the template pseudo-point cloud characteristics as the input of a self-attention mechanism to obtain the template pseudo-point cloud characteristics after re-weighting.
In the cross attention enhancement stage, cross-modal feature enhancement is performed on the output feature of the previous stage according to the following formula:
Figure BDA0003979791320000088
Figure BDA0003979791320000089
wherein the content of the first and second substances,
Figure BDA00039797913200000810
feature enhanced by self-attention>
Figure BDA00039797913200000811
Is obtained by a linear change and is->
Figure BDA00039797913200000812
Feature enhanced by self-attention>
Figure BDA00039797913200000813
Is obtained by a linear change and is->
Figure BDA00039797913200000814
Representing the final enhanced real point cloud features processed by the cross attention mechanism,/>
Figure BDA00039797913200000815
Representing the final enhanced pseudo-point cloud features that are processed through a cross-attention mechanism. Unlike the auto-attention mechanism, the cross-attention mechanism asserts/de-asserts a signal generated by the present branch boost feature>
Figure BDA0003979791320000091
Generated as Query with another branch enhancement feature
Figure BDA0003979791320000092
As Key and Value, feature similarity of different modes is calculated through an attention mechanism, and feature weight with similar semantic information is enhanced, so that semantic association of different modes is constructed, and finally semantic enhanced feature (based on the similarity) is generated>
Figure BDA0003979791320000093
Taking the re-weighted searching pseudo point cloud and the searching point cloud characteristics as input characteristics of cross attention to obtain final semantically enhanced searching pseudo point cloud and real point cloud characteristics; and taking the weighted template pseudo point cloud and the template point cloud characteristics as input characteristics of cross power to obtain the final semantically enhanced template pseudo point cloud and real point cloud characteristics.
And step 3: interaction and enhancement of multi-modal features are achieved by combining self-attention and cross-attention; the method comprises the following steps:
step 3.1: constructing a long-distance dependency relationship in a mode through a self-attention mechanism, learning importance degrees of different features and weighting the features;
step 3.2: through cross attention, semantic relations of different modes are constructed, so that the information of different modes can strengthen the characteristics of the same semantic object, and more robust multi-mode semantic characteristics are generated;
multi-modal similarity fusion: semantic enhancement of features with point clouds and pseudo-point clouds
Figure BDA0003979791320000098
Generating a geometric similarity S geo ∈R W×L×D Similarity to texture S tex ∈R N×D Wherein W and L represent the size of the characteristic diagram, N represents the number of key points of the pseudo point cloud, D represents the degree of characteristic dimension, and the similarity calculation formula is as follows,
Figure BDA0003979791320000094
wherein, the Correlation is a cosine similarity function,
Figure BDA0003979791320000095
a final enhancement feature representing a search area>
Figure BDA0003979791320000096
Representing the final enhancement features of the template region;
after the similarity of each modality (real point cloud and pseudo point cloud) is generated respectively, the method shown in fig. 4 is adopted to fuse the geometric similarity and the texture similarity, and the specific flow is introduced as follows: in order to ensure the unification of different similarity representation forms, the sparse texture similarity in the pseudo-point cloud branch is converted into dense BEV similarity characteristics, namely the texture similarity characteristics are voxelized, the similarity characteristics of each voxel are aggregated, and finally the dense BEV similarity S 'is generated' tex ∈R W×L×D . Then the invention further pairs texture similarity S 'by cross-attention' tex And geometric similarity S geo Fusion is performed to generate more robust multi-modal similarity features S fus The mathematical expression of similarity fusion is as follows:
Q geo =S geo W Q ,K tex =S′ tex W K ,V tex =S′ tex W V
Figure BDA0003979791320000097
S′ fus =MLP(S fus )+S fus
wherein MLP denotes a multi-layer perceptron, Q geo ,K tex ,V tex The Query generated by geometric similarity is characterized, and the Key and Value generated by texture similarity are also characterized. Since this step enables encoding of the geometric and texture cues of the target to the multi-modal similarity feature S' fus Therefore, the robustness of the tracker is improved, and the failure problem in a complex environment is solved.
And 4, step 4: fusing the texture similarity of the image and the geometric similarity of the point cloud based on the cross-modal modeling capability of the attention system; the method comprises the following steps:
step 4.1: generating geometric similarity S by adopting a pixel-by-pixel cosine similarity calculation function for semantic enhancement features of search point cloud and template point cloud geo ∈R W×L×D In the formula, W × L × D represents tensor S geo Dimension of (d);
step 4.2: adopting a point-by-point cosine similarity calculation function to the semantic enhancement features of the searched pseudo-point cloud and the template pseudo-point cloud, and calculating the texture similarity S tex ∈R N×D In the formula, nxD represents tensor S tex Dimension (d);
step 4.3: fusing geometric similarity and texture similarity by means of cross attention mechanism to generate more robust multi-modal similarity feature S' fus
Target position prediction: obtaining the fusion similarity feature through step 4, and then comparing the similarity feature S' fus And (4) respectively classifying the target positions by adopting the convolutional neural network branches, and regressing the relative offset position and orientation of the target.
And (4) classification and branching: is prepared from S' fus With each feature center as a key point
Figure BDA0003979791320000101
Categorizing a branch requires predicting a score for each keypoint and outputting a thermodynamic diagram>
Figure BDA0003979791320000102
And the key point with the highest classification score in the thermodynamic diagram representsWhere the most likely target is present. In order to construct the constraint relation of the loss function, the invention generates a thermodynamic diagram truth value Y epsilon [0,1 ] through a Gaussian kernel] W×H×1 The gaussian kernel formula used is as follows:
Figure BDA0003979791320000103
wherein delta p For the adaptive standard deviation of the target size, the network parameters are then optimized using Focal local as the Loss function, which is shown in the following equation:
Figure BDA0003979791320000104
wherein alpha and beta are hyper-parameters of Focal local, and N is similarity characteristic S' fus Number of middle key points, L h A value representing the Focal local Loss function;
regression branches mainly predict three-dimensional attributes of targets
Figure BDA0003979791320000105
Which comprises a position deviation->
Figure BDA0003979791320000106
Height->
Figure BDA0003979791320000107
And orientation +>
Figure BDA0003979791320000108
The position deviation is used for correcting the position of the key point so as to obtain a more accurate target positioning result. Then, the invention adopts L1 loss function to calculate the loss value of the target three-dimensional attribute:
Figure BDA0003979791320000109
Figure BDA0003979791320000111
wherein L is offset Value, L, representing a position deviation loss function z Value, L, representing a height loss function θ Value representing the orientation loss function, { x offset 、y offset Represents the position of a sample object used to train the model, z represents the height of a sample object used to train the model, and θ represents the orientation of a sample object used to train the model.
Finally, a final loss value L is obtained through weighted summation,
L=λ h L hoffset L offsetz L zθ L θ
where L is the total loss function, λ h Is the coefficient of the Focal local Loss function, λ offset Is a coefficient of a bias loss function, λ z Is a coefficient of the height loss function, λ θ Are coefficients towards the loss function.
And 5: feature S 'according to multi-modal similarity' fus Predicting the spatial position and orientation of the target by adopting a multilayer convolution structure; the method comprises the following steps:
step 5.1: extracting target clues implied in the multi-modal similarity characteristics through a multi-layer convolution structure to generate a characteristic diagram;
step 5.2: predicting the confidence coefficient of a target and a corresponding spatial attribute at each position in the characteristic diagram, wherein the position with the highest confidence coefficient is the position to be positioned of the target, and the spatial attribute is the deviation amount and the direction of the position of the target and is used for correcting the position to be positioned;
step 5.3: and taking the corrected to-be-positioned position as a prediction result of the tracking target.
The invention provides a three-dimensional single-target tracking method based on multi-mode information fusion, which solves the problem that a single-mode tracker is easy to confuse objects with similar geometric structures, thereby obtaining a more robust tracking result; the key points of the main improved technology are as follows:
(1) And (3) multi-mode data are aligned in space, pixels are mapped to a space coordinate system where the point cloud is located by means of point cloud depth and camera internal and external parameters, and a pseudo point cloud with texture features is generated, so that the difficulty of fusing multi-mode features in a subsequent network is reduced, and the utilization rate of image features is improved.
(2) The multi-mode double-flow feature extraction network extracts multi-mode features efficiently through different main networks, wherein point cloud branches generate dense BEV geometric features, the influence caused by point cloud sparsity is favorably relieved, and pseudo point cloud branches can represent local texture information of a target and a background by means of a small number of points, so that the calculation overhead of a subsequent network is favorably reduced.
(3) The multi-mode information interaction and enhancement mechanism is used for adaptively enhancing important features in the mode by means of the self-attention mechanism, and then constructing multi-mode semantic association by means of the cross-attention mechanism, so that the features of the same semantic object are enhanced.
(4) And (3) fusion of multi-modal similarity, namely realizing self-adaptive fusion of geometric and texture similarity through an attention mechanism, generating similarity characteristics of geometric and texture characteristics of a coded target, and using the similarity characteristics for final target regression so as to improve the robustness of a tracker in complex scenes such as point cloud sparsity or interference (a large number of structurally similar objects) and the like.

Claims (8)

1. A three-dimensional single-target tracking method based on multi-mode information fusion is characterized by comprising the following steps:
step 1: carrying out spatial alignment on images and point cloud data acquired by different sensors;
step 2: constructing a double-flow feature extraction network based on a deep learning method, and realizing parallel extraction of high-dimensional semantic features of pseudo point clouds and point clouds by different network branches;
and 3, step 3: interaction and enhancement of multi-modal features are achieved by combining self-attention and cross-attention;
and 4, step 4: calculating the texture similarity of the image and the geometric similarity of point cloud based on the semantic enhanced features generated in the step 3, and fusing the two types of similarity features by means of cross attention to generate multi-modal similarity;
and 5: feature S 'according to multi-modal similarity' fus And predicting the spatial position and orientation of the target by adopting a multilayer convolution structure.
2. The method for tracking the three-dimensional single target based on the multi-modal information fusion as claimed in claim 1, wherein the step 1 comprises:
step 1.1: projecting the point cloud to an image plane according to internal and external parameters of the camera;
step 1.2: assigning point cloud depths to corresponding pixels;
step 1.3: and back projecting the pixels to a three-dimensional space, thereby generating a pseudo point cloud which has image texture information and is aligned with the original point cloud coordinate system.
3. The method for tracking the three-dimensional single target based on the multi-modal information fusion as claimed in claim 1, wherein the step 2 comprises:
step 2.1: extracting texture features of the pseudo-point cloud;
step 2.2: and extracting the geometric characteristics of the real point cloud.
4. The method for tracking the three-dimensional single target based on the multi-modal information fusion as claimed in claim 3, wherein the step 2.1 comprises:
step 2.1.1: aiming at the pseudo point cloud obtained in the step 1, matching the template pseudo point cloud
Figure FDA0003979791310000011
And searching for a false point cloud->
Figure FDA0003979791310000012
Performing point cloud downsampling by using a farthest point sampling algorithm to respectively obtain Q points serving as key points;
step 2.1.2: performing KNN clustering on the points in the radius R by taking Q key points as centers according to the template pseudo point cloud and the search pseudo point cloud obtained in the step 2.1.1, and then aggregating the clustered point features into the key points by using an MLP network;
step 2.1.3: repeatedly applying farthest point sampling, KNN clustering and MLP network to reduce the number of key points to N, and outputting a template point set containing N points
Figure FDA0003979791310000013
And search point set>
Figure FDA0003979791310000014
Wherein each s i From a three-dimensional coordinate vector c i And a C-dimensional descriptor f representing local texture information of the object i Composition i.e. s i =(c i ,f i )。
5. The method for tracking the three-dimensional single target based on the multi-modal information fusion as claimed in claim 3, wherein the step 2.2 comprises:
step 2.2.1: aiming at the real point cloud, the template point cloud
Figure FDA0003979791310000021
And search point cloud>
Figure FDA0003979791310000022
Carrying out voxelization processing, and converting the voxelization processing into dense voxel representation;
step 2.2.2: for template point clouds
Figure FDA0003979791310000023
And search point cloud->
Figure FDA0003979791310000024
Respectively extracting the geometrical characteristics of the point clouds in the voxels by adopting a three-dimensional sparse convolution network, and generating a template point cloud->
Figure FDA0003979791310000025
Is greater than or equal to>
Figure FDA0003979791310000026
Search point cloud->
Figure FDA0003979791310000027
Is greater than or equal to>
Figure FDA0003979791310000028
In the formula, W, L, H, C represents the dimension of the corresponding tensor;
step 2.2.3: for three-dimensional voxel features
Figure FDA0003979791310000029
Merging of the height channel and the feature channel, respectively, to output a denser BEV feature->
Figure FDA00039797913100000210
Wherein C' = hxc.
6. The method for tracking the three-dimensional single target based on the multi-modal information fusion as claimed in claim 1, wherein the step 3 comprises:
step 3.1: constructing a long-distance dependency relationship in a mode through a self-attention mechanism, learning importance degrees of different features and weighting the features;
step 3.2: through cross attention, semantic relations of different modes are constructed, so that the information of different modes can strengthen the characteristics of the same semantic object, and more robust multi-mode semantic characteristics are generated.
7. The method for tracking the three-dimensional single target based on the multi-modal information fusion as claimed in claim 1, wherein the step 4 comprises:
step 4.1: generating geometric similarity S by adopting a pixel-by-pixel cosine similarity calculation function for semantic enhancement features of search point cloud and template point cloud geo ∈R W×L×D In the formula, W × L × D representsTensor S geo Dimension (d);
and 4.2: generating texture similarity S by using point-by-point cosine similarity calculation function for semantic enhancement features of searching pseudo-point clouds and template pseudo-point clouds tex ∈R N×D In the formula, nxD represents tensor S tex Dimension (d);
step 4.3: fusing geometric similarity and texture similarity by means of cross attention mechanism to generate more robust multi-modal similarity feature S' fus
8. The method for tracking the three-dimensional single target based on the multi-modal information fusion as claimed in claim 1, wherein the step 5 comprises:
step 5.1: extracting target clues implied in the multi-modal similarity characteristics through a multi-layer convolution structure to generate a characteristic diagram;
step 5.2: predicting the confidence coefficient of a target and a corresponding spatial attribute of each position in the characteristic diagram, wherein the position with the highest confidence coefficient is the position to be positioned of the target, and the spatial attribute is the deviation amount and the direction of the target position and is used for correcting the position to be positioned;
step 5.3: and taking the corrected to-be-positioned position as a prediction result of the tracking target.
CN202211545845.XA 2022-12-05 2022-12-05 Three-dimensional single-target tracking method based on multi-mode information fusion Pending CN115880333A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211545845.XA CN115880333A (en) 2022-12-05 2022-12-05 Three-dimensional single-target tracking method based on multi-mode information fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211545845.XA CN115880333A (en) 2022-12-05 2022-12-05 Three-dimensional single-target tracking method based on multi-mode information fusion

Publications (1)

Publication Number Publication Date
CN115880333A true CN115880333A (en) 2023-03-31

Family

ID=85765768

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211545845.XA Pending CN115880333A (en) 2022-12-05 2022-12-05 Three-dimensional single-target tracking method based on multi-mode information fusion

Country Status (1)

Country Link
CN (1) CN115880333A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117173655A (en) * 2023-08-28 2023-12-05 南京航空航天大学 Multi-mode 3D target detection method based on semantic propagation and cross-attention mechanism
CN117557993A (en) * 2024-01-12 2024-02-13 杭州像素元科技有限公司 Construction method and application of double-frame interaction perception 3D association detection model

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117173655A (en) * 2023-08-28 2023-12-05 南京航空航天大学 Multi-mode 3D target detection method based on semantic propagation and cross-attention mechanism
CN117557993A (en) * 2024-01-12 2024-02-13 杭州像素元科技有限公司 Construction method and application of double-frame interaction perception 3D association detection model
CN117557993B (en) * 2024-01-12 2024-03-29 杭州像素元科技有限公司 Construction method and application of double-frame interaction perception 3D association detection model

Similar Documents

Publication Publication Date Title
Yang et al. Multifeature fusion-based object detection for intelligent transportation systems
Wang et al. Multi-modal 3d object detection in autonomous driving: A survey and taxonomy
Yi et al. Segvoxelnet: Exploring semantic context and depth-aware features for 3d vehicle detection from point cloud
Xu et al. Head pose estimation using deep neural networks and 3D point clouds
CN115880333A (en) Three-dimensional single-target tracking method based on multi-mode information fusion
Wang et al. Nearest neighbor-based contrastive learning for hyperspectral and LiDAR data classification
Wang et al. An overview of 3d object detection
CN115272416A (en) Vehicle and pedestrian detection tracking method and system based on multi-source sensor fusion
Ouyang et al. A cgans-based scene reconstruction model using lidar point cloud
Zheng et al. Dim target detection method based on deep learning in complex traffic environment
Wang et al. A survey of 3D point cloud and deep learning-based approaches for scene understanding in autonomous driving
Karangwa et al. Vehicle detection for autonomous driving: A review of algorithms and datasets
Huang et al. Small target detection model in aerial images based on TCA-YOLOv5m
Zhou et al. Retrieval and localization with observation constraints
CN117522990A (en) Category-level pose estimation method based on multi-head attention mechanism and iterative refinement
Bi et al. Machine vision
CN112950786A (en) Vehicle three-dimensional reconstruction method based on neural network
Duan Deep learning-based multitarget motion shadow rejection and accurate tracking for sports video
Li et al. Target adaptive tracking based on GOTURN algorithm with convolutional neural network and data fusion
CN116664856A (en) Three-dimensional target detection method, system and storage medium based on point cloud-image multi-cross mixing
Zhao et al. DHA: Lidar and vision data fusion-based on road object classifier
Sun et al. The recognition framework of deep kernel learning for enclosed remote sensing objects
Ma et al. Multi-modal information fusion for LiDAR-based 3D object detection framework
Long et al. Radar fusion monocular depth estimation based on dual attention
CN117557599B (en) 3D moving object tracking method and system and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20230930

Address after: 110819 No. 3 lane, Heping Road, Heping District, Shenyang, Liaoning 11

Applicant after: Northeastern University

Applicant after: 63983 FORCES, PLA

Address before: 110819 No. 3 lane, Heping Road, Heping District, Shenyang, Liaoning 11

Applicant before: Northeastern University

TA01 Transfer of patent application right