CN115880333A

CN115880333A - Three-dimensional single-target tracking method based on multi-mode information fusion

Info

Publication number: CN115880333A
Application number: CN202211545845.XA
Authority: CN
Inventors: 方正; 林雨; 李智恒; 崔宇波; 李硕
Original assignee: Northeastern University China
Current assignee: Northeastern University China; 63983 Troops of PLA
Priority date: 2022-12-05
Filing date: 2022-12-05
Publication date: 2023-03-31

Abstract

The invention provides a three-dimensional single-target tracking method based on multi-mode information fusion, which comprises the following steps of firstly, realizing the spatial alignment of an original image and laser point cloud; secondly, a double-flow feature extraction network is constructed based on a deep learning method, and high-dimensional semantic features of the image and the point cloud are extracted in parallel; secondly, weighting important information in the modes through a self-attention mechanism, and constructing semantic relation among different modes by utilizing a cross-attention mechanism; then calculating texture and geometric similarity based on the semantic enhanced features, and generating multi-modal similarity features by using an attention mechanism; and finally, predicting the spatial position and orientation of the target by adopting a multilayer convolution structure. The method aims to generate more robust multi-modal characteristics by fully utilizing the complementary advantages of heterogeneous data of the multi-source sensor and adaptively fusing image texture information and point cloud geometric characteristics by means of an attention mechanism in deep learning so as to accurately regress and track the three-dimensional coordinates and orientation of the target and improve the robustness and accuracy of the three-dimensional target tracker.

Description

Three-dimensional single-target tracking method based on multi-mode information fusion

Technical Field

The invention belongs to the technical field of three-dimensional point cloud single-target tracking, and particularly relates to a three-dimensional single-target tracking method based on multi-mode information fusion.

Background

In recent years, with the rapid development of artificial intelligence technology, intelligent equipment such as robots and automatic driving automobiles gradually appear in the field of vision of the public and become a new strategic direction for manufacturing in China, and the safe and stable operation of the intelligent products can not be realized by the environment perception technology. The environment perception mainly means that sensors such as a camera, a laser radar and ultrasonic waves are used for acquiring all-around environment state data, and the most critical semantic information such as the positions and sizes of nearby vehicles and pedestrians is extracted through a computer processor and an intelligent algorithm, and the information can provide vital data support for decision and control links of an intelligent system so as to ensure the reliable operation of the system.

In recent years, the rapid growth of the field of deep learning also greatly promotes the development of a three-dimensional target tracking algorithm, many researchers try to apply a deep learning method to target tracking and achieve a successful result, and compared with the traditional algorithm, the high-dimensional features extracted by deep learning are more suitable for matching targets, so that robust target tracking is achieved. However, most of the existing three-dimensional target tracking algorithms inherit the twin network structure of the two-dimensional tracking algorithm and build similarity by relying on geometric information provided by point cloud, which causes that the tracker is difficult to distinguish objects with similar structures, and the texture information of the image can effectively solve the problem, for example: two persons with similar geometric structures can be distinguished by means of texture features such as appearances, clothes and the like, so that the deep fusion of multi-mode data can be realized by means of a deep learning method, and the problem that a tracker is prone to failure in a complex environment is solved.

Chinese patent CN114862911A is a three-dimensional point cloud single target tracking method based on graph convolution, which provides a three-dimensional point cloud single target tracking method based on graph convolution. The method improves the conventional three-dimensional single-target tracking method P2B, firstly takes a template and a search point cloud as network input, secondly carries out down-sampling on the point cloud and extracts seed point characteristics, secondly utilizes a graph volume module to fuse global and local characteristics, codes a template clue into a search area, and finally inputs the seed point of coded template information into a Hough voting module to realize positioning and tracking of a target in the search area and generate a three-dimensional target frame. According to the technical scheme, the laser radar is used as a sensor for sensing the environment state, although the point cloud data collected by the radar can describe the geometric outline of an object, due to the fact that the point cloud has sparsity and disorder, the object represented by the point cloud is incomplete under the conditions of long distance, shielding and the like, and the performance of the tracker is prone to being sharply reduced. Furthermore, since this approach relies only on the geometric information of the point cloud, it will be difficult for the tracker to resolve the target when the geometry of the target and the background are similar. Namely, the method is easy to cause the problems of target confusion and target loss, and is difficult to ensure good tracking performance in a complex scene.

Chinese patent CN111091582A is a single visual target tracking algorithm and system based on a deep neural network, and provides a single visual target tracking algorithm and system based on a deep neural network. In addition, the invention provides a size adjusting module which can dynamically adjust the cutting size of the search area according to the size of the target, so that the size adjusting module is suitable for targets with different sizes and different motion characteristics. In the method, the image is used as a unique data source, and although the image has abundant texture information, the image is easily influenced by environmental factors such as illumination, weather and the like, so that the image tracker does not have the capacity of all-weather operation. In addition, since the image lacks three-dimensional information of an object, the image tracker can only track a target in a two-dimensional plane, but in tracking tasks in the fields of robots and automatic driving, it is more desirable to capture a motion trajectory of the target in a three-dimensional space than to estimate a position change of the target in the image plane, and thus the application scenarios of the image tracker are very limited.

Disclosure of Invention

Based on the problems, the invention provides a multi-mode information fusion-based three-dimensional single-target tracking method, which aims to fully utilize the complementary advantages of heterogeneous data of a multi-source sensor, adaptively fuse image texture information and point cloud geometric characteristics by means of attention in deep learning, generate more robust multi-mode characteristics, accurately regress the three-dimensional coordinates and orientation of a tracking target, solve the problem that a point cloud tracker is difficult to distinguish objects with similar structures, improve the tracking capability of a network on a remote target, relieve the influence of point cloud sparseness and shielding problems on the tracking performance, and finally improve the robustness and accuracy of the three-dimensional target tracker. In addition, the image and the low-line-number laser radar data are used as system input, and a robust tracking result is provided, so that the dependence on an expensive high-line-number radar for tracking is avoided, the production cost of a vehicle is reduced, the automatic driving product is promoted to quickly land, and objective economic benefits are brought.

A three-dimensional single-target tracking method based on multi-modal information fusion comprises the following steps:

step 1: carrying out spatial alignment on images and point cloud data acquired by different sensors; the method comprises the following steps:

step 1.1: projecting the point cloud to an image plane according to the internal and external parameters of the camera;

step 1.2: assigning point cloud depths to corresponding pixels;

step 1.3: back projecting the pixels to a three-dimensional space, thereby generating a pseudo point cloud which has image texture information and is aligned with an original point cloud coordinate system;

step 2: constructing a double-flow feature extraction network based on a deep learning method, and realizing parallel extraction of high-dimensional semantic features of pseudo point clouds and point clouds by different network branches; the method comprises the following steps:

step 2.1: extracting texture features of the pseudo-point cloud; the method comprises the following steps:

step 2.1.1: aiming at the pseudo point cloud obtained in the step 1, matching the template pseudo point cloud

And searching for a false point cloud->

Performing point cloud down-sampling by using a farthest point sampling algorithm to respectively obtain Q points as key points;

step 2.1.2: aiming at the template pseudo point cloud and the search pseudo point cloud obtained in the step 2.1.1, performing KNN (K nearest neighbor) clustering on points in the radius R by taking Q key points as centers, and then aggregating the clustered point features into the key points by using an MLP (multi-layer perceptron) network;

step 2.1.3: repeatedly applying farthest point sampling, KNN clustering and MLP network to reduce the number of key points to N, and outputting a template point set containing N points

And search point set>

Wherein each s _i From a three-dimensional coordinate vector c _i And a C-dimensional descriptor f representing local texture information of the object _i Composition i.e. s _i ＝(c _i ,f _i )；

Step 2.2: extracting the geometric characteristics of the real point cloud, including:

step 2.2.1: aiming at the real point cloud, the template point cloud

And search point cloud->

Carrying out voxelization processing, and converting the voxelization processing into dense voxel representation;

step 2.2.2: for template point clouds

And search point cloud->

Extracting the geometrical characteristics of the point clouds in the voxels by adopting a three-dimensional sparse convolution network respectively to generate a template point cloud ^ er>

Is greater than or equal to>

Search point cloud->

Is greater than or equal to>

In the formula, W, L, H, C represents the dimension of the corresponding tensor;

step 2.2.3: for three-dimensional voxel features

Merging of the height channel and feature channel, respectively, to output a denser BEV (aerial view) feature->

Wherein C' = hxc;

and step 3: interaction and enhancement of multi-modal features are realized by combining self-attention and cross-attention; the method comprises the following steps:

step 3.1: constructing a long-distance dependency relationship in a mode through a self-attention mechanism, learning importance degrees of different features and weighting the features;

step 3.2: through cross attention, semantic relations of different modes are constructed, so that the information of different modes can strengthen the characteristics of the same semantic object, and more robust multi-mode semantic characteristics are generated;

and 4, step 4: the similarity between image pixels and point cloud data is fused based on the cross-modal modeling capability of an attention mechanism; the method comprises the following steps:

step 4.1: generating geometric similarity S by adopting a pixel-by-pixel cosine similarity calculation function for semantic enhancement features of search point cloud and template point cloud _geo ∈R ^W×L×D In the formula, W × L × D represents tensor S _geo Dimension (d);

step 4.2: generating texture similarity S by using point-by-point cosine similarity calculation function for semantic enhancement features of the searched pseudo-point cloud and the template pseudo-point cloud _tex ∈R ^N×D In the formula, N × D represents tensor S _tex Dimension (d);

step 4.3: fusing geometric similarity and texture similarity by means of cross attention mechanism to generate more robust multi-modal similarity feature S' _fus ；

And 5: feature S 'according to multi-modal similarity' _fus Predicting the spatial position and orientation of the target by adopting a multilayer convolution structure; the method comprises the following steps:

step 5.1: extracting implicit target clues in the multi-modal similarity features through a multi-layer convolution structure to generate a feature map;

step 5.2: predicting the confidence coefficient of a target and a corresponding spatial attribute at each position in the characteristic diagram, wherein the position with the highest confidence coefficient is the position to be positioned of the target, and the spatial attribute is the deviation amount and the direction of the position of the target and is used for correcting the position to be positioned;

step 5.3: and taking the corrected to-be-positioned position as a prediction result of the tracking target.

The invention has the beneficial effects that:

1) The invention can fully utilize the characteristics of each sensor to realize advantage complementation, improve the adaptability of the tracker to the tracking or shielding condition of a long-distance target and better distinguish objects with similar structures;

2) Compared with the existing multi-mode tracker, the tracking method provided by the invention has the best performance in the aspects of success rate and accuracy rate;

3) The method can be deployed in a mobile robot platform or an automatic driving vehicle and has real-time operation capability, and particularly takes Nvidia 2080Ti GPU as an example, the method can achieve the processing speed of 12 FPS.

Drawings

FIG. 1 is a schematic diagram of a three-dimensional single-target tracking method based on multi-modal information fusion according to the present invention;

FIG. 2 is a diagram of a multi-modal dual-flow feature extraction network architecture in accordance with the present invention;

FIG. 3 is a diagram of a multi-modal feature interaction and enhancement network architecture in accordance with the present invention;

FIG. 4 is a diagram of the multi-modal similarity fusion scheme according to the present invention.

Detailed Description

The invention is further described with reference to the following figures and specific examples.

The invention mainly focuses on three-dimensional target tracking, an extremely important task in environmental perception, and the core content of the task is as follows: and estimating the subsequent motion state of the target according to the initial state of the target and the data (images and point clouds) of the two frames. Because the sensing system often needs to position a specific target for a long time in practical application, the three-dimensional target tracking has extremely wide application value, and the application fields of the three-dimensional target tracking comprise: (1) Automatic driving, namely, performing automatic monitoring and tracking on targets such as pedestrians and vehicles in the surrounding environment of the vehicle by using a laser radar and a camera, ensuring that a safe distance is kept between the targets and the vehicles, the pedestrians and various objects, and effectively avoiding potential collision risks caused by emergency situations; (2) The method comprises the following steps of intelligently monitoring, intelligently analyzing a monitoring scene and continuously tracking a suspicious target, and early warning abnormal activities of a suspect, so that dangerous conditions are effectively prevented from occurring; (3) And in military guidance, the target with strategic value is intelligently identified through computing equipment, so that accurate positioning, continuous tracking and effective striking are realized.

The design schematic diagram of the invention is shown in fig. 1, firstly, multi-modal data is spatially aligned, that is, pixels containing texture information are projected to a three-dimensional space through internal and external parameters of a camera, pseudo point cloud which is in the same physical space with real point cloud is generated, then, prior information (three-dimensional position and size) is obtained through a target real value frame in an initial frame, and a target template area and a search area where targets may exist are intercepted from the point cloud and the pseudo point cloud. And secondly, extracting two types of modal characteristics by adopting a double-current network respectively, wherein a pseudo point cloud branch is used for quickly encoding texture information, and the point cloud branch can extract point cloud geometric characteristics and generate a dense BEV (belief-velocimetry) representation form. Subsequently, important features in the modes are weighted through the attention interaction and enhancement module, semantic association of the important features of different modes is constructed, and therefore more robust multi-mode features are obtained. In addition, the method further utilizes the enhanced features to calculate the similarity in the mode so as to generate the texture similarity and the geometric similarity, and then adaptively fuses the two types of similarity through an attention mechanism. Finally, the invention predicts the position and orientation of the target by means of multi-modal similarity features.

A three-dimensional single target tracking method based on multi-mode information fusion comprises the following steps of firstly, realizing the spatial alignment of an original image and laser point cloud; secondly, a double-flow feature extraction network is constructed based on a deep learning method, and high-dimensional semantic features of the image and the point cloud are extracted in parallel; secondly, weighting important information in the modes through a self-attention mechanism, and constructing semantic relation among different modes by utilizing a cross-attention mechanism; then calculating texture and geometric similarity based on the semantic enhanced features, and generating multi-modal similarity features by using an attention mechanism; finally, predicting the spatial position and orientation of the target by adopting a multilayer convolution structure; the specific implementation process is as follows:

image and point cloud data spatial alignment: because the image and the point cloud are positioned under different sensor coordinate systems, the corresponding relation of different data is simply and roughly established by directly adopting a neural network, and fusion confusion of different semantic information is easily caused, so that the spatial alignment of multi-mode data is ensured by adopting a mode of mapping image pixels to a point cloud space.

step 1.2: assigning point cloud depths to corresponding pixels;

assuming that the three-dimensional coordinates of the space point P obtained by the laser radar are X = (X, Y, z), the rotation matrix between the laser and camera coordinate system is R, the translation matrix is T, and the camera internal reference matrix is K, a pixel Y = (u, v) uniquely corresponding to the space point P in the pixel plane can be obtained according to the following point cloud-image projection formula, where (u, v) is the pixel coordinate.

Wherein (f) _x ,f _y ) Is the focal length, (c) _x ,c _y ) Is the imaging origin. And secondly, endowing the point cloud depth Z to the corresponding image pixel (u, v), and then back-projecting the pixel to a point cloud coordinate system according to the point cloud-image projection formula to obtain a pseudo point cloud with image texture information.

Extracting image and point cloud double-current features: obtaining pseudo-point clouds of images in the same spatial coordinate system as the point clouds, i.e.

(template pseudo-point cloud) and->

(search for a false point cloud) and subsequently the true point cloud is/are taken>

(template point cloud),. Sup.>

(search point cloud) and pseudo point cloud are simultaneously input into a double-current feature extraction network, and the two networks are respectivelyAnd extracting different modal characteristics.

And searching for a false point cloud->

Performing point cloud downsampling by using a farthest point sampling algorithm to respectively obtain Q points serving as key points;

And search point set->

Wherein each s _i From a three-dimensional coordinate vector c _i And a C-dimensional descriptor f representing local texture information of the object _i Composition of, i.e. s _i ＝(c _i ,f _i )；

step 2.2.1: aiming at the real point cloud, the template point cloud

And search point cloud->

step 2.2.2: for template point clouds

And search point cloud->

Is greater than or equal to>

Search point cloud>

Is greater than or equal to>

In the formula, W × L × H × C represents a tensor dimension.

Step 2.2.3: for three-dimensional voxel features

Merging the height channel and the feature channel to output a denser BEV (aerial view) feature->

Wherein C' = H × C;

as shown in fig. 2, in the pseudo point cloud branch, the pseudo point cloud template is first put into effect

And search point cloud->

And (3) applying a farthest point sampling algorithm, clustering sampling results to obtain Q clustering centers as point cloud key points, carrying out KNN (K nearest neighbor) clustering on points in the radius R, and then aggregating the clustered point features into the key points by using an MLP (multi-layer perceptron) network. After the processes are circularly reciprocated for a plurality of times,

/>

pseudo-point cloud branch final output template point set containing N points

And search point set->

Wherein each s _i From a three-dimensional coordinate vector c _i And a C-dimensional descriptor f representing local texture information of the object _i Composition s _i ＝(c _i ,f _i ) For simplification of the formula, subsequent use is made of>

And &>

Is taken place of>

And &>

As a pseudo-point cloudCharacteristic of the template region, ->

And searching the characteristics of the area for the pseudo point cloud. The branched feature extraction method can quickly aggregate the texture features of the pseudo-point cloud, and finally, only depends on the local texture features of the representation background and the representation foreground of a small number of points, so that the calculation cost of a subsequent network is effectively reduced.

Unlike the pseudo-point cloud branching, the real point cloud branching is first paired with the template

And search point cloud>

Performing voxelization processing, converting sparse point cloud into dense voxels, extracting geometrical characteristics of the point cloud in the voxels by adopting three-dimensional sparse convolution, and generating template three-dimensional voxel characteristics ^ whether or not>

And searching for a three-dimensional voxel feature->

Will then->

Is combined with the feature channel to output a denser template BEV feature->

And searching for BEV feature>

Wherein C' = H × C, the method is beneficial to relieving the influence caused by the point cloud sparsity.

Multi-modal feature interaction and enhancement: obtaining pseudo-point cloud template features

Pseudo-point cloud search feature->

And the template characteristic of the real point cloud->

Real point cloud search feature->

The respective attention enhancing features.

Firstly, a self-attention mechanism is adopted to carry out weight distribution on the point cloud and the pseudo point cloud characteristics, and the size of the characteristics is not changed in the process. Let F be the input feature, output the attention-enhancing feature after the self-attention mechanism processing

Will then>

Further, semantic associations of different modes are constructed through a cross attention mechanism, and a multi-mode semantic enhancement feature is output>

The following first introduces the attention mechanism rationale:

the attention mechanism is divided into three parts: inputting feature codes, position codes, similarity calculation and feature weighting, wherein the main idea is to generate feature weights in a self-adaptive manner according to feature similarity, so that the difference between important parts and non-important parts in features is more obvious, and a general formula of an attention mechanism is as follows:

Q,K,V＝α(F+P),β(F+P),γ(F)

where F is an input feature of the attention mechanism, and α, β, γ denote lines whose weights are not sharedAnd a sexualization layer or a multilayer perceptron, wherein P is used for representing position coding characteristics, and Q, K and V respectively represent a Query matrix, a Key matrix and a Value matrix in the attention mechanism. Then through QK ^T Calculating the feature similarity between Q and K, and dividing by the scaling factor

Performing normalization processing by Softmax, and performing Hadamard product with V to generate attention enhancing feature->

The detailed process of self-attention and cross-attention will be described below in conjunction with fig. 3:

in the self-attention-enhancing stage, the principle is shown in the following formula:

F _r is an input feature of a real point cloud, Q _r ,K _r ,V _r Are all input features F _r Obtained by linear variation, multiheadAttn means stitching the results calculated for multiple heads, layerNorm means applying layer normalization to the features,

representing the enhanced features of the output.

Taking the search point cloud characteristics as the input of a self-attention mechanism to obtain re-weighted search point cloud characteristics; taking the template point cloud characteristics as the input of a self-attention mechanism to obtain the re-weighted template point cloud characteristics;

and for the pseudo point cloud characteristics, realizing characteristic enhancement according to the following formula:

F _p is an input feature of a pseudo-point cloud, Q _p ,K _p ,V _p Are all input features F _p Obtained by linear variation, multiheadAttn means stitching the results calculated for multiple heads, layerNorm means applying layer normalization to the features,

representing the enhanced features of the output.

Taking the searched pseudo-point cloud characteristics as the input of a self-attention mechanism to obtain the re-weighted searched pseudo-point cloud characteristics; and taking the template pseudo-point cloud characteristics as the input of a self-attention mechanism to obtain the template pseudo-point cloud characteristics after re-weighting.

In the cross attention enhancement stage, cross-modal feature enhancement is performed on the output feature of the previous stage according to the following formula:

wherein the content of the first and second substances,

feature enhanced by self-attention>

Is obtained by a linear change and is->

Feature enhanced by self-attention>

Is obtained by a linear change and is->

Representing the final enhanced real point cloud features processed by the cross attention mechanism，/>

Representing the final enhanced pseudo-point cloud features that are processed through a cross-attention mechanism. Unlike the auto-attention mechanism, the cross-attention mechanism asserts/de-asserts a signal generated by the present branch boost feature>

Generated as Query with another branch enhancement feature

As Key and Value, feature similarity of different modes is calculated through an attention mechanism, and feature weight with similar semantic information is enhanced, so that semantic association of different modes is constructed, and finally semantic enhanced feature (based on the similarity) is generated>

Taking the re-weighted searching pseudo point cloud and the searching point cloud characteristics as input characteristics of cross attention to obtain final semantically enhanced searching pseudo point cloud and real point cloud characteristics; and taking the weighted template pseudo point cloud and the template point cloud characteristics as input characteristics of cross power to obtain the final semantically enhanced template pseudo point cloud and real point cloud characteristics.

And step 3: interaction and enhancement of multi-modal features are achieved by combining self-attention and cross-attention; the method comprises the following steps:

multi-modal similarity fusion: semantic enhancement of features with point clouds and pseudo-point clouds

Generating a geometric similarity S _geo ∈R ^W×L×D Similarity to texture S _tex ∈R ^N×D Wherein W and L represent the size of the characteristic diagram, N represents the number of key points of the pseudo point cloud, D represents the degree of characteristic dimension, and the similarity calculation formula is as follows,

wherein, the Correlation is a cosine similarity function,

a final enhancement feature representing a search area>

Representing the final enhancement features of the template region;

after the similarity of each modality (real point cloud and pseudo point cloud) is generated respectively, the method shown in fig. 4 is adopted to fuse the geometric similarity and the texture similarity, and the specific flow is introduced as follows: in order to ensure the unification of different similarity representation forms, the sparse texture similarity in the pseudo-point cloud branch is converted into dense BEV similarity characteristics, namely the texture similarity characteristics are voxelized, the similarity characteristics of each voxel are aggregated, and finally the dense BEV similarity S 'is generated' _tex ∈R ^W×L×D . Then the invention further pairs texture similarity S 'by cross-attention' _tex And geometric similarity S _geo Fusion is performed to generate more robust multi-modal similarity features S _fus The mathematical expression of similarity fusion is as follows:

Q _geo ＝S _geo W ^Q ,K _tex ＝S′ _tex W ^K ,V _tex ＝S′ _tex W ^V

S′ _fus ＝MLP(S _fus )+S _fus

wherein MLP denotes a multi-layer perceptron, Q _geo ,K _tex ,V _tex The Query generated by geometric similarity is characterized, and the Key and Value generated by texture similarity are also characterized. Since this step enables encoding of the geometric and texture cues of the target to the multi-modal similarity feature S' _fus Therefore, the robustness of the tracker is improved, and the failure problem in a complex environment is solved.

And 4, step 4: fusing the texture similarity of the image and the geometric similarity of the point cloud based on the cross-modal modeling capability of the attention system; the method comprises the following steps:

step 4.1: generating geometric similarity S by adopting a pixel-by-pixel cosine similarity calculation function for semantic enhancement features of search point cloud and template point cloud _geo ∈R ^W×L×D In the formula, W × L × D represents tensor S _geo Dimension of (d);

step 4.2: adopting a point-by-point cosine similarity calculation function to the semantic enhancement features of the searched pseudo-point cloud and the template pseudo-point cloud, and calculating the texture similarity S _tex ∈R ^N×D In the formula, nxD represents tensor S _tex Dimension (d);

Target position prediction: obtaining the fusion similarity feature through step 4, and then comparing the similarity feature S' _fus And (4) respectively classifying the target positions by adopting the convolutional neural network branches, and regressing the relative offset position and orientation of the target.

And (4) classification and branching: is prepared from S' _fus With each feature center as a key point

Categorizing a branch requires predicting a score for each keypoint and outputting a thermodynamic diagram>

And the key point with the highest classification score in the thermodynamic diagram representsWhere the most likely target is present. In order to construct the constraint relation of the loss function, the invention generates a thermodynamic diagram truth value Y epsilon [0,1 ] through a Gaussian kernel] ^W×H×1 The gaussian kernel formula used is as follows:

wherein delta _p For the adaptive standard deviation of the target size, the network parameters are then optimized using Focal local as the Loss function, which is shown in the following equation:

wherein alpha and beta are hyper-parameters of Focal local, and N is similarity characteristic S' _fus Number of middle key points, L _h A value representing the Focal local Loss function;

regression branches mainly predict three-dimensional attributes of targets

Which comprises a position deviation->

Height->

And orientation +>

The position deviation is used for correcting the position of the key point so as to obtain a more accurate target positioning result. Then, the invention adopts L1 loss function to calculate the loss value of the target three-dimensional attribute:

wherein L is _offset Value, L, representing a position deviation loss function _z Value, L, representing a height loss function _θ Value representing the orientation loss function, { x _offset 、y _offset Represents the position of a sample object used to train the model, z represents the height of a sample object used to train the model, and θ represents the orientation of a sample object used to train the model.

Finally, a final loss value L is obtained through weighted summation,

L＝λ _h L _h +λ _offset L _offset +λ _z L _z +λ _θ L _θ

where L is the total loss function, λ _h Is the coefficient of the Focal local Loss function, λ _offset Is a coefficient of a bias loss function, λ _z Is a coefficient of the height loss function, λ _θ Are coefficients towards the loss function.

step 5.1: extracting target clues implied in the multi-modal similarity characteristics through a multi-layer convolution structure to generate a characteristic diagram;

The invention provides a three-dimensional single-target tracking method based on multi-mode information fusion, which solves the problem that a single-mode tracker is easy to confuse objects with similar geometric structures, thereby obtaining a more robust tracking result; the key points of the main improved technology are as follows:

(1) And (3) multi-mode data are aligned in space, pixels are mapped to a space coordinate system where the point cloud is located by means of point cloud depth and camera internal and external parameters, and a pseudo point cloud with texture features is generated, so that the difficulty of fusing multi-mode features in a subsequent network is reduced, and the utilization rate of image features is improved.

(2) The multi-mode double-flow feature extraction network extracts multi-mode features efficiently through different main networks, wherein point cloud branches generate dense BEV geometric features, the influence caused by point cloud sparsity is favorably relieved, and pseudo point cloud branches can represent local texture information of a target and a background by means of a small number of points, so that the calculation overhead of a subsequent network is favorably reduced.

(3) The multi-mode information interaction and enhancement mechanism is used for adaptively enhancing important features in the mode by means of the self-attention mechanism, and then constructing multi-mode semantic association by means of the cross-attention mechanism, so that the features of the same semantic object are enhanced.

(4) And (3) fusion of multi-modal similarity, namely realizing self-adaptive fusion of geometric and texture similarity through an attention mechanism, generating similarity characteristics of geometric and texture characteristics of a coded target, and using the similarity characteristics for final target regression so as to improve the robustness of a tracker in complex scenes such as point cloud sparsity or interference (a large number of structurally similar objects) and the like.

Claims

1. A three-dimensional single-target tracking method based on multi-mode information fusion is characterized by comprising the following steps:

step 1: carrying out spatial alignment on images and point cloud data acquired by different sensors;

step 2: constructing a double-flow feature extraction network based on a deep learning method, and realizing parallel extraction of high-dimensional semantic features of pseudo point clouds and point clouds by different network branches;

and 3, step 3: interaction and enhancement of multi-modal features are achieved by combining self-attention and cross-attention;

and 4, step 4: calculating the texture similarity of the image and the geometric similarity of point cloud based on the semantic enhanced features generated in the step 3, and fusing the two types of similarity features by means of cross attention to generate multi-modal similarity;

and 5: feature S 'according to multi-modal similarity' _fus And predicting the spatial position and orientation of the target by adopting a multilayer convolution structure.

2. The method for tracking the three-dimensional single target based on the multi-modal information fusion as claimed in claim 1, wherein the step 1 comprises:

step 1.1: projecting the point cloud to an image plane according to internal and external parameters of the camera;

step 1.2: assigning point cloud depths to corresponding pixels;

step 1.3: and back projecting the pixels to a three-dimensional space, thereby generating a pseudo point cloud which has image texture information and is aligned with the original point cloud coordinate system.

3. The method for tracking the three-dimensional single target based on the multi-modal information fusion as claimed in claim 1, wherein the step 2 comprises:

step 2.1: extracting texture features of the pseudo-point cloud;

step 2.2: and extracting the geometric characteristics of the real point cloud.

4. The method for tracking the three-dimensional single target based on the multi-modal information fusion as claimed in claim 3, wherein the step 2.1 comprises:

And searching for a false point cloud->

step 2.1.2: performing KNN clustering on the points in the radius R by taking Q key points as centers according to the template pseudo point cloud and the search pseudo point cloud obtained in the step 2.1.1, and then aggregating the clustered point features into the key points by using an MLP network;

And search point set>

Wherein each s _i From a three-dimensional coordinate vector c _i And a C-dimensional descriptor f representing local texture information of the object _i Composition i.e. s _i ＝(c _i ,f _i )。

5. The method for tracking the three-dimensional single target based on the multi-modal information fusion as claimed in claim 3, wherein the step 2.2 comprises:

step 2.2.1: aiming at the real point cloud, the template point cloud

And search point cloud>

step 2.2.2: for template point clouds

And search point cloud->

Respectively extracting the geometrical characteristics of the point clouds in the voxels by adopting a three-dimensional sparse convolution network, and generating a template point cloud->

Is greater than or equal to>

Search point cloud->

Is greater than or equal to>

step 2.2.3: for three-dimensional voxel features

Merging of the height channel and the feature channel, respectively, to output a denser BEV feature->

Wherein C' = hxc.

6. The method for tracking the three-dimensional single target based on the multi-modal information fusion as claimed in claim 1, wherein the step 3 comprises:

step 3.2: through cross attention, semantic relations of different modes are constructed, so that the information of different modes can strengthen the characteristics of the same semantic object, and more robust multi-mode semantic characteristics are generated.

7. The method for tracking the three-dimensional single target based on the multi-modal information fusion as claimed in claim 1, wherein the step 4 comprises:

step 4.1: generating geometric similarity S by adopting a pixel-by-pixel cosine similarity calculation function for semantic enhancement features of search point cloud and template point cloud _geo ∈R ^W×L×D In the formula, W × L × D representsTensor S _geo Dimension (d);

and 4.2: generating texture similarity S by using point-by-point cosine similarity calculation function for semantic enhancement features of searching pseudo-point clouds and template pseudo-point clouds _tex ∈R ^N×D In the formula, nxD represents tensor S _tex Dimension (d);

step 4.3: fusing geometric similarity and texture similarity by means of cross attention mechanism to generate more robust multi-modal similarity feature S' _fus 。

8. The method for tracking the three-dimensional single target based on the multi-modal information fusion as claimed in claim 1, wherein the step 5 comprises:

step 5.2: predicting the confidence coefficient of a target and a corresponding spatial attribute of each position in the characteristic diagram, wherein the position with the highest confidence coefficient is the position to be positioned of the target, and the spatial attribute is the deviation amount and the direction of the target position and is used for correcting the position to be positioned;