CN115797408A

CN115797408A - Target tracking method and device fusing multi-view image and three-dimensional point cloud

Info

Publication number: CN115797408A
Application number: CN202211522027.8A
Authority: CN
Inventors: 冯建江; 张猛; 郭文轩
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2022-11-30
Filing date: 2022-11-30
Publication date: 2023-03-14

Abstract

The invention discloses a target tracking method and a target tracking device fusing a multi-view image and a three-dimensional point cloud, wherein the method comprises the following steps: the method comprises the steps of firstly, fusing multi-view multi-modal data to predict the three-dimensional position of a target in a scene, then obtaining three-dimensional space information of the target and appearance characteristics of the target in a multi-view image by utilizing a three-dimensional detection result, calculating a matching score matrix, and finally obtaining an accurate three-dimensional target tracking result. The invention uses the matching of the appearance characteristics of the current object and the appearance characteristics of the historical track to be recovered to carry out track repair. The three-dimensional target tracking method can greatly eliminate the problems of track exchange, target loss and the like in a single modal tracking result, obtains the target long-term tracking track with high accuracy, good continuity and strong robustness, and provides convenience for scene perception and security monitoring.

Description

Target tracking method and device fusing multi-view image and three-dimensional point cloud

Technical Field

The invention relates to the technical field of computer vision perception, in particular to a target tracking method and device fusing a multi-view image and a three-dimensional point cloud.

Background

In the field of three-dimensional vision, accurate three-dimensional target detection and tracking is the basis for scene visual perception and analysis. In recent years, the development of fire and heat from the fields of automatic driving, scene monitoring and the like has made urgent demands on high-precision target tracking algorithms, so that three-dimensional target tracking has become the most important research direction in computer vision. The detection tracking task takes the original data of the sensor as input, outputs accurate target position and tracking id, is the basis of subsequent links such as path planning and the like, and is an indispensable part of the whole system. In the face of the requirement of accurate three-dimensional positioning and tracking, the depth camera or the multi-camera sensor has low precision, short positioning distance and great influence by illumination. The laser radar has the characteristics of long distance, high precision and strong stability, but the laser radar also has the defects of higher price, sparse point cloud and lack of texture characteristics of a target. The existing three-dimensional target tracking technology directly obtains a tracking track of a target based on single-mode sensor input, the result of the tracking track is limited by the defect of a single sensor used, and a high-quality tracking result is difficult to obtain in complex and various scenes. Therefore, the multi-mode data of the two-dimensional image and the three-dimensional point cloud are fused, the defect of a single sensor can be eliminated, the robustness of the tracking technology is enhanced, and the precision of the tracking result is improved to a great extent.

In the existing three-dimensional target tracking technology, the following limitations and disadvantages still exist: the three-dimensional tracking technology based on the multi-view images obtains a two-dimensional target detection result by identifying the images, further depends on a calibrated multi-camera system, calculates to obtain a three-dimensional position of a target by utilizing antipodal geometric constraint, and finally performs matching connection to obtain a three-dimensional target tracking track. Due to calibration errors and shielding influences, the accurate three-dimensional position of the target is difficult to obtain based on the multi-view image technology, and meanwhile, space position information is lacked, so that tracking loss and id matching errors are easily caused under the condition that the targets are shielded mutually. The three-dimensional tracking technology based on the three-dimensional point cloud utilizes a laser radar to collect high-precision point cloud data, obtains a target accurate three-dimensional detection result, calculates a matching score matrix on a three-dimensional space, and connects to obtain a three-dimensional tracking track. Due to the sparsity of point cloud data, targets lack texture features, matching score calculation and track generation are performed depending on the geometric positions of the targets, and when a plurality of targets are gathered in a three-dimensional space, exchange of tracking ids is easily caused.

With the continuous development of three-dimensional imaging systems, the target tracking technology based on the traditional single-mode data acquisition system cannot meet the requirement of robust high-precision tracking.

Disclosure of Invention

The present invention is directed to solving, at least in part, one of the technical problems in the related art.

Therefore, the invention provides a target tracking method fusing a multi-view image and a three-dimensional point cloud. The method is integrally divided into two stages, wherein the first stage is used for carrying out three-dimensional target detection by fusing multi-view images and three-dimensional point cloud data, and the second stage is used for outputting a target tracking track based on the spatial information and appearance characteristic matching connection of a detection target. In the first stage, two-dimensional images and three-dimensional point clouds of multiple visual angles in a scene are simultaneously input, and feature extraction and fusion are carried out. For a two-dimensional image, respectively obtaining image features from multiple visual angles by using a two-dimensional feature extraction network, and projecting a feature map to a three-dimensional aerial view according to different camera parameters for splicing and fusing to obtain a multi-visual angle fused two-dimensional feature map; and for the three-dimensional point cloud, performing space-time registration on the point cloud from the laser radar with different visual angles, then directly fusing to obtain multi-angle dense point cloud, and then extracting a network by using three-dimensional features to obtain a point cloud feature map. And splicing and fusing the two-dimensional feature map and the point cloud feature map according to the spatial position to obtain a multi-mode feature map, and obtaining an accurate three-dimensional detection result through a detector to be used as the input of the second stage. In the second stage, the detection result of the first stage is used for matching and connecting to obtain the final tracking track. The invention calculates the intersection ratio of the target detection frames in the three-dimensional space as the relation measurement of the target on the space. Further, in order to fuse the characteristics of multiple modes, the invention uses a three-dimensional target detection frame to project to an input multi-view two-dimensional image to obtain a target area on the two-dimensional image, extracts the appearance characteristics of the target on the two-dimensional image, and calculates the cosine distance of the appearance characteristics among different targets as the relation measurement of the target on the appearance characteristics. After the two relation measurement matrixes are fused, the method uses a matching algorithm to match and connect the detection frames and outputs the tracking track of the final target. For a target object which leaves a scene or is seriously shielded, detection missing may occur in a detection algorithm, so that the tracking track is broken. When the targets are re-identified in the scene, the method and the device use the appearance characteristics of the current object and the appearance characteristics of the historical track to be recovered to match for track repair, and improve the track integrity.

The invention also aims to provide a target tracking device fusing the multi-view image and the three-dimensional point cloud.

In order to achieve the above object, the present invention provides a target tracking method fusing a multi-view image and a three-dimensional point cloud, including:

respectively obtaining a two-dimensional feature map and a point cloud feature map based on the two-dimensional image and the three-dimensional point cloud of the multi-view angle;

inputting a multi-modal feature map obtained by fusing the two-dimensional feature map and the point cloud feature map into a pre-trained multi-modal fusion target detection model for detection to obtain a target detection frame; the target detection frame comprises a three-dimensional target detection result of a target track;

obtaining a spatial position matching matrix according to a detection frame comparison result of a target detection frame and a track tracking result at a historical moment, calculating two-dimensional image characteristics of the target track and the track tracking result at the historical moment to obtain an appearance characteristic matching matrix, and aggregating the spatial position matching matrix and the appearance characteristic matching matrix to obtain a final relation measurement matrix;

and matching the three-dimensional target detection result with the track tracking result at the historical moment based on the final relation metric matrix and a preset matching algorithm, and obtaining the tracking track result of the target track at the current moment according to the matching result.

In addition, the target tracking method fusing the multi-view image and the three-dimensional point cloud according to the above embodiment of the present invention may further have the following additional technical features:

further, in an embodiment of the present invention, the obtaining a two-dimensional feature map and a point cloud feature map based on the multi-view two-dimensional image and the three-dimensional point cloud respectively includes: acquiring a multi-view two-dimensional image and a multi-view three-dimensional point cloud; extracting the multi-view two-dimensional image by using an image processing network to obtain an initial characteristic map, and projecting the initial characteristic map to a three-dimensional aerial view according to different camera parameters for splicing and fusing to obtain a multi-view fused two-dimensional characteristic map; and performing space-time registration on the three-dimensional point cloud by using a point cloud processing network, then fusing to obtain multi-angle dense point cloud, and performing feature extraction on the multi-angle dense point cloud to obtain a point cloud feature map.

Further, in one embodiment of the invention, the point cloud processing network comprises a feature extraction network and a backbone network; the backbone network comprises a first sub-network and a second sub-network; the image processing network comprises a convolutional neural network, the method further comprising: inputting the multi-view three-dimensional point cloud into the feature extraction network for point cloud conversion to obtain a pseudo image, and inputting the pseudo image into the first sub-network for feature extraction of feature maps with different spatial resolutions; inputting the features extracted from the feature maps with different spatial resolutions into the second sub-network for deconvolution operation, and then connecting the features in series to obtain the point cloud feature map; and inputting the multi-view two-dimensional image into the convolutional neural network to calculate a multi-channel feature map, and performing multi-camera information aggregation by using projection transformation of the multi-channel feature map so as to project the multi-channel feature map to a three-dimensional aerial view for splicing and fusing to obtain the two-dimensional feature map.

Further, in an embodiment of the present invention, the method further includes: acquiring sample data of a multi-modal characteristic diagram, carrying out data marking on a target detection frame by using the multi-sample data, and training a multi-modal fusion target detection model by using marked data; and inputting the two-dimensional image and the three-dimensional point cloud to be subjected to track tracking into the trained multi-mode fusion target detection model for target track detection to obtain a three-dimensional target detection result at each prediction moment.

Further, in an embodiment of the present invention, the method further includes: performing matching score calculation according to the geometric information of the detection frame of the three-dimensional target detection result at the current moment and the geometric information of the detection frame in the track tracking result obtained by matching at the historical moment to obtain the spatial position matching matrix; projecting the three-dimensional target detection frame to the multi-view two-dimensional image, and calculating the cosine distance of the appearance characteristics of the target track and the track tracking result at the historical moment to obtain an appearance characteristic matching matrix; and adding and fusing the spatial position matching matrix and the appearance characteristic matching matrix to obtain a final relation measurement matrix matched with the track tracking result of the target at the current moment and the historical moment.

Further, in the process of track repair processing, for a target which is lost and re-identified in a scene during tracking, the embodiment of the invention uses the matching of the appearance characteristics of the current object and the appearance characteristics of the historical track to be recovered to repair the track, thereby avoiding the situation that the track of the same object is broken, and improving the track integrity to obtain a more continuous and complete tracking result.

In order to achieve the above object, another aspect of the present invention provides a target tracking apparatus fusing a multi-view image and a three-dimensional point cloud, including:

the characteristic extraction module is used for respectively obtaining a two-dimensional characteristic diagram and a point cloud characteristic diagram based on the two-dimensional image and the three-dimensional point cloud of the multi-view angle;

the target detection module is used for inputting a multi-modal feature map obtained by fusing the two-dimensional feature map and the point cloud feature map into a pre-trained multi-modal fused target detection model for detection to obtain a target detection frame; the target detection frame comprises a three-dimensional target detection result of a target track;

the characteristic fusion module is used for obtaining a spatial position matching matrix according to a detection frame comparison result of a target detection frame and a track tracking result at a historical moment, calculating two-dimensional image characteristics of the target track and the track tracking result at the historical moment to obtain an appearance characteristic matching matrix, and aggregating the spatial position matching matrix and the appearance characteristic matching matrix to obtain a final relation measurement matrix;

and the matching output module is used for matching the three-dimensional target detection result with the track tracking result at the historical moment based on the final relation measurement matrix and a preset matching algorithm, and obtaining the tracking track result of the target track at the current moment according to the matching result.

According to the target tracking method and device fusing the multi-view images and the three-dimensional point cloud, the problems of track exchange, target loss and the like in a single-modal tracking result can be greatly eliminated, the target tracking track with high accuracy, good continuity and strong robustness can be obtained for a long time, and convenience is provided for scene perception and security monitoring.

The beneficial effects of the invention are as follows:

1) Compared with a conventional single-mode multi-target identification method, the method disclosed by the invention integrates the two-dimensional image with rich appearance information and the three-dimensional point cloud with accurate spatial information, and the feature levels of the two modes are integrated on the aerial view, so that the performance of target identification and detection in a scene is greatly improved.

2) The traditional point cloud-based three-dimensional multi-target tracking algorithm only uses detected spatial information to obtain a tracking track in a tracking stage, and different targets cannot be distinguished from the appearance. The method introduces the appearance characteristics of the targets in the multi-view image in the tracking stage, calculates the similarity of the appearance, and enhances the discrimination of matching between different targets, thereby obtaining more accurate, continuous and robust tracking tracks.

3) For objects which disappear and reappear in the scene, the invention uses the matching of the appearance characteristics of the current object and the appearance characteristics of the historical track to be recovered to repair the track, thereby avoiding the situation that the track of the same object is broken and obtaining a more continuous and complete tracking result.

4) One of the typical application scenarios of the present invention is the acquisition of information about an athlete in motion. By utilizing the multi-view multi-modal data, the accurate position of the player on the field and the tracking track can be obtained, so that the running distance, speed and other data of the player can be calculated, and the event analysis is facilitated.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a flow diagram of a method for target tracking of a fusion-based multi-view image and three-dimensional point cloud in accordance with an embodiment of the present invention;

FIG. 2 is a diagram of a target tracking method architecture incorporating multi-view images and three-dimensional point clouds in accordance with an embodiment of the invention;

FIG. 3 is an architectural diagram of target tracking based on spatial information and appearance characteristics according to an embodiment of the invention;

fig. 4 is a schematic structural diagram of a target tracking apparatus based on fusion of a multi-view image and a three-dimensional point cloud according to an embodiment of the present invention.

Detailed Description

It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict. The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings.

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The following describes a target tracking method and apparatus fusing a multi-view image and a three-dimensional point cloud according to an embodiment of the present invention with reference to the accompanying drawings.

Fig. 1 is a flowchart of a target tracking method fusing a multi-view image and a three-dimensional point cloud according to an embodiment of the present invention.

As shown in fig. 1, the method includes, but is not limited to, the following steps:

s1, respectively obtaining a two-dimensional feature map and a point cloud feature map based on a two-dimensional image and a three-dimensional point cloud of multiple visual angles;

s2, inputting a multi-modal feature map obtained by fusing the two-dimensional feature map and the point cloud feature map into a pre-trained multi-modal fused target detection model for detection to obtain a target detection frame; the target detection frame comprises a three-dimensional target detection result of a target track;

s3, obtaining a spatial position matching matrix according to the comparison result of the target detection frame and the detection frame of the track tracking result at the historical moment, calculating the two-dimensional image characteristics of the target track and the track tracking result at the historical moment to obtain an appearance characteristic matching matrix, and aggregating the spatial position matching matrix and the appearance characteristic matching matrix to obtain a final relation measurement matrix;

and S4, matching the three-dimensional target detection result with the track tracking result at the historical moment based on the final relation metric matrix and a preset matching algorithm, and obtaining the track tracking result of the target track at the current moment according to the matching result.

According to the target tracking method fusing the multi-view image and the three-dimensional point cloud, disclosed by the embodiment of the invention, the track is repaired by matching the appearance characteristics of the current object with the appearance characteristics of the historical track to be restored. The three-dimensional target tracking method can greatly eliminate the problems of track exchange, target loss and the like in a single modal tracking result, obtains the target long-term tracking track with high accuracy, good continuity and strong robustness, and provides convenience for scene perception and security monitoring.

The target tracking method fusing the multi-view image and the three-dimensional point cloud according to the embodiment of the invention is explained in detail below with reference to the accompanying drawings.

The following symbols will be first explained: t current time of algorithm processing, S relation metric matrix, S ₁ : spatial position image matrix, S ₂ : the appearance characteristic is matched with a matrix, and the appearance characteristic is matched with the matrix,

at time t, of ith camera targetThe image block is a block of an image,

and at the time t, appearance characteristics of the image block of the ith camera target.

Specifically, the whole system is divided into two stages, namely acquiring a three-dimensional target detection result and tracking a target based on spatial information and appearance characteristics. As shown in fig. 2 and 3.

The first stage is as follows: and detecting the three-dimensional target in a multi-view and multi-mode.

It can be understood that, for the multi-view multi-modal three-dimensional target detection task, the invention simultaneously inputs two-dimensional images and three-dimensional point clouds of multiple views in a scene for feature extraction and fusion. For a two-dimensional image, respectively obtaining image features from multiple visual angles by using a two-dimensional feature extraction network, and projecting a feature map to a three-dimensional aerial view according to different camera parameters for splicing and fusing to obtain a multi-visual angle fused two-dimensional feature map; and for the three-dimensional point cloud, performing space-time registration on the point cloud from the laser radar with different visual angles, then directly fusing to obtain multi-angle dense point cloud, and then extracting a network by using three-dimensional features to obtain a point cloud feature map. And splicing and fusing the two-dimensional feature map and the point cloud feature map according to spatial positions to obtain a multi-mode feature map, and then obtaining an accurate three-dimensional target detection result through a detection head.

Specifically, a multi-mode fused target detection deep learning network is constructed. The invention uses different feature extraction networks and backbone networks to process the two kinds of data, and the obtained two kinds of features are spliced and fused according to spatial positions to obtain a multi-mode feature map, and then an accurate three-dimensional target detection result is obtained through a detection head.

As an example, a point cloud processing network is constructed. The point cloud processing network designed by the invention comprises two parts: the system comprises a feature extraction network and a backbone network, wherein the point cloud feature extraction network is used for point cloud feature coding, and the backbone network is used for further processing the extracted features.

Specifically, the function of the feature extraction network is to convert point clouds into pseudo images and extract point cloud features. First, an input point cloud is divided into a plurality of units, each unit being a 3-dimensional grid obtained by dividing the point cloud in a certain step length on an X-Y plane (cartesian coordinate system). Due to the sparsity of point cloud data, many units may not contain point clouds or contain a small number of point clouds, the number of units is limited in consideration of the problem of computational complexity, P non-empty units are processed at most, N point cloud feature vectors are contained in each unit, if the number of point clouds is larger than N, N point cloud feature vectors are selected from the N point cloud feature vectors by a random sampling method, and otherwise, if the number of point clouds is smaller than N, the N point cloud feature vectors are filled by a zero filling method. By the method, a frame of point cloud data is encoded into a dense tensor with one dimension (, P, N). Next, processing the data by using a three-dimensional convolution network to generate a tensor with the dimension (P, N); the maximal pooling operation is then performed for each cell, resulting in a tensor of dimension (, P). Finally, a pseudo-image of (C,) is generated by the scatter operator.

As an example, the backbone network consists of a 2D convolutional neural network, whose role is to extract high-dimensional features on the pseudo-image output by the feature extraction network. The backbone network is divided into two sub-networks: one top-down sub-network is used to extract features from smaller and smaller spatial resolution feature maps, and the other sub-network is responsible for upsampling the extracted features from different resolution feature maps to the same dimension size by deconvolution operation and then concatenating them.

As an example, an image processing network is constructed. The input of the image processing network is a plurality of images collected by cameras with different visual angles, and a convolution neural network is utilized to simultaneously process the input images and extract features. The convolutional neural network calculates the C-channel feature maps of the M input images, respectively, and shares the weights in all calculations. To maintain the higher spatial resolution of the feature map, the last 3 layers of convolution are replaced with extended convolution. Before projection, the N feature maps are resized to a fixed size [, W ] (H and W represent the height and width of the feature map).

Furthermore, multi-camera information aggregation is carried out by utilizing projection transformation of the feature maps, and the multiple viewing angle feature maps are projected to the three-dimensional aerial view for splicing and fusion. According to the camera parameters, the corresponding relation between the picture pixels and the coordinates on the ground can be obtained. The feature map may be projected by a set of ground coordinates (z = 0) and a correspondence of a set of image pixels. In order to enhance the perception of the network to the space position, a plurality of projected feature maps and the X-Y coordinate map are spliced and fused to obtain a top view feature map of the MXC +2 channel. And splicing and fusing the two-dimensional feature map and the point cloud feature map according to spatial positions to obtain a multi-mode feature map, and obtaining an accurate three-dimensional target detection result through a detection head.

Further, multi-modal fusion of training and reasoning of the target detection model. And acquiring multi-view multi-modal data by using a multi-modal data acquisition system, manually marking a three-dimensional target frame on the acquired data, and training a multi-modal fusion target detection model by using the marked data. During reasoning, a similar multi-mode data acquisition system is used for acquiring multi-view images and point cloud data to be subjected to track tracking, and a three-dimensional target detection result at each prediction moment in the point cloud and image sequence is acquired by using the trained multi-mode fusion target detection deep learning network.

And a second stage of target tracking based on the spatial information and the appearance characteristics.

Specifically, a three-dimensional target detection result of a current prediction time to be subjected to trajectory tracking obtained in the first stage is obtained, and a trajectory tracking result of a previous time is matched. And calculating the intersection of the target and the detection frame of the tracking result at the previous moment and comparing the intersection as a spatial relation metric. And projecting the three-dimensional target detection frame to the input multi-view two-dimensional image, and calculating the cosine distance of the appearance characteristics of the target and the track tracking result at the previous moment as the relation measurement on appearance. After the two relation measurement matrixes are subjected to average aggregation, the detection result of the Hungarian matching algorithm at the current moment is matched with the track tracking result of the previous moment, and the tracking track of the target at the current moment is finally obtained.

Further, the current prediction time is represented by time t, and the previous timeThe moment is indicated by the moment t-1. The present invention uses distance cross-correlation as a measure of the relationship between detection boxes. The number of detection frames of the obtained detection result at the time t is m, the number of detection frames in the tracking result obtained by matching at the time t-1 is n, and the calculated matching score is m x n, and S is used for calculating a matrix ₁ And (4) showing.

Further, based on the accurate three-dimensional spatial information of the three-dimensional point cloud, the detection result of the fusion multi-mode obtained in the first stage has accurate spatial information, but for the aggregation targets with close distances, the detection frames are overlapped, and the track matching error is easily caused by only using the three-dimensional spatial position for tracking and matching. In addition to spatial information, the method extracts rich appearance characteristics by using the multi-view two-dimensional image, calculates the cosine distance of the appearance characteristics among different targets, and improves the precision of tracking matching.

As an example, the present invention uses a projection matrix from a world coordinate system to a camera coordinate system to project the three-dimensional detection frame obtained in the first stage onto a two-dimensional image, and acquires a two-dimensional detection frame of the object on the image. And carrying out image interception on the obtained two-dimensional image detection frame to obtain image blocks of each target at different visual angles. For time t, an image block of the ith camera target

Extracting a deep learning network by using two-dimensional image features, and calculating to obtain the appearance features of the image block

For each object obtained in the first stage, the invention extracts its appearance characteristics at k viewing angles

And the K appearance characteristics of the historical time recorded by the track tracking result of the previous time

And calculating cosine similarity. For one pair of tests at time t and time t-1And measuring the target, calculating to obtain a cosine similarity matrix with the size of K x K, and finally screening and averaging matching scores to obtain the final appearance feature similarity serving as the matching score of the appearance feature.

If the number of detection frames of the detection result at the time t obtained in the first stage is m and the number of detection frames in the tracking result obtained by matching at the time t-1 is n, the calculated matching score is m x n matrix, and S is used ₂ And (4) showing.

Further, matching the obtained space position with the matrix S ₁ And appearance feature matching matrix S ₂ And adding and fusing to obtain a final relation measurement matrix S matched with the tracking result of the target at the current moment and the tracking result at the previous moment. The invention uses a matching algorithm such as binary matching to connect the detection target at the current moment with the tracking track. And after completing the track matching connection at the time t, adding the obtained appearance features of the target at the time t on the two-dimensional image into a historical appearance feature library of the tracking track to which the target belongs for matching use at the next time.

Further, in the tracking track repairing process, the target leaves the scene or is shielded by an obstacle in the scene, so that the tracking track of the target is easily lost, and when the target detection returns to normal, the historical track should be restored again. However, the traditional tracking algorithm usually assigns a new id as a completely new target, so the tracking algorithm needs to repair the reappeared tracking track. The specific method comprises the following steps:

after the matching at the next moment is finished, three tracks are obtained, wherein the first track is a track successfully matched and connected with the detection result at the moment t, the second track is a historical track which cannot find a matched pair in the detection result at the moment t, and the third track is a detection result at the moment t without matching.

And for the historical track segment which is not successfully matched, the track is not directly considered to be immediately lost, and the corresponding appearance characteristic of the track is continuously kept to be used as the historical track to be recovered.

For the detection target at the time t which is not successfully matched, the method uses an appearance characteristic relation measurement method to calculate the appearance characteristic score between the unmatched detection target and the historical track to be recovered, if the score is larger than a threshold value, the detection target and the historical track to be recovered belong to the same object, the detection target and the historical track to be recovered are connected, the reappeared target tracking track is repaired, and the continuity of the track of the same object is improved.

According to the target tracking method fusing the multi-view image and the three-dimensional point cloud, the problems of track exchange, target loss and the like in a single-mode tracking result can be greatly eliminated, the target tracking track with high accuracy, good continuity and strong robustness can be obtained for a long time, and convenience is provided for scene perception and security monitoring.

In order to implement the above embodiment, as shown in fig. 4, the present embodiment further provides a target tracking apparatus 10 for fusing a multi-view image and a three-dimensional point cloud, where the apparatus 10 includes a feature extraction module 100, a target detection module 200, a feature fusion module 300, and a matching output module 400.

The feature extraction module 100 is configured to obtain a two-dimensional feature map and a point cloud feature map based on a multi-view two-dimensional image and a multi-view three-dimensional point cloud respectively;

the target detection module 200 is configured to input a multi-modal feature map obtained by fusing the two-dimensional feature map and the point cloud feature map into a pre-trained multi-modal fused target detection model for detection to obtain a target detection frame; the target detection frame comprises a three-dimensional target detection result of a target track;

the feature fusion module 300 is configured to obtain a spatial position matching matrix according to a detection frame comparison result of the target detection frame and the track tracking result at the historical time, calculate two-dimensional image features of the target track and the track tracking result at the historical time to obtain an appearance feature matching matrix, and aggregate the spatial position matching matrix and the appearance feature matching matrix to obtain a final relationship metric matrix;

and the matching output module 400 is configured to match the three-dimensional target detection result with the track tracking result at the historical time based on the final relationship metric matrix and a preset matching algorithm, and obtain the track tracking result of the target track at the current time according to the matching result.

Further, the feature extraction module 100 is further configured to:

acquiring a multi-view two-dimensional image and a multi-view three-dimensional point cloud;

extracting a multi-view two-dimensional image by using an image processing network to obtain an initial feature map, and projecting the initial feature map to a three-dimensional aerial view according to different camera parameters for splicing and fusing to obtain a multi-view fused two-dimensional feature map; and the number of the first and second groups,

and performing space-time registration on the three-dimensional point cloud by using a point cloud processing network, then fusing to obtain multi-angle dense point cloud, and performing feature extraction on the multi-angle dense point cloud to obtain a point cloud feature map.

Further, the point cloud processing network comprises a feature extraction network and a backbone network; the backbone network comprises a first sub-network and a second sub-network; the image processing network comprises a convolutional neural network, and the feature extraction module is further configured to:

inputting multi-view three-dimensional point cloud into a feature extraction network for point cloud conversion to obtain a pseudo image, and inputting the pseudo image into a first sub-network for feature extraction of feature maps with different spatial resolutions; inputting the features extracted from the feature maps with different spatial resolutions into a second sub-network for deconvolution operation, and then connecting the features in series to obtain a point cloud feature map; and the number of the first and second groups,

inputting the multi-view two-dimensional image into a convolution neural network to calculate a multi-channel feature map, and performing multi-camera information aggregation by using projection transformation of the multi-channel feature map so as to project the multi-channel feature map to a three-dimensional aerial view for splicing and fusing to obtain a two-dimensional feature map.

Further, the apparatus 10 further includes a model pre-training module, configured to:

acquiring sample data of the multi-modal characteristic diagram, carrying out data labeling on a target detection frame by using the multi-sample data, and training a multi-modal fusion target detection model by using the labeled data;

and inputting the two-dimensional image and the three-dimensional point cloud to be subjected to track tracking into the trained multi-mode fusion target detection model for target track, and obtaining a three-dimensional target detection result at each prediction moment.

Further, the apparatus 10 further comprises a matrix calculation module, configured to:

performing matching score calculation based on the geometric information of the detection frame of the current three-dimensional target detection result and the geometric information of the detection frame in the track tracking result obtained by matching the historical time to obtain a spatial position matching matrix;

projecting the three-dimensional target detection frame to a multi-view two-dimensional image, and calculating the cosine distance of the appearance characteristics of the target track and the track tracking result at the historical moment to obtain an appearance characteristic matching matrix; and the number of the first and second groups,

and adding and fusing the space position matching matrix and the appearance characteristic matching matrix to obtain a final relation measurement matrix matched with the track tracking result of the target at the current moment and the track tracking result at the historical moment.

According to the target tracking device fusing the multi-view images and the three-dimensional point cloud, the problems of track exchange, target loss and the like in a single-modal tracking result can be greatly eliminated, the target tracking track with high accuracy, good continuity and strong robustness can be obtained for a long time, and convenience is provided for scene perception and security monitoring.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

Claims

1. A target tracking method fusing a multi-view image and a three-dimensional point cloud is characterized by comprising the following steps:

2. The method of claim 1, wherein the obtaining of the two-dimensional feature map and the point cloud feature map based on the multi-view two-dimensional image and the three-dimensional point cloud comprises:

extracting the multi-view two-dimensional image by using an image processing network to obtain an initial feature map, and projecting the initial feature map to a three-dimensional aerial view according to different camera parameters for splicing and fusing to obtain a multi-view fused two-dimensional feature map; and (c) a second step of,

3. The method of claim 2, wherein the point cloud processing network comprises a feature extraction network and a backbone network; the backbone network comprises a first sub-network and a second sub-network; the image processing network comprises a convolutional neural network, the method further comprising:

inputting the multi-view three-dimensional point cloud into the feature extraction network for point cloud conversion to obtain a pseudo image, and inputting the pseudo image into the first sub-network for feature extraction of feature maps with different spatial resolutions; inputting the features extracted from the feature maps with different spatial resolutions into the second sub-network for deconvolution operation, and then connecting the features in series to obtain the point cloud feature map; and (c) a second step of,

and inputting the multi-view two-dimensional image into the convolutional neural network to calculate a multi-channel feature map, and performing multi-camera information aggregation by using the projection transformation of the multi-channel feature map so as to project the multi-channel feature map to a three-dimensional aerial view for splicing and fusing to obtain the two-dimensional feature map.

4. The method of claim 1, further comprising:

acquiring sample data of a multi-modal characteristic diagram, carrying out data marking on a target detection frame by using the multi-sample data, and training a multi-modal fusion target detection model by using marked data;

and inputting the two-dimensional image and the three-dimensional point cloud to be subjected to track tracking into the trained multi-mode fusion target detection model for target track detection to obtain a three-dimensional target detection result at each prediction moment.

5. The method of claim 4, further comprising:

performing matching score calculation according to the geometric information of the detection frame of the three-dimensional target detection result at the current moment and the geometric information of the detection frame in the track tracking result obtained by matching at the historical moment to obtain the spatial position matching matrix;

projecting the three-dimensional target detection frame to the multi-view two-dimensional image, and calculating the cosine distance of the target track and the appearance characteristic of the track tracking result at the historical moment to obtain an appearance characteristic matching matrix; and (c) a second step of,

and adding and fusing the spatial position matching matrix and the appearance characteristic matching matrix to obtain a final relation measurement matrix matched with the track tracking result of the target at the current moment and the track tracking result at the historical moment.

6. A target tracking device fusing a multi-view image and a three-dimensional point cloud, comprising:

7. The apparatus of claim 6, wherein the feature extraction module is further configured to:

8. The apparatus of claim 7, wherein the point cloud processing network comprises a feature extraction network and a backbone network; the backbone network comprises a first sub-network and a second sub-network; the image processing network comprises a convolutional neural network, and the feature extraction module is further configured to:

9. The apparatus of claim 6, further comprising a model pre-training module to:

acquiring sample data of a multi-modal feature map, carrying out data marking on a target detection frame by using the multi-sample data, and training a multi-modal fusion target detection model by using marked data;

10. The apparatus of claim 9, further comprising a matrix computation module configured to:

performing matching score calculation based on the geometric information of the detection frame of the current three-dimensional target detection result and the geometric information of the detection frame in the track tracking result obtained by matching the historical time to obtain the spatial position matching matrix;

projecting the three-dimensional target detection frame to the multi-view two-dimensional image, and calculating the cosine distance of the appearance characteristics of the target track and the track tracking result at the historical moment to obtain an appearance characteristic matching matrix; and the number of the first and second groups,