CN115641449A

CN115641449A - Target tracking method for robot vision

Info

Publication number: CN115641449A
Application number: CN202211226432.5A
Authority: CN
Inventors: 侯跃恩; 邓嘉明; 罗志坚; 高延增; 刘茗铄; 唐家晖
Original assignee: Jiaying University
Current assignee: Jiaying University
Priority date: 2022-10-09
Filing date: 2022-10-09
Publication date: 2023-01-24

Abstract

The invention discloses a target tracking method for robot vision; belonging to the technical field of target tracking methods; which comprises the following steps: (1) Manually obtaining the upper left coordinates and the lower right coordinates of a target to be tracked on a two-dimensional image in the first frame of the image, intercepting a target image and sample image blocks around the target image as templates, and obtaining a template sample feature tensor through a feature extraction depth network; (2) Inputting a search area sample into a feature extraction depth network to obtain a search area sample feature tensor; (3) Simultaneously inputting the feature tensor of the template and the feature tensor of the search area into a feature enhancement and feature fusion network based on an inner volume-attention model to obtain a fusion feature tensor containing the template features and the features of the search area, and then obtaining a tracking result through a classification network and a regression network; the invention aims to provide a target tracking method for robot vision, which has high tracking success rate and small tracking error and can realize real-time tracking; for real-time target tracking.

Description

Target tracking method for robot vision

Technical Field

The present invention relates to a target tracking method, and more particularly, to a target tracking method for robot vision.

Background

Video target tracking technology has received much attention from researchers as an important content in machine vision research. The method aims to track a target in a video through target state information obtained through a first frame so as to obtain a target state of each frame. In the tracking process, the situations of target form change, complex background with illumination change, target occlusion and the like exist. In these cases, the target feature structure may change accordingly, making it difficult for the tracking algorithm to lock the target.

Since the introduction of visual tracking by deep learning techniques, convolution has been widely used in the framework of feature extraction and template and search region fusion. The current popular deep learning tracker is mainly constructed by a convolution kernel, however, the convolution kernel cannot be designed to be too large due to large calculation amount. Therefore, the convolution kernel cannot interact long-distance information at a single time, and when similar objects appear or the shape of the objects changes greatly, the defect of the model limits the processing capacity of a complex scene.

The long distance dependency problem can be effectively solved by introducing a self-attention mechanism, which has been successfully applied in machine translation, natural language processing and speech processing. In addition, it also yields excellent experimental results during image processing, such as target tracking and target detection. Although the self-attention mechanism can capture global information well, it does not focus on local information particularly, which should take a large weight around the target for target tracking. Therefore, there is a need to solve this problem by developing a model that can handle both global and local information as well as a self-attention mechanism.

Disclosure of Invention

The invention aims to provide a target tracking method for robot vision, which has high tracking success rate and small tracking error and can realize real-time tracking aiming at the defects of the prior art.

The technical scheme of the invention is realized as follows: a target tracking method for robot vision, comprising the steps of:

(1) Manually obtaining the upper left coordinate and the lower right coordinate of a target to be tracked on a two-dimensional image in the first frame of the image, intercepting a target image and sample image blocks around the target image as templates, and obtaining a template sample feature tensor through a feature extraction depth network;

(2) Inputting the same characteristic extraction depth network into a search area sample to obtain a characteristic tensor of the search area sample;

(3) Simultaneously inputting the feature tensor of the template and the feature tensor of the search area into a feature enhancement and feature fusion network based on an inner volume-attention model to obtain a fusion feature tensor containing the template features and the features of the search area, and then obtaining a tracking result by the fusion feature tensor through a classification network and a regression network.

In the above target tracking method for robot vision, in step (1), the feature extraction depth network specifically includes: it uses the ResNet50 network as a reference; the ResNet50 comprises a dry layer and four branch dry layers, wherein 3, 4, 6 and 3 bottleecks are respectively arranged;

in the deep network for feature extraction, the fourth layer of ResNet50 is discarded, and the downsampled stride parameter of Conv2d operator of the third layer is changed from 2 to 1; in the dry layer of ResNet50, 7 × 7 pairs of inner convolution sums are used instead of the previous 7 × 7 convolution kernel; in other layers, all 3 × 3 convolution kernels are replaced by 7 × 7 internal convolution kernels; finally, a 1 × 1 convolution is added after the third layer.

In the above target tracking method for robot vision, in step (3), the inward roll-attention model is composed of an inward roll module, two Add & Norm modules and an FFN & Relu module;

the injection module takes tensor A and tensor B as input; use of

And

respectively constructing a convolution tensor and an internal convolution kernel, wherein d is the number of channels, and w x w is the scale of the image block;

to construct the internal convolution kernel, tensor B is unfolded

Then, a learnable parameter matrix is given

And

can get query Q and key K as

Q＝B′W _Q

K＝B′W _K ， (1)

Wherein,

then, attention is paid to the moment matrix

Can be obtained from formula (2);

the attention matrix M is then morphed into an internal convolution kernel tensor

Where g is the number of sets of internal convolution kernels, w × w is the scale of the convolved image, and k × k is the internal convolution kernel size.

In the above-mentioned target tracking method for robot vision, the attention matrix M is transformed into the internal convolution kernel tensor I depending on different types B, and two types of inputs B need to be processed: searching a region sample and a template set sample, wherein the template set sample consists of four templates and can be updated on line;

when the input B is a search area tensor, M _i，j Representing the similarity of the ith row of Q and the jth row of K; because each kernel is globally sampled, all internal convolution kernels can capture the long-range dependency of the search area; this strategy is called the involuntary attention strategy 1;

when the input B is a template set tensor, connecting the template set tensors using four templates; the ith line of M describes the similarity between the ith element in Q and all the elements in the four templates in K; since each kernel is globally sampled, all internal convolution kernels can capture the long dependency of the template set tensor; this strategy is referred to as the wrap attention strategy 2.

In the above target tracking method for robot vision, in step (3), the feature enhancement and feature fusion network based on the inward rolling-attention model is composed of five modules: an involution-attention template module, an involution-attention search area module, an involution-attention template search module, an involution-attention search template module, and an involution-attention hybrid module; wherein the involution-attention representation in the five modules is based on an involution-attention model;

the specific steps of obtaining the fusion feature tensor containing the template features and the search area features are as follows: first, a template set of features F _T0 And search region feature F _S0 Enhanced features F are obtained by the involution-attention template module and the involution-attention search area module, respectively _T1 And F _S1 (ii) a Then, the enhanced template feature F _T1 And search region feature F _S1 Simultaneously and alternately inputting the involution-attention template searching module and the involution-attention template searching module to obtain a fusion characteristic F _T2 And F _S2 (ii) a Wherein the inner roll-attention template module, the inner roll-attentionThe search region module, the inner volume-attention template search module and the inner volume-attention template search module jointly construct a feature enhanced fusion layer and repeat for 4 times;

after the feature enhancement fusion layer, the involution-attention mixing module fuses the feature F _T2 And F _S2 The feature F is output as an input and fed into the regression network and the classification network.

In the above-mentioned target tracking method for robot vision, the inward roll-attention search area module and the inward roll-attention template search module use the inward roll attention strategy 1 to acquire the inner convolution kernel, the inward roll-attention template module, the inward roll-attention search template module and the inward roll-attention blending module use the inward roll attention strategy 2 to acquire the inner convolution kernel.

In the above target tracking method for robot vision, in step (3), the classification network is a classification network comprising 3 linear layers and 2 activations, and is represented as

f _c (F)＝φ ₂ ((φ ₁ (F*W ₁ )*W ₂ ))*W ₃ ， (3)

Wherein,

for output of the feature hybrid network, W ₁ ，W ₂ ，W ₃ Is a learnable parameter matrix; the output of the classification network is a binary tensor

The classification loss is calculated using the standard binary cross entropy loss,

wherein, y _i Group-truth tag for ith sample, equal to 1 for positive sample, p _i Probability of being a positive sample;

by softmax function, f _c (F) Mapping to a classification score matrix S. In an ideal state, in the scoring matrix S, the classification scores of the target regions are all 1, and the classification scores of the background regions are all 0.

In the above-mentioned target tracking method for robot vision, in some cases, such as similar targets, occluded targets, or out-of-range targets, S may be contaminated; therefore, in the tracking method, a classification score higher than a predetermined value is regarded as a high score;

assuming that the number of high scores inside and outside the regression frame is N _i And N _o The area of the regression frame is N _r (ii) a Defining an update score as s = (N) _i -N _o )/N _r (ii) a When s is larger than tau and the updating interval is up, updating the template; where τ is the template update threshold.

In the above target tracking method for robot vision, in step (3), the regression network is established by estimating the probability distribution of the target frame; the regression network is a Full Convolution Network (FCN) and has four Conv-BN-ReLU layers; the output of the regression network has 4 channels which respectively represent the left, right, upper and lower probability distributions of the target frame; thus, the coordinates of the frame are

x _tl ＝∑(xP _left (x))

y _tl ＝∑(yP _top (y))

x _br ＝∑(xP _right (x))

y _br ＝∑(yP _bottom (y))， (5)

Wherein, P _left ，P _top ，P _riqht ，P _bottm The probability distributions of the left, right, upper and lower bounding boxes respectively; combined IoU losses and l ₁ Loss, the loss function of the regression network being

Wherein λ is _iou And λ _l For hyperparameters, for adjusting the weights of two terms, b and

the real target frame coordinates and the predicted target frame coordinates are respectively.

After the method is adopted, firstly, a depth network is extracted through improved features, a template sample feature tensor is obtained through a template made of a target image and sample image blocks around the target image, the search area sample feature tensor is obtained through a search area sample, and then a tracking result is obtained through an original feature enhancement and feature fusion network based on an inner volume-attention model and a classification network and a regression network. The tracking success rate of target tracking can be effectively enhanced, the tracking error is reduced, the real-time tracking effect is realized, and the control effect of the robot is improved.

Drawings

The invention will be further described in detail with reference to examples of embodiments shown in the drawings to which, however, the invention is not restricted.

FIG. 1 is a schematic diagram of the framework of the tracking method of the present invention;

FIG. 2 is a schematic view of an involution attention model of the present invention;

FIG. 3 is a schematic illustration of an injection module of the present invention;

FIG. 4 is a schematic diagram illustrating two exemplary attention moment array dimension-changing strategies according to the present invention;

FIG. 5 is a schematic representation of the relationship of the scoring heatmap of the present invention to a regression bounding box;

FIG. 6 is a block schematic diagram of the object tracking device of the present invention.

Detailed Description

Referring to fig. 1, a target tracking method for robot vision according to the present invention includes the following steps:

(1) Manually obtaining the upper left coordinates and the lower right coordinates of a target to be tracked on a two-dimensional image in the first frame of the image, intercepting a target image and sample image blocks around the target image as templates, and obtaining a template sample feature tensor through a feature extraction depth network;

In the step (1), in order to extract features more effectively, the internal convolution is used to reconstruct the existing feature extraction deep network, and the feature extraction deep network provided by the invention specifically comprises the following steps: it uses the ResNet50 network as a reference; the ResNet50 includes one dry layer and four branch dry layers, with 3, 4, 6 and 3 bottleecks respectively.

In the feature extraction depth network, the fourth layer of ResNet50 is discarded, and in order to obtain higher feature resolution, the downsampling stride parameter of Conv2d operator of the third layer is changed from 2 to 1; in the dry layer of ResNet50, 7 × 7 pairs of inner convolution sums are used instead of the previous 7 × 7 convolution kernel; at other layers, all 3 × 3 convolution kernels are replaced by 7 × 7 internal convolution kernels; therefore, the characteristic extraction deep network can obtain a larger receptive field at one time. Finally, a 1 × 1 convolution is added after the third layer to reduce the channel dimension of the feature extraction deep network output.

Table 1 lists the details of the modified kernel, the second column indicates the size of the convolution kernel that was replaced, and the multipliers in the brackets indicate the number of times the convolution kernel for that layer was replaced. The third column indicates the internal convolution kernel size. The fourth column is the number of channels of the convolution/convolution kernel. The last column represents the number of groups of inner convolution kernels. The input to the backbone network is a template image

Search area image

Template characteristics output by the backbone network after passing through the backbone network

Search area features

TABLE 1 modified ResNet50 Kernel

Referring to FIG. 2, in the present embodiment, the volume-attention model in step (3) is composed of an interior injection module, two Add & Norm modules and an FFN & Relu module.

Referring to FIG. 3, the injection module takes tensor A and tensor B as inputs; use of

And

and respectively constructing a convolution tensor and an internal convolution kernel, wherein d is the number of channels, and w multiplied by w is the scale of the image block.

To construct the internal convolution kernel, tensor B is unfolded

Then, a learnable parameter matrix is given

And

can obtain the query Q and the key K as

Q＝B′W _Q

K＝B′W _K ， (1)

Wherein,

then, attention is paid to the moment matrix

Can be obtained by the formula (2).

It is noted that the dimensionality of the attention matrix M into the internal convolution kernel tensor I depends on the different type B, requiring the processing of two types of inputs B: the method comprises the steps of searching a region sample and a template set sample, wherein the template set sample consists of four templates and can be updated online. Referring to FIG. 4, two types of input B dimension-changing strategies are shown.

When the input B is a search area tensor, M _i，j Representing the similarity of the ith row of Q and the jth row of K; as shown in fig. 4 (a), for simplicity, M having a shape of 64 × 64 and an internal convolution kernel having a shape of 2 × 2 are taken as examples. The dashed rectangle is 1 row of matrix M, which can be reconstructed as an 8 x 8 matrix. And extracting 1 element from every 4 rows and 4 columns respectively to construct an internal convolution kernel set of 16 groups of 2 multiplied by 2 kernels. Since each kernel is globally sampled, all internal convolution kernels can capture the long-range dependency of the search region. This strategy is referred to as the involuntary attention strategy 1.

When the input B is one template set tensor, the present embodiment connects the template set tensors using four templates. As shown in fig. 4 (b), the 1-1 and 1-2 blocks represent the first template itself and the pair-wise similarity between the first template and the second template, respectively. For example, row i of M describes the similarity between the i-th element in Q and all the elements in the 4 templates in K. The dashed rectangle in M can be reshaped into an 8 × 8 matrix, where the blocks in the red, blue, yellow, green boxes are associated with the first, second, third, fourth templates, respectively, specifically, in the figure, the number 2 corresponds to the yellow box, the number 3 corresponds to the red box, the number 4 corresponds to the blue box, and the number 6 corresponds to the green box. In each block, elements are extracted every two rows and two columns. For all blocks, 16 sets of internal convolution kernels are available. Since each kernel is globally sampled, all internal convolution kernels are able to capture the long dependency of the stencil set tensor. This strategy is referred to as the wrap attention strategy 2.

Further preferably, in step (3), the feature enhancement and feature fusion network based on the involution-attention model is composed of five modules: an involution-attention template module, an involution-attention search area module, an involution-attention template search module, an involution-attention search template module, and an involution-attention hybrid module; where the involution-attention representation in the five modules is based on the involution-attention model.

The specific steps of obtaining the fusion feature tensor containing the template features and the search area features are as follows: first, a template set is characterized by F _T0 And search region feature F _S0 Enhanced features F are obtained by the inlel-attention template module and the inlel-attention search area module, respectively _T1 And F _S1 (ii) a Then, the enhanced template feature F _T1 And search region feature F _S1 Simultaneously and alternately inputting the involution-attention template searching module and the involution-attention template searching module to obtain a fusion characteristic F _T2 And F _S2 (ii) a The characteristic enhancement fusion layer is constructed by the aid of an inner volume-attention template module, an inner volume-attention search area module, an inner volume-attention template search module and an inner volume-attention search template module, and is repeated for 4 times.

In this embodiment, the volume-attention search area module and the volume-attention template search module use the volume-attention strategy 1 to obtain the inner convolution kernel, the volume-attention template module, the volume-attention search template module, and the volume-attention blending module use the volume-attention strategy 2 to obtain the inner convolution kernel.

Further preferably, in order to improve the robustness of the tracking, the template set needs to be updated during the tracking process. However, when tracking drift, target occlusion, and target deviation distance occur, the current tracking result is not reliable. In order to ensure the reliability of the tracking result, the invention provides a classification network comprising 3 linear layers and 2 activations, which can be expressed as

f _c (F)＝φ ₂ ((φ ₁ (F*W ₁ )*W ₂ ))*W ₃ ， (3)

Wherein,

by softmax function, f _c (F) To a classification score matrix S. In an ideal state, in the scoring matrix S, the classification scores of the target regions are all 1, and the classification scores of the background regions are all 0.

However, in some cases, such as similar objects, occlusions, or out-of-range objects, S may become contaminated.

As shown in fig. 5, the red box in fig. 5 is a regression box provided for the regression network. In the score hotspot graph a, the regression box contains most of the high scores. In fractional hotspot graphsIn B, the regression box includes the partial content which is high-scored due to the similar target. Thus, the results of heat map A are more reliable than those of heat map B, and it is reasonable to update the template set with the tracked results of heat map A. In the tracking method, a classification score higher than a predetermined value, which is 0.88 in this embodiment, is regarded as a high score. Assuming that the number of high scores inside and outside the regression frame is N _i And N _o The area of the regression frame is N _r . Defining an update score as s = (N) _i -N _o )/N _r . When s > τ and the update interval is reached, the template is updated. Where τ is the template update threshold.

Traditional anchor-free regression networks directly learn the state of the target and follow Dirac delta distributions, which are limited to situations where the target boundaries are not sharp enough, such as occlusion, motion blur, shadows, complex backgrounds, and the like. The invention establishes a regression network by estimating the probability distribution of the target box.

In this embodiment, the regression network is a Full Convolution Network (FCN) with four Conv-BN-ReLU layers; the output of the regression network is provided with four channels which respectively represent the left, right, upper and lower probability distributions of the target frame; thus, the coordinates of the frame are

x _tl ＝∑(xP _left (x))

y _tl ＝∑(yP _top (y))

x _br ＝∑(xP _right (x))

y _br ＝∑(yP _bottom (y))， (5)

Wherein, P _left ，P _top ，P _right ，P _bottm The probability distributions of the left, right, upper and lower bounding boxes respectively; the regression network of the present invention performs better when dealing with uncertainties than other regression networks. Binding IoU losses (Liou) and l ₁ Loss, the loss function of the regression network is

Results of the experiment

The tracking method and apparatus of the present invention have been tested against the currently prevailing data sets, which are the TrackingNet and GOT-10k data sets, respectively, versus the currently more advanced methods.

Table 2 shows the test results of the tracking method of the present invention and other algorithms in TrackingNet, and it can be seen that the method of the present invention achieves the best results in indexes prec. (%), n.prec. (%), and Success (AUC).

Table 2 comparison of prec., n.prec, and AUC on the TrackingNet test set

Table 3 shows the test results of the tracking method and other algorithms of the present invention on the GOT-10k test set, and it can be seen that the method of the present invention obtains the best results in indexes mAO (%), SR0.5 (%) and SR0.75 (%).

Method	mAO(％)	SR0.5(％)	SR0.75(％)
				TrSiam	67.3	78.7	58.6
TrDiMP	68.8	80.5	59.7
				TransT	72.3	82.4	68.2
TREG	66.8	77.8	57.2
				STACK-ST50	68.0	77.7	62.3
SiamFC++	59.5	69.5	47.9
				The method of the invention	73.2	83.3	68.8

Referring to fig. 6, the target tracking device based on the method of the present invention includes an image acquisition module, a feature extraction module, a feature enhancement module, a feature fusion module, a classification module, a regression module, a result display module, a template update module, and a robot data interface module.

The robot comprises an image acquisition module, a feature extraction module, a feature fusion module, a regression module, a target template set and a robot data interface module, wherein the image acquisition module acquires video data by adopting a camera, the feature extraction module is responsible for extracting depth features of a target template and a search area, the feature enhancement module is responsible for enhancing the extracted depth features, the feature fusion module is responsible for fusing the enhanced features of the target template and the search area, the classification module is used for classifying and judging each area of the features, the regression module is responsible for determining a target state, the result display module is responsible for displaying a tracking result in an original image of a video, the template updating module is used for judging whether the tracking result is used for updating the target template set or not according to the results of the classification module and the regression module, and the robot data interface module is used for transmitting the tracking result to a robot controller to help the robot to make a decision and plan actions.

The above-mentioned embodiments are only for convenience of description, and are not intended to limit the present invention in any way, and those skilled in the art will understand that the technical features of the present invention can be modified or changed by other equivalent embodiments without departing from the scope of the present invention.

Claims

1. A target tracking method for robot vision, comprising the steps of:

2. The method for tracking the target of the robot vision according to claim 1, wherein in the step (1), the feature extraction depth network is specifically: it uses the ResNet50 network as a reference; the ResNet50 comprises a dry layer and four branch dry layers, wherein 3, 4, 6 and 3 bottleecks are respectively arranged;

in the feature extraction deep network, the fourth layer of ResNet50 is discarded, and the downsampling stride parameter of Conv2d operator of the third layer is changed from 2 to 1; in the dry layer of ResNet50, 7 × 7 pairs of inner convolution sums are used instead of the previous 7 × 7 convolution kernel; at other layers, all 3 × 3 convolution kernels are replaced by 7 × 7 internal convolution kernels; finally, a 1 × 1 convolution is added after the third layer.

3. The method of claim 1, wherein the intra-roll-attention model of step (3) is comprised of an intra-roll module, two Add & Norm modules, and an FFN & Relu module;

the injection module takes tensor A and tensor B as input; use of

And

to construct the internal convolution kernel, tensor B is unwrapped

Then, a learnable parameter matrix is given

And

can obtain the query Q and the key K as

Q＝B′W _Q

K＝B′W _K ， (1)

Wherein,

then, attention is paid to the moment matrix

Can be obtained by formula (2);

4. A method of object tracking for robot vision according to claim 3, characterized in that the multidimensional attention matrix M is dimensioned such that the internal convolution kernel tensor I depends on different types B, requiring processing of two types of inputs B: searching a region sample and a template set sample, wherein the template set sample consists of four templates and can be updated on line;

when the input B is a template set tensor, connecting the template set tensors using four templates; the ith line of M describes the similarity between the ith element in Q and all the elements in the four templates in K; because each kernel is globally sampled, all internal convolution kernels can capture the long dependency of the template set tensor; this strategy is referred to as the wrap attention strategy 2.

5. The target tracking method for robot vision according to claim 4, wherein in step (3), the feature enhancement and feature fusion network based on the inward rolling-attention model is composed of five modules: an involution-attention template module, an involution-attention search area module, an involution-attention template search module, an involution-attention search template module, and an involution-attention hybrid module; wherein the involution-attention representation in the five modules is based on an involution-attention model;

the specific steps of obtaining the fusion feature tensor containing the template features and the search area features are as follows: first, a template set is characterized by F _T0 And search region feature F _s0 Enhanced features F are obtained by the inlel-attention template module and the inlel-attention search area module, respectively _T1 And F _S1 (ii) a Then, the enhanced template feature F _T1 And search region feature F _S1 Simultaneously and alternately inputting the involution-attention template searching module and the involution-attention template searching module to obtain a fusion characteristic F _T2 And F _S2 (ii) a Wherein, the inner scroll-attention template module, the inner scroll-attention search area module, the inner scroll-attention template search module and the inner scroll-attention search template module jointly construct a feature enhanced fusion layer, and repeat for 4 times;

6. The method of claim 5, wherein the curl-attention search region module and the curl-attention template search module use a curl attention strategy of 1 to obtain the inner convolution kernel, the curl-attention template module, the curl-attention search template module and the curl-attention blending module use a curl attention strategy of 2 to obtain the inner convolution kernel.

7. The method of claim 1, wherein in step (3), the classification network is a classification network comprising 3 linear layers and 2 activations, and is represented by

f _c (F)＝φ ₂ ((φ ₁ (F*W ₁ )*W ₂ ))*W ₃ ， (3)

Wherein,

8. A method for robot-vision target tracking according to claim 7, characterized in that in some cases, such as similar targets, occlusion, or out-of-range targets, S may be contaminated; therefore, in the tracking method, a classification score higher than a predetermined value is regarded as a high score;

9. A target tracking method for robot vision according to claim 1, characterized in that in step (3), the regression network is established by estimating probability distribution of target box; the regression network is a Full Convolution Network (FCN) and has four Conv-BN-ReLU layers; the output of the regression network has 4 channels which respectively represent the left, right, upper and lower probability distribution of the target frame; thus, the coordinates of the frame are

x _tl ＝∑(xP _left (x))

y _tl ＝∑(yP _top (y))

x _br ＝∑(xP _right (x))

y _br ＝∑(yP _bottom (y))， (5)

Wherein, P _left ，P _top ，P _right ，P _bottm The probability distributions of the left, right, upper and lower bounding boxes respectively; combined IoU losses and l ₁ Loss, the loss function of the regression network being