CN115641449A - Target tracking method for robot vision - Google Patents

Target tracking method for robot vision Download PDF

Info

Publication number
CN115641449A
CN115641449A CN202211226432.5A CN202211226432A CN115641449A CN 115641449 A CN115641449 A CN 115641449A CN 202211226432 A CN202211226432 A CN 202211226432A CN 115641449 A CN115641449 A CN 115641449A
Authority
CN
China
Prior art keywords
attention
template
feature
module
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211226432.5A
Other languages
Chinese (zh)
Inventor
侯跃恩
邓嘉明
罗志坚
高延增
刘茗铄
唐家晖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiaying University
Original Assignee
Jiaying University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiaying University filed Critical Jiaying University
Priority to CN202211226432.5A priority Critical patent/CN115641449A/en
Publication of CN115641449A publication Critical patent/CN115641449A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Image Analysis (AREA)

Abstract

The invention discloses a target tracking method for robot vision; belonging to the technical field of target tracking methods; which comprises the following steps: (1) Manually obtaining the upper left coordinates and the lower right coordinates of a target to be tracked on a two-dimensional image in the first frame of the image, intercepting a target image and sample image blocks around the target image as templates, and obtaining a template sample feature tensor through a feature extraction depth network; (2) Inputting a search area sample into a feature extraction depth network to obtain a search area sample feature tensor; (3) Simultaneously inputting the feature tensor of the template and the feature tensor of the search area into a feature enhancement and feature fusion network based on an inner volume-attention model to obtain a fusion feature tensor containing the template features and the features of the search area, and then obtaining a tracking result through a classification network and a regression network; the invention aims to provide a target tracking method for robot vision, which has high tracking success rate and small tracking error and can realize real-time tracking; for real-time target tracking.

Description

Target tracking method for robot vision
Technical Field
The present invention relates to a target tracking method, and more particularly, to a target tracking method for robot vision.
Background
Video target tracking technology has received much attention from researchers as an important content in machine vision research. The method aims to track a target in a video through target state information obtained through a first frame so as to obtain a target state of each frame. In the tracking process, the situations of target form change, complex background with illumination change, target occlusion and the like exist. In these cases, the target feature structure may change accordingly, making it difficult for the tracking algorithm to lock the target.
Since the introduction of visual tracking by deep learning techniques, convolution has been widely used in the framework of feature extraction and template and search region fusion. The current popular deep learning tracker is mainly constructed by a convolution kernel, however, the convolution kernel cannot be designed to be too large due to large calculation amount. Therefore, the convolution kernel cannot interact long-distance information at a single time, and when similar objects appear or the shape of the objects changes greatly, the defect of the model limits the processing capacity of a complex scene.
The long distance dependency problem can be effectively solved by introducing a self-attention mechanism, which has been successfully applied in machine translation, natural language processing and speech processing. In addition, it also yields excellent experimental results during image processing, such as target tracking and target detection. Although the self-attention mechanism can capture global information well, it does not focus on local information particularly, which should take a large weight around the target for target tracking. Therefore, there is a need to solve this problem by developing a model that can handle both global and local information as well as a self-attention mechanism.
Disclosure of Invention
The invention aims to provide a target tracking method for robot vision, which has high tracking success rate and small tracking error and can realize real-time tracking aiming at the defects of the prior art.
The technical scheme of the invention is realized as follows: a target tracking method for robot vision, comprising the steps of:
(1) Manually obtaining the upper left coordinate and the lower right coordinate of a target to be tracked on a two-dimensional image in the first frame of the image, intercepting a target image and sample image blocks around the target image as templates, and obtaining a template sample feature tensor through a feature extraction depth network;
(2) Inputting the same characteristic extraction depth network into a search area sample to obtain a characteristic tensor of the search area sample;
(3) Simultaneously inputting the feature tensor of the template and the feature tensor of the search area into a feature enhancement and feature fusion network based on an inner volume-attention model to obtain a fusion feature tensor containing the template features and the features of the search area, and then obtaining a tracking result by the fusion feature tensor through a classification network and a regression network.
In the above target tracking method for robot vision, in step (1), the feature extraction depth network specifically includes: it uses the ResNet50 network as a reference; the ResNet50 comprises a dry layer and four branch dry layers, wherein 3, 4, 6 and 3 bottleecks are respectively arranged;
in the deep network for feature extraction, the fourth layer of ResNet50 is discarded, and the downsampled stride parameter of Conv2d operator of the third layer is changed from 2 to 1; in the dry layer of ResNet50, 7 × 7 pairs of inner convolution sums are used instead of the previous 7 × 7 convolution kernel; in other layers, all 3 × 3 convolution kernels are replaced by 7 × 7 internal convolution kernels; finally, a 1 × 1 convolution is added after the third layer.
In the above target tracking method for robot vision, in step (3), the inward roll-attention model is composed of an inward roll module, two Add & Norm modules and an FFN & Relu module;
the injection module takes tensor A and tensor B as input; use of
Figure BDA0003880021480000021
And
Figure BDA0003880021480000022
respectively constructing a convolution tensor and an internal convolution kernel, wherein d is the number of channels, and w x w is the scale of the image block;
to construct the internal convolution kernel, tensor B is unfolded
Figure BDA0003880021480000023
Then, a learnable parameter matrix is given
Figure BDA0003880021480000024
And
Figure BDA0003880021480000025
can get query Q and key K as
Q=B′W Q
K=B′W K , (1)
Wherein,
Figure BDA0003880021480000026
then, attention is paid to the moment matrix
Figure BDA0003880021480000027
Can be obtained from formula (2);
Figure BDA0003880021480000028
the attention matrix M is then morphed into an internal convolution kernel tensor
Figure BDA0003880021480000029
Where g is the number of sets of internal convolution kernels, w × w is the scale of the convolved image, and k × k is the internal convolution kernel size.
In the above-mentioned target tracking method for robot vision, the attention matrix M is transformed into the internal convolution kernel tensor I depending on different types B, and two types of inputs B need to be processed: searching a region sample and a template set sample, wherein the template set sample consists of four templates and can be updated on line;
when the input B is a search area tensor, M i,j Representing the similarity of the ith row of Q and the jth row of K; because each kernel is globally sampled, all internal convolution kernels can capture the long-range dependency of the search area; this strategy is called the involuntary attention strategy 1;
when the input B is a template set tensor, connecting the template set tensors using four templates; the ith line of M describes the similarity between the ith element in Q and all the elements in the four templates in K; since each kernel is globally sampled, all internal convolution kernels can capture the long dependency of the template set tensor; this strategy is referred to as the wrap attention strategy 2.
In the above target tracking method for robot vision, in step (3), the feature enhancement and feature fusion network based on the inward rolling-attention model is composed of five modules: an involution-attention template module, an involution-attention search area module, an involution-attention template search module, an involution-attention search template module, and an involution-attention hybrid module; wherein the involution-attention representation in the five modules is based on an involution-attention model;
the specific steps of obtaining the fusion feature tensor containing the template features and the search area features are as follows: first, a template set of features F T0 And search region feature F S0 Enhanced features F are obtained by the involution-attention template module and the involution-attention search area module, respectively T1 And F S1 (ii) a Then, the enhanced template feature F T1 And search region feature F S1 Simultaneously and alternately inputting the involution-attention template searching module and the involution-attention template searching module to obtain a fusion characteristic F T2 And F S2 (ii) a Wherein the inner roll-attention template module, the inner roll-attentionThe search region module, the inner volume-attention template search module and the inner volume-attention template search module jointly construct a feature enhanced fusion layer and repeat for 4 times;
after the feature enhancement fusion layer, the involution-attention mixing module fuses the feature F T2 And F S2 The feature F is output as an input and fed into the regression network and the classification network.
In the above-mentioned target tracking method for robot vision, the inward roll-attention search area module and the inward roll-attention template search module use the inward roll attention strategy 1 to acquire the inner convolution kernel, the inward roll-attention template module, the inward roll-attention search template module and the inward roll-attention blending module use the inward roll attention strategy 2 to acquire the inner convolution kernel.
In the above target tracking method for robot vision, in step (3), the classification network is a classification network comprising 3 linear layers and 2 activations, and is represented as
f c (F)=φ 2 ((φ 1 (F*W 1 )*W 2 ))*W 3 , (3)
Wherein,
Figure BDA0003880021480000031
for output of the feature hybrid network, W 1 ,W 2 ,W 3 Is a learnable parameter matrix; the output of the classification network is a binary tensor
Figure BDA0003880021480000032
The classification loss is calculated using the standard binary cross entropy loss,
Figure BDA0003880021480000033
wherein, y i Group-truth tag for ith sample, equal to 1 for positive sample, p i Probability of being a positive sample;
by softmax function, f c (F) Mapping to a classification score matrix S. In an ideal state, in the scoring matrix S, the classification scores of the target regions are all 1, and the classification scores of the background regions are all 0.
In the above-mentioned target tracking method for robot vision, in some cases, such as similar targets, occluded targets, or out-of-range targets, S may be contaminated; therefore, in the tracking method, a classification score higher than a predetermined value is regarded as a high score;
assuming that the number of high scores inside and outside the regression frame is N i And N o The area of the regression frame is N r (ii) a Defining an update score as s = (N) i -N o )/N r (ii) a When s is larger than tau and the updating interval is up, updating the template; where τ is the template update threshold.
In the above target tracking method for robot vision, in step (3), the regression network is established by estimating the probability distribution of the target frame; the regression network is a Full Convolution Network (FCN) and has four Conv-BN-ReLU layers; the output of the regression network has 4 channels which respectively represent the left, right, upper and lower probability distributions of the target frame; thus, the coordinates of the frame are
x tl =∑(xP left (x))
y tl =∑(yP top (y))
x br =∑(xP right (x))
y br =∑(yP bottom (y)), (5)
Wherein, P left ,P top ,P riqht ,P bottm The probability distributions of the left, right, upper and lower bounding boxes respectively; combined IoU losses and l 1 Loss, the loss function of the regression network being
Figure BDA0003880021480000041
Wherein λ is iou And λ l For hyperparameters, for adjusting the weights of two terms, b and
Figure BDA0003880021480000042
the real target frame coordinates and the predicted target frame coordinates are respectively.
After the method is adopted, firstly, a depth network is extracted through improved features, a template sample feature tensor is obtained through a template made of a target image and sample image blocks around the target image, the search area sample feature tensor is obtained through a search area sample, and then a tracking result is obtained through an original feature enhancement and feature fusion network based on an inner volume-attention model and a classification network and a regression network. The tracking success rate of target tracking can be effectively enhanced, the tracking error is reduced, the real-time tracking effect is realized, and the control effect of the robot is improved.
Drawings
The invention will be further described in detail with reference to examples of embodiments shown in the drawings to which, however, the invention is not restricted.
FIG. 1 is a schematic diagram of the framework of the tracking method of the present invention;
FIG. 2 is a schematic view of an involution attention model of the present invention;
FIG. 3 is a schematic illustration of an injection module of the present invention;
FIG. 4 is a schematic diagram illustrating two exemplary attention moment array dimension-changing strategies according to the present invention;
FIG. 5 is a schematic representation of the relationship of the scoring heatmap of the present invention to a regression bounding box;
FIG. 6 is a block schematic diagram of the object tracking device of the present invention.
Detailed Description
Referring to fig. 1, a target tracking method for robot vision according to the present invention includes the following steps:
(1) Manually obtaining the upper left coordinates and the lower right coordinates of a target to be tracked on a two-dimensional image in the first frame of the image, intercepting a target image and sample image blocks around the target image as templates, and obtaining a template sample feature tensor through a feature extraction depth network;
(2) Inputting the same characteristic extraction depth network into a search area sample to obtain a characteristic tensor of the search area sample;
(3) Simultaneously inputting the feature tensor of the template and the feature tensor of the search area into a feature enhancement and feature fusion network based on an inner volume-attention model to obtain a fusion feature tensor containing the template features and the features of the search area, and then obtaining a tracking result by the fusion feature tensor through a classification network and a regression network.
In the step (1), in order to extract features more effectively, the internal convolution is used to reconstruct the existing feature extraction deep network, and the feature extraction deep network provided by the invention specifically comprises the following steps: it uses the ResNet50 network as a reference; the ResNet50 includes one dry layer and four branch dry layers, with 3, 4, 6 and 3 bottleecks respectively.
In the feature extraction depth network, the fourth layer of ResNet50 is discarded, and in order to obtain higher feature resolution, the downsampling stride parameter of Conv2d operator of the third layer is changed from 2 to 1; in the dry layer of ResNet50, 7 × 7 pairs of inner convolution sums are used instead of the previous 7 × 7 convolution kernel; at other layers, all 3 × 3 convolution kernels are replaced by 7 × 7 internal convolution kernels; therefore, the characteristic extraction deep network can obtain a larger receptive field at one time. Finally, a 1 × 1 convolution is added after the third layer to reduce the channel dimension of the feature extraction deep network output.
Table 1 lists the details of the modified kernel, the second column indicates the size of the convolution kernel that was replaced, and the multipliers in the brackets indicate the number of times the convolution kernel for that layer was replaced. The third column indicates the internal convolution kernel size. The fourth column is the number of channels of the convolution/convolution kernel. The last column represents the number of groups of inner convolution kernels. The input to the backbone network is a template image
Figure BDA0003880021480000051
Search area image
Figure BDA0003880021480000052
Template characteristics output by the backbone network after passing through the backbone network
Figure BDA0003880021480000053
Search area features
Figure BDA0003880021480000054
TABLE 1 modified ResNet50 Kernel
Figure BDA0003880021480000055
Figure BDA0003880021480000061
Referring to FIG. 2, in the present embodiment, the volume-attention model in step (3) is composed of an interior injection module, two Add & Norm modules and an FFN & Relu module.
Referring to FIG. 3, the injection module takes tensor A and tensor B as inputs; use of
Figure BDA0003880021480000062
And
Figure BDA0003880021480000063
and respectively constructing a convolution tensor and an internal convolution kernel, wherein d is the number of channels, and w multiplied by w is the scale of the image block.
To construct the internal convolution kernel, tensor B is unfolded
Figure BDA0003880021480000064
Then, a learnable parameter matrix is given
Figure BDA0003880021480000065
And
Figure BDA0003880021480000066
can obtain the query Q and the key K as
Q=B′W Q
K=B′W K , (1)
Wherein,
Figure BDA0003880021480000067
then, attention is paid to the moment matrix
Figure BDA0003880021480000068
Can be obtained by the formula (2).
Figure BDA0003880021480000069
The attention matrix M is then morphed into an internal convolution kernel tensor
Figure BDA00038800214800000610
Where g is the number of sets of internal convolution kernels, w × w is the scale of the convolved image, and k × k is the internal convolution kernel size.
It is noted that the dimensionality of the attention matrix M into the internal convolution kernel tensor I depends on the different type B, requiring the processing of two types of inputs B: the method comprises the steps of searching a region sample and a template set sample, wherein the template set sample consists of four templates and can be updated online. Referring to FIG. 4, two types of input B dimension-changing strategies are shown.
When the input B is a search area tensor, M i,j Representing the similarity of the ith row of Q and the jth row of K; as shown in fig. 4 (a), for simplicity, M having a shape of 64 × 64 and an internal convolution kernel having a shape of 2 × 2 are taken as examples. The dashed rectangle is 1 row of matrix M, which can be reconstructed as an 8 x 8 matrix. And extracting 1 element from every 4 rows and 4 columns respectively to construct an internal convolution kernel set of 16 groups of 2 multiplied by 2 kernels. Since each kernel is globally sampled, all internal convolution kernels can capture the long-range dependency of the search region. This strategy is referred to as the involuntary attention strategy 1.
When the input B is one template set tensor, the present embodiment connects the template set tensors using four templates. As shown in fig. 4 (b), the 1-1 and 1-2 blocks represent the first template itself and the pair-wise similarity between the first template and the second template, respectively. For example, row i of M describes the similarity between the i-th element in Q and all the elements in the 4 templates in K. The dashed rectangle in M can be reshaped into an 8 × 8 matrix, where the blocks in the red, blue, yellow, green boxes are associated with the first, second, third, fourth templates, respectively, specifically, in the figure, the number 2 corresponds to the yellow box, the number 3 corresponds to the red box, the number 4 corresponds to the blue box, and the number 6 corresponds to the green box. In each block, elements are extracted every two rows and two columns. For all blocks, 16 sets of internal convolution kernels are available. Since each kernel is globally sampled, all internal convolution kernels are able to capture the long dependency of the stencil set tensor. This strategy is referred to as the wrap attention strategy 2.
Further preferably, in step (3), the feature enhancement and feature fusion network based on the involution-attention model is composed of five modules: an involution-attention template module, an involution-attention search area module, an involution-attention template search module, an involution-attention search template module, and an involution-attention hybrid module; where the involution-attention representation in the five modules is based on the involution-attention model.
The specific steps of obtaining the fusion feature tensor containing the template features and the search area features are as follows: first, a template set is characterized by F T0 And search region feature F S0 Enhanced features F are obtained by the inlel-attention template module and the inlel-attention search area module, respectively T1 And F S1 (ii) a Then, the enhanced template feature F T1 And search region feature F S1 Simultaneously and alternately inputting the involution-attention template searching module and the involution-attention template searching module to obtain a fusion characteristic F T2 And F S2 (ii) a The characteristic enhancement fusion layer is constructed by the aid of an inner volume-attention template module, an inner volume-attention search area module, an inner volume-attention template search module and an inner volume-attention search template module, and is repeated for 4 times.
After the feature enhancement fusion layer, the involution-attention mixing module fuses the feature F T2 And F S2 The feature F is output as an input and fed into the regression network and the classification network.
In this embodiment, the volume-attention search area module and the volume-attention template search module use the volume-attention strategy 1 to obtain the inner convolution kernel, the volume-attention template module, the volume-attention search template module, and the volume-attention blending module use the volume-attention strategy 2 to obtain the inner convolution kernel.
Further preferably, in order to improve the robustness of the tracking, the template set needs to be updated during the tracking process. However, when tracking drift, target occlusion, and target deviation distance occur, the current tracking result is not reliable. In order to ensure the reliability of the tracking result, the invention provides a classification network comprising 3 linear layers and 2 activations, which can be expressed as
f c (F)=φ 2 ((φ 1 (F*W 1 )*W 2 ))*W 3 , (3)
Wherein,
Figure BDA0003880021480000071
for output of the feature hybrid network, W 1 ,W 2 ,W 3 Is a learnable parameter matrix; the output of the classification network is a binary tensor
Figure BDA0003880021480000081
The classification loss is calculated using the standard binary cross entropy loss,
Figure BDA0003880021480000082
wherein, y i Group-truth tag for ith sample, equal to 1 for positive sample, p i Probability of being a positive sample;
by softmax function, f c (F) To a classification score matrix S. In an ideal state, in the scoring matrix S, the classification scores of the target regions are all 1, and the classification scores of the background regions are all 0.
However, in some cases, such as similar objects, occlusions, or out-of-range objects, S may become contaminated.
As shown in fig. 5, the red box in fig. 5 is a regression box provided for the regression network. In the score hotspot graph a, the regression box contains most of the high scores. In fractional hotspot graphsIn B, the regression box includes the partial content which is high-scored due to the similar target. Thus, the results of heat map A are more reliable than those of heat map B, and it is reasonable to update the template set with the tracked results of heat map A. In the tracking method, a classification score higher than a predetermined value, which is 0.88 in this embodiment, is regarded as a high score. Assuming that the number of high scores inside and outside the regression frame is N i And N o The area of the regression frame is N r . Defining an update score as s = (N) i -N o )/N r . When s > τ and the update interval is reached, the template is updated. Where τ is the template update threshold.
Traditional anchor-free regression networks directly learn the state of the target and follow Dirac delta distributions, which are limited to situations where the target boundaries are not sharp enough, such as occlusion, motion blur, shadows, complex backgrounds, and the like. The invention establishes a regression network by estimating the probability distribution of the target box.
In this embodiment, the regression network is a Full Convolution Network (FCN) with four Conv-BN-ReLU layers; the output of the regression network is provided with four channels which respectively represent the left, right, upper and lower probability distributions of the target frame; thus, the coordinates of the frame are
x tl =∑(xP left (x))
y tl =∑(yP top (y))
x br =∑(xP right (x))
y br =∑(yP bottom (y)), (5)
Wherein, P left ,P top ,P right ,P bottm The probability distributions of the left, right, upper and lower bounding boxes respectively; the regression network of the present invention performs better when dealing with uncertainties than other regression networks. Binding IoU losses (Liou) and l 1 Loss, the loss function of the regression network is
Figure BDA0003880021480000084
Wherein λ is iou And λ l For hyperparameters, for adjusting the weights of two terms, b and
Figure BDA0003880021480000091
the real target frame coordinates and the predicted target frame coordinates are respectively.
Results of the experiment
The tracking method and apparatus of the present invention have been tested against the currently prevailing data sets, which are the TrackingNet and GOT-10k data sets, respectively, versus the currently more advanced methods.
Table 2 shows the test results of the tracking method of the present invention and other algorithms in TrackingNet, and it can be seen that the method of the present invention achieves the best results in indexes prec. (%), n.prec. (%), and Success (AUC).
Table 2 comparison of prec., n.prec, and AUC on the TrackingNet test set
Figure BDA0003880021480000092
Table 3 shows the test results of the tracking method and other algorithms of the present invention on the GOT-10k test set, and it can be seen that the method of the present invention obtains the best results in indexes mAO (%), SR0.5 (%) and SR0.75 (%).
Method mAO(%) SR0.5(%) SR0.75(%)
TrSiam 67.3 78.7 58.6
TrDiMP 68.8 80.5 59.7
TransT 72.3 82.4 68.2
TREG 66.8 77.8 57.2
STACK-ST50 68.0 77.7 62.3
SiamFC++ 59.5 69.5 47.9
The method of the invention 73.2 83.3 68.8
Referring to fig. 6, the target tracking device based on the method of the present invention includes an image acquisition module, a feature extraction module, a feature enhancement module, a feature fusion module, a classification module, a regression module, a result display module, a template update module, and a robot data interface module.
The robot comprises an image acquisition module, a feature extraction module, a feature fusion module, a regression module, a target template set and a robot data interface module, wherein the image acquisition module acquires video data by adopting a camera, the feature extraction module is responsible for extracting depth features of a target template and a search area, the feature enhancement module is responsible for enhancing the extracted depth features, the feature fusion module is responsible for fusing the enhanced features of the target template and the search area, the classification module is used for classifying and judging each area of the features, the regression module is responsible for determining a target state, the result display module is responsible for displaying a tracking result in an original image of a video, the template updating module is used for judging whether the tracking result is used for updating the target template set or not according to the results of the classification module and the regression module, and the robot data interface module is used for transmitting the tracking result to a robot controller to help the robot to make a decision and plan actions.
The above-mentioned embodiments are only for convenience of description, and are not intended to limit the present invention in any way, and those skilled in the art will understand that the technical features of the present invention can be modified or changed by other equivalent embodiments without departing from the scope of the present invention.

Claims (9)

1. A target tracking method for robot vision, comprising the steps of:
(1) Manually obtaining the upper left coordinates and the lower right coordinates of a target to be tracked on a two-dimensional image in the first frame of the image, intercepting a target image and sample image blocks around the target image as templates, and obtaining a template sample feature tensor through a feature extraction depth network;
(2) Inputting the same characteristic extraction depth network into a search area sample to obtain a characteristic tensor of the search area sample;
(3) Simultaneously inputting the feature tensor of the template and the feature tensor of the search area into a feature enhancement and feature fusion network based on an inner volume-attention model to obtain a fusion feature tensor containing the template features and the features of the search area, and then obtaining a tracking result by the fusion feature tensor through a classification network and a regression network.
2. The method for tracking the target of the robot vision according to claim 1, wherein in the step (1), the feature extraction depth network is specifically: it uses the ResNet50 network as a reference; the ResNet50 comprises a dry layer and four branch dry layers, wherein 3, 4, 6 and 3 bottleecks are respectively arranged;
in the feature extraction deep network, the fourth layer of ResNet50 is discarded, and the downsampling stride parameter of Conv2d operator of the third layer is changed from 2 to 1; in the dry layer of ResNet50, 7 × 7 pairs of inner convolution sums are used instead of the previous 7 × 7 convolution kernel; at other layers, all 3 × 3 convolution kernels are replaced by 7 × 7 internal convolution kernels; finally, a 1 × 1 convolution is added after the third layer.
3. The method of claim 1, wherein the intra-roll-attention model of step (3) is comprised of an intra-roll module, two Add & Norm modules, and an FFN & Relu module;
the injection module takes tensor A and tensor B as input; use of
Figure FDA0003880021470000011
And
Figure FDA0003880021470000012
respectively constructing a convolution tensor and an internal convolution kernel, wherein d is the number of channels, and w x w is the scale of the image block;
to construct the internal convolution kernel, tensor B is unwrapped
Figure FDA0003880021470000013
Then, a learnable parameter matrix is given
Figure FDA0003880021470000014
And
Figure FDA0003880021470000015
can obtain the query Q and the key K as
Q=B′W Q
K=B′W K , (1)
Wherein,
Figure FDA0003880021470000016
then, attention is paid to the moment matrix
Figure FDA0003880021470000017
Can be obtained by formula (2);
Figure FDA0003880021470000018
the attention matrix M is then morphed into an internal convolution kernel tensor
Figure FDA0003880021470000019
Where g is the number of sets of internal convolution kernels, w × w is the scale of the convolved image, and k × k is the internal convolution kernel size.
4. A method of object tracking for robot vision according to claim 3, characterized in that the multidimensional attention matrix M is dimensioned such that the internal convolution kernel tensor I depends on different types B, requiring processing of two types of inputs B: searching a region sample and a template set sample, wherein the template set sample consists of four templates and can be updated on line;
when the input B is a search area tensor, M i,j Representing the similarity of the ith row of Q and the jth row of K; because each kernel is globally sampled, all internal convolution kernels can capture the long-range dependency of the search area; this strategy is called the involuntary attention strategy 1;
when the input B is a template set tensor, connecting the template set tensors using four templates; the ith line of M describes the similarity between the ith element in Q and all the elements in the four templates in K; because each kernel is globally sampled, all internal convolution kernels can capture the long dependency of the template set tensor; this strategy is referred to as the wrap attention strategy 2.
5. The target tracking method for robot vision according to claim 4, wherein in step (3), the feature enhancement and feature fusion network based on the inward rolling-attention model is composed of five modules: an involution-attention template module, an involution-attention search area module, an involution-attention template search module, an involution-attention search template module, and an involution-attention hybrid module; wherein the involution-attention representation in the five modules is based on an involution-attention model;
the specific steps of obtaining the fusion feature tensor containing the template features and the search area features are as follows: first, a template set is characterized by F T0 And search region feature F s0 Enhanced features F are obtained by the inlel-attention template module and the inlel-attention search area module, respectively T1 And F S1 (ii) a Then, the enhanced template feature F T1 And search region feature F S1 Simultaneously and alternately inputting the involution-attention template searching module and the involution-attention template searching module to obtain a fusion characteristic F T2 And F S2 (ii) a Wherein, the inner scroll-attention template module, the inner scroll-attention search area module, the inner scroll-attention template search module and the inner scroll-attention search template module jointly construct a feature enhanced fusion layer, and repeat for 4 times;
after the feature enhancement fusion layer, the involution-attention mixing module fuses the feature F T2 And F S2 The feature F is output as an input and fed into the regression network and the classification network.
6. The method of claim 5, wherein the curl-attention search region module and the curl-attention template search module use a curl attention strategy of 1 to obtain the inner convolution kernel, the curl-attention template module, the curl-attention search template module and the curl-attention blending module use a curl attention strategy of 2 to obtain the inner convolution kernel.
7. The method of claim 1, wherein in step (3), the classification network is a classification network comprising 3 linear layers and 2 activations, and is represented by
f c (F)=φ 2 ((φ 1 (F*W 1 )*W 2 ))*W 3 , (3)
Wherein,
Figure FDA0003880021470000021
for output of the feature hybrid network, W 1 ,W 2 ,W 3 Is a learnable parameter matrix; the output of the classification network is a binary tensor
Figure FDA0003880021470000022
The classification loss is calculated using the standard binary cross entropy loss,
Figure FDA0003880021470000031
wherein, y i Group-truth tag for ith sample, equal to 1 for positive sample, p i Probability of being a positive sample;
by softmax function, f c (F) Mapping to a classification score matrix S. In an ideal state, in the scoring matrix S, the classification scores of the target regions are all 1, and the classification scores of the background regions are all 0.
8. A method for robot-vision target tracking according to claim 7, characterized in that in some cases, such as similar targets, occlusion, or out-of-range targets, S may be contaminated; therefore, in the tracking method, a classification score higher than a predetermined value is regarded as a high score;
assuming that the number of high scores inside and outside the regression frame is N i And N o The area of the regression frame is N r (ii) a Defining an update score as s = (N) i -N o )/N r (ii) a When s is larger than tau and the updating interval is up, updating the template; where τ is the template update threshold.
9. A target tracking method for robot vision according to claim 1, characterized in that in step (3), the regression network is established by estimating probability distribution of target box; the regression network is a Full Convolution Network (FCN) and has four Conv-BN-ReLU layers; the output of the regression network has 4 channels which respectively represent the left, right, upper and lower probability distribution of the target frame; thus, the coordinates of the frame are
x tl =∑(xP left (x))
y tl =∑(yP top (y))
x br =∑(xP right (x))
y br =∑(yP bottom (y)), (5)
Wherein, P left ,P top ,P right ,P bottm The probability distributions of the left, right, upper and lower bounding boxes respectively; combined IoU losses and l 1 Loss, the loss function of the regression network being
Figure FDA0003880021470000032
Wherein λ is iou And λ l For hyperparameters, for adjusting the weights of two terms, b and
Figure FDA0003880021470000033
the real target frame coordinates and the predicted target frame coordinates are respectively.
CN202211226432.5A 2022-10-09 2022-10-09 Target tracking method for robot vision Pending CN115641449A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211226432.5A CN115641449A (en) 2022-10-09 2022-10-09 Target tracking method for robot vision

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211226432.5A CN115641449A (en) 2022-10-09 2022-10-09 Target tracking method for robot vision

Publications (1)

Publication Number Publication Date
CN115641449A true CN115641449A (en) 2023-01-24

Family

ID=84941741

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211226432.5A Pending CN115641449A (en) 2022-10-09 2022-10-09 Target tracking method for robot vision

Country Status (1)

Country Link
CN (1) CN115641449A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116433727A (en) * 2023-06-13 2023-07-14 北京科技大学 Scalable single-stream tracking method based on staged continuous learning

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116433727A (en) * 2023-06-13 2023-07-14 北京科技大学 Scalable single-stream tracking method based on staged continuous learning
CN116433727B (en) * 2023-06-13 2023-10-27 北京科技大学 Scalable single-stream tracking method based on staged continuous learning

Similar Documents

Publication Publication Date Title
Porav et al. Adversarial training for adverse conditions: Robust metric localisation using appearance transfer
CN109583340B (en) Video target detection method based on deep learning
CN107832672B (en) Pedestrian re-identification method for designing multi-loss function by utilizing attitude information
WO2020228446A1 (en) Model training method and apparatus, and terminal and storage medium
CN108960211B (en) Multi-target human body posture detection method and system
Dai et al. MS2DG-Net: Progressive correspondence learning via multiple sparse semantics dynamic graph
CN110728209A (en) Gesture recognition method and device, electronic equipment and storage medium
Xia et al. Loop closure detection for visual SLAM using PCANet features
CN106446015A (en) Video content access prediction and recommendation method based on user behavior preference
CN105844669A (en) Video target real-time tracking method based on partial Hash features
Wang et al. Pm-gans: Discriminative representation learning for action recognition using partial-modalities
US20200379481A1 (en) Localising a vehicle
WO2023206935A1 (en) Person re-identification method, system and device, and computer-readable storage medium
CN112329662B (en) Multi-view saliency estimation method based on unsupervised learning
CN115661754B (en) Pedestrian re-recognition method based on dimension fusion attention
CN111104924B (en) Processing algorithm for identifying low-resolution commodity image
CN112329771A (en) Building material sample identification method based on deep learning
CN116596966A (en) Segmentation and tracking method based on attention and feature fusion
CN115641449A (en) Target tracking method for robot vision
CN114612709A (en) Multi-scale target detection method guided by image pyramid characteristics
CN114764870A (en) Object positioning model processing method, object positioning device and computer equipment
Salem et al. A novel face inpainting approach based on guided deep learning
CN112418203A (en) Robustness RGB-T tracking method based on bilinear convergence four-flow network
CN114743045B (en) Small sample target detection method based on double-branch area suggestion network
CN113343953B (en) FGR-AM method and system for remote sensing scene recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination