CN115641449A - Target tracking method for robot vision - Google Patents
Target tracking method for robot vision Download PDFInfo
- Publication number
- CN115641449A CN115641449A CN202211226432.5A CN202211226432A CN115641449A CN 115641449 A CN115641449 A CN 115641449A CN 202211226432 A CN202211226432 A CN 202211226432A CN 115641449 A CN115641449 A CN 115641449A
- Authority
- CN
- China
- Prior art keywords
- attention
- template
- feature
- module
- network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 46
- 230000004927 fusion Effects 0.000 claims abstract description 31
- 238000000605 extraction Methods 0.000 claims abstract description 21
- 239000011159 matrix material Substances 0.000 claims description 24
- 238000009826 distribution Methods 0.000 claims description 10
- 238000002156 mixing Methods 0.000 claims description 6
- 238000002347 injection Methods 0.000 claims description 5
- 239000007924 injection Substances 0.000 claims description 5
- 238000012545 processing Methods 0.000 claims description 5
- 230000004913 activation Effects 0.000 claims description 3
- 238000001994 activation Methods 0.000 claims description 3
- 238000013507 mapping Methods 0.000 claims description 2
- 230000006870 function Effects 0.000 description 4
- 238000012360 testing method Methods 0.000 description 4
- 230000008859 change Effects 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 230000007246 mechanism Effects 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000005286 illumination Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Landscapes
- Image Analysis (AREA)
Abstract
The invention discloses a target tracking method for robot vision; belonging to the technical field of target tracking methods; which comprises the following steps: (1) Manually obtaining the upper left coordinates and the lower right coordinates of a target to be tracked on a two-dimensional image in the first frame of the image, intercepting a target image and sample image blocks around the target image as templates, and obtaining a template sample feature tensor through a feature extraction depth network; (2) Inputting a search area sample into a feature extraction depth network to obtain a search area sample feature tensor; (3) Simultaneously inputting the feature tensor of the template and the feature tensor of the search area into a feature enhancement and feature fusion network based on an inner volume-attention model to obtain a fusion feature tensor containing the template features and the features of the search area, and then obtaining a tracking result through a classification network and a regression network; the invention aims to provide a target tracking method for robot vision, which has high tracking success rate and small tracking error and can realize real-time tracking; for real-time target tracking.
Description
Technical Field
The present invention relates to a target tracking method, and more particularly, to a target tracking method for robot vision.
Background
Video target tracking technology has received much attention from researchers as an important content in machine vision research. The method aims to track a target in a video through target state information obtained through a first frame so as to obtain a target state of each frame. In the tracking process, the situations of target form change, complex background with illumination change, target occlusion and the like exist. In these cases, the target feature structure may change accordingly, making it difficult for the tracking algorithm to lock the target.
Since the introduction of visual tracking by deep learning techniques, convolution has been widely used in the framework of feature extraction and template and search region fusion. The current popular deep learning tracker is mainly constructed by a convolution kernel, however, the convolution kernel cannot be designed to be too large due to large calculation amount. Therefore, the convolution kernel cannot interact long-distance information at a single time, and when similar objects appear or the shape of the objects changes greatly, the defect of the model limits the processing capacity of a complex scene.
The long distance dependency problem can be effectively solved by introducing a self-attention mechanism, which has been successfully applied in machine translation, natural language processing and speech processing. In addition, it also yields excellent experimental results during image processing, such as target tracking and target detection. Although the self-attention mechanism can capture global information well, it does not focus on local information particularly, which should take a large weight around the target for target tracking. Therefore, there is a need to solve this problem by developing a model that can handle both global and local information as well as a self-attention mechanism.
Disclosure of Invention
The invention aims to provide a target tracking method for robot vision, which has high tracking success rate and small tracking error and can realize real-time tracking aiming at the defects of the prior art.
The technical scheme of the invention is realized as follows: a target tracking method for robot vision, comprising the steps of:
(1) Manually obtaining the upper left coordinate and the lower right coordinate of a target to be tracked on a two-dimensional image in the first frame of the image, intercepting a target image and sample image blocks around the target image as templates, and obtaining a template sample feature tensor through a feature extraction depth network;
(2) Inputting the same characteristic extraction depth network into a search area sample to obtain a characteristic tensor of the search area sample;
(3) Simultaneously inputting the feature tensor of the template and the feature tensor of the search area into a feature enhancement and feature fusion network based on an inner volume-attention model to obtain a fusion feature tensor containing the template features and the features of the search area, and then obtaining a tracking result by the fusion feature tensor through a classification network and a regression network.
In the above target tracking method for robot vision, in step (1), the feature extraction depth network specifically includes: it uses the ResNet50 network as a reference; the ResNet50 comprises a dry layer and four branch dry layers, wherein 3, 4, 6 and 3 bottleecks are respectively arranged;
in the deep network for feature extraction, the fourth layer of ResNet50 is discarded, and the downsampled stride parameter of Conv2d operator of the third layer is changed from 2 to 1; in the dry layer of ResNet50, 7 × 7 pairs of inner convolution sums are used instead of the previous 7 × 7 convolution kernel; in other layers, all 3 × 3 convolution kernels are replaced by 7 × 7 internal convolution kernels; finally, a 1 × 1 convolution is added after the third layer.
In the above target tracking method for robot vision, in step (3), the inward roll-attention model is composed of an inward roll module, two Add & Norm modules and an FFN & Relu module;
the injection module takes tensor A and tensor B as input; use ofAndrespectively constructing a convolution tensor and an internal convolution kernel, wherein d is the number of channels, and w x w is the scale of the image block;
to construct the internal convolution kernel, tensor B is unfoldedThen, a learnable parameter matrix is givenAndcan get query Q and key K as
Q=B′W Q
K=B′W K , (1)
the attention matrix M is then morphed into an internal convolution kernel tensorWhere g is the number of sets of internal convolution kernels, w × w is the scale of the convolved image, and k × k is the internal convolution kernel size.
In the above-mentioned target tracking method for robot vision, the attention matrix M is transformed into the internal convolution kernel tensor I depending on different types B, and two types of inputs B need to be processed: searching a region sample and a template set sample, wherein the template set sample consists of four templates and can be updated on line;
when the input B is a search area tensor, M i,j Representing the similarity of the ith row of Q and the jth row of K; because each kernel is globally sampled, all internal convolution kernels can capture the long-range dependency of the search area; this strategy is called the involuntary attention strategy 1;
when the input B is a template set tensor, connecting the template set tensors using four templates; the ith line of M describes the similarity between the ith element in Q and all the elements in the four templates in K; since each kernel is globally sampled, all internal convolution kernels can capture the long dependency of the template set tensor; this strategy is referred to as the wrap attention strategy 2.
In the above target tracking method for robot vision, in step (3), the feature enhancement and feature fusion network based on the inward rolling-attention model is composed of five modules: an involution-attention template module, an involution-attention search area module, an involution-attention template search module, an involution-attention search template module, and an involution-attention hybrid module; wherein the involution-attention representation in the five modules is based on an involution-attention model;
the specific steps of obtaining the fusion feature tensor containing the template features and the search area features are as follows: first, a template set of features F T0 And search region feature F S0 Enhanced features F are obtained by the involution-attention template module and the involution-attention search area module, respectively T1 And F S1 (ii) a Then, the enhanced template feature F T1 And search region feature F S1 Simultaneously and alternately inputting the involution-attention template searching module and the involution-attention template searching module to obtain a fusion characteristic F T2 And F S2 (ii) a Wherein the inner roll-attention template module, the inner roll-attentionThe search region module, the inner volume-attention template search module and the inner volume-attention template search module jointly construct a feature enhanced fusion layer and repeat for 4 times;
after the feature enhancement fusion layer, the involution-attention mixing module fuses the feature F T2 And F S2 The feature F is output as an input and fed into the regression network and the classification network.
In the above-mentioned target tracking method for robot vision, the inward roll-attention search area module and the inward roll-attention template search module use the inward roll attention strategy 1 to acquire the inner convolution kernel, the inward roll-attention template module, the inward roll-attention search template module and the inward roll-attention blending module use the inward roll attention strategy 2 to acquire the inner convolution kernel.
In the above target tracking method for robot vision, in step (3), the classification network is a classification network comprising 3 linear layers and 2 activations, and is represented as
f c (F)=φ 2 ((φ 1 (F*W 1 )*W 2 ))*W 3 , (3)
Wherein,for output of the feature hybrid network, W 1 ,W 2 ,W 3 Is a learnable parameter matrix; the output of the classification network is a binary tensorThe classification loss is calculated using the standard binary cross entropy loss,
wherein, y i Group-truth tag for ith sample, equal to 1 for positive sample, p i Probability of being a positive sample;
by softmax function, f c (F) Mapping to a classification score matrix S. In an ideal state, in the scoring matrix S, the classification scores of the target regions are all 1, and the classification scores of the background regions are all 0.
In the above-mentioned target tracking method for robot vision, in some cases, such as similar targets, occluded targets, or out-of-range targets, S may be contaminated; therefore, in the tracking method, a classification score higher than a predetermined value is regarded as a high score;
assuming that the number of high scores inside and outside the regression frame is N i And N o The area of the regression frame is N r (ii) a Defining an update score as s = (N) i -N o )/N r (ii) a When s is larger than tau and the updating interval is up, updating the template; where τ is the template update threshold.
In the above target tracking method for robot vision, in step (3), the regression network is established by estimating the probability distribution of the target frame; the regression network is a Full Convolution Network (FCN) and has four Conv-BN-ReLU layers; the output of the regression network has 4 channels which respectively represent the left, right, upper and lower probability distributions of the target frame; thus, the coordinates of the frame are
x tl =∑(xP left (x))
y tl =∑(yP top (y))
x br =∑(xP right (x))
y br =∑(yP bottom (y)), (5)
Wherein, P left ,P top ,P riqht ,P bottm The probability distributions of the left, right, upper and lower bounding boxes respectively; combined IoU losses and l 1 Loss, the loss function of the regression network being
Wherein λ is iou And λ l For hyperparameters, for adjusting the weights of two terms, b andthe real target frame coordinates and the predicted target frame coordinates are respectively.
After the method is adopted, firstly, a depth network is extracted through improved features, a template sample feature tensor is obtained through a template made of a target image and sample image blocks around the target image, the search area sample feature tensor is obtained through a search area sample, and then a tracking result is obtained through an original feature enhancement and feature fusion network based on an inner volume-attention model and a classification network and a regression network. The tracking success rate of target tracking can be effectively enhanced, the tracking error is reduced, the real-time tracking effect is realized, and the control effect of the robot is improved.
Drawings
The invention will be further described in detail with reference to examples of embodiments shown in the drawings to which, however, the invention is not restricted.
FIG. 1 is a schematic diagram of the framework of the tracking method of the present invention;
FIG. 2 is a schematic view of an involution attention model of the present invention;
FIG. 3 is a schematic illustration of an injection module of the present invention;
FIG. 4 is a schematic diagram illustrating two exemplary attention moment array dimension-changing strategies according to the present invention;
FIG. 5 is a schematic representation of the relationship of the scoring heatmap of the present invention to a regression bounding box;
FIG. 6 is a block schematic diagram of the object tracking device of the present invention.
Detailed Description
Referring to fig. 1, a target tracking method for robot vision according to the present invention includes the following steps:
(1) Manually obtaining the upper left coordinates and the lower right coordinates of a target to be tracked on a two-dimensional image in the first frame of the image, intercepting a target image and sample image blocks around the target image as templates, and obtaining a template sample feature tensor through a feature extraction depth network;
(2) Inputting the same characteristic extraction depth network into a search area sample to obtain a characteristic tensor of the search area sample;
(3) Simultaneously inputting the feature tensor of the template and the feature tensor of the search area into a feature enhancement and feature fusion network based on an inner volume-attention model to obtain a fusion feature tensor containing the template features and the features of the search area, and then obtaining a tracking result by the fusion feature tensor through a classification network and a regression network.
In the step (1), in order to extract features more effectively, the internal convolution is used to reconstruct the existing feature extraction deep network, and the feature extraction deep network provided by the invention specifically comprises the following steps: it uses the ResNet50 network as a reference; the ResNet50 includes one dry layer and four branch dry layers, with 3, 4, 6 and 3 bottleecks respectively.
In the feature extraction depth network, the fourth layer of ResNet50 is discarded, and in order to obtain higher feature resolution, the downsampling stride parameter of Conv2d operator of the third layer is changed from 2 to 1; in the dry layer of ResNet50, 7 × 7 pairs of inner convolution sums are used instead of the previous 7 × 7 convolution kernel; at other layers, all 3 × 3 convolution kernels are replaced by 7 × 7 internal convolution kernels; therefore, the characteristic extraction deep network can obtain a larger receptive field at one time. Finally, a 1 × 1 convolution is added after the third layer to reduce the channel dimension of the feature extraction deep network output.
Table 1 lists the details of the modified kernel, the second column indicates the size of the convolution kernel that was replaced, and the multipliers in the brackets indicate the number of times the convolution kernel for that layer was replaced. The third column indicates the internal convolution kernel size. The fourth column is the number of channels of the convolution/convolution kernel. The last column represents the number of groups of inner convolution kernels. The input to the backbone network is a template imageSearch area imageTemplate characteristics output by the backbone network after passing through the backbone networkSearch area features
TABLE 1 modified ResNet50 Kernel
Referring to FIG. 2, in the present embodiment, the volume-attention model in step (3) is composed of an interior injection module, two Add & Norm modules and an FFN & Relu module.
Referring to FIG. 3, the injection module takes tensor A and tensor B as inputs; use ofAndand respectively constructing a convolution tensor and an internal convolution kernel, wherein d is the number of channels, and w multiplied by w is the scale of the image block.
To construct the internal convolution kernel, tensor B is unfoldedThen, a learnable parameter matrix is givenAndcan obtain the query Q and the key K as
Q=B′W Q
K=B′W K , (1)
The attention matrix M is then morphed into an internal convolution kernel tensorWhere g is the number of sets of internal convolution kernels, w × w is the scale of the convolved image, and k × k is the internal convolution kernel size.
It is noted that the dimensionality of the attention matrix M into the internal convolution kernel tensor I depends on the different type B, requiring the processing of two types of inputs B: the method comprises the steps of searching a region sample and a template set sample, wherein the template set sample consists of four templates and can be updated online. Referring to FIG. 4, two types of input B dimension-changing strategies are shown.
When the input B is a search area tensor, M i,j Representing the similarity of the ith row of Q and the jth row of K; as shown in fig. 4 (a), for simplicity, M having a shape of 64 × 64 and an internal convolution kernel having a shape of 2 × 2 are taken as examples. The dashed rectangle is 1 row of matrix M, which can be reconstructed as an 8 x 8 matrix. And extracting 1 element from every 4 rows and 4 columns respectively to construct an internal convolution kernel set of 16 groups of 2 multiplied by 2 kernels. Since each kernel is globally sampled, all internal convolution kernels can capture the long-range dependency of the search region. This strategy is referred to as the involuntary attention strategy 1.
When the input B is one template set tensor, the present embodiment connects the template set tensors using four templates. As shown in fig. 4 (b), the 1-1 and 1-2 blocks represent the first template itself and the pair-wise similarity between the first template and the second template, respectively. For example, row i of M describes the similarity between the i-th element in Q and all the elements in the 4 templates in K. The dashed rectangle in M can be reshaped into an 8 × 8 matrix, where the blocks in the red, blue, yellow, green boxes are associated with the first, second, third, fourth templates, respectively, specifically, in the figure, the number 2 corresponds to the yellow box, the number 3 corresponds to the red box, the number 4 corresponds to the blue box, and the number 6 corresponds to the green box. In each block, elements are extracted every two rows and two columns. For all blocks, 16 sets of internal convolution kernels are available. Since each kernel is globally sampled, all internal convolution kernels are able to capture the long dependency of the stencil set tensor. This strategy is referred to as the wrap attention strategy 2.
Further preferably, in step (3), the feature enhancement and feature fusion network based on the involution-attention model is composed of five modules: an involution-attention template module, an involution-attention search area module, an involution-attention template search module, an involution-attention search template module, and an involution-attention hybrid module; where the involution-attention representation in the five modules is based on the involution-attention model.
The specific steps of obtaining the fusion feature tensor containing the template features and the search area features are as follows: first, a template set is characterized by F T0 And search region feature F S0 Enhanced features F are obtained by the inlel-attention template module and the inlel-attention search area module, respectively T1 And F S1 (ii) a Then, the enhanced template feature F T1 And search region feature F S1 Simultaneously and alternately inputting the involution-attention template searching module and the involution-attention template searching module to obtain a fusion characteristic F T2 And F S2 (ii) a The characteristic enhancement fusion layer is constructed by the aid of an inner volume-attention template module, an inner volume-attention search area module, an inner volume-attention template search module and an inner volume-attention search template module, and is repeated for 4 times.
After the feature enhancement fusion layer, the involution-attention mixing module fuses the feature F T2 And F S2 The feature F is output as an input and fed into the regression network and the classification network.
In this embodiment, the volume-attention search area module and the volume-attention template search module use the volume-attention strategy 1 to obtain the inner convolution kernel, the volume-attention template module, the volume-attention search template module, and the volume-attention blending module use the volume-attention strategy 2 to obtain the inner convolution kernel.
Further preferably, in order to improve the robustness of the tracking, the template set needs to be updated during the tracking process. However, when tracking drift, target occlusion, and target deviation distance occur, the current tracking result is not reliable. In order to ensure the reliability of the tracking result, the invention provides a classification network comprising 3 linear layers and 2 activations, which can be expressed as
f c (F)=φ 2 ((φ 1 (F*W 1 )*W 2 ))*W 3 , (3)
Wherein,for output of the feature hybrid network, W 1 ,W 2 ,W 3 Is a learnable parameter matrix; the output of the classification network is a binary tensorThe classification loss is calculated using the standard binary cross entropy loss,
wherein, y i Group-truth tag for ith sample, equal to 1 for positive sample, p i Probability of being a positive sample;
by softmax function, f c (F) To a classification score matrix S. In an ideal state, in the scoring matrix S, the classification scores of the target regions are all 1, and the classification scores of the background regions are all 0.
However, in some cases, such as similar objects, occlusions, or out-of-range objects, S may become contaminated.
As shown in fig. 5, the red box in fig. 5 is a regression box provided for the regression network. In the score hotspot graph a, the regression box contains most of the high scores. In fractional hotspot graphsIn B, the regression box includes the partial content which is high-scored due to the similar target. Thus, the results of heat map A are more reliable than those of heat map B, and it is reasonable to update the template set with the tracked results of heat map A. In the tracking method, a classification score higher than a predetermined value, which is 0.88 in this embodiment, is regarded as a high score. Assuming that the number of high scores inside and outside the regression frame is N i And N o The area of the regression frame is N r . Defining an update score as s = (N) i -N o )/N r . When s > τ and the update interval is reached, the template is updated. Where τ is the template update threshold.
Traditional anchor-free regression networks directly learn the state of the target and follow Dirac delta distributions, which are limited to situations where the target boundaries are not sharp enough, such as occlusion, motion blur, shadows, complex backgrounds, and the like. The invention establishes a regression network by estimating the probability distribution of the target box.
In this embodiment, the regression network is a Full Convolution Network (FCN) with four Conv-BN-ReLU layers; the output of the regression network is provided with four channels which respectively represent the left, right, upper and lower probability distributions of the target frame; thus, the coordinates of the frame are
x tl =∑(xP left (x))
y tl =∑(yP top (y))
x br =∑(xP right (x))
y br =∑(yP bottom (y)), (5)
Wherein, P left ,P top ,P right ,P bottm The probability distributions of the left, right, upper and lower bounding boxes respectively; the regression network of the present invention performs better when dealing with uncertainties than other regression networks. Binding IoU losses (Liou) and l 1 Loss, the loss function of the regression network is
Wherein λ is iou And λ l For hyperparameters, for adjusting the weights of two terms, b andthe real target frame coordinates and the predicted target frame coordinates are respectively.
Results of the experiment
The tracking method and apparatus of the present invention have been tested against the currently prevailing data sets, which are the TrackingNet and GOT-10k data sets, respectively, versus the currently more advanced methods.
Table 2 shows the test results of the tracking method of the present invention and other algorithms in TrackingNet, and it can be seen that the method of the present invention achieves the best results in indexes prec. (%), n.prec. (%), and Success (AUC).
Table 2 comparison of prec., n.prec, and AUC on the TrackingNet test set
Table 3 shows the test results of the tracking method and other algorithms of the present invention on the GOT-10k test set, and it can be seen that the method of the present invention obtains the best results in indexes mAO (%), SR0.5 (%) and SR0.75 (%).
Method | mAO(%) | SR0.5(%) | SR0.75(%) |
TrSiam | 67.3 | 78.7 | 58.6 |
TrDiMP | 68.8 | 80.5 | 59.7 |
TransT | 72.3 | 82.4 | 68.2 |
TREG | 66.8 | 77.8 | 57.2 |
STACK-ST50 | 68.0 | 77.7 | 62.3 |
SiamFC++ | 59.5 | 69.5 | 47.9 |
The method of the invention | 73.2 | 83.3 | 68.8 |
Referring to fig. 6, the target tracking device based on the method of the present invention includes an image acquisition module, a feature extraction module, a feature enhancement module, a feature fusion module, a classification module, a regression module, a result display module, a template update module, and a robot data interface module.
The robot comprises an image acquisition module, a feature extraction module, a feature fusion module, a regression module, a target template set and a robot data interface module, wherein the image acquisition module acquires video data by adopting a camera, the feature extraction module is responsible for extracting depth features of a target template and a search area, the feature enhancement module is responsible for enhancing the extracted depth features, the feature fusion module is responsible for fusing the enhanced features of the target template and the search area, the classification module is used for classifying and judging each area of the features, the regression module is responsible for determining a target state, the result display module is responsible for displaying a tracking result in an original image of a video, the template updating module is used for judging whether the tracking result is used for updating the target template set or not according to the results of the classification module and the regression module, and the robot data interface module is used for transmitting the tracking result to a robot controller to help the robot to make a decision and plan actions.
The above-mentioned embodiments are only for convenience of description, and are not intended to limit the present invention in any way, and those skilled in the art will understand that the technical features of the present invention can be modified or changed by other equivalent embodiments without departing from the scope of the present invention.
Claims (9)
1. A target tracking method for robot vision, comprising the steps of:
(1) Manually obtaining the upper left coordinates and the lower right coordinates of a target to be tracked on a two-dimensional image in the first frame of the image, intercepting a target image and sample image blocks around the target image as templates, and obtaining a template sample feature tensor through a feature extraction depth network;
(2) Inputting the same characteristic extraction depth network into a search area sample to obtain a characteristic tensor of the search area sample;
(3) Simultaneously inputting the feature tensor of the template and the feature tensor of the search area into a feature enhancement and feature fusion network based on an inner volume-attention model to obtain a fusion feature tensor containing the template features and the features of the search area, and then obtaining a tracking result by the fusion feature tensor through a classification network and a regression network.
2. The method for tracking the target of the robot vision according to claim 1, wherein in the step (1), the feature extraction depth network is specifically: it uses the ResNet50 network as a reference; the ResNet50 comprises a dry layer and four branch dry layers, wherein 3, 4, 6 and 3 bottleecks are respectively arranged;
in the feature extraction deep network, the fourth layer of ResNet50 is discarded, and the downsampling stride parameter of Conv2d operator of the third layer is changed from 2 to 1; in the dry layer of ResNet50, 7 × 7 pairs of inner convolution sums are used instead of the previous 7 × 7 convolution kernel; at other layers, all 3 × 3 convolution kernels are replaced by 7 × 7 internal convolution kernels; finally, a 1 × 1 convolution is added after the third layer.
3. The method of claim 1, wherein the intra-roll-attention model of step (3) is comprised of an intra-roll module, two Add & Norm modules, and an FFN & Relu module;
the injection module takes tensor A and tensor B as input; use ofAndrespectively constructing a convolution tensor and an internal convolution kernel, wherein d is the number of channels, and w x w is the scale of the image block;
to construct the internal convolution kernel, tensor B is unwrappedThen, a learnable parameter matrix is givenAndcan obtain the query Q and the key K as
Q=B′W Q
K=B′W K , (1)
4. A method of object tracking for robot vision according to claim 3, characterized in that the multidimensional attention matrix M is dimensioned such that the internal convolution kernel tensor I depends on different types B, requiring processing of two types of inputs B: searching a region sample and a template set sample, wherein the template set sample consists of four templates and can be updated on line;
when the input B is a search area tensor, M i,j Representing the similarity of the ith row of Q and the jth row of K; because each kernel is globally sampled, all internal convolution kernels can capture the long-range dependency of the search area; this strategy is called the involuntary attention strategy 1;
when the input B is a template set tensor, connecting the template set tensors using four templates; the ith line of M describes the similarity between the ith element in Q and all the elements in the four templates in K; because each kernel is globally sampled, all internal convolution kernels can capture the long dependency of the template set tensor; this strategy is referred to as the wrap attention strategy 2.
5. The target tracking method for robot vision according to claim 4, wherein in step (3), the feature enhancement and feature fusion network based on the inward rolling-attention model is composed of five modules: an involution-attention template module, an involution-attention search area module, an involution-attention template search module, an involution-attention search template module, and an involution-attention hybrid module; wherein the involution-attention representation in the five modules is based on an involution-attention model;
the specific steps of obtaining the fusion feature tensor containing the template features and the search area features are as follows: first, a template set is characterized by F T0 And search region feature F s0 Enhanced features F are obtained by the inlel-attention template module and the inlel-attention search area module, respectively T1 And F S1 (ii) a Then, the enhanced template feature F T1 And search region feature F S1 Simultaneously and alternately inputting the involution-attention template searching module and the involution-attention template searching module to obtain a fusion characteristic F T2 And F S2 (ii) a Wherein, the inner scroll-attention template module, the inner scroll-attention search area module, the inner scroll-attention template search module and the inner scroll-attention search template module jointly construct a feature enhanced fusion layer, and repeat for 4 times;
after the feature enhancement fusion layer, the involution-attention mixing module fuses the feature F T2 And F S2 The feature F is output as an input and fed into the regression network and the classification network.
6. The method of claim 5, wherein the curl-attention search region module and the curl-attention template search module use a curl attention strategy of 1 to obtain the inner convolution kernel, the curl-attention template module, the curl-attention search template module and the curl-attention blending module use a curl attention strategy of 2 to obtain the inner convolution kernel.
7. The method of claim 1, wherein in step (3), the classification network is a classification network comprising 3 linear layers and 2 activations, and is represented by
f c (F)=φ 2 ((φ 1 (F*W 1 )*W 2 ))*W 3 , (3)
Wherein,for output of the feature hybrid network, W 1 ,W 2 ,W 3 Is a learnable parameter matrix; the output of the classification network is a binary tensorThe classification loss is calculated using the standard binary cross entropy loss,
wherein, y i Group-truth tag for ith sample, equal to 1 for positive sample, p i Probability of being a positive sample;
by softmax function, f c (F) Mapping to a classification score matrix S. In an ideal state, in the scoring matrix S, the classification scores of the target regions are all 1, and the classification scores of the background regions are all 0.
8. A method for robot-vision target tracking according to claim 7, characterized in that in some cases, such as similar targets, occlusion, or out-of-range targets, S may be contaminated; therefore, in the tracking method, a classification score higher than a predetermined value is regarded as a high score;
assuming that the number of high scores inside and outside the regression frame is N i And N o The area of the regression frame is N r (ii) a Defining an update score as s = (N) i -N o )/N r (ii) a When s is larger than tau and the updating interval is up, updating the template; where τ is the template update threshold.
9. A target tracking method for robot vision according to claim 1, characterized in that in step (3), the regression network is established by estimating probability distribution of target box; the regression network is a Full Convolution Network (FCN) and has four Conv-BN-ReLU layers; the output of the regression network has 4 channels which respectively represent the left, right, upper and lower probability distribution of the target frame; thus, the coordinates of the frame are
x tl =∑(xP left (x))
y tl =∑(yP top (y))
x br =∑(xP right (x))
y br =∑(yP bottom (y)), (5)
Wherein, P left ,P top ,P right ,P bottm The probability distributions of the left, right, upper and lower bounding boxes respectively; combined IoU losses and l 1 Loss, the loss function of the regression network being
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211226432.5A CN115641449A (en) | 2022-10-09 | 2022-10-09 | Target tracking method for robot vision |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211226432.5A CN115641449A (en) | 2022-10-09 | 2022-10-09 | Target tracking method for robot vision |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115641449A true CN115641449A (en) | 2023-01-24 |
Family
ID=84941741
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211226432.5A Pending CN115641449A (en) | 2022-10-09 | 2022-10-09 | Target tracking method for robot vision |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115641449A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116433727A (en) * | 2023-06-13 | 2023-07-14 | 北京科技大学 | Scalable single-stream tracking method based on staged continuous learning |
-
2022
- 2022-10-09 CN CN202211226432.5A patent/CN115641449A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116433727A (en) * | 2023-06-13 | 2023-07-14 | 北京科技大学 | Scalable single-stream tracking method based on staged continuous learning |
CN116433727B (en) * | 2023-06-13 | 2023-10-27 | 北京科技大学 | Scalable single-stream tracking method based on staged continuous learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Porav et al. | Adversarial training for adverse conditions: Robust metric localisation using appearance transfer | |
CN109583340B (en) | Video target detection method based on deep learning | |
CN107832672B (en) | Pedestrian re-identification method for designing multi-loss function by utilizing attitude information | |
WO2020228446A1 (en) | Model training method and apparatus, and terminal and storage medium | |
CN108960211B (en) | Multi-target human body posture detection method and system | |
Dai et al. | MS2DG-Net: Progressive correspondence learning via multiple sparse semantics dynamic graph | |
CN110728209A (en) | Gesture recognition method and device, electronic equipment and storage medium | |
Xia et al. | Loop closure detection for visual SLAM using PCANet features | |
CN106446015A (en) | Video content access prediction and recommendation method based on user behavior preference | |
CN105844669A (en) | Video target real-time tracking method based on partial Hash features | |
Wang et al. | Pm-gans: Discriminative representation learning for action recognition using partial-modalities | |
US20200379481A1 (en) | Localising a vehicle | |
WO2023206935A1 (en) | Person re-identification method, system and device, and computer-readable storage medium | |
CN112329662B (en) | Multi-view saliency estimation method based on unsupervised learning | |
CN115661754B (en) | Pedestrian re-recognition method based on dimension fusion attention | |
CN111104924B (en) | Processing algorithm for identifying low-resolution commodity image | |
CN112329771A (en) | Building material sample identification method based on deep learning | |
CN116596966A (en) | Segmentation and tracking method based on attention and feature fusion | |
CN115641449A (en) | Target tracking method for robot vision | |
CN114612709A (en) | Multi-scale target detection method guided by image pyramid characteristics | |
CN114764870A (en) | Object positioning model processing method, object positioning device and computer equipment | |
Salem et al. | A novel face inpainting approach based on guided deep learning | |
CN112418203A (en) | Robustness RGB-T tracking method based on bilinear convergence four-flow network | |
CN114743045B (en) | Small sample target detection method based on double-branch area suggestion network | |
CN113343953B (en) | FGR-AM method and system for remote sensing scene recognition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |