CN111291679B - Target specific response attention target tracking method based on twin network - Google Patents
Target specific response attention target tracking method based on twin network Download PDFInfo
- Publication number
- CN111291679B CN111291679B CN202010081733.8A CN202010081733A CN111291679B CN 111291679 B CN111291679 B CN 111291679B CN 202010081733 A CN202010081733 A CN 202010081733A CN 111291679 B CN111291679 B CN 111291679B
- Authority
- CN
- China
- Prior art keywords
- target
- response
- search area
- convolution
- attention
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/41—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
- G06V20/42—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items of sport video content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
Abstract
A target specific response attention target tracking method based on a twin network relates to a computer vision technology. The target specific response attention target tracking method based on the twin network is provided aiming at the defect that the original target tracking method based on the twin network is not robust enough to complex tracking scenes such as rapid movement, shielding, rotation, background disorder and the like of a target, the influence of noise information on tracking performance in the tracking process is effectively weakened by the provided target response attention module, meanwhile, characteristic information which has discriminability on appearance change of a target object is strengthened, a better target response image generated by the twin network is enabled to be used for target position prediction, and therefore more robust tracking performance is achieved. It comprises five main parts: CNN feature extraction; generating a response graph by cross-correlation of channels; generating weights by using an attention network, and weighting each channel response graph; the position of the target is determined on the final response diagram, and the training method of the proposed model is provided.
Description
Technical Field
The invention relates to a computer vision technology, in particular to a target specific response attention target tracking method based on a twin network.
Background
Target tracking is a basic task in computer vision, and has wide application in the fields of video monitoring, vehicle navigation, augmented reality and the like. Object tracking is the selection of an object of interest selected in the first frame of a given video sequence and the prediction of the position of the object in subsequent frames by computer vision algorithms. In recent years, a twin-based tracking method is receiving attention of researchers because of ensuring tracking accuracy and real-time speed, but the performance of a related algorithm is easily influenced by: the rapid movement of the target or the camera, the appearance change of the target, the disordered background and the like, which are inevitable conditions in reality.
In deep learning, attention mechanisms may enable a model to selectively capture important portions of an input based on a particular task. Since the performance of the model can be improved without excessive consumption of computation and storage. Attention mechanism has been widely used in the fields of image titles, machine translation, speech recognition, etc. For target tracking, davt draws recognition space attention, identifying certain specific regions on the target. acfn attempts to utilize the focus mechanism to select a set of correlation filters for tracking. The cst-dcf utilizes the foreground spatial reliability map to constrain the correlation filter learning. RTT utilizes a multidirectional recurrent neural network that generates saliency to select a reliable region that belongs to a target object. The pioneering algorithms verify the superiority of the attention mechanism in the aspect of target tracking, and meanwhile, the method effectively provides a prospect for improving the performance of the target tracking method by utilizing the attention mechanism.
Disclosure of Invention
The invention aims to provide a target specific response attention target tracking method based on a twin network, which can effectively cope with complex tracking scenes such as rapid movement, occlusion, rotation, background disorder and the like of a target.
The invention comprises the following steps:
1) a video sequence is given, wherein a first frame comprises a marked target, a target template area Z and a target search area X are defined, the target template area Z is kept unchanged after being intercepted based on the given mark in the first frame, the target search area X is obtained from a current video frame to be tested, and an image block larger than the target template area Z is intercepted by utilizing a target position obtained from the previous frame;
2) subjecting the product of step 1) toInputting the target template area and the target search area into a full convolution twin network to obtain the CNN characteristic F of the target template area ZzAnd CNN feature F of target search area Xx;
3) The CNN characteristic F obtained in the step 2)xAnd FzInputting the attention model of the target specific response to obtain a response map S of multiple channels weighted by the attention modelmultiFor the response chart SmultiSumming channel by channel to obtain a final response diagram S, and determining the position with the maximum response value in the response diagram S as a target initial position;
4) taking the target position obtained in the last frame in the step 1) as a center, constructing a search scale pyramid for a target search area, executing the step 3) for each search area of an estimated scale in the scale pyramid, selecting the scale of the target search area with the highest response value as the scale corresponding to the current frame, and combining the target position and the scale to obtain the actual size and position of the target, thereby realizing target tracking;
5) training a model: the model training is independent of the tracking process, and after the model is trained off-line, the trained model is used for the tracking step.
In step 1), the specific steps of acquiring the target search area X in the current video frame to be tested are as follows:
(1) in the initial frame, the size of the extracted target template area is slightly larger than that of the actual target according to the real effective value, and the extracted actual size of the template is calculated according to the following formula:
where c is (w)z+hz)/2,wzWidth, h, of target template region ZzIndicating the height of the target template region Z, and then re-scaling the truncated template image block to 125 x 125, the scaling scale being 125/SzAnd saving, wherein the size is used for calculating the interception size of the search area.
(2) In connection with searchingThe resizing of the area is 255 × 255, and in order to ensure that the target scale of the search area is consistent with the scale of the template, the actual interception size of the search area is as follows: sx=255/scale。
In the step 2), the fully-convolution twin network adopts a five-layer network structure similar to AlexNet, and adopts two twin networks: one for extracting the features of the target template region and one for extracting the features of the target search region, the two networks sharing the same parameters; the process is described as follows:
Fx=ψp(X)
Fz=ψp(Z)
the fully-convoluted twin network and the proposed target attention model are pre-trained using a visual recognition data set ILSVRC2015_ VID, by combining the twin network and the target specific response attention model into a unified framework, inputting a target template region sampled from a certain frame in a video sequence and a target search region sampled from a subsequent frame in the same video sequence, and outputting a response map, wherein a loss function adopted in the training process is a cross-entropy loss function:
L(y,v)=log(1+e(-yv))
wherein y is the target label, and v is the score of the corresponding position of the response map.
In step 2), inputting the target template area and the target search area in step 1) into a full convolution twin network to obtain a CNN feature F of the target template area ZzAnd CNN feature F of target search area XxThe method comprises the following specific steps:
(1) after the input image of the full convolution twin network is resized, the full convolution twin network is a full convolution network of five convolution layers and is used for extracting the coarse grain characteristics of the input image;
(2) network details, each convolution layer is followed by a batch normalization layer and an activation layer ReLU; 1. a first convolution layer having a convolution kernel size of 11 × 11, a convolution step of 2, an input channel number of 3, and an output channel number of 96; 2. a first pooling layer, pooling size 3 x 3, maximal pooling of step size 2; 3. a second convolution layer, the convolution kernel is 5 multiplied by 5, the convolution step length is 1, the number of input channels is 96, and the number of output channels is 256; 4. the second pooling layer is the same as the first pooling layer, the pooling size is 3 × 3, and the step size is 2; 5. the third convolution layer, the convolution kernel size is 3 multiplied by 3, the convolution step size is 1, the input channel number is 256, and the output channel number is 384; 6. a fourth convolution layer having a convolution kernel size of 3 × 3, a convolution step size of 1, an input channel number 384, and an output channel number 384; 7. the fifth convolutional layer has a convolutional kernel size of 3 × 3, convolution step 1, input channel number 384, and output channel number 256.
(3) The whole tracking algorithm model is provided with two branch networks, one is a template branch, the other is a search area branch, the two branch networks adopt the networks described in the step (2), and the parameters are the same; the proposed attention model of the target specific response immediately follows the search area branch.
In step 3), the specific step of obtaining the final response map S may be:
(1) CNN feature F for target search region X and target template region ZxAnd FzPerforming channel-by-channel cross-correlation operation to obtain a multi-channel response diagram, which is represented as:
Smulti=Corrcw(Fx,Fz);
(2) f is to bexInputting an attention network H (·), obtaining an attention weight omega of a channel, wherein the attention network is composed of a global mean pooling and a three-layer multilayer perceptron and is represented as follows:
ω=H(Fx)
(3) weighting the calculated attention weight omega of the channel to the response graph S of the multiple channelsmultiTo obtain a weighted multi-channel response mapThen will beAnd SmultiAdding the residual error structures, and summing according to channels to obtain a final response diagram SfinalThe overall process is represented by the following equation:
in step 4), the specific step of selecting the scale of the target search area with the highest response value as the scale corresponding to the current frame may be:
(1) in order to balance the tracking precision and speed, three scales are selected for multi-scale search, and 3 scale factors lambdaiSpecific values [0.96385,1,1.0375 ]]Let the actual truncation size of the search area of the previous frame be SxThe actual size of the truncation area per scale is
(2) And readjusting the image blocks of the three scales to be 255 multiplied by 255, performing scale search, firstly finding a maximum value in the response image in each scale, and comparing the maximum response values in the three scales, wherein the scale where the maximum response value is located is the scale corresponding to the current frame.
In step 5), the specific steps of training the model may be:
(1) 53200 training pairs are sampled from a visual identification data set ILSVRC2015_ VID, each training pair consists of a template and a search area block, the template and the search area block of the same training pair belong to the same video sequence, and the frame of the template is in front of the search area block;
(2) for an original search area target located at the center of an area, the target center can be randomly shifted by 0-8 pixels during training, so that the generalization performance of network target shifting is improved;
(3) the model is subjected to iterative training for 50 generations, the number of samples selected in one training is 8, and the learning rate of the training is 10-2Decays exponentially to 10-5Training loss function LIs a weighted cross-entropy loss, as follows:
where w represents the weight of each sample, y represents the true label value, v represents the value predicted by the model, and n represents the number of samples taken in a batch.
The invention provides a target specific response attention target tracking method based on a twin network aiming at the defect that the original target tracking method based on the twin network is not robust enough to complex tracking scenes such as rapid movement, shielding, rotation, background disorder and the like of a target. The method comprises five main parts: CNN feature extraction; generating a response graph by cross-correlation of channels; generating weights by using an attention network, and weighting each channel response graph; the position of the target is determined on the final response diagram, and the training method of the proposed model is provided.
Compared with the existing attention mechanism, the attention mechanism of the invention learns the attention network in an end-to-end mode. By combining a target specific response attention model and a full convolution twin network in a unified frame, complex tracking scenes such as rapid target motion, occlusion, rotation, background clutter and the like in the actual tracking process are effectively processed, and real-time operation can be realized.
Drawings
FIG. 1 is a block diagram of an embodiment of the present invention.
Fig. 2 is a response diagram of each channel obtained by the channel-by-channel cross-correlation operation in the present invention.
Fig. 3 is a comparison of the resulting response map of the present invention and a baseline algorithm.
Detailed Description
The present invention will be described in detail with reference to the following examples, which are provided in the present invention, and the embodiments and specific operations of the present invention are given on the premise of the technical solution of the present invention, but the scope of the present invention is not limited to the following examples.
Referring to fig. 1, an embodiment of the present invention includes the steps of:
A. given a video sequence, the first frame contains a marked object. And defining a target template area Z and a target search area X, wherein the target template area is kept unchanged after being intercepted based on a given mark in a first frame, and the target search area is obtained in a current video frame to be tested and is an image block which is larger than the target template area and is intercepted by using the target position obtained from the previous frame.
B. Inputting the target template area and the target search area described in the step A into a full convolution twin network to obtain the CNN characteristic F of the target template area ZzAnd CNN feature F of target search area Xx. The full-convolution twin network adopts a five-layer network structure similar to AlexNet, and adopts two twin networks: one for extracting features of the target template region and one for extracting features of the target search region. Both networks share the same parameters. The process is described as follows:
Fx=ψp(X)
Fz=ψp(Z)
C. these two CNN features FxAnd FzInput to a target-specific response attention model, resulting in a response map S of the multipaths that has been weighted by an attention mechanismmultiFor the response chart SmultiAnd summing channel by channel to obtain a final response diagram S, and determining the position with the maximum response value in the response diagram as the target initial position.
D. In the aspect of scale estimation, in the step A, a search scale pyramid is constructed in advance for a target search area by taking the target of the previous frame as a center, the step C is executed for the search area of each scale in the scale pyramid, the scale of the search area with the highest response value is selected as the current scale, and the actual size and position of the target are obtained by combining the position and the scale of the target, so that the target tracking is realized.
E. Model training, the twin network and the proposed target attention model are pre-trained by using a visual recognition data set ILSVRC2015_ VID, the training set has 3862 video segments in total, the total frame number exceeds 112 thousands, and all the video segments are marked with category information and target positions, a target template region sampled from a certain frame in a video sequence and a target search region sampled from a subsequent frame in the same video sequence are input by combining the twin network and the target specific response attention model into a unified frame, a response diagram is output, and a loss function adopted in the training process is a cross entropy loss function: l (y, v) log (1+ e)(-yv)) Y is the target label and v is the response map corresponding location score.
The parameters in the single-frame target tracking process in step a are further described as follows:
A1. in the initial frame, the size of the target template region extracted according to the real effective value is slightly larger than that of the actual target for capturing some semantic information, and the actual size S extracted by the templatezCalculated according to the following formula:
where c is (w)z+hz)/2,wzWidth, h, of target template region ZzIndicating the height of the target template region Z, then resizing the truncated template image block to 125 x 125, and rescaling the rescale to 125/SzAnd saving, wherein the size is used for calculating the interception size of the search area.
A2. The resizing of the search area is 255 × 255, and in order to ensure that the target dimension of the search area and the dimension of the template are consistent, the actual cut size of the search area is as follows:
Sx=255/scale
the deep full convolution neural network adopted in the step B comprises the following substeps:
B1. after the input image is resized, the network is a full convolution twin network of 5 convolution layers and is used for extracting the coarse-grained characteristics of the input image. A structure similar to AlexNet (A. Krizhevsky, I. Sutskeeper, GeoffreE. Hinto n, "ImageNet Classification with Deep conditional Neural Networks", NIPS:1106- -1114,2014.) was used.
B2. Network details, each convolution layer is followed by a batch normalization layer and an activation layer ReLU, 1. a first convolution layer, the convolution kernel size of which is 11 x 11, the convolution step length is 2, the number of input channels is 3, and the number of output channels is 96; 2. pooling next one maximal value of 3 × 3, step 2; 3. the second layer convolution is that the convolution kernel is 5 multiplied by 5, the step length is 1, the input channel is 96, and the output channel is 256; 4. a second pooling layer, the pooling size being 3 x 3, the step size being 2; 5. a third convolutional layer, the convolutional kernel size of which is 3 × 3, the step size of which is 1, the number of input channels 256, and the number of output channels 384; 6. a fourth convolution layer having a convolution kernel size of 3 × 3, a step size of 1, an input channel number 384, and an output channel number 384; 7. the fifth convolutional layer has convolutional kernel size of 3 × 3, step size 1, input channel number 384, and output channel number 256.
B3. The whole tracking algorithm model has two branch networks, one is a template branch, the other is a search area branch, the networks of the two branches are the networks described in B2, and the parameters are the same. The attention model of the target specific response immediately follows the search area branch.
The attention model of the target specific response in step C, further comprising the sub-steps of:
C1. first CNN feature F for X and ZxAnd FzPerforming channel-by-channel cross-correlation operation to obtain a multi-channel response diagram, which is expressed by the following formula:
Smulti=Corrcw(Fx,Fz)
C2. and inputting Fx into an attention network H (·), and obtaining the attention weight omega of the channel, wherein the attention network is composed of a global mean pooling layer and a three-layer multilayer perceptron. This process can be expressed as:
ω=H(Fx)
C3. weighting the calculated weight omega to a multichannel response diagram SmultiTo obtain a weighted multi-channel response mapIn willAnd SmultiAdding the residual error structures, and summing according to channels to obtain a final response diagram SfinalThe overall process is represented by the following equation:
the parameters and flow in the multi-scale strategy of the single frame in the step D are further described as follows:
D1. in the tracking process, three scales are selected for multi-scale search in order to balance tracking precision and speed, and 3 scale factors lambdaiSpecific values [0.96385,1,1.0375 ]]. Let the truncation size of the search region of the previous frame be SxActual size of the truncation area per scale
D2. The image blocks of three scales are all readjusted to be 255 × 255 in size, and scale search is performed: the maximum value in the response image in each scale is found first, the maximum response values in the three scales are compared, and the scale where the maximum response value is located is the scale corresponding to the current frame.
The parameters of the model training process in step E and the flow thereof are further described as follows:
E1. 53200 training pairs are sampled from the ILSVRC2015_ VID, each training pair is composed of a template and a search area block, the template and the search area block of the same training pair belong to the same video sequence, and the frame of the template is in front of the search area block.
E2. For the original search area target located at the center of the area, the target center can be randomly shifted by 0 to 8 pixels during training, so that the generalization performance of network target shift is improved.
E3. The model is iteratively trained for 50 generations, the number of samples selected in one training is 8, and the learning rate of the training is from 10-2Decays exponentially to 10-5The trained penalty function is a weighted cross-entropy penalty, which is expressed as:
where w represents the weight of each sample, y represents the true label value, v represents the value predicted by the model, and n represents the number of samples taken in a batch.
Table 1 shows the results of video attribute analysis of the method of the present invention compared to other methods on OTB 100.
TABLE 1
TABLE 1 (continuation)
SiamFC corresponds to the method proposed by Bertonitto, L et al (Bertonitto, L., Valldre, J., Henriques, J.F., Ved aldi, A., Torr, P.H.S.: full volumetric parameter networks for object tracking. in: European conference Computer Vision works, (ECCV works) pp.850-865 (2016));
the method proposed by Simyltri, Dong, X.et al (Dong, X., Shen, J.: triple loss in size network for object tracking. in: European Conference Computer Vision (ECCV). pp.472-488 (2018));
SRDCF corresponds to the method proposed by Danelljan, M et al (Danelljan, M., H. agent, G., Khan, F.S., Felsberg, M.: Learing particulate regulated correction filters for visual tracking. in 2015IEEE International Conference Computer Vision (ICCV). pp.4310-4318 (2015.);
CSR-DCF corresponds to the method proposed by Lukezic, A, et al (Lukezic, A., Vojir, T., Zajc, L.C., Matas, J., Krist an, M.: diagnostic correlation filter with channel and spectral reliability. in:2017IEEE conference ce on Computer Vision and Pattern Registration (CVPR). pp.4847-4856 (2017));
TRACA corresponds to the method proposed by Choi, J et al (Choi, J., Chang, H.J., Fischer, T., Yun, S., Lee, K., Jeo ng, J., Demiris, Y., Choi, J.Y.: Context-aware feature compensation for high-speed visual tracking. in:2018IEEE Conference on Computer Vision and Pattern Recognition, CVPR.pp.479-488 (2018));
methods proposed by CFNet corresponding to varmadre, J, et al (Valmadre, J., Bertinetto, l., Henriques, j.f., Vedald I, a., Torr, p.h.s.: End-End rendering for correction filter based tracking. in:2017I EEE Conference on Computer Vision and Pattern Registration (CVPR): pp.5000-5008 (2017));
ACFN corresponds to the method proposed by Choi, J.et al (Choi, J., Chang, H.J., Yun, S., Fischer, T., Demiris, Y., Choi, J.Y.: environmental correction filter network for adaptive visual tracking. in:2017IEEE conference on Computer Vision and Pattern Registration (CVPR). pp.4828-4837 (2017));
the method proposed by Standard corresponds to Bertonitto, L.et al (Bertonitto, L., Valldre, J., Golodetz, S., Miksik, O., Torr, P.H.S.: Standard: minor workers for real-time tracking. in:2016IEEE Conference Computer Vision and Pattern Registration (CVPR). pp.1401-1409 (2016));
KCF corresponds to the method proposed by Henriques, J.F. et al (Henriques, J.F., Caseiro, R., Martins, P., Batista, J.: High-speed tracking with kernelized conversion filters, IEEE traces. Pattern anal. Mach. Intell.37(3), 583-string 596 (2015)).
As can be seen from the table 1, the method can effectively process complex tracking scenes such as rapid target motion, occlusion, rotation, background clutter and the like in the actual tracking process, and the performance of the method is superior to that of other trackers. Showing the effectiveness of the proposed method.
Claims (6)
1. A target specific response attention target tracking method based on a twin network is characterized by comprising the following steps:
1) a video sequence is given, wherein a first frame comprises a marked target, a target template area Z and a target search area X are defined, the target template area Z is kept unchanged after being intercepted based on the given mark in the first frame, the target search area X is obtained from a current video frame to be tested, and an image block larger than the target template area Z is intercepted by using a target position obtained from the previous frame;
2) inputting the target template area and the target search area in the step 1) into a full convolution twin network to obtain a CNN characteristic F of a target template area ZzAnd CNN feature F of target search area Xx;
3) The CNN characteristic F obtained in the step 2)xAnd FzInputting the attention model of the target specific response to obtain a response map S of multiple channels weighted by the attention modelmultiTo the response map SmultiSumming channel by channel to obtain a final response diagram S, and determining the position with the maximum response value in the response diagram S as a target initial position;
the specific steps for obtaining the final response diagram S are as follows:
(1) CNN feature F for target search region X and target template region ZxAnd FzPerforming channel-by-channel cross-correlation operation to obtain a multi-channel response diagram, which is represented as:
Smulti=Corrcw(Fx,Fz);
(2) f is to bexInputting an attention network H (·), obtaining an attention weight omega of a channel, wherein the attention network is composed of a global mean pooling and a three-layer multilayer perceptron and is represented as follows:
ω=H(Fx)
(3) weighting the calculated attention weight omega of the channel to the response graph S of the multiple channelsmultiTo obtain a weighted multi-channel response mapThen will beAnd SmultiAdding the residual error structures, and summing according to channels to obtain a final response diagram SfinalThe overall process is represented by the following equation:
4) taking the target position obtained in the last frame in the step 1) as a center, constructing a search scale pyramid for a target search area, executing the step 3) for each search area of an estimated scale in the scale pyramid, selecting the scale of the target search area with the highest response value as the scale corresponding to the current frame, and combining the target position and the scale to obtain the actual size and position of the target, thereby realizing target tracking;
5) training a model: the training of the model is independent of the tracking process, and after the model is trained off-line, the trained model is used for the tracking step.
2. The twin network based target specific response attention target tracking method as claimed in claim 1, wherein in step 1), the specific steps of the target search area X being obtained in the video frame to be tested at present are as follows:
1.1 in the initial frame according to the real effective value cutting out the target template area, the size of the target template area is slightly bigger than the actual target for capturing some semantic information, the size of the target template area is calculated according to the following formula:
wherein c is (w)z+hz)/2,wzWidth, h, of target template region ZzIndicating the height of the target template region Z, and then re-scaling the truncated template image block to 125 x 125, the scaling scale being 125/SzSaving, which is used for calculating the interception size of the search area;
1.2 the resizing is 255 × 255 with respect to the search area, and in order to ensure that the target dimension of the search area and the dimension of the template are consistent, the actual cut size of the search area is: sx=255/scale。
3. A twin network based target specific response attention target tracking method as claimed in claim 1, characterized in that in step 2), the fully convolution twin network adopts a five-layer network structure similar to AlexNet, and adopts two twin networks: one for extracting the features of the target template region and one for extracting the features of the target search region, the two networks sharing the same parameters; the process is described as follows:
Fx=ψp(X)
Fz=ψp(Z)
the fully-convoluted twin network and the proposed target attention model are pre-trained using a visual recognition data set ILSVRC2015_ VID, by combining the twin network and the target specific response attention model into a unified framework, inputting a target template region sampled from a certain frame in a video sequence and a target search region sampled from a subsequent frame in the same video sequence, and outputting a response map, wherein a loss function adopted in the training process is a cross-entropy loss function:
L(k,u)=log(1+e(-ku))
wherein k is the target label, and u is the score of the corresponding position of the response map.
4. Target specific response attention target tracking based on twin network as claimed in claim 1The tracking method is characterized in that in the step 2), the target template area and the target search area in the step 1) are input into a full convolution twin network to obtain the CNN characteristic F of the target template area ZzAnd CNN feature F of target search area XxThe method comprises the following specific steps:
2.1 after the input image of the full convolution twin network is resized, the full convolution twin network is a full convolution network of five convolution layers and is used for extracting the coarse-grained characteristic of the input image;
2.2 network details, each convolution layer is followed by a batch normalization layer and an activation layer ReLU; first convolution layer, its convolution kernel size is 11 x 11, convolution step is 2, input channel number is 3, output channel number is 96; pooling the first pooling layer with a pooling size of 3 × 3 and a maximum pooling step of 2; ③ a second convolution layer, the convolution kernel is 5 multiplied by 5, the convolution step length is 1, the number of input channels is 96, and the number of output channels is 256; the second pooling layer is the same as the first pooling layer, the pooling size is 3 multiplied by 3, and the step length is 2; fifthly, a third convolution layer, the convolution kernel size is 3 multiplied by 3, the convolution step size is 1, the input channel number is 256, and the output channel number is 384; sixthly, the size of a convolution kernel of the fourth convolution layer is 3 multiplied by 3, the convolution step length is 1, the number of input channels is 384, and the number of output channels is 384; seventhly, the convolution layer has the convolution kernel size of 3 multiplied by 3, the convolution step length of 1, the input channel number of 384 and the output channel number of 256;
2.3 the whole tracking algorithm model has two branch networks, one is a template branch, the other is a search area branch, the networks of the two branches are the networks described in the step 2.2, and the parameters are the same; the proposed attention model of the target specific response immediately follows the search area branch.
5. The twin network-based target specific response attention target tracking method as claimed in claim 1, wherein in step 4), the specific step of selecting the scale of the target search area with the highest response value as the corresponding scale of the current frame is:
4.1 to balance tracking accuracy and speed, three scales are selected for multi-scale search, and 3 scale factorsλiSpecific values [0.96385,1,1.0375 ]]Let the actual truncation size of the search area of the previous frame be SxThe actual size of the truncation area per scale is
And 4.2, readjusting the image blocks of the three scales to be 255 multiplied by 255, performing scale search, firstly finding the maximum value in the response image in each scale, and comparing the maximum response values in the three scales, wherein the scale where the maximum response value is located is the scale corresponding to the current frame.
6. The twin network-based target specific response attention target tracking method as claimed in claim 1, wherein in step 5), the specific steps of training the model are:
5.1 sampling 53200 training pairs from a visual identification data set ILSVRC2015_ VID, wherein each training pair consists of a template and a search area block, the template and the search area block of the same training pair belong to the same video sequence, and the frame of the template is in front of the search area block;
5.2, for an original search area target, locating the target at the center of the area, and randomly offsetting the target center by 0-8 pixels during training so as to improve the generalization performance of network target offset;
5.3 model iterative training 50 generations, the number of samples selected in one training is 8, and the learning rate of the training is from 10-2Decays exponentially to 10-5The trained loss function L is a weighted cross-entropy loss, as follows:
where w represents the weight of each sample, y represents the true label value, v represents the value predicted by the model, and n represents the number of samples sampled in a batch.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010081733.8A CN111291679B (en) | 2020-02-06 | 2020-02-06 | Target specific response attention target tracking method based on twin network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010081733.8A CN111291679B (en) | 2020-02-06 | 2020-02-06 | Target specific response attention target tracking method based on twin network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111291679A CN111291679A (en) | 2020-06-16 |
CN111291679B true CN111291679B (en) | 2022-05-27 |
Family
ID=71026700
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010081733.8A Active CN111291679B (en) | 2020-02-06 | 2020-02-06 | Target specific response attention target tracking method based on twin network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111291679B (en) |
Families Citing this family (26)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111860248B (en) * | 2020-07-08 | 2021-06-25 | 上海蠡图信息科技有限公司 | Visual target tracking method based on twin gradual attention-guided fusion network |
CN111787227A (en) * | 2020-07-22 | 2020-10-16 | 苏州臻迪智能科技有限公司 | Style migration method and device based on tracking shooting |
CN111899283B (en) * | 2020-07-30 | 2023-10-17 | 北京科技大学 | Video target tracking method |
CN112150504A (en) * | 2020-08-03 | 2020-12-29 | 上海大学 | Visual tracking method based on attention mechanism |
CN112085718B (en) * | 2020-09-04 | 2022-05-10 | 厦门大学 | NAFLD ultrasonic video diagnosis system based on twin attention network |
CN112288772B (en) * | 2020-10-14 | 2022-06-07 | 武汉大学 | Channel attention target tracking method based on online multi-feature selection |
CN112348849B (en) * | 2020-10-27 | 2023-06-20 | 南京邮电大学 | Twin network video target tracking method and device |
CN112215872B (en) * | 2020-11-04 | 2024-03-22 | 上海海事大学 | Multi-full convolution fusion single-target tracking method based on twin network |
CN112560695B (en) * | 2020-12-17 | 2023-03-24 | 中国海洋大学 | Underwater target tracking method, system, storage medium, equipment, terminal and application |
CN112598739B (en) * | 2020-12-25 | 2023-09-01 | 哈尔滨工业大学(深圳) | Mobile robot infrared target tracking method, system and storage medium based on space-time characteristic aggregation network |
CN112750148B (en) * | 2021-01-13 | 2024-03-22 | 浙江工业大学 | Multi-scale target perception tracking method based on twin network |
CN112785626A (en) * | 2021-01-27 | 2021-05-11 | 安徽大学 | Twin network small target tracking method based on multi-scale feature fusion |
CN113192124A (en) * | 2021-03-15 | 2021-07-30 | 大连海事大学 | Image target positioning method based on twin network |
CN113158881B (en) * | 2021-04-19 | 2022-06-14 | 电子科技大学 | Cross-domain pedestrian re-identification method based on attention mechanism |
CN113205544B (en) * | 2021-04-27 | 2022-04-29 | 武汉大学 | Space attention reinforcement learning tracking method based on cross-over ratio estimation |
CN113362373B (en) * | 2021-06-01 | 2023-12-15 | 北京首都国际机场股份有限公司 | Double-twin-network-based aircraft tracking method in complex apron area |
CN113658218B (en) * | 2021-07-19 | 2023-10-13 | 南京邮电大学 | Dual-template intensive twin network tracking method, device and storage medium |
CN113379806B (en) * | 2021-08-13 | 2021-11-09 | 南昌工程学院 | Target tracking method and system based on learnable sparse conversion attention mechanism |
CN113628249B (en) * | 2021-08-16 | 2023-04-07 | 电子科技大学 | RGBT target tracking method based on cross-modal attention mechanism and twin structure |
CN113793359B (en) * | 2021-08-25 | 2024-04-05 | 西安工业大学 | Target tracking method integrating twin network and related filtering |
CN113888590B (en) * | 2021-09-13 | 2024-04-16 | 华南理工大学 | Video target tracking method based on data enhancement and twin network |
CN113870312B (en) * | 2021-09-30 | 2023-09-22 | 四川大学 | Single target tracking method based on twin network |
CN113936040B (en) * | 2021-10-15 | 2023-09-15 | 哈尔滨工业大学 | Target tracking method based on capsule network and natural language query |
CN113822233B (en) * | 2021-11-22 | 2022-03-22 | 青岛杰瑞工控技术有限公司 | Method and system for tracking abnormal fishes cultured in deep sea |
CN114241003B (en) * | 2021-12-14 | 2022-08-19 | 成都阿普奇科技股份有限公司 | All-weather lightweight high-real-time sea surface ship detection and tracking method |
CN116645399B (en) * | 2023-07-19 | 2023-10-13 | 山东大学 | Residual network target tracking method and system based on attention mechanism |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108898620A (en) * | 2018-06-14 | 2018-11-27 | 厦门大学 | Method for tracking target based on multiple twin neural network and regional nerve network |
CN109978921A (en) * | 2019-04-01 | 2019-07-05 | 南京信息工程大学 | A kind of real-time video target tracking algorithm based on multilayer attention mechanism |
CN110223324A (en) * | 2019-06-05 | 2019-09-10 | 东华大学 | A kind of method for tracking target of the twin matching network indicated based on robust features |
CN110335290A (en) * | 2019-06-04 | 2019-10-15 | 大连理工大学 | Twin candidate region based on attention mechanism generates network target tracking method |
CN110675423A (en) * | 2019-08-29 | 2020-01-10 | 电子科技大学 | Unmanned aerial vehicle tracking method based on twin neural network and attention model |
-
2020
- 2020-02-06 CN CN202010081733.8A patent/CN111291679B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108898620A (en) * | 2018-06-14 | 2018-11-27 | 厦门大学 | Method for tracking target based on multiple twin neural network and regional nerve network |
CN109978921A (en) * | 2019-04-01 | 2019-07-05 | 南京信息工程大学 | A kind of real-time video target tracking algorithm based on multilayer attention mechanism |
CN110335290A (en) * | 2019-06-04 | 2019-10-15 | 大连理工大学 | Twin candidate region based on attention mechanism generates network target tracking method |
CN110223324A (en) * | 2019-06-05 | 2019-09-10 | 东华大学 | A kind of method for tracking target of the twin matching network indicated based on robust features |
CN110675423A (en) * | 2019-08-29 | 2020-01-10 | 电子科技大学 | Unmanned aerial vehicle tracking method based on twin neural network and attention model |
Non-Patent Citations (3)
Title |
---|
A Twofold Siamese Network for Real-Time Object Tracking;Anfeng He et al.;《2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition》;20181217;全文 * |
Multi-granularity Hierarchical Attention Siamese Network for Visual Tracking;Xing Chen et al.;《2018 International Joint Conference on Neural Networks (IJCNN)》;20181015;全文 * |
基于孪生网络与注意力机制的目标跟踪方法;周迪雅 等;《信息通信》;20191215(第12期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN111291679A (en) | 2020-06-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111291679B (en) | Target specific response attention target tracking method based on twin network | |
CN108549839B (en) | Adaptive feature fusion multi-scale correlation filtering visual tracking method | |
CN111914664A (en) | Vehicle multi-target detection and track tracking method based on re-identification | |
CN111460968B (en) | Unmanned aerial vehicle identification and tracking method and device based on video | |
CN108647694B (en) | Context-aware and adaptive response-based related filtering target tracking method | |
CN110569723A (en) | Target tracking method combining feature fusion and model updating | |
CN108961308B (en) | Residual error depth characteristic target tracking method for drift detection | |
CN110942471B (en) | Long-term target tracking method based on space-time constraint | |
CN113591968A (en) | Infrared weak and small target detection method based on asymmetric attention feature fusion | |
CN113011329A (en) | Pyramid network based on multi-scale features and dense crowd counting method | |
CN111738344A (en) | Rapid target detection method based on multi-scale fusion | |
CN110992401A (en) | Target tracking method and device, computer equipment and storage medium | |
CN111640138A (en) | Target tracking method, device, equipment and storage medium | |
CN115171165A (en) | Pedestrian re-identification method and device with global features and step-type local features fused | |
CN113192124A (en) | Image target positioning method based on twin network | |
CN115375737B (en) | Target tracking method and system based on adaptive time and serialized space-time characteristics | |
CN114898403A (en) | Pedestrian multi-target tracking method based on Attention-JDE network | |
CN111340842A (en) | Correlation filtering target tracking algorithm based on joint model | |
CN111507215A (en) | Video target segmentation method based on space-time convolution cyclic neural network and cavity convolution | |
Li et al. | Real-time pedestrian detection with deep supervision in the wild | |
Chen et al. | Norm-aware embedding for efficient person search and tracking | |
Zhou et al. | Discriminative attention-augmented feature learning for facial expression recognition in the wild | |
CN108257148B (en) | Target suggestion window generation method of specific object and application of target suggestion window generation method in target tracking | |
CN117011655A (en) | Adaptive region selection feature fusion based method, target tracking method and system | |
CN111275694A (en) | Attention mechanism guided progressive division human body analytic model and method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |