CN111291679B - Target specific response attention target tracking method based on twin network - Google Patents

Target specific response attention target tracking method based on twin network Download PDF

Info

Publication number
CN111291679B
CN111291679B CN202010081733.8A CN202010081733A CN111291679B CN 111291679 B CN111291679 B CN 111291679B CN 202010081733 A CN202010081733 A CN 202010081733A CN 111291679 B CN111291679 B CN 111291679B
Authority
CN
China
Prior art keywords
target
response
search area
convolution
attention
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010081733.8A
Other languages
Chinese (zh)
Other versions
CN111291679A (en
Inventor
王菡子
赵鹏辉
陈昊升
梁艳杰
严严
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen University
Original Assignee
Xiamen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen University filed Critical Xiamen University
Priority to CN202010081733.8A priority Critical patent/CN111291679B/en
Publication of CN111291679A publication Critical patent/CN111291679A/en
Application granted granted Critical
Publication of CN111291679B publication Critical patent/CN111291679B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • G06V20/42Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items of sport video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Abstract

A target specific response attention target tracking method based on a twin network relates to a computer vision technology. The target specific response attention target tracking method based on the twin network is provided aiming at the defect that the original target tracking method based on the twin network is not robust enough to complex tracking scenes such as rapid movement, shielding, rotation, background disorder and the like of a target, the influence of noise information on tracking performance in the tracking process is effectively weakened by the provided target response attention module, meanwhile, characteristic information which has discriminability on appearance change of a target object is strengthened, a better target response image generated by the twin network is enabled to be used for target position prediction, and therefore more robust tracking performance is achieved. It comprises five main parts: CNN feature extraction; generating a response graph by cross-correlation of channels; generating weights by using an attention network, and weighting each channel response graph; the position of the target is determined on the final response diagram, and the training method of the proposed model is provided.

Description

Target specific response attention target tracking method based on twin network
Technical Field
The invention relates to a computer vision technology, in particular to a target specific response attention target tracking method based on a twin network.
Background
Target tracking is a basic task in computer vision, and has wide application in the fields of video monitoring, vehicle navigation, augmented reality and the like. Object tracking is the selection of an object of interest selected in the first frame of a given video sequence and the prediction of the position of the object in subsequent frames by computer vision algorithms. In recent years, a twin-based tracking method is receiving attention of researchers because of ensuring tracking accuracy and real-time speed, but the performance of a related algorithm is easily influenced by: the rapid movement of the target or the camera, the appearance change of the target, the disordered background and the like, which are inevitable conditions in reality.
In deep learning, attention mechanisms may enable a model to selectively capture important portions of an input based on a particular task. Since the performance of the model can be improved without excessive consumption of computation and storage. Attention mechanism has been widely used in the fields of image titles, machine translation, speech recognition, etc. For target tracking, davt draws recognition space attention, identifying certain specific regions on the target. acfn attempts to utilize the focus mechanism to select a set of correlation filters for tracking. The cst-dcf utilizes the foreground spatial reliability map to constrain the correlation filter learning. RTT utilizes a multidirectional recurrent neural network that generates saliency to select a reliable region that belongs to a target object. The pioneering algorithms verify the superiority of the attention mechanism in the aspect of target tracking, and meanwhile, the method effectively provides a prospect for improving the performance of the target tracking method by utilizing the attention mechanism.
Disclosure of Invention
The invention aims to provide a target specific response attention target tracking method based on a twin network, which can effectively cope with complex tracking scenes such as rapid movement, occlusion, rotation, background disorder and the like of a target.
The invention comprises the following steps:
1) a video sequence is given, wherein a first frame comprises a marked target, a target template area Z and a target search area X are defined, the target template area Z is kept unchanged after being intercepted based on the given mark in the first frame, the target search area X is obtained from a current video frame to be tested, and an image block larger than the target template area Z is intercepted by utilizing a target position obtained from the previous frame;
2) subjecting the product of step 1) toInputting the target template area and the target search area into a full convolution twin network to obtain the CNN characteristic F of the target template area ZzAnd CNN feature F of target search area Xx
3) The CNN characteristic F obtained in the step 2)xAnd FzInputting the attention model of the target specific response to obtain a response map S of multiple channels weighted by the attention modelmultiFor the response chart SmultiSumming channel by channel to obtain a final response diagram S, and determining the position with the maximum response value in the response diagram S as a target initial position;
4) taking the target position obtained in the last frame in the step 1) as a center, constructing a search scale pyramid for a target search area, executing the step 3) for each search area of an estimated scale in the scale pyramid, selecting the scale of the target search area with the highest response value as the scale corresponding to the current frame, and combining the target position and the scale to obtain the actual size and position of the target, thereby realizing target tracking;
5) training a model: the model training is independent of the tracking process, and after the model is trained off-line, the trained model is used for the tracking step.
In step 1), the specific steps of acquiring the target search area X in the current video frame to be tested are as follows:
(1) in the initial frame, the size of the extracted target template area is slightly larger than that of the actual target according to the real effective value, and the extracted actual size of the template is calculated according to the following formula:
Figure BDA0002380548480000021
where c is (w)z+hz)/2,wzWidth, h, of target template region ZzIndicating the height of the target template region Z, and then re-scaling the truncated template image block to 125 x 125, the scaling scale being 125/SzAnd saving, wherein the size is used for calculating the interception size of the search area.
(2) In connection with searchingThe resizing of the area is 255 × 255, and in order to ensure that the target scale of the search area is consistent with the scale of the template, the actual interception size of the search area is as follows: sx=255/scale。
In the step 2), the fully-convolution twin network adopts a five-layer network structure similar to AlexNet, and adopts two twin networks: one for extracting the features of the target template region and one for extracting the features of the target search region, the two networks sharing the same parameters; the process is described as follows:
Fx=ψp(X)
Fz=ψp(Z)
the fully-convoluted twin network and the proposed target attention model are pre-trained using a visual recognition data set ILSVRC2015_ VID, by combining the twin network and the target specific response attention model into a unified framework, inputting a target template region sampled from a certain frame in a video sequence and a target search region sampled from a subsequent frame in the same video sequence, and outputting a response map, wherein a loss function adopted in the training process is a cross-entropy loss function:
L(y,v)=log(1+e(-yv))
wherein y is the target label, and v is the score of the corresponding position of the response map.
In step 2), inputting the target template area and the target search area in step 1) into a full convolution twin network to obtain a CNN feature F of the target template area ZzAnd CNN feature F of target search area XxThe method comprises the following specific steps:
(1) after the input image of the full convolution twin network is resized, the full convolution twin network is a full convolution network of five convolution layers and is used for extracting the coarse grain characteristics of the input image;
(2) network details, each convolution layer is followed by a batch normalization layer and an activation layer ReLU; 1. a first convolution layer having a convolution kernel size of 11 × 11, a convolution step of 2, an input channel number of 3, and an output channel number of 96; 2. a first pooling layer, pooling size 3 x 3, maximal pooling of step size 2; 3. a second convolution layer, the convolution kernel is 5 multiplied by 5, the convolution step length is 1, the number of input channels is 96, and the number of output channels is 256; 4. the second pooling layer is the same as the first pooling layer, the pooling size is 3 × 3, and the step size is 2; 5. the third convolution layer, the convolution kernel size is 3 multiplied by 3, the convolution step size is 1, the input channel number is 256, and the output channel number is 384; 6. a fourth convolution layer having a convolution kernel size of 3 × 3, a convolution step size of 1, an input channel number 384, and an output channel number 384; 7. the fifth convolutional layer has a convolutional kernel size of 3 × 3, convolution step 1, input channel number 384, and output channel number 256.
(3) The whole tracking algorithm model is provided with two branch networks, one is a template branch, the other is a search area branch, the two branch networks adopt the networks described in the step (2), and the parameters are the same; the proposed attention model of the target specific response immediately follows the search area branch.
In step 3), the specific step of obtaining the final response map S may be:
(1) CNN feature F for target search region X and target template region ZxAnd FzPerforming channel-by-channel cross-correlation operation to obtain a multi-channel response diagram, which is represented as:
Smulti=Corrcw(Fx,Fz);
(2) f is to bexInputting an attention network H (·), obtaining an attention weight omega of a channel, wherein the attention network is composed of a global mean pooling and a three-layer multilayer perceptron and is represented as follows:
ω=H(Fx)
(3) weighting the calculated attention weight omega of the channel to the response graph S of the multiple channelsmultiTo obtain a weighted multi-channel response map
Figure BDA0002380548480000031
Then will be
Figure BDA0002380548480000032
And SmultiAdding the residual error structures, and summing according to channels to obtain a final response diagram SfinalThe overall process is represented by the following equation:
Figure BDA0002380548480000041
in step 4), the specific step of selecting the scale of the target search area with the highest response value as the scale corresponding to the current frame may be:
(1) in order to balance the tracking precision and speed, three scales are selected for multi-scale search, and 3 scale factors lambdaiSpecific values [0.96385,1,1.0375 ]]Let the actual truncation size of the search area of the previous frame be SxThe actual size of the truncation area per scale is
Figure BDA0002380548480000042
Figure BDA0002380548480000043
(2) And readjusting the image blocks of the three scales to be 255 multiplied by 255, performing scale search, firstly finding a maximum value in the response image in each scale, and comparing the maximum response values in the three scales, wherein the scale where the maximum response value is located is the scale corresponding to the current frame.
In step 5), the specific steps of training the model may be:
(1) 53200 training pairs are sampled from a visual identification data set ILSVRC2015_ VID, each training pair consists of a template and a search area block, the template and the search area block of the same training pair belong to the same video sequence, and the frame of the template is in front of the search area block;
(2) for an original search area target located at the center of an area, the target center can be randomly shifted by 0-8 pixels during training, so that the generalization performance of network target shifting is improved;
(3) the model is subjected to iterative training for 50 generations, the number of samples selected in one training is 8, and the learning rate of the training is 10-2Decays exponentially to 10-5Training loss function LIs a weighted cross-entropy loss, as follows:
Figure BDA0002380548480000044
where w represents the weight of each sample, y represents the true label value, v represents the value predicted by the model, and n represents the number of samples taken in a batch.
The invention provides a target specific response attention target tracking method based on a twin network aiming at the defect that the original target tracking method based on the twin network is not robust enough to complex tracking scenes such as rapid movement, shielding, rotation, background disorder and the like of a target. The method comprises five main parts: CNN feature extraction; generating a response graph by cross-correlation of channels; generating weights by using an attention network, and weighting each channel response graph; the position of the target is determined on the final response diagram, and the training method of the proposed model is provided.
Compared with the existing attention mechanism, the attention mechanism of the invention learns the attention network in an end-to-end mode. By combining a target specific response attention model and a full convolution twin network in a unified frame, complex tracking scenes such as rapid target motion, occlusion, rotation, background clutter and the like in the actual tracking process are effectively processed, and real-time operation can be realized.
Drawings
FIG. 1 is a block diagram of an embodiment of the present invention.
Fig. 2 is a response diagram of each channel obtained by the channel-by-channel cross-correlation operation in the present invention.
Fig. 3 is a comparison of the resulting response map of the present invention and a baseline algorithm.
Detailed Description
The present invention will be described in detail with reference to the following examples, which are provided in the present invention, and the embodiments and specific operations of the present invention are given on the premise of the technical solution of the present invention, but the scope of the present invention is not limited to the following examples.
Referring to fig. 1, an embodiment of the present invention includes the steps of:
A. given a video sequence, the first frame contains a marked object. And defining a target template area Z and a target search area X, wherein the target template area is kept unchanged after being intercepted based on a given mark in a first frame, and the target search area is obtained in a current video frame to be tested and is an image block which is larger than the target template area and is intercepted by using the target position obtained from the previous frame.
B. Inputting the target template area and the target search area described in the step A into a full convolution twin network to obtain the CNN characteristic F of the target template area ZzAnd CNN feature F of target search area Xx. The full-convolution twin network adopts a five-layer network structure similar to AlexNet, and adopts two twin networks: one for extracting features of the target template region and one for extracting features of the target search region. Both networks share the same parameters. The process is described as follows:
Fx=ψp(X)
Fz=ψp(Z)
C. these two CNN features FxAnd FzInput to a target-specific response attention model, resulting in a response map S of the multipaths that has been weighted by an attention mechanismmultiFor the response chart SmultiAnd summing channel by channel to obtain a final response diagram S, and determining the position with the maximum response value in the response diagram as the target initial position.
D. In the aspect of scale estimation, in the step A, a search scale pyramid is constructed in advance for a target search area by taking the target of the previous frame as a center, the step C is executed for the search area of each scale in the scale pyramid, the scale of the search area with the highest response value is selected as the current scale, and the actual size and position of the target are obtained by combining the position and the scale of the target, so that the target tracking is realized.
E. Model training, the twin network and the proposed target attention model are pre-trained by using a visual recognition data set ILSVRC2015_ VID, the training set has 3862 video segments in total, the total frame number exceeds 112 thousands, and all the video segments are marked with category information and target positions, a target template region sampled from a certain frame in a video sequence and a target search region sampled from a subsequent frame in the same video sequence are input by combining the twin network and the target specific response attention model into a unified frame, a response diagram is output, and a loss function adopted in the training process is a cross entropy loss function: l (y, v) log (1+ e)(-yv)) Y is the target label and v is the response map corresponding location score.
The parameters in the single-frame target tracking process in step a are further described as follows:
A1. in the initial frame, the size of the target template region extracted according to the real effective value is slightly larger than that of the actual target for capturing some semantic information, and the actual size S extracted by the templatezCalculated according to the following formula:
Figure BDA0002380548480000061
where c is (w)z+hz)/2,wzWidth, h, of target template region ZzIndicating the height of the target template region Z, then resizing the truncated template image block to 125 x 125, and rescaling the rescale to 125/SzAnd saving, wherein the size is used for calculating the interception size of the search area.
A2. The resizing of the search area is 255 × 255, and in order to ensure that the target dimension of the search area and the dimension of the template are consistent, the actual cut size of the search area is as follows:
Sx=255/scale
the deep full convolution neural network adopted in the step B comprises the following substeps:
B1. after the input image is resized, the network is a full convolution twin network of 5 convolution layers and is used for extracting the coarse-grained characteristics of the input image. A structure similar to AlexNet (A. Krizhevsky, I. Sutskeeper, GeoffreE. Hinto n, "ImageNet Classification with Deep conditional Neural Networks", NIPS:1106- -1114,2014.) was used.
B2. Network details, each convolution layer is followed by a batch normalization layer and an activation layer ReLU, 1. a first convolution layer, the convolution kernel size of which is 11 x 11, the convolution step length is 2, the number of input channels is 3, and the number of output channels is 96; 2. pooling next one maximal value of 3 × 3, step 2; 3. the second layer convolution is that the convolution kernel is 5 multiplied by 5, the step length is 1, the input channel is 96, and the output channel is 256; 4. a second pooling layer, the pooling size being 3 x 3, the step size being 2; 5. a third convolutional layer, the convolutional kernel size of which is 3 × 3, the step size of which is 1, the number of input channels 256, and the number of output channels 384; 6. a fourth convolution layer having a convolution kernel size of 3 × 3, a step size of 1, an input channel number 384, and an output channel number 384; 7. the fifth convolutional layer has convolutional kernel size of 3 × 3, step size 1, input channel number 384, and output channel number 256.
B3. The whole tracking algorithm model has two branch networks, one is a template branch, the other is a search area branch, the networks of the two branches are the networks described in B2, and the parameters are the same. The attention model of the target specific response immediately follows the search area branch.
The attention model of the target specific response in step C, further comprising the sub-steps of:
C1. first CNN feature F for X and ZxAnd FzPerforming channel-by-channel cross-correlation operation to obtain a multi-channel response diagram, which is expressed by the following formula:
Smulti=Corrcw(Fx,Fz)
C2. and inputting Fx into an attention network H (·), and obtaining the attention weight omega of the channel, wherein the attention network is composed of a global mean pooling layer and a three-layer multilayer perceptron. This process can be expressed as:
ω=H(Fx)
C3. weighting the calculated weight omega to a multichannel response diagram SmultiTo obtain a weighted multi-channel response map
Figure BDA0002380548480000071
In will
Figure BDA0002380548480000072
And SmultiAdding the residual error structures, and summing according to channels to obtain a final response diagram SfinalThe overall process is represented by the following equation:
Figure BDA0002380548480000073
the parameters and flow in the multi-scale strategy of the single frame in the step D are further described as follows:
D1. in the tracking process, three scales are selected for multi-scale search in order to balance tracking precision and speed, and 3 scale factors lambdaiSpecific values [0.96385,1,1.0375 ]]. Let the truncation size of the search region of the previous frame be SxActual size of the truncation area per scale
Figure BDA0002380548480000074
Figure BDA0002380548480000075
D2. The image blocks of three scales are all readjusted to be 255 × 255 in size, and scale search is performed: the maximum value in the response image in each scale is found first, the maximum response values in the three scales are compared, and the scale where the maximum response value is located is the scale corresponding to the current frame.
The parameters of the model training process in step E and the flow thereof are further described as follows:
E1. 53200 training pairs are sampled from the ILSVRC2015_ VID, each training pair is composed of a template and a search area block, the template and the search area block of the same training pair belong to the same video sequence, and the frame of the template is in front of the search area block.
E2. For the original search area target located at the center of the area, the target center can be randomly shifted by 0 to 8 pixels during training, so that the generalization performance of network target shift is improved.
E3. The model is iteratively trained for 50 generations, the number of samples selected in one training is 8, and the learning rate of the training is from 10-2Decays exponentially to 10-5The trained penalty function is a weighted cross-entropy penalty, which is expressed as:
Figure BDA0002380548480000081
where w represents the weight of each sample, y represents the true label value, v represents the value predicted by the model, and n represents the number of samples taken in a batch.
Table 1 shows the results of video attribute analysis of the method of the present invention compared to other methods on OTB 100.
TABLE 1
Figure BDA0002380548480000082
TABLE 1 (continuation)
Figure BDA0002380548480000083
SiamFC corresponds to the method proposed by Bertonitto, L et al (Bertonitto, L., Valldre, J., Henriques, J.F., Ved aldi, A., Torr, P.H.S.: full volumetric parameter networks for object tracking. in: European conference Computer Vision works, (ECCV works) pp.850-865 (2016));
the method proposed by Simyltri, Dong, X.et al (Dong, X., Shen, J.: triple loss in size network for object tracking. in: European Conference Computer Vision (ECCV). pp.472-488 (2018));
SRDCF corresponds to the method proposed by Danelljan, M et al (Danelljan, M., H. agent, G., Khan, F.S., Felsberg, M.: Learing particulate regulated correction filters for visual tracking. in 2015IEEE International Conference Computer Vision (ICCV). pp.4310-4318 (2015.);
CSR-DCF corresponds to the method proposed by Lukezic, A, et al (Lukezic, A., Vojir, T., Zajc, L.C., Matas, J., Krist an, M.: diagnostic correlation filter with channel and spectral reliability. in:2017IEEE conference ce on Computer Vision and Pattern Registration (CVPR). pp.4847-4856 (2017));
TRACA corresponds to the method proposed by Choi, J et al (Choi, J., Chang, H.J., Fischer, T., Yun, S., Lee, K., Jeo ng, J., Demiris, Y., Choi, J.Y.: Context-aware feature compensation for high-speed visual tracking. in:2018IEEE Conference on Computer Vision and Pattern Recognition, CVPR.pp.479-488 (2018));
methods proposed by CFNet corresponding to varmadre, J, et al (Valmadre, J., Bertinetto, l., Henriques, j.f., Vedald I, a., Torr, p.h.s.: End-End rendering for correction filter based tracking. in:2017I EEE Conference on Computer Vision and Pattern Registration (CVPR): pp.5000-5008 (2017));
ACFN corresponds to the method proposed by Choi, J.et al (Choi, J., Chang, H.J., Yun, S., Fischer, T., Demiris, Y., Choi, J.Y.: environmental correction filter network for adaptive visual tracking. in:2017IEEE conference on Computer Vision and Pattern Registration (CVPR). pp.4828-4837 (2017));
the method proposed by Standard corresponds to Bertonitto, L.et al (Bertonitto, L., Valldre, J., Golodetz, S., Miksik, O., Torr, P.H.S.: Standard: minor workers for real-time tracking. in:2016IEEE Conference Computer Vision and Pattern Registration (CVPR). pp.1401-1409 (2016));
KCF corresponds to the method proposed by Henriques, J.F. et al (Henriques, J.F., Caseiro, R., Martins, P., Batista, J.: High-speed tracking with kernelized conversion filters, IEEE traces. Pattern anal. Mach. Intell.37(3), 583-string 596 (2015)).
As can be seen from the table 1, the method can effectively process complex tracking scenes such as rapid target motion, occlusion, rotation, background clutter and the like in the actual tracking process, and the performance of the method is superior to that of other trackers. Showing the effectiveness of the proposed method.

Claims (6)

1. A target specific response attention target tracking method based on a twin network is characterized by comprising the following steps:
1) a video sequence is given, wherein a first frame comprises a marked target, a target template area Z and a target search area X are defined, the target template area Z is kept unchanged after being intercepted based on the given mark in the first frame, the target search area X is obtained from a current video frame to be tested, and an image block larger than the target template area Z is intercepted by using a target position obtained from the previous frame;
2) inputting the target template area and the target search area in the step 1) into a full convolution twin network to obtain a CNN characteristic F of a target template area ZzAnd CNN feature F of target search area Xx
3) The CNN characteristic F obtained in the step 2)xAnd FzInputting the attention model of the target specific response to obtain a response map S of multiple channels weighted by the attention modelmultiTo the response map SmultiSumming channel by channel to obtain a final response diagram S, and determining the position with the maximum response value in the response diagram S as a target initial position;
the specific steps for obtaining the final response diagram S are as follows:
(1) CNN feature F for target search region X and target template region ZxAnd FzPerforming channel-by-channel cross-correlation operation to obtain a multi-channel response diagram, which is represented as:
Smulti=Corrcw(Fx,Fz);
(2) f is to bexInputting an attention network H (·), obtaining an attention weight omega of a channel, wherein the attention network is composed of a global mean pooling and a three-layer multilayer perceptron and is represented as follows:
ω=H(Fx)
(3) weighting the calculated attention weight omega of the channel to the response graph S of the multiple channelsmultiTo obtain a weighted multi-channel response map
Figure FDA0003594656520000011
Then will be
Figure FDA0003594656520000012
And SmultiAdding the residual error structures, and summing according to channels to obtain a final response diagram SfinalThe overall process is represented by the following equation:
Figure FDA0003594656520000013
4) taking the target position obtained in the last frame in the step 1) as a center, constructing a search scale pyramid for a target search area, executing the step 3) for each search area of an estimated scale in the scale pyramid, selecting the scale of the target search area with the highest response value as the scale corresponding to the current frame, and combining the target position and the scale to obtain the actual size and position of the target, thereby realizing target tracking;
5) training a model: the training of the model is independent of the tracking process, and after the model is trained off-line, the trained model is used for the tracking step.
2. The twin network based target specific response attention target tracking method as claimed in claim 1, wherein in step 1), the specific steps of the target search area X being obtained in the video frame to be tested at present are as follows:
1.1 in the initial frame according to the real effective value cutting out the target template area, the size of the target template area is slightly bigger than the actual target for capturing some semantic information, the size of the target template area is calculated according to the following formula:
Figure FDA0003594656520000021
wherein c is (w)z+hz)/2,wzWidth, h, of target template region ZzIndicating the height of the target template region Z, and then re-scaling the truncated template image block to 125 x 125, the scaling scale being 125/SzSaving, which is used for calculating the interception size of the search area;
1.2 the resizing is 255 × 255 with respect to the search area, and in order to ensure that the target dimension of the search area and the dimension of the template are consistent, the actual cut size of the search area is: sx=255/scale。
3. A twin network based target specific response attention target tracking method as claimed in claim 1, characterized in that in step 2), the fully convolution twin network adopts a five-layer network structure similar to AlexNet, and adopts two twin networks: one for extracting the features of the target template region and one for extracting the features of the target search region, the two networks sharing the same parameters; the process is described as follows:
Fx=ψp(X)
Fz=ψp(Z)
the fully-convoluted twin network and the proposed target attention model are pre-trained using a visual recognition data set ILSVRC2015_ VID, by combining the twin network and the target specific response attention model into a unified framework, inputting a target template region sampled from a certain frame in a video sequence and a target search region sampled from a subsequent frame in the same video sequence, and outputting a response map, wherein a loss function adopted in the training process is a cross-entropy loss function:
L(k,u)=log(1+e(-ku))
wherein k is the target label, and u is the score of the corresponding position of the response map.
4. Target specific response attention target tracking based on twin network as claimed in claim 1The tracking method is characterized in that in the step 2), the target template area and the target search area in the step 1) are input into a full convolution twin network to obtain the CNN characteristic F of the target template area ZzAnd CNN feature F of target search area XxThe method comprises the following specific steps:
2.1 after the input image of the full convolution twin network is resized, the full convolution twin network is a full convolution network of five convolution layers and is used for extracting the coarse-grained characteristic of the input image;
2.2 network details, each convolution layer is followed by a batch normalization layer and an activation layer ReLU; first convolution layer, its convolution kernel size is 11 x 11, convolution step is 2, input channel number is 3, output channel number is 96; pooling the first pooling layer with a pooling size of 3 × 3 and a maximum pooling step of 2; ③ a second convolution layer, the convolution kernel is 5 multiplied by 5, the convolution step length is 1, the number of input channels is 96, and the number of output channels is 256; the second pooling layer is the same as the first pooling layer, the pooling size is 3 multiplied by 3, and the step length is 2; fifthly, a third convolution layer, the convolution kernel size is 3 multiplied by 3, the convolution step size is 1, the input channel number is 256, and the output channel number is 384; sixthly, the size of a convolution kernel of the fourth convolution layer is 3 multiplied by 3, the convolution step length is 1, the number of input channels is 384, and the number of output channels is 384; seventhly, the convolution layer has the convolution kernel size of 3 multiplied by 3, the convolution step length of 1, the input channel number of 384 and the output channel number of 256;
2.3 the whole tracking algorithm model has two branch networks, one is a template branch, the other is a search area branch, the networks of the two branches are the networks described in the step 2.2, and the parameters are the same; the proposed attention model of the target specific response immediately follows the search area branch.
5. The twin network-based target specific response attention target tracking method as claimed in claim 1, wherein in step 4), the specific step of selecting the scale of the target search area with the highest response value as the corresponding scale of the current frame is:
4.1 to balance tracking accuracy and speed, three scales are selected for multi-scale search, and 3 scale factorsλiSpecific values [0.96385,1,1.0375 ]]Let the actual truncation size of the search area of the previous frame be SxThe actual size of the truncation area per scale is
Figure FDA0003594656520000031
Figure FDA0003594656520000032
And 4.2, readjusting the image blocks of the three scales to be 255 multiplied by 255, performing scale search, firstly finding the maximum value in the response image in each scale, and comparing the maximum response values in the three scales, wherein the scale where the maximum response value is located is the scale corresponding to the current frame.
6. The twin network-based target specific response attention target tracking method as claimed in claim 1, wherein in step 5), the specific steps of training the model are:
5.1 sampling 53200 training pairs from a visual identification data set ILSVRC2015_ VID, wherein each training pair consists of a template and a search area block, the template and the search area block of the same training pair belong to the same video sequence, and the frame of the template is in front of the search area block;
5.2, for an original search area target, locating the target at the center of the area, and randomly offsetting the target center by 0-8 pixels during training so as to improve the generalization performance of network target offset;
5.3 model iterative training 50 generations, the number of samples selected in one training is 8, and the learning rate of the training is from 10-2Decays exponentially to 10-5The trained loss function L is a weighted cross-entropy loss, as follows:
Figure FDA0003594656520000041
where w represents the weight of each sample, y represents the true label value, v represents the value predicted by the model, and n represents the number of samples sampled in a batch.
CN202010081733.8A 2020-02-06 2020-02-06 Target specific response attention target tracking method based on twin network Active CN111291679B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010081733.8A CN111291679B (en) 2020-02-06 2020-02-06 Target specific response attention target tracking method based on twin network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010081733.8A CN111291679B (en) 2020-02-06 2020-02-06 Target specific response attention target tracking method based on twin network

Publications (2)

Publication Number Publication Date
CN111291679A CN111291679A (en) 2020-06-16
CN111291679B true CN111291679B (en) 2022-05-27

Family

ID=71026700

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010081733.8A Active CN111291679B (en) 2020-02-06 2020-02-06 Target specific response attention target tracking method based on twin network

Country Status (1)

Country Link
CN (1) CN111291679B (en)

Families Citing this family (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111860248B (en) * 2020-07-08 2021-06-25 上海蠡图信息科技有限公司 Visual target tracking method based on twin gradual attention-guided fusion network
CN111787227A (en) * 2020-07-22 2020-10-16 苏州臻迪智能科技有限公司 Style migration method and device based on tracking shooting
CN111899283B (en) * 2020-07-30 2023-10-17 北京科技大学 Video target tracking method
CN112150504A (en) * 2020-08-03 2020-12-29 上海大学 Visual tracking method based on attention mechanism
CN112085718B (en) * 2020-09-04 2022-05-10 厦门大学 NAFLD ultrasonic video diagnosis system based on twin attention network
CN112288772B (en) * 2020-10-14 2022-06-07 武汉大学 Channel attention target tracking method based on online multi-feature selection
CN112348849B (en) * 2020-10-27 2023-06-20 南京邮电大学 Twin network video target tracking method and device
CN112215872B (en) * 2020-11-04 2024-03-22 上海海事大学 Multi-full convolution fusion single-target tracking method based on twin network
CN112560695B (en) * 2020-12-17 2023-03-24 中国海洋大学 Underwater target tracking method, system, storage medium, equipment, terminal and application
CN112598739B (en) * 2020-12-25 2023-09-01 哈尔滨工业大学(深圳) Mobile robot infrared target tracking method, system and storage medium based on space-time characteristic aggregation network
CN112750148B (en) * 2021-01-13 2024-03-22 浙江工业大学 Multi-scale target perception tracking method based on twin network
CN112785626A (en) * 2021-01-27 2021-05-11 安徽大学 Twin network small target tracking method based on multi-scale feature fusion
CN113192124A (en) * 2021-03-15 2021-07-30 大连海事大学 Image target positioning method based on twin network
CN113158881B (en) * 2021-04-19 2022-06-14 电子科技大学 Cross-domain pedestrian re-identification method based on attention mechanism
CN113205544B (en) * 2021-04-27 2022-04-29 武汉大学 Space attention reinforcement learning tracking method based on cross-over ratio estimation
CN113362373B (en) * 2021-06-01 2023-12-15 北京首都国际机场股份有限公司 Double-twin-network-based aircraft tracking method in complex apron area
CN113658218B (en) * 2021-07-19 2023-10-13 南京邮电大学 Dual-template intensive twin network tracking method, device and storage medium
CN113379806B (en) * 2021-08-13 2021-11-09 南昌工程学院 Target tracking method and system based on learnable sparse conversion attention mechanism
CN113628249B (en) * 2021-08-16 2023-04-07 电子科技大学 RGBT target tracking method based on cross-modal attention mechanism and twin structure
CN113793359B (en) * 2021-08-25 2024-04-05 西安工业大学 Target tracking method integrating twin network and related filtering
CN113888590B (en) * 2021-09-13 2024-04-16 华南理工大学 Video target tracking method based on data enhancement and twin network
CN113870312B (en) * 2021-09-30 2023-09-22 四川大学 Single target tracking method based on twin network
CN113936040B (en) * 2021-10-15 2023-09-15 哈尔滨工业大学 Target tracking method based on capsule network and natural language query
CN113822233B (en) * 2021-11-22 2022-03-22 青岛杰瑞工控技术有限公司 Method and system for tracking abnormal fishes cultured in deep sea
CN114241003B (en) * 2021-12-14 2022-08-19 成都阿普奇科技股份有限公司 All-weather lightweight high-real-time sea surface ship detection and tracking method
CN116645399B (en) * 2023-07-19 2023-10-13 山东大学 Residual network target tracking method and system based on attention mechanism

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108898620A (en) * 2018-06-14 2018-11-27 厦门大学 Method for tracking target based on multiple twin neural network and regional nerve network
CN109978921A (en) * 2019-04-01 2019-07-05 南京信息工程大学 A kind of real-time video target tracking algorithm based on multilayer attention mechanism
CN110223324A (en) * 2019-06-05 2019-09-10 东华大学 A kind of method for tracking target of the twin matching network indicated based on robust features
CN110335290A (en) * 2019-06-04 2019-10-15 大连理工大学 Twin candidate region based on attention mechanism generates network target tracking method
CN110675423A (en) * 2019-08-29 2020-01-10 电子科技大学 Unmanned aerial vehicle tracking method based on twin neural network and attention model

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108898620A (en) * 2018-06-14 2018-11-27 厦门大学 Method for tracking target based on multiple twin neural network and regional nerve network
CN109978921A (en) * 2019-04-01 2019-07-05 南京信息工程大学 A kind of real-time video target tracking algorithm based on multilayer attention mechanism
CN110335290A (en) * 2019-06-04 2019-10-15 大连理工大学 Twin candidate region based on attention mechanism generates network target tracking method
CN110223324A (en) * 2019-06-05 2019-09-10 东华大学 A kind of method for tracking target of the twin matching network indicated based on robust features
CN110675423A (en) * 2019-08-29 2020-01-10 电子科技大学 Unmanned aerial vehicle tracking method based on twin neural network and attention model

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
A Twofold Siamese Network for Real-Time Object Tracking;Anfeng He et al.;《2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition》;20181217;全文 *
Multi-granularity Hierarchical Attention Siamese Network for Visual Tracking;Xing Chen et al.;《2018 International Joint Conference on Neural Networks (IJCNN)》;20181015;全文 *
基于孪生网络与注意力机制的目标跟踪方法;周迪雅 等;《信息通信》;20191215(第12期);全文 *

Also Published As

Publication number Publication date
CN111291679A (en) 2020-06-16

Similar Documents

Publication Publication Date Title
CN111291679B (en) Target specific response attention target tracking method based on twin network
CN108549839B (en) Adaptive feature fusion multi-scale correlation filtering visual tracking method
CN111914664A (en) Vehicle multi-target detection and track tracking method based on re-identification
CN111460968B (en) Unmanned aerial vehicle identification and tracking method and device based on video
CN108647694B (en) Context-aware and adaptive response-based related filtering target tracking method
CN110569723A (en) Target tracking method combining feature fusion and model updating
CN108961308B (en) Residual error depth characteristic target tracking method for drift detection
CN110942471B (en) Long-term target tracking method based on space-time constraint
CN113591968A (en) Infrared weak and small target detection method based on asymmetric attention feature fusion
CN113011329A (en) Pyramid network based on multi-scale features and dense crowd counting method
CN111738344A (en) Rapid target detection method based on multi-scale fusion
CN110992401A (en) Target tracking method and device, computer equipment and storage medium
CN111640138A (en) Target tracking method, device, equipment and storage medium
CN115171165A (en) Pedestrian re-identification method and device with global features and step-type local features fused
CN113192124A (en) Image target positioning method based on twin network
CN115375737B (en) Target tracking method and system based on adaptive time and serialized space-time characteristics
CN114898403A (en) Pedestrian multi-target tracking method based on Attention-JDE network
CN111340842A (en) Correlation filtering target tracking algorithm based on joint model
CN111507215A (en) Video target segmentation method based on space-time convolution cyclic neural network and cavity convolution
Li et al. Real-time pedestrian detection with deep supervision in the wild
Chen et al. Norm-aware embedding for efficient person search and tracking
Zhou et al. Discriminative attention-augmented feature learning for facial expression recognition in the wild
CN108257148B (en) Target suggestion window generation method of specific object and application of target suggestion window generation method in target tracking
CN117011655A (en) Adaptive region selection feature fusion based method, target tracking method and system
CN111275694A (en) Attention mechanism guided progressive division human body analytic model and method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant