CN108596958B - Target tracking method based on difficult positive sample generation - Google Patents

Target tracking method based on difficult positive sample generation Download PDF

Info

Publication number
CN108596958B
CN108596958B CN201810443211.0A CN201810443211A CN108596958B CN 108596958 B CN108596958 B CN 108596958B CN 201810443211 A CN201810443211 A CN 201810443211A CN 108596958 B CN108596958 B CN 108596958B
Authority
CN
China
Prior art keywords
difficult
network
layer
action
positive sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810443211.0A
Other languages
Chinese (zh)
Other versions
CN108596958A (en
Inventor
李成龙
杨芮
王逍
汤进
罗斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui University
Original Assignee
Anhui University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui University filed Critical Anhui University
Priority to CN201810443211.0A priority Critical patent/CN108596958B/en
Publication of CN108596958A publication Critical patent/CN108596958A/en
Application granted granted Critical
Publication of CN108596958B publication Critical patent/CN108596958B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10024Color image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a target tracking method based on difficult positive sample generation, which is characterized in that for each video in training data, a variational self-encoder is utilized to learn a corresponding flow pattern, namely a positive sample generation network, and fine-tuning encoding is carried out according to an encoded input image to generate a large number of positive samples; inputting the positive samples into a difficult positive sample conversion network, training an intelligent agent to learn to shield a target object by using a background image block, continuously adjusting a bounding box by the intelligent agent to enable the samples to become difficult to identify, achieving the purpose of generating the difficult positive samples, and outputting the difficult positive samples as shielded difficult positive samples; and training a twin network for matching the target image block with the candidate image block based on the generated difficult positive sample to complete the positioning of the current frame target until the whole video processing is completed. The invention is based on a target tracking method for generating difficult positive samples, and directly learns the flow pattern distribution condition of the target from data, thereby obtaining a large amount of various positive samples.

Description

Target tracking method based on difficult positive sample generation
Technical Field
The invention relates to a visual tracking technology, in particular to a target tracking method based on difficult positive sample generation.
Background
Currently, the mainstream deep learning method tracking generally comprises the following steps: firstly, collecting a large amount of manually marked videos; secondly, performing dense sampling of positive and negative samples near the first frame marking frame on each video; thirdly, training a binary classifier by using the sample sampled in the last step; fourthly, confirming candidate areas near the search box, classifying, and selecting the area with the highest score as a tracking result; and fifthly, repeating the steps until the video is finished.
The prior art is not enough: as shown in fig. 1, the existing dense sampling method has insufficient sample diversity; the difficult samples are few, and the model is too sensitive to the challenge factors. Since visual tracking only gives one bounding box as an initial condition, and tracked targets are diverse, a tracking method based on deep learning cannot obtain enough training samples, and belongs to a typical small sample learning problem. In view of fig. 2, in the existing labeled video, various challenging video frames are very short.
The diversity of positive samples obtained by conventional dense sampling is insufficient, so that the model is easy to over-fit and is too sensitive to challenging factors; the existing difficult positive samples are obtained according to the prediction result of the model, namely: setting a threshold range, selecting all samples with confidence degrees in the range, and placing the samples in the next cycle to continuously fine-tune the model, so that the robustness of the model is stronger. However, this method is chosen based on model-dependent predictions, but the predictions of the model are not all accurate, thereby filling the tracking model with uncertainty.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: the conventional dense sampling method has the advantages that the obtained samples are insufficient in diversity and few in difficult samples, the model is too sensitive to challenge factors, and the target tracking method based on the difficult positive sample generation is provided.
The invention solves the technical problems through the following technical scheme, and the invention comprises the following steps:
(1) acquiring a marked video for training a depth tracking model;
(2) for each video in the training data, a variational self-encoder is utilized to carry out learning of a corresponding flow pattern, namely a positive sample generation network, and fine-tuning encoding is carried out according to an encoded input image to generate a large number of positive samples;
(3) inputting the positive samples into a difficult positive sample conversion network, training an intelligent agent to learn to shield a target object by using a background image block, continuously adjusting a bounding box by the intelligent agent to enable the samples to become difficult to identify, achieving the purpose of generating the difficult positive samples, and outputting the difficult positive samples as shielded difficult positive samples;
(4) and training a twin network for matching the target image block with the candidate image block based on the generated difficult positive sample to complete the positioning of the current frame target until the whole video processing is completed.
In the step (1), a certain amount of video for tracking is calibrated manually, and the calibration content includes finding the same target object as the first frame on each video frame and giving the position of the target object in the current frame.
In the step (2), data is preprocessed, and the data is stored as a data format h5 file which can be read by the deep neural network, wherein the coincidence ratio of the sampled image block and the true value exceeding a distinguishing threshold is a positive sample, and the coincidence ratio lower than the preset threshold is a negative sample.
The difficult positive sample conversion network comprises convolution layers, and original signal characteristics are enhanced and noise is reduced through convolution operation;
the pooling layer reduces a plurality of characteristics by a sampling method by utilizing an image locality principle;
and a full-junction layer, each neuron of the full-junction layer being connected to each neuron of the next layer, performing normal classification.
Taking a positive sample image to be shielded as input, performing convolution operation from an input layer to a convolutional layer, wherein each neuron of the convolutional layer can be connected with a local receptive field with a certain size in the input layer, and obtaining the characteristics of the image to be shielded after convolution; the process from convolutional layer to pooling layer is to reduce the number of features of the previous layer; the features obtained after the convolutional layer and the pooling layer are classified by the full-link layer, the result is finally output after the calculation processing of the full-link layer, and the probability that each output node on the full-link layer selects an action for the intelligent agent, namely the probability that a certain action should be executed to change the action of the shielding region in the current state is output on each output node.
The training process adopts the main parameter settings of deep learning as follows: for a positive sample generation network, the initial learning rate is 0.001, the optimization algorithm is RMSprop, and the training iteration number is 20,000; for the positive sample conversion network, the mini-batch of training data is 100, the optimization method is Adam, and the initial learning rate is 1 e-6; for the twin network subsequently used for tracking, the initial learning rate is 0.0001, the momentum is 0.9, and the weight decay parameter is 0.0005.
The process of the deep learning is as follows:
for a positive sample generation network, two fully-connected layers are adopted to encode input images which are pulled into column vectors, then encoded features are input into two fully-connected branches, mean value and standard deviation estimation is carried out, then the encoded features are input into three fully-connected layers, and finally the reconstructed images are output;
for a difficult positive sample conversion network, inputting a sample into a pre-trained VGG network, outputting an action selected by an agent in a current state through two fully-connected layers, obtaining a new shielded sample through executing the action, calculating the similarity between the sample and a true value, if the action belongs to a mobile action, if the similarity is reduced, giving a positive reward of the current action, otherwise, giving a negative reward;
if the action belongs to the stop action, when the similarity is lower than a certain threshold value, giving a positive reward, otherwise, giving a negative reward;
for a twin network mainly used for tracking, the twin network is provided with two branch networks which are respectively used for coding a target to be searched and a candidate search area of a current frame, and parameters of the two branches are shared;
the training of the network is based on positive and negative sample pairs, i.e.: if the coincidence degree of the two image blocks is greater than a certain threshold value, the two image blocks are regarded as the same image block, the given label is 1, otherwise, the two image blocks are regarded as different image blocks, and the given label is 0;
and measuring the difference between the output result of the model and the real graph by using a MarginContrastive Loss function, wherein the difference can be reversely transmitted into the network layer by layer to carry out parameter training of the model.
The actions comprise moving actions and stopping actions; the movement action represents a change of the current observation region; the stop action means that the occlusion area of the current frame has been found, and the search process of the current video frame is stopped, specifically: move up, move down, move left, move right, zoom out, enlarge, thin, flatten, stop.
Suppose that the score of an occluded object at time t is StThe score at the previous time is St-1The reward function for a movement action is set to:
Figure BDA0001656565340000031
where s and s' represent the current and next instant states, respectively.
The reward function for stopping the action is:
Figure BDA0001656565340000032
wherein the content of the first and second substances,
Figure BDA0001656565340000033
is a preset threshold parameter, and the reward function shows that if the agent chooses to stop the action at the current moment, the similarity between the currently occluded sample and the true sample will be calculated, if the similarity is lower than a certain threshold, a positive reward is given, otherwise a negative reward is given.
Compared with the prior art, the invention has the following advantages: the invention is based on the target tracking method of difficult positive sample generation, on the existing training data set, the current production model is utilized to directly learn the flow pattern distribution condition of the target from the data, and a large amount of various positive samples can be obtained without additional manual marking.
The generation of the difficult positive samples is regarded as a sequence decision problem, and the reinforcement learning algorithm is used for automatically learning the occlusion target so as to simulate the real occlusion situation, thereby obtaining more challenging positive samples.
Based on the model of the difficult positive samples, the test time is not increased explicitly, but the robustness and the tracking precision of the tracking algorithm are improved remarkably.
Drawings
FIG. 1 is a schematic diagram of a dense sampling method commonly used in prior art visual tracking;
FIG. 2 is a representation of various challenging factors in a video;
FIG. 3 is a flow chart of a difficult positive sample generation method of the present invention;
FIG. 4 is a sample obtained by constructing and sampling a positive sample flow pattern according to the present invention;
a shows the process of coding on a learned target flow pattern according to a real video frame, changing codes and then decoding a simulated target image; b is a table showing the actions selected and performed by the agent at each moment, namely: a process of shielding the target object with the background image block;
FIG. 5 is a sample of the difficulty achieved by the present invention;
FIG. 6 is a flow chart of the present invention;
FIG. 7 is a diagram illustrating the operation of bounding box transformation in the deep reinforcement learning algorithm of the present invention.
Detailed Description
The following examples are given for the detailed implementation and specific operation of the present invention, but the scope of the present invention is not limited to the following examples.
As shown in fig. 3 to 7, the method for generating a hard positive sample of the present embodiment includes the following steps:
(1) acquiring a marked video for training a depth tracking model;
(2) aiming at each video in the training data, a variational self-encoder is utilized to learn a corresponding flow pattern (namely, a positive sample generation network); the network structure of the positive sample generation network comprises a full connection layer, the input of the network is an image matrix pulled into column vectors, the output of the network is reconstructed image vectors, and then size normalization is carried out to obtain a conventional color image.
(3) For the reconstructed positive samples, the present embodiment performs conversion of the difficult positive samples, specifically, regards the generation problem of the difficult positive samples as a sequence decision problem, and performs learning of the partial network by using deep reinforcement learning. The network structure of the positive sample generation network comprises a convolutional layer, a pooling layer and a full-link layer. The input of the network is a color image, the intelligent agent can continuously adjust the bounding box (namely, operations such as moving, scaling and the like) to make the sample difficult to identify, so that the purpose of generating the difficult positive sample is achieved, and the blocked difficult positive sample is output.
(4) Based on the generated hard positive samples, the embodiment trains the twin network to match the target image blocks with the candidate image blocks to complete the positioning of the current frame target until the whole video processing is completed.
The difficult positive sample conversion network may include: the three layers of the convolutional layer, the pooling layer and the full-connection layer are as follows:
a Convolutional layer (Convolutional layer), which enhances the original signal characteristics and reduces noise through convolution operation, and the specific convolution operation can be realized by adopting the prior art;
the pooling layer (Poolinglayer) reduces many characteristics by sampling method using image locality principle, and can include modes of maximum pooling, mean pooling, random pooling and the like, and the prior art can be adopted for specific realization:
a fully connected Layer (fully connected Layer), each neuron of which is connected to each neuron of the next Layer, performs normal classification like a conventional Multi-Layer Perceptron (MLP) neural network.
Taking the positive sample image to be shielded as input, performing convolution operation from an input layer to a convolutional layer, wherein each neuron of the convolutional layer can be connected with a local receptive field with a certain size in the input layer, and obtaining the characteristics of the image to be shielded after convolution; the process from convolutional layer to pooling layer may be referred to as pooling process, with the goal of reducing the number of features of the previous layer; the features obtained after the convolution layer and the pooling layer are classified by the full-connection layer, and the result is finally output after the calculation processing of the full-connection layer.
And each output node on the full connection layer is the probability of the action to be selected by the agent, namely the probability that the action of the occlusion region should be changed by executing a certain action in the current state is output on each output node.
The network can obtain ideal difficult positive samples by continuously training parameters by using a positive sample generation network and a difficult sample conversion network, and the purpose of generating the difficult positive samples can be automatically completed by using deep learning without manual participation.
The training process is realized by adopting a deep learning toolkit Keras and Caffe, and the designed main parameters are set as follows: for a positive sample generation network, the initial learning rate is 0.001, the optimization algorithm is RMSprop, and the training iteration number is 20,000; for the positive sample conversion network, the mini-batch of training data is 100, the optimization method is Adam, and the initial learning rate is 1 e-6; for the twin network subsequently used for tracking, the initial learning rate is 0.0001, the momentum is 0.9, and the weight decay parameter is 0.0005.
The embodiment can automatically generate the hard positive sample by using the deep learning network and train the deep tracking algorithm, and the specific operation can include the following steps:
collection of annotated data:
for better training of the deep network, a certain amount of video for tracking needs to be calibrated manually, and the calibration content includes finding the same target object as the first frame on each video frame and giving the position of the target object in the current frame.
Pretreatment:
since the method proposed by the embodiment requires a pre-trained tracking model, the data is pre-processed first, and the data is stored as a data format h5 file which can be read by the deep neural network. In this process, the discrimination threshold for positive and negative samples is set to: the coincidence ratio of the sampled image block and the true value reaches 0.7, namely the image block is regarded as a positive sample, and the image block is regarded as a negative sample when the coincidence ratio is lower than 0.5.
Designing a deep neural network:
the network structure may comprise three components, respectively: convolutional layers (convolutional layers), pooling layers (PoolingLayer), and fully connected layers (fullonnectedLayer).
For the positive sample generation network, the present embodiment adopts two fully-connected layers to encode an input image pulled into a column vector, then inputs the encoded features into two fully-connected branches, performs mean and standard deviation estimation, then inputs into three fully-connected layers, and finally outputs the image after reconstruction.
For a difficult positive sample conversion network, inputting a sample into a pre-trained VGG network, outputting an action selected by an agent in a current state through two fully-connected layers, obtaining a new shielded sample through executing the action, calculating the similarity between the sample and a true value, if the action belongs to a mobile action, if the similarity is reduced, giving a positive reward of the current action, otherwise, giving a negative reward; if the action belongs to a stop action, a positive reward is given when the similarity is below a certain threshold, otherwise a negative reward is given.
For the twin network mainly used for tracking, there are two branch networks for encoding the target to be searched and the candidate search area of the current frame, respectively, and the parameters of the two branches are shared. The training of the network is based on positive and negative sample pairs, i.e.: and if the coincidence degree of the two image blocks is greater than a certain threshold value, the two image blocks are regarded as the same image block, the given label is 1, otherwise, the two image blocks are regarded as different image blocks, and the given label is 0. The MarginContrastive Loss function was used to measure the difference between the model output and the true image pair. The difference can be reversely transmitted to the network layer by layer to carry out parameter training of the model.
Training of models
The present embodiment may use the existing deep network training tool to train the model, such as: the Keras kit and Caffe. During the use of Caffe, a solvent file can be defined, which gives a method of optimizing the model (training), i.e. a parameter back-propagation algorithm. The key parameters may include a base learning rate (base learning rate), a learning momentum, a weight penalty coefficient, and the like. The basic learning rate can be set to be 0.0001-0.01, the range of the learning momentum can be 0.9-0.99, and the range of the weight penalty coefficient can be 0.0001-0.001.
In a specific implementation process, the three main network modules in this embodiment may be batch operations, and identify and track multiple target images at the same time, and the following respectively describes three sub-network modules:
the first subnetwork module utilizes a positive sample generation network to carry out positive sample expansion:
the module realizes the learning of the positive sample flow pattern by adopting a variational self-encoder network. This embodiment extracts the target object from the video frame, then unifies its resolution into 64 x 64, and then pulls it into a column vector with dimension 12288(64 x 3). The dimension of the middle fully-connected layer is 512 dimensions, and the dimension of the hidden layer coding is 2. The output dimension of the network after reconstruction is 12288, and then the resolution is adjusted to 64 × 3, i.e. the reconstructed image is obtained. In addition, in this embodiment, the network structure of the diversity self-encoder may also adopt a convolution structure. In order to obtain a cleaner stream, the present embodiment performs stream construction separately for each video. In other words, the present embodiment can perform the learning of the variational self-encoder for each training video by using the obtained target object.
And the sub-network module II performs difficult positive sample conversion by using a difficult positive sample conversion network:
unifying the resolutions of the input images into 224 × 224, inputting the input images into the VGG network to obtain feature expressions of the corresponding images, and obtaining the size of the feature map to be 512 × 7 ═ 25088;
the deep Q-Network is immediately behind the VGG Network, and specifically, is composed of three fully connected layers, whose dimensions are: 1024, 9;
the dimension of final output corresponds to the length of the action list designed by the embodiment, and represents the probability of selecting corresponding actions;
this embodiment considers the difficult positive sample conversion process as a sequence decision process, specifically:
the state is as follows: the present embodiment may normalize the image to 224 × 224, then input the image into the VGG network, and then extract the features of the 8 th layer as the state of the current step;
the actions are as follows: there are two types of actions in this embodiment, namely: a moving operation and a stopping operation; the movement action represents a change of the current observation region; the stop action indicates that an occlusion region of the current frame has been found and the search process of the current video frame is stopped. In the present embodiment, 8 moving actions and one stopping action are designed, as shown in fig. 7, which respectively are: move up, move down, move left, move right, zoom out, enlarge, thin, flatten, stop.
Rewarding: the goal of the agent in this embodiment is to receive the maximum reward, so the design of the reward function will be the key to the success of the strategy. Suppose that the score of an occluded object at time t is StThe score at the previous time is St-1The reward function for a movement action may be set as:
Figure BDA0001656565340000071
where s and s' represent the current and next instant states, respectively.
The stop motion has no state of the next moment, so the embodiment can specially design another reward function for the stop motion:
Figure BDA0001656565340000072
wherein the content of the first and second substances,
Figure BDA0001656565340000073
is a preset threshold parameter. The reward function states that if the agent chooses to stop the action at the current time, the currently occluded sample and true sample are calculatedSimilarity between books, a positive reward is given if the similarity is below a certain threshold, otherwise a negative reward is given.
The intelligent agent continuously interacts with the environment to obtain a large number of training samples, the training samples are stored in a playback unit, and then the mini-batch samples are sampled from the training samples to learn the occlusion strategy. Another important way to break the correlation between data, in addition to the playback unit, is the use of a target network. Specifically, copy of the model parameters is performed every τ steps, and the state at the current time and the state at the next time are respectively input into the conventional network model and the target network for sensing the environment.
And a third sub-network module trains a depth tracking model by using the twin network.
Performing convolution operation on the video frame to be tracked and a convolution kernel through a first convolution layer convolution kernel, wherein the size (kernelsize) of the convolution kernel can be 3 × 3, the moving step length can be set to 1 pixel point each time during sliding, the number of input feature layers can be 64, and the number of parameters of the convolution kernel is 64 × 3 — 1728;
through the first pooling layer poolinglayer, the pooling range size (kernelsize) may be 2 x 2, with 2 pixels per move;
after the second convolution layer convolution, performing convolution calculation on the output of the previous layer and a convolution kernel, wherein the size of the convolution kernel can be 3 × 3, the moving step length can be set to 1 pixel point each time during sliding, the number of input feature layers can be 128, and the number of parameters of the convolution kernel is 128 × 3 × 64 ═ 73728;
through the second pooling layer poolinglayer, the pooling range size (kernelsize) may be 2 x 2, with 2 pixels per move;
after passing through the third convolution layer, performing convolution calculation on the output of the previous layer and a convolution kernel, wherein the size of the convolution kernel can be 3 × 3, the moving step length can be set to 1 pixel point each time during sliding, the number of input feature layers can be 256, and the number of parameters of the convolution kernel is 256 × 3 — 128 — 294912;
performing convolution calculation on the output of the previous layer and a convolution kernel through a fourth convolution layer convolution kernel, wherein the size of the convolution kernel can be 3 × 3, the moving step length can be set to 1 pixel point each time during sliding, the number of input feature layers can be 512, and the number of parameters of the convolution kernel is 512 × 3 — 256 — 1179648;
after the fifth convolution layer convolution, performing convolution calculation on the output of the previous layer and a convolution kernel, wherein the size of the convolution kernel can be 3 × 3, the moving step length can be set to 1 pixel point each time during sliding, the number of input feature layers can be 512, and the number of parameters of the convolution kernel is 512 × 3 — 512 — 2359296;
after passing through an interest of interest pooling layer (ROI PoolingLayer), the part maps feature maps with different sizes into vectors with uniform dimensions, and then the vectors are input into a next full-connection layer;
through the full connected layer fullyconnectedlayer, the number of nodes of the full connected layer can be 4096, and the number of related convolution kernel parameters can be 4096 × 4096 — 16777216;
normalizing the obtained features through an L2 normalization layer;
and finally, distance measurement is carried out on the features obtained by two paths of input, wherein MarginContrastivLoss is adopted, and the similarity between two given image blocks is output.
In the implementation, each convolution layer can be followed by a nonlinear change, each fully-connected layer can be followed by a nonlinear change, and a dropout layer to avoid overfitting.
By adopting the model provided by the embodiment, the robustness of the model can be obviously improved, and a good experimental effect is obtained on a plurality of public data sets.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims (10)

1. A target tracking method based on difficult positive sample generation is characterized by comprising the following steps:
(1) acquiring a marked video for training a depth tracking model;
(2) for each video in the training data, a variational self-encoder is utilized to carry out learning of a corresponding flow pattern, namely a positive sample generation network, and fine-tuning encoding is carried out according to an encoded input image to generate a large number of positive samples;
(3) inputting the positive samples into a difficult positive sample conversion network, training an intelligent agent to learn to shield a target object by using a background image block, continuously adjusting a bounding box by the intelligent agent to enable the samples to become difficult to identify, achieving the purpose of generating the difficult positive samples, and outputting the difficult positive samples as shielded difficult positive samples;
(4) and training a twin network for matching the target image block with the candidate image block based on the generated difficult positive sample to complete the positioning of the current frame target until the whole video processing is completed.
2. The method for tracking the target based on the generation of the difficult positive sample as claimed in claim 1, wherein in the step (1), a certain amount of video for tracking is calibrated manually, the calibration comprises finding the same target object as the first frame on each video frame, and the position of the target object in the current frame is given.
3. The method for tracking the target based on the generation of the difficult positive samples as claimed in claim 1, wherein in the step (2), the data is preprocessed, the data is stored as a data format h5 file readable by a deep neural network, the coincidence degree of the sampled image block and the true value exceeding a distinguishing threshold is a positive sample, and the coincidence degree below the setting threshold is a negative sample.
4. The target tracking method based on the generation of the difficult positive samples as claimed in claim 1, wherein the difficult positive sample conversion network comprises convolution layers, and the original signal features are enhanced and noise is reduced through convolution operation;
the pooling layer reduces characteristics by a sampling method by utilizing an image locality principle;
and a full-junction layer, each neuron of the full-junction layer being connected to each neuron of the next layer, performing normal classification.
5. The method for tracking the target based on the generation of the difficult positive sample according to claim 4, wherein the image of the positive sample to be occluded is used as an input, each neuron of the convolutional layer can be connected with a local receptive field with a certain size in the input layer through convolution operation from the input layer to the convolutional layer, and the characteristic of the image to be occluded is obtained through convolution; the process from convolutional layer to pooling layer is to reduce the number of features of the previous layer; the features obtained after the convolutional layer and the pooling layer are classified by the full-link layer, the result is finally output after the calculation processing of the full-link layer, and the probability that each output node on the full-link layer selects an action for the intelligent agent, namely the probability that a certain action should be executed to change the action of the shielding region in the current state is output on each output node.
6. The method for tracking the target based on the generation of the difficult positive sample as claimed in claim 1, wherein the training process adopts deep learning: for a positive sample generation network, the initial learning rate is 0.001, the optimization algorithm is RMSprop, and the training iteration number is 20,000; for the positive sample conversion network, the mini-batch of training data is 100, the optimization method is Adam, and the initial learning rate is 1 e-6; for the twin network subsequently used for tracking, the initial learning rate is 0.0001, the momentum is 0.9, and the weight decay parameter is 0.0005.
7. The method for tracking the target based on the generation of the difficult positive sample as claimed in claim 6, wherein the deep learning process is as follows:
for a positive sample generation network, two fully-connected layers are adopted to encode input images which are pulled into column vectors, then encoded features are input into two fully-connected branches, mean value and standard deviation estimation is carried out, then the encoded features are input into three fully-connected layers, and finally the reconstructed images are output;
for a difficult positive sample conversion network, inputting a sample into a pre-trained VGG network, outputting an action selected by an agent in a current state through two fully-connected layers, obtaining a new shielded sample through executing the action, calculating the similarity between the sample and a true value, if the action belongs to a mobile action, if the similarity is reduced, giving a positive reward of the current action, otherwise, giving a negative reward;
if the action belongs to the stop action, when the similarity is lower than a certain threshold value, giving a positive reward, otherwise, giving a negative reward;
for the twin network for tracking, two branch networks are provided, which are respectively used for coding a target to be searched and a candidate search area of a current frame, and parameters of the two branches are shared;
the training of the network is based on positive and negative sample pairs, i.e.: if the coincidence degree of the two image blocks is greater than a certain threshold value, the two image blocks are regarded as the same image block, the given label is 1, otherwise, the two image blocks are regarded as different image blocks, and the given label is 0;
and measuring the difference between the output result of the model and the real graph by using a MarginContrastive Loss function, wherein the difference can be reversely transmitted into the network layer by layer to carry out parameter training of the model.
8. The method of claim 5, wherein the actions comprise a moving action and a stopping action; the movement action represents a change of the current observation region; the stop action means that the occlusion area of the current frame has been found, and the search process of the current video frame is stopped, specifically: move up, move down, move left, move right, zoom out, enlarge, thin, flatten, stop.
9. According to the claimsSolving 8 the method for tracking a target based on the generation of a difficult positive sample, wherein a score of an occluded object at a time t is assumed to be StThe score at the previous time is St-1The reward function for a movement action is set to:
Figure FDA0002987303730000021
where s represents the state at time t-1 and s' represents the state at time t.
10. The method of claim 8, wherein the reward function for stopping actions is:
Figure FDA0002987303730000022
wherein φ is a preset threshold parameter, and the reward function shows that if the agent chooses to stop the action at the current moment, the similarity between the currently occluded sample and the true sample is calculated, if the similarity is lower than a certain threshold, a positive reward is given, otherwise, a negative reward is given.
CN201810443211.0A 2018-05-10 2018-05-10 Target tracking method based on difficult positive sample generation Active CN108596958B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810443211.0A CN108596958B (en) 2018-05-10 2018-05-10 Target tracking method based on difficult positive sample generation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810443211.0A CN108596958B (en) 2018-05-10 2018-05-10 Target tracking method based on difficult positive sample generation

Publications (2)

Publication Number Publication Date
CN108596958A CN108596958A (en) 2018-09-28
CN108596958B true CN108596958B (en) 2021-06-04

Family

ID=63636958

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810443211.0A Active CN108596958B (en) 2018-05-10 2018-05-10 Target tracking method based on difficult positive sample generation

Country Status (1)

Country Link
CN (1) CN108596958B (en)

Families Citing this family (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109614979B (en) * 2018-10-11 2023-05-02 北京大学 Data augmentation method and image classification method based on selection and generation
CN109559358B (en) * 2018-10-22 2023-07-04 天津大学 Image sample up-sampling method based on convolution self-coding
CN111192288B (en) * 2018-11-14 2023-08-04 天津大学青岛海洋技术研究院 Target tracking algorithm based on deformation sample generation network
CN109800689B (en) * 2019-01-04 2022-03-29 西南交通大学 Target tracking method based on space-time feature fusion learning
CN109885482A (en) * 2019-01-16 2019-06-14 重庆大学 Software Defects Predict Methods based on the study of few sample data
CN109919183B (en) * 2019-01-24 2020-12-18 北京大学 Image identification method, device and equipment based on small samples and storage medium
CN109753975B (en) * 2019-02-02 2021-03-09 杭州睿琪软件有限公司 Training sample obtaining method and device, electronic equipment and storage medium
CN110084146B (en) * 2019-04-08 2021-06-04 清华大学 Pedestrian detection method and device based on shielding perception self-supervision learning
CN110349176B (en) * 2019-06-28 2021-04-06 华中科技大学 Target tracking method and system based on triple convolutional network and perceptual interference learning
CN110415271B (en) * 2019-06-28 2022-06-07 武汉大学 Appearance diversity-based method for tracking generation twin-resisting network target
CN110610197B (en) * 2019-08-19 2022-09-27 北京迈格威科技有限公司 Method and device for mining difficult sample and training model and electronic equipment
CN110852285B (en) * 2019-11-14 2023-04-18 腾讯科技(深圳)有限公司 Object detection method and device, computer equipment and storage medium
CN110991337B (en) * 2019-12-02 2023-08-25 山东浪潮科学研究院有限公司 Vehicle detection method based on self-adaptive two-way detection network
CN111565318A (en) * 2020-05-06 2020-08-21 中国科学院重庆绿色智能技术研究院 Video compression method based on sparse samples
CN111862158B (en) * 2020-07-21 2023-08-29 湖南师范大学 Staged target tracking method, device, terminal and readable storage medium
CN112801182B (en) * 2021-01-27 2022-11-04 安徽大学 RGBT target tracking method based on difficult sample perception
CN112784929B (en) * 2021-03-14 2023-03-28 西北工业大学 Small sample image classification method and device based on double-element group expansion
CN113077491B (en) * 2021-04-02 2023-05-02 安徽大学 RGBT target tracking method based on cross-modal sharing and specific representation form
CN113129337B (en) * 2021-04-14 2022-07-19 桂林电子科技大学 Background perception tracking method, computer readable storage medium and computer device
CN113258996B (en) * 2021-07-05 2021-09-17 南京华脉科技股份有限公司 Optical cable monitoring method in submarine cable production and laying process based on artificial intelligence

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004038602A1 (en) * 2002-10-24 2004-05-06 Warner-Lambert Company, Llc Integrated spectral data processing, data mining, and modeling system for use in diverse screening and biomarker discovery applications
CN102903122A (en) * 2012-09-13 2013-01-30 西北工业大学 Video object tracking method based on feature optical flow and online ensemble learning
CN103559237A (en) * 2013-10-25 2014-02-05 南京大学 Semi-automatic image annotation sample generating method based on target tracking
WO2016142285A1 (en) * 2015-03-06 2016-09-15 Thomson Licensing Method and apparatus for image search using sparsifying analysis operators

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004038602A1 (en) * 2002-10-24 2004-05-06 Warner-Lambert Company, Llc Integrated spectral data processing, data mining, and modeling system for use in diverse screening and biomarker discovery applications
CN102903122A (en) * 2012-09-13 2013-01-30 西北工业大学 Video object tracking method based on feature optical flow and online ensemble learning
CN103559237A (en) * 2013-10-25 2014-02-05 南京大学 Semi-automatic image annotation sample generating method based on target tracking
WO2016142285A1 (en) * 2015-03-06 2016-09-15 Thomson Licensing Method and apparatus for image search using sparsifying analysis operators

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
《A system of automated training sample generation for visual-based car detection》;Chao Wang,et al;《2012 IEEE/RSJ International Conference on Intelligent Robots and Systems》;20121224;第4169-4167页 *
《Robust visual tracking via online multiple instance learning with fisher information》;C XU,et al;《Pattern Recognit》;20151231;第48卷(第12期);第3917-3926页 *
《基于仿真样本生成的极速学习机泛化能力改进算法》;敖威,等;《南京大学学报》;20180131;第54卷(第1期);第75-84页 *

Also Published As

Publication number Publication date
CN108596958A (en) 2018-09-28

Similar Documents

Publication Publication Date Title
CN108596958B (en) Target tracking method based on difficult positive sample generation
Dai et al. Human action recognition using two-stream attention based LSTM networks
CN109961034B (en) Video target detection method based on convolution gating cyclic neural unit
CN112396607B (en) Deformable convolution fusion enhanced street view image semantic segmentation method
CN109743642B (en) Video abstract generation method based on hierarchical recurrent neural network
CN113688723A (en) Infrared image pedestrian target detection method based on improved YOLOv5
CN110097028B (en) Crowd abnormal event detection method based on three-dimensional pyramid image generation network
CN112446342B (en) Key frame recognition model training method, recognition method and device
CN113313123B (en) Glance path prediction method based on semantic inference
CN109461177B (en) Monocular image depth prediction method based on neural network
CN115713679A (en) Target detection method based on multi-source information fusion, thermal infrared and three-dimensional depth map
CN112288627A (en) Recognition-oriented low-resolution face image super-resolution method
CN115359372A (en) Unmanned aerial vehicle video moving object detection method based on optical flow network
CN113807356A (en) End-to-end low visibility image semantic segmentation method
Cai et al. Multiscale attentive image de-raining networks via neural architecture search
CN116993760A (en) Gesture segmentation method, system, device and medium based on graph convolution and attention mechanism
CN117197632A (en) Transformer-based electron microscope pollen image target detection method
CN111242003A (en) Video salient object detection method based on multi-scale constrained self-attention mechanism
CN116148864A (en) Radar echo extrapolation method based on DyConvGRU and Unet prediction refinement structure
Wang et al. Research on the multi-scale network crowd density estimation algorithm based on the attention mechanism
Zhang [Retracted] An Intelligent and Fast Dance Action Recognition Model Using Two‐Dimensional Convolution Network Method
CN115294353A (en) Crowd scene image subtitle description method based on multi-layer attribute guidance
CN113642596A (en) Brain network classification method based on community detection and double-path self-coding
Yang et al. Instance-aware detailed action labeling in videos
CN117593666B (en) Geomagnetic station data prediction method and system for aurora image

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant