CN113888595B

CN113888595B - Twin network single-target visual tracking method based on difficult sample mining

Info

Publication number: CN113888595B
Application number: CN202111152770.4A
Authority: CN
Inventors: 黄磊; 高占祺; 魏志强
Original assignee: Ocean University of China
Current assignee: Ocean University of China
Priority date: 2021-09-29
Filing date: 2021-09-29
Publication date: 2024-05-14
Anticipated expiration: 2041-09-29
Also published as: CN113888595A

Abstract

The invention discloses a twin network single target tracking method based on difficult sample mining, which comprises the steps of constructing a training set, constructing a convolution twin network based on difficult sample mining and the like: according to the invention, the difficult sample mining is introduced into the target tracking method, the difficult negative sample is mined as training data in the training process, the network parameters are updated, the ternary group loss of the difficult sample is selected as a loss function, the difficult sample is continuously optimized, and the difficult negative sample is continuously mined by the model in the training process through optimizing the loss, so that the network is fully trained, similar targets are better distinguished, the model is learned to have the characteristic of distinguishing capability, and the target tracking effect is better.

Description

Twin network single-target visual tracking method based on difficult sample mining

Technical Field

The invention belongs to the technical field of computer vision, relates to an image processing technology, and particularly relates to a twin network single target tracking method based on difficult sample mining.

Background

The single-target visual tracking is one of the popular but challenging research subjects in computer vision, has wide application in the aspects of intelligent video monitoring, robot visual navigation, medical diagnosis, positioning and tracking of underwater organisms and the like, and has wide development prospect. Visual target tracking refers to designating a target to be tracked in a first frame of a video sequence and calibrating an initial position of the target to be tracked, and then predicting the position and the size of the target in a subsequent frame to accurately track the target.

Early classical algorithms all process in the time domain, and these algorithms involve complex calculations, and the large amount of calculation makes tracking less real-time. Then an algorithm based on correlation filtering appears, and in contrast, the target tracking method converts calculation into a frequency domain by the introduction of the correlation filtering, so that the operation amount is greatly reduced, and the speed is greatly improved. With the development of deep learning, researchers have introduced deep learning techniques into target tracking, and a series of methods have been proposed and have achieved good results.

In recent years, a method for tracking targets based on a twin network has received unprecedented attention. The existing method adopts a convolutional neural network to perform feature extraction on target modeling. In the process of target tracking, offline training of a tracked target is one of the keys of the performance of a relational tracking model, and the selection of training data is particularly important when the model is in offline training. The existing twin network-based method only uses a target area, the characteristics extracted in the target area are directly subjected to related operation in the characteristics of the test frame image, the robustness is poor, complex scenes such as similar objects cannot be processed, and the discrimination capability is insufficient. The prior method usually marks the coordinate distance between the object and the instance as positive when the object is tracked, otherwise marks as negative, maximizes the similarity score of the positive instance pair and minimizes the similarity score of the negative instance pair through logic loss, and the method only uses the paired relation among the sample pairs, ignores the potential relation among the prototype, the positive instance and the negative instance, does not consider the effect of the difficult sample on the model, cannot process complex scenes such as similar objects and the like, and has proved by researchers in the fields of object recognition and the like.

Disclosure of Invention

Aiming at the defects existing in the prior art, the invention provides a twin network single-target tracking method based on difficult sample mining, which is characterized in that difficult sample mining is introduced into the target tracking method, difficult negative samples are mined as training data in the training process, network parameters are updated, and difficult sample triplet loss is selected as a loss function, so that the difficult negative samples are continuously mined by a model in the training process through optimizing the loss, the network is fully trained, similar targets are better distinguished, and the model is learned to have the characteristic of distinguishing capability.

In order to solve the technical problems, the invention adopts the following technical scheme:

a twin network single target tracking method based on difficult sample mining comprises the following steps:

Step (1), constructing a training set: cutting out target template images Z and search area images X of all images in an image sequence training set according to the target positions and the sizes of the images, dividing the search area images X into positive example images P and negative example images N, forming a pair of positive sample pairs by the images Z and the images P, forming a pair of negative sample pairs by the images Z and the images N, and forming a training data set by a (Z, P, N) triplet formed by the target template images Z, the positive example images P and the negative example images N;

Step (2), constructing a convolution twin network based on difficult sample mining, wherein the network comprises three branches and the three branches share weights of a feature extraction network; the three branches are respectively used for acquiring a feature map of a target template image, a feature map of a positive sample image of a search area and a feature map of a negative sample image, wherein during feature extraction, a difficult sample is defined, and difficult sample mining is introduced to learn features with distinguishing capability;

step (3), performing cross-correlation operation on the target template image feature map obtained in the step (2) and the search area image feature map to obtain a response map, wherein a position with a higher score in the response map is considered as a position most similar to an image target object, and the response map is enlarged to an original image size, so that the position of a target on an image to be searched is determined;

Step (4), training a twin network based on difficult sample mining based on the training set in the step (1) to obtain a training convergence twin network;

and (5) performing online target tracking by utilizing the trained twin network.

Further, the operation of step (1) includes cropping the target region template image and cropping the search region image; the clipping method of the target template image comprises the following steps: the method comprises the steps that a target frame of a template image in target tracking is known, a square area is cut out by taking a tracked target as the center, the center position of the target area represents the target position, q pixels are respectively expanded on four sides of the target frame, and finally the size of a cut target image block is scaled; the clipping method of the search area image comprises the following steps: respectively expanding 2q pixels on four sides of a target frame by taking the target area as the center, and then scaling the size of the cut image block of the search area; where q= (w+h)/4,w is the width of the target frame and h is the height of the target frame.

Further, in the step (2), the feature extraction networks of different branches of the twin network are adjusted ResNet-50, and the input image is subjected to ResNet-50 feature extraction.

Further, the positive sample pair is an image pair with similar visual characteristics and high reference contrast, and the negative sample pair is an image pair with similar visual characteristics and low reference contrast; the difficult samples in the dataset are defined as:

P＝{(i,j)|S_v(x_i,x_j)≥α,S_c(y_i,y_j)≥β}

N＝{(m,n)|S_v(x_m,x_n)≥α,S_c(y_m,y_n)<β}

Wherein S _v represents the visual feature similarity, S _c represents the reference contrast similarity, α represents the threshold value of the visual feature similarity, and β represents the threshold value of the reference contrast similarity;

When selecting pictures from a training set for training, selecting a least similar positive sample and a most similar negative sample for each picture to form a triplet, and calculating the triplet loss of a difficult sample; the difficult sample triplet loss is defined as:

wherein M represents M targets selected from each batch of samples, N represents N pictures selected randomly from each target, (z) ₊ represents max (z, 0), z represents maxd _A,P-mind_A,N +θ, θ is a threshold parameter set according to actual needs, d _A,P represents similarity between a template sample and a positive sample, and d _A,N represents distance between the template sample and the negative sample;

through L _hard optimizing loss, the model continuously excavates positive sample pairs and difficult negative samples in the training process, and learns the characteristics with distinguishing capability.

Further, the operation of the step (3) is as follows: after feature extraction, fusing different layers of features, wherein the lower layer features have more target position information and the higher layer features have more semantic information, performing up-sampling operation on the higher layer features, then fusing the higher layer features with the lower layer features, iteratively generating feature images fused by different branch multi-layer features, performing cross-correlation operation on the target template image feature images, the positive sample image feature images and the negative sample image feature images of the search area respectively to obtain response images, expanding the response images to the original image size, and determining the position of the target on the image to be searched.

Further, the specific operation of step (4) is as follows:

1) Training by using initial positive and negative samples, and enabling the Z direction P to be close to and far from N through training to obtain a trained classifier;

2) Classifying the samples by using the trained classifier, putting the samples with the misclassification as difficult negative samples into a negative sample subset, and then continuing to train the classifier;

3) The process is repeated until the performance of the classifier is no longer improved.

Further, the online tracking process in step (5) includes the following steps:

1) Reading a first frame picture of a video sequence to be tracked, acquiring boundary frame information of the first frame picture, cutting out a target template image Z of the first frame according to the method for cutting out the target template image in the step (1), inputting the Z into a template branch of the training convergence twin network in the step (4), extracting multi-layer features of the template image, fusing, and then setting t=2;

2) Reading a t frame of a video to be tracked, cutting out a search area image of the t frame according to the target position determined in the t-1 frame and the method for cutting out the search area image in the step (1), inputting the cut t frame search area image into a search branch of the training convergence twin network in the step (4), and extracting the characteristics of the t frame search image;

3) Performing cross-correlation operation on the feature map obtained in the step 1) after multi-layer fusion and the feature map obtained in the step 2);

4) Setting t=t+1, judging whether T is less than or equal to T, wherein T is the total frame number of the video sequence to be detected, if so, executing the steps 2) -3), otherwise, ending the tracking process of the video sequence to be detected.

Compared with the prior art, the invention has the advantages that:

Aiming at the problem that the existing twin network target tracking method does not consider the effect of a difficult sample on a model, the twin network target tracking method based on difficult sample mining is designed, the difficult sample mining is introduced into a target tracking twin network structure, a difficult negative sample is mined in the training process as training data, and the loss of a difficult sample triplet is selected as a loss function, so that the model is continuously optimized, the characteristic with distinguishing capability is learned, and the target tracking effect is good.

Specifically, in the training process, an initial positive sample and a negative sample are used for training, then the trained classifier is used for classifying the samples, the samples with the wrong classification are used as difficult negative samples to be placed into a negative sample subset, then the training is continued, and the training is repeated until the performance of the classifier is not improved. Different from the traditional samples for triplet training, the invention selects the difficult sample triples, updates network parameters by using the difficult samples in the training process, selects the positive sample which is least similar to each picture and the negative sample which is most similar to each picture to calculate the difficult triplet loss, and the model continuously excavates the difficult negative samples in the training process through optimizing the loss, so that the network is fully trained, similar targets are better distinguished, the problems of local change, background interference and the like in the images are solved, and the learned model has stronger generalization capability.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic overall flow chart of the present invention;

FIG. 2 is a schematic diagram of a difficult sample mining strategy architecture according to the present invention;

FIG. 3 is a tracking effect of object tracking for a first video sequence using the method of the present invention;

fig. 4 is a tracking effect of object tracking for a second video sequence using the method of the present invention.

Detailed Description

The invention will be further described with reference to the accompanying drawings and specific examples.

In combination with the overall flow shown in fig. 1, the twin network single target tracking method based on difficult sample mining comprises the following steps:

And (1) constructing a training set.

Cutting out target template images Z and search area images X of all images in an image sequence training set according to the target positions and the sizes of the images, dividing the search area images X into positive example images P and negative example images N, forming a pair of positive sample pairs by the images Z and the images P, forming a pair of negative sample pairs by the images Z and the images N, and forming a training data set by a (Z, P, N) triplet formed by the target template images Z, the positive example images P and the negative example images N.

Specifically, the operation of step (1) includes cropping the target area template image and cropping the search area image. The clipping method of the target template image comprises the following steps: the target frame of the template image in target tracking is known, a square area is cut out by taking the tracked target as the center, the center position of the target area is used for representing the target position, q pixels are respectively expanded on four sides of the target frame, and finally the size of the cut target image block is scaled to be 127 multiplied by 127. The clipping method of the search area image comprises the following steps: respectively expanding 2q pixels on four sides of a target frame by taking the target area as the center, and then scaling the size of the cut image block of the search area to 255×255; where q= (w+h)/4,w is the width of the target frame and h is the height of the target frame.

And (2) constructing a convolution twin network based on difficult sample mining, and obtaining feature graphs of different branches.

This network contains three branches and the three branches share the weights of the feature extraction network; the three branches are respectively used for acquiring a characteristic image of the target template image, a characteristic image of the positive sample image of the search area and a characteristic image of the negative sample image, wherein during characteristic extraction, a difficult sample is defined, and difficult sample mining is introduced to learn the characteristic with distinguishing capability.

Specifically, the feature extraction networks of different branches of the twin network in the step (2) are all ResNet-50 after fine tuning, and the input image extracts features through ResNet-50.

Difficult sample mining is introduced to learn the distinguishing-capable features. In connection with the difficult sample mining strategy of the present invention shown in fig. 2, in particular, the present invention contemplates obtaining valid difficult sample pairs from both visual feature similarity and reference contrast similarity. Image pairs with similar visual features and high reference contrast are defined as positive sample pairs and image pairs with similar visual features and low reference contrast are defined as negative sample pairs.

The difficult samples in the dataset are defined as:

P＝{(i,j)|S_v(x_i,x_j)≥α,S_c(y_i,y_j)≥β}

N＝{(m,n)|S_v(x_m,x_n)≥α,S_c(y_m,y_n)<β}

Wherein S _v represents the visual feature similarity, S _c represents the reference contrast similarity, α represents the threshold value of the visual feature similarity, and β represents the threshold value of the reference contrast similarity.

The traditional triples sample three pictures from training data, so that the method is simpler, but most of the sampled samples are simple and easily distinguished sample pairs, and if a large number of training sample pairs are simple sample pairs, better characteristics are not beneficial to network learning. Therefore, when the pictures are selected from the training set for training, for each picture, a least similar positive sample and a most similar negative sample are selected to form a triplet, and the difficult sample triplet loss is calculated.

The difficult sample triplet loss is defined as:

Wherein M represents M targets selected from each batch of samples, N represents N pictures selected randomly from each target, (z) ₊ represents max (z, 0), z represents maxd _A,P-mind_A,N +θ, θ is a threshold parameter set according to actual needs, d _A,P represents similarity between the template sample and the positive sample, and d _A,N represents distance between the template sample and the negative sample.

And (3) performing cross-correlation operation on the target template image feature map obtained in the step (2) and the search area image feature map to obtain a response map, wherein a position with a higher score in the response map is considered as the most similar position of the image target object, so that the position of the target is determined.

Specifically, the step (3) operates as follows: after feature extraction, different layers of features are fused, a lower layer of features have more target position information, a higher layer of features have more semantic information, the higher layer of features are firstly subjected to up-sampling operation, then are fused with the lower layer of features, feature images after different branches of multi-layer feature fusion are generated in an iteration mode, and the target template image feature images are respectively subjected to cross-correlation operation with the positive sample image feature image and the negative sample image feature image of the search area to obtain response images. And expanding the response map to the original image size so as to determine the position of the target on the image to be searched.

And (4) training the twin network based on difficult sample mining based on the training set in the step (1) to obtain a training convergence twin network.

Specifically, the specific operation of step (4) is as follows:

Specifically, the online tracking process in step (5) includes the following steps:

1) And (3) reading a first frame picture of the video sequence to be tracked, acquiring boundary frame information of the first frame picture, cutting out a target template image Z of the first frame according to the method for cutting out the target template image in the step (1), inputting the Z into a template branch of the training convergence twin network in the step (4), extracting multi-layer features of the template image, fusing, and then setting t=2.

2) And (3) reading a t frame of the video to be tracked, cutting out a search area image of the t frame according to the target position determined in the t-1 frame and the method for cutting out the search area image in the step (1), inputting the cut t frame search area image into a search branch of the training convergence twin network in the step (4), and extracting the characteristics of the t frame search image.

3) And performing cross-correlation operation on the characteristic map obtained in the step 1) after multi-layer fusion and the characteristic map obtained in the step 2).

4) Setting t=t+1, and judging whether T is less than or equal to T, wherein T is the total frame number of the video sequence to be detected; if yes, executing the step 2) -3), otherwise, ending the tracking process of the video sequence to be detected.

Fig. 3 is a tracking effect of object tracking for a first video sequence using the method of the present invention. It can be seen that the target tracking method provided by the invention can effectively track targets with similar background interference.

Fig. 4 is a tracking effect of object tracking for a second video sequence using the method of the present invention. It can be seen that the target tracking method provided by the invention can effectively track the target with posture change and rapid movement.

In summary, the invention introduces difficult sample mining into the target tracking twin network structure, designs difficult triplet loss, can fully train the network, strengthen the discrimination capability of the classifier, can better distinguish similar targets, and can solve the problems of local change, background interference and the like in images, and the learned model has stronger generalization capability.

It should be understood that the above description is not intended to limit the invention to the particular embodiments disclosed, but to limit the invention to the particular embodiments disclosed, and that various changes, modifications, additions and substitutions can be made by those skilled in the art without departing from the spirit and scope of the invention.

Claims

1. The twin network single target tracking method based on difficult sample mining is characterized by comprising the following steps:

step (1), constructing a training set: cutting out target template images Z and search area images X of all images in an image sequence training set according to the target positions and the sizes of the images, dividing the search area images X into positive example images P and negative example images N, forming a pair of positive sample pairs by the images Z and the images N, forming a pair of negative sample pairs by the images Z, the positive example images P and the (Z, P, N) triples formed by the target template images Z, the positive example images P and the negative example images N, and forming a training data set;

The positive sample pair is an image pair with similar visual characteristics and high reference contrast, and the negative sample pair is an image pair with similar visual characteristics and low reference contrast; the difficult samples in the dataset are defined as:

P＝{(i,j)|S_v(x_i,x_j)≥α,S_c(y_i,y_j)≥β}

N＝{(m,n)|S_v(x_m,x_n)≥α,S_c(y_m,y_n)<β}

When selecting images from a training set for training, selecting a least similar positive sample and a most similar negative sample for each image to form a triplet, and calculating the triplet loss of a difficult sample; the difficult sample triplet loss is defined as:

Wherein M represents M targets selected from each batch of samples, N represents N images selected randomly from each target, (z) ₊ represents max (z, 0), z represents max d _A,P-min d_A,N +θ, θ is a threshold parameter set according to actual needs, d _A,P represents similarity between the template sample and the positive sample, and d _A,N represents distance between the template sample and the negative sample;

Through L _hard optimizing loss, the model continuously excavates positive sample pairs and difficult negative samples in the training process, and learns the characteristics with distinguishing capability;

2. The difficult sample mining-based twin network single target tracking method of claim 1, wherein the operations of step (1) comprise cropping the target region template image and cropping the search region image; the clipping method of the target template image comprises the following steps: the method comprises the steps that a target frame of a template image in target tracking is known, a square area is cut out by taking a tracked target as the center, the center position of the target area represents the target position, q pixels are respectively expanded on four sides of the target frame, and finally the size of a cut target image block is scaled; the clipping method of the search area image comprises the following steps: respectively expanding 2q pixels on four sides of a target frame by taking the target area as the center, and then scaling the size of the cut image block of the search area; where q= (w+h)/4,w is the width of the target frame and h is the height of the target frame.

3. The twin network single target tracking method based on difficult sample mining according to claim 1, wherein the feature extraction networks of the different branches of the twin network in step (2) are adjusted ResNet-50, and the input image is subjected to feature extraction through ResNet-50.

4. The twin network single target tracking method based on difficult sample mining of claim 1, wherein step (3) operates as follows: after feature extraction, fusing different layers of features, wherein the lower layer features have more target position information and the higher layer features have more semantic information, performing up-sampling operation on the higher layer features, then fusing the higher layer features with the lower layer features, iteratively generating feature images fused by different branch multi-layer features, performing cross-correlation operation on the target template image feature images, the positive sample image feature images and the negative sample image feature images of the search area respectively to obtain response images, expanding the response images to the original image size, and determining the position of the target on the image to be searched.

5. The twin network single target tracking method based on difficult sample mining of claim 1, wherein the specific operation of step (4) is as follows:

6. The twin network single target tracking method based on difficult sample mining of claim 2, wherein the online target tracking process in step (5) comprises the steps of:

1) Reading a first frame image of a video sequence to be tracked, acquiring boundary frame information of the first frame image, cutting out a target template image Z of the first frame according to the method for cutting out the target template image in the step (1), inputting the Z into a template branch of the training convergence twin network in the step (4), extracting multi-layer characteristics of the template image, fusing, and then setting t=2;