CN111415318B

CN111415318B - Unsupervised related filtering target tracking method and system based on jigsaw task

Info

Publication number: CN111415318B
Application number: CN202010201902.7A
Authority: CN
Inventors: 张伟; 王嘉伦; 宋柯; 宋然; 顾建军
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2020-03-20
Filing date: 2020-03-20
Publication date: 2023-06-13
Anticipated expiration: 2040-03-20
Also published as: CN111415318A

Abstract

The invention relates to an unsupervised related filtering target tracking method and system based on a jigsaw task. Two stages of offline pre-training and online fine tuning are included. In the offline pre-training stage, the training process of the neural network based on the jigsaw task is mainly combined with two tasks of unsupervised related filtering algorithm training and jigsaw task training. The training process can be generally divided into four parts: data processing, depth feature extraction, jigsaw task training and unsupervised related filtering training. According to the invention, in the training of an unsupervised correlation filtering algorithm, a prediction task for indexing the image block position is introduced at the same time, so that the capability of the deep neural network for extracting object detail features is increased, and the algorithm takes into consideration semantic information and position information by fusing features of different layers, so that the accuracy is improved.

Description

Unsupervised related filtering target tracking method and system based on jigsaw task

Technical Field

The invention relates to the field of automatic identification, in particular to an unsupervised related filtering target tracking method and system based on a jigsaw task.

Background

The statements in this section merely improve the background of the present disclosure and do not necessarily constitute prior art.

Target tracking is an important subject in the field of computer vision research, and has wide application prospect in actual life. Specific application fields include intelligent video monitoring, three-dimensional reconstruction, man-machine interaction, image understanding, intelligent visual navigation and the like. The target tracking task obtains the complete motion trail of the moving target by estimating the exact position and occupied area of the moving target in a continuous video sequence, so that analysis and understanding of the motion target behavior are realized, and the follow-up advanced task is padded. The performance of the target tracking algorithm is improved greatly, but the realization of real-time and stable tracking of a moving target in a complex real scene still faces a great challenge. These challenges are caused by changes in the moving object itself, such as shape changes, pose changes, etc., and by external factors, such as motion blur, background occlusion, background clutter, illumination changes, etc., which present a number of difficulties for object tracking.

The tracking algorithm based on correlation filtering is a research hotspot in the current target tracking field. The method uses the target image to train a correlation filter with discriminant, carries out correlation filtering on the search area image, and searches the maximum position of the filter response image, namely the corresponding target position. At present, the related filtering algorithm is widely applied because of good accuracy and faster tracking speed. In addition, the target tracking algorithm based on deep learning has been greatly successful in the field of target tracking, mainly because the features extracted by the deep neural network have stronger expression capability and anti-interference capability compared with the traditional manually extracted features. The shallower features contain more target position information, while the deeper features contain more target semantic information, so that the accuracy of the overall algorithm is improved by feature fusion of different layers. Meanwhile, due to the serious lack of training data related to the target tracking task, the method of performing offline pre-training and then online fine tuning of the model by using an unsupervised training mode provides a feasible direction for the practical application of deep learning in the field of target tracking. In order to ensure the speed and the precision of the target tracking task at the same time, a plurality of algorithms are combined with a related filtering algorithm and a deep learning algorithm, and good performance improvement is achieved on a related data set.

The inventors found that: the existing unsupervised algorithm has insufficient capability of extracting object detail characteristics, and is difficult to well consider semantic information and position information, so that the accuracy is still to be improved.

Disclosure of Invention

In order to solve the defects of the prior art, the invention provides an unsupervised related filtering target tracking method and system based on a jigsaw task, and in the training of an unsupervised related filtering algorithm, a prediction task for indexing the image block position is introduced at the same time so as to increase the extraction capability of a deep neural network on object detail characteristics, and the algorithm gives consideration to semantic information and position information by fusing the characteristics of different layers, so that the accuracy is improved. Two stages, off-line pre-training and on-line fine tuning, are included.

In order to achieve the technical purpose, the invention adopts the following technical scheme:

the first aspect of the invention provides an unsupervised related filtering target tracking method based on a jigsaw task, which comprises the following steps:

processing the input image;

adopting a twin depth network model to extract depth characteristics of the processed image;

training a jigsaw task, processing the extracted depth features by using a classifier network model, and predicting the position index of the small image;

performing unsupervised related filtering training on the multi-layer features obtained by extracting the depth features, and circularly using related filtering algorithms for the features of different pictures;

and updating the appearance model on line and carrying out on-line fine adjustment.

In the offline pre-training stage, the training process of the neural network based on the jigsaw task is mainly combined with two tasks of unsupervised related filtering algorithm training and jigsaw task training. The depth features of the input image are extracted through the twin network and feature fusion is carried out, meanwhile, the training of the jigsaw task enables the features of the twin network model to have stronger universality and detail extraction capability, and higher tracking precision is obtained.

In a second aspect of the present invention, there is provided an unsupervised correlation filtering target tracking system based on a jigsaw task, comprising:

a data input device;

the data processing module is used for processing the input image;

the depth feature extraction module is used for extracting depth features of the processed image by adopting the twin depth network model;

the jigsaw task training module is used for processing the extracted depth features by using a classifier network model and predicting the position index of the small image;

the unsupervised related filtering training module is used for carrying out unsupervised related filtering training on the multi-layer features obtained by the depth feature extraction;

and the online fine adjustment module is used for online updating the appearance model.

The system of the invention has simple operation, convenient interaction and great reference and practical value.

The invention has the beneficial effects that:

(1) According to the invention, in the training of an unsupervised correlation filtering algorithm, a prediction task for indexing the image block position is introduced at the same time, so that the capability of the deep neural network for extracting object detail features is increased, and the algorithm takes into consideration semantic information and position information by fusing features of different layers, so that the accuracy is improved.

(2) The system of the invention has simple operation, convenient interaction and great reference and practical value.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.

FIG. 1 is a jigsaw mission training network of embodiment 1;

fig. 2 is an overall training framework of example 1.

Detailed Description

It should be noted that the following detailed description is illustrative and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present invention. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.

Term interpretation:

search: searching for a region;

template: a template region;

CNN: a convolutional network;

correlation Filter: a correlation filter;

pseudolabel: a pseudo tag;

consitstep Loss: a consistency loss function;

an Initial Label: an initial tag.

An unsupervised related filtering target tracking method based on a jigsaw task comprises the following steps:

processing the input image;

In some embodiments, the data processing uses ILSVRC2015 dataset.

In some embodiments, the specific steps of the data processing include:

cutting each picture centrally to obtain original picture with length and width

Size of the material;

scaling the cut picture to 125x125 resolution;

for the pictures of the same video sequence, randomly selecting 3 pictures as templates T and searching areas S ₁ Search area S ₂ ；

Respectively cutting out non-overlapping small pictures with the resolution of 50x50 from four positions of the upper left, the lower left, the upper right and the lower right of each selected picture, and scaling to the resolution of 63x 63;

randomly dithering each channel of each cut small image within 2 pixel points;

the obtained small images are randomly disturbed, and the disturbed position index, the disturbed small image with the resolution of 63x63 and the large image with the resolution of 125x125 are used as a group of training data to participate in training.

In some embodiments, the specific steps of depth feature extraction are:

extracting depth features of corresponding 1x1 resolution from the 4 small images with the resolution of 63x63 through a twin convolution network model respectively;

extracting the characteristics of a specific layer from a picture with a resolution of 125x125 through a twin convolution network model;

the resulting layer features of different resolution sizes are scaled to a fixed resolution of 125x125 using bilinear interpolation.

In some embodiments, the specific steps of the jigsaw task training are:

adjusting 4 three-dimensional depth features obtained in the depth feature extraction process into one-dimensional vectors, and connecting the three-dimensional depth features together in a given order;

and (3) predicting the position index of the disturbed small image by using the obtained feature vector through a classifier formed by a plurality of full-connection layers, and calculating a cross entropy loss function between the feature vector and the given position index. The depth features of the input image are extracted through the twin network and feature fusion is carried out, meanwhile, the training of the jigsaw task enables the features of the twin network model to have stronger universality and detail extraction capability, and higher tracking precision is obtained.

In some embodiments, the specific steps of the unsupervised correlation filter training include:

features F for particular layers _t 、F _s1 And corresponding response image Y _t The response image of the corresponding layer can be obtained by using a correlation filtering algorithm, and the search area S is obtained by adding weights ₁ Response image R of (2) _s1 ；

Features F for particular layers _s1 、F _s2 The obtained response image R _s1 The response image of the corresponding layer can be obtained by using a correlation filtering algorithm, and the search area S can be obtained by adding weights ₂ Response image R of (2) _s2 ；

Features F for particular layers _s2 、F _t The obtained response image R _s2 The response image of the corresponding layer can be obtained by using a correlation filtering algorithm, and the response image R of the template T can be obtained by adding weights _t ；

For the obtained response image R _t And an original response image Y _t A mean square loss function is calculated. Template T, search area S of the same video sequence ₁ Search area S ₂ And obtaining the multi-layer features through a depth feature extraction step. The shallow layer features have more position information, but the semantic information is not obvious, the deep layer features have more semantic information, the anti-interference capability is strong, but the necessary position information is lacking, and the improvement of the target tracking effect is facilitated by combining the features of different layers.

In some embodiments, the twin convolutional network structure is:

the first convolution layer, the convolution kernel size is 3x3, the step length is 1x1, and 32 feature maps are output; the activation function is a linear rectification unit ReLU;

the second convolution layer, the convolution kernel size is 3x3, the step length is 1x1, and 32 feature graphs are output; and carrying out local response normalization and maximum pooling on the same;

the third convolution layer, the convolution kernel size is 3x3, the step length is 1x1, and 64 feature graphs are output; the activation function is a linear rectification unit ReLU;

a fourth convolution layer, the convolution kernel size is 3x3, the step length is 1x1, and 64 feature graphs are output; and carrying out local response normalization and maximum pooling on the same;

a fifth convolution layer, wherein the convolution kernel size is 3x3, the step length is 1x1, and 128 feature graphs are output; the activation function is a linear rectification unit ReLU;

a sixth convolution layer, wherein the convolution kernel size is 3x3, the step length is 1x1, and 128 feature graphs are output; and carrying out local response normalization and maximum pooling on the same;

a seventh convolution layer, wherein the convolution kernel size is 3x3, the step length is 1x1, and 256 feature maps are output; the activation function is a linear rectification unit ReLU;

an eighth convolution layer, wherein the convolution kernel size is 3x3, the step length is 1x1, and 256 feature maps are output; and carrying out local response normalization and maximum pooling on the same;

a ninth convolution layer, wherein the convolution kernel size is 3x3, the step length is 1x1, and 256 feature maps are output; the activation function is a linear rectification unit ReLU;

a tenth convolution layer, wherein the convolution kernel size is 3x3, the step length is 1x1, and 256 feature maps are output; and local response normalization and maximum pooling are performed on the same.

In some embodiments, the classifier network structure is:

the first full-connection layer comprises 512 hidden units, and the activation function is a linear rectification unit ReLU;

the second full-connection layer, comprising 24 hidden units, is output as a predicted thumbnail position index.

In some embodiments, the specific steps of the online trimming include:

cutting and scaling: cutting out an image which is the same as the center position of the previous frame of image but larger on the search image, and scaling to 125x125 resolution;

the resulting image is used to update the parameters of the associated filtering algorithm,

W _t ＝(1-α _t )W _t-1 +α _t W

wherein alpha is _t-1 ∈[0,1]；

Cutting out three images with the same central position as the previous frame of image and different resolution values from the search image, and obtaining the largest response value of the three images through a correlation filtering algorithm, wherein the resolution value is the size of a target on the search image, and the position of the response value is the moving direction of the target; and repeating the cropping and scaling steps by taking the current search image as a template.

The invention will now be described in further detail with reference to the following specific examples, which should be construed as illustrative rather than limiting.

Example 1:

the invention provides an unsupervised related filtering target tracking method and system based on a jigsaw task, wherein the method comprises two stages of offline pre-training and online fine tuning.

In the offline pre-training stage, the training process of the neural network based on the jigsaw task is mainly combined with two tasks of unsupervised related filtering algorithm training and jigsaw task training. The depth features of the input image are extracted through the twin network and feature fusion is carried out, meanwhile, the training of the jigsaw task enables the features of the twin network model to have stronger universality and detail extraction capability, and higher tracking precision is obtained. The training process can be generally divided into four parts: data processing, depth feature extraction, jigsaw task training and unsupervised related filtering training.

And (3) data processing: the ILSVRC2015 data set is used in the training process, and the specific steps are as follows:

step (1): cutting each picture centrally to obtain original picture with length and width

Size of the product.

Step (2): and (3) scaling the picture obtained by clipping in the step (1) to a 125x125 resolution.

Step (3): for the pictures of the same video sequence, randomly selecting 3 pictures as templates T and searching areas S ₁ Search area S ₂ 。

Step (4): for each picture selected in the step (3), a non-overlapping small picture with the resolution of 50x50 is respectively cut out at the upper left, the lower left, the upper right and the lower right of the picture, and is scaled to the resolution of 63x 63.

Step (5): and (3) randomly dithering each channel of each small image cut in the step (4) within 2 pixel points.

Step (6): randomly scrambling the small images obtained in the step (5), and taking the scrambled position indexes, the scrambled small images with the resolution of 63x63 and the large images with the resolution of 125x125 as a group of training data to participate in training.

Depth feature extraction: the method uses a twin depth network model for feature extraction, and comprises the following specific steps:

step (1): and 4 small images with the resolution of 63x63 are respectively extracted into depth features with the resolution of 1x1 through a twin convolution network model.

Step (2): the 125x125 resolution picture is used for extracting the characteristics of a specific layer through a twin convolution network model.

Step (3): for layer features of different resolution sizes in step (2), bilinear interpolation is used to scale them to a fixed resolution of 125x 125.

Training of jigsaw tasks: the location index of the small graph is predicted using a classifier network model, as shown in fig. 1. The method comprises the following specific steps:

step (1): and (3) adjusting the 4 three-dimensional depth features obtained in the depth feature extraction step (1) into one-dimensional vectors, and connecting the three-dimensional depth features together in a given order.

Step (2): and (3) predicting the position index of the disturbed small image by using the feature vector obtained in the step (1) through a classifier formed by a plurality of full-connection layers, and calculating a cross entropy loss function between the position index and the given position index.

Unsupervised correlation filtering training: a specific training process is shown in fig. 2. Template T, search area S of the same video sequence ₁ Search area S ₂ And (3) obtaining the multi-layer features through the depth feature extraction step (3). The shallow layer features have more position information, but the semantic information is not obvious, the deep layer features have more semantic information, the anti-interference capability is strong, but the necessary position information is lacking, and the improvement of the target tracking effect is facilitated by combining the features of different layers. For different pictures during trainingThe characteristic cyclic use related filtering algorithm comprises the following specific steps:

step (1): features F for particular layers _t 、F _s1 And corresponding response image Y _t The response image of the corresponding layer can be obtained by using a correlation filtering algorithm, and the search area S is obtained by adding weights ₁ Response image R of (2) _s1 。

Step (2): features F for particular layers _s1 、F _s2 And the response image R obtained in the step (1) _s1 The response image of the corresponding layer can be obtained by using a correlation filtering algorithm, and the search area S can be obtained by adding weights ₂ Response image R of (2) _s2 。

Step (3): features F for particular layers _s2 、F _t And the response image R obtained in the step (2) _s2 The response image of the corresponding layer can be obtained by using a correlation filtering algorithm, and the response image R of the template T can be obtained by adding weights _t 。

Step (4): for the response image R obtained in step (3) _t And an original response image Y _t A mean square loss function is calculated.

The following is the complete network structure of the present invention:

twinning convolutional network structure:

Classifier network structure:

In the online fine tuning stage, in order to better capture the change of the target in the motion process, the appearance model needs to be updated online, and the specific steps are as follows:

step (1): an image which is the same as the center position of the previous frame of image but is larger is cut out from the search image, and the image is scaled to the 125x125 resolution size.

Step (2): the image obtained in the step (1) is used for updating the parameters of the related filtering algorithm,

W _t ＝(1-α _t )W _t-1 +α _t W

wherein alpha is _t-1 ∈[0,1]。

Step (3): and cutting out three images with the same central position as the previous frame of image and different resolution values from the search image, and obtaining the largest response value of the three images through a correlation filtering algorithm, wherein the resolution value is the size of the target on the search image, and the position of the response value is the moving direction of the target. And (3) taking the current search image as a template, and repeating the step (1).

Finally, it should be noted that the above-mentioned embodiments are only preferred embodiments of the present invention, and the present invention is not limited to the above-mentioned embodiments, but may be modified or substituted for some of them by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention. While the foregoing describes the embodiments of the present invention, it should be understood that the present invention is not limited to the embodiments, and that various modifications and changes can be made by those skilled in the art without any inventive effort.

Claims

1. An unsupervised related filtering target tracking method based on a jigsaw task is characterized by comprising the following steps:

processing the input image;

updating the appearance model on line and carrying out on-line fine adjustment;

the specific steps of data processing include:

cutting each picture centrally to obtain original picture with length and width

Size of the material;

scaling the cut picture to 125x125 resolution;

for the pictures of the same video sequence, randomly selecting 3 pictures as templates

Search area->

Search area->

；

randomly dithering each channel of each cut small image within 2 pixel points;

2. The tile task-based unsupervised correlation filtering target tracking method of claim 1, wherein the data processing uses ILSVRC2015 dataset.

3. The method for tracking an unsupervised correlation filtering target based on a jigsaw task as claimed in claim 1, wherein the specific steps of depth feature extraction are as follows:

4. The method for tracking an unsupervised related filtering target based on a jigsaw task as claimed in claim 3, wherein the specific steps of training the jigsaw task are as follows:

and (3) predicting the position index of the disturbed small image by using the obtained feature vector through a classifier formed by a plurality of full-connection layers, and calculating a cross entropy loss function between the feature vector and the given position index.

5. The method for tracking an unsupervised correlation filtering target based on a jigsaw task as claimed in claim 3, wherein the specific step of unsupervised correlation filtering training comprises:

features for specific layers

、

Corresponding response image +.>

Obtaining a response image of the corresponding layer by using a correlation filtering algorithm, and obtaining a search area by adding weights>

Is->

；

Features for specific layers

、

The resulting response image +.>

Is->

；

Features for specific layers

、

The resulting response image +.>

Obtaining a response image of the corresponding layer by using a correlation filtering algorithm, and adding the response images according to weights to obtain a response image of the template T>

；

For the obtained response image

And original response image->

A mean square loss function is calculated.

6. The method for tracking an unsupervised correlation filtering target based on a jigsaw task as claimed in claim 3, wherein the twin convolution network model is:

7. The method for tracking an unsupervised correlation filtering target based on a jigsaw task as claimed in claim 3, wherein the classifier network structure is as follows:

8. The method for tracking an unsupervised correlation filtering target based on a jigsaw task as claimed in claim 1, wherein the specific step of online fine tuning comprises:

wherein the method comprises the steps of

；/>

9. An unsupervised related filtering target tracking system based on a jigsaw task, comprising:

a data input device;

the data processing module is used for processing the input image;

the unsupervised correlation filtering training module is used for carrying out unsupervised correlation filtering training on the multi-layer features obtained by extracting the depth features, and circularly using correlation filtering algorithms for the features of different pictures;

the online fine tuning module is used for online updating the appearance model;

the specific steps of the data processing module for data processing comprise:

cutting each picture centrally to obtain original picture with length and width

Size of the material;

scaling the cut picture to 125x125 resolution;

Search area->

Search area->

；

randomly dithering each channel of each cut small image within 2 pixel points;