CN111415318B - Unsupervised related filtering target tracking method and system based on jigsaw task - Google Patents

Unsupervised related filtering target tracking method and system based on jigsaw task Download PDF

Info

Publication number
CN111415318B
CN111415318B CN202010201902.7A CN202010201902A CN111415318B CN 111415318 B CN111415318 B CN 111415318B CN 202010201902 A CN202010201902 A CN 202010201902A CN 111415318 B CN111415318 B CN 111415318B
Authority
CN
China
Prior art keywords
image
training
task
resolution
unsupervised
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010201902.7A
Other languages
Chinese (zh)
Other versions
CN111415318A (en
Inventor
张伟
王嘉伦
宋柯
宋然
顾建军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University
Original Assignee
Shandong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong University filed Critical Shandong University
Priority to CN202010201902.7A priority Critical patent/CN111415318B/en
Publication of CN111415318A publication Critical patent/CN111415318A/en
Application granted granted Critical
Publication of CN111415318B publication Critical patent/CN111415318B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/50Image enhancement or restoration using two or more images, e.g. averaging or subtraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4038Image mosaicing, e.g. composing plane images from plane sub-images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/70Denoising; Smoothing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10004Still image; Photographic image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20024Filtering details
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20112Image segmentation details
    • G06T2207/20132Image cropping

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to an unsupervised related filtering target tracking method and system based on a jigsaw task. Two stages of offline pre-training and online fine tuning are included. In the offline pre-training stage, the training process of the neural network based on the jigsaw task is mainly combined with two tasks of unsupervised related filtering algorithm training and jigsaw task training. The training process can be generally divided into four parts: data processing, depth feature extraction, jigsaw task training and unsupervised related filtering training. According to the invention, in the training of an unsupervised correlation filtering algorithm, a prediction task for indexing the image block position is introduced at the same time, so that the capability of the deep neural network for extracting object detail features is increased, and the algorithm takes into consideration semantic information and position information by fusing features of different layers, so that the accuracy is improved.

Description

Unsupervised related filtering target tracking method and system based on jigsaw task
Technical Field
The invention relates to the field of automatic identification, in particular to an unsupervised related filtering target tracking method and system based on a jigsaw task.
Background
The statements in this section merely improve the background of the present disclosure and do not necessarily constitute prior art.
Target tracking is an important subject in the field of computer vision research, and has wide application prospect in actual life. Specific application fields include intelligent video monitoring, three-dimensional reconstruction, man-machine interaction, image understanding, intelligent visual navigation and the like. The target tracking task obtains the complete motion trail of the moving target by estimating the exact position and occupied area of the moving target in a continuous video sequence, so that analysis and understanding of the motion target behavior are realized, and the follow-up advanced task is padded. The performance of the target tracking algorithm is improved greatly, but the realization of real-time and stable tracking of a moving target in a complex real scene still faces a great challenge. These challenges are caused by changes in the moving object itself, such as shape changes, pose changes, etc., and by external factors, such as motion blur, background occlusion, background clutter, illumination changes, etc., which present a number of difficulties for object tracking.
The tracking algorithm based on correlation filtering is a research hotspot in the current target tracking field. The method uses the target image to train a correlation filter with discriminant, carries out correlation filtering on the search area image, and searches the maximum position of the filter response image, namely the corresponding target position. At present, the related filtering algorithm is widely applied because of good accuracy and faster tracking speed. In addition, the target tracking algorithm based on deep learning has been greatly successful in the field of target tracking, mainly because the features extracted by the deep neural network have stronger expression capability and anti-interference capability compared with the traditional manually extracted features. The shallower features contain more target position information, while the deeper features contain more target semantic information, so that the accuracy of the overall algorithm is improved by feature fusion of different layers. Meanwhile, due to the serious lack of training data related to the target tracking task, the method of performing offline pre-training and then online fine tuning of the model by using an unsupervised training mode provides a feasible direction for the practical application of deep learning in the field of target tracking. In order to ensure the speed and the precision of the target tracking task at the same time, a plurality of algorithms are combined with a related filtering algorithm and a deep learning algorithm, and good performance improvement is achieved on a related data set.
The inventors found that: the existing unsupervised algorithm has insufficient capability of extracting object detail characteristics, and is difficult to well consider semantic information and position information, so that the accuracy is still to be improved.
Disclosure of Invention
In order to solve the defects of the prior art, the invention provides an unsupervised related filtering target tracking method and system based on a jigsaw task, and in the training of an unsupervised related filtering algorithm, a prediction task for indexing the image block position is introduced at the same time so as to increase the extraction capability of a deep neural network on object detail characteristics, and the algorithm gives consideration to semantic information and position information by fusing the characteristics of different layers, so that the accuracy is improved. Two stages, off-line pre-training and on-line fine tuning, are included.
In order to achieve the technical purpose, the invention adopts the following technical scheme:
the first aspect of the invention provides an unsupervised related filtering target tracking method based on a jigsaw task, which comprises the following steps:
processing the input image;
adopting a twin depth network model to extract depth characteristics of the processed image;
training a jigsaw task, processing the extracted depth features by using a classifier network model, and predicting the position index of the small image;
performing unsupervised related filtering training on the multi-layer features obtained by extracting the depth features, and circularly using related filtering algorithms for the features of different pictures;
and updating the appearance model on line and carrying out on-line fine adjustment.
In the offline pre-training stage, the training process of the neural network based on the jigsaw task is mainly combined with two tasks of unsupervised related filtering algorithm training and jigsaw task training. The depth features of the input image are extracted through the twin network and feature fusion is carried out, meanwhile, the training of the jigsaw task enables the features of the twin network model to have stronger universality and detail extraction capability, and higher tracking precision is obtained.
In a second aspect of the present invention, there is provided an unsupervised correlation filtering target tracking system based on a jigsaw task, comprising:
a data input device;
the data processing module is used for processing the input image;
the depth feature extraction module is used for extracting depth features of the processed image by adopting the twin depth network model;
the jigsaw task training module is used for processing the extracted depth features by using a classifier network model and predicting the position index of the small image;
the unsupervised related filtering training module is used for carrying out unsupervised related filtering training on the multi-layer features obtained by the depth feature extraction;
and the online fine adjustment module is used for online updating the appearance model.
The system of the invention has simple operation, convenient interaction and great reference and practical value.
The invention has the beneficial effects that:
(1) According to the invention, in the training of an unsupervised correlation filtering algorithm, a prediction task for indexing the image block position is introduced at the same time, so that the capability of the deep neural network for extracting object detail features is increased, and the algorithm takes into consideration semantic information and position information by fusing features of different layers, so that the accuracy is improved.
(2) The system of the invention has simple operation, convenient interaction and great reference and practical value.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.
FIG. 1 is a jigsaw mission training network of embodiment 1;
fig. 2 is an overall training framework of example 1.
Detailed Description
It should be noted that the following detailed description is illustrative and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present invention. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.
Term interpretation:
search: searching for a region;
template: a template region;
CNN: a convolutional network;
correlation Filter: a correlation filter;
pseudolabel: a pseudo tag;
consitstep Loss: a consistency loss function;
an Initial Label: an initial tag.
An unsupervised related filtering target tracking method based on a jigsaw task comprises the following steps:
processing the input image;
adopting a twin depth network model to extract depth characteristics of the processed image;
training a jigsaw task, processing the extracted depth features by using a classifier network model, and predicting the position index of the small image;
performing unsupervised related filtering training on the multi-layer features obtained by extracting the depth features, and circularly using related filtering algorithms for the features of different pictures;
and updating the appearance model on line and carrying out on-line fine adjustment.
In some embodiments, the data processing uses ILSVRC2015 dataset.
In some embodiments, the specific steps of the data processing include:
cutting each picture centrally to obtain original picture with length and width
Figure BDA0002419672830000051
Size of the material;
scaling the cut picture to 125x125 resolution;
for the pictures of the same video sequence, randomly selecting 3 pictures as templates T and searching areas S 1 Search area S 2
Respectively cutting out non-overlapping small pictures with the resolution of 50x50 from four positions of the upper left, the lower left, the upper right and the lower right of each selected picture, and scaling to the resolution of 63x 63;
randomly dithering each channel of each cut small image within 2 pixel points;
the obtained small images are randomly disturbed, and the disturbed position index, the disturbed small image with the resolution of 63x63 and the large image with the resolution of 125x125 are used as a group of training data to participate in training.
In some embodiments, the specific steps of depth feature extraction are:
extracting depth features of corresponding 1x1 resolution from the 4 small images with the resolution of 63x63 through a twin convolution network model respectively;
extracting the characteristics of a specific layer from a picture with a resolution of 125x125 through a twin convolution network model;
the resulting layer features of different resolution sizes are scaled to a fixed resolution of 125x125 using bilinear interpolation.
In some embodiments, the specific steps of the jigsaw task training are:
adjusting 4 three-dimensional depth features obtained in the depth feature extraction process into one-dimensional vectors, and connecting the three-dimensional depth features together in a given order;
and (3) predicting the position index of the disturbed small image by using the obtained feature vector through a classifier formed by a plurality of full-connection layers, and calculating a cross entropy loss function between the feature vector and the given position index. The depth features of the input image are extracted through the twin network and feature fusion is carried out, meanwhile, the training of the jigsaw task enables the features of the twin network model to have stronger universality and detail extraction capability, and higher tracking precision is obtained.
In some embodiments, the specific steps of the unsupervised correlation filter training include:
features F for particular layers t 、F s1 And corresponding response image Y t The response image of the corresponding layer can be obtained by using a correlation filtering algorithm, and the search area S is obtained by adding weights 1 Response image R of (2) s1
Features F for particular layers s1 、F s2 The obtained response image R s1 The response image of the corresponding layer can be obtained by using a correlation filtering algorithm, and the search area S can be obtained by adding weights 2 Response image R of (2) s2
Features F for particular layers s2 、F t The obtained response image R s2 The response image of the corresponding layer can be obtained by using a correlation filtering algorithm, and the response image R of the template T can be obtained by adding weights t
For the obtained response image R t And an original response image Y t A mean square loss function is calculated. Template T, search area S of the same video sequence 1 Search area S 2 And obtaining the multi-layer features through a depth feature extraction step. The shallow layer features have more position information, but the semantic information is not obvious, the deep layer features have more semantic information, the anti-interference capability is strong, but the necessary position information is lacking, and the improvement of the target tracking effect is facilitated by combining the features of different layers.
In some embodiments, the twin convolutional network structure is:
the first convolution layer, the convolution kernel size is 3x3, the step length is 1x1, and 32 feature maps are output; the activation function is a linear rectification unit ReLU;
the second convolution layer, the convolution kernel size is 3x3, the step length is 1x1, and 32 feature graphs are output; and carrying out local response normalization and maximum pooling on the same;
the third convolution layer, the convolution kernel size is 3x3, the step length is 1x1, and 64 feature graphs are output; the activation function is a linear rectification unit ReLU;
a fourth convolution layer, the convolution kernel size is 3x3, the step length is 1x1, and 64 feature graphs are output; and carrying out local response normalization and maximum pooling on the same;
a fifth convolution layer, wherein the convolution kernel size is 3x3, the step length is 1x1, and 128 feature graphs are output; the activation function is a linear rectification unit ReLU;
a sixth convolution layer, wherein the convolution kernel size is 3x3, the step length is 1x1, and 128 feature graphs are output; and carrying out local response normalization and maximum pooling on the same;
a seventh convolution layer, wherein the convolution kernel size is 3x3, the step length is 1x1, and 256 feature maps are output; the activation function is a linear rectification unit ReLU;
an eighth convolution layer, wherein the convolution kernel size is 3x3, the step length is 1x1, and 256 feature maps are output; and carrying out local response normalization and maximum pooling on the same;
a ninth convolution layer, wherein the convolution kernel size is 3x3, the step length is 1x1, and 256 feature maps are output; the activation function is a linear rectification unit ReLU;
a tenth convolution layer, wherein the convolution kernel size is 3x3, the step length is 1x1, and 256 feature maps are output; and local response normalization and maximum pooling are performed on the same.
In some embodiments, the classifier network structure is:
the first full-connection layer comprises 512 hidden units, and the activation function is a linear rectification unit ReLU;
the second full-connection layer, comprising 24 hidden units, is output as a predicted thumbnail position index.
In some embodiments, the specific steps of the online trimming include:
cutting and scaling: cutting out an image which is the same as the center position of the previous frame of image but larger on the search image, and scaling to 125x125 resolution;
the resulting image is used to update the parameters of the associated filtering algorithm,
W t =(1-α t )W t-1t W
wherein alpha is t-1 ∈[0,1];
Cutting out three images with the same central position as the previous frame of image and different resolution values from the search image, and obtaining the largest response value of the three images through a correlation filtering algorithm, wherein the resolution value is the size of a target on the search image, and the position of the response value is the moving direction of the target; and repeating the cropping and scaling steps by taking the current search image as a template.
The invention will now be described in further detail with reference to the following specific examples, which should be construed as illustrative rather than limiting.
Example 1:
the invention provides an unsupervised related filtering target tracking method and system based on a jigsaw task, wherein the method comprises two stages of offline pre-training and online fine tuning.
In the offline pre-training stage, the training process of the neural network based on the jigsaw task is mainly combined with two tasks of unsupervised related filtering algorithm training and jigsaw task training. The depth features of the input image are extracted through the twin network and feature fusion is carried out, meanwhile, the training of the jigsaw task enables the features of the twin network model to have stronger universality and detail extraction capability, and higher tracking precision is obtained. The training process can be generally divided into four parts: data processing, depth feature extraction, jigsaw task training and unsupervised related filtering training.
And (3) data processing: the ILSVRC2015 data set is used in the training process, and the specific steps are as follows:
step (1): cutting each picture centrally to obtain original picture with length and width
Figure BDA0002419672830000091
Size of the product.
Step (2): and (3) scaling the picture obtained by clipping in the step (1) to a 125x125 resolution.
Step (3): for the pictures of the same video sequence, randomly selecting 3 pictures as templates T and searching areas S 1 Search area S 2
Step (4): for each picture selected in the step (3), a non-overlapping small picture with the resolution of 50x50 is respectively cut out at the upper left, the lower left, the upper right and the lower right of the picture, and is scaled to the resolution of 63x 63.
Step (5): and (3) randomly dithering each channel of each small image cut in the step (4) within 2 pixel points.
Step (6): randomly scrambling the small images obtained in the step (5), and taking the scrambled position indexes, the scrambled small images with the resolution of 63x63 and the large images with the resolution of 125x125 as a group of training data to participate in training.
Depth feature extraction: the method uses a twin depth network model for feature extraction, and comprises the following specific steps:
step (1): and 4 small images with the resolution of 63x63 are respectively extracted into depth features with the resolution of 1x1 through a twin convolution network model.
Step (2): the 125x125 resolution picture is used for extracting the characteristics of a specific layer through a twin convolution network model.
Step (3): for layer features of different resolution sizes in step (2), bilinear interpolation is used to scale them to a fixed resolution of 125x 125.
Training of jigsaw tasks: the location index of the small graph is predicted using a classifier network model, as shown in fig. 1. The method comprises the following specific steps:
step (1): and (3) adjusting the 4 three-dimensional depth features obtained in the depth feature extraction step (1) into one-dimensional vectors, and connecting the three-dimensional depth features together in a given order.
Step (2): and (3) predicting the position index of the disturbed small image by using the feature vector obtained in the step (1) through a classifier formed by a plurality of full-connection layers, and calculating a cross entropy loss function between the position index and the given position index.
Unsupervised correlation filtering training: a specific training process is shown in fig. 2. Template T, search area S of the same video sequence 1 Search area S 2 And (3) obtaining the multi-layer features through the depth feature extraction step (3). The shallow layer features have more position information, but the semantic information is not obvious, the deep layer features have more semantic information, the anti-interference capability is strong, but the necessary position information is lacking, and the improvement of the target tracking effect is facilitated by combining the features of different layers. For different pictures during trainingThe characteristic cyclic use related filtering algorithm comprises the following specific steps:
step (1): features F for particular layers t 、F s1 And corresponding response image Y t The response image of the corresponding layer can be obtained by using a correlation filtering algorithm, and the search area S is obtained by adding weights 1 Response image R of (2) s1
Step (2): features F for particular layers s1 、F s2 And the response image R obtained in the step (1) s1 The response image of the corresponding layer can be obtained by using a correlation filtering algorithm, and the search area S can be obtained by adding weights 2 Response image R of (2) s2
Step (3): features F for particular layers s2 、F t And the response image R obtained in the step (2) s2 The response image of the corresponding layer can be obtained by using a correlation filtering algorithm, and the response image R of the template T can be obtained by adding weights t
Step (4): for the response image R obtained in step (3) t And an original response image Y t A mean square loss function is calculated.
The following is the complete network structure of the present invention:
twinning convolutional network structure:
the first convolution layer, the convolution kernel size is 3x3, the step length is 1x1, and 32 feature maps are output; the activation function is a linear rectification unit ReLU;
the second convolution layer, the convolution kernel size is 3x3, the step length is 1x1, and 32 feature graphs are output; and carrying out local response normalization and maximum pooling on the same;
the third convolution layer, the convolution kernel size is 3x3, the step length is 1x1, and 64 feature graphs are output; the activation function is a linear rectification unit ReLU;
a fourth convolution layer, the convolution kernel size is 3x3, the step length is 1x1, and 64 feature graphs are output; and carrying out local response normalization and maximum pooling on the same;
a fifth convolution layer, wherein the convolution kernel size is 3x3, the step length is 1x1, and 128 feature graphs are output; the activation function is a linear rectification unit ReLU;
a sixth convolution layer, wherein the convolution kernel size is 3x3, the step length is 1x1, and 128 feature graphs are output; and carrying out local response normalization and maximum pooling on the same;
a seventh convolution layer, wherein the convolution kernel size is 3x3, the step length is 1x1, and 256 feature maps are output; the activation function is a linear rectification unit ReLU;
an eighth convolution layer, wherein the convolution kernel size is 3x3, the step length is 1x1, and 256 feature maps are output; and carrying out local response normalization and maximum pooling on the same;
a ninth convolution layer, wherein the convolution kernel size is 3x3, the step length is 1x1, and 256 feature maps are output; the activation function is a linear rectification unit ReLU;
a tenth convolution layer, wherein the convolution kernel size is 3x3, the step length is 1x1, and 256 feature maps are output; and local response normalization and maximum pooling are performed on the same.
Classifier network structure:
the first full-connection layer comprises 512 hidden units, and the activation function is a linear rectification unit ReLU;
the second full-connection layer, comprising 24 hidden units, is output as a predicted thumbnail position index.
In the online fine tuning stage, in order to better capture the change of the target in the motion process, the appearance model needs to be updated online, and the specific steps are as follows:
step (1): an image which is the same as the center position of the previous frame of image but is larger is cut out from the search image, and the image is scaled to the 125x125 resolution size.
Step (2): the image obtained in the step (1) is used for updating the parameters of the related filtering algorithm,
W t =(1-α t )W t-1t W
wherein alpha is t-1 ∈[0,1]。
Step (3): and cutting out three images with the same central position as the previous frame of image and different resolution values from the search image, and obtaining the largest response value of the three images through a correlation filtering algorithm, wherein the resolution value is the size of the target on the search image, and the position of the response value is the moving direction of the target. And (3) taking the current search image as a template, and repeating the step (1).
Finally, it should be noted that the above-mentioned embodiments are only preferred embodiments of the present invention, and the present invention is not limited to the above-mentioned embodiments, but may be modified or substituted for some of them by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention. While the foregoing describes the embodiments of the present invention, it should be understood that the present invention is not limited to the embodiments, and that various modifications and changes can be made by those skilled in the art without any inventive effort.

Claims (9)

1. An unsupervised related filtering target tracking method based on a jigsaw task is characterized by comprising the following steps:
processing the input image;
adopting a twin depth network model to extract depth characteristics of the processed image;
training a jigsaw task, processing the extracted depth features by using a classifier network model, and predicting the position index of the small image;
performing unsupervised related filtering training on the multi-layer features obtained by extracting the depth features, and circularly using related filtering algorithms for the features of different pictures;
updating the appearance model on line and carrying out on-line fine adjustment;
the specific steps of data processing include:
cutting each picture centrally to obtain original picture with length and width
Figure QLYQS_1
Size of the material;
scaling the cut picture to 125x125 resolution;
for the pictures of the same video sequence, randomly selecting 3 pictures as templates
Figure QLYQS_2
Search area->
Figure QLYQS_3
Search area->
Figure QLYQS_4
Respectively cutting out non-overlapping small pictures with the resolution of 50x50 from four positions of the upper left, the lower left, the upper right and the lower right of each selected picture, and scaling to the resolution of 63x 63;
randomly dithering each channel of each cut small image within 2 pixel points;
the obtained small images are randomly disturbed, and the disturbed position index, the disturbed small image with the resolution of 63x63 and the large image with the resolution of 125x125 are used as a group of training data to participate in training.
2. The tile task-based unsupervised correlation filtering target tracking method of claim 1, wherein the data processing uses ILSVRC2015 dataset.
3. The method for tracking an unsupervised correlation filtering target based on a jigsaw task as claimed in claim 1, wherein the specific steps of depth feature extraction are as follows:
extracting depth features of corresponding 1x1 resolution from the 4 small images with the resolution of 63x63 through a twin convolution network model respectively;
extracting the characteristics of a specific layer from a picture with a resolution of 125x125 through a twin convolution network model;
the resulting layer features of different resolution sizes are scaled to a fixed resolution of 125x125 using bilinear interpolation.
4. The method for tracking an unsupervised related filtering target based on a jigsaw task as claimed in claim 3, wherein the specific steps of training the jigsaw task are as follows:
adjusting 4 three-dimensional depth features obtained in the depth feature extraction process into one-dimensional vectors, and connecting the three-dimensional depth features together in a given order;
and (3) predicting the position index of the disturbed small image by using the obtained feature vector through a classifier formed by a plurality of full-connection layers, and calculating a cross entropy loss function between the feature vector and the given position index.
5. The method for tracking an unsupervised correlation filtering target based on a jigsaw task as claimed in claim 3, wherein the specific step of unsupervised correlation filtering training comprises:
features for specific layers
Figure QLYQS_5
Figure QLYQS_6
Corresponding response image +.>
Figure QLYQS_7
Obtaining a response image of the corresponding layer by using a correlation filtering algorithm, and obtaining a search area by adding weights>
Figure QLYQS_8
Is->
Figure QLYQS_9
Features for specific layers
Figure QLYQS_10
Figure QLYQS_11
The resulting response image +.>
Figure QLYQS_12
Obtaining a response image of the corresponding layer by using a correlation filtering algorithm, and obtaining a search area by adding weights>
Figure QLYQS_13
Is->
Figure QLYQS_14
Features for specific layers
Figure QLYQS_15
Figure QLYQS_16
The resulting response image +.>
Figure QLYQS_17
Obtaining a response image of the corresponding layer by using a correlation filtering algorithm, and adding the response images according to weights to obtain a response image of the template T>
Figure QLYQS_18
For the obtained response image
Figure QLYQS_19
And original response image->
Figure QLYQS_20
A mean square loss function is calculated.
6. The method for tracking an unsupervised correlation filtering target based on a jigsaw task as claimed in claim 3, wherein the twin convolution network model is:
the first convolution layer, the convolution kernel size is 3x3, the step length is 1x1, and 32 feature maps are output; the activation function is a linear rectification unit ReLU;
the second convolution layer, the convolution kernel size is 3x3, the step length is 1x1, and 32 feature graphs are output; and carrying out local response normalization and maximum pooling on the same;
the third convolution layer, the convolution kernel size is 3x3, the step length is 1x1, and 64 feature graphs are output; the activation function is a linear rectification unit ReLU;
a fourth convolution layer, the convolution kernel size is 3x3, the step length is 1x1, and 64 feature graphs are output; and carrying out local response normalization and maximum pooling on the same;
a fifth convolution layer, wherein the convolution kernel size is 3x3, the step length is 1x1, and 128 feature graphs are output; the activation function is a linear rectification unit ReLU;
a sixth convolution layer, wherein the convolution kernel size is 3x3, the step length is 1x1, and 128 feature graphs are output; and carrying out local response normalization and maximum pooling on the same;
a seventh convolution layer, wherein the convolution kernel size is 3x3, the step length is 1x1, and 256 feature maps are output; the activation function is a linear rectification unit ReLU;
an eighth convolution layer, wherein the convolution kernel size is 3x3, the step length is 1x1, and 256 feature maps are output; and carrying out local response normalization and maximum pooling on the same;
a ninth convolution layer, wherein the convolution kernel size is 3x3, the step length is 1x1, and 256 feature maps are output; the activation function is a linear rectification unit ReLU;
a tenth convolution layer, wherein the convolution kernel size is 3x3, the step length is 1x1, and 256 feature maps are output; and local response normalization and maximum pooling are performed on the same.
7. The method for tracking an unsupervised correlation filtering target based on a jigsaw task as claimed in claim 3, wherein the classifier network structure is as follows:
the first full-connection layer comprises 512 hidden units, and the activation function is a linear rectification unit ReLU;
the second full-connection layer, comprising 24 hidden units, is output as a predicted thumbnail position index.
8. The method for tracking an unsupervised correlation filtering target based on a jigsaw task as claimed in claim 1, wherein the specific step of online fine tuning comprises:
cutting and scaling: cutting out an image which is the same as the center position of the previous frame of image but larger on the search image, and scaling to 125x125 resolution;
the resulting image is used to update the parameters of the associated filtering algorithm,
Figure QLYQS_21
wherein the method comprises the steps of
Figure QLYQS_22
;/>
Cutting out three images with the same central position as the previous frame of image and different resolution values from the search image, and obtaining the largest response value of the three images through a correlation filtering algorithm, wherein the resolution value is the size of a target on the search image, and the position of the response value is the moving direction of the target; and repeating the cropping and scaling steps by taking the current search image as a template.
9. An unsupervised related filtering target tracking system based on a jigsaw task, comprising:
a data input device;
the data processing module is used for processing the input image;
the depth feature extraction module is used for extracting depth features of the processed image by adopting the twin depth network model;
the jigsaw task training module is used for processing the extracted depth features by using a classifier network model and predicting the position index of the small image;
the unsupervised correlation filtering training module is used for carrying out unsupervised correlation filtering training on the multi-layer features obtained by extracting the depth features, and circularly using correlation filtering algorithms for the features of different pictures;
the online fine tuning module is used for online updating the appearance model;
the specific steps of the data processing module for data processing comprise:
cutting each picture centrally to obtain original picture with length and width
Figure QLYQS_23
Size of the material;
scaling the cut picture to 125x125 resolution;
for the pictures of the same video sequence, randomly selecting 3 pictures as templates
Figure QLYQS_24
Search area->
Figure QLYQS_25
Search area->
Figure QLYQS_26
Respectively cutting out non-overlapping small pictures with the resolution of 50x50 from four positions of the upper left, the lower left, the upper right and the lower right of each selected picture, and scaling to the resolution of 63x 63;
randomly dithering each channel of each cut small image within 2 pixel points;
the obtained small images are randomly disturbed, and the disturbed position index, the disturbed small image with the resolution of 63x63 and the large image with the resolution of 125x125 are used as a group of training data to participate in training.
CN202010201902.7A 2020-03-20 2020-03-20 Unsupervised related filtering target tracking method and system based on jigsaw task Active CN111415318B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010201902.7A CN111415318B (en) 2020-03-20 2020-03-20 Unsupervised related filtering target tracking method and system based on jigsaw task

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010201902.7A CN111415318B (en) 2020-03-20 2020-03-20 Unsupervised related filtering target tracking method and system based on jigsaw task

Publications (2)

Publication Number Publication Date
CN111415318A CN111415318A (en) 2020-07-14
CN111415318B true CN111415318B (en) 2023-06-13

Family

ID=71494404

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010201902.7A Active CN111415318B (en) 2020-03-20 2020-03-20 Unsupervised related filtering target tracking method and system based on jigsaw task

Country Status (1)

Country Link
CN (1) CN111415318B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112016591A (en) * 2020-08-04 2020-12-01 杰创智能科技股份有限公司 Training method of image recognition model and image recognition method
CN113240591B (en) * 2021-04-13 2022-10-04 浙江大学 Sparse deep completion method based on countermeasure network
CN113112518B (en) * 2021-04-19 2024-03-26 深圳思谋信息科技有限公司 Feature extractor generation method and device based on spliced image and computer equipment
CN113192062A (en) * 2021-05-25 2021-07-30 湖北工业大学 Arterial plaque ultrasonic image self-supervision segmentation method based on image restoration

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109191491A (en) * 2018-08-03 2019-01-11 华中科技大学 The method for tracking target and system of the twin network of full convolution based on multilayer feature fusion
CN110211192A (en) * 2019-05-13 2019-09-06 南京邮电大学 A kind of rendering method based on the threedimensional model of deep learning to two dimensional image
CN110210551A (en) * 2019-05-28 2019-09-06 北京工业大学 A kind of visual target tracking method based on adaptive main body sensitivity

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9218365B2 (en) * 2011-12-15 2015-12-22 Yeda Research And Development Co. Ltd. Device, system, and method of visual inference by collaborative composition
US11055854B2 (en) * 2018-08-23 2021-07-06 Seoul National University R&Db Foundation Method and system for real-time target tracking based on deep learning

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109191491A (en) * 2018-08-03 2019-01-11 华中科技大学 The method for tracking target and system of the twin network of full convolution based on multilayer feature fusion
CN110211192A (en) * 2019-05-13 2019-09-06 南京邮电大学 A kind of rendering method based on the threedimensional model of deep learning to two dimensional image
CN110210551A (en) * 2019-05-28 2019-09-06 北京工业大学 A kind of visual target tracking method based on adaptive main body sensitivity

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘侃等.一种基于深度神经网络的无线定位方法.计算机工程.2016,(第07期),88-91. *

Also Published As

Publication number Publication date
CN111415318A (en) 2020-07-14

Similar Documents

Publication Publication Date Title
CN111415318B (en) Unsupervised related filtering target tracking method and system based on jigsaw task
US11637971B2 (en) Automatic composition of composite images or videos from frames captured with moving camera
US10719940B2 (en) Target tracking method and device oriented to airborne-based monitoring scenarios
Zhang et al. SiamFT: An RGB-infrared fusion tracking method via fully convolutional Siamese networks
CN109410242B (en) Target tracking method, system, equipment and medium based on double-current convolutional neural network
KR20220108165A (en) Target tracking method, apparatus, electronic device and storage medium
CN113011329B (en) Multi-scale feature pyramid network-based and dense crowd counting method
CN111260688A (en) Twin double-path target tracking method
CN111639571B (en) Video action recognition method based on contour convolution neural network
CN111696110A (en) Scene segmentation method and system
CN112232134A (en) Human body posture estimation method based on hourglass network and attention mechanism
CN113255429B (en) Method and system for estimating and tracking human body posture in video
Lin et al. High resolution animated scenes from stills
CN115713546A (en) Lightweight target tracking algorithm for mobile terminal equipment
CN114036969A (en) 3D human body action recognition algorithm under multi-view condition
CN112700476A (en) Infrared ship video tracking method based on convolutional neural network
CN113592900A (en) Target tracking method and system based on attention mechanism and global reasoning
CN115862130B (en) Behavior recognition method based on human body posture and trunk sports field thereof
CN115761885B (en) Behavior recognition method for common-time and cross-domain asynchronous fusion driving
CN112257638A (en) Image comparison method, system, equipment and computer readable storage medium
CN109492530B (en) Robust visual object tracking method based on depth multi-scale space-time characteristics
Gupta et al. Reconnoitering the Essentials of Image and Video Processing: A Comprehensive Overview
Xue et al. Multiscale feature extraction network for real-time semantic segmentation of road scenes on the autonomous robot
CN114743045B (en) Small sample target detection method based on double-branch area suggestion network
CN113627410B (en) Method for recognizing and retrieving action semantics in video

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant