CN111415318B - Unsupervised related filtering target tracking method and system based on jigsaw task - Google Patents
Unsupervised related filtering target tracking method and system based on jigsaw task Download PDFInfo
- Publication number
- CN111415318B CN111415318B CN202010201902.7A CN202010201902A CN111415318B CN 111415318 B CN111415318 B CN 111415318B CN 202010201902 A CN202010201902 A CN 202010201902A CN 111415318 B CN111415318 B CN 111415318B
- Authority
- CN
- China
- Prior art keywords
- image
- training
- task
- resolution
- unsupervised
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000001914 filtration Methods 0.000 title claims abstract description 59
- 238000000034 method Methods 0.000 title claims abstract description 30
- 238000012549 training Methods 0.000 claims abstract description 60
- 238000012545 processing Methods 0.000 claims abstract description 21
- 238000000605 extraction Methods 0.000 claims abstract description 18
- 230000004044 response Effects 0.000 claims description 54
- 230000004913 activation Effects 0.000 claims description 18
- 238000010606 normalization Methods 0.000 claims description 15
- 238000011176 pooling Methods 0.000 claims description 15
- 239000013598 vector Substances 0.000 claims description 8
- 239000000463 material Substances 0.000 claims description 3
- 230000008569 process Effects 0.000 abstract description 8
- 238000013528 artificial neural network Methods 0.000 abstract description 7
- 230000006870 function Effects 0.000 description 17
- 230000004927 fusion Effects 0.000 description 4
- 230000006872 improvement Effects 0.000 description 4
- 238000013135 deep learning Methods 0.000 description 3
- 230000003993 interaction Effects 0.000 description 3
- 230000000694 effects Effects 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 125000004122 cyclic group Chemical group 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000005286 illumination Methods 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000009966 trimming Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T5/00—Image enhancement or restoration
- G06T5/50—Image enhancement or restoration using two or more images, e.g. averaging or subtraction
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T3/00—Geometric image transformations in the plane of the image
- G06T3/40—Scaling of whole images or parts thereof, e.g. expanding or contracting
- G06T3/4038—Image mosaicing, e.g. composing plane images from plane sub-images
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T5/00—Image enhancement or restoration
- G06T5/70—Denoising; Smoothing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10004—Still image; Photographic image
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20024—Filtering details
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20112—Image segmentation details
- G06T2207/20132—Image cropping
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to an unsupervised related filtering target tracking method and system based on a jigsaw task. Two stages of offline pre-training and online fine tuning are included. In the offline pre-training stage, the training process of the neural network based on the jigsaw task is mainly combined with two tasks of unsupervised related filtering algorithm training and jigsaw task training. The training process can be generally divided into four parts: data processing, depth feature extraction, jigsaw task training and unsupervised related filtering training. According to the invention, in the training of an unsupervised correlation filtering algorithm, a prediction task for indexing the image block position is introduced at the same time, so that the capability of the deep neural network for extracting object detail features is increased, and the algorithm takes into consideration semantic information and position information by fusing features of different layers, so that the accuracy is improved.
Description
Technical Field
The invention relates to the field of automatic identification, in particular to an unsupervised related filtering target tracking method and system based on a jigsaw task.
Background
The statements in this section merely improve the background of the present disclosure and do not necessarily constitute prior art.
Target tracking is an important subject in the field of computer vision research, and has wide application prospect in actual life. Specific application fields include intelligent video monitoring, three-dimensional reconstruction, man-machine interaction, image understanding, intelligent visual navigation and the like. The target tracking task obtains the complete motion trail of the moving target by estimating the exact position and occupied area of the moving target in a continuous video sequence, so that analysis and understanding of the motion target behavior are realized, and the follow-up advanced task is padded. The performance of the target tracking algorithm is improved greatly, but the realization of real-time and stable tracking of a moving target in a complex real scene still faces a great challenge. These challenges are caused by changes in the moving object itself, such as shape changes, pose changes, etc., and by external factors, such as motion blur, background occlusion, background clutter, illumination changes, etc., which present a number of difficulties for object tracking.
The tracking algorithm based on correlation filtering is a research hotspot in the current target tracking field. The method uses the target image to train a correlation filter with discriminant, carries out correlation filtering on the search area image, and searches the maximum position of the filter response image, namely the corresponding target position. At present, the related filtering algorithm is widely applied because of good accuracy and faster tracking speed. In addition, the target tracking algorithm based on deep learning has been greatly successful in the field of target tracking, mainly because the features extracted by the deep neural network have stronger expression capability and anti-interference capability compared with the traditional manually extracted features. The shallower features contain more target position information, while the deeper features contain more target semantic information, so that the accuracy of the overall algorithm is improved by feature fusion of different layers. Meanwhile, due to the serious lack of training data related to the target tracking task, the method of performing offline pre-training and then online fine tuning of the model by using an unsupervised training mode provides a feasible direction for the practical application of deep learning in the field of target tracking. In order to ensure the speed and the precision of the target tracking task at the same time, a plurality of algorithms are combined with a related filtering algorithm and a deep learning algorithm, and good performance improvement is achieved on a related data set.
The inventors found that: the existing unsupervised algorithm has insufficient capability of extracting object detail characteristics, and is difficult to well consider semantic information and position information, so that the accuracy is still to be improved.
Disclosure of Invention
In order to solve the defects of the prior art, the invention provides an unsupervised related filtering target tracking method and system based on a jigsaw task, and in the training of an unsupervised related filtering algorithm, a prediction task for indexing the image block position is introduced at the same time so as to increase the extraction capability of a deep neural network on object detail characteristics, and the algorithm gives consideration to semantic information and position information by fusing the characteristics of different layers, so that the accuracy is improved. Two stages, off-line pre-training and on-line fine tuning, are included.
In order to achieve the technical purpose, the invention adopts the following technical scheme:
the first aspect of the invention provides an unsupervised related filtering target tracking method based on a jigsaw task, which comprises the following steps:
processing the input image;
adopting a twin depth network model to extract depth characteristics of the processed image;
training a jigsaw task, processing the extracted depth features by using a classifier network model, and predicting the position index of the small image;
performing unsupervised related filtering training on the multi-layer features obtained by extracting the depth features, and circularly using related filtering algorithms for the features of different pictures;
and updating the appearance model on line and carrying out on-line fine adjustment.
In the offline pre-training stage, the training process of the neural network based on the jigsaw task is mainly combined with two tasks of unsupervised related filtering algorithm training and jigsaw task training. The depth features of the input image are extracted through the twin network and feature fusion is carried out, meanwhile, the training of the jigsaw task enables the features of the twin network model to have stronger universality and detail extraction capability, and higher tracking precision is obtained.
In a second aspect of the present invention, there is provided an unsupervised correlation filtering target tracking system based on a jigsaw task, comprising:
a data input device;
the data processing module is used for processing the input image;
the depth feature extraction module is used for extracting depth features of the processed image by adopting the twin depth network model;
the jigsaw task training module is used for processing the extracted depth features by using a classifier network model and predicting the position index of the small image;
the unsupervised related filtering training module is used for carrying out unsupervised related filtering training on the multi-layer features obtained by the depth feature extraction;
and the online fine adjustment module is used for online updating the appearance model.
The system of the invention has simple operation, convenient interaction and great reference and practical value.
The invention has the beneficial effects that:
(1) According to the invention, in the training of an unsupervised correlation filtering algorithm, a prediction task for indexing the image block position is introduced at the same time, so that the capability of the deep neural network for extracting object detail features is increased, and the algorithm takes into consideration semantic information and position information by fusing features of different layers, so that the accuracy is improved.
(2) The system of the invention has simple operation, convenient interaction and great reference and practical value.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.
FIG. 1 is a jigsaw mission training network of embodiment 1;
fig. 2 is an overall training framework of example 1.
Detailed Description
It should be noted that the following detailed description is illustrative and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present invention. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.
Term interpretation:
search: searching for a region;
template: a template region;
CNN: a convolutional network;
correlation Filter: a correlation filter;
pseudolabel: a pseudo tag;
consitstep Loss: a consistency loss function;
an Initial Label: an initial tag.
An unsupervised related filtering target tracking method based on a jigsaw task comprises the following steps:
processing the input image;
adopting a twin depth network model to extract depth characteristics of the processed image;
training a jigsaw task, processing the extracted depth features by using a classifier network model, and predicting the position index of the small image;
performing unsupervised related filtering training on the multi-layer features obtained by extracting the depth features, and circularly using related filtering algorithms for the features of different pictures;
and updating the appearance model on line and carrying out on-line fine adjustment.
In some embodiments, the data processing uses ILSVRC2015 dataset.
In some embodiments, the specific steps of the data processing include:
cutting each picture centrally to obtain original picture with length and widthSize of the material;
scaling the cut picture to 125x125 resolution;
for the pictures of the same video sequence, randomly selecting 3 pictures as templates T and searching areas S 1 Search area S 2 ;
Respectively cutting out non-overlapping small pictures with the resolution of 50x50 from four positions of the upper left, the lower left, the upper right and the lower right of each selected picture, and scaling to the resolution of 63x 63;
randomly dithering each channel of each cut small image within 2 pixel points;
the obtained small images are randomly disturbed, and the disturbed position index, the disturbed small image with the resolution of 63x63 and the large image with the resolution of 125x125 are used as a group of training data to participate in training.
In some embodiments, the specific steps of depth feature extraction are:
extracting depth features of corresponding 1x1 resolution from the 4 small images with the resolution of 63x63 through a twin convolution network model respectively;
extracting the characteristics of a specific layer from a picture with a resolution of 125x125 through a twin convolution network model;
the resulting layer features of different resolution sizes are scaled to a fixed resolution of 125x125 using bilinear interpolation.
In some embodiments, the specific steps of the jigsaw task training are:
adjusting 4 three-dimensional depth features obtained in the depth feature extraction process into one-dimensional vectors, and connecting the three-dimensional depth features together in a given order;
and (3) predicting the position index of the disturbed small image by using the obtained feature vector through a classifier formed by a plurality of full-connection layers, and calculating a cross entropy loss function between the feature vector and the given position index. The depth features of the input image are extracted through the twin network and feature fusion is carried out, meanwhile, the training of the jigsaw task enables the features of the twin network model to have stronger universality and detail extraction capability, and higher tracking precision is obtained.
In some embodiments, the specific steps of the unsupervised correlation filter training include:
features F for particular layers t 、F s1 And corresponding response image Y t The response image of the corresponding layer can be obtained by using a correlation filtering algorithm, and the search area S is obtained by adding weights 1 Response image R of (2) s1 ;
Features F for particular layers s1 、F s2 The obtained response image R s1 The response image of the corresponding layer can be obtained by using a correlation filtering algorithm, and the search area S can be obtained by adding weights 2 Response image R of (2) s2 ;
Features F for particular layers s2 、F t The obtained response image R s2 The response image of the corresponding layer can be obtained by using a correlation filtering algorithm, and the response image R of the template T can be obtained by adding weights t ;
For the obtained response image R t And an original response image Y t A mean square loss function is calculated. Template T, search area S of the same video sequence 1 Search area S 2 And obtaining the multi-layer features through a depth feature extraction step. The shallow layer features have more position information, but the semantic information is not obvious, the deep layer features have more semantic information, the anti-interference capability is strong, but the necessary position information is lacking, and the improvement of the target tracking effect is facilitated by combining the features of different layers.
In some embodiments, the twin convolutional network structure is:
the first convolution layer, the convolution kernel size is 3x3, the step length is 1x1, and 32 feature maps are output; the activation function is a linear rectification unit ReLU;
the second convolution layer, the convolution kernel size is 3x3, the step length is 1x1, and 32 feature graphs are output; and carrying out local response normalization and maximum pooling on the same;
the third convolution layer, the convolution kernel size is 3x3, the step length is 1x1, and 64 feature graphs are output; the activation function is a linear rectification unit ReLU;
a fourth convolution layer, the convolution kernel size is 3x3, the step length is 1x1, and 64 feature graphs are output; and carrying out local response normalization and maximum pooling on the same;
a fifth convolution layer, wherein the convolution kernel size is 3x3, the step length is 1x1, and 128 feature graphs are output; the activation function is a linear rectification unit ReLU;
a sixth convolution layer, wherein the convolution kernel size is 3x3, the step length is 1x1, and 128 feature graphs are output; and carrying out local response normalization and maximum pooling on the same;
a seventh convolution layer, wherein the convolution kernel size is 3x3, the step length is 1x1, and 256 feature maps are output; the activation function is a linear rectification unit ReLU;
an eighth convolution layer, wherein the convolution kernel size is 3x3, the step length is 1x1, and 256 feature maps are output; and carrying out local response normalization and maximum pooling on the same;
a ninth convolution layer, wherein the convolution kernel size is 3x3, the step length is 1x1, and 256 feature maps are output; the activation function is a linear rectification unit ReLU;
a tenth convolution layer, wherein the convolution kernel size is 3x3, the step length is 1x1, and 256 feature maps are output; and local response normalization and maximum pooling are performed on the same.
In some embodiments, the classifier network structure is:
the first full-connection layer comprises 512 hidden units, and the activation function is a linear rectification unit ReLU;
the second full-connection layer, comprising 24 hidden units, is output as a predicted thumbnail position index.
In some embodiments, the specific steps of the online trimming include:
cutting and scaling: cutting out an image which is the same as the center position of the previous frame of image but larger on the search image, and scaling to 125x125 resolution;
the resulting image is used to update the parameters of the associated filtering algorithm,
W t =(1-α t )W t-1 +α t W
wherein alpha is t-1 ∈[0,1];
Cutting out three images with the same central position as the previous frame of image and different resolution values from the search image, and obtaining the largest response value of the three images through a correlation filtering algorithm, wherein the resolution value is the size of a target on the search image, and the position of the response value is the moving direction of the target; and repeating the cropping and scaling steps by taking the current search image as a template.
The invention will now be described in further detail with reference to the following specific examples, which should be construed as illustrative rather than limiting.
Example 1:
the invention provides an unsupervised related filtering target tracking method and system based on a jigsaw task, wherein the method comprises two stages of offline pre-training and online fine tuning.
In the offline pre-training stage, the training process of the neural network based on the jigsaw task is mainly combined with two tasks of unsupervised related filtering algorithm training and jigsaw task training. The depth features of the input image are extracted through the twin network and feature fusion is carried out, meanwhile, the training of the jigsaw task enables the features of the twin network model to have stronger universality and detail extraction capability, and higher tracking precision is obtained. The training process can be generally divided into four parts: data processing, depth feature extraction, jigsaw task training and unsupervised related filtering training.
And (3) data processing: the ILSVRC2015 data set is used in the training process, and the specific steps are as follows:
step (1): cutting each picture centrally to obtain original picture with length and widthSize of the product.
Step (2): and (3) scaling the picture obtained by clipping in the step (1) to a 125x125 resolution.
Step (3): for the pictures of the same video sequence, randomly selecting 3 pictures as templates T and searching areas S 1 Search area S 2 。
Step (4): for each picture selected in the step (3), a non-overlapping small picture with the resolution of 50x50 is respectively cut out at the upper left, the lower left, the upper right and the lower right of the picture, and is scaled to the resolution of 63x 63.
Step (5): and (3) randomly dithering each channel of each small image cut in the step (4) within 2 pixel points.
Step (6): randomly scrambling the small images obtained in the step (5), and taking the scrambled position indexes, the scrambled small images with the resolution of 63x63 and the large images with the resolution of 125x125 as a group of training data to participate in training.
Depth feature extraction: the method uses a twin depth network model for feature extraction, and comprises the following specific steps:
step (1): and 4 small images with the resolution of 63x63 are respectively extracted into depth features with the resolution of 1x1 through a twin convolution network model.
Step (2): the 125x125 resolution picture is used for extracting the characteristics of a specific layer through a twin convolution network model.
Step (3): for layer features of different resolution sizes in step (2), bilinear interpolation is used to scale them to a fixed resolution of 125x 125.
Training of jigsaw tasks: the location index of the small graph is predicted using a classifier network model, as shown in fig. 1. The method comprises the following specific steps:
step (1): and (3) adjusting the 4 three-dimensional depth features obtained in the depth feature extraction step (1) into one-dimensional vectors, and connecting the three-dimensional depth features together in a given order.
Step (2): and (3) predicting the position index of the disturbed small image by using the feature vector obtained in the step (1) through a classifier formed by a plurality of full-connection layers, and calculating a cross entropy loss function between the position index and the given position index.
Unsupervised correlation filtering training: a specific training process is shown in fig. 2. Template T, search area S of the same video sequence 1 Search area S 2 And (3) obtaining the multi-layer features through the depth feature extraction step (3). The shallow layer features have more position information, but the semantic information is not obvious, the deep layer features have more semantic information, the anti-interference capability is strong, but the necessary position information is lacking, and the improvement of the target tracking effect is facilitated by combining the features of different layers. For different pictures during trainingThe characteristic cyclic use related filtering algorithm comprises the following specific steps:
step (1): features F for particular layers t 、F s1 And corresponding response image Y t The response image of the corresponding layer can be obtained by using a correlation filtering algorithm, and the search area S is obtained by adding weights 1 Response image R of (2) s1 。
Step (2): features F for particular layers s1 、F s2 And the response image R obtained in the step (1) s1 The response image of the corresponding layer can be obtained by using a correlation filtering algorithm, and the search area S can be obtained by adding weights 2 Response image R of (2) s2 。
Step (3): features F for particular layers s2 、F t And the response image R obtained in the step (2) s2 The response image of the corresponding layer can be obtained by using a correlation filtering algorithm, and the response image R of the template T can be obtained by adding weights t 。
Step (4): for the response image R obtained in step (3) t And an original response image Y t A mean square loss function is calculated.
The following is the complete network structure of the present invention:
twinning convolutional network structure:
the first convolution layer, the convolution kernel size is 3x3, the step length is 1x1, and 32 feature maps are output; the activation function is a linear rectification unit ReLU;
the second convolution layer, the convolution kernel size is 3x3, the step length is 1x1, and 32 feature graphs are output; and carrying out local response normalization and maximum pooling on the same;
the third convolution layer, the convolution kernel size is 3x3, the step length is 1x1, and 64 feature graphs are output; the activation function is a linear rectification unit ReLU;
a fourth convolution layer, the convolution kernel size is 3x3, the step length is 1x1, and 64 feature graphs are output; and carrying out local response normalization and maximum pooling on the same;
a fifth convolution layer, wherein the convolution kernel size is 3x3, the step length is 1x1, and 128 feature graphs are output; the activation function is a linear rectification unit ReLU;
a sixth convolution layer, wherein the convolution kernel size is 3x3, the step length is 1x1, and 128 feature graphs are output; and carrying out local response normalization and maximum pooling on the same;
a seventh convolution layer, wherein the convolution kernel size is 3x3, the step length is 1x1, and 256 feature maps are output; the activation function is a linear rectification unit ReLU;
an eighth convolution layer, wherein the convolution kernel size is 3x3, the step length is 1x1, and 256 feature maps are output; and carrying out local response normalization and maximum pooling on the same;
a ninth convolution layer, wherein the convolution kernel size is 3x3, the step length is 1x1, and 256 feature maps are output; the activation function is a linear rectification unit ReLU;
a tenth convolution layer, wherein the convolution kernel size is 3x3, the step length is 1x1, and 256 feature maps are output; and local response normalization and maximum pooling are performed on the same.
Classifier network structure:
the first full-connection layer comprises 512 hidden units, and the activation function is a linear rectification unit ReLU;
the second full-connection layer, comprising 24 hidden units, is output as a predicted thumbnail position index.
In the online fine tuning stage, in order to better capture the change of the target in the motion process, the appearance model needs to be updated online, and the specific steps are as follows:
step (1): an image which is the same as the center position of the previous frame of image but is larger is cut out from the search image, and the image is scaled to the 125x125 resolution size.
Step (2): the image obtained in the step (1) is used for updating the parameters of the related filtering algorithm,
W t =(1-α t )W t-1 +α t W
wherein alpha is t-1 ∈[0,1]。
Step (3): and cutting out three images with the same central position as the previous frame of image and different resolution values from the search image, and obtaining the largest response value of the three images through a correlation filtering algorithm, wherein the resolution value is the size of the target on the search image, and the position of the response value is the moving direction of the target. And (3) taking the current search image as a template, and repeating the step (1).
Finally, it should be noted that the above-mentioned embodiments are only preferred embodiments of the present invention, and the present invention is not limited to the above-mentioned embodiments, but may be modified or substituted for some of them by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention. While the foregoing describes the embodiments of the present invention, it should be understood that the present invention is not limited to the embodiments, and that various modifications and changes can be made by those skilled in the art without any inventive effort.
Claims (9)
1. An unsupervised related filtering target tracking method based on a jigsaw task is characterized by comprising the following steps:
processing the input image;
adopting a twin depth network model to extract depth characteristics of the processed image;
training a jigsaw task, processing the extracted depth features by using a classifier network model, and predicting the position index of the small image;
performing unsupervised related filtering training on the multi-layer features obtained by extracting the depth features, and circularly using related filtering algorithms for the features of different pictures;
updating the appearance model on line and carrying out on-line fine adjustment;
the specific steps of data processing include:
cutting each picture centrally to obtain original picture with length and widthSize of the material;
scaling the cut picture to 125x125 resolution;
for the pictures of the same video sequence, randomly selecting 3 pictures as templatesSearch area->Search area->;
Respectively cutting out non-overlapping small pictures with the resolution of 50x50 from four positions of the upper left, the lower left, the upper right and the lower right of each selected picture, and scaling to the resolution of 63x 63;
randomly dithering each channel of each cut small image within 2 pixel points;
the obtained small images are randomly disturbed, and the disturbed position index, the disturbed small image with the resolution of 63x63 and the large image with the resolution of 125x125 are used as a group of training data to participate in training.
2. The tile task-based unsupervised correlation filtering target tracking method of claim 1, wherein the data processing uses ILSVRC2015 dataset.
3. The method for tracking an unsupervised correlation filtering target based on a jigsaw task as claimed in claim 1, wherein the specific steps of depth feature extraction are as follows:
extracting depth features of corresponding 1x1 resolution from the 4 small images with the resolution of 63x63 through a twin convolution network model respectively;
extracting the characteristics of a specific layer from a picture with a resolution of 125x125 through a twin convolution network model;
the resulting layer features of different resolution sizes are scaled to a fixed resolution of 125x125 using bilinear interpolation.
4. The method for tracking an unsupervised related filtering target based on a jigsaw task as claimed in claim 3, wherein the specific steps of training the jigsaw task are as follows:
adjusting 4 three-dimensional depth features obtained in the depth feature extraction process into one-dimensional vectors, and connecting the three-dimensional depth features together in a given order;
and (3) predicting the position index of the disturbed small image by using the obtained feature vector through a classifier formed by a plurality of full-connection layers, and calculating a cross entropy loss function between the feature vector and the given position index.
5. The method for tracking an unsupervised correlation filtering target based on a jigsaw task as claimed in claim 3, wherein the specific step of unsupervised correlation filtering training comprises:
features for specific layers、Corresponding response image +.>Obtaining a response image of the corresponding layer by using a correlation filtering algorithm, and obtaining a search area by adding weights>Is->;
Features for specific layers、The resulting response image +.>Obtaining a response image of the corresponding layer by using a correlation filtering algorithm, and obtaining a search area by adding weights>Is->;
Features for specific layers、The resulting response image +.>Obtaining a response image of the corresponding layer by using a correlation filtering algorithm, and adding the response images according to weights to obtain a response image of the template T>;
6. The method for tracking an unsupervised correlation filtering target based on a jigsaw task as claimed in claim 3, wherein the twin convolution network model is:
the first convolution layer, the convolution kernel size is 3x3, the step length is 1x1, and 32 feature maps are output; the activation function is a linear rectification unit ReLU;
the second convolution layer, the convolution kernel size is 3x3, the step length is 1x1, and 32 feature graphs are output; and carrying out local response normalization and maximum pooling on the same;
the third convolution layer, the convolution kernel size is 3x3, the step length is 1x1, and 64 feature graphs are output; the activation function is a linear rectification unit ReLU;
a fourth convolution layer, the convolution kernel size is 3x3, the step length is 1x1, and 64 feature graphs are output; and carrying out local response normalization and maximum pooling on the same;
a fifth convolution layer, wherein the convolution kernel size is 3x3, the step length is 1x1, and 128 feature graphs are output; the activation function is a linear rectification unit ReLU;
a sixth convolution layer, wherein the convolution kernel size is 3x3, the step length is 1x1, and 128 feature graphs are output; and carrying out local response normalization and maximum pooling on the same;
a seventh convolution layer, wherein the convolution kernel size is 3x3, the step length is 1x1, and 256 feature maps are output; the activation function is a linear rectification unit ReLU;
an eighth convolution layer, wherein the convolution kernel size is 3x3, the step length is 1x1, and 256 feature maps are output; and carrying out local response normalization and maximum pooling on the same;
a ninth convolution layer, wherein the convolution kernel size is 3x3, the step length is 1x1, and 256 feature maps are output; the activation function is a linear rectification unit ReLU;
a tenth convolution layer, wherein the convolution kernel size is 3x3, the step length is 1x1, and 256 feature maps are output; and local response normalization and maximum pooling are performed on the same.
7. The method for tracking an unsupervised correlation filtering target based on a jigsaw task as claimed in claim 3, wherein the classifier network structure is as follows:
the first full-connection layer comprises 512 hidden units, and the activation function is a linear rectification unit ReLU;
the second full-connection layer, comprising 24 hidden units, is output as a predicted thumbnail position index.
8. The method for tracking an unsupervised correlation filtering target based on a jigsaw task as claimed in claim 1, wherein the specific step of online fine tuning comprises:
cutting and scaling: cutting out an image which is the same as the center position of the previous frame of image but larger on the search image, and scaling to 125x125 resolution;
the resulting image is used to update the parameters of the associated filtering algorithm,
Cutting out three images with the same central position as the previous frame of image and different resolution values from the search image, and obtaining the largest response value of the three images through a correlation filtering algorithm, wherein the resolution value is the size of a target on the search image, and the position of the response value is the moving direction of the target; and repeating the cropping and scaling steps by taking the current search image as a template.
9. An unsupervised related filtering target tracking system based on a jigsaw task, comprising:
a data input device;
the data processing module is used for processing the input image;
the depth feature extraction module is used for extracting depth features of the processed image by adopting the twin depth network model;
the jigsaw task training module is used for processing the extracted depth features by using a classifier network model and predicting the position index of the small image;
the unsupervised correlation filtering training module is used for carrying out unsupervised correlation filtering training on the multi-layer features obtained by extracting the depth features, and circularly using correlation filtering algorithms for the features of different pictures;
the online fine tuning module is used for online updating the appearance model;
the specific steps of the data processing module for data processing comprise:
cutting each picture centrally to obtain original picture with length and widthSize of the material;
scaling the cut picture to 125x125 resolution;
for the pictures of the same video sequence, randomly selecting 3 pictures as templatesSearch area->Search area->;
Respectively cutting out non-overlapping small pictures with the resolution of 50x50 from four positions of the upper left, the lower left, the upper right and the lower right of each selected picture, and scaling to the resolution of 63x 63;
randomly dithering each channel of each cut small image within 2 pixel points;
the obtained small images are randomly disturbed, and the disturbed position index, the disturbed small image with the resolution of 63x63 and the large image with the resolution of 125x125 are used as a group of training data to participate in training.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010201902.7A CN111415318B (en) | 2020-03-20 | 2020-03-20 | Unsupervised related filtering target tracking method and system based on jigsaw task |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010201902.7A CN111415318B (en) | 2020-03-20 | 2020-03-20 | Unsupervised related filtering target tracking method and system based on jigsaw task |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111415318A CN111415318A (en) | 2020-07-14 |
CN111415318B true CN111415318B (en) | 2023-06-13 |
Family
ID=71494404
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010201902.7A Active CN111415318B (en) | 2020-03-20 | 2020-03-20 | Unsupervised related filtering target tracking method and system based on jigsaw task |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111415318B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112016591A (en) * | 2020-08-04 | 2020-12-01 | 杰创智能科技股份有限公司 | Training method of image recognition model and image recognition method |
CN113240591B (en) * | 2021-04-13 | 2022-10-04 | 浙江大学 | Sparse deep completion method based on countermeasure network |
CN113112518B (en) * | 2021-04-19 | 2024-03-26 | 深圳思谋信息科技有限公司 | Feature extractor generation method and device based on spliced image and computer equipment |
CN113192062A (en) * | 2021-05-25 | 2021-07-30 | 湖北工业大学 | Arterial plaque ultrasonic image self-supervision segmentation method based on image restoration |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109191491A (en) * | 2018-08-03 | 2019-01-11 | 华中科技大学 | The method for tracking target and system of the twin network of full convolution based on multilayer feature fusion |
CN110211192A (en) * | 2019-05-13 | 2019-09-06 | 南京邮电大学 | A kind of rendering method based on the threedimensional model of deep learning to two dimensional image |
CN110210551A (en) * | 2019-05-28 | 2019-09-06 | 北京工业大学 | A kind of visual target tracking method based on adaptive main body sensitivity |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9218365B2 (en) * | 2011-12-15 | 2015-12-22 | Yeda Research And Development Co. Ltd. | Device, system, and method of visual inference by collaborative composition |
US11055854B2 (en) * | 2018-08-23 | 2021-07-06 | Seoul National University R&Db Foundation | Method and system for real-time target tracking based on deep learning |
-
2020
- 2020-03-20 CN CN202010201902.7A patent/CN111415318B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109191491A (en) * | 2018-08-03 | 2019-01-11 | 华中科技大学 | The method for tracking target and system of the twin network of full convolution based on multilayer feature fusion |
CN110211192A (en) * | 2019-05-13 | 2019-09-06 | 南京邮电大学 | A kind of rendering method based on the threedimensional model of deep learning to two dimensional image |
CN110210551A (en) * | 2019-05-28 | 2019-09-06 | 北京工业大学 | A kind of visual target tracking method based on adaptive main body sensitivity |
Non-Patent Citations (1)
Title |
---|
刘侃等.一种基于深度神经网络的无线定位方法.计算机工程.2016,(第07期),88-91. * |
Also Published As
Publication number | Publication date |
---|---|
CN111415318A (en) | 2020-07-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111415318B (en) | Unsupervised related filtering target tracking method and system based on jigsaw task | |
US11637971B2 (en) | Automatic composition of composite images or videos from frames captured with moving camera | |
US10719940B2 (en) | Target tracking method and device oriented to airborne-based monitoring scenarios | |
Zhang et al. | SiamFT: An RGB-infrared fusion tracking method via fully convolutional Siamese networks | |
CN109410242B (en) | Target tracking method, system, equipment and medium based on double-current convolutional neural network | |
KR20220108165A (en) | Target tracking method, apparatus, electronic device and storage medium | |
CN113011329B (en) | Multi-scale feature pyramid network-based and dense crowd counting method | |
CN111260688A (en) | Twin double-path target tracking method | |
CN111639571B (en) | Video action recognition method based on contour convolution neural network | |
CN111696110A (en) | Scene segmentation method and system | |
CN112232134A (en) | Human body posture estimation method based on hourglass network and attention mechanism | |
CN113255429B (en) | Method and system for estimating and tracking human body posture in video | |
Lin et al. | High resolution animated scenes from stills | |
CN115713546A (en) | Lightweight target tracking algorithm for mobile terminal equipment | |
CN114036969A (en) | 3D human body action recognition algorithm under multi-view condition | |
CN112700476A (en) | Infrared ship video tracking method based on convolutional neural network | |
CN113592900A (en) | Target tracking method and system based on attention mechanism and global reasoning | |
CN115862130B (en) | Behavior recognition method based on human body posture and trunk sports field thereof | |
CN115761885B (en) | Behavior recognition method for common-time and cross-domain asynchronous fusion driving | |
CN112257638A (en) | Image comparison method, system, equipment and computer readable storage medium | |
CN109492530B (en) | Robust visual object tracking method based on depth multi-scale space-time characteristics | |
Gupta et al. | Reconnoitering the Essentials of Image and Video Processing: A Comprehensive Overview | |
Xue et al. | Multiscale feature extraction network for real-time semantic segmentation of road scenes on the autonomous robot | |
CN114743045B (en) | Small sample target detection method based on double-branch area suggestion network | |
CN113627410B (en) | Method for recognizing and retrieving action semantics in video |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |