CN113379806B - Target tracking method and system based on learnable sparse conversion attention mechanism - Google Patents

Target tracking method and system based on learnable sparse conversion attention mechanism Download PDF

Info

Publication number
CN113379806B
CN113379806B CN202110929160.4A CN202110929160A CN113379806B CN 113379806 B CN113379806 B CN 113379806B CN 202110929160 A CN202110929160 A CN 202110929160A CN 113379806 B CN113379806 B CN 113379806B
Authority
CN
China
Prior art keywords
target
image
frame
search area
learnable
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110929160.4A
Other languages
Chinese (zh)
Other versions
CN113379806A (en
Inventor
王军
章利民
王员云
孟晨晨
张珮芸
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanchang Institute of Technology
Original Assignee
Nanchang Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanchang Institute of Technology filed Critical Nanchang Institute of Technology
Priority to CN202110929160.4A priority Critical patent/CN113379806B/en
Publication of CN113379806A publication Critical patent/CN113379806A/en
Application granted granted Critical
Publication of CN113379806B publication Critical patent/CN113379806B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/60Analysis of geometric attributes
    • G06T7/66Analysis of geometric attributes of image moments or centre of gravity
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Geometry (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a target tracking method and a system based on a learnable sparse conversion attention mechanism, which comprises the following steps: initializing an image in a given first frame target frame to generate a target template image; in the subsequent frame, the target center of the image in the target frame of the previous frame is used as a central point, and a plurality of search area images are obtained through a multi-scale strategy; inputting the target template image and the search area image into a convolutional neural network model sharing the weight, and respectively extracting features through a convolutional neural network; performing space conversion and channel conversion on the extracted features based on the learnable sparse model; and taking the depth feature of the target template as a convolution kernel, performing sliding window operation on the image of the search area to obtain a plurality of score maps, and inferring the relative displacement and scale change of the target according to the maximum position of the score values to realize target tracking. The method has good robustness and real-time performance, and can realize good target image tracking effect.

Description

Target tracking method and system based on learnable sparse conversion attention mechanism
Technical Field
The invention relates to the technical field of computer vision and digital image processing, in particular to a target tracking method and system based on a learnable sparse conversion attention mechanism.
Background
In recent years, visual tracking is a research hotspot in computer vision, and estimates the target position of a subsequent video frame by using the target initial state of a first frame image. Especially in recent years, with the rapid development of deep learning, significant progress in the field of target tracking is driven. However, in complex scenes, achieving robust and accurate target tracking still has great challenges, such as occlusion, motion blur, scale change, and illumination change.
In general, visual tracking algorithms include two categories: discriminant algorithms and generator algorithms. Specifically, (1) the algorithm based on the discriminant model can be regarded as a two-classification problem, that is, the target and the background information are simultaneously extracted to train a classifier, so that the target is distinguished from the background information of the current frame, and the target position of the current frame is obtained; (2) and establishing a motion model through online learning based on the algorithm of the generated model, and then searching a candidate region with the minimum reconstruction error through the model to realize target tracking. Meanwhile, in recent years, the method based on deep learning utilizes the strong characterization capability of the depth features, greatly improves the robustness and accuracy of the tracking algorithm, and gradually becomes a mainstream trend.
Specifically, the tracking algorithm based on deep learning mainly utilizes that a convolutional neural network has strong feature extraction and expression capacity and is used for extracting target features and distinguishing foreground and background to identify a tracking target. The video tracking algorithm based on the twin network converts the tracking problem into a matching problem, realizes off-line end-to-end large-scale data set training, and greatly improves the speed and the accuracy.
However, in the prior art, the robustness and the accuracy of an appearance model of a partial visual tracking algorithm are not ideal, and the influence caused by appearance changes such as motion blur, illumination change, complex background and occlusion cannot be well processed.
Disclosure of Invention
In view of the above situation, it is necessary to solve the problem in the prior art that the robustness and accuracy of the appearance model of the partial visual tracking algorithm are not ideal, and the influence caused by appearance changes such as motion blur, illumination change, complex background, occlusion, and the like cannot be well handled.
The embodiment of the invention provides a target tracking method based on a learnable sparse conversion attention mechanism, wherein the method comprises the following steps:
the method comprises the following steps: initializing an image in a given first frame target frame to generate a target template image;
step two: in a second frame and a subsequent frame, taking the target center of the image in the target frame of the previous frame as a central point, obtaining a plurality of search area images through a multi-scale strategy, and adjusting the plurality of search area images to be the same in size;
step three: inputting the target template image and the search area image into a convolutional neural network model sharing weight values, and respectively extracting a target template depth feature and a search area depth feature through a convolutional neural network;
step four: performing space conversion and channel conversion on the depth features of the target template and the depth features of the search area based on a learnable sparse model to reduce space feature redundancy and inter-channel redundancy;
step five: taking the depth features of the target template processed by the learnable sparse model as convolution kernels, and performing sliding window operation on the image of the search area to obtain a plurality of score maps;
step six: and according to the position with the maximum score value in the score maps, the relative displacement of the target center of the image in the target frame of the previous frame in the current frame is estimated, and the scale change of the target tracking image is obtained through a multi-scale strategy so as to realize the tracking of the target.
The invention provides a target tracking method based on a learnable sparse conversion attention mechanism, which combines a convolutional neural network model and a learnable sparse conversion model and can obtain more sparse and robust target template image characteristics and search area image characteristics; in addition, similarity calculation is carried out on the target template image characteristics and the search area image characteristics through cross correlation, and a multi-scale strategy is utilized to adapt to target scale change. The target tracking method provided by the invention has good robustness and real-time performance, can better process appearance changes including shading, illumination changes, motion blur and the like, and finally realizes a good target image tracking effect.
The target tracking method based on the learnable sparse conversion attention mechanism is characterized in that in the step one, the coordinates of the center of the target to be tracked in the target frame of the first frame are
Figure 990719DEST_PATH_IMAGE001
The height and width of the target to be tracked in the first frame target frame are respectively
Figure 439018DEST_PATH_IMAGE002
And
Figure 625280DEST_PATH_IMAGE003
a correlation coefficient is correspondingly set
Figure 551647DEST_PATH_IMAGE004
The expression is:
Figure 21812DEST_PATH_IMAGE005
the target tracking method based on the learnable sparse conversion attention mechanism is characterized in that in the step one, correlation coefficients are used
Figure 375433DEST_PATH_IMAGE004
Obtaining side lengths of target template images
Figure 48991DEST_PATH_IMAGE006
The corresponding expression is:
Figure 779049DEST_PATH_IMAGE007
the target tracking method based on the learnable sparse conversion attention mechanism is characterized in that in the second step, the side length of the image of the search area is searched
Figure 182349DEST_PATH_IMAGE008
By correlation coefficient
Figure 831505DEST_PATH_IMAGE004
Height from the image in the previous frame target frame
Figure 116993DEST_PATH_IMAGE009
And width
Figure 260529DEST_PATH_IMAGE010
And calculating to obtain the following concrete expression:
Figure 315073DEST_PATH_IMAGE011
wherein, when the previous frame is the first frame, the height and width of the image are respectively
Figure 666289DEST_PATH_IMAGE012
And
Figure 439073DEST_PATH_IMAGE013
the target tracking method based on the learnable sparse conversion attention mechanism is characterized in that in the second step, the side length of the image of the search area is obtained
Figure 120721DEST_PATH_IMAGE014
After the step of (a), the method further comprises:
center of object of image in previous frame object frame
Figure 295350DEST_PATH_IMAGE015
As a central point, by respectively
Figure 286309DEST_PATH_IMAGE017
As different side lengths, to obtain different search area images, wherein,
Figure 749651DEST_PATH_IMAGE018
wherein a plurality of the search area images are all adjusted to
Figure 359624DEST_PATH_IMAGE019
The size of (a).
In the third step, in the step of extracting the depth features by the convolutional neural network, the corresponding convolution operation is represented as:
Figure 264126DEST_PATH_IMAGE020
wherein the content of the first and second substances,
Figure 566932DEST_PATH_IMAGE021
in order to input the features of the image,
Figure 173362DEST_PATH_IMAGE022
for the output characteristics after the convolution operation,
Figure 587026DEST_PATH_IMAGE023
in order to be the convolution kernel size,
Figure 611614DEST_PATH_IMAGE024
as to the number of channels of the input image,
Figure 85321DEST_PATH_IMAGE025
in order to slide the window, the sliding window,
Figure 184907DEST_PATH_IMAGE026
for sliding windows
Figure 136682DEST_PATH_IMAGE025
From input features
Figure 15777DEST_PATH_IMAGE021
The tensor of (A) is
Figure 863647DEST_PATH_IMAGE027
The pixel of (a) is (are) in (b),
Figure 320036DEST_PATH_IMAGE028
is as follows
Figure 200136DEST_PATH_IMAGE029
A convolution kernel is
Figure 792792DEST_PATH_IMAGE030
The pixel of (b).
The target tracking method based on the learnable sparse conversion attention mechanism, wherein in the fourth step, when performing spatial conversion, the method comprises:
decomposing an input image local area into different frequency bands through continuous row and column transformation, and initializing corresponding column and row transformation weights;
the concrete expression is as follows:
Figure 483667DEST_PATH_IMAGE031
wherein the content of the first and second substances,
Figure 427352DEST_PATH_IMAGE032
the weights corresponding to the spatial transformation are represented,
Figure 376723DEST_PATH_IMAGE033
which represents the kronecker product of,
Figure 89464DEST_PATH_IMAGE034
and
Figure 685661DEST_PATH_IMAGE035
the transform initial weights for the columns and rows are represented, respectively.
The target tracking method based on the learnable sparse conversion attention mechanism comprises the following six specific steps:
finding the position with the maximum score value in the three score maps
Figure 116643DEST_PATH_IMAGE036
Calculating the relative displacement between the target center of the image in the target frame of the previous frame and the target center of the image in the target frame of the previous frame;
and updating the position of the target center of the target tracking image of the current frame according to the relative displacement so as to position.
The target tracking method based on the learnable sparse conversion attention mechanism is characterized by further comprising the following steps:
updating the scale of the target tracking image of the current frame according to the scale of the maximum value of the score values in the three score maps;
wherein the corresponding scale variation is represented as:
Figure 338545DEST_PATH_IMAGE037
wherein the content of the first and second substances,
Figure 374634DEST_PATH_IMAGE038
in order to be a change in scale, the,
Figure 266367DEST_PATH_IMAGE039
is the scale on which the maximum of the three score maps lies.
The invention also provides a target tracking system based on the learnable sparse conversion attention mechanism, wherein the system comprises:
the first processing module is used for initializing the image in the given first frame target frame to generate a target template image;
the second processing module is used for obtaining a plurality of search area images by taking the target center of the image in the target frame of the previous frame as a central point through a multi-scale strategy in the second frame and the subsequent frames and adjusting the plurality of search area images to be the same in size;
the first learning module is used for inputting the target template image and the search area image into a convolutional neural network model sharing weight values and respectively extracting a target template depth feature and a search area depth feature through a convolutional neural network;
the second learning module is used for carrying out space conversion and channel conversion on the depth features of the target template and the depth features of the search area based on a learnable sparse model so as to reduce space feature redundancy and inter-channel redundancy;
the sliding window processing module is used for taking the depth features of the target template processed by the learnable sparse model as convolution kernels and performing sliding window operation on the image of the search area to obtain a plurality of score maps;
and the positioning tracking module is used for estimating the relative displacement of the target center of the image in the target frame of the previous frame in the current frame according to the position with the maximum score value in the score maps, and acquiring the scale change of the target tracking image through a multi-scale strategy so as to realize the tracking of the target.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
FIG. 1 is a schematic diagram of a target tracking method based on a learnable sparse conversion attention mechanism proposed by the present invention;
fig. 2 is a structural diagram of a target tracking system based on a learnable sparse conversion attention mechanism proposed in the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.
These and other aspects of embodiments of the invention will be apparent with reference to the following description and attached drawings. In the description and drawings, particular embodiments of the invention have been disclosed in detail as being indicative of some of the ways in which the principles of the embodiments of the invention may be practiced, but it is understood that the scope of the embodiments of the invention is not limited correspondingly. On the contrary, the embodiments of the invention include all changes, modifications and equivalents coming within the spirit and terms of the claims appended hereto.
In the prior art, the robustness and the accuracy of an appearance model of a partial visual tracking algorithm are not ideal, and the influence caused by appearance changes such as motion blur, illumination change and shielding cannot be well processed.
In order to solve the technical problem, the present invention provides a target tracking method based on a learnable sparse transformation attention mechanism, please refer to fig. 1, the method includes the following steps:
s101, initializing the image in the given first frame target frame to generate a target template image.
In this step, the coordinates of the center of the target to be tracked in the first frame target frame are
Figure 60011DEST_PATH_IMAGE001
The height and width of the target to be tracked in the first frame target frame are respectively
Figure 960971DEST_PATH_IMAGE002
And
Figure 507358DEST_PATH_IMAGE003
. In addition, a correlation coefficient is correspondingly set
Figure 101151DEST_PATH_IMAGE004
The expression is:
Figure 116511DEST_PATH_IMAGE005
(1)
wherein the content of the first and second substances,
Figure 821162DEST_PATH_IMAGE002
and
Figure 487636DEST_PATH_IMAGE003
respectively the height and width of the target to be tracked in the first frame target frame.
At the same time, by the correlation coefficient
Figure 986750DEST_PATH_IMAGE004
Obtaining side lengths of target template images
Figure 489407DEST_PATH_IMAGE006
The corresponding expression is:
Figure 935432DEST_PATH_IMAGE007
(2)
that is, given the coordinates of the center of the target to be tracked in the first frame target frame as
Figure 331778DEST_PATH_IMAGE001
Side length of
Figure 126428DEST_PATH_IMAGE006
Intercepting image block, and adjusting the size of target template image to
Figure 241014DEST_PATH_IMAGE019
And S102, in the second frame and the subsequent frames, taking the target center of the image in the target frame of the previous frame as a central point, obtaining a plurality of search area images through a multi-scale strategy, and adjusting the plurality of search area images to be the same in size.
Here, step S102 is the same as the sampling method in step S101, except that a multi-scale strategy is used for regression of the search region images, and the search region images are resized to the same size.
Specifically, in this step, for the search area image, the side length of the search area image
Figure 897254DEST_PATH_IMAGE014
By correlation coefficient
Figure 413686DEST_PATH_IMAGE004
Height from the image in the previous frame target frame
Figure 644817DEST_PATH_IMAGE009
And width
Figure 246699DEST_PATH_IMAGE010
And calculating to obtain the following concrete expression:
Figure 706630DEST_PATH_IMAGE011
(3)
wherein, when the previous frame is the first frame, the height and width of the image are respectively
Figure 77569DEST_PATH_IMAGE002
And
Figure 208162DEST_PATH_IMAGE003
. Side length of image in search area
Figure 31761DEST_PATH_IMAGE014
After the step (2), coordinates of the center of the target to be tracked in the target frame of the previous frame
Figure 623280DEST_PATH_IMAGE015
As a central point, by respectively
Figure 724091DEST_PATH_IMAGE016
As different side lengths to obtain different search area images (filled in as a mean if out of range of the current frame). Finally, the images in the search area are all adjusted to
Figure 906810DEST_PATH_IMAGE019
Three images of size. Wherein the content of the first and second substances,
Figure 607919DEST_PATH_IMAGE018
s103, inputting the target template image and the search area image into a convolutional neural network model sharing a weight, and respectively extracting a target template depth feature and a search area depth feature through a convolutional neural network.
It is noted that the target template image and the search area image use the same convolutional neural network, and the weights are shared. In the stage of feature extraction, a full convolution neural network is used. When training neural network parameters, a value is initialized randomly, parameters are optimized through back propagation of a real value (Ground Truth) and a cross entropy loss function of a predicted value, and finally a difference function between the predicted value and the real value is observed to find a group of parameters capable of well fitting training data.
In addition, the used feature extraction backbone network is AlexNet, wherein the first four convolutional layers are used, the full connection layer is removed, and AlexNet network parameters can well cope with complex target appearance changes through large-scale data set offline end-to-end training.
In this step, in the step of extracting the depth feature by the convolutional neural network, the corresponding convolution operation is expressed as:
Figure 534287DEST_PATH_IMAGE020
(4)
wherein the content of the first and second substances,
Figure 489604DEST_PATH_IMAGE021
in order to input the features of the image,
Figure 374384DEST_PATH_IMAGE022
for the output characteristics after the convolution operation,
Figure 31630DEST_PATH_IMAGE023
in order to be the convolution kernel size,
Figure 27268DEST_PATH_IMAGE024
as to the number of channels of the input image,
Figure 837092DEST_PATH_IMAGE025
in order to slide the window, the sliding window,
Figure 627194DEST_PATH_IMAGE026
for sliding windows
Figure 37315DEST_PATH_IMAGE025
From input features
Figure 305485DEST_PATH_IMAGE021
The tensor of (A) is
Figure 235395DEST_PATH_IMAGE027
The pixel of (a) is (are) in (b),
Figure 399660DEST_PATH_IMAGE028
is as follows
Figure 906865DEST_PATH_IMAGE029
A convolution kernel is
Figure 103360DEST_PATH_IMAGE030
The pixel of (b).
And S104, performing space conversion and channel conversion on the depth features of the target template and the depth features of the search area based on a learnable sparse model to reduce space feature redundancy and inter-channel redundancy.
In performing spatial transformation, the main purpose is to reduce the redundancy of spatial features. Specifically, the local region of the input image is decomposed into different frequency bands through continuous row and column transformation, and corresponding column and row transformation weights are initialized;
the concrete expression is as follows:
Figure 277990DEST_PATH_IMAGE031
(5)
wherein the content of the first and second substances,
Figure 19681DEST_PATH_IMAGE032
the weights corresponding to the spatial transformation are represented,
Figure 279761DEST_PATH_IMAGE033
which represents the kronecker product of,
Figure 279947DEST_PATH_IMAGE034
and
Figure 309082DEST_PATH_IMAGE035
the transform initial weights for the columns and rows are represented, respectively.
In addition, in the channel conversion, redundancy among channels is mainly reduced, and specifically, the correlation among the channels is used for mapping the input features, so that the number of the channels is changed. Meanwhile, a residual error structure is adopted for design, on one hand, important information of input features is reserved, and on the other hand, a region of interest in an input image is highlighted.
And S105, taking the depth feature of the target template processed by the learnable sparse model as a convolution kernel, and performing sliding window operation on the image of the search area to obtain a plurality of score maps.
And S106, according to the position with the maximum score value in the multiple score maps, the relative displacement of the target center of the image in the target frame of the previous frame in the current frame is estimated, and the scale change of the target tracking image is obtained through a multi-scale strategy so as to realize the tracking of the target.
In this step, the position where the score value is the largest is found in the three score maps
Figure 221675DEST_PATH_IMAGE036
And calculating the relative displacement between the target center of the image in the target frame of the previous frame and the target center of the image in the target frame of the previous frame. And then updating the position of the target center of the target tracking image of the current frame according to the relative displacement so as to carry out positioning.
And meanwhile, updating the scale of the target tracking image of the current frame according to the scale of the maximum value of the score values in the three score maps.
Wherein the corresponding scale variation is represented as:
Figure 969051DEST_PATH_IMAGE037
wherein the content of the first and second substances,
Figure 507349DEST_PATH_IMAGE038
in order to be a change in scale, the,
Figure 594253DEST_PATH_IMAGE039
is the scale on which the maximum of the three score maps lies.
The invention provides a target tracking method based on a learnable sparse conversion attention mechanism, which combines a convolutional neural network model and a learnable sparse conversion model and can obtain more sparse and robust target template image characteristics and search area image characteristics; in addition, similarity calculation is carried out on the target template image characteristics and the search area image characteristics through cross correlation, and a multi-scale strategy is utilized to adapt to target scale change. The target tracking method provided by the invention has good robustness and real-time performance, can better process appearance changes including shading, illumination changes, motion blur and the like, and finally realizes a good target image tracking effect.
Referring to fig. 2, the present invention further provides a target tracking system based on a learnable sparse conversion attention mechanism, wherein the system includes a first processing module 11, a second processing module 12, a first learning module 13, a second learning module 14, a sliding window processing module 15, and a positioning tracking module 16;
a first processing module 11, configured to initialize an image in a given first frame target frame to generate a target template image;
a second processing module 12, which obtains a plurality of search area images through a multi-scale strategy by using a target center of an image in a target frame of a previous frame as a central point in a second frame and a subsequent frame, and adjusts the plurality of search area images to be the same size;
the first learning module 13 is configured to input the target template image and the search area image into a convolutional neural network model sharing a weight, and extract a target template depth feature and a search area depth feature through a convolutional neural network respectively;
the second learning module 14 is configured to perform spatial transformation and channel transformation on the depth feature of the target template and the depth feature of the search area based on a learnable sparse model to reduce spatial feature redundancy and inter-channel redundancy;
the sliding window processing module 15 is configured to perform a sliding window operation on the search area image by using the depth feature of the target template processed by the learnable sparse model as a convolution kernel to obtain a plurality of score maps;
and the positioning and tracking module 16 is configured to estimate, according to the position with the largest score value in the multiple score maps, a relative displacement of a target center of the image in the target frame of the previous frame in the current frame, and obtain, through a multi-scale strategy, a scale change of the target tracking image, so as to implement tracking of the target.
It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (8)

1. A target tracking method based on a learnable sparse conversion attention mechanism is characterized by comprising the following steps:
the method comprises the following steps: initializing an image in a given first frame target frame to generate a target template image;
step two: in a second frame and a subsequent frame, taking the target center of the image in the target frame of the previous frame as a central point, obtaining a plurality of search area images through a multi-scale strategy, and adjusting the plurality of search area images to be the same in size;
step three: inputting the target template image and the search area image into a convolutional neural network model sharing weight values, and respectively extracting a target template depth feature and a search area depth feature through a convolutional neural network;
step four: performing space conversion and channel conversion on the depth features of the target template and the depth features of the search area based on a learnable sparse model to reduce space feature redundancy and inter-channel redundancy;
step five: taking the depth features of the target template processed by the learnable sparse model as convolution kernels, and performing sliding window operation on the image of the search area to obtain a plurality of score maps;
step six: according to the position with the maximum score value in the score maps, the relative displacement of the target center of the image in the target frame of the previous frame in the current frame is estimated, and the scale change of the target tracking image is obtained through a multi-scale strategy so as to realize the tracking of the target;
in the third step, in the step of extracting the depth feature by the convolutional neural network, the corresponding convolution operation is represented as:
Figure 699047DEST_PATH_IMAGE001
wherein the content of the first and second substances,
Figure 645881DEST_PATH_IMAGE002
in order to input the features of the image,
Figure 956777DEST_PATH_IMAGE003
for the output characteristics after the convolution operation,
Figure 883145DEST_PATH_IMAGE004
in order to be the convolution kernel size,
Figure 228675DEST_PATH_IMAGE005
as to the number of channels of the input image,
Figure 582296DEST_PATH_IMAGE006
in order to slide the window, the sliding window,
Figure 380488DEST_PATH_IMAGE007
for sliding windows
Figure 110547DEST_PATH_IMAGE006
From input features
Figure 310584DEST_PATH_IMAGE002
The tensor of (A) is
Figure 336571DEST_PATH_IMAGE008
The pixel of (a) is (are) in (b),
Figure 887638DEST_PATH_IMAGE009
is as follows
Figure 155808DEST_PATH_IMAGE010
A convolution kernel is
Figure 210352DEST_PATH_IMAGE011
A pixel of (b);
in the fourth step, when performing spatial conversion, the method includes:
decomposing an input image local area into different frequency bands through continuous row and column transformation, and initializing corresponding column and row transformation weights;
the concrete expression is as follows:
Figure 171355DEST_PATH_IMAGE012
wherein the content of the first and second substances,
Figure 944139DEST_PATH_IMAGE013
the weights corresponding to the spatial transformation are represented,
Figure 750421DEST_PATH_IMAGE014
which represents the kronecker product of,
Figure 925050DEST_PATH_IMAGE015
and
Figure 994637DEST_PATH_IMAGE016
the transform initial weights for the columns and rows are represented, respectively.
2. The target tracking method based on the learnable sparse conversion attention mechanism as claimed in claim 1, wherein in the step one, the coordinates of the target center to be tracked in the first frame target frame are
Figure 567568DEST_PATH_IMAGE017
The height and width of the target to be tracked in the first frame target frame are respectively
Figure 177541DEST_PATH_IMAGE018
And
Figure 409939DEST_PATH_IMAGE019
a correlation coefficient is correspondingly set
Figure 712745DEST_PATH_IMAGE020
The expression is:
Figure 194542DEST_PATH_IMAGE021
3. the target tracking method based on the learnable sparse conversion attention mechanism as claimed in claim 2, wherein in the step one, the target tracking method is characterized by correlation coefficients
Figure 608205DEST_PATH_IMAGE020
Obtaining side lengths of target template images
Figure 491848DEST_PATH_IMAGE022
The corresponding expression is:
Figure 965554DEST_PATH_IMAGE023
4. the target tracking method based on the learnable sparse conversion attention mechanism as claimed in claim 2, wherein in the second step, the side length of the image of the search area is searched
Figure 436112DEST_PATH_IMAGE024
By correlation coefficient
Figure 387888DEST_PATH_IMAGE020
Height from the image in the previous frame target frame
Figure 391616DEST_PATH_IMAGE025
And width
Figure 36224DEST_PATH_IMAGE026
And calculating to obtain the following concrete expression:
Figure 492613DEST_PATH_IMAGE027
wherein, when the previous frame is the first frame, the height and width of the image are respectively
Figure 248079DEST_PATH_IMAGE028
And
Figure 840735DEST_PATH_IMAGE029
5. the target tracking method based on the learnable sparse conversion attention mechanism as claimed in claim 4, wherein in the second step, the side length of the image of the search area is obtained
Figure 921823DEST_PATH_IMAGE030
After the step of (a), the method further comprises:
center of object of image in previous frame object frame
Figure 98464DEST_PATH_IMAGE031
As a central point, by respectively
Figure 923201DEST_PATH_IMAGE032
As different side lengths, to obtain different search area images, wherein,
Figure 635942DEST_PATH_IMAGE033
wherein a plurality of the search area images are all adjusted to
Figure 356773DEST_PATH_IMAGE034
The size of (a).
6. The target tracking method based on the learnable sparse conversion attention mechanism according to claim 2, wherein the sixth step specifically comprises:
finding the position with the maximum score value in the three score maps
Figure 787755DEST_PATH_IMAGE035
Calculating the relative displacement between the target center of the image in the target frame of the previous frame and the target center of the image in the target frame of the previous frame;
and updating the position of the target center of the target tracking image of the current frame according to the relative displacement so as to position.
7. The learnable sparse conversion attention mechanism based target tracking method of claim 6, the method further comprising:
updating the scale of the target tracking image of the current frame according to the scale of the maximum value of the score values in the three score maps;
wherein the corresponding scale variation is represented as:
Figure 885024DEST_PATH_IMAGE036
wherein the content of the first and second substances,
Figure 717850DEST_PATH_IMAGE037
in order to be a change in scale, the,
Figure DEST_PATH_IMAGE038
is the scale on which the maximum of the three score maps lies.
8. A target tracking system based on a learnable sparse conversion attention mechanism, the system comprising:
the first processing module is used for initializing the image in the given first frame target frame to generate a target template image;
the second processing module is used for obtaining a plurality of search area images by taking the target center of the image in the target frame of the previous frame as a central point through a multi-scale strategy in the second frame and the subsequent frames and adjusting the plurality of search area images to be the same in size;
the first learning module is used for inputting the target template image and the search area image into a convolutional neural network model sharing weight values and respectively extracting a target template depth feature and a search area depth feature through a convolutional neural network;
the second learning module is used for carrying out space conversion and channel conversion on the depth features of the target template and the depth features of the search area based on a learnable sparse model so as to reduce space feature redundancy and inter-channel redundancy;
the sliding window processing module is used for taking the depth features of the target template processed by the learnable sparse model as convolution kernels and performing sliding window operation on the image of the search area to obtain a plurality of score maps;
the positioning and tracking module is used for estimating the relative displacement of the target center of the image in the target frame of the previous frame in the current frame according to the position with the maximum score value in the score maps, and obtaining the scale change of the target tracking image through a multi-scale strategy so as to realize the tracking of the target;
wherein the first learning module is configured to extract depth features through a convolutional neural network, and the corresponding convolutional operation is represented as:
Figure 642206DEST_PATH_IMAGE001
wherein the content of the first and second substances,
Figure 560484DEST_PATH_IMAGE002
in order to input the features of the image,
Figure 461444DEST_PATH_IMAGE003
for the output characteristics after the convolution operation,
Figure 883198DEST_PATH_IMAGE004
in order to be the convolution kernel size,
Figure 211411DEST_PATH_IMAGE005
as to the number of channels of the input image,
Figure 351405DEST_PATH_IMAGE006
in order to slide the window, the sliding window,
Figure 56056DEST_PATH_IMAGE007
for sliding windows
Figure 597896DEST_PATH_IMAGE006
From input features
Figure 601405DEST_PATH_IMAGE002
The tensor of (A) is
Figure 228695DEST_PATH_IMAGE008
The pixel of (a) is (are) in (b),
Figure 471458DEST_PATH_IMAGE009
is as follows
Figure 867804DEST_PATH_IMAGE010
A convolution kernel is
Figure 537820DEST_PATH_IMAGE011
A pixel of (b);
the second learning module is used for decomposing the local region of the input image into different frequency bands through continuous row and column transformation when space conversion is carried out, and initializing corresponding column and row transformation weights;
the concrete expression is as follows:
Figure 855669DEST_PATH_IMAGE012
wherein the content of the first and second substances,
Figure 636543DEST_PATH_IMAGE013
the weights corresponding to the spatial transformation are represented,
Figure 152975DEST_PATH_IMAGE014
which represents the kronecker product of,
Figure 993892DEST_PATH_IMAGE015
and
Figure 97239DEST_PATH_IMAGE016
the transform initial weights for the columns and rows are represented, respectively.
CN202110929160.4A 2021-08-13 2021-08-13 Target tracking method and system based on learnable sparse conversion attention mechanism Active CN113379806B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110929160.4A CN113379806B (en) 2021-08-13 2021-08-13 Target tracking method and system based on learnable sparse conversion attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110929160.4A CN113379806B (en) 2021-08-13 2021-08-13 Target tracking method and system based on learnable sparse conversion attention mechanism

Publications (2)

Publication Number Publication Date
CN113379806A CN113379806A (en) 2021-09-10
CN113379806B true CN113379806B (en) 2021-11-09

Family

ID=77577066

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110929160.4A Active CN113379806B (en) 2021-08-13 2021-08-13 Target tracking method and system based on learnable sparse conversion attention mechanism

Country Status (1)

Country Link
CN (1) CN113379806B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114119669A (en) * 2021-11-30 2022-03-01 南昌工程学院 Image matching target tracking method and system based on Shuffle attention
CN115063445B (en) * 2022-08-18 2022-11-08 南昌工程学院 Target tracking method and system based on multi-scale hierarchical feature representation

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108492313A (en) * 2018-02-05 2018-09-04 绍兴文理学院 A kind of dimension self-adaption visual target tracking method based on middle intelligence similarity measure
CN109427055A (en) * 2017-09-04 2019-03-05 长春长光精密仪器集团有限公司 The remote sensing images surface vessel detection method of view-based access control model attention mechanism and comentropy
CN110675423A (en) * 2019-08-29 2020-01-10 电子科技大学 Unmanned aerial vehicle tracking method based on twin neural network and attention model

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110060274A (en) * 2019-04-12 2019-07-26 北京影谱科技股份有限公司 The visual target tracking method and device of neural network based on the dense connection of depth
CN111126132A (en) * 2019-10-25 2020-05-08 宁波必创网络科技有限公司 Learning target tracking algorithm based on twin network
CN111260688A (en) * 2020-01-13 2020-06-09 深圳大学 Twin double-path target tracking method
CN111291679B (en) * 2020-02-06 2022-05-27 厦门大学 Target specific response attention target tracking method based on twin network
CN112991385B (en) * 2021-02-08 2023-04-28 西安理工大学 Twin network target tracking method based on different measurement criteria

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109427055A (en) * 2017-09-04 2019-03-05 长春长光精密仪器集团有限公司 The remote sensing images surface vessel detection method of view-based access control model attention mechanism and comentropy
CN108492313A (en) * 2018-02-05 2018-09-04 绍兴文理学院 A kind of dimension self-adaption visual target tracking method based on middle intelligence similarity measure
CN110675423A (en) * 2019-08-29 2020-01-10 电子科技大学 Unmanned aerial vehicle tracking method based on twin neural network and attention model

Also Published As

Publication number Publication date
CN113379806A (en) 2021-09-10

Similar Documents

Publication Publication Date Title
CN108090919B (en) Improved kernel correlation filtering tracking method based on super-pixel optical flow and adaptive learning factor
US11551333B2 (en) Image reconstruction method and device
WO2022036777A1 (en) Method and device for intelligent estimation of human body movement posture based on convolutional neural network
CN110570458B (en) Target tracking method based on internal cutting and multi-layer characteristic information fusion
WO2020108362A1 (en) Body posture detection method, apparatus and device, and storage medium
CN113379806B (en) Target tracking method and system based on learnable sparse conversion attention mechanism
CN111738344B (en) Rapid target detection method based on multi-scale fusion
CN107424161B (en) Coarse-to-fine indoor scene image layout estimation method
CN113313810B (en) 6D attitude parameter calculation method for transparent object
CN113989301A (en) Colorectal polyp segmentation method fusing neural networks of multiple attention mechanisms
CN111310768B (en) Saliency target detection method based on robustness background prior and global information
CN115393584A (en) Establishment method based on multi-task ultrasonic thyroid nodule segmentation and classification model, segmentation and classification method and computer equipment
EP3872761A2 (en) Analysing objects in a set of frames
CN106407978B (en) Method for detecting salient object in unconstrained video by combining similarity degree
CN110809126A (en) Video frame interpolation method and system based on adaptive deformable convolution
CN111860823A (en) Neural network training method, neural network training device, neural network image processing method, neural network image processing device, neural network image processing equipment and storage medium
CN115187786A (en) Rotation-based CenterNet2 target detection method
CN112270366A (en) Micro target detection method based on self-adaptive multi-feature fusion
CN114119669A (en) Image matching target tracking method and system based on Shuffle attention
CN112329662B (en) Multi-view saliency estimation method based on unsupervised learning
CN113538221A (en) Three-dimensional face processing method, training method, generating method, device and equipment
CN110503093B (en) Region-of-interest extraction method based on disparity map DBSCAN clustering
CN108765384B (en) Significance detection method for joint manifold sequencing and improved convex hull
CN114782455B (en) Cotton row center line image extraction method for agricultural machine embedded equipment
CN106485686A (en) One kind is based on gravitational spectral clustering image segmentation algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant