CN114708306A - Single-target tracking method and device and storage medium - Google Patents

Single-target tracking method and device and storage medium Download PDF

Info

Publication number
CN114708306A
CN114708306A CN202210240068.1A CN202210240068A CN114708306A CN 114708306 A CN114708306 A CN 114708306A CN 202210240068 A CN202210240068 A CN 202210240068A CN 114708306 A CN114708306 A CN 114708306A
Authority
CN
China
Prior art keywords
target
frame
prediction
search
area
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210240068.1A
Other languages
Chinese (zh)
Other versions
CN114708306B (en
Inventor
范保杰
郭小宾
蒋国平
徐丰羽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN202210240068.1A priority Critical patent/CN114708306B/en
Priority claimed from CN202210240068.1A external-priority patent/CN114708306B/en
Publication of CN114708306A publication Critical patent/CN114708306A/en
Application granted granted Critical
Publication of CN114708306B publication Critical patent/CN114708306B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/269Analysis of motion using gradient-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20112Image segmentation details
    • G06T2207/20132Image cropping

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a single target tracking method, a single target tracking device and a storage medium, wherein the method comprises the following steps: extracting the characteristics of a target area and a search area by adopting a transformer backbone network; and fusing the target region characteristics and the search region characteristics by adopting a transform-based encoding and decoding architecture to serve a subsequent prediction task, and adding M (100) target preselection frames based on the target position of a previous frame while feeding the target region characteristic encoding and the search region characteristic encoding into a transform decoder. IoU prediction module: and performing N times of iterative optimization on the target prediction box decoded and output by the Transformer decoder structure to obtain an optimized prediction box, then calculating IoU between the optimized prediction box and the labeling box, and selecting the average of the optimized prediction boxes of the first three of IoU as a final prediction result. The speed is also ensured while the precision is improved.

Description

Single-target tracking method and device and storage medium
Technical Field
The invention belongs to the technical field of target tracking, and relates to a single-target tracking method, a single-target tracking device and a single-target tracking storage medium, in particular to a single-target tracking method, a single-target tracking device and a single-target tracking storage medium for IoU prediction based on a transformer.
Background
Single target tracking is a very challenging task in computer vision tasks. The object tracking task aims at the fact that given the object information of a first frame, the tracker needs to automatically determine the state of the object in the following frame. Unlike the target detection task, the target features are only available in the inference phase, which means that there is no a priori information about the target, such as the class and surrounding environment. Due to the rapid development of deep learning in recent years, many research results are obtained in the field of target tracking.
The prior art has the following defects: however, in the face of a complex environment (e.g., blur, scale change, color shift, and fast motion), most existing trackers do not handle these scenes well.
Disclosure of Invention
The purpose is as follows: in order to overcome the defects in the prior art, the invention provides a single-target tracking method, a single-target tracking device and a storage medium; IoU prediction is carried out based on the transformer, the strong mutual integration capability of local information and global information of the transformer is fully utilized, meanwhile, the complex parameter design and post-processing of the traditional tracker are avoided, and the target characteristics are extracted more efficiently.
The technical scheme is as follows: in order to solve the technical problems, the technical scheme adopted by the invention is as follows:
in a first aspect, a single target tracking method is provided, including:
acquiring video data;
determining a target area of a first frame of video data, and performing feature extraction and transform encoder encoding on the target area through a transform backbone network to obtain target area feature encoding;
generating M target prior frames by using a Gaussian function according to a target area of a first frame of video data, wherein M is an integer greater than 3;
determining a search area according to a target area of a first frame of video data;
performing transformer backbone network feature extraction and transformer encoder encoding on the search area to obtain search area feature encoding;
inputting the target region feature coding, the search region feature coding and the M target prior frames into a transform decoder for decoding to obtain IoU values of intersection ratios between the output M target prediction frames and the output labeling frames;
the IoU loss optimizes network parameters, and a gradient descent method is adopted to carry out N times of iterative optimization on the target prediction frame to obtain an optimized prediction frame;
and selecting the average value of the optimized prediction boxes of the optimized first three IoU values as a final target tracking box.
In some embodiments, the determining a target region of a first frame of video data comprises;
and labeling the target in a first frame of the video data through a rectangular frame, wherein the target area comprises a target position and a target size which are respectively determined by the position and the size of the rectangular frame.
In some embodiments, determining a search area from a target area of a first frame of video data comprises:
Figure BDA0003540878990000021
wherein crop _ sz represents the size of a search area cropped in the current frame, w represents the width of an object frame in the previous frame, h represents the height of the object frame in the previous frame, and s represents a search factor.
In some embodiments, as shown in fig. 2 and fig. 3, inputting the target region feature coding, the search region feature coding, and the M target prior boxes into a transform decoder for decoding, and obtaining IoU values of intersection ratios between the output M target prediction boxes and the annotation box, includes: and splicing the target region feature code and the search region feature code on a channel dimension to obtain a feature map, obtaining M target prediction frames according to the feature map and M target prior frames, and then obtaining IoU values between the M target prediction frames and the labeling frames through a feed-forward neural network (FFN).
In some embodiments, the IoU loss optimizes network parameters, and the target prediction frame is iteratively optimized for N times by adopting a gradient descent method, including:
Figure BDA0003540878990000031
wherein, A is a prediction box, and B is a marking box.
In some embodiments, M is 100.
In a second aspect, the present invention provides a single target tracking device, comprising a processor and a storage medium;
the storage medium is used for storing instructions;
the processor is configured to operate in accordance with the instructions to perform the steps of the method according to the first aspect.
In a third aspect, the present invention provides a storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method of the first aspect.
Has the advantages that: compared with other transform-based target tracking algorithms, the single-target tracking method and the single-target tracking system provided by the invention have the advantages that the template and the search image are jointly sent to a transform decoder, and 100 target prior frames based on the target position of the previous frame are added at the same time, the point is provided that the condition change of the target between every two frames in the target tracking task is not very large, so that the 100 target prior frames are equivalent to providing a target position prior for a network, the precision is improved, and the speed is also ensured. The tracking performance is ensured by effectively utilizing the template frame information and giving proper prior to the model. In addition, the situation that an anchor frame needs to be designed finely in the traditional anchor frame-based method can be effectively avoided, complex post-processing steps in most target tracking algorithms are eliminated, the tracking accuracy of the model is improved, and the real-time performance is also met.
Drawings
FIG. 1 is a flow chart of a single target tracking method according to an embodiment of the invention;
fig. 2 and 3 are schematic diagrams of a network of a single-target tracking system according to an embodiment of the invention.
Detailed Description
The invention is further described below with reference to the figures and examples. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.
In the description of the present invention, the meaning of a plurality is one or more, the meaning of a plurality is two or more, and the above, below, exceeding, etc. are understood as excluding the present numbers, and the above, below, within, etc. are understood as including the present numbers. If the first and second are described for the purpose of distinguishing technical features, they are not to be understood as indicating or implying relative importance or implicitly indicating the number of technical features indicated or implicitly indicating the precedence of the technical features indicated.
In the description of the present invention, reference to the description of "one embodiment", "some embodiments", "illustrative embodiments", "examples", "specific examples", or "some examples", etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
Example 1
A single target tracking method, comprising:
acquiring video data;
determining a target area of a first frame of video data, and performing feature extraction and transform encoder encoding on the target area through a transform backbone network to obtain target area feature encoding;
generating M target prior frames by using a Gaussian function according to a target area of a first frame of video data, wherein M is an integer greater than 3;
determining a search area according to a target area of a first frame of video data;
performing transform backbone network feature extraction and transform encoder encoding on the search area to obtain search area feature encoding;
inputting the target region feature coding, the search region feature coding and the M target prior frames into a transform decoder for decoding to obtain IoU values of intersection ratios between the output M target prediction frames and the output labeling frames;
the IoU loss optimizes network parameters, and a gradient descent method is adopted to carry out N times of iterative optimization on the target prediction frame to obtain an optimized prediction frame;
and selecting the average value of the optimized prediction boxes which are subjected to network forward calculation IoU and are three before the value as a final target tracking box.
In some embodiments, the determining a target region of a first frame of video data comprises;
and labeling the target in a first frame of the video data through a rectangular frame, wherein the target area comprises a target position and a target size which are respectively determined by the position and the size of the rectangular frame.
The initial frame image needs to pass through a Position encoding (Position encoding) module while passing through a backbone network, and the relative or absolute Position of a single pixel of the image in a sequence is saved. The position calculation formula is as follows:
PE(pos,2i)=sin(pos/100002i/d)
PE(pos,2i+1)=cos(pos/100002i/d)
the transform encoder structure specifically comprises six identical encoders. Each encoder consists of Multi-Head Self-attachment, a Norm layer and a Feed-Forward Network, and each structural formula is expressed as follows: the Self-orientation formula:
Figure BDA0003540878990000051
Multi-Head Self-Attention:
MultiHead(Q,K,V)=Concat(head1,...,headh)W
wherein, head refers to the result of each Self-orientation, h refers to the number of heads, and W is a weighting matrix.
Feed-Forward Network:
FFW(x)=max(0,xW1+b1)W2+b2
In some embodiments, generating M object prior boxes with a gaussian function from an object region of a first frame of video data comprises: the gaussian function used is as follows:
Figure BDA0003540878990000061
where μ is 0 and σ is 0.05 and 0.5, respectively. In some embodiments, M is 100. 100 object prior boxes are generated relative to the previous frame object box.
In some embodiments, determining a search area from a target area of a first frame of video data comprises:
Figure BDA0003540878990000062
wherein crop _ sz represents the size of a search area cropped in the current frame, w represents the width of an object frame in the previous frame, h represents the height of the object frame in the previous frame, and s represents a search factor.
Further, in some embodiments the picture is set to a fixed 320 x 320 size with a function in the pytorech after determining the search area; and then sending the data to a transformer backbone network for feature extraction.
In some embodiments, inputting the target region feature coding, the search region feature coding and the M predicted target prior boxes into a transform decoder for decoding, and obtaining IoU values of intersection ratios between the output M predicted target boxes and the output annotation boxes, including: and splicing the target region feature code and the search region feature code on a channel dimension to obtain a feature map, obtaining M target prediction frames according to the feature map and M target prior frames, and then obtaining IoU values between the M target prediction frames and the labeling frames through a feed-forward neural network (FFN).
In some embodiments, the IoU loss optimizes network parameters, and the target prediction frame is iteratively optimized for N times by adopting a gradient descent method, including:
Figure BDA0003540878990000063
wherein, A is a prediction box, and B is a marking box.
The target prediction box is iteratively optimized N times using a gradient descent method, and in some embodiments N is taken to be 10. The specific process is as follows:
step 1: IoU loss optimization network parameter theta0And setting the step length alpha to 1;
step2 with M target prediction boxes
Figure BDA0003540878990000071
Obtaining IoU loss relative to the standard frame through a feedforward neural network and recording the IoU loss as LIoU
Step3 calculation
Figure BDA0003540878990000072
Step 4. update M target prediction boxes,
Figure BDA0003540878990000073
step5, iterating N times to obtain
Figure BDA0003540878990000074
The optimization process only carries out iterative optimization on the target prediction frame and does not update network parameters.
Finally, the optimal prediction box with N sub-optimizations and IoU forward calculation is selected and averaged to be used as the final prediction box AfinalAnd tracking the target.
Example 2
In a second aspect, the present embodiment provides a single target tracking apparatus, including a processor and a storage medium;
the storage medium is used for storing instructions;
the processor is configured to operate in accordance with the instructions to perform the steps of the method according to embodiment 1.
Example 3
In a third aspect, the present embodiment provides a storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method of embodiment 1.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims (8)

1. A method for single target tracking, comprising:
acquiring video data;
determining a target area of a first frame of video data, and performing feature extraction and transform encoder encoding on the target area through a transform backbone network to obtain target area feature encoding;
generating M target prior frames by using a Gaussian function according to a target area of a first frame of video data, wherein M is an integer greater than 3;
determining a search area according to a target area of a first frame of video data;
performing transformer backbone network feature extraction and transformer encoder encoding on the search area to obtain search area feature encoding;
inputting the target region feature coding, the search region feature coding and the M target prior frames into a transform decoder for decoding to obtain IoU values of intersection ratios between the output M target prediction frames and the output labeling frames;
the IoU loss optimizes network parameters, and a gradient descent method is adopted to carry out N times of iterative optimization on the target prediction frame to obtain an optimized prediction frame;
and selecting the average value of the optimized prediction boxes of the optimized first three IoU values as a final target tracking box.
2. The method of single-target tracking according to claim 1, wherein said determining a target region of a first frame of video data comprises;
and labeling the target in a first frame of the video data through a rectangular frame, wherein the target area comprises a target position and a target size which are respectively determined by the position and the size of the rectangular frame.
3. The single-target tracking method of claim 1 or 2, wherein determining the search area based on the target area of the first frame of video data comprises:
Figure FDA0003540878980000011
wherein crop _ sz represents the size of a search area cropped in the current frame, w represents the width of an object frame in the previous frame, h represents the height of the object frame in the previous frame, and s represents a search factor.
4. The single-target tracking method of claim 1, wherein the inputting the target region feature code, the search region feature code and the M target prior boxes into a transform decoder for decoding results in IoU values of intersection ratios between the output M target prediction boxes and the output annotation boxes comprises: and splicing the target region feature code and the search region feature code on a channel dimension to obtain a feature map, obtaining M target prediction frames according to the feature map and the M target prior frames, and then obtaining IoU values between the M target prediction frames and the labeling frames through a feedforward neural network.
5. The single-target tracking method according to claim 1, wherein the IoU loss optimizes network parameters and performs N times of iterative optimization on the target prediction frame by adopting a gradient descent method, and the method comprises the following steps:
Figure FDA0003540878980000021
wherein, A is a prediction box, and B is a marking box.
6. The method of claim 5, wherein M is 100.
7. A single target tracking device, comprising: comprising a processor and a storage medium;
the storage medium is to store instructions;
the processor is configured to operate in accordance with the instructions to perform the steps of the method according to any one of claims 1 to 6.
8. A storage medium having a computer program stored thereon, the computer program, when being executed by a processor, implementing the steps of the method of any one of claims 1 to 6.
CN202210240068.1A 2022-03-10 Single-target tracking method, single-target tracking device and storage medium Active CN114708306B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210240068.1A CN114708306B (en) 2022-03-10 Single-target tracking method, single-target tracking device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210240068.1A CN114708306B (en) 2022-03-10 Single-target tracking method, single-target tracking device and storage medium

Publications (2)

Publication Number Publication Date
CN114708306A true CN114708306A (en) 2022-07-05
CN114708306B CN114708306B (en) 2024-10-29

Family

ID=

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111192291A (en) * 2019-12-06 2020-05-22 东南大学 Target tracking method based on cascade regression and twin network
US20200327680A1 (en) * 2019-04-12 2020-10-15 Beijing Moviebook Science and Technology Co., Ltd. Visual target tracking method and apparatus based on deep adversarial training
WO2020228353A1 (en) * 2019-05-13 2020-11-19 深圳先进技术研究院 Motion acceleration-based image search method, system, and electronic device
CN112541944A (en) * 2020-12-10 2021-03-23 山东师范大学 Probability twin target tracking method and system based on conditional variational encoder
CN113256678A (en) * 2021-04-26 2021-08-13 中国人民解放军32802部队 Target tracking method based on self-attention transformation network

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200327680A1 (en) * 2019-04-12 2020-10-15 Beijing Moviebook Science and Technology Co., Ltd. Visual target tracking method and apparatus based on deep adversarial training
WO2020228353A1 (en) * 2019-05-13 2020-11-19 深圳先进技术研究院 Motion acceleration-based image search method, system, and electronic device
CN111192291A (en) * 2019-12-06 2020-05-22 东南大学 Target tracking method based on cascade regression and twin network
CN112541944A (en) * 2020-12-10 2021-03-23 山东师范大学 Probability twin target tracking method and system based on conditional variational encoder
CN113256678A (en) * 2021-04-26 2021-08-13 中国人民解放军32802部队 Target tracking method based on self-attention transformation network

Similar Documents

Publication Publication Date Title
KR102279350B1 (en) Method and device for generating image data set to be used for learning cnn capable of detecting obstruction in autonomous driving circumstance, and testing method, and testing device using the same
Jiang et al. Matching by linear programming and successive convexification
US9947077B2 (en) Video object tracking in traffic monitoring
US11184558B1 (en) System for automatic video reframing
CN112270710B (en) Pose determining method, pose determining device, storage medium and electronic equipment
CN111814753A (en) Target detection method and device under foggy weather condition
CN110443279B (en) Unmanned aerial vehicle image vehicle detection method based on lightweight neural network
JP2020038666A (en) Method for generating data set for learning for detection of obstacle in autonomous driving circumstances and computing device, learning method, and learning device using the same
JP2020038668A (en) Method for generating image data set for cnn learning for detecting obstacle in autonomous driving circumstances and computing device
WO2021164515A1 (en) Detection method and apparatus for tampered image
Dong et al. Temporal feature augmented network for video instance segmentation
CN116740413A (en) Deep sea biological target detection method based on improved YOLOv5
CN117132914A (en) Method and system for identifying large model of universal power equipment
CN116052108A (en) Transformer-based traffic scene small sample target detection method and device
CN112070181B (en) Image stream-based cooperative detection method and device and storage medium
CN117788544A (en) Image depth estimation method based on lightweight attention mechanism
CN114708306B (en) Single-target tracking method, single-target tracking device and storage medium
CN114708306A (en) Single-target tracking method and device and storage medium
Xiong et al. Distortion map-guided feature rectification for efficient video semantic segmentation
EP4235492A1 (en) A computer-implemented method, data processing apparatus and computer program for object detection
Cho et al. Deep photo-geometric loss for relative camera pose estimation
JP2024068729A (en) Learning device, parameter adjustment method and recording medium
Shi et al. Small object detection algorithm incorporating swin transformer for tea buds
CN113987270A (en) Method, device, terminal and storage medium for determining similar video clips
Zhou et al. LEDet: localization estimation detector with data augmentation for ship detection based on unmanned surface vehicle

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant