CN114708306A - Single-target tracking method and device and storage medium - Google Patents
Single-target tracking method and device and storage medium Download PDFInfo
- Publication number
- CN114708306A CN114708306A CN202210240068.1A CN202210240068A CN114708306A CN 114708306 A CN114708306 A CN 114708306A CN 202210240068 A CN202210240068 A CN 202210240068A CN 114708306 A CN114708306 A CN 114708306A
- Authority
- CN
- China
- Prior art keywords
- target
- frame
- prediction
- search
- area
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 30
- 238000003860 storage Methods 0.000 title claims abstract description 18
- 238000002372 labelling Methods 0.000 claims abstract description 10
- 238000005457 optimization Methods 0.000 claims abstract description 9
- 238000004590 computer program Methods 0.000 claims description 11
- 238000000605 extraction Methods 0.000 claims description 7
- 238000011478 gradient descent method Methods 0.000 claims description 7
- 238000013528 artificial neural network Methods 0.000 claims description 4
- 238000010586 diagram Methods 0.000 description 9
- 230000006870 function Effects 0.000 description 9
- 238000004364 calculation method Methods 0.000 description 4
- 238000012545 processing Methods 0.000 description 4
- 230000008569 process Effects 0.000 description 3
- 230000008859 change Effects 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 238000012805 post-processing Methods 0.000 description 2
- 230000006978 adaptation Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/20—Analysis of motion
- G06T7/269—Analysis of motion using gradient-based methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10016—Video; Image sequence
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20112—Image segmentation details
- G06T2207/20132—Image cropping
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a single target tracking method, a single target tracking device and a storage medium, wherein the method comprises the following steps: extracting the characteristics of a target area and a search area by adopting a transformer backbone network; and fusing the target region characteristics and the search region characteristics by adopting a transform-based encoding and decoding architecture to serve a subsequent prediction task, and adding M (100) target preselection frames based on the target position of a previous frame while feeding the target region characteristic encoding and the search region characteristic encoding into a transform decoder. IoU prediction module: and performing N times of iterative optimization on the target prediction box decoded and output by the Transformer decoder structure to obtain an optimized prediction box, then calculating IoU between the optimized prediction box and the labeling box, and selecting the average of the optimized prediction boxes of the first three of IoU as a final prediction result. The speed is also ensured while the precision is improved.
Description
Technical Field
The invention belongs to the technical field of target tracking, and relates to a single-target tracking method, a single-target tracking device and a single-target tracking storage medium, in particular to a single-target tracking method, a single-target tracking device and a single-target tracking storage medium for IoU prediction based on a transformer.
Background
Single target tracking is a very challenging task in computer vision tasks. The object tracking task aims at the fact that given the object information of a first frame, the tracker needs to automatically determine the state of the object in the following frame. Unlike the target detection task, the target features are only available in the inference phase, which means that there is no a priori information about the target, such as the class and surrounding environment. Due to the rapid development of deep learning in recent years, many research results are obtained in the field of target tracking.
The prior art has the following defects: however, in the face of a complex environment (e.g., blur, scale change, color shift, and fast motion), most existing trackers do not handle these scenes well.
Disclosure of Invention
The purpose is as follows: in order to overcome the defects in the prior art, the invention provides a single-target tracking method, a single-target tracking device and a storage medium; IoU prediction is carried out based on the transformer, the strong mutual integration capability of local information and global information of the transformer is fully utilized, meanwhile, the complex parameter design and post-processing of the traditional tracker are avoided, and the target characteristics are extracted more efficiently.
The technical scheme is as follows: in order to solve the technical problems, the technical scheme adopted by the invention is as follows:
in a first aspect, a single target tracking method is provided, including:
acquiring video data;
determining a target area of a first frame of video data, and performing feature extraction and transform encoder encoding on the target area through a transform backbone network to obtain target area feature encoding;
generating M target prior frames by using a Gaussian function according to a target area of a first frame of video data, wherein M is an integer greater than 3;
determining a search area according to a target area of a first frame of video data;
performing transformer backbone network feature extraction and transformer encoder encoding on the search area to obtain search area feature encoding;
inputting the target region feature coding, the search region feature coding and the M target prior frames into a transform decoder for decoding to obtain IoU values of intersection ratios between the output M target prediction frames and the output labeling frames;
the IoU loss optimizes network parameters, and a gradient descent method is adopted to carry out N times of iterative optimization on the target prediction frame to obtain an optimized prediction frame;
and selecting the average value of the optimized prediction boxes of the optimized first three IoU values as a final target tracking box.
In some embodiments, the determining a target region of a first frame of video data comprises;
and labeling the target in a first frame of the video data through a rectangular frame, wherein the target area comprises a target position and a target size which are respectively determined by the position and the size of the rectangular frame.
In some embodiments, determining a search area from a target area of a first frame of video data comprises:
wherein crop _ sz represents the size of a search area cropped in the current frame, w represents the width of an object frame in the previous frame, h represents the height of the object frame in the previous frame, and s represents a search factor.
In some embodiments, as shown in fig. 2 and fig. 3, inputting the target region feature coding, the search region feature coding, and the M target prior boxes into a transform decoder for decoding, and obtaining IoU values of intersection ratios between the output M target prediction boxes and the annotation box, includes: and splicing the target region feature code and the search region feature code on a channel dimension to obtain a feature map, obtaining M target prediction frames according to the feature map and M target prior frames, and then obtaining IoU values between the M target prediction frames and the labeling frames through a feed-forward neural network (FFN).
In some embodiments, the IoU loss optimizes network parameters, and the target prediction frame is iteratively optimized for N times by adopting a gradient descent method, including:
wherein, A is a prediction box, and B is a marking box.
In some embodiments, M is 100.
In a second aspect, the present invention provides a single target tracking device, comprising a processor and a storage medium;
the storage medium is used for storing instructions;
the processor is configured to operate in accordance with the instructions to perform the steps of the method according to the first aspect.
In a third aspect, the present invention provides a storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method of the first aspect.
Has the advantages that: compared with other transform-based target tracking algorithms, the single-target tracking method and the single-target tracking system provided by the invention have the advantages that the template and the search image are jointly sent to a transform decoder, and 100 target prior frames based on the target position of the previous frame are added at the same time, the point is provided that the condition change of the target between every two frames in the target tracking task is not very large, so that the 100 target prior frames are equivalent to providing a target position prior for a network, the precision is improved, and the speed is also ensured. The tracking performance is ensured by effectively utilizing the template frame information and giving proper prior to the model. In addition, the situation that an anchor frame needs to be designed finely in the traditional anchor frame-based method can be effectively avoided, complex post-processing steps in most target tracking algorithms are eliminated, the tracking accuracy of the model is improved, and the real-time performance is also met.
Drawings
FIG. 1 is a flow chart of a single target tracking method according to an embodiment of the invention;
fig. 2 and 3 are schematic diagrams of a network of a single-target tracking system according to an embodiment of the invention.
Detailed Description
The invention is further described below with reference to the figures and examples. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.
In the description of the present invention, the meaning of a plurality is one or more, the meaning of a plurality is two or more, and the above, below, exceeding, etc. are understood as excluding the present numbers, and the above, below, within, etc. are understood as including the present numbers. If the first and second are described for the purpose of distinguishing technical features, they are not to be understood as indicating or implying relative importance or implicitly indicating the number of technical features indicated or implicitly indicating the precedence of the technical features indicated.
In the description of the present invention, reference to the description of "one embodiment", "some embodiments", "illustrative embodiments", "examples", "specific examples", or "some examples", etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
Example 1
A single target tracking method, comprising:
acquiring video data;
determining a target area of a first frame of video data, and performing feature extraction and transform encoder encoding on the target area through a transform backbone network to obtain target area feature encoding;
generating M target prior frames by using a Gaussian function according to a target area of a first frame of video data, wherein M is an integer greater than 3;
determining a search area according to a target area of a first frame of video data;
performing transform backbone network feature extraction and transform encoder encoding on the search area to obtain search area feature encoding;
inputting the target region feature coding, the search region feature coding and the M target prior frames into a transform decoder for decoding to obtain IoU values of intersection ratios between the output M target prediction frames and the output labeling frames;
the IoU loss optimizes network parameters, and a gradient descent method is adopted to carry out N times of iterative optimization on the target prediction frame to obtain an optimized prediction frame;
and selecting the average value of the optimized prediction boxes which are subjected to network forward calculation IoU and are three before the value as a final target tracking box.
In some embodiments, the determining a target region of a first frame of video data comprises;
and labeling the target in a first frame of the video data through a rectangular frame, wherein the target area comprises a target position and a target size which are respectively determined by the position and the size of the rectangular frame.
The initial frame image needs to pass through a Position encoding (Position encoding) module while passing through a backbone network, and the relative or absolute Position of a single pixel of the image in a sequence is saved. The position calculation formula is as follows:
PE(pos,2i)=sin(pos/100002i/d)
PE(pos,2i+1)=cos(pos/100002i/d)
the transform encoder structure specifically comprises six identical encoders. Each encoder consists of Multi-Head Self-attachment, a Norm layer and a Feed-Forward Network, and each structural formula is expressed as follows: the Self-orientation formula:
Multi-Head Self-Attention:
MultiHead(Q,K,V)=Concat(head1,...,headh)W
wherein, head refers to the result of each Self-orientation, h refers to the number of heads, and W is a weighting matrix.
Feed-Forward Network:
FFW(x)=max(0,xW1+b1)W2+b2。
In some embodiments, generating M object prior boxes with a gaussian function from an object region of a first frame of video data comprises: the gaussian function used is as follows:
where μ is 0 and σ is 0.05 and 0.5, respectively. In some embodiments, M is 100. 100 object prior boxes are generated relative to the previous frame object box.
In some embodiments, determining a search area from a target area of a first frame of video data comprises:
wherein crop _ sz represents the size of a search area cropped in the current frame, w represents the width of an object frame in the previous frame, h represents the height of the object frame in the previous frame, and s represents a search factor.
Further, in some embodiments the picture is set to a fixed 320 x 320 size with a function in the pytorech after determining the search area; and then sending the data to a transformer backbone network for feature extraction.
In some embodiments, inputting the target region feature coding, the search region feature coding and the M predicted target prior boxes into a transform decoder for decoding, and obtaining IoU values of intersection ratios between the output M predicted target boxes and the output annotation boxes, including: and splicing the target region feature code and the search region feature code on a channel dimension to obtain a feature map, obtaining M target prediction frames according to the feature map and M target prior frames, and then obtaining IoU values between the M target prediction frames and the labeling frames through a feed-forward neural network (FFN).
In some embodiments, the IoU loss optimizes network parameters, and the target prediction frame is iteratively optimized for N times by adopting a gradient descent method, including:
wherein, A is a prediction box, and B is a marking box.
The target prediction box is iteratively optimized N times using a gradient descent method, and in some embodiments N is taken to be 10. The specific process is as follows:
step 1: IoU loss optimization network parameter theta0And setting the step length alpha to 1;
step2 with M target prediction boxesObtaining IoU loss relative to the standard frame through a feedforward neural network and recording the IoU loss as LIoU;
The optimization process only carries out iterative optimization on the target prediction frame and does not update network parameters.
Finally, the optimal prediction box with N sub-optimizations and IoU forward calculation is selected and averaged to be used as the final prediction box AfinalAnd tracking the target.
Example 2
In a second aspect, the present embodiment provides a single target tracking apparatus, including a processor and a storage medium;
the storage medium is used for storing instructions;
the processor is configured to operate in accordance with the instructions to perform the steps of the method according to embodiment 1.
Example 3
In a third aspect, the present embodiment provides a storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method of embodiment 1.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.
Claims (8)
1. A method for single target tracking, comprising:
acquiring video data;
determining a target area of a first frame of video data, and performing feature extraction and transform encoder encoding on the target area through a transform backbone network to obtain target area feature encoding;
generating M target prior frames by using a Gaussian function according to a target area of a first frame of video data, wherein M is an integer greater than 3;
determining a search area according to a target area of a first frame of video data;
performing transformer backbone network feature extraction and transformer encoder encoding on the search area to obtain search area feature encoding;
inputting the target region feature coding, the search region feature coding and the M target prior frames into a transform decoder for decoding to obtain IoU values of intersection ratios between the output M target prediction frames and the output labeling frames;
the IoU loss optimizes network parameters, and a gradient descent method is adopted to carry out N times of iterative optimization on the target prediction frame to obtain an optimized prediction frame;
and selecting the average value of the optimized prediction boxes of the optimized first three IoU values as a final target tracking box.
2. The method of single-target tracking according to claim 1, wherein said determining a target region of a first frame of video data comprises;
and labeling the target in a first frame of the video data through a rectangular frame, wherein the target area comprises a target position and a target size which are respectively determined by the position and the size of the rectangular frame.
3. The single-target tracking method of claim 1 or 2, wherein determining the search area based on the target area of the first frame of video data comprises:
wherein crop _ sz represents the size of a search area cropped in the current frame, w represents the width of an object frame in the previous frame, h represents the height of the object frame in the previous frame, and s represents a search factor.
4. The single-target tracking method of claim 1, wherein the inputting the target region feature code, the search region feature code and the M target prior boxes into a transform decoder for decoding results in IoU values of intersection ratios between the output M target prediction boxes and the output annotation boxes comprises: and splicing the target region feature code and the search region feature code on a channel dimension to obtain a feature map, obtaining M target prediction frames according to the feature map and the M target prior frames, and then obtaining IoU values between the M target prediction frames and the labeling frames through a feedforward neural network.
5. The single-target tracking method according to claim 1, wherein the IoU loss optimizes network parameters and performs N times of iterative optimization on the target prediction frame by adopting a gradient descent method, and the method comprises the following steps:
wherein, A is a prediction box, and B is a marking box.
6. The method of claim 5, wherein M is 100.
7. A single target tracking device, comprising: comprising a processor and a storage medium;
the storage medium is to store instructions;
the processor is configured to operate in accordance with the instructions to perform the steps of the method according to any one of claims 1 to 6.
8. A storage medium having a computer program stored thereon, the computer program, when being executed by a processor, implementing the steps of the method of any one of claims 1 to 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210240068.1A CN114708306B (en) | 2022-03-10 | Single-target tracking method, single-target tracking device and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210240068.1A CN114708306B (en) | 2022-03-10 | Single-target tracking method, single-target tracking device and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114708306A true CN114708306A (en) | 2022-07-05 |
CN114708306B CN114708306B (en) | 2024-10-29 |
Family
ID=
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111192291A (en) * | 2019-12-06 | 2020-05-22 | 东南大学 | Target tracking method based on cascade regression and twin network |
US20200327680A1 (en) * | 2019-04-12 | 2020-10-15 | Beijing Moviebook Science and Technology Co., Ltd. | Visual target tracking method and apparatus based on deep adversarial training |
WO2020228353A1 (en) * | 2019-05-13 | 2020-11-19 | 深圳先进技术研究院 | Motion acceleration-based image search method, system, and electronic device |
CN112541944A (en) * | 2020-12-10 | 2021-03-23 | 山东师范大学 | Probability twin target tracking method and system based on conditional variational encoder |
CN113256678A (en) * | 2021-04-26 | 2021-08-13 | 中国人民解放军32802部队 | Target tracking method based on self-attention transformation network |
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200327680A1 (en) * | 2019-04-12 | 2020-10-15 | Beijing Moviebook Science and Technology Co., Ltd. | Visual target tracking method and apparatus based on deep adversarial training |
WO2020228353A1 (en) * | 2019-05-13 | 2020-11-19 | 深圳先进技术研究院 | Motion acceleration-based image search method, system, and electronic device |
CN111192291A (en) * | 2019-12-06 | 2020-05-22 | 东南大学 | Target tracking method based on cascade regression and twin network |
CN112541944A (en) * | 2020-12-10 | 2021-03-23 | 山东师范大学 | Probability twin target tracking method and system based on conditional variational encoder |
CN113256678A (en) * | 2021-04-26 | 2021-08-13 | 中国人民解放军32802部队 | Target tracking method based on self-attention transformation network |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
KR102279350B1 (en) | Method and device for generating image data set to be used for learning cnn capable of detecting obstruction in autonomous driving circumstance, and testing method, and testing device using the same | |
Jiang et al. | Matching by linear programming and successive convexification | |
US9947077B2 (en) | Video object tracking in traffic monitoring | |
US11184558B1 (en) | System for automatic video reframing | |
CN112270710B (en) | Pose determining method, pose determining device, storage medium and electronic equipment | |
CN111814753A (en) | Target detection method and device under foggy weather condition | |
CN110443279B (en) | Unmanned aerial vehicle image vehicle detection method based on lightweight neural network | |
JP2020038666A (en) | Method for generating data set for learning for detection of obstacle in autonomous driving circumstances and computing device, learning method, and learning device using the same | |
JP2020038668A (en) | Method for generating image data set for cnn learning for detecting obstacle in autonomous driving circumstances and computing device | |
WO2021164515A1 (en) | Detection method and apparatus for tampered image | |
Dong et al. | Temporal feature augmented network for video instance segmentation | |
CN116740413A (en) | Deep sea biological target detection method based on improved YOLOv5 | |
CN117132914A (en) | Method and system for identifying large model of universal power equipment | |
CN116052108A (en) | Transformer-based traffic scene small sample target detection method and device | |
CN112070181B (en) | Image stream-based cooperative detection method and device and storage medium | |
CN117788544A (en) | Image depth estimation method based on lightweight attention mechanism | |
CN114708306B (en) | Single-target tracking method, single-target tracking device and storage medium | |
CN114708306A (en) | Single-target tracking method and device and storage medium | |
Xiong et al. | Distortion map-guided feature rectification for efficient video semantic segmentation | |
EP4235492A1 (en) | A computer-implemented method, data processing apparatus and computer program for object detection | |
Cho et al. | Deep photo-geometric loss for relative camera pose estimation | |
JP2024068729A (en) | Learning device, parameter adjustment method and recording medium | |
Shi et al. | Small object detection algorithm incorporating swin transformer for tea buds | |
CN113987270A (en) | Method, device, terminal and storage medium for determining similar video clips | |
Zhou et al. | LEDet: localization estimation detector with data augmentation for ship detection based on unmanned surface vehicle |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant |