CN111815677A - Target tracking method and device, terminal equipment and readable storage medium - Google Patents

Target tracking method and device, terminal equipment and readable storage medium Download PDF

Info

Publication number
CN111815677A
CN111815677A CN202010661194.5A CN202010661194A CN111815677A CN 111815677 A CN111815677 A CN 111815677A CN 202010661194 A CN202010661194 A CN 202010661194A CN 111815677 A CN111815677 A CN 111815677A
Authority
CN
China
Prior art keywords
template
detection
candidate
sample set
preset size
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010661194.5A
Other languages
Chinese (zh)
Inventor
衣杨
赵小蕾
陈嘉谦
邱泽敏
刘东琳
陈怡华
李宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xinhua College of Sun Yat Sen University
Original Assignee
Xinhua College of Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xinhua College of Sun Yat Sen University filed Critical Xinhua College of Sun Yat Sen University
Priority to CN202010661194.5A priority Critical patent/CN111815677A/en
Publication of CN111815677A publication Critical patent/CN111815677A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • G06T7/251Analysis of motion using feature-based methods, e.g. the tracking of corners or segments involving models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The embodiment of the invention discloses a target tracking method, a target tracking device, terminal equipment and a readable storage medium, wherein the method comprises the steps of extracting n template feature maps with different scales from a template frame by utilizing a feature pyramid network; extracting n detection feature maps with different scales from the detection frame by using a feature pyramid network; determining preset candidate number of candidate images and corresponding scores and positions of the candidate images on a detection feature map by using an ith area candidate sub-network according to the ith template feature map and the detection feature map with the same scale as the ith template feature map; and when the n regional candidate sub-networks finish processing the n template feature maps and the corresponding detection feature maps, determining m front candidate images with the highest scores in all the candidate images as tracking targets according to the scores of all the candidate images, and taking the positions corresponding to the tracking targets as target positions. The scheme realizes accurate tracking of the small target.

Description

Target tracking method and device, terminal equipment and readable storage medium
Technical Field
The present invention relates to the field of target tracking, and in particular, to a target tracking method, an apparatus, a terminal device, and a readable storage medium.
Background
In recent years, deep learning has begun to advance to the military target tracking field, attracting the eye of more and more scholars. The low-level features of the tracking images have higher resolution, so that the targets can be accurately positioned conveniently, the high-level features contain more semantic information, larger target change can be processed, the tracker is prevented from drifting, the targets can be positioned in a range more conveniently, and therefore the features of the targets can be better extracted by utilizing deep learning, and the targets can be better expressed. However, since deep learning has the problems of large training samples and long online updating time, there still remains a small challenge in terms of timeliness of tracking.
Disclosure of Invention
In view of the foregoing problems, the present invention provides a target tracking method, an apparatus, a terminal device and a readable storage medium.
A first embodiment of the present invention provides a target tracking method, including:
extracting n template feature graphs with different scales from the template frame by using a feature pyramid network;
extracting n detection feature maps with different scales, which are in one-to-one correspondence with the scales of the n template feature maps, from a detection frame by using a feature pyramid network;
determining preset candidate number of candidate images and corresponding scores and positions of the candidate images on a detection feature map by using an ith area candidate sub-network according to the ith template feature map and the detection feature map with the same scale as the ith template feature map;
and when the n regional candidate sub-networks finish processing the n template feature maps and the corresponding detection feature maps, determining m front candidate images with the highest scores in all the candidate images as tracking targets according to the scores of all the candidate images, and taking the positions corresponding to the tracking targets as target positions.
A target tracking method according to a second embodiment of the present invention, where the determining, by using an ith sub-network of area candidates, a preset number of candidate images and scores and positions corresponding to the candidate images on a detection feature map according to an ith template feature map and the detection feature map having the same scale as the ith template feature map includes:
performing convolution operation on the ith template characteristic diagram to obtain a template sample set with a first preset size and a template sample set with a second preset size;
performing convolution operation on the detection characteristic graph with the same scale as the ith template characteristic graph to obtain a detection sample set with a third preset size and a detection sample set with a fourth preset size;
the classification branch of the ith regional candidate sub-network calculates the score of each candidate image in the detection sample set of the third preset size according to the template sample set of the first preset size and the detection sample set of the third preset size;
determining the position of each sample in the detection sample set with the fourth preset size according to the template sample set with the second preset size and the detection sample set with the fourth preset size by the regression branch of the ith area candidate sub-network;
and determining the position corresponding to each alternative image according to the position of each sample in the detection sample set with the fourth preset size, wherein the detection sample set with the third preset size is equal to the detection sample set with the fourth preset size.
A target tracking method according to a third embodiment of the present invention further includes:
determining position response values of target positions corresponding to the first m candidate images;
and when the maximum position response value is larger than a preset response threshold value, taking the alternative image corresponding to the maximum position response value as the template frame.
In the above target tracking method, the response value is calculated according to the following formula:
Figure BDA0002578585200000031
wherein the content of the first and second substances,
Figure BDA0002578585200000036
representing said position response value, t*Representing said target position, y (t)*) Representing said target position t*T represents the interference position closest to the target position, y (t) represents the response result of the interference position t, and Δ is a quadratic continuous differentiable function.
In the target tracking method according to the embodiment, a training sample set is used to train a target tracking model corresponding to the target tracking method in advance until the error loss of the target tracking model is smaller than a preset error threshold;
the error loss is calculated using the following loss function:
Figure BDA0002578585200000032
wherein the content of the first and second substances,
Figure BDA0002578585200000033
represents the error loss, betaiRepresents the weighting coefficients of the i-th region candidate sub-network, mu represents the attenuation parameter,
Figure BDA0002578585200000034
weighted response values representing the n regional candidate subnets;
the weighted response value calculation formula is as follows:
Figure BDA0002578585200000035
wherein, yβ(t*) Representing said target position t*T represents the interference position closest to the target position, yβ(t) represents the weighted response result of the interference location t, Δ being a quadratic continuous differentiable function;
the weighted response result calculation formula is as follows:
Figure BDA0002578585200000041
si(t) represents the ith area candidate sub-network response result.
In the target tracking method according to the foregoing embodiment, the n different scales include at least one of 32 × 32 pixel scales, 64 × 64 pixel scales, 128 × 128 pixel scales, and 256 × 256 pixel scales.
A fourth embodiment of the present invention provides a target tracking apparatus, including:
the template characteristic image acquisition module is used for extracting n template characteristic images with different scales from the template frame by utilizing the characteristic pyramid network;
the detection characteristic graph acquisition module is used for extracting n detection characteristic graphs with different scales, which are in one-to-one correspondence with the scales of the n template characteristic graphs, from a detection frame by utilizing a characteristic pyramid network;
the candidate image determining module is used for determining a preset candidate number of candidate images and corresponding scores and positions of the candidate images on the detection feature map by utilizing an ith regional candidate sub-network according to the ith template feature map and the detection feature map with the same scale as the ith template feature map;
and the tracking target determining module is used for determining the first m candidate images with the highest scores in all the candidate images as the tracking targets according to the scores of all the candidate images after the n regional candidate sub-networks process the n template feature images and the corresponding detection feature images, and the positions corresponding to the tracking targets are used as target positions.
The above candidate image determination module includes:
a template sample set obtaining unit, configured to perform convolution operation on the ith template feature map to obtain a template sample set of a first preset size and a template sample set of a second preset size;
a detection sample set obtaining unit, configured to perform convolution operation on the detection feature map with the same scale as the ith template feature map to obtain a detection sample set of a third preset size and a detection sample set of a fourth preset size;
a candidate image score calculation unit, configured to calculate, by the classification branch of the ith regional candidate subnetwork, a score of each candidate image in the detection sample set of the third preset size according to the template sample set of the first preset size and the detection sample set of the third preset size;
a sample position determining unit, configured to determine, by the regression branch of the i-th area candidate subnetwork, a position of each sample in the fourth preset-sized detection sample set according to the second preset-sized template sample set and the fourth preset-sized detection sample set;
and the candidate image position determining unit is used for determining the position corresponding to each candidate image according to the position of each sample in the detection sample set with the fourth preset size, wherein the detection sample set with the third preset size is equal to the detection sample set with the fourth preset size.
The above embodiments relate to a terminal device comprising a memory for storing a computer program and a processor for executing the computer program to enable the terminal device to perform the above object tracking method.
The above embodiments relate to a readable storage medium storing a computer program which, when run on a processor, performs the above-described object tracking method.
Extracting n template feature graphs with different scales from a template frame by using a feature pyramid network; extracting n detection feature maps with different scales, which are in one-to-one correspondence with the scales of the n template feature maps, from a detection frame by using a feature pyramid network; determining preset candidate number of candidate images and corresponding scores and positions of the candidate images on a detection feature map by using an ith area candidate sub-network according to the ith template feature map and the detection feature map with the same scale as the ith template feature map; and when the n regional candidate sub-networks finish processing the n template feature maps and the corresponding detection feature maps, determining m front candidate images with the highest scores in all the candidate images as tracking targets according to the scores of all the candidate images, and taking the positions corresponding to the tracking targets as target positions. On one hand, the technical scheme of the invention takes the characteristic pyramid network as a characteristic extraction layer of a tracking frame, effectively fuses low-layer high-resolution information and high-layer high-semantic information, can more accurately position a target position, and has particularly prominent tracking performance on a small target tracking object; on the other hand, the tracking target is screened through the improved regional candidate sub-network, and the small target is accurately tracked.
Drawings
In order to more clearly illustrate the technical solution of the present invention, the drawings required to be used in the embodiments will be briefly described below, and it should be understood that the following drawings only illustrate some embodiments of the present invention, and therefore should not be considered as limiting the scope of the present invention. Like components are numbered similarly in the various figures.
FIG. 1 is a flow chart illustrating a target tracking method according to an embodiment of the invention;
FIG. 2 is a schematic diagram of a target tracking model according to an embodiment of the present invention;
FIG. 3 illustrates a flow diagram for determining alternative image scores and positions according to an embodiment of the invention;
FIG. 4 is a schematic diagram illustrating a structure of a regional candidate subnetwork model according to an embodiment of the present invention;
FIG. 5 is a flow chart illustrating another target tracking method according to an embodiment of the invention;
FIG. 6 is a schematic diagram of a target tracking apparatus according to an embodiment of the present invention;
FIG. 7 shows a structural schematic diagram for determining alternative image scores and positions according to an embodiment of the invention.
Description of the main element symbols:
1-a target tracking device; 100-a template feature map acquisition module; 200-a detection characteristic map acquisition module; 300-an alternative image determination module; 400-a tracking target determination module; 500-response value calculation module; 600-template frame update module; 310-a template sample set obtaining unit; 320-a detection sample set acquisition unit; 330-alternative image score calculation unit; 340-a sample position determination unit; 350-alternative image position determination unit.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments.
The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.
Hereinafter, the terms "including", "having", and their derivatives, which may be used in various embodiments of the present invention, are only intended to indicate specific features, numbers, steps, operations, elements, components, or combinations of the foregoing, and should not be construed as first excluding the existence of, or adding to, one or more other features, numbers, steps, operations, elements, components, or combinations of the foregoing.
Furthermore, the terms "first," "second," "third," and the like are used solely to distinguish one from another and are not to be construed as indicating or implying relative importance.
Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which various embodiments of the present invention belong. The terms (such as those defined in commonly used dictionaries) should be interpreted as having a meaning that is consistent with their contextual meaning in the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein in various embodiments of the present invention.
The invention provides a target tracking method which is provided by improving the existing three technologies, and the target tracking method comprises a twin neural network structure, a characteristic pyramid network and a regional candidate network.
The twin network structure means that the main structure of the network is composed of an upper branch and a lower branch, the two branches share all weights of the same network, and the twin network structure is used for solving the classification problem that the classes are many or uncertain, but the number of samples below each class is small like twin twins. In the field of visual target tracking, the upper branch of the twin network is a Template branch (Template branch) for extracting appearance features of a Template frame, the lower branch is a Detection branch (Detection branch), the input of the Detection branch is a candidate area for searching, which is cut on the current frame according to the tracking result of the previous frame, after passing through the same network, similarity calculation is carried out by using a feature map of the Template branch and feature maps of a plurality of candidate areas of the current frame, and the candidate area with the highest score is used as the tracking result of the current frame.
The feature pyramid generally refers to an image pyramid which is similar to a real pyramid and is constructed by connecting 4 vertexes of a plurality of images with different scales aiming at an input single image. A Feature Pyramid Network (FPN) used in the field of target tracking can add one dimension (or depth) to a two-dimensional image, and is different from a traditional detection algorithm which only adopts top-level features for prediction, the Feature Pyramid Network can fuse the features of different levels and independently predict in different Feature layers, and accordingly more robust semantic information is obtained. Through the self bottom-up line and top-down line, the feature pyramid network fully utilizes the high-resolution information of the lower layer and the high-semantic information of the higher layer, particularly for small targets, the feature pyramid network also increases the resolution of feature mapping, and more useful information about the small targets can be obtained.
The Region candidate Network (RPN) is a Network for extracting candidate boxes, and appears earliest in the fast RCNN structure. It uses candidate boxes, also known as Anchor box (Anchor) techniques, which are commonly used in computer vision to represent fixed reference boxes. In a target tracking task, the tracked target has the characteristics of uncertain category, uncertain position, uncertain scale and the like, the improved anchor frame technology can cover about all positions and scales by pre-setting a group of fixed reference frames with different scales and different positions, each fixed reference frame is responsible for detecting the target which is intersected with the fixed reference frame and is larger than a preset threshold, and the regional candidate network not only has good identification effect, but also has high identification speed.
Example 1
In this embodiment, referring to fig. 1, it is shown that a target tracking method includes the following steps:
step S100: and extracting n template feature maps with different scales from the template frame by using the feature pyramid network.
In the initial stage of target tracking, a first frame in the video frames can be used as a template frame, and n template feature maps with different scales are extracted from the template frame by using a Feature Pyramid Network (FPN). Exemplarily, referring to fig. 2, a Feature Pyramid Network (FPN) extracts 4 template feature maps with different scales from a template frame, including a template feature map with a scale of 32 × 32 pixel points, a template feature map with a scale of 64 × 64 pixel points, a template feature map with a scale of 128 × 128 pixel points, and a template feature map with a scale of 256 × 256 pixel points.
It should be understood that n is a positive integer, and may be set to 3, 4, 5, 6, etc., and may be adjusted according to the actual effect of target tracking, and the size of the template feature map may also be flexibly set according to specific requirements.
Step S200: and extracting n detection feature maps with different scales, which are in one-to-one correspondence with the scales of the n template feature maps, from the detection frame by using a feature pyramid network.
The detection frame is a current frame needing to obtain a tracking target, and n detection feature maps with different scales, which are in one-to-one correspondence with the scales of the n template feature maps, are extracted from the detection frame by using a feature pyramid network. Exemplarily, referring to fig. 2, a Feature Pyramid Network (FPN) extracts 4 detection feature maps with different scales from a detection frame, including a detection feature map with a scale of 32 × 32 pixel points, a detection feature map with a scale of 64 × 64 pixel points, a detection feature map with a scale of 128 × 128 pixel points, and a detection feature map with a scale of 256 × 256 pixel points.
Step S300: and determining preset candidate images in number and corresponding scores and positions of the candidate images on the detection feature map by using the ith area candidate sub-network according to the ith template feature map and the detection feature map with the same scale as the ith template feature map.
Exemplarily, referring to fig. 2, when n is 4, the template feature map with a 32 × 32 pixel point scale and the detection feature map with a 32 × 32 pixel point scale correspond to 4 region candidate subnetworks (RPN), the template feature map with a 64 × 64 pixel point scale and the detection feature map with a 64 × 64 pixel point scale serve as inputs of a first region candidate subnetwork (RPN), the template feature map with a 128 × 128 pixel point scale and the detection feature map with a 128 × 128 pixel point scale serve as inputs of a second region candidate subnetwork (RPN), the template feature map with a 256 × 256 pixel point scale and the detection feature map with a 256 × 256 pixel point scale serve as inputs of a fourth region candidate subnetwork (RPN).
And respectively determining a preset candidate number of candidate images, and the corresponding scores and positions of the candidate images on the corresponding detection feature maps by the 4 regional candidate sub-networks (RPN). An anchor frame with 3 length-width ratios can be preset on the detection feature map of each scale so as to cover a tracking target possibly existing in the detection feature map, wherein the 3 length-width ratios include 1: 2. 1: 1 and 2: 1. correspondingly, the detection feature map of 4 scales comprises 12 preset anchor frames.
The classification score of each anchor frame may be calculated using the classification branch of the regional candidate subnetwork (RPN), and the regression position of each anchor frame may be determined using the regression branch of the regional candidate subnetwork (RPN), it being understood that each anchor frame contains a candidate image, the regression position of each anchor frame is the position of the corresponding candidate image, and the classification score of each anchor frame is the score of the corresponding candidate image.
Step S400: and when the n regional candidate sub-networks finish processing the n template feature maps and the corresponding detection feature maps, determining m front candidate images with the highest scores in all the candidate images as tracking targets according to the scores of all the candidate images, and taking the positions corresponding to the tracking targets as target positions.
And sequencing all the alternative images in sequence from high to low according to the corresponding scores of all the alternative images, and determining the top m alternative images with the highest scores as tracking targets. Wherein m is a preset positive integer.
I is not more than n.
In the embodiment, n template feature maps with different scales are extracted from a template frame by using a feature pyramid network; extracting n detection feature maps with different scales, which are in one-to-one correspondence with the scales of the n template feature maps, from a detection frame by using a feature pyramid network; determining preset candidate number of candidate images and corresponding scores and positions of the candidate images on a detection feature map by using an ith area candidate sub-network according to the ith template feature map and the detection feature map with the same scale as the ith template feature map; and when the n regional candidate sub-networks finish processing the n template feature maps and the corresponding detection feature maps, determining m front candidate images with the highest scores in all the candidate images as tracking targets according to the scores of all the candidate images, and taking the positions corresponding to the tracking targets as target positions. On one hand, the technical scheme of the embodiment takes the feature pyramid network as a feature extraction layer of a tracking frame, effectively fuses low-layer high-resolution information and high-layer high-semantic information, can more accurately position a target position, and is particularly prominent in tracking performance on a small target tracking object; on the other hand, the tracking target is screened through the improved regional candidate sub-network, and the small target is accurately tracked.
Example 2
The regional candidate subnetworks, as shown in FIG. 4, contain classification branches for distinguishing between target and background and regression branches for bounding box regression.
Further, referring to fig. 3, the step S300 of the above embodiment 1 includes the following steps:
step S310: and performing convolution operation on the ith template characteristic diagram to obtain a template sample set with a first preset size and a template sample set with a second preset size.
And the first convolution layer of the ith area candidate sub-network performs convolution operation on the input ith template feature map to obtain a template sample set with a first preset size and a template sample set with a second preset size.
Further, a template sample set with a first preset size of 4 × 4 × (2k × 256) is obtained on the classification branch of the i-th area candidate sub-network
Figure BDA0002578585200000111
The features representing template samples of size 4 x 4 have 2k variations across k different anchor frames. It should be understood that 2k changes indicate that there may be two situations in the image in each anchor frame, either background or target, i.e., both 0 and 1 states.
Further, a template sample set with a second preset size of 4 × 4 × (4k × 256) is obtained on the regression branch of the i-th region candidate subnetwork
Figure BDA0002578585200000121
The features of the template sample, represented as a scale of 4 x 4 pixel points, vary by 4k over k different anchor frames. It should be understood that the 4k variations correspond to the width, height, abscissa and ordinate of the location, and each anchor frame is represented by the corresponding width, height, abscissa and ordinate.
Step S320: and performing convolution operation on the detection feature map with the same scale as the ith template feature map to obtain a detection sample set with a third preset size and a detection sample set with a fourth preset size.
And the first convolution layer of the ith area candidate sub-network performs convolution operation on the input detection feature map with the same scale as the ith template feature map to obtain a detection sample set with a third preset size and a detection sample set with a fourth preset size.
Further, a detection sample set with a third preset size of 20 × 20 × 256 is obtained on the classification branch of the i-th region candidate sub-network
Figure BDA0002578585200000122
Obtaining a detection sample set with a fourth preset size of 20 multiplied by 256 on the regression branch of the ith regional candidate subnetwork
Figure BDA0002578585200000123
256 in the above scale represents the number of channels of the template sample, and the dimension of the feature is expanded to 256 dimensions through the network training process of the feature pyramid.
Step S330: and the classification branch of the ith regional candidate sub-network calculates the score of each candidate image in the detection sample set with the third preset size according to the template sample set with the first preset size and the detection sample set with the third preset size.
The classification branch will give a classification score corresponding to each input detection sample, i.e. a detailed score predicted as a target or a background, and the corresponding score can be expressed as
Figure BDA0002578585200000124
:representsa related operation.
Step S340: and determining the position of each sample in the detection sample set with the fourth preset size according to the template sample set with the second preset size and the detection sample set with the fourth preset size by the regression branch of the ith area candidate subnetwork.
The regression branch gives the position regression value of each detection sample
Figure BDA0002578585200000131
The position regression value includes abscissa, ordinate, width and height, which correspond to dx,dy,dwAnd dhThe number of the four values is,
Figure BDA0002578585200000132
:representsa related operation.
Step S350: and determining the position corresponding to each alternative image according to the position of each sample in the detection sample set with the fourth preset size, wherein the detection sample set with the third preset size is equal to the detection sample set with the fourth preset size.
The detection sample set of the third preset size and the detection sample set of the fourth preset size are sample sets with the same size, and the position corresponding to each candidate image can be determined according to the position of each sample in the detection sample set of the fourth preset size.
Top m alternative classification output information
Figure BDA0002578585200000133
And regression output information
Figure BDA0002578585200000134
The position information of m candidate positions with the highest score can be obtained
Figure BDA0002578585200000135
The specific calculation formula is as follows:
Figure BDA0002578585200000136
Figure BDA0002578585200000137
Figure BDA0002578585200000138
Figure BDA0002578585200000139
wherein
Figure BDA00025785852000001310
For the original center coordinates and length and width of the candidate box corresponding to the ith candidate position, cls represents the classification branch and reg represents the regression branch, for each subscript: i e 0, w), j 0, h), l 0,2k, p 0, k), each A is a set of vectors for output information.
Example 3
In this embodiment, referring to fig. 5, it is shown that the target tracking method further includes the following steps after the above steps S100 to S400:
step S500: and determining position response values of the target positions corresponding to the first m candidate images.
The position response values of the target positions corresponding to the first m candidate images can be respectively calculated according to the following formula:
Figure BDA0002578585200000141
wherein the content of the first and second substances,
Figure BDA0002578585200000142
representing said position response value, t*Represents the target position corresponding to any one of the m candidate images, y (t)*) Indicates the target position t*T represents the interference position closest to the target position, y (t) represents the response result of the interference position t, Δ is a quadratic continuous differentiable function, t and t*The closer, Δ (t-t)*) Approaches to 0, t and t*The farther away, Δ (t-t)*) Approaching 1.
It should be appreciated that m position response values corresponding to the m candidate positions may be determined according to the above formula.
Step S600: and when the maximum position response value is larger than a preset response threshold value, taking the alternative image corresponding to the maximum position response value as the template frame.
And selecting the candidate image with the corresponding position response value larger than the preset response threshold value from the m candidate positions as a new template frame to continue to execute the step S100.
In the online updating mode based on the high-score sample feedback, the high-score alternative sample in the tracking process is used as a new template frame for subsequent detection tasks. The accuracy and robustness of target tracking are effectively improved.
Further, a target tracking model corresponding to the target tracking method is trained in advance by utilizing a training sample set until the error loss of the target tracking model is smaller than a preset error threshold; the error loss is calculated using the following loss function:
Figure BDA0002578585200000143
wherein the content of the first and second substances,
Figure BDA0002578585200000144
represents the error loss, betaiRepresents the weighting coefficients of the i-th region candidate sub-network, mu represents the attenuation parameter,
Figure BDA0002578585200000145
representing weighted response values of the n region candidate subnets, the weighted response values being calculated as follows:
Figure BDA0002578585200000151
wherein, yβ(t*) Representing said target position t*T represents the interference position closest to the target position, yβ(t) represents the weighted response result of the interference location t, Δ being a quadratic continuous differentiable function; the weighted response result calculation formula is as follows:
Figure BDA0002578585200000152
si(t) represents the i-th area candidate sub-network response result, wherein the sum of the weighting coefficients is 1, i.e.
Figure BDA0002578585200000153
And calculating the error loss of the target tracking model corresponding to the target tracking method by using the loss function, and when the error loss of the target tracking model is smaller than a preset error threshold, indicating that the tracking quality of the target tracking model corresponding to the target tracking method meets the standard.
Example 4
In the present embodiment, referring to fig. 6, a target tracking apparatus 1 is shown including: a template feature map acquisition module 100, a detection feature map acquisition module 200, an alternative image determination module 300, and a tracking target determination module 400.
A template feature map obtaining module 100, configured to extract n template feature maps of different scales from a template frame by using a feature pyramid network; a detection feature map obtaining module 200, configured to extract n detection feature maps with different scales, which are in one-to-one correspondence with the scales of the n template feature maps, from a detection frame by using a feature pyramid network; a candidate image determining module 300, configured to determine, by using an ith region candidate sub-network, a preset candidate number of candidate images, and scores and positions corresponding to the candidate images on a detection feature map according to an ith template feature map and the detection feature map with the same scale as the ith template feature map; a tracking target determining module 400, configured to determine, according to the score of each candidate image, m previous candidate images with the highest score in all candidate images as tracking targets after n region candidate subnetworks complete processing on the n template feature maps and corresponding detection feature maps, where a position corresponding to the tracking target is a target position.
Further, referring to fig. 7, the alternative image determining module 300 includes: a template sample set acquisition unit 310, a detection sample set acquisition unit 320, an alternative image score calculation unit 330, a sample position determination unit 340, and an alternative image position determination unit 350.
A template sample set obtaining unit 310, configured to perform convolution operation on the ith template feature map to obtain a template sample set with a first preset size and a template sample set with a second preset size; a detection sample set obtaining unit 320, configured to perform convolution operation on the detection feature map with the same scale as the ith template feature map to obtain a detection sample set of a third preset size and a detection sample set of a fourth preset size; a candidate image score calculation unit 330, configured to calculate, by the classification branch of the i-th regional candidate sub-network, a score of each candidate image in the detection sample set of the third preset size according to the template sample set of the first preset size and the detection sample set of the third preset size; a sample position determining unit 340, configured to determine, by the regression branch of the i-th area candidate subnetwork, a position of each sample in the fourth preset-sized detection sample set according to the second preset-sized template sample set and the fourth preset-sized detection sample set; a candidate image position determining unit 350, configured to determine, according to the position of each sample in the fourth preset-size detection sample set, a position corresponding to each candidate image, where the third preset-size detection sample set is equal to the fourth preset-size detection sample set.
The target tracking device 1 further includes: a response value calculating module 500, configured to determine position response values of target positions corresponding to the m previous candidate images; a template frame updating module 600, configured to, when the maximum position response value is greater than a preset response threshold, take the candidate image corresponding to the maximum position response value as the template frame.
The target tracking apparatus 1 of this embodiment is configured to execute the target tracking method according to the above embodiment through the matching use of the template feature map obtaining module 100, the detection feature map obtaining module 200, the candidate image determining module 300, and the tracking target determining module 400, and the implementation and beneficial effects related to the above embodiment are also applicable to this embodiment, and are not described herein again.
The above embodiments relate to a terminal device, including a memory for storing a computer program and a processor for executing the computer program to enable the terminal device to execute the object tracking method of the above embodiments.
The above embodiments relate to a readable storage medium storing a computer program which, when run on a processor, performs the object tracking method of the above embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative and, for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, each functional module or unit in each embodiment of the present invention may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention or a part of the technical solution that contributes to the prior art in essence can be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a smart phone, a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention.

Claims (10)

1. A method of object tracking, the method comprising:
extracting n template feature graphs with different scales from the template frame by using a feature pyramid network;
extracting n detection feature maps with different scales, which are in one-to-one correspondence with the scales of the n template feature maps, from a detection frame by using a feature pyramid network;
determining preset candidate number of candidate images and corresponding scores and positions of the candidate images on a detection feature map by using an ith area candidate sub-network according to the ith template feature map and the detection feature map with the same scale as the ith template feature map;
and when the n regional candidate sub-networks finish processing the n template feature maps and the corresponding detection feature maps, determining m front candidate images with the highest scores in all the candidate images as tracking targets according to the scores of all the candidate images, and taking the positions corresponding to the tracking targets as target positions.
2. The method according to claim 1, wherein the determining, by using the ith area candidate sub-network, a preset candidate number of candidate images and scores and positions corresponding to the candidate images on the detection feature map according to the ith template feature map and the detection feature map with the same scale as the ith template feature map comprises:
performing convolution operation on the ith template characteristic diagram to obtain a template sample set with a first preset size and a template sample set with a second preset size;
performing convolution operation on the detection characteristic graph with the same scale as the ith template characteristic graph to obtain a detection sample set with a third preset size and a detection sample set with a fourth preset size;
the classification branch of the ith regional candidate sub-network calculates the score of each candidate image in the detection sample set of the third preset size according to the template sample set of the first preset size and the detection sample set of the third preset size;
determining the position of each sample in the detection sample set with the fourth preset size according to the template sample set with the second preset size and the detection sample set with the fourth preset size by the regression branch of the ith area candidate sub-network;
and determining the position corresponding to each alternative image according to the position of each sample in the detection sample set with the fourth preset size, wherein the detection sample set with the third preset size is equal to the detection sample set with the fourth preset size.
3. The target tracking method of claim 1, further comprising:
determining position response values of target positions corresponding to the first m candidate images;
and when the maximum position response value is larger than a preset position response threshold value, taking the alternative image corresponding to the maximum position response value as the template frame.
4. The object tracking method of claim 3, wherein the position response value is calculated according to the formula:
Figure FDA0002578585190000021
wherein the content of the first and second substances,
Figure FDA0002578585190000023
representing said position response value, t*Representing said target position, y (t)*) Representing said target position t*T represents the interference position closest to the target position, y (t) represents the response result of the interference position t, and Δ is a quadratic continuous differentiable function.
5. The target tracking method according to claim 1, wherein a target tracking model corresponding to the target tracking method is trained in advance by using a training sample set until an error loss of the target tracking model is smaller than a preset error threshold;
the error loss is calculated using the following loss function:
Figure FDA0002578585190000022
wherein the content of the first and second substances,
Figure FDA0002578585190000024
represents the error loss, betaiRepresents the weighting coefficients of the i-th region candidate sub-network, mu represents the attenuation parameter,
Figure FDA0002578585190000025
weighted response values representing the n regional candidate subnets;
the weighted response value calculation formula is as follows:
Figure FDA0002578585190000031
wherein, yβ(t*) Representing said target position t*T represents the interference bit closest to the target positionPosition yβ(t) represents the weighted response result of the interference location t, Δ being a quadratic continuous differentiable function;
the weighted response result calculation formula is as follows:
Figure FDA0002578585190000032
si(t) represents the ith area candidate sub-network response result.
6. The method of any of claims 1-5, wherein the n different scales comprise at least one of a 32 x 32 pixel scale, a 64 x 64 pixel scale, a 128 x 128 pixel scale, and a 256 x 256 pixel scale.
7. An object tracking device, the device comprising:
the template characteristic image acquisition module is used for extracting n template characteristic images with different scales from the template frame by utilizing the characteristic pyramid network;
the detection characteristic graph acquisition module is used for extracting n detection characteristic graphs with different scales, which are in one-to-one correspondence with the scales of the n template characteristic graphs, from a detection frame by utilizing a characteristic pyramid network;
the candidate image determining module is used for determining a preset candidate number of candidate images and corresponding scores and positions of the candidate images on the detection feature map by utilizing an ith regional candidate sub-network according to the ith template feature map and the detection feature map with the same scale as the ith template feature map;
and the tracking target determining module is used for determining the first m candidate images with the highest scores in all the candidate images as the tracking targets according to the scores of all the candidate images after the n regional candidate sub-networks process the n template feature images and the corresponding detection feature images, and the positions corresponding to the tracking targets are used as target positions.
8. The target tracking device of claim 7, wherein the alternative image determination module comprises:
a template sample set obtaining unit, configured to perform convolution operation on the ith template feature map to obtain a template sample set of a first preset size and a template sample set of a second preset size;
a detection sample set obtaining unit, configured to perform convolution operation on the detection feature map with the same scale as the ith template feature map to obtain a detection sample set of a third preset size and a detection sample set of a fourth preset size;
a candidate image score calculation unit, configured to calculate, by the classification branch of the ith regional candidate subnetwork, a score of each candidate image in the detection sample set of the third preset size according to the template sample set of the first preset size and the detection sample set of the third preset size;
a sample position determining unit, configured to determine, by the regression branch of the i-th area candidate subnetwork, a position of each sample in the fourth preset-sized detection sample set according to the second preset-sized template sample set and the fourth preset-sized detection sample set;
and the candidate image position determining unit is used for determining the position corresponding to each candidate image according to the position of each sample in the detection sample set with the fourth preset size, wherein the detection sample set with the third preset size is equal to the detection sample set with the fourth preset size.
9. A terminal device, comprising a memory for storing a computer program and a processor for executing the computer program to enable the terminal device to perform the object tracking method of any one of claims 1 to 6.
10. A readable storage medium, characterized in that it stores a computer program which, when run on a processor, performs the object tracking method of any one of claims 1 to 6.
CN202010661194.5A 2020-07-10 2020-07-10 Target tracking method and device, terminal equipment and readable storage medium Pending CN111815677A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010661194.5A CN111815677A (en) 2020-07-10 2020-07-10 Target tracking method and device, terminal equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010661194.5A CN111815677A (en) 2020-07-10 2020-07-10 Target tracking method and device, terminal equipment and readable storage medium

Publications (1)

Publication Number Publication Date
CN111815677A true CN111815677A (en) 2020-10-23

Family

ID=72841718

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010661194.5A Pending CN111815677A (en) 2020-07-10 2020-07-10 Target tracking method and device, terminal equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN111815677A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112614157A (en) * 2020-12-17 2021-04-06 上海眼控科技股份有限公司 Video target tracking method, device, equipment and storage medium
CN116309710A (en) * 2023-02-27 2023-06-23 荣耀终端有限公司 Target tracking method and electronic equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109344821A (en) * 2018-08-30 2019-02-15 西安电子科技大学 Small target detecting method based on Fusion Features and deep learning
CN110033473A (en) * 2019-04-15 2019-07-19 西安电子科技大学 Motion target tracking method based on template matching and depth sorting network
CN110544269A (en) * 2019-08-06 2019-12-06 西安电子科技大学 twin network infrared target tracking method based on characteristic pyramid

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109344821A (en) * 2018-08-30 2019-02-15 西安电子科技大学 Small target detecting method based on Fusion Features and deep learning
CN110033473A (en) * 2019-04-15 2019-07-19 西安电子科技大学 Motion target tracking method based on template matching and depth sorting network
CN110544269A (en) * 2019-08-06 2019-12-06 西安电子科技大学 twin network infrared target tracking method based on characteristic pyramid

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112614157A (en) * 2020-12-17 2021-04-06 上海眼控科技股份有限公司 Video target tracking method, device, equipment and storage medium
CN116309710A (en) * 2023-02-27 2023-06-23 荣耀终端有限公司 Target tracking method and electronic equipment

Similar Documents

Publication Publication Date Title
CN109271856B (en) Optical remote sensing image target detection method based on expansion residual convolution
CN107609525B (en) Remote sensing image target detection method for constructing convolutional neural network based on pruning strategy
CN110084836B (en) Target tracking method based on deep convolution characteristic hierarchical response fusion
CN110287826B (en) Video target detection method based on attention mechanism
CN110569738B (en) Natural scene text detection method, equipment and medium based on densely connected network
CN108009529B (en) Forest fire smoke video target detection method based on characteristic root and hydrodynamics
CN109492596B (en) Pedestrian detection method and system based on K-means clustering and regional recommendation network
CN109063549B (en) High-resolution aerial video moving target detection method based on deep neural network
CN111274981B (en) Target detection network construction method and device and target detection method
CN106815323A (en) A kind of cross-domain vision search method based on conspicuousness detection
CN111160407A (en) Deep learning target detection method and system
CN113140005A (en) Target object positioning method, device, equipment and storage medium
CN108364305B (en) Vehicle-mounted camera video target tracking method based on improved DSST
CN108154159A (en) A kind of method for tracking target with automatic recovery ability based on Multistage Detector
CN110310305A (en) A kind of method for tracking target and device based on BSSD detection and Kalman filtering
CN112926399A (en) Target object detection method and device, electronic equipment and storage medium
KR101917525B1 (en) Method and apparatus for identifying string
CN116912796A (en) Novel dynamic cascade YOLOv 8-based automatic driving target identification method and device
CN110147768B (en) Target tracking method and device
CN116977633A (en) Feature element segmentation model training method, feature element segmentation method and device
CN112862730B (en) Point cloud feature enhancement method and device, computer equipment and storage medium
CN112801092B (en) Method for detecting character elements in natural scene image
CN114332457A (en) Image instance segmentation model training method, image instance segmentation method and device
CN111815677A (en) Target tracking method and device, terminal equipment and readable storage medium
CN113704276A (en) Map updating method and device, electronic equipment and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination