CN111815677A

CN111815677A - Target tracking method and device, terminal equipment and readable storage medium

Info

Publication number: CN111815677A
Application number: CN202010661194.5A
Authority: CN
Inventors: 衣杨; 赵小蕾; 陈嘉谦; 邱泽敏; 刘东琳; 陈怡华; 李宁
Original assignee: Xinhua College of Sun Yat Sen University
Current assignee: Xinhua College of Sun Yat Sen University
Priority date: 2020-07-10
Filing date: 2020-07-10
Publication date: 2020-10-23

Abstract

The embodiment of the invention discloses a target tracking method, a target tracking device, terminal equipment and a readable storage medium, wherein the method comprises the steps of extracting n template feature maps with different scales from a template frame by utilizing a feature pyramid network; extracting n detection feature maps with different scales from the detection frame by using a feature pyramid network; determining preset candidate number of candidate images and corresponding scores and positions of the candidate images on a detection feature map by using an ith area candidate sub-network according to the ith template feature map and the detection feature map with the same scale as the ith template feature map; and when the n regional candidate sub-networks finish processing the n template feature maps and the corresponding detection feature maps, determining m front candidate images with the highest scores in all the candidate images as tracking targets according to the scores of all the candidate images, and taking the positions corresponding to the tracking targets as target positions. The scheme realizes accurate tracking of the small target.

Description

Target tracking method and device, terminal equipment and readable storage medium

Technical Field

The present invention relates to the field of target tracking, and in particular, to a target tracking method, an apparatus, a terminal device, and a readable storage medium.

Background

In recent years, deep learning has begun to advance to the military target tracking field, attracting the eye of more and more scholars. The low-level features of the tracking images have higher resolution, so that the targets can be accurately positioned conveniently, the high-level features contain more semantic information, larger target change can be processed, the tracker is prevented from drifting, the targets can be positioned in a range more conveniently, and therefore the features of the targets can be better extracted by utilizing deep learning, and the targets can be better expressed. However, since deep learning has the problems of large training samples and long online updating time, there still remains a small challenge in terms of timeliness of tracking.

Disclosure of Invention

In view of the foregoing problems, the present invention provides a target tracking method, an apparatus, a terminal device and a readable storage medium.

A first embodiment of the present invention provides a target tracking method, including:

extracting n template feature graphs with different scales from the template frame by using a feature pyramid network;

extracting n detection feature maps with different scales, which are in one-to-one correspondence with the scales of the n template feature maps, from a detection frame by using a feature pyramid network;

determining preset candidate number of candidate images and corresponding scores and positions of the candidate images on a detection feature map by using an ith area candidate sub-network according to the ith template feature map and the detection feature map with the same scale as the ith template feature map;

and when the n regional candidate sub-networks finish processing the n template feature maps and the corresponding detection feature maps, determining m front candidate images with the highest scores in all the candidate images as tracking targets according to the scores of all the candidate images, and taking the positions corresponding to the tracking targets as target positions.

A target tracking method according to a second embodiment of the present invention, where the determining, by using an ith sub-network of area candidates, a preset number of candidate images and scores and positions corresponding to the candidate images on a detection feature map according to an ith template feature map and the detection feature map having the same scale as the ith template feature map includes:

performing convolution operation on the ith template characteristic diagram to obtain a template sample set with a first preset size and a template sample set with a second preset size;

performing convolution operation on the detection characteristic graph with the same scale as the ith template characteristic graph to obtain a detection sample set with a third preset size and a detection sample set with a fourth preset size;

the classification branch of the ith regional candidate sub-network calculates the score of each candidate image in the detection sample set of the third preset size according to the template sample set of the first preset size and the detection sample set of the third preset size;

determining the position of each sample in the detection sample set with the fourth preset size according to the template sample set with the second preset size and the detection sample set with the fourth preset size by the regression branch of the ith area candidate sub-network;

and determining the position corresponding to each alternative image according to the position of each sample in the detection sample set with the fourth preset size, wherein the detection sample set with the third preset size is equal to the detection sample set with the fourth preset size.

A target tracking method according to a third embodiment of the present invention further includes:

determining position response values of target positions corresponding to the first m candidate images;

and when the maximum position response value is larger than a preset response threshold value, taking the alternative image corresponding to the maximum position response value as the template frame.

In the above target tracking method, the response value is calculated according to the following formula:

wherein the content of the first and second substances,

representing said position response value, t^*Representing said target position, y (t)^*) Representing said target position t^*T represents the interference position closest to the target position, y (t) represents the response result of the interference position t, and Δ is a quadratic continuous differentiable function.

In the target tracking method according to the embodiment, a training sample set is used to train a target tracking model corresponding to the target tracking method in advance until the error loss of the target tracking model is smaller than a preset error threshold;

the error loss is calculated using the following loss function:

wherein the content of the first and second substances,

represents the error loss, beta_iRepresents the weighting coefficients of the i-th region candidate sub-network, mu represents the attenuation parameter,

weighted response values representing the n regional candidate subnets;

the weighted response value calculation formula is as follows:

wherein, y_β(t^*) Representing said target position t^*T represents the interference position closest to the target position, y_β(t) represents the weighted response result of the interference location t, Δ being a quadratic continuous differentiable function;

the weighted response result calculation formula is as follows:

s_i(t) represents the ith area candidate sub-network response result.

In the target tracking method according to the foregoing embodiment, the n different scales include at least one of 32 × 32 pixel scales, 64 × 64 pixel scales, 128 × 128 pixel scales, and 256 × 256 pixel scales.

A fourth embodiment of the present invention provides a target tracking apparatus, including:

the template characteristic image acquisition module is used for extracting n template characteristic images with different scales from the template frame by utilizing the characteristic pyramid network;

the detection characteristic graph acquisition module is used for extracting n detection characteristic graphs with different scales, which are in one-to-one correspondence with the scales of the n template characteristic graphs, from a detection frame by utilizing a characteristic pyramid network;

the candidate image determining module is used for determining a preset candidate number of candidate images and corresponding scores and positions of the candidate images on the detection feature map by utilizing an ith regional candidate sub-network according to the ith template feature map and the detection feature map with the same scale as the ith template feature map;

and the tracking target determining module is used for determining the first m candidate images with the highest scores in all the candidate images as the tracking targets according to the scores of all the candidate images after the n regional candidate sub-networks process the n template feature images and the corresponding detection feature images, and the positions corresponding to the tracking targets are used as target positions.

The above candidate image determination module includes:

a template sample set obtaining unit, configured to perform convolution operation on the ith template feature map to obtain a template sample set of a first preset size and a template sample set of a second preset size;

a detection sample set obtaining unit, configured to perform convolution operation on the detection feature map with the same scale as the ith template feature map to obtain a detection sample set of a third preset size and a detection sample set of a fourth preset size;

a candidate image score calculation unit, configured to calculate, by the classification branch of the ith regional candidate subnetwork, a score of each candidate image in the detection sample set of the third preset size according to the template sample set of the first preset size and the detection sample set of the third preset size;

a sample position determining unit, configured to determine, by the regression branch of the i-th area candidate subnetwork, a position of each sample in the fourth preset-sized detection sample set according to the second preset-sized template sample set and the fourth preset-sized detection sample set;

and the candidate image position determining unit is used for determining the position corresponding to each candidate image according to the position of each sample in the detection sample set with the fourth preset size, wherein the detection sample set with the third preset size is equal to the detection sample set with the fourth preset size.

The above embodiments relate to a terminal device comprising a memory for storing a computer program and a processor for executing the computer program to enable the terminal device to perform the above object tracking method.

The above embodiments relate to a readable storage medium storing a computer program which, when run on a processor, performs the above-described object tracking method.

Extracting n template feature graphs with different scales from a template frame by using a feature pyramid network; extracting n detection feature maps with different scales, which are in one-to-one correspondence with the scales of the n template feature maps, from a detection frame by using a feature pyramid network; determining preset candidate number of candidate images and corresponding scores and positions of the candidate images on a detection feature map by using an ith area candidate sub-network according to the ith template feature map and the detection feature map with the same scale as the ith template feature map; and when the n regional candidate sub-networks finish processing the n template feature maps and the corresponding detection feature maps, determining m front candidate images with the highest scores in all the candidate images as tracking targets according to the scores of all the candidate images, and taking the positions corresponding to the tracking targets as target positions. On one hand, the technical scheme of the invention takes the characteristic pyramid network as a characteristic extraction layer of a tracking frame, effectively fuses low-layer high-resolution information and high-layer high-semantic information, can more accurately position a target position, and has particularly prominent tracking performance on a small target tracking object; on the other hand, the tracking target is screened through the improved regional candidate sub-network, and the small target is accurately tracked.

Drawings

In order to more clearly illustrate the technical solution of the present invention, the drawings required to be used in the embodiments will be briefly described below, and it should be understood that the following drawings only illustrate some embodiments of the present invention, and therefore should not be considered as limiting the scope of the present invention. Like components are numbered similarly in the various figures.

FIG. 1 is a flow chart illustrating a target tracking method according to an embodiment of the invention;

FIG. 2 is a schematic diagram of a target tracking model according to an embodiment of the present invention;

FIG. 3 illustrates a flow diagram for determining alternative image scores and positions according to an embodiment of the invention;

FIG. 4 is a schematic diagram illustrating a structure of a regional candidate subnetwork model according to an embodiment of the present invention;

FIG. 5 is a flow chart illustrating another target tracking method according to an embodiment of the invention;

FIG. 6 is a schematic diagram of a target tracking apparatus according to an embodiment of the present invention;

FIG. 7 shows a structural schematic diagram for determining alternative image scores and positions according to an embodiment of the invention.

Description of the main element symbols:

1-a target tracking device; 100-a template feature map acquisition module; 200-a detection characteristic map acquisition module; 300-an alternative image determination module; 400-a tracking target determination module; 500-response value calculation module; 600-template frame update module; 310-a template sample set obtaining unit; 320-a detection sample set acquisition unit; 330-alternative image score calculation unit; 340-a sample position determination unit; 350-alternative image position determination unit.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments.

The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

Hereinafter, the terms "including", "having", and their derivatives, which may be used in various embodiments of the present invention, are only intended to indicate specific features, numbers, steps, operations, elements, components, or combinations of the foregoing, and should not be construed as first excluding the existence of, or adding to, one or more other features, numbers, steps, operations, elements, components, or combinations of the foregoing.

Furthermore, the terms "first," "second," "third," and the like are used solely to distinguish one from another and are not to be construed as indicating or implying relative importance.

Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which various embodiments of the present invention belong. The terms (such as those defined in commonly used dictionaries) should be interpreted as having a meaning that is consistent with their contextual meaning in the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein in various embodiments of the present invention.

The invention provides a target tracking method which is provided by improving the existing three technologies, and the target tracking method comprises a twin neural network structure, a characteristic pyramid network and a regional candidate network.

The twin network structure means that the main structure of the network is composed of an upper branch and a lower branch, the two branches share all weights of the same network, and the twin network structure is used for solving the classification problem that the classes are many or uncertain, but the number of samples below each class is small like twin twins. In the field of visual target tracking, the upper branch of the twin network is a Template branch (Template branch) for extracting appearance features of a Template frame, the lower branch is a Detection branch (Detection branch), the input of the Detection branch is a candidate area for searching, which is cut on the current frame according to the tracking result of the previous frame, after passing through the same network, similarity calculation is carried out by using a feature map of the Template branch and feature maps of a plurality of candidate areas of the current frame, and the candidate area with the highest score is used as the tracking result of the current frame.

The feature pyramid generally refers to an image pyramid which is similar to a real pyramid and is constructed by connecting 4 vertexes of a plurality of images with different scales aiming at an input single image. A Feature Pyramid Network (FPN) used in the field of target tracking can add one dimension (or depth) to a two-dimensional image, and is different from a traditional detection algorithm which only adopts top-level features for prediction, the Feature Pyramid Network can fuse the features of different levels and independently predict in different Feature layers, and accordingly more robust semantic information is obtained. Through the self bottom-up line and top-down line, the feature pyramid network fully utilizes the high-resolution information of the lower layer and the high-semantic information of the higher layer, particularly for small targets, the feature pyramid network also increases the resolution of feature mapping, and more useful information about the small targets can be obtained.

The Region candidate Network (RPN) is a Network for extracting candidate boxes, and appears earliest in the fast RCNN structure. It uses candidate boxes, also known as Anchor box (Anchor) techniques, which are commonly used in computer vision to represent fixed reference boxes. In a target tracking task, the tracked target has the characteristics of uncertain category, uncertain position, uncertain scale and the like, the improved anchor frame technology can cover about all positions and scales by pre-setting a group of fixed reference frames with different scales and different positions, each fixed reference frame is responsible for detecting the target which is intersected with the fixed reference frame and is larger than a preset threshold, and the regional candidate network not only has good identification effect, but also has high identification speed.

Example 1

In this embodiment, referring to fig. 1, it is shown that a target tracking method includes the following steps:

step S100: and extracting n template feature maps with different scales from the template frame by using the feature pyramid network.

In the initial stage of target tracking, a first frame in the video frames can be used as a template frame, and n template feature maps with different scales are extracted from the template frame by using a Feature Pyramid Network (FPN). Exemplarily, referring to fig. 2, a Feature Pyramid Network (FPN) extracts 4 template feature maps with different scales from a template frame, including a template feature map with a scale of 32 × 32 pixel points, a template feature map with a scale of 64 × 64 pixel points, a template feature map with a scale of 128 × 128 pixel points, and a template feature map with a scale of 256 × 256 pixel points.

It should be understood that n is a positive integer, and may be set to 3, 4, 5, 6, etc., and may be adjusted according to the actual effect of target tracking, and the size of the template feature map may also be flexibly set according to specific requirements.

Step S200: and extracting n detection feature maps with different scales, which are in one-to-one correspondence with the scales of the n template feature maps, from the detection frame by using a feature pyramid network.

The detection frame is a current frame needing to obtain a tracking target, and n detection feature maps with different scales, which are in one-to-one correspondence with the scales of the n template feature maps, are extracted from the detection frame by using a feature pyramid network. Exemplarily, referring to fig. 2, a Feature Pyramid Network (FPN) extracts 4 detection feature maps with different scales from a detection frame, including a detection feature map with a scale of 32 × 32 pixel points, a detection feature map with a scale of 64 × 64 pixel points, a detection feature map with a scale of 128 × 128 pixel points, and a detection feature map with a scale of 256 × 256 pixel points.

Step S300: and determining preset candidate images in number and corresponding scores and positions of the candidate images on the detection feature map by using the ith area candidate sub-network according to the ith template feature map and the detection feature map with the same scale as the ith template feature map.

Exemplarily, referring to fig. 2, when n is 4, the template feature map with a 32 × 32 pixel point scale and the detection feature map with a 32 × 32 pixel point scale correspond to 4 region candidate subnetworks (RPN), the template feature map with a 64 × 64 pixel point scale and the detection feature map with a 64 × 64 pixel point scale serve as inputs of a first region candidate subnetwork (RPN), the template feature map with a 128 × 128 pixel point scale and the detection feature map with a 128 × 128 pixel point scale serve as inputs of a second region candidate subnetwork (RPN), the template feature map with a 256 × 256 pixel point scale and the detection feature map with a 256 × 256 pixel point scale serve as inputs of a fourth region candidate subnetwork (RPN).

And respectively determining a preset candidate number of candidate images, and the corresponding scores and positions of the candidate images on the corresponding detection feature maps by the 4 regional candidate sub-networks (RPN). An anchor frame with 3 length-width ratios can be preset on the detection feature map of each scale so as to cover a tracking target possibly existing in the detection feature map, wherein the 3 length-width ratios include 1: 2. 1: 1 and 2: 1. correspondingly, the detection feature map of 4 scales comprises 12 preset anchor frames.

The classification score of each anchor frame may be calculated using the classification branch of the regional candidate subnetwork (RPN), and the regression position of each anchor frame may be determined using the regression branch of the regional candidate subnetwork (RPN), it being understood that each anchor frame contains a candidate image, the regression position of each anchor frame is the position of the corresponding candidate image, and the classification score of each anchor frame is the score of the corresponding candidate image.

Step S400: and when the n regional candidate sub-networks finish processing the n template feature maps and the corresponding detection feature maps, determining m front candidate images with the highest scores in all the candidate images as tracking targets according to the scores of all the candidate images, and taking the positions corresponding to the tracking targets as target positions.

And sequencing all the alternative images in sequence from high to low according to the corresponding scores of all the alternative images, and determining the top m alternative images with the highest scores as tracking targets. Wherein m is a preset positive integer.

I is not more than n.

In the embodiment, n template feature maps with different scales are extracted from a template frame by using a feature pyramid network; extracting n detection feature maps with different scales, which are in one-to-one correspondence with the scales of the n template feature maps, from a detection frame by using a feature pyramid network; determining preset candidate number of candidate images and corresponding scores and positions of the candidate images on a detection feature map by using an ith area candidate sub-network according to the ith template feature map and the detection feature map with the same scale as the ith template feature map; and when the n regional candidate sub-networks finish processing the n template feature maps and the corresponding detection feature maps, determining m front candidate images with the highest scores in all the candidate images as tracking targets according to the scores of all the candidate images, and taking the positions corresponding to the tracking targets as target positions. On one hand, the technical scheme of the embodiment takes the feature pyramid network as a feature extraction layer of a tracking frame, effectively fuses low-layer high-resolution information and high-layer high-semantic information, can more accurately position a target position, and is particularly prominent in tracking performance on a small target tracking object; on the other hand, the tracking target is screened through the improved regional candidate sub-network, and the small target is accurately tracked.

Example 2

The regional candidate subnetworks, as shown in FIG. 4, contain classification branches for distinguishing between target and background and regression branches for bounding box regression.

Further, referring to fig. 3, the step S300 of the above embodiment 1 includes the following steps:

step S310: and performing convolution operation on the ith template characteristic diagram to obtain a template sample set with a first preset size and a template sample set with a second preset size.

And the first convolution layer of the ith area candidate sub-network performs convolution operation on the input ith template feature map to obtain a template sample set with a first preset size and a template sample set with a second preset size.

Further, a template sample set with a first preset size of 4 × 4 × (2k × 256) is obtained on the classification branch of the i-th area candidate sub-network

The features representing template samples of size 4 x 4 have 2k variations across k different anchor frames. It should be understood that 2k changes indicate that there may be two situations in the image in each anchor frame, either background or target, i.e., both 0 and 1 states.

Further, a template sample set with a second preset size of 4 × 4 × (4k × 256) is obtained on the regression branch of the i-th region candidate subnetwork

The features of the template sample, represented as a scale of 4 x 4 pixel points, vary by 4k over k different anchor frames. It should be understood that the 4k variations correspond to the width, height, abscissa and ordinate of the location, and each anchor frame is represented by the corresponding width, height, abscissa and ordinate.

Step S320: and performing convolution operation on the detection feature map with the same scale as the ith template feature map to obtain a detection sample set with a third preset size and a detection sample set with a fourth preset size.

And the first convolution layer of the ith area candidate sub-network performs convolution operation on the input detection feature map with the same scale as the ith template feature map to obtain a detection sample set with a third preset size and a detection sample set with a fourth preset size.

Further, a detection sample set with a third preset size of 20 × 20 × 256 is obtained on the classification branch of the i-th region candidate sub-network

Obtaining a detection sample set with a fourth preset size of 20 multiplied by 256 on the regression branch of the ith regional candidate subnetwork

256 in the above scale represents the number of channels of the template sample, and the dimension of the feature is expanded to 256 dimensions through the network training process of the feature pyramid.

Step S330: and the classification branch of the ith regional candidate sub-network calculates the score of each candidate image in the detection sample set with the third preset size according to the template sample set with the first preset size and the detection sample set with the third preset size.

The classification branch will give a classification score corresponding to each input detection sample, i.e. a detailed score predicted as a target or a background, and the corresponding score can be expressed as

:representsa related operation.

Step S340: and determining the position of each sample in the detection sample set with the fourth preset size according to the template sample set with the second preset size and the detection sample set with the fourth preset size by the regression branch of the ith area candidate subnetwork.

The regression branch gives the position regression value of each detection sample

The position regression value includes abscissa, ordinate, width and height, which correspond to d_x，d_y，d_wAnd d_hThe number of the four values is,

:representsa related operation.

Step S350: and determining the position corresponding to each alternative image according to the position of each sample in the detection sample set with the fourth preset size, wherein the detection sample set with the third preset size is equal to the detection sample set with the fourth preset size.

The detection sample set of the third preset size and the detection sample set of the fourth preset size are sample sets with the same size, and the position corresponding to each candidate image can be determined according to the position of each sample in the detection sample set of the fourth preset size.

Top m alternative classification output information

And regression output information

The position information of m candidate positions with the highest score can be obtained

The specific calculation formula is as follows:

wherein

For the original center coordinates and length and width of the candidate box corresponding to the ith candidate position, cls represents the classification branch and reg represents the regression branch, for each subscript: i e 0, w), j 0, h), l 0,2k, p 0, k), each A is a set of vectors for output information.

Example 3

In this embodiment, referring to fig. 5, it is shown that the target tracking method further includes the following steps after the above steps S100 to S400:

step S500: and determining position response values of the target positions corresponding to the first m candidate images.

The position response values of the target positions corresponding to the first m candidate images can be respectively calculated according to the following formula:

wherein the content of the first and second substances,

representing said position response value, t^*Represents the target position corresponding to any one of the m candidate images, y (t)^*) Indicates the target position t^*T represents the interference position closest to the target position, y (t) represents the response result of the interference position t, Δ is a quadratic continuous differentiable function, t and t^*The closer, Δ (t-t)^*) Approaches to 0, t and t^*The farther away, Δ (t-t)^*) Approaching 1.

It should be appreciated that m position response values corresponding to the m candidate positions may be determined according to the above formula.

Step S600: and when the maximum position response value is larger than a preset response threshold value, taking the alternative image corresponding to the maximum position response value as the template frame.

And selecting the candidate image with the corresponding position response value larger than the preset response threshold value from the m candidate positions as a new template frame to continue to execute the step S100.

In the online updating mode based on the high-score sample feedback, the high-score alternative sample in the tracking process is used as a new template frame for subsequent detection tasks. The accuracy and robustness of target tracking are effectively improved.

Further, a target tracking model corresponding to the target tracking method is trained in advance by utilizing a training sample set until the error loss of the target tracking model is smaller than a preset error threshold; the error loss is calculated using the following loss function:

wherein the content of the first and second substances,

representing weighted response values of the n region candidate subnets, the weighted response values being calculated as follows:

wherein, y_β(t^*) Representing said target position t^*T represents the interference position closest to the target position, y_β(t) represents the weighted response result of the interference location t, Δ being a quadratic continuous differentiable function; the weighted response result calculation formula is as follows:

s_i(t) represents the i-th area candidate sub-network response result, wherein the sum of the weighting coefficients is 1, i.e.

And calculating the error loss of the target tracking model corresponding to the target tracking method by using the loss function, and when the error loss of the target tracking model is smaller than a preset error threshold, indicating that the tracking quality of the target tracking model corresponding to the target tracking method meets the standard.

Example 4

In the present embodiment, referring to fig. 6, a target tracking apparatus 1 is shown including: a template feature map acquisition module 100, a detection feature map acquisition module 200, an alternative image determination module 300, and a tracking target determination module 400.

A template feature map obtaining module 100, configured to extract n template feature maps of different scales from a template frame by using a feature pyramid network; a detection feature map obtaining module 200, configured to extract n detection feature maps with different scales, which are in one-to-one correspondence with the scales of the n template feature maps, from a detection frame by using a feature pyramid network; a candidate image determining module 300, configured to determine, by using an ith region candidate sub-network, a preset candidate number of candidate images, and scores and positions corresponding to the candidate images on a detection feature map according to an ith template feature map and the detection feature map with the same scale as the ith template feature map; a tracking target determining module 400, configured to determine, according to the score of each candidate image, m previous candidate images with the highest score in all candidate images as tracking targets after n region candidate subnetworks complete processing on the n template feature maps and corresponding detection feature maps, where a position corresponding to the tracking target is a target position.

Further, referring to fig. 7, the alternative image determining module 300 includes: a template sample set acquisition unit 310, a detection sample set acquisition unit 320, an alternative image score calculation unit 330, a sample position determination unit 340, and an alternative image position determination unit 350.

A template sample set obtaining unit 310, configured to perform convolution operation on the ith template feature map to obtain a template sample set with a first preset size and a template sample set with a second preset size; a detection sample set obtaining unit 320, configured to perform convolution operation on the detection feature map with the same scale as the ith template feature map to obtain a detection sample set of a third preset size and a detection sample set of a fourth preset size; a candidate image score calculation unit 330, configured to calculate, by the classification branch of the i-th regional candidate sub-network, a score of each candidate image in the detection sample set of the third preset size according to the template sample set of the first preset size and the detection sample set of the third preset size; a sample position determining unit 340, configured to determine, by the regression branch of the i-th area candidate subnetwork, a position of each sample in the fourth preset-sized detection sample set according to the second preset-sized template sample set and the fourth preset-sized detection sample set; a candidate image position determining unit 350, configured to determine, according to the position of each sample in the fourth preset-size detection sample set, a position corresponding to each candidate image, where the third preset-size detection sample set is equal to the fourth preset-size detection sample set.

The target tracking device 1 further includes: a response value calculating module 500, configured to determine position response values of target positions corresponding to the m previous candidate images; a template frame updating module 600, configured to, when the maximum position response value is greater than a preset response threshold, take the candidate image corresponding to the maximum position response value as the template frame.

The target tracking apparatus 1 of this embodiment is configured to execute the target tracking method according to the above embodiment through the matching use of the template feature map obtaining module 100, the detection feature map obtaining module 200, the candidate image determining module 300, and the tracking target determining module 400, and the implementation and beneficial effects related to the above embodiment are also applicable to this embodiment, and are not described herein again.

The above embodiments relate to a terminal device, including a memory for storing a computer program and a processor for executing the computer program to enable the terminal device to execute the object tracking method of the above embodiments.

The above embodiments relate to a readable storage medium storing a computer program which, when run on a processor, performs the object tracking method of the above embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative and, for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, each functional module or unit in each embodiment of the present invention may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention or a part of the technical solution that contributes to the prior art in essence can be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a smart phone, a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention.

Claims

1. A method of object tracking, the method comprising:

2. The method according to claim 1, wherein the determining, by using the ith area candidate sub-network, a preset candidate number of candidate images and scores and positions corresponding to the candidate images on the detection feature map according to the ith template feature map and the detection feature map with the same scale as the ith template feature map comprises:

3. The target tracking method of claim 1, further comprising:

and when the maximum position response value is larger than a preset position response threshold value, taking the alternative image corresponding to the maximum position response value as the template frame.

4. The object tracking method of claim 3, wherein the position response value is calculated according to the formula:

wherein the content of the first and second substances,

5. The target tracking method according to claim 1, wherein a target tracking model corresponding to the target tracking method is trained in advance by using a training sample set until an error loss of the target tracking model is smaller than a preset error threshold;

the error loss is calculated using the following loss function:

wherein the content of the first and second substances,

weighted response values representing the n regional candidate subnets;

the weighted response value calculation formula is as follows:

wherein, y_β(t^*) Representing said target position t^*T represents the interference bit closest to the target positionPosition y_β(t) represents the weighted response result of the interference location t, Δ being a quadratic continuous differentiable function;

the weighted response result calculation formula is as follows:

s_i(t) represents the ith area candidate sub-network response result.

6. The method of any of claims 1-5, wherein the n different scales comprise at least one of a 32 x 32 pixel scale, a 64 x 64 pixel scale, a 128 x 128 pixel scale, and a 256 x 256 pixel scale.

7. An object tracking device, the device comprising:

8. The target tracking device of claim 7, wherein the alternative image determination module comprises:

9. A terminal device, comprising a memory for storing a computer program and a processor for executing the computer program to enable the terminal device to perform the object tracking method of any one of claims 1 to 6.

10. A readable storage medium, characterized in that it stores a computer program which, when run on a processor, performs the object tracking method of any one of claims 1 to 6.