CN113393493B

CN113393493B - Target object tracking method and device

Info

Publication number: CN113393493B
Application number: CN202110592320.0A
Authority: CN
Inventors: 詹忆冰; 吴爽
Original assignee: Jingdong Technology Information Technology Co Ltd
Current assignee: Jingdong Technology Information Technology Co Ltd
Priority date: 2021-05-28
Filing date: 2021-05-28
Publication date: 2024-04-05
Anticipated expiration: 2041-05-28
Also published as: CN113393493A

Abstract

The application discloses a target object tracking method and device. One embodiment of the method comprises the following steps: for each frame of image in the video to be processed, performing the following target tracking operation: determining the position information of a target object in the current frame image through a filtering template corresponding to the previous frame image; generating response information of the previous frame image to a filtering template corresponding to the previous frame image; and obtaining a filtering template corresponding to the current frame image according to the position information and the response information, wherein the filtering template corresponding to the current frame image is used for determining the position information of the target object in the next frame image. The method for tracking the target object improves universality, robustness, precision and credibility of target tracking.

Description

Target object tracking method and device

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to a target object tracking method and device.

Background

In machine vision tasks, object tracking has a wide range of application scenarios and tremendous commercial value. Based on the characteristics of quick and efficient related filtering algorithm, easy deployment on a CPU (central processing unit ) and real-time target tracking, the method is widely applied to target tracking tasks. The current relevant filtering model only refers to the Gaussian pseudo tag preset in each frame of image, and updates the filtering model.

Disclosure of Invention

The embodiment of the application provides a target object tracking method and device.

In a first aspect, an embodiment of the present application provides a tracking method of a target object, for each frame of image in a video to be processed, performing the following target tracking operation: determining the position information of a target object in the current frame image through a filtering template corresponding to the previous frame image; generating response information of the previous frame image to a filtering template corresponding to the previous frame image; and obtaining a filtering template corresponding to the current frame image according to the position information and the response information, wherein the filtering template corresponding to the current frame image is used for determining the position information of the target object in the next frame image.

In some embodiments, the obtaining a filtering template corresponding to the current frame image according to the location information and the response information includes: combining the position information and the response information to obtain label information of the current frame image; and obtaining a filtering template corresponding to the current frame image based on minimization between the label information and the target response information, wherein the target response information characterizes the response information of the current frame image to the filtering template corresponding to the current frame image.

In some embodiments, the obtaining a filtering template corresponding to the current frame image based on the minimizing between the tag information and the target response information includes: the label information and the target response information are acted through a preset weighting window, and acted information is obtained; and obtaining a filtering template corresponding to the current frame image based on the minimization of the information after the action.

In some embodiments, the center point of the preset weighting window is a first weight, and the non-center point of the preset weighting window is a second weight; and the above-mentioned acts on label information and goal response information through the preset weighting window, get the information after acting, including: and acting the label information and the target response information through a preset weighting window to obtain the after-acting information of the matching loss of the central point and the non-central point of the target object distinguished by the first weight and the second weight.

In some embodiments, the minimizing based on the post-action information to obtain a filtering template corresponding to the current frame image includes: based on the minimization of the information after the action, a filtering template corresponding to the current frame image is obtained by utilizing preset constraint aiming at distinguishing the background and the foreground of the target object in the image.

In some embodiments, the above-mentioned acting on the tag information and the target response information through a preset weighting window, so as to obtain acting information; based on the minimization of the information after the action, a filtering template corresponding to the current frame image is obtained, and the method comprises the following steps: for each channel of each frame image, the label information and the target response information of the corresponding channel are acted through a preset weighting window to obtain the acted information corresponding to the channel, and the filtering template corresponding to the channel of the current frame image is obtained based on the minimization of the acted information; and determining the position information of the target object in the current frame image through the filtering template corresponding to the previous frame image, including: and determining the position information of the target object in the current frame image based on the filtering templates corresponding to the channels of the previous frame image one by one and acting on the channels of the current frame image.

In some embodiments, the obtaining the post-action information corresponding to the channel by applying the preset weighting window to the tag information and the target response information corresponding to the channel, and obtaining the filtering template corresponding to the channel of the current frame image based on the minimization of the post-action information includes: the method comprises the steps of adopting a multi-scale filtering mode, acting on label information and target scale response information corresponding to a channel through a preset weighting window to obtain acting information corresponding to the channel, and obtaining a multi-scale filtering template corresponding to the channel of a current frame image based on minimization of the acting information, wherein the target scale response information represents the channel of the current frame image, and is response information of a single-scale filtering template in the multi-scale filtering template.

In some embodiments, the above method further comprises: and aiming at a first frame image in the video to be processed, determining a filtering template corresponding to the first frame image according to a target frame representing a target object in the first frame image.

In a second aspect, an embodiment of the present application provides a tracker device for a target object, where for each frame of image in a video to be processed, the following target tracking operation is performed by: a first determining unit configured to determine position information of a target object in a current frame image in response to a filtering template corresponding to a previous frame image; a second determining unit configured to generate response information of the previous frame image to a filtering template corresponding to the previous frame image; the obtaining unit is configured to obtain a filtering template corresponding to the current frame image according to the position information and the response information, wherein the filtering template corresponding to the current frame image is used for determining the position information of the target object in the next frame image.

In some embodiments, the obtaining unit is further configured to: combining the position information and the response information to obtain label information of the current frame image; and obtaining a filtering template corresponding to the current frame image based on minimization between the label information and the target response information, wherein the target response information characterizes the response information of the current frame image to the filtering template corresponding to the current frame image.

In some embodiments, the obtaining unit is further configured to: the label information and the target response information are acted through a preset weighting window, and acted information is obtained; and obtaining a filtering template corresponding to the current frame image based on the minimization of the information after the action.

In some embodiments, the center point of the preset weighting window is a first weight, and the non-center point of the preset weighting window is a second weight; and a deriving unit further configured to: and acting the label information and the target response information through a preset weighting window to obtain the after-acting information of the matching loss of the central point and the non-central point of the target object distinguished by the first weight and the second weight.

In some embodiments, the obtaining unit is further configured to: based on the minimization of the information after the action, a filtering template corresponding to the current frame image is obtained by utilizing preset constraint aiming at distinguishing the background and the foreground of the target object in the image.

In some embodiments, the obtaining unit is further configured to: for each channel of each frame image, the label information and the target response information of the corresponding channel are acted through a preset weighting window to obtain the acted information corresponding to the channel, and the filtering template corresponding to the channel of the current frame image is obtained based on the minimization of the acted information; and a first determination unit further configured to: and determining the position information of the target object in the current frame image based on the filtering templates corresponding to the channels of the previous frame image one by one and acting on the channels of the current frame image.

In some embodiments, the obtaining unit is further configured to: the method comprises the steps of adopting a multi-scale filtering mode, acting on label information and target scale response information corresponding to a channel through a preset weighting window to obtain acting information corresponding to the channel, and obtaining a multi-scale filtering template corresponding to the channel of a current frame image based on minimization of the acting information, wherein the target scale response information represents the channel of the current frame image, and is response information of a single-scale filtering template in the multi-scale filtering template.

In some embodiments, the apparatus further comprises: and the third determining unit is configured to determine a filtering template corresponding to the first frame image according to the target frame representing the target object in the first frame image aiming at the first frame image in the video to be processed.

In a third aspect, embodiments of the present application provide a computer readable medium having a computer program stored thereon, wherein the program, when executed by a processor, implements a method as described in any of the implementations of the first aspect.

In a fourth aspect, an embodiment of the present application provides an electronic device, including: one or more processors; a storage device having one or more programs stored thereon, which when executed by one or more processors, cause the one or more processors to implement the method as described in any of the implementations of the first aspect.

According to the target object tracking method and device, the following target tracking operation is executed for each frame of image in the video to be processed: determining the position information of a target object in the current frame image through a filtering template corresponding to the previous frame image; generating response information of the previous frame image to a filtering template corresponding to the previous frame image; according to the position information and the response information, a filtering template corresponding to the current frame image is obtained, wherein the filtering template corresponding to the current frame image is used for determining the position information of the target object in the next frame image, so that a target object tracking method is provided, the position information of the target object in the current frame image and the response information corresponding to the last frame image can be flexibly combined, the filtering template for tracking the target object is obtained according to the combined information, and universality, robustness and accuracy of target tracking are improved.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the detailed description of non-limiting embodiments, made with reference to the following drawings, in which:

FIG. 1 is an exemplary system architecture diagram in which an embodiment of the present application may be applied;

FIG. 2 is a flow chart of one embodiment of a tracking method according to the subject application;

fig. 3 is a schematic diagram of an application scenario of the tracking method of the target object according to the present embodiment;

FIG. 4 is a flow chart of yet another embodiment of a method of tracking a target object according to the present application;

FIG. 5 is a block diagram of one embodiment of a tracker device of a target object according to the present application;

FIG. 6 is a schematic diagram of a computer system suitable for use in implementing embodiments of the present application.

Detailed Description

The present application is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be noted that, for convenience of description, only the portions related to the present invention are shown in the drawings.

It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other. The present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

FIG. 1 illustrates an exemplary architecture 100 to which the subject tracking methods and apparatus of the present application may be applied.

As shown in fig. 1, a system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The communication connection between the terminal devices 101, 102, 103 constitutes a topology network, the network 104 being the medium for providing the communication link between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The terminal devices 101, 102, 103 may be hardware devices or software supporting network connections for data interaction and data processing. When the terminal devices 101, 102, 103 are hardware, they may be various electronic devices supporting network connection, information acquisition, interaction, display, processing, etc., including, but not limited to, cameras, smartphones, tablets, electronic book readers, laptop and desktop computers, etc. When the terminal devices 101, 102, 103 are software, they can be installed in the above-listed electronic devices. It may be implemented as a plurality of software or software modules, for example, for providing distributed services, or as a single software or software module. The present invention is not particularly limited herein.

The server 105 may be a server providing various services, for example, a background server that acquires a video to be processed that is captured or transmitted by the terminal devices 101, 102, 103, and tracks a target object in the video to be processed. Optionally, the background server may feed back the target tracking result to the terminal device. As an example, the server 105 may be a cloud server.

The server may be hardware or software. When the server is hardware, the server may be implemented as a distributed server cluster formed by a plurality of servers, or may be implemented as a single server. When the server is software, it may be implemented as a plurality of software or software modules (e.g., software or software modules for providing distributed services), or as a single software or software module. The present invention is not particularly limited herein.

It should be further noted that, the tracking method of the target object provided by the embodiment of the present application may be executed by a server, may be executed by a terminal device, or may be executed by the server and the terminal device in cooperation with each other. Accordingly, each part (for example, each unit) included in the target object tracking device may be set in the server, may be set in the terminal device, or may be set in the server and the terminal device, respectively.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. When the electronic device on which the tracking method of the target object is run does not need to perform data transmission with other electronic devices, the system architecture may include only the electronic device (e.g., server or terminal device) on which the tracking method of the target object is run.

With continued reference to fig. 2, a flow 200 of one embodiment of a method of tracking a target object is shown, for each frame of image in a video to be processed, performing the following target tracking operations:

step 201, determining the position information of the target object in the current frame image through the filtering template corresponding to the previous frame image.

In this embodiment, the execution subject (e.g., the server in fig. 1) of the tracking method of the target object may determine the position information of the target object in the current frame image through the filtering template corresponding to the previous frame image.

The video to be processed may be a video including any target object. For example, the video to be processed is video of a target object including a person, an animal, or the like, which is taken by the monitoring apparatus. Each frame of image in the video to be processed may include a plurality of target objects, or may not include a target object in some frames.

As an example, the executing body may perform convolution operation on the current frame image based on the filtering template corresponding to the previous frame image, to obtain a response chart representing response information corresponding to the current frame image; further, a point of the response map at which the response is maximum is determined as a center point of the target object in the current frame image to determine the position information of the target object. For example, the center point of the target object in the current frame image is directly determined as the position information of the target object; for another example, a target frame surrounding the target object is obtained according to the center point of the target object, so that the position of the area where the target object is located is determined.

Wherein the filtering template should satisfy the following conditions: the convolution with the target object in the image is greatest in response at the center point of the target object and less in response at the non-center point, specifically the more the filter template is against the background in the image the less.

Since the time interval between two frames of images of the video to be processed is short, the difference between the two frames should not be too large. Based on the above assumption, the position information of the target object in the current frame image is kept consistent with the position information of the target object in the previous frame image with a high probability, that is, the two frame images have visual characteristics such as time continuity, space continuity and the like. And carrying out convolution operation on the filtering template corresponding to the previous frame image and the current frame image to obtain a response image of the current frame image, and further, taking the point with the maximum response in the response image as the center point of the target object in the current frame image. And for each frame of image in the video to be processed, determining the position information of the target object in the current frame of image, namely realizing the tracking of the target object in the video to be processed.

Step 202, generating response information of the previous frame image to the filtering template corresponding to the previous frame image.

In this embodiment, the execution body may generate response information of the previous frame image to the filtering template corresponding to the previous frame image.

As an example, the execution subject may perform a convolution operation on the previous frame image based on the filter template corresponding to the previous frame image, thereby determining response information of the previous frame image to the filter template. Wherein the response information can be represented in a response graph.

It will be appreciated that in the response graph, the response of the center point of the target object is maximized.

And 203, obtaining a filtering template corresponding to the current frame image according to the position information and the response information.

In this embodiment, the executing main body may obtain a filtering template corresponding to the current frame image according to the location information and the response information. Wherein, the filtering template corresponding to the current frame image is used for determining the position information of the target object in the next frame image

As an example, the execution subject may set a gaussian distribution tag for the current frame image with the center point of the target object in the response map corresponding to the current frame image (i.e., the point of the response map where the response is maximum) as the center. And the center point corresponding to the Gaussian distribution label coincides with the center point of the target object in the response chart corresponding to the current frame image.

The response diagram of the previous frame image to the filtering template is the largest response at the center point of the target object, and smaller response at the non-center point of the target object. The execution subject can combine the Gaussian distribution label representing the position information of the target object in the current frame image and the response information of the previous frame image to obtain combined information, and determine the filtering template corresponding to the current frame image by taking the combined information as the label. If the center point of the target object in the response diagram of the filtering template corresponding to the previous frame image is not overlapped with the center point of the Gaussian distribution label, the center point of the target object in the response diagram of the previous frame image corresponding to the filtering template is overlapped with the center point of the Gaussian distribution label through matrix shifting operation.

In some optional implementations of this embodiment, the executing body may execute the step 203 as follows:

first, combining the position information and the response information to obtain the label information of the current frame image.

The executing body may set a preset weight for the position information and the response information, and obtain the tag information of the current frame image based on the combination of the preset weight and the position information and the response information.

And secondly, obtaining a filtering template corresponding to the current frame image based on minimization between the label information and the target response information.

The target response information characterizes response information of the current frame image to a filtering template corresponding to the current frame image.

In this implementation manner, the executing body may determine a distance between the tag information and the target response information according to the L2 norm, and further determine a filtering template corresponding to the current frame image based on the minimization of the distance between the tag information and the target response information. To improve the generalization capability of the filtering template, a regularization term based on the filtering template can also be introduced.

In some optional implementations of this embodiment, the executing body may act on the tag information and the target response information through a preset weighting window to obtain post-acting information, and obtain a filtering template corresponding to the current frame image based on minimization of the post-acting information.

Wherein, different weights are set at different positions in the preset weighting window. As an example, the weight of the center point in the preset weighting window is greater than the weight of the non-center point. The addition of the preset weighting window can enable the filtering template to be more concentrated on the tag matching loss of the center point position of the target object in the response graph (by setting larger weight on the center point in the preset weighting window), or reduce the tag matching loss of the center point position (by setting smaller weight on the center point in the preset weighting window).

In some optional implementations of this embodiment, a center point of the preset weighting window is a first weight, and a non-center point of the preset weighting window is a second weight. In this implementation manner, the executing body may act on the tag information and the target response information through a preset weighting window to obtain the acting information of the matching loss of the central point and the non-central point of the target object distinguished by the first weight and the second weight.

Different weights are set for the center point and the non-center point of the preset weighting window, and the following three points are mainly considered: first, in the correlation filtering algorithm, only the center point part is considered as a positive sample, and each point of the non-center point position is not needed to be too high in value or attention of the label matching process of the non-center point because each characteristic sample is not existed in a real scene due to the assumption of a cyclic signal in the discrete Fourier assumption; secondly, the non-center point position is mainly based on the background, and in the embodiment, the response of the places with more backgrounds in the image is expected to be as low as possible, and the background part does not require accurate label matching; third, in the process of detecting the position of the target object in the current frame image, only the center point of the target object is used, and the focus point of the filtering template is at the center point position of the target object, but not at the non-center position. Therefore, different weights are given to the center point position and the non-center point position of the target object.

In some optional implementations of this embodiment, for the second step, the executing body may obtain, based on minimization of the post-action information, a filtering template corresponding to the current frame image using a preset constraint that aims to distinguish a background and a foreground of the target object in the image.

In this implementation, the tag matching of the background portion in the image is ignored, and the tag matching of the tracked target object area is concentrated.

In some optional implementations of this embodiment, the execution body may determine a filtering template corresponding to each channel of the image by using a channel of the image as a unit. Taking an RGB image as an example, including three channels of R (red), G (green), and B (blue), the execution subject may determine three filtering templates corresponding to the three channels one by one.

Specifically, for each channel of each frame image, label information and target response information corresponding to the channel are acted through a preset weighting window to obtain acting information corresponding to the channel, and a filtering template corresponding to the channel of the current frame image is obtained based on minimization of the acting information.

In this implementation, the execution body executes the step 201 as follows: and determining the position information of the target object in the current frame image based on the filtering templates corresponding to the channels of the previous frame image one by one and acting on the channels of the current frame image.

For example, for each frame of image, the execution body may obtain a response map of each channel through a filtering template corresponding to each channel of the image, further fuse each response map, obtain a fused response map, and determine a point with the largest response in the fused response map as a center point of the target object, thereby determining position information of the target object.

In this implementation manner, the execution body executes the step 202 as follows: and determining the response information of the previous frame image to each filtering template through the filtering templates corresponding to the channels of the previous frame image one by one.

In some optional implementations of this embodiment, the executing body determines a filtering template corresponding to the first frame image in the video to be processed by: and aiming at a first frame image in the video to be processed, determining a filtering template corresponding to the first frame image according to a target frame representing a target object in the first frame image.

Wherein the target frame indicates a region position of the target object in the first frame image. And taking the target frame as a label, and determining the filtering template of the first frame image based on the condition that the response of the filtering template is maximum at the center point of the target object and the response of the filtering template is smaller at the non-center point based on the convolution of the filtering template and the target object in the image.

As follows, a specific implementation manner of this embodiment is given:

first, a mathematical model of the filtering template is constructed:

wherein ε (W) _t ) A solution function representing a filtering template of the t-th frame image;sequentially representing a filter template of a kth channel of a t frame image and a filter template of a kth channel of a t-1 frame image; u represents a preset weighting window; y represents a Gaussian distribution label of the t-th frame image; /> Sequentially representing the image information of the kth channel of the t-th frame image and the image information of the kth channel of the t-1 th frame image; />A regularization term representing the ability to generalize a function; θ representsWeight parameters, lambda represents a regularized term coefficient, and C represents the number of channels of the image; delta represents matrix shift operation, and aims to make the response diagram of the previous frame image and the response diagram of the current frame image be similar as much as possible, so as to realize cross-time continuity prior, namely the response of the filtering template to the same target object in the adjacent frame image should be consistent; the ",. Sup.,. Sup.sequentially represent Hadamard products and (5) convolution.

Here, y, u, x, W are all illustrated by taking a 1-dimensional vector as an example, and may be specifically a vector of n×1, and the calculation of the 2-dimensional image matrix is similar to that of a one-dimensional vector. When a matrix of n x n, i.e. a 2-dimensional case, is used, a normal convolution calculation is performed by convolving a matrix of n x n with a filtering template of n x n, the time complexity of which is O (n ⁴ ) Based on the correlation filtering model, the computational complexity can be reduced to O (n 2log (n)), namely the computational complexity of the fast Fourier transform, so that the algorithm complexity is reduced, and the tracking process is quickened.

x may be a manual feature (hand craft) such as a directional gradient histogram feature (Histogram of Oriented Gradient, HOG), a Color space feature (CN), a gray feature, or a depth feature.

Wherein, the information after action is as follows:

the post-exposure information may be regarded as a tag of the filtering template that determines the image of the t-th frame.

In order to simplify the operation, the position of label matching is divided into two areas, one is a center point, and a weight is assigned as b; one is a non-center point region, and the assigned weight is a. Characterized by the following preset weighting window:

in order to enhance the capability of the filtering template to distinguish the foreground from the background, the above formulas (1), (2) and (3) are combined to obtain:

wherein,as an auxiliary variable, B is a mapping matrix of n×d. B are all 0 (corresponding to background position) or 1 (corresponding to foreground position) and are intended to ignore the label matching of the background part, while focusing on the label matching of the tracked target object area, and N >>D, a step of performing the process; f (i) represents a mapping function, and represents that the characteristic vector x of the kth channel of the t frame is translated by i units to obtain a new vector; u (i) and l (i) represent the values of the ith position in the vectors u (characterizing the preset weighting window) and l (characterizing the post-action information), respectively, and u (i), +.>

Then, the above equation (4) is solved by using a block coordinate descent method, and after the auxiliary variable g is introduced, it is possible to obtain:

where μ is a penalty parameter for making the constraint And meets the requirement as much as possible. The above equation (6) can be divided into two sub-questions of iteration, one for W _t ，g _t And solving.

Based on g _t Deriving formula (5), obtaining:

wherein,representation->Is a conjugate of (c).

Based on g _t Deriving formula (5), obtaining:

wherein I represents a unit vector.

Substituting formula (3) into formula (7), and simplifying the formula (7) into:

wherein M is:

wherein,sequentially represent the M to be processed _m 、/>To the time domain space.

To avoid the inverse operation of the matrix, based on the nature of the cyclic matrix, we get:

wherein diag represents a diagonal matrix, F represents a fourier transform matrix, and H represents a conjugate transpose.

Then, sherman-Morrison formula based on rank 1 optimization:

wherein, in the characteristic represented by the Sherman-Morrison formula, A can be any square matrix; x and y are vectors.

The starting optimization formula for W can be obtained:

wherein a=a ² M+μI _N ，φ＝f _t (1)。

Finally, the above formula (12) is simplified to obtainRepresentation in the frequency domain:

wherein,

thus, the rank 1 optimization method has the property of cyclic matrix, avoids the inverse of the calculated n-n matrix and greatly quickens the speedIs a solution to (c).

As aboveIs redundant and can only be applied to a single channel. Each channel of the image can thus be considered consistent, and a filtering template for each channel is derived based on the solution process described above.

After the filtering template of the current frame image is obtained, the filtering template is regarded as a filter, and a response chart of the next frame image is obtained based on the filtering template of the current frame image and the characteristic x of the next frame image:

then, the position of the target object in the target frame of the next frame is calculated based on the following equation:

(x，y)＝arg max Res _t+1 (x，y)

x，y

thus, target tracking for t+1 frames is completed.

In some optional implementation manners of this embodiment, for each channel of each frame image, a multi-scale filtering manner is adopted, label information and target scale response information corresponding to the channel are acted through a preset weighting window, after-action information corresponding to the channel is obtained, and a multi-scale filtering template corresponding to the channel of the current frame image is obtained based on minimization of the after-action information. Wherein the target scale response information characterizes the channel of the current frame image, and for response information of a single scale filter template in the multi-scale filter templates

In order to achieve multi-scale target tracking, a related filtering algorithm generates a plurality of filtering templates with different scales, then generates a plurality of response graphs with different scales, and further determines a response graph with the largest response from the response graphs with different scales, and the response graph is regarded as the scaling scale of the image.

The non-center point a and the center point b of the preset weighting window u in the formula (2) and the preset weighting window u in the formula (3) are set to be values, the super-parameters can be flexibly adjusted according to the characteristics of the tracking target and the condition of the data set, so that different situations can be adapted, and the universality and the accuracy of the application are improved. Even in some cases, the above-mentioned super-parameters can be adjusted, and the method in the application can be specialized into a general target model without using the method of the application, so that the universality of the application can be enhanced on the premise of not being weaker than the result of the general method. Moreover, the method does not require the image of the video to be processed to meet the low-rank data distribution assumption, but rather enhances the distinguishing capability of the model on the data in the image through the specialization of the weighted window on the positions of the central point and the non-central point, and enhances the robustness and the reliability of the tracking method of the target object.

With continued reference to fig. 3, fig. 3 is a schematic diagram 300 of an application scenario of the tracking method of the target object according to the present embodiment. In the application scenario of fig. 3, the server 302 first acquires the monitoring video taken by the terminal device 301. For each frame of image in the surveillance video, the server 302 performs the following target tracking operation: firstly, determining a filtering template i-1 corresponding to a t-1 frame image, and determining the position information of a target object in the t frame image based on the filtering template i-1; then, determining response information of the t-1 frame image to the filtering template i-1; and then, according to the position information and the response information, obtaining a filtering template i corresponding to the t frame image. Further, in the next target tracking operation, the position information of the target object in the t+1st frame image may be determined by the filter template i corresponding to the t frame image.

The method provided by the above embodiment of the present application performs the following object tracking operation by aiming at each frame image in the video to be processed: determining the position information of a target object in the current frame image through a filtering template corresponding to the previous frame image; generating response information of the previous frame image to a filtering template corresponding to the previous frame image; and obtaining a filtering template corresponding to the current frame image according to the position information and the response information, wherein the filtering template corresponding to the current frame image is used for determining the position information of the target object in the next frame image, so that a target object tracking method is provided, and the universality, the robustness, the accuracy and the credibility of target tracking are improved.

With continued reference to fig. 4, there is shown a schematic flow 400 of one embodiment of a method of tracking a target object according to the present application, for each frame of image in a video to be processed, performing the following target tracking operations:

step 401, determining the position information of the target object in the current frame image through the filtering template corresponding to the previous frame image.

Step 402, generating response information of the previous frame image to the filtering template corresponding to the previous frame image.

And step 403, combining the position information and the response information to obtain the label information of the current frame image.

The filtering template corresponding to the current frame image is used for determining the position information of the target object in the next frame image.

And step 404, acting on the tag information and the target response information through a preset weighting window to obtain acting information of the matching loss of the central point and the non-central point of the target object distinguished by the first weight and the second weight, and obtaining a filtering template corresponding to the current frame image based on minimization of the acting information.

The center point of the preset weighting window is a first weight, and the non-center point of the preset weighting window is a second weight.

As can be seen from this embodiment, compared with the embodiment corresponding to fig. 2, the flow 400 of the tracking method of the target object in this embodiment specifically illustrates the determination process of the filtering template, in the correlation filtering, the labels based on the time domain smoothing hypothesis are constructed, the preset weighting window for matching the labels based on different weights is constructed, and the robustness and the accuracy of target tracking are further improved.

With continued reference to fig. 5, as an implementation of the method shown in the foregoing figures, the present application provides an embodiment of a target object tracking apparatus, where an embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2, and the apparatus may be specifically applied to various electronic devices.

As shown in fig. 5, the target object tracker device performs the following target tracking operation for each frame image in the video to be processed by: a first determining unit 501 configured to determine, by using a filtering template corresponding to a previous frame image, location information of a target object in a current frame image; a second determining unit 502 configured to generate response information of the previous frame image to the filtering template corresponding to the previous frame image; and an obtaining unit 503 configured to obtain a filtering template corresponding to the current frame image according to the position information and the response information, where the filtering template corresponding to the current frame image is used to determine the position information of the target object in the next frame image.

In some optional implementations of the present embodiment, the obtaining unit 503 is further configured to: combining the position information and the response information to obtain label information of the current frame image; and obtaining a filtering template corresponding to the current frame image based on minimization between the label information and the target response information, wherein the target response information characterizes the response information of the current frame image to the filtering template corresponding to the current frame image.

In some optional implementations of the present embodiment, the obtaining unit 503 is further configured to: the label information and the target response information are acted through a preset weighting window, and acted information is obtained; and obtaining a filtering template corresponding to the current frame image based on the minimization of the information after the action.

In some optional implementations of this embodiment, a center point of the preset weighting window is a first weight, and a non-center point of the preset weighting window is a second weight; the obtaining unit 503 is further configured to: and acting the label information and the target response information through a preset weighting window to obtain the after-acting information of the matching loss of the central point and the non-central point of the target object distinguished by the first weight and the second weight.

In some optional implementations of the present embodiment, the obtaining unit 503 is further configured to: based on the minimization of the information after the action, a filtering template corresponding to the current frame image is obtained by utilizing preset constraint aiming at distinguishing the background and the foreground of the target object in the image.

In some optional implementations of the present embodiment, the obtaining unit 503 is further configured to: for each channel of each frame image, the label information and the target response information of the corresponding channel are acted through a preset weighting window to obtain the acted information corresponding to the channel, and the filtering template corresponding to the channel of the current frame image is obtained based on the minimization of the acted information; and a first determination unit further configured to: and determining the position information of the target object in the current frame image based on the filtering templates corresponding to the channels of the previous frame image one by one and acting on the channels of the current frame image.

In some embodiments, the deriving unit 503 is further configured to: for each channel of each frame image, a multi-scale filtering mode is adopted, label information and target scale response information corresponding to the channel are acted through a preset weighting window, after-action information corresponding to the channel is obtained, and a multi-scale filtering template corresponding to the channel of the current frame image is obtained based on minimization of the after-action information, wherein the target scale response information represents the channel of the current frame image, and response information of a single scale filtering template in the multi-scale filtering templates.

In some embodiments, the apparatus further comprises: a third determining unit (not shown in the figure) is configured to determine, for a first frame image in the video to be processed, a filtering template corresponding to the first frame image according to a target frame representing a target object in the first frame image.

In this embodiment, the target object tracking device performs the following target tracking operation for each frame of image in the video to be processed by: the first determining unit determines the position information of the target object in the current frame image through a filtering template corresponding to the previous frame image; the second determining unit generates response information of the previous frame image to a filtering template corresponding to the previous frame image; the obtaining unit obtains a filtering template corresponding to the current frame image according to the position information and the response information, wherein the filtering template corresponding to the current frame image is used for determining the position information of the target object in the next frame image, so that a target object tracking device is provided, and the universality, the robustness and the accuracy of target tracking are improved.

Referring now to FIG. 6, there is illustrated a schematic diagram of a computer system 600 suitable for use in implementing the apparatus of embodiments of the present application (e.g., apparatus 101, 102, 103, 105 illustrated in FIG. 1). The apparatus shown in fig. 6 is merely an example, and should not be construed as limiting the functionality and scope of use of the embodiments herein.

As shown in fig. 6, the computer system 600 includes a processor (e.g., CPU, central processing unit) 601, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 602 or a program loaded from a storage section 608 into a Random Access Memory (RAM) 603. In the RAM603, various programs and data required for the operation of the system 600 are also stored. The processor 601, the ROM602, and the RAM603 are connected to each other through a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

The following components are connected to the I/O interface 605: an input portion 606 including a keyboard, mouse, etc.; an output portion 607 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like; a storage section 608 including a hard disk and the like; and a communication section 609 including a network interface card such as a LAN card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The drive 610 is also connected to the I/O interface 605 as needed. Removable media 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed as needed on drive 610 so that a computer program read therefrom is installed as needed into storage section 608.

In particular, according to embodiments of the present application, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from a network through the communication portion 609, and/or installed from the removable medium 611. The above-described functions defined in the method of the present application are performed when the computer program is executed by the processor 601.

It should be noted that the computer readable medium of the present application may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present application, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations of the present application may be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the client computer, partly on the client computer, as a stand-alone software package, partly on the client computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the client computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units involved in the embodiments of the present application may be implemented by software, or may be implemented by hardware. The described units may also be provided in a processor, for example, described as: a processor includes a first determination unit, a second determination unit obtaining unit. The names of these units do not limit the unit itself in some cases, and for example, the obtaining unit may also be described as "a unit that obtains a filtering template corresponding to the current frame image from the position information and the response information".

As another aspect, the present application also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be present alone without being fitted into the device. The computer readable medium carries one or more programs which, when executed by the apparatus, cause the computer device to: for each frame of image in the video to be processed, performing the following target tracking operation: determining the position information of a target object in the current frame image through a filtering template corresponding to the previous frame image; generating response information of the previous frame image to a filtering template corresponding to the previous frame image; and obtaining a filtering template corresponding to the current frame image according to the position information and the response information, wherein the filtering template corresponding to the current frame image is used for determining the position information of the target object in the next frame image.

The foregoing description is only of the preferred embodiments of the present application and is presented as a description of the principles of the technology being utilized. It will be appreciated by persons skilled in the art that the scope of the invention referred to in this application is not limited to the specific combinations of features described above, but it is intended to cover other embodiments in which any combination of features described above or equivalents thereof is possible without departing from the spirit of the invention. Such as the above-described features and technical features having similar functions (but not limited to) disclosed in the present application are replaced with each other.

Claims

1. A method of tracking a target object, comprising: for each frame of image in the video to be processed, performing the following target tracking operation:

determining the position information of a target object in the current frame image through a filtering template corresponding to the previous frame image;

generating response information of the previous frame image to a filtering template corresponding to the previous frame image;

according to the position information and the response information, obtaining a filtering template corresponding to the current frame image, including: combining the position information and the response information to obtain label information of the current frame image; and obtaining a filtering template corresponding to the current frame image based on the minimization between the label information and the target response information, wherein the target response information characterizes the response information of the current frame image to the filtering template corresponding to the current frame image, and the filtering template corresponding to the current frame image is used for determining the position information of the target object in the next frame image.

2. The method of claim 1, wherein the obtaining a filtering template corresponding to the current frame image based on the minimizing between the tag information and the target response information includes:

the label information and the target response information are acted through a preset weighting window, so that acted information is obtained;

and obtaining a filtering template corresponding to the current frame image based on the minimization of the information after the action.

3. The method of claim 2, wherein a center point of the preset weighting window is a first weight and a non-center point of the preset weighting window is a second weight; and

the step of acting on the tag information and the target response information through a preset weighting window to obtain information after acting comprises the following steps:

and acting on the tag information and the target response information through the preset weighting window to obtain the acting information for distinguishing the matching loss of the central point and the non-central point of the target object by the first weight and the second weight.

4. The method according to claim 2, wherein the obtaining the filtering template corresponding to the current frame image based on the minimizing of the post-action information includes:

And based on the minimization of the information after the action, obtaining a filtering template corresponding to the current frame image by utilizing preset constraint aiming at distinguishing the background and the foreground of the target object in the image.

5. The method according to claim 2, wherein the tag information and the target response information are acted on by a preset weighting window to obtain acted information; based on the minimization of the information after the action, a filtering template corresponding to the current frame image is obtained, and the method comprises the following steps:

for each channel of each frame image, the label information and the target response information corresponding to the channel are acted through the preset weighting window to obtain acting information corresponding to the channel, and a filtering template corresponding to the channel of the current frame image is obtained based on minimization of the acting information; and

the determining the position information of the target object in the current frame image through the filtering template corresponding to the previous frame image comprises the following steps:

and determining the position information of the target object in the current frame image based on the filtering templates corresponding to the channels of the previous frame image one by one and acting on the channels of the current frame image.

6. The method according to claim 5, wherein the applying the label information and the target response information corresponding to the channel through the preset weighting window to obtain post-application information corresponding to the channel, and based on the minimizing of the post-application information, obtaining a filtering template corresponding to the channel of the current frame image includes:

And obtaining the post-action information corresponding to the channel by adopting a multi-scale filtering mode through the action of the preset weighting window on the label information and the target scale response information corresponding to the channel, and obtaining a multi-scale filtering template corresponding to the channel of the current frame image based on the minimization of the post-action information, wherein the target scale response information represents the channel of the current frame image and is the response information of a single-scale filtering template in the multi-scale filtering template.

7. The method of claim 1, further comprising:

and aiming at a first frame image in the video to be processed, determining a filtering template corresponding to the first frame image according to a target frame representing the target object in the first frame image.

8. A target object tracking device performs the following target tracking operation for each frame image in a video to be processed by:

the first determining unit is configured to determine the position information of the target object in the current frame image through a filtering template corresponding to the previous frame image;

a second determining unit configured to generate response information of the previous frame image to a filtering template corresponding to the previous frame image;

The obtaining unit is configured to obtain a filtering template corresponding to the current frame image according to the position information and the response information, and comprises the following steps: combining the position information and the response information to obtain label information of the current frame image; and obtaining a filtering template corresponding to the current frame image based on the minimization between the label information and the target response information, wherein the target response information characterizes the response information of the current frame image to the filtering template corresponding to the current frame image, and the filtering template corresponding to the current frame image is used for determining the position information of the target object in the next frame image.

9. The apparatus of claim 8, wherein the deriving unit is further configured to:

the label information and the target response information are acted through a preset weighting window, so that acted information is obtained; and obtaining a filtering template corresponding to the current frame image based on the minimization of the information after the action.

10. A computer readable medium having stored thereon a computer program, wherein the program when executed by a processor implements the method of any of claims 1-7.

11. An electronic device, comprising:

One or more processors;

a storage device having one or more programs stored thereon,

when executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1-7.