CN113393493A

CN113393493A - Target object tracking method and device

Info

Publication number: CN113393493A
Application number: CN202110592320.0A
Authority: CN
Inventors: 詹忆冰; 吴爽
Original assignee: Jingdong Shuke Haiyi Information Technology Co Ltd
Current assignee: Jingdong Shuke Haiyi Information Technology Co Ltd
Priority date: 2021-05-28
Filing date: 2021-05-28
Publication date: 2021-09-14
Anticipated expiration: 2041-05-28
Also published as: CN113393493B

Abstract

The application discloses a target object tracking method and device. One embodiment of the method comprises: for each frame of image in the video to be processed, the following target tracking operation is performed: determining the position information of a target object in the current frame image through a filter template corresponding to the previous frame image; generating response information of the previous frame image to a filtering template corresponding to the previous frame image; and obtaining a filtering template corresponding to the current frame image according to the position information and the response information, wherein the filtering template corresponding to the current frame image is used for determining the position information of the target object in the next frame image. The application provides a target object tracking method, which improves the universality, robustness, accuracy and reliability of target tracking.

Description

Target object tracking method and device

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to a target object tracking method and device.

Background

In a machine vision task, target tracking has wide application scenes and great commercial value. Based on the characteristics that the correlation filtering algorithm is fast and efficient, the target tracking method is easy to deploy on a Central Processing Unit (CPU), and can track the target in real time, and the target tracking method is widely applied to target tracking tasks. The current related filtering model only refers to a Gaussian pseudo label preset in each frame of image to update the filtering model.

Disclosure of Invention

The embodiment of the application provides a target object tracking method and device.

In a first aspect, an embodiment of the present application provides a target object tracking method, which performs the following target tracking operations for each frame of image in a video to be processed: determining the position information of a target object in the current frame image through a filter template corresponding to the previous frame image; generating response information of the previous frame image to a filtering template corresponding to the previous frame image; and obtaining a filtering template corresponding to the current frame image according to the position information and the response information, wherein the filtering template corresponding to the current frame image is used for determining the position information of the target object in the next frame image.

In some embodiments, the obtaining a filtering template corresponding to the current frame image according to the position information and the response information includes: combining the position information and the response information to obtain label information of the current frame image; and obtaining a filtering template corresponding to the current frame image based on the minimization between the label information and the target response information, wherein the target response information represents the response information of the current frame image to the filtering template corresponding to the current frame image.

In some embodiments, the obtaining a filtering template corresponding to the current frame image based on the minimization between the tag information and the target response information includes: acting on the label information and the target response information through a preset weighting window to obtain acted information; and obtaining a filtering template corresponding to the current frame image based on the minimization of the acted information.

In some embodiments, the center point of the predetermined weighting window is a first weight, and the non-center point of the predetermined weighting window is a second weight; and the above acting on the tag information and the target response information through the preset weighting window to obtain acted information, including: and acting on the label information and the target response information through a preset weighting window to obtain the acted information of the matching loss of a central point and a non-central point of the target object distinguished by the first weight and the second weight.

In some embodiments, the obtaining a corresponding filtering template of the current frame image based on the minimization of the affected information includes: and based on the minimization of the acted information, obtaining a filtering template corresponding to the current frame image by using a preset constraint aiming at distinguishing the background and the foreground of the target object in the image.

In some embodiments, the tag information and the target response information are acted on through a preset weighting window to obtain acted information; based on the minimization of the acted information, obtaining a filtering template corresponding to the current frame image, wherein the filtering template comprises the following steps: for each channel of each frame image, acting on the label information and the target response information corresponding to the channel through a preset weighting window to obtain acted information corresponding to the channel, and obtaining a filtering template corresponding to the channel of the current frame image based on minimization of the acted information; and the determining the position information of the target object in the current frame image by the filter template corresponding to the previous frame image includes: and acting on each channel of the current frame image based on the one-to-one corresponding filtering template of each channel of the previous frame image to determine the position information of the target object in the current frame image.

In some embodiments, the obtaining the post-operation information corresponding to the channel by applying the preset weighting window to the tag information and the target response information corresponding to the channel, and obtaining the filtering template corresponding to the channel of the current frame image based on minimization of the post-operation information includes: and obtaining multi-scale filtering templates corresponding to the channels of the current frame image based on minimization of the acted information by using a multi-scale filtering mode and acting on the label information and the target scale response information corresponding to the channels through a preset weighting window, wherein the target scale response information represents the channels of the current frame image and is used for responding to the response information of the filtering templates with a single scale in the multi-scale filtering templates.

In some embodiments, the above method further comprises: and aiming at a first frame image in the video to be processed, determining a filtering template corresponding to the first frame image according to a target frame representing a target object in the first frame image.

In a second aspect, an embodiment of the present application provides a target object tracker that performs the following target tracking operations for each frame of image in a video to be processed, by: a first determining unit configured to determine position information of a target object in a current frame image in response to a filtering template corresponding to a previous frame image; the second determining unit is configured to generate response information of the previous frame image to the filtering template corresponding to the previous frame image; and the obtaining unit is configured to obtain a filtering template corresponding to the current frame image according to the position information and the response information, wherein the filtering template corresponding to the current frame image is used for determining the position information of the target object in the next frame image.

In some embodiments, the deriving unit is further configured to: combining the position information and the response information to obtain label information of the current frame image; and obtaining a filtering template corresponding to the current frame image based on the minimization between the label information and the target response information, wherein the target response information represents the response information of the current frame image to the filtering template corresponding to the current frame image.

In some embodiments, the deriving unit is further configured to: acting on the label information and the target response information through a preset weighting window to obtain acted information; and obtaining a filtering template corresponding to the current frame image based on the minimization of the acted information.

In some embodiments, the center point of the predetermined weighting window is a first weight, and the non-center point of the predetermined weighting window is a second weight; and an obtaining unit further configured to: and acting on the label information and the target response information through a preset weighting window to obtain the acted information of the matching loss of a central point and a non-central point of the target object distinguished by the first weight and the second weight.

In some embodiments, the deriving unit is further configured to: and based on the minimization of the acted information, obtaining a filtering template corresponding to the current frame image by using a preset constraint aiming at distinguishing the background and the foreground of the target object in the image.

In some embodiments, the deriving unit is further configured to: for each channel of each frame image, acting on the label information and the target response information corresponding to the channel through a preset weighting window to obtain acted information corresponding to the channel, and obtaining a filtering template corresponding to the channel of the current frame image based on minimization of the acted information; and a first determination unit further configured to: and acting on each channel of the current frame image based on the one-to-one corresponding filtering template of each channel of the previous frame image to determine the position information of the target object in the current frame image.

In some embodiments, the deriving unit is further configured to: and obtaining multi-scale filtering templates corresponding to the channels of the current frame image based on minimization of the acted information by using a multi-scale filtering mode and acting on the label information and the target scale response information corresponding to the channels through a preset weighting window, wherein the target scale response information represents the channels of the current frame image and is used for responding to the response information of the filtering templates with a single scale in the multi-scale filtering templates.

In some embodiments, the apparatus further comprises: and the third determining unit is configured to determine, for a first frame image in the video to be processed, a filtering template corresponding to the first frame image according to a target frame in the first frame image, wherein the target frame characterizes a target object.

In a third aspect, the present application provides a computer-readable medium, on which a computer program is stored, where the program, when executed by a processor, implements the method as described in any implementation manner of the first aspect.

In a fourth aspect, an embodiment of the present application provides an electronic device, including: one or more processors; a storage device having one or more programs stored thereon, which when executed by one or more processors, cause the one or more processors to implement a method as described in any implementation of the first aspect.

According to the method and the device for tracking the target object, the following target tracking operation is executed for each frame of image in the video to be processed: determining the position information of a target object in the current frame image through a filter template corresponding to the previous frame image; generating response information of the previous frame image to a filtering template corresponding to the previous frame image; according to the position information and the response information, the filtering template corresponding to the current frame image is obtained, wherein the filtering template corresponding to the current frame image is used for determining the position information of the target object in the next frame image, so that the target object tracking method is provided, the position information of the target object in the current frame image and the response information corresponding to the previous frame image can be flexibly combined, the filtering template for tracking the target object is obtained according to the combined information, and the universality, the robustness and the accuracy of target tracking are improved.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present application may be applied;

FIG. 2 is a flow diagram of one embodiment of a method for tracking a target object according to the present application;

fig. 3 is a schematic diagram of an application scenario of the tracking method of the target object according to the present embodiment;

FIG. 4 is a flow diagram of yet another embodiment of a target object tracking method according to the present application;

FIG. 5 is a block diagram of one embodiment of a tracker device for a target object according to the present application;

FIG. 6 is a block diagram of a computer system suitable for use in implementing embodiments of the present application.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Fig. 1 illustrates an exemplary architecture 100 to which the target object tracking method and apparatus of the present application may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The communication connections between the

terminal devices

101, 102, 103 form a topological network, and the network 104 serves to provide a medium for communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The

terminal devices

101, 102, 103 may be hardware devices or software that support network connections for data interaction and data processing. When the

terminal devices

101, 102, and 103 are hardware, they may be various electronic devices supporting network connection, information acquisition, interaction, display, processing, and other functions, including but not limited to cameras, smart phones, tablet computers, e-book readers, laptop portable computers, desktop computers, and the like. When the

terminal apparatuses

101, 102, 103 are software, they can be installed in the electronic apparatuses listed above. It may be implemented, for example, as multiple software or software modules to provide distributed services, or as a single software or software module. And is not particularly limited herein.

The server 105 may be a server providing various services, for example, a background server that acquires a to-be-processed video captured or sent by the

terminal devices

101, 102, and 103 and tracks a target object in the to-be-processed video. Optionally, the background server may feed back the target tracking result to the terminal device. As an example, the server 105 may be a cloud server.

The server may be hardware or software. When the server is hardware, it may be implemented as a distributed server cluster formed by multiple servers, or may be implemented as a single server. When the server is software, it may be implemented as multiple pieces of software or software modules (e.g., software or software modules used to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.

It should be further noted that the tracking method for the target object provided in the embodiments of the present application may be executed by a server, may also be executed by a terminal device, and may also be executed by the server and the terminal device in cooperation with each other. Accordingly, each part (for example, each unit) included in the target object tracker device may be entirely provided in the server, may be entirely provided in the terminal device, and may be provided in the server and the terminal device, respectively.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. When the electronic device on which the tracking method of the target object operates does not need to perform data transmission with other electronic devices, the system architecture may include only the electronic device (e.g., a server or a terminal device) on which the tracking method of the target object operates.

With continuing reference to FIG. 2, a flow 200 of one embodiment of a target object tracking method is shown, for each frame of image in a video to be processed, performing the following target tracking operations:

step 201, determining the position information of the target object in the current frame image through the filter template corresponding to the previous frame image.

In this embodiment, an execution subject (for example, the server in fig. 1) of the target object tracking method may determine the position information of the target object in the current frame image through a filter template corresponding to the previous frame image.

The video to be processed may be a video including an arbitrary target object. For example, the video to be processed is a video of a target object including a person, an animal, or the like, captured by a monitoring apparatus. Each frame of image in the video to be processed may include a plurality of target objects, or some frames may not include target objects.

As an example, the executing body may perform a convolution operation on the current frame image based on the filtering template corresponding to the previous frame image to obtain a response map representing response information corresponding to the current frame image; and then, determining the point with the maximum response in the response image as the central point of the target object in the current frame image so as to determine the position information of the target object. For example, the central point of the target object in the current frame image is directly determined as the position information of the target object; for another example, a target frame surrounding the target object is obtained according to the central point of the target object, so as to determine the position of the area where the target object is located.

Wherein, the filtering template should satisfy the following conditions: convolution with a target object in an image has the largest response at the central point of the target object and smaller response at non-central points, and specifically, the smaller the response of the filter template to the more background in the image.

Since the time interval between two frames of images of the video to be processed is short, the difference between the two frames of images should not be too large. Based on the above assumptions, the position information of the target object in the current frame image is consistent with the position information of the target object in the previous frame image with a high probability, that is, the two frame images have visual characteristics such as temporal continuity, spatial continuity, and the like. And performing convolution operation on the filtering template corresponding to the previous frame image and the current frame image to obtain a response image of the current frame image, and further taking the point with the maximum response in the response image as the central point of the target object in the current frame image. And for each frame of image in the video to be processed, determining the position information of the target object in the current frame of image, namely realizing the tracking of the target object in the video to be processed.

Step 202, generating response information of the previous frame image to the filtering template corresponding to the previous frame image.

In this embodiment, the execution subject may generate response information of the previous frame image to the filter template corresponding to the previous frame image.

As an example, the execution subject may perform a convolution operation on the previous frame image based on the corresponding filtering template of the previous frame image, so as to determine the response information of the previous frame image to the filtering template. Wherein, the response information can be represented in a response graph.

It can be understood that the response map has the largest response at the center point of the target object.

And step 203, obtaining a filtering template corresponding to the current frame image according to the position information and the response information.

In this embodiment, the execution main body may obtain the filtering template corresponding to the current frame image according to the position information and the response information. Wherein, the filtering template corresponding to the current frame image is used for determining the position information of the target object in the next frame image

As an example, the executing subject may set a gaussian distribution label for the current frame image with a center point of the target object in the response map corresponding to the current frame image (i.e., a point with the largest response in the response map) as a center. And the central point corresponding to the Gaussian distribution label is superposed with the central point of the target object in the response image corresponding to the current frame image.

And the response graph of the previous frame image to the filtering template is also maximum at the central point of the target object, and is smaller at the non-central point of the target object. The execution main body may combine a gaussian distribution tag representing position information of a target object in a current frame image and response information of a previous frame image to obtain combined information, and determine a filtering template corresponding to the current frame image by using the combined information as a tag. If the central point of the target object in the response graph of the previous frame image to the filter template corresponding to the previous frame image is not coincident with the central point of the gaussian distribution label, the central point of the target object in the response graph of the previous frame image to the filter template needs to be coincident with the central point of the gaussian distribution label through a matrix shift operation.

In some optional implementations of this embodiment, the executing main body may execute the step 203 by:

firstly, combining the position information and the response information to obtain the label information of the current frame image.

The execution body may set a preset weight for the position information and the response information, and obtain the label information of the current frame image based on the preset weight in combination with the position information and the response information.

Secondly, a filtering template corresponding to the current frame image is obtained based on the minimization between the label information and the target response information.

The target response information represents the response information of the current frame image to the corresponding filtering template of the current frame image.

In this implementation, the executing entity may determine the distance between the tag information and the target response information according to the norm of L2, and then determine the filtering template corresponding to the current frame image based on the minimization of the distance between the tag information and the target response information. In order to improve the generalization capability of the filtering template, a regularization term based on the filtering template can be introduced.

In some optional implementation manners of this embodiment, the execution main body may act on the tag information and the target response information through a preset weighting window to obtain the acted information, and obtain the filtering template corresponding to the current frame image based on minimization of the acted information.

Wherein different weights are set at different positions in the preset weighting window. As an example, the weight of the center point in the preset weighting window is greater than the weight of the non-center point. The addition of the preset weighting window can make the filtering template more concentrated on the tag matching loss of the central point position of the target object in the response map (realized by setting a larger weight at the central point in the preset weighting window), or reduce the tag matching loss of the central position (realized by setting a smaller weight at the central point in the preset weighting window).

In some optional implementations of this embodiment, a central point of the preset weighting window is a first weight, and a non-central point of the preset weighting window is a second weight. In this implementation manner, the execution subject may act on the tag information and the target response information through a preset weighting window to obtain post-action information that distinguishes matching loss between a central point and a non-central point of the target object by using the first weight and the second weight.

Therefore, different weights are set for the central point and the non-central point of the preset weighting window, and the following three considerations are mainly considered: in the correlation filtering algorithm, only the central point part is considered as a positive sample, and each point of the non-central point position is not existed in a real scene because of the assumption of a cycle signal in the discrete Fourier hypothesis, so the value or attention of the label matching process of the non-central point is not required to be too high; secondly, the position of the non-center point is mainly based on the background, the local response of the image with more background is expected to be as low as possible in the embodiment, and the background part does not require accurate label matching; and thirdly, in the process of detecting the position of the target object in the current frame image, only the central point of the target object is used, and the attention point of the filtering template is at the central point position of the target object instead of the non-central position. Therefore, different weights are given to the center point position and the non-center point position of the target object.

In some optional implementation manners of this embodiment, for the second step, the executing entity may obtain, based on the minimization of the post-action information, a filtering template corresponding to the current frame image by using a preset constraint aiming at distinguishing a background and a foreground of the target object in the image.

In this implementation, tag matching of the background portion in the image is omitted, and tag matching concentrated on the tracked target object region is omitted.

In some optional implementation manners of this embodiment, the execution subject may determine, in units of channels of the image, a filter template corresponding to each channel of the image. Taking an RGB image as an example, including three channels of R (red), G (green), and B (blue), the execution body may determine three filtering templates corresponding to the three channels one by one.

Specifically, for each channel of each frame image, the preset weighting window is applied to the label information and the target response information corresponding to the channel to obtain the post-action information corresponding to the channel, and the filtering template corresponding to the channel of the current frame image is obtained based on the minimization of the post-action information.

In this implementation, the executing body executes the step 201 as follows: and acting on each channel of the current frame image based on the one-to-one corresponding filtering template of each channel of the previous frame image to determine the position information of the target object in the current frame image.

As an example, for each frame of image, the execution subject may obtain a response map of each channel through a filter template corresponding to each channel of the image, further fuse each response map, obtain a fused response map, determine a point with the maximum response in the fused response map as a central point of the target object, and further determine the position information of the target object.

In this implementation, the executing body executes the step 202 as follows: and determining the response information of the previous frame image to each filtering template through the filtering templates corresponding to the channels of the previous frame image one by one.

In some optional implementations of this embodiment, the executing entity determines the filtering template corresponding to the first frame image in the video to be processed by: and aiming at a first frame image in the video to be processed, determining a filtering template corresponding to the first frame image according to a target frame representing a target object in the first frame image.

Wherein the target frame indicates a region position of the target object in the first frame image. And with the target frame as a label, based on the convolution of the filtering template and the target object in the image, determining the filtering template of the first frame image under the condition that the response at the central point of the target object is maximum and the response at the non-central point is small.

A specific implementation of the present embodiment is given as follows:

first, a mathematical model of the filter template is constructed:

wherein, epsilon (W)_t) A solving function of the filtering template representing the t frame image;

sequentially represents the t frame imageA filter template of a k channel and a filter template of a k channel of a t-1 frame image; u represents a preset weighting window; y represents a Gaussian distribution label of the t frame image;

sequentially representing the image information of the kth channel of the t frame image and the image information of the kth channel of the t-1 frame image;

a regularization term for improving the function generalization ability; theta represents a weight parameter, lambda represents a regular term coefficient, and C represents the number of channels of the image; the delta represents matrix shift operation, aims to make the response image of the previous frame image and the response image of the current frame image as similar as possible, and realizes cross-time continuity prior, namely the response of the filtering template to the same target object in the adjacent frame image should be consistent; and ∑ indicates hadamard product and convolution in turn.

Here, y, u, x, and W are all described by taking a 1-dimensional vector as an example, specifically, n × 1 may be used, and the calculation of the 2-dimensional image matrix is similar to the one-dimensional vector. When using normal convolution calculations for a matrix of n x n, i.e. the 2-dimensional case, a matrix of n x n is convolved with a filtered template of n x n, the time complexity of which is O (n)⁴) Based on the relevant filtering model, the calculation complexity can be reduced to O (n2log (n)), namely the calculation complexity of fast Fourier transform, thereby reducing the algorithm complexity and speeding up the tracking process.

x may be a Histogram of Oriented Gradient (HOG) feature, a Color space feature (CN), a gray scale feature, or other manual features (enhanced craft), or may be a depth feature.

Wherein the post-operative information is as follows:

the post-contribution information may be viewed as a label that determines a filter template for the t-th frame image.

In order to simplify the operation, the position matched with the label is divided into two areas, one area is a central point, and the weight is assigned as b; one is a non-center point region, and the assigned weight is a. Characterized by a preset weighting window as follows:

in order to enhance the distinguishing capability of the filtering template for the foreground and the background, the above equations (1), (2) and (3) are combined to obtain:

wherein,

b is a mapping matrix of N x D as an auxiliary variable. All values in B are 0 (corresponding to background position) or 1 (corresponding to foreground position) aiming at ignoring tag matching of background part but tag matching concentrated on tracked target object area, and N>>D; (i) representing a mapping function, representing that the characteristic vector x of the kth channel of the t frame is translated by i units to obtain a new vector; u (i) and l (i) represent the values of the ith position in vectors u (characterizing the preset weighting window) and l (characterizing the post-operative information), respectively, and u (i),

then, using a block coordinate descent method to solve the above equation (4), and introducing the auxiliary variable g, we can obtain:

where μ is a penalty parameter for making the constraint condition

The requirements are met as much as possible. Equation (6) above can be divided into two sub-problems of iteration, one for each of W_t，g_tAnd (6) solving.

Based on g_tDerivation of equation (5) yields:

wherein,

to represent

Conjugation of (1).

Based on g_tDerivation of equation (5) yields:

where I represents a unit vector.

Substituting equation (3) into equation (7) reduces the above equation (7) to:

wherein M is:

wherein,

sequentially indicates that M is to be_m、

Conversion to time domain space.

To avoid the operation of the inverse of the matrix, based on the properties of the circulant matrix, we obtain:

wherein diag denotes a diagonal matrix, F denotes a fourier transform matrix, and H denotes a conjugate transpose.

Then, based on the rank 1 optimized Sherman-Morrison formula:

wherein, in the characteristic represented by the Sherman-Morrison formula, A can be any square matrix; x and y are both vectors.

The initial optimal formula for W can be obtained:

wherein A ═ a²M+μI_N，φ＝f_t(1)。

Finally, the above formula (12) is simplified to obtain

Representation in the frequency domain:

wherein,

therefore, the property of the circulation matrix is also realized through the above rank 1 optimization method, the inverse of the n x n matrix is avoided being calculated, and the speed is greatly accelerated

And (4) solving.

As above

The solution space of (a) is very redundant and can only be applied to a single channel. Each channel of the image can thus be considered as identical, and a filter template for each channel is obtained based on the above-described solving process.

After the filtering template of the current frame image is obtained, the filtering template is regarded as a filter, and a response image of the next frame image is obtained based on the filtering template of the current frame image and the characteristic x of the next frame image:

then, the position of the target object in the target frame of the next frame is calculated based on the following equation:

(x，y)＝arg max Res_t+1(x，y)

x，y

thus, target tracking for the t +1 frame is completed.

In some optional implementation manners of this embodiment, for each channel of each frame of image, a multi-scale filtering manner is adopted, the label information and the target scale response information corresponding to the channel are acted on by a preset weighting window, so as to obtain post-action information corresponding to the channel, and based on minimization of the post-action information, a multi-scale filtering template corresponding to the channel of the current frame of image is obtained. Wherein the target scale response information characterizes the channel of the current frame image, and the response information of the filtering template of a single scale in the multi-scale filtering template

In order to realize multi-scale target tracking, a relevant filtering algorithm generates a plurality of filtering templates with different scales, then generates a plurality of response graphs with different scales, and further determines a response graph with the maximum response from the plurality of response graphs with different scales, and the response graph is regarded as the scaling scale of the image.

The super parameter can be flexibly adjusted according to the characteristics of the tracking target and the condition of the data set based on the set values of the non-central point a and the central point b of the preset weighting window u in the formula (2) and the formula (3) so as to adapt to different situations and improve the universality and the accuracy of the method. Even in some cases, the above-mentioned hyper-parameters can be adjusted to characterize the method in the present application as a general target model without using the method of the present application, thereby ensuring that the universality of the present application is enhanced on the premise of not being weaker than the results of the general method. Moreover, the method in the application does not require that the image of the video to be processed meets the data distribution assumption of low rank and the like, and instead, the specialization of the central point and the non-central point through the weighting window enhances the distinguishing capability of the model on the data in the image and enhances the robustness and the reliability of the tracking method of the target object.

With continued reference to fig. 3, fig. 3 is a schematic diagram 300 of an application scenario of the tracking method of the target object according to the present embodiment. In the application scenario of fig. 3, the server 302 first acquires the surveillance video captured by the terminal device 301. For each frame of image in the surveillance video, the server 302 performs the following target tracking operations: firstly, determining a filter template i-1 corresponding to a t-1 frame image, and determining the position information of a target object in the t-1 frame image based on the filter template i-1; then, determining the response information of the t-1 frame image to the filtering template i-1; and then, obtaining a filtering template i corresponding to the t frame image according to the position information and the response information. Furthermore, in the next target tracking operation, the position information of the target object in the t +1 th frame image can be determined through the filter template i corresponding to the t th frame image.

The method provided by the above embodiment of the present application performs the following target tracking operation for each frame of image in the video to be processed: determining the position information of a target object in the current frame image through a filter template corresponding to the previous frame image; generating response information of the previous frame image to a filtering template corresponding to the previous frame image; and obtaining a filtering template corresponding to the current frame image according to the position information and the response information, wherein the filtering template corresponding to the current frame image is used for determining the position information of the target object in the next frame image, so that the target object tracking method is provided, and the universality, robustness, accuracy and reliability of target tracking are improved.

With continuing reference to FIG. 4, a schematic flow chart 400 illustrating one embodiment of a target object tracking method in accordance with the present application performs the following target tracking operations for each frame of image in a video to be processed:

step 401, determining the position information of the target object in the current frame image through the filtering template corresponding to the previous frame image.

And 402, generating response information of the previous frame image to the filter template corresponding to the previous frame image.

And step 403, combining the position information and the response information to obtain the label information of the current frame image.

The filtering template corresponding to the current frame image is used for determining the position information of the target object in the next frame image.

And step 404, acting on the tag information and the target response information through a preset weighting window to obtain acted information for distinguishing the matching loss of the central point and the non-central point of the target object by using the first weight and the second weight, and obtaining a filtering template corresponding to the current frame image based on minimization of the acted information.

The center point of the preset weighting window is a first weight, and the non-center point of the preset weighting window is a second weight.

As can be seen from this embodiment, compared with the embodiment corresponding to fig. 2, the flow 400 of the target object tracking method in this embodiment specifically illustrates a process of determining a filtering template, in the relevant filtering, a label based on a time domain smoothing assumption is constructed, a preset weighting window for matching the label based on different weights is constructed, and robustness and accuracy of target tracking are further improved.

With continuing reference to fig. 5, as an implementation of the method shown in the above figures, the present application provides an embodiment of a target object tracker apparatus, which corresponds to the method embodiment shown in fig. 2 and can be applied to various electronic devices.

As shown in fig. 5, the tracker device of the target object performs the following target tracking operation for each frame image in the video to be processed by: a first determining unit 501 configured to determine position information of a target object in a current frame image through a filter template corresponding to a previous frame image; a second determining unit 502 configured to generate response information of the previous frame image to a corresponding filtering template of the previous frame image; the obtaining unit 503 is configured to obtain, according to the position information and the response information, a filtering template corresponding to the current frame image, where the filtering template corresponding to the current frame image is used to determine the position information of the target object in the next frame image.

In some optional implementations of this embodiment, the obtaining unit 503 is further configured to: combining the position information and the response information to obtain label information of the current frame image; and obtaining a filtering template corresponding to the current frame image based on the minimization between the label information and the target response information, wherein the target response information represents the response information of the current frame image to the filtering template corresponding to the current frame image.

In some optional implementations of this embodiment, the obtaining unit 503 is further configured to: acting on the label information and the target response information through a preset weighting window to obtain acted information; and obtaining a filtering template corresponding to the current frame image based on the minimization of the acted information.

In some optional implementation manners of this embodiment, a central point of the preset weighting window is a first weight, and a non-central point of the preset weighting window is a second weight; a deriving unit 503, further configured to: and acting on the label information and the target response information through a preset weighting window to obtain the acted information of the matching loss of a central point and a non-central point of the target object distinguished by the first weight and the second weight.

In some optional implementations of this embodiment, the obtaining unit 503 is further configured to: and based on the minimization of the acted information, obtaining a filtering template corresponding to the current frame image by using a preset constraint aiming at distinguishing the background and the foreground of the target object in the image.

In some optional implementations of this embodiment, the obtaining unit 503 is further configured to: for each channel of each frame image, acting on the label information and the target response information corresponding to the channel through a preset weighting window to obtain acted information corresponding to the channel, and obtaining a filtering template corresponding to the channel of the current frame image based on minimization of the acted information; and a first determination unit further configured to: and acting on each channel of the current frame image based on the one-to-one corresponding filtering template of each channel of the previous frame image to determine the position information of the target object in the current frame image.

In some embodiments, the deriving unit 503 is further configured to: and for each channel of each frame of image, a multi-scale filtering mode is adopted, label information and target scale response information corresponding to the channel are acted through a preset weighting window to obtain acted information corresponding to the channel, and a multi-scale filtering template corresponding to the channel of the current frame of image is obtained based on minimization of the acted information, wherein the target scale response information represents the channel of the current frame of image, and response information of the filtering template with a single scale in the multi-scale filtering template is obtained.

In some embodiments, the apparatus further comprises: and a third determining unit (not shown in the figure) configured to determine, for a first frame image in the video to be processed, a corresponding filtering template of the first frame image according to a target frame representing a target object in the first frame image.

In this embodiment, the tracker device of the target object performs the following target tracking operation for each frame of image in the video to be processed by the following units: the first determining unit determines the position information of a target object in the current frame image through a filter template corresponding to the previous frame image; the second determining unit generates response information of the previous frame image to the filtering template corresponding to the previous frame image; the obtaining unit obtains the filtering template corresponding to the current frame image according to the position information and the response information, wherein the filtering template corresponding to the current frame image is used for determining the position information of the target object in the next frame image, so that the target object tracking device is provided, and the target tracking universality, robustness and accuracy are improved.

Referring now to FIG. 6, shown is a block diagram of a computer system 600 suitable for use in implementing devices of embodiments of the present application (e.g.,

devices

101, 102, 103, 105 shown in FIG. 1). The apparatus shown in fig. 6 is only an example, and should not bring any limitation to the function and use range of the embodiments of the present application.

As shown in fig. 6, the computer system 600 includes a processor (e.g., CPU, central processing unit) 601 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage section 608 into a Random Access Memory (RAM) 603. In the RAM603, various programs and data necessary for the operation of the system 600 are also stored. The processor 601, the ROM602, and the RAM603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

The following components are connected to the I/O interface 605: an input portion 606 including a keyboard, a mouse, and the like; an output portion 607 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 608 including a hard disk and the like; and a communication section 609 including a network interface card such as a LAN card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The driver 610 is also connected to the I/O interface 605 as needed. A removable medium 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 610 as necessary, so that a computer program read out therefrom is mounted in the storage section 608 as necessary.

In particular, according to embodiments of the application, the processes described above with reference to the flow diagrams may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 609, and/or installed from the removable medium 611. The computer program, when executed by the processor 601, performs the above-described functions defined in the method of the present application.

It should be noted that the computer readable medium of the present application can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the client computer, partly on the client computer, as a stand-alone software package, partly on the client computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the client computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present application may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes a first determining unit, a second determining unit obtaining unit. The names of these units do not form a limitation to the unit itself in some cases, and for example, the deriving unit may also be described as a "unit that derives a filter template corresponding to the current frame image according to the position information and the response information".

As another aspect, the present application also provides a computer-readable medium, which may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by the apparatus, cause the computer device to: for each frame of image in the video to be processed, the following target tracking operation is performed: determining the position information of a target object in the current frame image through a filter template corresponding to the previous frame image; generating response information of the previous frame image to a filtering template corresponding to the previous frame image; and obtaining a filtering template corresponding to the current frame image according to the position information and the response information, wherein the filtering template corresponding to the current frame image is used for determining the position information of the target object in the next frame image.

The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention herein disclosed is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the invention. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims

1. A method of tracking a target object, comprising: for each frame of image in the video to be processed, the following target tracking operation is performed:

determining the position information of a target object in the current frame image through a filter template corresponding to the previous frame image;

generating response information of the previous frame image to a filtering template corresponding to the previous frame image;

and obtaining a filtering template corresponding to the current frame image according to the position information and the response information, wherein the filtering template corresponding to the current frame image is used for determining the position information of the target object in the next frame image.

2. The method according to claim 1, wherein the obtaining a filtering template corresponding to the current frame image according to the position information and the response information includes:

combining the position information and the response information to obtain label information of the current frame image;

and obtaining a filtering template corresponding to the current frame image based on the minimization between the label information and target response information, wherein the target response information represents the response information of the current frame image to the filtering template corresponding to the current frame image.

3. The method of claim 2, wherein the deriving the corresponding filtering template of the current frame image based on the minimization between the tag information and the target response information comprises:

acting on the label information and the target response information through a preset weighting window to obtain acted information;

and obtaining a filtering template corresponding to the current frame image based on the minimization of the acted information.

4. The method according to claim 3, wherein the center point of the preset weighting window is a first weight, and the non-center point of the preset weighting window is a second weight; and

the acting on the tag information and the target response information through a preset weighting window to obtain acted information comprises the following steps:

and acting on the label information and the target response information through the preset weighting window to obtain the acted information for distinguishing the matching loss of the central point and the non-central point of the target object by the first weight and the second weight.

5. The method of claim 3, wherein the deriving a corresponding filtering template for the current frame image based on the minimization of the post-action information comprises:

and based on the minimization of the acted information, obtaining a filtering template corresponding to the current frame image by using a preset constraint aiming at distinguishing the background and the foreground of the target object in the image.

6. The method according to claim 3, wherein the acting on the tag information and the target response information through a preset weighting window obtains acted information; obtaining a filtering template corresponding to the current frame image based on the minimization of the acted information, wherein the filtering template comprises:

for each channel of each frame image, acting on the label information and the target response information corresponding to the channel through the preset weighting window to obtain acted information corresponding to the channel, and obtaining a filtering template corresponding to the channel of the current frame image based on minimization of the acted information; and

the determining the position information of the target object in the current frame image through the filtering template corresponding to the previous frame image includes:

and acting on each channel of the current frame image based on the one-to-one corresponding filtering template of each channel of the previous frame image, and determining the position information of the target object in the current frame image.

7. The method of claim 6, wherein the acting on the tag information and the target response information corresponding to the channel through the preset weighting window to obtain the acted information corresponding to the channel, and obtaining the filtering template corresponding to the channel of the current frame image based on the minimization of the acted information comprises:

and obtaining multi-scale filtering templates corresponding to the channels of the current frame image based on minimization of the acted information by using a multi-scale filtering mode and acting on the label information and the target scale response information corresponding to the channels through the preset weighting window, wherein the target scale response information represents the channels of the current frame image and is used for responding to the response information of the filtering templates with a single scale in the multi-scale filtering templates.

8. The method of claim 1, further comprising:

and aiming at a first frame image in the video to be processed, determining a filtering template corresponding to the first frame image according to a target frame which is used for representing the target object in the first frame image.

9. A target object tracking device performs the following target tracking operation for each frame of image in a video to be processed by the following units:

the first determining unit is configured to determine the position information of the target object in the current frame image through a filter template corresponding to the previous frame image;

a second determining unit configured to generate response information of the previous frame image to a corresponding filtering template of the previous frame image;

and the obtaining unit is configured to obtain a filtering template corresponding to the current frame image according to the position information and the response information, wherein the filtering template corresponding to the current frame image is used for determining the position information of the target object in the next frame image.

10. The apparatus of claim 9, wherein the deriving unit is further configured to:

combining the position information and the response information to obtain label information of the current frame image; and obtaining a filtering template corresponding to the current frame image based on the minimization between the label information and target response information, wherein the target response information represents the response information of the current frame image to the filtering template corresponding to the current frame image.

11. The apparatus of claim 10, wherein the deriving unit is further configured to:

acting on the label information and the target response information through a preset weighting window to obtain acted information; and obtaining a filtering template corresponding to the current frame image based on the minimization of the acted information.

12. A computer-readable medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the method of any one of claims 1-8.

13. An electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-8.