CN110059521B

CN110059521B - Target tracking method and device

Info

Publication number: CN110059521B
Application number: CN201810049002.8A
Authority: CN
Inventors: 黄元捷
Original assignee: Zhejiang Uniview Technologies Co Ltd
Current assignee: Zhejiang Uniview Technologies Co Ltd
Priority date: 2018-01-18
Filing date: 2018-01-18
Publication date: 2022-05-13
Anticipated expiration: 2038-01-18
Also published as: CN110059521A

Abstract

The invention provides a target tracking method and a target tracking device, which are applied to a server storing a characteristic model corresponding to each target object. The method comprises the following steps: carrying out target detection on the current video frame image, and extracting corresponding CNN characteristics according to the position information of each object to be detected obtained by detection; calculating to obtain a corresponding similarity matrix according to the position information and the CNN characteristics of each object to be detected and the position information and the characteristic model of each target object in the previous video frame image; performing data association on each object to be detected and each target object based on the similarity matrix to obtain an optimal matching result; and if the optimal matching result has an object to be detected which is successfully matched with the corresponding target object, updating the corresponding feature model according to the CNN feature of the object to be detected, and obtaining a corresponding tracking result based on the object to be detected. The method has the advantages of strong anti-interference capability and high tracking success rate, and can continuously track the target object.

Description

Target tracking method and device

Technical Field

The invention relates to the technical field of multi-target tracking of video images, in particular to a target tracking method and device.

Background

With the continuous development of monitoring technologies, the application of multi-target tracking technologies for tracking multiple target objects in a monitored video is becoming more and more extensive. In the existing multi-target tracking scheme, in the tracking process of a target object, the target object is tracked by comparing a CNN (Convolutional Neural Network) feature of the target object in a current video image with a CNN feature of the target object when the target object is tracked successfully recently, but the multi-target tracking scheme is weak in anti-interference capability and low in target tracking success rate, and usually, because the CNN feature of the target object when the target object is tracked successfully recently carries a feature of a partial obstruction, the CNN feature of the target object in the current video image cannot be correctly matched with the CNN feature of the target object when the target object is tracked successfully recently, so that tracking failure is caused.

Disclosure of Invention

In order to overcome the defects in the prior art, the invention aims to provide a target tracking method and a target tracking device.

As for the method, a preferred embodiment of the present invention provides a target tracking method, which is applied to a server, where the server stores feature models corresponding to target objects, where each feature model includes a history CNN feature of a corresponding target object, and the method includes:

carrying out target detection on a current video frame image, and extracting CNN characteristics corresponding to each object to be detected from the current video frame image according to the position information of each object to be detected in the detected current video frame image;

calculating to obtain a similarity matrix between each object to be detected in the current video frame image and each target object in the previous video frame image according to the position information and the corresponding CNN characteristic of each object to be detected in the current video frame image, and the position information and the corresponding characteristic model of each target object in the previous video frame image;

performing data association on each object to be detected and each target object based on the similarity matrix to obtain an optimal matching result between the current video frame image and the previous video frame image;

and if the optimal matching result contains an object to be detected which is successfully matched with the corresponding target object, updating the characteristic model corresponding to the target object according to the CNN characteristic of the object to be detected which is successfully matched with the corresponding target object, and obtaining a corresponding tracking result based on the object to be detected which is successfully matched. According to the method, an optimal similarity matrix between each object to be detected in a current video frame image and each target object in a previous video frame image is calculated and obtained according to the CNN characteristics of each object to be detected in the current video frame image and the historical CNN characteristics included in the characteristic model of each target object in the previous video frame image, an optimal matching result between the current video frame image and the previous video frame image is obtained based on the similarity matrix, and a corresponding tracking result is obtained based on the object to be detected which is successfully matched with the corresponding target object and exists in the optimal matching result, so that the influence of an interfering object on target tracking is reduced, the target tracking success rate is improved, and the continuous tracking of the target object is realized.

In terms of an apparatus, a preferred embodiment of the present invention provides a target tracking apparatus applied to a server, where the server stores feature models corresponding to target objects, where each feature model includes a history CNN feature of a corresponding target object, and the apparatus includes:

the detection extraction module is used for carrying out target detection on the current video frame image and extracting CNN characteristics corresponding to each object to be detected from the current video frame image according to the position information of each object to be detected in the detected current video frame image;

the matrix calculation module is used for calculating to obtain a similarity matrix between each object to be detected in the current video frame image and each target object in the previous video frame image according to the position information and the corresponding CNN characteristic of each object to be detected in the current video frame image and the position information and the corresponding characteristic model of each target object in the previous video frame image;

the image matching module is used for performing data association on each object to be detected and each target object based on the similarity matrix to obtain an optimal matching result between the current video frame image and the previous video frame image;

and the updating and tracking module is used for updating the characteristic model corresponding to the target object according to the CNN characteristics of the object to be detected successfully matched with the corresponding target object and obtaining a corresponding tracking result based on the successfully matched object to be detected if the object to be detected successfully matched with the corresponding target object exists in the optimal matching result.

Compared with the prior art, the target tracking method and device provided by the preferred embodiment of the invention have the following beneficial effects: the target tracking method is strong in anti-interference capability and high in target tracking success rate, and can continuously track the target object. The target tracking method is applied to a server, and the server stores characteristic models corresponding to target objects, wherein each characteristic model comprises historical CNN characteristics of the corresponding target object. Firstly, the method comprises the steps of carrying out target detection on a current video frame image to obtain each object to be detected in the current video frame image, and extracting CNN characteristics corresponding to each object to be detected from the current video frame image according to position information of each object to be detected in the detected current video frame image; then, according to the position information and corresponding CNN characteristics of each object to be detected in the current video frame image, and the position information and corresponding CNN characteristics of each target object in the previous video frame image, calculating to obtain an optimal similarity matrix between each object to be detected in the current video frame image and each target object in the previous video frame image; then, the method carries out data association on each object to be detected and each target object based on the similarity matrix to obtain an optimal matching result between the current video frame image and the previous video frame image; finally, when the optimal matching result has an object to be detected which is successfully matched with the corresponding target object, the method updates the feature model corresponding to the target object according to the CNN feature of the object to be detected which is successfully matched with the corresponding target object, and obtains a corresponding tracking result based on the object to be detected which is successfully matched, thereby reducing the influence of an interfering object on target tracking, improving the success rate of target tracking and realizing continuous tracking of the target object.

In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments are briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention, and therefore should not be considered as limiting the scope of the claims of the present invention, and it is obvious for those skilled in the art that other related drawings can be obtained according to the drawings without inventive efforts.

Fig. 1 is a block diagram of a server according to a preferred embodiment of the present invention.

Fig. 2 is a flowchart illustrating a target tracking method according to a preferred embodiment of the invention.

Fig. 3 is a flowchart illustrating the sub-steps included in step S220 shown in fig. 2.

Fig. 4 is a flowchart illustrating the sub-steps included in step S240 shown in fig. 2.

FIG. 5 is a block diagram of the target tracking device shown in FIG. 1 according to a preferred embodiment of the present invention.

FIG. 6 is a block diagram of the matrix calculation module shown in FIG. 5.

Icon: 10-a server; 11-a memory; 12-a processor; 13-a communication unit; 100-a target tracking device; 110-a detection extraction module; 120-a matrix calculation module; 130-an image matching module; 140-update the tracking module; 121-similarity calculation operator module; 122-matrix generation submodule.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

Some embodiments of the invention are described in detail below with reference to the accompanying drawings. The embodiments described below and the features of the embodiments can be combined with each other without conflict.

Fig. 1 is a block diagram of a server 10 according to a preferred embodiment of the present invention. In the embodiment of the present invention, the server 10 is configured to perform target tracking on each monitored object in the acquired monitoring video, where the target tracking has strong anti-interference capability and high tracking success rate, and the server 10 may be, but is not limited to, a cloud server, a distributed server, a centralized server, and the like.

In this embodiment, the server 10 includes a target tracking device 100, a memory 11, a processor 12, and a communication unit 13. The memory 11, the processor 12 and the communication unit 13 are electrically connected to each other directly or indirectly to realize data transmission or interaction. For example, the components may be electrically connected to each other via one or more communication buses or signal lines.

The memory 11 may be configured to store feature models corresponding to target objects in a surveillance video, where each feature model includes a historical CNN feature extracted when a corresponding target object is tracked by the server 10, the target object is an object to be tracked in the surveillance video, and the target object may be a person, a vehicle, an animal, and/or a plant. The Memory 11 may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read Only Memory (EPROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), and the like. The memory 11 may store a software program, and the processor 12 may execute the software program after receiving an execution instruction.

The processor 12 may be an integrated circuit chip having signal processing capabilities. The Processor 12 may be a general-purpose Processor including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also digital signal processors, application specific integrated circuits, off-the-shelf programmable gate arrays or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. The processor 12 may implement or perform the methods, steps, and logic blocks disclosed in embodiments of the present invention. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The communication unit 13 is configured to establish a communication connection between the server 10 and another external device via a network, and to transmit and receive data via the network. The server 10 obtains a surveillance video that needs to be subjected to target tracking from the surveillance device through the communication unit 13, and after the target tracking of the surveillance video is completed, the surveillance video subjected to target tracking can be displayed on the display device through the communication unit 13.

The target tracking device 100 includes at least one software functional module that can be stored in the memory 11 in the form of software or firmware. The processor 12 may be used to execute executable modules stored in the memory 11 corresponding to the target tracking device 100, such as software functional modules and computer programs included in the target tracking device 100. In this embodiment, the target tracking apparatus 100 has a strong anti-interference capability, and can perform target tracking with a high success rate of continuously tracking the target objects in the monitoring video in a manner of comparing the CNN features of each object to be detected in the current video frame image with the historical CNN features included in the feature models of each target object one by one.

It is to be understood that the block diagram shown in fig. 1 is merely a schematic diagram of one structural component of the server 10, and that the server 10 may include more or fewer components than shown in fig. 1, or have a different configuration than shown in fig. 1. The components shown in fig. 1 may be implemented in hardware, software, or a combination thereof.

Fig. 2 is a block diagram of a target tracking method according to a preferred embodiment of the invention. In the embodiment of the present invention, the target tracking method is applied to the server 10, and is used for continuously performing target tracking on each target object in a monitored video, where the target tracking has strong anti-interference capability and a high tracking success rate, and feature models corresponding to each target object in the monitored video are stored in the server 10, and each feature model includes a history CNN feature of the corresponding target object. The following describes the specific flow and steps of the target tracking method shown in fig. 2 in detail.

In an embodiment of the present invention, the target tracking method includes the following steps:

step S210, performing target detection on the current video frame image, and extracting CNN characteristics corresponding to each object to be detected from the current video frame image according to the position information of each object to be detected in the detected current video frame image.

In this embodiment, the surveillance video acquired by the server 10 may be formed by continuously displaying a plurality of video frame images, and the server 10 may complete target tracking on an object to be tracked in the surveillance video by comparing CNN characteristics of each target object that may exist in the plurality of video frame images. The server 10 may obtain position information of each object to be detected in the current video frame image by performing target detection on the current video frame image, and deduct a feature region corresponding to each object to be detected from a corresponding position in the current video frame image according to the position information of each object to be detected, so as to extract a CNN feature corresponding to each object to be detected from the feature region of each object to be detected. The object to be detected is an object detected in a current video frame image, where the object to be detected may include an object that has been continuously tracked in at least one previous video frame image of the current video frame image arranged according to a time sequence in the surveillance video, and an object that newly appears in the current video frame image and needs to be tracked for the first time.

Step S220, calculating to obtain a similarity matrix between each object to be detected in the current video frame image and each object in the previous video frame image according to the position information and the corresponding CNN feature of each object to be detected in the current video frame image, and the position information and the corresponding feature model of each object in the previous video frame image.

In this embodiment, the previous video frame image is a previous video frame image of the current video frame image arranged according to a time sequence in the surveillance video, the target objects in the previous video frame image are all the target objects acquired from the surveillance video by the server 10 before performing target tracking on the current video frame image, and the target objects in the previous video frame image include the target objects directly exposed in the previous video frame image and the target objects not directly exposed in the previous video frame image before performing target detection on the current video frame image.

In this embodiment, the server 10 obtains an optimal similarity matrix between each object to be detected in the current video frame image and each object in the previous video frame image by comparing the CNN feature of each object to be detected in the current video frame image in the video frame image with the historical CNN feature included in the feature model corresponding to each object in the previous video frame image, and comparing the position information of each object to be detected in the current video frame image in the video frame image with the position information of each object in the previous video frame image that should correspond to each other. Wherein the historical CNN features included in each feature model are CNN features in the corresponding video frame image when the corresponding target object is successfully tracked. For example, a first video frame image, a third video frame image, a fifth video frame image, and a seventh video frame image are sequentially arranged according to a time sequence, a target object is successfully tracked in the first video frame image, the fifth video frame image, and the seventh video frame image, and if the number of the historical CNN features in the feature model is not limited, the feature model of the target object includes the CNN features corresponding to the target object in the first video frame image, the fifth video frame image, and the seventh video frame image.

Optionally, please refer to fig. 3, which is a flowchart illustrating the sub-steps included in step S220 shown in fig. 2. In this embodiment, the step of calculating the similarity matrix between each object to be detected in the current video frame image and each target object in the previous video frame image in step S220 may include sub-step S221, sub-step S222, and sub-step S223:

and a substep S221, calculating and obtaining the feature similarity between each object to be detected and each target object based on the historical CNN features in the feature model corresponding to each target object.

In this embodiment, the server 10 may obtain the optimal feature similarity between each object to be detected and each target object by comparing and calculating the CNN features corresponding to each object to be detected in the current video frame image with all the historical CNN features included in the feature model of each target object in the previous video frame image.

Optionally, the step of calculating, based on the historical CNN features in the feature model corresponding to each target object, to obtain the feature similarity between each object to be detected and each target object includes:

calculating the cosine distance between the CNN characteristic of each object to be detected and each historical CNN characteristic in the characteristic model corresponding to each target object to obtain each cosine distance between the object to be detected and the corresponding target object;

and selecting the cosine distance with the minimum value from the cosine distances as the characteristic similarity between the object to be detected and the corresponding target object.

The optimal feature similarity between each object to be detected and the corresponding target object, which can be obtained according to the above steps, can be calculated by the following formula:

wherein M is_iIndicates a target serial number ofi characteristic model of the target object, F_i ⁰Initial CNN feature, F, representing the target object_i ⁿRepresenting the historical CNN feature, aff, of the target object in the nth video frame image_appRepresenting the degree of similarity of features between the object to be detected and the corresponding target object, F_iRepresenting the historical CNN characteristics of the corresponding target object,

representing the CNN characteristic, cosine (F) of the object to be examined_i，

) Feature F representing history CNN of target object_iCorresponding to CNN characteristics of object to be detected

The cosine distance between.

The substep S222 is to calculate and obtain the spatial similarity and the shape similarity between each object to be detected and each target object based on the position information and the target size information of each target object in the previous video frame image and the position information and the target size information of each object to be detected in the current video frame image;

in this embodiment, the position information of each target object in the previous video frame image includes X coordinate information and Y coordinate information of a coordinate point at the upper left corner of a corresponding feature region of the corresponding target object in the previous video frame image, the target size information of each target object in the previous video frame image includes a region width and a region height of a corresponding feature region of the corresponding target object in the previous video frame image, the position information of each object to be detected in the current video frame image includes X coordinate information and Y coordinate information of a coordinate point at the upper left corner of a corresponding feature region of the object to be detected in the current video frame image, and the target size information of each object to be detected in the current video frame image includes a region width and a region height of a corresponding feature region of the object to be detected in the current video frame image. The server 10 calculates and obtains the spatial similarity between each object to be detected and each target object according to the area width, the area height, the X coordinate information and the Y coordinate information of each object to be detected, and the X coordinate information and the Y coordinate information of each target object; the server 10 calculates and obtains the shape similarity between each object to be detected and each target object according to the area width, the area height, the X coordinate information, the Y coordinate information of each object to be detected, and the area width and the area height of each target object. In this embodiment, the spatial similarity and the shape similarity both conform to the matching criteria of the hungarian algorithm and the extended algorithm thereof. Wherein the spatial similarity and the shape similarity can be calculated by the following formula:

wherein, trk_iRepresenting the ith target object, det_jRepresenting the jth object to be detected, X, Y, W, H respectively representing the x coordinate value, y coordinate value, area width and area height of the upper left corner point of the corresponding characteristic area of the object in the corresponding video frame image, and aff_mot(trk_i，det_j) Representing the spatial similarity, aff, between the ith target object and the jth object to be detected_shp(trk_i，det_j) And representing the shape similarity between the ith target object and the jth object to be detected.

And a substep S223 of calculating the association similarity between each object to be detected and each target object according to the feature similarity, the spatial similarity and the shape similarity between each object to be detected and each target object, and correspondingly obtaining the similarity matrix.

In this embodiment, the server 10 obtains the optimal association similarity between each object to be detected and each target object by multiplying the feature similarity, the spatial similarity, and the shape similarity between each object to be detected and each target object, and arranges the optimal association similarity between each object to be detected and each target object in a matrix form to generate the optimal similarity matrix between the current video frame image and the previous video frame image.

Referring to fig. 2 again, in step S230, data association is performed between each object to be detected and each target object based on the similarity matrix, so as to obtain an optimal matching result between the current video frame image and the previous video frame image.

In this embodiment, the server 10 performs data association between each object to be detected and each target object based on the similarity matrix by using the hungarian algorithm or the extended algorithm thereof, so as to obtain an optimal matching result between the current video frame image and the previous video frame image.

Step S240, if there is an object to be detected successfully matched with the corresponding target object in the optimal matching result, updating the feature model corresponding to the target object according to the CNN feature of the object to be detected successfully matched with the corresponding target object, and obtaining a corresponding tracking result based on the successfully matched object to be detected.

In this embodiment, when obtaining the optimal matching result between the current video frame image and the previous video frame image, the server 10 performs object division on each object to be detected in the current video frame image, so as to obtain an object to be detected that is successfully matched with the corresponding target object in the previous video frame image, an object to be detected that is less matched with the corresponding target object in the previous video frame image, and an object to be detected that is not matched with each target object in the previous video frame image and that newly appears in the current video frame image and needs to be tracked.

In this embodiment, for an object to be detected that needs to be tracked newly appearing in the current video frame image, the server 10 may use the CNN feature of the object to be detected in the current video frame image as the initial CNN feature of the object to be detected, create a feature model of the object to be detected based on the initial CNN feature, and perform parameter correction on the object to be detected by using a Kalman filter to obtain a tracking result of the object to be detected. When the server 10 performs target tracking on a video frame image subsequent to the current video frame image, the created object to be detected is used as a target object of the monitoring video, and the target tracking is performed by using the feature model of the object to be detected.

In this embodiment, for an object to be detected with a low matching degree with a corresponding target object in the previous video frame image in the current video frame image, the server 10 predicts the position information of the object to be detected in the previous video frame image based on a Kalman filter, and determines whether to remove the tracker of the object to be detected according to the prediction result. If the prediction result indicates that the position information of the object to be detected in the past video frame image is unchanged for a long time and the total predicted time length of the object to be detected in the past video frame image is greater than a preset time length threshold, the server 10 removes the tracker of the object to be detected; if the prediction result indicates that the position information of the object to be detected in the previous video frame image is unchanged for a long time and the total predicted time length of the object to be detected in the previous video frame image is smaller than the preset time length threshold, the server 10 performs parameter correction on the object to be detected by using a Kalman filter to obtain the tracking result of the object to be detected.

In this embodiment, for the object to be detected in the current video frame image, which is successfully matched with the corresponding target object in the previous video frame image, the server 10 updates the feature model of the target object, which is successfully matched with the object to be detected, according to the CNN feature of the object to be detected in the current video frame image, and performs parameter correction on the object to be detected by using a Kalman filter to obtain the tracking result of the object to be detected, where the object to be detected is the target object which is successfully matched with the object to be detected.

Optionally, please refer to fig. 4, which is a flowchart illustrating the sub-steps included in step S240 shown in fig. 2. In this embodiment, the step of updating the feature model corresponding to the target object according to the CNN feature of the object to be detected successfully matched with the corresponding target object in step S240 includes substeps S241 and substep S242:

and a substep S241 of counting the feature number of the historical CNN features in the feature model corresponding to the target object to obtain the corresponding feature total number.

In this embodiment, when the feature model of the corresponding target object is updated, the server 10 obtains the total number of the corresponding features by counting the number of features of the history CNN features in the feature model of the target object.

And a substep S242, comparing the total number of the features with a preset feature storage number, and adding the CNN features of the object to be detected, which are successfully matched with the target object, into a feature model corresponding to the target object according to a comparison result.

In this embodiment, the step of adding, by the server 10, the CNN feature of the object to be detected, which is successfully matched with the target object, to the feature model corresponding to the target object according to the comparison result includes:

if the comparison result is that the total number of the features is smaller than the preset feature storage number, directly adding the CNN features of the object to be detected, which are successfully matched with the target object, into a corresponding feature model for storage;

and if the comparison result is that the total number of the features is not less than the preset feature storage number, replacing any one of the historical CNN features except the initial CNN feature in the corresponding feature model with the CNN feature of the object to be detected so as to add the CNN feature of the object to be detected into the feature model corresponding to the target object.

The number of the preset feature storage may be 10, 15, or 25, and the values may be configured differently according to actual requirements.

Fig. 5 is a block diagram of the target tracking device 100 shown in fig. 1 according to a preferred embodiment of the present invention. In the embodiment of the present invention, the target tracking device 100 includes a detection extraction module 110, a matrix calculation module 120, an image matching module 130, and an update tracking module 140.

The detection extraction module 110 is configured to perform target detection on a current video frame image, and extract, from the current video frame image, CNN features corresponding to each object to be detected according to position information of each object to be detected in the current video frame image.

In this embodiment, the detection extraction module 110 may execute step S210 shown in fig. 2, and the specific execution process may refer to the above detailed description of step S210.

The matrix calculation module 120 is configured to calculate, according to the position information and the corresponding CNN characteristic of each object to be detected in the current video frame image, and the position information and the corresponding characteristic model of each target object in the previous video frame image, to obtain a similarity matrix between each object to be detected in the current video frame image and each target object in the previous video frame image.

In this embodiment, the matrix calculation module 120 may perform step S220 shown in fig. 2, and the specific implementation process may refer to the detailed description of step S220 above.

Fig. 6 is a block diagram of the matrix calculation module 120 shown in fig. 5. In this embodiment, the matrix calculation module 120 includes a similarity calculation sub-module 121 and a matrix generation sub-module 122.

The similarity operator module 121 is configured to calculate, based on the historical CNN features in the feature model corresponding to each target object, feature similarities between each object to be detected and each target object.

In this embodiment, the way for calculating, by the similarity operator module 121, the feature similarity between each object to be detected and each target object based on the historical CNN features in the feature model corresponding to each target object includes:

The similarity operator module 121 may perform the sub-step S221 shown in fig. 3, and the detailed implementation process may refer to the detailed description of the sub-step S221 above.

The similarity operator module 121 is further configured to calculate, based on the position information and the target size information of each target object in the previous video frame image, and the position information and the target size information of each target object in the current video frame image, a spatial similarity and a shape similarity between each target object and each target object.

In this embodiment, the similarity operator module 121 may further perform the sub-step S222 shown in fig. 3, and the specific implementation process may refer to the detailed description of the sub-step S222 above.

The matrix generation submodule 122 is configured to calculate association similarities between each object to be detected and each target object according to the feature similarity, the spatial similarity, and the shape similarity between each object to be detected and each target object, and accordingly obtain the similarity matrix.

In this embodiment, the matrix generation sub-module 122 may perform the sub-step S223 shown in fig. 3, and the detailed implementation process may refer to the detailed description of the sub-step S223 above.

Referring to fig. 5 again, the image matching module 130 is configured to perform data association between each object to be detected and each target object based on the similarity matrix, so as to obtain an optimal matching result between the current video frame image and the previous video frame image.

In this embodiment, the image matching module 130 may execute step S230 shown in fig. 2, and the specific execution process may refer to the above detailed description of step S230.

The update tracking module 140 is configured to, if an object to be detected successfully matched with the corresponding target object exists in the optimal matching result, update the feature model corresponding to the target object according to the CNN feature of the object to be detected successfully matched with the corresponding target object, and obtain a corresponding tracking result based on the object to be detected successfully matched.

In this embodiment, the manner of updating the feature model corresponding to the target object by the update tracking module 140 according to the CNN feature of the object to be detected successfully matched with the corresponding target object includes:

counting the feature number of the historical CNN features in the feature model corresponding to the target object to obtain the corresponding feature total number;

and comparing the total number of the features with the preset stored number of the features, and adding the CNN features of the object to be detected, which are successfully matched with the target object, into the feature model corresponding to the target object according to the comparison result.

The manner in which the update tracking module 140 adds the CNN feature of the object to be detected, which is successfully matched with the target object, to the feature model corresponding to the target object according to the comparison result includes:

In this embodiment, the update tracking module 140 may execute step S240 shown in fig. 2, and sub-step S241 and sub-step S242 shown in fig. 4, and the specific execution process may refer to the above detailed description of step S240, sub-step S241, and sub-step S242.

In summary, in the target tracking method and apparatus provided in the preferred embodiments of the present invention, the target tracking method has strong anti-interference capability and high target tracking success rate, and can continuously track the target object. The target tracking method is applied to a server, and the server stores characteristic models corresponding to target objects, wherein each characteristic model comprises historical CNN characteristics of the corresponding target object. Firstly, the method comprises the steps of carrying out target detection on a current video frame image to obtain each object to be detected in the current video frame image, and extracting CNN characteristics corresponding to each object to be detected from the current video frame image according to position information of each object to be detected in the detected current video frame image; then, according to the position information and corresponding CNN characteristics of each object to be detected in the current video frame image, and the position information and corresponding CNN characteristics of each target object in the previous video frame image, calculating to obtain an optimal similarity matrix between each object to be detected in the current video frame image and each target object in the previous video frame image; then, the method carries out data association on each object to be detected and each target object based on the similarity matrix to obtain an optimal matching result between the current video frame image and the previous video frame image; finally, when the optimal matching result has an object to be detected which is successfully matched with the corresponding target object, the method updates the feature model corresponding to the target object according to the CNN feature of the object to be detected which is successfully matched with the corresponding target object, and obtains a corresponding tracking result based on the object to be detected which is successfully matched, thereby reducing the influence of an interfering object on target tracking, improving the success rate of target tracking and realizing continuous tracking of the target object.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A target tracking method is applied to a server, the server stores feature models corresponding to target objects, wherein each feature model comprises historical CNN features of the corresponding target object, and the method comprises the following steps:

performing target detection on a current video frame image, and extracting CNN characteristics corresponding to each object to be detected from the current video frame image according to the position information of each object to be detected in the detected current video frame image;

if the optimal matching result contains an object to be detected which is successfully matched with the corresponding target object, updating the feature model corresponding to the target object according to the CNN feature of the object to be detected which is successfully matched with the corresponding target object, and obtaining a corresponding tracking result based on the object to be detected which is successfully matched;

the step of calculating the similarity matrix between each object to be detected in the current video frame image and each target object in the previous video frame image comprises the following steps:

calculating to obtain the feature similarity between each object to be detected and each target object based on the historical CNN features in the corresponding feature model of each target object;

calculating to obtain the spatial similarity and the shape similarity between each object to be detected and each target object based on the position information and the target size information of each target object in the last video frame image and the position information and the target size information of each object to be detected in the current video frame image;

multiplying and calculating according to the feature similarity, the space similarity and the shape similarity between each object to be detected and each target object to obtain the association similarity between each object to be detected and each target object, and correspondingly obtaining a similarity matrix;

the step of calculating the spatial similarity and the shape similarity between each object to be detected and each target object based on the position information and the target size information of each target object in the previous video frame image and the position information and the target size information of each object to be detected in the current video frame image comprises the following steps:

calculating and solving the spatial similarity between each object to be detected and each target object according to the area width, the area height, the X coordinate information and the Y coordinate information of each object to be detected, and the X coordinate information and the Y coordinate information of each target object;

calculating and solving the shape similarity between each object to be detected and each target object according to the area width, the area height, the X coordinate information and the Y coordinate information of each object to be detected, and the area width and the area height of each target object;

the space similarity and the shape similarity both accord with the matching criterion of the Hungarian algorithm and the expansion algorithm thereof; wherein the spatial similarity and the shape similarity can be calculated by the following formula:

wherein, trk_iRepresenting the ith target object, det_jRepresenting the jth object to be detected, X, Y, W, H respectively representing the x coordinate value, y coordinate value, area width and area height of the upper left corner point of the corresponding characteristic area of the object in the corresponding video frame image, and aff_mot(trk_i，det_j) Representing the spatial similarity, aff, between the ith target object and the jth object to be detected_shp(trk_i，det_j) Representing the shape similarity between the ith target object and the jth object to be detected。

2. The method according to claim 1, wherein the step of calculating the feature similarity between each object to be detected and each target object based on the historical CNN features in the feature model corresponding to each target object comprises:

3. The method according to any one of claims 1-2, wherein the step of updating the feature model corresponding to the target object according to the CNN feature of the object to be detected that is successfully matched with the corresponding target object comprises:

4. The method according to claim 3, wherein the step of adding the CNN feature of the object to be detected, which is successfully matched with the target object, into the feature model corresponding to the target object according to the comparison result comprises:

5. An object tracking apparatus applied to a server storing feature models corresponding to respective object objects, wherein each feature model includes a history CNN feature of the corresponding object, the apparatus comprising:

the updating and tracking module is used for updating the characteristic model corresponding to the target object according to the CNN characteristics of the object to be detected successfully matched with the corresponding target object if the object to be detected successfully matched with the corresponding target object exists in the optimal matching result, and obtaining a corresponding tracking result based on the object to be detected successfully matched;

the matrix calculation module comprises a similarity calculation operator module and a matrix generation submodule;

the similarity operator module is used for calculating and obtaining the feature similarity between each object to be detected and each target object based on the historical CNN features in the feature model corresponding to each target object;

the similarity calculation operator module is further used for calculating and obtaining the spatial similarity and the shape similarity between each object to be detected and each target object based on the position information and the target size information of each target object in the last video frame image and the position information and the target size information of each object to be detected in the current video frame image;

the matrix generation submodule is used for multiplying and calculating the correlation similarity between each object to be detected and each target object according to the feature similarity, the space similarity and the shape similarity between each object to be detected and each target object, and correspondingly obtaining a similarity matrix;

the similarity calculation operator module is specifically configured to, when the similarity calculation operator module is used to calculate spatial similarity and shape similarity between each object to be detected and each target object based on the position information and the target size information of each target object in the previous video frame image and the position information and the target size information of each object to be detected in the current video frame image, the similarity calculation operator module is configured to:

wherein, trk_iRepresents the ith target object, det_jRepresenting the jth object to be detected, X, Y, W, H respectively representing the x coordinate value, the y coordinate value, the region width and the region height of the upper left corner of the corresponding characteristic region of the object in the corresponding video frame image, and aff_mot(trk_i，det_j) Representing the spatial similarity, aff, between the ith target object and the jth object to be detected_shp(trk_i，det_j) And representing the shape similarity between the ith target object and the jth object to be detected.

6. The device according to claim 5, wherein the manner of calculating the feature similarity between each object to be detected and each target object by the similarity operator module based on the historical CNN features in the feature model corresponding to each target object comprises:

7. The apparatus according to any one of claims 5 to 6, wherein the manner of updating the feature model corresponding to the target object by the update tracking module according to the CNN feature of the object to be detected successfully matched with the corresponding target object includes:

8. The apparatus according to claim 7, wherein the manner in which the update tracking module adds the CNN feature of the object to be detected, which is successfully matched with the target object, to the feature model corresponding to the target object according to the comparison result includes: