CN113256683B

CN113256683B - Target tracking method and related equipment

Info

Publication number: CN113256683B
Application number: CN202110627554.4A
Authority: CN
Inventors: 王智卓; 曾卓熙; 陈宁
Original assignee: Shenzhen Intellifusion Technologies Co Ltd
Current assignee: Shenzhen Intellifusion Technologies Co Ltd
Priority date: 2020-12-30
Filing date: 2021-06-04
Publication date: 2024-03-29
Anticipated expiration: 2041-06-04
Also published as: WO2022142416A1; CN113256683A; CN112700472A

Abstract

The embodiment of the application provides a target tracking method and related equipment, wherein the method comprises the following steps: track prediction is carried out on a first target in a t+1st frame image to obtain a first prediction frame, wherein the first target is a target in the t frame image, and t is a positive integer; determining K anchor point frames in the t+1st frame image according to the first prediction frame, wherein K is a positive integer; obtaining a second prediction frame and a first confidence score of the second prediction frame according to the K anchor blocks; and if the first confidence score of the second prediction frame is larger than a preset score, determining that a second target framed by the second prediction frame in the t+1st frame image is the first target. By adopting the embodiment of the application, the accuracy and the speed of target tracking are improved.

Description

Target tracking method and related equipment

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a target tracking method and related devices.

Background

With the rapid expansion of urban scale, urban governance has become a critical issue. In the current city management, the task of video monitoring and analysis is completed by installing an intelligent camera. Unlike traditional monitoring cameras, intelligent cameras can perform intelligent analysis, alarming and other works on the end side, and multi-target tracking plays a vital role. Therefore, how to achieve robust, real-time, high-precision multi-target tracking becomes an urgent task. However, current multi-target tracking does not achieve a good compromise in speed and accuracy.

Disclosure of Invention

The embodiment of the application discloses a target tracking method and related equipment, which are beneficial to improving the accuracy and speed of target tracking.

The first aspect of the embodiment of the application discloses a target tracking method, which comprises the following steps: track prediction is carried out on a first target in a t+1st frame image to obtain a first prediction frame, wherein the first target is a target in the t frame image, and t is a positive integer; determining K anchor point frames in the t+1st frame image according to the first prediction frame, wherein K is a positive integer; obtaining a second prediction frame and a first confidence score of the second prediction frame according to the K anchor blocks; and if the first confidence score of the second prediction frame is larger than a preset score, determining that a second target framed by the second prediction frame in the t+1st frame image is the first target.

In a possible implementation manner, the determining K anchor blocks in the t+1st frame image according to the first prediction block includes: determining a plurality of anchor blocks in the (t+1) th frame image according to the first prediction block; respectively calculating the cross-over ratios between the first prediction frame and the anchor blocks to obtain a plurality of cross-over ratios; and sequencing the multiple cross ratios according to the sequence from big to small, and taking anchor blocks corresponding to the first K cross ratios in the sequencing result as the K anchor blocks.

In a possible implementation manner, the obtaining a second prediction frame according to the K anchor blocks and a first confidence score of the second prediction frame includes: if the K is equal to 1, the K anchor blocks are used as the second prediction blocks; if the K is larger than 1, combining according to the K anchor blocks to obtain the second prediction frame; and calculating a first confidence score of the second prediction frame according to the corresponding intersection ratio of each anchor frame in the K anchor frames.

In a possible implementation manner, the combining according to the K anchor blocks to obtain the second prediction block includes: and performing first multiply-accumulate calculation according to the K anchor blocks and the corresponding intersection ratio of each anchor block in the K anchor blocks, and dividing the result of the first multiply-accumulate calculation by K to obtain the second prediction block.

In one possible implementation manner, the calculating to obtain the first confidence score of the second prediction frame according to the corresponding intersection ratio of each anchor frame in the K anchor frames includes: acquiring a second confidence score of each anchor block in the K anchor blocks; and calculating the corresponding intersection ratio of each anchor point frame in the K anchor point frames, carrying out second multiply-accumulate calculation on the intersection ratio and the second confidence score of each anchor point frame in the K anchor point frames, and dividing the result of the second multiply-accumulate calculation by K to obtain the first confidence score of the second prediction frame.

In one possible implementation, the method further includes: performing target detection on the t+1st frame image by adopting a lightweight target detection model; if a third target is detected in the t+1st frame image, extracting features of the third target to obtain a first feature, wherein the third target is not the first target, and the third target is not the second target; calculating an average value of N second features to obtain a third feature, wherein the N second features are obtained by extracting features of a fourth target in N frame images, the N second features are in one-to-one correspondence with the N frame images, any N frame of the N frame images before the t frame image comprises an image of the fourth target, the fourth target is not the first target, the fourth target is not the second target, and the N is a positive integer; calculating the similarity of the first feature and the third feature; and if the similarity between the first feature and the third feature is greater than a preset similarity, determining that the third target is the fourth target.

In one possible implementation, the lightweight object detection model includes, connected in sequence: a first convolution layer, a maximum pooling layer, a second convolution layer, a third convolution layer, a first BN layer, a first ReLU layer, a fourth convolution layer, a second BN layer, a second ReLU layer, a fifth convolution layer, a third BN layer, a third ReLU layer, a sixth convolution layer, a fourth BN layer, a seventh convolution layer, a fifth BN layer, a fourth ReLU layer, an eighth convolution layer, a sixth BN layer, a ninth convolution layer, a seventh BN layer, a fifth ReLU layer, a tenth convolution layer, an eighth BN layer, an eleventh convolution layer, a ninth BN layer, a sixth ReLU layer, a twelfth convolution layer, a tenth BN layer, a thirteenth convolution layer, an eleventh BN layer, a seventh ReLU layer, a fourteenth convolution layer, a twelfth BN layer, a fifteenth convolution layer, a thirteenth BN layer, a sixteenth convolution layer, a fourteenth convolution layer, a seventeenth lu layer, a seventeenth BN layer, a ninth lu layer, a depth separation convolution layer, a sixteenth convolution layer, a sixty-detector layer, a full-order layer, a detector layer.

In one possible implementation, when t is equal to 1, the method further includes: and carrying out target detection on the t frame image by adopting the lightweight target detection model so as to obtain the first target.

The second aspect of the embodiment of the application discloses a target tracking device, which comprises a processing unit, wherein the processing unit is used for: track prediction is carried out on a first target in a t+1st frame image to obtain a first prediction frame, wherein the first target is a target in the t frame image, and t is a positive integer; determining K anchor point frames in the t+1st frame image according to the first prediction frame, wherein K is a positive integer; obtaining a second prediction frame and a first confidence score of the second prediction frame according to the K anchor blocks; and if the first confidence score of the second prediction frame is larger than a preset score, determining that a second target framed by the second prediction frame in the t+1st frame image is the first target.

In a possible implementation manner, in the determining K anchor blocks in the t+1st frame image according to the first prediction block, the processing unit is specifically configured to: determining a plurality of anchor blocks in the (t+1) th frame image according to the first prediction block; respectively calculating the cross-over ratios between the first prediction frame and the anchor blocks to obtain a plurality of cross-over ratios; and sequencing the multiple cross ratios according to the sequence from big to small, and taking anchor blocks corresponding to the first K cross ratios in the sequencing result as the K anchor blocks.

In a possible implementation manner, in the obtaining a second prediction frame according to the K anchor blocks and the first confidence score of the second prediction frame, the processing unit is specifically configured to: if the K is equal to 1, the K anchor blocks are used as the second prediction blocks; if the K is larger than 1, combining according to the K anchor blocks to obtain the second prediction frame; and calculating a first confidence score of the second prediction frame according to the corresponding intersection ratio of each anchor frame in the K anchor frames.

In a possible implementation manner, in the aspect of combining according to the K anchor blocks to obtain the second prediction block, the processing unit is specifically configured to: and performing first multiply-accumulate calculation according to the K anchor blocks and the corresponding intersection ratio of each anchor block in the K anchor blocks, and dividing the result of the first multiply-accumulate calculation by K to obtain the second prediction block.

In one possible implementation manner, in calculating the first confidence score of the second prediction frame according to the corresponding intersection ratio of each of the K anchor frames, the processing unit is specifically configured to: acquiring a second confidence score of each anchor block in the K anchor blocks; and calculating the corresponding intersection ratio of each anchor point frame in the K anchor point frames, carrying out second multiply-accumulate calculation on the intersection ratio and the second confidence score of each anchor point frame in the K anchor point frames, and dividing the result of the second multiply-accumulate calculation by K to obtain the first confidence score of the second prediction frame.

In a possible implementation, the processing unit is further configured to: performing target detection on the t+1st frame image by adopting a lightweight target detection model; if a third target is detected in the t+1st frame image, extracting features of the third target to obtain a first feature, wherein the third target is not the first target, and the third target is not the second target; calculating an average value of N second features to obtain a third feature, wherein the N second features are obtained by extracting features of a fourth target in N frame images, the N second features are in one-to-one correspondence with the N frame images, any N frame of the N frame images before the t frame image comprises an image of the fourth target, the fourth target is not the first target, the fourth target is not the second target, and the N is a positive integer; calculating the similarity of the first feature and the third feature; and if the similarity between the first feature and the third feature is greater than a preset similarity, determining that the third target is the fourth target.

In a possible implementation, when t is equal to 1, the processing unit is further configured to: and carrying out target detection on the t frame image by adopting the lightweight target detection model so as to obtain the first target.

A third aspect of the present embodiments discloses an electronic device comprising a processor, a memory, a communication interface, and one or more programs stored in the memory and configured to be executed by the processor, the programs comprising instructions for performing the steps in the method according to any of the first aspect of the present embodiments.

A fourth aspect of the present application discloses a chip, including: a processor for calling and running a computer program from a memory, causing a device on which the chip is mounted to perform the method according to any of the first aspects of the embodiments of the present application.

A fifth aspect of the embodiments of the present application discloses a computer-readable storage medium storing a computer program for electronic data exchange, wherein the computer program causes a computer to perform the method according to any one of the first aspects of the embodiments of the present application.

A sixth aspect of the embodiments of the present application discloses a computer program product for causing a computer to perform the method according to any one of the first aspect of the embodiments of the present application.

In the embodiment of the present application, the first target is a target in a t frame image in the video stream, and track prediction is performed on the first target in a t+1st frame image in the video stream to obtain a first prediction frame; then determining K anchor blocks in the t+1st frame image according to the first prediction block; obtaining a second prediction frame and a first confidence score of the second prediction frame according to the K anchor blocks; if the first confidence score of the second prediction frame is greater than the preset score, determining that a second target framed by the second prediction frame in the (t+1) th frame of image is the first target. Compared with the prior art, the target is tracked by directly performing simple cross ratio (Intersection over Union, ioU) matching through the first prediction frame, the embodiment of the application generates the second prediction frame through the first prediction frame and the anchor point frame, and tracks the target according to the second prediction frame and the confidence score of the second prediction frame, so that the embodiment of the application can improve the accuracy of target tracking. Compared with the prior art, the target is tracked by using the motion clue and the appearance characteristic, the target is tracked by using the second prediction frame, and the target is tracked by using the motion clue only, so that a complex association process in target tracking can be omitted, and the target tracking speed can be improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the following description will briefly introduce the drawings that are needed in the embodiments or the description of the prior art, it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic flow chart of a target tracking method according to an embodiment of the present application.

Fig. 2 is a schematic diagram of an anchor block according to an embodiment of the present application.

Fig. 3 is a schematic architecture diagram of a lightweight object detection model according to an embodiment of the present application.

Fig. 4 is a flowchart of another object tracking method according to an embodiment of the present application.

Fig. 5 is a schematic structural diagram of a target tracking apparatus according to an embodiment of the present application.

Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to facilitate understanding of the embodiments of the present application, technical problems to be specifically solved by the present application are further analyzed and presented.

While many excellent multi-objective tracking algorithms have been proposed by current scholars in succession, these algorithms do not meet the actual needs of a real-world scenario. Most of the current multi-target tracking algorithms use a tracking-by-detection (tracking-by-detection) concept to complete the task of multi-target tracking, and the concept has a plurality of problems, including:

(1) The upper limit of the overall multi-target tracking algorithm depends on the upper limit of the detection algorithm.

(2) After the detector (i.e. detection algorithm) loses the object, this method has difficulty recovering the lost object from the time-line.

(3) Matching and accurately correlating the same objects only by simple IoU, while tracking using both motion cues and appearance features takes a significant amount of time.

Aiming at the problems, the embodiment of the application provides a lightweight multi-target tracking method based on target re-detection and anchor point selection, which uses the concept of tracking-by-detection and anchor point selection to complete the generation of a short track, and uses the concept of short track synthesis to complete the task of multi-target tracking; in order to further improve the accuracy of target association, the embodiment of the application also provides an anchor point association thought. Therefore, the performance dependence of a multi-target tracking algorithm on a target detection algorithm can be well eliminated, and a needed short track can be generated rapidly and accurately; and the performance of the multi-target tracking algorithm can be further improved through short track merging.

Embodiments of the present application are described below with reference to the accompanying drawings in the embodiments of the present application.

Referring to fig. 1, fig. 1 is a flowchart of a target tracking method according to an embodiment of the present application, where the target tracking method may be applied to an electronic device, and the electronic device may include an intelligent camera, and the target tracking method includes, but is not limited to, the following steps.

And 101, carrying out track prediction on a first target in a t+1st frame image to obtain a first prediction frame, wherein the first target is a target in the t frame image, and t is a positive integer.

The t frame image and the t+1st frame image are two adjacent frame images in the video stream; when there are a plurality of targets in the t-th frame image, the first target is any one of the plurality of targets in the t-th frame image. The targets include pedestrians, animals, vehicles, etc.

It should be understood that when t is equal to 1, the t frame image is the 1 st frame image, the first target in the 1 st frame image is obtained by detection, for example, the 1 st frame image is subjected to target detection by a target detection algorithm or a target detection model, so as to obtain a detection frame, and the target framed in the detection frame is the first target. When t is greater than 1, the first target in the t frame image is obtained through target prediction in the t-1 frame image or is obtained through detection; specifically, when a first target appears in the t-1 th frame image, the first target in the t-1 th frame image is obtained through target prediction in the t-1 th frame image; when the first target does not appear in the t-1 frame image, namely the first target is a new target relative to the t-1 frame image, the first target in the t frame image is obtained through detection.

The first prediction frame may be an optical flow prediction frame, that is, a prediction frame for predicting the t+1st frame image of the first target through an optical flow algorithm; wherein the first prediction frame is also expressed in the form of (x, y, w, h), x and y representing the center point of the object, and w and h representing the width and height of the first prediction frame, respectively.

Step 102, determining K anchor point frames in the t+1st frame image according to the first prediction frame, wherein K is a positive integer.

Wherein K anchor blocks are determined in the t+1st frame image according to the first prediction block, that is, K anchor blocks are determined near the position of the first detection block. The anchor boxes are rectangular boxes of different sizes and dimensions, and typical aspect ratios are 1:1, 1:2, 1:3, 1:4, 2:1, 3:1, 4:1, etc., and the specific details are shown in FIG. 2.

Specifically, anchor blocks with different length-width ratios are determined at the positions nearby the first prediction block; then calculating the intersection ratio of the first prediction frame and each anchor frame in the plurality of anchor frames; and then selecting K anchor blocks from large to small according to the corresponding intersection ratio.

In this embodiment, among all anchor blocks corresponding to the first prediction frame, K anchor blocks having the largest intersection ratio with the first prediction frame are selected, so that the accuracy of selecting the anchor blocks can be improved, and the accuracy of target tracking can be improved.

And step 103, obtaining a second prediction frame and a first confidence score of the second prediction frame according to the K anchor blocks.

The second prediction frame is a re-detection frame result, and has higher tracking precision. The first prediction frame, the second prediction frame and the detection frame are all rectangular frames, the rectangular frames are expressed by the form of (x, y, w, h), x and y represent the center point of the target, and w and h represent the width and height of the rectangular frames respectively.

Specifically, if K is equal to 1, i.e., there is only one anchor block that intersects with the first prediction block more than the first prediction block, then this anchor block is directly taken as the second prediction block. If K is not equal to 1, that is, a plurality of anchor blocks are merged with the first prediction block, the K anchor blocks are fused to obtain a second prediction block.

The first confidence score of the second prediction frame may be used to determine whether the first target and the target defined by the second prediction frame are the same target, where the first confidence score is related to an intersection ratio corresponding to each of the K anchor frames.

Specifically, anchor point frame fusion is performed through formula (1) to obtain a second prediction frame.

In the case of the formula (1),representing a re-detection frame of a jth target in a t-th frame image in a t+1st frame image, namely a second prediction frame of the jth target in the t-th frame image in the t+1st frame image, wherein the second prediction frame can be a second prediction frame of the first target in the t+1st frame image; / >Representing an optical flow prediction frame of a jth target in a t+1st frame image, that is, a first prediction frame of the jth target in the t+1st frame image, which may be a first prediction frame of the first target in the t+1st frame image; />Representing optical flow prediction box->Weight value corresponding to kth anchor block of (2) when the optical flow prediction block +.>For the first target in the first prediction frame in the t+1st frame image, +.>Representing a first target in a t+1st frame imageThe weight value corresponding to the kth anchor point frame of the measuring frame, namely the intersection ratio of the kth anchor point frame of the first predicted frame of the first target in the (t+1) th frame image and the first predicted frame of the first target in the (t+1) th frame image.

Specifically, a first confidence score for the second prediction box is calculated by equation (2).

In the formula (2) of the present invention,representing the confidence score of the j-th target in the t-th frame image in the t+1-th frame image, namely the confidence score of the j-th target in the t+1-th frame image in the second prediction frame, or the first confidence score, which can be the first confidence score of the first target in the t+1-th frame image; />Representing optical flow prediction box->Weight value corresponding to kth anchor block of (2) when the optical flow prediction block +.>First pre-emphasis for first target in t+1st frame imageWhen measuring the frame, the user can check the weight of the frame>Representing a weight value corresponding to a kth anchor point frame of a first predicted frame of a first target in a (t+1) -th frame image, namely the intersection ratio of the kth anchor point frame of the first predicted frame of the first target in the (t+1) -th frame image and the first predicted frame of the first target in the (t+1) -th frame image; />Representing optical flow prediction box->Confidence score corresponding to the kth anchor block, i.e. the second confidence score, when the optical flow prediction block +.>For the first target in the first prediction frame in the t+1st frame image, +.>A second confidence score corresponding to a kth anchor block of the first prediction block of the first target in the (t+1) -th frame image is represented.

Step 104, if the first confidence score of the second prediction frame is greater than a preset score, determining that a second target framed by the second prediction frame in the t+1st frame image is the first target.

Specifically, if the preset score is λ, if the first confidence score of the second prediction frame is greater than λ, it is indicated that the second target is the same as the first target, and the ID of the first target is assigned to the second target; if the first confidence score of the second prediction frame is not greater than λ, it indicates that the second prediction frame is a false re-detection result, and the previous track is broken at this time, that is, the first target in the t frame image does not appear in the t+1st frame image.

In the embodiment of the present application, the first target is a target in a t frame image in the video stream, and track prediction is performed on the first target in a t+1st frame image in the video stream, so as to obtain a first prediction frame; then determining K anchor blocks in the t+1st frame image according to the first prediction block; obtaining a second prediction frame and a first confidence score of the second prediction frame according to the K anchor blocks; if the first confidence score of the second prediction frame is greater than the preset score, determining that a second target framed by the second prediction frame in the (t+1) th frame of image is the first target. Compared with the prior art, which directly tracks the target through simple IoU matching of the first prediction frame, the embodiment of the application generates a second prediction frame through the first prediction frame and the anchor point frame, and tracking the target according to the second prediction frame and the confidence score of the second prediction frame, so that the embodiment of the application can improve the accuracy of target tracking. Compared with the prior art, the target is tracked by using the motion clue and the appearance characteristic, the target is tracked by using the second prediction frame, and the target is tracked by using the motion clue only, so that a complex association process in target tracking can be omitted, and the target tracking speed can be improved.

It should be appreciated that the object in the t+1st frame image may be a new object, i.e., an object that appears in the t+1st frame image but does not appear in the t frame image, in addition to the object that transitions from the t frame image to the t+1st frame image. The target which is transited from the t frame image to the t+1st frame image is obtained through re-detection tracking, namely is obtained through tracking of a second prediction frame; the newly appeared target can not be obtained through re-detection tracking, so that the light-weight target detection model is needed to detect the target by adopting the t+1st frame image so as to determine whether the newly appeared target exists in the t+1st frame image. For a new object in the t+1st frame image, it may be the first new object in the video; it is also possible that objects that have appeared before the t-th frame image, but have not appeared in the t-th frame image, i.e. objects with broken tracks, for which track merging is required.

Wherein, for the similarity between the characteristics of the new appearing target and the characteristics of the target with broken track, it is determined whether the new appearing target in the t+1st frame image is the target with broken track. Specifically, assuming that the third target is any new target in the t+1st frame image, and the fourth target is any target with broken track; firstly, detecting a third target in a t+1st frame image, namely, detecting that the frame is fixed to the third target, and extracting the characteristics of the third target, namely, extracting the characteristics of the third target in the detection frame fixed to the third target, so as to obtain a first characteristic; then calculating the average value of N second features to obtain a third feature value, wherein the N second features are obtained by carrying out feature extraction on a fourth target in N frame images, namely, carrying out feature extraction on a detection frame framing the fourth target in each frame of image in the N frame images to obtain N second features, the N second features are in one-to-one correspondence with the N frame images, and any N frame of the N frame images before the t frame image comprises an image of the fourth target; calculating the similarity of the first feature and the third feature, and determining the third target as a fourth target, namely determining the newly-appearing target as a target with broken track under the condition that the similarity of the first feature and the third feature is larger than the preset similarity; and under the condition that the similarity between the first feature and the third feature is not greater than the preset similarity, determining that the third target is the first-appearing target, namely, the new-appearing target is the first-appearing target.

Wherein, calculate the average value of the characteristic of all detection frames of each orbit through formula (3), also include calculating the average value of the characteristic of the goal that the orbit breaks; the so-called trajectory is in fact a detection frame of a certain object in different video frame images.

In formula (3), f represents an average value of characteristics of a certain track, that is, a third characteristic; f (f) _i A feature of any detection frame representing a certain track, namely a second feature; n represents the number of times a certain track appears in the video, or the number of detection boxes of a certain track.

In the embodiment, a lightweight target detection model is adopted to carry out target detection on the t+1st frame image; if the new appearing target is detected in the t+1st frame image, extracting the characteristics of the new appearing target to obtain a first characteristic; calculating the average value of the characteristics of the targets with broken tracks to obtain third characteristics; calculating the similarity between the first feature and the third feature, and judging whether the newly appeared target is a target with broken track according to the similarity between the first feature and the third feature; and under the condition that the newly-appeared target is a target with a broken track, the newly-appeared target and the target with the broken track are combined in track, so that the accuracy of target tracking is improved.

It will be appreciated that since the target detection algorithm is required to be used in each frame of the overall multi-target tracking algorithm, both the speed and accuracy of the target detection algorithm can affect its performance. Therefore, the embodiment of the application makes a compromise between speed and precision, and designs a lightweight target detection model.

Specifically, the basic framework of the lightweight object detection model is shown in fig. 3, and the basic framework comprises a reference network, an SPA module and a Head network; the SPA module is a spatial channel attention module, can greatly improve the precision of the whole detection algorithm, has the main effect of setting a weight value for each position on the characteristic map, and can play a role of emphasizing certain positions after multiplying the weight value with the original characteristic map; the Head network comprises a detection branch and a classification branch, wherein the classification branch is used for executing classification tasks, namely, the detection frame is a target of which type; the detection branch is used to perform a regression task, i.e. a more accurate adjustment of the detection box, which illustrates the specific position value of the object in the image, usually denoted by x, y, w, h, x and y representing the points of the upper left corner of the rectangular box, w and h representing the width and height of the rectangular box, respectively.

The implementation details of the network architecture of the lightweight target detection model are shown in table 1; the lightweight target detection model is trained by using a Pytorch framework, and key training parameters of the whole network are shown in table 2.

Table 1 network architecture

/>

Wherein S1 and S2 in table 1 represent different convolution kernel steps; eps and momentum are specific parameters in the BN network layer, eps is a small number to prevent denominator from being 0; momentum is a momentum parameter, typically around 0.99.

TABLE 2 training parameters for lightweight object detection models

Parameter name	Default value	Description of the invention
			Input_size	320*320	Input picture size
lr	0.0001	Learning rate
			momentum	0.9	Optimizer parameters
epoch	30	Number of iterations
			batch_size	16	Number of pictures used per training
optimizer	Adam	Optimizer

In the embodiment, a lightweight target detection model is adopted to detect the target of the image, which is beneficial to realizing the compromise of speed and precision in target tracking.

It will be appreciated that the frame images following the 1 st frame image may be tracked using the target in the previous frame image, but the 1 st frame image does not have the previous frame image and therefore it is necessary to detect the target to obtain the target and then follow-up tracking based on the target in the 1 st frame image. The lightweight object detection model can be used for detecting the object in the 1 st frame image to obtain the object in the 1 st frame image.

Referring to fig. 4, fig. 4 is a flowchart of another object tracking method according to an embodiment of the present application, where the object tracking method may be applied to an electronic device, and the electronic device may include a smart camera, and the object tracking method includes, but is not limited to, the following steps.

Step 401, capturing an image to obtain a t+1st frame image.

The application scene of the embodiment of the application is video monitoring, namely, a monitoring camera is hung at a certain height, and targets appearing in a specific area are captured and analyzed in real time. Through the above operation, an image to be processed can be obtained, that is, a t+1st frame image can be obtained.

Step 402, determine whether t is equal to 0.

Step 403, if t is equal to 0, then performing lightweight target detection.

Specifically, after the image to be processed is acquired, the image to be processed needs to be analyzed and processed. When t is equal to 0, the t+1st frame image is the 1st frame image, and for the 1st frame image in the video, a lightweight detection algorithm is firstly used to obtain the specific positions of all pedestrians in the 1st frame image, namely the detection frames of the pedestrians. For example, if there are j targets, i.e., j pedestrians, in the 1 st frame image, then the 1 st frame image is detected, so as to obtain j detection frames. Because the target detection algorithm is required to be used in each frame in the whole multi-target tracking algorithm, the speed and the precision of the detection algorithm can influence the performance of the target detection algorithm, a compromise is made between the speed and the precision in the embodiment of the application, and a lightweight target detection algorithm, namely a lightweight target detection model is designed, the implementation details of the network architecture of the lightweight target detection model are shown in table 1, and the basic framework of the lightweight target detection model is shown in fig. 3.

It should be appreciated that since the 1 st frame image is the first appearance of the object in the video, after detecting the detection frames of all pedestrians in the 1 st frame image by using the lightweight object detection model, it is directly saved, that is, step 409 is performed.

In step 404, if t is not equal to 0, the target position is predicted using the optical flow.

It should be appreciated that the detection frames of all pedestrians in the 1 st frame of image in the video can be obtained by step 403. For the t frame image, firstly, estimating the approximate position of each target in the t frame image in the t+1st frame image by using a dense optical flow algorithm, namely, predicting each target in the t frame image to obtain an optical flow prediction frame in the t+1st frame image. For example, if there are j objects in the t frame image, j optical flow prediction frames are predicted in the t+1st frame image.

The optical flow prediction algorithm predicts based on the detection frame, namely, firstly, target detection is carried out in the 1 st frame image of the video to obtain a target. Through a large number of tests, the Farnesak Flow algorithm is finally selected to execute prediction, and the algorithm is a dense optical Flow algorithm, but after acceleration, the algorithm has high speed and high prediction precision, and can meet the requirements of most tracking scenes. For any one of the targets in the t-th frame image, implementation details of optical flow prediction are as follows:

(1) And acquiring the central point positions x and y of the detection frame of the jth target in the t frame image and the width w and the height h of the detection frame of the jth target, wherein the jth target in the t frame image is any target in the t frame image.

(2) Horizontal and vertical offsets d (h, w) at each pixel position in a detection frame of a jth target in a t-th frame image are acquired from the t-th frame image and the t+1th frame image.

(3) And obtaining the predicted position of the jth target in the (t+1) th frame image according to the formula (4).

In the formula (4), j represents a j-th target in the t-th frame image, which may be a first target;a detection frame representing a jth target in the t-th frame image; />An area of a detection frame representing a jth target in the t-th frame image; />Representing that the jth object in the t frame image is in the t+1st frame imageThe optical flow prediction frame in (1) is the first prediction frame of the jth target in the (t+1) th frame image.

Step 405, anchor block matching.

It should be appreciated that, through the above step 404, an optical flow prediction frame of all objects in the t frame image in the t+1st frame image may be obtained, including a first prediction frame of a first object in the t frame image in the t+1st frame image. That is, through the above step 404, the approximate predicted positions of all the detection frames in the t+1st frame image can be obtained, but in many scenarios, there will be a larger deviation value only by using the predicted approximate positions, which will greatly affect the accuracy of the whole multi-target tracking algorithm.

In order to obtain more accurate predicted positions of all detection frames in the t+1st frame image, according to the embodiment of the present application, a plurality of manually set anchor point frames are added on the basis of the approximate predicted positions of the optical flow predicted frames, that is, the positions of the optical flow predicted frames corresponding to each target of the t frame image, so as to further adjust the predicted positions of each target of the t frame image; and then a re-detection algorithm is utilized to obtain the specific prediction position of each target of the t frame image in the t+1st frame image.

Among them, the anchor boxes are rectangular boxes with different sizes and dimensions, and typical aspect ratios are 1:1, 1:2, 1:3, 1:4, 2:1, 3:1, 4:1, and so on, and specific details are shown in fig. 2.

For an optical flow prediction frame of any one of the targets in the t-th frame image in the t+1st frame image, for example, an optical flow prediction frame of the jth target in the t-th frame image in the t+1st frame imageThe implementation details of the anchor block matching are as follows:

(1) Optical flow prediction frame in t+1st frame image for jth target in t frame imageCalculate all anchor blocks and optical flow prediction blocks corresponding to them +.>IoU therebetween; for example, there are N anchor blocks, then calculate the N anchor blocks and the optical flow prediction block +. >N IoU are obtained at IoU.

(2) Predicting optical flow in order of IoU from big to smallSequencing all the corresponding anchor blocks, and selecting the first K anchor blocks; for example, there are N anchor blocks, and K anchor blocks are selected from the N anchor blocks from large to small according to IoU.

(3) When the k=1, taking the anchor block as a final prediction block of the jth target in the t+1st frame image; when K is>1, due to the K anchor blocks and the optical flow prediction blockHas a larger IoU value in between, so that these K anchor blocks and optical flow prediction block +.>With similar dimensions and positions; at this time, the IoU values corresponding to the K anchor blocks are to be regarded as the weight values of the K anchor blocks, i.e. +.>Wherein A represents a matching function, a _jk Representing the jth optical flow prediction box->Is the kth anchor box of (c).

Step 406, obtaining a re-detection result.

It should be understood that, through the above step 405, K anchor frames of each target in the t-th frame image in the t+1st frame image may be obtained, that is, K anchor frames of each detection frame in the t+1st frame image may be obtained; or, K anchor blocks corresponding to the optical flow prediction blocks of each target in the t-th frame image in the t+1st frame image are obtained. Thus, K anchor blocks of each target in the t+1st frame image can be utilized to acquire a re-detection result of each target in the t+1st frame image.

Unlike tracking-by-detection-based multi-target tracking algorithms, which perform association of targets in t frame images with targets in t+1st frame images mainly through Hungarian matching algorithm of Kalman filtering, the embodiment of the application mainly adopts anchor point frame and re-detection concept to perform association of targets in t frame images with targets in t+1st frame images.

For an optical flow prediction frame of any one of the targets in the t-th frame image in the t+1st frame image, for example, an optical flow prediction frame of the jth target in the t-th frame image in the t+1st frame imageSpecific re-detection details are described below.

(1) Optical flow prediction frame in t+1st frame image for jth target in t frame imageAcquiring an optical flow prediction box->The method comprises the steps of corresponding K anchor blocks and a weight value corresponding to each of the K anchor blocks, wherein the weight value corresponding to each of the K anchor blocks is IoU value corresponding to each of the K anchor blocks.

(2) And acquiring the confidence score of each anchor block in the K anchor blocks by using detection algorithms such as SSD, yoLv3 or Yolov 4.

(3) Combining the K anchor blocks through the formula (1) and the formula (2) to obtain a re-detection frame of the jth target in the (t+1) th frame image and a confidence score of the re-detection frame, namely, a re-detection result of the jth target in the (t+1) th frame image and a confidence score of the re-detection result.

In the formula (1) or the formula (2),representing a re-detection frame of a jth target in a (t+1) th frame image, namely a second prediction frame of the jth target in the (t+1) th frame image; />Representing the confidence score of the j-th target in the t-th frame image in the t+1-th frame image, namely the confidence score of the j-th target in the t-th frame image in the second prediction frame in the t+1-th frame image, or the first confidence score; />Representing optical flow prediction boxesA weight value corresponding to the kth anchor point frame; />Representing optical flow prediction box->The confidence score corresponding to the kth anchor box, i.e., the second confidence score.

(4) And filtering the re-detection result through a set threshold value. If the threshold is lambda, whenIn this case, the weight detection frame is indicated>The method is a correct re-detection result, namely a final prediction frame of a jth detection frame in a (t+1) th frame image can be obtained, and an ID value of a jth target in the (t) th frame image is assigned to the target framed in the (t+1) th frame image of the re-detection frame; when->In this case, the weight detection frame is indicated>Is an erroneous re-detection result when the previous track is broken, i.e. the jth object in the t-th frame image is not present in the t + 1-th frame image.

Step 407, non-maximum suppression (non maximum suppression, NMS) deduplication.

It should be appreciated that through the above step 406, a re-detection result of each object in the t-th frame image in the t+1st frame image may be obtained, which corresponds to the correlation result in the tracking-by-detection. For the target detection algorithm based on the anchor block, some re-detection blocks with higher overlapping rate are generally generated, that is, a plurality of relatively close re-detection blocks may appear on a target, and the main purpose of NMS de-duplication is to select an optimal re-detection block and remove other re-detection blocks.

The implementation details of NMS deduplication are as follows:

(1) And sorting all the redetected frames according to the confidence score value, and selecting the redetected frame with the highest current score.

(2) Traversing other re-detection frames except the re-detection frame with the current highest score, calculating IoU values between the other re-detection frames and the re-detection frame with the current highest score, and if the IoU value is larger than a set threshold value, considering that the coincidence ratio between the other re-detection frames and the re-detection frame with the current highest score is higher, and deleting the other re-detection frames.

(3) And continuing to select a current re-detection frame with the highest score from other re-detection frames which are not processed, and repeating the operation until NMS de-duplication is performed on all targets.

Step 408, merging short tracks.

It should be understood that, through the above operation, a short track of a certain target in the whole video may be obtained, and because the target to be tracked in the real scene may encounter the shielding of other targets and sometimes may also temporarily disappear, the track of the target is usually broken, and a subsequent track merging operation needs to be performed. For tracking tasks, which are to generate a track of a different object in a video frame, the track is in fact a rectangular box (detection box) of the object in a different video frame, the tracking algorithm assigns a unique ID value to the current object for identifying the object. However, in a real scene, there may be situations such as occlusion of a target, rapid movement of an object, etc., which may result in the loss of the whole target, and at this time, the whole track is finished, and a new ID is given to the person in a subsequent frame, so that track merging is required.

The short track merging process used in the embodiments of the present application is as follows:

(1) For newly-appearing targets in the t+1st frame image, acquiring 128-dimensional feature representations of detection frames of each newly-appearing target by utilizing a pedestrian re-identification algorithm to obtain first features; here, the new object appearing in the t+1st frame image refers to an object that appears only in the t+1st frame image, but not in the t frame image.

(2) The average value of the features of all the detection frames of each trace is calculated by the formula (3).

(3) Calculating the similarity between the first feature of the new appearing target and the third feature of the other already ended tracks, if the similarity between the first feature and the third feature is larger than the preset similarity, indicating that the new appearing target is an interrupted track, and merging the interrupted tracks; i.e. merging the new object with the track whose feature similarity is greater than the preset similarity.

Step 409, updating the tracking result.

It should be understood that by the above operation, it is possible to accurately acquire the re-detection results of a plurality of targets in the t+1st frame image, then save the rectangular frame information and the ID information of these re-detection results, and connect the tracks having the same ID to complete the processing of the t+1st frame image. The processing of the subsequent frame image then continues until the entire video is finished.

In the target tracking method described in fig. 4, in order to solve the problem that the multi-target tracking algorithm excessively depends on the performance of the detection algorithm, it is creatively proposed to use a concept based on tracking-by-detection and anchor point selection to solve the problem; by means of re-detection and anchor point selection, tracking precision of a single target can be improved, and precision of the whole multi-target tracking algorithm can be greatly improved. The complex association process in multi-target tracking can be omitted by matching anchor blocks, and the speed of a multi-target tracking algorithm is further improved. The new tracks are combined through feature matching, so that the accuracy of the whole multi-target tracking algorithm can be further improved.

The foregoing details the method of embodiments of the present application, and the apparatus of embodiments of the present application is provided below.

Referring to fig. 5, fig. 5 is a schematic structural diagram of an object tracking device 500 provided in an embodiment of the present application, where the object tracking device 500 is applied to an electronic apparatus, and the object tracking device 500 may include a processing unit 501, where the detailed descriptions of the respective units are as follows:

The processing unit 501 is configured to: track prediction is carried out on a first target in a t+1st frame image to obtain a first prediction frame, wherein the first target is a target in the t frame image, and t is a positive integer; determining K anchor point frames in the t+1st frame image according to the first prediction frame, wherein K is a positive integer; obtaining a second prediction frame and a first confidence score of the second prediction frame according to the K anchor blocks; and if the first confidence score of the second prediction frame is larger than a preset score, determining that a second target framed by the second prediction frame in the t+1st frame image is the first target.

In a possible implementation manner, in the determining K anchor blocks in the t+1st frame image according to the first prediction block, the processing unit 501 is specifically configured to: determining a plurality of anchor blocks in the (t+1) th frame image according to the first prediction block; respectively calculating the cross-over ratios between the first prediction frame and the anchor blocks to obtain a plurality of cross-over ratios; and sequencing the multiple cross ratios according to the sequence from big to small, and taking anchor blocks corresponding to the first K cross ratios in the sequencing result as the K anchor blocks.

In one possible implementation manner, in the obtaining a second prediction frame according to the K anchor frames and the first confidence score of the second prediction frame, the processing unit 501 is specifically configured to: if the K is equal to 1, the K anchor blocks are used as the second prediction blocks; if the K is larger than 1, combining according to the K anchor blocks to obtain the second prediction frame; and calculating a first confidence score of the second prediction frame according to the corresponding intersection ratio of each anchor frame in the K anchor frames.

In one possible implementation manner, in the aspect of combining according to the K anchor blocks to obtain the second prediction block, the processing unit 501 is specifically configured to: and performing first multiply-accumulate calculation according to the K anchor blocks and the corresponding intersection ratio of each anchor block in the K anchor blocks, and dividing the result of the first multiply-accumulate calculation by K to obtain the second prediction block.

In one possible implementation manner, in calculating the first confidence score of the second prediction frame according to the corresponding intersection ratio of each of the K anchor frames, the processing unit 501 is specifically configured to: acquiring a second confidence score of each anchor block in the K anchor blocks; and calculating the corresponding intersection ratio of each anchor point frame in the K anchor point frames, carrying out second multiply-accumulate calculation on the intersection ratio and the second confidence score of each anchor point frame in the K anchor point frames, and dividing the result of the second multiply-accumulate calculation by K to obtain the first confidence score of the second prediction frame.

In a possible implementation manner, the processing unit 501 is further configured to: performing target detection on the t+1st frame image by adopting a lightweight target detection model; if a third target is detected in the t+1st frame image, extracting features of the third target to obtain a first feature, wherein the third target is not the first target, and the third target is not the second target; calculating an average value of N second features to obtain a third feature, wherein the N second features are obtained by extracting features of a fourth target in N frame images, the N second features are in one-to-one correspondence with the N frame images, any N frame of the N frame images before the t frame image comprises an image of the fourth target, the fourth target is not the first target, the fourth target is not the second target, and the N is a positive integer; calculating the similarity of the first feature and the third feature; and if the similarity between the first feature and the third feature is greater than a preset similarity, determining that the third target is the fourth target.

In a possible implementation, when t is equal to 1, the processing unit 501 is further configured to: and carrying out target detection on the t frame image by adopting the lightweight target detection model so as to obtain the first target.

It should be noted that the implementation of each unit may also correspond to the corresponding description of the method embodiment shown in fig. 1 or fig. 4. Of course, the object tracking device 500 provided in the embodiment of the present application includes, but is not limited to, the above unit modules, for example: the object tracking device 500 may further comprise a storage unit 501, which storage unit 502 may be used for storing program code and data of the object tracking device 500.

In the object tracking device 500 illustrated in fig. 5, the first object is an object in a t frame image in the video stream, and track prediction is performed on the first object in a t+1st frame image in the video stream to obtain a first prediction frame; then determining K anchor blocks in the t+1st frame image according to the first prediction block; obtaining a second prediction frame and a first confidence score of the second prediction frame according to the K anchor blocks; if the first confidence score of the second prediction frame is greater than the preset score, determining that a second target framed by the second prediction frame in the (t+1) th frame of image is the first target. Compared with the prior art, which directly tracks the target through simple IoU matching of the first prediction frame, the embodiment of the application generates a second prediction frame through the first prediction frame and the anchor point frame, and tracking the target according to the second prediction frame and the confidence score of the second prediction frame, so that the embodiment of the application can improve the accuracy of target tracking. Compared with the prior art, the target is tracked by using the motion clue and the appearance characteristic, the target is tracked by using the second prediction frame, and the target is tracked by using the motion clue only, so that a complex association process in target tracking can be omitted, and the target tracking speed can be improved.

Referring to fig. 6, fig. 6 is a schematic structural diagram of an electronic device 610 according to an embodiment of the present application, where the electronic device 610 includes a processor 611, a memory 612, and a communication interface 613, and the processor 611, the memory 612, and the communication interface 613 are connected to each other through a bus 614.

Memory 612 includes, but is not limited to, random access memory (random access memory, RAM), read-only memory (ROM), erasable programmable read-only memory (erasable programmable read only memory, EPROM), or portable read-only memory (compact disc read-only memory, CD-ROM), which memory 612 is used for associated computer programs and data. The communication interface 613 is used to receive and transmit data.

The processor 611 may be one or more central processing units (central processing unit, CPU), and in the case where the processor 611 is a CPU, the CPU may be a single-core CPU or a multi-core CPU.

The processor 611 in the electronic device 610 is configured to read the computer program code stored in the memory 612, and execute the method shown in fig. 1 or fig. 4.

It should be noted that the implementation of the respective operations may also correspond to the respective description of the method embodiment shown with reference to fig. 1 or fig. 4.

In the electronic device 610 depicted in fig. 6, the first target is a target in a t frame image in the video stream, and track prediction is performed on the first target in a t+1st frame image in the video stream to obtain a first prediction frame; then determining K anchor blocks in the t+1st frame image according to the first prediction block; obtaining a second prediction frame and a first confidence score of the second prediction frame according to the K anchor blocks; if the first confidence score of the second prediction frame is greater than the preset score, determining that a second target framed by the second prediction frame in the (t+1) th frame of image is the first target. Compared with the prior art, which directly tracks the target through simple IoU matching of the first prediction frame, the embodiment of the application generates a second prediction frame through the first prediction frame and the anchor point frame, and tracking the target according to the second prediction frame and the confidence score of the second prediction frame, so that the embodiment of the application can improve the accuracy of target tracking. Compared with the prior art, the target is tracked by using the motion clue and the appearance characteristic, the target is tracked by using the second prediction frame, and the target is tracked by using the motion clue only, so that a complex association process in target tracking can be omitted, and the target tracking speed can be improved.

The embodiment of the application also provides a chip, which comprises at least one processor, a memory and an interface circuit, wherein the memory, the transceiver and the at least one processor are interconnected through a circuit, and the at least one memory stores a computer program; the computer program, when executed by the processor, implements the method flow shown in fig. 1 or fig. 4.

Embodiments of the present application also provide a computer readable storage medium having a computer program stored therein, which when run on a computer, implements the method flow shown in fig. 1 or fig. 4.

Embodiments of the present application also provide a computer program product, which when run on a computer, implements the method flow shown in fig. 1 or fig. 4.

It should be appreciated that the processors referred to in the embodiments of the present application may be central processing units (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate arrays (Field Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

It should also be understood that the memory referred to in the embodiments of the present application may be volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable EPROM (EEPROM), or a flash Memory. The volatile memory may be random access memory (Random Access Memory, RAM) which acts as an external cache. By way of example, and not limitation, many forms of RAM are available, such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (Double Data Rate SDRAM), enhanced SDRAM (ESDRAM), synchronous DRAM (SLDRAM), and Direct RAM (DR RAM).

Note that when the processor is a general-purpose processor, DSP, ASIC, FPGA or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, the memory (storage module) is integrated into the processor.

It should be noted that the memory described herein is intended to comprise, without being limited to, these and any other suitable types of memory.

It should also be understood that the first, second, third, fourth, and various numerical numbers referred to herein are merely descriptive convenience and are not intended to limit the scope of the present application.

It should be understood that the term "and/or" is merely an association relationship describing the associated object, and means that three relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.

It should be understood that, in various embodiments of the present application, the sequence numbers of the foregoing processes do not mean the order of execution, and the order of execution of the processes should be determined by the functions and internal logic thereof, and should not constitute any limitation on the implementation process of the embodiments of the present application.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.

In the several embodiments provided in this application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the above-described division of units is merely a logical function division, and there may be another division manner in actual implementation, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The above functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method shown in the various embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The steps in the method of the embodiment of the application can be sequentially adjusted, combined and deleted according to actual needs.

The modules in the device of the embodiment of the application can be combined, divided and deleted according to actual needs.

The above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions from the scope of the technical solutions of the embodiments of the present application.

Claims

1. A target tracking method, comprising:

track prediction is carried out on a first target in a t+1st frame image to obtain a first prediction frame, wherein the first target is a target in the t frame image, and t is a positive integer;

determining K anchor point frames in the t+1st frame image according to the first prediction frame, wherein K is a positive integer;

obtaining a second prediction frame and a first confidence score of the second prediction frame according to the K anchor blocks;

If the first confidence score of the second prediction frame is larger than a preset score, determining that a second target framed by the second prediction frame in the (t+1) th frame image is the first target;

the method further comprises the steps of: performing target detection on the t+1st frame image by adopting a lightweight target detection model; if a third target is detected in the t+1st frame image, extracting features of the third target to obtain a first feature, wherein the third target is not the first target, and the third target is not the second target; calculating an average value of N second features to obtain a third feature, wherein the N second features are obtained by extracting features of a fourth target in N frame images, the N second features are in one-to-one correspondence with the N frame images, any N frame of the N frame images before the t frame image comprises an image of the fourth target, the fourth target is not the first target, the fourth target is not the second target, and the N is a positive integer; calculating the similarity of the first feature and the third feature; and if the similarity between the first feature and the third feature is greater than a preset similarity, determining that the third target is the fourth target.

2. The method of claim 1, wherein the determining K anchor blocks in the t+1st frame image according to the first prediction block comprises:

determining a plurality of anchor blocks in the (t+1) th frame image according to the first prediction block;

respectively calculating the cross-over ratios between the first prediction frame and the anchor blocks to obtain a plurality of cross-over ratios;

and sequencing the multiple cross ratios according to the sequence from big to small, and taking anchor blocks corresponding to the first K cross ratios in the sequencing result as the K anchor blocks.

3. The method of claim 2, wherein the obtaining a second prediction box from the K anchor boxes and the first confidence score for the second prediction box comprises:

if the K is equal to 1, the K anchor blocks are used as the second prediction blocks;

if the K is larger than 1, combining according to the K anchor blocks to obtain the second prediction frame;

and calculating a first confidence score of the second prediction frame according to the corresponding intersection ratio of each anchor frame in the K anchor frames.

4. The method of claim 3, wherein said combining according to the K anchor blocks to obtain the second prediction block comprises:

And performing first multiply-accumulate calculation according to the K anchor blocks and the corresponding intersection ratio of each anchor block in the K anchor blocks, and dividing the result of the first multiply-accumulate calculation by K to obtain the second prediction block.

5. The method of claim 3, wherein calculating a first confidence score for the second prediction block based on the corresponding intersection ratio for each of the K anchor blocks comprises:

acquiring a second confidence score of each anchor block in the K anchor blocks;

and calculating the corresponding intersection ratio of each anchor point frame in the K anchor point frames, carrying out second multiply-accumulate calculation on the intersection ratio and the second confidence score of each anchor point frame in the K anchor point frames, and dividing the result of the second multiply-accumulate calculation by K to obtain the first confidence score of the second prediction frame.

6. The method of claim 1, wherein the lightweight object detection model comprises, in order: a first convolution layer, a maximum pooling layer, a second convolution layer, a third convolution layer, a first BN layer, a first ReLU layer, a fourth convolution layer, a second BN layer, a second ReLU layer, a fifth convolution layer, a third BN layer, a third ReLU layer, a sixth convolution layer, a fourth BN layer, a seventh convolution layer, a fifth BN layer, a fourth ReLU layer, an eighth convolution layer, a sixth BN layer, a ninth convolution layer, a seventh BN layer, a fifth ReLU layer, a tenth convolution layer, an eighth BN layer, an eleventh convolution layer, a ninth BN layer, a sixth ReLU layer, a twelfth convolution layer, a tenth BN layer, a thirteenth convolution layer, an eleventh BN layer, a seventh ReLU layer, a fourteenth convolution layer, a twelfth BN layer, a fifteenth convolution layer, a thirteenth BN layer, a sixteenth convolution layer, a fourteenth convolution layer, a seventeenth lu layer, a seventeenth BN layer, a ninth lu layer, a depth separation convolution layer, a sixteenth convolution layer, a sixty-detector layer, a full-order layer, a detector layer.

7. An object tracking device, comprising a processing unit configured to:

the processing unit is further configured to: performing target detection on the t+1st frame image by adopting a lightweight target detection model; if a third target is detected in the t+1st frame image, extracting features of the third target to obtain a first feature, wherein the third target is not the first target, and the third target is not the second target; calculating an average value of N second features to obtain a third feature, wherein the N second features are obtained by extracting features of a fourth target in N frame images, the N second features are in one-to-one correspondence with the N frame images, any N frame of the N frame images before the t frame image comprises an image of the fourth target, the fourth target is not the first target, the fourth target is not the second target, and the N is a positive integer; calculating the similarity of the first feature and the third feature; and if the similarity between the first feature and the third feature is greater than a preset similarity, determining that the third target is the fourth target.

8. An electronic device comprising a processor, a memory, a communication interface, and one or more programs stored in the memory and configured to be executed by the processor, the programs comprising instructions for performing the steps in the method of any of claims 1-6.

9. A computer-readable storage medium, characterized in that it stores a computer program for electronic data exchange, wherein the computer program causes a computer to perform the method according to any one of claims 1-6.