WO2021237749A1

WO2021237749A1 - Method and apparatus for object tracking and reidentification

Info

Publication number: WO2021237749A1
Application number: PCT/CN2020/093538
Authority: WO
Inventors: Xiaoyi YANG
Original assignee: Siemens Aktiengesellschaft; Siemens Ltd., China
Priority date: 2020-05-29
Filing date: 2020-05-29
Publication date: 2021-12-02

Abstract

A method, apparatus, system and computer-readable medium for object tracking and reidentification are presented. An object tracking enhancement solution is provided in the present disclosure, which serves first to detect tracking failures, and then to find out the same object in other frames. An object reidentification solution is also provided, with combination of metric learning method and representation learning method, the neural network used for object reidentification can effectively and precisely separate features of different objects, especially similar-looking ones.

Description

Method and apparatus for object tracking and reidentification

Technical Field

The present invention relates to techniques of image processing, and more particularly to a method, apparatus and computer-readable storage medium for object tracking and reidentification.

Background Art

Moving object tracking is widely used technique. Taking vehicle as an example, current tracking procedures are not quite reliable. It counts on the comparison between vehicle positions in adjacent frames of video. But the process is always interrupted by the occurrence of occlusion or by the disappearance of object within a video, as shown in FIG. 1.

Most of the current tracking procedures complete when the nearest bounding boxes are found in two adjacent frames, as shown in FIG. 2. The nearest bounding boxes in two adjacent frames are the ones having the biggest overlapping area, that is area (P) ∩area (C) , as shown in FIG. 2. In practice, IoU (intersection over union) can be used for judgement of nearest bounding boxes:

If IoU is larger than a pre-defined threshold, it is determined that nearest bounding boxes for the same object are found, therefore the object is tracked.

However, a failure occurs when occlusion appears or when the object moves out of sight. Referring to FIG. 3A, in previous frame, a target car A is complete in the bounding box, while in current frame, the target car A is partially occluded by a truck B, then with the overlapping area becomes very small, IoU might become smaller than the threshold, the target car A is determined to be untracked wrongly. In another case, referring to FIG. 3B, in previous frame, the target car A is partially occluded by a truck B, while in current frame, the target car A is completely occluded by the truck B, bounding box of the truck B might be determined as bounding box of the target car A, then IoU for the two adjacent frames might be larger than the predefined threshold, it is wrongly determined that the target car A is tracked.

Summary of the Invention

In this disclosure, on one hand, we propose solutions for object tracking, taking into more reasonable judgement criteria, which can enhance tracking accuracy.

On the other hand, improvements are made as to object reidentification in two images, with which, similar looking objects can be differentiated precisely and easily.

Embodiments of the present disclosure include methods, apparatuses for object tracking and methods, apparatuses for object reidentification.

According to a first aspect of the present disclosure, a method for object tracking is presented. The method includes following steps:

- acquiring a first frame in a first video;

- generating at least one first bounding box of the first frame via object detection;

- finding a pair of nearest bounding boxes, wherein one is from the at least one first bounding box, the other is from at least one second bounding box detected via object detection from a second frame of the first video, and the second frame is the previous one of the first frame;

- calculating change in size between the pair of nearest bounding boxes;

- determining, based on the change in size, whether same object is detected in both of the nearest bounding boxes.

According to a second aspect of the present disclosure, an apparatus for object tracking is presented. The apparatus includes:

- a video frame acquisition module, configured to acquire a first frame in a first video;

- a bounding box generation module, configured to generate at least one first bounding box of the first frame via object detection;

- a calculation module, configured to: find a pair of nearest bounding boxes, wherein one is from the at least one first bounding box, the other is from at least one second bounding box detected via object detection from a second frame of the first video, and the second frame is the previous one of the first frame; calculate change in size between the pair of nearest bounding boxes; determine, based on the change in size, whether same object is detected in both of the nearest bounding boxes.

According to a third aspect of the present disclosure, an apparatus for object tracking is presented. The apparatus includes at least one processor; at least one memory, coupled to the at least one processor, configured to execute method according to the first aspect.

According to a fourth aspect of the present disclosure, a computer-readable medium for object tracking is presented. The computer-readable medium stores computer-executable instructions, wherein the computer-executable instructions when executed cause at least one processor to execute method according to the first aspect.

Current tracking procedures only take into account of IoU, which results in wrong judgement. With solutions provided in the present disclosure, change in size of nearest bounding boxes is considered. Usually, probability of nearest bounding boxes in adjacent frames correspond to same object is high, however, different objects usually have different sizes, so judgement based on change in size helps determine whether the two nearest bounding boxes correspond to same object.

Optionally, before determining based on the change in size whether same object is detected in both of the nearest bounding boxes, IoU of the pair of nearest bounding boxes can be calculated and if the change in size is smaller than a first threshold and the IoU is larger than a second threshold, it can be determined that same object is detected in both of the nearest bounding boxes. Based both of IoU and change in size, the judgement can be more precise.

Optionally, if the change in size is not smaller than the first threshold or the IoU is not larger than the second threshold, it can be determined whether the same object is tracked through following procedure:

- determining position relationship between the first bounding box in the pair of nearest bounding boxes and the border of the first frame;

- determining, based on the position relationship, whether to acquire a third frame from the first video or from a second video across cameras, wherein the third frame is the next frame of the first frame;

- acquiring the third frame;

- generating at least one third bounding box of the third frame via object detection;

- reidentifying same object in the second frame and the third frame;

- determining the same object is tracked if the same object is reidentified.

Taking into account cases that the occlusion appears or when the object moves out of sight, position of bounding box can be checked further here, then the object’s moving direction can be estimated, to search for the target either in the original video, or in a video captured by another camera, reidentification procedure can be triggered to check whether same object is reidentified in both previous and next frames in case the object is lost in current frame.

Optionally, same object can be reidentified via a neural network with metric learning method, wherein neural network is trained by a combination of a metric learning method and a representation learning method. With the combination of representation learning methods during training, the competence of feature separation can be enhanced, which compensates low rate of success rate of metric learning methods, especially when differences between objects are less obvious, combination of the representation learning methods can help the neural network to be powerful enough to find them out. Based on the well-trained neural network, reidentification with metric learning can be more precise and efficient. Also, with the classification of representation learning methods, the complexity of training can also be reduced significantly.

Optionally, the neural network can include:

- a backbone part, configured to extract features of image in the bounding boxes detected in frames of a video;

- a first fully connected layer and a second fully connected layer, connected to the backbone part, configured to synthesize and reduce data dimension of features coming out from the backbone part, wherein the metric learning method is applied on the first fully connected layer and the representation learning method is applied on the second fully connected layer;

- a loss function for classification part, connected to the second fully connected layer, and

- a constructive loss function part, connected to the first fully connected layer.

Optionally, the neural network can be trained base on a dataset including images of similar appearance different objects and/or images of same object captured in different perspectives. The selection of dataset can influence the performance of the neural network. Dataset including images of different objects having similar appearance can help the neural network to concentrate more on the distinguishable features. And images captured in different perspectives can help the neural network match multiple views of same object in actual use.

According to a fifth aspect of the present disclosure, a method for object reidentification is presented. The method includes following steps:

- acquiring images;

- reidentifying in the acquired images via a neural network with metric learning method, wherein neural network is trained by a combination of a metric learning method and a representation learning method.

According to a sixth aspect of the present disclosure, an apparatus for object reidentification is presented. The apparatus includes:

- an image acquisition module, configured to acquire images;

- a reidentification module, configured to reidentify in the acquired images via a neural network with metric learning method, wherein neural network is trained by a combination of a metric learning method and a representation learning method.

According to a seventh aspect of the present disclosure, an apparatus for object reidentification is presented. The apparatus includes at least one processor; at least one memory, coupled to the at least one processor, configured to execute method according to the fifth aspect.

According to an eighth aspect of the present disclosure, a computer-readable medium for object reidentification is presented. The computer-readable medium stores computer-executable instructions, wherein the computer-executable instructions when executed cause at least one processor to execute method according to the fifth aspect.

With the combination of representation learning methods during training, the competence of feature separation can be enhanced, which compensates low rate of success rate of metric learning methods, especially when differences between objects are less obvious, combination of the representation learning methods can help the neural network to be powerful enough to find them out. Based on the well-trained neural network, reidentification with metric learning can be more precise and efficient. Also, with the classification of representation learning methods, the complexity of training can also be reduced significantly.

Optionally, the neural network can include:

- a backbone part, configured to extract features of the images;

Optionally, the neural network can be trained base on a dataset including images of similar appearance different objects and/or images of same object captured in different perspectives. Dataset including images of different objects having similar appearance can help the neural network to concentrate more on the distinguishable features. And images captured in different perspectives can help the neural network match multiple views of same object in actual use.

Brief Description of the Drawings

The above mentioned attributes and other features and advantages of the present technique and the manner of attaining them will become more apparent and the present technique itself will be better understood by reference to the following description of embodiments of the present technique taken in conjunction with the accompanying drawings, wherein:

FIG. 1 depicts scenarios in which vehicles are occluded or disappear.

FIG. 2 depicts current object tracking criteria.

FIG. 3A depicts the case in which a vehicle is wrongly untracked using current object tracking criteria.

FIG. 3B depicts the case in which a vehicle is wrongly tracked using current object tracking criteria.

FIG. 4 depicts a block diagram of an apparatus for object tracking in accordance with one embodiment of the present disclosure.

FIG. 5 depicts process of reidentification.

FIG. 6 depicts structure of a neural network in accordance with one embodiment of the present disclosure.

FIG. 7 and FIG. 8 depict flow diagrams of a method for object tracking in accordance with one embodiment of the present disclosure.

FIG. 9 depicts a block diagram of an apparatus for object reidentification in accordance with one embodiment of the present disclosure.

FIG. 10 depicts flow diagram of a method for object reidentification in accordance with one embodiment of the present disclosure.

FIG. 11 depicts examples of data resource for training a model used for object reidentification in accordance with one embodiment of the present disclosure.

FIG. 12 depicts object tracking result in accordance with one embodiment of the present disclosure.

Reference Numbers:

10, an apparatus for object tracking

101, at least one memory

102, at least one processor

103, a communication module

20, an object tracking program

104, a video frame acquisition module

105, a bounding box generation module

106, a calculation module

30, data acquired

100, 100’ methods for object tracking

S101～S105, steps of method 100

S1051～S1036, sub steps of S105

50, an apparatus for object reidentification

501, at least one memory

502, at least one processor

503, a communication module

60, an object reidentification program

504, an image acquisition module

505, a reidentification module

70, images to be reidentified

80, dataset used for training a neural network 90

90, a neural network

901, a backbone part

902, a first fully connected layer

903, a second fully connected layer

904, a loss function for classification part

905, a constructive loss function part

Detailed Description of Example Embodiments

Hereinafter, above-mentioned and other features of the present technique are described in detail. Various embodiments are described with reference to the drawing, where like reference numerals are used to refer to like elements throughout. In the following description, for purpose of explanation, numerous specific details are set forth in order to provide a thorough understanding of one or more embodiments. It may be noted that the illustrated embodiments are intended to explain, and not to limit the invention. It may be evident that such embodiments may be practiced without these specific details.

When introducing elements of various embodiments of the present disclosure, the articles “a” , “an” , “the” and “said” are intended to mean that there are one or more of the elements. The terms “comprising” , “including” and “having” are intended to be inclusive and mean that there may be additional elements other than the listed elements.

Now the present disclosure will be described hereinafter in details by referring to FIG. 1 to FIG. 9.

FIG. 4 depicts a block diagrams of an apparatus in accordance with one embodiment of the present disclosure. The apparatus 10 for object tracking presented in the present disclosure can be implemented as a network of computer processors, to execute following method 100 for object tracking presented in the present disclosure. the apparatus 10 can also be a single computer, as shown in FIG. 4, including at least one memory 101, which includes computer-readable medium, such as a random access memory (RAM) . The apparatus 10 also includes at least one processor 102, coupled with the at least one memory 101. Computer-executable instructions are stored in the at least one memory 101, and when executed by the at least one processor 102, can cause the at least one processor 102 to perform the steps described herein. The at least one processor 102 may include a microprocessor, an application specific integrated circuit (ASIC) , a digital signal processor (DSP) , a central processing unit (CPU) , a graphics processing unit (GPU) , state machines, etc. embodiments of computer-readable medium include, but not limited to a floppy disk, CD-ROM, magnetic disk, memory chip, ROM, RAM, an ASIC, a configured processor, all optical media, all magnetic tape or other magnetic media, or any other medium from which a computer processor can read instructions. Also, various other forms of computer-readable medium may transmit or carry instructions to a computer, including a router, private or public network, or other transmission device or channel, both wired and wireless. The instructions may include code from any computer-programming language, including, for example, C, C++, C#, Visual Basic, Java, and JavaScript.

The at least one memory 101 shown in FIG. 4 can contain an object tracking program 20, when executed by the at least one processor 102, causing the at least one processor 102 to execute the method 100 or method 100’ for object tracking presented in the present disclosure. Data 30, including videos of the target object can also be stored in the at least one memory 101. The data 30 can be received via a communication module 103 of the apparatus 10.

The object tracking program 20 can include:

- a video frame acquisition module 104, configured to acquire a first frame in a first video;

- a bounding box generation module 105, configured to generate at least one first bounding box of the first frame via object detection;

- a calculation module 106, configured to: find a pair of nearest bounding boxes, wherein one is from the at least one first bounding box, the other is from at least one second bounding box detected via object detection from a second frame of the first video, and the second frame is the previous one of the first frame; calculate change in size between the pair of nearest bounding boxes; and determine, based on the change in size, whether same object is detected in both of the nearest bounding boxes.

A video includes frames of images, which are usually taken by cameras in order of time. here, the first frame is the current frame, the second frame is the previous frame of the current frame. Usually, there are more than one object in a frame, so at least one bounding box can be found in each frame, one bounding box corresponds to one object.

As mentioned above, current tracking procedures only take into account of IoU, which results in wrong judgement. Here, change in size of nearest bounding boxes is considered. Usually, probability of nearest bounding boxes in adjacent frames correspond to same object is high, however, different objects usually have different sizes, so taking into account change in size helps determine whether the two nearest bounding boxes correspond to same object.

Optionally, the calculation module 106 is further configured to:

- before determining based on the change in size whether same object is detected in both of the nearest bounding boxes, calculate IoU of the pair of nearest bounding boxes;

- determine that same object is detected in both of the nearest bounding boxes, if the change in size is smaller than a first threshold and the IoU is larger than a second threshold,

So IoU and change in size are both taken into account to determine whether same object is detected in both of the nearest bounding boxes, the judgement will be more accurate. Detection of same object in both of the nearest bounding boxes means that the same object is tracked. The first threshold and the second threshold can be set according to actual application scenario. And to be noted that, it can be implemented that the calculation module 106 can determine same object is detected in both of the nearest bounding boxes, if the change in size is not larger than a first threshold and the IoU is not smaller than a second threshold.

Optionally, the calculation module 106 is further configured to: if the change in size is not smaller than the first threshold or the IoU is not larger than the second threshold,

- determine position relationship between the first bounding box in the pair of nearest bounding boxes and the border of the first frame;

- determine based on the position relationship whether to acquire a third frame from the first video or from a second video across cameras, wherein the third frame is the next frame of the first frame;

- acquire the third frame;

- generate at least one third bounding box of the third frame via object detection;

- reidentify same object in the second frame and the third frame;

- determine that the same object reidentified is tracked if the same object is reidentified.

Change in size being not smaller than the first threshold or IoU being not larger than the second threshold means that the nearest bounding boxes might not correspond to same object, than further processing and judgement can be done for accurate judgement.

Firstly, position relationship between the first bounding box in the pair of nearest bounding boxes and the border of the first frame can be determined, based on which it can be decided that whether to acquire next frame from same video with the current frame, or from other videos across cameras. Then, based on such position judgement, object tracking across cameras can be executed practically, which can effectively increase rate of success to track a moving object. Once the third frame is acquired, which is the next frame of the current one, bounding boxes can be generated in the third frame and same object can be reidentified in the second frame and the third frame; once the same object is reidentified, it can be determined that the same object reidentified is tracked.

Optionally, the calculation module 106 is further configured to reidentify via a neural network 90 with metric learning method, wherein the neural network 90 is trained by a combination of a metric learning method and a representation learning method.

The objective of reidentification is to estimate whether two images represent the same object. The basic idea is to extract the features of images and to estimate their similarity according to their Euclidean distance. This process can be represented by FIG. 5.

Considering the strong ability of neural network to extract abstract features, we apply it here to achieve this task. A well-performed network provides quite different features for different objects.

Metric learning methods are widely discussed in reidentification topics. They aim to separate features of objects. While representation learning methods are rarely mentioned in these tasks. They serve to express features of categories, so they are usually used for classification. Here, considering that classification process of representation learning methods imply synthesis of distinguishable characters, if we regard each object as a category, the representation learning methods works as well for feature separation. With the combination of representation learning methods during training, the competence of feature separation can be enhanced, which compensates low rate of success rate of metric learning methods, especially when differences between objects are less obvious, as are the cases of many cars on the road, combination of the representation learning methods can help the neural network to be powerful enough to find them out. Based on the well-trained neural network, reidentification with metric learning can be more precise and efficient. Also, with the classification of representation learning methods, the complexity of training can also be reduced significantly.

Now, referring to FIG. 6, the neural network 90 used for object reidentification can include:

- a backbone part 901, configured to extract features of image in the bounding boxes detected in frames of a video;

- a first fully connected layer 902 and a second fully connected layer 903, connected to the backbone part 901, configured to synthesize and reduce data dimension of features coming out from the backbone part 901, wherein the metric learning method is applied on the first fully connected layer 902 and the representation learning method is applied on the second fully connected layer 903;

- a loss function for classification part 904, connected to the second fully connected layer 903, and

- a constructive loss function part 905, connected to the first fully connected layer 902.

The components in dot dashed-line rectangle are arranged according to representation learning methods. They work during the training process. The components in dashed-line rectangle are arranged based on metric learning methods. They work during both the training and the application process.

The backbone part 901 can be implemented as Resnet 50, which seems to be a good choice up to now as it can provide enough features for plenty of categories. We have also tried the lighter backbone Resnet 18. While its depth is not enough. Errors occur in the results.

The first fully connected layer 902, the input/output can be 2048/128. It can be used to synthesize the features coming out of the backbone part 901. What’s more, a reduce of data dimension is good for a better convergence during the training and an easier comparison during the application.

The second fully Connected Layer 903, the input/output can be 2048/901. It has the similar function to the first fully connected layer 902. The representation learning methods aims originally at classification. Here, we can make use of it to differentiate features of different objects. Therefore, the more types we provide, the more distinguishable features we get. For example, for the vehicle reidentification, we can select two datasets of in total 901 automobiles to train the neural network 90.

The loss function for classification part 904, can be implemented with Softmax.

The constructive loss function part 905 is a kind of metric learning loss function. We select it because: 1) Resnet 50 provides enough features; 2) Representation learning branch helps to separate the features. A more complicated loss function (for example the triplet loss) increases the difficulty of convergence during the training process. But the improvement on separation ability is not obvious. The constructive loss function can be defined as:

Where N is the number of image pairs per batch, d _i is the Euclidean distance of two vectors coming out of the first fully connected layer 902, y _i is the label that marks whether the object on two images are the same (if yes, y _i=1; otherwise, y _i=0) .

To be mentioned that, as an example, there are only two inputs as shown in FIG. 6, they are the second frame (the previous frame) and the third frame (the next frame) . However, there can be more than two inputs, same object then will be reidentified from all the input images.

Optionally, the neural network 90 can be trained based on a dataset 80 including images of similar appearance different objects and/or images of same object captured in different perspectives. The selection of dataset can influence the performance of the neural network 90. Taking vehicles as an example, as shown in FIG. 11, we find dataset with a large quantity of automobiles. Some of them have similar appearance, which helps the neural network 90 to concentrate more on the distinguishable features. What’s more, for each automobile, the dataset 80 can cover the images captured in different perspectives. As a result, the neural network 90 will be able to match multiple views (e.g. the front and the rear) of a car in actual use.

Although the video frame acquisition module 104, the bounding box generation module 105, the calculation module 106 are described above as software modules of the object tracking program 20. Also, they can be implemented via hardware, such as ASIC chips. They can be integrated into one chip, or separately implemented and electrically connected.

It should be mentioned that the present disclosure may include apparatuses having different architecture than shown in FIG. 4. The architecture above is merely exemplary and used to explain the exemplary method 100 shown in FIG. 7 and method 100’ shown in FIG. 8.

Various methods in accordance with the present disclosure may be carried out. One exemplary method 100 according to the present disclosure includes following steps:

S101: acquiring a first frame in a first video, that is the current frame;

S102: generating at least one first bounding box of the first frame via object detection;

S103: finding a pair of nearest bounding boxes, wherein one is from the at least one first bounding box, the other is from at least one second bounding box detected via object detection from a second frame of the first video, and the second frame is the previous one of the first frame, that is the previous frame;

S104: calculating change in size between the pair of nearest bounding boxes;

S105: determining, based on the change in size, whether same object is detected in both of the nearest bounding boxes.

Now referring to FIG. 8, in comparison to method 100, optionally, before the step S105, in step S103’, IoU of the pair of nearest bounding boxes can be calculated. Then the step S105 can include following sub steps:

S1051: if the change in size is smaller than a first threshold and the IoU is larger than a second threshold, determining that same object is detected in both of the nearest bounding boxes.

Current object tracking procedure completes when the nearest bounding boxes are found in two adjacent frames. As mentioned above, a failure may occur when the occlusion appears or when the object moves out of sight. In that case, it requires to search for the target either in the original video, or in a video captured by another camera.

Therefore, position of bounding box can be checked further here. Normally, the bounding box of object should be of stable shape and velocity. As is shown in FIG. 2, the object gets tracked if the bounding boxes in current and previous frames superpose importantly one to another (IoU) and the changes in size is trivial. Otherwise, there may be a blocker or a disappearing object on the border of frame. To further distinguish these two cases, location of the bounding box can be further checked, then the object’s moving direction can be estimated, and the reidentification procedure can be triggered in following sub steps:

S1052: if the change in size is not smaller than the first threshold or the IoU is not larger than the second threshold, determining position relationship between the first bounding box in the pair of nearest bounding boxes and the border of the first frame;

Then after the sub step S1052, following sub steps can be executed for object tracking:

S1053: determining, based on the position relationship, whether to acquire a third frame from the first video or from a second video across cameras, wherein the third frame is the next frame of the first frame;

S1054: acquiring the third frame;

S1055: generating at least one third bounding box of the third frame via object detection;

S1056: reidentifying same object in the second frame and the third video frame;

If the same object is reidentified, the procedure can go back to the sub step S1051, the same object reidentified can be determined to be tracked.

Optionally, in the sub step S1056, the reidentification can be done via a neural network 90 with metric learning method, wherein neural network 90 is trained by a combination of a metric learning method and a representation learning method.

Optionally, the neural network 90 can include:

- a first fully connected layer 902 and a second fully connected layer 903, connected to the backbone part 901, configured to synthesize and reduce data dimension of features coming out from the backbone part, wherein the metric learning method is applied on the first fully connected layer and the representation learning method is applied on the second fully connected layer;

Optionally, the neural network 90 is trained base on a dataset 80 including images of similar appearance different objects and/or images of same object captured in different perspectives.

Now referring to FIG. 9 and FIG. 10, an apparatus 50 and a method 200 for object reidentification are also provided in the present disclosure. They can be used for object reidentification in above mentioned apparatus 200 and method 100’.

The apparatus 50 has similar structure to apparatus 10. So herein, only different parts are introduced. Other description can be referred to apparatus 10.

As shown in FIG. 9, the apparatus 50 can include:

- at least one processor 502;

- at least one memory 501 coupled to the at least one processor 502, containing an object reidentification program 60, when executed by the at least one processor 502, causing the at least one processor 502 to execute the method 200 for object reidentification presented in the present disclosure.

Images 70 to be reidentified and dataset 80 used for training a neural network 90 can also be stored in the at least one memory 501. These data can be received via a communication module 503 of the apparatus 50.

The object reidentification program 60 can include:

- an image acquisition module 504, configured to acquire images 70;

- a reidentification module 505, configured to reidentify in the acquired images 70 via a neural network 90 with metric learning method, wherein neural network 90 is trained by a combination of a metric learning method and a representation learning method.

Optionally, the neural network 90 can include:

- a backbone part 901, configured to extract features of the first image and a second image;

- a first fully connected layer 902 and a second fully connected layer 903, connected to the backbone part 901, configured to synthesize and reduce data dimension of features coming out from the backbone part, wherein the metric learning method is applied on the first fully connected layer 902 and the representation learning method is applied on the second fully connected layer 903;

- a constructive loss function part 905, connected to the first fully connected layer 901.

Details of the neural network 90 can be referred to FIG. 6 and corresponding description mentioned above.

Optionally, the neural network 90 can be trained base on a dataset 80 comprising images of similar appearance different objects and/or images of same object captured in different perspectives, as shown in FIG. 11.

Although the image acquisition module 504, reidentification module 505 are described above as software modules of the object tracking program 60. Also, they can be implemented via hardware, such as ASIC chips. They can be integrated into one chip, or separately implemented and electrically connected.

It should be mentioned that the present disclosure may include apparatuses having different architecture than shown in FIG. 9. The architecture above is merely exemplary and used to explain the exemplary method 200 shown in FIG. 10.

Various methods in accordance with the present disclosure may be carried out. One exemplary method 200 according to the present disclosure includes following steps:

S201: acquiring images 70;

S202: reidentifying in the acquired images 70 via a neural network 90 with metric learning method, wherein neural network is trained by a combination of a metric learning method and a representation learning method.

Optionally, the neural network 90 can include:

- a backbone part 901, configured to extract features of the images 70;

A computer-readable medium is also provided in the present disclosure, storing computer-executable instructions, which upon execution by a computer, enables the computer to execute any of the methods presented in this disclosure.

A computer program, which is being executed by at least one processor and performs any of the methods presented in this disclosure.

FIG. 12 shows test results of solutions provided in the present disclosure. the solution was tested on GeForce RTX 2080Ti with the videos of resolution 1080p. They are captured by cameras along a street. So one car will appear in both videos. It takes 17ms to estimate a pair of cars in the frames. With the increase of image pairs (n pairs for example) , the operation time will be longer but will not increase in multiplies (t<17n) . Because the images will be packed as a whole matrix for calculation in GPU. This module will be triggered only if the tracking fails. So its speed can meet the requirements in actual use.

An object tracking enhancement solution is provided in the present disclosure, which serves first to detect tracking failures, and then to find out the same object in other frames. Frames used for object tracking are not limited to adjacent frames of same video. They may be issued from videos captured by two cameras at long distance, which can effectively solve the problem of object tracking across cameras.

An object reidentification solution is also provided, with combination of metric learning method and representation learning method, the neural network used for object reidentification can effectively and precisely separate features of different objects, especially similar-looking ones. The choice of dataset of different objects with similar looking and different perspective of an object makes the neural network be able to recognize objects from many perspectives. When reidentify vehicles, the neural network can work even when images are captured in the front or at the back of a vehicle.

The neural network can be easily used in other scenarios after training. So, target object is not limited to vehicle. This solution is also applicable to other contexts (e.g. person tracking and object tracking) . The solution provided in the present disclosure can be widely applied in many surveillance scenarios. We have already mentioned the enhancement of tracking in terms of traffic. Also, it may serve to tracking patients in hospitals or the aged in the nursing houses.

While the present technique has been described in detail with reference to certain embodiments, it should be appreciated that the present technique is not limited to those precise embodiments. Rather, in view of the present disclosure which describes exemplary modes for practicing the invention, many modifications and variations would present themselves, to those skilled in the art without departing from the scope and spirit of this invention. The scope of the invention is, therefore, indicated by the following claims rather than by the foregoing description. All changes, modifications, and variations coming within the meaning and range of equivalency of the claims are to be considered within their scope.

Claims

A method (100) for object tracking, comprising:

- acquiring (S101) a first frame in a first video;

- generating (S102) at least one first bounding box of the first frame via object detection;

- finding (S103) a pair of nearest bounding boxes, wherein one is from the at least one first bounding box, the other is from at least one second bounding box detected via object detection from a second frame of the first video, and the second frame is the previous one of the first frame;

- calculating (S104) change in size between the pair of nearest bounding boxes;

- determining (S105) , based on the change in size, whether same object is detected in both of the nearest bounding boxes.
the method (100) according to claim 1, wherein

- before determining (S105) , based on the change in size, whether same object is detected in both of the nearest bounding boxes, the method further comprises: calculating (S103’ ) IoU of the pair of nearest bounding boxes;

- determining (S105) , based on the change in size, whether same object is detected in both of the nearest bounding boxes, comprises: if the change in size is smaller than a first threshold and the IoU is larger than a second threshold, determining (S1051) that same object is detected in both of the nearest bounding boxes.
the method according to claim2, wherein determining (S105) , based on the change in size, whether same object is detected in both of the nearest bounding boxes, further comprises: if the change in size is not smaller than the first threshold or the IoU is not larger than the second threshold,

- determining (S1052) position relationship between the first bounding box in the pair of nearest bounding boxes and the border of the first frame;

- determining (S1053) , based on the position relationship, whether to acquire a third frame from the first video or from a second video across cameras, wherein the third frame is the next frame of the first frame;

- acquiring (S1054) the third frame;

- generating (S1055) at least one third bounding box of the third frame via object detection;

- reidentifying (S1056) same object in the second frame and the third frame;

- determining (S1051) that the same object reidentified is tracked if the same object is reidentified.
the method (100) according to claim 3, wherein reidentifying (S1056) same object in the second frame and the third frame, further comprises:

- reidentifying (S1056) via a neural network (90) with metric learning method, wherein neural network (90) is trained by a combination of a metric learning method and a representation learning method.
the method (100) according to the claim 4, wherein the neural network (90) comprises:

- a backbone part (901) , configured to extract features of image in the bounding boxes detected in frames of a video;

- a first fully connected layer (902) and a second fully connected layer (903) , connected to the backbone part (901) , configured to synthesize and reduce data dimension of features coming out from the backbone part, wherein the metric learning method is applied on the first fully connected layer and the representation learning method is applied on the second fully connected layer;

- a loss function for classification part (904) , connected to the second fully connected layer (903) , and

- a constructive loss function part (905) , connected to the first fully connected layer (902) .
the method (100) according to claim 4, wherein the neural network (90) is trained base on a dataset (80) comprising images of similar appearance different objects and/or images of same object captured in different perspectives.
An apparatus (10) for object tracking, comprising:

- a video frame acquisition module (104) , configured to acquire a first frame in a first video;

- a bounding box generation module (105) , configured to generate at least one first bounding box of the first frame via object detection;

- a calculation module (106) , configured to:

- find a pair of nearest bounding boxes, wherein one is from the at least one first bounding box, the other is from at least one second bounding box detected via object detection from a second frame of the first video, and the second frame is the previous one of the first frame;

- calculate change in size between the pair of nearest bounding boxes;

- determine, based on the change in size, whether same object is detected in both of the nearest bounding boxes.
the apparatus (10) according to claim 7, wherein the calculation module (106) is further configured to:

- before determining based on the change in size whether same object is detected in both of the nearest bounding boxes, calculate IoU of the pair of nearest bounding boxes;

- determine that same object is detected in both of the nearest bounding boxes, if the change in size is smaller than a first threshold and the IoU is larger than a second threshold.
the apparatus (10) according to claim 8, wherein the calculation module (106) is further configured to: if the change in size is not smaller than the first threshold or the IoU is not larger than the second threshold,

- determine position relationship between the first bounding box in the pair of nearest bounding boxes and the border of the first frame;

- determine based on the position relationship whether to acquire a third frame from the first video or from a second video across cameras, wherein the third frame is the next frame of the first frame;

- acquire the third frame;

- generate at least one third bounding box of the third frame via object detection;

- reidentify same object in the second frame and the third frame;

- determine that the same object reidentified is tracked if the same object is reidentified.
the apparatus (10) according to claim 9, wherein the calculation module (106) is further configured to:

- reidentify via a neural network (90) with metric learning method, wherein the neural network (90) is trained by a combination of a metric learning method and a representation learning method.
the apparatus (10) according to the claim 10, wherein the neural network (90) comprises:

- a backbone part (901) , configured to extract features of image in the bounding boxes detected in frames of a video;

- a first fully connected layer (902) and a second fully connected layer (903) , connected to the backbone part (901) , configured to synthesize and reduce data dimension of features coming out from the backbone part (901) , wherein the metric learning method is applied on the first fully connected layer (902) and the representation learning method is applied on the second fully connected layer (903) ;

- a loss function for classification part (904) , connected to the second fully connected layer (903) , and

- a constructive loss function part (905) , connected to the first fully connected layer (902) .
the apparatus (10) according to claim 10, wherein the neural network (90) is trained base on a dataset (80) comprising images of similar appearance different objects and/or images of same object captured in different perspectives.
An apparatus (10) for object tracking, comprising:

- at least one processor (102) ;

- at least one memory (101) , coupled to the at least one processor (102) , configured to execute method according to any of claims 1～6.
A computer-readable medium for object tracking, storing computer-executable instructions, wherein the computer-executable instructions when executed cause at least one processor to execute method according to any of claims 1～6.
A method (200) for object reidentification, comprising:

- acquiring (S201) images (70) ;

- reidentifying (S202) in the acquired images (70) via a neural network (90) with metric learning method, wherein neural network is trained by a combination of a metric learning method and a representation learning method.
the method according to claim 15, wherein the neural network (90) comprises:

- a backbone part (901) , configured to extract features of the images;

- a first fully connected layer (902) and a second fully connected layer (903) , connected to the backbone part (901) , configured to synthesize and reduce data dimension of features coming out from the backbone part (901) , wherein the metric learning method is applied on the first fully connected layer (902) and the representation learning method is applied on the second fully connected layer (903) ;

- a loss function for classification part (904) , connected to the second fully connected layer (903) , and

- a constructive loss function part (905) , connected to the first fully connected layer (902) .
the method according to claim 15, wherein the neural network (90) is trained base on a dataset (80) comprising images of similar appearance different objects and/or images of same object captured in different perspectives.
An apparatus (50) for object reidentification, comprising:

- an image acquisition module (504) , configured to acquire images (70) ;

- a reidentification module (505) , configured to reidentify in the acquired images (70) via a neural network (90) with metric learning method, wherein neural network is trained by a combination of a metric learning method and a representation learning method.
the apparatus (50) according to claim 15, wherein the neural network (90) comprises:

- a backbone part (901) , configured to extract features of the first image and a second image;

- a first fully connected layer (902) and a second fully connected layer (903) , connected to the backbone part (901) , configured to synthesize and reduce data dimension of features coming out from the backbone part, wherein the metric learning method is applied on the first fully connected layer (902) and the representation learning method is applied on the second fully connected layer (903) ;

- a loss function for classification part (904) , connected to the second fully connected layer (903) , and

- a constructive loss function part (905) , connected to the first fully connected layer (901) .
the apparatus (50) according to claim 15, wherein the neural network (90) is trained base on a dataset (80) comprising images of similar appearance different objects and/or images of same object captured in different perspectives.
An apparatus (50) for object tracking, comprising:

- at least one processor (502) ;

- at least one memory (501) , coupled to the at least one processor (502) , configured to execute method according to any of claims 15～17.
A computer-readable medium for object tracking, storing computer-executable instructions, wherein the computer-executable instructions when executed cause at least one processor to execute method according to any of claims 15～17.