WO2021237749A1 - Method and apparatus for object tracking and reidentification - Google Patents

Method and apparatus for object tracking and reidentification Download PDF

Info

Publication number
WO2021237749A1
WO2021237749A1 PCT/CN2020/093538 CN2020093538W WO2021237749A1 WO 2021237749 A1 WO2021237749 A1 WO 2021237749A1 CN 2020093538 W CN2020093538 W CN 2020093538W WO 2021237749 A1 WO2021237749 A1 WO 2021237749A1
Authority
WO
WIPO (PCT)
Prior art keywords
frame
fully connected
connected layer
neural network
learning method
Prior art date
Application number
PCT/CN2020/093538
Other languages
French (fr)
Inventor
Xiaoyi YANG
Original Assignee
Siemens Aktiengesellschaft
Siemens Ltd., China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Siemens Aktiengesellschaft, Siemens Ltd., China filed Critical Siemens Aktiengesellschaft
Priority to PCT/CN2020/093538 priority Critical patent/WO2021237749A1/en
Publication of WO2021237749A1 publication Critical patent/WO2021237749A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/56Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
    • G06V20/58Recognition of moving objects or obstacles, e.g. vehicles or pedestrians; Recognition of traffic objects, e.g. traffic signs, traffic lights or roads

Definitions

  • the present invention relates to techniques of image processing, and more particularly to a method, apparatus and computer-readable storage medium for object tracking and reidentification.
  • Moving object tracking is widely used technique. Taking vehicle as an example, current tracking procedures are not quite reliable. It counts on the comparison between vehicle positions in adjacent frames of video. But the process is always interrupted by the occurrence of occlusion or by the disappearance of object within a video, as shown in FIG. 1.
  • the nearest bounding boxes in two adjacent frames are the ones having the biggest overlapping area, that is area (P) ⁇ area (C) , as shown in FIG. 2.
  • IoU intersection over union
  • IoU is larger than a pre-defined threshold, it is determined that nearest bounding boxes for the same object are found, therefore the object is tracked.
  • Embodiments of the present disclosure include methods, apparatuses for object tracking and methods, apparatuses for object reidentification.
  • a method for object tracking includes following steps:
  • an apparatus for object tracking includes:
  • a video frame acquisition module configured to acquire a first frame in a first video
  • a bounding box generation module configured to generate at least one first bounding box of the first frame via object detection
  • a calculation module configured to: find a pair of nearest bounding boxes, wherein one is from the at least one first bounding box, the other is from at least one second bounding box detected via object detection from a second frame of the first video, and the second frame is the previous one of the first frame; calculate change in size between the pair of nearest bounding boxes; determine, based on the change in size, whether same object is detected in both of the nearest bounding boxes.
  • an apparatus for object tracking includes at least one processor; at least one memory, coupled to the at least one processor, configured to execute method according to the first aspect.
  • a computer-readable medium for object tracking stores computer-executable instructions, wherein the computer-executable instructions when executed cause at least one processor to execute method according to the first aspect.
  • IoU of the pair of nearest bounding boxes can be calculated and if the change in size is smaller than a first threshold and the IoU is larger than a second threshold, it can be determined that same object is detected in both of the nearest bounding boxes. Based both of IoU and change in size, the judgement can be more precise.
  • the change in size is not smaller than the first threshold or the IoU is not larger than the second threshold, it can be determined whether the same object is tracked through following procedure:
  • position of bounding box can be checked further here, then the object’s moving direction can be estimated, to search for the target either in the original video, or in a video captured by another camera, reidentification procedure can be triggered to check whether same object is reidentified in both previous and next frames in case the object is lost in current frame.
  • same object can be reidentified via a neural network with metric learning method, wherein neural network is trained by a combination of a metric learning method and a representation learning method.
  • the combination of representation learning methods during training, the competence of feature separation can be enhanced, which compensates low rate of success rate of metric learning methods, especially when differences between objects are less obvious, combination of the representation learning methods can help the neural network to be powerful enough to find them out.
  • reidentification with metric learning can be more precise and efficient.
  • the complexity of training can also be reduced significantly.
  • the neural network can include:
  • a backbone part configured to extract features of image in the bounding boxes detected in frames of a video
  • the neural network can be trained base on a dataset including images of similar appearance different objects and/or images of same object captured in different perspectives.
  • the selection of dataset can influence the performance of the neural network.
  • Dataset including images of different objects having similar appearance can help the neural network to concentrate more on the distinguishable features.
  • images captured in different perspectives can help the neural network match multiple views of same object in actual use.
  • a method for object reidentification includes following steps:
  • neural network is trained by a combination of a metric learning method and a representation learning method.
  • an apparatus for object reidentification includes:
  • an image acquisition module configured to acquire images
  • a reidentification module configured to reidentify in the acquired images via a neural network with metric learning method, wherein neural network is trained by a combination of a metric learning method and a representation learning method.
  • an apparatus for object reidentification includes at least one processor; at least one memory, coupled to the at least one processor, configured to execute method according to the fifth aspect.
  • a computer-readable medium for object reidentification stores computer-executable instructions, wherein the computer-executable instructions when executed cause at least one processor to execute method according to the fifth aspect.
  • the competence of feature separation can be enhanced, which compensates low rate of success rate of metric learning methods, especially when differences between objects are less obvious, combination of the representation learning methods can help the neural network to be powerful enough to find them out.
  • reidentification with metric learning can be more precise and efficient.
  • the complexity of training can also be reduced significantly.
  • the neural network can include:
  • the neural network can be trained base on a dataset including images of similar appearance different objects and/or images of same object captured in different perspectives.
  • Dataset including images of different objects having similar appearance can help the neural network to concentrate more on the distinguishable features.
  • images captured in different perspectives can help the neural network match multiple views of same object in actual use.
  • FIG. 1 depicts scenarios in which vehicles are occluded or disappear.
  • FIG. 2 depicts current object tracking criteria.
  • FIG. 3A depicts the case in which a vehicle is wrongly untracked using current object tracking criteria.
  • FIG. 3B depicts the case in which a vehicle is wrongly tracked using current object tracking criteria.
  • FIG. 4 depicts a block diagram of an apparatus for object tracking in accordance with one embodiment of the present disclosure.
  • FIG. 5 depicts process of reidentification.
  • FIG. 6 depicts structure of a neural network in accordance with one embodiment of the present disclosure.
  • FIG. 7 and FIG. 8 depict flow diagrams of a method for object tracking in accordance with one embodiment of the present disclosure.
  • FIG. 9 depicts a block diagram of an apparatus for object reidentification in accordance with one embodiment of the present disclosure.
  • FIG. 10 depicts flow diagram of a method for object reidentification in accordance with one embodiment of the present disclosure.
  • FIG. 11 depicts examples of data resource for training a model used for object reidentification in accordance with one embodiment of the present disclosure.
  • FIG. 12 depicts object tracking result in accordance with one embodiment of the present disclosure.
  • the articles “a” , “an” , “the” and “said” are intended to mean that there are one or more of the elements.
  • the terms “comprising” , “including” and “having” are intended to be inclusive and mean that there may be additional elements other than the listed elements.
  • FIG. 4 depicts a block diagrams of an apparatus in accordance with one embodiment of the present disclosure.
  • the apparatus 10 for object tracking presented in the present disclosure can be implemented as a network of computer processors, to execute following method 100 for object tracking presented in the present disclosure.
  • the apparatus 10 can also be a single computer, as shown in FIG. 4, including at least one memory 101, which includes computer-readable medium, such as a random access memory (RAM) .
  • the apparatus 10 also includes at least one processor 102, coupled with the at least one memory 101.
  • Computer-executable instructions are stored in the at least one memory 101, and when executed by the at least one processor 102, can cause the at least one processor 102 to perform the steps described herein.
  • the at least one processor 102 may include a microprocessor, an application specific integrated circuit (ASIC) , a digital signal processor (DSP) , a central processing unit (CPU) , a graphics processing unit (GPU) , state machines, etc.
  • embodiments of computer-readable medium include, but not limited to a floppy disk, CD-ROM, magnetic disk, memory chip, ROM, RAM, an ASIC, a configured processor, all optical media, all magnetic tape or other magnetic media, or any other medium from which a computer processor can read instructions.
  • various other forms of computer-readable medium may transmit or carry instructions to a computer, including a router, private or public network, or other transmission device or channel, both wired and wireless.
  • the instructions may include code from any computer-programming language, including, for example, C, C++, C#, Visual Basic, Java, and JavaScript.
  • the at least one memory 101 shown in FIG. 4 can contain an object tracking program 20, when executed by the at least one processor 102, causing the at least one processor 102 to execute the method 100 or method 100’ for object tracking presented in the present disclosure.
  • Data 30, including videos of the target object can also be stored in the at least one memory 101.
  • the data 30 can be received via a communication module 103 of the apparatus 10.
  • the object tracking program 20 can include:
  • a video frame acquisition module 104 configured to acquire a first frame in a first video
  • a bounding box generation module 105 configured to generate at least one first bounding box of the first frame via object detection
  • a calculation module 106 configured to: find a pair of nearest bounding boxes, wherein one is from the at least one first bounding box, the other is from at least one second bounding box detected via object detection from a second frame of the first video, and the second frame is the previous one of the first frame; calculate change in size between the pair of nearest bounding boxes; and determine, based on the change in size, whether same object is detected in both of the nearest bounding boxes.
  • a video includes frames of images, which are usually taken by cameras in order of time.
  • the first frame is the current frame
  • the second frame is the previous frame of the current frame.
  • there are more than one object in a frame so at least one bounding box can be found in each frame, one bounding box corresponds to one object.
  • calculation module 106 is further configured to:
  • the first threshold and the second threshold can be set according to actual application scenario. And to be noted that, it can be implemented that the calculation module 106 can determine same object is detected in both of the nearest bounding boxes, if the change in size is not larger than a first threshold and the IoU is not smaller than a second threshold.
  • the calculation module 106 is further configured to: if the change in size is not smaller than the first threshold or the IoU is not larger than the second threshold,
  • Change in size being not smaller than the first threshold or IoU being not larger than the second threshold means that the nearest bounding boxes might not correspond to same object, than further processing and judgement can be done for accurate judgement.
  • position relationship between the first bounding box in the pair of nearest bounding boxes and the border of the first frame can be determined, based on which it can be decided that whether to acquire next frame from same video with the current frame, or from other videos across cameras. Then, based on such position judgement, object tracking across cameras can be executed practically, which can effectively increase rate of success to track a moving object.
  • bounding boxes can be generated in the third frame and same object can be reidentified in the second frame and the third frame; once the same object is reidentified, it can be determined that the same object reidentified is tracked.
  • the calculation module 106 is further configured to reidentify via a neural network 90 with metric learning method, wherein the neural network 90 is trained by a combination of a metric learning method and a representation learning method.
  • the objective of reidentification is to estimate whether two images represent the same object.
  • the basic idea is to extract the features of images and to estimate their similarity according to their Euclidean distance. This process can be represented by FIG. 5.
  • Metric learning methods are widely discussed in reidentification topics. They aim to separate features of objects. While representation learning methods are rarely mentioned in these tasks. They serve to express features of categories, so they are usually used for classification. Here, considering that classification process of representation learning methods imply synthesis of distinguishable characters, if we regard each object as a category, the representation learning methods works as well for feature separation. With the combination of representation learning methods during training, the competence of feature separation can be enhanced, which compensates low rate of success rate of metric learning methods, especially when differences between objects are less obvious, as are the cases of many cars on the road, combination of the representation learning methods can help the neural network to be powerful enough to find them out. Based on the well-trained neural network, reidentification with metric learning can be more precise and efficient. Also, with the classification of representation learning methods, the complexity of training can also be reduced significantly.
  • the neural network 90 used for object reidentification can include:
  • a backbone part 901 configured to extract features of image in the bounding boxes detected in frames of a video
  • the components in dot dashed-line rectangle are arranged according to representation learning methods. They work during the training process.
  • the components in dashed-line rectangle are arranged based on metric learning methods. They work during both the training and the application process.
  • the backbone part 901 can be implemented as Resnet 50, which seems to be a good choice up to now as it can provide enough features for plenty of categories. We have also tried the lighter backbone Resnet 18. While its depth is not enough. Errors occur in the results.
  • the first fully connected layer 902 the input/output can be 2048/128. It can be used to synthesize the features coming out of the backbone part 901. What’s more, a reduce of data dimension is good for a better convergence during the training and an easier comparison during the application.
  • the second fully Connected Layer 903, the input/output can be 2048/901. It has the similar function to the first fully connected layer 902.
  • the representation learning methods aims originally at classification. Here, we can make use of it to differentiate features of different objects. Therefore, the more types we provide, the more distinguishable features we get. For example, for the vehicle reidentification, we can select two datasets of in total 901 automobiles to train the neural network 90.
  • the loss function for classification part 904 can be implemented with Softmax.
  • the constructive loss function part 905 is a kind of metric learning loss function. We select it because: 1) Resnet 50 provides enough features; 2) Representation learning branch helps to separate the features. A more complicated loss function (for example the triplet loss) increases the difficulty of convergence during the training process. But the improvement on separation ability is not obvious.
  • the constructive loss function can be defined as:
  • N is the number of image pairs per batch
  • d i is the Euclidean distance of two vectors coming out of the first fully connected layer 902
  • the neural network 90 can be trained based on a dataset 80 including images of similar appearance different objects and/or images of same object captured in different perspectives.
  • the selection of dataset can influence the performance of the neural network 90. Taking vehicles as an example, as shown in FIG. 11, we find dataset with a large quantity of automobiles. Some of them have similar appearance, which helps the neural network 90 to concentrate more on the distinguishable features. What’s more, for each automobile, the dataset 80 can cover the images captured in different perspectives. As a result, the neural network 90 will be able to match multiple views (e.g. the front and the rear) of a car in actual use.
  • the video frame acquisition module 104, the bounding box generation module 105, the calculation module 106 are described above as software modules of the object tracking program 20. Also, they can be implemented via hardware, such as ASIC chips. They can be integrated into one chip, or separately implemented and electrically connected.
  • FIG. 4 The architecture above is merely exemplary and used to explain the exemplary method 100 shown in FIG. 7 and method 100’ shown in FIG. 8.
  • One exemplary method 100 according to the present disclosure includes following steps:
  • S103 finding a pair of nearest bounding boxes, wherein one is from the at least one first bounding box, the other is from at least one second bounding box detected via object detection from a second frame of the first video, and the second frame is the previous one of the first frame, that is the previous frame;
  • S105 determining, based on the change in size, whether same object is detected in both of the nearest bounding boxes.
  • step S105 in comparison to method 100, optionally, before the step S105, in step S103’, IoU of the pair of nearest bounding boxes can be calculated. Then the step S105 can include following sub steps:
  • S1051 if the change in size is smaller than a first threshold and the IoU is larger than a second threshold, determining that same object is detected in both of the nearest bounding boxes.
  • the bounding box of object should be of stable shape and velocity.
  • the object gets tracked if the bounding boxes in current and previous frames superpose importantly one to another (IoU) and the changes in size is trivial. Otherwise, there may be a blocker or a disappearing object on the border of frame.
  • location of the bounding box can be further checked, then the object’s moving direction can be estimated, and the reidentification procedure can be triggered in following sub steps:
  • S1052 if the change in size is not smaller than the first threshold or the IoU is not larger than the second threshold, determining position relationship between the first bounding box in the pair of nearest bounding boxes and the border of the first frame;
  • S1053 determining, based on the position relationship, whether to acquire a third frame from the first video or from a second video across cameras, wherein the third frame is the next frame of the first frame;
  • the procedure can go back to the sub step S1051, the same object reidentified can be determined to be tracked.
  • the reidentification can be done via a neural network 90 with metric learning method, wherein neural network 90 is trained by a combination of a metric learning method and a representation learning method.
  • the neural network 90 can include:
  • a backbone part 901 configured to extract features of image in the bounding boxes detected in frames of a video
  • the neural network 90 is trained base on a dataset 80 including images of similar appearance different objects and/or images of same object captured in different perspectives.
  • an apparatus 50 and a method 200 for object reidentification are also provided in the present disclosure. They can be used for object reidentification in above mentioned apparatus 200 and method 100’.
  • the apparatus 50 has similar structure to apparatus 10. So herein, only different parts are introduced. Other description can be referred to apparatus 10.
  • the apparatus 50 can include:
  • At least one memory 501 coupled to the at least one processor 502, containing an object reidentification program 60, when executed by the at least one processor 502, causing the at least one processor 502 to execute the method 200 for object reidentification presented in the present disclosure.
  • Images 70 to be reidentified and dataset 80 used for training a neural network 90 can also be stored in the at least one memory 501. These data can be received via a communication module 503 of the apparatus 50.
  • the object reidentification program 60 can include:
  • an image acquisition module 504 configured to acquire images 70;
  • a reidentification module 505 configured to reidentify in the acquired images 70 via a neural network 90 with metric learning method, wherein neural network 90 is trained by a combination of a metric learning method and a representation learning method.
  • the neural network 90 can include:
  • a backbone part 901 configured to extract features of the first image and a second image
  • the neural network 90 can be trained base on a dataset 80 comprising images of similar appearance different objects and/or images of same object captured in different perspectives, as shown in FIG. 11.
  • image acquisition module 504, reidentification module 505 are described above as software modules of the object tracking program 60. Also, they can be implemented via hardware, such as ASIC chips. They can be integrated into one chip, or separately implemented and electrically connected.
  • FIG. 9 The architecture above is merely exemplary and used to explain the exemplary method 200 shown in FIG. 10.
  • One exemplary method 200 according to the present disclosure includes following steps:
  • S202 reidentifying in the acquired images 70 via a neural network 90 with metric learning method, wherein neural network is trained by a combination of a metric learning method and a representation learning method.
  • the neural network 90 can include:
  • the neural network 90 is trained base on a dataset 80 including images of similar appearance different objects and/or images of same object captured in different perspectives.
  • a computer-readable medium is also provided in the present disclosure, storing computer-executable instructions, which upon execution by a computer, enables the computer to execute any of the methods presented in this disclosure.
  • a computer program which is being executed by at least one processor and performs any of the methods presented in this disclosure.
  • FIG. 12 shows test results of solutions provided in the present disclosure.
  • the solution was tested on GeForce RTX 2080Ti with the videos of resolution 1080p. They are captured by cameras along a street. So one car will appear in both videos. It takes 17ms to estimate a pair of cars in the frames. With the increase of image pairs (n pairs for example) , the operation time will be longer but will not increase in multiplies (t ⁇ 17n) . Because the images will be packed as a whole matrix for calculation in GPU. This module will be triggered only if the tracking fails. So its speed can meet the requirements in actual use.
  • An object tracking enhancement solution is provided in the present disclosure, which serves first to detect tracking failures, and then to find out the same object in other frames.
  • Frames used for object tracking are not limited to adjacent frames of same video. They may be issued from videos captured by two cameras at long distance, which can effectively solve the problem of object tracking across cameras.
  • An object reidentification solution is also provided, with combination of metric learning method and representation learning method, the neural network used for object reidentification can effectively and precisely separate features of different objects, especially similar-looking ones.
  • the choice of dataset of different objects with similar looking and different perspective of an object makes the neural network be able to recognize objects from many perspectives.
  • the neural network can work even when images are captured in the front or at the back of a vehicle.
  • the neural network can be easily used in other scenarios after training. So, target object is not limited to vehicle. This solution is also applicable to other contexts (e.g. person tracking and object tracking) .
  • the solution provided in the present disclosure can be widely applied in many surveillance scenarios. We have already mentioned the enhancement of tracking in terms of traffic. Also, it may serve to tracking patients in hospitals or the aged in the nursing houses.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

A method, apparatus, system and computer-readable medium for object tracking and reidentification are presented. An object tracking enhancement solution is provided in the present disclosure, which serves first to detect tracking failures, and then to find out the same object in other frames. An object reidentification solution is also provided, with combination of metric learning method and representation learning method, the neural network used for object reidentification can effectively and precisely separate features of different objects, especially similar-looking ones.

Description

Method and apparatus for object tracking and reidentification Technical Field
The present invention relates to techniques of image processing, and more particularly to a method, apparatus and computer-readable storage medium for object tracking and reidentification.
Background Art
Moving object tracking is widely used technique. Taking vehicle as an example, current tracking procedures are not quite reliable. It counts on the comparison between vehicle positions in adjacent frames of video. But the process is always interrupted by the occurrence of occlusion or by the disappearance of object within a video, as shown in FIG. 1.
Most of the current tracking procedures complete when the nearest bounding boxes are found in two adjacent frames, as shown in FIG. 2. The nearest bounding boxes in two adjacent frames are the ones having the biggest overlapping area, that is area (P) ∩area (C) , as shown in FIG. 2. In practice, IoU (intersection over union) can be used for judgement of nearest bounding boxes:
Figure PCTCN2020093538-appb-000001
If IoU is larger than a pre-defined threshold, it is determined that nearest bounding boxes for the same object are found, therefore the object is tracked.
However, a failure occurs when occlusion appears or when the object moves out of sight. Referring to FIG. 3A, in previous frame, a target car A is complete in the bounding box, while in current frame, the target car A is partially occluded by a truck B, then with the overlapping area becomes very small, IoU might become smaller than the threshold, the target car A is determined to be untracked wrongly. In another case, referring to FIG. 3B, in previous frame, the target car A is partially occluded by a truck B, while in current frame, the target car A is completely occluded by the truck B, bounding box of the truck B might be determined as bounding box of the target car A, then IoU for the two adjacent frames might be larger than the predefined threshold, it is wrongly determined that the target car A is tracked.
Summary of the Invention
In this disclosure, on one hand, we propose solutions for object tracking, taking into more reasonable judgement criteria, which can enhance tracking accuracy.
On the other hand, improvements are made as to object reidentification in two images, with which, similar looking objects can be differentiated precisely and easily.
Embodiments of the present disclosure include methods, apparatuses for object tracking and methods, apparatuses for object reidentification.
According to a first aspect of the present disclosure, a method for object tracking is presented. The method includes following steps:
- acquiring a first frame in a first video;
- generating at least one first bounding box of the first frame via object detection;
- finding a pair of nearest bounding boxes, wherein one is from the at least one first bounding box, the other is from at least one second bounding box detected via object detection from a second frame of the first video, and the second frame is the previous one of the first frame;
- calculating change in size between the pair of nearest bounding boxes;
- determining, based on the change in size, whether same object is detected in both of the nearest bounding boxes.
According to a second aspect of the present disclosure, an apparatus for object tracking is presented. The apparatus includes:
- a video frame acquisition module, configured to acquire a first frame in a first video;
- a bounding box generation module, configured to generate at least one first bounding box of the first frame via object detection;
- a calculation module, configured to: find a pair of nearest bounding boxes, wherein one is from the at least one first bounding box, the other is from at least one second bounding box detected via object detection from a second frame of the first video, and the second frame is the previous one of the first frame; calculate change in size between the pair of nearest bounding boxes; determine, based on the change in size, whether same object is detected in both of the nearest bounding boxes.
According to a third aspect of the present disclosure, an apparatus for object tracking is presented. The apparatus includes at least one processor; at least one memory, coupled to the at least one processor, configured to execute method  according to the first aspect.
According to a fourth aspect of the present disclosure, a computer-readable medium for object tracking is presented. The computer-readable medium stores computer-executable instructions, wherein the computer-executable instructions when executed cause at least one processor to execute method according to the first aspect.
Current tracking procedures only take into account of IoU, which results in wrong judgement. With solutions provided in the present disclosure, change in size of nearest bounding boxes is considered. Usually, probability of nearest bounding boxes in adjacent frames correspond to same object is high, however, different objects usually have different sizes, so judgement based on change in size helps determine whether the two nearest bounding boxes correspond to same object.
Optionally, before determining based on the change in size whether same object is detected in both of the nearest bounding boxes, IoU of the pair of nearest bounding boxes can be calculated and if the change in size is smaller than a first threshold and the IoU is larger than a second threshold, it can be determined that same object is detected in both of the nearest bounding boxes. Based both of IoU and change in size, the judgement can be more precise.
Optionally, if the change in size is not smaller than the first threshold or the IoU is not larger than the second threshold, it can be determined whether the same object is tracked through following procedure:
- determining position relationship between the first bounding box in the pair of nearest bounding boxes and the border of the first frame;
- determining, based on the position relationship, whether to acquire a third frame from the first video or from a second video across cameras, wherein the third frame is the next frame of the first frame;
- acquiring the third frame;
- generating at least one third bounding box of the third frame via object detection;
- reidentifying same object in the second frame and the third frame;
- determining the same object is tracked if the same object is reidentified.
Taking into account cases that the occlusion appears or when the object moves out of sight, position of bounding box can be checked further here, then the object’s moving direction can be estimated, to search for the target either in the original video, or in a video captured by another camera, reidentification procedure can be triggered to check whether same object is reidentified in both previous and next  frames in case the object is lost in current frame.
Optionally, same object can be reidentified via a neural network with metric learning method, wherein neural network is trained by a combination of a metric learning method and a representation learning method. With the combination of representation learning methods during training, the competence of feature separation can be enhanced, which compensates low rate of success rate of metric learning methods, especially when differences between objects are less obvious, combination of the representation learning methods can help the neural network to be powerful enough to find them out. Based on the well-trained neural network, reidentification with metric learning can be more precise and efficient. Also, with the classification of representation learning methods, the complexity of training can also be reduced significantly.
Optionally, the neural network can include:
- a backbone part, configured to extract features of image in the bounding boxes detected in frames of a video;
- a first fully connected layer and a second fully connected layer, connected to the backbone part, configured to synthesize and reduce data dimension of features coming out from the backbone part, wherein the metric learning method is applied on the first fully connected layer and the representation learning method is applied on the second fully connected layer;
- a loss function for classification part, connected to the second fully connected layer, and
- a constructive loss function part, connected to the first fully connected layer.
Optionally, the neural network can be trained base on a dataset including images of similar appearance different objects and/or images of same object captured in different perspectives. The selection of dataset can influence the performance of the neural network. Dataset including images of different objects having similar appearance can help the neural network to concentrate more on the distinguishable features. And images captured in different perspectives can help the neural network match multiple views of same object in actual use.
According to a fifth aspect of the present disclosure, a method for object reidentification is presented. The method includes following steps:
- acquiring images;
- reidentifying in the acquired images via a neural network with metric learning method, wherein neural network is trained by a combination of a metric learning  method and a representation learning method.
According to a sixth aspect of the present disclosure, an apparatus for object reidentification is presented. The apparatus includes:
- an image acquisition module, configured to acquire images;
- a reidentification module, configured to reidentify in the acquired images via a neural network with metric learning method, wherein neural network is trained by a combination of a metric learning method and a representation learning method.
According to a seventh aspect of the present disclosure, an apparatus for object reidentification is presented. The apparatus includes at least one processor; at least one memory, coupled to the at least one processor, configured to execute method according to the fifth aspect.
According to an eighth aspect of the present disclosure, a computer-readable medium for object reidentification is presented. The computer-readable medium stores computer-executable instructions, wherein the computer-executable instructions when executed cause at least one processor to execute method according to the fifth aspect.
With the combination of representation learning methods during training, the competence of feature separation can be enhanced, which compensates low rate of success rate of metric learning methods, especially when differences between objects are less obvious, combination of the representation learning methods can help the neural network to be powerful enough to find them out. Based on the well-trained neural network, reidentification with metric learning can be more precise and efficient. Also, with the classification of representation learning methods, the complexity of training can also be reduced significantly.
Optionally, the neural network can include:
- a backbone part, configured to extract features of the images;
- a first fully connected layer and a second fully connected layer, connected to the backbone part, configured to synthesize and reduce data dimension of features coming out from the backbone part, wherein the metric learning method is applied on the first fully connected layer and the representation learning method is applied on the second fully connected layer;
- a loss function for classification part, connected to the second fully connected layer, and
- a constructive loss function part, connected to the first fully connected layer.
Optionally, the neural network can be trained base on a dataset including images of similar appearance different objects and/or images of same object captured in  different perspectives. Dataset including images of different objects having similar appearance can help the neural network to concentrate more on the distinguishable features. And images captured in different perspectives can help the neural network match multiple views of same object in actual use.
Brief Description of the Drawings
The above mentioned attributes and other features and advantages of the present technique and the manner of attaining them will become more apparent and the present technique itself will be better understood by reference to the following description of embodiments of the present technique taken in conjunction with the accompanying drawings, wherein:
FIG. 1 depicts scenarios in which vehicles are occluded or disappear.
FIG. 2 depicts current object tracking criteria.
FIG. 3A depicts the case in which a vehicle is wrongly untracked using current object tracking criteria.
FIG. 3B depicts the case in which a vehicle is wrongly tracked using current object tracking criteria.
FIG. 4 depicts a block diagram of an apparatus for object tracking in accordance with one embodiment of the present disclosure.
FIG. 5 depicts process of reidentification.
FIG. 6 depicts structure of a neural network in accordance with one embodiment of the present disclosure.
FIG. 7 and FIG. 8 depict flow diagrams of a method for object tracking in accordance with one embodiment of the present disclosure.
FIG. 9 depicts a block diagram of an apparatus for object reidentification in accordance with one embodiment of the present disclosure.
FIG. 10 depicts flow diagram of a method for object reidentification in accordance with one embodiment of the present disclosure.
FIG. 11 depicts examples of data resource for training a model used for object reidentification in accordance with one embodiment of the present disclosure.
FIG. 12 depicts object tracking result in accordance with one embodiment of the present disclosure.
Reference Numbers:
10, an apparatus for object tracking
101, at least one memory
102, at least one processor
103, a communication module
20, an object tracking program
104, a video frame acquisition module
105, a bounding box generation module
106, a calculation module
30, data acquired
100, 100’ methods for object tracking
S101~S105, steps of method 100
S1051~S1036, sub steps of S105
50, an apparatus for object reidentification
501, at least one memory
502, at least one processor
503, a communication module
60, an object reidentification program
504, an image acquisition module
505, a reidentification module
70, images to be reidentified
80, dataset used for training a neural network 90
90, a neural network
901, a backbone part
902, a first fully connected layer
903, a second fully connected layer
904, a loss function for classification part
905, a constructive loss function part
Detailed Description of Example Embodiments
Hereinafter, above-mentioned and other features of the present technique are described in detail. Various embodiments are described with reference to the drawing, where like reference numerals are used to refer to like elements throughout. In the following description, for purpose of explanation, numerous specific details are set forth in order to provide a thorough understanding of one or more embodiments. It may be noted that the illustrated embodiments are intended to explain, and not to limit the invention. It may be evident that such embodiments may be practiced without these specific details.
When introducing elements of various embodiments of the present disclosure, the articles “a” , “an” , “the” and “said” are intended to mean that there are one or more of the elements. The terms “comprising” , “including” and “having” are intended to be inclusive and mean that there may be additional elements other than the listed elements.
Now the present disclosure will be described hereinafter in details by referring to FIG. 1 to FIG. 9.
FIG. 4 depicts a block diagrams of an apparatus in accordance with one embodiment of the present disclosure. The apparatus 10 for object tracking presented in the present disclosure can be implemented as a network of computer processors, to execute following method 100 for object tracking presented in the present disclosure. the apparatus 10 can also be a single computer, as shown in FIG. 4, including at least one memory 101, which includes computer-readable medium, such as a random access memory (RAM) . The apparatus 10 also includes at least one processor 102, coupled with the at least one memory 101. Computer-executable instructions are stored in the at least one memory 101, and when executed by the at least one processor 102, can cause the at least one processor 102 to perform the steps described herein. The at least one processor 102 may include a microprocessor, an application specific integrated circuit (ASIC) , a digital signal processor (DSP) , a central processing unit (CPU) , a graphics processing unit (GPU) , state machines, etc. embodiments of computer-readable medium include, but not limited to a floppy disk, CD-ROM, magnetic disk, memory chip, ROM, RAM, an ASIC, a configured processor, all optical media, all magnetic tape or other magnetic media, or any other medium from which a computer processor can read instructions. Also, various other forms of computer-readable medium may transmit or carry instructions to a computer, including a router, private or public network, or other transmission device or channel, both wired and wireless. The instructions may include code from any computer-programming language, including, for example, C, C++, C#, Visual Basic, Java, and JavaScript.
The at least one memory 101 shown in FIG. 4 can contain an object tracking program 20, when executed by the at least one processor 102, causing the at least one processor 102 to execute the method 100 or method 100’ for object tracking presented in the present disclosure. Data 30, including videos of the target object can also be stored in the at least one memory 101. The data 30 can be received via a  communication module 103 of the apparatus 10.
The object tracking program 20 can include:
- a video frame acquisition module 104, configured to acquire a first frame in a first video;
- a bounding box generation module 105, configured to generate at least one first bounding box of the first frame via object detection;
- a calculation module 106, configured to: find a pair of nearest bounding boxes, wherein one is from the at least one first bounding box, the other is from at least one second bounding box detected via object detection from a second frame of the first video, and the second frame is the previous one of the first frame; calculate change in size between the pair of nearest bounding boxes; and determine, based on the change in size, whether same object is detected in both of the nearest bounding boxes.
A video includes frames of images, which are usually taken by cameras in order of time. here, the first frame is the current frame, the second frame is the previous frame of the current frame. Usually, there are more than one object in a frame, so at least one bounding box can be found in each frame, one bounding box corresponds to one object.
As mentioned above, current tracking procedures only take into account of IoU, which results in wrong judgement. Here, change in size of nearest bounding boxes is considered. Usually, probability of nearest bounding boxes in adjacent frames correspond to same object is high, however, different objects usually have different sizes, so taking into account change in size helps determine whether the two nearest bounding boxes correspond to same object.
Optionally, the calculation module 106 is further configured to:
- before determining based on the change in size whether same object is detected in both of the nearest bounding boxes, calculate IoU of the pair of nearest bounding boxes;
- determine that same object is detected in both of the nearest bounding boxes, if the change in size is smaller than a first threshold and the IoU is larger than a second threshold,
So IoU and change in size are both taken into account to determine whether same object is detected in both of the nearest bounding boxes, the judgement will be more accurate. Detection of same object in both of the nearest bounding boxes means that the same object is tracked. The first threshold and the second threshold can be set according to actual application scenario. And to be noted that, it can be  implemented that the calculation module 106 can determine same object is detected in both of the nearest bounding boxes, if the change in size is not larger than a first threshold and the IoU is not smaller than a second threshold.
Optionally, the calculation module 106 is further configured to: if the change in size is not smaller than the first threshold or the IoU is not larger than the second threshold,
- determine position relationship between the first bounding box in the pair of nearest bounding boxes and the border of the first frame;
- determine based on the position relationship whether to acquire a third frame from the first video or from a second video across cameras, wherein the third frame is the next frame of the first frame;
- acquire the third frame;
- generate at least one third bounding box of the third frame via object detection;
- reidentify same object in the second frame and the third frame;
- determine that the same object reidentified is tracked if the same object is reidentified.
Change in size being not smaller than the first threshold or IoU being not larger than the second threshold means that the nearest bounding boxes might not correspond to same object, than further processing and judgement can be done for accurate judgement.
Firstly, position relationship between the first bounding box in the pair of nearest bounding boxes and the border of the first frame can be determined, based on which it can be decided that whether to acquire next frame from same video with the current frame, or from other videos across cameras. Then, based on such position judgement, object tracking across cameras can be executed practically, which can effectively increase rate of success to track a moving object. Once the third frame is acquired, which is the next frame of the current one, bounding boxes can be generated in the third frame and same object can be reidentified in the second frame and the third frame; once the same object is reidentified, it can be determined that the same object reidentified is tracked.
Optionally, the calculation module 106 is further configured to reidentify via a neural network 90 with metric learning method, wherein the neural network 90 is trained by a combination of a metric learning method and a representation learning method.
The objective of reidentification is to estimate whether two images represent the  same object. The basic idea is to extract the features of images and to estimate their similarity according to their Euclidean distance. This process can be represented by FIG. 5.
Considering the strong ability of neural network to extract abstract features, we apply it here to achieve this task. A well-performed network provides quite different features for different objects.
Metric learning methods are widely discussed in reidentification topics. They aim to separate features of objects. While representation learning methods are rarely mentioned in these tasks. They serve to express features of categories, so they are usually used for classification. Here, considering that classification process of representation learning methods imply synthesis of distinguishable characters, if we regard each object as a category, the representation learning methods works as well for feature separation. With the combination of representation learning methods during training, the competence of feature separation can be enhanced, which compensates low rate of success rate of metric learning methods, especially when differences between objects are less obvious, as are the cases of many cars on the road, combination of the representation learning methods can help the neural network to be powerful enough to find them out. Based on the well-trained neural network, reidentification with metric learning can be more precise and efficient. Also, with the classification of representation learning methods, the complexity of training can also be reduced significantly.
Now, referring to FIG. 6, the neural network 90 used for object reidentification can include:
- a backbone part 901, configured to extract features of image in the bounding boxes detected in frames of a video;
- a first fully connected layer 902 and a second fully connected layer 903, connected to the backbone part 901, configured to synthesize and reduce data dimension of features coming out from the backbone part 901, wherein the metric learning method is applied on the first fully connected layer 902 and the representation learning method is applied on the second fully connected layer 903;
- a loss function for classification part 904, connected to the second fully connected layer 903, and
- a constructive loss function part 905, connected to the first fully connected layer 902.
The components in dot dashed-line rectangle are arranged according to representation learning methods. They work during the training process. The  components in dashed-line rectangle are arranged based on metric learning methods. They work during both the training and the application process.
The backbone part 901 can be implemented as Resnet 50, which seems to be a good choice up to now as it can provide enough features for plenty of categories. We have also tried the lighter backbone Resnet 18. While its depth is not enough. Errors occur in the results.
The first fully connected layer 902, the input/output can be 2048/128. It can be used to synthesize the features coming out of the backbone part 901. What’s more, a reduce of data dimension is good for a better convergence during the training and an easier comparison during the application.
The second fully Connected Layer 903, the input/output can be 2048/901. It has the similar function to the first fully connected layer 902. The representation learning methods aims originally at classification. Here, we can make use of it to differentiate features of different objects. Therefore, the more types we provide, the more distinguishable features we get. For example, for the vehicle reidentification, we can select two datasets of in total 901 automobiles to train the neural network 90.
The loss function for classification part 904, can be implemented with Softmax.
The constructive loss function part 905 is a kind of metric learning loss function. We select it because: 1) Resnet 50 provides enough features; 2) Representation learning branch helps to separate the features. A more complicated loss function (for example the triplet loss) increases the difficulty of convergence during the training process. But the improvement on separation ability is not obvious. The constructive loss function can be defined as:
Figure PCTCN2020093538-appb-000002
Where N is the number of image pairs per batch, d i is the Euclidean distance of two vectors coming out of the first fully connected layer 902, y i is the label that marks whether the object on two images are the same (if yes, y i=1; otherwise, y i=0) .
To be mentioned that, as an example, there are only two inputs as shown in FIG. 6, they are the second frame (the previous frame) and the third frame (the next frame) . However, there can be more than two inputs, same object then will be reidentified from all the input images.
Optionally, the neural network 90 can be trained based on a dataset 80 including images of similar appearance different objects and/or images of same object  captured in different perspectives. The selection of dataset can influence the performance of the neural network 90. Taking vehicles as an example, as shown in FIG. 11, we find dataset with a large quantity of automobiles. Some of them have similar appearance, which helps the neural network 90 to concentrate more on the distinguishable features. What’s more, for each automobile, the dataset 80 can cover the images captured in different perspectives. As a result, the neural network 90 will be able to match multiple views (e.g. the front and the rear) of a car in actual use.
Although the video frame acquisition module 104, the bounding box generation module 105, the calculation module 106 are described above as software modules of the object tracking program 20. Also, they can be implemented via hardware, such as ASIC chips. They can be integrated into one chip, or separately implemented and electrically connected.
It should be mentioned that the present disclosure may include apparatuses having different architecture than shown in FIG. 4. The architecture above is merely exemplary and used to explain the exemplary method 100 shown in FIG. 7 and method 100’ shown in FIG. 8.
Various methods in accordance with the present disclosure may be carried out. One exemplary method 100 according to the present disclosure includes following steps:
S101: acquiring a first frame in a first video, that is the current frame;
S102: generating at least one first bounding box of the first frame via object detection;
S103: finding a pair of nearest bounding boxes, wherein one is from the at least one first bounding box, the other is from at least one second bounding box detected via object detection from a second frame of the first video, and the second frame is the previous one of the first frame, that is the previous frame;
S104: calculating change in size between the pair of nearest bounding boxes;
S105: determining, based on the change in size, whether same object is detected in both of the nearest bounding boxes.
Now referring to FIG. 8, in comparison to method 100, optionally, before the step S105, in step S103’, IoU of the pair of nearest bounding boxes can be calculated. Then the step S105 can include following sub steps:
S1051: if the change in size is smaller than a first threshold and the IoU is larger than a second threshold, determining that same object is detected in both of the nearest bounding boxes.
Current object tracking procedure completes when the nearest bounding boxes are found in two adjacent frames. As mentioned above, a failure may occur when the occlusion appears or when the object moves out of sight. In that case, it requires to search for the target either in the original video, or in a video captured by another camera.
Therefore, position of bounding box can be checked further here. Normally, the bounding box of object should be of stable shape and velocity. As is shown in FIG. 2, the object gets tracked if the bounding boxes in current and previous frames superpose importantly one to another (IoU) and the changes in size is trivial. Otherwise, there may be a blocker or a disappearing object on the border of frame. To further distinguish these two cases, location of the bounding box can be further checked, then the object’s moving direction can be estimated, and the reidentification procedure can be triggered in following sub steps:
S1052: if the change in size is not smaller than the first threshold or the IoU is not larger than the second threshold, determining position relationship between the first bounding box in the pair of nearest bounding boxes and the border of the first frame;
Then after the sub step S1052, following sub steps can be executed for object tracking:
S1053: determining, based on the position relationship, whether to acquire a third frame from the first video or from a second video across cameras, wherein the third frame is the next frame of the first frame;
S1054: acquiring the third frame;
S1055: generating at least one third bounding box of the third frame via object detection;
S1056: reidentifying same object in the second frame and the third video frame;
If the same object is reidentified, the procedure can go back to the sub step S1051, the same object reidentified can be determined to be tracked.
Optionally, in the sub step S1056, the reidentification can be done via a neural network 90 with metric learning method, wherein neural network 90 is trained by a combination of a metric learning method and a representation learning method.
Optionally, the neural network 90 can include:
- a backbone part 901, configured to extract features of image in the bounding boxes detected in frames of a video;
- a first fully connected layer 902 and a second fully connected layer 903, connected to the backbone part 901, configured to synthesize and reduce data  dimension of features coming out from the backbone part, wherein the metric learning method is applied on the first fully connected layer and the representation learning method is applied on the second fully connected layer;
- a loss function for classification part 904, connected to the second fully connected layer 903, and
- a constructive loss function part 905, connected to the first fully connected layer 902.
Optionally, the neural network 90 is trained base on a dataset 80 including images of similar appearance different objects and/or images of same object captured in different perspectives.
Now referring to FIG. 9 and FIG. 10, an apparatus 50 and a method 200 for object reidentification are also provided in the present disclosure. They can be used for object reidentification in above mentioned apparatus 200 and method 100’.
The apparatus 50 has similar structure to apparatus 10. So herein, only different parts are introduced. Other description can be referred to apparatus 10.
As shown in FIG. 9, the apparatus 50 can include:
- at least one processor 502;
- at least one memory 501 coupled to the at least one processor 502, containing an object reidentification program 60, when executed by the at least one processor 502, causing the at least one processor 502 to execute the method 200 for object reidentification presented in the present disclosure.
Images 70 to be reidentified and dataset 80 used for training a neural network 90 can also be stored in the at least one memory 501. These data can be received via a communication module 503 of the apparatus 50.
The object reidentification program 60 can include:
- an image acquisition module 504, configured to acquire images 70;
- a reidentification module 505, configured to reidentify in the acquired images 70 via a neural network 90 with metric learning method, wherein neural network 90 is trained by a combination of a metric learning method and a representation learning method.
Optionally, the neural network 90 can include:
- a backbone part 901, configured to extract features of the first image and a second image;
- a first fully connected layer 902 and a second fully connected layer 903, connected to the backbone part 901, configured to synthesize and reduce data dimension of features coming out from the backbone part, wherein the metric  learning method is applied on the first fully connected layer 902 and the representation learning method is applied on the second fully connected layer 903;
- a loss function for classification part 904, connected to the second fully connected layer 903, and
- a constructive loss function part 905, connected to the first fully connected layer 901.
Details of the neural network 90 can be referred to FIG. 6 and corresponding description mentioned above.
Optionally, the neural network 90 can be trained base on a dataset 80 comprising images of similar appearance different objects and/or images of same object captured in different perspectives, as shown in FIG. 11.
Although the image acquisition module 504, reidentification module 505 are described above as software modules of the object tracking program 60. Also, they can be implemented via hardware, such as ASIC chips. They can be integrated into one chip, or separately implemented and electrically connected.
It should be mentioned that the present disclosure may include apparatuses having different architecture than shown in FIG. 9. The architecture above is merely exemplary and used to explain the exemplary method 200 shown in FIG. 10.
Various methods in accordance with the present disclosure may be carried out. One exemplary method 200 according to the present disclosure includes following steps:
S201: acquiring images 70;
S202: reidentifying in the acquired images 70 via a neural network 90 with metric learning method, wherein neural network is trained by a combination of a metric learning method and a representation learning method.
Optionally, the neural network 90 can include:
- a backbone part 901, configured to extract features of the images 70;
- a first fully connected layer 902 and a second fully connected layer 903, connected to the backbone part 901, configured to synthesize and reduce data dimension of features coming out from the backbone part 901, wherein the metric learning method is applied on the first fully connected layer 902 and the representation learning method is applied on the second fully connected layer 903;
- a loss function for classification part 904, connected to the second fully connected layer 903, and
- a constructive loss function part 905, connected to the first fully connected layer 902.
Optionally, the neural network 90 is trained base on a dataset 80 including images of similar appearance different objects and/or images of same object captured in different perspectives.
A computer-readable medium is also provided in the present disclosure, storing computer-executable instructions, which upon execution by a computer, enables the computer to execute any of the methods presented in this disclosure.
A computer program, which is being executed by at least one processor and performs any of the methods presented in this disclosure.
FIG. 12 shows test results of solutions provided in the present disclosure. the solution was tested on GeForce RTX 2080Ti with the videos of resolution 1080p. They are captured by cameras along a street. So one car will appear in both videos. It takes 17ms to estimate a pair of cars in the frames. With the increase of image pairs (n pairs for example) , the operation time will be longer but will not increase in multiplies (t<17n) . Because the images will be packed as a whole matrix for calculation in GPU. This module will be triggered only if the tracking fails. So its speed can meet the requirements in actual use.
An object tracking enhancement solution is provided in the present disclosure, which serves first to detect tracking failures, and then to find out the same object in other frames. Frames used for object tracking are not limited to adjacent frames of same video. They may be issued from videos captured by two cameras at long distance, which can effectively solve the problem of object tracking across cameras.
An object reidentification solution is also provided, with combination of metric learning method and representation learning method, the neural network used for object reidentification can effectively and precisely separate features of different objects, especially similar-looking ones. The choice of dataset of different objects with similar looking and different perspective of an object makes the neural network be able to recognize objects from many perspectives. When reidentify vehicles, the neural network can work even when images are captured in the front or at the back of a vehicle.
The neural network can be easily used in other scenarios after training. So, target object is not limited to vehicle. This solution is also applicable to other contexts (e.g. person tracking and object tracking) . The solution provided in the  present disclosure can be widely applied in many surveillance scenarios. We have already mentioned the enhancement of tracking in terms of traffic. Also, it may serve to tracking patients in hospitals or the aged in the nursing houses.
While the present technique has been described in detail with reference to certain embodiments, it should be appreciated that the present technique is not limited to those precise embodiments. Rather, in view of the present disclosure which describes exemplary modes for practicing the invention, many modifications and variations would present themselves, to those skilled in the art without departing from the scope and spirit of this invention. The scope of the invention is, therefore, indicated by the following claims rather than by the foregoing description. All changes, modifications, and variations coming within the meaning and range of equivalency of the claims are to be considered within their scope.

Claims (22)

  1. A method (100) for object tracking, comprising:
    - acquiring (S101) a first frame in a first video;
    - generating (S102) at least one first bounding box of the first frame via object detection;
    - finding (S103) a pair of nearest bounding boxes, wherein one is from the at least one first bounding box, the other is from at least one second bounding box detected via object detection from a second frame of the first video, and the second frame is the previous one of the first frame;
    - calculating (S104) change in size between the pair of nearest bounding boxes;
    - determining (S105) , based on the change in size, whether same object is detected in both of the nearest bounding boxes.
  2. the method (100) according to claim 1, wherein
    - before determining (S105) , based on the change in size, whether same object is detected in both of the nearest bounding boxes, the method further comprises: calculating (S103’ ) IoU of the pair of nearest bounding boxes;
    - determining (S105) , based on the change in size, whether same object is detected in both of the nearest bounding boxes, comprises: if the change in size is smaller than a first threshold and the IoU is larger than a second threshold, determining (S1051) that same object is detected in both of the nearest bounding boxes.
  3. the method according to claim2, wherein determining (S105) , based on the change in size, whether same object is detected in both of the nearest bounding boxes, further comprises: if the change in size is not smaller than the first threshold or the IoU is not larger than the second threshold,
    - determining (S1052) position relationship between the first bounding box in the pair of nearest bounding boxes and the border of the first frame;
    - determining (S1053) , based on the position relationship, whether to acquire a third frame from the first video or from a second video across cameras, wherein the third frame is the next frame of the first frame;
    - acquiring (S1054) the third frame;
    - generating (S1055) at least one third bounding box of the third frame via object detection;
    - reidentifying (S1056) same object in the second frame and the third frame;
    - determining (S1051) that the same object reidentified is tracked if the same object is reidentified.
  4. the method (100) according to claim 3, wherein reidentifying (S1056) same object in the second frame and the third frame, further comprises:
    - reidentifying (S1056) via a neural network (90) with metric learning method, wherein neural network (90) is trained by a combination of a metric learning method and a representation learning method.
  5. the method (100) according to the claim 4, wherein the neural network (90) comprises:
    - a backbone part (901) , configured to extract features of image in the bounding boxes detected in frames of a video;
    - a first fully connected layer (902) and a second fully connected layer (903) , connected to the backbone part (901) , configured to synthesize and reduce data dimension of features coming out from the backbone part, wherein the metric learning method is applied on the first fully connected layer and the representation learning method is applied on the second fully connected layer;
    - a loss function for classification part (904) , connected to the second fully connected layer (903) , and
    - a constructive loss function part (905) , connected to the first fully connected layer (902) .
  6. the method (100) according to claim 4, wherein the neural network (90) is trained base on a dataset (80) comprising images of similar appearance different objects and/or images of same object captured in different perspectives.
  7. An apparatus (10) for object tracking, comprising:
    - a video frame acquisition module (104) , configured to acquire a first frame in a first video;
    - a bounding box generation module (105) , configured to generate at least one first bounding box of the first frame via object detection;
    - a calculation module (106) , configured to:
    - find a pair of nearest bounding boxes, wherein one is from the at least one first bounding box, the other is from at least one second bounding box detected via object detection from a second frame of the first video, and the second frame is the  previous one of the first frame;
    - calculate change in size between the pair of nearest bounding boxes;
    - determine, based on the change in size, whether same object is detected in both of the nearest bounding boxes.
  8. the apparatus (10) according to claim 7, wherein the calculation module (106) is further configured to:
    - before determining based on the change in size whether same object is detected in both of the nearest bounding boxes, calculate IoU of the pair of nearest bounding boxes;
    - determine that same object is detected in both of the nearest bounding boxes, if the change in size is smaller than a first threshold and the IoU is larger than a second threshold.
  9. the apparatus (10) according to claim 8, wherein the calculation module (106) is further configured to: if the change in size is not smaller than the first threshold or the IoU is not larger than the second threshold,
    - determine position relationship between the first bounding box in the pair of nearest bounding boxes and the border of the first frame;
    - determine based on the position relationship whether to acquire a third frame from the first video or from a second video across cameras, wherein the third frame is the next frame of the first frame;
    - acquire the third frame;
    - generate at least one third bounding box of the third frame via object detection;
    - reidentify same object in the second frame and the third frame;
    - determine that the same object reidentified is tracked if the same object is reidentified.
  10. the apparatus (10) according to claim 9, wherein the calculation module (106) is further configured to:
    - reidentify via a neural network (90) with metric learning method, wherein the neural network (90) is trained by a combination of a metric learning method and a representation learning method.
  11. the apparatus (10) according to the claim 10, wherein the neural network (90)  comprises:
    - a backbone part (901) , configured to extract features of image in the bounding boxes detected in frames of a video;
    - a first fully connected layer (902) and a second fully connected layer (903) , connected to the backbone part (901) , configured to synthesize and reduce data dimension of features coming out from the backbone part (901) , wherein the metric learning method is applied on the first fully connected layer (902) and the representation learning method is applied on the second fully connected layer (903) ;
    - a loss function for classification part (904) , connected to the second fully connected layer (903) , and
    - a constructive loss function part (905) , connected to the first fully connected layer (902) .
  12. the apparatus (10) according to claim 10, wherein the neural network (90) is trained base on a dataset (80) comprising images of similar appearance different objects and/or images of same object captured in different perspectives.
  13. An apparatus (10) for object tracking, comprising:
    - at least one processor (102) ;
    - at least one memory (101) , coupled to the at least one processor (102) , configured to execute method according to any of claims 1~6.
  14. A computer-readable medium for object tracking, storing computer-executable instructions, wherein the computer-executable instructions when executed cause at least one processor to execute method according to any of claims 1~6.
  15. A method (200) for object reidentification, comprising:
    - acquiring (S201) images (70) ;
    - reidentifying (S202) in the acquired images (70) via a neural network (90) with metric learning method, wherein neural network is trained by a combination of a metric learning method and a representation learning method.
  16. the method according to claim 15, wherein the neural network (90) comprises:
    - a backbone part (901) , configured to extract features of the images;
    - a first fully connected layer (902) and a second fully connected layer (903) , connected to the backbone part (901) , configured to synthesize and reduce data dimension of features coming out from the backbone part (901) , wherein the metric learning method is applied on the first fully connected layer (902) and the representation learning method is applied on the second fully connected layer (903) ;
    - a loss function for classification part (904) , connected to the second fully connected layer (903) , and
    - a constructive loss function part (905) , connected to the first fully connected layer (902) .
  17. the method according to claim 15, wherein the neural network (90) is trained base on a dataset (80) comprising images of similar appearance different objects and/or images of same object captured in different perspectives.
  18. An apparatus (50) for object reidentification, comprising:
    - an image acquisition module (504) , configured to acquire images (70) ;
    - a reidentification module (505) , configured to reidentify in the acquired images (70) via a neural network (90) with metric learning method, wherein neural network is trained by a combination of a metric learning method and a representation learning method.
  19. the apparatus (50) according to claim 15, wherein the neural network (90) comprises:
    - a backbone part (901) , configured to extract features of the first image and a second image;
    - a first fully connected layer (902) and a second fully connected layer (903) , connected to the backbone part (901) , configured to synthesize and reduce data dimension of features coming out from the backbone part, wherein the metric learning method is applied on the first fully connected layer (902) and the representation learning method is applied on the second fully connected layer (903) ;
    - a loss function for classification part (904) , connected to the second fully connected layer (903) , and
    - a constructive loss function part (905) , connected to the first fully connected layer (901) .
  20. the apparatus (50) according to claim 15, wherein the neural network (90) is  trained base on a dataset (80) comprising images of similar appearance different objects and/or images of same object captured in different perspectives.
  21. An apparatus (50) for object tracking, comprising:
    - at least one processor (502) ;
    - at least one memory (501) , coupled to the at least one processor (502) , configured to execute method according to any of claims 15~17.
  22. A computer-readable medium for object tracking, storing computer-executable instructions, wherein the computer-executable instructions when executed cause at least one processor to execute method according to any of claims 15~17.
PCT/CN2020/093538 2020-05-29 2020-05-29 Method and apparatus for object tracking and reidentification WO2021237749A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/093538 WO2021237749A1 (en) 2020-05-29 2020-05-29 Method and apparatus for object tracking and reidentification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/093538 WO2021237749A1 (en) 2020-05-29 2020-05-29 Method and apparatus for object tracking and reidentification

Publications (1)

Publication Number Publication Date
WO2021237749A1 true WO2021237749A1 (en) 2021-12-02

Family

ID=78745429

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/093538 WO2021237749A1 (en) 2020-05-29 2020-05-29 Method and apparatus for object tracking and reidentification

Country Status (1)

Country Link
WO (1) WO2021237749A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180204057A1 (en) * 2015-10-10 2018-07-19 Beijing Kuangshi Technology Co., Ltd. Object detection method and object detection apparatus
CN109376572A (en) * 2018-08-09 2019-02-22 同济大学 Real-time vehicle detection and trace tracking method in traffic video based on deep learning
US20190258878A1 (en) * 2018-02-18 2019-08-22 Nvidia Corporation Object detection and detection confidence suitable for autonomous driving
CN110929676A (en) * 2019-12-04 2020-03-27 浙江工业大学 Deep learning-based real-time detection method for illegal turning around

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180204057A1 (en) * 2015-10-10 2018-07-19 Beijing Kuangshi Technology Co., Ltd. Object detection method and object detection apparatus
US20190258878A1 (en) * 2018-02-18 2019-08-22 Nvidia Corporation Object detection and detection confidence suitable for autonomous driving
CN109376572A (en) * 2018-08-09 2019-02-22 同济大学 Real-time vehicle detection and trace tracking method in traffic video based on deep learning
CN110929676A (en) * 2019-12-04 2020-03-27 浙江工业大学 Deep learning-based real-time detection method for illegal turning around

Similar Documents

Publication Publication Date Title
CN108725440B (en) Forward collision control method and apparatus, electronic device, program, and medium
EP1606769B1 (en) System and method for vehicle detection and tracking
Khammari et al. Vehicle detection combining gradient analysis and AdaBoost classification
CN109242884B (en) Remote sensing video target tracking method based on JCFNet network
JP3367170B2 (en) Obstacle detection device
Cui et al. Vehicle localisation using a single camera
CN110443225A (en) Virtual and real lane line identification method and device based on feature pixel statistics
Wang et al. Probabilistic inference for occluded and multiview on-road vehicle detection
US8406472B2 (en) Method and system for processing image data
CN102222341B (en) Motion characteristic point detection method and device, moving target detecting method and device
JP2006182086A (en) Vehicle sensing device
US20070237398A1 (en) Method and apparatus for classifying an object
CN110490066B (en) Target detection method and device based on picture analysis and computer equipment
CN110210474A (en) Object detection method and device, equipment and storage medium
EP2951783B1 (en) Method and system for detecting moving objects
WO2020154990A1 (en) Target object motion state detection method and device, and storage medium
Liu et al. Vehicle detection and ranging using two different focal length cameras
CN112906777A (en) Target detection method and device, electronic equipment and storage medium
KR20180127245A (en) Method for Predicting Vehicle Collision Using Data Collected from Video Games
JP6439757B2 (en) Image processing apparatus and image processing method
Tanaka et al. Vehicle Detection Based on Perspective Transformation Using Rear‐View Camera
CN110850974A (en) Method and system for detecting intention interest point
Zielke et al. CARTRACK: computer vision-based car following.
WO2021237749A1 (en) Method and apparatus for object tracking and reidentification
Chang et al. Stereo-based object detection, classi? cation, and quantitative evaluation with automotive applications

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20937788

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20937788

Country of ref document: EP

Kind code of ref document: A1