WO2023119969A1

WO2023119969A1 - Object tracking method and object tracking device

Info

Publication number: WO2023119969A1
Application number: PCT/JP2022/042682
Authority: WO
Inventors: 文彬佐藤; 大気関井
Original assignee: コニカミノルタ株式会社
Priority date: 2021-12-20
Filing date: 2022-11-17
Publication date: 2023-06-29

Abstract

Provided is an object tracking method which allows high-precision object tracking. This object tracking method which tracks the same object from multiple frames of video captured by a camera has: an object detection step for acquiring detection results of features of objects in each frame of the video; an object identification value calculation step for calculating object identification values to serve as the same values when belonging to the same object, for each detection result, by using neural computation in which detection results across multiple frames are used as an input; and a detection result classification step for classifying the detection results across multiple frames on the basis of the similarity of the object identification values.

Description

Object tracking method and object tracking device

The present disclosure relates to technology for detecting and tracking an object from captured images.

Object detection technology that detects objects such as people and vehicles from images captured by cameras and tracks the same object in multiple frames is used in applications such as surveillance camera systems and in-vehicle camera systems.

In recent years, deep learning has been used as an object tracking technology. As an object detection method using deep learning, for example, Non-Patent Document 1 can be cited. In Non-Patent Document 1, by using the object detection result at time t and the object tracking result up to time t-1, the object detection result at time t is associated with any object tracking result up to time t-1. , a technique for tracking an object is disclosed.

In Non-Patent Document 1, since the error in the correspondence between the tracking result up to time t-1 and the detection result at time t is propagated back, the error from the detection result at a time later than time t cannot be handled. There is a possibility that the accuracy of object tracking can be improved by performing object tracking using the detection result of the time later than time t.

The present disclosure has been made in view of the above problems, and aims to provide an object tracking method and an object tracking device capable of performing object tracking with higher accuracy than conventional methods.

An object tracking method according to one aspect of the present disclosure is an object tracking method for tracking the same object from a plurality of frames of video captured by a camera, and includes an object detection step of obtaining detection results of features of an object in each frame of the video. an object identification value calculation step of calculating, for each detection result, an object identification value that should be the same value if the object belongs to the same object, by neural computation using the detection results over the plurality of frames as input; and a detection result classification step of classifying the detection results over the plurality of frames based on similarity of object identification values.

According to the present disclosure, an object identification value for each detection result is collectively calculated by neural computation with input of object feature detection results over a plurality of frames. Also, since the detection result of the future frame is used, the accuracy of object tracking can be improved.

1 is a block diagram showing a schematic configuration of an object tracking system 1 according to Embodiment 1; FIG. 2 is a diagram showing an example of an image 111 captured by a camera 15; FIG. 3A and 3B are diagrams for explaining the feature detection result data 113. FIG. 4A and 4B are diagrams for explaining the object identification value data 115. FIG. FIGS. 5A and 5B are diagrams for explaining the group classification processing of the group classifier 116. FIG. 6A and 6B are diagrams for explaining the group classification result data 117. FIG. 2 is a block diagram showing the configuration of DNN; FIG.

1. Embodiment 1
An object tracking system 1 according to Embodiment 1 will be described below.

1.1 Configuration (1) Object tracking system 1
FIG. 1 is a block diagram showing the configuration of an object tracking system 1. As shown in FIG. As shown in the figure, the object tracking system 1 comprises a camera 15 and an object tracking device 10 .

(2) Camera 15
The camera 15 includes an imaging element such as a CMOS (Complementary Metal-Oxide-Semiconductor field-effect transistor) image sensor or a CCD (Charge-Coupled Device) image sensor, and converts the light imaged on the imaging element into an electric signal by photoelectric conversion. , an image of a predetermined size is output.

(3) Object tracking device 10
Object tracking device 10 includes control unit 11 and input/output interface 12 for connecting to camera 15 . The control unit 11 includes a CPU (Central Processing Unit) 11a, a main storage device 11b, an auxiliary storage device 11c, and the like. The computer programs and data stored in the auxiliary storage device 11c are loaded into the main storage device 11b, and the CPU 11a operates according to the computer programs and data loaded in the main storage device 11b, so that each processing unit (object A detector 112, an object identification value calculator 114, and a group classifier 116) are realized. The auxiliary storage device 11c is, for example, composed of a hard disk and/or a non-volatile semiconductor memory.

The auxiliary storage device 13 stores an image 111 captured by the camera 15, feature detection result data 113, object identification value data 115, group classification result data 117, and the like.

(4) Photographed image 111
A captured image 111 is image data of a plurality of frames captured by the camera 15 . FIG. 2 shows an example of image data 201 of one frame of the captured image 111 that is input to the object detector 112 .

(5) Object detector 112
The object detector 112 receives the captured image 111 , performs object detection processing, and outputs feature detection result data 113 .

The object detector 112 is a neural network that has performed machine learning to detect the features of the object to be detected. An existing neural network can be used for the object detection unit 112 . In this embodiment, the object detection unit 112 uses OpenPose (see Non-Patent Document 2). OpenPose is a neural network that detects joint points of a human body (characteristic points such as face, neck, shoulders, etc.) from image data.

FIG. 3(a) is a diagram schematically showing feature points of an object detected by the object detector 112. FIG. FIG. 3(a) shows the detection result when the image data 201 in which two people are shown is input. As shown in the figure, a predetermined number of feature points 301 are detected for each detected person.

(6) Feature detection result data 113
The object detector 112 outputs feature detection result data 113 for each of the plurality of feature points 301 in FIG. 3(a). FIG. 3B shows an example of the data structure of the feature detection result data 113 for one feature point 301. FIG. As shown in FIG. 3B, the feature detection result data 113 includes feature point IDs, position information, time information, likelihood information, object category information, and feature point category information.

A feature point ID is an identifier attached to uniquely identify a plurality of feature points detected by the object detector 112 .

The position information is information indicating the X-coordinate and Y-coordinate of the detected feature point in the detection image.

The time information is the frame number of the detected image.

Likelihood information is information indicating that the detected feature points are likely to be detected.

The object category information is information indicating the category (type) of the object to which the detected feature points belong. The object category information is, for example, values identifying humans, dogs, cats, cars, and the like.

The feature point category information is information indicating the category (type) of the detected feature point. The feature point category information is, for example, values for identifying head joint points, neck joint points, shoulder joint points, and the like.

(7) Object identification value calculator 114
The object identification value calculator 114 receives as input a plurality of feature detection result data 113 detected from a plurality of frames, and is a neural network that performs machine learning for calculating an object identification value for each of the input feature detection result data 113. is.

Here, the object identification value should be the same value when belonging to the same object. For example, when the feature detection result data 113 detected from the photographed images 111 in which the same person is captured over a plurality of frames is input, the object identification value calculator 114 ideally determines that the feature points belonging to the person are: Calculate and output object identification values that are all the same value.

An existing neural network can be used for the object identification value calculator 114 . In this embodiment, the object identification value calculator 114 uses PointNet (see Non-Patent Document 5). PointNet is a neural network for executing a specific task with point cloud data as input. The object identification value calculator 114 is preferably a permutation-equivariant neural network that uses point cloud data as an input.

The object identification value calculator 114 of the present embodiment is designed for feature points belonging to different persons so that feature points belonging to the same person have the same object identification value by designing a loss function that applies contrast learning. They are trained to have different object identification values.

The designed loss function is as follows.

In the above loss function, L _pull is as follows.

In the loss function above, L _push is:

Here, K _n,i is the output (estimated object identification value ). μ _m is the average value of outputs of feature points belonging to an object whose object ID is m (m is an integer equal to or greater than 1 and equal to or less than N). Note that N is the number of detected objects. I is the number of feature points of the detected object whose object ID is n. For example, when Person A and Person B are detected from each frame of the captured image and other objects are not detected, N=2. Then, when person A is detected in 30 frames and 14 feature points of person A are detected per frame, I corresponding to person A is I=14×30=420.

As training data, the object ID of the object to which each feature detection result data 113 belongs is given.

In the above loss function, the smaller the difference between the object identification values output for a plurality of feature points belonging to the same person, the smaller the value of L _pull . Therefore, with this loss function, feature points belonging to the same person are learned to have the same object identification value.

Further, in the above loss function, the larger the difference between the object identification values output for a plurality of feature points belonging to different persons, the smaller the value of _Lpush . Therefore, with this loss function, feature points belonging to different persons are learned to have different object identification values.

In addition, in order to collectively calculate loss values for a plurality of feature detection result data 113 obtained from a large number of frames using this loss function, calculation of an object identification value for feature detection result data 113 detected from a specific frame is performed. , and feature detection result data 113 detected from future frames.

(8) Object identification value data 115
An object identification value calculator 114 calculates object identification value data 115 for each feature point (feature detection result data 113 ) detected by the object detector 112 .

FIG. 4(a) is a diagram schematically showing a plurality of detected feature points 401. FIG. FIG. 4B shows the data structure of the object identification value data 115 calculated for the detected feature point 401. As shown in FIG. As shown in FIG. 4B, the object identification value data 115 includes feature point IDs and object identification values. A feature point ID is an identifier attached to uniquely identify a feature point detected by the object detector 112 . The object identification value is a vector value that should be the same value when belonging to the same object.

(9) Group classifier 116
The group classifier 116 receives the object identification value data 115 of each feature point detected by the object detector 112 and classifies the feature points into groups based on the similarity of the object identification values. FIG. 5(a) is a diagram schematically showing a plurality of feature points 501 detected by the object detector 112, and FIG. 5(b) schematically shows a plurality of grouped feature points 501. It is a diagram.

Existing clustering methods can be used for the group classification method. For example, the difference between the object identification values of two feature points is calculated, and if the difference is smaller than a predetermined threshold, they are classified into the same group. By doing this for all combinations, a plurality of feature points are grouped. may be performed.

In addition, as another method of group classification, the K-means method is used to generate multiple clusters by changing the k value (number of clusters) within an arbitrary range, and the generated clusters are subjected to the elbow method (within the cluster The optimum number of clusters is determined from the sum of the error squares of the elements), the optimum k value may be calculated, and the clusters generated with the calculated k value may be used as the group classification result.

(10) Group classification result data 117
The group classifier 116 calculates group classification result data 117 for each feature point (feature detection result data 113) detected by the object detector 112. FIG.

FIG. 4(a) is a diagram schematically showing a plurality of detected feature points 501. FIG. FIG. 4B shows the data structure of group classification result data 117 calculated for the detected feature points 501 . As shown in FIG. 4B, the group classification result data 117 includes feature point IDs and group classification results. A feature point ID is an identifier attached to uniquely identify a feature point detected by the object detector 112 . A group classification result is an identifier indicating a classified group, and one group indicates one object (same object detected in a plurality of frames).

1.2 DNNs
As described above, object detector 112 and object identification value calculator 114 are machine-learned deep neural networks (DNN). Any DNN may be used for the object detector 112 as long as it detects feature points from an input image and outputs point cloud data. The object identification value calculator 114 receives point cloud data and may use any DNN as long as it is permutation-equivariant.

A neural network 700 shown in FIG. 7 will be described as an example of a DNN neural network.

(1) Structure of Neural Network 700 A neural network is an information processing system that imitates a human neural network. In the neural network 700, an engineered neuron model corresponding to a nerve cell is called a neuron U here. A neural network 700 has a structure in which a large number of neurons U are connected. Further, the neural network 700 is composed of a plurality of layers 701 each having a plurality of neurons. A weight indicating the strength of connection between neurons is set between neurons in adjacent layers.

A multi-input single-output element is used as the neuron U. The signal propagates in one direction, and the input value is multiplied by the above weight and input to the neuron U. This weight can be changed by learning. From the neuron U, the sum of the input values multiplied by the weights is transformed by the activation function and then output to each neuron U of the next layer. As the activation function, for example, ReLU or a sigmoid function can be used.

The first layer is called the input layer, and data is input. For example, the pixel value of each pixel forming one image is input to each neuron U of the input layer. Position information, time information, likelihood information, object category information, and feature point information included in the point cloud data are input to each neuron U of the input layer. The last layer, called the output layer, is the layer that outputs the results.

As a learning method of the neural network 700, for example, an error (loss value) is calculated using a predetermined error function (loss function) from a value indicating the correct answer (teacher data) and the output value of the neural network 700 for the training data. In order to minimize this error, an error backpropagation method is used in which weights between neurons are sequentially changed using the steepest descent method or the like.

2. Supplement Although the present invention has been described above based on the embodiments, the present invention is of course not limited to the above-described embodiments, and the following modifications are naturally included in the technical scope of the present invention. be.

(1) In the above embodiment, the object detector 112 uses OpenPose for detecting joint points of an object, but other neural networks may be used. For example, a neural network (eg, YOLO (see Non-Patent Document 3)) that detects the circumscribing rectangle of an object may be used to detect each vertex and center position of the circumscribing rectangle as the feature detection result. A neural network that detects the contour of an object (for example, Deep Snake (see Non-Patent Document 4), etc.) may be used to detect detected contour points and the center point of the object as feature detection results.

Further, it is not necessary to use each of the plurality of feature points detected for one object as the feature detection result, and the object information consisting of the plurality of feature points detected for one object may be used as the feature detection result. That is, the feature detection result may be skeleton information including position information of a plurality of joint points in OpenPose. In this case, an object identification value is calculated for skeleton information, and group classification is performed. Similarly, rectangle information consisting of information on the position and size of the circumscribing rectangle of the object in YOLO may be used as the feature detection result, and contour information consisting of position information of multiple contour points of the object in Deep Snake may be used as the feature detection result. good.

(2) In the above embodiment, the feature detection result data 113 includes likelihood information, object category information, and feature point category information, but may include other information. For example, information on the appearance of the object (eg, color information, etc.) may be included.

(3) In the above-described embodiment, the object detector 112 may be one that receives a single image of one frame as input and detects the features of the object in that one frame, and consists of a plurality of frames including one frame. A plurality of images may be input and the feature of an object in one frame may be detected.

(4) In Non-Patent Document 1, features of an object are detected by a neural network using a mechanism called self-Attention. In the above-described embodiments, the object detector 112 may detect object features using the self-attention mechanism described in Non-Patent Document 1.

(5) In the above embodiment, the object identification value calculator 114 performs the task of calculating the object identification value for each of the feature detection result data 113. In addition to calculating the object identification value, other Tasks may be performed simultaneously.

For example, if the object detector 112 detects feature points without detecting the object category (type of object), the object identification value calculator 114 detects the object category in addition to the object identification value. You may perform tasks that In this case, the task of detecting the object category by adding the information of the detected object category to the teacher data, calculating the error with the output value, and learning by the error backpropagation method so that this error is minimized. can be executed simultaneously.

Also, based on the feature detection result detected by the object detector 112, a task of recognizing the attributes and actions of the object may be executed. The task of recognizing attributes of an object may be, for example, a task of recognizing attributes such as a person's gender, age, type of clothes worn, and color. Also, the task of recognizing the behavior of an object may be, for example, a task of recognizing the behavior of an object (person) such as running, walking, or making a phone call. In these cases, information on the attributes and behavior of the detected objects is added to the teacher data, and the errors with the output values are calculated. Tasks that detect attributes and behaviors can be performed simultaneously.

In addition to the task of calculating the object identification value based on the feature detection result detected by the object detector 112, the object identification value calculator 114 calculates the situation in each frame of the captured image 111 or the situation in a plurality of frames. It may perform the task of recognizing. The task of recognizing situations may be, for example, a task of recognizing the danger of contact between objects such as a pedestrian and a car, or a task of recognizing dangerous driving such as ignoring traffic lights. In these cases, information about the situation of the captured image is added to the training data, the error between the output value and the output value is calculated, and learning is performed using the error backpropagation method to minimize this error, thereby recognizing the situation. can be executed simultaneously.

(7) In the above-described embodiment, the object detector 112 uses OpenPose to detect the joint points of an object, but in OpenPose, the joint points of an object are classified for each detected object and output. For example, when person A and person B appear in the image, the joint points of person A and the joint points of person B are output in a distinguishable format. The group classifier 116 may classify the feature detection result data 113 for each object using this classification result information. That is, the object detector 112 classifies the feature points in each frame for each object. Using this classification result, the group classifier 116 classifies feature points in the same frame into the same group, and determines whether or not feature points in other frames are included in the same group as an object identification value. may be classified according to the similarity of

3. Others An object tracking method according to one aspect of the present disclosure is an object tracking method for tracking the same object from a plurality of frames of video captured by a camera, wherein object detection acquires detection results of features of an object in each frame of the video. an object identification value calculation step of calculating, for each of the detection results, an object identification value that should be the same value if the object belongs to the same object, by neural computation using the detection results over the plurality of frames as input; and a detection result classification step of classifying the detection results over the plurality of frames based on the similarity of the object identification values.

In the above object tracking method, the detection result includes, in addition to time information, information on the positions of feature points on the skeleton of the object, information on the positions of feature points on the circumscribed rectangle of the object, and information on the positions of feature points on the outline of the object. , may be included.

In the above object tracking method, the detection result includes likelihood information indicating that the feature point is likely to be detected, object category information indicating the type of object, feature point category information indicating the type of feature point, and object appearance information that characterizes the appearance of the object.

In the above object tracking method, the object detection step receives a single image of one frame as an input and detects features of an object in the one frame, or receives a plurality of images consisting of a plurality of frames including the one frame. may be used to detect the feature of the object in the one frame.

In the above object tracking method, the object detection step may calculate the detection result by an object detector that performs neural operations with the single image or the plurality of images as input.

In the above object tracking method, the object detector may use DNN (Deep Neural Network).

In the above object tracking method, the object detector may use a neural network with a self-attention mechanism.

In the above object tracking method, the object identification value calculating step collectively calculates the object identification value for each of the detection results of a plurality of frames using a DNN (Deep Neural Network) that receives the detection results in point cloud data format. may be calculated.

In the above object tracking method, the DNN may be Permutation-Equivariant.

In the above object tracking method, the DNN is learned by contrast learning so that two detection results have the same value when they belong to the same object, and have different values when they belong to different objects. good too.

In the above object tracking method, even if the DNN is trained to recognize at least one of object type, object attribute, and object behavior for each detection result in addition to the object identification value good.

In the above object tracking method, the DNN may be trained to recognize the situation in each frame of the video or the situation of the video in addition to the object identification value.

In the above object tracking method, the object identification values may be vector values, and the detection result classification step may associate object identification values with close distances as detection results of the same object.

In the object tracking method, the object detection step classifies the detection result for each object in each frame of the video, and the detection result classification step uses the classification result in the object detection step to classify the plurality of frames. may be classified according to objects.

An object tracking device according to one aspect of the present disclosure is an object tracking device that tracks the same object from a plurality of frames of video captured by a camera, and an object detector that acquires detection results of features of an object in each frame of the video. an object identification value calculator for calculating, for each detection result, an object identification value that should be the same value if the object belongs to the same object, by neural computation using the detection results over the plurality of frames as input; a detection result classifier that classifies the detection results over the plurality of frames based on similarity of object identification values.

The present disclosure is useful as an object tracking device installed in a surveillance camera system or the like.

1 object tracking system 10 object tracking device 112 object detector 114 object identification value calculator 116 group classifier 15 camera

Claims

An object tracking method for tracking the same object from multiple frames of video captured by a camera,
an object detection step of obtaining detection results of features of objects in each frame of the video;
an object identification value calculation step of calculating, for each detection result, an object identification value that should be the same value if the object belongs to the same object, by neural computation using the detection results over the plurality of frames as input;
and a detection result classification step of classifying the detection results over the plurality of frames based on the similarity of the object identification values.
The detection result, in addition to time information,
2. The object according to claim 1, comprising any of positional information of feature points on the skeleton of the object, information on positions of feature points on the circumscribed rectangle of the object, and information on positions of a plurality of feature points on the outline of the object. tracking method.
The detection results include likelihood information indicating that the feature points are likely to be detected, object category information indicating the type of object, feature point category information indicating the type of feature points, and features of the appearance of the object. 3. The object tracking method according to claim 2, further comprising:
The object detection step receives a single image of one frame as an input and detects features of an object in the one frame, or receives a plurality of images consisting of a plurality of frames including the one frame as an input, and detects features of the object in the one frame. The method of claim 1, further comprising detecting features of the object.
5. The object tracking method according to claim 4, wherein said object detection step calculates said detection result by means of an object detector which performs neural operations with said single image or said plurality of images as an input.
The object tracking method according to claim 5, wherein the object detector uses a DNN (Deep Neural Network).
6. The object tracking method according to claim 5, wherein the object detector uses a neural network of a self-attention mechanism.
2. The object identification value calculating step collectively calculates an object identification value for each of the detection results of a plurality of frames using a DNN (Deep Neural Network) that receives the detection results in point cloud data format. The object tracking method described in .
The object tracking method according to claim 8, wherein the DNN is Permutation-Equivariant.
The object according to claim 8, wherein the DNN is learned by contrast learning so that two detection results have the same value when they belong to the same object, and have different values when they belong to different objects. tracking method.
The object tracking according to claim 8, wherein the DNN is trained to recognize at least one of object type, object attribute, and object behavior for each detection result in addition to the object identification value. Method.
9. The object tracking method of claim 8, wherein the DNN is trained to recognize the situation at each frame of the video or the situation of the video in addition to the object identification value.
the object identification value is a vector value;
2. The object tracking method according to claim 1, wherein said detection result classifying step associates object identification values with close distances as detection results of the same object.
The object detection step classifies the detection result by object in each frame of the video,
2. The object tracking method according to claim 1, wherein the detection result classification step classifies the detection results over the plurality of frames for each object using the classification result in the object detection step.
An object tracking device that tracks the same object from multiple frames of video captured by a camera,
an object detector that acquires detection results of features of objects in each frame of the video;
an object identification value calculator that calculates, for each detection result, an object identification value that should be the same value if the object belongs to the same object, by neural computation using the detection results over the plurality of frames as input;
A detection result classifier that classifies the detection results over the plurality of frames based on the similarity of the object identification values.