WO2023119969A1 - Object tracking method and object tracking device - Google Patents

Object tracking method and object tracking device Download PDF

Info

Publication number
WO2023119969A1
WO2023119969A1 PCT/JP2022/042682 JP2022042682W WO2023119969A1 WO 2023119969 A1 WO2023119969 A1 WO 2023119969A1 JP 2022042682 W JP2022042682 W JP 2022042682W WO 2023119969 A1 WO2023119969 A1 WO 2023119969A1
Authority
WO
WIPO (PCT)
Prior art keywords
detection
detection result
tracking method
identification value
detection results
Prior art date
Application number
PCT/JP2022/042682
Other languages
French (fr)
Japanese (ja)
Inventor
文彬 佐藤
大気 関井
Original Assignee
コニカミノルタ株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by コニカミノルタ株式会社 filed Critical コニカミノルタ株式会社
Publication of WO2023119969A1 publication Critical patent/WO2023119969A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion

Definitions

  • the present disclosure relates to technology for detecting and tracking an object from captured images.
  • Object detection technology that detects objects such as people and vehicles from images captured by cameras and tracks the same object in multiple frames is used in applications such as surveillance camera systems and in-vehicle camera systems.
  • Non-Patent Document 1 As an object detection method using deep learning, for example, Non-Patent Document 1 can be cited. In Non-Patent Document 1, by using the object detection result at time t and the object tracking result up to time t-1, the object detection result at time t is associated with any object tracking result up to time t-1. , a technique for tracking an object is disclosed.
  • Non-Patent Document 1 since the error in the correspondence between the tracking result up to time t-1 and the detection result at time t is propagated back, the error from the detection result at a time later than time t cannot be handled. There is a possibility that the accuracy of object tracking can be improved by performing object tracking using the detection result of the time later than time t.
  • the present disclosure has been made in view of the above problems, and aims to provide an object tracking method and an object tracking device capable of performing object tracking with higher accuracy than conventional methods.
  • An object tracking method is an object tracking method for tracking the same object from a plurality of frames of video captured by a camera, and includes an object detection step of obtaining detection results of features of an object in each frame of the video. an object identification value calculation step of calculating, for each detection result, an object identification value that should be the same value if the object belongs to the same object, by neural computation using the detection results over the plurality of frames as input; and a detection result classification step of classifying the detection results over the plurality of frames based on similarity of object identification values.
  • an object identification value for each detection result is collectively calculated by neural computation with input of object feature detection results over a plurality of frames. Also, since the detection result of the future frame is used, the accuracy of object tracking can be improved.
  • FIG. 1 is a block diagram showing a schematic configuration of an object tracking system 1 according to Embodiment 1;
  • FIG. 2 is a diagram showing an example of an image 111 captured by a camera 15;
  • FIG. 3A and 3B are diagrams for explaining the feature detection result data 113.
  • FIG. 4A and 4B are diagrams for explaining the object identification value data 115.
  • FIG. FIGS. 5A and 5B are diagrams for explaining the group classification processing of the group classifier 116.
  • FIG. 6A and 6B are diagrams for explaining the group classification result data 117.
  • FIG. 2 is a block diagram showing the configuration of DNN;
  • FIG. 1 is a block diagram showing a schematic configuration of an object tracking system 1 according to Embodiment 1;
  • FIG. 2 is a diagram showing an example of an image 111 captured by a camera 15;
  • FIG. 3A and 3B are diagrams for explaining the feature detection result data 113.
  • FIG. 4A and 4B are diagrams for explaining the object
  • Embodiment 1 An object tracking system 1 according to Embodiment 1 will be described below.
  • FIG. 1 is a block diagram showing the configuration of an object tracking system 1. As shown in FIG. As shown in the figure, the object tracking system 1 comprises a camera 15 and an object tracking device 10 .
  • the camera 15 includes an imaging element such as a CMOS (Complementary Metal-Oxide-Semiconductor field-effect transistor) image sensor or a CCD (Charge-Coupled Device) image sensor, and converts the light imaged on the imaging element into an electric signal by photoelectric conversion. , an image of a predetermined size is output.
  • CMOS Complementary Metal-Oxide-Semiconductor field-effect transistor
  • CCD Charge-Coupled Device
  • Object tracking device 10 includes control unit 11 and input/output interface 12 for connecting to camera 15 .
  • the control unit 11 includes a CPU (Central Processing Unit) 11a, a main storage device 11b, an auxiliary storage device 11c, and the like.
  • the computer programs and data stored in the auxiliary storage device 11c are loaded into the main storage device 11b, and the CPU 11a operates according to the computer programs and data loaded in the main storage device 11b, so that each processing unit (object A detector 112, an object identification value calculator 114, and a group classifier 116) are realized.
  • the auxiliary storage device 11c is, for example, composed of a hard disk and/or a non-volatile semiconductor memory.
  • the auxiliary storage device 13 stores an image 111 captured by the camera 15, feature detection result data 113, object identification value data 115, group classification result data 117, and the like.
  • Photographed image 111 is image data of a plurality of frames captured by the camera 15 .
  • FIG. 2 shows an example of image data 201 of one frame of the captured image 111 that is input to the object detector 112 .
  • the object detector 112 receives the captured image 111 , performs object detection processing, and outputs feature detection result data 113 .
  • the object detector 112 is a neural network that has performed machine learning to detect the features of the object to be detected.
  • An existing neural network can be used for the object detection unit 112 .
  • the object detection unit 112 uses OpenPose (see Non-Patent Document 2).
  • OpenPose is a neural network that detects joint points of a human body (characteristic points such as face, neck, shoulders, etc.) from image data.
  • FIG. 3(a) is a diagram schematically showing feature points of an object detected by the object detector 112.
  • FIG. 3(a) shows the detection result when the image data 201 in which two people are shown is input. As shown in the figure, a predetermined number of feature points 301 are detected for each detected person.
  • FIG. 3B shows an example of the data structure of the feature detection result data 113 for one feature point 301.
  • the feature detection result data 113 includes feature point IDs, position information, time information, likelihood information, object category information, and feature point category information.
  • a feature point ID is an identifier attached to uniquely identify a plurality of feature points detected by the object detector 112 .
  • the position information is information indicating the X-coordinate and Y-coordinate of the detected feature point in the detection image.
  • the time information is the frame number of the detected image.
  • Likelihood information is information indicating that the detected feature points are likely to be detected.
  • the object category information is information indicating the category (type) of the object to which the detected feature points belong.
  • the object category information is, for example, values identifying humans, dogs, cats, cars, and the like.
  • the feature point category information is information indicating the category (type) of the detected feature point.
  • the feature point category information is, for example, values for identifying head joint points, neck joint points, shoulder joint points, and the like.
  • the object identification value calculator 114 receives as input a plurality of feature detection result data 113 detected from a plurality of frames, and is a neural network that performs machine learning for calculating an object identification value for each of the input feature detection result data 113. is.
  • the object identification value should be the same value when belonging to the same object.
  • the object identification value calculator 114 ideally determines that the feature points belonging to the person are: Calculate and output object identification values that are all the same value.
  • An existing neural network can be used for the object identification value calculator 114 .
  • the object identification value calculator 114 uses PointNet (see Non-Patent Document 5).
  • PointNet is a neural network for executing a specific task with point cloud data as input.
  • the object identification value calculator 114 is preferably a permutation-equivariant neural network that uses point cloud data as an input.
  • the object identification value calculator 114 of the present embodiment is designed for feature points belonging to different persons so that feature points belonging to the same person have the same object identification value by designing a loss function that applies contrast learning. They are trained to have different object identification values.
  • the designed loss function is as follows.
  • L pull is as follows.
  • L push is:
  • K n,i is the output (estimated object identification value ).
  • ⁇ m is the average value of outputs of feature points belonging to an object whose object ID is m (m is an integer equal to or greater than 1 and equal to or less than N).
  • N is the number of detected objects.
  • the object ID of the object to which each feature detection result data 113 belongs is given.
  • An object identification value calculator 114 calculates object identification value data 115 for each feature point (feature detection result data 113 ) detected by the object detector 112 .
  • FIG. 4(a) is a diagram schematically showing a plurality of detected feature points 401.
  • FIG. FIG. 4B shows the data structure of the object identification value data 115 calculated for the detected feature point 401.
  • the object identification value data 115 includes feature point IDs and object identification values.
  • a feature point ID is an identifier attached to uniquely identify a feature point detected by the object detector 112 .
  • the object identification value is a vector value that should be the same value when belonging to the same object.
  • FIG. 5(a) is a diagram schematically showing a plurality of feature points 501 detected by the object detector 112, and FIG. 5(b) schematically shows a plurality of grouped feature points 501. It is a diagram.
  • Existing clustering methods can be used for the group classification method. For example, the difference between the object identification values of two feature points is calculated, and if the difference is smaller than a predetermined threshold, they are classified into the same group. By doing this for all combinations, a plurality of feature points are grouped. may be performed.
  • the K-means method is used to generate multiple clusters by changing the k value (number of clusters) within an arbitrary range, and the generated clusters are subjected to the elbow method (within the cluster The optimum number of clusters is determined from the sum of the error squares of the elements), the optimum k value may be calculated, and the clusters generated with the calculated k value may be used as the group classification result.
  • Group classification result data 117 The group classifier 116 calculates group classification result data 117 for each feature point (feature detection result data 113) detected by the object detector 112. FIG. 10
  • FIG. 4(a) is a diagram schematically showing a plurality of detected feature points 501.
  • FIG. FIG. 4B shows the data structure of group classification result data 117 calculated for the detected feature points 501 .
  • the group classification result data 117 includes feature point IDs and group classification results.
  • a feature point ID is an identifier attached to uniquely identify a feature point detected by the object detector 112 .
  • a group classification result is an identifier indicating a classified group, and one group indicates one object (same object detected in a plurality of frames).
  • object detector 112 and object identification value calculator 114 are machine-learned deep neural networks (DNN). Any DNN may be used for the object detector 112 as long as it detects feature points from an input image and outputs point cloud data.
  • the object identification value calculator 114 receives point cloud data and may use any DNN as long as it is permutation-equivariant.
  • a neural network 700 shown in FIG. 7 will be described as an example of a DNN neural network.
  • a neural network is an information processing system that imitates a human neural network.
  • an engineered neuron model corresponding to a nerve cell is called a neuron U here.
  • a neural network 700 has a structure in which a large number of neurons U are connected. Further, the neural network 700 is composed of a plurality of layers 701 each having a plurality of neurons. A weight indicating the strength of connection between neurons is set between neurons in adjacent layers.
  • a multi-input single-output element is used as the neuron U.
  • the signal propagates in one direction, and the input value is multiplied by the above weight and input to the neuron U. This weight can be changed by learning. From the neuron U, the sum of the input values multiplied by the weights is transformed by the activation function and then output to each neuron U of the next layer.
  • the activation function for example, ReLU or a sigmoid function can be used.
  • the first layer is called the input layer, and data is input. For example, the pixel value of each pixel forming one image is input to each neuron U of the input layer. Position information, time information, likelihood information, object category information, and feature point information included in the point cloud data are input to each neuron U of the input layer.
  • the last layer called the output layer, is the layer that outputs the results.
  • an error is calculated using a predetermined error function (loss function) from a value indicating the correct answer (teacher data) and the output value of the neural network 700 for the training data.
  • an error backpropagation method is used in which weights between neurons are sequentially changed using the steepest descent method or the like.
  • the object detector 112 uses OpenPose for detecting joint points of an object, but other neural networks may be used.
  • a neural network eg, YOLO (see Non-Patent Document 3)
  • a neural network that detects the contour of an object for example, Deep Snake (see Non-Patent Document 4), etc.
  • Deep Snake see Non-Patent Document 4
  • the feature detection result may be skeleton information including position information of a plurality of joint points in OpenPose.
  • an object identification value is calculated for skeleton information, and group classification is performed.
  • rectangle information consisting of information on the position and size of the circumscribing rectangle of the object in YOLO may be used as the feature detection result
  • contour information consisting of position information of multiple contour points of the object in Deep Snake may be used as the feature detection result. good.
  • the feature detection result data 113 includes likelihood information, object category information, and feature point category information, but may include other information. For example, information on the appearance of the object (eg, color information, etc.) may be included.
  • the object detector 112 may be one that receives a single image of one frame as input and detects the features of the object in that one frame, and consists of a plurality of frames including one frame. A plurality of images may be input and the feature of an object in one frame may be detected.
  • Non-Patent Document 1 features of an object are detected by a neural network using a mechanism called self-Attention.
  • the object detector 112 may detect object features using the self-attention mechanism described in Non-Patent Document 1.
  • the object identification value calculator 114 performs the task of calculating the object identification value for each of the feature detection result data 113. In addition to calculating the object identification value, other Tasks may be performed simultaneously.
  • the object identification value calculator 114 detects the object category in addition to the object identification value. You may perform tasks that In this case, the task of detecting the object category by adding the information of the detected object category to the teacher data, calculating the error with the output value, and learning by the error backpropagation method so that this error is minimized. can be executed simultaneously.
  • a task of recognizing the attributes and actions of the object may be executed.
  • the task of recognizing attributes of an object may be, for example, a task of recognizing attributes such as a person's gender, age, type of clothes worn, and color.
  • the task of recognizing the behavior of an object may be, for example, a task of recognizing the behavior of an object (person) such as running, walking, or making a phone call. In these cases, information on the attributes and behavior of the detected objects is added to the teacher data, and the errors with the output values are calculated. Tasks that detect attributes and behaviors can be performed simultaneously.
  • the object identification value calculator 114 calculates the situation in each frame of the captured image 111 or the situation in a plurality of frames. It may perform the task of recognizing.
  • the task of recognizing situations may be, for example, a task of recognizing the danger of contact between objects such as a pedestrian and a car, or a task of recognizing dangerous driving such as ignoring traffic lights. In these cases, information about the situation of the captured image is added to the training data, the error between the output value and the output value is calculated, and learning is performed using the error backpropagation method to minimize this error, thereby recognizing the situation. can be executed simultaneously.
  • the object detector 112 uses OpenPose to detect the joint points of an object, but in OpenPose, the joint points of an object are classified for each detected object and output. For example, when person A and person B appear in the image, the joint points of person A and the joint points of person B are output in a distinguishable format.
  • the group classifier 116 may classify the feature detection result data 113 for each object using this classification result information. That is, the object detector 112 classifies the feature points in each frame for each object. Using this classification result, the group classifier 116 classifies feature points in the same frame into the same group, and determines whether or not feature points in other frames are included in the same group as an object identification value. may be classified according to the similarity of
  • An object tracking method for tracking the same object from a plurality of frames of video captured by a camera, wherein object detection acquires detection results of features of an object in each frame of the video.
  • an object identification value calculation step of calculating, for each of the detection results, an object identification value that should be the same value if the object belongs to the same object, by neural computation using the detection results over the plurality of frames as input; and a detection result classification step of classifying the detection results over the plurality of frames based on the similarity of the object identification values.
  • the detection result includes, in addition to time information, information on the positions of feature points on the skeleton of the object, information on the positions of feature points on the circumscribed rectangle of the object, and information on the positions of feature points on the outline of the object. , may be included.
  • the detection result includes likelihood information indicating that the feature point is likely to be detected, object category information indicating the type of object, feature point category information indicating the type of feature point, and object appearance information that characterizes the appearance of the object.
  • the object detection step receives a single image of one frame as an input and detects features of an object in the one frame, or receives a plurality of images consisting of a plurality of frames including the one frame. may be used to detect the feature of the object in the one frame.
  • the object detection step may calculate the detection result by an object detector that performs neural operations with the single image or the plurality of images as input.
  • the object detector may use DNN (Deep Neural Network).
  • the object detector may use a neural network with a self-attention mechanism.
  • the object identification value calculating step collectively calculates the object identification value for each of the detection results of a plurality of frames using a DNN (Deep Neural Network) that receives the detection results in point cloud data format. may be calculated.
  • DNN Deep Neural Network
  • the DNN may be Permutation-Equivariant.
  • the DNN is learned by contrast learning so that two detection results have the same value when they belong to the same object, and have different values when they belong to different objects. good too.
  • the DNN may be trained to recognize the situation in each frame of the video or the situation of the video in addition to the object identification value.
  • the object identification values may be vector values
  • the detection result classification step may associate object identification values with close distances as detection results of the same object.
  • the object detection step classifies the detection result for each object in each frame of the video, and the detection result classification step uses the classification result in the object detection step to classify the plurality of frames. may be classified according to objects.
  • An object tracking device is an object tracking device that tracks the same object from a plurality of frames of video captured by a camera, and an object detector that acquires detection results of features of an object in each frame of the video.
  • an object identification value calculator for calculating, for each detection result, an object identification value that should be the same value if the object belongs to the same object, by neural computation using the detection results over the plurality of frames as input;
  • a detection result classifier that classifies the detection results over the plurality of frames based on similarity of object identification values.
  • an object identification value for each detection result is collectively calculated by neural computation with input of object feature detection results over a plurality of frames. Also, since the detection result of the future frame is used, the accuracy of object tracking can be improved.
  • the present disclosure is useful as an object tracking device installed in a surveillance camera system or the like.
  • object tracking system 10 object tracking device 112 object detector 114 object identification value calculator 116 group classifier 15 camera

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

Provided is an object tracking method which allows high-precision object tracking. This object tracking method which tracks the same object from multiple frames of video captured by a camera has: an object detection step for acquiring detection results of features of objects in each frame of the video; an object identification value calculation step for calculating object identification values to serve as the same values when belonging to the same object, for each detection result, by using neural computation in which detection results across multiple frames are used as an input; and a detection result classification step for classifying the detection results across multiple frames on the basis of the similarity of the object identification values.

Description

物体追跡方法及び物体追跡装置Object tracking method and object tracking device
 本開示は、撮影画像から物体を検出して追跡する技術に関する。 The present disclosure relates to technology for detecting and tracking an object from captured images.
 カメラで撮影された映像から人物や車両などの物体を検出し、同一物体を複数フレームで追跡する物体検出技術は、監視カメラシステムや車載カメラシステムなどのアプリケーションに利用されている。 Object detection technology that detects objects such as people and vehicles from images captured by cameras and tracks the same object in multiple frames is used in applications such as surveillance camera systems and in-vehicle camera systems.
 近年、物体追跡技術として、ディープラーニングが使用されている。ディープラーニングを用いた物体検出方法としては、例えば、非特許文献1が挙げられる。非特許文献1では、時刻tの物体検出結果と時刻t-1までの物体追跡結果を用いて、時刻tの物体検出結果が時刻t-1までのいずれかの物体追跡結果に対応づけることにより、物体を追跡するする技術が開示されている。 In recent years, deep learning has been used as an object tracking technology. As an object detection method using deep learning, for example, Non-Patent Document 1 can be cited. In Non-Patent Document 1, by using the object detection result at time t and the object tracking result up to time t-1, the object detection result at time t is associated with any object tracking result up to time t-1. , a technique for tracking an object is disclosed.
 非特許文献1では、時刻t-1までの追跡結果と時刻tの検出結果の対応関係の誤差を誤差逆伝搬するため,時刻tよりも未来の時刻の検出結果との誤差は扱えない。時刻tよりも未来の時刻の検出結果も用いて物体追跡を行うことで物体追跡の精度を向上させることができる可能性がある。 In Non-Patent Document 1, since the error in the correspondence between the tracking result up to time t-1 and the detection result at time t is propagated back, the error from the detection result at a time later than time t cannot be handled. There is a possibility that the accuracy of object tracking can be improved by performing object tracking using the detection result of the time later than time t.
 本開示は、上記課題に鑑みてなされたもので、従来よりも高精度な物体追跡を行うことが可能な物体追跡方法および物体追跡装置を提供することを目的とする。 The present disclosure has been made in view of the above problems, and aims to provide an object tracking method and an object tracking device capable of performing object tracking with higher accuracy than conventional methods.
 本開示の一態様の物体追跡方法は、カメラで撮影した複数フレームの映像から同一物体を追跡する物体追跡方法であって、前記映像の各フレームの物体の特徴の検出結果を取得する物体検出ステップと、前記検出結果ごとに、同一物体に属する場合に同一の値となるべき物体識別値を、前記複数フレームに渡る前記検出結果を入力としたニューロ演算により計算する物体識別値算出ステップと、前記物体識別値の類似性に基づいて、前記複数フレームに渡る前記検出結果を分類する検出結果分類ステップとを有する。 An object tracking method according to one aspect of the present disclosure is an object tracking method for tracking the same object from a plurality of frames of video captured by a camera, and includes an object detection step of obtaining detection results of features of an object in each frame of the video. an object identification value calculation step of calculating, for each detection result, an object identification value that should be the same value if the object belongs to the same object, by neural computation using the detection results over the plurality of frames as input; and a detection result classification step of classifying the detection results over the plurality of frames based on similarity of object identification values.
 本開示によると、複数フレームに渡る物体特徴の検出結果を入力としたニューロ演算により、各検出結果に対する物体識別値を一括して計算するため、特定のフレームの物体識別値の計算を、それよりも未来のフレームの検出結果を用いて行うことになるため、物体追跡の精度を向上することができる。 According to the present disclosure, an object identification value for each detection result is collectively calculated by neural computation with input of object feature detection results over a plurality of frames. Also, since the detection result of the future frame is used, the accuracy of object tracking can be improved.
実施の形態1に係る物体追跡システム1の概略構成を示すブロック図である。1 is a block diagram showing a schematic configuration of an object tracking system 1 according to Embodiment 1; FIG. カメラ15の撮影画像111の一例を示す図である。2 is a diagram showing an example of an image 111 captured by a camera 15; FIG. 図3(a)(b)は、特徴検出結果データ113について説明するための図である。3A and 3B are diagrams for explaining the feature detection result data 113. FIG. 図4(a)(b)は、物体識別値データ115について説明するための図である。4A and 4B are diagrams for explaining the object identification value data 115. FIG. 図5(a)(b)は、グループ分類器116のグループ分類処理について説明するための図である。FIGS. 5A and 5B are diagrams for explaining the group classification processing of the group classifier 116. FIG. 図6(a)、(b)は、グループ分類結果データ117について説明するための図である。6A and 6B are diagrams for explaining the group classification result data 117. FIG. DNNの構成を示すブロック図である。2 is a block diagram showing the configuration of DNN; FIG.
 1.実施の形態1
 以下、実施の形態1に係る物体追跡システム1について説明する。
1. Embodiment 1
An object tracking system 1 according to Embodiment 1 will be described below.
 1.1 構成
 (1)物体追跡システム1
 図1は、物体追跡システム1の構成を示すブロック図である。図に示すように、物体追跡システム1は、カメラ15と、物体追跡装置10とを備える。
1.1 Configuration (1) Object tracking system 1
FIG. 1 is a block diagram showing the configuration of an object tracking system 1. As shown in FIG. As shown in the figure, the object tracking system 1 comprises a camera 15 and an object tracking device 10 .
 (2)カメラ15
 カメラ15は、CMOS(Complementary Metal-Oxide-Semiconductor field-effect transistor)イメージセンサーやCCD(Charge-Coupled Device)イメージセンサー等の撮像素子を備え、撮像素子上に結像した光を光電変換で電気信号に変換することにより所定サイズの画像を出力する。
(2) Camera 15
The camera 15 includes an imaging element such as a CMOS (Complementary Metal-Oxide-Semiconductor field-effect transistor) image sensor or a CCD (Charge-Coupled Device) image sensor, and converts the light imaged on the imaging element into an electric signal by photoelectric conversion. , an image of a predetermined size is output.
 (3)物体追跡装置10
 物体追跡装置10は、制御部11と、カメラ15に接続するための入出力インターフェース12とを備える。制御部11は、CPU(Central Processing Unit)11a、主記憶装置11b、補助記憶装置11cなどから構成される。主記憶装置11bには、補助記憶装置11cに格納されたコンピュータプログラムやデータがロードされ、CPU11aが、主記憶装置11bにロードされたコンピュータプログラムやデータにしたがって動作することにより、各処理部(物体検出器112、物体識別値算出器114、グループ分類器116)を実現する。補助記憶装置11cは、一例として、ハードディスクおよび/または不揮発性の半導体メモリから構成されている。
(3) Object tracking device 10
Object tracking device 10 includes control unit 11 and input/output interface 12 for connecting to camera 15 . The control unit 11 includes a CPU (Central Processing Unit) 11a, a main storage device 11b, an auxiliary storage device 11c, and the like. The computer programs and data stored in the auxiliary storage device 11c are loaded into the main storage device 11b, and the CPU 11a operates according to the computer programs and data loaded in the main storage device 11b, so that each processing unit (object A detector 112, an object identification value calculator 114, and a group classifier 116) are realized. The auxiliary storage device 11c is, for example, composed of a hard disk and/or a non-volatile semiconductor memory.
 補助記憶装置13は、カメラ15の撮影画像111、特徴検出結果データ113、物体識別値データ115、グループ分類結果データ117などを記憶する。 The auxiliary storage device 13 stores an image 111 captured by the camera 15, feature detection result data 113, object identification value data 115, group classification result data 117, and the like.
 (4)撮影画像111
 撮影画像111は、カメラ15で撮影された複数フレームの画像データである。図2は、物体検出器112の入力となる撮影画像111の一フレームの画像データ201の一例である。
(4) Photographed image 111
A captured image 111 is image data of a plurality of frames captured by the camera 15 . FIG. 2 shows an example of image data 201 of one frame of the captured image 111 that is input to the object detector 112 .
 (5)物体検出器112
 物体検出器112は、撮影画像111を入力とし、物体検出処理を行い、特徴検出結果データ113を出力する。
(5) Object detector 112
The object detector 112 receives the captured image 111 , performs object detection processing, and outputs feature detection result data 113 .
 物体検出器112は、検出対象の物体の特徴を検出するための機械学習を行ったニューラルネットワークである。物体検出部112は、既存のニューラルネットワークを用いることができる。本実施の形態では、物体検出部112は、OpenPose(非特許文献2参照)を用いる。OpenPoseは、画像データから人体の関節点(顔・首・肩などの特徴点)を検出するニューラルネットワークである。 The object detector 112 is a neural network that has performed machine learning to detect the features of the object to be detected. An existing neural network can be used for the object detection unit 112 . In this embodiment, the object detection unit 112 uses OpenPose (see Non-Patent Document 2). OpenPose is a neural network that detects joint points of a human body (characteristic points such as face, neck, shoulders, etc.) from image data.
 図3(a)は、物体検出器112により検出された物体の特徴点を模式的に示す図である。図3(a)は、図に示す2人の人物が写っている画像データ201を入力として場合の検出結果である。図に示すように、検出された人物毎に所定数個の特徴点301が検出される。 FIG. 3(a) is a diagram schematically showing feature points of an object detected by the object detector 112. FIG. FIG. 3(a) shows the detection result when the image data 201 in which two people are shown is input. As shown in the figure, a predetermined number of feature points 301 are detected for each detected person.
 (6)特徴検出結果データ113
 物体検出器112は、図3(a)における複数の特徴点301それぞれの、特徴検出結果データ113を出力する。図3(b)は、1つの特徴点301に対する特徴検出結果データ113のデータ構造の一例を示す。図3(b)に示すように、特徴検出結果データ113は、特徴点ID、位置情報、時間情報、尤度情報、物体カテゴリ情報、特徴点カテゴリ情報を含んでいる。
(6) Feature detection result data 113
The object detector 112 outputs feature detection result data 113 for each of the plurality of feature points 301 in FIG. 3(a). FIG. 3B shows an example of the data structure of the feature detection result data 113 for one feature point 301. FIG. As shown in FIG. 3B, the feature detection result data 113 includes feature point IDs, position information, time information, likelihood information, object category information, and feature point category information.
 特徴点IDは、物体検出器112で検出された複数の特徴点を一意に識別するために付される識別子である。 A feature point ID is an identifier attached to uniquely identify a plurality of feature points detected by the object detector 112 .
 位置情報は、検出した特徴点の検出画像におけるX座標およびY座標を示す情報である。 The position information is information indicating the X-coordinate and Y-coordinate of the detected feature point in the detection image.
 時間情報は、検出画像のフレーム番号である。 The time information is the frame number of the detected image.
 尤度情報は、検出した特徴点が尤もらしく検出されていることを示す情報である。 Likelihood information is information indicating that the detected feature points are likely to be detected.
 物体カテゴリ情報は、検出した特徴点が属する物体のカテゴリ(種別)を示す情報である。物体カテゴリ情報は、例えば、人間、犬、猫、車、などを識別する値である。 The object category information is information indicating the category (type) of the object to which the detected feature points belong. The object category information is, for example, values identifying humans, dogs, cats, cars, and the like.
 特徴点カテゴリ情報は、検出した特徴点のカテゴリ(種別)を示す情報である。特徴点カテゴリ情報は、例えば、頭の関節点、首の関節点、肩の関節点、などを識別する値である。 The feature point category information is information indicating the category (type) of the detected feature point. The feature point category information is, for example, values for identifying head joint points, neck joint points, shoulder joint points, and the like.
 (7)物体識別値算出器114
 物体識別値算出器114は、複数フレームから検出された複数の特徴検出結果データ113を入力として、入力された特徴検出結果データ113それぞれの物体識別値を算出するための機械学習を行ったニューラルネットワークである。
(7) Object identification value calculator 114
The object identification value calculator 114 receives as input a plurality of feature detection result data 113 detected from a plurality of frames, and is a neural network that performs machine learning for calculating an object identification value for each of the input feature detection result data 113. is.
 ここで、物体識別値とは、同一物体に属する場合に同一の値となるべきものである。例えば、複数フレームに渡って同一人物が写る撮影画像111から検出された特徴検出結果データ113を入力とした場合、物体識別値算出器114は、理想的には、当該人物に属する特徴点は、すべて同一の値となる物体識別値を算出して出力する。 Here, the object identification value should be the same value when belonging to the same object. For example, when the feature detection result data 113 detected from the photographed images 111 in which the same person is captured over a plurality of frames is input, the object identification value calculator 114 ideally determines that the feature points belonging to the person are: Calculate and output object identification values that are all the same value.
 物体識別値算出器114は、既存のニューラルネットワークを用いることができる。本実施の形態では、物体識別値算出器114は、PointNet(非特許文献5参照)を用いる。PointNetは点群データを入力として、特定のタスクを実行するためのニューラルネットワークである。物体識別値算出器114としては、点群データを入力として用い、Permutation-Equivariantなニューラルネットワークであることが望ましい。 An existing neural network can be used for the object identification value calculator 114 . In this embodiment, the object identification value calculator 114 uses PointNet (see Non-Patent Document 5). PointNet is a neural network for executing a specific task with point cloud data as input. The object identification value calculator 114 is preferably a permutation-equivariant neural network that uses point cloud data as an input.
 本実施の形態の物体識別値算出器114は、対照学習を応用した損失関数の設計により同一人物に属する特徴点は同一の物体識別値になるように、異なる人物に属する特徴点に対しては異なる物体識別値になるように学習されている。 The object identification value calculator 114 of the present embodiment is designed for feature points belonging to different persons so that feature points belonging to the same person have the same object identification value by designing a loss function that applies contrast learning. They are trained to have different object identification values.
 設計した損失関数は下記の通りである。
The designed loss function is as follows.
Figure JPOXMLDOC01-appb-M000001
 上記損失関数において、Lpullは下記の通りである。
Figure JPOXMLDOC01-appb-M000001
In the above loss function, L pull is as follows.
Figure JPOXMLDOC01-appb-M000002
 上記損失関数において、Lpushは下記の通りである。
Figure JPOXMLDOC01-appb-M000002
In the loss function above, L push is:
Figure JPOXMLDOC01-appb-M000003
 ここでKn,iは、物体IDがn(nは1以上N以下の整数)の物体に属するi(iは1以上I以下の整数)番目の特徴点に対する出力(物体識別値の推定値)である。また、μmは、物体IDがm(mは1以上N以下の整数)の物体に属する特徴点の出力の平均値である。なお、Nは、検出された物体の数である。そして、Iは、検出された物体IDがnの物体の特徴点の数である。例えば、撮影画像の各フレームから人物Aと人物Bが検出され、他の物体が検出されない場合、N=2となる。そして、30フレームで人物Aが検出され、1フレームあたり人物Aの特徴点が14個検出された場合、人物Aに対応するIはI=14×30=420である。
Figure JPOXMLDOC01-appb-M000003
Here, K n,i is the output (estimated object identification value ). μ m is the average value of outputs of feature points belonging to an object whose object ID is m (m is an integer equal to or greater than 1 and equal to or less than N). Note that N is the number of detected objects. I is the number of feature points of the detected object whose object ID is n. For example, when Person A and Person B are detected from each frame of the captured image and other objects are not detected, N=2. Then, when person A is detected in 30 frames and 14 feature points of person A are detected per frame, I corresponding to person A is I=14×30=420.
 教師データとして、各特徴検出結果データ113が属する物体の物体IDを与えられる。 As training data, the object ID of the object to which each feature detection result data 113 belongs is given.
 上記損失関数において、同一人物に属する複数の特徴点に対して出力される物体識別値の差が小さいほどLpullは小さな値となる。従って、この損失関数により、同一人物に属する特徴点は同一の物体識別値になるように学習される。 In the above loss function, the smaller the difference between the object identification values output for a plurality of feature points belonging to the same person, the smaller the value of L pull . Therefore, with this loss function, feature points belonging to the same person are learned to have the same object identification value.
 また、上記損失関数において、異なる人物に属する複数の特徴点に対して出力される物体識別値の差が大きいほどLpushは小さな値となる。従って、この損失関数により、異なる人物に属する特徴点は異なる物体識別値になるように学習される。 Further, in the above loss function, the larger the difference between the object identification values output for a plurality of feature points belonging to different persons, the smaller the value of Lpush . Therefore, with this loss function, feature points belonging to different persons are learned to have different object identification values.
 また、この損失関数により、多数のフレームから得られる複数の特徴検出結果データ113に対する損失値を一括して計算するため、特定のフレームから検出された特徴検出結果データ113に対する物体識別値の計算を、それよりも未来のフレームから検出された特徴検出結果データ113も用いて行われるように学習される。 In addition, in order to collectively calculate loss values for a plurality of feature detection result data 113 obtained from a large number of frames using this loss function, calculation of an object identification value for feature detection result data 113 detected from a specific frame is performed. , and feature detection result data 113 detected from future frames.
 (8)物体識別値データ115
 物体識別値算出器114により、物体検出器112で検出された特徴点(特徴検出結果データ113)それぞれに対して、物体識別値データ115が算出される。
(8) Object identification value data 115
An object identification value calculator 114 calculates object identification value data 115 for each feature point (feature detection result data 113 ) detected by the object detector 112 .
 図4(a)は、検出された複数の特徴点401を模式的に示す図である。図4(b)は、検出された特徴点401に対して算出される物体識別値データ115のデータ構造を示す。図4(b)に示すように、物体識別値データ115は、特徴点IDと物体識別値とを含む。特徴点IDは、物体検出器112で検出された特徴点を一意に識別するために付される識別子である。物体識別値は、同一物体に属する場合に同一の値となるべきベクトル値である。 FIG. 4(a) is a diagram schematically showing a plurality of detected feature points 401. FIG. FIG. 4B shows the data structure of the object identification value data 115 calculated for the detected feature point 401. As shown in FIG. As shown in FIG. 4B, the object identification value data 115 includes feature point IDs and object identification values. A feature point ID is an identifier attached to uniquely identify a feature point detected by the object detector 112 . The object identification value is a vector value that should be the same value when belonging to the same object.
 (9)グループ分類器116
 グループ分類器116は、物体検出器112で検出された特徴点それぞれの物体識別値データ115を入力として、物体識別値の類似性による特徴点のグループ分類を行う。図5(a)は、物体検出器112で検出された複数の特徴点501を模式的に示す図であり、図5(b)は、グループ分類された複数の特徴点501を模式的に示す図である。
(9) Group classifier 116
The group classifier 116 receives the object identification value data 115 of each feature point detected by the object detector 112 and classifies the feature points into groups based on the similarity of the object identification values. FIG. 5(a) is a diagram schematically showing a plurality of feature points 501 detected by the object detector 112, and FIG. 5(b) schematically shows a plurality of grouped feature points 501. It is a diagram.
 グループ分類の方法は、既存のクラスタリング手法を用いることができる。例えば、2つの特徴点の物体識別値の差を計算し、差が所定の閾値よりも小さければ、同じグループに分類し、これをすべての組み合わせについて、行うことで、複数の特徴点のグループ化を行うとしてもよい。 Existing clustering methods can be used for the group classification method. For example, the difference between the object identification values of two feature points is calculated, and if the difference is smaller than a predetermined threshold, they are classified into the same group. By doing this for all combinations, a plurality of feature points are grouped. may be performed.
 また、別の方法のグループ分類としては、K-means法により、任意の範囲でk値(クラスタ数)を変化させ、複数クラスタを生成し、生成したクラスタに対して、エルボー法(クラスタ内の要素の誤差二乗和により、最適なクラスタ数を決定)により最適なk値を算出し、算出されたk値で生成されたクラスタをグループ分類結果としてもよい。 In addition, as another method of group classification, the K-means method is used to generate multiple clusters by changing the k value (number of clusters) within an arbitrary range, and the generated clusters are subjected to the elbow method (within the cluster The optimum number of clusters is determined from the sum of the error squares of the elements), the optimum k value may be calculated, and the clusters generated with the calculated k value may be used as the group classification result.
 (10)グループ分類結果データ117
 グループ分類器116により、物体検出器112で検出された特徴点(特徴検出結果データ113)それぞれに対して、グループ分類結果データ117が算出される。
(10) Group classification result data 117
The group classifier 116 calculates group classification result data 117 for each feature point (feature detection result data 113) detected by the object detector 112. FIG.
 図4(a)は、検出された複数の特徴点501を模式的に示す図である。図4(b)は、検出された特徴点501に対して算出されるグループ分類結果データ117のデータ構造を示す。図4(b)に示すように、グループ分類結果データ117は、特徴点IDとグループ分類結果とを含む。特徴点IDは、物体検出器112で検出された特徴点を一意に識別するために付される識別子である。グループ分類結果は、分類されたグループを示す識別子であり、1つのグループが一つの物体(複数フレームに検出された同一物体)を示す。 FIG. 4(a) is a diagram schematically showing a plurality of detected feature points 501. FIG. FIG. 4B shows the data structure of group classification result data 117 calculated for the detected feature points 501 . As shown in FIG. 4B, the group classification result data 117 includes feature point IDs and group classification results. A feature point ID is an identifier attached to uniquely identify a feature point detected by the object detector 112 . A group classification result is an identifier indicating a classified group, and one group indicates one object (same object detected in a plurality of frames).
 1.2 DNN
 上述のように、物体検出器112および物体識別値算出器114は、機械学習を行ったディープニューラルネットワーク(DNN)である。物体検出器112は、入力画像から特徴点を検出し、点群データを出力するものであれば、任意のDNNを用いてもよい。物体識別値算出器114は、点群データを入力とし、Permutation-Equivariantであれば、任意のDNNを用いてもよい。
1.2 DNNs
As described above, object detector 112 and object identification value calculator 114 are machine-learned deep neural networks (DNN). Any DNN may be used for the object detector 112 as long as it detects feature points from an input image and outputs point cloud data. The object identification value calculator 114 receives point cloud data and may use any DNN as long as it is permutation-equivariant.
 DNNニューラルネットワークの一例として、図7に示すニューラルネットワーク700について、説明する。 A neural network 700 shown in FIG. 7 will be described as an example of a DNN neural network.
 (1)ニューラルネットワーク700の構造
 ニューラルネットワークとは、人間の神経ネットワークを模倣した情報処理システムのことである。ニューラルネットワーク700において、神経細胞に相当する工学的なニューロンのモデルを、ここではニューロンUと呼ぶ。ニューラルネットワーク700は、ニューロンUが多数結合された構造をしている。また、ニューラルネットワーク700は、それぞれが複数のニューロンが集まった複数の層701から構成される。隣り合う層のニューロン間には、ニューロン同士のつながりの強さを示す重みが設定されている。
(1) Structure of Neural Network 700 A neural network is an information processing system that imitates a human neural network. In the neural network 700, an engineered neuron model corresponding to a nerve cell is called a neuron U here. A neural network 700 has a structure in which a large number of neurons U are connected. Further, the neural network 700 is composed of a plurality of layers 701 each having a plurality of neurons. A weight indicating the strength of connection between neurons is set between neurons in adjacent layers.
 ニューロンUとして、多入力1出力の素子が用いられる。信号は一方向に伝わり、入力された値に、上記の重みが乗じられて、ニューロンUに入力される。この重みは、学習によって変化させることができる。ニューロンUからは、重みが乗じられたそれぞれの入力値の総和が活性化関数による変形を受けた後、次の層の各ニューロンUに出力される。なお、活性化関数としては、例えば、ReLUやシグモイド関数を用いることができる。 A multi-input single-output element is used as the neuron U. The signal propagates in one direction, and the input value is multiplied by the above weight and input to the neuron U. This weight can be changed by learning. From the neuron U, the sum of the input values multiplied by the weights is transformed by the activation function and then output to each neuron U of the next layer. As the activation function, for example, ReLU or a sigmoid function can be used.
 最初の層は、入力層と呼ばれ、データが入力される。例えば、1枚の画像を構成する各画素の画素値がそれぞれ入力層の各ニューロンUに入力される。また、点群データに含まれる位置情報、時間情報、尤度情報、物体カテゴリ情報、特徴点情報がそれぞれ入力層の各ニューロンUに入力される。最後の層は、出力層と呼ばれ、結果を出力する層になる。 The first layer is called the input layer, and data is input. For example, the pixel value of each pixel forming one image is input to each neuron U of the input layer. Position information, time information, likelihood information, object category information, and feature point information included in the point cloud data are input to each neuron U of the input layer. The last layer, called the output layer, is the layer that outputs the results.
 ニューラルネットワーク700の学習方法としては、例えば、正解を示す値(教師データ)と訓練データに対するニューラルネットワーク700の出力値とから所定の誤差関数(損失関数)を用いて誤差(損失値)を算出し、この誤差が最小となるように、最急降下法等を用いてニューロン間の重みを順次変化させていく誤差逆伝播法(バックプロパゲーション)が用いられる。 As a learning method of the neural network 700, for example, an error (loss value) is calculated using a predetermined error function (loss function) from a value indicating the correct answer (teacher data) and the output value of the neural network 700 for the training data. In order to minimize this error, an error backpropagation method is used in which weights between neurons are sequentially changed using the steepest descent method or the like.
 2.補足
 以上、本発明を実施の形態に基づいて説明してきたが本発明は上述の実施の形態に限定されないのは勿論であり、以下の変形例が本発明の技術範囲に含まれることは勿論である。
2. Supplement Although the present invention has been described above based on the embodiments, the present invention is of course not limited to the above-described embodiments, and the following modifications are naturally included in the technical scope of the present invention. be.
 (1)上述の実施の形態において、物体検出器112は、物体の関節点を検出するOpenPoseを用いるとしたが、他のニューラルネットワークを用いてもよい。例えば、物体の外接矩形を検出するニューラルネットワーク(例えば、YOLO(非特許文献3参照)など)を用い、外接矩形の各頂点や中心位置を特徴検出結果として検出してもよい。また、物体の輪郭を検出するニューラルネットワーク(例えば、Deep Snake(非特許文献4参照)など)を用い、検出した輪郭点や物体の中心点を特徴検出結果として検出してもよい。 (1) In the above embodiment, the object detector 112 uses OpenPose for detecting joint points of an object, but other neural networks may be used. For example, a neural network (eg, YOLO (see Non-Patent Document 3)) that detects the circumscribing rectangle of an object may be used to detect each vertex and center position of the circumscribing rectangle as the feature detection result. A neural network that detects the contour of an object (for example, Deep Snake (see Non-Patent Document 4), etc.) may be used to detect detected contour points and the center point of the object as feature detection results.
 また、1つの物体に対して検出される複数の特徴点をそれぞれ、特徴検出結果とする必要はなく、1つの物体に対する検出される複数の特徴点からなる物体情報を特徴検出結果としてもよい。すなわち、OpenPoseにおける複数の関節点の位置情報からなる骨格情報を特徴検出結果としてもよい。この場合、骨格情報に対して物体識別値が算出され、グループ分類が行われる。同様に、YOLOにおける物体の外接矩形の位置および大きさの情報からなる矩形情報を特徴検出結果としてもよく、Deep Snakeにおける物体の複数の輪郭点の位置情報からなる輪郭情報を特徴検出結果としてもよい。 Further, it is not necessary to use each of the plurality of feature points detected for one object as the feature detection result, and the object information consisting of the plurality of feature points detected for one object may be used as the feature detection result. That is, the feature detection result may be skeleton information including position information of a plurality of joint points in OpenPose. In this case, an object identification value is calculated for skeleton information, and group classification is performed. Similarly, rectangle information consisting of information on the position and size of the circumscribing rectangle of the object in YOLO may be used as the feature detection result, and contour information consisting of position information of multiple contour points of the object in Deep Snake may be used as the feature detection result. good.
 (2)上述の実施の形態において、特徴検出結果データ113は、尤度情報、物体カテゴリ情報、特徴点カテゴリ情報を含むとしているが、他の情報を含んでいてもよい。例えば、物体の外観の情報(例えば、色の情報など)が含まれていてもよい。 (2) In the above embodiment, the feature detection result data 113 includes likelihood information, object category information, and feature point category information, but may include other information. For example, information on the appearance of the object (eg, color information, etc.) may be included.
 (3)上述の実施の形態において、物体検出器112は、一のフレームの単一画像を入力として当該一のフレームの物体の特徴を検出するものでもよく、一のフレームを含む複数フレームからなる複数画像を入力として当該一のフレームの物体の特徴を検出するものであってもよい。 (3) In the above-described embodiment, the object detector 112 may be one that receives a single image of one frame as input and detects the features of the object in that one frame, and consists of a plurality of frames including one frame. A plurality of images may be input and the feature of an object in one frame may be detected.
 (4)非特許文献1では、self-Attensionと呼ばれる機構を用いたニューラルネットワークにより、物体の特徴を検出している。上述の実施の形態において、物体検出器112は、非特許文献1に記載されたself-Attension機構を用いて物体の特徴を検出してもよい。 (4) In Non-Patent Document 1, features of an object are detected by a neural network using a mechanism called self-Attention. In the above-described embodiments, the object detector 112 may detect object features using the self-attention mechanism described in Non-Patent Document 1.
 (5)上述の実施の形態において、物体識別値算出器114は、特徴検出結果データ113それぞれの物体識別値を算出するタスクを行うものとしているが、物体識別値の算出に加えて、他のタスクを同時に行うものであってもよい。 (5) In the above embodiment, the object identification value calculator 114 performs the task of calculating the object identification value for each of the feature detection result data 113. In addition to calculating the object identification value, other Tasks may be performed simultaneously.
 例えば、物体検出器112において、物体カテゴリ(物体の種別)を検出せずに特徴点を検出していた場合には、物体識別値算出器114において、物体識別値に加えて、物体カテゴリを検出するタスクを実行してもよい。この場合、教師データに、検出した物体カテゴリの情報を加え、出力値との誤差を算出し、この誤差が最小となるように、誤差逆伝播法により学習することで、物体カテゴリを検出するタスクを同時に実行できるようになる。 For example, if the object detector 112 detects feature points without detecting the object category (type of object), the object identification value calculator 114 detects the object category in addition to the object identification value. You may perform tasks that In this case, the task of detecting the object category by adding the information of the detected object category to the teacher data, calculating the error with the output value, and learning by the error backpropagation method so that this error is minimized. can be executed simultaneously.
 また、物体検出器112で検出された特徴検出結果に基づいて、物体の属性や行動を認識するタスクを実行してもよい。物体の属性を認識するタスクとは、例えば、人物の性別、年齢、着用している衣服の種類、色などの属性を認識するタスクであってもよい。また、物体の行動を認識するタスクとは、例えば、走っている、歩いている、電話をかけているなどの物体(人物)の行動を認識するタスクであってもよい。これらの場合、教師データに、検出した物体の属性や行動の情報を加え、出力値との誤差を算出し、この誤差が最小となるように、誤差逆伝播法により学習することで、物体の属性や行動を検出するタスクを同時に実行できるようになる。 Also, based on the feature detection result detected by the object detector 112, a task of recognizing the attributes and actions of the object may be executed. The task of recognizing attributes of an object may be, for example, a task of recognizing attributes such as a person's gender, age, type of clothes worn, and color. Also, the task of recognizing the behavior of an object may be, for example, a task of recognizing the behavior of an object (person) such as running, walking, or making a phone call. In these cases, information on the attributes and behavior of the detected objects is added to the teacher data, and the errors with the output values are calculated. Tasks that detect attributes and behaviors can be performed simultaneously.
 また、物体識別値算出器114は、物体検出器112で検出された特徴検出結果に基づいて、物体識別値の算出するタスクに加えて、撮影画像111の各フレームにおける状況または複数フレームにおける状況を認識するタスクを実行してもよい。状況を認識するタスクとは、例えば、歩行者と自動車など物体と物体が接触する危険性を認識するタスク、信号無視などの危険運転を認識するタスクであってもよい。これらの場合、教師データに、撮影画像の状況の情報を加え、出力値との誤差を算出し、この誤差が最小となるように、誤差逆伝播法により学習することで、状況を認識するタスクを同時に実行できるようになる。 In addition to the task of calculating the object identification value based on the feature detection result detected by the object detector 112, the object identification value calculator 114 calculates the situation in each frame of the captured image 111 or the situation in a plurality of frames. It may perform the task of recognizing. The task of recognizing situations may be, for example, a task of recognizing the danger of contact between objects such as a pedestrian and a car, or a task of recognizing dangerous driving such as ignoring traffic lights. In these cases, information about the situation of the captured image is added to the training data, the error between the output value and the output value is calculated, and learning is performed using the error backpropagation method to minimize this error, thereby recognizing the situation. can be executed simultaneously.
 (7)上述の実施の形態において、物体検出器112は、物体の関節点を検出するOpenPoseを用いるとしたが、OpenPoseでは、物体の関節点を、検出した物体ごとに分類して出力する。例えば、画像に人物Aと人物Bが写っていた場合に、人物Aの関節点と人物Bの関節点とを区別可能な形式で出力する。グループ分類器116は、この分類結果の情報を用いて、特徴検出結果データ113を物体ごとに分類してもよい。すなわち、物体検出器112によって、各フレーム内の特徴点は、物体ごとに分類されている。グループ分類器116は、この分類結果を用いて、同一フレーム内の特徴点同士の分類を同じグループに分類した上で、他のフレームの特徴点が同じグループに含まれるか否かを物体識別値の類似性により分類してもよい。 (7) In the above-described embodiment, the object detector 112 uses OpenPose to detect the joint points of an object, but in OpenPose, the joint points of an object are classified for each detected object and output. For example, when person A and person B appear in the image, the joint points of person A and the joint points of person B are output in a distinguishable format. The group classifier 116 may classify the feature detection result data 113 for each object using this classification result information. That is, the object detector 112 classifies the feature points in each frame for each object. Using this classification result, the group classifier 116 classifies feature points in the same frame into the same group, and determines whether or not feature points in other frames are included in the same group as an object identification value. may be classified according to the similarity of
 3.その他
 本開示の一態様の物体追跡方法は、カメラで撮影した複数フレームの映像から同一物体を追跡する物体追跡方法であって、前記映像の各フレームの物体の特徴の検出結果を取得する物体検出ステップと、前記検出結果ごとに、同一物体に属する場合に同一の値となるべき物体識別値を、前記複数フレームに渡る前記検出結果を入力としたニューロ演算により計算する物体識別値算出ステップと、前記物体識別値の類似性に基づいて、前記複数フレームに渡る前記検出結果を分類する検出結果分類ステップとを有する。
3. Others An object tracking method according to one aspect of the present disclosure is an object tracking method for tracking the same object from a plurality of frames of video captured by a camera, wherein object detection acquires detection results of features of an object in each frame of the video. an object identification value calculation step of calculating, for each of the detection results, an object identification value that should be the same value if the object belongs to the same object, by neural computation using the detection results over the plurality of frames as input; and a detection result classification step of classifying the detection results over the plurality of frames based on the similarity of the object identification values.
 上記物体追跡方法において、前記検出結果は、時刻情報に加え、物体の骨格の特徴点の位置の情報、物体の外接矩形の特徴点の位置の、および、物体の輪郭の特徴点の位置の情報、のいずれかを含むとしてもよい。 In the above object tracking method, the detection result includes, in addition to time information, information on the positions of feature points on the skeleton of the object, information on the positions of feature points on the circumscribed rectangle of the object, and information on the positions of feature points on the outline of the object. , may be included.
 上記物体追跡方法において、前記検出結果は、前記特徴点が尤もらしく検出されていることを示す尤度情報、物体の種類を示す物体カテゴリ情報、特徴点の種類を示す特徴点カテゴリ情報、および、物体の外観の特徴を示す物体外観情報、のいずれかを含むとしてもよい。 In the above object tracking method, the detection result includes likelihood information indicating that the feature point is likely to be detected, object category information indicating the type of object, feature point category information indicating the type of feature point, and object appearance information that characterizes the appearance of the object.
 上記物体追跡方法において、前記物体検出ステップは、一のフレームの単一画像を入力として前記一のフレームの物体の特徴を検出する、または、前記一のフレームを含む複数フレームからなる複数画像を入力として前記一のフレームの物体の特徴を検出するとしてもよい。 In the above object tracking method, the object detection step receives a single image of one frame as an input and detects features of an object in the one frame, or receives a plurality of images consisting of a plurality of frames including the one frame. may be used to detect the feature of the object in the one frame.
 上記物体追跡方法において、前記物体検出ステップは、前記単一画像または前記複数画像を入力としたニューロ演算を行う物体検出器によって、前記検出結果を計算するとしてもよい。 In the above object tracking method, the object detection step may calculate the detection result by an object detector that performs neural operations with the single image or the plurality of images as input.
 上記物体追跡方法において、前記物体検出器は、DNN(Deep Neural Network)を用いるとしてもよい。 In the above object tracking method, the object detector may use DNN (Deep Neural Network).
 上記物体追跡方法において、前記物体検出器は、Self-Attention機構のニューラルネットワークを用いるとしてもよい。 In the above object tracking method, the object detector may use a neural network with a self-attention mechanism.
 上記物体追跡方法において、前記物体識別値算出ステップは、点群データ形式の前記検出結果を入力とするDNN(Deep Neural Network)を用いて複数フレームの前記検出結果ごとの物体識別値を一括して計算するとしてもよい。 In the above object tracking method, the object identification value calculating step collectively calculates the object identification value for each of the detection results of a plurality of frames using a DNN (Deep Neural Network) that receives the detection results in point cloud data format. may be calculated.
 上記物体追跡方法において、前記DNNは、Permutation-Equivariantであるとしてもよい。 In the above object tracking method, the DNN may be Permutation-Equivariant.
 上記物体追跡方法において、前記DNNは、2つの検出結果が同一物体に属する場合に同一の値を有するように、それぞれ異なる物体に属する場合に異なる値を有するように対照学習により学習されているとしてもよい。 In the above object tracking method, the DNN is learned by contrast learning so that two detection results have the same value when they belong to the same object, and have different values when they belong to different objects. good too.
 上記物体追跡方法において、前記DNNは、前記物体識別値に加えて、前記検出結果ごとの、物体の種類、物体の属性、物体の行動の少なくともいずれかを認識するように学習されているとしてもよい。 In the above object tracking method, even if the DNN is trained to recognize at least one of object type, object attribute, and object behavior for each detection result in addition to the object identification value good.
 上記物体追跡方法において、前記DNNは、前記物体識別値に加えて、前記映像の各フレームにおける状況または前記映像の状況を認識するように学習されているとしてもよい。 In the above object tracking method, the DNN may be trained to recognize the situation in each frame of the video or the situation of the video in addition to the object identification value.
 上記物体追跡方法において、前記物体識別値は、ベクトル値であり、前記検出結果分類ステップは、各物体識別値の距離が近いもの同士を同一物体の検出結果として対応付けるとしてもよい。 In the above object tracking method, the object identification values may be vector values, and the detection result classification step may associate object identification values with close distances as detection results of the same object.
 上記物体追跡方法において、前記物体検出ステップは、前記映像の各フレームにおいて、前記検出結果を物体ごとに分類し、前記検出結果分類ステップは、前記物体検出ステップにおける分類結果を用いて、前記複数フレームに渡る前記検出結果を、物体ごとに分類するとしてもよい。 In the object tracking method, the object detection step classifies the detection result for each object in each frame of the video, and the detection result classification step uses the classification result in the object detection step to classify the plurality of frames. may be classified according to objects.
 本開示の一態様の物体追跡装置は、カメラで撮影した複数フレームの映像から同一物体を追跡する物体追跡装置であって、前記映像の各フレームの物体の特徴の検出結果を取得する物体検出器と、前記検出結果ごとに、同一物体に属する場合に同一の値となるべき物体識別値を、前記複数フレームに渡る前記検出結果を入力としたニューロ演算により計算する物体識別値算出器と、前記物体識別値の類似性に基づいて、前記複数フレームに渡る前記検出結果を分類する検出結果分類器とを備える。 An object tracking device according to one aspect of the present disclosure is an object tracking device that tracks the same object from a plurality of frames of video captured by a camera, and an object detector that acquires detection results of features of an object in each frame of the video. an object identification value calculator for calculating, for each detection result, an object identification value that should be the same value if the object belongs to the same object, by neural computation using the detection results over the plurality of frames as input; a detection result classifier that classifies the detection results over the plurality of frames based on similarity of object identification values.
 本開示によると、複数フレームに渡る物体特徴の検出結果を入力としたニューロ演算により、各検出結果に対する物体識別値を一括して計算するため、特定のフレームの物体識別値の計算を、それよりも未来のフレームの検出結果を用いて行うことになるため、物体追跡の精度を向上することができる。 According to the present disclosure, an object identification value for each detection result is collectively calculated by neural computation with input of object feature detection results over a plurality of frames. Also, since the detection result of the future frame is used, the accuracy of object tracking can be improved.
 本開示は、監視カメラシステムなどに搭載される物体追跡装置として有用である。 The present disclosure is useful as an object tracking device installed in a surveillance camera system or the like.
  1 物体追跡システム
 10 物体追跡装置
112 物体検出器
114 物体識別値算出器
116 グループ分類器
 15 カメラ
1 object tracking system 10 object tracking device 112 object detector 114 object identification value calculator 116 group classifier 15 camera

Claims (15)

  1.  カメラで撮影した複数フレームの映像から同一物体を追跡する物体追跡方法であって、
     前記映像の各フレームの物体の特徴の検出結果を取得する物体検出ステップと、
     前記検出結果ごとに、同一物体に属する場合に同一の値となるべき物体識別値を、前記複数フレームに渡る前記検出結果を入力としたニューロ演算により計算する物体識別値算出ステップと、
     前記物体識別値の類似性に基づいて、前記複数フレームに渡る前記検出結果を分類する検出結果分類ステップと
     を有する物体追跡方法。
    An object tracking method for tracking the same object from multiple frames of video captured by a camera,
    an object detection step of obtaining detection results of features of objects in each frame of the video;
    an object identification value calculation step of calculating, for each detection result, an object identification value that should be the same value if the object belongs to the same object, by neural computation using the detection results over the plurality of frames as input;
    and a detection result classification step of classifying the detection results over the plurality of frames based on the similarity of the object identification values.
  2.  前記検出結果は、時刻情報に加え、
     物体の骨格の特徴点の位置の情報、物体の外接矩形の特徴点の位置の情報、および、物体の輪郭の複数の特徴点の位置の情報、のいずれかを含む
     請求項1に記載の物体追跡方法。
    The detection result, in addition to time information,
    2. The object according to claim 1, comprising any of positional information of feature points on the skeleton of the object, information on positions of feature points on the circumscribed rectangle of the object, and information on positions of a plurality of feature points on the outline of the object. tracking method.
  3.  前記検出結果は、前記特徴点が尤もらしく検出されていることを示す尤度情報、物体の種類を示す物体カテゴリ情報、特徴点の種類を示す特徴点カテゴリ情報、および、物体の外観の特徴を示す物体外観情報、のいずれかを含む
     請求項2に記載の物体追跡方法。
    The detection results include likelihood information indicating that the feature points are likely to be detected, object category information indicating the type of object, feature point category information indicating the type of feature points, and features of the appearance of the object. 3. The object tracking method according to claim 2, further comprising:
  4.  前記物体検出ステップは、一のフレームの単一画像を入力として前記一のフレームの物体の特徴を検出する、または、前記一のフレームを含む複数フレームからなる複数画像を入力として前記一のフレームの物体の特徴を検出する
     請求項1に記載の物体追跡方法。
    The object detection step receives a single image of one frame as an input and detects features of an object in the one frame, or receives a plurality of images consisting of a plurality of frames including the one frame as an input, and detects features of the object in the one frame. The method of claim 1, further comprising detecting features of the object.
  5.  前記物体検出ステップは、前記単一画像または前記複数画像を入力としたニューロ演算を行う物体検出器によって、前記検出結果を計算する
     請求項4に記載の物体追跡方法。
    5. The object tracking method according to claim 4, wherein said object detection step calculates said detection result by means of an object detector which performs neural operations with said single image or said plurality of images as an input.
  6.  前記物体検出器は、DNN(Deep Neural Network)を用いる
     請求項5に記載の物体追跡方法。
    The object tracking method according to claim 5, wherein the object detector uses a DNN (Deep Neural Network).
  7.  前記物体検出器は、Self-Attention機構のニューラルネットワークを用いる
     請求項5に記載の物体追跡方法。
    6. The object tracking method according to claim 5, wherein the object detector uses a neural network of a self-attention mechanism.
  8.  前記物体識別値算出ステップは、点群データ形式の前記検出結果を入力とするDNN(Deep Neural Network)を用いて、複数フレームの前記検出結果ごとの物体識別値を一括して計算する
     請求項1に記載の物体追跡方法。
    2. The object identification value calculating step collectively calculates an object identification value for each of the detection results of a plurality of frames using a DNN (Deep Neural Network) that receives the detection results in point cloud data format. The object tracking method described in .
  9.  前記DNNは、Permutation-Equivariantである
     請求項8に記載の物体追跡方法。
    The object tracking method according to claim 8, wherein the DNN is Permutation-Equivariant.
  10.  前記DNNは、2つの検出結果が同一物体に属する場合に同一の値を有するように、それぞれ異なる物体に属する場合に異なる値を有するように対照学習により学習されている
     請求項8に記載の物体追跡方法。
    The object according to claim 8, wherein the DNN is learned by contrast learning so that two detection results have the same value when they belong to the same object, and have different values when they belong to different objects. tracking method.
  11.  前記DNNは、前記物体識別値に加えて、前記検出結果ごとの、物体の種類、物体の属性、物体の行動の少なくともいずれかを認識するように学習されている
     請求項8に記載の物体追跡方法。
    The object tracking according to claim 8, wherein the DNN is trained to recognize at least one of object type, object attribute, and object behavior for each detection result in addition to the object identification value. Method.
  12.  前記DNNは、前記物体識別値に加えて、前記映像の各フレームにおける状況または前記映像の状況を認識するように学習されている
     請求項8に記載の物体追跡方法。
    9. The object tracking method of claim 8, wherein the DNN is trained to recognize the situation at each frame of the video or the situation of the video in addition to the object identification value.
  13.  前記物体識別値は、ベクトル値であり、
     前記検出結果分類ステップは、各物体識別値の距離が近いもの同士を同一物体の検出結果として対応付ける
     請求項1に記載の物体追跡方法。
    the object identification value is a vector value;
    2. The object tracking method according to claim 1, wherein said detection result classifying step associates object identification values with close distances as detection results of the same object.
  14.  前記物体検出ステップは、前記映像の各フレームにおいて、前記検出結果を物体ごとに分類し、
     前記検出結果分類ステップは、前記物体検出ステップにおける分類結果を用いて、前記複数フレームに渡る前記検出結果を、物体ごとに分類する
     請求項1に記載の物体追跡方法。
    The object detection step classifies the detection result by object in each frame of the video,
    2. The object tracking method according to claim 1, wherein the detection result classification step classifies the detection results over the plurality of frames for each object using the classification result in the object detection step.
  15.  カメラで撮影した複数フレームの映像から同一物体を追跡する物体追跡装置であって、
     前記映像の各フレームの物体の特徴の検出結果を取得する物体検出器と、
     前記検出結果ごとに、同一物体に属する場合に同一の値となるべき物体識別値を、前記複数フレームに渡る前記検出結果を入力としたニューロ演算により計算する物体識別値算出器と、
     前記物体識別値の類似性に基づいて、前記複数フレームに渡る前記検出結果を分類する検出結果分類器と
     を備える物体追跡装置。
    An object tracking device that tracks the same object from multiple frames of video captured by a camera,
    an object detector that acquires detection results of features of objects in each frame of the video;
    an object identification value calculator that calculates, for each detection result, an object identification value that should be the same value if the object belongs to the same object, by neural computation using the detection results over the plurality of frames as input;
    A detection result classifier that classifies the detection results over the plurality of frames based on the similarity of the object identification values.
PCT/JP2022/042682 2021-12-20 2022-11-17 Object tracking method and object tracking device WO2023119969A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2021206542 2021-12-20
JP2021-206542 2021-12-20

Publications (1)

Publication Number Publication Date
WO2023119969A1 true WO2023119969A1 (en) 2023-06-29

Family

ID=86902188

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2022/042682 WO2023119969A1 (en) 2021-12-20 2022-11-17 Object tracking method and object tracking device

Country Status (1)

Country Link
WO (1) WO2023119969A1 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002077906A (en) * 2000-07-06 2002-03-15 Mitsubishi Electric Research Laboratories Inc Method and system for extracting high-level features from low-level features of multimedia content
JP2009211123A (en) * 2008-02-29 2009-09-17 Institute Of Physical & Chemical Research Classification device and program

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002077906A (en) * 2000-07-06 2002-03-15 Mitsubishi Electric Research Laboratories Inc Method and system for extracting high-level features from low-level features of multimedia content
JP2009211123A (en) * 2008-02-29 2009-09-17 Institute Of Physical & Chemical Research Classification device and program

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CHARLES R. QI; HAO SU; KAICHUN MO; LEONIDAS J. GUIBAS: "PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 2 December 2016 (2016-12-02), 201 Olin Library Cornell University Ithaca, NY 14853 , XP080736277, DOI: 10.1109/CVPR.2017.16 *
WEI-CHIH HUNG; HENRIK KRETZSCHMAR; TSUNG-YI LIN; YUNING CHAI; RUICHI YU; MING-HSUAN YANG; DRAGO ANGUELOV: "SoDA: Multi-Object Tracking with Soft Data Association", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 1 January 1900 (1900-01-01), 201 Olin Library Cornell University Ithaca, NY 14853 , XP081743210 *

Similar Documents

Publication Publication Date Title
Zhang et al. Empowering things with intelligence: a survey of the progress, challenges, and opportunities in artificial intelligence of things
Ijjina et al. Human action recognition in RGB-D videos using motion sequence information and deep learning
Pervaiz et al. Hybrid algorithm for multi people counting and tracking for smart surveillance
CN106845487B (en) End-to-end license plate identification method
Shami et al. People counting in dense crowd images using sparse head detections
Basly et al. CNN-SVM learning approach based human activity recognition
US20170351905A1 (en) Learning model for salient facial region detection
Shahzad et al. A smart surveillance system for pedestrian tracking and counting using template matching
CN109583315B (en) Multichannel rapid human body posture recognition method for intelligent video monitoring
DE112019005671T5 (en) DETERMINING ASSOCIATIONS BETWEEN OBJECTS AND PERSONS USING MACHINE LEARNING MODELS
Serpush et al. Complex human action recognition using a hierarchical feature reduction and deep learning-based method
Potdar et al. A convolutional neural network based live object recognition system as blind aid
CN111523559B (en) Abnormal behavior detection method based on multi-feature fusion
CN111611874A (en) Face mask wearing detection method based on ResNet and Canny
WO2022156317A1 (en) Video frame processing method and apparatus, electronic device, and storage medium
Waheed et al. A novel deep learning model for understanding two-person interactions using depth sensors
Acharya et al. Real-time detection and tracking of pedestrians in CCTV images using a deep convolutional neural network
Aldahoul et al. A comparison between various human detectors and CNN-based feature extractors for human activity recognition via aerial captured video sequences
Wang Three-dimensional convolutional restricted Boltzmann machine for human behavior recognition from RGB-D video
Rashwan et al. Action representation and recognition through temporal co-occurrence of flow fields and convolutional neural networks
Echoukairi et al. Improved Methods for Automatic Facial Expression Recognition.
Kale et al. Suspicious activity detection using transfer learning based resnet tracking from surveillance videos
Nigam et al. Multiview human activity recognition using uniform rotation invariant local binary patterns
Srividya et al. Deep learning techniques for physical abuse detection
Valappil et al. Vehicle detection in UAV videos using CNN-SVM

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22910688

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE