KR20180108123A

KR20180108123A - Method for tracking multi object

Info

Publication number: KR20180108123A
Application number: KR1020170037477A
Authority: KR
Inventors: 강행봉; 오상일
Original assignee: 가톨릭대학교 산학협력단
Priority date: 2017-03-24
Filing date: 2017-03-24
Publication date: 2018-10-04
Also published as: KR101916573B1

Abstract

The present invention relates to a method for tracking multiple objects, the method comprising the steps of: (a) generating a 2D image data set composed of a target object of a previous 2D image frame and a target candidate object of a current 2D image frame, and a depth data set composed of a target object of a previous depth frame and a target candidate object of a current 2D image frame; (b) applying the 2D image data set and the depth data set to a first matching network and a matching network independent from each other, outputting a 2D image tracking result for the 2D image data set from the first matching network, and outputting a depth tracking result for the depth data set from the second matching network; and (c) combining the 2D image tracking result and the depth tracking result by using a basic belief assignment to determine a final tracking result. Accordingly, when multiple target objects are tracked by using the 2D image frame and the depth frame acquired from multiple sensors, a tracking failure due to an error of data is overcome, and more accurate and stable tracking is possible.

Description

{METHOD FOR TRACKING MULTI OBJECT}

본 발명은 다중 객체 추적 방법에 관한 것으로서, 다중 센서로부터 취득된 2D 영상 프레임 및 깊이 프레임을 이용하여 다중의 타겟 객체를 추적하는데 있어 데이터의 오류로 인한 추적 실패를 극복하고, 보다 정확하고 안정적인 추적이 가능한 다중 객체 추적 방법에 관한 것이다.The present invention relates to a multi-object tracking method, and more particularly, to tracking multiple target objects using 2D image frames and depth frames acquired from multiple sensors, overcoming tracking errors due to errors in data, To a possible multi-object tracking method.

객체 추적은 보안, 스포츠 분석, 인간-컴퓨터 상호작용, 자율 주행 시스템과 같은 다양한 분야에서 사용되는 중요한 작업이다. 이로 인해, 다양한 추적기의 형태가 개발되어지고 있는데 다중 객체 추적, 다중 센서를 이용한 추적, 모델없는 추적기 등이 제안되고 있다.Object tracking is an important task used in many areas such as security, sports analysis, human-computer interaction, and autonomous navigation systems. As a result, various types of tracker have been developed, including multi-object tracking, tracking using multiple sensors, and modelless tracker.

다중 객체 추적의 중요 목적은 주어진 비디오 시퀀스의 프레임들로부터 타겟 객체의 상태를 추적하는 것이다. 하지만, 다양한 형태의 다중 객체 추적기가 제안되었음에도 불구하고, 조명의 변화나 객체의 크기 변화, 가려짐과 같은 다양한 방해 요소로 인해 추적 성능을 발전시키는데 여전히 한계가 존재하고 있다.An important objective of multi-object tracking is to track the status of the target object from the frames of a given video sequence. However, although various types of multi-object trackers have been proposed, there are still limitations in developing tracking performance due to various obstacles such as illumination changes, object size changes, and masking.

이러한 방해 요소를 해결하는 하나의 방법은 해당 방해 요소를 추적기에 선험적으로 모델링하는 것이다. 예를 들어, Lucas, B.D., Kanade, T. 등의 논문 "An iterative image registration technique with an application to stereo vision(IJCAI, 1981, Vol. 81, pp. 674-679.)"에서는 어파인 변형(Affine transformation)을 제안하고 있고, Nguyen, H.T. 및 Smeulders, A.W.의 논문 "Robust tracking using foreground-background texture discrimination(International Journal of Computer Vision 2006, 69, 277-293.)"에서는 조명 처리를 제안하고 있고, Pan, J. 및 Hu, B.의 논문 "Robust occlusion handling in object tracking. 2007 IEEE Conference on Computer Vision and Pattern Recognition(IEEE, 2007, pp. 1-8.)"에서는 가려짐 검출과 관련된 기술을 제안하고 있다. 하지만, 특정 방해요소가 모델링된 추적기는 해당 방해요소에 대해서는 강건한 성능을 보여주지만, 다른 방해 요소가 입력되었을 때는 이를 극복하는데 한계가 있다.One way to address these obstacles is to model the disturbances a priori in the tracker. For example, in the article "An iterative image registration technique with an application to stereo vision (IJCAI, 1981, Vol. 81, pp. 674-679)" by Lucas, BD, Kanade, T. et al. transformation, and Nguyen, HT And Smeulders, AW, "Robust tracking using foreground-background texture discrimination" (International Journal of Computer Vision 2006, 69, 277-293) &Quot; Robust occlusion handling in object tracking. "IEEE Proceedings of Computer Vision and Pattern Recognition (IEEE, 2007, pp. 1-8.). However, a tracker modeled with a particular disturbance exhibits robust performance for that disturbance, but has limitations when overcoming other disturbances.

또 다른 방법으로는 추적기가 작동하는 동안에 적응적으로 형태 모델을 업데이트하는 방법이다. 하지만 역시 형태 모델이 적응적으로 업데이트 되었더라도 일시적으로 변화하는 상황이 새롭게 업데이트된 형태 모델에 유입될 경우 극적으로 변하는 형태를 놓칠 수 있게 된다.Another way is to adaptively update the morphological model while the tracker is running. However, even if the morphological model is adaptively updated, it is possible to miss the dramatically changing morphology if the temporarily changing situation is introduced into the updated morphological model.

더욱이, 추적기가 RGB 프레임에서만 작동할 경우, 추적 중인 타겟 객체의 바운딩 박스(Bounding box)가 다음 프레임에서 유사한 형태나 색상을 가지고 있는 유사한 객체로 여겨지는 이른바 바운딩 박스의 쉬프팅(shifting) 문제가 쉽게 발생할 수 있다.Moreover, if the tracker only operates on RGB frames, the so-called bounding box shifting problem, in which the bounding box of the target object being tracked is considered a similar object with a similar shape or color in the next frame, .

각각의 모달리티(Modality)에서 다양한 방해 요소로부터 야기되는 추적 실패를 보상하기 위해, 다중 센서 융합이 제안되었다. RGB 프레임 상에서의 추적 실패는 3D 포인트 클라우드 및 스테레오 비전 센서로부터의 깊이 정보를 사용하여 보상될 수 있다. 하지만, 기존의 다중 센서를 이용한 추적기들은 모든 센서가 정상적으로 작동한다는 가정 하에 모델링되었기 때문에 하나 혹은 그 이상의 센서에서 발생하는 잡음에 대해서는 다루지 못한다는 약점이 존재한다.To compensate for tracking failures resulting from various disturbances in each modality, multiple sensor fusion has been proposed. Tracking failures on RGB frames can be compensated using depth information from the 3D point cloud and stereo vision sensors. However, existing trackers using multiple sensors are modeled under the assumption that all sensors operate normally, so there is a weakness that they can not deal with the noise generated by one or more sensors.

이에, 본 발명은 상기와 같은 문제점을 해소하기 위해 안출된 것으로서, 다중 센서로부터 취득된 2D 영상 프레임 및 깊이 프레임을 이용하여 다중의 타겟 객체를 추적하는데 있어 데이터의 오류로 인한 추적 실패를 극복하고, 보다 정확하고 안정적인 추적이 가능한 다중 객체 추적 방법을 제공하는데 그 목적이 있다.SUMMARY OF THE INVENTION Accordingly, the present invention has been made to solve the above problems, and it is an object of the present invention to overcome the tracking failure due to data errors in tracking multiple target objects using 2D image frames and depth frames acquired from multiple sensors, The object of the present invention is to provide a multi-object tracking method capable of more accurate and stable tracking.

상기 목적은 본 발명에 따라, 다중 객체 추적 방법에 있어서, (a) 이전 2D 영상 프레임의 타겟 객체와 현재 2D 영상 프레임의 타겟 후보 객체로 구성된 2D 영상 데이터세트와, 이전 깊이 프레임의 타겟 객체와 현재 2D 영상 프레임의 타겟 후보 객체로 구성된 깊이 데이터세트를 생성하는 단계와; (b) 상기 2D 영상 데이터세트와 상기 깊이 데이터세트가 상호 독립된 제1 매칭 네트워크 및 매칭 네트워크에 각각 적용되고, 상기 제1 매칭 네트워크로부터 상기 2D 영상 데이터세트에 대한 2D 영상 추적 결과가 출력되고 상기 제2 매칭 네트워크로부터 상기 깊이 데이터세트에 대한 깊이 추적 결과가 출력되는 단계와; (c) 기본 신념 할당(Basic belief assignment)을 이용하여 상기 2D 영상 추적 결과와 상기 깊이 추적 결과가 융합되어 최종 추적 결과가 결정되는 단계를 포함하는 것을 특징으로 하는 다중 객체 추적 방법에 의해서 달성된다.(A) a 2D image data set composed of a target object of a previous 2D image frame and a target candidate object of a current 2D image frame, and a 2D image data set of a current depth frame, Generating a depth data set comprising a target candidate object of a 2D image frame; (b) the 2D image data set and the depth data set are respectively applied to a first matching network and a matching network that are mutually independent, a 2D image tracking result of the 2D image data set is output from the first matching network, Outputting a depth tracking result for the depth data set from a matching network; and (c) the 2D tracing result and the depth tracing result are fused using a basic belief assignment to determine a final tracing result.

여기서, 상기 (a) 단계에서는 기 훈련된 컨벌루션 신경망(Pre-trained convolution neural network)을 이용하여 상기 2D 영상 프레임의 상기 타겟 객체와 상기 타겟 후보 객체를 포함하는 인스턴스가 표현되며; 상기 (a) 단계는 (a1) 상기 기 훈련된 컨벌루션 신경망(Pre-trained convolution neural network)으로부터 상기 2D 영상 프레임의 컨벌루션 라이어(Convolution layer)의 출력이 추출되어 상기 2D 영상 프레임의 출력 특성 지도가 생성되는 되는 단계와; (a2) 각각의 상기 인스턴스의 표현이 ORI 풀링을 이용하여 각각의 상기 인스턴스의 스케일에 따라 상기 출력 특성 지도로부터 풀링되는 단계와; (a) 상기 인스턴스의 표현이 정규화되는 단계를 포함할 수 있다.In the step (a), an instance including the target object and the target candidate object of the 2D image frame is expressed using a pre-trained convolution neural network; (A) extracting an output of a convolution layer of the 2D image frame from the pre-trained convolution neural network to generate an output characteristic map of the 2D image frame; ; (a2) a representation of each of the instances is pooled from the output property map according to a scale of each of the instances using ORI pooling; (a) the representation of the instance is normalized.

또한, 상기 (a1) 단계에서는 상기 제1 매칭 네트워크의 입력보다 큰 스케일의 인스턴스에는 서브 샘플링을 위해 최대값 풀링(Max pooling)이 적용되고, 상기 제1 매칭 네트워크의 입력보다 작은 스케일의 인스턴스에는 업샘플링을 위해 디컨벌루션(Deconvolution) 연산이 적용될 수 있다.In addition, in step (a1), Max pooling is applied to an instance of a larger scale than the input of the first matching network, and an instance of a smaller scale than an input of the first matching network A deconvolution operation may be applied for sampling.

또한, 상기 (a) 단계에서는 슈퍼비전 트랜스퍼(Supervision transfer)가 적용되어 상기 깊이 프레임의 상기 타겟 객체와 상기 타겟 후보 객체를 포함하는 인스턴스가 표현될 수 있다.In the step (a), a supervision transfer may be applied to represent an instance including the target object and the target candidate object in the depth frame.

그리고, 상기 (b) 단계에서 상기 제1 매칭 네트워크 및 제2 매칭 네트워크에는 가중치를 공유하는 두 개의 서브 네트워크와 두 개의 상기 서브 네트워크가 연결되어 매칭 여부를 판단하는 소프트맥스 레이어로 구성된 컨벌루션 신경망(Convolution neural network)이 적용되며; 상기 타겟 객체 및 상기 타겟 후보 객체는 상기 컨벌루션 신경망(Convolution neural network)의 상기 서브 네트워크에 각각 분리되어 입력될 수 있다.In the step (b), the first matching network and the second matching network are connected to two sub-networks sharing a weight, and two sub-networks are connected to each other, and a soft- neural network) is applied; The target object and the target candidate object may be separately input to the subnetwork of the Convolution Neural Network.

그리고, 기 설정된 개수의 상기 2D 영상 프레임 및 상기 깊이 프레임에 대해 상기 (a) 단계 내지 상기 (c) 단계가 수행된 후, 상기 최종 추적 결과의 매칭 점수에 기초하여 상기 제1 매칭 네트워크 및 상기 제2 매칭 네트워크가 파인 튜닝(Fine tuning)되어 상기 제1 매칭 네트워크 및 상기 제2 매칭 네트워크의 타겟 형태 모델(Target appearance model)이 업데이트되는 단계를 더 포함할 수 있다.After the steps (a) to (c) are performed for a predetermined number of the 2D image frames and the depth frame, the first matching network and the depth frame are determined based on the matching score of the final tracking result. 2 matching network may be fine tuned to update the target appearance model of the first matching network and the second matching network.

상기와 같은 구성에 따라 본 발명에 따르면, 다중 센서로부터 취득된 2D 영상 프레임 및 깊이 프레임을 이용하여 다중의 타겟 객체를 추적하는데 있어 데이터의 오류로 인한 추적 실패를 극복하고, 보다 정확하고 안정적인 추적이 가능한 다중 객체 추적 방법이 제공된다.According to the present invention, in tracking multiple target objects using 2D image frames and depth frames acquired from multiple sensors, it is possible to overcome tracking failure due to data errors, A possible multi-object tracking method is provided.

도 1은 본 발명에 따른 다중 객체 추적 시스템의 구성을 도시한 도면이고,
도 2는 본 발명에 따른 다중 객체 추적 방법을 설명하기 위한 도면이고,
도 3은 본 발명에 따른 다중 객체 추적 방법에서 인스턴스를 표현하는 방법을 설명하기 위한 도면이고,
도 4는 본 발명에 따른 다중 객체 추적 방법에 적용되는 컨벌루션 신경망의 구조를 설명하기 위한 도면이다.FIG. 1 is a diagram illustrating a configuration of a multi-object tracking system according to the present invention,
FIG. 2 is a view for explaining a multi-object tracking method according to the present invention,
FIG. 3 is a view for explaining a method of representing an instance in the multi-object tracking method according to the present invention,
4 is a view for explaining a structure of a convolutional neural network applied to a multi-object tracking method according to the present invention.

이하에서는 첨부된 도면을 참조하여 본 발명에 따른 실시예들을 상세히 설명한다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명에 따른 다중 객체 추적 시스템의 구성을 도시한 도면이다. 도 1을 참조하여 설명하면 본 발명에 따른 다중 객체 추적 시스템은 2D 프레임 취득부, 깊이 프레임 취득부(12), 데이터세트 생성부(20), 제1 매칭 네트워크(30), 제2 매칭 네트워크(40) 및 추적 결과 융합부(50)를 포함한다.1 is a diagram illustrating a configuration of a multi-object tracking system according to the present invention. 1, a multi-object tracking system according to the present invention includes a 2D frame acquisition unit, a depth frame acquisition unit 12, a data set generation unit 20, a first matching network 30, a second matching network 40 and a tracking result fusion unit 50.

2D 영상 프레임 취득부(11)는 2D 영상 프레임을 취득한다. 일 예로, 2D 영상 카메라나 레이저 스캐너와 같은 2D 영상 센서를 통해 2D 영상 프레임을 취득한다. 이하에서는, 본 발명에서는 2D 영상 프레임이 RGB 프레임인 것을 예로 하여 설명하며, 본 발명의 기술적 사상이 이에 국한되지 않음은 물론이다.The 2D image frame acquisition unit 11 acquires a 2D image frame. For example, a 2D image frame is acquired through a 2D image sensor such as a 2D image camera or a laser scanner. In the following description, it is assumed that the 2D image frame is an RGB frame, and the technical idea of the present invention is not limited thereto.

깊이 프레임 취득부(12)는 깊이 프레임을 취득한다. 예를 들어, 2개의 영상 센서를 이용한 앙안 카메라에 의해 촬영된 3차원 영상으로부터 깊이 프레임이 취득될 수 있으며, 깊이 프레임에는 깊이 정보가 포함된다.The depth frame acquisition section 12 acquires a depth frame. For example, a depth frame can be acquired from a three-dimensional image captured by an Angan camera using two image sensors, and depth information is included in the depth frame.

데이터세트 생성부(20)는 2D 영상 프레임 취득부(11) 및 깊이 프레임 취득부(12)로부터 각각 RGB 프레임과 깊이 프레임을 입력받고, RGB 프레임에 대응하는 2D 영상 데이터세트, 즉 RGB 데이터 스트를 생성하고, 깊이 프레임에 대응하는 깊이 데이터세트를 생성한다.The data set generation unit 20 receives the RGB frame and the depth frame from the 2D image frame acquisition unit 11 and the depth frame acquisition unit 12 and receives the 2D image data set corresponding to the RGB frame, And generates a depth data set corresponding to the depth frame.

여기서, RGB 데이터세트는 이전 RGB 프레임(k)의 타겟 객체와, 현재 RGB 프레임(k+1)의 타겟 후보 객체로 구성되며, 마찬가지로 깊이 데이터세트는 이전 깊이 프레임(k)의 타겟 객체와, 현재 깊이 프레임(k+1)의 타겟 후보 객체로 구성된다.Here, the RGB data set is composed of the target object of the previous RGB frame (k) and the target candidate object of the current RGB frame (k + 1), and similarly, the depth data set includes the target object of the previous depth frame And a target candidate object of the depth frame (k + 1).

제1 매칭 네트워크(30)와 제2 매칭 네트워크(40)는 상호 독립적으로 동작한다. 제1 매칭 네트워크(30)는 RGB 데이터세트를 입력받아 타겟 객체와 타겟 후보 객체 간의 매칭 여부를 판단하여 2D 영상 추적 결과, 즉 RGB 추적 결과를 출력한다. 그리고, 제2 매칭 네트워크(40)는 깊이 데이터세트를 입력받아 타겟 객체와 타겟 후보 객체 간의 매칭 여부를 판단하여 깊이 추적 결과를 출력한다.The first matching network 30 and the second matching network 40 operate independently of each other. The first matching network 30 receives the RGB data set, determines whether the target object matches the target candidate object, and outputs the 2D image tracking result, i.e., the RGB tracking result. The second matching network 40 receives the depth data set, determines whether the target object matches the target candidate object, and outputs the depth tracking result.

추적 결과 융합부(50)는 RGB 추적 결과와 깊이 추적 결과를 융합하여 최종 추적 결과를 결정하는데, 본 발명에서는 기본 신념 할당(Basic belief assignment)을 이용하여 두 추적 결과를 융합하는 것을 예로 한다. Tracking Result The convergence unit 50 fuses the RGB tracking result with the depth tracking result to determine the final tracking result. In the present invention, it is assumed that the two tracking results are fused using the basic belief assignment.

상기와 같이, RGB 프레임을 이용한 추적 결과와, 깊이 프레임에 대한 추적 결과가 독립적으로 진행되어 도출되고, 독립적으로 도출된 추적 결과가 최종 결정 과정에서 융합됨으로써, 어느 하나의 센서에서 발생하는 오류에 의한 영향을 최소화하여 보다 정확하고 안정적인 타겟 객체의 추적이 가능하게 된다.As described above, the tracking result using the RGB frame and the tracking result of the depth frame are derived independently, and the independently derived tracking result is fused in the final determination process. Thus, The influence is minimized, and more accurate and stable tracking of the target object becomes possible.

이하에서는, 도 2를 참조하여 본 발명에 따른 다중 객체 추적 시스템에 적용된 다중 객체 추적 방법에 대해 상세히 설명한다.Hereinafter, a multi-object tracking method applied to the multi-object tracking system according to the present invention will be described in detail with reference to FIG.

도 2에 도시된 바와 같이, 본 발명에 따른 다중 객체 추적 방법에서는 프레임 k에서의 타겟 객체와 프레임 k+1에서의 타겟 후보 객체로 구성된 표현(Representation)을 제1 매칭 네트워크(30) 및 제2 매칭 네트워크(40)로 입력되는 RGB 데이터세트 및 깊이 데이터세트로 생성한다.2, in the multi-object tracking method according to the present invention, a representation composed of a target object in a frame k and a target candidate object in a frame k + 1 is referred to as a first matching network 30 and a second matching network And the RGB data set and the depth data set input to the matching network 40 are generated.

여기서, RGB 프레임 및 깊이 프레임 내의 인스턴스, 즉 타겟 객체나 타겟 후보 객체를 표현하기 위해 컨벌루션 레이어(Convolutional layer)의 출력을 인스턴스의 스케일에 따라 인스턴스의 표현으로 적응적으로 사용한다.Here, the output of the convolutional layer is adaptively used as a representation of the instance according to the scale of the instance to represent the instances in the RGB frame and the depth frame, i.e., the target object or the target candidate object.

보다 구체적으로 설명하면, 타겟 객체는 포즈의 변화나 이동 상태와 같은 요소에 따라 그 스케일에 큰 변화가 발생한다. 만약, 낮은 해상도, 예를 들어 작은 스케일의 인스턴스(Instance)가 매칭 네트워크 상의 다음 레이어(Layer)를 통과하게 되면, 그 미세 특성은 컨벌루션(Convolution)과 풀링(Pooling)과 같은 동작에 의해 점진적으로 사라질 수 있다. 이에, 본 발명에서는 인스턴스를 인스턴스의 스케일에 따라 분류하여 표현(Representation)한다.More specifically, the target object undergoes a large change in its scale depending on factors such as the pose change and the movement state. If a lower resolution instance, for example a small scale instance, passes through the next layer on the matching network, then the fine characteristics gradually disappear by operations such as Convolution and Pooling . In the present invention, the instances are classified according to the scale of the instances and are represented.

도 3은 본 발명에 따른 다중 객체 추적 방법에서 인스턴스를 표현하는 방법을 설명하기 위한 도면이다. 도 3을 참조하여 설명하면, 본 발명에 따른 다중 객체 추적 방법에서 RGB 데이터세트를 생성하는 과정에서는 기 훈련된 컨벌루션 신경망(Pre-trained convolution neural network, 이하 'CNN'이라 함)을 이용하여 인스턴스가 표현된다. 여기서, 인스턴스 전체를 CNN에 입력하는 경우 높은 계산 비용이 발생하는 바, 본 발명에서는 ROI 풀링(Pooling)을 인스턴스의 표현에 적용한다.3 is a diagram for explaining a method of representing an instance in the multi-object tracking method according to the present invention. Referring to FIG. 3, in the process of generating RGB data sets in the multi-object tracking method according to the present invention, an instance is created using a pre-trained convolution neural network (CNN) Is expressed. Here, when the entire instance is input to the CNN, a high calculation cost is incurred. In the present invention, ROI pooling is applied to the representation of the instance.

먼저, 기 훈련된 CNN으로부터 RGB 프레임의 컨벌루션 라이어의 모든 출력이 추출된다. 그리고, 각각의 인스턴스의 표현이 ROI 풀링을 이용하여 그 스케일 레벨에 따라 해당 RGB 프레임의 출력 특성 지도(Output Feature Map)로부터 풀링된다. 각각의 컨벌루션 라이로부터의 출력 사이즈가 다르기 때문에, 본 발명에서는 다른 샘플링 레이어를 매칭 함수로 적용하는 것에 의해 개별적으로 샘플링된다. 즉, 제1 매칭 네트워크(30)의 입력보다 큰 스케일의 인스턴스에는 서브 샘플링을 위해 최대값 풀링(Max pooling)이 적용되고, 제1 매칭 네트워크(30)의 입력보다 작은 스케일의 인스턴스에는 업샘플링을 위해 디컨벌루션(Deconvolution) 연산이 적용된다. 그리고, 인스턴스의 표현의 정규화가 수행되는데, 본 발명에서는 지역 응답 정규화(Local response normalization, LRN)가 적용되는 것을 예로 한다.First, all the outputs of the convolutionary of the RGB frame are extracted from the trained CNN. The representation of each instance is then pooled from the Output Feature Map of the corresponding RGB frame according to the scale level using ROI pooling. Because the output sizes from each convolutional are different, in the present invention, they are sampled separately by applying different sampling layers as matching functions. That is, Max pooling is applied for subsampling to instances of a larger scale than the input of the first matching network 30, and upsampling is applied to instances of a smaller scale than the input of the first matching network 30 A deconvolution operation is applied. In addition, normalization of the expression of the instance is performed. In the present invention, local response normalization (LRN) is applied.

한편, 본 발명에 따른 다중 객체 추적 방법에서, 깊이 프레임에는 슈퍼비전 트랜스퍼(Supervision transfer)가 적용되어 인스턴스가 표현되는 것을 예로 한다.In the multi-object tracking method according to the present invention, supervision transfer is applied to a depth frame to represent an instance.

먼저,

,

를 각각 RGB 프레임과 깊이 프레임의 레이어드 표현(layered representation)한다. 여기서, i는 레이어의 개수이다. 슈퍼비전 트랜스퍼는 고정된 CNN 구조로부터 언어노우티드(Unannotated)된 깊이 이미지를 표현하기 위해 가중치 파라미터

를 충분히 훈련시킨다. 슈퍼비전 트랜스퍼는 손실 함수

(본 발명에서는 L2 거리가 손실함수로 사용하는 것을 예로 한다)를 이용하여 RGB 프레임과 깊이 프레임의 표현 간의 유사성의 측정한다. 유사성은 [수학식 1]과 같이 측정된다.first,

,

Layered representation of the RGB frame and the depth frame, respectively. Here, i is the number of layers. The supervision transfer uses a weight parameter to represent the unannotated depth image from the fixed CNN structure

. Super Vision Transfer Loss Function

(In the present invention, L2 distance is used as a loss function as an example), the similarity between the representation of the RGB frame and the depth frame is measured. The similarity is measured as in Equation (1).

[수학식 1][Equation 1]

[수학식 1]에서 t()는

를

의 동일 차원(same dimension)으로 임베딩(embedding)하기 위한 변환 함수(transformation function)이고,

는 학습된 가중치 파라이터이다. 본 발명에서는 만약 깊이 이미지가 3D 포인트 클라우드로부터 얻어지면, 업-스케일링 방법이 변환 함수로 사용될 수 있다.In Equation (1), t

To

Is a transformation function for embedding in the same dimension of the input image,

Is a learned weighted wave distributor. In the present invention, if a depth image is obtained from a 3D point cloud, the up-scaling method can be used as a transform function.

상기와 같은 방법을 통해 RGB 데이터세트 및 깊이 데이터세트를 구성하는 인스턴스, 즉 타겟 객체와 타겟 후보 객체가 표현되면, RGB 데이터세트 및 깊이 데이터세트는 각각 제1 매칭 네트워크(30) 및 제2 매칭 네트워크(40)로 입력된다. 본 발명에서는 제1 매칭 네트워크(30) 및 제2 매칭 네트워크(40)에 가중치를 공유하는 2개의 서브 네트워크로 구성된 컨벌루션 신경망이 적용되는 것을 예로 한다.When an instance constituting the RGB data set and the depth data set, that is, the target object and the target candidate object are represented through the above-described method, the RGB data set and the depth data set are respectively transmitted to the first matching network 30 and the second matching network (40). In the present invention, a convolutional neural network composed of two sub-networks sharing a weight to the first matching network 30 and the second matching network 40 is applied.

도 4는 본 발명에 따른 다중 객체 추적 방법에 적용되는 컨벌루션 신경망의 구조를 설명하기 위한 도면이다. 도 4에 도시된 컨벌루션 신경망은 제1 매칭 네트워크(30) 및 제2 매칭 네트워크(40) 각각 적용된 구조로, 제1 매칭 네트워크(30)에 적용되는 컨벌루션 신경망을 예로 하여 설명하며, 제2 매칭 네트워크(40)에 적용되는 컨벌루션 신경망에 대한 설명은 생략한다.4 is a view for explaining a structure of a convolutional neural network applied to a multi-object tracking method according to the present invention. The convolutional neural network shown in FIG. 4 is applied to the first matching network 30 and the second matching network 40. The convolutional neural network is applied to the first matching network 30, A description of the convolutional neural network applied to the convolutional neural network 40 is omitted.

RGB 데이터세트

를 구성하는 타겟 객체

와 타겟 후보 객체

는 컨벌루션 신경망의 두 개의 서브 네트워크로 각각 분리되어 입력된다. 본 발명에 따른 컨벌루션 신경망의 서브 네트워크는 각각 3개의 컨벌루션 레이어(Convolution layer)와 2개의 풀리-커넥티드 레이어(Fully-connected layer)로 구성되는 것을 예로 한다.RGB data set

The target object

And target candidate objects

Are separately input into two subnetworks of the convolutional neural network. The subnetwork of the convolutional neural network according to the present invention includes three convolution layers and two fully-connected layers.

기존의 컨벌루션 신경망 구조에서는, 최대값 풀링(Max pooling)이 입력으로 적용될 경우, 지역적인 이웃에서 강한 값만이 다음 단계의 레이어로 전이되기 위해 활성화되었다. 즉, 활성화된 값의 공간 분해능이 상당히 감소하게 된다. 최대값 풀링의 이점은 지역적인 변형에 대해 강하지만, 시간에 따른 객체의 작은 형태 변화를 유지하는 것은 중요하다. 따라서, 본 발명에서는 최대값 풀링 레이어가 각각의 서브 네트워크에서 제외된다.In the conventional convolution neural network structure, when Max pooling is applied as input, only strong values in the local neighbors are activated to transition to the next layer. That is, the spatial resolution of the activated value is significantly reduced. The benefits of maximum pooling are strong for local variants, but it is important to keep small shape changes of objects over time. Therefore, in the present invention, the maximum value pooling layer is excluded from each subnetwork.

각각의 서브 네트워크의 마지막 풀리-커넥티드 레이어의 출력이 연결되어 하나의 벡터로 이루어진 상태로 투-웨이(Two-way) 소프트맥스 레이어로 전이된다. 소프트맥스 레이어는

와

이 매칭이 되는지 판단한다. 본 발명에서는 1 및 0이 매칭(positive)과 비매칭(dis-matching, negative)을 각각 나타내는 분류로 제1 매칭 네트워크(30)의 출력, 즉 RGB 프레임에 대한 RGB 추적 결과(깊이 프레임의 경우 깊이 추적 결과)로 출력된다.The output of the last pulley-connected layer of each subnetwork is coupled and transitioned to a two-way soft max layer with one vector. The soft max layer

Wow

It is determined whether or not this is a match. In the present invention, the output of the first matching network 30, i.e., the RGB tracking result for the RGB frame (the depth in the case of the depth frame, Tracking result).

상기와 같이, RGB 프레임에 대한 RGB 추적 결과와, 깊이 프레임에 대한 깊이 추적 결과가 독립적으로 생성되면, RGB 추적 결과와 깊이 추적 결과가 융합되어, 보다 정확한 추적 결과가 획득된다. 본 발명에서는 기본 신념 할당(Basic belief assignment, 이하 'BBA'라 함)을 이용하여 RGB 추적 결과와 깊이 추적 결과가 융합되어 최종 추적 결과가 결정되는 것을 예로 한다.As described above, when the RGB tracking result for the RGB frame and the depth tracking result for the depth frame are independently generated, the RGB tracking result and the depth tracking result are fused and more accurate tracking results are obtained. In the present invention, the RGB tracing result and the depth tracing result are merged using the basic belief assignment (BBA) to determine the final tracing result.

BBA를 사용하여 RGB 추적 결과와 깊이 추적 결과가 할인 요소(Discounting factor)를 평가한다. 할인 요소 α는 각각의 추적 결과에 대한 신뢰도를 나타낸다.Using the BBA, RGB tracking results and depth tracking results evaluate the discounting factor. The discount factor α represents the reliability of each tracking result.

을 경우의 수라고 할 때, 0, 1, Ω는 각각 비매칭, 매칭 및 불확실을 나타내는 것으로 가정하면, 추적 결과에 대한 BBA는 [수학식 2]와 같이 정의될 수 있다.

Assuming that 0, 1, and? Denote mismatching, matching, and uncertainty, respectively, the BBA for the tracking result can be defined as: " (2) "

[수학식 2]&Quot; (2) "

[수학식 2]에서 m(A)는 서브 세트에 대해 결정된 신뢰성(Belief)의 부분을 나타내는 BBM(Basic belief mass)이다. 컨정크티브 규칙(Conjunctive rule)이 RGB 프레임의 BBA와 깊이 프레임의 BBA의 조합에 사용된다. mD 및 mR을 각각 깊이 프레임과 RGB 프레임의 추적 결과에 대한 BBM이라 하면, 컨정크티브 콤비네이션(Conjunctive combination)

가 [수학식 3]과 같이 정의된다.In Equation (2), m (A) is a basic belief mass (BBM) representing a portion of the belief determined for the subset. A conjunctive rule is used for the combination of the BBA of the RGB frame and the BBA of the depth frame. Let mD and mR denote the BBM for the tracking results of the depth frame and the RGB frame, respectively. Conjunctive combination,

Is defined as " (3) "

[수학식 3]&Quot; (3) "

본 발명에 따른 다중 객체 추적 방법의 융합은 추적 결과에 따른 가중치를 배정한다. 이를 위해, 본 발명에서는 할인 요소를 각각의 BBM에 추가하였다. RGB 추적 결과 및 깊이 추적 결과에 대한 할인 요소는 [수학식 4] 및 [수학식 5]와 같이 정의될 수 있다.The convergence of the multi-object tracking method according to the present invention assigns a weight according to the tracking result. To this end, a discount factor was added to each BBM in the present invention. The RGB tracking result and the discount factor for the depth tracking result can be defined as [Equation 4] and [Equation 5].

[수학식 4]&Quot; (4) "

[수학식 5]&Quot; (5) "

본 발명에서는 Smets, P의 논문 "The combination of evidence in the transferable belief model. IEEE Transactions on pattern analysis and machine intelligence(1990, 12, 447-458.)"에 개시된 정규화된 신뢰 함수를 이용하여 할인 요소 α가 설정되는 것을 예로 한다.The present invention uses the normalized confidence function disclosed in Smets, P., "The combination of evidence in the transferable belief model. &Quot; IEEE Transactions on Pattern Analysis and Machine Intelligence (1990, 12, 447-458) Is set as an example.

RGB 추적 결과와 깊이 추적 결과를 병합하기 전에, 각 추적 결과는 그들의 할인 요소에 의해 할인된다. RGB 프레임의 할인된 BBM

와 깊이 프레임의 할인된 BBM

로부터 결합된 BBM

는 [수학식 6]을 통해 계산된다.Before merging RGB trace results with depth trace results, each trace result is discounted by their discount factor. Discounted BBM for RGB frames

And deep frame BBM

0.0 > BBM <

Is calculated by the following equation (6).

[수학식 6]&Quot; (6) "

[수학식 6]에서 α는 상기 Smets, P의 논문의 최소화 프로그램의 선형 함수와 이차 함수를 통해 계산되는데, 본 발명에서는 α_R 및 α_D를 각각 0.22와 0.31로 설정하는 것을 예로 하였다.In Equation (6), α is calculated through a linear function and a quadratic function of the minimizing program of Smets, P. In the present invention, α _R and α _D are set to 0.22 and 0.31, respectively.

다시 도 1 및 도 2를 참조하여 설명하면, 본 발명에 따른 다중 객체 추적 방법은 제1 매칭 네트워크(30) 및 제2 매칭 네트워크(40)의 파인 튜닝(Fine tuning) 과정을 더 포함할 수 있다.1 and 2, the multi-object tracking method according to the present invention may further include a fine tuning process of the first matching network 30 and the second matching network 40 .

강인한 제1 매칭 네트워크(30) 및 제2 매칭 네트워크(40)의 구성을 위해, 외부 비디오 시퀀스에서 초기화된 상태였던 제1 매칭 네트워크(30) 및 제2 매칭 네트워크(40)는 구조화된 모델(Structured model)에서 일정 수 이상의 프레임의 추적이 수행된 후 파인 튜닝된다. 여기서, 추적 결과는 추적 결과 누적부(60)에 누적된 상태로, 파인 튜닝부(70)가 누적 결과를 이용하여 파인 튜닝 과정을 수행한다.For the configuration of the robust first matching network 30 and the second matching network 40 the first matching network 30 and the second matching network 40 which were initialized in the external video sequence are structured model is traced more than a certain number of frames and then fine tuned. Here, the tracking result is accumulated in the tracking result accumulation unit 60, and the fine tuning unit 70 performs a fine tuning process using the accumulation result.

파인 튜닝된 제1 매칭 네트워크(30) 및 제2 매칭 네트워크(40)는 모델 연관성(model consistency)을 보전하면서도 일시적으로 변하는 타겟 형태에 강인해질 수 있다. 여기서, 이전에 파인 튜닝된 매칭 네트워크는 타겟 형태 모델(Target appearance model)의 경로(path)로서 유지된다. The fine tuned first matching network 30 and the second matching network 40 may be robust to a temporally changing target shape while preserving model consistency. Here, the previously fine tuned matching network is maintained as a path of the target appearance model.

본 발명에 따른 구조화된 타겟 형태 모델은 적응적으로 제1 매칭 네트워크(30) 및 제2 매칭 네트워크(40)를 파인 튜닝하는 것에 의해 계층적 구조(Hierarchical structure)로 구성된다.The structured target morph model according to the present invention is configured in a hierarchical structure by fine tuning the first matching network 30 and the second matching network 40 adaptively.

를 파인 튜닝된 매칭 네트워크(제1 매칭 네트워크(30) 또는 제2 매칭 네트워크(40), 이하 동일)에 대한 노드라 하고,

및

는 각각 파인 튜닝된 매칭 네트워크와 관련된 정점(vertex)과, 정점 간의 경로 관계(path relationship)를 나타내는 방향성 엣지(Directed edge)라 하면, 두 정점(하나의 에지)은 [수학식 7]과 같이 정의될 수 있다.

Is referred to as a node for a fine tuned matching network (the first matching network 30 or the second matching network 40, hereinafter the same)

And

Is defined as a directed edge representing a path relationship between a vertex and a vertex associated with a fine tuned matching network, and two vertices (one edge) are defined as in Equation (7) .

[수학식 7]&Quot; (7) "

[수학식 7]에서,

는 정점

와

사이의 관계 점수(Relationship score)를 의미하고,

는 정점

까지 매칭 네트워크를 이용하여 추적이 수행된 연속된 프레임의 세트이고,

는

가 정점

를 이용하여 이전의 타겟 객체

와 매칭된 후보로 판단되었을 때의 매칭 점수이다.In Equation (7)

Apex

Wow

The relationship score between the two groups,

Apex

Lt; RTI ID = 0.0 > a < / RTI > matching network,

The

Apex

&Lt; RTI ID = 0.0 >

Is a matching score when it is judged as a matching candidate.

가 프레임

에서의 타겟 객체의 세트라고 하면, 프레임

에서의 타겟 후보 객체는

이다. 본 발명에서는 세트

와

중 가장 유사한 쌍 C의 세트를 찾는 것이다. 이는 [수학식 8]을 만족시킨다.

Frame

A set of target objects in the frame < RTI ID = 0.0 >

The target candidate object in

to be. In the present invention,

Wow

The most likely pair C is found. This satisfies [Expression 8].

[수학식 8]&Quot; (8) "

[수학식 8]에서

는 n번째 타겟 객체의 활성화된 파인 튜닝 경로로부터 얻은 가중치 평균 매칭 점수이다.

가 n번째 타겟 형태 모델의 활성화된 파인 튜닝 경로라 하면, 가중치 평균 매칭 점수는 [수학식 9]와 같이 측정될 수 있다.In Equation 8,

Is the weighted average matching score obtained from the activated fine tuning path of the nth target object.

Is an activated fine tuning path of the n-th target form model, the weighted average matching score can be measured as shown in Equation (9).

[수학식 9]&Quot; (9) "

[수학식 9]에서

는 n번째 타겟 형태 모델의 정점

에 대응하는

와

사이의 매칭 점수이고, 분류 1(matching)에 대한 확률이다. 그리고,

는 n번째 타겟 객체의 경로에서 프레임

안의 정점

의 가중치이다. 후보가 타겟 객체와 매칭되지 않으면, 추적에서 새로운 삭제된 객체로 여겨진다. 또한, 타겟 객체가 모든 후보에 대해

보다 작은 매칭 점수를 가지면, 사라진 객체로 여겨진다. 본 발명에서는 실험적으로

을 0.6으로 설정하는 것을 예로 한다.In Equation (9)

Is the vertex of the nth target type model

Corresponding to

Wow

, And is the probability for matching. And,

Is the frame in the path of the nth target object

The inner peak

. If the candidate does not match the target object, it is considered a new deleted object in the trace. Also, if the target object has

Having a smaller matching score is considered a lost object. In the present invention,

Is set to 0.6.

가중치

를 결정하기 위해, 본 발명에서는 매칭 네트워크의 신뢰성(reliability)을 고려한다. 이와 같은 가중치는 노이즈 인스턴스에도 불구하고 높은 매칭 점수가 측정되는 것과 같이, 파인 튜닝이 신뢰할 수 없는 케이스로 생성되는 것을 방지하기 위해 할당된다. 매칭 네트워크의 신뢰성 측정을 위해, 파인 튜닝 경로(모든 경로가 아님)는 [수학식 10]과 같이 회귀적으로 탐색(Recursively explored)된다.weight

The reliability of the matching network is considered in the present invention. Such a weight is assigned to prevent fine tuning from being created as an unreliable case, such as a high matching score being measured despite the noise instance. In order to measure the reliability of the matching network, the fine tuning path (not all paths) is recursively explored as shown in Equation (10).

[수학식 10]&Quot; (10) "

[수학식 10]에서,

는 정점

의 부모 노드(Parent node)이다.In Equation (10)

Apex

(Parent node).

구조화된 타겟 형태 모델에서, 노드는 파인 튜닝 매칭 네트워크를 포함한다. 본 발명에서는 새로운 훈련 프레임에 대한 파인 튜닝 매칭 네트워크를 위한 적응적인 모델 업데이트 방법이 적용된다.In a structured target morph model, the node includes a fine tuning matching network. In the present invention, an adaptive model update method for a fine tuning matching network for a new training frame is applied.

z를 매칭 네트워크의 파인 튜닝을 위한 새로이 생성된 노드라 한다. 매칭 네트워크의 파인 튜닝은 15개의 연속된 프레임(

)의 추적이 완료된 후에 수행되는 것을 예로 한다. 새로운 파인 튜닝된 매칭 네트워크는 부모 노드

를 가지며, [수학식 11]을 만족한다.z is a newly generated node for fine tuning the matching network. Fine tuning of the matching network can be performed in 15 consecutive frames (

) Is completed. The new fine-tuned matching network includes a parent node

And satisfies the following expression (11).

[수학식 11]&Quot; (11) "

[수학식 11]에서

는 임시 엣지(Interim edge)이다. 새로 생성된 노드 크의 부모 노드를 찾은 후에, 두 세트의 프레임

와

에 대한 파인 튜닝이 수행된다. 마지막으로 노드 z의 새로이 파인 튜닝된 매칭 네트워크가

에 추가됨으로써, 파인 튜닝이 완료된다.In Equation (11)

Is an interim edge. After finding the parent node of the newly created node, two sets of frames

Wow

The fine tuning is performed. Finally, a new fine tuned matching network of node z

The fine tuning is completed.

비록 본 발명의 몇몇 실시예들이 도시되고 설명되었지만, 본 발명이 속하는 기술분야의 통상의 지식을 가진 당업자라면 본 발명의 원칙이나 정신에서 벗어나지 않으면서 본 실시예를 변형할 수 있음을 알 수 있을 것이다. 발명의 범위는 첨부된 청구항과 그 균등물에 의해 정해질 것이다.Although several embodiments of the present invention have been shown and described, those skilled in the art will appreciate that various modifications may be made without departing from the principles and spirit of the invention . The scope of the invention will be determined by the appended claims and their equivalents.

11 : 2D 영상 프레임 취득부 12 : 깊이 프레임 취득부
20 : 데이터세트 생성부 30 : 제1 매칭 네트워크
40 : 제2 매칭 네트워크 50 : 추적 결과 융합부
60 : 추적 결과 누적부 70 : 파인 튜닝부11: 2D image frame acquisition unit 12: depth frame acquisition unit
20: data set generation unit 30: first matching network
40: second matching network 50: tracking result fusion unit
60: tracking result accumulation unit 70: fine tuning unit

Claims

In a multi-object tracking method,
(a) a 2D image data set composed of a target object of a previous 2D image frame and a target candidate object of a current 2D image frame, and a depth data set composed of a target object of a previous depth frame and a target candidate object of a current 2D image frame ;
(b) the 2D image data set and the depth data set are respectively applied to a first matching network and a matching network that are mutually independent, a 2D image tracking result of the 2D image data set is output from the first matching network, Outputting a depth tracking result for the depth data set from a matching network;
and (c) determining a final tracking result by fusing the 2D image tracking result and the depth tracking result using a basic belief assignment.

The method according to claim 1,
In the step (a), an instance including the target object and the target candidate object of the 2D image frame is expressed using a pre-trained convolution neural network;
The step (a)
(a1) extracting an output of a convolution layer of the 2D image frame from the pre-trained convolution neural network to generate an output characteristic map of the 2D image frame;
(a2) a representation of each of the instances is pooled from the output property map according to a scale of each of the instances using ORI pooling;
(a) the representation of the instance is normalized.

The method according to claim 1,
In the step (a1)
Max pooling is applied for subsampling to instances of a larger scale than the input of the first matching network,
Wherein a deconvolution operation is applied to an instance of a smaller scale than the input of the first matching network for upsampling.

The method according to claim 1,
Wherein in the step (a), an instance including the target object of the depth frame and the target candidate object is represented by a supervision transfer.

The method according to claim 1,
In the step (b), the first matching network and the second matching network are connected with two sub-networks sharing a weight, and two sub-networks are connected to each other, and a soft max layer );
Wherein the target object and the target candidate object are separately input to the subnetwork of the convergence neural network.

The method according to claim 1,
Wherein, after performing the steps (a) to (c) for a predetermined number of the 2D image frames and the depth frame, the first matching network and the second matching Wherein the network is fine tuned to update a target appearance model of the first matching network and the second matching network.