KR20230060214A

KR20230060214A - Apparatus and Method for Tracking Person Image Based on Artificial Intelligence

Info

Publication number: KR20230060214A
Application number: KR1020210144687A
Authority: KR
Inventors: 전광길
Original assignee: 인천대학교 산학협력단
Priority date: 2021-10-27
Filing date: 2021-10-27
Publication date: 2023-05-04

Abstract

An apparatus and method for tracking an image object based on artificial intelligence include: performing transfer learning for a weight obtained by estimating a human object by pre-leaning and a weight obtaining by estimating the human object by learning a top-view video frame; and estimating the presence of a human object using two weights when extracting the feature of an input image frame. The present invention detects and tracks an object using a weight obtained by estimating a human object by pre-leaning and a weight obtaining by estimating the human object by learning a top-view video frame, thereby achieving a very excellent effect of increasing the tracking accuracy (96%) and detection accuracy (95%) of object tracking technology. The apparatus includes: a first object detection part; a second object detection part; and a transfer learning model part.

Description

Apparatus and Method for Tracking Person Image Based on Artificial Intelligence}

본 발명은 영상 객체 추적 장치 및 방법에 관한 것으로서, 더욱 상세하게는 미리 사전 학습하여 사람 객체를 추정한 가중치와 탑뷰 영상 프레임을 학습하여 사람 객체를 추정한 가중치를 전이 학습하고, 입력된 영상 프레임의 특징의 추출 시 2개의 가중치를 이용하여 사람 객체의 존재를 추정하는 인공지능 기반 영상 객체 추적 장치 및 방법에 관한 것이다.The present invention relates to an apparatus and method for tracking a video object, and more particularly, to transfer learning of weights for estimating a human object by prior learning and weights for estimating a human object by learning a top-view video frame, and An artificial intelligence-based image object tracking device and method for estimating the existence of a human object using two weights when extracting features.

오늘날 5G는 고신뢰성, 고대역폭 및 안전한 네트워크 연결로 비디오 스트림을 고속으로 처리하여 비디오 감시 및 모니터링 서비스에 지대한 영향을 미치고 있다.Today, 5G is having a profound impact on video surveillance and monitoring services by processing video streams at high speed with high reliability, high bandwidth and secure network connectivity.

이러한 5G 인프라를 이용하여 다중 인물 추적, 탐지 프레임워크에 의한 딥러닝 기반 추적 기술이 많이 연구되고 있는 실정이다.Deep learning-based tracking technology by multi-person tracking and detection framework using this 5G infrastructure is being researched a lot.

영상 감시에서 사람 추적은 인체의 변형 가능한 특성, 폐색, 조명 및 배경 조건과 같은 다양한 환경 구성 요소, 특히 사람의 시각적 모양이 다른 사람과 구별해내는 것이 중요하다.In video surveillance, human tracking is important to distinguish various environmental components, such as the deformable nature of the human body, occlusion, lighting and background conditions, especially the visual appearance of a person from others.

현재의 객체 추적 기술은 객체의 탐지 정확도와 추적 정확도가 다소 떨어지는 단점이 있거나 비용이 많이 드는 솔루션, 더 긴 분석 시간 및 막대한 대역폭 요구사항이 발생하는 문제점이 있다.Current object tracking technologies have disadvantages in that object detection accuracy and tracking accuracy are somewhat low, or expensive solutions, longer analysis time, and enormous bandwidth requirements occur.

한국 공개특허번호 제10-2021-0067498호Korean Patent Publication No. 10-2021-0067498

이와 같은 문제점을 해결하기 위하여, 본 발명은 미리 사전 학습하여 사람 객체를 추정한 가중치와 탑뷰 영상 프레임을 학습하여 사람 객체를 추정한 가중치를 전이 학습하고, 입력된 영상 프레임의 특징의 추출 시 2개의 가중치를 이용하여 사람 객체의 존재를 추정하는 인공지능 기반 영상 객체 추적 장치 및 방법을 제공하는데 그 목적이 있다.In order to solve this problem, the present invention transfer-learns the weights for estimating the human object by prior learning and the weights for estimating the human object by learning the top-view image frame, and when extracting the features of the input image frame, two An object of the present invention is to provide an artificial intelligence based video object tracking device and method for estimating the existence of a human object using weights.

상기 목적을 달성하기 위한 본 발명의 특징에 따른 인공지능 기반 영상 객체 추적 장치는,An artificial intelligence-based video object tracking device according to the features of the present invention for achieving the above object,

학습 데이터셋, 검증 데이터셋, 테스트 데이터셋을 저장한 학습 데이터셋에서 영상 프레임의 제1 특징맵을 추출하고, 추출한 제1 특징맵을 기초로 영상에서 사람 객체의 존재가 추정되는 적어도 하나의 영역을 추출하여 제1 바운딩 박스로 표시하는 제1 객체 탐지부;A first feature map of an image frame is extracted from a learning dataset storing a training dataset, a verification dataset, and a test dataset, and at least one region in which the existence of a human object is estimated based on the extracted first feature map a first object detection unit that extracts and displays as a first bounding box;

탑뷰에 설치된 카메라부를 이용하여 탑뷰 영상 프레임을 수신하여 상기 탑뷰 영상 프레임의 제2 특징맵을 추출하고, 추출한 제2 특징맵을 기초로 영상에서 사람 객체의 존재가 추정되는 적어도 하나의 영역을 추출하여 제2 바운딩 박스로 표시하는 제2 객체 탐지부; 및A top-view image frame is received using a camera unit installed in the top-view, a second feature map of the top-view image frame is extracted, and at least one region in which the existence of a human object is estimated is extracted from the image based on the extracted second feature map. a second object detection unit displaying a second bounding box; and

상기 제1 객체 탐지부로부터 사전 학습 완료된 학습 모델로부터 학습된 결과에 대한 제1 가중치를 생성하고, 상기 제2 객체 탐지부로부터 탑뷰 영상 프레임을 이용한 학습 모델로부터 학습된 결과에 대한 제2 가중치를 생성하는 전이 학습 모델부를 포함한다.A first weight is generated for a result learned from the learning model for which pretraining has been completed from the first object detector, and a second weight is generated for a result learned from a learning model using a top-view image frame from the second object detector. It includes a transfer learning model unit that does.

본 발명의 특징에 따른 인공지능 기반 영상 객체 추적 방법은,An artificial intelligence-based image object tracking method according to the features of the present invention,

제1 객체 탐지부는 학습 데이터셋, 검증 데이터셋, 테스트 데이터셋을 저장한 학습 데이터셋에서 영상 프레임의 제1 특징맵을 추출하고, 추출한 제1 특징맵을 기초로 영상에서 사람 객체의 존재가 추정되는 적어도 하나의 영역을 추출하여 제1 바운딩 박스로 표시하는 단계;The first object detection unit extracts a first feature map of an image frame from a learning dataset storing a training dataset, a verification dataset, and a test dataset, and estimates the presence of a human object in the image based on the extracted first feature map. extracting at least one area to be displayed as a first bounding box;

제2 객체 탐지부는 탑뷰에 설치된 카메라부를 이용하여 탑뷰 영상 프레임을 수신하여 상기 탑뷰 영상 프레임의 제2 특징맵을 추출하고, 추출한 제2 특징맵을 기초로 영상에서 사람 객체의 존재가 추정되는 적어도 하나의 영역을 추출하여 제2 바운딩 박스로 표시하는 단계; 및The second object detection unit receives a top-view image frame using a camera unit installed in the top-view, extracts a second feature map of the top-view image frame, and at least one object for which the existence of a human object is estimated in the image based on the extracted second feature map. extracting an area of and displaying it as a second bounding box; and

전이 학습 모델부는 상기 제1 객체 탐지부로부터 사전 학습 완료된 학습 모델로부터 학습된 결과에 대한 제1 가중치를 생성하고, 상기 제2 객체 탐지부로부터 탑뷰 영상 프레임을 이용한 학습 모델로부터 학습된 결과에 대한 제2 가중치를 생성하는 단계를 포함한다.The transfer learning model unit generates a first weight for a result learned from a learning model for which pre-learning has been completed from the first object detector, and a second weight for a result learned from a learning model using a top-view image frame from the second object detector. 2 includes generating weights.

전술한 구성에 의하여, 본 발명은 미리 사전 학습하여 사람 객체를 추정한 가중치와 탑뷰 영상 프레임을 학습하여 사람 객체를 추정한 가중치를 이용하여 객체를 탐지 및 추적함으로써 객체 추적 기술의 추적 정확도(96%)와 탐지 정확도(95%)가 매우 우수한 효과를 달성할 수 있다.According to the configuration described above, the present invention detects and tracks an object using weights estimated for a human object by learning in advance and weights estimated for a human object by learning a top-view image frame in advance, thereby achieving tracking accuracy (96%) of the object tracking technology. ) and detection accuracy (95%) can achieve very good effects.

도 1 및 도 2는 본 발명의 실시예에 따른 전이 학습 기반의 탑뷰 영상 객체 추적 장치의 구성을 나타낸 도면이다.
도 3 및 도 4는 본 발명의 실시예에 따른 바운딩 박스 탐지부의 구성을 간략하게 나타낸 도면이다.
도 5는 YOLO를 사용한 탑뷰 영상에서의 사람 객체 탐지된 예를 나타낸 도면이다.
도 6은 본 발명의 실시예에 따른 객체 추적부의 내부 구성을 간략하게 나타낸 블록도이다.1 and 2 are diagrams illustrating the configuration of an apparatus for tracking a top-view image object based on transfer learning according to an embodiment of the present invention.
3 and 4 are diagrams schematically illustrating the configuration of a bounding box detection unit according to an embodiment of the present invention.
5 is a diagram illustrating an example of detecting a human object in a top-view image using YOLO.
6 is a block diagram briefly illustrating the internal configuration of an object tracking unit according to an embodiment of the present invention.

명세서 전체에서, 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다.Throughout the specification, when a certain component is said to "include", it means that it may further include other components without excluding other components unless otherwise stated.

도 1 및 도 2는 본 발명의 실시예에 따른 전이 학습 기반의 탑뷰 영상 객체 추적 장치의 구성을 나타낸 도면이고, 도 3 및 도 4는 본 발명의 실시예에 따른 바운딩 박스 탐지부의 구성을 간략하게 나타낸 도면이다.1 and 2 are diagrams showing the configuration of a transfer learning-based top-view image object tracking device according to an embodiment of the present invention, and FIGS. 3 and 4 are schematic diagrams of the configuration of a bounding box detection unit according to an embodiment of the present invention. is the drawing shown.

본 발명은 탑뷰에서 탑뷰 영상 시퀀스(Top View Sequence)를 생성하여 탑뷰 영상 시퀀스에서 객체 탐지와 객체 추적을 수행하는 일련의 과정을 간략하게 나타낸다.The present invention briefly shows a series of processes of generating a top-view image sequence from a top-view and performing object detection and object tracking in the top-view image sequence.

본 발명의 실시예에 따른 전이 학습 기반의 탑뷰 영상 객체 추적 장치(100)는 바운딩 박스 탐지부(200) 및 객체 추적부(300)를 포함한다.The apparatus 100 for tracking an object in a top-view image based on transfer learning according to an embodiment of the present invention includes a bounding box detection unit 200 and an object tracking unit 300.

본 발명의 실시예에 따른 바운딩 박스 탐지부(200)는 COCO 데이터셋 데이터베이스부(210), 제1 객체 탐지부(220), 입력부(230), 제2 객체 탐지부(240), 전이 학습 모델부(250), 객체 탐지 모델부(260), 훈련 영상 입력부(270), 출력부(280)를 포함한다.The bounding box detection unit 200 according to an embodiment of the present invention includes a COCO dataset database unit 210, a first object detection unit 220, an input unit 230, a second object detection unit 240, and a transfer learning model. It includes a unit 250, an object detection model unit 260, a training image input unit 270, and an output unit 280.

탑뷰 영상 생성부(미도시)는 탑뷰에 설치된 카메라부를 이용하여 탑뷰 영상 시퀀스를 생성한다. 여기서, 탑뷰(Top View)는 상부에서 내려다 보는 시야이다.A top-view image generator (not shown) generates a top-view image sequence using a camera unit installed in the top-view. Here, the top view is a view looking down from above.

탑뷰 영상 생성부는 생성한 탑뷰 영상 시퀀스를 영상 프레임으로 변환한다.The top-view image generator converts the generated top-view image sequence into an image frame.

COCO 데이터셋 데이터베이스부(210)는 객체 탐지, 세그먼테이션, 키포인트 탐지 등의 컴퓨터 비전 분야의 Task를 목적으로 만들어진 데이터셋이며, 학습 데이터셋, 검증 데이터셋, 테스트 데이터셋을 저장하고 있다.The COCO dataset database unit 210 is a dataset created for tasks in the field of computer vision, such as object detection, segmentation, and keypoint detection, and stores training datasets, verification datasets, and test datasets.

입력부(230)는 탑뷰 영상 생성부로부터 영상 프레임을 수신한다.The input unit 230 receives an image frame from the top-view image generator.

제1 객체 탐지부(220)는 COCO 데이터셋 데이터베이스부(210)로부터 학습 데이터셋, 검증 데이터셋, 테스트 데이터셋을 수집하고, 사전 훈련된 YOLOv3를 이용하여 영상 프레임의 특징맵을 추출하고, 추출한 특징맵을 기초로 영상에서 사람 객체의 존재가 추정되는 적어도 하나의 영역을 추출한다.The first object detection unit 220 collects a training data set, a verification data set, and a test data set from the COCO data set database unit 210, and extracts a feature map of an image frame using pre-trained YOLOv3. Based on the feature map, at least one region in which the presence of a human object is estimated is extracted from the image.

제2 객체 탐지부(240)는 입력부(230)로부터 탑뷰 영상 생성부에서 획득한 영상 프레임을 수집하고, YOLOv3를 이용하여 탑뷰 영상 프레임의 특징맵을 추출하고, 추출한 특징맵을 기초로 영상에서 사람 객체의 존재가 추정되는 적어도 하나의 영역을 추출한다.The second object detection unit 240 collects the image frames acquired by the top-view image generator from the input unit 230, extracts a feature map of the top-view image frame using YOLOv3, and extracts a person from the image based on the extracted feature map. At least one region in which the existence of an object is estimated is extracted.

제1 객체 탐지부(220)와 제2 객체 탐지부(240)는 심층 신경망(Deep Neural Networks, DNN), 컨볼루션 신경망 (Convolutional deep Neural Networks, CNN), 순환 신경망(Reccurent Neural Network, RNN) 및 심층 신뢰 신경 망(Deep Belief Networks, DBN) 중 어느 하나의 신경망을 이용하여 입력 영상으로부터 특징맵을 추출한다.The first object detection unit 220 and the second object detection unit 240 are configured to use deep neural networks (DNNs), convolutional deep neural networks (CNNs), recurrent neural networks (RNNs), and A feature map is extracted from an input image using one of deep belief networks (DBN).

제1 객체 탐지부(220)와 제2 객체 탐지부(240)는 딥러닝(Deep learning)을 기반으로 학습부에 의하여 이미 학습이 완료된 모델을 이용하여서 특징맵을 생성할 수 있다. 딥러닝은 여러 비선형 변환기법의 조합을 통해 높은 수준의 추상화(Abstractions, 다량의 데이터나 복잡한 자료들 속에서 핵심적인 내용 또는 기능을 요약하는 작업)를 시도하는 기계학습(Machine Learning) 알고리즘의 집합으로 정의된다.The first object detection unit 220 and the second object detection unit 240 may generate a feature map using a model that has already been learned by the learning unit based on deep learning. Deep learning is a set of machine learning (Machine Learning) algorithms that attempt a high level of abstraction (Abstractions, the task of summarizing key contents or functions in large amounts of data or complex data) through a combination of several nonlinear transformation methods. is defined

제1 객체 탐지부(220)와 제2 객체 탐지부(240)는 영상 프레임에서 객체가 존재할 것으로 추정되는 영역을 추출하고, 추출된 영역으로부터 특징을 나타내는 특징맵을 추출한다.The first object detection unit 220 and the second object detection unit 240 extract an area in which an object is estimated to exist in an image frame, and extract a feature map indicating a feature from the extracted area.

제1 객체 탐지부(220)와 제2 객체 탐지부(240)는 추출한 특징맵을 기초로 영상에서 사람 객체의 존재가 추정되는 적어도 하나의 영역을 추출한다. 영역을 추출하는 방법은 예를 들어 faster RCNN, SSD(Single Shot MultiBox Detector), YOLO(You Only Look Once) 등이 있을 수 있으며, 본 발명은 YOLO 객체 인식모듈(YOLOv3)을 일례로 하고 있다.The first object detection unit 220 and the second object detection unit 240 extract at least one region in which the existence of a human object is estimated from the image based on the extracted feature map. A method for extracting a region may include, for example, faster RCNN, SSD (Single Shot MultiBox Detector), YOLO (You Only Look Once), and the like, and the present invention takes the YOLO object recognition module (YOLOv3) as an example.

제1 객체 탐지부(220)와 제2 객체 탐지부(240)는 특징맵 중에서 영상의 영역별 클래스의 좌표를 포함하는 특징맵을 선정하고, 선정된 특징맵으로부터 영역을 구별하는 좌표를 식별한 뒤, 식별된 좌표를 개체의 존재가 추정되는 영역으로 추출할 수 있다.The first object detection unit 220 and the second object detection unit 240 select a feature map including coordinates of classes for each region of the image from among the feature maps, and identify coordinates for distinguishing regions from the selected feature maps. Then, the identified coordinates can be extracted as a region in which the entity's existence is estimated.

제1 객체 탐지부(220)와 제2 객체 탐지부(240)는 사람 객체를 하나 또는 2개 이상으로 설정할 수 있다.The first object detection unit 220 and the second object detection unit 240 may set one or two or more human objects.

제1 객체 탐지부(220)와 제2 객체 탐지부(240)는 추출된 적어도 하나의 영역 각각에 대해서, 해당 객체의 최외곽을 둘러싸는 바운딩 박스(Bounding Box)로서 표시할 수 있다.The first object detection unit 220 and the second object detection unit 240 may display each of the at least one extracted region as a bounding box surrounding the outermost periphery of the object.

각각의 바운딩 박스는 영상에서 해당 바운딩 박스의 위치에 개체의 존재 가능성이 있음을 나타낸다.Each bounding box indicates that there is a possibility that an object exists at the location of the corresponding bounding box in the image.

YOLOv3 모델은 단일 네트워크 아키텍처를 사용하여 전체 입력 영상에 대한 클래스 확률과 해당 바운딩 박스를 예측한다.The YOLOv3 model uses a single network architecture to predict class probabilities and corresponding bounding boxes for the entire input image.

YOLOv3 모델은 24개의 컨볼루션 레이어와 2개의 완전히 연결된 레이어가 포함되어 있다. 컨볼루션 레이어는 영상 추출에 사용되는 반면 완전 연결 레이어는 클래스 예측과 확률을 계산한다.The YOLOv3 model contains 24 convolutional layers and 2 fully connected layers. Convolutional layers are used for image extraction, while fully connected layers compute class predictions and probabilities.

제1 객체 탐지부(220)와 제2 객체 탐지부(240)는 사람을 감지하는 동안 모델은 도 5와 같이 입력 영상 프레임을 그리드 셀이라고도 하는 S × S 영역으로 나눈다. 이러한 그리드 셀은 바운딩 박자 예측 및 클래스 확률과 연결된다. 각 셀은 사람의 중심이 격자 셀에 있는지 여부의 확률을 예측한다. 예측이 양성이면 경계 상자와 각 양성 검출에 대한 신뢰도 값이 예측된다.While the first object detection unit 220 and the second object detection unit 240 detect a person, the model divides the input image frame into S×S regions, also referred to as grid cells, as shown in FIG. 5 . These grid cells are linked to bounding beat predictions and class probabilities. Each cell predicts the probability of whether or not the center of a person is in a grid cell. If the prediction is positive, the bounding box and confidence value for each positive detection is predicted.

제1 객체 탐지부(220)와 제2 객체 탐지부(240)는 검출된 바운딩 박스의 정도를 사람으로 나타내는 신뢰도 값(Conf(person))을 수학식 1로 정의한다.The first object detection unit 220 and the second object detection unit 240 define a confidence value (Conf(person)) representing the degree of the detected bounding box as a person by Equation 1.

여기서, Pr(person)은 예측된 바운딩 박스에 사람이 있는지(예: 1, 아니오: 0)를 나타내고, IOU(Pred, Truth)는 예측된 바운딩 박스와 실제 바운딩 박스의 겹치는 넓이를 두 영역을 합친 넓이로 나눈 IoU 중첩 비율이다.Here, Pr (person) indicates whether there is a person in the predicted bounding box (eg: 1, no: 0), and IOU (Pred, Truth) is the overlapping area of the predicted bounding box and the actual bounding box, which is the sum of the two areas. The IoU overlap ratio divided by the width.

여기서, BoxT는 훈련 세트에서 그라운드 트루(Ground Truth)를 나타내고(Truth), BoxP는 예측된 바운딩 박스를 나타낸다(Pred).Here, BoxT represents the ground truth in the training set (Truth), and BoxP represents the predicted bounding box (Pred).

그라운드 트루(Ground Truth)는 신경망 모델이 예측한 값이 아닌 실제 정답 레벨(Label)을 의미한다. 즉, 어떤 객체의 실제 위치를 나타낸다.Ground truth means the actual correct answer level (Label), not the value predicted by the neural network model. That is, it represents the actual location of an object.

IoU은 객체 탐지에서 바운딩 박스로 객체의 위치를 나타내고, 이때 예측한 바운딩 박스와 실제 그라운드 트루 바운딩 박스의 영역을 비교하고, 두 영역을 비교했을 때, 겹치는 넓이를 두 영역을 합친 넓이로 나눈 IoU 중첩 비율이 기설정된 임계값(0.5 이상)일 때 매치되었다고 한다.IoU indicates the location of an object with a bounding box in object detection, at this time, the areas of the predicted bounding box and the actual ground true bounding box are compared, and when the two areas are compared, the overlapping area is divided by the combined area of the two areas IoU overlap When the ratio is a predetermined threshold value (0.5 or more), it is said to be matched.

IoU 값이 0.5 이상이면, 해당 Region을 객체로 바라보고, 그라운드 트루와 같은 클래스로 레이블링한다. If the IoU value is greater than 0.5, the Region is viewed as an object, and labeled with the same class as Ground True.

제1 객체 탐지부(220)와 제2 객체 탐지부(240)는 탑뷰 영상 프레임에서 사람 감지를 위해 적합한 영역이 선택되고 예측되고, 예측 후 신뢰도 값을 사용하여 원하는 바운딩 박스를 획득한다.The first object detection unit 220 and the second object detection unit 240 select and predict a region suitable for human detection in the top-view image frame, and obtain a desired bounding box using a reliability value after prediction.

제1 객체 탐지부(220)와 제2 객체 탐지부(240)는 h, w, x, y 및 신뢰도 값을 포함하여 모든 바운딩 박스에 대해 5개의 값이 예측된다. 여기서, 너비와 높이는 w, h로 표시되고, 바운딩 박스 중심 좌표는 x, y로 표시된다.The first object detection unit 220 and the second object detection unit 240 predict five values for all bounding boxes including h, w, x, y and reliability values. Here, the width and height are expressed as w and h, and the coordinates of the center of the bounding box are expressed as x and y.

제1 객체 탐지부(220)와 제2 객체 탐지부(240)는 임계값을 정의하여 낮은 점수의 신뢰도 값을 버리고 나머지 여러 개의 높은 신뢰도 바운딩 박스를 처리하고 최대가 아닌 억제를 사용하여 최종 위치 매개변수를 파생한다.The first object detection unit 220 and the second object detection unit 240 define a threshold value, discard low-scoring confidence values, process the remaining multiple high-confidence bounding boxes, and use non-maximal suppression to determine the final location parameters. derive a variable

제1 객체 탐지부(220)와 제2 객체 탐지부(240)는 감지된 바운딩 박스에 대해 손실 함수를 계산한다.The first object detection unit 220 and the second object detection unit 240 calculate a loss function for the detected bounding box.

제1 객체 탐지부(220)와 제2 객체 탐지부(240)는 계산한 손실 함수의 총합이 최소가 되도록 영상 프레임의 특징을 학습한다.The first object detection unit 220 and the second object detection unit 240 learn the characteristics of the image frame so that the total sum of the calculated loss functions is minimized.

손실 함수는 회귀 손실과 분류 손실의 합이다. 제1 객체 탐지부(220)와 제2 객체 탐지부(240)는 하나의 대상, 즉 사람만을 고려하고, 이러한 작업에 대한 손실함수는 다음의 수학식 3과 같이 주어진다.The loss function is the sum of the regression loss and the classification loss. The first object detection unit 220 and the second object detection unit 240 consider only one object, that is, a person, and the loss function for this task is given as Equation 3 below.

여기서,

는 예측된 바운딩 박스의 좌표 손실을 나타내고,

는 실제 바운딩 박스의 좌표 손실을 계산하는데 사용된다. 탑뷰 영상 프레임에서 사람의 크기가 다르며 YOLOv3 손실 특징은 모든 바운딩 박스에 대해 동일한 손실을 계산한다. 그러나 손실 함수에 대한 크고 작은 객체의 영향은 전체 이미지에 대해 다릅니다. 따라서, 좌표의 손실 함수를 개선하기 위해서 대조(Contrast) 정규화가 사용된다. 좌표

의 손실함수는 하기의 수학식 4로 주어진다.here,

denotes the coordinate loss of the predicted bounding box,

is used to calculate the coordinate loss of the actual bounding box. The size of the person in the top-view video frame is different, and the YOLOv3 loss feature calculates the same loss for all bounding boxes. However, the influence of large and small objects on the loss function is different for the entire image. Therefore, contrast normalization is used to improve the coordinate loss function. coordinate

The loss function of is given by Equation 4 below.

여기서,

는 바운딩 박스의 좌표 예측에 사용하는 스케일 매개변수이고(

),

는 i번째 셀에서 감지된 바운딩 박스의 예측 위치이고,

는 i번째 셀에서 바운딩 박스의 실제 위치이다.here,

is the scale parameter used to predict the coordinates of the bounding box (

),

is the predicted position of the bounding box detected in the i-th cell,

is the actual location of the bounding box in the i-th cell.

수학식 4는 좌표값 x, y를 갖는 예측된 바운딩 박스와 관련된 손실 함수를 계산한다.Equation 4 calculates the loss function associated with the predicted bounding box with coordinate values x and y.

는 j번째 바운딩 박스에서 감지된 사람의 가능성을 보여준다.

는 일정한 상수이다.

shows the probability of a person detected in the j-th bounding box.

is a constant constant.

는 (j = 0 내지 B)를 각 그리드 셀(i = 0 내지 S²)에 대한 예측 변수로 사용하여 각 바운딩 박스에 대한 합계를 계산한다.

calculates the sum for each bounding box using (j = 0 to B) as a predictor for each grid cell (i = 0 to S ² ).

의

는 하기의 수학식 5로 계산된다.

of

Is calculated by Equation 5 below.

여기서,

는 분류 오류를 나타내고,

와

는 예측된 바운딩 박스의 i번째 그리드 셀의 신뢰도 값과 원본 슬라이딩 윈도우의 i번째 그리드 셀의 신뢰도 값이고,

는 그리드 셀 i의 j번째 바운딩 박스에서 사람이 감지되는지 여부를 나타낸다. 대상 사람이 j번째 경계 상자와 i번째 격자 셀에 있는 경우 수학식 5의 함수는 1과 같고, 그렇지 않으면 0이 된다.here,

represents the classification error,

and

is the confidence value of the ith grid cell of the predicted bounding box and the confidence value of the ith grid cell of the original sliding window,

represents whether a person is detected in the j-th bounding box of grid cell i. The function of Equation 5 is equal to 1 if the target person is in the j-th bounding box and the i-th grid cell, and 0 otherwise.

는 그리드 셀 i의 j번째 바운딩 박스에서 사람이 감지되지 않는지 여부를 나타낸다.

represents whether a person is not detected in the j-th bounding box of grid cell i.

전이 학습 모델부(250)는 제1 객체 탐지부(220)로부터 사전 학습 완료된 학습 모델로부터 학습된 결과에 대한 제1 가중치를 생성하고, 제2 객체 탐지부(240)로부터 탑뷰 영상 프레임을 이용한 학습 모델로부터 학습된 결과에 대한 제2 가중치를 생성한다.The transfer learning model unit 250 generates a first weight for a result learned from the learning model for which pre-learning has been completed from the first object detection unit 220, and learns using the top-view image frame from the second object detection unit 240. A second weight for a result learned from the model is generated.

훈련 영상 입력부(270)는 탑뷰 영상인 훈련 영상 프레임을 생성한다.The training image input unit 270 generates a training image frame that is a top-view image.

객체 탐지 모델부(260)는 탑뷰 영상인 훈련 영상 프레임을 입력받고, 전이 학습 모델부(250)에서 생성된 제1 가중치와 제2 가중치를 이용하여 입력된 훈련 영상 프레임의 특징맵을 추출하고, 추출한 특징맵을 기초로 영상에서 사람 객체의 존재가 추정되는 적어도 하나의 영역을 추출하여 바운딩 박스로 표시하여 출력부(280)로 출력한다.The object detection model unit 260 receives a training image frame, which is a top-view image, and extracts a feature map of the input training image frame using the first and second weights generated by the transfer learning model unit 250, Based on the extracted feature map, at least one region in which the presence of a human object is estimated is extracted from the image, displayed as a bounding box, and output to the output unit 280.

도 6은 본 발명의 실시예에 따른 객체 추적부의 내부 구성을 간략하게 나타낸 블록도이다.6 is a block diagram briefly illustrating the internal configuration of an object tracking unit according to an embodiment of the present invention.

객체 추적부(300)는 탑뷰 영상 프레임에서 사람 객체를 추적하기 위해 딥러닝 기반으로 추적 알고리즘인 deep SORT를 적용한다.The object tracking unit 300 applies deep SORT, a tracking algorithm based on deep learning, to track a human object in a top-view image frame.

객체 추적부(300)는 입력 모듈(310), 바운딩 박스 추적부(320), 제어부(330), CNN 모델부(340), 거리 계산부(350) 및 추적 결과부(360)를 포함한다.The object tracking unit 300 includes an input module 310, a bounding box tracking unit 320, a controller 330, a CNN model unit 340, a distance calculation unit 350, and a tracking result unit 360.

입력 모듈(310)은 바운딩 박스 탐지부(200)의 객체 탐지 모델부(260)로부터 사람 객체를 바운딩 박스로 표시한 영상 프레임을 수신한다.The input module 310 receives an image frame displaying a human object as a bounding box from the object detection model unit 260 of the bounding box detection unit 200 .

바운딩 박스 추적부(320)는 바운딩 박스 정보를 추적하기 위해 칼만 필터를 사용한다.The bounding box tracking unit 320 uses a Kalman filter to track bounding box information.

바운딩 박스 추적부(320)는 칼만 필터를 사용하여 시간적으로 수신되는 복수의 영상 프레임에 포함된 바운딩 박스 정보를 추적한다.The bounding box tracking unit 320 tracks bounding box information included in a plurality of temporally received image frames by using a Kalman filter.

칼만 필터는 오차(잡음)가 포함되어 있는 측정치(관측값)을 바탕으로 선형 상태를 추정하는 재귀필터이다A Kalman filter is a recursive filter that estimates a linear state based on measurements (observations) that contain errors (noise).

칼만 필터는 기존에 누적된 과거 추적 데이터를 사용하여 현재의 위치를 예측을 하고, 새로 들어온 검출 데이터 중에서 매칭되는 관측데이터가 있다면 예측데이터와 관측데이터 사이의 분산(variance)값의 비율을 가지고 둘 중의 분산의 비율로써 최종적인 상태가 정해진다.The Kalman filter predicts the current position using the previously accumulated past tracking data, and if there is matching observed data among the newly received detected data, the ratio of the variance value between the predicted data and the observed data is used to determine which of the two The final state is determined by the ratio of dispersion.

칼만 필터는 이전에 누적된 추적 데이터로부터 칼만필터를 통해 물체의 상태(위치와 크기)를 예측한다. 칼만 필터는 공지된 추적 알고리즘으로 상세한 설명을 생략한다.The Kalman filter predicts the state (position and size) of an object through the Kalman filter from previously accumulated tracking data. The Kalman filter is a well-known tracking algorithm, and its detailed description is omitted.

칼만 필터링은 현재 영상 프레임의 객체 추적에 공간 좌표 정보를 사용한다.Kalman filtering uses spatial coordinate information to track objects in the current video frame.

공간 좌표 정보는 아래와 같다.Spatial coordinate information is as follows.

는 감지된 바운딩 박스의 각 좌표의 추적 속도이고,

는 바운딩 박스의 위치이다.

is the tracking speed of each coordinate of the detected bounding box,

is the location of the bounding box.

u, v는 바운딩 박스 중심 위치이고, 종횡비

, 높이 h이다.u, v are the bounding box center positions, and the aspect ratio

, with height h.

바운딩 박스 추적부(320)는 바운딩 박스의 공간 좌표 정보를 이용하여 추적 예측이 수행된다.The bounding box tracking unit 320 performs tracking prediction using spatial coordinate information of the bounding box.

제어부(330)는 바운딩 박스 추적부(320)로부터 바운딩 박스의 공간 정보 및 추적 정보를 수신하여 CNN 모델부(340)로 전송한다.The control unit 330 receives space information and tracking information of the bounding box from the bounding box tracking unit 320 and transmits them to the CNN model unit 340.

CNN 모델부(340)는 컨볼루션 신경망(Convolutional Proposal Network, CNN)과 복수개의 필터들을 포함하고, CNN은 컨볼루션 연산을 수행하도록 설계된 컨볼루션 레이어들을 포함할 수 있다.The CNN model unit 340 includes a convolutional neural network (CNN) and a plurality of filters, and the CNN may include convolutional layers designed to perform convolutional operations.

CNN을 구성하는 컨볼루션 레이어는 커널을 이용하여 입력과 연관된 컨볼루션 연산을 수행할 수 있다.A convolution layer constituting a CNN may perform a convolution operation related to an input using a kernel.

CNN 모델부(340)는 CNN을 사용하여 바운딩 박스의 모양(외형, Appearance) 정보를 포함한 특징 벡터를 추출한다.The CNN model unit 340 extracts a feature vector including shape (appearance) information of the bounding box using CNN.

거리 계산부(350)는 기존의 감지된 객체 추적과 다음의 감지된 객체 추적 사이의 거리를 계산하기 위해 마할라노비스 거리(Mahalanobis Distance)를 사용한다.The distance calculation unit 350 uses the Mahalanobis distance to calculate the distance between the previous detected object tracking and the next detected object tracking.

비용 행렬은 Deep SORT 알고리즘에서 모양을 나타내는데 사용된다.The cost matrix is used to represent the shape in the Deep SORT algorithm.

거리 계산부(350)는 수학식 6에 의해 계산된 두 개의 거리값을 사용하여 새로운 탐지와 추적 간의 공간적 유사성을 나타낸다.The distance calculation unit 350 indicates spatial similarity between new detection and tracking using the two distance values calculated by Equation 6.

여기서,

는

투영 추적이 측정 공간이고,

를 위한 새로운 탐지

가 사용됨을 나타낸다. T는 전치행렬을 나타낸다.here,

Is

the projection trace is the measurement space,

new detection for

indicates that is used. T represents a transposed matrix.

는 각 바운딩 박스 탐지를 나타낸다.

denotes each bounding box detection.

거리 계산부(350)는 새로운 탐지

와 측정된 위치의

추적(Track) 간의 차이를 계산하여 마할라노비스 거리(Mahalanobis Distance)를 획득한다.The distance calculation unit 350 detects a new

and the measured position of

The difference between tracks is calculated to obtain the Mahalanobis distance.

거리 계산부(350)는 수학식 6의 거리값과, i번째(

) 추적(Track)과 j번째(

) 탐지 사이의 마할라노비스 거리 임계값을 이용하여 다음의 수학식 7에 의해 결정 지표를 계산한다.The distance calculation unit 350 calculates the distance value of Equation 6 and the i-th (

) track and the jth (

) Calculate the decision index by the following Equation 7 using the Mahalanobis distance threshold between detections.

여기서, t⁽¹⁾는 기설정된 마할라노비스 거리 임계값이다.Here, t ⁽¹⁾ is a preset Mahalanobis distance threshold.

거리 계산부(350)는 i번째 추적 사이의 연관성이 있는 경우, 1로 평가되고, j번째 검출이 허용된다.The distance calculating unit 350 evaluates to 1 when there is a correlation between the i-th traces, and the j-th detection is permitted.

거리 계산부(350)는 다음의 수학식 8에 의해 모양 정보를 나타내는 2번째 거리값을 계산한다.The distance calculation unit 350 calculates a second distance value representing shape information by Equation 8 below.

두 번째 거리값(d⁽²⁾(i,j))은 수학식 8과 같이,

추적(Track)과

탐지(Detection) 사이의 최소 코사인 거리를 계산한다.The second distance value (d ⁽²⁾ (i, j)) is as shown in Equation 8,

Track and

Calculate the minimum cosine distance between detections.

여기서,

과

은 모양 디스크립터(Apperance Descript)를 나타내고,

는 i번째 추적에 있는 100개 이상의 객체(사람)의 모양을 나타내는데 사용된다. 연관 추적 사이의 임계값을 설정하기 위해서는 다음의 수학식 9를 사용한다.here,

class

Represents an appearance descriptor (Apperance Descript),

is used to represent the shape of 100 or more objects (people) in the ith trace. In order to set the threshold value between the association traces, the following Equation 9 is used.

거리 계산부(350)는 각 추적 k를 위한 모양 디스크립터(Descriptor)에 연관된 마지막

의 갤러리(Gallery)(

) 상태를 유지한다.The distance calculation unit 350 is the last associated with the shape descriptor for each tracking k.

's Gallery (

) state.

거리 계산부(350)는 연관 추적 사이에 임계값을 설정하기 위해 하기의 수학식 9를 이용한다.The distance calculating unit 350 uses Equation 9 below to set a threshold value between association tracking.

거리값이 작으면 1이 되고, 거리값이 크면 0이 된다. 거리 계산부(350)는 수학식 9의 비용함수(비용행렬, C_i,j)를 하기의 수학식 10에 의해 결합한다. 다시 말해, 연관 문제를 만들기 위해서는 가중합(수학식 10)을 사용하여 두 메트릭(Metric)을 결합한다.If the distance value is small, it becomes 1, and if the distance value is large, it becomes 0. The distance calculation unit 350 combines the cost function (cost matrix, C _i,j ) of Equation 9 by Equation 10 below. In other words, to create an association problem, two metrics are combined using a weighted sum (Equation 10).

결합된 연관 비용에 대한 각 메트릭의 영향은 하이퍼파라미터

를 통해 제어할 수 있다.The impact of each metric on the combined associated cost is a hyperparameter

can be controlled through

거리 계산부(350)는 공간 정보와 모양 정보를 매칭하기 위해서 다음의 수학식 11의 게이트 함수(게이트 행렬, b_i,j)를 정의한다.The distance calculation unit 350 defines a gate function (gate matrix, b _i,j ) of Equation 11 below to match spatial information and shape information.

수학식 11의 두 메트릭의 게이팅 영역 내에 있는 경우, 연관성을 허용 가능하다고 말한다.An association is said to be acceptable if it falls within the gating domain of the two metrics in Equation 11.

수학식 11의 값이 모양과 공간 게이트 함수가 동일하지 않으면 0, 동일하면 1과 같다.The value of Equation 11 is equal to 0 if the shape and space gate functions are not the same, and equal to 1 if they are the same.

또한, (i, j)가 모양과 공간 정보 사이의 진정한 일치임을 나타낸다. 따라서 모든 새로운 영상 프레임에서 탐지 및 추적은 위의 비용 및 게이트 기능을 활용하는 것과 관련된다.It also indicates that (i, j) is a true match between shape and spatial information. Therefore, detection and tracking in every new video frame involves utilizing the above cost and gate functions.

추적 결과부(360)는 영상 시퀀스에서 추적하고 새로운 감지가 현재 추적과 효과적으로 연관되는 경우 다음 새로운 영상 프레임에서의 추적이 계속된다.The tracking result unit 360 tracks in the video sequence and continues tracking in the next new video frame if the new detection is effectively associated with the current tracking.

추적 결과부(360)는 새로운 탐지와 추적이 연결되거나 일치하지 않으면 0으로 설정된다.The tracking result section 360 is set to 0 if a new detection and tracking are connected or do not match.

따라서, 이러한 경우 추적 결과부(360)는 새로운 탐지가 실패한다. 이러한 경우 새로운 탐지는 프레임 f에서 기존 탐지와 연결되지 않는다. 그런 다음 새로운 탐지가 임시 추적으로 초기화된다.Therefore, in this case, the tracking result unit 360 fails to detect a new one. In this case, new detections are not linked to existing detections in frame f. A new detection is then initialized with a temporary trace.

따라서, 추적 결과부(360)는 연관 추적 사이의 임계값이 설정되면(수학식 9), Deep SORT 알고리즘이 계속해서 검증하고, 다음의 (f + 1), (f + 2),…(f + t) 임시 프레임에서 새로운 탐지와 연결하고, 성공적으로 연결되면 해당 추적이 추적 확인되고 업데이트된다. 그렇지 않으면 즉시 삭제된다.Therefore, in the tracking result unit 360, when the threshold value between related tracking is set (Equation 9), the Deep SORT algorithm continues to verify, and the following (f + 1), (f + 2), ... (f + t) Connect with the new detection in the temporary frame, and if successfully connected, the corresponding trace is trace confirmed and updated. Otherwise, it is deleted immediately.

추적 결과부(360)는 연관 추적 사이의 임계값이 설정되지 않으면, 즉시 삭제된다.The tracking result unit 360 is immediately deleted if a threshold value between related tracking is not set.

이상에서 본 발명의 실시예에 대하여 상세하게 설명하였지만 본 발명의 권리범위는 이에 한정되는 것은 아니고 다음의 청구범위에서 정의하고 있는 본 발명의 기본 개념을 이용한 당업자의 여러 변형 및 개량 형태 또한 본 발명의 권리범위에 속하는 것이다.Although the embodiments of the present invention have been described in detail above, the scope of the present invention is not limited thereto, and various modifications and improvements made by those skilled in the art using the basic concept of the present invention defined in the following claims are also included in the scope of the present invention. that fall within the scope of the right.

100: 탑뷰 영상 객체 추적 장치
200: 바운딩 박스 탐지부
300: 객체 추적부100: top view video object tracking device
200: bounding box detection unit
300: object tracking unit

Claims

A first feature map of an image frame is extracted from a learning dataset storing a training dataset, a verification dataset, and a test dataset, and at least one region in which the existence of a human object is estimated based on the extracted first feature map a first object detection unit that extracts and displays as a first bounding box;
A top-view image frame is received using a camera unit installed in the top-view, a second feature map of the top-view image frame is extracted, and at least one region in which the existence of a human object is estimated is extracted from the image based on the extracted second feature map. a second object detection unit displaying a second bounding box; and
A first weight is generated for a result learned from the learning model for which pretraining has been completed from the first object detector, and a second weight is generated for a result learned from a learning model using a top-view image frame from the second object detector. An artificial intelligence-based video object tracking device comprising a transfer learning model unit that

The method of claim 1,
A training image frame, which is a top-view image, is input, and a third feature map of the input training image frame is extracted using the first weight and the second weight generated by the transfer learning model unit, and the extracted third feature map is based on the extracted third feature map. An artificial intelligence based image object tracking apparatus further comprising an object detection model unit extracting at least one region in which the existence of a human object is estimated from the raw image and displaying the extracted region as a third bounding box.

The method of claim 2,
Bounding for receiving image frames in which a human object is displayed as a bounding box from the object detection model unit and tracking bounding box information included in a plurality of temporally received image frames by using a Kalman filter to track the bounding box information box tracking; and
Composed of a Convolutional Proposal Network (CNN) and a plurality of filters, spatial information and tracking information of the bounding box are received from the bounding box tracking unit, and feature vectors including shape (appearance) information of the bounding box are extracted. An artificial intelligence-based image object tracking device further comprising an object tracking unit consisting of a CNN model unit to

The method of claim 1,
The first object detection unit and the second object detection unit track an artificial intelligence-based image object defining a reliability value (Conf(person)) representing the degree of the detected bounding box as a person by Equations 1 and 2 below. Device.
[Equation 1]

Here, Pr (person) indicates whether there is a person in the predicted bounding box (eg: 1, no: 0), and IOU (Pred, Truth) is the overlapping area of the predicted bounding box and the actual bounding box, which is the sum of the two areas. IoU overlap ratio divided by width.
[Equation 2]

Here, BoxT represents the ground truth in the training set (Truth), and BoxP represents the predicted bounding box (Pred).

The method of claim 1,
The first object detection unit and the second object detection unit learn characteristics of an image frame such that a total sum of loss functions of Equations 3, 4, and 5 below is minimized.
[Equation 3]

here,

denotes the coordinate loss of the predicted bounding box,

is used to calculate the coordinate loss of the actual bounding box.
[Equation 4]

here,

is the scale parameter used for predicting the coordinates of the bounding box (

),

is the predicted position of the bounding box detected in the i-th cell,

is the actual position of the bounding box in the i-th cell.
[Equation 5]

here,

represents the classification error,

and

Indicates whether a person is detected in the j-th bounding box of grid cell i,

Indicates whether a person is not detected in the j-th bounding box of grid cell i,

The first object detection unit extracts a first feature map of an image frame from a learning dataset storing a training dataset, a verification dataset, and a test dataset, and estimates the presence of a human object in the image based on the extracted first feature map. extracting at least one area to be displayed as a first bounding box;
The second object detection unit receives a top-view image frame using a camera unit installed in the top-view, extracts a second feature map of the top-view image frame, and at least one object for which the existence of a human object is estimated in the image based on the extracted second feature map. extracting an area of and displaying it as a second bounding box; and
The transfer learning model unit generates a first weight for a result learned from a learning model for which pre-learning has been completed from the first object detector, and a second weight for a result learned from a learning model using a top-view image frame from the second object detector. 2 An artificial intelligence-based video object tracking method comprising generating weights.

The method of claim 6,
The object detection model unit receives a training image frame that is a top-view image, extracts a third feature map of the input training image frame using the first and second weights generated by the transfer learning model unit, and extracts the extracted third feature map. The method of tracking an image object based on artificial intelligence further comprising extracting at least one region in an image in which the presence of a human object is estimated based on the feature map and displaying the extracted region as a third bounding box.

The method of claim 7,
The bounding box tracking unit receives an image frame in which a human object is displayed as a bounding box from the object detection model unit, and uses a Kalman filter to track the bounding box information, and the bounding box information included in the plurality of temporally received image frames. tracking; and
The CNN model unit consists of a Convolutional Proposal Network (CNN) and a plurality of filters, receives spatial information and tracking information of the bounding box from the bounding box tracking unit, and features including shape (appearance) information of the bounding box. An artificial intelligence-based image object tracking method further comprising the step of extracting a vector.

The method of claim 6,
Artificial intelligence including the step of defining a confidence value (Conf(person)) representing the degree of the detected bounding box as a person by the first object detection unit and the second object detection unit by Equation 1 and Equation 2 below. Based video object tracking method.
[Equation 1]

The method of claim 6,
The first object detection unit and the second object detection unit learning characteristics of image frames such that a sum of loss functions of Equations 3, 4, and 5 below is minimized. Object tracking method.
[Equation 3]

here,

denotes the coordinate loss of the predicted bounding box,

here,

is the scale parameter used to predict the coordinates of the bounding box (

),

is the predicted position of the bounding box detected in the i-th cell,

is the actual position of the bounding box in the i-th cell.
[Equation 5]

here,

represents the classification error,

and

Indicates whether a person is detected in the j-th bounding box of grid cell i,