KR102314520B1

KR102314520B1 - Apparatus and method for detecting object

Info

Publication number: KR102314520B1
Application number: KR1020190032714A
Authority: KR
Inventors: 김재민
Original assignee: 홍익대학교 산학협력단; (주)파크인서울
Priority date: 2019-03-22
Filing date: 2019-03-22
Publication date: 2021-10-20
Also published as: KR20200119369A

Abstract

본 발명은 주차장의 천정에 설치된 어안렌즈 카메라를 통하여 획득한 영상에서 주차한 차량을 검출하고 주행하는 차량을 추적하기 위하여, 객체로서의 차량의 정보를 추출하는데 특화된 딥러닝 기반 실시간 객체 검출 장치 및 방법에 관한 것이다.
본 발명의 일 실시 예에 따른 객체 검출 장치는, 주차장의 천정에 구비된 복수의 어안렌즈 카메라를 통하여 촬영한 영상에 딥러닝 기반의 합성곱 신경 회로망 모델을 적용하여 영상의 특징을 추출하는 특징 추출부와, 특징 추출부의 최상층, 차상층 및 차차상층 중 하나 이상에서 추출한 특징을 병합하는 다중 전처리부와, 다중 전처리부로부터 병합한 특징을 수신하여 하나 이상의 객체를 테두리 상자로 나타내고, 테두리 상자의 중심좌표, 객체의 넓이, 객체의 높이, 테두리 상자에 포함된 객체의 신뢰도 및 셀에 존재할 객체가 특정 클래스일 확률과 객체의 진행방향을 추정하는 다중 분류부를 포함한다.The present invention relates to a deep learning-based real-time object detection apparatus and method specialized for extracting vehicle information as an object in order to detect a parked vehicle from an image obtained through a fisheye lens camera installed on the ceiling of a parking lot and track a driving vehicle. it's about
The object detection apparatus according to an embodiment of the present invention extracts features of an image by applying a deep learning-based convolutional neural network model to an image captured by a plurality of fisheye lens cameras provided on the ceiling of a parking lot. A multi-preprocessing unit that merges features extracted from at least one of the uppermost layer, the next higher layer, and the next higher layer of the feature extraction unit, and receives the merged features from the multiple pre-processing unit to display one or more objects as a bounding box, and the center of the bounding box It includes a multi-classifier for estimating coordinates, object width, object height, reliability of an object included in the bounding box, probability that an object exists in a cell is a specific class, and the direction of the object.

Description

Object detection apparatus and method {APPARATUS AND METHOD FOR DETECTING OBJECT}

본 발명은 주차장의 천정에 설치된 어안렌즈 카메라를 통하여 획득한 영상에서 주차한 차량을 검출하고 주행하는 차량을 추적하기 위하여, 객체로서의 차량의 정보를 추출하는데 특화된 딥러닝 기반 실시간 객체 검출 장치 및 방법에 관한 것이다.The present invention relates to a deep learning-based real-time object detection apparatus and method specialized for extracting vehicle information as an object in order to detect a parked vehicle from an image obtained through a fisheye lens camera installed on the ceiling of a parking lot and track a driving vehicle. it's about

합성곱 신경 회로망(convolution neural network) 기반 객체 인식 기술은 다양한 환경에서 다양한 객체를 인식하려고 하면 다양한 객체 인식을 위한 풍부한 특징을 추출해야 한다. 풍부한 특징 추출을 위해서는 합성곱 층(convolution layer)이 매우 많아야 하며, 각 합성곱 층에서 사용하는 필터의 개수 또한 많아야 한다. 이는 많은 연산량을 필요로 하는 문제점이 있다.Convolutional neural network-based object recognition technology needs to extract rich features for recognizing various objects in order to recognize various objects in various environments. In order to extract rich features, the number of convolution layers must be very large, and the number of filters used in each convolution layer must also be large. This has a problem that requires a large amount of computation.

전술한 배경기술은 발명자가 본 발명의 도출을 위해 보유하고 있었거나, 본 발명의 도출 과정에서 습득한 기술 정보로서, 반드시 본 발명의 출원 전에 일반 공중에게 공개된 공지기술이라 할 수는 없다.The above-mentioned background art is technical information possessed by the inventor for derivation of the present invention or acquired in the process of derivation of the present invention, and cannot necessarily be said to be a known technique disclosed to the general public prior to the filing of the present invention.

국내 공개특허공보 제2018-0136720호Domestic Patent Publication No. 2018-0136720

본 발명은 전술한 문제점 및/또는 한계를 해결하기 위해 안출된 것으로, 일 측면에 따른 본 발명의 목적은 주차장의 천정에 설치된 어안렌즈 카메라를 통하여 획득한 영상으로부터 주차한 차량을 검출하고 주행하는 차량을 추적하기 위하여, 객체로서의 차량의 정보를 추출하는데 특화된 딥러닝 기반 실시간 객체 검출 기술을 개발하여 적은 연산량으로 높은 검출율을 달성하는데 있다.The present invention has been devised to solve the above-described problems and/or limitations, and an object of the present invention according to one aspect is to detect a parked vehicle from an image obtained through a fisheye lens camera installed on the ceiling of a parking lot and drive a vehicle In order to track the vehicle, the goal is to achieve a high detection rate with a small amount of computation by developing a deep learning-based real-time object detection technology specialized in extracting vehicle information as an object.

본 발명의 일 실시 예에 따른 객체 검출 장치는, 주차장의 천정에 구비된 복수의 어안렌즈 카메라를 통하여 촬영한 영상에 딥러닝 기반의 합성곱 신경 회로망 모델을 적용하여 상기 영상의 특징을 추출하는 특징 추출부; 상기 특징 추출부의 최상층, 차상층 및 차차상층 중 하나 이상에서 추출한 특징을 병합하는 다중 전처리부; 및 상기 다중 전처리부로부터 병합한 특징을 수신하여 하나 이상의 객체를 테두리 상자로 나타내고, 상기 테두리 상자의 중심좌표, 상기 객체의 넓이, 상기 객체의 높이, 상기 테두리 상자에 포함된 객체의 신뢰도 및 셀에 존재할 상기 객체가 특정 클래스일 확률과 상기 객체의 진행방향을 추정하는 다중 분류부;를 포함할 수 있다.The object detection apparatus according to an embodiment of the present invention extracts features of the image by applying a deep learning-based convolutional neural network model to an image captured by a plurality of fisheye lens cameras provided on the ceiling of a parking lot. extraction unit; a multiple pre-processing unit merging the features extracted from at least one of the uppermost layer, the next-level layer, and the next-level layer of the feature extraction unit; and receiving the merged features from the multi-preprocessor to represent one or more objects as a bounding box, the central coordinates of the bounding box, the width of the object, the height of the object, the reliability of the object included in the bounding box, and the cell. It may include; a multi-classifier for estimating the probability that the object to be present is a specific class and the moving direction of the object.

상기 특징 추출부는, 상기 어안렌즈 카메라로부터 출력되는 영상에 대하여 연속된 3×3과 1×1 콘벌루션으로 이루어져 필터링을 수행하는 19개의 콘벌루션 레이어(convolution layer)와, 2×2 필터로 이루어져 상기 영상의 다운 샘플링을 수행하는 5개의 최대 풀링 레이어(max pooling layer)를 포함하는 darknet-19을 포함할 수 있다.The feature extraction unit includes 19 convolution layers that perform filtering by continuously performing 3×3 and 1×1 convolutions on the image output from the fisheye lens camera, and a 2×2 filter. It may include darknet-19, which includes 5 max pooling layers that perform image downsampling.

상기 다중 전처리부는, 19개의 콘벌루션 레이어와 5개의 최대 풀링 레이어를 포함하는 상기 특징 추출부 중 최상층의 19번째 콘벌루션 레이어에서 출력되는 해당 셀의 1024개의 특징과, 상기 최상층 셀에 해당하는 차상층의 4 셀에서 추출한 특징 256개의 특징을 병합하는 제1 전처리부; 및 상기 특징 추출부 중 차상층의 13번째 콘벌루션 레이어에서 출력되는 해당 셀의 1024개의 특징과, 차상층 셀에 해당하는 차차상층의 4 셀에서 추출한 특징 256개의 특징을 병합하는 제2 전처리부;를 포함할 수 있다.The multi-preprocessor includes 1024 features of the corresponding cell output from the 19th convolutional layer of the uppermost layer among the feature extractors including 19 convolutional layers and 5 maximal pooling layers, and the second layer corresponding to the uppermost cell. a first pre-processing unit that merges 256 features extracted from 4 cells of ; and a second preprocessing unit for merging 1024 features of the corresponding cell output from the 13th convolutional layer of the next higher layer among the feature extracting unit and 256 features extracted from 4 cells of the next higher layer corresponding to the next higher layer cell. may include.

상기 다중 분류부는, 상기 제1 전처리부가 병합한 1280개의 특징을 1024개의 1×1×1280 합성 곱 필터를 통과시켜 구한 1024개의 특징으로부터 상기 테두리 상자의 중심좌표(x,y), 상기 객체의 넓이(w), 상기 객체의 높이(h), 상기 테두리 상자에 포함된 상기 객체의 신뢰도(C) 및 셀에 존재할 상기 객체가 특정 클래스일 확률(Pi)과 상기 객체의 진행방향(θ)을 추정하는 제1 분류부; 및 상기 제2 전처리부가 병합한 1280개의 특징을 1024개의 1×1×1280 합성 곱 필터를 통과시켜 구한 1024개의 특징으로부터 상기 테두리 상자의 중심좌표(x,y), 상기 객체의 넓이(w), 상기 객체의 높이(h), 상기 테두리 상자에 포함된 상기 객체의 신뢰도(C) 및 셀에 존재할 상기 객체가 특정 클래스일 확률(Pi)과 상기 객체의 진행방향(θ)을 추정하는 제2 분류부;를 포함할 수 있다.The multi-classifier, from 1024 features obtained by passing 1024 1×1×1280 synthetic product filters for 1280 features merged by the first preprocessor, the central coordinates (x,y) of the bounding box, and the area of the object (w), the height (h) of the object, the reliability (C) of the object included in the bounding box, the probability (Pi) of the object existing in the cell, and the moving direction (θ) of the object are estimated a first classification unit; and the center coordinates (x,y) of the bounding box, the area of the object (w), A second classification for estimating the height (h) of the object, the reliability (C) of the object included in the bounding box, the probability (Pi) of the object existing in the cell being of a specific class, and the moving direction (θ) of the object may include;

상기 장치는, 상기 제1 분류부의 추정 결과 및 상기 제2 분류부의 추정 결과로부터 상기 객체에 여러 테두리 상자가 겹치는 부분을 제거하는 NMS(non-maximum suppression) 처리부;를 더 포함할 수 있다.The apparatus may further include a non-maximum suppression (NMS) processing unit that removes a portion where multiple bounding boxes overlap the object from the estimation result of the first classifier and the estimation result of the second classifier.

본 발명의 일 실시 예에 따른 객체 검출 방법은, 특징 추출부에 의해, 주차장의 천정에 구비된 복수의 어안렌즈 카메라를 통하여 촬영한 영상에 딥러닝 기반의 합성곱 신경 회로망 모델을 적용하여 상기 영상의 특징을 추출하는 단계; 다중 전처리부에 의해, 상기 특징 추출부의 최상층, 차상층 및 차차상층 중 하나 이상에서 추출한 특징을 병합하는 단계; 및 다중 분류부에 의해, 상기 다중 전처리부로부터 병합한 특징을 수신하여 하나 이상의 객체를 테두리 상자로 나타내고, 상기 테두리 상자의 중심좌표, 상기 객체의 넓이, 상기 객체의 높이, 상기 테두리 상자에 포함된 객체의 신뢰도 및 셀에 존재할 상기 객체가 특정 클래스일 확률과 상기 객체의 진행방향을 추정하는 단계;를 포함할 수 있다.In the object detection method according to an embodiment of the present invention, a deep learning-based convolutional neural network model is applied to an image captured by a plurality of fisheye lens cameras provided on the ceiling of a parking lot by a feature extraction unit to apply the image extracting the features of merging the features extracted from one or more of the uppermost layer, the next-level layer, and the next-order layer by the multiple preprocessor; and by the multi-classifier, receiving the merged features from the multi-preprocessing unit and representing one or more objects as a bounding box, including the central coordinates of the bounding box, the width of the object, the height of the object, and the bounding box. It may include; estimating the reliability of the object, the probability that the object existing in the cell is a specific class, and the moving direction of the object.

상기 특징을 추출하는 단계는, 상기 어안렌즈 카메라로부터 출력되는 영상에 대하여 연속된 3×3과 1×1 콘벌루션으로 이루어져 필터링을 수행하는 19개의 콘벌루션 레이어(convolution layer)와, 2×2 필터로 이루어져 상기 영상의 다운 샘플링을 수행하는 5개의 최대 풀링 레이어(max pooling layer)를 포함하는 darknet-19을 이용하여 상기 특징을 추출하는 단계;를 포함할 수 있다.In the step of extracting the feature, 19 convolution layers that perform filtering by performing continuous 3×3 and 1×1 convolution on the image output from the fisheye lens camera, and a 2×2 filter and extracting the feature using darknet-19 including five max pooling layers for downsampling the image.

상기 병합하는 단계는, 제1 전처리부에 의해, 19개의 콘벌루션 레이어와 5개의 최대 풀링 레이어를 포함하는 상기 특징 추출부 중 최상층의 19번째 콘벌루션 레이어에서 출력되는 해당 셀의 1024개의 특징과, 상기 최상층 셀에 해당하는 차상층의 4 셀에서 추출한 특징 256개의 특징을 병합하는 단계; 및 제2 전처리부에 의해, 상기 특징 추출부 중 차상층의 13번째 콘벌루션 레이어에서 출력되는 해당 셀의 1024개의 특징과, 차상층 셀에 해당하는 차차상층의 4 셀에서 추출한 특징 256개의 특징을 병합하는 단계;를 포함할 수 있다.In the merging step, 1024 features of the cell output from the 19th convolutional layer of the uppermost layer among the feature extraction unit including 19 convolutional layers and 5 maximal pooling layers by the first preprocessor; merging 256 features extracted from 4 cells of the second layer corresponding to the uppermost cell; and by the second pre-processing unit, 1024 features of the corresponding cell output from the 13th convolutional layer of the second-order layer among the feature extraction unit and 256 features extracted from 4 cells of the next-order layer corresponding to the second-order cell merging; may include.

상기 추정하는 단계는, 제1 분류부에 의해, 상기 제1 전처리부가 병합한 1280개의 특징을 1024개의 1×1×1280 합성 곱 필터를 통과시켜 구한 1024개의 특징으로부터 상기 테두리 상자의 중심좌표(x,y), 상기 객체의 넓이(w), 상기 객체의 높이(h), 상기 테두리 상자에 포함된 상기 객체의 신뢰도(C) 및 셀에 존재할 상기 객체가 특정 클래스일 확률(Pi)과 상기 객체의 진행방향(θ)을 추정하는 단계; 및 제2 분류부에 의해, 상기 제2 전처리부가 병합한 1280개의 특징을 1024개의 1×1×1280 합성 곱 필터를 통과시켜 구한 1024개의 특징으로부터 상기 테두리 상자의 중심좌표(x,y), 상기 객체의 넓이(w), 상기 객체의 높이(h), 상기 테두리 상자에 포함된 상기 객체의 신뢰도(C) 및 셀에 존재할 상기 객체가 특정 클래스일 확률(Pi)과 상기 객체의 진행방향(θ)을 추정하는 단계;를 포함할 수 있다.In the estimating step, the center coordinates (x) of the bounding box from 1024 features obtained by passing the 1280 features merged by the first preprocessor through 1024 1×1×1280 synthetic product filters by the first classification unit , y), the width (w) of the object, the height (h) of the object, the reliability (C) of the object included in the bounding box, and the probability (Pi) that the object exists in the cell is a specific class and the object estimating a moving direction (θ) of and the center coordinate (x,y) of the bounding box from 1024 features obtained by passing 1024 1×1×1280 synthetic product filters for 1280 features merged by the second preprocessing unit by the second classification unit, The width (w) of the object, the height (h) of the object, the reliability (C) of the object included in the bounding box, the probability (Pi) of the object existing in the cell being a specific class, and the moving direction (θ) of the object ) estimating; may include.

상기 방법은, NMS(non-maximum suppression) 처리부에 의해, 상기 제1 분류부의 추정 결과 및 상기 제2 분류부의 추정 결과로부터 상기 객체에 여러 테두리 상자가 겹치는 부분을 제거하는 단계;를 더 포함할 수 있다.The method may further include: removing, by a non-maximum suppression (NMS) processing unit, a portion where multiple bounding boxes overlap the object from the estimation result of the first classification unit and the estimation result of the second classification unit; have.

이 외에도, 본 발명을 구현하기 위한 다른 방법, 다른 시스템 및 상기 방법을 실행하기 위한 컴퓨터 프로그램이 더 제공될 수 있다.In addition to this, other methods for implementing the present invention, other systems, and computer programs for executing the methods may be further provided.

전술한 것 외의 다른 측면, 특징, 이점이 이하의 도면, 특허청구범위 및 발명의 상세한 설명으로부터 명확해질 것이다.Other aspects, features and advantages other than those described above will become apparent from the following drawings, claims, and detailed description of the invention.

실시 예들에 따르면, 주차장의 천정에 설치된 어안렌즈 카메라를 통하여 획득한 영상으로부터 객체로서의 차량을 인식하는데 특화된 딥러닝 기반 실시간 객체 검출 기술을 개발하여 적은 연산량으로 높은 검출율을 달성할 수 있다. 특히 기존 발표된 검출 장치 중에서 연산량 대비 성능이 월등한 YOLOv3의 약 70%의 연산량으로 유사한 검출 성능을 달성할 수 있다.According to embodiments, a high detection rate can be achieved with a small amount of computation by developing a deep learning-based real-time object detection technology specialized for recognizing a vehicle as an object from an image obtained through a fisheye lens camera installed on the ceiling of a parking lot. In particular, similar detection performance can be achieved with about 70% of the computational amount of YOLOv3, which has superior performance compared to the computational amount among previously announced detection devices.

또한 차량의 진행방향을 인식함으로써 주차장 내에서 차량 추적의 신뢰도를 높일 수 있다.In addition, it is possible to increase the reliability of vehicle tracking in the parking lot by recognizing the traveling direction of the vehicle.

본 발명의 효과는 이상에서 언급된 것들에 한정되지 않으며, 언급되지 아니한 다른 효과들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.Effects of the present invention are not limited to those mentioned above, and other effects not mentioned will be clearly understood by those skilled in the art from the following description.

도 1은 합성곱 신경 회로망을 기반으로 하는 다양한 객체 검출기의 연산량 대비 성능을 도시한 도면이다.
도 2는 본 발명의 일 실시 예에 따른 객체 검출 장치의 구성을 개략적으로 설명하기 위하여 도시한 도면이다.
도 3은 도 2의 객체 검출 장치 중 영상 처리에부에서 수행하는 영상 화질 개선을 설명하기 위하여 도시한 도면이다.
도 4 및 도 5는 도 2의 객체 검출 장치 중 객체 검출부의 상세 구성을 개략적으로 설명하기 위하여 도시한 도면이다.
도 6은 도 2의 객체 검출 장치에서의 객체 검출 결과를 도시한 도면이다.
도 7은 본 발명의 일 실시 예에 따른 객체 검출 방법을 설명하기 위한 흐름도이다.1 is a diagram illustrating performance versus computational amount of various object detectors based on a convolutional neural network.
2 is a diagram schematically illustrating a configuration of an object detection apparatus according to an embodiment of the present invention.
FIG. 3 is a diagram illustrating image quality improvement performed by an image processing unit of the object detection apparatus of FIG. 2 .
4 and 5 are diagrams for schematically explaining a detailed configuration of an object detection unit in the object detection apparatus of FIG. 2 .
FIG. 6 is a diagram illustrating an object detection result in the object detection apparatus of FIG. 2 .
7 is a flowchart illustrating an object detection method according to an embodiment of the present invention.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 설명되는 실시 예들을 참조하면 명확해질 것이다. 그러나 본 발명은 아래에서 제시되는 실시 예들로 한정되는 것이 아니라, 서로 다른 다양한 형태로 구현될 수 있고, 본 발명의 사상 및 기술 범위에 포함되는 모든 변환, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 아래에 제시되는 실시 예들은 본 발명의 개시가 완전하도록 하며, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이다. 본 발명을 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 본 발명의 요지를 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다.Advantages and features of the present invention, and methods for achieving them will become apparent with reference to the detailed description in conjunction with the accompanying drawings. However, it should be understood that the present invention is not limited to the embodiments presented below, but may be implemented in a variety of different forms, and includes all transformations, equivalents, and substitutes included in the spirit and scope of the present invention. . The embodiments presented below are provided to complete the disclosure of the present invention, and to fully inform those of ordinary skill in the art to the scope of the present invention. In describing the present invention, if it is determined that a detailed description of a related known technology may obscure the gist of the present invention, the detailed description thereof will be omitted.

본 출원에서 사용한 용어는 단지 특정한 실시 예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다. 제1, 제2 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다.The terms used in the present application are only used to describe specific embodiments, and are not intended to limit the present invention. The singular expression includes the plural expression unless the context clearly dictates otherwise. In the present application, terms such as “comprise” or “have” are intended to designate that a feature, number, step, operation, component, part, or combination thereof described in the specification exists, but one or more other features It should be understood that this does not preclude the existence or addition of numbers, steps, operations, components, parts, or combinations thereof. Terms such as first, second, etc. may be used to describe various elements, but the elements should not be limited by the terms. The above terms are used only for the purpose of distinguishing one component from another.

이하, 본 발명에 따른 실시 예들을 첨부된 도면을 참조하여 상세히 설명하기로 하며, 첨부 도면을 참조하여 설명함에 있어, 동일하거나 대응하는 구성 요소는 동일한 도면번호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다.Hereinafter, embodiments according to the present invention will be described in detail with reference to the accompanying drawings, and in the description with reference to the accompanying drawings, the same or corresponding components are given the same reference numerals, and overlapping descriptions thereof are omitted. decide to do

딥러닝 학습 기법의 발전과 대용량의 영상 데이터 수집이 가능하면서, 영상에서 객체를 인식하는 기술은 높은 인식 성능을 달성하게 되었다. 딥러닝 기반 객체 검출 방법 중에서 빠른 검출 속도와 높은 성능을 보여주는 방법으로는 Ren 등이 제안한 Faster R-CNN 방법이 있다. 이 방법은 피라미드 구조의 합성곱 신경 회로망(convolutional neural network)으로 특징을 추출하고, 영상에서 k개의 앵커 박스(anchor box)를 이동시키면서 앵커 박스 안에 객체가 존재하는지 판단한다. 이때 객체의 존재 여부는 앵커 박스 영역 내에 있는 신경 회로망에서 추출한 특징들로 판단한다. 최종적으로 객체를 포함하는 영역의 크기 및 위치, 영역 내에 존재하는 객체의 신뢰도 및 객체가 특징 종류일 확률을 계산한다. 그러나 이 방법은 실시간으로 구현하기에는 연산량이 많이 요구되는 문제점이 있다.With the development of deep learning learning techniques and the ability to collect large amounts of image data, the technology for recognizing objects in images has achieved high recognition performance. Among the deep learning-based object detection methods, the Faster R-CNN method proposed by Ren et al. is a method that shows fast detection speed and high performance. In this method, features are extracted with a pyramid-structured convolutional neural network, and k anchor boxes are moved in the image to determine whether an object exists in the anchor box. At this time, the existence of the object is determined by features extracted from the neural network in the anchor box area. Finally, the size and position of the region including the object, the reliability of the object existing in the region, and the probability that the object is a feature type are calculated. However, this method has a problem in that a large amount of computation is required to be implemented in real time.

Rendman 등은 앵커 박스를 움직여가면서 객체의 영역을 찾는 대신 합성곱 신경 회로망의 최상층(해상도: 7×7)의 각 셀에 Bro 테두리 상자로 객체 영역의 위치와 크기, 테두리 상자가 객체를 포함하는 신뢰도, 포함된 객체가 특정 종류일 확률을 완전 연결 망(fully connected network)으로 계산하였다. 이 방법은 앵커 박스를 이동시키는 대신 고정된 7×7×b 개의 테두리 상자에서 한번만 객체 검출을 수행함으로 YOLO(you only look once)라고 한다. 이 방법은 매우 적은 연산량을 필요로 하며 실시간 객체 검출 구현이 가능하다. 그러나 작은 객체의 검출 성능이 낮은 문제점이 있다.Rendman et al., instead of finding the region of an object while moving the anchor box, use the Bro bounding box in each cell of the top layer (resolution: 7×7) of the convolutional neural network to determine the position and size of the object region and the reliability that the bounding box contains the object. , the probability that the included object is a specific type was calculated as a fully connected network. This method is called YOLO (you only look once) because object detection is performed only once in a fixed 7×7×b bounding box instead of moving the anchor box. This method requires very little computation and real-time object detection is possible. However, there is a problem in that the detection performance of small objects is low.

Liu 등은 작은 객체도 효과적으로 검출할 수 있는 방법을 제안하였다. 이 방법은 테두리 상자 안에서 한번에 객체의 위치 및 크기, 객체의 종류별 확률을 모두 한번에 처리한다고 하여 SSD(single-shot detector)라고 한다. 이 방법은 합성곱 신경 회로망의 높은 해상도 층부터 최저 해상도 층까지 다양한 층에서 객체를 검출하는 다중스케일 검출기를 구현하였다. 이 방법은 YOLO와 비슷한 연산량으로 높은 검출율을 보여주고 있다. Liu et al. proposed a method that can effectively detect even small objects. This method is called SSD (single-shot detector) because it processes the position, size, and probability of each type of object all at once in the bounding box. This method implemented a multiscale detector that detects objects in various layers from the high-resolution layer to the lowest-resolution layer of the convolutional neural network. This method shows a high detection rate with a computational amount similar to that of YOLO.

Redmon 등은 특징 검출을 위한 합성곱 신경 회로망을 연속된 3×3과 1×1 합성곱으로 이루어진 19개의 합성곱 층(convolution layer)과 5개의 맥스 풀링 층(max pooling layer)로 구성하고, 최상층의 해상도를 13×13으로 높였으며, 이 신경 회로망을 darknet-19라고 명명하였다. 객체 검출을 위한 분류기를 완전 연결 망(fully connected network) 대신 합성곱 신경 회로망(CNN)으로 구성된 YOLOv2를 제안하였다. 이 방법은 YOLO 보다 적은 연산량으로 SSD 보다 우수한 검출 성능을 구현하였다.Redmon et al. constructed a convolutional neural network for feature detection with 19 convolution layers and 5 max pooling layers consisting of continuous 3×3 and 1×1 convolutions, and the top layer The resolution was increased to 13×13, and this neural network was named darknet-19. As a classifier for object detection, we proposed YOLOv2 consisting of a convolutional neural network (CNN) instead of a fully connected network. This method realized better detection performance than SSD with less computation than YOLO.

He 등은 수천개의 객체 검출에 필요한 풍부한 특징을 추출할 수 있는 매우 깊은 층으로 구성된 합성곱 신경 회로망을 제안하였으며, 이를 ResNet이라고 한다. 이 신경 회로망은 매우 깊은 층(deep layers)을 안정적으로 학습할 수 있는 구조로, 객체 검출 및 인식 등에서 높은 성능을 보여주고 있다. 그러나 층이 깊어지는 만큼 연산량이 늘어나는 문제점이 있다.He et al. proposed a convolutional neural network composed of very deep layers that can extract rich features required for detecting thousands of objects, which is called ResNet. This neural network has a structure that can learn very deep layers stably and shows high performance in object detection and recognition. However, there is a problem in that the amount of computation increases as the layer becomes deeper.

Lin 등은 앵커 박스를 사용하여 객체를 검출하는 방법에서 신경 회로망을 학습함에 있어 객체 샘플의 수가 배경 샘플의 수에 비하여 매우 적고 이로 인하여 학습된 검출기의 성능이 저하되는 문제점을 찾아내고 이를 보완하여 학습하는 방법을 제시하였다. 이는 RetinaNet이라고 불린다.Lin et al. found that the number of object samples is very small compared to the number of background samples in learning a neural network in a method of detecting an object using an anchor box, and thus the performance of the learned detector is deteriorated. method was presented. This is called RetinaNet.

Redman 등은 좀 더 풍부한 특징을 추출하기 위하여 darknet-19과 ResNet의 혼합형태로 53개의 합성곱 층으로 사용하였으며, 이를 darknet-53이라 한다. darknet-53은 darknet-19 보다는 많은 연산량을, ResNet에 비하여 적은 연산량을 사용한다. Redman 등은 darknet-53으로 특징을 추출하고, 다양한 크기의 객체를 검출하기 위하여 최상층, 차상층, 차차상층의 3 층에서 객체 검출을 위한 분류기를 생성하였으며, 이 방법을 YOLOv3라고 불린다.Redman et al. used 53 convolutional layers in a mixed form of darknet-19 and ResNet to extract richer features, which they call darknet-53. Darknet-53 uses more computation than darknet-19 and less computation than ResNet. Redman et al. extracted features with darknet-53 and created a classifier for object detection in three layers: top layer, second layer, and next layer to detect objects of various sizes. This method is called YOLOv3.

도 1은 종래 기술에 따른 합성곱 신경 회로망을 기반으로 하는 다양한 객체 검출기의 연산량 대비 성능을 도시한 도면이다. 도 1을 참조하면, YOLOv3가 연산량 대비 검출 성능이 다른 방법에 비하여 월등하게 우수함을 보여주고 있다.1 is a diagram illustrating performance versus computational amount of various object detectors based on a convolutional neural network according to the prior art. Referring to FIG. 1 , it is shown that YOLOv3 is significantly superior to other methods in detection performance compared to the amount of computation.

이러한 딥러닝 기반 객체 인식 기술은 다양한 환경에서 다양한 객체를 인식하려고 하면 그 성능이 낮아지는 한계가 있다. 그러나 인식하고자 하는 객체를 특화하고, 객체를 촬영하는 카메라의 시점과 종류를 특화하고, 주변 환경을 특화한다면 높은 검출률의 객체 검출이 가능하다. 본 실시 예에서는 지하 주차장의 천정에 설치된 어안렌즈 카메라를 통하여 획득한 영상에서 객체로서의 차량을 검출하는데 특화된 딥러닝 기반 실시간 객체 검출 기술을 제안한다.Such deep learning-based object recognition technology has a limitation in that its performance is lowered when it tries to recognize various objects in various environments. However, if the object to be recognized is specialized, the viewpoint and type of a camera that captures the object, and the surrounding environment are specialized, object detection with a high detection rate is possible. In this embodiment, we propose a deep learning-based real-time object detection technology specialized for detecting a vehicle as an object in an image obtained through a fisheye lens camera installed on the ceiling of an underground parking lot.

도 2는 본 발명의 일 실시 예에 따른 객체 검출 장치의 구성을 개략적으로 설명하기 위하여 도시한 도면이다. 도 2를 참조하면, 객체 검출 장치(1)는 영상 촬영부(100), 영상 처리부(200), 객체 검출부(300) 및 디스플레이부(400)를 포함할 수 있다.2 is a diagram schematically illustrating a configuration of an object detection apparatus according to an embodiment of the present invention. Referring to FIG. 2 , the object detecting apparatus 1 may include an image capturing unit 100 , an image processing unit 200 , an object detecting unit 300 , and a display unit 400 .

영상 촬영부(100)는 실내 주차장의 천정에 설치된 N개의 어안렌즈 카메라(100_1 내지 100_N)(예를 들어, 16개)를 통하여 영상을 촬영할 수 있다. 본 실시 예에서 어안렌즈 카메라(100_1 내지 100_N) 각각은 150도 이상의 시야각 범위 내에서 입사되는 빛을 집광하여 전기적 신호로 변환함으로써 영상을 촬영할 수 있다.The image photographing unit 100 may photograph an image through N fisheye lens cameras 100_1 to 100_N (eg, 16) installed on the ceiling of the indoor parking lot. In the present embodiment, each of the fisheye lens cameras 100_1 to 100_N may capture an image by condensing incident light within a viewing angle range of 150 degrees or more and converting it into an electrical signal.

영상 처리부(200)는 영상 촬영부(100)가 촬영한 영상의 화질을 개선할 수 있다. 실내 주차장 내부에 균등하게 조명을 배치하여도 건물의 구석진 부분에 주차한 차량은 조명이 어두워서, 도 3(a)에서와 같이 객체 후방의 윤곽을 알아보기 힘들다. 합성곱 신경 회로망 기반 객체 검출부(300)에 객체를 포함하는 영상을 입력으로 넣었을 때, 신경 회로망의 중간층에서 출력값을 시각화하여 보면 객체 윤곽의 에지가 강조되어 나타난다. 즉 객체 검출률을 향상시키기 위해서는 객체의 윤곽을 뚜렷하게 하는 화질 개선이 필요하다. 물론 객체 검출기를 학습할 대 입력 영상의 발기(intensity), 색조(hue) 등에 변화를 주는 데이터 증강 방법을 사용하지만, 일반적으로 검출하는 객체의 변화 범위가 크면 객체 검출의 성능이 저하된다.The image processing unit 200 may improve the image quality of the image captured by the image capturing unit 100 . Even if the lighting is evenly arranged inside the indoor parking lot, the vehicle parked in the corner of the building is dark in lighting, so it is difficult to recognize the outline of the rear of the object as shown in FIG. 3(a). When an image including an object is input to the convolutional neural network-based object detector 300 , the edge of the object outline is emphasized when an output value is visualized in the intermediate layer of the neural network. That is, in order to improve the object detection rate, it is necessary to improve the image quality to make the outline of the object clear. Of course, when learning the object detector, a data augmentation method that changes the intensity, hue, etc. of the input image is used, but in general, if the change range of the object to be detected is large, the performance of object detection is deteriorated.

본 실시 예에서 영상 처리부(200)는 입력 영상의 화질 개선을 위하여 감마 보정(gamma correction)과 에지 향상(edge enhancement)을 적용한다. 감마 보정은 수학식 1과 같다.In this embodiment, the image processing unit 200 applies gamma correction and edge enhancement to improve the quality of the input image. Gamma correction is expressed by Equation (1).

수학식 1에서

은 입력 영상의 밝기,

는 입력 영상의 밝기 범위,

는 감마 보정된 영상의 밝기,

는 보정된 영상의 밝기 범위를 각각 나타내며, R,G,B 각각 독립적으로 보정한다.in Equation 1

is the brightness of the input image,

is the brightness range of the input image,

is the brightness of the gamma-corrected image,

represents the brightness range of the corrected image, respectively, and each of R, G, and B is independently corrected.

에지를 향상함에 있어 에지가 뚜렷한 부분은 향상을 적게 하고, 에지가 약한 부분은 향상을 많이 할 필요가 있다. 본 실시 예에서 적용한 에지 향상 방법은 수학식 2와 같다.In improving the edge, it is necessary to make a small improvement in a part with a sharp edge, and to increase the improvement in a part with a weak edge. The edge enhancement method applied in this embodiment is as shown in Equation (2).

수학식 2에서

는 지역 평균을 나타내고,

는 지역 표준편차를 나타낸다.in Equation 2

represents the regional mean,

is the regional standard deviation.

영상 처리부(200)에서 감마 보정 및 에지 향상의 결과로 화질이 개선된 영상은 객체 검출부(300)로 입력된다. An image whose image quality is improved as a result of gamma correction and edge enhancement in the image processing unit 200 is input to the object detection unit 300 .

객체 검출부(300)는 영상 처리부(200)로부터 수신한 영상(입력 영상)에 대하여 합성곱 신경 회로망(convolutional neural network) 기반 필터링 및 다운 샘플링을 거쳐 영상에 대한 특징을 추출하고, 추출한 영상의 특징으로부터 객체의 유/무, 객체의 위치, 객체의 크기, 객체의 진행방향 등의 정보를 검출할 수 있다.The object detection unit 300 extracts features of the image through convolutional neural network-based filtering and downsampling on the image (input image) received from the image processing unit 200 , and extracts features from the extracted image. Information such as the presence/absence of an object, the location of the object, the size of the object, and the moving direction of the object can be detected.

디스플레이부(400)는 객체 검출부(300)가 검출한 객체 검출 결과, 예를 들어 도 6을 디스플레이 한다.The display unit 400 displays the object detection result detected by the object detection unit 300 , for example, FIG. 6 .

도 4 및 도 5는 도 2의 객체 검출 장치 중 객체 검출부의 상세 구성을 개략적으로 설명하기 위하여 도시한 도면이다. 4 and 5 are diagrams for schematically explaining a detailed configuration of an object detection unit in the object detection apparatus of FIG. 2 .

기존의 객체 검출 장치 중에서 YOLOv3는 연산량 대비 검출 성능이 다른 방법에 비하여 월등하게 우수함을 보여주고 있다. 반면에 YOLOv3 보다 연산량이 적은 YOLOv2는 사람과 같이 작은 객체의 검출에서는 YOLOv3 보다 성능이 많이 낮지만, 버스와 같은 크기가 큰 객체를 검출하는 경우 성능이 YOLOv3에 근접한다.Among the existing object detection devices, YOLOv3 shows that the detection performance compared to the amount of computation is significantly superior to that of other methods. On the other hand, YOLOv2, which requires less computation than YOLOv3, has much lower performance than YOLOv3 in detecting small objects such as humans, but approaches YOLOv3 in performance when detecting large objects such as buses.

본 실시 예에서는 주차장의 천정에 설치된 어안렌즈 카메라(100_1 내지 100_N)를 통하여 획득한 영상에서 객체로서의 차량만을 인식하는데 특화된 검출기로, YOLOv3 보다 연산량이 적으면서 크기가 큰 객체의 검출 성능에서는 큰 차이가 없는 YOLOv2를 고려한다. Darknet-19에서 추출된 특징은 darknet-53에서 추출된 특징에 비하여 풍부하지는 않지만, 객체로서 차량만을 검출하는 데는 충분하다.In this embodiment, the detector is specialized for recognizing only the vehicle as an object in the image acquired through the fisheye lens cameras (100_1 to 100_N) installed on the ceiling of the parking lot. Consider YOLOv2 without YOLOv2. The features extracted from Darknet-19 are not as rich as those extracted from darknet-53, but it is sufficient to detect only the vehicle as an object.

주차장의 천정에 설치된 어안렌즈 카메라(100_1 내지 100_N)로 하단 주차면의 차량을 촬영하면, 영상의 중심에 놓인 차량은 크기가 크지만 영상의 외곽에 놓인 차량은 크기가 작다. 다양한 크기의 차량을 검출하기 위해서는 YOLOv3에서와 같이 다중 스케일로 객체를 검출하는 것이 필요하다. 때문에 본 실시 예에서 제안하는 검출기는 darknet-19 합성곱 신경 회로망을 기반으로 다중 스케일에서 객체를 검출하는 구조를 구비한다.When a vehicle on the lower parking surface is photographed with a fisheye lens camera (100_1 to 100_N) installed on the ceiling of a parking lot, the vehicle placed in the center of the image is large in size, but the vehicle placed in the outskirts of the image is small in size. In order to detect vehicles of various sizes, it is necessary to detect objects on multiple scales as in YOLOv3. Therefore, the detector proposed in this embodiment has a structure for detecting objects at multiple scales based on the darknet-19 convolutional neural network.

본 실시 예에서 어안렌즈 카메라(100_1 내지 100_N)는 주차장의 천정에 설치되어, 어안렌즈 카메라(100_1 내지 100_N) 밑으로 이동하는 객체(예를 들어 주행하는 차량)를 촬영하여 객체가 진행하는 진행방향 정보를 획득할 수 있다. 기존의 방법들은 객체 옆에서 객체의 앞/뒤/좌/우를 촬영한 영상을 이용하기 때문에 본 실시 예와 같이 영상에서 객체의 진행방향 정보를 획득할 수 없다.In this embodiment, the fisheye lens cameras 100_1 to 100_N are installed on the ceiling of the parking lot, and the object (for example, a driving vehicle) moving under the fisheye lens cameras 100_1 to 100_N is photographed and the moving direction of the object. information can be obtained. Since the existing methods use images obtained by photographing the front/back/left/right of the object next to the object, it is not possible to obtain the moving direction information of the object from the image as in the present embodiment.

도 4 및 도 5를 참조하면, 객체 검출부(300)는 특징 추출부(310), 다중 전처리부(320), 다중 분류부(330) 및 NMS 처리부(340)를 포함할 수 있다.4 and 5 , the object detection unit 300 may include a feature extractor 310 , a multi-preprocessor 320 , a multi-classifier 330 , and an NMS processor 340 .

특징 추출부(310)는 입력 영상에 대하여 필터링 및 다운 샘플링을 거쳐 영상에 대한 특징을 추출할 수 있다. 특징 추출부(310)는 예를 들어 객체로서의 차량의 외곽 에지, 차량의 외각 코너 등과 같은 객체의 영상적 특징을 추출하며, 이러한 영상적 특징들을 이용하여 후술하는 분류부(300)에서 차량의 위치, 높이, 방향, 넓이 등의 값을 추정할 수 있다.The feature extractor 310 may extract features from the input image through filtering and downsampling. The feature extraction unit 310 extracts image features of an object such as, for example, an outer edge of a vehicle as an object, an outer corner of a vehicle, etc. , height, direction, and width can be estimated.

도 5에 개시된 바와 같이 특징 추출부(310)는 기존에 알려진 darknet-19를 적용하였으며, 연속된 3×3과 1×1 콘벌루션으로 이루어져 필터링을 수행하는 19개의 콘벌루션 레이어(convolution layer)와 영상의 다운 샘플링을 수행하는 5개의 최대 풀링 레이어(max pooling layer)를 포함할 수 있으며, 각각의 콘벌루션 레이어의 입력에 배치 정규화(batch normalization)를 적용하고, 활성화 함수로 Leaky ReLU(교정 선형 유닛, rectified linear unit)를 적용한다.As shown in FIG. 5 , the feature extracting unit 310 applies the known darknet-19, and consists of continuous 3×3 and 1×1 convolutions and includes 19 convolutional layers that perform filtering and May include 5 max pooling layers that perform image downsampling, apply batch normalization to the input of each convolutional layer, and Leaky ReLU (corrective linear unit) as an activation function , a rectified linear unit) is applied.

또한 19개의 콘벌루션 레이어 중 6-8 번째 콘벌루션 레이어 집합을 차차상층(the after next higher layer)으로 명명하고, 9-13번째 콘벌루션 레이어 집합을 차상층(the next higher layer, the next lower resolution layer)으로 명명하고, 14-19번째 콘벌루션 레이어 집합을 최상층(the highest layer, the lowest resolution layer)으로 명명 할 수 있다. 특징 추출부(310)의 입력 영상은 416×416 셀로 구성되고, 차차상층은 52×52 셀들로 구성되며, 차상층은 26×26 셀로 구성되며, 최상층은13×13 셀로 구성된다.In addition, the 6th-8th convolutional layer set out of 19 convolutional layers is named the after next higher layer, and the 9th-13th convolutional layer set is named the next higher layer (the next higher layer, the next lower resolution). layer), and the 14-19th convolutional layer set may be named the highest layer (the lowest resolution layer). The input image of the feature extraction unit 310 is composed of 416x416 cells, the second-order layer consists of 52x52 cells, the second-order layer consists of 26x26 cells, and the uppermost layer consists of 13x13 cells.

특징 추출부(310)의 첫 번째 콘벌루션 레이어는 416×416의 해상도를 갖는 입력 영상에 대하여, 3×3 크기의 32개 필터를 이용하여 1 픽셀 간격으로 콘벌루션화 된다. 첫 번째 콘벌루션 레이어에서 출력되는 특징은 배치 정규화 및 Leaky ReLU를 거친 후 2 픽셀 간격으로 적용하는 2Х2 크기의 필터들을 포함하는 첫 번째 최대 풀링 레이어에 의해 처리되어 208Х208의 크기로 다운샘플링 된 후 두 번째 콘벌루션 레이어로 입력된다.The first convolutional layer of the feature extractor 310 is convolutional with respect to an input image having a resolution of 416×416 using 32 filters of 3×3 size at 1-pixel intervals. The features output from the first convolutional layer are processed by the first maximal pooling layer containing filters of size 2Х2 applied at 2-pixel intervals after batch normalization and Leaky ReLU, downsampled to a size of 208Х208, and then the second It is input to the convolutional layer.

두 번째 콘벌루션 레이어는 첫 번째 최대 풀링 레이어에 의해 다운 샘플링된 208×208의 해상도를 갖는 영상에 대하여, 3×3 크기의 64개 필터를 이용하여 1 픽셀 간격으로 콘벌루션화 된다. 두 번째 콘벌루션 레이어에서 출력되는 특징은 배치 정규화 및 Leaky ReLU를 거친 후 2 픽셀 간격으로 적용하는 2Х2 크기의 필터들을 포함하는 두 번째 최대 풀링 레이어에 의해 처리되어 104Х104의 크기로 다운샘플링 된 후 세 번째 콘벌루션 레이어로 입력된다.The second convolutional layer is convolved at 1-pixel intervals using 64 filters of 3×3 size with respect to an image having a resolution of 208×208 down-sampled by the first maximum pooling layer. The features output from the second convolutional layer are processed by a second maximal pooling layer containing filters of size 2Х2 applied at 2-pixel intervals after batch normalization and Leaky ReLU, downsampled to a size of 104Х104, and then the third It is input to the convolutional layer.

세 번째 콘벌루션 레이어는 두 번째 최대 풀링 레이어에 의해 다운 샘플링된 104×104의 해상도를 갖는 영상에 대하여, 3×3 크기의 128개 필터를 이용하여 1 픽셀 간격으로 콘벌루션화 된다. 세 번째 콘벌루션 레이어에서 출력되는 특징은 배치 정규화 및 Leaky ReLU를 거친 후, 네 번째 콘벌루션 레이어로 입력된다. 이러한 과정을 거쳐 차차상층은 52×52 셀로 구성되며, 차상층은 26×26 셀로 구성되며, 최상층은13×13 셀로 구성된다.The third convolutional layer is convolved at 1-pixel intervals using 128 filters of 3×3 size for an image with a resolution of 104×104 down-sampled by the second maximum pooling layer. The features output from the third convolutional layer are input to the fourth convolutional layer after undergoing batch normalization and Leaky ReLU. Through this process, the next upper layer consists of 52×52 cells, the second layer consists of 26×26 cells, and the uppermost layer consists of 13×13 cells.

다중 전처리부(320)는 특징 추출부(310)의 차상층 및 최상층의 특징을 병합하여 다중 분류부(330)로 출력하고 차차상층 및 차상층의 특징을 병합하여 다중 분류부(330)로 입력한다. The multi-preprocessor 320 merges the features of the next-order layer and the uppermost layer of the feature extraction section 310 and outputs them to the multi-classifier 330 , and merges the features of the second-order layer and the next layer to the multi-classifier 330 . do.

다중 분류부(330)는 다중 전처리부(320)로부터 병합한 특징을 수신하여 검출한 객체들을 테두리 상자(bounding box)로 나타내고, 테두리 상자의 중심좌표(x,y), 객체의 넓이(w), 객체의 높이(h), 테두리 상자에 포함된 객체의 신뢰도(C), 객체의 진행방향(θ) 및 셀에 존재할 객체가 특정 클래스일 확률(Pi)을 추정한다. The multi-classifier 330 receives the merged features from the multi-preprocessor 320 and displays the detected objects as a bounding box, the center coordinates of the bounding box (x,y), and the area of the object (w) , the height of the object (h), the reliability (C) of the object included in the bounding box, the moving direction (θ) of the object, and the probability (Pi) of the object existing in the cell being of a specific class are estimated.

본 실시 예에서 다중 전처리부(320)는 제1 전처리부(321) 및 제2 전처리부(322)를 포함하고, 다중 분류부(330)는 제1 분류부(331) 및 제2 분류부(332)를 포함한다. 또한, 제1 전처리부(321)의 출력은 제1 분류부(331)로 입력되고, 제1 분류부(331)는 YOLOv2에서와 같이 최상층의 각 13×13 개의 셀 각각으로부터 객체 검출 결과로 6개 정보 즉, x, y, w, h, C, θ, Pi를 검출할 수 있다. 제2 전처리부(322)의 출력은 제2 분류부(332)로 입력되고, 제2 분류부(332)는 차상층의 26×26 개의 셀 각각으로부터 객체 검출 결과로 6개 정보 즉, x, y, w, h, C, θ, Pi를 검출할 수 있다.In this embodiment, the multiple preprocessor 320 includes a first preprocessor 321 and a second preprocessor 322 , and the multiple classifier 330 includes a first classifier 331 and a second classifier ( 332). Also, the output of the first preprocessor 321 is input to the first classifier 331, and the first classifier 331 obtains 6 object detection results from each of the 13×13 cells in the uppermost layer as in YOLOv2. Dog information, that is, x, y, w, h, C, θ, Pi can be detected. The output of the second preprocessor 322 is input to the second classifier 332, and the second classifier 332 provides six pieces of information, that is, x, y, w, h, C, θ, and Pi can be detected.

도 5를 참조하면, 제1 전처리부(321)는 최상층의 19번째 콘벌루션 레이어에서 출력되는 해당 셀의 1024개의 특징과, 최상층 셀에 해당하는 차상층의 4 셀에서 추출한 특징 256(4×64)개의 특징을 병합하여 제1 분류부(331)로 출력한다. 제1 분류부(331)는 제1 전처리부(321)에서 병합한 1280개의 특징을 1024개의 1×1×1280 합성 곱 필터를 통과시켜 구한 1024개의 특징으로부터 테두리 상자의 중심좌표(x,y), 객체의 넓이(w), 객체의 높이(h), 테두리 상자에 포함된 객체의 신뢰도(C), 객체의 진행방향(θ) 및 셀에 존재할 객체가 특정 클래스일 확률(Pi)을 추정한다.Referring to FIG. 5 , the first preprocessing unit 321 includes 1024 features of the corresponding cell output from the 19th convolutional layer of the uppermost layer and 256 features extracted from 4 cells of the second layer corresponding to the uppermost cell (4×64). ) are merged and output to the first classification unit 331 . The first classification unit 331 passes the 1280 features merged by the first preprocessor 321 through 1024 1×1×1280 synthetic product filters, and the center coordinates (x, y) of the bounding box from 1024 features , the width (w) of the object, the height (h) of the object, the reliability (C) of the object included in the bounding box, the moving direction (θ) of the object, and the probability (Pi) of the object existing in the cell are estimated to be of a specific class .

제2 전처리부(322)는 차상층의 13번째 콘벌루션 레이어에서 출력되는 해당 셀의 1024개의 특징과, 차상층 셀에 해당하는 차차상층의 4 셀에서 추출한 특징 256(4×64)개의 특징을 병합하여 제2 분류부(332)로 출력한다. 제2 분류부(332)는 제2 전처리부(322)에서 병합한 1280개의 특징을 1024개의 1×1×1280 합성 곱 필터를 통과시켜 구한 1024개의 특징으로부터 테두리 상자의 중심좌표(x,y), 객체의 넓이(w), 객체의 높이(h), 테두리 상자에 포함된 객체의 신뢰도(C), 객체의 진행방향(θ) 및 셀에 존재할 객체가 특정 클래스일 확률(Pi)을 추정한다.The second pre-processing unit 322 extracts 1024 features of the corresponding cell output from the 13th convolutional layer of the second-order layer and 256 (4×64) features extracted from 4 cells of the second-order layer corresponding to the second-order cell. They are merged and output to the second classification unit 332 . The second classification unit 332 passes the 1280 features merged by the second pre-processor 322 through 1024 1×1×1280 synthetic product filters, and from 1024 features obtained by the central coordinates (x, y) of the bounding box. , the width (w) of the object, the height (h) of the object, the reliability (C) of the object included in the bounding box, the moving direction (θ) of the object, and the probability (Pi) of the object existing in the cell are estimated to be of a specific class .

NMS(non-maximum suppression) 처리부(340)는 제1 분류부(331) 및 제2 분류부(332)에서 검출한 테두리 상자에 객체가 존재할 클래스 확률(Pi)로부터 겹치는 부분(예를 들어 차 한대에 여러 테두리 상자가 그려지는 경우와 같음)을 제거하기 위해 사용된다. NMS 처리부(340)는 현재 픽셀을 기준으로 주변의 픽셀과 비교했을 때 최대값인 경우 그대로 놔두고, 아닐 경우(비 최대) 제거하는 방법을 사용한다. NMS(non-maximum suppression) 처리부(340)에 의해 겹치는 부분이 제거된 결과가 디스플레이부(400)에 출력되는 도 6과 같다.The non-maximum suppression (NMS) processing unit 340 performs an overlapping portion (eg, one car) from the class probability Pi that the object exists in the bounding box detected by the first classification unit 331 and the second classification unit 332 . (as in the case where multiple bounding boxes are drawn on The NMS processing unit 340 uses a method of leaving the current pixel as it is if it is the maximum value compared with the surrounding pixels based on the current pixel, and removing it if it is not (non-maximum). As shown in FIG. 6 , the result of removing the overlapping portion by the non-maximum suppression (NMS) processing unit 340 is output to the display unit 400 .

YOLOv3에서는 최상층, 차상층, 차차상층에서 각각 분류기로 객체를 검출하는데, 각 층의 객체 검출기는 최상층에서 추출한 특징을 업샘플링(up-sampling)하여 병합한 후 합성곱 필터를 거쳐 분류기에 사용할 특징을 추출한다. 최상층의 특징은 광범위한 영역에서 추출한 특징이기 때문에 제한된 차량 영역 밖의 배경이 특징값에 영향을 줄 수 있다. 물론 학습과정에서 최상층에서 추출한 특징 중에서 필요한 부분만 선택할 수도 있지만, 본 실시 예에서 득 보다 실이 많다고 판단하여 최상층에서 구한 특징을 차상층의 분류기에 사용하지 않았다. 제안하는 방법에서 이러한 접근의 타당성은 실험을 통하여 입증한다.In YOLOv3, objects are detected with a classifier in the top layer, the next layer, and the next layer. The object detector of each layer up-samples the features extracted from the top layer, merges them, and passes the convolution filter to select the features to be used in the classifier extract Since the features of the top layer are features extracted from a wide area, the background outside the limited vehicle area may affect the feature values. Of course, it is possible to select only the necessary parts from the features extracted from the uppermost layer in the learning process, but in this embodiment, the features obtained from the uppermost layer were not used in the classifier of the next higher layer because it was judged that there were more disadvantages than benefits. The validity of this approach in the proposed method is verified through experiments.

또한 검출하고자 하는 객체로서의 차량의 크기는 일정 크기 이상이기 때문에, 본 실시 예에서는 YOLOv3와 달리 최상층과 차상층 두 층에서만 객체를 검출한다. In addition, since the size of the vehicle as an object to be detected is greater than or equal to a certain size, in the present embodiment, unlike YOLOv3, the object is detected only in two floors, the uppermost floor and the uppermost floor.

합성곱 신경 회로망은 학습 시 최소화할 손실 함수(loss function)로 YOLOv2에서 사용한 함수를 적용하였으며 수학식 3과 같다.In the convolutional neural network, the function used in YOLOv2 is applied as a loss function to be minimized during training, and Equation 3 is shown.

수학식 3에서 x,y,w,h는 객체를 포함하는 테두리 상자의 중심좌표(x,y), 넓이(w), 높이(h)를 나타내며,

는 i번째 셀의 j번째 테두리 상자에 포함된 객체의 신뢰도를 나타내고,

는 i번째 셀에 존재할 객체가 특정 클래스일 확률을 나타내고,

는 객체의 진행방향을 나타낸다.

는 i번째 셀의 j번째 테두리 상자에 객체가 존재하면 1, 아니면 0 값을 가진다.

는 i번째 셀에 객체의 중심이 놓여지면 1, 아니면 0 값을 가진다. 본 실시 예에서는 총 객체의 학습 데이터에서 객체를 포함하는 셀의 개수와 배경만 포함하는 셀의 개수 비를 바탕으로

값을 설정한다.In Equation 3, x, y, w, h represent the center coordinates (x, y), width (w), and height (h) of the bounding box containing the object,

represents the reliability of the object included in the j-th bounding box of the i-th cell,

represents the probability that the object that exists in the i-th cell is of a specific class,

indicates the moving direction of the object.

has a value of 1 if there is an object in the j-th bounding box of the i-th cell, otherwise 0.

has a value of 1 if the center of the object is placed in the i-th cell, otherwise 0. In this embodiment, based on the ratio of the number of cells including the object to the number of cells including only the background in the learning data of the total object

Set the value.

도 7은 본 발명의 일 실시 예에 따른 객체 검출 방법을 설명하기 위한 흐름도이다. 이하의 설명에서 도 1 내지 도 6에 대한 설명과 중복되는 부분은 그 설명을 생략하기로 한다.7 is a flowchart illustrating an object detection method according to an embodiment of the present invention. In the following description, descriptions of parts overlapping with those of FIGS. 1 to 6 will be omitted.

도 7을 참조하면, S710단계에서, 객체 검출 장치(1)는 실내 주차장의 천정에 설치된 N개의 어안렌즈 카메라(예를 들어, 16개)를 통하여 영상을 촬영한다.Referring to FIG. 7 , in step S710 , the object detection apparatus 1 captures an image through N fisheye lens cameras (eg, 16) installed on the ceiling of an indoor parking lot.

S720단계에서, 객체 검출 장치(1)는 어안렌즈 카메라를 통하여 촬영한 영상에 감마 보정 및 에지 향상을 적용한 화질 개선을 수행한다.In step S720, the object detection apparatus 1 performs image quality improvement by applying gamma correction and edge enhancement to the image captured by the fisheye lens camera.

S730단계에서, 객체 검출 장치(1)는 N개의 어안렌즈 카메라를 통하여 촬영하여 화질이 개선된 영상에 딥러닝 기반의 합성곱 신경 회로망 모델을 적용하여 영상의 특징을 추출한다. 본 실시 예에서, 객체 검출 장치(1)는 연속된 3×3과 1×1 콘벌루션으로 이루어져 필터링을 수행하는 19개의 콘벌루션 레이어(convolution layer)와 영상의 다운 샘플링을 수행하는 5개의 최대 풀링 레이어(max pooling layer)를 포함하는 darknet-19를 적용하여 영상의 특징을 추출할 수 있다.In step S730, the object detection apparatus 1 extracts image features by applying a deep learning-based convolutional neural network model to an image of which image quality is improved by photographing through N fisheye lens cameras. In the present embodiment, the object detection apparatus 1 consists of continuous 3×3 and 1×1 convolution and includes 19 convolutional layers for filtering and 5 maximum pooling for image downsampling. Image features can be extracted by applying darknet-19 including a max pooling layer.

S740단계에서, 객체 검출 장치(1)는 최상층의 19번째 콘벌루션 레이어에서 출력되는 해당 셀의 1024개의 특징과, 최상층 셀에 해당하는 차상층의 4 셀에서 추출한 특징 256(4×64)개의 특징을 병합한 제1 전처리 결과 및 차상층의 13번째 콘벌루션 레이어에서 출력되는 해당 셀의 1024개의 특징과, 차상층 셀에 해당하는 차차상층의 4 셀에서 추출한 특징 256(4×64)개의 특징을 병합한 제2 전처리 결과를 생성한다.In step S740, the object detection apparatus 1 extracts 1024 features of the corresponding cell output from the 19th convolutional layer of the uppermost layer and 256 (4×64) features extracted from 4 cells of the second layer corresponding to the uppermost cell. 1, 1024 features of the cell output from the thirteenth convolutional layer of the next higher layer, and 256 (4×64) features extracted from 4 cells of the next higher layer corresponding to the next higher layer A merged second preprocessing result is generated.

S750단계에서, 객체 검출 장치(1)는 제1 전처리 결과 즉, 병합한 1280개의 특징을 1024개의 1×1×1280 합성 곱 필터를 통과시켜 구한 1024개의 특징으로부터 테두리 상자의 중심좌표(x,y), 객체의 넓이(w), 객체의 높이(h), 테두리 상자에 포함된 객체의 신뢰도(C) 및 셀에 존재할 객체가 특정 클래스일 확률(Pi)과 객체의 진행방향(θ)을 추정하는 제1 분류 결과를 생성하고, 제2 전처리 결과 즉, 병합한 1280개의 특징을 1024개의 1×1×1280 합성 곱 필터를 통과시켜 구한 1024개의 특징으로부터 테두리 상자의 중심좌표(x,y), 객체의 넓이(w), 객체의 높이(h), 테두리 상자에 포함된 객체의 신뢰도(C) 및 셀에 존재할 객체가 특정 클래스일 확률(Pi)과 객체의 진행방향(θ)을 추정하는 제2 분류 결과를 생성한다.In step S750, the object detection apparatus 1 performs the first pre-processing result, that is, the central coordinates (x, y) of the bounding box from 1024 features obtained by passing 1280 merged features through 1024 1×1×1280 convolutional product filters. ), the width of the object (w), the height of the object (h), the reliability of the object included in the bounding box (C), the probability that the object that exists in the cell is of a specific class (Pi), and the moving direction (θ) of the object are estimated. The center coordinates of the bounding box (x, y) from 1024 features obtained by generating a first classification result that A method for estimating the width (w) of the object, the height (h) of the object, the reliability of the object included in the bounding box (C), the probability that the object that exists in the cell is of a specific class (Pi), and the moving direction (θ) of the object 2 Generate classification results.

S760단계에서, 객체 검출 장치(1)는 제1 분류 결과 및 제2 분류 결과로부터 검출한 객체에 여러 테두리 상자가 겹치는 부분을 제거하는 NMS(non-maximum suppression) 처리를 수행한다.In operation S760 , the object detection apparatus 1 performs non-maximum suppression (NMS) processing of removing a portion where multiple bounding boxes overlap with the object detected from the first classification result and the second classification result.

이상 설명된 본 발명에 따른 실시 예는 컴퓨터 상에서 다양한 구성요소를 통하여 실행될 수 있는 컴퓨터 프로그램의 형태로 구현될 수 있으며, 이와 같은 컴퓨터 프로그램은 컴퓨터로 판독 가능한 매체에 기록될 수 있다. 이때, 매체는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체, CD-ROM 및 DVD와 같은 광기록 매체, 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical medium), 및 ROM, RAM, 플래시 메모리 등과 같은, 프로그램 명령어를 저장하고 실행하도록 특별히 구성된 하드웨어 장치를 포함할 수 있다.The embodiment according to the present invention described above may be implemented in the form of a computer program that can be executed through various components on a computer, and such a computer program may be recorded in a computer-readable medium. In this case, the medium includes a hard disk, a magnetic medium such as a floppy disk and a magnetic tape, an optical recording medium such as CD-ROM and DVD, a magneto-optical medium such as a floppy disk, and a ROM. , RAM, flash memory, and the like, hardware devices specially configured to store and execute program instructions.

한편, 상기 컴퓨터 프로그램은 본 발명을 위하여 특별히 설계되고 구성된 것이거나 컴퓨터 소프트웨어 분야의 당업자에게 공지되어 사용 가능한 것일 수 있다. 컴퓨터 프로그램의 예에는, 컴파일러에 의하여 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용하여 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드도 포함될 수 있다.Meanwhile, the computer program may be specially designed and configured for the present invention, or may be known and used by those skilled in the computer software field. Examples of the computer program may include not only machine language codes such as those generated by a compiler, but also high-level language codes that can be executed by a computer using an interpreter or the like.

본 발명의 명세서(특히 특허청구범위에서)에서 "상기"의 용어 및 이와 유사한 지시 용어의 사용은 단수 및 복수 모두에 해당하는 것일 수 있다. 또한, 본 발명에서 범위(range)를 기재한 경우 상기 범위에 속하는 개별적인 값을 적용한 발명을 포함하는 것으로서(이에 반하는 기재가 없다면), 발명의 상세한 설명에 상기 범위를 구성하는 각 개별적인 값을 기재한 것과 같다. In the specification of the present invention (especially in the claims), the use of the term "above" and similar referential terms may be used in both the singular and the plural. In addition, when a range is described in the present invention, each individual value constituting the range is described in the detailed description of the invention as including the invention to which individual values belonging to the range are applied (unless there is a description to the contrary). same as

본 발명에 따른 방법을 구성하는 단계들에 대하여 명백하게 순서를 기재하거나 반하는 기재가 없다면, 상기 단계들은 적당한 순서로 행해질 수 있다. 반드시 상기 단계들의 기재 순서에 따라 본 발명이 한정되는 것은 아니다. 본 발명에서 모든 예들 또는 예시적인 용어(예들 들어, 등등)의 사용은 단순히 본 발명을 상세히 설명하기 위한 것으로서 특허청구범위에 의해 한정되지 않는 이상 상기 예들 또는 예시적인 용어로 인해 본 발명의 범위가 한정되는 것은 아니다. 또한, 당업자는 다양한 수정, 조합 및 변경이 부가된 특허청구범위 또는 그 균등물의 범주 내에서 설계 조건 및 팩터에 따라 구성될 수 있음을 알 수 있다.The steps constituting the method according to the present invention may be performed in an appropriate order, unless there is an explicit order or description to the contrary. The present invention is not necessarily limited to the order in which the steps are described. The use of all examples or exemplary terms (eg, etc.) in the present invention is merely for the purpose of describing the present invention in detail, and unless defined by the claims, the scope of the present invention is limited by the examples or exemplary terminology. it's not going to be In addition, those skilled in the art will recognize that various modifications, combinations, and changes may be made in accordance with design conditions and factors within the scope of the appended claims or their equivalents.

따라서, 본 발명의 사상은 상기 설명된 실시 예에 국한되어 정해져서는 아니 되며, 후술하는 특허청구범위뿐만 아니라 이 특허청구범위와 균등한 또는 이로부터 등가적으로 변경된 모든 범위는 본 발명의 사상의 범주에 속한다고 할 것이다.Therefore, the spirit of the present invention should not be limited to the above-described embodiments, and the scope of the spirit of the present invention is not limited to the scope of the scope of the present invention. will be said to belong to

1: 객체 검출 장치
100: 영상 촬영부
200: 영상 처리부
300: 객체 검출부
400: 디스플레이부 1: object detection device
100: video recording unit
200: image processing unit
300: object detection unit
400: display unit

Claims

a feature extraction unit for extracting features of the image by applying a deep learning-based convolutional neural network model to an image taken through a plurality of fisheye lens cameras provided on the ceiling of the parking lot;
a multiple pre-processing unit merging the features extracted from at least one of the uppermost layer, the next-level layer, and the next-level layer of the feature extraction unit; and
By receiving the merged features from the multi-preprocessor, one or more objects are displayed as a bounding box, and the center coordinate of the bounding box, the width of the object, the height of the object, the reliability of the object included in the bounding box, and the center of the cell A multi-classifier for estimating a probability (Pi) that one or more objects represented by the bounding box present is a specific class, and a moving direction of the object;
The feature extraction unit includes 19 convolutional layers for filtering and 5 maximal pooling layers for image downsampling, including continuous 3×3 and 1×1 convolutions, and the 19 convolutional layers The 6th-8th convolutional layer set means the next-order layer, the 9th-13th convolutional layer set means the next-order layer, and the 14th-19th convolutional layer set means the top layer.
The input image of the feature extraction unit consists of 416×416 cells, the second-order layer consists of 52×52 cells, the second-order layer consists of 26×26 cells, the uppermost layer consists of 13×13 cells, and the bounding box consists of the multiple The object detection apparatus according to claim 1, wherein the multi-classification unit displays the objects detected in the features of the uppermost layer, the next-level layer, and the next-order layer merged from the preprocessor as a box with a border.

delete

According to claim 1, wherein the multi-preprocessor,
Among the feature extraction unit including 19 convolutional layers and 5 maximum pooling layers, 1024 features of the corresponding cell output from the 19th convolutional layer of the uppermost layer, and 4 cells of the next layer corresponding to the uppermost cell a first preprocessor that merges 256 extracted features; and
a second pre-processing unit that merges 1024 features of the corresponding cell output from the 13th convolutional layer of the second-order layer among the feature extraction unit and 256 features extracted from 4 cells of the second-order layer corresponding to the second-order cell; Including, object detection device.

The method of claim 3, wherein the multi-classifying unit,
The center coordinates (x,y) of the bounding box, the area (w) of the object, the center coordinates (x,y) of the bounding box, the A first classifier for estimating the height (h) of the object, the reliability (C) of the object included in the bounding box, the probability (Pi) of the object existing in the cell being of a specific class, and the moving direction (θ) of the object ; and
The center coordinates (x,y) of the bounding box, the area (w) of the object, the center coordinates (x,y) of the bounding box, the A second classification unit for estimating the height (h) of the object, the reliability (C) of the object included in the bounding box, the probability (Pi) that the object exists in the cell is a specific class, and the moving direction (θ) of the object ; Containing, object detection device.

5. The method of claim 4,
and a non-maximum suppression (NMS) processing unit that removes a portion where multiple bounding boxes overlap the object from the estimation result of the first classifier and the estimation result of the second classifier.

extracting the features of the image by applying a deep learning-based convolutional neural network model to an image photographed through a plurality of fisheye lens cameras provided on the ceiling of the parking lot by the feature extraction unit;
merging the features extracted from at least one of the uppermost layer, the next higher layer, and the next higher layer by the multiple preprocessing unit; and
The multi-classifier receives the merged features from the multi-preprocessor and displays one or more objects as a bounding box, and the central coordinates of the bounding box, the width of the object, the height of the object, and the object included in the bounding box. estimating a probability (Pi) that one or more objects represented by the bounding box with the reliability of the cell and the center of the cell are of a specific class, and the moving direction of the object;
The feature extraction unit includes 19 convolutional layers for filtering and 5 maximal pooling layers for image downsampling, including continuous 3×3 and 1×1 convolutions, and the 19 convolutional layers The 6th-8th convolutional layer set means the next-order layer, the 9th-13th convolutional layer set means the next-order layer, and the 14th-19th convolutional layer set means the top layer.
The input image of the feature extraction unit consists of 416×416 cells, the second-order layer consists of 52×52 cells, the second-order layer consists of 26×26 cells, the uppermost layer consists of 13×13 cells, and the bounding box consists of the multiple The object detection method, characterized in that the multi-classification unit displays the objects detected from the features of the uppermost layer, the next-level layer, and the next-level layer merged from the preprocessor as a box with a border.

delete

The method of claim 6, wherein the merging comprises:
1024 features of the corresponding cell output from the 19th convolutional layer of the uppermost layer among the feature extraction unit including 19 convolutional layers and 5 maximal pooling layers by the first preprocessor, and corresponding to the uppermost cell merging 256 features extracted from 4 cells of the second layer; and
By the second pre-processing unit, 1024 features of the corresponding cell output from the 13th convolutional layer of the next-order layer among the feature extraction unit and 256 features extracted from the 4 cells of the next-order layer corresponding to the second-order cell are extracted. Including; merging; object detection method.

The method of claim 8, wherein the estimating comprises:
By the first classification unit, the center coordinates (x,y) of the bounding box, the object The width (w) of the object, the height (h) of the object, the reliability (C) of the object included in the bounding box, the probability (Pi) of the object existing in the cell being a specific class, and the moving direction (θ) of the object estimating; and
By the second classification unit, the central coordinates (x,y) of the bounding box, the object The width (w) of the object, the height (h) of the object, the reliability (C) of the object included in the bounding box, the probability (Pi) of the object existing in the cell being a specific class, and the moving direction (θ) of the object Including; estimating the object detection method.

10. The method of claim 9,
removing, by a non-maximum suppression (NMS) processing unit, a portion in which multiple bounding boxes overlap the object from the estimation result of the first classification unit and the estimation result of the second classification unit;

A computer program stored in the computer-readable recording medium for executing the method of any one of claims 6 and 8 to 10 using a computer.