KR20200119369A

KR20200119369A - Apparatus and method for detecting object

Info

Publication number: KR20200119369A
Application number: KR1020190032714A
Authority: KR
Inventors: 김재민
Original assignee: 홍익대학교 산학협력단; (주)파크인서울
Priority date: 2019-03-22
Filing date: 2019-03-22
Publication date: 2020-10-20
Also published as: KR102314520B1

Abstract

The present invention relates to an apparatus and a method for detecting an object in real time based on deep learning which are specialized in extracting information of a vehicle as an object to detect a parked vehicle in an image acquired through a fisheye lens camera installed on the ceiling of a parking lot and track a running vehicle. According to an embodiment of the present invention, the apparatus for detecting an object comprises: a feature extraction unit to apply a deep learning-based convolutional neural network model to an image photographed through a plurality of fisheye lens cameras provided on the ceiling of a parking lot to extract features of the image; a multi-preprocessing unit to merge features extracted from one or more among the highest layer, second-highest layer, and third-highest layer of the feature extraction unit; and a multi-classification unit to receive the merged features from the multi-preprocessing unit to represent one or more objects with edge boxes, and estimate the center coordinates of the edge boxes, the widths of the objects, the heights of the objects, the reliability of objects included in the edge boxes, the probability that the objects to exist in a cell are a specific class, and the moving direction of the objects.

Description

Object detection apparatus and method {APPARATUS AND METHOD FOR DETECTING OBJECT}

본 발명은 주차장의 천정에 설치된 어안렌즈 카메라를 통하여 획득한 영상에서 주차한 차량을 검출하고 주행하는 차량을 추적하기 위하여, 객체로서의 차량의 정보를 추출하는데 특화된 딥러닝 기반 실시간 객체 검출 장치 및 방법에 관한 것이다.The present invention relates to a deep learning-based real-time object detection apparatus and method specialized in extracting vehicle information as an object in order to detect a parked vehicle from an image acquired through a fisheye lens camera installed on the ceiling of a parking lot and to track a driving vehicle. About.

합성곱 신경 회로망(convolution neural network) 기반 객체 인식 기술은 다양한 환경에서 다양한 객체를 인식하려고 하면 다양한 객체 인식을 위한 풍부한 특징을 추출해야 한다. 풍부한 특징 추출을 위해서는 합성곱 층(convolution layer)이 매우 많아야 하며, 각 합성곱 층에서 사용하는 필터의 개수 또한 많아야 한다. 이는 많은 연산량을 필요로 하는 문제점이 있다.Object recognition technology based on a convolution neural network needs to extract rich features for various object recognition when trying to recognize various objects in various environments. In order to extract rich features, the number of convolution layers must be very large, and the number of filters used in each convolution layer must also be large. This has a problem that requires a large amount of computation.

전술한 배경기술은 발명자가 본 발명의 도출을 위해 보유하고 있었거나, 본 발명의 도출 과정에서 습득한 기술 정보로서, 반드시 본 발명의 출원 전에 일반 공중에게 공개된 공지기술이라 할 수는 없다.The above-described background technology is technical information possessed by the inventors for derivation of the present invention or acquired during the derivation process of the present invention, and is not necessarily known to be publicly known prior to filing the present invention.

국내 공개특허공보 제2018-0136720호Korean Patent Publication No. 2018-0136720

본 발명은 전술한 문제점 및/또는 한계를 해결하기 위해 안출된 것으로, 일 측면에 따른 본 발명의 목적은 주차장의 천정에 설치된 어안렌즈 카메라를 통하여 획득한 영상으로부터 주차한 차량을 검출하고 주행하는 차량을 추적하기 위하여, 객체로서의 차량의 정보를 추출하는데 특화된 딥러닝 기반 실시간 객체 검출 기술을 개발하여 적은 연산량으로 높은 검출율을 달성하는데 있다.The present invention has been conceived to solve the above-described problems and/or limitations, and an object of the present invention according to an aspect is a vehicle that detects and drives a parked vehicle from an image acquired through a fisheye lens camera installed on the ceiling of a parking lot In order to track the data, a deep learning-based real-time object detection technology specialized in extracting vehicle information as an object is developed to achieve a high detection rate with a small amount of computation.

본 발명의 일 실시 예에 따른 객체 검출 장치는, 주차장의 천정에 구비된 복수의 어안렌즈 카메라를 통하여 촬영한 영상에 딥러닝 기반의 합성곱 신경 회로망 모델을 적용하여 상기 영상의 특징을 추출하는 특징 추출부; 상기 특징 추출부의 최상층, 차상층 및 차차상층 중 하나 이상에서 추출한 특징을 병합하는 다중 전처리부; 및 상기 다중 전처리부로부터 병합한 특징을 수신하여 하나 이상의 객체를 테두리 상자로 나타내고, 상기 테두리 상자의 중심좌표, 상기 객체의 넓이, 상기 객체의 높이, 상기 테두리 상자에 포함된 객체의 신뢰도 및 셀에 존재할 상기 객체가 특정 클래스일 확률과 상기 객체의 진행방향을 추정하는 다중 분류부;를 포함할 수 있다.An object detection apparatus according to an embodiment of the present invention extracts features of the image by applying a deep learning-based convolutional neural network model to an image captured through a plurality of fisheye lens cameras provided on the ceiling of a parking lot. Extraction unit; A multiplex preprocessing unit for merging features extracted from at least one of an uppermost layer, a second upper layer, and a second upper layer of the feature extraction unit; And receiving the merged features from the multi-preprocessor to display one or more objects as a bounding box, and to the center coordinate of the bounding box, the width of the object, the height of the object, the reliability of the object included in the bounding box, and the cell. And a multiple classification unit that estimates a probability that the object to exist is a specific class and a moving direction of the object.

상기 특징 추출부는, 상기 어안렌즈 카메라로부터 출력되는 영상에 대하여 연속된 3×3과 1×1 콘벌루션으로 이루어져 필터링을 수행하는 19개의 콘벌루션 레이어(convolution layer)와, 2×2 필터로 이루어져 상기 영상의 다운 샘플링을 수행하는 5개의 최대 풀링 레이어(max pooling layer)를 포함하는 darknet-19을 포함할 수 있다.The feature extraction unit comprises 19 convolution layers that perform filtering by consisting of 3×3 and 1×1 convolutions continuously for the image output from the fisheye lens camera, and a 2×2 filter. It may include darknet-19 including five maximum pooling layers that perform down-sampling of an image.

상기 다중 전처리부는, 19개의 콘벌루션 레이어와 5개의 최대 풀링 레이어를 포함하는 상기 특징 추출부 중 최상층의 19번째 콘벌루션 레이어에서 출력되는 해당 셀의 1024개의 특징과, 상기 최상층 셀에 해당하는 차상층의 4 셀에서 추출한 특징 256개의 특징을 병합하는 제1 전처리부; 및 상기 특징 추출부 중 차상층의 13번째 콘벌루션 레이어에서 출력되는 해당 셀의 1024개의 특징과, 차상층 셀에 해당하는 차차상층의 4 셀에서 추출한 특징 256개의 특징을 병합하는 제2 전처리부;를 포함할 수 있다.The multi-preprocessor includes 1024 features of a corresponding cell output from a 19th convolutional layer of an uppermost layer among the feature extraction units including 19 convolutional layers and five maximum pooling layers, and a second upper layer corresponding to the uppermost cell. A first preprocessor for merging the 256 features extracted from 4 cells of And a second preprocessing unit for merging 1024 features of a corresponding cell output from a 13th convolutional layer of the next upper layer among the feature extracting units and 256 features extracted from four cells of the next higher layer corresponding to the next higher layer cell. It may include.

상기 다중 분류부는, 상기 제1 전처리부가 병합한 1280개의 특징을 1024개의 1×1×1280 합성 곱 필터를 통과시켜 구한 1024개의 특징으로부터 상기 테두리 상자의 중심좌표(x,y), 상기 객체의 넓이(w), 상기 객체의 높이(h), 상기 테두리 상자에 포함된 상기 객체의 신뢰도(C) 및 셀에 존재할 상기 객체가 특정 클래스일 확률(Pi)과 상기 객체의 진행방향(θ)을 추정하는 제1 분류부; 및 상기 제2 전처리부가 병합한 1280개의 특징을 1024개의 1×1×1280 합성 곱 필터를 통과시켜 구한 1024개의 특징으로부터 상기 테두리 상자의 중심좌표(x,y), 상기 객체의 넓이(w), 상기 객체의 높이(h), 상기 테두리 상자에 포함된 상기 객체의 신뢰도(C) 및 셀에 존재할 상기 객체가 특정 클래스일 확률(Pi)과 상기 객체의 진행방향(θ)을 추정하는 제2 분류부;를 포함할 수 있다.The multi-classification unit may include a central coordinate (x,y) of the bounding box and an area of the object from 1024 features obtained by passing 1280 features merged by the first preprocessor through 1024 1x1x1280 composite product filters. (w), the height of the object (h), the reliability of the object included in the bounding box (C), the probability that the object to be present in the cell is of a specific class (Pi), and the moving direction of the object (θ) A first classification unit; And the central coordinate (x,y) of the bounding box, the area of the object (w) from 1024 features obtained by passing 1280 features merged by the second preprocessor through 1024 1x1x1280 composite product filters, A second classification for estimating the height (h) of the object, the reliability of the object included in the bounding box (C), the probability that the object to exist in the cell is of a specific class (Pi), and the moving direction (θ) of the object May include;

상기 장치는, 상기 제1 분류부의 추정 결과 및 상기 제2 분류부의 추정 결과로부터 상기 객체에 여러 테두리 상자가 겹치는 부분을 제거하는 NMS(non-maximum suppression) 처리부;를 더 포함할 수 있다.The apparatus may further include a non-maximum suppression (NMS) processing unit that removes a portion where several frame boxes overlap the object from the estimation result of the first classification unit and the estimation result of the second classification unit.

본 발명의 일 실시 예에 따른 객체 검출 방법은, 특징 추출부에 의해, 주차장의 천정에 구비된 복수의 어안렌즈 카메라를 통하여 촬영한 영상에 딥러닝 기반의 합성곱 신경 회로망 모델을 적용하여 상기 영상의 특징을 추출하는 단계; 다중 전처리부에 의해, 상기 특징 추출부의 최상층, 차상층 및 차차상층 중 하나 이상에서 추출한 특징을 병합하는 단계; 및 다중 분류부에 의해, 상기 다중 전처리부로부터 병합한 특징을 수신하여 하나 이상의 객체를 테두리 상자로 나타내고, 상기 테두리 상자의 중심좌표, 상기 객체의 넓이, 상기 객체의 높이, 상기 테두리 상자에 포함된 객체의 신뢰도 및 셀에 존재할 상기 객체가 특정 클래스일 확률과 상기 객체의 진행방향을 추정하는 단계;를 포함할 수 있다.In an object detection method according to an embodiment of the present invention, a deep learning-based convolutional neural network model is applied to an image captured through a plurality of fisheye lens cameras provided on a ceiling of a parking lot by a feature extraction unit to provide the image. Extracting features of; Merging the features extracted from at least one of the uppermost layer, the upper secondary layer, and the upper secondary layer of the feature extraction section by a multiple preprocessor; And receiving the features merged from the multiple preprocessing unit by the multi-classification unit, and representing one or more objects as a bounding box, and the center coordinate of the bounding box, the width of the object, the height of the object, and included in the bounding box. And estimating a reliability of an object, a probability that the object to exist in a cell is a specific class, and a moving direction of the object.

상기 특징을 추출하는 단계는, 상기 어안렌즈 카메라로부터 출력되는 영상에 대하여 연속된 3×3과 1×1 콘벌루션으로 이루어져 필터링을 수행하는 19개의 콘벌루션 레이어(convolution layer)와, 2×2 필터로 이루어져 상기 영상의 다운 샘플링을 수행하는 5개의 최대 풀링 레이어(max pooling layer)를 포함하는 darknet-19을 이용하여 상기 특징을 추출하는 단계;를 포함할 수 있다.The step of extracting the features includes 19 convolution layers that perform filtering consisting of 3x3 and 1x1 convolutions continuously on the image output from the fisheye camera, and a 2x2 filter. And extracting the features using darknet-19 including five max pooling layers for down-sampling the image.

상기 병합하는 단계는, 제1 전처리부에 의해, 19개의 콘벌루션 레이어와 5개의 최대 풀링 레이어를 포함하는 상기 특징 추출부 중 최상층의 19번째 콘벌루션 레이어에서 출력되는 해당 셀의 1024개의 특징과, 상기 최상층 셀에 해당하는 차상층의 4 셀에서 추출한 특징 256개의 특징을 병합하는 단계; 및 제2 전처리부에 의해, 상기 특징 추출부 중 차상층의 13번째 콘벌루션 레이어에서 출력되는 해당 셀의 1024개의 특징과, 차상층 셀에 해당하는 차차상층의 4 셀에서 추출한 특징 256개의 특징을 병합하는 단계;를 포함할 수 있다.The merging may include, by a first preprocessor, 1024 features of a corresponding cell output from the 19th convolutional layer of the uppermost layer among the feature extractors including 19 convolutional layers and 5 maximum pooling layers, Merging the features of 256 features extracted from 4 cells of the next upper layer corresponding to the uppermost cell; And 1024 features of the corresponding cell output from the 13th convolutional layer of the next upper layer among the feature extracting unit and 256 features extracted from 4 cells of the next upper layer corresponding to the next upper layer cell by the second preprocessor. It may include; merging.

상기 추정하는 단계는, 제1 분류부에 의해, 상기 제1 전처리부가 병합한 1280개의 특징을 1024개의 1×1×1280 합성 곱 필터를 통과시켜 구한 1024개의 특징으로부터 상기 테두리 상자의 중심좌표(x,y), 상기 객체의 넓이(w), 상기 객체의 높이(h), 상기 테두리 상자에 포함된 상기 객체의 신뢰도(C) 및 셀에 존재할 상기 객체가 특정 클래스일 확률(Pi)과 상기 객체의 진행방향(θ)을 추정하는 단계; 및 제2 분류부에 의해, 상기 제2 전처리부가 병합한 1280개의 특징을 1024개의 1×1×1280 합성 곱 필터를 통과시켜 구한 1024개의 특징으로부터 상기 테두리 상자의 중심좌표(x,y), 상기 객체의 넓이(w), 상기 객체의 높이(h), 상기 테두리 상자에 포함된 상기 객체의 신뢰도(C) 및 셀에 존재할 상기 객체가 특정 클래스일 확률(Pi)과 상기 객체의 진행방향(θ)을 추정하는 단계;를 포함할 수 있다.In the estimating step, the center coordinate (x) of the bounding box from 1024 features obtained by passing 1280 features merged by the first preprocessor through 1024 1x1x1280 composite product filters by a first classification unit (x ,y), the area of the object (w), the height of the object (h), the reliability of the object included in the bounding box (C), and the probability that the object exists in the cell is of a specific class (Pi) and the object Estimating the traveling direction (θ) of; And a center coordinate (x,y) of the bounding box from 1024 features obtained by passing 1280 features merged by the second preprocessor through 1024 1x1x1280 composite product filters by a second classification unit, The width (w) of the object, the height of the object (h), the reliability of the object included in the bounding box (C), the probability that the object to exist in the cell is of a specific class (Pi), and the moving direction of the object (θ ) Estimating; may include.

상기 방법은, NMS(non-maximum suppression) 처리부에 의해, 상기 제1 분류부의 추정 결과 및 상기 제2 분류부의 추정 결과로부터 상기 객체에 여러 테두리 상자가 겹치는 부분을 제거하는 단계;를 더 포함할 수 있다.The method may further include removing, by a non-maximum suppression (NMS) processing unit, a portion where several bounding boxes overlap the object from the estimation result of the first classification unit and the estimation result of the second classification unit. have.

이 외에도, 본 발명을 구현하기 위한 다른 방법, 다른 시스템 및 상기 방법을 실행하기 위한 컴퓨터 프로그램이 더 제공될 수 있다.In addition to this, another method for implementing the present invention, another system, and a computer program for executing the method may be further provided.

전술한 것 외의 다른 측면, 특징, 이점이 이하의 도면, 특허청구범위 및 발명의 상세한 설명으로부터 명확해질 것이다.Other aspects, features, and advantages other than those described above will become apparent from the following drawings, claims, and detailed description of the invention.

실시 예들에 따르면, 주차장의 천정에 설치된 어안렌즈 카메라를 통하여 획득한 영상으로부터 객체로서의 차량을 인식하는데 특화된 딥러닝 기반 실시간 객체 검출 기술을 개발하여 적은 연산량으로 높은 검출율을 달성할 수 있다. 특히 기존 발표된 검출 장치 중에서 연산량 대비 성능이 월등한 YOLOv3의 약 70%의 연산량으로 유사한 검출 성능을 달성할 수 있다.According to embodiments, a deep learning-based real-time object detection technology specialized for recognizing a vehicle as an object from an image acquired through a fisheye lens camera installed on the ceiling of a parking lot can be developed, thereby achieving a high detection rate with a small amount of computation. In particular, it is possible to achieve similar detection performance with about 70% of the computational amount of YOLOv3, which has superior performance compared to the amount of computation among previously announced detection devices.

또한 차량의 진행방향을 인식함으로써 주차장 내에서 차량 추적의 신뢰도를 높일 수 있다.In addition, it is possible to increase the reliability of vehicle tracking in the parking lot by recognizing the direction of the vehicle.

본 발명의 효과는 이상에서 언급된 것들에 한정되지 않으며, 언급되지 아니한 다른 효과들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.The effects of the present invention are not limited to those mentioned above, and other effects that are not mentioned will be clearly understood by those skilled in the art from the following description.

도 1은 합성곱 신경 회로망을 기반으로 하는 다양한 객체 검출기의 연산량 대비 성능을 도시한 도면이다.
도 2는 본 발명의 일 실시 예에 따른 객체 검출 장치의 구성을 개략적으로 설명하기 위하여 도시한 도면이다.
도 3은 도 2의 객체 검출 장치 중 영상 처리에부에서 수행하는 영상 화질 개선을 설명하기 위하여 도시한 도면이다.
도 4 및 도 5는 도 2의 객체 검출 장치 중 객체 검출부의 상세 구성을 개략적으로 설명하기 위하여 도시한 도면이다.
도 6은 도 2의 객체 검출 장치에서의 객체 검출 결과를 도시한 도면이다.
도 7은 본 발명의 일 실시 예에 따른 객체 검출 방법을 설명하기 위한 흐름도이다.1 is a diagram showing the performance versus the amount of computation of various object detectors based on a convolutional neural network.
2 is a diagram schematically illustrating a configuration of an object detection apparatus according to an embodiment of the present invention.
FIG. 3 is a diagram illustrating an image quality improvement performed by an image processing unit among the object detection apparatus of FIG. 2.
4 and 5 are diagrams schematically illustrating a detailed configuration of an object detection unit in the object detection apparatus of FIG. 2.
6 is a diagram illustrating an object detection result in the object detection apparatus of FIG. 2.
7 is a flowchart illustrating an object detection method according to an embodiment of the present invention.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 설명되는 실시 예들을 참조하면 명확해질 것이다. 그러나 본 발명은 아래에서 제시되는 실시 예들로 한정되는 것이 아니라, 서로 다른 다양한 형태로 구현될 수 있고, 본 발명의 사상 및 기술 범위에 포함되는 모든 변환, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 아래에 제시되는 실시 예들은 본 발명의 개시가 완전하도록 하며, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이다. 본 발명을 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 본 발명의 요지를 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다.Advantages and features of the present invention, and a method of achieving them will become apparent with reference to embodiments described in detail together with the accompanying drawings. However, it should be understood that the present invention is not limited to the embodiments presented below, but may be implemented in a variety of different forms, and includes all transformations, equivalents, or substitutes included in the spirit and scope of the present invention. . The embodiments presented below are provided to complete the disclosure of the present invention, and to fully inform a person of ordinary skill in the art to which the present invention belongs. In describing the present invention, when it is determined that a detailed description of a related known technology may obscure the subject matter of the present invention, a detailed description thereof will be omitted.

본 출원에서 사용한 용어는 단지 특정한 실시 예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다. 제1, 제2 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다.The terms used in the present application are used only to describe specific embodiments, and are not intended to limit the present invention. Singular expressions include plural expressions unless the context clearly indicates otherwise. In the present application, terms such as "comprise" or "have" are intended to designate the presence of features, numbers, steps, actions, components, parts, or combinations thereof described in the specification, but one or more other features. It is to be understood that the presence or addition of elements or numbers, steps, actions, components, parts, or combinations thereof, does not preclude in advance. Terms such as first and second may be used to describe various elements, but the elements should not be limited by the terms. These terms are used only for the purpose of distinguishing one component from another component.

이하, 본 발명에 따른 실시 예들을 첨부된 도면을 참조하여 상세히 설명하기로 하며, 첨부 도면을 참조하여 설명함에 있어, 동일하거나 대응하는 구성 요소는 동일한 도면번호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다.Hereinafter, embodiments according to the present invention will be described in detail with reference to the accompanying drawings, and in the description with reference to the accompanying drawings, identical or corresponding components are assigned the same reference numbers, and redundant descriptions thereof are omitted. I will do it.

딥러닝 학습 기법의 발전과 대용량의 영상 데이터 수집이 가능하면서, 영상에서 객체를 인식하는 기술은 높은 인식 성능을 달성하게 되었다. 딥러닝 기반 객체 검출 방법 중에서 빠른 검출 속도와 높은 성능을 보여주는 방법으로는 Ren 등이 제안한 Faster R-CNN 방법이 있다. 이 방법은 피라미드 구조의 합성곱 신경 회로망(convolutional neural network)으로 특징을 추출하고, 영상에서 k개의 앵커 박스(anchor box)를 이동시키면서 앵커 박스 안에 객체가 존재하는지 판단한다. 이때 객체의 존재 여부는 앵커 박스 영역 내에 있는 신경 회로망에서 추출한 특징들로 판단한다. 최종적으로 객체를 포함하는 영역의 크기 및 위치, 영역 내에 존재하는 객체의 신뢰도 및 객체가 특징 종류일 확률을 계산한다. 그러나 이 방법은 실시간으로 구현하기에는 연산량이 많이 요구되는 문제점이 있다.With the development of deep learning learning techniques and the ability to collect large amounts of image data, the technology to recognize objects in images has achieved high recognition performance. Among deep learning-based object detection methods, the Faster R-CNN method proposed by Ren et al. is a method that shows fast detection speed and high performance. This method extracts features using a pyramidal convolutional neural network, and determines whether an object exists in the anchor box while moving k anchor boxes from the image. At this time, the existence of the object is determined by features extracted from the neural network in the anchor box area. Finally, the size and position of the region including the object, the reliability of the object existing in the region, and the probability that the object is a feature type are calculated. However, this method has a problem that a large amount of computation is required to implement in real time.

Rendman 등은 앵커 박스를 움직여가면서 객체의 영역을 찾는 대신 합성곱 신경 회로망의 최상층(해상도: 7×7)의 각 셀에 Bro 테두리 상자로 객체 영역의 위치와 크기, 테두리 상자가 객체를 포함하는 신뢰도, 포함된 객체가 특정 종류일 확률을 완전 연결 망(fully connected network)으로 계산하였다. 이 방법은 앵커 박스를 이동시키는 대신 고정된 7×7×b 개의 테두리 상자에서 한번만 객체 검출을 수행함으로 YOLO(you only look once)라고 한다. 이 방법은 매우 적은 연산량을 필요로 하며 실시간 객체 검출 구현이 가능하다. 그러나 작은 객체의 검출 성능이 낮은 문제점이 있다.Rendman et al., instead of finding the area of the object while moving the anchor box, a Bro bounding box in each cell of the top layer (resolution: 7×7) of the convolutional neural network, the location and size of the object area, and the reliability that the bounding box contains the object. In addition, the probability that the included object is of a specific type was calculated as a fully connected network. This method is called YOLO (you only look once) because it performs object detection only once in fixed 7×7×b bounding boxes instead of moving the anchor box. This method requires very little computation and can implement real-time object detection. However, there is a problem in that the detection performance of small objects is low.

Liu 등은 작은 객체도 효과적으로 검출할 수 있는 방법을 제안하였다. 이 방법은 테두리 상자 안에서 한번에 객체의 위치 및 크기, 객체의 종류별 확률을 모두 한번에 처리한다고 하여 SSD(single-shot detector)라고 한다. 이 방법은 합성곱 신경 회로망의 높은 해상도 층부터 최저 해상도 층까지 다양한 층에서 객체를 검출하는 다중스케일 검출기를 구현하였다. 이 방법은 YOLO와 비슷한 연산량으로 높은 검출율을 보여주고 있다. Liu et al. proposed a method that can effectively detect even small objects. This method is called a single-shot detector (SSD) because it handles the location and size of an object, and the probability of each type of object at once in the bounding box. This method implements a multiscale detector that detects objects in various layers from the high resolution layer to the lowest resolution layer of the convolutional neural network. This method shows a high detection rate with a computational amount similar to that of YOLO.

Redmon 등은 특징 검출을 위한 합성곱 신경 회로망을 연속된 3×3과 1×1 합성곱으로 이루어진 19개의 합성곱 층(convolution layer)과 5개의 맥스 풀링 층(max pooling layer)로 구성하고, 최상층의 해상도를 13×13으로 높였으며, 이 신경 회로망을 darknet-19라고 명명하였다. 객체 검출을 위한 분류기를 완전 연결 망(fully connected network) 대신 합성곱 신경 회로망(CNN)으로 구성된 YOLOv2를 제안하였다. 이 방법은 YOLO 보다 적은 연산량으로 SSD 보다 우수한 검출 성능을 구현하였다.Redmon et al. constructed a convolutional neural network for feature detection with 19 convolution layers and 5 max pooling layers consisting of consecutive 3×3 and 1×1 convolutions. The resolution of was increased to 13×13, and this neural network was named darknet-19. Instead of a fully connected network, a classifier for object detection, YOLOv2, composed of a convolutional neural network (CNN) was proposed. This method achieves better detection performance than SSD with less computation than YOLO.

He 등은 수천개의 객체 검출에 필요한 풍부한 특징을 추출할 수 있는 매우 깊은 층으로 구성된 합성곱 신경 회로망을 제안하였으며, 이를 ResNet이라고 한다. 이 신경 회로망은 매우 깊은 층(deep layers)을 안정적으로 학습할 수 있는 구조로, 객체 검출 및 인식 등에서 높은 성능을 보여주고 있다. 그러나 층이 깊어지는 만큼 연산량이 늘어나는 문제점이 있다.He et al. proposed a convolutional neural network composed of very deep layers capable of extracting rich features required for detection of thousands of objects, which is called ResNet. This neural network is a structure capable of stably learning very deep layers, and shows high performance in object detection and recognition. However, as the layer deepens, there is a problem that the amount of computation increases.

Lin 등은 앵커 박스를 사용하여 객체를 검출하는 방법에서 신경 회로망을 학습함에 있어 객체 샘플의 수가 배경 샘플의 수에 비하여 매우 적고 이로 인하여 학습된 검출기의 성능이 저하되는 문제점을 찾아내고 이를 보완하여 학습하는 방법을 제시하였다. 이는 RetinaNet이라고 불린다.Lin et al. found a problem in which the number of object samples is very small compared to the number of background samples in learning a neural network in the method of detecting an object using an anchor box, and thus the performance of the learned detector is degraded, and it is supplemented to learn. We presented a way to do it. This is called RetinaNet.

Redman 등은 좀 더 풍부한 특징을 추출하기 위하여 darknet-19과 ResNet의 혼합형태로 53개의 합성곱 층으로 사용하였으며, 이를 darknet-53이라 한다. darknet-53은 darknet-19 보다는 많은 연산량을, ResNet에 비하여 적은 연산량을 사용한다. Redman 등은 darknet-53으로 특징을 추출하고, 다양한 크기의 객체를 검출하기 위하여 최상층, 차상층, 차차상층의 3 층에서 객체 검출을 위한 분류기를 생성하였으며, 이 방법을 YOLOv3라고 불린다.Redman et al. used 53 convolutional layers in the form of a mixture of darknet-19 and ResNet to extract more rich features, and this is called darknet-53. darknet-53 uses more computation than darknet-19 and less computation than ResNet. Redman et al. extracted features with darknet-53 and created a classifier for object detection in three layers: the top layer, the second layer, and the second layer to detect objects of various sizes, and this method is called YOLOv3.

도 1은 종래 기술에 따른 합성곱 신경 회로망을 기반으로 하는 다양한 객체 검출기의 연산량 대비 성능을 도시한 도면이다. 도 1을 참조하면, YOLOv3가 연산량 대비 검출 성능이 다른 방법에 비하여 월등하게 우수함을 보여주고 있다.1 is a diagram showing the performance versus the amount of computation of various object detectors based on a convolutional neural network according to the prior art. Referring to FIG. 1, it is shown that YOLOv3 is superior to other methods in detecting performance compared to the amount of computation.

이러한 딥러닝 기반 객체 인식 기술은 다양한 환경에서 다양한 객체를 인식하려고 하면 그 성능이 낮아지는 한계가 있다. 그러나 인식하고자 하는 객체를 특화하고, 객체를 촬영하는 카메라의 시점과 종류를 특화하고, 주변 환경을 특화한다면 높은 검출률의 객체 검출이 가능하다. 본 실시 예에서는 지하 주차장의 천정에 설치된 어안렌즈 카메라를 통하여 획득한 영상에서 객체로서의 차량을 검출하는데 특화된 딥러닝 기반 실시간 객체 검출 기술을 제안한다.This deep learning-based object recognition technology has a limitation in that its performance decreases when attempting to recognize various objects in various environments. However, if the object to be recognized is specialized, the viewpoint and type of the camera that photographs the object, and the surrounding environment are specialized, object detection with a high detection rate is possible. In this embodiment, a deep learning-based real-time object detection technology specialized for detecting a vehicle as an object from an image acquired through a fisheye lens camera installed on the ceiling of an underground parking lot is proposed.

도 2는 본 발명의 일 실시 예에 따른 객체 검출 장치의 구성을 개략적으로 설명하기 위하여 도시한 도면이다. 도 2를 참조하면, 객체 검출 장치(1)는 영상 촬영부(100), 영상 처리부(200), 객체 검출부(300) 및 디스플레이부(400)를 포함할 수 있다.2 is a diagram schematically illustrating a configuration of an object detection apparatus according to an embodiment of the present invention. Referring to FIG. 2, the object detection apparatus 1 may include an image capture unit 100, an image processing unit 200, an object detection unit 300, and a display unit 400.

영상 촬영부(100)는 실내 주차장의 천정에 설치된 N개의 어안렌즈 카메라(100_1 내지 100_N)(예를 들어, 16개)를 통하여 영상을 촬영할 수 있다. 본 실시 예에서 어안렌즈 카메라(100_1 내지 100_N) 각각은 150도 이상의 시야각 범위 내에서 입사되는 빛을 집광하여 전기적 신호로 변환함으로써 영상을 촬영할 수 있다.The image capture unit 100 may capture an image through N fisheye lens cameras 100_1 to 100_N (eg, 16) installed on the ceiling of an indoor parking lot. In the present embodiment, each of the fisheye lens cameras 100_1 to 100_N may capture an image by condensing incident light within a viewing angle range of 150 degrees or more and converting it into an electrical signal.

영상 처리부(200)는 영상 촬영부(100)가 촬영한 영상의 화질을 개선할 수 있다. 실내 주차장 내부에 균등하게 조명을 배치하여도 건물의 구석진 부분에 주차한 차량은 조명이 어두워서, 도 3(a)에서와 같이 객체 후방의 윤곽을 알아보기 힘들다. 합성곱 신경 회로망 기반 객체 검출부(300)에 객체를 포함하는 영상을 입력으로 넣었을 때, 신경 회로망의 중간층에서 출력값을 시각화하여 보면 객체 윤곽의 에지가 강조되어 나타난다. 즉 객체 검출률을 향상시키기 위해서는 객체의 윤곽을 뚜렷하게 하는 화질 개선이 필요하다. 물론 객체 검출기를 학습할 대 입력 영상의 발기(intensity), 색조(hue) 등에 변화를 주는 데이터 증강 방법을 사용하지만, 일반적으로 검출하는 객체의 변화 범위가 크면 객체 검출의 성능이 저하된다.The image processing unit 200 may improve quality of an image captured by the image capturing unit 100. Even if lighting is evenly arranged inside the indoor parking lot, the lighting of the vehicle parked in the corner of the building is dark, so it is difficult to recognize the outline of the rear of the object as shown in Fig. 3(a). When an image including an object is input to the convolutional neural network-based object detection unit 300 as an input, the edge of the object contour is highlighted when visualizing the output value in the intermediate layer of the neural network. In other words, in order to improve the object detection rate, it is necessary to improve the image quality to make the outline of the object clear. Of course, when the object detector is learned, a data augmentation method that changes the intensity and hue of the input image is used, but generally, if the range of change of the object to be detected is large, the performance of object detection is degraded.

본 실시 예에서 영상 처리부(200)는 입력 영상의 화질 개선을 위하여 감마 보정(gamma correction)과 에지 향상(edge enhancement)을 적용한다. 감마 보정은 수학식 1과 같다.In this embodiment, the image processing unit 200 applies gamma correction and edge enhancement to improve the quality of the input image. Gamma correction is shown in Equation 1.

수학식 1에서

은 입력 영상의 밝기,

는 입력 영상의 밝기 범위,

는 감마 보정된 영상의 밝기,

는 보정된 영상의 밝기 범위를 각각 나타내며, R,G,B 각각 독립적으로 보정한다.In Equation 1

Is the brightness of the input image,

Is the brightness range of the input image,

Is the brightness of the gamma-corrected image,

Represents the brightness range of the corrected image, and R, G, and B are each independently corrected.

에지를 향상함에 있어 에지가 뚜렷한 부분은 향상을 적게 하고, 에지가 약한 부분은 향상을 많이 할 필요가 있다. 본 실시 예에서 적용한 에지 향상 방법은 수학식 2와 같다.In improving the edge, it is necessary to improve less in areas with clear edges and more in areas with weak edges. The edge enhancement method applied in this embodiment is shown in Equation 2.

수학식 2에서

는 지역 평균을 나타내고,

는 지역 표준편차를 나타낸다.In Equation 2

Represents the regional average,

Represents the local standard deviation.

영상 처리부(200)에서 감마 보정 및 에지 향상의 결과로 화질이 개선된 영상은 객체 검출부(300)로 입력된다. An image whose image quality is improved as a result of gamma correction and edge enhancement in the image processing unit 200 is input to the object detection unit 300.

객체 검출부(300)는 영상 처리부(200)로부터 수신한 영상(입력 영상)에 대하여 합성곱 신경 회로망(convolutional neural network) 기반 필터링 및 다운 샘플링을 거쳐 영상에 대한 특징을 추출하고, 추출한 영상의 특징으로부터 객체의 유/무, 객체의 위치, 객체의 크기, 객체의 진행방향 등의 정보를 검출할 수 있다.The object detection unit 300 extracts features of the image through filtering and down-sampling based on a convolutional neural network on the image (input image) received from the image processing unit 200, and extracts features from the extracted image. Information such as the presence/absence of an object, the location of the object, the size of the object, and the moving direction of the object can be detected.

디스플레이부(400)는 객체 검출부(300)가 검출한 객체 검출 결과, 예를 들어 도 6을 디스플레이 한다.The display unit 400 displays an object detection result, for example, FIG. 6 detected by the object detection unit 300.

도 4 및 도 5는 도 2의 객체 검출 장치 중 객체 검출부의 상세 구성을 개략적으로 설명하기 위하여 도시한 도면이다. 4 and 5 are diagrams schematically illustrating a detailed configuration of an object detection unit in the object detection apparatus of FIG. 2.

기존의 객체 검출 장치 중에서 YOLOv3는 연산량 대비 검출 성능이 다른 방법에 비하여 월등하게 우수함을 보여주고 있다. 반면에 YOLOv3 보다 연산량이 적은 YOLOv2는 사람과 같이 작은 객체의 검출에서는 YOLOv3 보다 성능이 많이 낮지만, 버스와 같은 크기가 큰 객체를 검출하는 경우 성능이 YOLOv3에 근접한다.Among the existing object detection devices, YOLOv3 shows superior performance compared to other methods in terms of computational amount. On the other hand, YOLOv2, which has a lower computational load than YOLOv3, has much lower performance than YOLOv3 in detecting small objects such as humans, but when detecting large objects such as a bus, the performance is close to YOLOv3.

본 실시 예에서는 주차장의 천정에 설치된 어안렌즈 카메라(100_1 내지 100_N)를 통하여 획득한 영상에서 객체로서의 차량만을 인식하는데 특화된 검출기로, YOLOv3 보다 연산량이 적으면서 크기가 큰 객체의 검출 성능에서는 큰 차이가 없는 YOLOv2를 고려한다. Darknet-19에서 추출된 특징은 darknet-53에서 추출된 특징에 비하여 풍부하지는 않지만, 객체로서 차량만을 검출하는 데는 충분하다.In this embodiment, as a detector specialized for recognizing only a vehicle as an object in an image acquired through a fisheye lens camera (100_1 to 100_N) installed on the ceiling of a parking lot, there is a large difference in the detection performance of a large object with less computation than YOLOv3. Consider YOLOv2, which is missing. The features extracted from Darknet-19 are not as rich as those extracted from darknet-53, but are sufficient to detect only vehicles as objects.

주차장의 천정에 설치된 어안렌즈 카메라(100_1 내지 100_N)로 하단 주차면의 차량을 촬영하면, 영상의 중심에 놓인 차량은 크기가 크지만 영상의 외곽에 놓인 차량은 크기가 작다. 다양한 크기의 차량을 검출하기 위해서는 YOLOv3에서와 같이 다중 스케일로 객체를 검출하는 것이 필요하다. 때문에 본 실시 예에서 제안하는 검출기는 darknet-19 합성곱 신경 회로망을 기반으로 다중 스케일에서 객체를 검출하는 구조를 구비한다.When a vehicle on the lower parking surface is photographed with a fisheye lens camera 100_1 to 100_N installed on the ceiling of a parking lot, the vehicle placed in the center of the image is large, but the vehicle placed outside the image is small. In order to detect vehicles of various sizes, it is necessary to detect objects with multiple scales as in YOLOv3. Therefore, the detector proposed in this embodiment has a structure that detects objects at multiple scales based on the darknet-19 convolutional neural network.

본 실시 예에서 어안렌즈 카메라(100_1 내지 100_N)는 주차장의 천정에 설치되어, 어안렌즈 카메라(100_1 내지 100_N) 밑으로 이동하는 객체(예를 들어 주행하는 차량)를 촬영하여 객체가 진행하는 진행방향 정보를 획득할 수 있다. 기존의 방법들은 객체 옆에서 객체의 앞/뒤/좌/우를 촬영한 영상을 이용하기 때문에 본 실시 예와 같이 영상에서 객체의 진행방향 정보를 획득할 수 없다.In the present embodiment, the fisheye lens cameras 100_1 to 100_N are installed on the ceiling of a parking lot, and photograph an object (for example, a driving vehicle) moving under the fisheye lens cameras 100_1 to 100_N, and the moving direction of the object Information can be obtained. Existing methods use an image photographing the front/back/left/right of the object next to the object, so it is not possible to obtain information about the moving direction of the object from the image as in the present embodiment.

도 4 및 도 5를 참조하면, 객체 검출부(300)는 특징 추출부(310), 다중 전처리부(320), 다중 분류부(330) 및 NMS 처리부(340)를 포함할 수 있다.4 and 5, the object detection unit 300 may include a feature extraction unit 310, a multiple preprocessor 320, a multiple classification unit 330, and an NMS processing unit 340.

특징 추출부(310)는 입력 영상에 대하여 필터링 및 다운 샘플링을 거쳐 영상에 대한 특징을 추출할 수 있다. 특징 추출부(310)는 예를 들어 객체로서의 차량의 외곽 에지, 차량의 외각 코너 등과 같은 객체의 영상적 특징을 추출하며, 이러한 영상적 특징들을 이용하여 후술하는 분류부(300)에서 차량의 위치, 높이, 방향, 넓이 등의 값을 추정할 수 있다.The feature extractor 310 may extract features of an image through filtering and down-sampling of the input image. The feature extraction unit 310 extracts image features of an object such as, for example, an outer edge of a vehicle as an object, an outer corner of the vehicle, and the like, and the position of the vehicle in the classification unit 300 to be described later using these image features. , Height, direction, and area values can be estimated.

도 5에 개시된 바와 같이 특징 추출부(310)는 기존에 알려진 darknet-19를 적용하였으며, 연속된 3×3과 1×1 콘벌루션으로 이루어져 필터링을 수행하는 19개의 콘벌루션 레이어(convolution layer)와 영상의 다운 샘플링을 수행하는 5개의 최대 풀링 레이어(max pooling layer)를 포함할 수 있으며, 각각의 콘벌루션 레이어의 입력에 배치 정규화(batch normalization)를 적용하고, 활성화 함수로 Leaky ReLU(교정 선형 유닛, rectified linear unit)를 적용한다.As disclosed in FIG. 5, the feature extraction unit 310 applies the previously known darknet-19, and consists of consecutive 3×3 and 1×1 convolutions to perform filtering, and 19 convolution layers. It can include 5 max pooling layers that perform down-sampling of the image, apply batch normalization to the input of each convolutional layer, and use Leaky ReLU (correction linear unit) as an activation function. , rectified linear unit) is applied.

또한 19개의 콘벌루션 레이어 중 6-8 번째 콘벌루션 레이어 집합을 차차상층(the after next higher layer)으로 명명하고, 9-13번째 콘벌루션 레이어 집합을 차상층(the next higher layer, the next lower resolution layer)으로 명명하고, 14-19번째 콘벌루션 레이어 집합을 최상층(the highest layer, the lowest resolution layer)으로 명명 할 수 있다. 특징 추출부(310)의 입력 영상은 416×416 셀로 구성되고, 차차상층은 52×52 셀들로 구성되며, 차상층은 26×26 셀로 구성되며, 최상층은13×13 셀로 구성된다.In addition, among the 19 convolutional layers, the 6-8th convolutional layer set is named the after next higher layer, and the 9-13th convolutional layer set is the next higher layer, the next lower resolution. layer), and the 14-19th convolutional layer set may be named the highest layer, the lowest resolution layer. The input image of the feature extraction unit 310 is composed of 416 × 416 cells, the second upper layer is composed of 52 × 52 cells, the second upper layer is composed of 26 × 26 cells, and the top layer is composed of 13 × 13 cells.

특징 추출부(310)의 첫 번째 콘벌루션 레이어는 416×416의 해상도를 갖는 입력 영상에 대하여, 3×3 크기의 32개 필터를 이용하여 1 픽셀 간격으로 콘벌루션화 된다. 첫 번째 콘벌루션 레이어에서 출력되는 특징은 배치 정규화 및 Leaky ReLU를 거친 후 2 픽셀 간격으로 적용하는 2Х2 크기의 필터들을 포함하는 첫 번째 최대 풀링 레이어에 의해 처리되어 208Х208의 크기로 다운샘플링 된 후 두 번째 콘벌루션 레이어로 입력된다.The first convolutional layer of the feature extraction unit 310 is convolved with respect to an input image having a resolution of 416×416 at 1 pixel intervals using 32 filters having a size of 3×3. The features output from the first convolutional layer are processed by the first maximum pooling layer including filters of size 2Х2 applied at 2 pixel intervals after batch normalization and Leaky ReLU, and downsampled to a size of 208Х208, and then the second It is input as a convolutional layer.

두 번째 콘벌루션 레이어는 첫 번째 최대 풀링 레이어에 의해 다운 샘플링된 208×208의 해상도를 갖는 영상에 대하여, 3×3 크기의 64개 필터를 이용하여 1 픽셀 간격으로 콘벌루션화 된다. 두 번째 콘벌루션 레이어에서 출력되는 특징은 배치 정규화 및 Leaky ReLU를 거친 후 2 픽셀 간격으로 적용하는 2Х2 크기의 필터들을 포함하는 두 번째 최대 풀링 레이어에 의해 처리되어 104Х104의 크기로 다운샘플링 된 후 세 번째 콘벌루션 레이어로 입력된다.The second convolutional layer is convolved at 1 pixel intervals using 64 filters having a size of 3×3 for an image having a resolution of 208×208 down-sampled by the first maximum pooling layer. The features output from the second convolutional layer are processed by the second maximum pooling layer that includes filters of 2Х2 size applied at 2 pixel intervals after batch normalization and Leaky ReLU, and downsampled to a size of 104Х104, and then the third. It is input as a convolutional layer.

세 번째 콘벌루션 레이어는 두 번째 최대 풀링 레이어에 의해 다운 샘플링된 104×104의 해상도를 갖는 영상에 대하여, 3×3 크기의 128개 필터를 이용하여 1 픽셀 간격으로 콘벌루션화 된다. 세 번째 콘벌루션 레이어에서 출력되는 특징은 배치 정규화 및 Leaky ReLU를 거친 후, 네 번째 콘벌루션 레이어로 입력된다. 이러한 과정을 거쳐 차차상층은 52×52 셀로 구성되며, 차상층은 26×26 셀로 구성되며, 최상층은13×13 셀로 구성된다.The third convolutional layer is convolved with a resolution of 104×104 down-sampled by the second maximum pooling layer at 1-pixel intervals using 128 filters of 3×3 size. Features output from the third convolutional layer are input to the fourth convolutional layer after batch normalization and Leaky ReLU. Through this process, the second upper layer is composed of 52 × 52 cells, the second upper layer is composed of 26 × 26 cells, and the top layer is composed of 13 × 13 cells.

다중 전처리부(320)는 특징 추출부(310)의 차상층 및 최상층의 특징을 병합하여 다중 분류부(330)로 출력하고 차차상층 및 차상층의 특징을 병합하여 다중 분류부(330)로 입력한다. The multiple preprocessor 320 merges the features of the uppermost layer and the uppermost layer of the feature extraction unit 310 and outputs them to the multi-classification unit 330, and merges the features of the higher-order and second-order layers and inputs them to the multi-classification unit 330. do.

다중 분류부(330)는 다중 전처리부(320)로부터 병합한 특징을 수신하여 검출한 객체들을 테두리 상자(bounding box)로 나타내고, 테두리 상자의 중심좌표(x,y), 객체의 넓이(w), 객체의 높이(h), 테두리 상자에 포함된 객체의 신뢰도(C), 객체의 진행방향(θ) 및 셀에 존재할 객체가 특정 클래스일 확률(Pi)을 추정한다. The multiple classification unit 330 receives the merged features from the multiple preprocessor 320 and displays the detected objects as a bounding box, the center coordinates of the bounding box (x,y), and the width of the object (w). , Estimate the height of the object (h), the reliability of the object included in the bounding box (C), the moving direction of the object (θ), and the probability that the object present in the cell is of a specific class (Pi).

본 실시 예에서 다중 전처리부(320)는 제1 전처리부(321) 및 제2 전처리부(322)를 포함하고, 다중 분류부(330)는 제1 분류부(331) 및 제2 분류부(332)를 포함한다. 또한, 제1 전처리부(321)의 출력은 제1 분류부(331)로 입력되고, 제1 분류부(331)는 YOLOv2에서와 같이 최상층의 각 13×13 개의 셀 각각으로부터 객체 검출 결과로 6개 정보 즉, x, y, w, h, C, θ, Pi를 검출할 수 있다. 제2 전처리부(322)의 출력은 제2 분류부(332)로 입력되고, 제2 분류부(332)는 차상층의 26×26 개의 셀 각각으로부터 객체 검출 결과로 6개 정보 즉, x, y, w, h, C, θ, Pi를 검출할 수 있다.In this embodiment, the multiple pre-processing unit 320 includes a first pre-processing unit 321 and a second pre-processing unit 322, and the multi-classifying unit 330 is a first classification unit 331 and a second classification unit ( 332). In addition, the output of the first pre-processing unit 321 is input to the first classification unit 331, and the first classification unit 331 is used as an object detection result from each of 13×13 cells in the uppermost layer as in YOLOv2. Dog information, that is, x, y, w, h, C, θ, Pi can be detected. The output of the second preprocessor 322 is input to the second classification unit 332, and the second classification unit 332 provides 6 pieces of information, that is, x, as object detection results from each of 26×26 cells of the next upper layer. y, w, h, C, θ, Pi can be detected.

도 5를 참조하면, 제1 전처리부(321)는 최상층의 19번째 콘벌루션 레이어에서 출력되는 해당 셀의 1024개의 특징과, 최상층 셀에 해당하는 차상층의 4 셀에서 추출한 특징 256(4×64)개의 특징을 병합하여 제1 분류부(331)로 출력한다. 제1 분류부(331)는 제1 전처리부(321)에서 병합한 1280개의 특징을 1024개의 1×1×1280 합성 곱 필터를 통과시켜 구한 1024개의 특징으로부터 테두리 상자의 중심좌표(x,y), 객체의 넓이(w), 객체의 높이(h), 테두리 상자에 포함된 객체의 신뢰도(C), 객체의 진행방향(θ) 및 셀에 존재할 객체가 특정 클래스일 확률(Pi)을 추정한다.Referring to FIG. 5, the first preprocessor 321 includes 1024 features of a corresponding cell output from the 19th convolutional layer of the top layer and 256 (4×64) features extracted from 4 cells of the next upper layer corresponding to the topmost cell. ) Features are merged and output to the first classification unit 331. The first classification unit 331 is the center coordinate (x,y) of the bounding box from 1024 features obtained by passing 1280 features merged by the first preprocessor 321 through 1024 1×1×1280 composite product filters. , The area of the object (w), the height of the object (h), the reliability of the object included in the bounding box (C), the direction of the object (θ), and the probability that the object in the cell is of a specific class (Pi) is estimated. .

제2 전처리부(322)는 차상층의 13번째 콘벌루션 레이어에서 출력되는 해당 셀의 1024개의 특징과, 차상층 셀에 해당하는 차차상층의 4 셀에서 추출한 특징 256(4×64)개의 특징을 병합하여 제2 분류부(332)로 출력한다. 제2 분류부(332)는 제2 전처리부(322)에서 병합한 1280개의 특징을 1024개의 1×1×1280 합성 곱 필터를 통과시켜 구한 1024개의 특징으로부터 테두리 상자의 중심좌표(x,y), 객체의 넓이(w), 객체의 높이(h), 테두리 상자에 포함된 객체의 신뢰도(C), 객체의 진행방향(θ) 및 셀에 존재할 객체가 특정 클래스일 확률(Pi)을 추정한다.The second preprocessor 322 performs 1024 features of the corresponding cell output from the 13th convolutional layer of the next upper layer, and 256 (4 x 64) features extracted from 4 cells of the next upper layer corresponding to the next upper layer cell. They are merged and output to the second classification unit 332. The second classification unit 332 is the central coordinate (x,y) of the bounding box from 1024 features obtained by passing 1280 features merged by the second preprocessor 322 through 1024 1×1×1280 composite product filters. , The area of the object (w), the height of the object (h), the reliability of the object included in the bounding box (C), the direction of the object (θ), and the probability that the object in the cell is of a specific class (Pi) is estimated. .

NMS(non-maximum suppression) 처리부(340)는 제1 분류부(331) 및 제2 분류부(332)에서 검출한 테두리 상자에 객체가 존재할 클래스 확률(Pi)로부터 겹치는 부분(예를 들어 차 한대에 여러 테두리 상자가 그려지는 경우와 같음)을 제거하기 위해 사용된다. NMS 처리부(340)는 현재 픽셀을 기준으로 주변의 픽셀과 비교했을 때 최대값인 경우 그대로 놔두고, 아닐 경우(비 최대) 제거하는 방법을 사용한다. NMS(non-maximum suppression) 처리부(340)에 의해 겹치는 부분이 제거된 결과가 디스플레이부(400)에 출력되는 도 6과 같다.The non-maximum suppression (NMS) processing unit 340 overlaps from the class probability Pi that the object exists in the bounding box detected by the first classification unit 331 and the second classification unit 332 (for example, one car It is used to remove (same as when multiple bounding boxes are drawn). The NMS processing unit 340 uses a method of leaving the current pixel as it is when it is the maximum value when compared with the surrounding pixels, and removing it when it is not (non-maximum). The result of removing the overlapping portion by the non-maximum suppression (NMS) processing unit 340 is output to the display unit 400 as shown in FIG. 6.

YOLOv3에서는 최상층, 차상층, 차차상층에서 각각 분류기로 객체를 검출하는데, 각 층의 객체 검출기는 최상층에서 추출한 특징을 업샘플링(up-sampling)하여 병합한 후 합성곱 필터를 거쳐 분류기에 사용할 특징을 추출한다. 최상층의 특징은 광범위한 영역에서 추출한 특징이기 때문에 제한된 차량 영역 밖의 배경이 특징값에 영향을 줄 수 있다. 물론 학습과정에서 최상층에서 추출한 특징 중에서 필요한 부분만 선택할 수도 있지만, 본 실시 예에서 득 보다 실이 많다고 판단하여 최상층에서 구한 특징을 차상층의 분류기에 사용하지 않았다. 제안하는 방법에서 이러한 접근의 타당성은 실험을 통하여 입증한다.In YOLOv3, objects are detected by classifiers in the top layer, the top layer, and the top layer, respectively, and the object detector of each layer up-samps the features extracted from the top layer, merges them, and then passes through a convolution filter to determine the features to be used in the classifier. Extract. Since the features of the top layer are features extracted from a wide area, a background outside the limited vehicle area may affect the feature value. Of course, it is possible to select only a necessary part of the features extracted from the top layer in the learning process, but in this embodiment, it was determined that the loss was more than the benefit, and the features obtained from the top layer were not used in the classifier of the next layer. In the proposed method, the validity of this approach is verified through experiments.

또한 검출하고자 하는 객체로서의 차량의 크기는 일정 크기 이상이기 때문에, 본 실시 예에서는 YOLOv3와 달리 최상층과 차상층 두 층에서만 객체를 검출한다. In addition, since the size of the vehicle as the object to be detected is more than a certain size, in the present embodiment, unlike YOLOv3, the object is detected only in two floors of the top floor and the top floor.

합성곱 신경 회로망은 학습 시 최소화할 손실 함수(loss function)로 YOLOv2에서 사용한 함수를 적용하였으며 수학식 3과 같다.The convolutional neural network applied the function used in YOLOv2 as a loss function to be minimized during training, and is shown in Equation 3.

수학식 3에서 x,y,w,h는 객체를 포함하는 테두리 상자의 중심좌표(x,y), 넓이(w), 높이(h)를 나타내며,

는 i번째 셀의 j번째 테두리 상자에 포함된 객체의 신뢰도를 나타내고,

는 i번째 셀에 존재할 객체가 특정 클래스일 확률을 나타내고,

는 객체의 진행방향을 나타낸다.

는 i번째 셀의 j번째 테두리 상자에 객체가 존재하면 1, 아니면 0 값을 가진다.

는 i번째 셀에 객체의 중심이 놓여지면 1, 아니면 0 값을 가진다. 본 실시 예에서는 총 객체의 학습 데이터에서 객체를 포함하는 셀의 개수와 배경만 포함하는 셀의 개수 비를 바탕으로

값을 설정한다.In Equation 3, x,y,w,h represent the center coordinates (x,y), area (w), and height (h) of the bounding box including the object,

Represents the reliability of the object included in the j-th bounding box of the i-th cell,

Represents the probability that the object present in the i-th cell is of a specific class,

Indicates the direction of the object.

Has a value of 1 if the object exists in the j-th bounding box of the i-th cell, or 0 otherwise.

Is 1 if the center of the object is placed in the i-th cell, otherwise it has a value of 0. In this embodiment, based on the ratio of the number of cells including the object and the number of cells including only the background in the training data of the total object

Set the value.

도 7은 본 발명의 일 실시 예에 따른 객체 검출 방법을 설명하기 위한 흐름도이다. 이하의 설명에서 도 1 내지 도 6에 대한 설명과 중복되는 부분은 그 설명을 생략하기로 한다.7 is a flowchart illustrating an object detection method according to an embodiment of the present invention. In the following description, portions overlapping with the descriptions of FIGS. 1 to 6 will be omitted.

도 7을 참조하면, S710단계에서, 객체 검출 장치(1)는 실내 주차장의 천정에 설치된 N개의 어안렌즈 카메라(예를 들어, 16개)를 통하여 영상을 촬영한다.Referring to FIG. 7, in step S710, the object detection device 1 captures an image through N fisheye cameras (eg, 16) installed on the ceiling of an indoor parking lot.

S720단계에서, 객체 검출 장치(1)는 어안렌즈 카메라를 통하여 촬영한 영상에 감마 보정 및 에지 향상을 적용한 화질 개선을 수행한다.In step S720, the object detection apparatus 1 performs image quality improvement by applying gamma correction and edge enhancement to an image captured through a fisheye lens camera.

S730단계에서, 객체 검출 장치(1)는 N개의 어안렌즈 카메라를 통하여 촬영하여 화질이 개선된 영상에 딥러닝 기반의 합성곱 신경 회로망 모델을 적용하여 영상의 특징을 추출한다. 본 실시 예에서, 객체 검출 장치(1)는 연속된 3×3과 1×1 콘벌루션으로 이루어져 필터링을 수행하는 19개의 콘벌루션 레이어(convolution layer)와 영상의 다운 샘플링을 수행하는 5개의 최대 풀링 레이어(max pooling layer)를 포함하는 darknet-19를 적용하여 영상의 특징을 추출할 수 있다.In step S730, the object detection apparatus 1 extracts features of the image by applying a convolutional neural network model based on deep learning to an image with improved image quality by photographing through N fisheye lens cameras. In this embodiment, the object detection apparatus 1 consists of consecutive 3×3 and 1×1 convolutions, and 19 convolution layers for filtering and 5 maximum pooling for down-sampling an image. A feature of an image can be extracted by applying darknet-19 including a max pooling layer.

S740단계에서, 객체 검출 장치(1)는 최상층의 19번째 콘벌루션 레이어에서 출력되는 해당 셀의 1024개의 특징과, 최상층 셀에 해당하는 차상층의 4 셀에서 추출한 특징 256(4×64)개의 특징을 병합한 제1 전처리 결과 및 차상층의 13번째 콘벌루션 레이어에서 출력되는 해당 셀의 1024개의 특징과, 차상층 셀에 해당하는 차차상층의 4 셀에서 추출한 특징 256(4×64)개의 특징을 병합한 제2 전처리 결과를 생성한다.In step S740, the object detection device 1 includes 1024 features of the corresponding cell output from the 19th convolutional layer of the top layer and 256 (4×64) features extracted from 4 cells of the next upper layer corresponding to the topmost cell. The first preprocessing result of merging and 1024 features of the corresponding cell output from the 13th convolutional layer of the next upper layer, and 256 (4×64) features extracted from the 4 cells of the next upper layer corresponding to the next upper layer cell The merged second preprocessing result is generated.

S750단계에서, 객체 검출 장치(1)는 제1 전처리 결과 즉, 병합한 1280개의 특징을 1024개의 1×1×1280 합성 곱 필터를 통과시켜 구한 1024개의 특징으로부터 테두리 상자의 중심좌표(x,y), 객체의 넓이(w), 객체의 높이(h), 테두리 상자에 포함된 객체의 신뢰도(C) 및 셀에 존재할 객체가 특정 클래스일 확률(Pi)과 객체의 진행방향(θ)을 추정하는 제1 분류 결과를 생성하고, 제2 전처리 결과 즉, 병합한 1280개의 특징을 1024개의 1×1×1280 합성 곱 필터를 통과시켜 구한 1024개의 특징으로부터 테두리 상자의 중심좌표(x,y), 객체의 넓이(w), 객체의 높이(h), 테두리 상자에 포함된 객체의 신뢰도(C) 및 셀에 존재할 객체가 특정 클래스일 확률(Pi)과 객체의 진행방향(θ)을 추정하는 제2 분류 결과를 생성한다.In step S750, the object detection apparatus 1 uses the first preprocessing result, that is, the merged 1280 features, which are obtained by passing 1024 1x1x1280 composite product filters, and the center coordinates (x,y ), the area of the object (w), the height of the object (h), the reliability of the object included in the bounding box (C), the probability that the object to exist in the cell is of a specific class (Pi), and the direction of the object (θ) The first classification result is generated, and the second preprocessing result, that is, the center coordinates (x,y) of the bounding box from 1024 features obtained by passing 1024 1×1×1280 composite product filters for the merged 1280 features, The object width (w), the height of the object (h), the reliability of the object included in the bounding box (C), the probability that the object to exist in the cell is of a specific class (Pi), and the direction of the object (θ) 2 Generate classification results.

S760단계에서, 객체 검출 장치(1)는 제1 분류 결과 및 제2 분류 결과로부터 검출한 객체에 여러 테두리 상자가 겹치는 부분을 제거하는 NMS(non-maximum suppression) 처리를 수행한다.In step S760, the object detection apparatus 1 performs a non-maximum suppression (NMS) process in which a portion where several frame boxes overlap an object detected from the first classification result and the second classification result is removed.

이상 설명된 본 발명에 따른 실시 예는 컴퓨터 상에서 다양한 구성요소를 통하여 실행될 수 있는 컴퓨터 프로그램의 형태로 구현될 수 있으며, 이와 같은 컴퓨터 프로그램은 컴퓨터로 판독 가능한 매체에 기록될 수 있다. 이때, 매체는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체, CD-ROM 및 DVD와 같은 광기록 매체, 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical medium), 및 ROM, RAM, 플래시 메모리 등과 같은, 프로그램 명령어를 저장하고 실행하도록 특별히 구성된 하드웨어 장치를 포함할 수 있다.The embodiment according to the present invention described above may be implemented in the form of a computer program that can be executed through various components on a computer, and such a computer program may be recorded in a computer-readable medium. In this case, the medium is a magnetic medium such as a hard disk, a floppy disk, and a magnetic tape, an optical recording medium such as a CD-ROM and a DVD, a magneto-optical medium such as a floptical disk, and a ROM. A hardware device specially configured to store and execute program instructions, such as, RAM, flash memory, and the like.

한편, 상기 컴퓨터 프로그램은 본 발명을 위하여 특별히 설계되고 구성된 것이거나 컴퓨터 소프트웨어 분야의 당업자에게 공지되어 사용 가능한 것일 수 있다. 컴퓨터 프로그램의 예에는, 컴파일러에 의하여 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용하여 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드도 포함될 수 있다.Meanwhile, the computer program may be specially designed and configured for the present invention, or may be known and usable to those skilled in the computer software field. Examples of the computer program may include not only machine language codes produced by a compiler but also high-level language codes that can be executed by a computer using an interpreter or the like.

본 발명의 명세서(특히 특허청구범위에서)에서 "상기"의 용어 및 이와 유사한 지시 용어의 사용은 단수 및 복수 모두에 해당하는 것일 수 있다. 또한, 본 발명에서 범위(range)를 기재한 경우 상기 범위에 속하는 개별적인 값을 적용한 발명을 포함하는 것으로서(이에 반하는 기재가 없다면), 발명의 상세한 설명에 상기 범위를 구성하는 각 개별적인 값을 기재한 것과 같다. In the specification of the present invention (especially in the claims), the use of the term "above" and a similar reference term may correspond to both the singular and the plural. In addition, when a range is described in the present invention, the invention to which an individual value falling within the range is applied (unless otherwise stated), and each individual value constituting the range is described in the detailed description of the invention. Same as

본 발명에 따른 방법을 구성하는 단계들에 대하여 명백하게 순서를 기재하거나 반하는 기재가 없다면, 상기 단계들은 적당한 순서로 행해질 수 있다. 반드시 상기 단계들의 기재 순서에 따라 본 발명이 한정되는 것은 아니다. 본 발명에서 모든 예들 또는 예시적인 용어(예들 들어, 등등)의 사용은 단순히 본 발명을 상세히 설명하기 위한 것으로서 특허청구범위에 의해 한정되지 않는 이상 상기 예들 또는 예시적인 용어로 인해 본 발명의 범위가 한정되는 것은 아니다. 또한, 당업자는 다양한 수정, 조합 및 변경이 부가된 특허청구범위 또는 그 균등물의 범주 내에서 설계 조건 및 팩터에 따라 구성될 수 있음을 알 수 있다.If there is no explicit order or contradictory description of the steps constituting the method according to the present invention, the steps may be performed in an appropriate order. The present invention is not necessarily limited according to the order of description of the steps. The use of all examples or illustrative terms (for example, etc.) in the present invention is merely for describing the present invention in detail, and the scope of the present invention is limited by the above examples or illustrative terms unless limited by the claims. It does not become. In addition, those skilled in the art can recognize that various modifications, combinations, and changes may be configured according to design conditions and factors within the scope of the appended claims or their equivalents.

따라서, 본 발명의 사상은 상기 설명된 실시 예에 국한되어 정해져서는 아니 되며, 후술하는 특허청구범위뿐만 아니라 이 특허청구범위와 균등한 또는 이로부터 등가적으로 변경된 모든 범위는 본 발명의 사상의 범주에 속한다고 할 것이다.Therefore, the spirit of the present invention is limited to the above-described embodiments and should not be determined, and all ranges equivalent to or equivalently changed from the claims to be described later as well as the claims to be described later are the scope of the spirit of the present invention. It will be said to belong to.

1: 객체 검출 장치
100: 영상 촬영부
200: 영상 처리부
300: 객체 검출부
400: 디스플레이부 1: object detection device
100: video recording unit
200: image processing unit
300: object detection unit
400: display unit

Claims

A feature extraction unit for extracting features of the image by applying a deep learning-based convolutional neural network model to an image captured through a plurality of fisheye lens cameras provided on the ceiling of a parking lot;
A multiplex preprocessing unit for merging features extracted from at least one of an uppermost layer, a second upper layer, and a second upper layer of the feature extracting unit; And
Receiving the merged features from the multiple preprocessors, one or more objects are displayed as a bounding box, and the center coordinate of the bounding box, the width of the object, the height of the object, the reliability of the object included in the bounding box, and the existence of the cell Including a multiple classification unit for estimating a probability of the object being a specific class and a moving direction of the object.

The method of claim 1, wherein the feature extraction unit,
The image output from the fisheye lens camera is composed of 19 convolution layers that perform filtering consisting of 3×3 and 1×1 convolutions, and a 2×2 filter to perform down-sampling of the image. An object detection apparatus comprising darknet-19 including five maximum pooling layers to perform.

The method of claim 1, wherein the multiple preprocessing unit,
Of the feature extraction unit including 19 convolutional layers and 5 maximum pooling layers, 1024 features of the corresponding cell output from the 19th convolutional layer of the uppermost layer and 4 cells of the next upper layer corresponding to the uppermost layer are extracted. A first preprocessor for merging 256 features; And
A second preprocessing unit for merging 1024 features of a corresponding cell output from the 13th convolutional layer of the next upper layer among the feature extracting units and 256 features extracted from 4 cells of the next higher layer corresponding to the next upper layer cell; Containing, object detection device.

The method of claim 3, wherein the multiple classification unit,
From 1024 features obtained by passing the 1280 features merged by the first preprocessor through 1024 1x1x1280 composite product filters, the center coordinate (x,y) of the bounding box, the area of the object (w), the A first classification unit for estimating the height (h) of the object, the reliability of the object included in the bounding box (C), the probability that the object present in the cell is of a specific class (Pi), and the moving direction (θ) of the object ; And
From 1024 features obtained by passing the 1280 features merged by the second preprocessor through 1024 1x1x1280 composite product filters, the center coordinate (x,y) of the bounding box, the area of the object (w), the A second classification unit for estimating the height (h) of the object, the reliability of the object included in the bounding box (C), the probability that the object to exist in the cell is of a specific class (Pi), and the moving direction (θ) of the object Containing; object detection device.

The method of claim 4,
The object detection apparatus further comprises a non-maximum suppression (NMS) processing unit that removes a portion where several bounding boxes overlap the object from the estimation result of the first classification unit and the estimation result of the second classification unit.

Extracting a feature of the image by applying a deep learning-based convolutional neural network model to an image captured by a plurality of fisheye lens cameras provided on a ceiling of a parking lot by a feature extraction unit;
Merging the features extracted from at least one of the uppermost layer, the upper secondary layer, and the upper secondary layer of the feature extraction section by a multiple preprocessor; And
The multiple classification unit receives the merged features from the multiple preprocessor and displays one or more objects as a bounding box, and the center coordinate of the bounding box, the width of the object, the height of the object, and the object included in the bounding box Including, estimating the reliability of the object and the probability that the object to be present in the cell is of a specific class, and a moving direction of the object.

The method of claim 6, wherein the extracting the feature comprises:
The image output from the fisheye lens camera is composed of 19 convolution layers that perform filtering consisting of 3×3 and 1×1 convolutions, and a 2×2 filter to perform down-sampling of the image. Containing, object detection method comprising; extracting the feature using darknet-19 including five maximum pooling layers to perform.

The method of claim 6, wherein the merging step,
By the first preprocessor, 1024 features of the cell output from the 19th convolutional layer of the uppermost layer among the feature extracting units including 19 convolutional layers and 5 maximum pooling layers, and corresponding to the uppermost cell. Merging the features of 256 features extracted from 4 cells of the next upper layer; And
The second preprocessor merges 1024 features of the corresponding cell output from the 13th convolutional layer of the next upper layer among the feature extraction units and 256 features extracted from 4 cells of the next upper layer corresponding to the next upper layer cell. Including; object detection method.

The method of claim 8, wherein the estimating step,
By the first classification unit, the center coordinate (x,y) of the bounding box from 1024 features obtained by passing 1280 features merged by the first preprocessor through 1024 1×1×1280 composite product filters, and the object The area (w) of the object, the height of the object (h), the reliability of the object included in the bounding box (C), the probability that the object present in the cell is of a specific class (Pi), and the direction of the object (θ) Estimating; And
By the second classification unit, the center coordinate (x,y) of the bounding box from 1024 features obtained by passing 1280 features merged by the second preprocessor through 1024 1x1x1280 composite product filters, and the object The area (w) of the object, the height of the object (h), the reliability of the object included in the bounding box (C), the probability that the object present in the cell is of a specific class (Pi), and the direction of the object (θ) Estimating; containing, object detection method.

The method of claim 9,
The method further comprising: removing, by a non-maximum suppression (NMS) processing unit, a portion where several bounding boxes overlap the object from the estimation result of the first classification unit and the estimation result of the second classification unit.

A computer program stored in the computer-readable recording medium to execute the method of any one of claims 6 to 10 using a computer.