KR101995523B1

KR101995523B1 - Apparatus and method for object detection with shadow removed

Info

Publication number: KR101995523B1
Application number: KR1020170172513A
Authority: KR
Inventors: 박강령; 김동섭
Original assignee: 동국대학교 산학협력단
Priority date: 2017-12-14
Filing date: 2017-12-14
Publication date: 2019-10-01
Also published as: KR20190071452A

Abstract

본 발명은 객체 검출 장치 및 방법에 관한 것으로, 보다 구체적으로 저해상도 영상에서 객체의 그림자를 검출하고, 검출한 그림자를 제거하여 정확하게 객체를 검출하는 객체 검출 장치 및 방법에 관한 것이다. 본 발명은 저해상도 영상에서 회선 신경망을 기반으로 객체와 객체의 그림자를 분류함으로써 그림자가 제거된 객체를 정확하게 검출할 수 있다. 또한, 본 발명은 저해상도 영상을 HSV 컬러 모델로 변환하여 객체 검출을 수행하므로 조명 환경 변화에 강인하다.The present invention relates to an object detecting apparatus and method, and more particularly, to an object detecting apparatus and method for detecting an object shadow in a low resolution image, and accurately detecting the object by removing the detected shadow. The present invention can accurately detect the object from which the shadow is removed by classifying the object and the shadow of the object based on the circuit neural network in the low resolution image. In addition, the present invention is robust to changes in the lighting environment since object detection is performed by converting a low resolution image into an HSV color model.

Description

Object detection apparatus and method {APPARATUS AND METHOD FOR OBJECT DETECTION WITH SHADOW REMOVED}

본 발명은 객체 검출 장치 및 방법에 관한 것으로, 보다 구체적으로 저해상도 영상에서 객체의 그림자를 검출하고, 검출한 그림자를 제거하여 정확하게 객체를 검출하는 객체 검출 장치 및 방법에 관한 것이다.The present invention relates to an object detecting apparatus and method, and more particularly, to an object detecting apparatus and method for detecting an object shadow in a low resolution image, and accurately detecting the object by removing the detected shadow.

최근 지능형 감시 시스템이 확대됨에 따라 감시 카메라를 통해 촬영된 영상에서 객체를 검출, 추적 및 인식하는 기술에 대한 연구가 활발히 이루어지고 있다. 대부분의 지능형 감시 시스템은 가시광영상에서 객체를 검출하게 되는데, 이 때 정확하게 객체를 검출하기 위해서는 그림자와 객체를 정확하게 분리하는 것이 중요하다. 이를 위해 기존의 지능형 감시 시스템은 NIR 및 열 카메라를 이용하는데 이는 별도의 조명이 필요하거나 가격이 비싸다는 단점이 있다.Recently, as the intelligent surveillance system is expanded, researches on technologies for detecting, tracking, and recognizing objects in images captured by the surveillance camera have been actively conducted. Most intelligent surveillance systems detect objects in visible light images, and it is important to accurately separate shadows and objects in order to detect objects accurately. For this purpose, existing intelligent surveillance systems use NIR and thermal cameras, which require additional lighting or are expensive.

기존의 그림자 검출 알고리즘은 주로 객체(object), 그림자의 색상 특징(color feature)을 이용해 검출한다. 하지만 이는 실제 환경에서는 많은 변수가 존재하기 때문에 정확하게 객체를 검출하는데 어려움이 있다.Existing shadow detection algorithms mainly detect color features of objects and shadows. However, since there are many variables in the real environment, it is difficult to accurately detect an object.

본 발명 기술에 대한 배경기술은 대한민국 공개특허공보 제2015-01739호에 게시된 바 있다. Background art of the present invention has been published in the Republic of Korea Patent Publication No. 2015-01739.

본 발명은 저해상도 영상을 회선 신경망(CNN; Convolutional Neural Network)을 기반으로 객체의 그림자를 검출하고, 검출한 그림자를 제거하여 정확하게 객체를 검출하는 객체 검출 장치 및 방법을 제공하는 것이다.The present invention provides an object detection apparatus and method for detecting a shadow of an object based on a convolutional neural network (CNN) of a low resolution image, and accurately detecting the object by removing the detected shadow.

본 발명의 일 측면에 따르면, 객체 검출 장치가 제공된다.According to one aspect of the invention, an object detecting apparatus is provided.

본 발명의 일 실시 예에 따른 객체 검출 장치는 입력 이미지를 입력받고, 상기 입력 이미지와 배경 이미지의 차 연산을 수행하여 전경 영역을 검출하는 전경 영역 검출부, 검출한 전경 영역을 참조하여 상기 입력 이미지에서 대략적인 객체 영역을 픽셀 단위의 윈도우 이미지로 생성하는 윈도우 이미지 생성부, 복수개의 윈도우 이미지로 구축된 학습 데이터를 이용하여 회선 신경망 (CNN; Convolutional Neural Network)을 학습하는 회선 신경망 학습부; 및 상기 윈도우 이미지를 보간법에 의해 미리 설정된 크기로 변형하고, 변형된 윈도우 이미지를 학습된 회선 신경망을 이용하여 객체 또는 그림자로 분류하고, 그림자로 분류된 영역을 제외한 객체 영역을 검출하는 객체 검출부를 포함한다.An object detecting apparatus according to an embodiment of the present disclosure receives an input image, performs a difference operation between the input image and the background image, and detects a foreground region, and refers to the detected foreground region in the input image. A window image generator for generating an approximate object area as a window image in pixel units, a circuit neural network learner for learning a convolutional neural network (CNN) using training data constructed from a plurality of window images; And an object detector configured to transform the window image into a predetermined size by interpolation, classify the modified window image into an object or a shadow using a learned circuit neural network, and detect an object region except for the region classified as a shadow. do.

본 발명의 다른 측면에 따르면, 객체 검출 방법이 제공된다.According to another aspect of the present invention, an object detection method is provided.

본 발명의 일 실시 예에 따른 객체 검출 방법은 입력 이미지를 입력받고 상기 입력 이미지와 배경 이미지의 차 연산을 수행하여 전경 영역을 검출하는 단계, 검출한 전경 영역을 참조하여 상기 입력 이미지에서 대략적인 객체 영역을 픽셀 단위의 윈도우 이미지로 생성하는 단계, 복수개의 윈도우 이미지로 구축된 학습 데이터를 이용하여 회선 신경망 (CNN; Convolutional Neural Network)을 학습하는 단계, 상기 윈도우 이미지를 보간법에 의해 미리 설정된 크기로 변형하고, 변형된 윈도우 이미지를 학습된 회선 신경망을 이용하여 객체 또는 그림자로 분류하는 단계 및 그림자로 분류된 영역을 제외한 객체 영역을 상기 입력 이미지에 표시하는 단계를 포함한다.An object detecting method according to an embodiment of the present invention is to detect an foreground area by receiving an input image and performing a difference operation between the input image and a background image, and approximate an object in the input image with reference to the detected foreground area. Generating an area as a window image in pixel units, learning a convolutional neural network (CNN) using training data constructed from a plurality of window images, and transforming the window image to a predetermined size by interpolation. And classifying the modified window image into an object or a shadow by using the learned circuit neural network, and displaying an object area except the area classified as a shadow on the input image.

본 발명은 저해상도 영상에서 회선 신경망을 기반으로 객체와 객체의 그림자를 분류함으로써 그림자가 제거된 객체를 정확하게 검출할 수 있다. The present invention can accurately detect the object from which the shadow is removed by classifying the object and the shadow of the object based on the circuit neural network in the low resolution image.

또한, 본 발명은 저해상도 영상을 HSV 컬러 모델로 변환하여 객체 검출을 수행하므로 조명 환경 변화에 강인하다.In addition, the present invention is robust to changes in the lighting environment since object detection is performed by converting a low resolution image into an HSV color model.

도 1 내지 도 4는 본 발명의 일 실시 예에 따른 객체 검출 장치를 설명하기 위한 도면들.
도 5 및 도 6은 본 발명의 일 실시예에 따른 객체 검출 장치를 이용하여 객체를 검출하는 방법을 설명하기 위한 도면들.
도 7 내지 도 9는 본 발명의 일 실시 예에 따른 회선 신경망 기반의 학습을 통한 객체 검출 방법의 객체와 그림자를 분류한 결과를 예시한 도면들.
도 10 및 도 11은 기존의 객체 검출 방법과 본 발명의 일 실시 예에 따른 객체 검출 방법의 성능을 비교한 결과를 나타내는 도면들.
도 12는 본 발명의 일 실시예에 따른 객체 검출 방법의 객체 검출 결과를 오픈 데이터와 비교한 도면. 1 to 4 are diagrams for describing an object detecting apparatus according to an embodiment of the present invention.
5 and 6 are diagrams for explaining a method of detecting an object using an object detecting apparatus according to an embodiment of the present invention.
7 to 9 are diagrams illustrating a result of classifying an object and a shadow of the object detection method through the circuit neural network-based learning according to an embodiment of the present invention.
10 and 11 are diagrams showing the results of comparing the performance of the object detection method according to an embodiment of the present invention and the existing object detection method.
12 is a view comparing object detection results with open data in an object detection method according to an embodiment of the present invention.

본 발명은 다양한 변경을 가할 수 있고 여러 가지 실시 예를 가질 수 있는 바, 특정 실시 예들을 도면에 예시하고 이를 상세한 설명을 통해 상세히 설명하고자 한다. 그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 본 발명을 설명함에 있어서, 관련된 공지 기술에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다. 또한, 어떤 부분이 어떤 구성 요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성 요소를 제외하는 것이 아니라 다른 구성 요소를 더 포함할 수 있는 것을 의미한다. As the inventive concept allows for various changes and numerous embodiments, particular embodiments will be illustrated in the drawings and described in detail in the written description. However, this is not intended to limit the present invention to specific embodiments, it should be understood to include all modifications, equivalents, and substitutes included in the spirit and scope of the present invention. In describing the present invention, when it is determined that the detailed description of the related known technology may unnecessarily obscure the subject matter of the present invention, the detailed description thereof will be omitted. In addition, when a part is said to "include" a certain component, it means that it may further include other components, except to exclude other components unless specifically stated otherwise.

이하, 본 발명의 실시 예를 첨부한 도면들을 참조하여 상세히 설명하기로 한다. 본 발명을 설명함에 있어 전체적인 이해를 용이하게 하기 위하여 도면 번호에 상관없이 동일한 수단에 대해서는 동일한 참조 번호를 사용하기로 한다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. In the following description of the present invention, the same reference numerals will be used for the same means regardless of the reference numerals in order to facilitate the overall understanding.

도 1 내지 도 4는 본 발명의 일 실시 예에 따른 객체 검출 장치를 설명하기 위한 도면들이다.1 to 4 are diagrams for describing an object detecting apparatus according to an exemplary embodiment.

도 1을 참조하면, 객체 검출 장치(100)는 전경 영역 검출부(110), 윈도우 이미지 생성부(120), 회선 신경망 학습부(130) 및 객체 검출부(140)를 포함한다.Referring to FIG. 1, the object detecting apparatus 100 may include a foreground area detector 110, a window image generator 120, a circuit neural network learner 130, and an object detector 140.

전경 영역 검출부(110)는 입력 이미지를 입력받고, 입력 이미지와 배경 이미지의 차 연산을 수행하여 전경 영역을 검출한다. 여기서 입력 이미지와 배경 이미지는 가시광 카메라를 통해 취득한 RGB 컬러 모델 영상일 수 있다. The foreground area detector 110 receives an input image and detects a foreground area by performing a difference operation between the input image and the background image. The input image and the background image may be RGB color model images acquired through a visible light camera.

윈도우 이미지 생성부(120)는 검출한 전경 영역을 참조하여 입력 이미지에서 대략적인 객체 영역을 픽셀 단위의 윈도우 이미지로 생성한다. 이때, 윈도우 이미지 생성부(120)는 식 1을 통해 윈도우 이미지를 HSV 컬러 모델로 변환하여 3채널의 윈도우 이미지로 생성한다. The window image generator 120 generates an approximate object region in the input image as a window image in units of pixels with reference to the detected foreground region. In this case, the window image generating unit 120 converts the window image into the HSV color model through Equation 1 and generates three window images.

상기 식 1에서

는 k프레임에서 x, y좌표의

i번째 윈도우 이미지,

은 F의 각 채널,

는 k프레임에서의 입력 이미지의 S값, V값,

는 참조 영상의 V값이다. In Equation 1

Is the x, y coordinate

i-th window image,

Each channel of f,

Is the S, V, and

Is the V value of the reference image.

회선 신경망 학습부(130)는 복수개의 윈도우 이미지로 구축된 학습 데이터를 이용하여 VGG Net-16 기반의 회선 신경망 (CNN; Convolutional Neural Network)을 학습한다. 이때 회선 신경망 학습부(130)는 윈도우 이미지를 실측 이미지(ground truth image)를 기반으로 객체 데이터와 그림자 데이터로 구분한다. 여기서 실측 이미지는 도 2를 참조하면, 입력 이미지(210)를 참조하여 수동적으로 객체와 그림자를 구분한 이미지로, 객체의 경우는 파란색으로 그림자의 경우는 빨간색으로 표시되어있다. 이러한 실측 이미지(220)를 참조하여 회선 신경망 학습부(130)는 윈도우 이미지의 중심 픽셀 좌표가 실측 이미지에서 객체에 해당할 경우 윈도우 이미지를 객체 데이터로 구분하고, 윈도우 이미지의 중심 픽셀 좌표가 실측 이미지에서 그림자에 해당할 경우 윈도우 이미지를 그림자 데이터로 구분할 수 있다. The circuit neural network learner 130 learns a convolutional neural network (CNN) based on VGG Net-16 based on training data constructed from a plurality of window images. At this time, the circuit neural network learner 130 classifies the window image into object data and shadow data based on a ground truth image. Here, referring to FIG. 2, the measured image is an image in which an object and a shadow are manually divided by referring to the input image 210, and the object is displayed in blue and the shadow in red. With reference to the measured image 220, the circuit neural network learner 130 classifies the window image into object data when the center pixel coordinate of the window image corresponds to the object in the measured image, and the center pixel coordinate of the window image is the measured image. In the case of shadow in the window image can be divided into shadow data.

회선 신경망 학습부(130)는 객체 데이터와 그림자 데이터로 구분된 윈도우 이미지를 바이리니어(bi-linear) 보간법에 미리 설정된 크기로 변형하여 회선 신경망의 입력으로 사용한다. 예를 들면, 회선 신경망 학습부(130)는 21×21×3 픽셀 크기의 윈도우 이미지를 224×224×3 픽셀 크기로 변형하여 회선 신경망의 입력으로 사용할 수 있다. 여기서 회선 신경망의 구조는 도 3 및 도 4를 참조하여 설명하도록 한다. The circuit neural network learning unit 130 transforms the window image divided into the object data and the shadow data into a predetermined size by bi-linear interpolation and uses the input as the input of the circuit neural network. For example, the circuit neural network learner 130 may transform a window image having a size of 21 × 21 × 3 pixels into a size of 224 × 224 × 3 pixels and use it as an input of the circuit neural network. Here, the structure of the circuit neural network will be described with reference to FIGS. 3 and 4.

도 3을 참조하면, 본 발명의 일 실시예에 따른 회선 신경망은 입력된 이미지를 13개의 컨볼루션 레이어(310)와 3개의 은닉 레이어(320)를 통해 최종적으로 객체 또는 그림자로 분류하도록 설계된다. Referring to FIG. 3, a convolutional neural network according to an embodiment of the present invention is designed to finally classify an input image into an object or a shadow through 13 convolution layers 310 and 3 hidden layers 320.

도 4를 참조하면, 회선 신경망은 (1) 이미지 입력 레이어(400), (2) 5개의 컨볼루션 레이어 그룹(410~450) 및 (3) 3개의 은닉 레이어 그룹(460~480)을 포함한다. Referring to FIG. 4, the convolutional neural network includes (1) an image input layer 400, (2) five convolutional layer groups 410-450, and (3) three hidden layer groups 460-480. .

(1) 이미지 입력 레이어(400)는 미리 설정한 픽셀 크기의 윈도우 이미지를 입력한다. 이때 윈도우 이미지는 HSV 컬러 모델의 3채널 이미지를 이용할 수 있으며 224×224×3 픽셀 크기일 수 있다.(1) The image input layer 400 inputs a window image having a preset pixel size. In this case, the window image may use a three-channel image of the HSV color model and may have a size of 224 × 224 × 3 pixels.

(2) 5개의 컨볼루션 레이어 그룹(410~450)은 입력한 윈도우 이미지에 회선 신경망을 적용하여 특징을 추출한다. 5개의 컨볼루션 레이어 그룹은 하나 이상의 컨볼루션 레이어(Convolutional Layer)를 포함할 수 있고, 각각의 컨볼루션 레이어들은 교정 선형 유닛(ReLU; Rectified Linear Unit) 레이어 및/또는 최대 풀링(Max Pooling) 레이어를 포함할 수 있다.(2) Five convolutional layer groups 410 to 450 extract a feature by applying a convolutional neural network to the input window image. The five convolutional layer groups may include one or more convolutional layers, each convolutional layer comprising a Rectified Linear Unit (ReLU) layer and / or a Max Pooling layer. It may include.

제1 특징 추출 레이어 그룹(410)은 2개의 컨볼루션 레이어를 포함한다. 제1 컨볼루션 레이어는 224×224×3 픽셀의 이미지가 입력되며, 3×3×3 크기의 64개의 필터를 이용하여 1픽셀 간격으로 1패딩(padding)씩 컨볼루션화될 수 있다. 이 경우 필터당 가중치 크기는 3×3×3=27이고, 제1 컨볼루션 레이어에서 전체 파라미터의 수는(27+1) × 64 = 1792이며, 여기서, 1은 바이어스(bias)를 나타낸다. 제1 콘벌루션 레이어에서 출력된 특징 맵의 크기는 224×224×64이다. 출력 높이(또는 너비)는 입력 높이(또는 너비) - 필터 높이 (또는 너비) + 2 × 패딩)/스트라이드 +1 의 식을 이용하여 224 (= (224 - 3 + 2 × 1)/1 + 1)으로 산출될 수 있다. 출력들은 ReLU 레이어를 거쳐 제2 컨볼루션 레이어의 입력으로 사용된다. 제2 컨볼루션 레이어는 입력된 이미지를 3×3×64 크기의 64개의 필터를 이용하여 1픽셀 간격으로 1패딩(padding)씩 컨볼루션화하고, 교정 선형 유닛 레이어를 거친 후 2 픽셀 간격으로 2×2 크기의 필터들을 적용하는 제 1 맥스 풀링 레이어에 의해 처리된 후 112×112×64 픽셀 크기로 다운 샘플링된다.The first feature extraction layer group 410 includes two convolutional layers. An image of 224 × 224 × 3 pixels is input to the first convolution layer, and may be convolved by one padding at intervals of one pixel by using 64 filters having a size of 3 × 3 × 3. In this case, the weight size per filter is 3 × 3 × 3 = 27, and the total number of parameters in the first convolutional layer is (27 + 1) × 64 = 1792, where 1 represents a bias. The size of the feature map output from the first convolutional layer is 224 × 224 × 64. The output height (or width) is 224 (= (224-3 + 2 × 1) / 1 + 1 using the formula Input height (or width)-filter height (or width) + 2 × padding) / stride +1 Can be calculated as The outputs are used as inputs of the second convolutional layer via the ReLU layer. The second convolutional layer convolves the input image by one padding at intervals of one pixel using 64 filters of size 3 × 3 × 64, passes through the calibration linear unit layer, and then moves the image to 2 pixel intervals. It is processed by a first max pulling layer applying filters of size x 2 and then down sampled to size 112 x 112 x 64 pixels.

제2 특징 추출 레이어 그룹(420)은 2개의 컨볼루션 레이어를 포함한다. 제3 컨볼루션 레이어는 112×112×64 픽셀의 이미지가 입력되며, 3×3×64 크기의 128개의 필터를 이용하여 1픽셀 간격으로 1패딩씩 컨볼루션화될 수 있다. 출력들은 ReLU 레이어를 거쳐 제4 컨볼루션 레이어의 입력으로 사용된다. 제4 컨볼루션 레이어는 입력된 이미지를 3×3×128 크기의 128개의 필터를 이용하여 1픽셀 간격으로 1패딩씩 컨볼루션화하고, 교정 선형 유닛 레이어를 거친 후 2 픽셀 간격으로 2×2 크기의 필터들을 적용하는 제2 맥스 풀링 레이어에 의해 처리된 후 56×56×128 픽셀 크기로 다운 샘플링된다.The second feature extraction layer group 420 includes two convolutional layers. An image of 112 × 112 × 64 pixels is input to the third convolution layer, and may be convolved by one padding at intervals of one pixel by using 128 filters having a size of 3 × 3 × 64. The outputs are used as inputs to the fourth convolutional layer via the ReLU layer. The fourth convolutional layer convolves the input image by one padding at intervals of one pixel using 128 filters having a size of 3 × 3 × 128, passes through a calibration linear unit layer, and has a size of 2 × 2 at two pixel intervals. It is processed by a second max pooling layer that applies filters of and then downsampled to 56 × 56 × 128 pixels.

제3 특징 추출 레이어 그룹(430)은 3개의 컨볼루션 레이어를 포함한다. 제5 컨볼루션 레이어는 56×56×128 픽셀의 이미지가 입력되며, 3×3×128 크기의 256개의 필터를 이용하여 1픽셀 간격으로 1패딩씩 컨볼루션화될 수 있다. 출력들은 ReLU 레이어를 거쳐 제6 컨볼루션 레이어의 입력으로 사용된다. 제6 컨볼루션 레이어는 입력된 이미지를 3×3×256 크기의 256개의 필터를 적용하여 1픽셀 간격으로 1패딩씩 컨볼루션화하고, 교정 선형 유닛 레이어를 거쳐 제7 컨볼루션 레이어의 입력으로 사용된다. 제7 컨볼루션 레이어는 입력된 이미지를 3×3×256 크기의 256개의 필터를 적용하여 1픽셀 간격으로 1패딩씩 컨볼루션화하고, 교정 선형 유닛 레이어를 거친 후 2 픽셀 간격으로 2×2 크기의 필터들을 적용하는 제3 맥스 풀링 레이어에 의해 처리된 후 28×28×256 픽셀 크기로 다운 샘플링된다.The third feature extraction layer group 430 includes three convolutional layers. An image of 56 × 56 × 128 pixels is input to the fifth convolution layer, and may be convolved by one padding at intervals of one pixel by using 256 filters having a size of 3 × 3 × 128. The outputs are used as inputs to the sixth convolutional layer via the ReLU layer. The sixth convolution layer convolves the input image by 1 padding at intervals of 1 pixel by applying 256 filters having a size of 3 × 3 × 256, and uses it as an input of the seventh convolution layer through a correction linear unit layer. do. The seventh convolution layer convolves the input image by 1 padding at 1 pixel intervals by applying 256 filters of 3 × 3 × 256 size, passes through the correction linear unit layer, and 2 × 2 size at 2 pixel intervals. It is processed by a third max pooling layer to apply filters of and then downsampled to 28 × 28 × 256 pixel size.

제4 특징 추출 레이어 그룹(440)은 3개의 컨볼루션 레이어를 포함한다. 제8 컨볼루션 레이어는 28×28×256 픽셀의 이미지가 입력되며, 3×3×256 크기의 256개의 필터를 이용하여 1픽셀 간격으로 1패딩씩 컨볼루션화될 수 있다. 출력들은 ReLU 레이어를 거쳐 제9 컨볼루션 레이어의 입력으로 사용된다. 제9 컨볼루션 레이어는 입력된 이미지를 3×3×512 크기의 512개의 필터를 이용하여 1픽셀 간격으로 1패딩씩 컨볼루션화하고, 교정 선형 유닛 레이어를 거쳐 제10 컨볼루션 레이어의 입력으로 사용된다. 제10 컨볼루션 레이어는 입력된 이미지를 3×3×512 크기의 512개의 필터를 이용하여 1픽셀 간격으로 1패딩씩 컨볼루션화하고, 교정 선형 유닛 레이어를 거친 후 2 픽셀 간격으로 2×2 크기의 필터들을 적용하는 제4 맥스 풀링 레이어에 의해 처리된 후 14×14×512 픽셀 크기로 다운 샘플링된다.The fourth feature extraction layer group 440 includes three convolutional layers. An image of 28 × 28 × 256 pixels is input to the eighth convolution layer, and may be convolved by one padding at intervals of one pixel using 256 filters having a size of 3 × 3 × 256. The outputs are used as inputs to the ninth convolutional layer via the ReLU layer. The ninth convolution layer convolves the input image by one padding at intervals of one pixel by using 512 filters having a size of 3 × 3 × 512, and uses the correction linear unit layer as an input of the tenth convolution layer. do. The 10th convolution layer convolves the input image by 1 padding at 1 pixel intervals using 512 filters of 3 × 3 × 512 size, passes through the calibration linear unit layer, and 2 × 2 sizes at 2 pixel intervals. It is processed by a fourth max pooling layer applying filters of and then downsampled to a size of 14x14x512 pixels.

제5 특징 추출 레이어 그룹(450)은 3개의 컨볼루션 레이어를 포함한다. 제11 컨볼루션 레이어는 14×14×512 픽셀의 이미지가 입력되며, 3×3×512 크기의 512개의 필터를 이용하여 1픽셀 간격으로 1패딩씩 컨볼루션화될 수 있다. 출력들은 ReLU 레이어를 거쳐 제12 컨볼루션 레이어의 입력으로 사용된다. 제12 컨볼루션 레이어는 입력된 이미지를 3×3×512 크기의 512개의 필터를 이용하여 1픽셀 간격으로 1패딩씩 컨볼루션화하고, 교정 선형 유닛 레이어를 거쳐 제13 컨볼루션 레이어의 입력으로 사용된다. 제13 컨볼루션 레이어는 입력된 이미지를 3×3×512 크기의 1개의 필터를 이용하여 1픽셀 간격으로 1패딩씩 컨볼루션화하고, 교정 선형 유닛 레이어를 거친 후 2 픽셀 간격으로 2×2 크기의 필터들을 적용하는 제5 맥스 풀링 레이어에 의해 처리된 후 7×7×512 픽셀 크기로 다운 샘플링된다.The fifth feature extraction layer group 450 includes three convolutional layers. An image of 14 × 14 × 512 pixels is input to the eleventh convolution layer and may be convolved by one padding at intervals of one pixel using 512 filters having a size of 3 × 3 × 512. The outputs are used as inputs to the twelfth convolutional layer via the ReLU layer. The twelfth convolution layer convolves the input image by one padding at intervals of one pixel by using 512 filters having a size of 3 × 3 × 512 and uses the correction linear unit layer as an input of the thirteenth convolution layer. do. The thirteenth convolution layer convolves the input image by one padding at one pixel interval by using one filter having a size of 3 × 3 × 512, passes through a calibration linear unit layer, and then has a size of 2 × 2 at 2 pixel intervals. It is processed by a fifth max pooling layer to apply filters of and then downsampled to a size of 7x7x512 pixels.

(3) 3개의 은닉 레이어 그룹(460~480)은 완전 연결 레이어(Fully connected layer), 교정 선형 유닛 레이어(ReLU layer), 드롭아웃 레이어(Dropout layer), 소프트맥스 레이어(Softmax layer) 및 분류 레이어(Classification layer)를 포함할 수 있다. (3) The three hidden layer groups 460-480 are a fully connected layer, a calibration linear unit layer (ReLU layer), a dropout layer, a softmax layer and a classification layer. (Classification layer) may be included.

제1 은닉 레이어 그룹(460)은 제1 완전 연결 레이어 및 교정 선형 유닛 레이어를 거쳐 4096×1 노드를 출력한다. The first hidden layer group 460 outputs 4096 × 1 nodes via a first fully connected layer and a calibration linear unit layer.

제2 은닉 레이어 그룹(470)은 제2 완전 연결 레이어, 교정 선형 유닛 레이어 및 드롭아웃 레이어를 거쳐 4096×1 노드를 출력한다. 이때, 드롭 아웃 레이어는 과적합(overfitting)을 방지하기 위해 사용되며, 미리 설정된 확률에 기초하여 각각의 숨겨진 노드의 출력을 0으로 설정할 수 있다. 여기서 미리 설정된 확률을 0.5일 수 있다. The second hidden layer group 470 outputs 4096 × 1 nodes via a second fully connected layer, a calibration linear unit layer, and a dropout layer. In this case, the dropout layer is used to prevent overfitting, and the output of each hidden node may be set to 0 based on a preset probability. Here, the preset probability may be 0.5.

제3 은닉 레이어 그룹(480)은 제3 완전 연결 레이어, 소프트맥스 레이어 및 분류 레이어를 거쳐 2×1 노드를 출력하여 최종적으로 이미지 입력 레이어(400)에 입력된 이미지가 객체 또는 그림자로 분류될 확률을 산출한다.The third hidden layer group 480 outputs a 2 × 1 node through the third fully connected layer, the softmax layer, and the classification layer, so that the probability that the image input to the image input layer 400 is finally classified as an object or a shadow. To calculate.

객체 검출부(140)는 학습된 회선 신경망을 이용하여 입력된 윈도우 이미지를 객체 또는 그림자로 분류하고, 그림자로 분류된 영역을 제외한 객체 영역 검출하여 입력 이미지에 표시한다. The object detector 140 classifies the input window image into an object or a shadow by using the learned circuit neural network, and detects and displays an object area except the area classified as a shadow on the input image.

도 5 및 도 6은 본 발명의 일 실시예에 따른 객체 검출 장치를 이용하여 객체를 검출하는 방법을 설명하기 위한 도면들이다.5 and 6 are diagrams for describing a method of detecting an object using an object detecting apparatus according to an embodiment of the present invention.

도 5를 참조하면, 단계 S510에서 객체 검출 장치는 입력 이미지를 입력받고 입력 이미지와 배경 이미지의 차 연산을 수행하여 전경 영역을 검출한다. 여기서 입력 이미지와 배경 이미지는 가시광 카메라를 통해 취득한 RGB 컬러 모델 영상일 수 있다. Referring to FIG. 5, in operation S510, the object detecting apparatus receives an input image and detects a foreground area by performing a difference operation between the input image and the background image. The input image and the background image may be RGB color model images acquired through a visible light camera.

단계 S520에서 객체 검출 장치는 검출한 전경 영역을 참조하여 입력 이미지에서 대략적인 객체 영역을 픽셀 단위의 윈도우 이미지로 생성한다. 여기서 윈도우 이미지는 식 1을 통해 HSV 컬러 모델로 변환하여 3채널의 윈도우 이미지를 생성할 수 있다. In operation S520, the object detecting apparatus generates an approximate object region from the input image as a window image in units of pixels with reference to the detected foreground region. Here, the window image may be converted into an HSV color model through Equation 1 to generate a three-channel window image.

상기 식 1에서

는 k프레임에서 x, y좌표의

i번째 윈도우 이미지,

은 F의 각 채널,

는 k프레임에서의 입력 이미지의 S값, V값, B는 참조 영상의 V값이다. In Equation 1

Is the x, y coordinate

i-th window image,

Each channel of f,

Is the S value, the V value, and B of the input image in k frames.

단계 S530에서 객체 검출 장치는 복수개의 윈도우 이미지로 구축된 학습 데이터를 이용하여 회선 신경망 (CNN; Convolutional Neural Network)을 학습한다.In operation S530, the apparatus for detecting an object learns a convolutional neural network (CNN) using learning data constructed from a plurality of window images.

단계 S540에서 객체 검출 장치는 학습된 회선 신경망을 이용하여 입력된 윈도우 이미지를 객체 또는 그림자로 분류한다. In operation S540, the object detecting apparatus classifies the input window image into an object or a shadow by using the learned circuit neural network.

단계 S540에서 객체 검출 장치는 그림자로 분류된 영역을 제외한 객체 영역을 입력 이미지에 표시한다. In operation S540, the object detecting apparatus displays the object region except the region classified as the shadow on the input image.

도 6은 도 5에서 상술한 단계를 수행한 결과 영상을 나타낸다. 6 illustrates an image of a result of performing the above-described steps in FIG. 5.

도 6에서 (a)는 객체 검출 장치(100)에 입력된 이미지이고, 이를 배경 이미지와 차 연산을 수행하여 검출한 전경 영상은 도 6의 (b)와 같다. 도 6에서 (c)는 검출한 전경 영상을 참조하여 입력 이미지에서 대략적인 객체 영역을 픽셀 단위의 윈도우 이미지로 생성한 결과이다. 도 6의 (c)와 같이 윈도우 이미지는 객체의 그림자를 포함하여 객체 영역이 지정되어 있다. 본 발명은 윈도우 이미지에서 그림자가 제거된 객체 영역을 검출하기 위해 윈도우 이미지에 학습된 회선 신경망을 적용하여 도 6의 (d)와 같이 그림자와 객체를 분류함으로써 도 6의 (f)와 같이 그림자가 제거된 객체 영역을 검출할 수 있다.In FIG. 6, (a) is an image input to the object detecting apparatus 100, and the foreground image detected by performing a difference operation with the background image is the same as that of FIG. In FIG. 6, (c) is a result of generating an approximate object area in a pixel unit window image from the input image with reference to the detected foreground image. As shown in (c) of FIG. 6, the window image includes an object region including a shadow of the object. The present invention classifies the shadow and the object as shown in (d) of FIG. 6 by applying the learned circuit neural network to the window image to detect the object area from which the shadow is removed from the window image. The removed object area can be detected.

본 발명의 일 실시예에 따른 객체 검출 장치(100)의 성능 평가를 위해 로지텍 웹캠 C600을 통해 낮, 밤, 새벽, 날씨, 온도 및 조명 변화가 있는 다양한 환경에서 촬영된 1,802,147장의 데이터를 사용하여 학습 및 검증 실험을 진행하였다. 학습 및 검증에 사용된 데이터는 표 1과 같이 실측 영상을 기반으로 객체와 그림자로 구분하여 생성한 윈도우 이미지이고, 교차 검증(2-fold cross validation)을 수행하기 위해 두 개의 그룹으로 분류하였다.In order to evaluate the performance of the object detecting apparatus 100 according to an embodiment of the present invention, the Logitech webcam C600 is trained using data of 1,802,147 photographed in various environments having changes in day, night, dawn, weather, temperature, and lighting. And verification experiments were conducted. The data used for learning and verification are window images created by dividing objects and shadows based on the measured images as shown in Table 1, and classified into two groups to perform 2-fold cross validation.

Window imageWindow image ObjectObject ShadowShadow Number of dataNumber of data 900,443900,443 901,704901,704 Group 1Group 1 450,230450,230 450,875450,875 Group 2Group 2 450,213450,213 450,829450,829

학습 및 검증은 Intel® Core™ i7-6700 CPU @ 3.40 GHz (4 CPUs), 64 GB 메 모리 및 12 GB의 메모리를 가진 그래픽 카드 NVIDIA GeForce GTX TITAN X (3,072 CUDA cores) 컴퓨팅 장치가 이용되었다.Training and verification was done using a graphics card NVIDIA GeForce GTX TITAN X (3,072 CUDA cores) computing device with an Intel® Core ™ i7-6700 CPU @ 3.40 GHz (4 CPUs), 64 GB memory and 12 GB of memory.

객체 검출 장치(100)는 회선 신경망 모델을 학습시켰으며, 21×21×3 픽셀 크기 윈도우 이미지를 바이리니어(bi-linear) 보간법에 의해 224×224×3 픽셀 크기로 최종적으로 변경하여 학습 및 검증에 이용하였다. 여기서, 회선 신경망 모델의 학습은 앞 서 도 3 및 도 4를 참조하여 설명한 방법과 같다.The object detection apparatus 100 trained the circuit neural network model, and finally changed the 21 × 21 × 3 pixel size window image to 224 × 224 × 3 pixel size by bi-linear interpolation. It was used for. Here, the learning of the circuit neural network model is the same as the method described above with reference to FIGS. 3 and 4.

도 7 내지 도 9는 본 발명의 일 실시 예에 따른 회선 신경망 기반의 학습을 통한 객체 검출 방법의 객체와 그림자를 분류한 결과를 예시한 도면들이다.7 to 9 are diagrams illustrating a result of classifying an object and a shadow of an object detection method through circuit neural network-based learning according to an embodiment of the present invention.

도 7을 참조하면, 도 7에서 그래프 y축은 분류 학습 정확도를 나타내며, x축은 학습 에폭(Training epoch)을 나타낸다. 여기서, 학습 에폭(Training epoch)은 학습할 데이터가 1만장이라 가정하면, 10에폭은 10만장을 학습하는 것을 의미한다. 도 7과 같이 객체 검출 방법의 정확도는 100에 가까이 수렴하고 손실은 0에 가까이 수렴하는 것을 확인할 수 있다. Referring to FIG. 7, the graph y-axis represents classification learning accuracy and the x-axis represents a training epoch in FIG. 7. Here, the training epoch is assumed that the data to be learned is 10,000 pieces, and the 10 epochs means that 100,000 pieces are learned. As shown in FIG. 7, the accuracy of the object detection method converges near 100 and the loss converges near zero.

도 8을 참조하면, 객체 검출 장치(100)는 상술한 객체 검출 방법에 따라 윈도우 이미지를 2 회의 교차 검증을 수행한 결과는 98.13% 및 99.15%의 정확도로 객체 및 그림자를 분류하였다. Referring to FIG. 8, the object detecting apparatus 100 classifies objects and shadows with accuracy of 98.13% and 99.15% as a result of performing two cross-validation on the window image according to the object detecting method described above.

도 9를 참조하면, 객체 검출 장치(100)는 하기 식 2 내지 5를 이용하여 TPR, PPV, ACC 및 F_score를 측정하였을 때 평균 98%의 좋은 성능을 보였다. 식 2에서 #TP, #TN, #FP 그리고 #FN는 각각 true positives (TPs), true negatives (TNs), false positives(FPs) 그리고 false negative (FNs) 의 숫자를 의미한다. Referring to FIG. 9, the object detecting apparatus 100 exhibited an average performance of 98% when the TPR, PPV, ACC, and F_score were measured using Equations 2 to 5 below. In Equation 2, #TP, #TN, #FP, and #FN represent the numbers of true positives (TPs), true negatives (TNs), false positives (FPs), and false negatives (FNs), respectively.

도 10 및 도 11은 기존의 객체 검출 방법과 본 발명의 일 실시 예에 따른 객체 검출 방법의 성능을 비교한 결과를 나타내는 도면들이다.10 and 11 are diagrams showing the results of comparing the performance of the object detection method according to an embodiment of the present invention and the existing object detection method.

도 10을 참조하면, 본 발명의 객체 검출 방법은 기존의 방법들에 비해 거의 오차가 없이 높은 정확도를 보이는 것을 확인할 수 있디. Referring to FIG. 10, it can be seen that the object detection method of the present invention has high accuracy with almost no error compared to existing methods.

도 11을 참조하면, 본 발명의 객체 검출 방법은 CAVIAR Dataset을 이용해 기존 방법과의 성능을 비교하였을 때 97.67%로 기존 방법에 비해 높은 정확도를 보였다. Referring to FIG. 11, the object detection method of the present invention showed higher accuracy than the existing method by 97.67% when comparing the performance with the existing method using the CAVIAR dataset.

도 12는 본 발명의 일 실시예에 따른 객체 검출 방법의 객체 검출 결과를 오픈 데이터와 비교한 도면이다. 12 is a diagram illustrating object detection results of an object detection method according to an embodiment of the present invention compared with open data.

도 12에서 (a) 및 (b)는 오픈 데이터에서 제공한 입력 영상 및 객체와 그림자가 분류된 실측 영상이다. 도 12에서 (c)는 오픈 데이터의 입력 영상을 본 발명의 객체 검출 방법을 이용하여 객체와 그림자를 분류한 결과이다. 도 12와 같이 오픈 데이터의 실측 영상 (b)와 본 발명의 결과 영상(c)은 매우 유사함을 확인할 수 있다.(A) and (b) of FIG. 12 are input images provided from open data, and measured images classified by objects and shadows. 12C illustrates a result of classifying an object and a shadow of an input image of open data using the object detection method of the present invention. As shown in FIG. 12, the measured image (b) of the open data and the resulting image (c) of the present invention are very similar.

본 발명의 다양한 실시 예에 따른 객체 검출 방법은 다양한 서버 등의 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현될 수 있다. 또한, 본 발명에 따른 객체 검출 방법을 실행하는 프로그램 및 애플리케이션은 컴퓨터 수단에 설치되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 컴퓨터 판독 가능 매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 분야 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media) 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. The object detecting method according to various embodiments of the present disclosure may be implemented in the form of program instructions that may be executed through computer means such as various servers. In addition, a program and an application for executing the object detecting method according to the present invention may be installed in computer means and recorded in a computer readable medium. Computer-readable media may include, alone or in combination with the program instructions, data files, data structures, and the like. The program instructions recorded on the computer readable medium may be those specially designed and constructed for the present invention, or may be known and available to those skilled in the computer software arts. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tape, optical media such as CD-ROMs, DVDs, and magnetic disks, such as floppy disks. Hardware devices specially configured to store and execute program instructions such as magneto-optical media and ROM, RAM, flash memory and the like.

이제까지 본 발명에 대하여 그 실시 예들을 중심으로 살펴보았다. 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자는 본 발명이 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 변형된 형태로 구현될 수 있음을 이해할 수 있을 것이다. 그러므로 개시된 실시 예들은 한정적인 관점이 아니라 설명적인 관점에서 고려되어야 한다. 본 발명의 범위는 전술한 설명이 아니라 특허청구범위에 나타나 있으며, 그와 동등한 범위 내에 있는 모든 차이점은 본 발명에 포함된 것으로 해석되어야 할 것이다.So far, the present invention has been described with reference to the embodiments. Those skilled in the art will appreciate that the present invention can be implemented in a modified form without departing from the essential features of the present invention. Therefore, the disclosed embodiments should be considered in descriptive sense only and not for purposes of limitation. The scope of the present invention is shown in the claims rather than the foregoing description, and all differences within the scope will be construed as being included in the present invention.

100: 객체 검출 장치
110: 전경 영역 검출부
120: 윈도우 이미지 생성부
130: 회선 신경망 학습부
140: 객체 검출부100: object detection device
110: foreground area detection unit
120: window image generating unit
130: circuit neural network learning unit
140: object detection unit

Claims

In the object detecting apparatus,
A foreground region detector which receives an input image and detects a foreground region by performing a difference operation between the input image and the background image;
A window image generating unit generating an object region as a window image in pixel units from the input image with reference to the detected foreground region;
A circuit neural network learning unit for learning a convolutional neural network (CNN) using training data constructed from a plurality of window images; And
And transforming the window image into a predetermined size by interpolation, classifying the modified window image into an object or a shadow using a trained circuit neural network, and detecting an object area except an area classified as a shadow. ,
The window image generation unit,
Equation 1 below to convert the window image to an HSV color model to generate a three-channel window image,

In Equation 1

Is the x, y coordinate

i-th window image,

Each channel of f,

Is the S, V, and

Is an object detection device that means a V value of the reference image.

delete

The method of claim 1,
The circuit neural network learning unit
The window image included in the training data is divided into object data and shadow data based on a ground truth image, and the object data and the shadow data are transformed into a predetermined size by interpolation to input the circuit neural network. Object detection device to use.

The method of claim 1,
The circuit neural network
An image input layer for inputting a window image having a preset pixel size;
A convolution layer group for extracting features by applying a convolutional neural network to the input window image; And
And a hidden layer group that classifies the window image into an object or a shadow based on the extracted feature.

The method of claim 4, wherein
The convolutional neural network includes five convolutional layer groups and three hidden layer groups,
The five convolutional layer groups include one or more convolutional layers, each convolutional layer comprising a Rectified Linear Unit (ReLU) layer and / or a Max Pooling layer. ,
The three hidden layer groups include a fully connected layer, a correction linear unit layer (ReLU layer), a dropout layer, a Softmax layer, and a classification layer. Object detection device.

In the object detection method,
Detecting a foreground area by receiving an input image and performing a difference operation between the input image and the background image;
Generating an object region as a window image in pixel units from the input image with reference to the detected foreground region;
Training a convolutional neural network (CNN) using training data constructed from a plurality of window images;
Transforming the window image into a predetermined size by interpolation, and classifying the modified window image into an object or a shadow using a learned circuit neural network; And
Displaying an object area except the area classified as a shadow on the input image,
Generating an object region as a window image in pixel units from an input image with reference to the detected foreground region,
Equation 1 below to convert the window image to an HSV color model to generate a three-channel window image,

In Equation 1

Is the x, y coordinate

i-th window image,

Each channel of f,

Is the S, V, and

Is an object detection method for meaning a V value of a reference image.

delete

The method of claim 6,
Learning a convolutional neural network (CNN) using the training data constructed from the plurality of window images
And a window image included in the training data into object data and shadow data based on a ground truth image, transformed into a predetermined size by interpolation, and used as input of a circuit neural network.

A computer program stored in a computer-readable recording medium for executing the object detecting method of any one of claims 6 and 8.