KR102103770B1

KR102103770B1 - Apparatus and method for pedestrian detection

Info

Publication number: KR102103770B1
Application number: KR1020180037899A
Authority: KR
Inventors: 박강령; 김종현
Original assignee: 동국대학교 산학협력단
Priority date: 2018-04-02
Filing date: 2018-04-02
Publication date: 2020-04-24
Also published as: KR20190115542A

Abstract

본 발명은 이미지 내에서 보행자를 검출하는 기술에 관한 것으로, 보다 상세하게는 더 빠른 R-CNN 기반의 딥 러닝을 기반으로 한 이미지 내 보행자를 검출하는 기술에 관한 것이다. 본 발명의 일 실시 예에 따르면, R-CNN을 이용하여 연속된 이미지의 각각의 특징 맵을 생성한 후, 생성한 특징 맵을 결합하여 공간 시간적인 합산을 통한 보행자 인식 성능을 높일 수 있다.The present invention relates to a technology for detecting a pedestrian in an image, and more particularly, to a technology for detecting a pedestrian in an image based on faster R-CNN-based deep learning. According to an embodiment of the present invention, after generating each feature map of a continuous image using R-CNN, the generated feature maps can be combined to improve pedestrian recognition performance through spatial and temporal summation.

Description

Pedestrian detection device and method {APPARATUS AND METHOD FOR PEDESTRIAN DETECTION}

본 발명은 이미지 내에서 보행자를 검출하는 기술에 관한 것으로, 보다 상세하게는 더 빠른 R-CNN 기반의 딥 러닝을 기반으로 한 이미지 내 보행자를 검출하는 기술에 관한 것이다.The present invention relates to a technology for detecting a pedestrian in an image, and more particularly, to a technology for detecting a pedestrian in an image based on faster R-CNN-based deep learning.

최근 무인 자동차 및 인공지능 감시 시스템의 급격한 발전과 함께, 원거리에서 취득된 영상 내에서 정확한 사람 영역을 검출하는 연구에 대한 중요성이 증대되고 있다. 기존에 가시광선 카메라를 이용한 연구들에서는 외부광이 존재하는 낮 시간에 사람을 검출하는 방법들에 대해 주로 연구하였으나 외부광이 존재하지 않는 밤시간에는 사람을 검출하는 데 어려움이 있는 관계로, 추가적인 근 적외선 조명 및 근 적외선 카메라를 이용하거나, 혹은 열화상 카메라를 이용하는 방법들을 주로 사용하였다. 하지만, 근적외선 조명의 경우, 조사 각도 및 거리에 한계가 있고, 대상체가 가까이 있는 경우와 먼 경우 조명 전력을 적응적으로 조절해야 하는 어려움이 있다. 그리고 열화상 카메라의 경우 아직 가격 이 고가인 관계로 다양한 장소에 설치되어 사용하기 어렵다. 이를 고려하여 가시광선 카메라를 이용하여 야간 시간에서 사람들을 검출하는 연구들이 있으나, 이는 주로 대상체까지의 거리가 가까운 실내환경을 대상으로 하였다. In recent years, with the rapid development of unmanned vehicles and artificial intelligence surveillance systems, the importance of research to detect an accurate human area within an image acquired from a long distance is increasing. Previous studies using visible light cameras mainly studied methods for detecting a person in the daytime when there is external light, but because it is difficult to detect a person in the nighttime when there is no external light, additional Methods using near-infrared illumination and near-infrared cameras, or thermal imaging cameras were mainly used. However, in the case of near-infrared lighting, there is a limit to the irradiation angle and distance, and there is a difficulty in adaptively adjusting the lighting power when the object is near and far. In addition, thermal imaging cameras are still expensive and difficult to install and use in various locations. Considering this, there are studies that detect people at night time using a visible light camera, but this mainly targets indoor environments with a close distance to the object.

본 발명의 배경기술은 대한민국 등록특허 제10-1818129호에 게시되어 있다.Background art of the present invention is published in Korea Patent Registration No. 10-1818129.

본 발명은 야간 시간에도 사람을 인식하는 데 있어 더욱 정확도가 높은 보행자 검출 장치 및 방법을 제공하는 것이다.The present invention is to provide a more accurate pedestrian detection device and method for recognizing people even at night time.

본 발명의 일 측면에 따르면, 보행자 검출 장치를 제공한다. According to an aspect of the present invention, a pedestrian detection device is provided.

본 발명의 일 실시 예에 따른 보행자 검출 장치는 연속적인 복수의 이미지들을 입력받는 연속 이미지 입력부, 입력된 연속적인 복수의 이미지들을 정규화하는 이미지 정규화부 및 정규화된 연속적인 복수의 이미지들에 대해 기계 학습된 변경된 R-CNN 을 적용하여 보행자 후보군을 분류하는 변경된 R-CNN부를 포함할 수 있다.Pedestrian detection apparatus according to an embodiment of the present invention is a continuous image input unit that receives a plurality of consecutive images, an image normalizing unit that normalizes the input multiple consecutive images, and machine learning for a plurality of normalized consecutive images It may include a modified R-CNN unit to classify the candidate group of pedestrians by applying the modified R-CNN.

본 발명의 다른 측면에 따르면, 보행자 검출 방법이 제공된다. According to another aspect of the present invention, a pedestrian detection method is provided.

본 발명의 일 실시 예에 따른 보행자 검출 방법은 연속적인 복수의 이미지들을 입력 받는 단계, 입력된 연속적인 복수의 이미지들을 정규화하는 단계, 정규화된 연속적인 복수의 이미지들에 대해 기계 학습된 변경된 R-CNN 을 적용하여 각 이미지들의 공간적인 특징 맵을 추출하는 단계, 각 이미지에서 추출한 공간적인 특징 맵 들에서 시간적으로 특징 맵 결합을 수행하여 보행자 후보군을 추정하는 단계 및 각 이미지에서 추출한 공간적인 특징 맵 및 추정된 보행자 후보군을 이용하여 예상 보행자를 분류하는 단계를 포함할 수 있다.Pedestrian detection method according to an embodiment of the present invention comprises the steps of receiving a plurality of consecutive images, normalizing the input consecutive plurality of images, the modified R- machine-learned for a plurality of normalized consecutive images The step of extracting the spatial feature map of each image by applying CNN, the step of estimating the candidate candidate group by temporally combining the feature maps from the spatial feature maps extracted from each image, and the spatial feature map extracted from each image, and And classifying a predicted pedestrian using the estimated pedestrian candidate group.

본 발명의 일 실시 예에 따르면, 가시광선 카메라에서 얻은 야간 시간의 이미지에서 보행자를 정확하게 검출할 수 있다.According to an embodiment of the present invention, it is possible to accurately detect a pedestrian in an image of a night time obtained from a visible light camera.

본 발명의 일 실시 예에 따르면, R-CNN을 이용하여 연속된 이미지의 각각의 특징 맵을 생성한 후, 생성한 특징 맵을 결합하여 공간 시간적인 합산을 통한 보행자 인식 성능을 높일 수 있다.According to an embodiment of the present invention, after generating each feature map of a continuous image using R-CNN, the generated feature maps can be combined to improve pedestrian recognition performance through spatial and temporal summation.

도 1은 본 발명의 일 실시 예에 따른 보행자 검출 장치를 설명하기 위한 도면.
도 2내지 도 7은 본 발명의 일 실시 예에 따른 보행자 검출 장치의 변경된 R-CNN부를 설명하기 위한 도면들.
도 8 및 도 9는 본 발명의 일 실시 예에 따른 보행자 검출 방법을 설명하기 위한 도면들.
도 10 내지 도 13은 본 발명의 일 실시 예에 따른 보행자 검출 방법의 기계학습을 설명하기 위한 도면들.
도 14 내지 도 17는 본 발명의 효과를 설명하기 위한 도면들.1 is a view for explaining a pedestrian detection device according to an embodiment of the present invention.
2 to 7 are views for explaining a modified R-CNN unit of the pedestrian detection apparatus according to an embodiment of the present invention.
8 and 9 are diagrams for explaining a pedestrian detection method according to an embodiment of the present invention.
10 to 13 are views for explaining the machine learning of the pedestrian detection method according to an embodiment of the present invention.
14 to 17 are views for explaining the effect of the present invention.

본 발명은 다양한 변경을 가할 수 있고 여러 가지 실시 예를 가질 수 있는 바, 특정 실시 예들을 도면에 예시하고 이를 상세한 설명을 통해 상세히 설명하고자 한다. 그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 본 발명을 설명함에 있어서, 관련된 공지 기술에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다. 또한, 본 명세서 및 청구항에서 사용되는 단수 표현은, 달리 언급하지 않는 한 일반적으로 "하나 이상"을 의미하는 것으로 해석되어야 한다.The present invention can be variously changed and can have various embodiments, and specific embodiments will be illustrated in the drawings and described in detail through detailed description. However, this is not intended to limit the present invention to specific embodiments, and should be understood to include all modifications, equivalents, and substitutes included in the spirit and scope of the present invention. In describing the present invention, when it is determined that a detailed description of related known technologies may unnecessarily obscure the subject matter of the present invention, the detailed description will be omitted. In addition, the singular expressions used in the specification and claims should be interpreted to mean “one or more” in general unless stated otherwise.

이하, 본 발명의 바람직한 실시 예를 첨부도면을 참조하여 상세히 설명하기로 하며, 첨부 도면을 참조하여 설명함에 있어, 동일하거나 대응하는 구성 요소는 동일한 도면번호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다.Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings, and in describing with reference to the accompanying drawings, identical or corresponding components are assigned the same reference numbers, and redundant description thereof will be omitted. Shall be

도 1은 본 발명의 일 실시 예에 따른 보행자 검출 장치를 설명하기 위한 도면이다.1 is a view for explaining a pedestrian detection device according to an embodiment of the present invention.

도 1을 참조하면, 본 발명의 일 실시 예에 따른 보행자 검출 장치는 연속 이미지 입력부(110), 이미지 정규화부(120), 변경된 R-CNN부(130) 및 보행자 검출부(140)를 포함한다. Referring to FIG. 1, a pedestrian detection apparatus according to an embodiment of the present invention includes a continuous image input unit 110, an image normalization unit 120, a modified R-CNN unit 130, and a pedestrian detection unit 140.

일반적으로, 야간 이미지는 야간 시간 대의 이미지로, 대부분의 픽셀 값들이 낮은 강도(Intensity)에 몰려 있어서 콘트라스트(Contrast)가 낮고, 더군다나 빛을 충분히 얻지 못하여 노이즈가 많다. 만약, 카메라의 노출 값을 높여서 짧은 시간 동안 빛을 충분히 얻게 되면 노이즈를 크게 줄일 수 있지만, 이러한 경우 모션 블러링(Motion Blurring)이 증가하여 움직이는 보행자와 같은 물체는 알아보기 힘들 정도로 흐려지며, 특히나 이동하는 카메라에서는 사용 할 수 없게 된다. 결과적으로, 카메라의 노출 값을 그대로 유지하면, 실제 어두운 이미지에 의해 생기는 참인 가장자리(true edge; 실제 이미지 상에 존재하는)와 노이즈에 의해 생기는 거짓인 가장자리(false edge; 노이즈에 의해 생긴)가 산재되어 있는 정보로부터 보행자의 형태를 추출해 내기 어렵다. 본 발명에 따른 보행자 검출 장치는 상술한 특징을 가진 야간 이미지에서 보행자를 검출하기 위하여 공간 시간적 합산(spatio-temporal summation)을 통하여 야간의 약한 신호를 모아서 강한 신호를 만들 수 있다. 본 발명에 따른 보행자 검출 장치는 CNN을 이용하여 야간의 연속된 비디오 프레임들로부터 각각 딥 특징을 생성 (공간적 정보 이용)한 후, 특징 레벨에서 그들을 결합(시간적 정보 이용)하여 최종적인 보행자 인식 성능을 높일 수 있다.In general, the night image is an image of the night time zone, and most of the pixel values are concentrated in low intensity, so the contrast is low, and further, the light is insufficient to obtain enough noise. If you increase the exposure value of the camera and get enough light for a short period of time, the noise can be greatly reduced, but in this case, motion blurring increases, so objects such as moving pedestrians are blurry to an unrecognizable level, especially moving. The camera cannot be used. As a result, if you keep the exposure value of the camera as it is, there will be scattered true edges caused by the actual dark image and false edges caused by the noise. It is difficult to extract the form of pedestrians from the information. The pedestrian detection apparatus according to the present invention can generate a strong signal by collecting weak signals at night through spatio-temporal summation in order to detect pedestrians at night images having the above-described features. The pedestrian detection apparatus according to the present invention generates deep features (using spatial information) from consecutive video frames at night using CNN, and then combines them at the feature level (using temporal information) to achieve the final pedestrian recognition performance. Can be increased.

연속 이미지 입력부(110)는 연속적인 복수의 이미지들을 입력 받는다. 연속 이미지 입력부(110)는 가시광선 카메라일 수 있으며, 용도에 따라 감시 카메라로 사용될 수 있다. The continuous image input unit 110 receives a plurality of continuous images. The continuous image input unit 110 may be a visible light camera and may be used as a surveillance camera depending on the application.

이미지 정규화부(120)는 연속적인 복수의 이미지들을 정규화한다. 이미지 정규화부(120)는 입력된 연속적인 복수의 이미지들을 그대로 사용하면 영상취득 시간대에 따라, 콘트라스트와 조도가 크게 변동하기 때문에 적절한 전처리를 통해 정규화 시킬 수 있다. 더욱 상세히 설명하면, 입력된 연속의 각 이미지 픽셀들의 조도(이미지 빛 강도의 평균)과 콘트라스트(이미지 빛 강도의 분산)이 크게 다르다. 예를 들어, 야간 이미지의 경우 주간 이미지에 비해, 상대적으로 작은 조도와 콘트라스트를 가진다. 만약 이런 다양한 입력 이미지를 이용하여 기계 학습을 한다면 노이즈가 많아 분별하기 힘든 보행자 검출 외에 추가적으로 다양한 조도와 콘트라스트에 강인해지도록 학습해야 하므로 적절한 모델로 수렴하기가 어려워 질 것이다. 따라서, 이미지 정규화부(120)는 모든 입력 이미지는 유사한 조도와 콘트라스트를 갖도록 전처리를 수행한다. The image normalization unit 120 normalizes a plurality of consecutive images. The image normalization unit 120 may normalize through appropriate pre-processing because the contrast and illuminance fluctuate greatly depending on the image acquisition time zone when the input multiple consecutive images are used as they are. In more detail, the illuminance (average of the image light intensity) and contrast (variance of the image light intensity) of each of the input successive image pixels are significantly different. For example, the night image has a relatively small illuminance and contrast compared to the day image. If machine learning is performed using these various input images, it is difficult to converge to an appropriate model because it is necessary to learn to be robust to various illuminances and contrasts in addition to detecting pedestrians that are difficult to discriminate due to noise. Therefore, the image normalization unit 120 preprocesses all input images to have similar illuminance and contrast.

이미지 정규화부(120)는 다음과 같은 두 가지 전 처리 방법을 적용할 수 있다.The image normalization unit 120 may apply the following two pre-processing methods.

첫 번째 방법은 픽셀 정규화로 각 입력 이미지에 대해서, RGB 채널 각각에 다음 수식 (1)과 같은 처리를 할 수 있다.The first method is pixel normalization, and for each input image, the following equation (1) can be performed for each RGB channel.

수식 (1)

Equation (1)

여기서,

는 스케일을 맞추기 위한 값 (255/2)이며,

는 미니-배치 이미지의 각 채널 (R, G, B를 위한)의

위치에서의 픽셀 값이며,

은 픽셀의 개수이다. 즉, 각 채널의 픽셀 값들을 제로 평균 및 단위 표준 편차를 갖는 분포로 만든 후 스케일링 하여, 각 입력 이미지의 조도와 콘트라스트를 정규화시킬 수 있다. here,

Is the value to scale (255/2),

Of each channel (for R, G, B) of the mini-batch image

Pixel value at position,

Is the number of pixels. That is, it is possible to normalize the illuminance and contrast of each input image by scaling the pixel values of each channel into a distribution having a zero average and a unit standard deviation.

두 번째 방법인 히스토그램 정규화(histogram equalization (HE)) 및 평균 차감(mean subtraction)은, 각각의 입력 이미지의 RGB 컬러를 HSV 컬러로 채널 변환 한 후, 밸류(value) 채널 이미지에 히스토그램 정규화를 적용하는 방식이다. 일반적으로 야간 이미지는 밝기 값이 전체적으로 낮을 뿐, 고유의 색깔은 어느 정도 보존하고 있다고 가정 할 수 있기 때문에 밸류(value) 채널만 히스토그램 정규화를 적용하였다. 따라서, 각 이미지의 픽셀 밝기 값들이 히스토그램 정규화를 통해 근사적으로 일정한 분포에 좀더 가까워 지게 되므로, 정규화된 콘트라스트를 가질 수 있다. 그 후, 다시 RGB 채널로 변경하여 이미지 셋 전체의 RGB 평균 값을 빼줌으로써, 이미지 픽셀 값들을 제로 센터링하였다. The second method, histogram equalization (HE) and mean subtraction, converts the RGB color of each input image to HSV color and then applies histogram normalization to the value channel image. Way. In general, since the brightness value of the night image is low overall, and it can be assumed that the original color is preserved to some extent, the histogram normalization is applied only to the value channel. Therefore, since the pixel brightness values of each image become closer to a nearly uniform distribution through histogram normalization, it can have a normalized contrast. Thereafter, the image pixel values were zero-centered by changing to an RGB channel again and subtracting the average RGB value of the entire image set.

변경된 R-CNN부(130)는 정규화된 연속적인 복수의 이미지들에 대해 기계 학습된 변경된 R-CNN 을 적용하여 각 이미지들의 공간적인 특징 맵을 추출한다. 변경된 R-CNN부(130)는 각 이미지에서 추출한 공간적인 특징 맵 들에서 시간적으로 특징 맵 결합을 통하여 보행자 후보군을 추정하여 분류한다. 변경된 R-CNN부(130)는 이하 도 2 내지 도 7을 참조하여 더욱 상세히 설명하기로 한다.The modified R-CNN unit 130 extracts the spatial feature map of each image by applying the machine-modified modified R-CNN to a plurality of normalized consecutive images. The modified R-CNN unit 130 estimates and classifies a pedestrian candidate group through temporal feature map combining from spatial feature maps extracted from each image. The modified R-CNN unit 130 will be described in more detail with reference to FIGS. 2 to 7 below.

보행자 검출부(140)는 확인한 보행자 후보군에서 보행자를 검출한다.The pedestrian detection unit 140 detects a pedestrian from the identified pedestrian candidate group.

도 2내지 도 7은 본 발명의 일 실시 예에 따른 보행자 검출 장치의 변경된 R-CNN부를 설명하기 위한 도면들이다.2 to 7 are diagrams for explaining a modified R-CNN unit of a pedestrian detection device according to an embodiment of the present invention.

도 2를 참조하면, 보행자 검출 장치의 변경된 R-CNN부(130)는 특징맵추출부(132), 특징맵결합부(134), 분류기부(136) 및 딥러닝부(138)를 포함한다. 2, the modified R-CNN unit 130 of the pedestrian detection device includes a feature map extraction unit 132, a feature map combining unit 134, a classifier unit 136, and a deep learning unit 138. .

본 발명에 따른 변경된 R-CNN부(130)는 더 빠른 R-CNN의 특징 맵 추출부의 마지막 콘볼루션 레이어의 특징 해상도가 낮기 때문에 관심 영역(ROI)이 풀링 레이어에 의해 구별되지 않은 밋밋한 특징이 생기고, 이로 인하여 특히 작은 크기의 대상 검출이 어려워지는 문제를 해결하기 위하여, 추가로 4번째 맥스 풀링 레이어를 제거하여, 마지막 특징 맵의 해상도를 높였다.In the modified R-CNN unit 130 according to the present invention, since the feature resolution of the final convolution layer of the feature map extraction unit of the faster R-CNN is low, a region of interest (ROI) is not distinguished by the pooling layer, resulting in a boring feature. In order to solve the problem that it is difficult to detect a particularly small object due to this, the fourth max pooling layer is additionally removed to increase the resolution of the final feature map.

특징맵추출부(132)는 정규화된 연속적인 복수의 이미지들에 대해 기계 학습된 변경된 R-CNN 을 적용하여 각 이미지들의 공간적인 특징 맵을 추출한다.The feature map extractor 132 extracts the spatial feature map of each image by applying the machine-modified modified R-CNN to a plurality of normalized consecutive images.

도 3을 참조하면, 특징맵추출부(132)는 복수의 보행자들을 포함한 이미지를 입력으로 받아, 13개의 콘볼루션 레이어와 ReLU 활성함수를 3개의 맥스 풀링 레이어를 통과시켜서 특징맵을 추출한다. 특징맵추출부(132)는 VGG-네트 16 네트워크의 마지막 맥스 풀링 레이어 전까지의 레이어들을 특징 추출 네트워크로 사용했고, 상술한 바와 같이 저 해상도 특징 맵 문제를 해결하기 위하여 추가로 4번째 맥스 풀링 레이어를 제거하여, 마지막 특징 맵의 해상도를 높였다. 도 3의 표 1에서와 같이 스트라이드 2××2를 기반으로 한 맥스 풀링 레이어가 총 3번 수행되므로, 최종적인 특징 추출 네트워크의 전체 스트라이드는 8××8 ((2××2)3) 이 된다. 특징맵추출부(132)는 추출한 특징맵을 도 4의 특징맵결합부(134)와 도 6의 분류기부(136)의 입력으로 사용된다.Referring to FIG. 3, the feature map extracting unit 132 receives an image including a plurality of pedestrians as input, and extracts a feature map by passing 13 convolutional layers and a ReLU activation function through three max pooling layers. The feature map extractor 132 used the layers up to the last max pooling layer of the VGG-net 16 network as a feature extraction network, and additionally, a fourth max pooling layer was used to solve the low resolution feature map problem as described above. By removing it, the resolution of the last feature map was increased. As shown in Table 1 of FIG. 3, since the max pooling layer based on stride 2 ×× 2 is performed three times, the entire stride of the final feature extraction network is 8 ×× 8 ((2 ×× 2) 3). do. The feature map extracting unit 132 is used to input the extracted feature map to the feature map combining unit 134 of FIG. 4 and the classifier unit 136 of FIG. 6.

특징맵결합부(134)는 각 이미지에서 추출한 공간적인 특징 맵 들에서 시간적으로 특징 맵 결합을 수행하여 보행자 후보군을 추정한다.The feature map combining unit 134 estimates a pedestrian candidate group by performing feature map combining temporally on spatial feature maps extracted from each image.

특징맵결합부(134)는 완전 콘볼루션 네트워크로써, 마지막 1x1 콘볼루션 레이어를 통해, 특징 맵의 각 위치와 대응되는 이미지의 위치를 중심으로, 총 9개 앵커 박스에 대한 보행자 확률과 경계 박스 회귀 벡터를 얻는다. 특징맵결합부(134)는 보행자의 형태에 맞게, 세로가 긴 형태의 동일한 세로대가로비가 예를 들면 7:3이면서 스케일이 다른 9개의 앵커 박스들을 사용했다. 여기서, 경계 박스 회귀 벡터 (

)는 아래 수식 (2),(3)와 같이 앵커 박스와 제안 박스(보행자를 둘러싼다고 예상하는 경계 박스)사이의 변환을 파라미터로 표현 한 값이다.The feature map combining unit 134 is a complete convolutional network, and through the last 1x1 convolution layer, centers on the location of the image corresponding to each location of the feature map, pedestrian probability and boundary box regression for a total of 9 anchor boxes Get a vector The feature map coupling unit 134 used nine anchor boxes of different scales, for example, with the same vertical aspect ratio of a long vertical shape in accordance with the shape of the pedestrian. Where the bounding box regression vector (

) Is a value expressing the conversion between the anchor box and the suggestion box (the bounding box that is expected to surround the pedestrian) as a parameter, as in Equations (2) and (3) below.

수식 (2)

Equation (2)

수식 (3)

Equation (3)

여기서,

는 박스의 중심 좌표

와 너비 및 높이를 각각 나타낸다. 또한,

는 각각 제안 박스의 중심 좌표 x와 앵커 박스의 중심 좌표 x를 나타낸다 (

도 마찬가지임). here,

Is the center coordinate of the box

And width and height respectively. In addition,

Denotes the center coordinate x of the proposal box and the center coordinate x of the anchor box, respectively, (

The same is true).

수식 (2)은 중심 좌표 간의 스케일이 변하지 않는 전환을 나타내며, 수식 (3)은 너비 및 높이 간의 로그 공간상의 전환을 나타낸다.Equation (2) represents a transition in which the scale between center coordinates does not change, and Equation (3) represents a transition in log space between width and height.

따라서, 스케일이 유사한 타겟(경계 박스 회귀 벡터)으로 기계 학습할 수 있으며, 경계 박스 회귀 벡터를 얻게 되면, 앵커 박스

를 다시 정의된 제안 박스

로 전환할 수 있다. 최종적으로, 각 제안 박스의 보행자 확률로 비최대 서프레션(non-maximum suppression;NMS)을 수행하여, 임계 값 이상 IoU(Intersection Over Union)가 겹치면서 확률이 낮은 제안 박스들을 제거하고, 보행자 확률 상위 N(N은 자연수)개 (e.g. 100개)의 제안 박스들을 선별한다. Therefore, it is possible to machine learn with a target having a similar scale (boundary box regression vector), and when a boundary box regression vector is obtained, the anchor box

Redefine the proposal box

You can switch to Finally, non-maximum suppression (NMS) is performed with the pedestrian probability of each suggestion box to remove the proposal boxes with low probability while overlapping the Intersection Over Union (IoU) above the threshold value, and the pedestrian probability upper N (N is a natural number) (eg 100) suggest boxes are selected.

야간 이미지 한 장은 밝기 및 콘트라스트 값이 낮은 관계로 대상 영역에 대한 극도로 적은 정보를 포함하고 있기 때문에, 연속적인 복수 장의 이미지 정보를 결합함으로써 대상 검출 성능을 향상할 수 있다. 즉, 대상의 참인 가장자리(true edge)는 연속적인 프레임 내에서 약간의 위치변화로 거의 동일하게 생기지만, 무작위 노이즈에 의해 생기는 거짓인 가장자리(false edge)는 매 프레임 달라지기 때문에, 결합을 통해 참인 가장자리의 영향력을 키울 수 있다. 그러나, 카메라가 움직이거나 대상이가 움직이는 경우, 복수 장의 이미지끼리 합쳤을 때 서로 오브젝트들의 위치가 다르기 때문에 정확한 경계 박스를 추정하기 힘들 것이다. 특징맵결합부(134)는 이 때문에 연속적인 프레임의 이미지들끼리 결합하지 않고, 아래 수식 (4)와 같이 특징맵추출부(132)로부터 얻은 각 특징 맵들 간의 가중치 합산을 통한 결합을 수행한다. 이는 특징맵추출부(132)를 거치면서 얻은 특징 맵 (도 3의 5_3rd 콘볼루션 레이어(ReLU)에서의 90××113××512크기의 특징 맵)의 경우, 입력 영상 (도 3의 720××900××3크기의 입력 영상) 대비, 수평 및 수직 추출을 통해 대상의 전환 변화에 보다 강건한 특성을 가지기 때문이다. 이는 일반적으로 CNN은 작은 전환에 강인한 특징 추출의 특성을 갖기 때문이다.Since one night image contains extremely little information about the target area due to low brightness and contrast values, it is possible to improve the target detection performance by combining a plurality of consecutive image information. That is, the true edge of the object occurs almost the same with a slight position change within a continuous frame, but the false edge caused by random noise is changed every frame, so it is true through combining. You can increase the influence of the edge. However, when the camera moves or the object moves, it may be difficult to estimate the exact bounding box because the positions of the objects are different when multiple images are combined. For this reason, the feature map combining unit 134 does not combine images of consecutive frames, and performs combining through weighting summation between the feature maps obtained from the feature map extraction unit 132 as shown in Equation (4) below. In the case of a feature map obtained through the feature map extraction unit 132 (a feature map of a size of 90 ×× 113 ×× 512 in the 5_3rd convolution layer (ReLU) in FIG. 3), the input image (720 × in FIG. 3) This is because it has a more robust characteristic to the change in object conversion through contrast, horizontal and vertical extraction. This is because CNN generally has characteristics of feature extraction that is robust to small conversion.

수식 (4)

Equation (4)

여기서,

는 연속적인 프레임 인덱스를 나타내며,

은 결합에 사용된 모든 프레임의 갯수이고,

는 i번째 이미지 프레임을 입력으로 도 3의 특징맵추출부(132)로부터 얻은 90××113××512크기의 특징 맵을 의미한다. (

는 특징 맵에서의 수평, 수직 및 채널 방향 위치를 나타낸다.

는 1-D 이산 가우시안 분포 계수들을 사용하였다. 따라서, 연속적인 프레임들의 특징 맵들 (

)들을 동일한 위치끼리 가중치를 합산하여, 결합한 특징 맵들 (

)를 얻은 후, 도 4의 특징맵결합부(134)와 도 6의 분류기부(136)의 입력으로 사용된다.here,

Denotes a continuous frame index,

Is the number of all frames used for joining,

Denotes a feature map having a size of 90 ×× 113 ×× 512 obtained from the feature map extracting unit 132 of FIG. 3 as an input of the i-th image frame. (

Indicates the horizontal, vertical and channel direction position in the feature map.

Used the 1-D discrete Gaussian distribution coefficients. Thus, feature maps of successive frames (

Feature maps that combine weights between the same locations and combine them (

), And is used as an input of the feature map combining unit 134 of FIG. 4 and the classifier unit 136 of FIG. 6.

구체적으로, 특징 맵들상에서 연속적인 프레임 정보를 결합하는 이유는, 더 깊은 레이어의 특징일수록 실제 이미지상의 수용영역(receptive field)이 커져서, 연속적인 프레임들의 동일한 위치 스칼라(scalar) 특징들은 유사한 콘텐츠를 포함하는 수용영역들을 가질 가능성이 커지기 때문이다 (도 5 참조). 즉, 각 프레임 특징 맵들의 동일한 위치 스칼라 특징 (

) 들은 서로 유사한 콘텐츠를 갖는 수용영역 이미지를 갖고, 동일한 CNN을 거친다. 하지만 서로 다른 노이즈를 갖고 있기 때문에, 수식 (4)와 같이 가중치 합산했을 때 노이즈의 영향력을 약화 시킬 수 있게 된다. Specifically, the reason for combining continuous frame information on the feature maps is that the deeper the feature of the layer, the larger the receptive field on the actual image is, so that the same location scalar features of successive frames contain similar content. This is because there is an increased possibility of having receiving areas (see FIG. 5). That is, the scalar feature of the same location of each frame feature map (

) Have an image of a receiving area having similar contents to each other and go through the same CNN. However, since they have different noises, it is possible to weaken the influence of noise when summing the weights as in Equation (4).

분류기부(136)는 각 이미지에서 추출한 공간적인 특징 맵 및 추정된 보행자 후보군을 이용하여 예상 보행자를 분류한다The classifier 136 classifies the predicted pedestrian using the spatial feature map extracted from each image and the estimated pedestrian candidate group.

도 6을 참조하면, 분류기부(136)는 특징맵추출부(132)로부터 얻은 특징 맵들과 특징맵결합부(134)로부터 얻은 제안 박스들을 입력으로 사용한다. 먼저, 각 제안 박스에 따라 특징 맵들상의 해당 위치를 자른다. 이때, 잘라진 크기는 모두 다르기 때문에, ROI 풀링을 통해 통일된 크기를 갖도록 한다. 본 발명에서 사용한 ROI 풀링 크기는 보행자의 형태를 고려한 세로가 긴 형태의 직사각형 크기 (e.g. 3x7)을 사용하였다. 이는 특징맵결합부(134)의 경계 박스 회귀 벡터에 의해 앵커 박스의 형태가 조금씩 재정의 되지만, 여전히 세로가 긴 형태의 제안 박스를 출력하기 때문에, 이에 맞도록 설계한 크기이다. 또한, ROI 풀링 크기를 절반이상 줄임으로써, 네트워크의 전체 파라미터의 개수를 절반 가량 줄일 수 있기 때문에, 오우버 피팅(over-fitting)을 줄이고 메모리 사용량을 절감할 수 있다. 이는 대부분의 가중치는 첫번째 완전 연결 레이어에 연결 되어있기 때문에, ROI 풀링 크기를 절반 가량 줄임으로써 전체 네트워크의 파라미터 개수를 절반 가량 줄일 수 있게 된 것이다. ROI 풀링된 후 밋밋한 특징들은 완전 연결 레이어를 거쳐 경계 박스 회귀 벡터와 보행자 확률들을 얻는다. 다시 한번, 경계 박스 회귀 벡터를 이용하여 제안 박스들을 예상 박스들로 재정의한 후, NMS로 겹쳐진 예상 박스를 제거하여 최종적인 검출 결과를 얻는다. 도 6의 분류 완전 연결 레이어에서 2××100를 얻게 되는데, 여기서 2는 보행자 또는 배경일 확률을 의미하고, 100은 후보 수를 나타낸다.Referring to FIG. 6, the classifier 136 uses as inputs feature maps obtained from the feature map extractor 132 and suggestion boxes obtained from the feature map combiner 134. First, the corresponding position on the feature maps is cut according to each suggestion box. At this time, since the cut sizes are all different, it is necessary to have a uniform size through ROI pooling. As the ROI pooling size used in the present invention, a rectangular shape (e.g. 3x7) having a long vertical shape considering the shape of a pedestrian was used. This is a size designed to fit this, because the shape of the anchor box is redefined little by little by the boundary box regression vector of the feature map combining unit 134. In addition, by reducing the size of the ROI pooling by more than half, the total number of parameters of the network can be reduced by half, thereby reducing over-fitting and reducing memory usage. This is because most of the weights are connected to the first fully connected layer, so the ROI pooling size can be reduced by half, reducing the number of parameters in the entire network by about half. After ROI pooling, the boring features go through a fully connected layer to get the bounding box regression vector and pedestrian probabilities. Once again, the proposed boxes are redefined as predicted boxes using the bounding box regression vector, and then the predicted boxes overlapped with the NMS are removed to obtain a final detection result. In the classification complete connection layer of FIG. 6, 2 ×× 100 is obtained, where 2 indicates a probability of being a pedestrian or a background, and 100 indicates a number of candidates.

딥러닝부(138)는 데이터베이스를 이용하여 변경된 R-CNN부(130)를 기계 학습한다.The deep learning unit 138 machine-learns the changed R-CNN unit 130 using a database.

딥러닝부(138)는 데이터베이스를 이용하여 특징 맵 결합 학습과 분류기 학습을 번갈아 가며 수행하는 4-단계 교차 학습 방법을 사용한다. 딥러닝부(138)는 변경된 R-CNN부(130)에서는 특징 맵 결합 학습과 분류기 학습 모두 2계급 분류를 수행하게 된다. 따라서, 특징 맵 결합 학습과 분류기 학습은 미니 배치 내의 각 앵커 박스 혹은 제안 박스에 대해 동일한 (

는 달라질 수 있음) 손실 함수를 최소화하도록 가중치가 기계 학습된다.The deep learning unit 138 uses a 4-step cross-learning method that alternately performs feature map combining learning and classifier learning using a database. The deep learning unit 138 performs two-class classification in both the feature map combining learning and the classifier learning in the changed R-CNN unit 130. Therefore, the feature map combination learning and classifier learning are the same (for each anchor box or suggestion box in the mini-batch).

Can vary) Weights are machine-learned to minimize the loss function.

수식 (5)

Equation (5)

여기서,

는 앵커 박스 (특징 맵 결합의 경우) 혹은 제안 박스 (분류기의 경우)가 보행자를 둘러싸고 있다고 예상하는 확률이다.

는 대응되는 실제 지상 검증 레벨이다 (보행자=1, 배경=0). 그리고

(분류 손실 함수)는 로그 손실 함수이다.

는 해당 앵커 박스 혹은 제안 박스의 경계 박스 회귀 벡터이고,

는 대응되는 실제 지상 검증 경계 박스와의 경계 박스 회귀 벡터 이다.

(회귀 손실 함수)는 강건한 손실 함수 (매끄러운

)이다. 또한, 회귀 손실은 실제 지상 검증이 보행자인 경우 (

)에 만 발생한다. 따라서, 보행자 샘플의 경우 분류 손실과 회귀 손실을 합쳐서 주고, 배경 샘플의 경우 분류 손실만 준다. 마지막으로,

를 통해, 분류와 회귀 손실의 가중치를 조절한다.here,

Is the probability that you expect the anchor box (for feature map combination) or the proposed box (for classifier) to surround the pedestrian.

Is the corresponding actual ground verification level (pedestrian = 1, background = 0). And

(Classification loss function) is a log loss function.

Is the bounding box regression vector of the corresponding anchor box or suggestion box,

Is the bounding box regression vector with the corresponding actual ground verification bounding box.

(Regression loss function) is a robust loss function (smooth

)to be. In addition, the regression loss is when the actual ground verification is pedestrian (

). Therefore, in the case of a pedestrian sample, the classification loss and the regression loss are combined, and in the background sample, only the classification loss is given. Finally,

Through, the weighting of classification and regression loss is adjusted.

딥러닝부(138)는 야간 보행자 뿐만 아니라 주간 보행자에 대해서도 기계 학습을 수행할 수 있다. 이는 일반적으로 기계 학습에 사용할 충분한 야간 보행자 이미지가 없기 때문에, 주간 보행자 이미지에 대해 밝기 감소, 노이즈 추가 및 수직 플립핑을, 야간 기계 학습 이미지에 대해서는 수직 플립핑을 각각 적용하는 데이터 확장(데이터 확장)을 수행하여 야간 기계 학습 이미지수를 늘릴 수 있다. 여기서, 밝기 감소 및 노이즈 추가 시, 실제 야간 이미지와 유사한 특성을 갖도록 다음 알고리즘 1과 같이 수행하였다.The deep learning unit 138 may perform machine learning for daytime pedestrians as well as nighttime pedestrians. This is usually because there is not enough night pedestrian image to be used for machine learning, so data expansion that applies brightness reduction, noise addition and vertical flipping for daytime pedestrian imagery and vertical flipping for nighttime machine learning image respectively (data expansion) To increase the number of machine learning images at night. Here, when the brightness is reduced and the noise is added, the following algorithm 1 is performed to have characteristics similar to the actual night image.

알고리즘 1Algorithm 1

일정 무작위 변수

을 범위 내에서 얻는다 (

은 어두운 정도를 나타낸다). RGB 컬러의 주간 이미지를 HSV 컬러로 변환한 후, Schedule random variable

Get within the range (

Indicates the degree of darkness). After converting the weekly image of RGB color to HSV color,

밸류 채널 이미지의 픽셀 값들을 최소 최대 스케일링 방법으로 정규화 한다. 밸류 채널 이미지의 각 픽셀을

에 따른

로 나눠줌으로써 영상을 어둡게 만든다. HSV 컬러 이미지를 다시 RGB 컬러 이미지로 변환한다. 각 픽셀의 각 RGB채널에 추가적 화이트 가우시안 노이즈 (AWGN) (

에 따른 분산 (

)를 기반으로 한 노이즈 이미지 (

))를 더한 후, 전체 이미지를 가우시안 블러링하여 야간 이미지와 유사하게 변환한다. The pixel values of the value channel image are normalized using a minimum and maximum scaling method. Each pixel in the value channel image

In accordance

Divided by to darken the video. Convert the HSV color image back to an RGB color image. Additional white Gaussian noise (AWGN) for each RGB channel of each pixel (

Variance according to (

Noise image based on ()

After adding)), the entire image is Gaussian-blurred to convert it similar to a night image.

도 7은 데이터 확장에 따라 칼텍 데이터베이스 주간 이미지를 이용하여,

(level of darkness)에 따라 상술한 알고리즘1에 의해 얻어진 야간 이미지의 예를 나타낸다.Figure 7 is using the Caltech database weekly image according to the data expansion,

An example of a night image obtained by the above-described algorithm 1 according to (level of darkness) is shown.

도 8 및 도 9는 본 발명의 일 실시 예에 따른 보행자 검출 방법을 설명하기 위한 도면들이다. 이하 설명하는 각 단계는 보행자 검출 장치를 구성하는 각 기능부를 통해 수행되는 과정이나 발명의 간결하고 명확한 설명을 위해 각 단계의 주체를 보행자 검출 장치로 통칭하도록 한다.8 and 9 are diagrams for explaining a pedestrian detection method according to an embodiment of the present invention. Each step described below will be referred to collectively as a pedestrian detection device as a subject performed in each step for a concise and clear description of a process or invention performed through each functional unit constituting the pedestrian detection device.

도 8을 참조하면, 단계 S810에서 보행자 검출 장치(100)는 연속적인 복수의 이미지들을 입력 받는다. Referring to FIG. 8, in step S810, the pedestrian detection apparatus 100 receives a plurality of consecutive images.

단계 S820에서 보행자 검출 장치(100)는 연속적인 복수의 이미지들을 정규화한다. 보행자 검출 장치(100)는 픽셀 정규화(픽셀 정규화)로 각 입력 이미지에 대해서, RGB 채널 각각에, 각 채널의 픽셀 값들을 제로 평균 및 단위 표준 편차를 갖는 분포로 만든 후 스케일링 하여, 각 입력 이미지의 조도와 콘트라스트를 정규화시킬 수 있다. 또한, 보행자 검출 장치(100)는 히스토그램 정규화(histogram equalization (HE)) 및 평균 차감(mean subtraction)으로, 각각의 입력 이미지의 RGB 컬러를 HSV 컬러로 채널 변환 한 후, 밸류(value) 채널 이미지에 히스토그램 정규화를 적용하고, 그 후, 다시 RGB 채널로 변경하여 이미지 셋 전체의 RGB 평균 값을 빼줌으로써, 이미지 픽셀 값들을 제로 센터링할 수 있다. In step S820, the pedestrian detection apparatus 100 normalizes a plurality of consecutive images. Pedestrian detection device 100, pixel normalization (pixel normalization) for each input image, for each of the RGB channels, pixel values of each channel to a distribution with a zero average and a unit standard deviation, and then scaled, each input image The illuminance and contrast can be normalized. In addition, the pedestrian detection apparatus 100 converts the RGB color of each input image to HSV color by using histogram equalization (HE) and mean subtraction, and then to a value channel image. By applying the histogram normalization, and then changing to the RGB channel again, subtracting the average RGB value of the entire image set, the image pixel values can be zero centered.

이하 도 8 및 도 9를 참조하면, 단계 S830에서 보행자 검출 장치는 정규화된 연속적인 복수의 이미지들에 대해 기계 학습된 변경된 R-CNN 을 적용하여 각 이미지들의 공간적인 특징 맵을 추출한다.8 and 9, in step S830, the pedestrian detection apparatus extracts the spatial feature map of each image by applying the machine-modified modified R-CNN to a plurality of normalized consecutive images.

단계 S840에서 보행자 검출 장치는 각 이미지에서 추출한 공간적인 특징 맵 들에서 시간적으로 특징 맵 결합을 수행하여 보행자 후보군을 추정한다.In step S840, the pedestrian detection apparatus estimates a candidate pedestrian group by temporally combining feature maps from spatial feature maps extracted from each image.

단계 S850에서 보행자 검출 장치는 각 이미지에서 추출한 공간적인 특징 맵 및 추정된 보행자 후보군을 이용하여 예상 보행자를 분류한다In step S850, the pedestrian detection device classifies the predicted pedestrian using the spatial feature map extracted from each image and the estimated pedestrian candidate group.

보행자 검출 장치는 데이터베이스를 이용하여 기계 학습을 수행하는 단계를 더 포함할 수 있다. 기계 학습을 수행하는 단계에 대해서는 이하 도 10내지 도 13을 참조하여 자세히 설명한다.The pedestrian detection device may further include performing machine learning using a database. The steps for performing machine learning will be described in detail with reference to FIGS. 10 to 13 below.

도 10 내지 도 13은 본 발명의 일 실시 예에 따른 보행자 검출 방법의 기계학습을 설명하기 위한 도면들이다.10 to 13 are views for explaining the machine learning of the pedestrian detection method according to an embodiment of the present invention.

본 발명의 일 실시 예에 따른 보행자 검출 방법을 수행하기 위하여 예를 들면, 카이스트 데이터베이스 및 칼텍 데이터베이스를 이용할 수 있다. In order to perform a pedestrian detection method according to an embodiment of the present invention, for example, a KAIST database and a Caltek database may be used.

도 10을 참조하면, 카이스트 데이터베이스는 가시광선 카메라 이미지 뿐만 아니라, 열화상 이미지도 함께 제공하며, 주간 및 야간 이미지를 갖는 보행자 이미지 세트이다. 본 발명에서는 기계학습 및 검증에서 가시광선 카메라 이미지만을 이용하였다. 카이스트 데이터베이스는 모든 이미지가 512

640 크기 이므로 동일한 비율로 크기를 조정하여 720

900 크기로 사용할 수 있다. Referring to FIG. 10, the KAIST database is a set of pedestrian images having daytime and nighttime images, as well as a visible light camera image and a thermal image. In the present invention, only visible light camera images are used in machine learning and verification. KAIST database has 512 images

It is 640 size, so resize it at the same ratio to make it 720

Available in 900 sizes.

칼텍 데이터베이스는 많은 보행자를 포함한 주간 이미지-세트이다. 카이스트 데이터베이스는 야간 이미지를 포함하기 때문에 다양한 노이즈로 학습시키기 위해 10 프레임마다 샘플링 했으며, 칼텍 데이터베이스는 30 프레임마다 샘플링 했다. 칼텍 데이터베이스는 모든 이미지가 480

640 크기 이므로 일단, 512

640로 세로를 늘린 후, 동일한 비율로 크기를 조정하여 720

900 크기로 사용했다. The Caltech database is a weekly image-set containing many pedestrians. Because the KAIST database contains night images, it was sampled every 10 frames to train with various noises, and the Caltech database was sampled every 30 frames. All images in the Caltech database are 480

640 size, so once, 512

After increasing the height to 640, resize it to the same ratio and adjust it to 720

Used in 900 size.

도 11을 참조하면, 기계 학습시 데이터 확장으로 플립핑 방법을 사용했고, 주간 이미지에 대해서는 무작위한 정도로 어둡게 만드는 방법도 함께 사용했다. 또, 유효화 세트로 야간 기계 학습 이미지의 10% 가량을 따로 빼서 사용했다. 검증시에는 보행자 실제 지상 검증 기준 (합리적)을 사용했고, 평가 매트릭은 log-average Miss Rate on False Positive Per Image(FPPI) in [

]를 사용해 성능 평가 했다.Referring to FIG. 11, a flipping method was used as a data extension during machine learning, and a method of making the image darker to a random degree was also used. In addition, about 10% of night machine learning images were subtracted from the validation set. Pedestrian actual ground verification criteria (reasonable) were used for verification, and the evaluation metric was log-average Miss Rate on False Positive Per Image (FPPI) in [

] To evaluate performance.

기계 학습 시 이미지네트 데이터세트로 사전 학습된 된 VGG-16 모델을 사용하여 가중치를 초기화 했다. 다만, 관심 영역(ROI) 풀링 된 후의 특징 맵 크기가 7x3(세로가 긴)이기 때문에, 첫번째 레이어와 연결되는 가중치의 개수(

)가 해당 사전 학습된 모델의 가중치 개수(

)와 맞지 않는다. 따라서, 너비축의 중앙부분과 연결되는 가중치만 잘라서 초기화에 사용했다. 또한, VGG-16 모델의 첫 번째부터 네 번째 콘볼루션 레이어까지의 사전 학습된 가중치는 기계 학습시 프리징(freezing) 했다. In machine learning, weights were initialized using a VGG-16 model that was pre-trained with the ImageNet dataset. However, since the feature map size after pooling the ROI is 7x3 (long vertical), the number of weights connected to the first layer (

) Is the number of weights of the pre-trained model (

) Does not fit. Therefore, only the weight connected to the central part of the width axis was cut and used for initialization. In addition, the pre-trained weights from the first to fourth convolutional layers of the VGG-16 model were freezing during machine learning.

도 12를 참조하여 기계 학습 과정을 설명한다. 기계 학습 과정에 특징맵 결합의 제안 박스를 주는 것은 실선 화살표로 표시했고, 기계 학습된 가중치를 줌으로써 가중치를 초기화 하는 것은 점선 화살표로 표시 했다. 첫 번째와 두 번째 행의 특징맵결합과 분류기(실선 박스)에서는 특징 맵 결합과 함께 단대단 학습하고, 마지막 행의 특징 맵 결합과 분류기(점선 박스) 특징 맵 추출을 공유하기 위해 특징 맵 추출을 제외한 고유한 네트워크만 학습한다.The machine learning process will be described with reference to FIG. 12. To give a suggestion box of combining feature maps to the machine learning process is indicated by a solid arrow, and to initialize the weights by giving machine-learned weights is indicated by a dotted arrow. Feature map combining and classifier (solid box) in the first and second row learn end-to-end with feature map combining, and feature map extraction to share feature map combining and classifier (dashed box) feature map extraction in the last row. Except for learning unique networks.

구체적인 기계 학습 과정은, 먼저 칼텍 보행자 데이터베이스를 사용하여 특징맵 추출 및 특징맵 결합을 미세하게 튜닝하였다. 그 후, 학습된 특징맵 결합을 사용하여 기계 학습 이미지마다 제안 박스들을 생성하고, 그것을 이용해 특징맵 추출과 분류기 부분을 미세 튜닝하였다 (학습율 =0.001). 그 다음 카이스트 데이터베이스를 사용하여 4-단계 교차 기계 학습방식으로, 다시 한번 미세 튜닝 하였다 (학습율=0.0001). 따라서, 보행자 수가 많은 칼텍 데이터베이스를 사용하여 일반적인 보행자에 대한 특징 맵 추출과 분류기 부분을 기계 학습하고, 야간에 강인한 네트워크를 학습하기 위해 카이스트 데이터베이스를 통해 최종적으로 미세 튜닝 했다.For the specific machine learning process, the feature map extraction and feature map combination were finely tuned using the Caltech pedestrian database. Then, the proposed boxes were generated for each machine learning image using the combined feature maps, and the feature map extraction and classifier parts were fine-tuned (learning rate = 0.001). Then, using the KAIST database, it was fine-tuned once again using a 4-step cross-machine learning method (learning rate = 0.0001). Therefore, by using the Caltech database with a large number of pedestrians, feature map extraction and classifier parts for general pedestrians were machine-trained, and finally fine-tuned through the KAIST database to train a robust network at night.

기계 학습시 Stochastic Gradient Descent (SGD) 방법을 사용했고, 모멘텀(momentum)은 0.9, 가중치 디케이(decay)는 0.0005를 사용했다. 총 3번의 특징맵결합(RPN) 기계 학습 단계시 각 80k번 SGD 반복을 수행했고, 마찬가지로 총 3번의 분류기(classifier) 기계 학습 단계시 각 40k번 SGD 반복을 수행했다. Stochastic Gradient Descent (SGD) method was used for machine learning, 0.9 for the momentum and 0.0005 for the weight decay. SGD repetition was performed 80 times each for a total of 3 feature map combining (RPN) machine learning steps, and SGD repetition was performed for 40 repetitions for a total of 3 classifier machine learning steps.

도 13을 참조하면, 마지막 특징맵결합(RPN)(a)과 분류기(classifier)(b) 기계 학습 단계시 미니 배치 손실의 평균이 감소하는 그래프를 그렸다. 여기서, 분류기의 손실에 비해 특징맵결합의 손실이 낮은 이유는, 특징맵결합은 이미지 내의 모든 앵커 박스에 대한 손실을 평균하므로, 상대적으로 쉬운 샘플이 많기 때문이다.Referring to FIG. 13, a graph is shown in which the average of mini-batch loss decreases during the final feature map combination (RPN) (a) and classifier (b) machine learning step. Here, the reason for the loss of feature map combining compared to the loss of the classifier is that feature map combining averages the losses for all anchor boxes in the image, and thus, there are many relatively easy samples.

총 6번의 단계 (칼텍 데이터베이스를 위한 2 단계와 카이스트 데이터베이스를 위한 4 단계)마다 기계 학습이 끝나면, 다음 단계의 기계 학습을 위하여 현재 단계의 저장된 모델 중 유효성 에러가 가장 적은 모델을 선택하여 진행했다. 또한, 기계 학습 시 보행자가 적어도 한 명 이상 존재하는 이미지만 사용하여, 그레디언트(gradient)가 배경 손실(background loss)에 너무 치중되지 않도록 했다. 모든 기계 학습이 완료 되는 데에는 대략 2.5일 정도가 걸렸다. After the machine learning was completed for every 6 stages (2 stages for the Caltech database and 4 stages for the KAIST database), for the next stage of machine learning, the model with the lowest validation error was selected from the stored models of the current stage. In addition, in machine learning, only images in which at least one pedestrian is present are used, so that the gradient is not too focused on background loss. It took about 2.5 days for all machine learning to be completed.

도 14 내지 도 17는 본 발명의 효과를 설명하기 위한 도면들이다.14 to 17 are views for explaining the effect of the present invention.

도 14를 참조하면, 본 발명의 픽셀 정규화 전처리방법을 사용해 기계 학습한 제안 모델을 사용하여 다양한 특징 결합 방법에 대한 성능을 비교했다. 연속적인 프레임 특징 결합 (가중치 합산) 시 현재 프레임 (실제 검출결과를 얻는 이미지)의 가중치를 크게 하기 위하여 1-D 이산 가우시안 필터 계수를 사용했다. 이론적으로는 모든 가중치를 같게 하여 결합하는 것이 가장 좋지만 (appendix), 실제로는 현재 프레임의 가중치를 가장 크게 해야 정확한 경계 박스를 얻을 수 있을 것이다. 본 연구에서는 현재 프레임의 가중치와 연속적인 프레임들의 가중치를 더한 값이 동일해 지도록 하기 위하여 default로 [0.25, 0.5, 0.25]를 사용했다. Referring to FIG. 14, performance of various feature combining methods is compared using a proposed model machine-trained using the pixel normalization pre-processing method of the present invention. 1-D discrete Gaussian filter coefficients are used to increase the weight of the current frame (image to obtain the actual detection result) when combining consecutive frame features (weighted summation). Theoretically, it is best to combine all the weights equally (appendix), but in practice, the current frame weight should be the largest to obtain an accurate bounding box. In this study, [0.25, 0.5, 0.25] was used as the default so that the weight of the current frame plus the weight of consecutive frames became the same.

연속적인 프레임 결합에 있어서, 몇 개의 연속적인 프레임을 결합 할 것인지가 중요한 변수가 된다. 이론적으로는 (appendix) 많은 프레임을 결합 할수록 특징의 분별력이 더 좋아지지만, 실제로는 이미지끼리 대응되는 위치가 점점 어긋나기 때문에 적당한 개수의 프레임을 결합 해야 한다. 아래 표에서 보듯, 3개의 프레임을 결합했을 때 가장 좋은 성능이 나왔고, 5개 이상부터는 성능이 점점 떨어졌다. 이는 각 프레임 특징의 수용범위(receptive field)가 크게 벗어나기 때문으로 해석 된다. 만약, 감시 카메라와 같이 고정된 카메라의 경우에는, 연속적인 프레임의 개수를 더 늘렸을 때, 더 큰 성능 향상이 있을 것으로 추측한다.In continuous frame combining, how many consecutive frames are combined is an important variable. Theoretically, (appendix) the more frames are combined, the better the discernment of the features is, but in reality, the corresponding number of frames is shifted more and more, so it is necessary to combine the appropriate number of frames. As shown in the table below, the best performance came out when the three frames were combined, and the performance decreased gradually from five or more. This is interpreted because the receptive field of each frame feature is greatly deviated. In the case of a fixed camera such as a surveillance camera, it is assumed that when the number of consecutive frames is increased, there will be a greater performance improvement.

이번에는, 연속적인 프레임 특징 결합이 분류와 경계 박스 회귀에 미치는 영향을 살펴 보았다. 여기서 경계 박스 회귀는 실제 보행자의 실제 지상 검증 경계 박스와 예상 경계 박스 간의 IoU(intersection over union)가 클수록 올바르게 회귀한 것이다. This time, we examined the effect of successive frame feature combinations on classification and boundary box regression. Here, the boundary box regression is correctly regressed as the intersection over union (IoU) between the actual pedestrian's actual ground verification boundary box and the expected boundary box is larger.

도 15를 참조하면, IoU 허용치(threshold)가 작았을 때(IoU가 작아도 true positive로 간주), 연속적인 프레임을 결합하지 않은 그래프 (1520)보다 결합했을 때 (1510), 손실률(MR)이 확연히 줄어든다. 그러나, IoU 허용치(threshold)가 높아지면, 결합 했을 때와 안 했을 때의 성능 차가 거의 없다. 이는 연속적인 프레임의 특징을 결합함으로써, 분류기 성능은 증가하지만, 회귀 성능에는 큰 효과를 보이지 못하는 것으로 해석된다. 왜냐하면, IoU 허용치가 낮을 때는, 예상 경계 박스가 러프하게 실제 지상 검증 경계 박스와 교차해도 true positive로 간주하기 때문에 분류기 성능이 손실율에 큰 영향을 준다. 따라서, 결합을 통해 분류기 성능이 증가하면서, 손실율이 상대적으로 낮아졌다. 반면, IoU 허용치가 높아지면, 예상 박스가 실제 지상 검증 경계 박스와 거의 일치해야 true positive가 된다.Referring to FIG. 15, when the IoU threshold is small (regarded as a true positive even when the IoU is small), when the consecutive frames are combined (1510) than the uncombined graph (1520), the loss rate (MR) is remarkably Decreases. However, when the IoU threshold is high, there is little difference in performance between when combined and not. It is interpreted that by combining the characteristics of consecutive frames, the classifier performance increases, but does not show a significant effect on regression performance. Because, when the IoU tolerance is low, the classifier performance greatly affects the loss rate because the expected boundary box is considered to be true positive even if it roughly intersects the actual ground verification boundary box. Therefore, as the classifier performance was increased through the combination, the loss rate was relatively low. On the other hand, if the IoU tolerance is high, the expected box becomes true positive only when it matches the actual ground verification boundary box.

도 16을 참조하면, 연속적인 프레임 특징 결합을 사용한 제안 모델과 카이스트 데이터베이스 베이스 라인간의 성능을 비교했다. Referring to FIG. 16, the performance between the proposed model using the continuous frame feature combination and the KAIST database baseline was compared.

베이스 라인에 사용된 알고리즘은 RGB 채널을 이용한 ACF-RGB(1620)와 RGB 채널, thermal 채널 (T) 그리고, HOG 특징 of the thermal 이미지 (THOG) 채널을 사용한 ACF-RGB+T+THOG (1630)이다. 물론, 2개의 베이스 라인 알고리즘의 성능을 비교하면, 정보가 더 많은 ACF-RGB+T+THOG 가 모든 시간에서 우수한 성능을 보였다. 특히, ACF-RGB는 야간에서 성능이 매우 저조하다. 이는 야간에서는 노이즈가 매우 강하기 때문에 hand-crafted 특징으로는 보행자를 구분해내기 힘들기 때문이다. 그러나, 본 발명에 따른 모델(1610)은 딥 특징을 사용하였고, 그러한 특징 정보를 짧은 시간 동안 결합했기 때문에 카이스트 데이터베이스 야간 이미지에서 ACF-RG(1620)보다 30% 이상 높은 검출 성능을 보였다 (도16(c)). 특히, 열화상 정보까지 사용한 ACF-RGB+T+THOG(1630) 보다도 5% 이상 높은 성능을 보였다. 또한, 카이스트 데이터베이스 주간 이미지에서도 안정된 성능을 보였다 (도 16(b)). 이는 데이터 확장을 통해 다양한 이미지로 기계 학습 했고, 적절한 전처리로 콘트라스트와 조명을 정규화 했기 때문으로 해석 된다. 결과적으로, 카이스트 데이터베이스 모두(도 16(a)) 에서 베이스 라인의 성능을 능가했다.The algorithm used for the baseline is ACF-RGB (1620) using RGB channel, RGB channel, thermal channel (T), and ACF-RGB + T + THOG (1630) using HOG feature of the thermal image (THOG) channel. to be. Of course, comparing the performance of the two baseline algorithms, ACF-RGB + T + THOG with more information showed excellent performance at all times. In particular, ACF-RGB has very poor performance at night. This is because noise is very strong at night, so it is difficult to distinguish pedestrians with hand-crafted features. However, the model 1610 according to the present invention uses a deep feature, and because it combines such feature information for a short time, it shows a detection performance of 30% higher than that of the ACF-RG 1620 in the KAIST database night image (FIG. 16). (c)). In particular, it showed more than 5% higher performance than ACF-RGB + T + THOG (1630) using even thermal image information. In addition, it also showed stable performance in the KAIST database weekly image (Fig. 16 (b)). This is interpreted as machine learning with various images through data expansion and normalizing contrast and lighting with proper pre-processing. As a result, all of the KAIST databases (Fig. 16 (a)) exceeded the performance of the baseline.

도 17을 참조하면, 카이스트 데이터베이스의 샘플 이미지들의 검출 결과를 보여준다. Referring to FIG. 17, results of detection of sample images of the KAIST database are shown.

도 17의 첫 번째 행은 주간 이미지에 대한 검출 결과이고, 나머지 행은 야간 이미지의 검출 결과이다. 녹색, 노란색(1520) 및 빨간색(1530)의 경계 박스는 각각 true positive 박스, false negative 박스 (손실된 실제 지상 검증), false positive 박스를 나타낸다.The first row of FIG. 17 is the detection result for the daytime image, and the remaining row is the detection result for the nighttime image. The bounding boxes of green, yellow (1520) and red (1530) represent true positive boxes, false negative boxes (lost actual ground verification), and false positive boxes, respectively.

본 발명에 따른 보행자 검출 방법은 가시광선 카메라에서 얻은 야간을 포함한 모든 시간대의 이미지를 대상으로 적용하였다. 보행자 검출에 맞춰 약간의 변화를 준 faster R-CNN 모델을 주간과 야간 이미지들을 이용하여 기계 학습 시켰다. 이때, 노이즈와 조명에 강인한 모델이 되도록 주간 이미지의 강도를 무작위하게 줄이고, 무작위한 AWGN을 넣었다. 또한, 안정적인 기계 학습이 되도록 모든 기계 학습 이미지의 조명과 콘트라스트를 정규화 하는 전처리방법을 적용했다. 검증시에도 동일한 전처리방법을 적용하였으며, 추가적으로 연속적인 프레임 특징들을 가중치 합산하여 시간적인 정보를 함께 사용했다. 결과적으로, 카이스트 데이터베이스의 모든 시간대에서 베이스 라인의 성능을 능가했으며, 특히 야간에서 열화상 카메라정보까지 사용한 방법보다 높은 성능을 보였다. The pedestrian detection method according to the present invention was applied to images of all time zones including nighttime obtained from a visible light camera. A faster R-CNN model with slight changes to pedestrian detection was machine trained using day and night images. At this time, the intensity of the daytime image was randomly reduced and a random AWGN was added to make the model robust against noise and lighting. In addition, a pre-processing method was used to normalize the lighting and contrast of all machine learning images to ensure stable machine learning. The same pre-processing method was applied at the time of verification, and additional temporal information was used by adding weights to consecutive frame features. As a result, it exceeded the performance of the baseline in all time zones of the KAIST database, and in particular, it showed higher performance than the method used from night to thermal imaging camera information.

성능 측정에서 사용한 카이스트 데이터베이스는 이동하는 차량에서 얻은 이미지이기 때문에, 보행자와의 거리와 조명이 다양했다. 그에 따라, 대부분의 에러 사례들은 너무 어둡거나 거리가 먼 보행자 혹은 배경에서 생겼다 (사람의 육안으로도 식별이 불가능한). 그러나 만약 어느 정도의 제약조건을 가진 감시 시스템 환경 이라면 (거리, 조명 등이 식별 가능한 정도의 수준), 연속적인 프레임 특징 결합의 효과가 더욱 커짐으로써, 높은 정확도의 검출 성능을 보일 것으로 예상한다. Because the KAIST database used for performance measurement is an image obtained from a moving vehicle, the distance and lighting from pedestrians varied. As a result, most of the error cases occurred in pedestrians or backgrounds that were too dark or far away (not visible to the human eye). However, if the surveillance system environment has certain constraints (levels such as distance, lighting, etc. can be discerned), it is expected that the effect of combining continuous frame features will be increased, resulting in high-accuracy detection performance.

본 발명의 실시 예에 따른 보행자 검출 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 컴퓨터 판독 가능 매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 분야 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(mag네트ic media), CD-ROM, DVD와 같은 광 기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(mag네트o-optical media) 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 또한 상술한 매체는 프로그램 명령, 데이터 구조 등을 지정하는 신호를 전송하는 반송파를 포함하는 광 또는 금속선, 도파관 등의 전송 매체일 수도 있다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상술한 하드웨어 장치는 본 발명의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.Pedestrian detection method according to an embodiment of the present invention is implemented in the form of program instructions that can be performed through various computer means may be recorded on a computer-readable medium. Computer-readable media may include program instructions, data files, data structures, or the like alone or in combination. The program instructions recorded on the computer-readable medium may be specially designed and configured for the present invention, or may be known and available to those skilled in the computer software field. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tapes (magnetic media), optical media such as CD-ROMs, DVDs, and floptical disks. Hardware devices specifically configured to store and execute program instructions such as magneto-optical media and ROM, RAM, flash memory, and the like are included. In addition, the above-described medium may be a transmission medium such as an optical or metal wire or a waveguide including a carrier wave that transmits a signal specifying a program command, data structure, or the like. Examples of program instructions include high-level language code that can be executed by a computer using an interpreter, etc., as well as machine language codes produced by a compiler. The hardware device described above may be configured to operate as one or more software modules to perform the operation of the present invention, and vice versa.

이제까지 본 발명에 대하여 그 실시 예들을 중심으로 살펴보았다. 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자는 본 발명이 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 변형된 형태로 구현될 수 있음을 이해할 수 있을 것이다. 그러므로 개시된 실시 예들은 한정적인 관점이 아니라 설명적인 관점에서 고려되어야 한다. 본 발명의 범위는 전술한 설명이 아니라 특허청구범위에 나타나 있으며, 그와 동등한 범위 내에 있는 모든 차이점은 본 발명에 포함된 것으로 해석되어야 할 것이다.So far, the present invention has been focused on the embodiments. Those skilled in the art to which the present invention pertains will understand that the present invention may be implemented in a modified form without departing from the essential characteristics of the present invention. Therefore, the disclosed embodiments should be considered in terms of explanation, not limitation. The scope of the present invention is shown in the claims rather than the foregoing description, and all differences within the equivalent range should be interpreted as being included in the present invention.

Claims

In the pedestrian detection device for detecting a pedestrian at night time,
A continuous image input unit that receives a plurality of consecutive images;
An image normalization unit that normalizes a plurality of input consecutive images; And
It includes a modified R-CNN unit for classifying a pedestrian candidate group by applying a machine-learned modified R-CNN to a plurality of normalized consecutive images,
Further comprising a deep learning unit for machine learning by performing an algorithm so that the image of the daytime environment has characteristics similar to the image of the nighttime environment,
The plurality of images
It is a night image with a large change in contrast and illuminance compared to a daytime image, and three consecutive images for combining continuous frame features.
The image normalization unit
For each input image, for each of the RGB channels, the pixel values of each channel are made into a distribution with zero average and unit standard deviation, and then scaled to normalize the illuminance and contrast of each input image.
After converting the RGB color of each input image to HSV color, apply histogram normalization to the value channel image, and then change it back to the RGB channel to normalize by subtracting the RGB average value of the entire image set,
The modified R-CNN unit
A feature map extracting unit that extracts a spatial feature map of each image by applying a machine-learned modified R-CNN to a plurality of normalized consecutive images;
A feature map combining unit for estimating a candidate group of pedestrians by performing feature map combining temporally on spatial feature maps extracted from each image; And
A pedestrian detection device including a classifier for classifying predicted pedestrians using a spatial feature map extracted from each image and an estimated pedestrian candidate group.

delete

According to claim 1,
The feature map extraction unit,
A pedestrian detection device that receives images containing a plurality of pedestrians as input, and passes 13 convolution layers and ReLU activation functions through three max pooling layers to extract feature maps.

The method of claim 5,
The feature map extraction unit,
In order to solve the low-resolution feature map problem in the VGG-Net 16 network, a pedestrian detection device that further increases the resolution of the last feature map by removing the fourth max pooling layer.

According to claim 1,
The feature map coupling unit
Pedestrian detection device that performs combining by adding weights between spatial feature maps extracted from each image.

In the pedestrian detection method performed in the pedestrian detection device for the detection of pedestrians at night time,
Receiving a plurality of consecutive images;
Normalizing the input consecutive plurality of images;
Extracting a spatial feature map of each image by applying a machine-learned modified R-CNN to a plurality of normalized consecutive images;
Estimating a candidate candidate group by temporally combining feature maps from spatial feature maps extracted from each image; And
And classifying the expected pedestrian using the spatial feature map extracted from each image and the estimated pedestrian candidate group,
Further comprising the step of performing the machine learning by performing an algorithm so that the image of the daytime environment has characteristics similar to the image of the nighttime environment,
The plurality of images are night images in which contrast and illuminance fluctuate significantly compared to day images, and three consecutive images for combining continuous frame features,
The step of normalizing the input multiple consecutive images,
For each input image, for each of the RGB channels, the pixel values of each channel are made into a distribution with zero average and unit standard deviation, and then scaled to normalize the illuminance and contrast of each input image.
Pedestrian detection method that converts the RGB color of each input image to HSV color, applies histogram normalization to the value channel image, and then changes to the RGB channel again to subtract the average RGB value of the entire set of images. .

delete

A computer program recorded on a computer-readable recording medium that executes the pedestrian detection method of claim 8.