KR20180080081A

KR20180080081A - Method and system for robust face dectection in wild environment based on cnn

Info

Publication number: KR20180080081A
Application number: KR1020170075826A
Authority: KR
Inventors: 노용만; 김형일; 송주남; 김학구
Original assignee: 한국과학기술원
Priority date: 2017-01-03
Filing date: 2017-06-15
Publication date: 2018-07-11
Also published as: KR102036963B1

Abstract

Provided is an improved method for detecting a face based on a convolutional neural network (CNN) and a system thereof. The face can be accurately and quickly detected by using a fully convolutional network (FCN) of a multi-scale sharing a partial neural network even in a wide environment where a face pose is changed and the face is hidden. Accordingly, the present invention can reduce operation complexity.

Description

FIELD OF THE INVENTION [0001] The present invention relates to a face detection method and system robust to a CNN-based wild environment,

본 발명은 얼굴의 포즈 변화와 가림이 발생하는 와일드 환경에서 얼굴 검출을 수행하는 방법에 관한 것으로, 합성곱 신경망(Convolutional Neural Network; CNN) 기반의 얼굴 검출 기술에 대한 것이다.The present invention relates to a method for performing face detection in a wild environment where a pose change and occlusion of a face occurs, and a face detection technique based on a Convolutional Neural Network (CNN).

최근 얼굴 정보를 이용한 다양한 어플리케이션 (application)이 등장함에 따라 실용적인 얼굴 검출 방법에 관심이 높아지고 있다. 얼굴 인식 시스템은 특정인의 출입을 허가하는 보안시스템과 감시 환경 에서 개인의 프라이버시(privacy) 보호를 위해 사용 되고 있다. 또한, 표정 인식은 얼굴 영역에 대해 표 정 변화를 분석하여 외형적 표정 변화로부터 사람 의 감정을 해석하는 분야에 이용되고 있다. 이러한 얼굴 정보를 활용한 어플리케이션의 영역이 확대되고 그 수가 증가함에 따라 다양한 환경에서 얼굴 영 역을 정확하게 추출할 수 있는 실용성 높은 얼굴 검 출 방법에 대한 연구가 활발하게 진행되고 있다.2. Description of the Related Art [0002] With the advent of various applications using face information, a practical face detection method is becoming more and more popular. The face recognition system is used to protect the privacy of individuals in a security system and a surveillance environment that allows a person to enter and exit. In addition, face recognition is used in the field of interpreting human emotion from the external appearance change by analyzing the face change for the face area. As the area of the application using the face information is expanded and the number of the application is increased, a practical face detection method capable of accurately extracting the face area in various environments is actively researched.

2000년대에 제안된 비올라 존스(Viola-Jones) 방법은 얼굴 검출의 실용적인 가능성을 제시한 최초의 모델이다. 적분 영상(integral image) 기법을 이용하여 Haar-like 특징 정보를 효율적으로 추출하고 이를 제안한 Adaboost의 직렬로 연결된 분류기를 이용하여 최종적인 얼굴 영역을 선별한다. 그러나 이러한 방법은 단순한 특징 정보를 이용하기 때문에 얼굴의 자세 변화 또는 가림과 같은 환경에서 얼굴 검출 성능이 크게 떨어진다. 이러한 문제를 해결하기 위해 변형 가능한 파트 모델(deformable part model; DPM)이 제안되었다. 이 방법은 얼굴 구성 요소의 기하학적인 위치 관계에 의한 조합으로서 얼굴 영역을 정의한다. 얼굴 구성 요소의 일부가 손실되더라도 얼굴 영역을 판정할 수 있기 때문에 자세 변화 또는 가림에 강인한 특성을 보인다. 그러나 각 얼굴의 구성 요소의 존재 가능성에 대한 일차적인 과정뿐만 아니라 슬라이딩 윈도우 방법(sliding window method)으로부터 추출된 수많은 윈도우에 대해 파트 모델의 매칭(matching) 정도를 판정하는 것은 큰 복잡도를 수반 하게 된다. 또한, 이러한 파트 모델을 학습하기 위해서는 각각의 파트의 정확한 라벨(label)이 포함된 대규모의 데이터베이스(database)가 필요하다.The Viola-Jones method proposed in the 2000s is the first model to show the practical possibility of face detection. The Haar-like feature information is efficiently extracted using the integral image technique and the final face area is selected using the proposed Adaboost serial-connected classifier. However, since this method uses simple feature information, face detection performance is greatly degraded in environments such as face posture change or occlusion. In order to solve this problem, a deformable part model (DPM) has been proposed. This method defines a face area as a combination of geometric positional relationships of facial components. Even if a part of the facial component is lost, the facial region can be determined, so that it exhibits a strong characteristic against attitude change or occlusion. However, determining the degree of matching of a part model to a large number of windows extracted from a sliding window method as well as a primary process of the possibility of existence of each face component involves a great deal of complexity. In addition, in order to learn such a part model, a large-scale database including an accurate label of each part is required.

최근에 다양한 컴퓨터 비전(computer vision)의 분야에서 학습에 기반한 합성곱 신경망(convolutional neural network; CNN) 방법이 큰 성과를 이루었다. CNN의 얼굴 검출 방법이 검출 성능에서 큰 발전을 이루었지만, 시스템의 증가된 복잡도는 이에 대한 실용성에 의문을 갖게 했다. 320×240의 이미지 로부터 추출할 수 있는 윈도우(window)의 수는 십억 개에 달한다. 수많은 패치에 대해서 각각 CNN에 기반하여 특징 정보를 추출하고 얼굴과 얼굴이 아닌 영역으로 분류(classification)를 하게 된다. 이는 얼굴 검출 성능과 시스템의 복잡도 사이의 트레이드오프(trade off) 관계를 잘 나타낸다. 또한, 인접한 윈도우 사이의 교집합 영역에 대해 합성곱 연산(convolution operation)이 반복적으로 수행됨으로써 불필요한 연산 과정이 포함되고, 합성곱 신경망의 완전 연결 계층(fully-connected layer)의 입력과 출력이 고정됨으로 인하여 그 신경망을 통과하는 모든 입력 데이터는 입력 데이터의 크기를 고정된 크기로 재조정(resizing)하는 과정을 수반함으로써 시스템의 연산 복잡도가 증가하게 된다.Recently, a learning-based convolutional neural network (CNN) method has achieved great results in various computer vision fields. CNN's face detection method has made great progress in detection performance, but the increased complexity of the system makes it questionable. The number of windows that can be extracted from an image of 320 × 240 amounts to one billion. For a large number of patches, feature information is extracted based on CNN, and classified into non-face and face regions. This indicates a trade off relationship between face detection performance and system complexity. In addition, unnecessary operations are included by repeatedly performing the convolution operation on the intersecting area between adjacent windows, and the input and output of the fully-connected layer of the concurrent neural network are fixed All input data passing through the neural network is subjected to a process of resizing input data to a fixed size, thereby increasing the computational complexity of the system.

일 실시예는 완전 연결 계층(Fully-Connected layer)이 없는 완전한 합성곱 네트워크(Fully Convolutional Network; FCN)를 입력단에 사용함으로써 입력 데이터의 크기를 고정된 크기로 재조정(resizing)하는 과정을 배제하여 연산 복잡도를 낮춘 얼굴 검출 방법을 제공할 수 있다.One embodiment excludes the process of resizing input data to a fixed size by using a fully concatenated network (FCN) without an Fully-Connected layer at the input, It is possible to provide a face detection method with reduced complexity.

일 실시예는 얼굴을 포함하는지 여부를 판단하는 분류(Classfication) 과정과 얼굴 경계 영역 회귀법(Face Bound Regression)을 통한 회귀(Regression) 과정을 더하여 정교하게 얼굴 영역을 검출하는 얼굴 검출 방법을 제공할 수 있다.One embodiment of the present invention provides a face detection method for accurately detecting a face region by adding a classfication process for determining whether a face is included or not and a regression process through a face boundary regression process have.

일 실시예는 복수 개의 계층이 공통된 피처맵을 사용하여 합성곱 연산을 함으로써 복잡도를 낮추고, 풀링 계층들이 서로 다른 크기의 스트라이드(stride)를 갖게 하여 다양한 크기의 얼굴을 검출하는 것에 최적화된 얼굴 검출 방법을 제공할 수 있다.In one embodiment, the complexity is lowered by performing a composite product operation using a common feature map of a plurality of layers, and the pooling layers have strides of different sizes to thereby detect faces of various sizes. Can be provided.

본 발명의 일 실시예에 따른 얼굴 검출 방법은 이미지에 포함된 얼굴 구성요소를 나타내는 복수의 히트맵들-상기 복수의 히트맵들은 상기 이미지에 대해 서로 다른 합성곱 또는 풀링 방식을 적용함으로써 생성된 것들임-각각으로부터 서로 다른 복수의 얼굴 후보 영역들을 추출하는 단계; 및 상기 서로 다른 복수의 얼굴 후보 영역들에 기초하여 상기 이미지에 포함된 얼굴 영역을 검출(Detection)하는 단계를 포함할 수 있다.A face detection method according to an embodiment of the present invention includes a plurality of heat maps representing a face component included in an image, the plurality of heat maps being generated by applying different composite products or pooling methods to the images Extracting a plurality of face candidate regions different from each other; And detecting a face region included in the image based on the plurality of different face candidate regions.

상기 서로 다른 복수의 얼굴 후보 영역들을 추출하는 단계는, 합성곱을 수행하는 적어도 하나의 제1 합성곱 계층 및 풀링을 수행하는 적어도 하나의 제1 풀링 계층을 포함하는 제1 계층을 통하여 상기 이미지를 피처맵들로 변환하는 단계; 및 상기 서로 다른 복수의 얼굴 후보 영역들을 추출하기 위하여 적어도 하나의 제2 합성곱 계층과 적어도 하나의 제2 풀링 계층을 포함하는 복수의 제2 계층들 각각을 통하여 상기 피처맵들을 히트맵들로 변환하는 단계를 포함하고, 상기 복수의 제2 계층들은 상기 피처맵들을 상기 히트맵들로 변환하기 위하여 공통적으로 상기 피처맵들을 사용할 수 있다.Wherein the step of extracting the plurality of different face candidate regions comprises the step of extracting the image through a first layer comprising at least one first composite product layer performing a composite product and at least one first pooling layer performing a pooling, Maps into maps; And converting the feature maps into heat maps through each of a plurality of second layers including at least one second composite product layer and at least one second pooling layer to extract the plurality of different face candidate regions, Wherein the plurality of second layers may use the feature maps in common to convert the feature maps into the heat maps.

나아가, 상기 피처맵들을 히트맵들로 변환하는 단계는, 피처맵들을 히트맵들로 변환하기 위하여 상기 복수의 제2 계층들 각각이 합성곱과 풀링 연산을 연속적으로 수행하는 단계를 포함하고, 상기 복수의 제2 계층들 중 어느 하나가 포함하고 있는 계층과, 다른 하나의 제2 계층에 포함되며 상기 계층의 연산 순서에 대응되는 계층은 서로 다른 크기의 스트라이드(stride)를 가질 수 있다.Further, the step of converting the feature maps into heat maps comprises successively performing a composite product and a pooling operation on each of the plurality of second layers to convert feature maps into heat maps, A layer included in one of the plurality of second layers and a layer corresponding to the operation order of the layer included in the other second layer may have strides of different sizes.

상기 얼굴 영역을 검출(Detection)하는 단계는, 상기 얼굴 후보 영역에 얼굴 영역이 있는지 여부를 판단함으로써 얼굴 유무에 대하여 분류(Classification)하는 단계; 및 상기 분류와 상기 얼굴 후보 영역들을 기반으로 하여 정밀한 얼굴 후보 영역으로 회귀(Regression)하는 단계를 포함할 수 있다.The step of detecting the face region may include classifying the presence or absence of a face by determining whether the face region exists in the face candidate region; And a step of regression to a precise face candidate region based on the classification and the face candidate regions.

나아가, 상기 분류(Classification)하는 단계는, 얼굴 영역이 포함되어 있으면 확률 1을 제시하고, 얼굴 영역이 포함되어 있지 않으면 확률 0을 제시하는 단계를 포함할 수 있다.In addition, the classifying step may include a step of presenting a probability 1 if a face area is included and a step of presenting a probability 0 if the face area is not included.

나아가, 상기 정밀한 얼굴 후보 영역으로 회귀(Regression)하는 단계는, 상기 분류(Classification)하는 단계에서 얼굴 영역이 있다고 분류하면 얼굴 영역의 위치 정보를 제시하고, 얼굴 영역이 없다고 분류하면 얼굴 영역의 위치 정보를 무시하라는 라벨(label)을 부여하는 단계를 포함할 수 있다.In addition, the step of regressing to the precise face candidate region may include: providing position information of the face region if the face region is classified as the face region in the classifying step; And a step of assigning a label to the user to ignore the message.

상기 얼굴 영역을 검출(Detection)하는 단계는, 상기 얼굴 후보 영역들이 적어도 하나의 합성곱 계층, 적어도 하나의 풀링 계층 및 적어도 하나의 완전 연결 계층(Fully-Connected layer; FCL)을 거쳐서 얼굴 영역을 검출하는 단계를 포함할 수 있다.The step of detecting the face region may include detecting the face region through at least one of a plurality of composite product layers, at least one pooling layer, and at least one Fully-Connected Layer (FCL) .

상기 얼굴 후보 영역들을 추출하는 단계는, 물체 영역과 물체가 아닌 영역을 구분하는 신경망 모델을 기반으로 하여 얼굴 영역과 얼굴이 아닌 영역을 구분하기 위한 학습을 하는 단계; 및 하나 이상의 얼굴 특징점(facial landmark)을 포함하는 이미지 데이터 베이스를 사용하여 얼굴 후보 영역 추출을 학습하는 단계를 통해 학습될 수 있다.Wherein the extracting of the face candidate regions comprises: learning for distinguishing a face region and a non-face region based on a neural network model that distinguishes an object region and an object region; And learning face candidate region extraction using an image database including at least one facial landmark.

상기 얼굴 영역을 검출(Detection)하는 단계는, 네거티브 예제 마이닝(Hard Sample Mining) 기술을 통해 상기 얼굴 후보 영역들을 데이터 베이스로 사용하여 학습될 수 있다.The step of detecting the face region may be learned by using the face candidate regions as a database through a negative sample mining technique.

본 발명의 일 실시예에 따른 기계 학습 기반 얼굴 검출 시스템은 이미지에 포함된 얼굴 구성요소를 나타내는 복수의 히트맵들-상기 복수의 히트맵들은 상기 이미지에 대해 서로 다른 합성곱 및 풀링 방식들을 적용함으로써 생성된 것들임-각각으로부터 서로 다른 복수의 얼굴 후보 영역들을 추출하여 검출부로 프로포즈(propose)하는 제안부; 및 상기 서로 다른 복수의 얼굴 후보 영역들에 기초하여 상기 이미지에 포함된 얼굴 영역을 검출(Detection)하는 검출부를 포함할 수 있다.A machine learning based face detection system in accordance with an embodiment of the present invention includes a plurality of heat maps representing a face component included in an image, the plurality of heat maps applying different synthesis products and pooling schemes A proposal unit for extracting a plurality of different face candidate regions from each other and proposing to the detection unit; And a detection unit for detecting a face region included in the image based on the plurality of different face candidate regions.

상기 제안부는, 합성곱을 수행하는 적어도 하나의 제1 합성곱 계층 및 풀링을 수행하는 적어도 하나의 제1 풀링 계층을 통하여 상기 이미지를 피처맵들로 변환하는 제1 계층부; 및 상기 서로 다른 복수의 얼굴 후보 영역들을 추출하기 위하여 적어도 하나의 제2 합성곱 계층과 적어도 하나의 제2 풀링 계층을 통하여 상기 피처맵들을 히트맵들로 변환하는 복수의 제2 계층부를 포함하고, 상기 복수의 제2 계층부들은 상기 피처맵들을 상기 히트맵들로 변환하기 위하여 공통적으로 상기 피처맵들을 사용할 수 있다.The proposal section includes a first layer for transforming the image into feature maps through at least one first convolution layer performing a convolution and at least one first pulling layer performing a pulling; And a plurality of second layers converting the feature maps into heat maps through at least one second convolution layer and at least one second pooling layer to extract the plurality of different face candidate regions, The plurality of second layers may commonly use the feature maps to convert the feature maps into the heat maps.

상기 제안부는, 피처맵들을 히트맵들로 변환하기 위하여 상기 복수의 제2 계층부들 각각이 합성곱과 풀링 연산을 연속적으로 수행하고, 상기 복수의 제2 계층부들 중 어느 하나가 포함하고 있는 계층과, 다른 하나의 제2 계층부에 포함되며 상기 계층의 연산 순서에 대응되는 계층은 서로 다른 크기의 스트라이드(stride)를 가질 수 있다.The proposing unit consecutively performs a composite product and a pooling operation on each of the plurality of second layers to convert feature maps into heat maps, and further includes a layer including one of the plurality of second layers, , And a layer corresponding to the operation order of the layer included in the second layer of the other layer may have strides of different sizes.

상기 검출부는, 상기 얼굴 후보 영역에 얼굴이 있는지 여부를 판단함으로써 얼굴 유무에 대하여 분류(Classification)를 수행하는 분류부; 및 상기 분류와 상기 얼굴 후보 영역들을 기반으로 하여 정밀한 얼굴 후보 영역으로 회귀(Regression)하는 회귀부를 포함할 수 있다.Wherein the detecting unit comprises: a classifying unit for classifying the face candidate by determining whether or not the face exists in the face candidate region; And a regression unit for regression of the classification and the face candidate regions based on the face candidate regions.

나아가, 상기 분류부는, 얼굴 영역이 포함되어 있으면 확률 1을 제시하고, 얼굴 영역이 포함되어 있지 않으면 확률 0을 제시할 수 있다.Furthermore, the classifier may present a probability of 1 if the face region is included, and a probability of 0 if the face region is not included.

나아가, 상기 회귀부는, 상기 분류부가 얼굴 영역이 있다고 분류하면 얼굴 영역의 위치 정보를 제시하고, 얼굴 영역이 없다고 분류하면 얼굴 영역의 위치 정보를 무시하라는 라벨(label)을 부여할 수 있다.Further, the regression unit may provide the position information of the face region if the classification unit classifies the face region, and may label the position information of the face region to be ignored if the classification region is classified as not having the face region.

상기 검출부는, 얼굴 영역을 검출하기 위하여 상기 얼굴 후보 영역들이 적어도 하나의 합성곱 계층, 적어도 하나의 풀링 계층 및 적어도 하나의 완전 연결 계층(Fully-Connected layer; FCL)을 포함할 수 있다.The detecting unit may include at least one collection product layer, at least one pooling layer, and at least one Fully-Connected layer (FCL) to detect the face area.

일 실시예는 완전 연결 계층(Fully-Connected layer)이 없는 완전한 합성곱 네트워크(Fully Convolutional Network; FCN)를 입력단에 사용함으로써 입력 데이터의 크기를 고정된 크기로 재조정(resizing)하는 과정을 배제하여 연산 복잡도를 낮출 수 있다.One embodiment excludes the process of resizing input data to a fixed size by using a fully concatenated network (FCN) without an Fully-Connected layer at the input, The complexity can be lowered.

일 실시예는 얼굴을 포함하는지 여부를 판단하는 분류(Classfication) 과정과 얼굴 경계 영역 회귀법(Face Bound Regression)을 통한 회귀(Regression) 과정을 더하여 정교하게 얼굴 영역을 검출할 수 있다.In one embodiment, the facial region can be precisely detected by adding a classfication process for determining whether a face is included or not and a regression process using a face boundary regression method.

일 실시예는 복수 개의 계층이 공통된 피처맵을 사용하여 합성곱 연산을 함으로써 복잡도를 낮추고, 풀링 계층들이 서로 다른 크기의 스트라이드(stride)를 갖게 하여 다양한 크기의 얼굴을 검출하는 것에 최적화될 수 있다.In an embodiment, a plurality of layers may be optimized to reduce complexity by performing a composite product operation using a common feature map, and pooling layers may have strides of different sizes to detect faces of various sizes.

도 1은 본 발명의 일 실시예에 따른 얼굴 검출 과정을 설명하기 위한 도면이다.
도 2는 본 발명의 일 실시예에 따른 제안 네트워크의 기계 학습을 설명하기 위한 도면이다.
도 3은 본 발명의 일 실시예에 따른 제안 네트워크를 설명하기 위한 도면이다.
도 4는 본 발명의 일 실시예에 따른 검출 네트워크를 설명하기 위한 도면이다.
도 5는 본 발명의 일 실시예에 따른 얼굴 검출 방법에 대한 흐름도이다.
도 6은 본 발명의 일 실시예에 따른 얼굴 검출 시스템의 블록도이다.1 is a view for explaining a face detection process according to an embodiment of the present invention.
2 is a diagram for explaining machine learning of a proposed network according to an embodiment of the present invention.
3 is a diagram for explaining a proposed network according to an embodiment of the present invention.
4 is a diagram for explaining a detection network according to an embodiment of the present invention.
5 is a flowchart of a face detection method according to an embodiment of the present invention.
6 is a block diagram of a face detection system according to an embodiment of the present invention.

이하, 본 발명의 여러가지 실시예 중 특정 실시예를 첨부된 도면에 도시하여 상세하게 설명한다. 그러나 이러한 특정 실시예가 본 발명을 제한하거나 한정하는 것은 아니다. 도면의 부호에 관계없이 동일한 참조 번호는 동일한 구성요소를 나타내며, 중복되는 설명은 생략한다.Hereinafter, specific embodiments among various embodiments of the present invention will be described in detail with reference to the accompanying drawings. However, this specific embodiment does not limit or limit the invention. The same reference numerals denote the same elements regardless of the reference numerals in the drawings, and a duplicate description will be omitted.

도 1은 본 발명의 일 실시예에 따른 얼굴 검출 과정을 설명하기 위한 도면이다.1 is a view for explaining a face detection process according to an embodiment of the present invention.

도 1을 참조하면, 본 발명의 일 실시예는 대상 이미지(100)에서 얼굴 영역을 검출(detection)하는 네트워크를 제공할 수 있다. 본 발명의 일 실시예에 따른 네트워크는 제안 네트워크(proposal network)(110)와 검출 네트워크(detection network)(120) 두 단계로 구분된다.Referring to FIG. 1, an embodiment of the present invention may provide a network for detecting a face region in a target image 100. The network according to an exemplary embodiment of the present invention is divided into a proposal network 110 and a detection network 120.

제안 네트워크(110)로 대상 이미지(100)가 입력되면 제1 계층(130)이 대상 이미지(100)를 피처맵들로 변환한다. 피처맵은 대상 이미지에 대한 합성곱 및 풀링 연산을 통해 생성되는 이미지이다.When the target image 100 is input to the proposal network 110, the first layer 130 converts the target image 100 into feature maps. A feature map is an image generated through a composite product and a pooling operation on a target image.

제1 피처맵은 복수의 제2 계층(141, 142, 143)에 입력되며, 복수의 제2 계층 각각은 서로 다른 방식을 통해 제1 피처맵을 히트맵(150)들로 변환한다. 히트맵(150)은 피처맵에 대한 합성곱 및 풀링 연산을 통해 생성되는 확률맵이다.The first feature map is input to the plurality of second layers 141, 142, and 143, and each of the plurality of second layers converts the first feature map into the heat maps 150 through different methods. The heat map 150 is a probability map generated by a product multiply and a pulling operation on the feature map.

확률맵은 각각의 픽셀 값을 얼굴이 존재할 확률 값으로 매핑(mapping)한 것으로서, 얼굴이 존재하는 영역을 파악하기 위해 사용된다.The probability map is a mapping of each pixel value to a probability value of a face, and is used to grasp a region where a face exists.

제1 계층(130) 및 복수의 제2 계층(141, 142, 143)은 합성곱 계층(convolutional layer)들 또는 풀링 계층(pooling layer)들을 포함할 수 있다. The first layer 130 and the plurality of second layers 141, 142, and 143 may include convolutional layers or pooling layers.

합성곱 계층은 학습된 가중치(weight)와 바이어스(bias) 및 사용자에 의해 정의된 스트라이드(stride)를 포함하는 커널을 가질 수 있고 대상 이미지(100) 또는 피처맵에 대한 합성곱 연산을 할 수 있다.The product multiply layer may have a kernel containing learned weights and biases and user defined strides and may perform a composite multiply operation on the target image 100 or the feature map .

풀링 계층은 사용자에 의해 정의된 크기의 스트라이드(stride)를 가지고 풀링 연산을 할 수 있다. 풀링 연산은 max-pooling 또는 average-pooling일 수 있다.The pooling layer can perform a pooling operation with a stride of a size defined by the user. The pooling operation may be max-pooling or average-pooling.

제안 네트워크(110)는 복수의 제2 계층(141, 142, 143)을 통해 생성된 히트맵들로부터 얼굴이라고 판단되는 영역, 즉 얼굴 후보 영역(170) n개를 추출(160)하여, 검출 네트워크(120)에 제안한다.The proposal network 110 extracts 160 regions from the heat maps generated through the plurality of second layers 141, 142, and 143, that is, n face candidate regions 170, (120).

복수의 제2 계층(141, 142, 143) 각각은 합성곱 연산 또는 풀링 연산을 연속적으로 수행한다. 어느 하나의 제2 계층에 포함된 특정 계층이 가지고 있는 스트라이드의 크기는, 다른 하나의 제2 계층에 포함되며 상기 특정 계층의 연산 순서에 대응되는 계층이 가지고 있는 스트라이드의 크기와 다를 수 있다.Each of the plurality of second layers 141, 142, and 143 successively performs a composite product operation or a pooling operation. The size of the stride included in the specific layer included in the second layer may be different from the size of the stride included in the second layer and correspond to the operation order of the specific layer.

제2 계층 별로 스트라이드의 크기를 다르게 함으로써 다양한 크기의 얼굴에 대하여 최적화된 히트맵을 생성할 수 있으므로 높은 정확도를 갖는 얼굴 후보 영역(170)을 생성할 수 있고, 궁극적으로 대상 이미지(100)에 대한 얼굴 검출 성능을 향상시킬 수 있다.Since the size of the stride is different for each second layer, it is possible to generate the optimized heat map for the faces of various sizes, so that the face candidate region 170 having high accuracy can be generated, and ultimately, The face detection performance can be improved.

제안 네트워크(110)로부터 추출된 얼굴 후보 영역(170)은 그 자체로도 높은 얼굴 검출 성능을 보이지만, 추출된 얼굴 후보 영역(170)에 대해 검출 네트워크(120)를 거침으로써 리콜율(recall rate)을 높이고 오검출(false-positive)을 줄여서 더 높은 얼굴 검출 성능에 기여할 수 있다.The face candidate region 170 extracted from the proposal network 110 has a high face detection performance per se but the recall rate can be improved by passing through the detection network 120 for the extracted face candidate region 170. [ And can contribute to higher face detection performance by reducing false-positive.

검출 네트워크(120)에 얼굴 후보 영역(170)이 입력되면 합성곱 계층(180)과 완전 연결 계층(190)을 통해 1차원 데이터(191)로 변환되고, n개의 얼굴 후보 영역(170) 각각에 얼굴이 포함되어 있는지 분류(Classfication) 할 수 있는 n개의 값(192)을 제시한다. 검출 네트워크(120)는 1차원 데이터(191)를 생성하기 위하여 합성곱 계층(180)과 완전 연결 계층(190) 외에도 풀링 계층을 포함할 수 있다.When the face candidate region 170 is input to the detection network 120, it is converted into the one-dimensional data 191 through the composite product hierarchy 180 and the complete connection hierarchy 190, We present n values (192) that can classify if the face is included. The detection network 120 may include a pooling layer in addition to the composite product layer 180 and the full connection layer 190 to generate the one-dimensional data 191.

또한, 검출 네트워크(120)는 얼굴 후보 영역(170)에서 판단되는 얼굴 위치보다 더 정밀한 얼굴 위치로 회귀(Regression)하여 정밀한 얼굴 후보 영역에 대한 좌표 값(192)을 제시한다. 정밀한 얼굴 후보 영역에 대한 box 표시를 하기 위하여 좌표 값은 x 좌표, y좌표, 너비, 높이를 포함하는 4n개의 값일 수 있다. 정밀한 얼굴 후보 영역으로 회귀하는 알고리즘은 D. Wang, J. Ynag, and Q. Liu, "Hierarchical Convolutional Neural Network for Face Detection," Proceeding of International Conference on Image and Graphics, pp. 373-384, 2015. 에서 개시하고 있는 얼굴 영역 회귀법(Face bound regression)일 수 있다.In addition, the detection network 120 regresses the face position to a more accurate face position than the face position determined in the face candidate region 170, thereby presenting the coordinate value 192 for the precise face candidate region. In order to display a box for a precise face candidate area, the coordinate value may be 4n values including x coordinate, y coordinate, width, and height. An algorithm for returning to a precise face candidate region is described in D. Wang, J. Ynag, and Q. Liu, "Hierarchical Convolutional Neural Network for Face Detection," Proceeding of International Conference on Image and Graphics, pp. 373-384, < / RTI > 2015. It may be a face bound regression.

생성된 1차원 데이터(191)는 얼굴 후보 영역(170)보다 더 정밀하게 얼굴 영역을 나타낼 수 있으며, 5n개의 값에 대한 후처리 과정(post processing)을 통해 최종적인 얼굴 영역을 제시할 수 있다. 후처리 과정은 Non-Maximum Suppression(NMS)일 수 있다.The generated one-dimensional data 191 can represent the face region more accurately than the face candidate region 170 and present a final face region through post processing of 5n values. The post-processing may be Non-Maximum Suppression (NMS).

도 2는 본 발명의 일 실시예에 따른 제안 네트워크(110)의 기계 학습을 설명하기 위한 도면이다.2 is a diagram for explaining machine learning of the proposed network 110 according to an embodiment of the present invention.

제안 네트워크(110)에서, 각 합성곱 계층은 가중치(weight)를 포함하는 커널을 이용하여 대상 이미지(100) 또는 피처맵에 대한 합성곱 연산을 할 수 있다.In the proposed network 110, each of the convolution product layers may perform a composite product operation on the target image 100 or the feature map using a kernel including a weight.

얼굴 영역이 더욱 부각되는 피처맵을 생성하기 위해서는 얼굴의 구성요소가 드러날 수 있도록 가중치가 결정되어야 한다. 본 발명의 일 실시예와 같이 심층적인 신경망 구조는 대규모의 파라미터를 포함하고 있다. 따라서, 초기 가중치를 가우시안 분포로 설정한다면 라벨이 부여된 얼굴 특징점(facial landmark)의 위치 정보를 사용하여 얼굴의 구성요소를 지역화(localize)하기 위한 정보 해석하는 것은 어려움이 있다. 즉, 얼굴 구성 요소가 부각된 피처맵을 생성하는 것이 어렵다는 문제점이 있다.In order to create a feature map in which the face area becomes more prominent, a weight value must be determined so that a face component can be revealed. As with the embodiment of the present invention, the in-depth neural network structure includes a large-scale parameter. Therefore, if the initial weight is set to the Gaussian distribution, it is difficult to interpret the information for localizing the facial components using the location information of the facial landmark to which the label is attached. That is, there is a problem that it is difficult to generate a feature map in which face components are highlighted.

이를 해결하기 위하여, 전이 학습(transfer learning)을 통해, 구성요소를 지역화하는 특성이 있는 가중치를 본 발명의 일 실시예를 위한 초기 가중치로 사용할 수 있다.To overcome this, through transfer learning, weights that have the property of localizing the components can be used as initial weights for an embodiment of the present invention.

얼굴의 특징점(facial landmark)은 얼굴의 특징이 되는 부분에 표시된 점이며, 눈, 코, 입, 귀 등에 표시될 수 있다. 지역화(localize)는 얼굴 구성요소가 다른 부분에 비하여 부각이 되도록 만드는 과정이다.A facial landmark is a point marked on a feature part of a face and can be displayed on the eyes, nose, mouth, ear, and the like. Localization is the process by which facial components are made more visible than others.

도 2에 도시된 바와 같이, 본 발명의 일 실시예는 고양이의 얼굴 구성요소를 지역화하는 특성이 있는 네트워크(201)의 가중치를, 사람의 얼굴 구성요소를 지역화하는 특성이 있는 네트워크(202)의 초기 가중치로 사용할 수 있다.As shown in FIG. 2, an embodiment of the present invention provides a method of adjusting the weight of a network 201 having a characteristic of localizing a face component of a cat to a weight of a network 202 having a characteristic of localizing a face component of a person It can be used as an initial weight.

전이 학습을 통해 얻은 가중치를 초기값으로 한 후에 사람의 얼굴 구성요소를 지역화하기 위한 학습을 할 수 있다. 학습을 위해서 A. Krizhevsky, I. Sutskever, and G.E. Hinton,“Imagenet Classification with Deep Convolutional Neural Networks," Proceeding of Advances in Neural Information Processing Systems, pp. 1097-1105, 2012. 에서 소개하고 있는 AlexNet의 구조를 모델로 사용할 수 있다. AlexNet의 구조는 물체 영역과 배경 영역을 구분하는 특성을 가진 네트워크이다. After initializing the weights from the transition learning, we can learn to localize the human face components. A. Krizhevsky, I. Sutskever, and G.E. The structure of AlexNet, which is introduced in Hinton, "Imagenet Classification with Deep Convolutional Neural Networks," Proceedings of Advances in Neural Information Processing Systems, pp. 1097-1105, It is a network with characteristics that distinguish background areas.

이를 기본 구조로 하면, 얼굴 영역과 배경 영역을 구분하는 가중치를 용이하게 얻을 수 있으므로, 얼굴 구성 요소를 지역화하는 특성이 있는 네트워크(202)를 유용하게 구성할 수 있다.With this basic structure, since the weight for distinguishing the face area and the background area can be easily obtained, the network 202 having the characteristic of localizing the face components can be advantageously constructed.

AlexNet 구조는 5개의 합성곱 계층(221, 222, 225), 3개의 풀링 계층(231, 232, 233) 및 3개의 완전 연결 계층(fully-connected layer)을 포함할 수 있다. 마지막 완전 연결 계층에서는 순차 연결된 얼굴 후보 특징점의 위치 좌표(250)가 생성될 수 있다. 일 실시예에서, 순차 연결된 얼굴 후보 특징점의 위치 좌표(250)는 왼쪽 눈, 오른쪽 눈, 코, 입 각각을 가리키는 복수 개의 얼굴 특징점의 위치 좌표를 포함할 수 있다.The AlexNet structure may include five concatenation product layers 221, 222, and 225, three pooling layers 231, 232, and 233, and three fully-connected layers. In the last complete connection layer, the position coordinates 250 of the sequentially connected face candidate feature points can be generated. In one embodiment, the position coordinates 250 of the sequentially connected face candidate feature points may include position coordinates of a plurality of facial feature points indicating left eye, right eye, nose, and mouth, respectively.

순차적으로 연결된 얼굴 후보 특징점의 위치 좌표(250)와 실제 이미지에 존재하는 얼굴 특징점의 위치 좌표를 비교함으로써 손실함수가 최소가 되도록 하여 가중치를 변화시킬 수 있다. 즉, 하기와 같은 손실함수를 통해서, 대상 이미지에 존재하는 얼굴 특징점의 위치를 파악하기 위한 학습을 수행할 수 있다.The weights can be changed by minimizing the loss function by comparing the position coordinates 250 of the sequentially connected face candidate feature points with the position coordinates of the face feature points existing in the actual image. That is, the learning can be performed to grasp the position of the facial feature point existing in the target image through the loss function as described below.

일 실시예는 순차 연결된 얼굴 후보 특징점의 위치 좌표(250)와 실제 이미지에 존재하는 얼굴 특징점의 위치 좌표의 유클리디언 거리(Euclidean distance)를 최소화하는 손실함수(loss function)는 다음과 같이 정의될 수 있다. In one embodiment, a loss function that minimizes the Euclidean distance between the position coordinates 250 of the sequentially connected face candidate feature points and the position coordinates of the facial feature points existing in the real image is defined as follows .

여기에서,

는 미니 배치(mini-batch)의 크기를 의미하며,

은 얼굴 특징점의 총 개수,

은 순차 연결된 얼굴 후보 특징점의 위치 좌표(250),

은 실제 이미지에 존재하는 얼굴 특징점의 위치 좌표이다. 얼굴 특징점의 집합은

의 벡터 형태로 정의될 수 있다.From here,

Refers to the size of the mini-batch,

The total number of facial feature points,

The position coordinates 250 of the sequentially connected face candidate feature points,

Is the position coordinate of the facial feature point existing in the actual image. The set of facial feature points

As shown in FIG.

일 실시예는 총 41개의 얼굴 특징점 중에서 오른쪽 눈, 왼쪽 눈, 코 그리고 입을 지역화하기 위해 각각 6, 6, 9 및 20개의 얼굴 특징점을 사용할 수 있다.One embodiment may use 6, 6, 9, and 20 facial feature points to localize the right eye, left eye, nose, and mouth of a total of 41 facial feature points.

일 실시예는 식 (1)의 손실함수를 최소화하기 위해 확률적 기울기 하강(stochastic gradient descent) 방법을 이용할 수 있다. Caffe 라이브러리(library)를 이용할 수 있으며 초기의 학습 속도(initial learning rate)는

, 가속도(momentum)의

에에 대해 매 세대(epoch) 수마다 학습 속도에

의 값을 곱할 수 있다. 완전 연결 계층의 드롭아웃(dropout)의 확률 값은 0.5일 수 있다.One embodiment may use a stochastic gradient descent method to minimize the loss function of equation (1). The Caffe library is available and the initial learning rate is

, Momentum of

The number of epochs per epoch

Can be multiplied. The probability of a dropout in the full connection layer may be 0.5.

도 3은 본 발명의 일 실시예에 따른 제안 네트워크를 설명하기 위한 도면이다.3 is a diagram for explaining a proposed network according to an embodiment of the present invention.

도 3에 도시된 바와 같이, 도 2의 실시예에서 개시하고 있는 학습을 통해 얻어진 가중치 또는 바이어스를 복사(340)하여 제안 네트워크의 각 계층이 가지고 있는 가중치 또는 바이어스로 사용할 수 있다.As shown in FIG. 3, the weight or bias obtained through the learning disclosed in the embodiment of FIG. 2 may be copied 340 and used as a weight or a bias of each layer of the proposed network.

일 실시예에서, 제안 네트워크의 제1 계층(130)은 2개의 합성곱 계층(221, 222)과 풀링 계층(231)을 포함하며, 제2 계층(141, 142, 143) 각각은 3개의 합성곱 계층(223, 224, 225)과 1개의 풀링 계층(232)을 포함할 수 있다. 일 실시예에서 제1 계층과 제2 계층의 합성곱 계층 및 풀링 계층 개수는 AlexNet의 구조에 따른 것으로 예시적이므로 개발자의 설정에 따라 개수는 변경될 수 있다.In one embodiment, the first layer 130 of the proposed network includes two concatenation layers 221 and 222 and a pooling layer 231, and each of the second layers 141, 142, and 143 includes three composites And may include a product layer 223, 224, 225 and a pooling layer 232. In one embodiment, the number of layers of the first and second layers and the number of pooling layers are exemplary according to the structure of AlexNet, so the number may be changed according to the setting of the developer.

도 3에 도시된 바와 같이, 일 실시예에 따른 제안 네트워크(110)는 도 2에 도시된 네트워크와는 다르게 마지막 풀링 계층(233)및 완전 연결 계층(240)를 두지 않을 수 있다. 완전 연결 계층(240)을 통과하려면 입력과 출력 데이터의 크기가 고정된 값이어야 하므로 모든 입력 데이터의 크기를 고정된 크기에 맞게 재조정(resizing)하는 과정을 거쳐야 한다. 재조정하는 과정에서 복잡도는 증가되므로, 일 실시예는 풀링 계층(233) 및 완전 연결 계층(240)을 두지 않음으로써 입력 데이터의 크기를 재조정하는데 소요되는 복잡도를 낮출 수 있다.As shown in FIG. 3, the proposed network 110 according to one embodiment may not have a last pooling layer 233 and a full connection layer 240, unlike the network shown in FIG. Since the sizes of input and output data must be fixed values to pass through the complete connection layer 240, the size of all input data must be resized to a fixed size. Since the complexity increases in the process of reordering, one embodiment can reduce the complexity of re-sizing the input data by not providing the pooling layer 233 and the complete connection layer 240. [

도 3에 도시된 바와 같이, 일 실시예에 따른 제안 네트워크(110)에서, 제1 계층(130)을 거쳐 생성된 피처맵은 복수의 제2 계층(141, 142, 143)에서 공통적으로 사용될 수 있다. 3, in the proposed network 110 according to an embodiment, the feature maps generated through the first layer 130 may be commonly used in the plurality of second layers 141, 142, and 143 have.

합성곱 신경망에서, 낮은 계층에서는 대상 이미지의 간단한 테두리(edge)와 같은 특징이 추출되고, 높은 계층에서는 물체의 형상과 같은 복잡한 특징이 추출된다. 따라서, 낮은 계층부터 복수 개의 계층을 두는 대신에 한 개의 제1 계층(130)만을 두고 제1 계층(130)에서 생성된 피처맵을 복수개의 제2 계층(141, 142, 143)으로 보냄으로써 낮은 계층에서 발생하는 불필요한 계산 복잡도를 낮출 수 있다.In the composite neural network, features such as simple edges of the target image are extracted in the lower layer, and complex features such as the shape of the object are extracted in the higher layer. Therefore, instead of placing a plurality of layers from a lower layer, a feature map generated in the first layer 130 is sent to a plurality of second layers 141, 142, 143 with only one first layer 130 being present, Unnecessary computational complexity generated in the layer can be reduced.

일 실시예에서, 복수개의 제2 계층(141, 142, 143) 각각이 포함하고 있는 맨 처음 풀링 계층(232)은 서로 다른 크기의 스트라이드(stride)를 가질 수 있다. 대상 이미지(100)에 다양한 크기의 얼굴이 존재하더라도 다른 크기의 스트라이드(stride)로 풀링 연산을 함으로써, 다양한 크기의 얼굴 별로 최적화된 히트맵을 얻을 수 있으므로 얼굴 검출 성능을 향상시킬 수 있다. In one embodiment, the first pooling layer 232, which each of the plurality of second layers 141, 142, and 143 includes, may have strides of different sizes. Even if faces of various sizes are present in the target image 100, by performing a pulling operation with strides of different sizes, it is possible to obtain optimized heat maps for faces of various sizes, thereby improving the face detection performance.

일 예로, 작은 크기의 스트라이를 갖는 풀링 계층이 포함된 제2 계층은 작은 얼굴에 대한 히트맵을 표현하는데 적합하고, 큰 크기의 스트라이드를 갖는 풀링 계층이 포함된 제2 계층은 큰 얼굴에 대한 히트맵을 표현하는데 적합할 수 있다. 이는 예시적인 것으로서, 맨 처음 풀링 계층 외에 다른 계층의 스트라이드(stride)를 달리 함으로써 다양한 크기의 얼굴에 대해 최적화된 히트맵을 얻을 수도 있다.As an example, a second layer including a pooling layer with a small size stripe is suitable for representing a heat map for a small face, and a second layer including a pooling layer with a large size stride includes a hit for a large face It may be appropriate to represent the map. This is an example, and it is possible to obtain an optimized heat map for faces of various sizes by differentiating strides of layers other than the first pooling layer.

일 실시예에서, 각각의 제2 계층이 포함하는 마지막 합성곱 계층(225)은 256개의 피처맵을 생성할 수 있으며, 정규화(normalizing) 및 스케일링(scaling) 과정을 거쳐 히트맵을 얻을 수 있다. 이 히트맵에 대해 얼굴 영역과 얼굴이 아닌 영역을 잘 구분 짓는

을 설정하여 히트맵으로부터 얼굴 영역을 판단할 수 있고 얼굴 후보 영역을 생성할 수 있다.In one embodiment, the last concatenated product layer 225 of each second layer may generate 256 feature maps, and the heat map may be obtained through normalization and scaling. For this heatmap, we distinguish between the facial region and the non-face region.

The face region can be determined from the heat map and the face candidate region can be generated.

도 4는 본 발명의 일 실시예에 따른 검출 네트워크를 설명하기 위한 도면이다.4 is a diagram for explaining a detection network according to an embodiment of the present invention.

제안 네트워크로부터 받은 입력 패치(410)는 그 자체로 높은 얼굴 검출 성능을 가지나, 리콜율(recall rate)을 높이고 혹여 발생할 수 있는 오검출(false-positive)을 줄이기 위해 검출 네트워크를 통해 추가적인 연산을 할 수 있다. 검출 네트워크에 입력되는 얼굴 후보 영역은 입력 패치(410)라고 부를 수 있다.The input patch 410 received from the proposed network itself has high face detection capability but may perform additional operations through the detection network to increase the recall rate and reduce false- . The face candidate region input to the detection network may be referred to as an input patch 410.

도 4에 도시된 바와 같이, 일 실시예에서, 검출 네트워크는 4개의 합성곱 계층(421, 422, 423, 424), 4개의 풀링 계층(431, 432, 433) 및 1개의 완전 연결 계층(440)을 포함할 수 있다. 검출 네트워크는 입력 패치(410)에 대하여 분류(Classification) 및 회귀(Regression)의 결과로 1차원의 데이터(191)를 생성할 수 있고, 입력 패치에 boxing된 얼굴 영역보다 정밀한 얼굴 영역을 검출할 수 있다.4, in one embodiment, the detection network includes four aggregate product layers 421, 422, 423, and 424, four pooling layers 431, 432, and 433, and one full connection layer 440 ). The detection network can generate one-dimensional data 191 as a result of classification and regression with respect to the input patch 410 and can detect a face area that is more accurate than a boxed face area in the input patch have.

일 실시예에 따른 검출 네트워크는 얼굴 후보 영역에서 더 정밀한 얼굴 영역으로 회귀(Regression)할 수 있고, D. Wang, J. Ynag, and Q. Liu, "Hierarchical Convolutional Neural Network for Face Detection," Proceeding of International Conference on Image and Graphics, pp. 373-384, 2015.에 개시된 얼굴 영역 회귀법(Face Bound Regression)을 수행하기 위한 구조를 기반으로 할 수 있다.The detection network according to an embodiment can regress to a more accurate face region in the face candidate region and can be regenerated in the face region by D. Wang, J. Ynag, and Q. Liu, "Hierarchical Convolutional Neural Network for Face Detection, International Conference on Image and Graphics, pp. 373-384, 2015, which is incorporated herein by reference in its entirety.

일 실시예는, 상기 구조에 더하여 입력 패치(410)에 얼굴이 존재하는지 여부를 판단하는 분류(Classification) 과정을 추가적으로 도입함으로써 리콜율(recall rate)을 높이고 혹여 발생할 수 있는 오검출(false-positive)을 줄일 수 있다.In one embodiment, in addition to the above structure, a classification process for determining whether or not a face exists in the input patch 410 is additionally performed to increase a recall rate and to detect a false-positive ) Can be reduced.

입력 패치(410)에 얼굴이 존재하는지 여부를 판단하여 분류(Classification)하는 값(192)을 제시할 수 있고, 입력 패치(410)에 얼굴이 있으면 확률 1을 제시하고, 얼굴이 없으면 0을 제시하는 방법을 사용할 수 있다.It is possible to present a value 192 for judging whether a face exists in the input patch 410 and to classify the input patch 410. If the input patch 410 has a face, the probability 1 is presented. Can be used.

일 실시예는, 얼굴 영역 회귀법(Face Bound Regression)을 통해서 정밀한 얼굴 영역의 위치 정보(193)를 제시할 수 있다. 위치 정보는 x좌표, y좌표, 너비 및 높이를 포함할 수 있다.In one embodiment, accurate face region location information 193 can be presented through Face Bound Regression. The location information may include x coordinate, y coordinate, width and height.

일 실시예에서, 패치 1(411)에는 얼굴 영역이 포함되어 있으므로 분류(Classification)과정의 결과로서 확률 1이 제시(451)되며, 회귀(Regression)과정의 결과로서 정밀한 얼굴 영역의 위치 정보(452)가 제시될 수 있다.In one embodiment, since the face area is included in the patch 1 411, the probability 1 is presented 451 as a result of the classification process, and the position information 452 of the accurate face area is obtained as a result of the regression process ) May be presented.

패치 2(412)에는 얼굴 영역이 포함되어 있지 않으므로 분류(Classification) 과정의 결과로서 확률 0이 제시(261)되며, 얼굴 영역이 없다고 분류되었으므로 위치 정보에는 패치 2(412)의 위치 정보를 무시하라는 라벨(label)(462)이 부여될 수 있다.Since the face area is not included in the patch 2 412, the probability 0 is presented 261 as a result of the classification process, and since the face area is not classified, the position information does not include the position information of the patch 2 412 A label 462 can be given.

분류(Classification)와 회귀(Regression)에 대하여 손실함수를 정의하고, 아래의 수식 (4)와 같은 손실함수의 값이 최소가 되도록 함으로써 검출 네트워크의 가중치를 학습할 수 있다.We can learn the weight of the detection network by defining the loss function for Classification and Regression and minimizing the value of the loss function as shown in Equation (4) below.

여기에서,

는 조정 파라미터(parameter)이다. 일 실시예에서, 분류(Classification)에 대한 손실함수는 아래의 수식 (5)와 같이 교차 엔트로피 함수(cross-entropy loss function)일 수 있고, 회귀(Regression)에 대한 손실함수는 아래의 수식 (6)과 같이 정밀한 얼굴 영역의 위치 정보와 실제 얼굴 영역의 위치 정보 간의 유클리디언 거리(Euclidean distance)가 최소가 되도록 설계 할 수 있다.From here,

Is an adjustment parameter. In one embodiment, the loss function for Classification may be a cross-entropy loss function as in Equation (5) below, and the loss function for Regression may be expressed by Equation 6 ) Can be designed so that the Euclidean distance between the position information of the accurate face region and the position information of the actual face region is minimized.

여기에서,

는 미니 배치(mini-batch)의 크기를 의미하며,

는 얼굴 영역의 위치 정보를 정의하는 행렬의 크기,

은 분류 과정에서 얼굴이라고 추정되는 확률 값,

은 목적하는 얼굴 영역인지 얼굴 영역이 아닌지에 대한 라벨이다. 또한,

과

는 각각 정밀한 얼굴 영역의 위치 정보와 이에 대해 가장 근접한 실제 얼굴 위치 정보이다.From here,

Refers to the size of the mini-batch,

The size of the matrix defining the position information of the face area,

Is a probability value estimated to be a face in the classification process,

Is a label of whether it is a desired facial region or a face region. Also,

and

Are respectively the position information of the accurate face region and the actual face position information which is closest thereto.

식 (4)의 손실함수를 최소화하기 위해 확률적 기울기 하강(stochastic gradient descent) 방법을 이용할 수 있다. Caffe 라이브러리(library)를 이용할 수 있으며 초기의 학습 속도(initial learning rate)는

, 가속도(momentum)의

에 대해 매 세대(epoch) 수마다 학습 속도에

의 값을 곱할 수 있다. 완전 연결 계층의 드롭아웃(dropout)의 확률 값은 0.5일 수 있다.To minimize the loss function of Eq. (4), a stochastic gradient descent method can be used. The Caffe library is available and the initial learning rate is

, Momentum of

The number of epochs per epoch

검출 네트워크의 학습을 위하여 네거티브 예제 마이닝(hard sample mining) 기술을 사용할 수 있다. 이 기술은 일반화된 많은 예제를 사용하여 합성곱 신경망을 학습하는 것이 아니라, 목적을 잘 표현하는 소규모의 유익한 예제를 추출하여 특정한 상황에 잘 대처하는 신경망을 학습시키는 기법이다. 즉, 제안 네트워크를 통해 출력된 추출된 얼굴 후보 영역(170)은 그 자체로도 높은 얼굴 검출 성능을 보이므로, 이를 기반으로 검출 네트워크를 학습하여 성능을 최대화할 수 있다.Negative sample mining techniques can be used for learning of the detection network. This technique is a technique to learn neural networks which are able to cope with specific situations by extracting small and useful examples that express purpose well, rather than learning synthetic neural networks using many generalized examples. That is, since the extracted face candidate region 170 output through the proposed network shows high face detection performance by itself, the detection network can be learned based on the extracted face candidate region 170 to maximize the performance.

본 발명의 일 실시예에 따른 얼굴 검출 방법은, 제안 네트워크와 검출 네트워크가 직렬로 연결된 구조를 이루고 있기 때문에 검출 네트워크가 처리해야 할 데이터는 제안 네트워크의 성능과 직접도가 매우 높다. 제안 네트워크가 출력하는 대부분의 얼굴 후보 영역은 얼굴과의 유사도가 매우 높은 패치일 가능성이 크다. In the face detection method according to an embodiment of the present invention, since the proposed network and the detection network are connected in series, the data to be processed by the detection network has a very high performance and directivity of the proposed network. Most face candidate regions output by the proposed network are likely to be patches with a high degree of similarity with the face.

따라서 네거티브 예제 마이닝 기술에 의할 때, 제안 네트워크에 의해 생성되는 얼굴 후보 영역 중 확실히 얼굴 영역을 포함하는 패치들을 사용하여 검출 네트워크의 학습을 할 수 있다. Therefore, according to the negative example mining technique, it is possible to learn the detection network using the patches including the face region certainly among the face candidate regions generated by the proposal network.

도 5는 본 발명의 일 실시예에 따른 얼굴 검출 방법에 대한 흐름도이다.5 is a flowchart of a face detection method according to an embodiment of the present invention.

도 5를 참조하면, 일 실시예는 얼굴 검출을 하기 위하여 사전에 학습을 할 수 있다(510). Referring to FIG. 5, an embodiment may perform a learning process in order to detect a face (510).

제안 네트워크가 학습하는 단계는, 물체 영역과 배경 영역을 구분하는 선행 모델을 기본 구조로 삼고 나서, 전이 학습을 통해 초기 가중치를 설정하고, 얼굴 영역과 얼굴이 아닌 영역을 구분하는 학습을 하는 단계를 포함할 수 있다. 또한, 순차 연결된 얼굴 후보 특징점의 위치 좌표(250)와 실제 이미지에 존재하는 얼굴 특징점의 위치 좌표간의 유클리디언 거리(Euclidean distance)를 최소화하는 손실함수(loss function)를 이용하여 학습하는 단계를 포함할 수 있다.In the learning step of the proposed network, a preliminary model that distinguishes an object area and a background area is used as a basic structure, and an initial weight is set through the transition learning, and learning is performed to distinguish the face area and the non-face area . In addition, the method includes a step of learning using a loss function that minimizes an Euclidean distance between a position coordinate 250 of a sequentially connected face candidate feature point and a position coordinate of a facial feature point existing in an actual image can do.

검출 네트워크에서 제안 네트워크로부터 얼굴 후보 영역을 받은 뒤 더 정밀한 얼굴 영역을 생성하기 위하여 학습할 수 있다. After receiving the face candidate region from the proposed network in the detection network, it can learn to generate a more accurate face region.

검출 네트워크가 학습하는 단계는, 네거티브 예제 마이닝(hard sample mining) 기술을 이용하고, 분류(Classification)를 위해서 손실 함수로 교차 엔트로피 함수(cross-entropy loss function)를 이용하고, 정밀한 얼굴 영역으로 회귀(Regression)하기 위하여 정밀한 얼굴 영역의 위치 정보와 실제 얼굴 영역의 위치 정보 간의 유클리디언 거리(Euclidean distance)가 최소가 되도록 하는 함수를 이용하는 단계를 포함할 수 있다.The learning network learns by using a hard sample mining technique, using a cross-entropy loss function as a loss function for classification, and returning to a precise face area Regression may be performed using a function that minimizes the Euclidean distance between the position information of the accurate face region and the position information of the actual face region.

얼굴 검출의 대상이 되는 대상 이미지가 입력되면(520), 제안 네트워크에서 대상 이미지에 대응되는 복수의 얼굴 후보 영역을 추출할 수 있다(530). 추출하는 단계는, 제1 계층에서 대상 이미지를 피처맵으로 변환하는 단계, 복수의 제2 계층들 각각이 피처맵을 히트맵으로 변환하는 단계, 히트맵을 얼굴 후보 영역으로 변환하는 단계를 포함할 수 있다.When a target image to be subjected to face detection is input (520), a plurality of face candidate regions corresponding to the target image in the proposal network can be extracted (530). The extracting step may include converting the target image to the feature map in the first layer, converting each of the plurality of second layers into the heat map, and converting the heat map to the face candidate region .

이 때, 복수의 제2 계층들은 제 1 계층이 변환한 피처맵을 공통적으로 사용할 수 있다. 또한, 복수의 제2 계층들에 포함될 수 있는 풀링 계층은 제2 계층들 각각마다 서로 다른 크기의 스트라이드(stride)를 가지고 있어서, 다양한 크기의 얼굴에 대하여 최적화된 히트맵을 생성할 수 있다. 즉, 다양한 크기의 얼굴에 대해서도 향상된 얼굴 검출 성능을 보일 수 있으므로, 와일드한 환경에 강인한 얼굴 검출 방법을 제공할 수 있다.In this case, the plurality of second layers can commonly use the feature map converted by the first layer. Also, the pooling layer, which may be included in the plurality of second layers, has strides of different sizes for each of the second layers, so that it is possible to generate an optimized heat map for faces of various sizes. That is, since the face detection performance can be improved even for faces of various sizes, a face detection method robust against a wild environment can be provided.

검출 네트워크는 제안 네트워크로부터 얼굴 후보 영역을 받을 수 있고, 얼굴 후보 영역에 얼굴 영역이 존재하는지를 분류(Classification)하고, 얼굴 후보 영역보다 정밀한 얼굴 영역으로 회귀(Regression)할 수 있다(540). 회귀(Regression)는 제안 네트워크로부터 받은 얼굴 후보 영역의 위치 좌표를 조정하여 실제 얼굴 영역의 위치 좌표에 가깝도록 만드는 것을 의미한다. 회귀하는 방법으로 D. Wang, J. Ynag, and Q. Liu, "Hierarchical Convolutional Neural Network for Face Detection," Proceeding of International Conference on Image and Graphics, pp. 373-384, 2015.에서 제안하는 얼굴 경계 영역 회귀법(Face Bound Regression)이 사용될 수 있다.The detection network can receive a face candidate region from the proposed network, classify whether a face region exists in the face candidate region, and regression to a face region more accurate than the face candidate region (540). Regression means that the position coordinates of the face candidate region received from the proposed network are adjusted to be close to the position coordinates of the actual face region. D. Wang, J. Ynag, and Q. Liu, "Hierarchical Convolutional Neural Network for Face Detection," Proceedings of International Conference on Image and Graphics, pp. 373-384, 2015. The face boundary regression method (Face Bound Regression) can be used.

검출 네트워크에서 생성된 정밀한 얼굴 후보 영역들은 후처리 과정(post processing)을 통해 최종적인 얼굴 영역으로 제시될 수 있다(550). 후처리 과정은 Non-Maximum Suppression(NMS)일 수 있다.Precise face candidate regions generated in the detection network can be presented as a final face region through post processing (550). The post-processing may be Non-Maximum Suppression (NMS).

도 6은 본 발명의 일 실시예에 따른 얼굴 검출 시스템의 블록도이다.6 is a block diagram of a face detection system according to an embodiment of the present invention.

도 6은 참조하면, 얼굴 검출 시스템은 제안부(610)와 검출부(640)를 포함할 수 있다. 대상 이미지(100)를 받은 제안부(610)는 얼굴 후보 영역을 추출하여 검출부(640)에 제안(propose)할 수 있으며, 검출부(640)는 얼굴 후보 영역을 받아 더 정밀한 얼굴 후보 영역을 검출 할 수 있다.Referring to FIG. 6, the face detection system may include a proposal unit 610 and a detection unit 640. The proposal unit 610 receiving the target image 100 may extract the face candidate region and propose the same to the detection unit 640. The detection unit 640 may receive the face candidate region and detect a more accurate face candidate region .

제안부(610)는 제1 계층부(620)와 제2 계층부(630)를 포함할 수 있다.The proposal unit 610 may include a first layer unit 620 and a second layer unit 630.

제1 계층부(620)는 복수 개의 합성곱 계층과 풀링 계층을 포함할 수 있으며, 대상 이미지에 대응하는 피처맵을 생성할 수 있다.The first layer 620 may include a plurality of convolutional layers and a pooling layer, and may generate a feature map corresponding to the target image.

복수의 제2 계층부(630)는 제1 계층부가 생성한 피처맵을 공통적으로 사용하며, 복수의 제2 계층부(630) 각각은 피처맵에 대응되는 히트맵을 생성할 수 있다. 복수의 제2 계층부(630) 각각은 직렬 연결된 복수 개의 합성곱 계층과 풀링 계층을 포함할 수 있으며, 어느 하나의 제2 계층부에 포함된 풀링 계층이 갖는 스트라이드(stride)의 크기는, 상기 풀링 계층에 대응되는 다른 하나의 제2 계층부에 포함된 풀링 계층이 갖는 스트라이드(stride)의 크기와 다를 수 있다. 스트라이드의 크기를 다르게 함으로써 대상 이미지에 다양한 크기의 얼굴이 있더라도 이에 최적화된 히트맵을 생성할 수 있다. 이는 예시적인 것으로서, 풀링 계층 외에도 각 제2 계층부가 포함하는 합성곱 계층의 스트라이드(stride)의 크기가 다를 수도 있다.The plurality of second hierarchical units 630 commonly use the feature maps generated by the first hierarchical unit and each of the plurality of second hierarchical units 630 can generate a heat map corresponding to the feature maps. Each of the plurality of second layer units 630 may include a plurality of concatenated product layers and a pooling layer connected in series, and the size of the stride of the pooling layer included in any one of the second layer units And may be different from the stride size of the pooling layer included in the second layer corresponding to the pooling layer. By varying the size of the stride, it is possible to generate optimized heat maps even if there are various sizes of faces in the target image. In addition to the pulling layer, the sizes of the strides of the convolution layer included in each second layer may be different.

일 실시예에서, 제안부(610)는, 물체 영역과 배경 영역을 구분하는 선행 모델을 기본 구조로 삼을 수 있으며, 전이 학습을 통해 초기 가중치를 설정하고, 얼굴 영역과 얼굴이 아닌 영역을 구분하는 학습을 할 수 있다. 또한, 순차 연결된 얼굴 후보 특징점의 위치 좌표(250)와 실제 이미지에 존재하는 얼굴 특징점의 위치 좌표간의 유클리디언 거리(Euclidean distance)를 최소화하는 손실함수(loss function)를 이용하여 학습할 수 있다.In one embodiment, the proposal unit 610 may use a predecessor model that distinguishes an object area and a background area as a basic structure, sets an initial weight through transition learning, distinguishes a face area and a non-face area Learning can be done. In addition, it is possible to learn using a loss function that minimizes an Euclidean distance between a position coordinate 250 of a sequentially connected face candidate feature point and a position coordinate of a facial feature point existing in an actual image.

제안부(610)는 생성된 히트맵으로부터 얼굴 후보 영역을 생성하여 검출부(640)로 보낼 수 있다. The proposal unit 610 can generate a face candidate region from the generated heat map and send it to the detection unit 640. [

일 예에서, 검출부(640)는 복수 개의 합성곱 계층, 풀링 계층, 완전 연결 계층을 포함할 수 있다. 검출부(640)는 제안부(610)로부터 받은 얼굴 후보 영역에 대하여 상기 계층에 따라 연산을 수행할 수 있고, 수행의 결과로서 정밀한 얼굴 후보 영역의 위치 좌표를 나타내는 1차원 데이터를 생성할 수 있다.In one example, the detector 640 may include a plurality of convolutional layers, a pooling layer, and a full connection layer. The detecting unit 640 can perform an operation on the face candidate region received from the proposing unit 610 according to the layer and can generate one-dimensional data indicating the position coordinates of the face candidate region precisely as a result of the operation.

검출부(640)는 분류부(650)와 회귀부(660)를 포함할 수 있다. 일 실시예에서, 분류부(650)와 회귀부(660)는 복수 개의 합성곱 계층과 풀링 계층을 포함하는 동일한 네트워크일 수 있다. 분류부(650)는 얼굴 후보 영역에 얼굴이 있는지 판단할 수 있으며, 얼굴이 있다고 판단되면 확률 1을 제시하고, 없다고 판단되면 확률 0을 제시할 수 있다. 회귀부(660)는 입력 패치에 boxing된 얼굴 후보 영역보다 정밀하게 얼굴 영역을 검출할 수 있다. 보다 자세하게는, 분류부(650)가 얼굴 후보 영역에 얼굴이 있다고 판단하면 회귀부(660)는 얼굴 영역 회귀법(Face Bound Regression)을 통해서 정밀해진 얼굴 영역의 위치 정보를 제시할 수 있다. 위치 정보는 x좌표, y좌표, 너비 및 높이를 포함할 수 있다. 반대로 분류부(650)가 얼굴 후보 영역에 얼굴이 없다고 판단하면 회귀부(660)는 box표시된 위치 정보를 무시하라는 라벨(label)을 얼굴 후보 영역에 부여할 수 있다.The detecting unit 640 may include a classifying unit 650 and a regression unit 660. In one embodiment, the classifier 650 and regression unit 660 may be the same network that includes a plurality of convolutional layers and a pooling layer. The classifying unit 650 can determine whether there is a face in the face candidate region. If the face is found, the classifying unit 650 presents the probability 1, and if it is determined that there is no face, the classifying unit 650 can present the probability 0. The regression unit 660 can detect the face region more accurately than the face candidate region boxed in the input patch. More specifically, if the classifying unit 650 determines that there is a face in the face candidate region, the regression unit 660 can present the position information of the face region, which is refined through Face Bound Regression. The location information may include x coordinate, y coordinate, width and height. On the other hand, if the classifying unit 650 determines that there is no face in the face candidate region, the regression unit 660 may assign a label to the face candidate region to ignore the box-marked position information.

검출부는 제안부로부터 얼굴 후보 영역을 받은 뒤 더 정밀한 얼굴 영역을 생성하기 위하여 학습할 수 있다. The detection unit can learn to generate a more accurate face region after receiving the face candidate region from the proposal unit.

일 예로, 검출부는 네거티브 예제 마이닝(hard sample mining) 기술을 이용하고, 분류(Classification)를 위해서 손실 함수로 교차 엔트로피 함수(cross-entropy loss function)를 이용하고, 정밀한 얼굴 영역으로 회귀(Regression)하기 위하여 정밀한 얼굴 영역의 위치 정보와 실제 얼굴 영역의 위치 정보 간의 유클리디언 거리(Euclidean distance)가 최소가 되도록 하는 함수를 이용하여 학습할 수 있다.For example, the detector uses a hard sample mining technique, uses a cross-entropy loss function as a loss function for classification, and regresses to a fine face region A learning function can be used to minimize the Euclidean distance between the position information of the accurate face region and the position information of the actual face region.

검출부(640)가 정밀해진 얼굴 후보 영역의 정보를 생성하면 후처리 과정(post processing)을 통해 최종적인 얼굴 영역(670)을 제시할 수 있다. 후처리 과정은 Non-Maximum Suppression(NMS)일 수 있다.If the detection unit 640 generates information on the face candidate region that is refined, the final face region 670 can be presented through post processing. The post-processing may be Non-Maximum Suppression (NMS).

이상과 같이 한정된 실시예를 들어 본 발명을 구체적으로 설명하였으나, 본 발명은 상술한 실시예에 한정되지 않는다. 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자라면 청구 범위 및 발명의 설명을 보고 용이하게 변경, 수정하여 실시할 수 있으며 그러한 실시까지 본 발명의 청구범위의 기재 범위에 속하게 된다.While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is to be understood that the invention is not limited to the disclosed exemplary embodiments. It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

A plurality of heat maps representing a face component included in the image, the plurality of heat maps being generated by applying different composite products and pooling schemes to the image; a plurality of different face candidate regions Extracting; And
Detecting a face region included in the image based on the plurality of different face candidate regions
And detecting a face of the face.

The method according to claim 1,
Wherein the extracting of the plurality of different face candidate regions comprises:
Transforming the image into feature maps through a first layer comprising at least a first convolution layer performing a composite product and at least one first pooling layer performing a pooling; And
And converting the feature maps into heat maps through each of the plurality of second layers including at least one second composite product layer and at least one second pooling layer to extract the plurality of different face candidate regions step
Lt; / RTI >
Wherein the plurality of second layers commonly use the feature maps to convert the feature maps into the heat maps.

3. The method of claim 2,
The step of converting the feature maps into heat maps comprises:
Each of the plurality of second layers successively performing a composite product and a pooling operation to convert feature maps into heat maps,
Wherein a layer included in one of the plurality of second layers and a layer corresponding to an operation order of the layer included in the other second layer have strides of different sizes, Wherein the second hierarchical layer generates an optimized heat map for faces of different sizes.

The method according to claim 1,
The step of detecting the face region comprises:
Classifying the face candidate region with respect to face presence or absence by determining whether or not the face region exists in the face candidate region; And
A step of regression to a precise face candidate region based on the classification and the face candidate regions
And a face detection step of detecting a face of the face.

5. The method of claim 4,
The classifying may include:
A probability 1 if the face region is included, and a probability 0 if the face region is not included.

5. The method of claim 4,
The step of regressing to the precise face candidate region comprises:
And a step of providing a label indicating that the position information of the face area is ignored if the face area is classified in the classifying step and if the face area is not classified, .

The method according to claim 1,
The step of detecting the face region comprises:
Wherein the face candidate regions detect a face region through at least one convolution product layer, at least one pulling layer, and at least one Fully-Connected layer (FCL).

The method according to claim 1,
Wherein the extracting of the face candidate regions comprises:
Performing learning for distinguishing a face region and a non-face region based on a neural network model for distinguishing an object region and a non-object region; And
Learning face candidate region extraction using an image database including one or more facial landmarks
The face detection method comprising:

The method according to claim 1,
The step of detecting the face region comprises:
Wherein the face candidate regions are learned by using the face candidate regions as a database through a negative sample mining technique.

A plurality of heat maps representing a face component included in the image, the plurality of heat maps being generated by applying different composite products and pooling schemes to the image; a plurality of different face candidate regions A proposal unit for extracting and proposing to a detection unit; And
A detecting unit for detecting a face region included in the image based on the plurality of different face candidate regions,
Based face detection system.

11. The method of claim 10,
[0027]
A first layer for transforming the image into feature maps through at least one first convolution layer performing a convolution and at least one first pooling layer performing a polling; And
A plurality of second hierarchical layers for converting the feature maps into heat maps through at least one second convolution layer and at least one second pooling layer to extract the plurality of different face candidate regions,
Lt; / RTI >
Wherein the plurality of second layers use the feature maps commonly to convert the feature maps into the heat maps.

11. The method of claim 10,
[0027]
Wherein each of the plurality of second layers sequentially performs a composite product and a pooling operation to convert feature maps into heat maps,
A layer included in one of the plurality of second layers and a layer corresponding to an operation order of the layer included in another one of the second layers has strides of different sizes Machine learning based face detection system.

11. The method of claim 10,
Wherein:
A classifying unit for classifying the presence or absence of a face by determining whether or not a face exists in the face candidate region; And
And a regression unit for regression to a precise face candidate region based on the classification and the face candidate regions,
Based face detection system.

14. The method of claim 13,
Wherein,
A probability 1 if the face region is included, and a probability 0 if the face region is not included.

14. The method of claim 13,
The regression unit,
Wherein if the classifying unit classifies the face region as having a face region, the position information of the face region is presented, and if the classifying unit classifies the face region as not having the face region, a label indicating that the position information of the face region is ignored.

11. The method of claim 10,
Wherein:
Characterized in that said face candidate regions comprise at least one of a composite product layer, at least one pooling layer and at least one Fully-Connected layer (FCL) for detecting face regions. system.