KR20160098581A

KR20160098581A - Method for certification using face recognition an speaker verification

Info

Publication number: KR20160098581A
Application number: KR1020150019467A
Authority: KR
Inventors: 조성원; 정성모
Original assignee: 홍익대학교 산학협력단
Priority date: 2015-02-09
Filing date: 2015-02-09
Publication date: 2016-08-19

Abstract

The present invention relates to a certificating method converging face recognition and speaker recognition, capable of performing certification by recognizing a face and a voice and converging the face and the voice. One embodiment of the present invention comprises: a face feature point extracting step of detecting a face image from the entire image and extracting a face feature vector from the detected face image; a voice feature point extracting step of detecting a voice from an audio and extracting voice feature data from the detected voice; a face similarity calculating step of calculating a face similarity between a registered feature vector which is previously registered and the face feature vector extracted from the face image; a voice similarity calculating step of calculating a voice similarity between a registered feature data which is previously registered and the voice feature data extracted from the voice; and an identity certification step of performing an identity certification by applying the face similarity and the voice similarity to an artificial neural network model.

Description

[0001] The present invention relates to a face recognition method and a speaker recognition method,

본 발명은 얼굴 인식 및 화자 인식이 융합된 인증 방법으로서, 얼굴을 인식하고 음성을 인식하여 이를 융합시켜 인증을 수행하는 방법에 관한 것이다.
The present invention relates to an authentication method in which face recognition and speaker recognition are blended, and relates to a method for recognizing a face, recognizing a voice, and fusing it to perform authentication.

21세기는 정보의 시대라고 불린다. 그러한 이유는 인터넷 보급 등으로 인하여 자신이 원하는 정보를 수집, 분석 등이 간단하고 편리해짐으로써 생활을 포함한 다방면으로 우리의 삶이 윤택해지고 있기 때문이다. 그러나 인터넷을 이용하여 글로벌 네트워크가 형성되어 편리해진 생활 속에 개인의 중요한 정보가 타인에 의해 도용되거나 파괴되는 사건과 사고가 연이어 발생하고 심각한 문제로 제기되고 있다. 이와 같이 인터넷과 IT기술의 발달로 우리의 삶은 편하고 윤택해졌지만 해킹이나 각종 사이버 범죄의 증가 등으로 인해 개인의 정보가 유출되고 공유되며 금적전인 피해를 입는 경우가 일어나고 있다.The 21st century is called the age of information. The reason is that because of the spread of the internet, collecting and analyzing the information that we want is simple and convenient, and our life is getting rich in various aspects including life. However, the use of the Internet has created a global network, and in the life that has become convenient, the important information of the individual is stolen or destroyed by others. As a result of this development of the Internet and IT technology, our lives have become comfortable and enriched, but personal information is leaked, shared, and damaged due to hacking and various cyber crimes.

기존에는 사용자 인증을 위하여 사용자 패스워드 또는 PIN(Personal Identification Number) 기법을 많이 이용하였다. 그러나 이러한 방법은 부주의로 인하여 인증에 많은 위험과 문제점을 안고 있으며 분실을 통해 불법적으로 사용될 수 있는 문제점을 가지고 있다. 따라서 사용자 패스워드 또는 PIN(Personal Identification Number)만을 이용한 사용자 인증방법으로는 개인뿐만 아니라 산업과 국가의 중요한 정보를 안전하게 보관할 수 없다고 생각하여 문제 해결 방안을 고려하고 있는 실정이다.In the past, a user password or PIN (Personal Identification Number) technique has been widely used for user authentication. However, this method has many risks and problems in authentication due to carelessness and has a problem that it can be used illegally through loss. Therefore, in the user authentication method using only the user password or PIN (Personal Identification Number), it is considered that the important information of the industry and the country can not be safely stored as well as the individual.

이에 따라 카드 사용 등, 본인 인증에 대한 보안을 높이고 편리한 방법이 요구 되고 있으며, 최근 들어 이러한 문제를 해결할 방안으로 개인의 고유한 생체정보인 신체적 또는 행동학적 특징에 따라 사람들의 신원을 확인하는 바이오메트릭(Biometrics) 즉, 생체 정보를 이용한 생체인식 기술이 대두되고 있다. In order to solve these problems, there has been recently proposed a biometric method for verifying the identity of a person according to a physical or behavioral characteristic, which is a unique biometric information of an individual. (Biometrics), that is, biometrics technology using biometrics is emerging.

특히 사용자의 거부감이 없고 사용자의 편의성을 가진 생체인식 기술의 개발이 더욱 활성화되고 있는 실정이다.
In particular, the development of a biometric recognition technology which does not have a sense of rejection by the user and which is convenient for the user is being actively promoted.

한국공개특허 10-2006-0101244Korean Patent Publication No. 10-2006-0101244

본 발명의 기술적 과제는 얼굴얼굴 특징 벡터와 음성얼굴 특징 벡터를 제시하고 이 값들을 이용하여 사용자 인증을 가능하게 하는 인증 알고리즘을 인공 신경망 모델에 적용하는데 있다. 또한 본 발명의 기술적 과제는 인증 오차없이 안정된 인증 방법을 제공하는데 있다.
An object of the present invention is to provide a face-feature vector and a face-feature vector, and to apply an authentication algorithm to an artificial neural network model that enables user authentication using these values. Another object of the present invention is to provide a stable authentication method without an authentication error.

본 발명의 실시 형태는 전체 영상에서 얼굴 이미지를 검출하여, 검출한 얼굴 이미지에서 얼굴 특징 벡터를 추출하는 얼굴 특징점 추출 과정; 오디오에서 음성을 검출하여, 검출한 음성에서 음성 특징 데이터를 추출하는 음성 특징점 추출 과정; 미리 등록된 등록 특징 벡터와, 상기 얼굴 이미지에서 추출한 얼굴 특징 벡터간의 얼굴 유사도를 산출하는 얼굴 유사도 산출 과정; 미리 등록된 등록 특징 데이터와, 상기 음성에서 추출한 음성 특정 데이터간의 음성 유사도를 산출하는 음성 유사도 산출 과정; 및 상기 얼굴 유사도와 음성 유사도를 인공 신경망 모델에 적용시켜 본인 인증을 수행하는 본인 인증 과정;을 포함할 수 있다.An embodiment of the present invention relates to a facial feature point extraction process for detecting a facial image in a whole image and extracting a facial feature vector from the detected facial image; A voice feature point extracting step of detecting voice in audio and extracting voice feature data from the detected voice; A face similarity degree calculating step of calculating a face similarity degree between a registered feature vector registered in advance and a face feature vector extracted from the face image; A voice similarity degree calculating step of calculating a voice similarity degree between pre-registered registered characteristic data and voice specific data extracted from the voice; And an identity authentication process for applying the face similarity and voice similarity to the ANN model to perform identity authentication.

청구항 1에 있어서, 상기 얼굴 특징점 추출 과정은, 전체 영상에서 얼굴 이미지를 단계적으로 검출하는 얼굴 이미지 검출 과정; 검출된 얼굴 이미지의 크기와 색상을 설정된 값으로 정규화하는 정규화 과정; 및 상기 정규화된 얼굴 이미지에서 얼굴 특징 벡터를 추출하는 얼굴 특징 벡터 추출 과정;을 포함할 수 있다.The method according to claim 1, wherein the facial feature point extracting step comprises: a face image detecting step of detecting a face image step by step in the entire image; A normalization process of normalizing the size and color of the detected face image to a set value; And a face feature vector extraction process for extracting a face feature vector from the normalized face image.

상기 얼굴 이미지를 단계적으로 검출하는 것은, 아다부스트(adaboost) 방식을 통해 얼굴 이미지를 단계적으로 검출할 수 있다.The stepwise detection of the face image can detect the face image step by step through the adaboost method.

상기 정규화 과정은, 검출된 얼굴 이미지의 크기를 미리 설정된 크기로 변환시켜 크기 정규화 얼굴 이미지로서 출력하는 크기 정규화 과정;을 포함할 수 있다.The normalization process may include a size normalization process of converting the size of the detected face image into a predetermined size and outputting the size normalized face image.

상기 크기 정규화 과정이 완료된 후에, 크기 변환된 얼굴 이미지의 기울어진 기울어짐 각도 및 기울어짐 방향을 파악하여, 상기 기울어짐 각도만큼 기울어짐 방향의 역방향으로 상기 얼굴 이미지를 회전시켜 크기 정규화 얼굴 이미지로서 출력하는 각도 정규화 과정;을 포함할 수 있다.After the size normalization process is completed, the tilted inclination angle and the inclination direction of the resized face image are grasped, and the face image is rotated in a direction opposite to the inclination direction by the inclination angle to output the resized normalized face image An angle normalization process.

상기 각도 정규화 과정에서 기울어짐 각도의 파악은, 상기 크기 변환된 얼굴 이미지를 2.5°간격으로 회전시켜 가며 수직방향에서의 밸리와 에지를 추출하여, 수직방향의 밸리와 에지에서의 수직방향 히스토그램을 산출하는 과정; 및 상기 크기 변환된 얼굴 이미지를 2.5°간격으로 회전시켜 가며 산출한 수직방향 히스토그램이 가장 큰 분산을 가질 때의 회전 각도를 기울어짐 각도로 결정하는 과정;을 포함하며, 상기 밸리는, 크기 변환된 얼굴 이미지 중에서 이미지 화소의 밝기가 주변 화소보다 낮은 밝기를 가지며, 이미지 화소와 주변 화소간의 밝기 차이가 미리 정해진 임계치보다 더 크게 차이나는 영역이며, 상기 에지는, 크기 변환된 얼굴 이미지의 얼굴 윤곽선임을 특징으로 할 수 있다.In order to grasp the inclination angle in the angle normalization process, the size-converted face image is rotated at intervals of 2.5 degrees, valleys and edges in the vertical direction are extracted, and a vertical direction histogram in the valleys and edges in the vertical direction is calculated Process; And a step of determining a rotation angle when the vertical direction histogram calculated by rotating the resized face image at intervals of 2.5 degrees has the largest variance as an inclination angle, The brightness of the image pixel is lower than that of the surrounding pixels, the brightness difference between the image pixel and the surrounding pixels is larger than a predetermined threshold value, and the edge is the face contour of the resized face image .

상기 각도 정규화 과정이 있은 후, 상기 정규화 얼굴 이미지에서 조명 영향을 제거하여 조명 정규화 얼굴 이미지로서 출력하는 조명 정규화 과정;을 포함할 수 있다.And an illumination normalization process of removing the illumination effect from the normalized face image and outputting the illumination normalized face image after the angle normalization process.

상기 조명 영향을 제거하는 것은, 상기 정규화 얼굴 이미지에서 승법잡음(multiplicative moise) 및 상가성잡음(additive noise)을 제거하여 조명 정규화 얼굴 이미지로서 출력할 수 있다.Eliminating the illumination effect may remove the multiplicative moise and additive noise from the normalized face image and output it as an illuminated normalized face image.

상기 정규화 얼굴 이미지에서 승법잡음(multiplicative moise) 및 상가성잡음(additive noise)을 제거하는 것은, x,y를 화소 좌표, I'(x,y)를 조명 영향을 받은 얼굴 이미지의 화소 함수, I(x,y)를 조명 영향이 제거된 정규화된 얼굴 이미지의 화소 함수, E는 정규화된 얼굴 이미지의 화소값 평균, VAR을 분산이라 할 때,(X, y) is the pixel function of the illuminated face image, I (x, y) is the pixel function of the illuminated face image, I (x, y) is the pixel function of the normalized face image from which the illumination effect is removed, E is the average of the pixel values of the normalized facial image, and VAR is the variance,

에 의해 산출되어 출력됨을 특징으로 할 수 있다.And outputs the calculated value.

상기 얼굴 특징 벡터 추출 과정은, 상기 정규화된 얼굴 이미지에서 극좌표로 표현된 복수의 가버 크기 특징 벡터들을 산출하는 과정; 및 상기 가버 크기 특징 벡터들의 모임인 얼굴 그래프를 얼굴 특징 벡터로 결정하는 과정;을 포함할 수 있다.The facial feature vector extracting step may include: calculating a plurality of Gabor size feature vectors expressed in polar coordinates in the normalized face image; And determining a face graph, which is a group of the Gabor size feature vectors, as a face feature vector.

상기 복수의 가버 크기 특징 벡터들은, 상기 정규화된 얼굴 이미지에서의 격자 구조 위치에서 추출된 가버 크기 특징 벡터들임을 특징으로 할 수 있다.The plurality of Gabor size feature vectors may be Gabor size feature vectors extracted from a grid structure position in the normalized face image.

상기 얼굴 그래프를 얼굴 특징 벡터로 추출하는 과정은, 조명 영향과 상관없는 고유 얼굴 가버 공간을 생성하는 과정; 상기 고유 얼굴 가버 공간에 상기 격자 구조 위치에서 추출된 가버 크기 특징 벡터를 사영시켜 데이터 크기를 줄이는 과정; 및 상기 고유 얼굴 가버 공간에 사영된 가버 크기 특징 벡터들의 모임인 얼굴 그래프를 얼굴 특징 벡터로 추출하는 과정;을 포함할 수 있다.The process of extracting the face graph as a face feature vector includes: a process of generating an inherent face cover space that is not related to illumination effects; Reducing a data size by projecting a Gabor size feature vector extracted at the grid structure position in the inherent face gaber space; And extracting a face graph, which is a group of Gabor size feature vectors projected in the inherent face gabber space, as a face feature vector.

상기 음성 특징점 추출 과정은, 평균 에너지법과 영교차율을 이용하여 전체 오디오에서 음성을 검출하는 음성 검출 과정; 및 상기 추출된 음성에서 음성 특징 데이터를 추출하는 음성 특징 데이터 추출 과정;을 포함할 수 있다.The speech minutiae point extraction process may include a speech detection process for detecting speech in the entire audio using an average energy method and a zero crossing rate; And a voice feature data extraction process for extracting voice feature data from the extracted voice.

상기 음성 검출 과정은, 상기 평균 에너지법을 통하여 전체 오디오에서 유성음 영역을 검출하는 과정; 상기 영교차율을 이용하여 상기 유성음 영역의 전단에 마찰음이 존재하는지를 파악하는 과정; 및 상기 유성음 영역의 전단에 마찰음이 존재하는 경우 음성으로서 판단하는 과정;을 포함할 수 있다.Detecting the voiced sound region in the entire audio through the average energy method; Determining whether there is a fricative at the front end of the voiced region using the zero crossing rate; And determining, as a speech, if there is a fricative sound in a front end of the voiced speech region.

상기 음성 특징 데이터 추출 과정은, 상기 추출된 음성을 고속 퓨리에 변환(FFT)하는 과정; 상기 고속 퓨리에 변환(FFT)된 음성 데이터에 대하여 균일 필터인 멜 필터(Mel Filter)로서 필터링하는 과정; 상기 필터링된 음성 데이터의 상관 관계를 제거하는 이산 코사인 변환(DCT)하는 과정; 상기 이산 코사인 변환(DCT)된 음성 데이터에 대하여 MFCC(Mixed Content Signal Classification)를 적용하여 음성 특징 데이터를 추출하는 과정; 및 상기 MFCC에 적용된 각 프레임별 음성 특징 데이터의 개수를 정규화하는 과정;을 포함할 수 있다. The voice feature data extraction process may include a fast Fourier transform (FFT) process of the extracted voice; Filtering the fast Fourier transformed speech data as a Mel filter which is a uniform filter; Discrete cosine transform (DCT) for eliminating the correlation of the filtered speech data; Extracting voice feature data by applying MFCC (Mixed Content Signal Classification) to the DCT voice data; And normalizing the number of the voice feature data for each frame applied to the MFCC.

상기 얼굴 유사도 산출 과정은, 미리 등록된 등록 특징 벡터와, 상기 얼굴 이미지에서 추출한 얼굴 특징 벡터간의 코럴레이션 값을 얼굴 유사도로서 산출할 수 있다.The face similarity degree calculating process may calculate a correlation value between the registered feature vector registered in advance and the face feature vector extracted from the face image as the face similarity.

상기 음성 유사도 산출 과정은, 미리 등록된 등록 특징 데이터와, 상기 음성에서 추출한 음성 특정 데이터간의 코럴레이션 값을 음성 유사도로서 산출할 수 있다.The voice similarity degree calculating step can calculate the correlation value between the registered feature data registered in advance and the voice specific data extracted from the voice as voice similarity.

상기 본인 인증 과정은, 상기 얼굴 유사도를 0에서 1 범위 내의 숫자로 정규화한 정규화 얼굴 유사도를 출력하며, 상기 음성 유사도를 0에서 1 범위 내의 숫자로 정규화한 정규화 음성 유사도를 출력하는 과정; 및 상기 정규화 얼굴 유사도와 정규화 음성 유사도를 인공 신경망 모델에 적용시켜 본인 인증을 수행하는 인공 신경망 모델 적용 과정;을 포함할 수 있다.Outputting a normalized face similarity normalized by a number within a range of 0 to 1, and outputting a normalized voice similarity normalized by a number within a range of 0 to 1; And applying an artificial neural network model for applying authentication to the ANN model by applying the normalized face similarity and the normalized voice similarity.

상기 인공 신경망 모델 적용 과정은, 실험 데이터와 원하는 출력 값을 인공 신경망에 함께 입력시켜 학습시키는 학습 알고리즘을 적용하여 입력층, 은닉층, 및 출력층으로 된 인공 신경망 모델을 학습시키는 과정; 및 상기 학습 알고리즘에 의해 학습된 인공 신경망 모델에 상기 정규화 얼굴 유사도 및 정규화 음성 유사도를 적용하여 본인 인증을 수행하는 과정;을 포함할 수 있다.
A process of learning an artificial neural network model composed of an input layer, a hidden layer, and an output layer by applying experimental data and a learning algorithm for inputting a desired output value together with an artificial neural network to learn; And performing authentication by applying the normalized face similarity and the normalized voice similarity to the artificial neural network model learned by the learning algorithm.

본 발명의 실시 형태에 따르면 얼굴 인식과 음성 인식의 데이터를 인경 신경망 모델에 적용함으로써, 안정적인 사용자 인증을 수행할 수 있다. 따라서 본 발명의 실시 형태에 따르면 학습화된 인공 신경망을 이용함으로써, 얼굴 인식과 음성 인식의 오차를 줄일 수 있다.
According to the embodiment of the present invention, stable user authentication can be performed by applying the data of face recognition and speech recognition to the neural network model. Therefore, according to the embodiment of the present invention, the error of the face recognition and the speech recognition can be reduced by using the artificial neural network.

도 1은 본 발명의 실시예에 따른 얼굴 인식 및 화자 인식이 융합된 인증 과정을 도시한 플로차트.
도 2는 본 발명의 실시예에 따른 얼굴 특징점 추출 과정을 도시한 플로차트.
도 3은 본 발명의 실시예에 따른 음성 특징점 추출 과정을 도시한 플로차트.
도 4는 본 발명의 실시예에 따라 직렬로 연결된 트리구조의 분별기에서 앞단의 분별기는 단순화된 얼굴 패턴을 이용하여 얼굴 부위를 검출하는 모습을 도시한 그림.
도 5는 본 발명의 실시예에 따라 검출된 얼굴 이미지의 크기를 256 × 256 화소(pixel)로 크기를 정규화한 그림.
도 6은 본 발명의 실시예에 따라 기울어진 얼굴 영역 영상에서 에지와 밸리를 나타낸 그림.
도 7은 본 발명의 실시예에 따라 기울어진 얼굴 영역 영상에서 에지와 밸리의 수직 히스토그램을 나타낸 그림.
도 8은 본 발명의 실시예에 따라 똑바로 된 얼굴 영역 영상에서 에지와 밸리를 나타낸 그림.
도 9는 본 발명의 실시예에 따라 똑바로 된 얼굴 영역 영상에서 에지와 밸리 수직 히스토그램을 나타낸 그림.
도 10은 본 발명의 실시예에 따라 각도 정규화한 그림.
도 11은 본 발명의 실시예에 따라 조명 영향을 제거하기 전후의 모습을 도시한 그림.
도 12는 본 발명의 실시예에 따라 가버 웨이블렛 마스크를 나타낸 그림.
도 13은 본 발명의 실시예에 따라 정규화된 얼굴 이미지의 격자구조 위치에서 가버 크기 특징벡터를 추출한 그림.
도 14는 본 발명의 실시예에 따라 타인간의 얼굴 차이를 알아내기 위해 사용한 고유 얼굴 이미지 그림.
도 15는 본 발명의 실시예에 따라 동일 인물의 다양한 조명 환경에서 추출한 조명 이지지 그림.
도 16은 본 발명의 실시예에 따라 멜 주파수와 주파수와의 관계를 도시한 그래프.
도 17은 본 발명의 실시예에 따라 K-means를 이용한 군집화를 나타낸 그림.
도 18은 본 발명의 실시예에 따라 얼굴인식을 위해 사용된 이미지들을 도시한 그림.
도 19는 본 발명의 실시예에 따라 실험에 사용할 음성의 파형의 예를 도시한 그림.
도 20은 본 발명의 실시예에 따라 음성과 얼굴의 유사도 결과를 나타낸 그림.
도 21은 본 발명의 실시예에 따라 얼굴 인식과 음성 인식의 유사도의 정규화를 나타낸 그림.
도 22는 본 발명의 실시예에 따라 음성인식에서의 유사도 결과를 도시한 그림.
도 23은 인공 신경망의 구조를 도시한 그림.
도 24는 본 발명의 실시예에 따라 본인 확인을 위한 인공 신경망의 구조를 도시한 그림.1 is a flowchart illustrating an authentication process in which face recognition and speaker recognition are fused according to an embodiment of the present invention.
2 is a flowchart illustrating a facial feature point extraction process according to an embodiment of the present invention.
3 is a flowchart illustrating a speech minutiae point extraction process according to an embodiment of the present invention.
FIG. 4 is a diagram illustrating a discriminator of a tree structure connected in series according to an embodiment of the present invention, wherein a front discriminator detects a facial region using a simplified facial pattern. FIG.
FIG. 5 is a view normalizing the size of a detected face image according to an embodiment of the present invention by 256 × 256 pixels. FIG.
FIG. 6 is a diagram illustrating an edge and a valley in a slanted face region image according to an embodiment of the present invention; FIG.
FIG. 7 is a vertical histogram of an edge and a valley in an inclined face region image according to an embodiment of the present invention; FIG.
Figure 8 is an illustration of edges and valleys in an upright face region image in accordance with an embodiment of the present invention.
9 is a diagram illustrating edge and valley vertical histograms in an upright face region image in accordance with an embodiment of the present invention.
FIG. 10 is an angle normalized according to an embodiment of the present invention. FIG.
FIG. 11 is a view showing a state before and after removing a lighting effect according to an embodiment of the present invention; FIG.
12 illustrates a Gabor wavelet mask in accordance with an embodiment of the present invention.
FIG. 13 is a diagram of a garbage size feature vector extracted from a lattice structure position of a normalized facial image according to an embodiment of the present invention; FIG.
Figure 14 is a unique facial image illustration used to identify facial differences among others in accordance with an embodiment of the present invention.
FIG. 15 is an illumination diagram extracted from various lighting environments of the same person according to an embodiment of the present invention. FIG.
16 is a graph showing the relationship between Mel frequency and frequency according to an embodiment of the present invention.
17 illustrates clustering using K-means according to an embodiment of the present invention.
FIG. 18 is a view showing images used for face recognition according to an embodiment of the present invention; FIG.
FIG. 19 is a diagram showing an example of a waveform of a voice to be used in an experiment according to an embodiment of the present invention; FIG.
FIG. 20 is a diagram illustrating a result of similarity between speech and face according to an embodiment of the present invention; FIG.
21 is a diagram illustrating normalization of the similarity between face recognition and speech recognition according to an embodiment of the present invention.
22 is a diagram showing a result of similarity in speech recognition according to an embodiment of the present invention.
23 is a view showing a structure of an artificial neural network;
FIG. 24 is a diagram illustrating a structure of an artificial neural network for identifying an individual according to an embodiment of the present invention; FIG.

이하, 이 발명이 속하는 기술분야에서 통상의 지식을 갖는 자가 이 발명을 용이하게 실시할 수 있을 정도로 상세히 설명하기 위하여, 이 발명의 가장 바람직한 실시예를 첨부된 도면을 참조로 하여 상세히 설명하기로 한다. 이 발명의 목적, 작용 효과를 포함하여 기타 다른 목적들, 특징점들, 그리고 동작상의 이점들이 바람직한 실시예의 설명에 의해서 보다 명확해질 것이다. 하기에서 각 도면의 구성요소들에 참조부호를 부가함에 있어 동일한 구성요소들에 대해서는 비록 다른 도면상에 표시되더라도 가능한 한 동일한 부호를 가지도록 하고 있음에 유의해야 한다. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings, in order to explain the present invention in detail so that those skilled in the art can easily carry out the present invention. . Other objects, features, and operational advantages of the present invention, including its effects and advantages, will become more apparent from the description of the preferred embodiments. It should be noted that the same reference numerals are used to denote the same or similar components in the drawings.

도 1은 본 발명의 실시예에 따른 얼굴 인식 및 화자 인식이 융합된 인증 과정을 도시한 플로차트이며, 도 2는 본 발명의 실시예에 따른 얼굴 특징점 추출 과정을 도시한 플로차트이며, 도 3은 본 발명의 실시예에 따른 음성 특징점 추출 과정을 도시한 플로차트이다.FIG. 1 is a flow chart showing an authentication process in which a face recognition and a speaker recognition are fused according to an embodiment of the present invention. FIG. 2 is a flowchart illustrating a face feature point extraction process according to an embodiment of the present invention. 7 is a flowchart illustrating a speech minutiae point extraction process according to an embodiment of the present invention.

생체인식의 기술로는 지문인식, 얼굴인식, 음성인식, 홍채인식 등 사람의 다양한 생리적 특징과 행동적 특징 정보들이 활용되어지고 있다. 지문인식은 이 가운데 가장 안정성이 우수하여 이미 여러 곳에서 상용화가 되어 널리 사용되고 있다.Biometrics techniques are used for various physiological and behavioral features such as fingerprint recognition, face recognition, speech recognition, and iris recognition. Fingerprint recognition is the most stable among these, and it has been widely used since it has been commercialized in many places.

생체인식 가운데 가장 중요한 특징의 요건으로는 누구나 가지고 있어야하고(Universal), 각 사람마다 고유하며(Unique), 그리고 변하지 않고 변화시킬 수 없는(Permanent) 것이어야 한다. 그러나 지문은 노동으로 인해 닳아 없어지거나 건조할 때, 이물질이 묻게 되면 인식이 어려운 단점이 있다. 그리고 무엇보다도 많은 사람들이 만져야 인식을 하게 되는 기계에 대한 이용 거부감이 들 수 있기에 다른 생체인식에 대한 연구 및 개발이 진행 중이다.The most important feature of biometrics is that it must be universal, unique for each person, and permanently unchanged. However, there is a disadvantage that the fingerprint is hard to recognize when it is worn out due to labor, when it is dry, or when foreign matter is buried. And most of all, research and development on other biometrics are underway because there is a feeling of refusal to use a machine that many people will recognize when they touch it.

이에 본 발명에서는 사용자의 거부감이 없고 사용자의 편의성을 가진 얼굴인식과 화자인식의 특징과 유사도를 활용하여 인경 신경망 모델에 적용함으로써 보안성과 신뢰도가 높은 안정화된 인증 방법을 제시한다. CPU와 같은 연산 가능한 장치를 이용하여, 얼굴 인식 및 화자 인식이 융합된 인증 방법을 제시한다.Accordingly, the present invention proposes a stabilized authentication method with high security and reliability by applying the feature and similarity of face recognition and speaker recognition, which have no objection of the user, and which is convenient for the user, to the neural network model. An authentication method that combines face recognition and speaker recognition using an arithmetic device such as a CPU is presented.

얼굴 인식 및 화자 인식이 융합된 인증을 위하여, 우선, 카메라를 통해 취득한 전체 영상에서 얼굴 이미지를 검출하여, 검출한 얼굴 이미지에서 얼굴 특징 벡터를 추출하는 얼굴 특징점 추출 과정(S100)을 가진다.In order to authenticate the fusion of face recognition and speaker recognition, a facial feature point extraction process (S100) for detecting a face image from the entire image acquired through a camera and extracting a face feature vector from the detected face image is performed.

얼굴을 인식하여 검출하기 위해 사용되는 이미지는 얼굴과 배경이 혼합된 형태이다. 따라서 입력된 전체 영상에서의 정확한 얼굴 검출이 선행되어야 한다. 또한 제어되지 않은 환경 하에서 획득한 얼굴 영상은 거리, 위치, 인종, 조명 등의 변화에 따라 크기, 자세, 머리모양, 색 분포(chrominance distribution)가 변화하기 때문에, 단순한 얼굴의 밝기 혹은 색 패턴 (intensity 혹은 chrominance pattern) 매칭을 이용한 얼굴 검출은 그 한계가 있다. 현재 알려진 얼굴 색조(face color hue)를 이용한 얼굴 검출은 적은 연산 량으로 얼굴을 검출할 수 있지만 환경의 변화에 민감하다는 단점이 있다.An image used to recognize and detect a face is a mixture of face and background. Therefore, accurate face detection in the input full image must be preceded. In addition, facial images obtained under uncontrolled environment vary in size, posture, head shape, and chrominance distribution according to changes in distance, position, race, and illumination, so that a simple face brightness or intensity Or chrominance pattern matching is limited. Face detection using the currently known face color hue has the disadvantage of being sensitive to changes in the environment although it can detect faces with a small amount of computation.

이러한 단점을 극복하고자 본 발명에서는 복잡한 배경에서 다양한 크기와 자세의 얼굴 영역을 실시간으로 정확하게 추출하기 위하여 얼굴 특징점 추출 과정(S100)은, 도 2에 도시한 바와 같이 얼굴 이미지 검출 과정(S110)과, 정규화 과정(S120)과, 얼굴 특징 벡터 추출 과정(S130)을 가진다.In order to overcome this drawback, in the present invention, in order to accurately extract face regions of various sizes and orientations in a complex background in real time, a facial feature point extraction process (SlOO) includes a face image detection process (S110) A normalization process (S120), and a face feature vector extraction process (S130).

얼굴 이미지 검출 과정(S110)은, 전체 영상에서 얼굴 이미지를 단계적으로 검출하는 과정이다. 얼굴 이미지를 단계적으로 검출하기 위하여, 아다부스트(adaboost) 방식을 통해 얼굴 이미지를 단계적으로 검출한다. 아다부스트(adaboost) 방식은 "pattern Recogniton Letters"의 간행물에 실린 Xudong Xie 및 Kin-Man Lam 저서의 "An efficient illumination normalization method for face recognition"에 자세히 제시되어 있다.The face image detecting process (S110) is a process of detecting a face image step by step in the whole image. In order to detect the face image step by step, the face image is detected step by step through the adaboost method. The adaboost method is detailed in "An efficient illumination normalization method for face recognition" by Xudong Xie and Kin-Man Lam in the publication of "Pattern Recognition Letters".

참고로, 아다부스트(Adaboost) 방식을 간단히 설명하면, 얼굴 이미지를 검출하는데 있어서 도 2에 도시한 바와 같이 단계적인 방법을 이용하여 얼굴 영역을 검출한다. 단계적 검출의 각 단계에서 사용하는 얼굴 패턴은 M × M(예컨대, 21 × 21) 저해상도에서 표현되는 얼굴 패턴을 이용한다. 이러한 방법은 사람의 인지시스템은 저해상도의 영상에서도 얼굴을 찾는다는 사실에 그 바탕을 두고 있다. 저해상도의 영상에서도 나타날 수 있는 얼굴 패턴의 수는 여전히 많아(21 × 21 이진 영상에서 표현 가능한 패턴 수는 "2⁴⁴¹ - 5.0×10¹³²") 기존의 방법으로는 학습이 어렵다. 이러한 문제점을 해결하기 위하여 얼굴의 패턴을 단순한 기저 패턴(basis pattern)조합으로 단순화시키고(21 × 21 이진 영상에서 Haar 기저 패턴의 조합의 경우 수는 약 10만개), 이 중에서 표현되는 모든 얼굴 패턴을 부스팅 알고리즘(boosting algorithm)으로 오프라인에서 학습하고, 학습된 패턴을 이용하여 온라인상에서 실시간으로 얼굴 검출하는 방법이 성공적으로 적용되고 있다. 구체적인 검출 방법은 [그림 3-1]과 같이 여러 개의 트리구조의 분별기(tree-classifier)를 직렬로 연결하여 배경에서 얼굴을 검출한다.For reference, the Adaboost method will be briefly described. In detecting a face image, the face area is detected using a stepwise method as shown in FIG. The face pattern used in each step of the stepwise detection uses a face pattern represented by an M × M (for example, 21 × 21) low resolution. This method is based on the fact that human cognitive systems search for faces even in low-resolution images. The number of face patterns that can appear in low-resolution images is still large (the number of patterns that can be expressed in a 21 × 21 binary image is "2 ⁴⁴¹ - 5.0 × 10 ¹³² "). In order to solve this problem, we simplified the face pattern to a simple basis pattern combination (about 100,000 combinations of Haar base patterns in 21 × 21 binary image) Learning method using a boosting algorithm in offline and face detection in real time on - line using a learned pattern has been successfully applied. A specific detection method detects faces in the background by serially connecting tree-classifiers of several tree structures as shown in [Figure 3-1].

도 4에 도시한 바와 같이 직렬로 연결된 트리구조의 분별기에서 앞단의 분별기는 단순화된 얼굴 패턴을 이용하여 얼굴 부위를 검출하며, 뒷단으로 갈수록 좀 더 복잡한 얼굴 패턴을 이용하여 얼굴 부위를 검출한다. Haar 기반 기저 패턴들은 주어진 얼굴 영역을 2~4개의 직사각형 블록으로 구성되며. 각 직사각형 블록의 값은 직사각형 안의 모든 화소 값의 합을 구한 후 가중치를 적용한 값이다. 직사각형 안의 모든 화소의 값의 합은 합영상(sum image)을 이용하여 약 10개의 덧셈 연산으로 계산이 가능하다.
As shown in FIG. 4, in the discriminator of the tree structure connected in series, the discriminator at the front end detects the facial region using the simplified facial pattern, and detects the facial region using a more complex facial pattern toward the rear end. Haar-based basis patterns consist of 2 ~ 4 rectangular blocks with given face area. The value of each rectangular block is the sum of all the pixel values in the rectangle and then weighted. The sum of the values of all the pixels in the rectangle can be calculated by about 10 addition operations using the sum image.

얼굴 이미지 검출 과정(S110)이 완료되면, 검출된 얼굴 이미지의 크기와 색상을 설정된 값으로 정규화하는 정규화 과정(S120)을 가진다.Upon completion of the face image detection process (S110), the normalization process (S120) normalizes the size and color of the detected face image to a set value.

정규화 과정(S120)은, 도 2에 도시한 바와 같이 검출된 얼굴 이미지의 크기를 미리 설정된 크기로 변환시켜 크기 정규화 얼굴 이미지로서 출력하는 크기 정규화 과정(S121)과, 크기 변환된 얼굴 이미지의 기울어진 기울어짐 각도 및 기울어짐 방향을 파악하여 기울어짐 각도만큼 기울어짐 방향의 역방향으로 얼굴 이미지를 회전시켜 크기 정규화 얼굴 이미지로서 출력하는 각도 정규화 과정(S122)과, 정규화 얼굴 이미지에서 조명 영향을 제거하여 조명 정규화 얼굴 이미지로서 출력하는 조명 정규화 과정(S123)을 가진다.As shown in FIG. 2, the normalization process (S120) includes a size normalization process (S121) of converting the size of the detected face image into a preset size and outputting the size as a normalized face image, An angle normalization process (S122) of obtaining a normalized face image by rotating the face image in a direction opposite to the inclination direction by detecting a tilting angle and a tilting direction, And an illumination normalization process (S123) for outputting the image as a normalized face image.

입력 영상에서 검출된 얼굴 영상의 크기는 일정하지 않기 때문에, 특징 검출의 효율을 높이기 위해서는 검출된 얼굴 영상의 크기를 일정하게 만드는 크기 정규화 과정(S121)을 필요로 한다. 즉, 도 5에 도시한 바와 같이 검출된 얼굴 이미지의 크기를 256 × 256 화소(pixel)로 크기를 정규화한다. 이하 설명에서는 검출된 얼굴 이미지의 크기를 256 × 256 화소(pixel)의 크기로서 예를 들어 설명한다.Since the size of the face image detected in the input image is not constant, a size normalization process (S121) is required to make the size of the detected face image constant to increase the efficiency of feature detection. That is, as shown in FIG. 5, the size of the detected face image is normalized to 256 × 256 pixels. In the following description, the size of the detected face image is assumed to be 256 x 256 pixels (pixels), for example.

크기 정규화 과정(S121)이 완료되면 조명 정규화 과정을 하기 전에, 크기 변환된 얼굴 이미지의 기울어진 기울어짐 각도 및 기울어짐 방향을 파악하여, 기울어짐 각도만큼 기울어짐 방향의 역방향으로 얼굴 이미지를 회전시켜 크기 정규화 얼굴 이미지로서 출력하는 각도 정규화 과정(S122)을 추가로 수행할 수 있다.After the size normalization process (S121) is completed, the tilted inclination angle and the inclination direction of the resized face image are grasped before the illumination normalization process, and the face image is rotated in the reverse direction of the inclination direction by the inclination angle An angle normalization process (S122) for outputting the size normalized face image may be further performed.

크기가 정규화된 얼굴 이미지에서 얼굴이 평면 회전(rotation in plane)되어 있는 경우에 얼굴이 회전된 기울어짐(tilt) 각도를 얻어내어 기울어짐 각도만큼 회전하여, 얼굴 이미지의 얼굴 자세를 똑바로 하기 위함이다.When the face is rotated in a plane in a normalized face image, the face is rotated to obtain a tilt angle and rotated by a tilting angle to straighten the face posture of the face image .

각도 정규화 과정에서 기울어짐 각도를 파악하기 위하여, 수직방향 히스토그램 산출 과정, 및 기울어짐 각도 결정 과정을 가질 수 있다.A vertical direction histogram calculation process, and a tilt angle determination process in order to grasp the tilt angle in the angle normalization process.

수직방향 히스토그램 산출 과정은, 크기 변환된 얼굴 이미지를 2.5°간격으로 회전시켜 가며 수직방향에서의 밸리와 에지를 추출하여, 수직방향의 밸리와 에지에서의 수직방향 히스토그램을 산출하는 과정이다.The vertical direction histogram calculation process is a process of calculating the vertical direction histogram in the valleys and edges in the vertical direction by extracting the valleys and edges in the vertical direction while rotating the size-converted face images at intervals of 2.5 degrees.

기울어짐 각도 결정 과정은, 크기 변환된 얼굴 이미지를 2.5°간격으로 회전시켜 가며 산출한 수직방향 히스토그램이 가장 큰 분산을 가질 때의 회전 각도를 기울어짐 각도로 결정하는 과정이다.The inclination angle determination process is a process of determining the rotation angle when the vertical direction histogram calculated by rotating the resized face image at intervals of 2.5 degrees has the largest variance as the inclination angle.

상기에서 밸리는, 크기 변환된 얼굴 이미지 중에서 이미지 화소의 밝기가 주변 화소보다 낮은 밝기를 가지며, 이미지 화소와 주변 화소간의 밝기 차이가 미리 정해진 임계치보다 더 크게 차이나는 영역이다. 밸리는, 미지 화소의 밝기가 급격히 변하면서 주위 화소보다 낮은 밝기를 갖는 부분, 즉 주위보다 어두운 부분으로 정의된다. 본 발명에서는 얼굴 영역 이미지에 그레이 톱햇(top hat) 모폴로지 연산을 적용 후에 이진화하고 이진 닫힘 (binary closing) 모폴로지 연산을 적용하여 밸리를 얻을 수 있다.In this case, among the resized face images, the brightness of the image pixel is lower than that of the surrounding pixels, and the brightness difference between the image pixel and the surrounding pixels is larger than a predetermined threshold value. The valley is defined as a portion having a brightness lower than that of the surrounding pixel, that is, a portion darker than the surrounding, while the brightness of the unknown pixel changes abruptly. In the present invention, a gray top-hat morphology operation is applied to a face region image, and then a binary closing morphology operation is applied to obtain a valley.

에지(edge)는, 크기 변환된 얼굴 이미지 중에서 얼굴 윤곽선을 말한다. 밸리와 에지의 추출은 정확한 얼굴 영역 안에서 추출되어야 신뢰성이 높으므로 얼굴 영역의 내부 부분, 즉, 검출된 얼굴 이미지의 크기가 256 × 256 화소(pixel)의 사이즈인 경우 중앙의 140 × 140 화소(pixel) 크기로 절단된 영역에 대하여 밸리 영역과 에지를 추출한다. 참고로 도 6에서 기울어짐 상태의 얼굴 이미지에서 에지를 도 6(b)에 도시하였고, 밸리를 도 6(c)에 도시하였다.The edge refers to the face outline of the resized face image. Since the extraction of the valley and the edge is highly reliable since it is extracted within the correct face area, when the inner part of the face area, that is, the size of the detected face image is 256 × 256 pixels, the center 140 × 140 pixels ) -Level-sized region, the valley region and the edge are extracted. 6A and 6B, edges are shown in an inclined face image in FIG. 6, and valleys are shown in FIG. 6C.

참고로, 도 6(b) 및 도 8(b)는 에지를, 도 6(c) 및 도 8(c)는 밸리를 나타내고, 도 7(b) 및 도 9(b)는 에지 및 밸리 이미지의 수직 방향 히스토그램들을 나타낸다. 이들을 비교해 보면, 얼굴이 똑바로 되었을 때가 기울어져 있을 때에 비해 에지 및 밸리 수직방향 히스토그램 분포의 분산(평균에서의 변화 정도)이 큼을 알 수 있다. 이러한 사실을 이용하여 틸트 각도(분산이 가장 큰 회전 각도)를 구해 도 10과 같이 얼굴을 똑바로 하였다.
6 (b) and Fig. 8 (b) show the edge, Fig. 6 (c) and Fig. 8 In the vertical direction. Comparing these results, it can be seen that the distribution of the histogram distribution in the vertical direction of the edge and valley (degree of change in the average) is larger than when the face is inclined when it is straight. Using this fact, the tilt angle (rotation angle with the greatest variance) was obtained and the face was straightened as shown in Fig.

크기 및 각도에 대한 정규화 과정(S121,S122)이 완료되면, 정규화 얼굴 이미지에서 조명 영향을 제거하여 조명 정규화 얼굴 이미지로서 출력하는 조명 정규화 과정(S123)을 가진다. After the normalization process (S121, S122) for size and angle is completed, an illumination normalization process (S123) is performed to remove illumination effect from the normalized face image and output it as an illumination normalized face image.

얼굴 이미지는 획득 시 조명환경에 따라 동일인의 얼굴이라 할지라도 그 모습에 많은 변화가 생기게 된다. 조명에 의해 코와 눈, 입과 같은 구조 근처에 음영이 생기거나 포화에 의해 얼굴의 일부분이 이미지에 제대로 표현되지 않을 수 있기 때문이다. 따라서 얼굴특징을 이용한 인증에 앞서 얼굴의 조명 영향을 제거하는 과정이 필수적이다. 이러한 조명 영향에 의한 얼굴이미지 변형은 하기의 수식과 같이 조명효과가 배제된 이미지 I(x,y)에 승법잡음(multiplicative moise) A와 상가성잡음(additive noise) B가 추가되어 생기는 변형이다.The facial image changes much even if it is the same person's face depending on the lighting environment during acquisition. Light can cause shadows near structures such as the nose, eyes, and mouth, or saturation can cause parts of the face to be poorly represented in the image. Therefore, it is essential to remove the influence of the illumination of the face before the authentication using the face feature. The deformation of the face image due to the illumination effect is a deformation caused by adding a multiplicative moise A and an additive noise B to the image I (x, y) excluding the illumination effect as shown in the following equation.

I'(x,y) = A × I(x,y) + BI '(x, y) = A x I (x, y) + B

여기서, I'(x,y) : 조명 영향을 받은 얼굴 이미지Here, I '(x, y) is the illuminated face image

I(x,y) : 조명 정규화된 얼굴 이미지
I (x, y): illuminated normalized facial image

이러한 조명 영향을 제거하는 것은, 정규화 얼굴 이미지에서 승법잡음(multiplicative moise) 및 상가성잡음(additive noise)을 제거하는 것이다. 본 발명은 승법잡음(multiplicative moise) 및 상가성잡음(additive noise)을 직접적으로 산출하지 않아도 하기의 [수학식 1]에 의하여 승법잡음(multiplicative moise) 및 상가성잡음(additive noise)을 제거할 수 있다.Eliminating this illumination effect is to remove multiplicative moises and additive noise in the normalized face image. The present invention eliminates multiplicative moise and additive noise by the following equation (1) without directly calculating multiplicative moises and additive noise. have.

즉, x,y를 화소 좌표, I'(x,y)를 조명 영향을 받은 얼굴 이미지의 화소 함수, I(x,y)를 조명 영향이 제거된 정규화된 얼굴 이미지의 화소 함수, E는 정규화된 얼굴 이미지의 화소값 평균, VAR을 분산이라 할 때,(X, y) is the pixel function of the illuminated face image, I (x, y) is the pixel function of the illuminated normalized face image, E is the normalized When the average of the pixel values of the facial image, VAR,

[수학식 1][Equation 1]

에 의해 조명 영향을 제거할 수 있다.The illumination effect can be removed.

참고로, 조명 영향을 제거하기 전의 이미지 영상을 도 11(a)에 도시하였으며, 조명 영향을 제거한 후의 이미지 영상을 도 11(b)에 도시하였다.
11 (a) shows an image image before the illumination effect is removed, and FIG. 11 (b) shows an image image after the illumination effect is removed.

한편, 크기, 각도 및 조명에 대한 정규화 과정(S120;S121,S122,S123)이 완료되면, 정규화된 얼굴 이미지에서 얼굴 특징 벡터를 추출하는 얼굴 특징 벡터 추출 과정(S130)을 가진다.Meanwhile, when the normalization process (S120; S121, S122, and S123) for size, angle, and illumination is completed, a face feature vector extraction process (S130) for extracting a face feature vector from the normalized face image is performed.

얼굴 특징 벡터 추출 과정은, 정규화된 얼굴 이미지에서 극좌표로 표현된 복수의 가버 크기 특징 벡터들을 산출하는 과정(S131)과, 가버 크기 특징 벡터들의 모임인 얼굴 그래프를 얼굴 특징 벡터로 결정하는 과정(S132)을 포함한다.The face feature vector extraction process includes a step S131 of calculating a plurality of Gabor size feature vectors expressed in polar coordinates in a normalized face image, a step S132 of determining a face graph as a group of Gabor size feature vectors as a face feature vector ).

본 발명에서는 얼굴 특징 벡터로서 가버 크기 특징벡터를 사용하는 것이다. 가버 크기 특징 벡터는 극좌표로 표현된 가버 특징벡터(Garbor Jet)에서 그 크기 값만을 가져와 만든다. 얼굴 이미지 특징점에서의 가버 크기 특징 벡터는 얼굴 이미지 특징점에 대해 방향/주파수/위상에 따라 구성되는 각기 다른 가버 웨이블렛 커널과의 컨볼루션에 의해 얻어진 가버 계수들의 셋트로 정의된다. 본 발명에서 사용한 가버 웨이블렛 커널은 " Wiskott, J. M. Fellous, N. Kuiger, C. von der Malsburg, "Face Recognition by Elastic Bunch Graph Matching," Pattern Analysis and Machine Intelligence, IEEE Transactions, Vol.19, pp. 775-779, July.1997."에서 공지된 가버 웨이블렛 커널이 사용된다.In the present invention, a Gabor size feature vector is used as a face feature vector. The Gabor size feature vector is obtained by taking only the magnitude value from the Gabor feature vector represented by polar coordinates. The Gabor size feature vector at the face image feature points is defined as a set of Gabor coefficients obtained by convolution with different Gabor wavelet kernels constructed according to direction / frequency / phase for facial image feature points. The Gabor wavelet kernel used in the present invention is described in "Wiskott, JM Fellous, N. Kuiger, C. von der Malsburg," Face Recognition by Elastic Bunch Graph Matching, "Pattern Analysis and Machine Intelligence, IEEE Transactions, Vol. -779, July, 1997. "is used.

즉, 본 발명에서 사용한 가버 웨이블렛 커널은 다음과 같이 [수학식 2]로서 표현될 수 있다.That is, the Gabor wavelet kernel used in the present invention can be expressed as [Equation 2] as follows.

[수학식 2]&Quot; (2) "

여기서,

이고 웨이브 벡터

는

로 주어지며, 이 때

는 웨이블렛의 방향을

는 웨이블렛의 파장(주파수 역수에 비례)을 나타낸다. 또한

는

에 비례하는 가우시안의 크기를 나타낸다. 본 발명에서는 가버 웨이블렛 커널에 대해

와

,

의 40개 조합으로 나타나는 가버 웨이블렛 커널을 사용하였다. 본 발명에서는 실제적으로 가버 특징 벡터들은 다음과 같이 구하였다. 상기 40개 조합에 대한 가버 웨이블렛 커널을 실수부와 허수부로 나누고 각각을 이산화 하여 가버 웨이블렛 마스크들을 만들고 이 j 번째 조합에 대한 실수부 및 허수부 가버 웨이블렛 마스크들과 이미지의 점

근방 각 점에서의 이미지 픽셀값(그레이값)들과 컨볼루션하여 j 번째 가버 특징 벡터

(여기서

)을 구하였다.here,

And the wave vector

The

Lt; / RTI >

The direction of the wavelet

(Proportional to the frequency inverse number) of the wavelet. Also

The

The size of Gaussian proportional to In the present invention, the Gabor wavelet kernel

Wow

,

The Gabor wavelet kernel is used. In the present invention, the Gabor feature vectors are actually obtained as follows. The Gabor wavelet kernels for the 40 combinations are divided into a real part and an imaginary part and are respectively discretized into Gabor wavelet masks, and real and imaginary Gabor wavelet masks for this j-th combination and points

(Gray values) at the neighboring points to obtain a j-th Gabor feature vector

(here

) Were obtained.

이때, 각 이미지의 점

에서의 가버 특징벡터

은

으로 정의된다. 또한 각 복소 가버 웨이블렛 계수

은

(크기

, 위상

)로 표현될 수 있다. 이때 가버특징벡터

에서 그 크기값 많을 가져와 만든 벡터 a_j , j=1,....40를 가버 크기 특징 벡터라 한다. 참고로 도 12는 본 발명의 실시예에서 사용한 80개의 가버 웨이블렛 마스크(40개 조합 가버웨이블렛 커널

2(실수부, 허수부))를 나타낸다.At this time,

The Gabor feature vector

silver

. Each complex Gabor wavelet coefficient

silver

(size

, Phase

). &Lt; / RTI > At this time,

The vector a _j , j = 1,. For reference, FIG. 12 is a schematic diagram of an embodiment of the present invention, in which 80 Gabor wavelet masks (40 combination Gabor wavelet kernels

2 (real part, imaginary part)).

또한, 가버 크기 특징 벡터들은, 정규화된 얼굴 이미지에서의 격자 구조 위치에서 추출된 가버 크기 특징 벡터이다.In addition, the Gabor size feature vectors are Gabor size feature vectors extracted at the grid position in the normalized face image.

코끝, 입 꼬리와 같은 특징점 위치에서 가버 웨이블렛과 2D 콘볼루션시켜 얻은 가버 특징벡터로 페이스 그래프를 형성하여 얼굴 특징 정보로 사용하였다. 그러나 이 방법은 특징점을 찾기 위해 많은 연산 량이 필요할 뿐더러 초기 특징점 추정위치가 본래 위치와 떨어진 경우 정확한 특징점 위치를 검출해 내기 어렵다. 따라서 본 발명에서는 도 13과 같이 정규화된 얼굴 이미지의 격자구조 위치에서 가버 크기 특징벡터를 추출하고 이 가버 크기 특징벡터들의 모임인 페이스 그래프를 한 사람의 얼굴 특징 벡터로 정의하여 사용한다. 이러한 방법은 이미 생체적 특징점에서 추출한 얼굴특징벡터와 성능상에서 큰 차이가 없는 것으로 알려져 있다.
A face graph was formed by Gabor feature vector obtained by 2D convolution with Gabor wavelet at feature point such as nose tip and mouth tail and used as facial feature information. However, this method requires a large amount of computation to find the feature point, and it is difficult to detect the exact feature point position when the initial feature point estimation position is away from the original position. Therefore, in the present invention, a Gabor size feature vector is extracted from a grid structure position of a normalized face image as shown in FIG. 13, and a face graph of a group of Gabor size feature vectors is defined as a face feature vector. This method is known to have no significant difference in performance from the facial feature vector extracted from the biometric feature points.

정규화된 얼굴 이미지에서 극좌표로 표현된 복수의 가버 크기 특징 벡터들을 산출(S131)한 후에는, 가버 크기 특징 벡터들의 모임인 얼굴 그래프를 얼굴 특징 벡터로 결정하는 과정(S132)을 가진다.After calculating a plurality of Gabor size feature vectors expressed in polar coordinates in the normalized face image (S131), a face graph, which is a group of Gabor size feature vectors, is determined as a face feature vector (S132).

얼굴 그래프를 얼굴 특징 벡터로 하는 과정(S132)은, 조명 영향과 상관없는 고유 얼굴 가버 공간을 생성하는 과정과, 고유 얼굴 가버 공간에 격자 구조 위치에서 추출된 가버 얼굴 특징 벡터를 사영시켜 데이터 크기를 줄이는 과정과, 고유 얼굴 가버 공간 그래프에 사영된 가버 얼굴 특징 벡터들의 모임인 얼굴 그래프를 얼굴 특징 벡터로 추출하는 과정을 가진다.The process of making the face graph as a face feature vector (S132) includes a process of generating an inherent face garbage space that is independent of the lighting effect, a step of projecting the garbage face feature vector extracted from the grid structure position in the inherent face garbage space, And a face graph, which is a group of Gabor face feature vectors projected on the inherent face gauge space graph, as a face feature vector.

추출한 얼굴 특징 벡터는 192개의 격자점에서 뽑아낸 40길이의 가버 크기 특징 벡터들의 모임이고, 가버 크기 특징 벡터의 각 차원 값들은 4byte float값이므로 이는 모두 합쳐 30K byte에 달하는 큰 크기의 특징벡터이다. 30K byte에 달하는 큰 크기의 정보는 스마트카드에서 다룰 수 없으므로 얼굴특징벡터의 정보 손실을 최소화 하면서 그 크기는 줄이는 과정이 필요하다. The extracted facial feature vector is a collection of Gabor size feature vectors of 40 lengths extracted from 192 lattice points and each dimension value of the Gabor size feature vector is a 4 byte float value. Since the large size information of 30K bytes can not be handled by the smart card, it is necessary to reduce the size of the face feature vector while minimizing the information loss.

이를 위하여 본 발명에서는 조명 영향을 받지 않고 독립적인 고유 얼굴 가버 공간, 예컨대, PCA 분석을 통해 고유 얼굴 가버 PCA 공간을 생성하고 얼굴 특징 벡터를 고유 얼굴 가버 공간에 사영시켜 데이터 크기를 최소화한다. 이하에서는 고유 얼굴 가버 공간을 PCA 분석을 통한 고유 얼굴 가버 PCA 공간을 예로서 설명한다. 고유 얼굴 가버 공간은 얼굴 이미지가 조명 이미지와 반사 이미지로 이루어진다는 점, 그리고 PCA 분석이 데이터의 분포특성을 가장 잘 표현하며 사영시에 정보손실을 최소화하는 주성분 축을 도출해 낸다는 점을 이용하여 얻는다. 얼굴이미지 I(x,y)는 하기의 [수학식 3]에 기재한 바와 같이 얼굴의 고유 특성을 반영하는 반사이미지 R(x,y)와 조명영향을 의미하는 조명이미지 L(x,y)의 곱으로 이루어져 있다.To this end, in the present invention, a native face garber PCA space is generated through an independent natural face garber space, for example, a PCA analysis, and a face characteristic vector is projected in a native face garber space without minimizing illumination. Hereinafter, the inherent face gabber space will be described as an example of the inherent face gaver PCA space through PCA analysis. The inherent face gabber space is obtained by the fact that the face image is composed of illumination image and reflection image, and that the PCA analysis represents the distribution characteristic of the data best and the principal component axis which minimizes the loss of information at projection. The face image I (x, y) is a reflection image R (x, y) reflecting the intrinsic characteristics of the face and the illumination image L (x, y) .

[수학식 3]&Quot; (3) "

I(x,y) = R(x,y)×L(x,y)I (x, y) = R (x, y) L (x, y)

그런데 얼굴의 고유 특성을 반영하는 반사이미지 R(x,y)는 조명이미지 L(x,y)에 대하여 독립적이지 못하다. 둘은 얼굴이미지 I(x,y)에서 나온 것으로 I(x,y)에 의하여 서로 연관되기 때문이다. 그러므로 반사이미지 R(x,y)로부터 구한 얼굴특징벡터들이 구성하는 공간 역시 조명이미지 L(x,y)에서 구한 얼굴특징벡터들이 구성하는 공간과 직교하지 못한다.However, the reflection image R (x, y) reflecting the intrinsic characteristics of the face is not independent of the illumination image L (x, y). The two come from the face image I (x, y) and are related to each other by I (x, y). Therefore, the space constituted by the facial feature vectors obtained from the reflection image R (x, y) is also not orthogonal to the space constituted by the facial feature vectors obtained from the illumination image L (x, y).

그러나 조명정규화 기법을 이용하여 정규화된 얼굴 이미지를 반사이미지로 간주하여 추출한 얼굴특징벡터와 조명이미지로부터 추출한 얼굴 특징벡터는 여전히 각각 얼굴의 고유 특징과 반사효과로 인한 특징을 잘 반영하고 있다고 할 수 있다. However, it can be said that the facial feature vector extracted from the normalized facial image using the illumination normalization technique and the facial feature vector extracted from the illumination image still reflect the characteristics of the facial feature and the reflection effect, respectively .

또한 사람의 얼굴이 대부분 비슷하여 조명에 의한 효과 또한 대부분 얼굴에서 비슷하게 일어난다고 가정하면 다음과 같은 방식으로 조명효과에 직교하는 고유 얼굴 가버 PCA 공간을 생성할 수 있다.In addition, assuming that most of the faces of a person are similar, and most of the effects of lighting also occur in the face, the PCA space can be generated as follows.

이를 위해, PCA 분석을 통한 가버 PCA 공간을 설명한 "김상훈, 설태인, 정선태, 조성원, 가버특징벡터 조명 PCA 모델 기반 강인한 얼굴인식, 전자공학회논문지-SC 45(6), pp. 67-76, 2008."에 따르면, 고유 얼굴 가버 PCA 공간을 얻기 위하여 조명 정규화된 각기 다른 사람의 얼굴이미지(도 14)로부터 얼굴 특징벡터를 추출하고 이러한 얼굴 특징벡터 데이터 셋에 대하여 PCA를 수행한다. 이 과정을 통해 얻은 주성분 벡터를 열벡터로 가지는 행렬을 Φ_R이라 한다. 그리고 여러 가지 조명환경에서 찍은 동일인물의 조명 이미지(도 15)으로부터 추출한 얼굴 특징벡터에 대하여 PCA를 수행한다. 그리고 이를 통해 얻은 주성분 벡터를 열벡터로 가지는 행렬을 Φ_L이라고 한다면, Φ_R에 있는 주성분 축들은 여러 사람의 얼굴 차이를 가장 잘 표현하는 기저이고, Φ_L에 있는 주성분 축들은 조명에 대한 효과로 변화하는 얼굴 특징을 표현하는 기저이다. 그런데 앞서 말 한대로 대부분의 사람의 얼굴이 비슷하여 조명에 대한 효과가 모든 사람에게 거의 비슷하게 일어난다고 가정한다면, Φ_L은 특정인물에 대한 조명 효과를 모델링한 것이 아니라 모든 사람에게 적용할 수 있는 일반적인 조명효과를 모델링한 것으로 간주할 수 있으므로 여러 사람의 얼굴 특징 차이를 가장 잘 표현하면서도 조명효과에 대한 변화에 대하여 직교하는 공간의 기저 Φ^* _{R을 다음}과 같은 [수학식 4]로서 얻어낼 수 있다.For this purpose, robust facial recognition based on the PCA model of feature vector illumination, which explains the Garber PCA space through PCA analysis, "Sanghoon Kim, Seoltae Jung, Jeon Sun Tae, Cho Sung Won, Gabor, SC 45 (6), pp. 67-76, 2008 . In accordance with the present invention, a facial feature vector is extracted from each person's facial image (FIG. 14) normalized to obtain a unique facial feature PCA space, and a PCA is performed on the facial feature vector data set. Let Φ _R be the matrix with the principal vector obtained by this process as a column vector. The PCA is performed on the face feature vector extracted from the illumination image of the same person (FIG. 15) taken in various illumination environments. And if it is called a matrix having a principal component vector obtained as a column vector by Φ _L, principal component axis in Φ _R are a base that best represents the face difference in many people, as the main component axes are effective for lighting in Φ _L It is the basis for expressing the changing facial features. However, assuming that most people's faces are similar and the effect of illumination is almost similar to that of all people, Φ _L is not a model of lighting effect for a specific person, but rather a general illumination The base Φ ^* _{R of} the space orthogonal to the change of the illumination effect can be obtained as the _following Equation (4).

[수학식 4]&Quot; (4) "

즉, 서로 직교하지 않는 공간의 기저 Φ_R, Φ_L로부터 서로 직교하지 않는 성분을 제거하여 얼굴의 고유 특성을 가장 잘 표현하면서도 조명효과에 대하여 직교인 새로운 기저 Φ^* _R을 얻을 수 있는 것이다.That is, by removing the components that are not orthogonal to each other from the bases Φ _R and Φ _L of the spaces that are not orthogonal to each other, a new base Φ ^* _R orthogonal to the illumination effect can be obtained while expressing the unique characteristics of the face best.

그리고 마지막으로 새롭게 구성한 고유 얼굴 가버 PCA 공간으로 얼굴 특징 벡터를 사영시켜 정보손실을 최소화하는 정보 압축을 이룰 수 있다. 실제 실험상에서는 292개의 기저로 이루어진 고유 얼굴 가버 PCA 공간을 생성하여 최종적으로 292차원의 얼굴 특징벡터를 얻을 수 있었다. 또한 고유 얼굴 가버 PCA 공간은 조명 영향을 받지 않는 독립인 공간이므로 부수적으로 조명에 대한 강인함도 얻을 수 있다.
Finally, it is possible to achieve information compression that minimizes information loss by projecting the facial feature vector into the newly constructed unique facial garbage PCA space. In the actual experiment, the 292-basis facial feature vectors of 292 dimensions were obtained by generating the native face garber PCA space of 292 bases. In addition, since the PCA space is independent of the illumination, it can be robust against illumination.

한편, 얼굴 특징점 추출 과정(S100) 이외에, 마이크를 통해 취득한 오디오에서 음성을 검출하여, 검출한 음성에서 음성 특징 데이터를 추출하는 음성 특징점 추출 과정(S200)을 가진다.Meanwhile, in addition to the facial feature point extraction process (S100), a speech feature point extraction process (S200) is performed to detect speech from audio acquired through a microphone and extract speech feature data from the detected speech.

음성 특징점 추출 과정(S200)은, 도 3에 도시한 바와 같이 음성 검출 과정(S210)과 음성 특징 데이터 추출 과정(S220)을 가진다.The voice feature point extraction process (S200) has a voice detection process (S210) and a voice feature data extraction process (S220) as shown in FIG.

음성 검출 과정(S210)은, 평균 에너지법과 영교차율을 이용하여 전체 오디오에서 음성을 검출한다. 즉, 음성 검출 과정(S210)은, 평균 에너지법을 통하여 전체 오디오에서 유성음 영역을 검출하는 과정(S211)과, 영교차율을 이용하여 상기 유성음 영역의 전단에 마찰음이 존재하는지를 파악하는 과정(S212)과, 유성음 영역의 전단에 마찰음이 존재하는 경우 음성으로서 판단하는 과정(S213)을 가진다.The voice detection process (S210) detects the voice in the entire audio using the average energy method and the zero crossing rate. That is, the voice detection process (S210) includes a process S211 for detecting a voiced region in the entire audio through the average energy method, a process S212 for determining whether a fricative exists in the front of the voiced region using the zero crossing rate, And a step S213 of judging the presence of a fricative sound in the front end of the voiced sound region as speech.

화자인식을 위해 음성신호에서 음성 특징을 추출하기 위해서는 먼저 실제 음성이 있는 부분을 추출하는 것이 필요하다. 이를 위해 평균 에너지법과 영교차율의 단기 시간(short time) 해석이 사용된다. In order to extract a speech feature from a speech signal for speaker recognition, it is necessary to first extract a portion having a real speech. For this, a short time analysis of the average energy method and zero crossing rate is used.

음성 신호의 에너지는 시간에 따라 변화한다. short-time 에너지는 이러한 에너지의 변화량을 효과적으로 표현할 수 있는 음성 파라미터라고 할 수 있다. 특히, 유성음의 경우에는 무성음이나 노이즈(noise)에 비해서 큰 에너지 성분을 갖기 때문에 유성음 부분을 추출해내는데 평균 에너지법이 이용된다. 평균에너지를 구하는 식은 하기의 [수학식 5]와 같다.The energy of the speech signal changes with time. Short-time energy is a voice parameter that can effectively represent the amount of energy change. Especially, in the case of voiced sound, because it has a larger energy component than unvoiced sound or noise, the average energy method is used to extract the voiced sound part. The equation for obtaining the average energy is shown in Equation (5) below.

[수학식 5]&Quot; (5) "

여기서 N은 한 프레임에서서의 샘플 수를 의미하며, w(n)은 해당 구간을 정해주는 윈도우 함수이다.Here, N means the number of samples in one frame, and w (n) is a window function for setting the corresponding interval.

또한 영교차율(zero crossing rate)은음성신호에서 두 개의 연속되는 샘플의 부호가 다른 경우가 생기는 비율, 즉 샘플링 된 음성신호의 값이 얼마나 빈번하게 영을 지나치는가를 나타내는 척도가 된다. 발음이 되는 부분, 즉 음성으로 간주되는 부분은 유성음만을 의미하지는 않는다. 무성음 부분은 에너지가 작기 때문에 평균 에너지법만으로는 추출할 수가 없다. 따라서 무성음의 경우 유성음과 일반 잡음에 비해 주파수 성분이 크다는 점을 고려하여 분리해 낼 수가 있다. 음성신호의 주파수가 큰 경우에는 영교차율이 크고, 작은 경우에는 영교차율이 작아진다. 그러므로 무성음인 경우에는 유성음에 비해서 영교차율이 크기 때문에 무성음 부분을 추출해내는데 이용되며, 영교차율을 구하는 식은 [수학식 6]과 같다.Also, the zero crossing rate is a measure of how often the sign of two consecutive samples in a speech signal is different, ie how frequently the value of the sampled speech signal crosses zero. The portion to be pronounced, that is, the portion considered as a voice, does not mean only a voiced sound. Since the unvoiced part is small in energy, it can not be extracted by the average energy method alone. Therefore, in the case of unvoiced sound, it can be separated by considering that the frequency component is larger than voiced sound and general noise. When the frequency of the voice signal is large, the zero crossing rate is large, and when the voice signal is small, the zero crossing rate is small. Therefore, in the case of unvoiced sound, since the zero crossing rate is larger than that of voiced sound, it is used to extract the unvoiced sound portion, and the formula for obtaining the zero crossing rate is as shown in Equation (6).

[수학식 6]&Quot; (6) "

측정된 영교차율이 이미 결정된 임계치보다 크면 해당 부분에 무성음이 존재한다고 볼 수 있다. 따라서 음성 부분의 추출을 위해서는 먼저 평균에너지법으로 유성음 부분을 찾은 후 유성음 부분의 전단에 마찰음이 있는지 영교차율을 통해서 찾아내어 음성으로 판단하여 음성 부분으로 추출한다.
If the measured zero crossing rate is greater than the determined threshold, it can be seen that the unvoiced sound exists in the corresponding part. Therefore, in order to extract the speech part, first the voiced part is searched by the average energy method and then the voiced part is found through the zero crossing rate.

한편, 음성 검출 과정(S210)이 완료되면, 도 3에 도시한 바와 같이 추출된 음성에서 음성 특징 데이터를 추출하는 음성 특징 데이터 추출 과정(S220)을 가진다. 음성 특징 데이터 추출 과정은, 고속 퓨리에 변환(FFT)하는 과정(S221), 멜 필터(Mel Filter)로서 필터링하는 과정(S222), DCT 변환하는 과정(S223), MFCC(Mixed Content Signal Classification)를 적용하여 음성 특징 데이터를 추출하는 과정(S224), 및 음성 특징 데이터의 개수를 정규화하는 과정(S225)을 가진다.On the other hand, when the voice detection process (S210) is completed, the voice feature data extraction process (S220) extracts voice feature data from the extracted voice as shown in FIG. The voice feature data extraction process includes a process of performing a fast Fourier transform (FFT) process S221, a process of filtering as a Mel filter S222, a DCT transform process S223, a mixed content signal classification (MFCC) (S224) of extracting voice feature data, and a process of normalizing the number of voice feature data (S225).

음성 특징 데이터 추출 과정을 상술하면, 우선, 추출된 음성을 고속 퓨리에 변환(FFT)하는 과정(S221)을 가진다.Describing the speech feature data extraction process, first, there is a fast Fourier transform (FFT) process (S221) of the extracted speech.

시간축 상의 N개의 이산 신호로부터 DFT(Discrete Fourier Transform) 스펙트럼을 얻기 위해서는 N의 제곱의 연산량이 필요하다. 이는 DFT의 N개의 요소마다 N번의 합과 복소수의 곱셈이 필요하기 때문이다. 만약 N이 128이라면 16384번의 합산과 복소수의 곱셈이 필요하며, N이 256이라면 65536번의 합산과 복소수의 곱셈이 필요하다. 따라서 N 제곱의 연산량이 필요한 DFT를 실시간 처리하기에는 어려움이 있다. 고속 퓨리에 변환(FFT)는 이러한 연산량 문제를 해결하기 위해 개발된 알고리즘이다. 고속 퓨리에 변환(FFT)을 사용하면 N제곱의 연산량을?

에 처리할 수 있다. DFT 계산을 FFT로 대치하면

의 연산량 감소율을 얻을 수 있다. 만약 N이 128이라면 약 18배, N이 256이라면 약 32배의 연산량 감소 효과를 갖는다. 고속 퓨리에 변환(FFT)의 원리는 다음과 같다. 만약 푸리에 변환을 알고자 하는 신호의 샘플개수 N이 2의 m승이라고 하면, N개의 샘플신호를 [수학식 7]과 같이 차례로 홀수 번째 샘플열과 짝수 번째 샘플열의 2개의 샘플 열로 분류할 수 있다.In order to obtain a discrete Fourier transform (DFT) spectrum from N discrete signals on the time axis, a computation amount of N squared is required. This is because it requires N sums and complex multiplications for every N elements of the DFT. If N is 128, the sum of 16384 times and the complex number must be multiplied. If N is 256, the sum of 65536 times and the complex number must be multiplied. Therefore, it is difficult to process DFT in real time, which requires a computation amount of N squared. Fast Fourier Transform (FFT) is an algorithm developed to solve this computation problem. Using Fast Fourier Transform (FFT), the computation of N squared?

Lt; / RTI > Replacing the DFT calculation with an FFT

The reduction rate of the calculation amount of the operation amount can be obtained. If N is 128, it is about 18 times. If N is 256, it is about 32 times smaller. The principle of fast Fourier transform (FFT) is as follows. Assuming that the number N of samples of the signal for which the Fourier transform is to be known is an m-th power of 2, the N sample signals can be classified into two sample sequences of an odd-numbered sample sequence and an even-numbered sample sequence, as shown in Equation (7).

[수학식 7]&Quot; (7) "

위 두 샘플열 g, h는 각각 N/2개의 샘플을 가지며 NT/2의 주기를 갖는다.
The two sample sequences g and h each have N / 2 samples and have a period of NT / 2.

또한, 추출된 음성을 고속 퓨리에 변환(FFT)한 후에는, 고속 퓨리에 변환(FFT)된 음성 데이터에 대하여 균일 필터인 멜 필터(Mel Filter)로서 필터링하는 과정(S222)을 가진다. 프레임당 너무 많은 양의 데이터 값이 존재하므로 적절한 파라미터를 뽑기 위해 FFT한 결과를 갖고 필터뱅크를 거쳐서 각 필터뱅크의 에너지를 구하게 되는 것이다. 음성 인식에서 사용되는 필터뱅크의 가장 평이한 형태는 균일(uniform) 필터뱅크이다. 여기서의 i번째 대역통과 필터의 중앙 주파수 값,

는 [수학식 8]과 같이 정의된다.After the extracted voice is subjected to fast Fourier transform (FFT), there is a step S222 of filtering the voice data subjected to the fast Fourier transform (FFT) as a Mel filter which is a uniform filter. Since there are too many data values per frame, we get the energy of each filter bank through the filter bank with the result of FFT to extract the appropriate parameters. The most common form of filter bank used in speech recognition is a uniform filter bank. Here, the center frequency value of the i-th band pass filter,

Is defined as " (8) "

[수학식 8]&Quot; (8) "

상기의 [수학식 8]에서 Fs는 음성 신호의 샘플링 주파수, N은 샘플링 주파수를 등간격으로 나눈 균일 분포 필터들의 수이다. 또 실제 필터들의 수를 나타내는 Q는 다음 조건을 만족한다.In Equation (8), Fs is the sampling frequency of the speech signal, and N is the number of uniform distribution filters dividing the sampling frequency by equal intervals. Also, Q representing the number of actual filters satisfies the following condition.

[수학식 9]&Quot; (9) "

Q≤N／2Q? N / 2

상기의 [수학식 9]에서 등호는 음성 신호의 전체 주파수 범위가 분석에 사용될 때 성립한다. 또한 i번째 필터의 대역 b_i는 일반적으로 하기의 [수학식 10]의 특성을 만족한다.In Equation (9), the equal sign is established when the entire frequency range of the speech signal is used for analysis. The band b _i of the i-th filter generally satisfies the following equation (10).

[수학식 10]&Quot; (10) "

b_i ≥ F_s／Nb _i ≥ F _s / N

이때 등호는 인접한 필터 채널들과 주파수 중첩부분이 없다는 것을 의미한다. 부등호는 인접한 필터 채널이 중첩한다는 뜻이다. 만약, b_i〈 F_s/N이면 음성 스펙트럼의 일부분의 정보가 의미를 상실하게 되어 올바른 분석이 이루어질 수 없게된다.The equal sign means that there is no frequency overlap with adjacent filter channels. Inequality signifies that adjacent filter channels overlap. If b _i <F _s / N, the information of a part of the speech spectrum loses its meaning and correct analysis can not be performed.

본 발명의 실시예에서 사용하는 멜 필터는 복수의 멜 필터가 조합된 멜 필터 뱅크(Mel Filter Bank)가 사용될 수 있다. 멜 필터 뱅크는 비균일(nonuniform) 필터 뱅크이다. 비균일 필터뱅크를 디자인하는 경우에는 주대역 단위를 직접적으로 사용하는 방법이 있다. 주대역을 따른 필터들의 배치는 인지적 연구에 기반을 두고 있다. 이 필터들의 간격은 도 16에 도시한 바와 같이 1000 Hz 미만의 주파수에서는 선형적이며 그 이상에서는 로그단위에 가깝다. 본 발명의 실시예에서는 하기의 [수학식 11]의 멜 공식을 이용하여 멜 주파수 영역으로 옮긴 다음 멜 주파수 영역에서 등간격으로 나눈 후 다시 주파수 영역으로 변환해 사용한다.The mel filter used in the embodiment of the present invention may be a Mel Filter Bank in which a plurality of mel filters are combined. The Mel filter bank is a nonuniform filter bank. When designing a nonuniform filter bank, there is a method of directly using the main band unit. The placement of filters along the main band is based on cognitive research. The spacing of these filters is linear at frequencies below 1000 Hz and closer to logarithmic units, as shown in FIG. In the embodiment of the present invention, the mel-frequency region is shifted to the Mel frequency region using the Mel formula of Equation (11), and then the region is equally divided in the Mel frequency region and then converted to the frequency domain.

[수학식 11]&Quot; (11) "

멜 필터(Mel Filter)로서 필터링한 후에는, 필터링된 음성 데이터의 상관 관계를 제거하는 이산 코사인 변환(DCT)하는 과정(S223)을 가진다. DCT는 필터뱅크를 거쳐서 나온 출력간의 상관관계를 없애주고 파라미터의 특징을 모아주는 역할을 한다.And a discrete cosine transform (DCT) process (S223) for eliminating the correlation of the filtered voice data after filtering as a Mel filter. The DCT removes the correlation between the outputs from the filter banks and collects the characteristics of the parameters.

이산 코사인 변환(DCT) 후에는, 이산 코사인 변환(DCT)된 음성 데이터에 대하여 MFCC(Mixed Content Signal Classification)를 적용하여 음성 특징 데이터를 추출하는 과정(S224)을 가진다. 하나의 프레임에 MFCC(Mixed Content Signal Classification)를 적용할 경우 필터뱅크의 개수만큼 특징벡터 값이 나오게 된다. 특징 추출 처리는 필터뱅크 수만큼 나온 특징차수에서 작은 차수부터 필요한 개수만큼 선택하는 것이다. MFCC를 구하기 위한 과정을 식으로 표현하면 하기의 [수학식 12]와 같다.After the discrete cosine transform (DCT), the speech characteristic data is extracted by applying MFCC (Mixed Content Signal Classification) to the discrete cosine transform (DCT) speech data (S224). When MFCC (Mixed Content Signal Classification) is applied to one frame, the feature vector value is outputted by the number of filter banks. In the feature extraction process, the number of filter banks is selected from a small degree to a required number in the feature order. The process for obtaining the MFCC is expressed by the following equation (12).

[수학식 12]&Quot; (12) "

위에서 y_t ^(m)은 멜 필터 뱅크에서 얻어진 출력이고 여기에 logarithm을 취하고 DCT를 수행하여 특징벡터 y_t ^(m)(k)를 구할 수 있다.
In the above, y _t ^(m) is the output obtained from the bank of melfilter, and the feature vector y _t ^(m) (k) can be obtained by taking the logarithm and performing the DCT.

또한, MFCC를 적용하여 음성 특징 데이터를 추출(S224)한 후에는, MFCC에 적용된 각 프레임별 음성 특징 데이터의 개수를 정규화하는 과정(S225)을 가진다. MFCC를 통한 음성의 경우 한 프레임에 13개의 특징벡터를 얻을 수가 있다. 그러나 음성의 길이에 따라 프레임의 수가 많게는 120개, 적을 경우에는 60개의 수가 나오기 때문에 각 음성에 대해 비교가 불가하다. 그리하여 본 발명에서는 MFCC를 통해 얻은 각각의 음성 정보 데이터를 하나의 정규화된 데이터로 통일하기 위해 K-means(K-평균) 알고리즘을 이용하여 도 17과 같이 군집화한다. 이리하여 본 발명의 실시예에서는 K-means 알고리즘을 통해 13 × 20으로 군집화한다. K-means는 클러스터링 기법 중의 하나이다. 클러스터링은 주로 세분화(segmentation) 작업에 많이 사용되고 Anomaly 탐지와 대규모 데이터를 나누어서 작은 데이터 그룹으로 분리하는 경우에 활용되고 있다.In addition, after extracting the voice feature data by applying the MFCC (S224), the process of normalizing the number of the voice feature data for each frame applied to the MFCC is performed (S225). In case of voice through MFCC, 13 feature vectors can be obtained in one frame. However, since the number of frames is more than 120, and the number of frames is less than 60, the number of frames can not be compared with each other. In the present invention, the K-means (K-means) algorithm is used to group the respective voice information data obtained through the MFCC into one normalized data as shown in FIG. Thus, in the embodiment of the present invention, the data is clustered into 13 × 20 through the K-means algorithm. K-means is one of clustering techniques. Clustering is mainly used for segmentation, and it is used for separating large data into small data groups.

참고로, K-means 알고리즘은 가장 일반적으로 사용되는 분할 클러스터링 알고리즘이다. 하기의 [수학식 13]은 K-means 알고리즘을 간략히 나타내었다.For reference, the K-means algorithm is the most commonly used partitioned clustering algorithm. The following Equation (13) briefly shows the K-means algorithm.

[수학식 13]&Quot; (13) "

이러한 K-means 알고리즘의 개념은 패턴들과 그 패턴이 속하는 클러스터의 중심과의 평균 유클리디안(Euclidean) 거리를 최소화하는 것이다. 클러스터의 중심은 그 클러스터에 속한 패턴의 평균 혹은 중심(centroid)??

라 하고 하기의 [수학식 14]와 같이 정의된다.The concept of this K-means algorithm is to minimize the average Euclidean distance between the patterns and the center of the cluster to which the pattern belongs. The center of the cluster is the average or centroid of the patterns in the cluster.

And is defined by the following equation (14).

[수학식 14]&Quot; (14) "

상기의 [수학식 14]에서 w는 클러스터에 속한 패턴집합이며, x는 클러스터에 속한 특정 패턴이다. 패턴은 실수 값을 가지는 벡터로 표현된다. K-Means에서 클러스터는 중력의 중심과 같이 무게 중심을 가지는 구형(sphere)으로 생각한다. 중심이 클러스터에 속한 패턴들을 얼마나 잘 표현했는가를 나타내는 척도(RSS : Residual Sum of Squares)는 각 클러스터에 속하는 모든 패턴들에 대하여 각 패턴과 중심까지의 제곱거리의 합으로 나타내며 하기의 [수학식 15]와 같다. RSS는 K-means의 목적 함수이고, 이를 최소화해야 한다.In Equation (14), w is a pattern set belonging to a cluster, and x is a specific pattern belonging to a cluster. The pattern is represented by a vector with real values. In K-Means, clusters are thought of as spheres with center of gravity, like the center of gravity. The RSS (Residual Sum of Squares) indicating how well the center represents the patterns belonging to the cluster is represented by the sum of squares of the respective patterns and the center to all the patterns belonging to each cluster, ]. RSS is the objective function of the K-means and should be minimized.

[수학식 15]&Quot; (15) "

한편, 얼굴 특징점과 음성 특징점을 추출한 후에는, 미리 등록된 등록 특징 벡터와, 얼굴 이미지에서 추출한 얼굴 특징 벡터간의 얼굴 유사도를 산출(S300)한다.Meanwhile, after extracting the facial feature points and audio feature points is calculated (S300) a face similarity between the facial feature vector extracted from the pre-registered registered feature vectors, and a face image.

얼굴 유사도 산출(S300)은, 미리 등록된 등록 특징 벡터와, 얼굴 이미지에서 추출한 얼굴 특징 벡터간의 코럴레이션 값을 얼굴 유사도로서 산출한다. 등록 특징 벡터는 미리 등록된 원래의 본인 얼굴에서 추출된 특징 벡터를 말하며, 얼굴 특징 벡터는 인증을 위해 촬영된 영상의 얼굴에서 추출된 특징 벡터를 말한다.The face similarity degree calculation (S300) calculates the correlation value between the registered feature vector registered in advance and the face feature vector extracted from the face image as the face similarity. The registration feature vector refers to a feature vector extracted from the original face registered in advance, and the face feature vector refers to a feature vector extracted from the face of the captured image for authentication.

두 얼굴 특징 벡터간의 유사도를 크기가 1로 정규화 된 두 얼굴 특징 벡터간의 코럴레이션 값으로 정의하는데, 실험적으로 얻어진 임계치로 그 이상의 유사도를 얻는 경우에는 동일인으로, 그렇지 않은 경우에는 타인으로 간주하는 방법을 이용한다.The similarity between two facial feature vectors is defined as the correlation value between two facial feature vectors normalized to 1 in size. If the similarity is obtained more than the experimentally obtained threshold value, the method is regarded as the same person, otherwise, .

[수학식 16]&Quot; (16) "

: 등록자의 얼굴 특징 벡터

: Registrant's facial feature vector

: 피인증자의 얼굴 특징 벡터

: The facial feature vector of the subject

본 발명의 실시예에서는 얼굴 이미지를 생성하기 위해 800만 화소의 갤럭시 노트2 내장카메라와 LifeCamVX-7000웹캠, 그리고 5G 아이폰 카메라의 내장 카메라를 사용하여 실험한다. 실험에 사용된 얼굴의 이미지는 자세와 조명환경에 대한 영향을 배제하기 위해 동일한 조명환경에서 찍은 50명의 정면 얼굴 2장, 총 100장의 이미지로 실험을 진행하였다. 도 18은 얼굴인식을 위해 사용된 이미지들이다.In the embodiment of the present invention, an 8-megapixel Galaxy Note 2 built-in camera, a LifeCamVX-7000 webcam, and a built-in camera of a 5G iPhone camera are used to create a face image. In order to avoid the influence of posture and lighting environment on the images of the face used in the experiment, the experiment was conducted with two frontal faces of 50 persons and a total of 100 images taken in the same lighting environment. 18 are images used for face recognition.

이처럼 각각의 얼굴이미지에서 얼굴 특징 벡터를 뽑고 이를 1개의 동일인물 얼굴이미지에서 추출한 얼굴 특징 벡터와 50개의 다른 인물 얼굴 이미지에서 추출한 얼굴 특징벡터와 비교하여 유사도를 계산한 뒤, 다양한 임계치들의 값을 적용하여 동일인물인지 여부를 판가름한다. [표 1]은 다양한 임계치에 대한 얼굴 특징 벡터의 정합 실험 결과를 정리한 것이다.The similarity is calculated by comparing the facial feature vector extracted from each facial image and the face feature vector extracted from one face image of the same person and the face feature vector extracted from 50 different face images, And judge whether or not they are the same person. Table 1 summarizes the experimental results of face feature vector matching for various thresholds.

임계치Threshold FAR
(False Accept Rate)FAR
(False Accept Rate) FRR
(False Reject Rate)FRR
(False Reject Rate)
0.4

0.4

21% (21/100회)
21% (21/100 times)
4% (4/100회)
4% (4/100 times)
0.5

0.5

9% (9/100회)
9% (9/100 times)
4% (4/100회)
4% (4/100 times)
0.6

0.6

2% (2/100회)

2% (2/100 times)

6% (6/100회)
6% (6/100 times)
0.7

0.7

0% (0/100회)
0% (0/100 times)
10% (6/100회)
10% (6/100 times)
0.8

0.8

0% (0/100회)
0% (0/100 times)
14% (14/100회)
14% (14/100 times)

실험 결과, 임계치를 0.7 이상으로 하였을 경우, 낮은 FAR의 값을 가지지만 비교적 높은 FRR의 값을 나타내었고, 0.7이하의 임계치를 정했을 경우에는 FRR의 값이 낮아졌으나, FAR의 값이 높아지는 것을 확인할 수 있다.
Experimental results show that when the threshold value is 0.7 or more, the FRR value has a relatively low FAR value, but when the threshold value is set to 0.7 or less, the FRR value is lowered, but the FAR value is increased have.

얼굴 유사도 산출(S300)과 함께, 미리 등록된 등록 특징 데이터와, 상기 음성에서 추출한 음성 특정 데이터간의 음성 유사도를 산출하는 음성 유사도 산출 과정(S400)을 가진다. 음성 유사도 산출(S400)은, 미리 등록된 등록 특징 데이터와, 음성에서 추출한 음성 특정 데이터간의 코럴레이션 값을 음성 유사도로서 산출한다. 여기서 등록 특징 데이터는 미리 등록된 원래의 본인 음성에서 추출된 특징 데이터를 말하며, 음성 특징 데이터는 인증을 위해 오디오의 음성에서 추출된 특징 데이터를 말한다.A face similarity degree calculation step (S300), and a voice similarity degree calculation step (S400) of calculating a voice similarity degree between the registration feature data registered in advance and the voice specific data extracted from the voice. The voice similarity degree calculation step (S400) calculates the correlation value between the registered characteristic data registered in advance and the voice specific data extracted from the voice as the voice similarity degree. Herein, the registration feature data refers to feature data extracted from the original voice registered in advance, and the voice feature data refers to feature data extracted from the audio voice for authentication.

음성인식의 특징 데이터 정합은 인공 신경망 입력변수들을 동일하기 위해 얼굴인식에서 사용하였던 코릴레이션 방법을 적용하여 실험적으로 얻어진 임계치를 통해 그 이상의 값을 얻는 경우에는 동일인으로 간주하고 그렇지 않을 경우에는 타인으로 간주하는 방법을 이용한다.The characteristics of speech recognition are regarded as the same person if the correlation value that is obtained through the experimentally obtained threshold is obtained by applying the correlation method used in the face recognition to equalize the input parameters of the artificial neural network, .

본 발명의 실시예에서는 음성인식의 특징 데이터를 얻기 위해 ABKO사의 mp5000 스피치 전용 마이크로폰을 사용하였다. 실험에 사용될 음성은 10명이 상민이라는 동일한 단어를 사용하였으며 각자의 음성을 2개씩, 총 20개의 음성파일을 획득하여 실험을 진행하였다. 도 19은 실험에 사용할 음성의 파형의 예이다.In the embodiment of the present invention, the MP5000 speech-only microphone of ABKO was used to obtain the feature data of the speech recognition. Experiments were carried out by using the same vocabulary as 10 vocabularies and 20 vocabulary files for each vocabulary. 19 is an example of a waveform of a voice to be used in an experiment.

이처럼 각각의 음성의 파형을 통해 음성 특징 벡터를 추출하고 FRR을 얻기 위해 동일인의 음성 2개씩을 가지고 실험한다. 그리고 FAR을 얻기 위하여 각자의 동일음성을 기준으로 나머지 9명의 음성과 번갈아가며 음성 특징 벡터와 비교하여 동일인물인지의 여부를 판가름한다. 하기의 [표 2]는 음성 특징 벡터의 정합 실험 결과를 나타낸 표이다.In this way, the speech feature vector is extracted through the waveform of each speech, and two speech of the same person is experimented to obtain the FRR. In order to obtain the FAR, it alternates with the remaining nine voices based on the same voice of each person, and compares with the voice feature vector to determine whether the person is the same person or not. Table 2 below is a table showing results of matching experiments of speech feature vectors.

임계치Threshold FAR
(False Accept Rate)FAR
(False Accept Rate) FRR
(False Reject Rate)FRR
(False Reject Rate)
0.65

0.65

34.4% (124/360회)
34.4% (124/360 times)
205 (4/20회)
205 (4/20 times)
0.7

0.7

31.1% (112/360회)
31.1% (112/360 times)
30% (6/20회)
30% (6/20 times)

실험 결과, 임계치의 값에 따라 FAR과 FRR의 값이 변하는 것을 알 수가 있었다. 그러나 얼굴의 정합 결과에 비해 높은 에러율을 나타낸다.
Experimental results show that the values of FAR and FRR change according to the value of the threshold. However, it shows higher error rate than face matching result.

한편, 상기와 같이 얼굴 유사도와 음성 유사도를 산출(S300,S400)한 후에는, 산출한 얼굴 유사도와 음성 유사도를 인공 신경망 모델에 적용시켜 본인 인증을 수행하는 본인 인증 과정(S500)을 가진다.Meanwhile, after the face similarity degree and the voice similarity degree are calculated (S300 and S400), the user authentication process (S500) is performed to apply the calculated face similarity and voice similarity to the ANN model to perform the identity authentication.

자신의 얼굴과 음성에 매칭 정보에 대한 데이터를 융합하고 인식률이 가장 좋은 결과를 얻을 수 있는 인공신경망을 찾고 예측하는 것이다. 이를 위한 가장 간단한 방법은 학습에 사용된 자료와는 또 다른 정보 데이터를 이용하여 오차와 인식률을 평가하는 것이다. 따라서 인공신경망을 Train data에 의해 오차를 최소화하도록 학습시키고 Test data의 새로운 데이터에 대해 효율을 측정함으로써 선택된 인공신경망의 효율이 입증될 수 있다.And to find and predict the artificial neural network which can obtain the best recognition result by converging the data of the matching information with the face and the voice of the user. The simplest way to do this is to evaluate the error and recognition rate using different information data than the data used in the learning. Therefore, the efficiency of the selected artificial neural network can be verified by learning the artificial neural network to minimize the error by train data and measuring the efficiency of new data of test data.

본 발명의 실시예에서는 얼굴인식과 음성인식의 유사도를 비교하여 인공 신경망 모델에 적용시킴으로써 본인 확인이 예측 가능한 시스템을 구현한다. In the embodiment of the present invention, a similarity between face recognition and speech recognition is compared and applied to an artificial neural network model, thereby realizing a system in which identity verification can be predicted.

얼굴 유사도와 음성 유사도를 인공 신경망 모델에 적용시키기 전에, 우선, 얼굴 유사도를 0에서 1 범위 내의 숫자로 정규화한 정규화 얼굴 유사도를 출력하며, 음성 유사도를 0에서 1 범위 내의 숫자로 정규화한 정규화 음성 유사도를 출력하는 과정(S510)을 가진다. 인공 신경망에 적용할 입력 변수들을 동일한 범위의 값을 가지도록 전환하는 것이다. 주어진 입력 변수들의 단위는 상이하기 때문이다. 그리고 신경망 모형에 적합한 자료는 범주형 입력변수가 모든 범주에서 일정 빈도 이상의 값을 갖고, 연속형 입력변수 값들의 범위가 변수 간에 큰 차이가 없는 자료이어야 한다.Before applying the face similarity and the voice similarity to the artificial neural network model, the normalized face similarity normalized by the numbers in the range of 0 to 1 is output first, and the normalized voice similarity normalized by the numbers within the range of 0 to 1 (S510). The input variables to be applied to the artificial neural network are switched to have the same range of values. Since the units of a given input variable are different. The data suitable for the neural network model should be categorical input variables having a frequency more than a certain frequency in all categories, and the range of continuous input variable values should not be much different among the variables.

본 발명의 실시예에서 택한 입력 변수에서 얼굴인식의 유사도는 10⁹의 값을 가지고 있고, 화자인식의 유사도는 0과 1의 사이 값을 가지고 있다. 이처럼 수치의 크기나 범위가 매우 다르기 때문에, 각각의 입력 변수를 동등한 수준에서 학습을 시작하여 가중치를 결정하기 위해 정규화를 실시한다. 도 20는 얼굴인식과 화자인식 유사도의 정규화된 결과를 도시한 그림이다. 음성과 얼굴의 유사도 결과를 도시한 도 20에서 볼 수 있듯이 입력 값에 대한 범위와 분포가 너무 크기 때문에 정규화가 필요하였다. 그러하여 도 21과 같이 이러한 값들을 정규화를 시켜 값의 범위와 분포가 분산되지 않게 한다. 이를 한 자리수로 정규화 한 이유는 크기나 범위에 대한 분포를 동일하게 맞추기 위한 것에도 있지만 얼굴인식과 화자인식의 정합 결과, 근소한 차이로 본인 확인 여부에 잘못된 결과를 나타내기 때문이다.In the embodiment of the present invention, the degree of similarity of face recognition has a value of 10 ⁹ and the degree of similarity of speaker recognition has a value between 0 and 1. Since the magnitudes and ranges of these numbers are very different, we normalize each input variable to start learning at an equivalent level and determine the weights. 20 shows a normalized result of face recognition and speaker recognition similarity. As shown in FIG. 20 showing the result of similarity between voice and face, normalization was required because the range and distribution of input values were too large. As shown in FIG. 21, these values are normalized so that the range and distribution of values are not dispersed. The reason for normalizing this to one digit is to match the distribution of size or range equally, but it is a result of matching between face recognition and speaker recognition, and it shows wrong result in confirmation of identity due to slight difference.

도 22는, 3.txt와 4.txt 그리고 15.txt와 16.txt는 동일인물의 음성에 대한 정합 결과이다. 그러나 임계치를 0.7으로 적용을 하면 본인이 아닌 것으로 판단되어 위와 같은 크기의 정규화를 실시하여 본인의 음성이 되도록 한다.
22 shows the result of matching 3.txt and 4.txt and 15.txt and 16.txt for the voice of the same person. However, if the threshold value is set to 0.7, it is judged that the person is not the person, and normalization of the size is performed to make the voice of the person himself.

상기와 같이 정규화한 정규화 얼굴 유사도와 정규화 음성 유사도를 산출한 후에는, 정규화 얼굴 유사도와 정규화 음성 유사도를 인공 신경망 모델에 적용시켜 본인 인증을 수행한다.After the normalized face similarity and the normalized voice similarity are calculated as described above, the authentication of the identity is performed by applying the normalized face similarity and the normalized voice similarity to the ANN model.

인공 신경망에 대하여 간략하게 상술한다. 인공 신경망이란 인간의 두뇌와 신경 세포 모델을 흉내 내어 구성한 네트워크로서 그 학습 능력과 추론 능력이 매우 뛰어난 것으로 알려져 있다. 이 모형은 생물학적 신경 시스템처럼 실제 세계의 객체(Object)가 상호 작용하는 것과 같이, 간단하게 구성된 요소들과 이들의 수직적 구조(Hierarchical organization)의 병렬 상호 연결 네트워크(Parallel interconnected network)이다. 다시 말하면 처리 단위인 뉴론(Neuron)의 전이함수와 신경망의 구조를 나타내는 층(Layer)의 수와 처리 단위간의 연결 상태 및 연결 강도를 주어진 문제 해결에 적절하게 조정하여 학습하게 된다.Artificial neural networks are briefly described in detail. Artificial neural network is a network constructed by mimicking human brain and neuron model and is known to have excellent learning ability and reasoning ability. This model is a parallel interconnected network of simple constituent elements and their hierarchical organization, just as biological objects interact with real world objects like biological neural systems. In other words, the connection state and the connection strength between the processing unit and neuron's transition function and the number of layers indicating the structure of the neural network are adjusted to suit a given problem.

인공 신경망의 구조는 입력층(Input layer), 은닉층(Hidden layer), 출력층(Output layer) 으로 이루어져 있다. 이들은 처리요소들을 모아 층을 이루고 있다. 처리 요소는 활성화 함수(Activation Function) 라고 할 수 있는데 이것은 두 부분으로 이루어져 있다. 우선 첫 번째 부분은 여러 다른 처리 요소들로부터 입력을 받아들여 연결 가중치(Connection Weight) 를 사용하여 순 입력 값을 계산한다. 이것은 결합 함수(Combination function) 에 의해 이루어진다. 두 번째 활성화 단계는 전이 함수에 의해 실행되는데 이는 가중치를 고려한 입력 값을 전이 함수(Transfer function) 를 이용하여 출력 값으로 변환하는 것이다. 전이 함수로서 고려 할 수 있는 함수는 선형 함수, Sigmoid 함수, Hyperbolic tangent 함수 등이다. 선형함수는 선형 회귀에서 사용하는 것과 비슷한 경우이고 후자들은 비선형 함수로서 결과가 비선형적 성격을 띠고 있다.The structure of an artificial neural network consists of an input layer, a hidden layer, and an output layer. These are the layers of processing elements together. The processing element can be called an activation function, which consists of two parts. First, the first part accepts input from several different processing elements and computes the net input value using the connection weight. This is done by a combination function. The second activation step is performed by the transition function, which converts the input value taking the weight into an output value using a transfer function. The functions that can be considered as transition functions are linear functions, Sigmoid functions, and hyperbolic tangent functions. Linear functions are similar to those used in linear regression and the latter are nonlinear functions and the results are nonlinear.

인공 신경망에서는 처리 요소들을 모아 층을 구성하게 되는데, 도 23에 인공 신경망의 구조를 도시하였다. 도 23에서 입력층은 신경망에서 자료가 제공되는 층이고 출력층은 주어진 입력에 대해 인공 신경망이 반응하여 결과를 내는 층이다. 또한 은닉층은 입력층과 출력층 사이에 존재하는 것으로서 입력 층의 모든 노드들과 완벽하게 모두 연결되어 있다. 은닉층에 존재하는 노드들은 각각의 입력에 이에 상하는 가중치를 곱함으로써 값을 계산하여 이를 출력층에 넘겨주는 역할을 한다.
In the artificial neural network, processing elements are gathered to form a layer. FIG. 23 shows the structure of the artificial neural network. In Figure 23, the input layer is the layer in which the data is provided in the neural network, and the output layer is the layer in which the artificial neural network responds to the given input. The hidden layer exists between the input layer and the output layer and is completely connected to all nodes of the input layer. The nodes in the hidden layer calculate the value by multiplying each input by the corresponding weight, and pass it to the output layer.

본 발명의 실시예는, 정규화 얼굴 유사도와 정규화 음성 유사도를 인공 신경망 모델에 적용(S520)시켜 본인 인증을 수행함에 있어서, 실험 데이터와 원하는 출력 값을 인공 신경망에 함께 입력시켜 학습시키는 학습 알고리즘(역전파 학습 알고리즘)을 적용하여 입력층, 은닉층, 및 출력층으로 된 인공 신경망 모델을 학습시키는 과정을 가진다. 그리고, 학습 알고리즘에 의해 학습된 인공 신경망 모델에 상기 정규화 얼굴 유사도 및 정규화 음성 유사도를 적용하여 본인 인증을 수행함으로써, 오차없는 본인 인증을 수행할 수 있다.
In the embodiment of the present invention, a learning algorithm for performing authentication by applying normalized face similarity and normalized voice similarity to an artificial neural network model (S520) by inputting experiment data and a desired output value into an artificial neural network Propagation learning algorithm) is applied to learn an artificial neural network model composed of an input layer, a hidden layer, and an output layer. Then, by performing the identity authentication by applying the normalized face similarity degree and the normalized voice similarity degree to the artificial neural network model learned by the learning algorithm, error-free identity authentication can be performed.

본 발명에서 사용하는 인공 신경망에서 사용하는 학습 알고리즘은 공지된 역전파 학습 알고리즘인데, 역전파 알고리즘은 ADALINE(Adaptive Linear Neuron)의 방법을 다층으로 확장하면서 단점을 보완하는 것이다. 참고로, ADALINE(Adaptive Linear Neuron)는, 인공 신경망의 뉴런에 대한 초기모델로서 적응형 선형 결합기와 양자화 회로를 직렬로 접속한 것이다. ADALINE을 뉴런과 비교하면 적응형 가중치는 시냅스, 입력 벡터의 성분은 축색돌기의 입력, 양자화 된 축력은 축색 출력에 각각 대응한다. ADALINE의 출력은 실제의 신경 세포에서 일어나는 것과 매우 유사하다.The learning algorithm used in the artificial neural network used in the present invention is a known back propagation learning algorithm. The back propagation algorithm compensates the disadvantages by extending the ADALINE (Adaptive Linear Neuron) method in multiple layers. For reference, ADALINE (Adaptive Linear Neuron) is an initial model for an artificial neural network neuron, in which an adaptive linear combiner and a quantization circuit are connected in series. When ADALINE is compared with a neuron, the adaptive weight corresponds to the synapse, the input vector component corresponds to the input of the axon projection, and the quantized axial force corresponds to the axon output. The output of ADALINE is very similar to what happens in real neurons.

역전파 학습 알고리즘의 기본 원리는 다음과 같다. 입력층의 각 뉴런에 입력 패턴을 주며, 이 신호는 각 뉴런에서 변환되어 중간층에 전달되고 최후에 출력층에서 신호를 출력하게 된다. 이 출력 값과 기댓값을 비교하여 차이를 줄여나가는 방향으로 가중치를 조절하고, 상위층에서 역전파하여 하위층에서는 이를 근거로 가중치를 조정해 나간다. 지도 학습에서는 입력 및 원하는 출력(목표 출력) 패턴(벡터)이 인공 신경망에 지시된다. 역전파 학습 알고리즘을 이용한 인공 신경망의 학습은 3단계로 이루어진다. 1단계로 학습 입력 패턴을 인공 신경망에 입력하여 출력을 구하고, 2단계는 출력과 목표치 사이의 차이 즉, 오차를 구하고, 3단계에서는 오차 값이 없는 경우에는 학습이 일어나지 않고, 오차 값이 존재할 경우에는 그 값을 역방향으로 전파시키면서 오차 값의 크기를 감소시키는 방향으로 출력층의 가중치 및 은닉층의 가중치를 변경한다. 이런 학습 단계의 역전파로 인해 역전파 학습 알고리즘이 순환 구조의 인공 신경망이라고 오해하기 쉬우나 단지 학습 과정에서만 오차가 관련된 출력이 역방향으로 전파되며, 학습이 완료되고 실제 응용 시에는 입력이 순방향으로 진행되면서 출력이 나오는 순방향 인공 신경망 구조임을 분명히 해둘 필요가 있다. 또 역전파 학습 알고리즘은 학습에 상당히 많은 시간을 소요하게 되지만, 일단 학습이 끝나면 응용 단계에서 매우 빠르게 결과를 출력한다.The basic principle of the backpropagation learning algorithm is as follows. Each neuron in the input layer is given an input pattern, which is transformed at each neuron and delivered to the middle layer, and finally at the output layer. The output value is compared with the expected value to adjust the weight in the direction of reducing the difference, and the weight is adjusted based on the backward propagation in the upper layer and the lower layer. In map learning, input and desired output (target output) patterns (vectors) are indicated in artificial neural networks. Learning of the artificial neural network using the backpropagation learning algorithm consists of three steps. The learning input pattern is input to the artificial neural network to obtain the output. In the second step, the difference between the output and the target value, that is, the error is obtained. If there is no error value in the third step, no learning occurs. The weight of the output layer and the weight of the hidden layer are changed in the direction of reducing the magnitude of the error value while propagating the value in the opposite direction. Although the backpropagation learning algorithm is easy to misunderstand that the backpropagation learning algorithm is an artificial neural network with a circular structure, only the output related to the error is propagated in the reverse direction only in the learning process and the learning is completed. In actual application, It is necessary to make clear that it is a forward artificial neural network structure. In addition, the backpropagation learning algorithm takes a considerable amount of time to learn, but once the learning ends, it outputs the results very quickly at the application level.

역전파 학습 알고리즘(Backpropagation Algorithm)에 대하여 좀더 상술하면, 역전파 학습 알고리즘은 일반 델타 규칙 학습법과 마찬가지로 출력층 오차신호를 이용하여 은닉층과 출력층간의 가중치를 변경하고, 또한 출력층 오차신호를 은닉층에 역전파하여 입력층과 은닉층간의 가중치를 변경하는 학습 방법이다. To further elaborate on the backpropagation algorithm, the backpropagation learning algorithm changes the weights between the hidden layer and the output layer using the output layer error signal as in the general delta rule learning method, To change the weights between the input layer and the hidden layer.

먼저, 학습시킬 p개의 학습 패턴쌍 (x1,s1), (x2,s2),,,,,(x_p,s_p) 를 선정하고 나서, 초기 가중치 v,w를 임의의 작은 값으로 초기화하며, 학습률 선정 방법에 의거하여 적절한 학습률 α를 결정한다. 학습 패턴쌍으로 차례로 입력하여 가중치를 변경하면서 학습해 나간다. 역전파 학습 알고리즘은 시그모이드 함수를 활성 함수로 사용한다. 또한 하기의 [수학식 17]과 같이 목표치 s와 최종 출력 y를 비교하여 제곱 오차 E를 구한다.First, after selecting the learning p number of training patterns to a pair (x1, s1), (x2 , s2) ,,,,, (x p, s p), v the initial weight, and initializes the value of w to an arbitrary small , And determines an appropriate learning rate? Based on the learning rate selection method. Learning pattern pairs are input in order, and weights are changed while learning. The backpropagation learning algorithm uses the sigmoid function as the active function. The square error E is obtained by comparing the target value s and the final output y as shown in the following equation (17).

[수학식 17]&Quot; (17) "

또한 하기의 [수학식 18]에 의해 출력층의 오차 신호 δ_y를 구한다.The error signal? _Y of the output layer is obtained by the following equation (18).

[수학식 18]&Quot; (18) "

또한 하기의 [수학식 19]에 의해 은닉층에 전파되는 오차 신호 δ₂를 구한다.The error signal? ₂ propagated to the hidden layer is obtained by the following equation (19).

[수학식 19]&Quot; (19) "

또한 하기의 수학식 20에 의하여 k 학습 단계에서의 은닉층과 출력층간의 가중치 변화량

및 입력층과 은닉층간의 가중치 변화량

를 구한다.Further, the weight variation between the hidden layer and the output layer in the k learning step

And a weight change amount between the input layer and the hidden layer

.

[수학식 20]&Quot; (20) "

또한 하기의 [수학식 21]에 의하여 k+1 단계에서의 은닉층과 출력층간의 가중치

, 입력층과 은닉층간의 가중치

을 구한다.Also, by using the following expression (21), the weight of the hidden layer and the output layer in the (k + 1)

, The weight between the input layer and the hidden layer

.

[수학식 21]&Quot; (21) "

학습 패턴쌍을 반복 입력하여 연결강도를 변경하며, 오차를 특정범위보다 적어지면 학습을 종료한다.
The learning pattern pair is input repeatedly to change the connection strength. When the error is less than the specified range, the learning is terminated.

따라서 본 발명에서 정규화 얼굴 유사도와 정규화 음성 유사도를 인공 신경망 모델에 적용시켜 본인 인증을 수행함에 있어서, 실험 데이터와 원하는 출력 값을 인공 신경망에 함께 입력시켜 학습시키는 학습 알고리즘(역전파 학습 알고리즘)을 적용하여 입력층, 은닉층, 및 출력층으로 된 인공 신경망 모델을 학습시키는 것 역시 상기에서 설명된 공지된 인공 신경망의 학습 알고리즘을 따른다.Accordingly, in the present invention, a learning algorithm (backpropagation learning algorithm) is applied to perform authentication by applying normalized face similarity and normalized voice similarity to an artificial neural network model by inputting experimental data and desired output values into an artificial neural network Learning of an artificial neural network model composed of an input layer, a hidden layer, and an output layer also follows the known artificial neural network learning algorithm described above.

즉, 본인 확인 예측을 위한 인공 신경망 모델에서 이용한 역전파 알고리즘은 다층(Multi-layer)이고 feedforward 신경망에서 사용되는 학습 알고리즘이며 학습의 방법은 지도 학습(supervised learning) 이다. 본 발명에서는 도 24에 도시한 바와 같이 입출력층과 1개의 중간층으로 구성된 3층 구조를 사용하였으며 입력층은 각각의 특징 정보들을 사용한다.In other words, the backpropagation algorithm used in artificial neural network model for identity confirmation prediction is a learning algorithm used in multi-layer and feedforward neural network, and the learning method is supervised learning. In the present invention, as shown in FIG. 24, a three-layer structure including an input / output layer and an intermediate layer is used, and input layers use respective feature information.

도 24에 도시된 바와 같이 입력층의 개수는 얼굴과 음성의 유사도 비교를 위해 2개의 수가 되는 것이고 은닉층의 수는 6개이다. 그리고 출력층은 2개로 하여 본인이 맞을 경우 첫 번째, 본인이 아닐 경우 2번째 노드가 되도록 학습을 하여 결과를 나타내기로 한다. 본인 확인을 위한 인공 신경망 모델에서 이용한 역전파 알고리즘은 주어진 입력에 대하여 희망하는 출력과 인공 신경망의 실제 출력값 간의 오차를 최소가 되도록 가중치(weight)를 조정하기 위한 학습 과정을 필요로 한다. 이처럼 본 발명의 실시예에서 제공하는 인공 신경망 모델의 알고리즘은 가중치 벡터의 초기화, 가중치 벡터의 업데이트 그리고 새로운 가중치 벡터의 생성 등으로 구성되어 있다. 학습 알고리즘을 살펴보면 다음과 같다.As shown in FIG. 24, the number of input layers is two for the comparison of similarity of face and voice, and the number of hidden layers is six. The output layer is set to two, and the learning is performed so that the first node is the first node if the node is the first node and the second node if the node is not the node. The back propagation algorithm used in the artificial neural network model for identification requires a learning process to adjust the weight so that the error between the desired output and the actual output value of the artificial neural network is minimized for a given input. As described above, the algorithm of the artificial neural network model provided in the embodiment of the present invention is composed of initialization of a weight vector, updating of a weight vector, and generation of a new weight vector. The learning algorithm is as follows.

STEP1. 통신망을 통한 입력벡터 전파(Feedforward propagation)STEP1. Feedforward propagation through a communication network

j=1,2,,,,6j = 1,2 ,,,, 6

STEP2. 출력층에서 error 계산STEP2. Error calculation in output layer

δ_k = designed output, k=1,2,,,10
δ _k = designed output, k = 1,2 ,,, 10

STEP3.중간층에서 error 계산 (Backward propagation of error)STEP 3. Backward propagation of error in middle layer

STEP4. 가중치 업데이트STEP4. Update weights

α=learning rate
α = learning rate

STEP5. 수렴할 때까지 STEP1-4를 계속 반복STEP5. Repeat steps 1-4 until convergence

학습 알고리즘을 통해 알 수 있듯이 계속적인 반복을 통해 에러 값을 계산하며 가중치를 업데이트하게 된다. 본 연구에서는 반복 횟수를 2000000번으로 정하였고 총 오차에 대한 임계 값(Threshold)을 정하여 반복하는 도중에 정해놓은 임계 값에 도달하면 학습을 마치도록 설정하였다. 임계 값은 0.01로 설정하였고

는 0.1로 하였다. 학습률이 높으면 빠른 학습이 가능하지만, 수렴하지 않고 오답을 낼 확률 또한 커진다.
As you can see from the learning algorithm, it will calculate the error value and update the weight with continuous iteration. In this study, the number of iterations is set to 2000000, and the threshold for the total error is set. When the threshold is reached during the iteration, the learning is completed. The threshold was set to 0.01

Lt; / RTI > If the learning rate is high, fast learning is possible, but the probability of giving wrong answers without convergence also increases.

한편, 인공 신경망을 학습시킨 후에 학습 알고리즘에 의해 학습된 인공 신경망 모델에 정규화 얼굴 유사도 및 정규화 음성 유사도를 적용(S520)하여 본인 인증을 수행함으로써, 오차없는 본인 인증을 수행할 수 있다.On the other hand, after the artificial neural network is learned, normal authentication face similarity and normalized voice similarity are applied to the artificial neural network model learned by the learning algorithm (S520), thereby performing identity authentication without error.

하기의 [표 3]은 본인의 얼굴과 음성을 포함한 데이터 a와 b 두 가지 중, 하나의 데이터와 나머지 9명의 데이터를 인공 신경망에 학습시킨 후, 나머지 본인의 데이터를 테스트 한 실험 결과이다. 그리고 [표 4]는 본인의 데이터를 포함하지 않고 나머지 9명의 데이터를 학습 시킨 뒤, 본인의 각기 다른 데이터 2가지를 테스트 한 결과이다.Table 3 below shows experimental results obtained by learning one data of the data a and b including the subject's face and voice and the remaining 9 data in the artificial neural network, and then testing the data of the other person. [Table 4] shows the result of testing two different data of my own after learning the data of the other nine without the data of my own.

FAR(False Accept Rate)False Accept Rate (FAR) FRR(False Reject Rate)False Reject Rate (FRR) AA 0/360/36 0/20/2 BB 0/360/36 0/20/2 CC 0/360/36 0/20/2 DD 0/360/36 0/20/2 EE 0/360/36 0/20/2 FF 1/361/36 0/20/2 GG 0/360/36 0/20/2 HH 1/361/36 0/20/2 II 0/360/36 0/20/2 JJ 0/360/36 0/20/2 결과result 2/3602/360 0/200/20

FAR(False Accept Rate)False Accept Rate (FAR) FRR(False Reject Rate)False Reject Rate (FRR) AA 0/360/36 0/20/2 BB 0/360/36 0/20/2 CC 0/360/36 0/20/2 DD 0/360/36 0/20/2 EE 1/361/36 0/20/2 FF 1/361/36 0/20/2 GG 0/360/36 0/20/2 HH 1/361/36 0/20/2 II 0/360/36 0/20/2 JJ 0/360/36 2/22/2 결과result 3/3603/360 2/202/20

인공 신경망에 얼굴과 화자의 유사도 값을 입력하면 정확한 결과를 얻을 수 없었다. 그리하여 각각의 입력 데이터들의 값을 정규화 하여 학습을 시킨 뒤에 테스트를 한 결과, 본인을 학습 시킨 결과는 100%의 본인 인식률을 나타내었지만 본인을 학습하지 않고 테스트를 하였을 경우에는 90%의 인식률을 나타내었다.If we input the similarity value of face and speaker in artificial neural network, we could not get accurate results. As a result of testing after normalizing the values of each input data, the result of learning the subject showed 100% of the recognition rate of the person, but when the test was performed without learning the person, the recognition rate was 90% .

본인을 학습시키지 않았던 실험 중, FRR에서 2개의 에러 모두 'J'에서 나타났다. 를 제외한 나머지 사람들의 경우 얼굴 이미지에서의 정합결과가 음성에서의 정합 결과보다 높은 편이였지만, 'j'의 경우에는 음성에서의 정합 결과가 얼굴 이미지에서의 정합 결과보다 높은 값을 나타내어 이와 같은 에러를 나타내었다.
During experiments that did not teach me, FRR showed both errors at 'J'. The matching result in the face image is higher than the matching result in the voice image, but in the case of 'j', the matching result in the voice is higher than the matching result in the face image, Respectively.

FAR(False Accept Rate)

False Accept Rate (FAR)

FRR(False Reject Rate)
False Reject Rate (FRR)
결과1

Result 1

0.56%
0.56%
0%
0%
결과2

Result 2

0.83%
0.83%
10%
10%

상기의 [표 5]를 참조하면 얼굴과 음성의 융합 결과를 얼굴과 음성 각각의 정합 결과와 비교하였을 때, 음성의 정합 결과보다는 우수한 결과를 보였다. 그러나, 얼굴 정합 결과와 비교하였을 때, 본인의 얼굴과 음성을 포함한 데이터를 학습시켰을 경우에는 얼굴의 정합 결과보다 좋은 결과를 나타내었지만, 본인을 포함하지 않고 학습을 시켰을 경우에는 임계치 0.7의 값을 적용한 결과와 비슷한 실험 결과를 나타내었다.Referring to Table 5 above, when the result of fusion of face and voice is compared with the result of matching of face and voice, excellent results are obtained than the result of voice matching. However, when the data including the face and the voice of the user were compared with the results of the face matching, the results were better than the results of the face matching. However, when the user did not include the user, The results are similar to those of the results.

위의 결과에서 볼 수 있듯이, 화자 인증의 부정확한 인증으로 인해 값들의 분포가 확산되면서 j와 같은 잘못된 인식 결과가 나타났다. 화자 인증 결과에 대한 개선이 이루어지면 융합 인증에 대한 결과 또한 안정되고 정확한 결과를 얻을 수 있을 것이다.As can be seen from the above results, incorrect distribution of values due to improper authentication of speaker authentication resulted in incorrect recognition result such as j. If the speaker authentication result is improved, the result of fusion authentication will be stable and accurate.

상술한 본 발명의 설명에서의 실시예는 여러가지 실시가능한 예중에서 당업자의 이해를 돕기 위하여 가장 바람직한 예를 선정하여 제시한 것으로, 이 발명의 기술적 사상이 반드시 이 실시예만 의해서 한정되거나 제한되는 것은 아니고, 본 발명의 기술적 사상을 벗어나지 않는 범위내에서 다양한 변화와 변경 및 균등한 타의 실시예가 가능한 것이다.
The embodiments of the present invention described above are selected and presented in order to assist those of ordinary skill in the art from among various possible examples. The technical idea of the present invention is not necessarily limited to or limited to these embodiments And various changes, modifications, and equivalents may be made without departing from the spirit and scope of the present invention.

S100:얼굴 특징점 추출 S200:음성 특징점 추출
S300:얼굴 유사도 산출 S400:음성 유사도 산출
S500:본인 인증 과정
S510:정규화 얼굴 유사도, 정규화 음성 유사도 산출
S520:인공 신경망 모델 적용S100: Facial feature point extraction S200: Speech feature point extraction
S300: calculation of face similarity degree S400: calculation of voice similarity degree
S500: Self certification process
S510: Normalized face similarity, normalized voice similarity calculation
S520: Application of artificial neural network model

Claims

A face feature point extraction process for extracting a face feature vector from a detected face image by detecting a face image from the whole image;
A voice feature point extracting step of detecting voice in audio and extracting voice feature data from the detected voice;
A face similarity degree calculating step of calculating a face similarity degree between a registered feature vector registered in advance and a face feature vector extracted from the face image;
A voice similarity degree calculating step of calculating a voice similarity degree between pre-registered registered characteristic data and voice specific data extracted from the voice; And
A person authentication process for applying the face similarity and voice similarity to an artificial neural network model to perform identity authentication;
Wherein the face recognition and the speaker recognition are combined.

The method according to claim 1,
A face image detecting process for detecting a face image step by step in a whole image;
A normalization process of normalizing the size and color of the detected face image to a set value; And
A face feature vector extracting step of extracting a face feature vector from the normalized face image;
Wherein the face recognition and the speaker recognition are combined.

The method according to claim 2, wherein stepwise detection of the face image comprises:
An authentication method that is a fusion of face recognition and speaker recognition that detects face images step by step through an adaboost method.

The method of claim 2,
A size normalization process of converting the size of the detected face image into a predetermined size and outputting the size normalized face image;
Wherein the face recognition and the speaker recognition are combined.

5. The method of claim 4, wherein after the size normalization process is completed,
And an angle normalization process of grasping an inclined angle and an inclination direction of the size-converted face image, rotating the face image in a direction opposite to the inclination direction by the inclination angle, and outputting the rotated face image as a size normalized face image An authentication method that combines face recognition and speaker recognition.

[6] The method of claim 5, wherein, in the angle normalization process,
Extracting valleys and edges in the vertical direction while rotating the size-converted face images at intervals of 2.5 degrees, and calculating a vertical direction histogram in valleys and edges in the vertical direction; And
And determining a rotation angle when the vertical direction histogram calculated by rotating the resized face image at intervals of 2.5 degrees has the largest variance as an inclination angle,
The valley is a region in which the brightness of the image pixel in the resized face image is lower than that of the surrounding pixels and the brightness difference between the image pixel and the surrounding pixels is larger than a predetermined threshold value, And a face contour of the face image. The authentication method is a fusion of face recognition and speaker recognition.

The method of claim 5, wherein after the angle normalization process,
And an illumination normalization process of removing the illumination effect from the normalized face image and outputting the illumination normalized face image as an illumination normalized face image.

8. The method of claim 7,
And a face recognition and a speaker recognition which are output as an illumination normalized face image by eliminating multiplicative moise and additive noise from the normalized face image.

The method of claim 8, wherein removing the multiplicative moise and additive noise in the normalized face image comprises:
(x, y) is the pixel function of the illuminated face image, I (x, y) is the pixel function of the illuminated normalized face image, E is the normalized face When the average of pixel values of an image and VAR are variance,

And the face recognition and the speaker recognition are fused.

The method according to claim 2,
Calculating a plurality of Gabor size feature vectors expressed in polar coordinates in the normalized face image; And
Determining a face graph, which is a group of the Gabor size feature vectors, as a face feature vector;
Wherein the face recognition and the speaker recognition are combined.

11. The method of claim 10,
Wherein the normalized facial image has garbage size feature vectors extracted at a grid structure position in the normalized facial image.

The method according to claim 11, wherein the step of extracting the face graph as a face feature vector comprises:
A process of generating an inherent face garber space that is not related to lighting effects;
Reducing a data size by projecting a Gabor size feature vector extracted at the grid structure position in the inherent face gaber space; And
Extracting a face graph, which is a group of Gabor size feature vectors projected in the inherent face gabber space, as a face feature vector;
Wherein the face recognition and the speaker recognition are combined.

The method according to claim 1,
A speech detection process for detecting speech in the entire audio using the average energy method and the zero crossing rate; And
Extracting voice feature data from the extracted voice;
Wherein the face recognition and the speaker recognition are combined.

14. The method of claim 13,
Detecting a voiced sound region in the entire audio through the average energy method;
Determining whether there is a fricative at the front end of the voiced region using the zero crossing rate; And
Determining a voice as a voice when a fricative exists in a front end of the voiced region;
Wherein the face recognition and the speaker recognition are combined.

[14] The method of claim 13,
Performing fast Fourier transform (FFT) on the extracted speech;
Filtering the fast Fourier transformed speech data as a Mel filter which is a uniform filter;
Discrete cosine transform (DCT) for eliminating the correlation of the filtered speech data;
Extracting voice feature data by applying MFCC (Mixed Content Signal Classification) to the DCT voice data; And
Normalizing the number of speech feature data for each frame applied to the MFCC;
Wherein the face recognition and the speaker recognition are combined.

The method according to claim 1,
A face recognition and a speaker recognition that combines a registered feature vector registered in advance and a face feature vector extracted from the face image as face similarity.

The method according to claim 1,
A face recognition and a speaker recognition that calculates a correlation value between pre-registered registered feature data and voice specific data extracted from the voice as voice similarity.

The method according to claim 1,
Outputting a normalized facial similarity normalized to a number within a range of 0 to 1, and outputting a normalized phoneme similarity normalized to a number within a range of 0 to 1; And
Applying an artificial neural network model for applying authentication to the artificial neural network model by applying the normalized face similarity and the normalized voice similarity;
Wherein the face recognition and the speaker recognition are combined.

[19] The method of claim 18,
A process of learning an artificial neural network model composed of an input layer, a hidden layer, and an output layer by applying experimental data and a learning algorithm that inputs desired output values into an artificial neural network together; And
Performing authentication by applying the normalized face similarity and the normalized voice similarity to the artificial neural network model learned by the learning algorithm;
Wherein the face recognition and the speaker recognition are combined.