KR20150042674A

KR20150042674A - Multimodal user recognition robust to environment variation

Info

Publication number: KR20150042674A
Application number: KR20130123325A
Authority: KR
Inventors: 김동주; 이상헌; 손명규; 김현덕
Original assignee: 재단법인대구경북과학기술원
Priority date: 2013-10-10
Filing date: 2013-10-16
Publication date: 2015-04-21
Also published as: KR101514551B1

Abstract

The present invention provides a method to enhance user recognition performance with a combination or selection of modality depending on variation of the external environment. The present invention relates to a multi-modal user recognition method, which provides a multi-modal user recognition method which includes: a step where facial similarity regarding facial feature vector of the an input image is calculated using face model templates; a step where vocal similarity regarding vocal feature vector of input audio is calculated using speech model templates; a step where a first value is created by applying to the facial similarity a first weight obtained from using the size of lighting associated with the input image; a step where a second value is created by applying to the vocal similarity a second weight obtained from using the size of noise associated with the input audio; and a step where the user is recognized using the final score obtained from combining the first and the second value.

Description

[0001] MULTIMODAL USER RECOGNITION ROBUST TO ENVIRONMENT VARIATION [0002]

본 발명은 얼굴 영상과 음성을 이용한 멀티모달 사용자 인식 시스템 및 방법에 관한 것으로 사용자 인식 성능을 향상시키기 위한 방법에 관한 것이다.
The present invention relates to a multi-modal user recognition system and method using facial image and voice, and a method for improving user recognition performance.

개인의 신분 증명을 위한 방법으로 인간의 얼굴, 음성, 홍채, 지문, 서명, 정맥과 같은 생체 정보를 독립적으로 이용하는 기술이 수행되어 왔다. 하지만, 이러한 단일의 생체 정보만을 이용한 인식 방법은 각 생체 정보마다 가지는 취약점으로 인하여 인식 성능의 저하를 가져온다는 문제점을 지니고 있다. Techniques for independently using biometric information such as human face, voice, iris, fingerprint, signature, and vein have been performed as a method for identification of an individual. However, the recognition method using only the single biometric information has a problem that the recognition performance is deteriorated due to the vulnerability of each biometric information.

일 예로, 얼굴을 이용한 사용자 인식에서는 외부 조명 변화, 음성을 이용한 사용자 인식에서는 잡음이 각 단일 생체인식 시스템의 인식 성능을 저하하는 큰 요소로 작용한다. 이와 같은 단일 생체인식 기술의 한계를 보완하고 인식 성능의 향상과 신뢰도를 높이기 위하여 최근에는 얼굴과 음성, 지문과 얼굴, 홍채와 지문, 정맥과 홍채 등과 같이 두 개 이상의 생체정보를 이용하는 멀티모달 방식이 활발히 연구되고 있다. For example, in user recognition using faces, noise is a major factor that degrades recognition performance of each single biometric recognition system in the case of external illumination change and user recognition using voice. In order to improve the recognition performance and increase the reliability of the system, multimodal method using two or more biometric information such as face and voice, fingerprint and face, iris and fingerprint, vein and iris Is being actively studied.

그러나 여전히 멀티모달 사용자 인식 시스템에 있어서 조명 및 잡음과 같은 외부 환경 요소는 여전히 인식 성능을 저하하는 요소로 작용한다. 이에 따라 인식 성능 저하의 문제점을 해결하는 것이 요구된다.
However, in a multi-modal user recognition system, external environmental factors such as illumination and noise still detract from the recognition performance. Therefore, it is required to solve the problem of deterioration in recognition performance.

본 발명은 외부 환경 변화에 따른 모달리티의 결합 또는 선택을 통하여 사용자 인식 성능을 향상시키기 위한 멀티 모달 사용자 인식 방법을 제공한다.
The present invention provides a method of recognizing a multimodal user for enhancing user recognition performance by combining or selecting a modality according to an external environment change.

본 발명의 일실시예에 따른 멀티 모달 사용자 인식 방법은, 얼굴 모델 템플릿(face model templates)을 이용하여 입력 영상의 얼굴 특징 벡터에 대한 얼굴 유사도를 계산하는 단계; 음성 모델 템플릿(speech model templates)을 이용하여 입력 오디오의 음성 특징 벡터에 대한 음성 유사도를 계산하는 단계; 상기 입력 영상과 연관된 조명의 크기를 이용하여 얻어진 제1 가중치를 상기 얼굴 유사도에 적용하여 제1 값을 생성하는 단계; 상기 입력 오디오와 연관된 잡음의 크기를 이용하여 얻어진 제2 가중치를 상기 음성 유사도에 적용하여 제2 값을 생성하는 단계; 및 상기 제1 값 및 상기 제2 값을 결합하여 얻어진 최종 스코어를 이용하여, 사용자를 인식하는 단계를 포함할 수 있다.A method of recognizing a multimodal user according to an exemplary embodiment of the present invention includes calculating face similarity to a face feature vector of an input image using face model templates; Calculating a voice similarity degree with respect to a voice feature vector of input audio using speech model templates; Generating a first value by applying a first weight value obtained by using a magnitude of illumination associated with the input image to the face similarity; Generating a second value by applying a second weight obtained by using a magnitude of a noise associated with the input audio to the voice similarity; And recognizing the user using a final score obtained by combining the first value and the second value.

일측에 따르면, 상기 입력 영상과 연관된 조명의 크기를 이용하여 얻어진 제1 가중치를 상기 얼굴 유사도에 적용하여 제1 값을 생성하는 단계는, 얼굴 학습 데이터의 조명 크기 및 얼굴 테스트 데이터의 조명 크기에 따른 인식률을 저장하는 얼굴 인식률 테이블을 유지하는 단계; 상기 입력 영상과 연관된 상기 조명의 크기를 추정하는 단계; 상기 얼굴 인식률 테이블 및 상기 추정된 조명의 크기를 이용하여 상기 제1 가중치를 계산하는 단계; 및 상기 얼굴 유사도에 상기 제1 가중치를 적용하여 상기 제1 값을 생성하는 단계를 포함할 수 있다.According to one aspect of the present invention, the step of generating the first value by applying a first weight value obtained by using the magnitude of the illumination associated with the input image to the face similarity degree may include: Maintaining a face recognition rate table storing a recognition rate; Estimating a magnitude of the illumination associated with the input image; Calculating the first weight using the face recognition rate table and the size of the estimated illumination; And generating the first value by applying the first weight to the face similarity.

또 다른 일측에 따르면, 상기 입력 영상과 연관된 상기 조명의 크기를 추정하는 단계는, 레티넥스(Retinex) 알고리즘을 이용하여 상기 입력 영상과 연관된 상기 조명의 크기를 추정하는 단계를 포함할 수 있다.According to another aspect, estimating the size of the illumination associated with the input image may include estimating the size of the illumination associated with the input image using a Retinex algorithm.

또 다른 일측에 따르면, 상기 입력 오디오와 연관된 잡음의 크기를 이용하여 얻어진 제2 가중치를 상기 음성 유사도에 적용하여 제2 값을 생성하는 단계는, 음성 학습 데이터의 잡음 크기 및 음성 테스트 데이터의 잡음 크기에 따른 인식률을 저장하는 음성 인식률 테이블을 유지하는 단계; 상기 입력 오디오와 연관된 잡음의 크기를 추정하는 단계; 상기 음성 인식률 테이블 및 상기 추정된 잡음의 크기를 이용하여 상기 제2 가중치를 계산하는 단계; 및 상기 음성 유사도에 상기 제2 가중치를 적용하여 상기 제2 값을 생성하는 단계를 포함할 수 있다.According to another aspect, the step of applying the second weight, obtained by using the magnitude of the noise associated with the input audio, to the voice similarity degree to generate the second value may include calculating a noise magnitude of the voice learning data and a noise magnitude Maintaining a voice recognition rate table storing a recognition rate according to the voice recognition rate table; Estimating a magnitude of noise associated with the input audio; Calculating the second weight using the speech recognition rate table and the estimated noise size; And generating the second value by applying the second weight to the voice similarity.

또 다른 일측에 따르면, 상기 입력 오디오와 연관된 잡음의 크기를 추정하는 단계는, SNNR을 이용하여 상기 입력 오디오와 연관된 잡음의 크기를 추정하는 단계를 포함할 수 있다.According to another aspect, estimating the magnitude of the noise associated with the input audio may include estimating the magnitude of the noise associated with the input audio using the SNNR.

또 다른 일측에 따르면, 상기 얼굴 모델 템플릿은 상기 얼굴 학습 데이터와 연관되고, 상기 음성 모델 템플릿은 상기 음성 학습 데이터와 연관될 수 있다.According to another aspect, the face model template is associated with the face learning data, and the voice model template can be associated with the voice learning data.

또 다른 일측에 따르면, 상기 제1 가중치를 계산하는 단계 및 상기 제2 가중치를 계산하는 단계는, 얼굴 학습 데이터의 조명 크기 및 얼굴 테스트 데이터의 조명 크기에 따른 인식률을 저장하는 얼굴 인식률 테이블을 유지하는 단계; 음성 학습 데이터의 잡음 크기 및 음성 테스트 데이터의 잡음 크기에 따른 인식률을 저장하는 음성 인식률 테이블을 유지하는 단계; 상기 얼굴 인식률 테이블, 상기 얼굴 유사도가 가장 높은 것으로 결정된 얼굴 학습 데이터 및 상기 추정된 조명의 크기를 이용하여 제1 인식률을 계산하는 단계; 상기 음성 인식률 테이블, 상기 음성 유사도가 가장 높은 것으로 결정된 음성 학습 데이터 및 상기 추정된 잡음의 크기를 이용하여 제2 인식률을 계산하는 단계; 상기 제1 인식률 및 상기 제2 인식률을 이용하여 상기 얼굴 유사도에 대한 제1 가중치를 계산하는 단계; 및 상기 제1 인식률 및 상기 제2 인식률을 이용하여 상기 음성 유사도에 대한 제2 가중치를 계산하는 단계를 포함할 수 있다.According to another aspect of the present invention, the step of calculating the first weight and the step of calculating the second weight include a step of holding a face recognition rate table storing a recognition rate according to the illumination size of the face learning data and the illumination size of the face test data step; Maintaining a voice recognition rate table storing a recognition rate according to a noise size of voice training data and a noise size of voice test data; Calculating a first recognition rate using the face recognition rate table, the face learning data determined to have the highest face similarity degree, and the size of the estimated illumination; Calculating a second recognition rate using the speech recognition rate table, the speech learning data determined to have the highest voice similarity degree, and the estimated noise size; Calculating a first weight for the face similarity using the first recognition rate and the second recognition rate; And calculating a second weight for the voice similarity using the first recognition rate and the second recognition rate.

본 발명의 일실시예에 따른 멀티 모달 사용자 인식 방법은, 얼굴 학습 데이터의 조명 크기 및 얼굴 테스트 데이터의 조명 크기에 따른 인식률을 저장하는 얼굴 인식률 테이블을 유지하는 단계; 음성 학습 데이터의 잡음 크기 및 음성 테스트 데이터의 잡음 크기에 따른 인식률을 저장하는 음성 인식률 테이블을 유지하는 단계; 상기 입력 영상과 연관된 조명의 크기 및 상기 입력 오디오와 연관된 잡음의 크기를 추정하는 단계; 상기 얼굴 인식률 테이블 및 상기 추정된 조명의 크기를 이용하여 제1 인식률을 계산하는 단계; 상기 음성 인식률 테이블 및 상기 추정된 음성의 크기를 이용하여 제2 인식률을 계산하는 단계; 상기 제1 인식률 및 상기 제2 인식률의 차가 미리 정해진 임계값보다 큰 경우 더 큰 인식률을 가진 모달리티를 이용하여 사용자를 인식하는 단계를 포함할 수 있다.According to an embodiment of the present invention, there is provided a method for recognizing a multimodal user, comprising: maintaining a face recognition rate table storing a recognition rate according to illumination size of face learning data and illumination size of face test data; Maintaining a voice recognition rate table storing a recognition rate according to a noise size of voice training data and a noise size of voice test data; Estimating a magnitude of illumination associated with the input image and a magnitude of noise associated with the input audio; Calculating a first recognition rate using the face recognition rate table and the size of the estimated illumination; Calculating a second recognition rate using the voice recognition rate table and the estimated voice size; And recognizing the user using a modality having a larger recognition rate when the difference between the first recognition rate and the second recognition rate is larger than a predetermined threshold value.

일측에 따르면, 상기 멀티 모달 사용자 인식 방법은, 입력 영상으로부터 얼굴 영역을 검출하는 단계; 입력 오디오로부터 음성 영역을 검출하는 단계; 상기 검출된 얼굴 영역으로부터 얼굴 특징 벡터를 추출하는 단계; 상기 검출된 음성 영역으로부터 음성 특징 벡터를 추출하는 단계; 얼굴 모델 템플릿(face model templates)을 이용하여 입력 영상의 얼굴 특징 벡터에 대한 얼굴 유사도를 계산하는 단계; 및 음성 모델 템플릿(speech model templates)을 이용하여 입력 오디오의 음성 특징 벡터에 대한 음성 유사도를 계산하는 단계를 더 포함할 수 있다.
According to one aspect, the multi-modal user recognition method includes: detecting a face region from an input image; Detecting a speech region from the input audio; Extracting a face feature vector from the detected face region; Extracting a speech feature vector from the detected speech region; Calculating a face similarity degree of a face feature vector of an input image using face model templates; And calculating speech similarity to the speech feature vector of the input audio using speech model templates.

본 발명은 얼굴, 음성의 모달리티를 결합함으로써 인식률을 향상시킬 수 있고, 외부 환경에 강인한 방법을 고안함으로써 사용자 인식 시스템의 신뢰성을 제고할 수 있다.
According to the present invention, the recognition rate can be improved by combining the facial and voice modalities, and the reliability of the user recognition system can be improved by devising a method robust to the external environment.

도 1은 멀티 모달 사용자 인식 장치의 멀티 모달 사용자 인식 방법을 나타낸 도면이다.
도 2는 본 발명의 일실시예에 따른 멀티 모달 사용자 인식 장치의 얼굴 유사도를 계산하는 방법을 나타낸 흐름도이다.
도 3은 는 본 발명의 일실시예에 따른 멀티 모달 사용자 인식 장치의 음성 유사도를 계산하는 방법을 나타낸 흐름도이다.
도 4는 본 발명의 일실시예에 따른 멀티 모달 사용자 인식 장치의 멀티 모달 사용자 인식 방법을 나타낸 도면이다.
도 5는 본 발명의 일실시예에 따른 멀티 모달 사용자 인식 장치의 가중치를 이용한 멀티 모달 사용자 인식 방법을 나타낸 또 다른 예이다. 1 is a diagram illustrating a multi-modal user recognition method of a multi-modal user recognition device.
FIG. 2 is a flowchart illustrating a method of calculating a face similarity degree of a multimodal user recognition apparatus according to an exemplary embodiment of the present invention.
3 is a flowchart illustrating a method of calculating a voice similarity degree of a multimodal user recognition apparatus according to an exemplary embodiment of the present invention.
4 is a diagram illustrating a multimodal user recognition method of a multimodal user recognition apparatus according to an embodiment of the present invention.
FIG. 5 is another example of a multimodal user recognition method using a weight of a multimodal user recognition apparatus according to an embodiment of the present invention.

이하, 본 발명에 따른 실시예를 첨부한 도면을 참조하여 상세히 설명한다.Hereinafter, embodiments according to the present invention will be described in detail with reference to the accompanying drawings.

도 1은 멀티 모달 사용자 인식 장치의 멀티 모달 사용자 인식 방법을 나타낸 도면이다. 1 is a diagram illustrating a multi-modal user recognition method of a multi-modal user recognition device.

멀티 모달 사용자 인식 장치는 입력 영상(101)으로부터 얼굴 유사도를 계산(102)할 수 있고, 입력 오디오(110)로부터 음성 유사도(112)를 계산할 수 있다. 이때, 계산된 얼굴 유사도 및 계산된 음성 유사도를 각각 정규화하고, 정규화된 두 스코어 값을 하나의 스코어 값으로 결합할 수 있다. 본 발명은 종래의 발명과는 다르게 외부 조명과 잡음의 크기에 따라 얼굴 영역의 가중치를 계산(103)하고, 음성 영역의 가중치를 계산(103)할 수 있다. 계산된 가중치를 통해 입력 영상으로부터 계산된 스코어와 입력 오디오로부터 계산된 스코어를 결합함으로써 최종 스코어를 계산(104)할 수 있으며, 계산된 최종 스코어를 통하여 사용자를 인식(105)할 수 있다. The multimodal user recognition apparatus can calculate the face similarity degree 102 from the input image 101 and calculate the voice similarity degree 112 from the input audio 110. [ At this time, the calculated face similarity and the calculated voice similarity may be respectively normalized, and the normalized two score values may be combined into one score value. The present invention can calculate the weight of the face region 103 according to the size of the external illumination and the noise, and calculate the weight of the speech region 103, unlike the conventional invention. The final score can be calculated 104 by combining the score calculated from the input audio and the score calculated from the input audio through the calculated weight, and the user can be recognized 105 through the calculated final score.

도 2는 본 발명의 일실시예에 따른 멀티 모달 사용자 인식 장치의 얼굴 유사도를 계산하는 방법을 나타낸 흐름도이다.FIG. 2 is a flowchart illustrating a method of calculating a face similarity degree of a multimodal user recognition apparatus according to an exemplary embodiment of the present invention.

도 2는 멀티 모달 사용자 인식 장치의 멀티 모달 사용자 인식 방법에 관한 것으로, 도 1에서 설명한 얼굴 유사도를 계산하는 방법에 대하여 상세히 설명하기로 한다. FIG. 2 illustrates a method of recognizing a multimodal user of a multimodal user recognition apparatus, and a method of calculating the similarity degree of the face illustrated in FIG. 1 will be described in detail.

단계(210)에서 멀티 모달 사용자 인식 장치는 입력 영상에서 얼굴 영역을 검출할 수 있다. 이때, 입력 영상에서 얼굴 영역을 검출하는 방법으로는 Adaboost 기반의 알고리즘이 수행될 수 있다. Adaboost 알고리즘이란 간단한, 약한 분류기들의 선형 조합으로부터 강한 분류기를 설계하기 위한 알고리즘이다. Adaboost는 구현이 매우 간단하고, 일반적인 학습 스키마로서 다양한 학습 업무에 사용가능하며, 특징점들을 선택하고 분류기를 학습시킨다. In step 210, the multimodal user recognition device may detect the face region in the input image. At this time, an Adaboost-based algorithm can be performed as a method of detecting the face region in the input image. The Adaboost algorithm is an algorithm for designing strong classifiers from linear combinations of simple, weak classifiers. Adaboost is very simple to implement, it can be used for various learning tasks as a general learning schema, selects feature points and learns classifiers.

일 실시예에 따른 Adaboost 알고리즘은 복잡하고 다양한 배경이 있는 얼굴에서 얼굴을 강인하게 검출할 수 있는 특징점들을 찾는 것으로 얼굴 이미지와 얼굴이 아닌 이미지를 가장 잘 판단할 수 있는 특징점들을 찾도록 설계될 수 있다. The Adaboost algorithm according to an exemplary embodiment can be designed to find feature points that can robustly detect a face in a complex and various background face, and can best determine facial images and images other than faces .

단계(220)에서 멀티 모달 사용자 인식 장치는 추출된 얼굴 영역으로부터 얼굴 특징 벡터를 추출할 수 있다. 이때, 추출된 얼굴 영역의 특징점들을 나열하여 얼굴 특징 벡터를 추출할 수 있다. 예를 들면, 얼굴 영역의 특징 파라미터로는 PCA, LDA, 2D-PCA, 2D-DCT 등의 알고리즘이 사용될 수 있다.In step 220, the multimodal user recognition device may extract a facial feature vector from the extracted face region. At this time, the facial feature vector can be extracted by arranging the feature points of the extracted face region. For example, PCA, LDA, 2D-PCA, and 2D-DCT can be used as feature parameters of the face region.

단계(230)에서 멀티 모달 사용자 인식 장치는 얼굴 모델 템플릿을 이용하여 입력 영상의 얼굴 특징 벡터에 대한 얼굴 유사도를 계산할 수 있다. 멀티 모달 사용자 인식 장치는 얼굴 모델 템플릿과 패턴 인식 알고리즘을 이용하여 거리값 또는 확률값 등의 얼굴 유사도 값이 계산될 수 있다. 얼굴 모델 템플릿은 얼굴 학습 데이터와 연관될 수 있으며, 얼굴 학습 과정에서 사용한 얼굴 학습 데이터 및 조명의 크기를 저장할 수도 있다. 이때, 유사도가 가장 높은 얼굴 학습 데이터가 결정될 수 있다. 예를 들면, 유사도 계산을 위한 알고리즘은 NN, K-NN, SVM, GMM, HMM, EHMM 등의 다양한 방법이 적용될 수 있다.In step 230, the multimodal user recognition apparatus can calculate the face similarity to the face feature vector of the input image using the face model template. The multi-modal user recognition apparatus can calculate a face similarity value such as a distance value or a probability value using a face model template and a pattern recognition algorithm. The face model template may be associated with the face learning data, and may store the face learning data and the size of the illumination used in the face learning process. At this time, the face learning data having the highest degree of similarity can be determined. For example, various algorithms such as NN, K-NN, SVM, GMM, HMM, and EHMM can be applied to the similarity calculation algorithm.

단계(240)에서 멀티 모달 사용자 인식 장치는 얼굴 유사도를 정규화할 수 있다. In step 240, the multimodal user recognition device may normalize the face similarity.

일 실시예에 따른 멀티 모달 사용자 인식 장치는 영역 검출, 특징 추출, 유사도 계산, 조명/잡음 크기 추정 등에 사용되는 서술한 알고리즘에 제한을 두지 않는다.The multimodal user recognition apparatus according to an embodiment does not limit the algorithm described above to be used for area detection, feature extraction, similarity calculation, illumination / noise size estimation, and the like.

도 3은 는 본 발명의 일실시예에 따른 멀티 모달 사용자 인식 장치의 음성 유사도를 계산하는 방법을 나타낸 흐름도이다.3 is a flowchart illustrating a method of calculating a voice similarity degree of a multimodal user recognition apparatus according to an exemplary embodiment of the present invention.

도 3은 멀티 모달 사용자 인식 장치의 멀티 모달 사용자 인식 방법에 관한 것으로, 도 1에서 설명한 음성 유사도를 계산하는 방법에 대하여 상세히 설명하기로 한다. FIG. 3 illustrates a method of recognizing a multimodal user of a multimodal user recognition apparatus, and a method of calculating the similarity degree of voice described in FIG. 1 will be described in detail.

단계(310)에서 멀티 모달 사용자 인식 장치는 입력 오디오에서 음성 영역을 검출할 수 있다. 이때, 입력 오디오에서 음성 영역을 검출하는 방법으로는 시간 및 주파수 공간에서의 End-Point Detection 기술들이 사용될 수 있다. End-Point Detection이란, 음성을 검출하는 방법으로 음성의 시작과 끝을 찾는 것이다. 일 실시예에 따른 멀티 모달 사용자 인식 장치는 End-Point Detection 기술들을 사용하여 시간 및 주파수 공간에서 간단하고 빠르게 음성을 검출할 수 있다. In step 310, the multimodal user recognition device may detect the speech region in the input audio. At this time, end-point detection techniques in time and frequency space can be used as a method of detecting a speech region in input audio. End-Point Detection is a way to detect the beginning and end of a voice. A multi-modal user recognition apparatus according to an exemplary embodiment can detect voice simply and quickly in time and frequency space using End-Point Detection techniques.

단계(320)에서 멀티 모달 사용자 인식 장치는 추출된 음성 영역으로부터 음성 특징 벡터를 추출할 수 있다. 이때, 추출된 음성 영역의 특징 점들을 나열하여 음성 특징 벡터를 추출할 수 있으며, 음성 특징 파라미터로는 Pitch, LPC, MFCC 등의 알고리즘이 사용될 수 있다. 예를 들면, 선형 예측 부호화(LPC: Linear Predictive Coding)는 음성 분석 및 합성 등의 여러 분야에서 사용되는 기법으로, 음성 스펙트럼이 가진 특성을 상대적으로 적은 수의 파라미터만으로 정확하게 표현할 수 있다. In step 320, the multimodal user recognition device may extract speech feature vectors from the extracted speech region. At this time, the speech feature vector can be extracted by arranging the feature points of the extracted speech region, and the speech feature parameters can be algorithms such as Pitch, LPC, and MFCC. For example, Linear Predictive Coding (LPC) is a technique used in various fields such as speech analysis and synthesis, and can accurately express characteristics of a speech spectrum with a relatively small number of parameters.

단계(330)에서 멀티 모달 사용자 인식 장치는 음성 모델 템플릿을 이용하여 입력 오디오의 음성 특징 벡터에 대한 음성 유사도를 계산할 수 있다. 멀티 모달 사용자 인식 장치는 음성 모델 템플릿과 패턴 인식 알고리즘을 이용하여 거리값 또는 확률값 등의 음성 유사도 값이 계산될 수 있다. 음성 모델 템플릿은 음성 학습 데이터와 연관될 수 있으며, 음성 학습 과정에서 사용한 음성 학습 데이터 및 잡음의 크기를 저장할 수도 있다. 이때, 유사도가 가장 높은 음성 학습 데이터가 결정될 수 있다. 예를 들면, 음성 유사도 측정을 위한 알고리즘은 HMM, GMM, SVM 등이 적용될 수 있다. In step 330, the multimodal user recognition device may calculate the phonetic similarity to the speech feature vector of the input audio using the speech model template. The multimodal user recognition apparatus can calculate a voice similarity value such as a distance value or a probability value using a voice model template and a pattern recognition algorithm. The speech model template may be associated with the speech learning data and may also store the speech learning data and the size of the noise used in the speech learning process. At this time, the speech learning data with the highest degree of similarity can be determined. For example, HMM, GMM, SVM, etc. can be applied to the algorithm for voice similarity measurement.

단계(340)에서 멀티 모달 사용자 인식 장치는 음성 유사도를 정규화할 수 있다.In step 340, the multimodal user recognition device may normalize the voice similarity.

도 4는 본 발명의 일실시예에 따른 멀티 모달 사용자 인식 장치의 멀티 모달 사용자 인식 방법을 나타낸 도면이다.4 is a diagram illustrating a multimodal user recognition method of a multimodal user recognition apparatus according to an embodiment of the present invention.

멀티 모달 사용자 인식 방법은 멀티 모달 사용자 인식 장치에 의해서 수행될 수 있으며, 가중치를 계산하는 방법에 대하여 상세히 설명하기로 한다.The multimodal user recognition method can be performed by a multimodal user recognition apparatus, and a method for calculating the weight value will be described in detail.

단계(401)에서 멀티 모달 사용자 인식 장치는 얼굴 학습 데이터의 조명 크기 및 얼굴 테스트 데이터의 조명 크기에 따른 인식률을 저장하는 얼굴 인식률 테이블을 유지할 수 있다. In step 401, the multi-modal user recognition device may maintain a face recognition rate table that stores the illumination size of the face learning data and the recognition rate according to the illumination size of the face test data.

표 1은 조명 레벨에 따른 얼굴 인식률을 나타낸 것으로, 조명 레벨 단위는 LUX로 표현되며, 세로축은 학습 데이터의 조명 레벨을 의미하고, 가로축은 테스트 데이터의 조명 레벨을 의미할 수 있다. 얼굴 인식률 테이블을 통해서 학습에 사용한 조명 레벨과 테스트에 사용한 조명 레벨이 같다면 가장 좋은 인식 성능을 나타내는 것을 알 수 있고, 학습과 테스트에 사용한 조명 레벨이 크게 다르다면 인식률이 매우 저하되는 것을 알 수 있다.Table 1 shows the face recognition rate according to the illumination level. The illumination level unit is expressed by LUX, the vertical axis is the illumination level of the training data, and the horizontal axis is the illumination level of the test data. It can be seen from the face recognition rate table that the best recognition performance is obtained when the illumination level used for the learning is the same as the illumination level used for the test and the recognition rate is greatly degraded if the illumination levels used for the learning and the test are greatly different .

단계(402)에서 멀티 모달 사용자 인식 장치는 얼굴 모델 템플릿의 조명의 크기를 획득할 수 있다. 얼굴 학습 과정에서 저장된 얼굴 학습 데이터 및 조명의 크기로부터 유사도가 가장 높은 것으로 결정된 얼굴 학습 데이터의 조명의 크기를 획득할 수 있다. In step 402, the multimodal user recognition device may obtain the magnitude of the illumination of the face model template. It is possible to obtain the size of the illumination of the face learning data determined to have the highest degree of similarity from the stored face learning data and the size of the illumination in the face learning process.

단계(403)에서 멀티 모달 사용자 인식 장치는 입력 영상과 연관된 조명의 크기를 추정할 수 있다. 입력 영상과 연관된 조명의 크기를 추정하는 방법으로는 레티넥스(Retinex)알고리즘을 이용하여 입력 영상과 연관된 조명의 크기를 추정할 수 있다. 레티넥스 알고리즘이란 영상의 밝기와 시각적으로 인지된 감각 사이에는 로그의 관계를 가진다는 사실과 영상의 밝기는 실제의 밝기인 반사성분과 조명에 의한 성분의 곱으로 주어진다는 사실에 근거하여 영상에서 조명의 성분을 줄이고, 반사의 성분만을 나타냄으로써 영상의 콘트라스트를 증대시키고자 하는 것을 의미할 수 있다. 여기서 본 발명에서는 레티넥스 알고리즘에서의 조명 크기를 추정할 수 있다. 일 실시예에 따른 레티넥스 알고리즘은 영상의 대비를 향상시키거나 선명도를 증진시킬 때 사용하거나 또한 픽셀 값의 동적 영역이 큰 경우에 이것을 압축시켜서 영상 데이터 전송에 따른 병목 현상의 해소에 이용할 수 있다. In step 403, the multimodal user recognition device may estimate the size of the illumination associated with the input image. The size of the illumination associated with the input image can be estimated using Retinex algorithm. The Retinex algorithm is based on the fact that there is a logarithmic relationship between the brightness of the image and the perceived sensation of the image, and that the brightness of the image is given as the product of the reflection component, which is the actual brightness, It may mean to reduce the component of the image and to increase the contrast of the image by showing only the component of the reflection. In the present invention, the illumination size in the Retinex algorithm can be estimated. The Retinex algorithm according to an exemplary embodiment may be used to improve the contrast of an image or improve sharpness, or compress a dynamic range of a pixel value when the dynamic range is large, thereby resolving a bottleneck due to image data transmission.

단계(404)에서 멀티 모달 사용자 인식 장치는 얼굴 인식률 테이블 및 추정된 조명의 크기를 이용하여 제1 인식률을 계산할 수 있다. 단계(402)에서 획득한 얼굴 모델 템플릿의 조명의 크기와 단계(403)에서 획득한 입력 영상과 연관된 조명의 크기를 이용하여 얼굴 인식률 테이블에서 제1 인식률을 획득한다. 이때, 얼굴 모델 템플릿의 조명의 크기와 입력 영상과 연관된 조명의 크기가 같다면 높은 인식률을 나타낼 수 있고, 얼굴 모델 템플릿의 조명의 크기와 입력 영상과 연관된 조명의 크기가 다르다면 낮은 인식률을 나타낼 수 있다.In step 404, the multimodal user recognition device may calculate the first recognition rate using the face recognition rate table and the size of the estimated illumination. The first recognition rate is obtained in the face recognition rate table using the size of the illumination of the face model template obtained in step 402 and the size of the illumination associated with the input image obtained in step 403. [ In this case, if the size of the illumination of the face model template is the same as the size of the illumination associated with the input image, a high recognition rate can be obtained. If the size of the illumination of the face model template is different from the size of the illumination associated with the input image, have.

단계(411)에서 멀티 모달 사용자 인식 장치는 음성 학습 데이터의 조명 크기 및 음성 테스트 데이터의 조명 크기에 따른 인식률을 저장하는 음성 인식률 테이블을 유지할 수 있다. In step 411, the multimodal user recognition device may maintain a voice recognition rate table that stores the recognition rate according to the illumination size of the speech learning data and the illumination size of the voice test data.

표 2에서 음성 인식률 테이블은 음성의 잡음 레벨에 따른 사용자 인식률을 나타낸 것으로, 잡음 레벨의 단위는 dB로 표현되며, 세로축은 학습 데이터의 잡음 레벨을 의미하고, 가로축은 테스트 데이터의 잡음 레벨을 의미할 수 있다. 음성 인식률 테이블을 통해서 테스트 데이터의 잡음 레벨과 학습 데이터의 잡음 레벨이 유사하다면 높은 인식률을 보인다는 것을 알 수 있고, 잡음 레벨이 다르다면 인식률이 크게 저하되는 것을 알 수 있다.In Table 2, the voice recognition rate table shows the user recognition rate according to the noise level of the voice. The unit of the noise level is expressed in dB, the vertical axis indicates the noise level of the training data, and the horizontal axis indicates the noise level of the test data . If the noise level of the test data is similar to the noise level of the training data through the speech recognition rate table, it can be seen that the recognition rate shows a high recognition rate, and the recognition rate greatly decreases if the noise level is different.

단계(412)에서 멀티 모달 사용자 인식 장치는 음성 모델 템플릿의 잡음의 크기를 획득할 수 있다. 음성 학습 과정에서 저장된 음성 학습 데이터 및 잡음의 크기로부터 유사도가 가장 높은 것으로 결정된 음성 학습 데이터의 잡음의 크기를 획득할 수 있다.In step 412, the multimodal user recognition device may obtain the magnitude of the noise of the speech model template. The size of the noise of the speech learning data determined to have the highest similarity from the size of the speech learning data and noise stored in the speech learning process can be obtained.

단계(413)에서 멀티 모달 사용자 인식 장치는 입력 오디오와 연관된 잡음의 크기를 추정할 수 있다. 입력 오디오와 연관된 잡음의 크기를 추정하는 방법은 SNNR(Signal Plus Noise to Noise Ratio) 을 이용하여 입력 오디오와 연관된 잡음의 크기를 추정할 수 있다.In step 413, the multimodal user recognition device may estimate the magnitude of the noise associated with the input audio. The method of estimating the size of the noise associated with the input audio can estimate the size of the noise associated with the input audio using SNNR (Signal Plus Noise to Noise Ratio).

단계(414)에서 멀티 모달 사용자 인식 장치는 음성 인식률 테이블 및 추정된 잡음의 크기를 이용하여 제2 인식률을 계산할 수 있다. 단계(412)에서 획득한 음성 모델 템플릿의 잡음의 크기와 단계(413)에서 획득한 입력 오디오와 연관된 잡음의 크기를 이용하여 음성 인식률 테이블에서 제2 인식률을 획득할 수 있다. 이때, 음성 모델 템플릿의 잡음의 크기와 입력 오디오와 연관된 잡음의 크기가 같다면 높은 인식률을 나타낼 수 있고, 음성 모델 템플릿의 잡음의 크기와 입력 오디오와 연관된 잡음의 크기가 다르다면, 낮은 인식률을 나타낼 수 있다.In step 414, the multimodal user recognition device may calculate the second recognition rate using the speech recognition rate table and the estimated noise magnitude. The second recognition rate can be obtained in the speech recognition rate table by using the size of the noise of the speech model template obtained in step 412 and the size of the noise associated with the input audio obtained in step 413. [ At this time, if the noise level of the speech model template is equal to the noise level associated with the input audio, it can show a high recognition rate. If the noise level of the speech model template is different from the noise level associated with the input audio, .

단계(405)에서 멀티 모달 사용자 인식 장치는 얼굴 인식률 테이블을 통해 획득된 제1 인식률과 음성 인식률 테이블을 통해 획득된 제2 인식률을 통해 가중치를 계산할 수 있다. 제1 인식률 및 제2 인식률을 이용하여 정규화한 얼굴 유사도에 대한 제1 가중치를 계산하고, 제1 인식률 및 제2 인식률을 이용하여 정규화한 음성 유사도에 대한 제2 가중치를 계산할 수 있다. 얼굴에 대한 인식률을

, 음성에 대한 인식률을

라고 하면, 가중치 W_n 은 수학식 1과 같이 표현될 수 있다. 이때, N은 모달리티의 개수를 의미할 수 있다. 예를 들어, 음성 및 얼굴의 2개의 모달을 사용하는 경우 N은 2일 수 있다.In step 405, the multimodal user recognition apparatus can calculate the weight through the first recognition rate obtained through the face recognition rate table and the second recognition rate obtained through the speech recognition rate table. The first weight for the normalized facial similarity using the first recognition rate and the second recognition rate can be calculated and the second weight for the voice similarity normalized using the first recognition rate and the second recognition rate can be calculated. Recognition rate for face

, The recognition rate of voice

, The weight W _n Can be expressed as Equation (1). In this case, N may mean the number of modalities. For example, if two modals of voice and face are used, N may be 2.

예를 들면, 얼굴에 대한 인식률이 93.12, 음성에 대한 인식률이 89.15라고 가정하면, 얼굴에 대한 가중치 W₁은 로 계산될 수 있다. 또한, 음성에 대한 가중치 W₂는 로 계산될 수 있고, W₁은 약 0.51, W₂는 약 0.48 의 값을 갖는다. 이때, 정규화한 얼굴 유사도에 제1 가중치를 적용하여 제1 값을 생성하고, 정규화한 음성 유사도에 제2 가중치를 적용하여 제2 값을 생성할 수 있다.For example, assuming that the recognition rate for the face is 93.12 and the recognition rate for the voice is 89.15, the weight W ₁ for the face can be calculated to be. In addition, the weight W ₂ for speech can be calculated to be about 0.51 for W ₁ and about 0.48 for W ₂ . In this case, the first value may be generated by applying the first weight to the normalized facial similarity, and the second value may be generated by applying the second weight to the normalized voice similarity.

단계(406)에서 멀티 모달 사용자 인식 장치는 정규화한 얼굴 유사도에 단계(405)에서 획득한 제1 가중치를 적용하여 제1 값을 생성하고, 정규화한 음성 유사도에 단계(405)에서 획득한 제2 가중치를 적용하여 제2 값을 생성할 수 있다. 생성된 제1 값 및 제2 값을 결합하여 최종 스코어를 계산할 수 있고, 최종 스코어를 이용하여 사용자를 인식할 수 있다.In step 406, the multimodal user recognition apparatus generates a first value by applying the first weight obtained in step 405 to the normalized facial similarity, and generates a first value using the second weight obtained in step 405, The second value can be generated by applying the weight. The final score may be calculated by combining the generated first value and the second value, and the user may be recognized using the final score.

일 실시예에 따른 멀티 모달 사용자 인식 장치는 얼굴, 음성뿐만 아니라 복수 개의 모달리티를 이용하여 사용자를 인식할 수 있다. The multimodal user recognition apparatus according to an exemplary embodiment of the present invention recognizes a user using not only a face and a voice but also a plurality of modalities.

도 5는 본 발명의 일실시예에 따른 멀티 모달 사용자 인식 장치의 멀티 모달 사용자 인식 방법을 나타낸 또 다른 예이다. 5 is another example of a multimodal user recognition method of a multimodal user recognition apparatus according to an embodiment of the present invention.

도 5는 멀티 모달 사용자 인식 방법에 있어서, ｜제1 인식률 ― 제2 인식률｜을 임계치와 비교하여 미리 정해진 임계값보다 더 큰 인식률을 갖는 모달리티를 이용하여 사용자를 인식하는 방법에 대하여 설명하기로 한다. 5 illustrates a method of recognizing a user by using a modality having a recognition rate that is greater than a predetermined threshold by comparing a first recognition rate and a second recognition rate with a threshold in a multimodal user recognition method .

단계(501)에서 멀티 모달 사용자 인식 장치는 얼굴 학습 데이터의 조명 크기 및 얼굴 테스트 데이터의 조명 크기에 따른 인식률을 저장하는 얼굴 인식률 테이블을 유지할 수 있다. 조명 레벨 단위는 LUX로 표현되며, 세로축은 학습 데이터의 조명 레벨을 의미하고, 가로축은 테스트 데이터의 조명 레벨을 의미할 수 있다. In step 501, the multi-modal user recognition device may maintain a face recognition rate table that stores the illumination size of the face learning data and the recognition rate according to the illumination size of the face test data. The illumination level unit is represented by LUX, the vertical axis represents the illumination level of the learning data, and the horizontal axis can represent the illumination level of the test data.

단계(502)에서 멀티 모달 사용자 인식 장치는 얼굴 모델 템플릿의 조명의 크기를 획득할 수 있다. 얼굴 학습 과정에서 저장된 얼굴 학습 데이터 및 조명의 크기로부터 유사도가 가장 높은 것으로 결정된 얼굴 학습 데이터의 조명의 크기를 획득할 수 있다.In step 502, the multimodal user recognition device may obtain the size of the illumination of the face model template. It is possible to obtain the size of the illumination of the face learning data determined to have the highest degree of similarity from the stored face learning data and the size of the illumination in the face learning process.

단계(503)에서 멀티 모달 사용자 인식 장치는 입력 영상과 연관된 조명의 크기를 추정할 수 있다. 입력 영상과 연관된 조명의 크기를 추정하는 방법으로는 레티넥스(Retinex)알고리즘을 이용하여 입력 영상과 연관된 조명의 크기를 추정할 수 있다. 레티넥스 알고리즘이란 영상의 밝기와 시각적으로 인지된 감각 사이에는 로그의 관계를 가진다는 사실과 영상의 밝기는 실제의 밝기인 반사성분과 조명에 의한 성분의 곱으로 주어진다는 사실에 근거하여 영상에서 조명의 성분을 줄이고, 반사의 성분만을 나타냄으로써 영상의 콘트라스트를 증대시키고자 하는 것이다. In step 503, the multimodal user recognition device may estimate the magnitude of the illumination associated with the input image. The size of the illumination associated with the input image can be estimated using Retinex algorithm. The Retinex algorithm is based on the fact that there is a logarithmic relationship between the brightness of the image and the perceived sensation of the image, and that the brightness of the image is given as the product of the reflection component, which is the actual brightness, And the contrast of the image is increased by showing only the component of the reflection.

일 실시예에 따른 레티넥스 알고리즘은 영상의 대비를 향상시키거나, 선명도를 증진시킬 때, 또한 픽셀 값의 동적 영역이 큰 경우에 영상 데이터 전송에 따른 병목 현상의 해소에 이용할 수 있다.The Retinex algorithm according to an exemplary embodiment can be used to improve the contrast of an image, improve sharpness, and resolve a bottleneck due to image data transmission when the dynamic range of a pixel value is large.

단계(504)에서 멀티 모달 사용자 인식 장치는 얼굴 인식률 테이블 및 추정된 조명의 크기를 이용하여 제1 인식률을 계산할 수 있다. 단계(502)에서 획득한 얼굴 모델 템플릿의 조명의 크기와 단계(503)에서 획득한 입력 영상과 연관된 조명의 크기를 이용하여 얼굴 인식률 테이블에서 제1 인식률을 획득할 수 있다. In step 504, the multimodal user recognition device may calculate the first recognition rate using the face recognition rate table and the size of the estimated illumination. The first recognition rate can be obtained in the face recognition rate table using the size of the illumination of the face model template acquired in step 502 and the size of the illumination associated with the input image acquired in step 503. [

일 실시예에 따른 멀티 모달 사용자 인식 장치는 앞서 제시한 표 1의 얼굴 인식률 테이블을 통해 얼굴 모델 템플릿의 조명의 크기와 입력 영상과 연관된 조명의 크기가 같으면 높은 인식률을 나타낼 수 있고, 얼굴 모델 템플릿의 조명의 크기와 입력 영상과 연관된 조명의 크기가 다르다면 낮은 인식률을 나타낼 수 있다.The multimodal user recognition apparatus according to an exemplary embodiment can exhibit a high recognition rate when the size of the illumination of the face model template and the size of the illumination associated with the input image are the same through the face recognition rate table shown in Table 1, If the size of the illumination and the size of the illumination associated with the input image are different, a low recognition rate can be obtained.

단계(511)에서 멀티 모달 사용자 인식 장치는 음성 학습 데이터의 조명 크기 및 음성 테스트 데이터의 조명 크기에 따른 인식률을 저장하는 음성 인식률 테이블을 유지할 수 있다. 음성 인식률 테이블은 음성의 잡음 레벨에 따른 사용자 인식률을 나타낸 것으로, 잡음 레벨의 단위는 dB로 표현되며, 세로축은 학습 데이터의 잡음 레벨을 의미하고, 가로축은 테스트 데이터의 잡음 레벨을 의미할 수 있다. In step 511, the multimodal user recognition apparatus may maintain a voice recognition rate table storing the recognition rate according to the illumination size of the voice training data and the illumination size of the voice test data. The voice recognition rate table shows the user recognition rate according to the noise level of the voice. The unit of the noise level is expressed in dB, the vertical axis indicates the noise level of the learning data, and the horizontal axis indicates the noise level of the test data.

단계(512)에서 멀티 모달 사용자 인식 장치는 음성 모델 템플릿의 잡음의 크기를 획득할 수 있다. 음성 학습 과정에서 저장된 음성 학습 데이터 및 잡음의 크기로부터 유사도가 가장 높은 것으로 결정된 음성 학습 데이터의 잡음의 크기를 획득할 수 있다.At step 512, the multimodal user recognition device may obtain the magnitude of the noise of the speech model template. The size of the noise of the speech learning data determined to have the highest similarity from the size of the speech learning data and noise stored in the speech learning process can be obtained.

단계(513)에서 멀티 모달 사용자 인식 장치는 입력 오디오와 연관된 잡음의 크기를 추정할 수 있다. 입력 오디오와 연관된 잡음의 크기를 추정하는 방법은 SNNR을 이용하여 입력 오디오와 연관된 잡음의 크기를 추정할 수 있다.In step 513, the multimodal user recognition device may estimate the magnitude of the noise associated with the input audio. The method of estimating the magnitude of the noise associated with the input audio can estimate the magnitude of the noise associated with the input audio using SNNR.

단계(514)에서 멀티 모달 사용자 인식 장치는 음성 인식률 테이블 및 추정된 잡음의 크기를 이용하여 제2 인식률을 계산할 수 있다. 단계(512)에서 획득한 음성 모델 템플릿의 잡음의 크기와 단계(513)에서 획득한 입력 오디오와 연관된 잡음의 크기를 이용하여 음성 인식률 테이블에서 제2 인식률을 획득할 수 있다. In step 514, the multimodal user recognition device may calculate the second recognition rate using the voice recognition rate table and the estimated noise magnitude. The second recognition rate can be obtained in the speech recognition rate table using the size of the noise of the speech model template obtained in step 512 and the size of the noise associated with the input audio obtained in step 513. [

일 실시예에 따른 멀티 모달 사용자 인식 장치는 앞서 제시한 표 2의 음성 인식률 테이블을 통해 음성 모델 템플릿의 잡음의 크기와 입력 오디오와 연관된 잡음의 크기가 같다면 높은 인식률을 나타낼 수 있고, 음성 모델 템플릿의 잡음의 크기와 입력 오디오와 연관된 잡음의 크기가 다르다면, 낮은 인식률을 나타낼 수 있다.The multimodal user recognition apparatus according to the embodiment can show a high recognition rate if the noise level of the speech model template and the noise level associated with the input audio are the same through the speech recognition rate table of Table 2, If the magnitude of the noise of the input audio is different from the magnitude of the noise associated with the input audio, a low recognition rate can be obtained.

단계(505)에서 멀티 모달 사용자 인식 장치는 ｜제1 인식률 ― 제2 인식률｜을 미리 정해진 임계값과 비교할 수 있다.In step 505, the multimodal user recognition device may compare the first recognition rate to the second recognition rate with a predetermined threshold.

얼굴 인식률 테이블을 통해 획득된 제1 인식률과 음성 인식률 테이블을 통해 획득된 제2 인식률을 통해 가중치를 계산할 수 있다. 제1 인식률 및 제2 인식률을 이용하여 정규화한 얼굴 유사도에 대한 제1 가중치를 계산하고, 제1 인식률 및 제2 인식률을 이용하여 정규화한 음성 유사도에 대한 제2 가중치를 계산할 수 있다. 얼굴에 대한 인식률을

, 음성에 대한 인식률을

라고 하면, 가중치 W_n 은 앞의 수학식 1과 같이 표현될 수 있다. The weight can be calculated through the first recognition rate obtained through the face recognition rate table and the second recognition rate obtained through the voice recognition rate table. The first weight for the normalized facial similarity using the first recognition rate and the second recognition rate can be calculated and the second weight for the voice similarity normalized using the first recognition rate and the second recognition rate can be calculated. Recognition rate for face

, The recognition rate of voice

, The weight W _n Can be expressed as Equation (1).

또한, ｜제1 인식률 ― 제2 인식률｜ 을 미리 정해진 임계치과 비교할 수 있고 수학식 2와 같이 표현될 수 있다. Also, the first recognition rate - the second recognition rate | can be compared with a predetermined threshold value and expressed as Equation (2).

이때, 수학식 2를 만족한다면, 더 낮은 인식률을 갖는 모달리티를 제외하고 나머지 한 개의 모달리티만으로 사용자를 인식할 수 있다. 예를 들면, 제1 인식률이 제2 인식률보다 크고, ｜제1 인식률이 ― 제2 인식률｜이 임계치 보다 클 경우, 얼굴 영상만을 사용하여 사용자를 인식할 수 있고, 제1 인식률이 제2 인식률보다 작고, ｜제1 인식률 ― 제2 인식률｜이 임계치 보다 클 경우, 음성만을 사용하여 사용자를 인식할 수 있다. At this time, if Equation (2) is satisfied, the user can be recognized by only one modality except for the modality having a lower recognition rate. For example, when the first recognition rate is larger than the second recognition rate, and the first recognition rate - the second recognition rate is larger than the threshold value, the user can be recognized using only the face image, and the first recognition rate is higher than the second recognition rate If the first recognition rate - the second recognition rate | is larger than the threshold value, the user can be recognized using only the voice.

단계(506)에서 멀티 모달 사용자 인식 장치는 얼굴 영상을 사용하여 사용자를 인식할 수 있다. 예를 들면, 제1 인식률이 제1 인식률이 78.89, 제2 인식률이 29.63 이고, 임계치가 30이라고 가정하면, ｜제1 인식률 ― 제2 인식률｜이 미리 정해진 임계치보다 크므로, 낮은 인식률을 갖는 음성 모달리티를 제외하고 얼굴 영상을 사용하여 사용자를 인식할 수 있다. In step 506, the multimodal user recognition device may recognize the user using the face image. For example, assuming that the first recognition rate is 78.89, the second recognition rate is 29.63, and the threshold value is 30, the first recognition rate-second recognition rate | is larger than the predetermined threshold value, It is possible to recognize the user using the facial image except for the modality.

단계(507)에서 멀티 모달 사용자 인식 장치는 가중치를 계산하여 사용자를 인식할 수 있다. 가중치를 계산하는 방법은 수학식 1을 참고하기로 한다. 멀티 모달 사용자 인식 장치는 정규화한 얼굴 유사도에 제1 가중치를 적용하여 제1 값을 생성하고, 정규화한 음성 유사도에 제2 가중치를 적용하여 제2 값 생성할 수 있다. 생성된 제1 값 및 제2 값을 결합하여 최종 스코어를 계산할 수 있고, 최종 스코어를 이용하여 사용자를 인식할 수 있다. 최종 스코어 S는 가중치W_n, 각 모달리티의 정규화된 스코어 S_n를 결합하여 획득할 수 있고 수학식 3과 같이 표현될 수 있다. In step 507, the multimodal user recognition device can recognize the user by calculating weights. The method of calculating the weight is as follows. The multimodal user recognition apparatus may generate the first value by applying the first weight to the normalized facial similarity, and generate the second value by applying the second weight to the normalized voice similarity. The final score may be calculated by combining the generated first value and the second value, and the user may be recognized using the final score. The final score S can be obtained by combining the weight W _n , the normalized score S _n of each modality, and can be expressed as: < EMI ID = 3.0 >

최종 스코어 S는 W₁ S₁ ₊W₂ S₂를 계산함으로써 최종 스코어 값을 획득할 수 있다. 이때, 본 발명에서 N은 2이며 모달리티의 개수에 따라 변할 수 있다. The final score S can obtain a final score value by calculating W ₁ S ₁ ₊ W ₂ S ₂ . In this case, N is 2 in the present invention and can be changed according to the number of modalities.

단계(508)에서 멀티 모달 사용자 인식 장치는 음성을 사용하여 사용자를 인식할 수 있다. 예를 들면, 제1 인식률이 48.94, 제2 인식률이 79.1, 임계치가 30이라고 가정하면, ｜제1 인식률 ― 제2 인식률｜이 미리 정해진 임계치보다 크므로, 낮은 인식률을 갖는 얼굴 모달리티를 제외하고 음성을 사용하여 사용자를 인식할 수 있다.At step 508, the multimodal user recognition device may recognize the user using voice. For example, if the first recognition rate is 48.94, the second recognition rate is 79.1, and the threshold value is 30, the first recognition rate-second recognition rate | is larger than the predetermined threshold value, Can be used to recognize the user.

일 실시예에 따른 멀티 모달 사용자 인식 장치는 두 개의 모달리티를 결합하여 조명 및 잡음과 같은 외부 환경 요소로부터 인식률을 향상시킬 수 있다.A multimodal user recognition device according to an embodiment can combine two modalities to improve the recognition rate from external environmental factors such as illumination and noise.

이상에서 설명된 장치는 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치 및 구성요소는, 예를 들어, 프로세서, 콘트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPA(field programmable array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 콘트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The apparatus described above may be implemented as a hardware component, a software component, and / or a combination of hardware components and software components. For example, the apparatus and components described in the embodiments may be implemented within a computer system, such as, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable array (FPA) A programmable logic unit (PLU), a microprocessor, or any other device capable of executing and responding to instructions. The processing device may execute an operating system (OS) and one or more software applications running on the operating system. The processing device may also access, store, manipulate, process, and generate data in response to execution of the software. For ease of understanding, the processing apparatus may be described as being used singly, but those skilled in the art will recognize that the processing apparatus may have a plurality of processing elements and / As shown in FIG. For example, the processing unit may comprise a plurality of processors or one processor and one controller. Other processing configurations are also possible, such as a parallel processor.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치, 또는 전송되는 신호 파(signal wave)에 영구적으로, 또는 일시적으로 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.The software may include a computer program, code, instructions, or a combination of one or more of the foregoing, and may be configured to configure the processing device to operate as desired or to process it collectively or collectively Device can be commanded. The software and / or data may be in the form of any type of machine, component, physical device, virtual equipment, computer storage media, or device , Or may be permanently or temporarily embodied in a transmitted signal wave. The software may be distributed over a networked computer system and stored or executed in a distributed manner. The software and data may be stored on one or more computer readable recording media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 실시예의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The method according to an embodiment may be implemented in the form of a program command that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, and the like, alone or in combination. The program instructions to be recorded on the medium may be those specially designed and configured for the embodiments or may be available to those skilled in the art of computer software. Examples of computer-readable media include magnetic media such as hard disks, floppy disks and magnetic tape; optical media such as CD-ROMs and DVDs; magnetic media such as floppy disks; Magneto-optical media, and hardware devices specifically configured to store and execute program instructions such as ROM, RAM, flash memory, and the like. Examples of program instructions include machine language code such as those produced by a compiler, as well as high-level language code that can be executed by a computer using an interpreter or the like. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the embodiments, and vice versa.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is to be understood that the invention is not limited to the disclosed exemplary embodiments. For example, it is to be understood that the techniques described may be performed in a different order than the described methods, and / or that components of the described systems, structures, devices, circuits, Lt; / RTI > or equivalents, even if it is replaced or replaced.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.
Therefore, other implementations, other embodiments, and equivalents to the claims are also within the scope of the following claims.

Claims

A method for multi-modal user recognition,
Calculating a face similarity degree of a face feature vector of an input image using face model templates;
Calculating a voice similarity degree with respect to a voice feature vector of input audio using speech model templates;
Generating a first value by applying a first weight value obtained by using a magnitude of illumination associated with the input image to the face similarity;
Generating a second value by applying a second weight obtained by using a magnitude of a noise associated with the input audio to the voice similarity; And
Recognizing a user using a final score obtained by combining the first value and the second value
The method comprising the steps of:

The method according to claim 1,
Wherein the step of generating the first value by applying the first weight value obtained by using the magnitude of the illumination associated with the input image to the face similarity,
Maintaining a face recognition rate table storing a recognition rate according to illumination size of face learning data and illumination size of face test data;
Estimating a magnitude of the illumination associated with the input image;
Calculating the first weight using the face recognition rate table and the size of the estimated illumination; And
Generating the first value by applying the first weight to the face similarity,
The method comprising the steps of:

3. The method of claim 2,
Wherein estimating the magnitude of the illumination associated with the input image comprises:
Estimating the size of the illumination associated with the input image using a Retinex algorithm,
The method comprising the steps of:

The method according to claim 1,
Wherein the step of generating a second value by applying a second weight obtained by using the magnitude of the noise associated with the input audio to the voice similarity,
Maintaining a voice recognition rate table storing a recognition rate according to a noise size of voice training data and a noise size of voice test data;
Estimating a magnitude of noise associated with the input audio;
Calculating the second weight using the speech recognition rate table and the estimated noise size; And
Applying the second weight to the voice similarity to generate the second value
The method comprising the steps of:

5. The method of claim 4,
Wherein estimating the magnitude of the noise associated with the input audio comprises:
Estimating a size of the noise associated with the input audio using SNNR
The method comprising the steps of:

The method according to claim 2 or 4,
Wherein the face model template is associated with the face learning data,
Wherein the speech model template is associated with the speech learning data
Multimodal User Recognition Method.

The method according to claim 1,
Wherein the step of calculating the first weight and the step of calculating the second weight comprise:
Maintaining a face recognition rate table storing a recognition rate according to illumination size of face learning data and illumination size of face test data;
Maintaining a voice recognition rate table storing a recognition rate according to a noise size of voice training data and a noise size of voice test data;
Calculating a first recognition rate using the face recognition rate table, the face learning data determined to have the highest face similarity degree, and the size of the estimated illumination;
Calculating a second recognition rate using the speech recognition rate table, the speech learning data determined to have the highest voice similarity degree, and the estimated noise size;
Calculating a first weight for the face similarity using the first recognition rate and the second recognition rate; And
Calculating a second weight for the voice similarity using the first recognition rate and the second recognition rate
The method comprising the steps of:

A method for multi-modal user recognition,
Maintaining a face recognition rate table storing a recognition rate according to illumination size of face learning data and illumination size of face test data;
Maintaining a voice recognition rate table storing a recognition rate according to a noise size of voice training data and a noise size of voice test data;
Estimating a magnitude of illumination associated with the input image and a magnitude of noise associated with the input audio;
Calculating a first recognition rate using the face recognition rate table and the size of the estimated illumination;
Calculating a second recognition rate using the voice recognition rate table and the estimated voice size;
If the difference between the first recognition rate and the second recognition rate is greater than a predetermined threshold value, recognizing the user using a modality having a higher recognition rate
The method comprising the steps of:

The method according to claim 1,
Detecting a face region from an input image;
Detecting a speech region from the input audio;
Extracting a face feature vector from the detected face region;
Extracting a speech feature vector from the detected speech region;
Calculating a face similarity degree of a face feature vector of an input image using face model templates; And
A step of calculating a phonetic similarity degree of a speech feature vector of the input audio using speech model templates
The method further comprising: