KR20140146840A

KR20140146840A - Visual speech recognition system using multiple lip movement features extracted from lip image

Info

Publication number: KR20140146840A
Application number: KR1020130069665A
Authority: KR
Inventors: 윤인찬; 이연주; 추준욱; 송현아; 서준교; 최귀원; 고혜승
Original assignee: 한국과학기술연구원
Priority date: 2013-06-18
Filing date: 2013-06-18
Publication date: 2014-12-29
Also published as: KR101480816B1

Abstract

A speech recognition system according to the present invention comprises; an image obtaining device which obtains an image of a user; an area detection module which detects a face area and a lip area in the image of the user; a lip shape feature extraction module which extracts a lip shape feature though a plurality of special positions inside and outside the lip area; a lip texture feature module which extracts a lip texture feature through brightness information of the lip area; and a speech recognition module which extracts and recognizes a speech unit having the lip shape feature per prestored speech unit, the lip shape feature extracted from the lip shape feature extraction module from lip texture feature information, and the lip texture feature extracted from the lip texture feature module at the same time.

Description

[0001] The present invention relates to a visual speech recognition system using a plurality of lip motion features extracted from a lip image,

본 발명은 시각적 음성인식 시스템에 관한 것으로서, 더욱 상세하게는 입술 영상에서 추출된 다수의 입술의 움직임에 대한 특징을 음성 인식에 이용하는 시각적 음성인식 시스템에 관한 것이다. BACKGROUND OF THE INVENTION 1. Field of the Invention [0001] The present invention relates to a visual speech recognition system, and more particularly, to a visual speech recognition system using a feature of a plurality of lips extracted from a lip image for speech recognition.

음성인식 기술은 인간과 컴퓨터 간 상호작용과 관련한 다양한 응용분야에 적용되고 있다. Speech recognition technology has been applied to various applications related to human-computer interaction.

하지만, 종래의 음성인식 기술은 마이크에 입력된 음성신호를 이용하여 사용자가 의도한 단어 또는 문장을 인식하는 기술이 대부분이다. 이러한 음성신호를 기반으로 한 음성인식은 잡음이 존재하는 환경에서는 인식 성능이 크게 저하되는 문제가 있다. However, most conventional speech recognition techniques use a voice signal input to a microphone to recognize a word or a sentence intended by the user. Speech recognition based on such a voice signal has a problem that the recognition performance is largely lowered in the presence of noise.

인간의 음성 인식은 청각뿐만 아니라 시각을 함께 이용하여 이루어짐이 연구결과를 통해 밝혀진바 있다. 이러한 점에 착안하여 영상을 기반으로 음성 인식을 하는 시각적 음성 인식에 대한 연구가 이루어지고 있으며 이를 소위 립리딩(lipreading)이라고 한다. Human speech recognition is based not only on hearing but also on visual perception. Based on this point, research on visual speech recognition based on image based speech recognition has been conducted and is called lipreading.

그러나 종래의 립리딩 연구는 모노 카메라를 이용하여 2차원 공간상에 입술 움직임에 대한 특징을 추출하여 시각적인 음성의 최소단위인 비즘(viseme)을 구분하고자 하고 있다. 이러한 종래 기술에 따르면 2차원 공간상에서 입술의 특징을 추출함으로써 오인식의 비율이 높아서, 대부분의 연구 결과가 음성 신호를 이용한 음성 인식 방법의 보조 수단 정도로 사용되는 정도에 그치고 있는 실정이다. However, in the conventional lip reading study, it is tried to distinguish the viseme which is the minimum unit of visual voice by extracting the feature of the lip motion on the two-dimensional space using the mono camera. According to this conventional technique, the ratio of false recognition is high by extracting the lips characteristic in the two-dimensional space. Therefore, most of the research results are limited to the extent that it is used as an auxiliary means of the voice recognition method using voice signals.

대한민국 특허공개 제10-2011-0048209호 공보Korean Patent Publication No. 10-2011-0048209

본 발명은 상기 종래 기술의 문제점을 해결하기 위한 것으로서, 입술 모양 정보와 2차원 영상에서의 영상 화소의 밝기 정보를 동시에 이용함으로써, 정확하고 안정적인 음성 인식 성능을 가질 수 있는 음성 인식 시스템을 제공하는 것을 목적으로 한다. SUMMARY OF THE INVENTION The present invention has been made to solve the above problems of the prior art, and it is an object of the present invention to provide a voice recognition system capable of accurately and stably performing voice recognition performance by simultaneously using lip shape information and brightness information of image pixels in a two- The purpose.

상기 목적을 달성하기 위하여, 본 발명에 따른 음성 인식 시스템은, 사용자의 영상을 획득하는 영상 획득 장치와; 상기 사용자의 영상에서 얼굴 영역 및 입술 영역을 검출하는 영역 검출 모듈과; 상기 입술 영역의 내외의 복수의 특징점의 위치를 통해 입술 모양 특징을 추출하는 입술 모양 특징 추출 모듈과; 상기 입술 영역의 밝기 정보를 통해 입술 텍스쳐(texture) 특징을 추출하는 입술 텍스쳐 특징 추출 모듈; 및 기저장된 음성 단위별 입술 모양 특징과 입술 텍스쳐 특징 정보로부터 상기 입술 모양 특징 추출 모듈에서 추출된 입술 모양 특징 및 상기 입술 텍스쳐 특징 추출 모듈에서 추출된 입술 텍스쳐 특징을 동시에 가지는 음성 단위를 추출하여 인식하는 음성 인식 모듈을 포함한다. In order to achieve the above object, a speech recognition system according to the present invention comprises: an image acquisition device for acquiring an image of a user; An area detection module for detecting a face area and a lip area in the user's image; A lip shape feature extraction module for extracting lip shape features through positions of a plurality of feature points inside and outside the lip area; A lip texture feature extraction module that extracts a lip texture feature through brightness information of the lip area; Extracting and recognizing a voice unit having a lip shape feature extracted from the lip shape feature extraction module and lip texture feature extracted from the lip texture feature extraction module from the lip shape feature and lip texture feature information for each voice unit previously stored And a voice recognition module.

일 실시예에 따르면, 상기 영상 획득 장치는, 상기 사용자의 좌우 스테레오 영상을 획득하는 스테레오 카메라이다. According to one embodiment, the image acquisition device is a stereo camera that acquires left and right stereo images of the user.

또한, 상기 입술 모양 특징 추출 모듈은, 상기 좌우 스테레오 영상으로부터 상기 특징점들의 3차원 좌표를 계산하고, 이를 통해 3차원의 입술 모양 특징을 추출하도록 할 수도 있다. In addition, the lip shape feature extraction module may calculate the three-dimensional coordinates of the feature points from the left and right stereo images, and extract the three-dimensional lip shape feature through the three-dimensional coordinates.

또한, 일 실시예에 따른 음성 인식 시스템은 모의 사용자에 대한 사전 학습 과정을 통해 음성 단위별 상기 모의 사용자의 입술 영역 내외의 상기 특징점들의 좌표값들을 취득하고, 상기 모의 사용자로부터 취득된 특징점들의 좌표값들을 이용해 음성 단위별 입술 움직임에 대한 정보를 나타내는 3차원의 입술 모양 모델들을 생성하여 저장하는 데이터 베이스를 더 포함한다. In addition, the speech recognition system according to an embodiment may acquire coordinate values of the minutiae points within and outside the lip region of the simulated user for each voice unit through a pre-learning process for the mock user, Dimensional lip models representing information on lips movement per voice unit using a plurality of lip-shape models.

상기 입술 모양 모델들은 모델 파라미터를 가지고, 상기 입술 모양 특징 추출 모듈은, 실 사용자의 얼굴의 특징점들의 좌표값을 상기 입술 모양 모델들에 피팅하여 가장 유사한 모델의 모델 파라미터를 입술 모양 특징으로 추출하도록 할 수도 있다. The lip shape models have model parameters, and the lip feature extraction module extracts model parameters of the most similar model as lip shape features by fitting the coordinate values of the feature points of the real user's face to the lip shape models It is possible.

또한, 상기 입술 텍스처 특징은, 상기 입술 영역의 밝기값의 기울기 또는 기울기의 방향 분포일 수 있다. In addition, the lip texture feature may be a gradient of a brightness value of the lip region or a direction distribution of a slope.

일 실시예에 따르면, 시스템은 상기 입술 모양 특징과 상기 입술 텍스쳐 특징을 통합한 통합 입술 특징을 생성하는 특징 레벨 통합 모듈을 더 포함하고, 상기 음성 인식 모듈은 데이터베이스에 기저장된 음성 단위별 통합 입술 특징 정보로부터 상기 특징 레벨 통합 모듈에서 입력된 통합 입술 특징을 가지는 음성 단위를 추출하여 인식한다. According to one embodiment, the system further comprises a feature level integration module for creating an integrated lip feature that integrates the lip feature and the lip texture feature, wherein the voice recognition module includes an integrated lips feature Extracts and recognizes a speech unit having the integrated lip feature input from the feature level integration module.

다른 실시예에 따르면, 상기 음성 인식 모듈은, 데이터 베이스에 기저장된 음성 단위별 입술 모양 특징 정보로부터 상기 입술 모양 특징 추출 모듈로부터 추출된 입술 모양 특징을 가지는 음성 단위를 추출하는 제1 음성 인식 모듈과 상기 데이터 베이스에 기저장된 음성 단위별 입술 텍스쳐 특징 정보로부터 상기 입술 텍스쳐 특징 추출 모듈로부터 추출된 입술 텍스쳐 특징을 가지는 음성 단위를 추출하는 제2 음성 인식 모듈 및 기설정된 가중치에 따라 제1 음성 인식 모듈 및 제2 음성 인식 모듈로부터 추출된 음성 단위들에 점수를 부여하여, 점수가 높은 음성 단위를 최종 음성 단위로 인식하여 출력하는 스코어 레벨 통합 모듈을 더 포함한다.According to another embodiment, the speech recognition module may include a first speech recognition module for extracting a speech unit having a lip shape feature extracted from the lip feature extraction module from lip feature information for each speech unit previously stored in a database, A second speech recognition module for extracting a speech unit having a lip texture feature extracted from the lip texture feature extraction module from the lip texture feature information previously stored in the database, And a score level integration module for assigning scores to speech units extracted from the second speech recognition module and recognizing and outputting high score speech units as final speech units.

도 1은 본 발명의 일 실시예에 따른 음성 인식 시스템의 구성 및 정보처리 과정을 나타낸 개념도이다.
도 2는 영역 검출 모듈을 이용해 얼굴 영역을 검출하는 실시 형태를 도시한 것이다.
도 3은 사전 학습을 통해 입술 모양 모델을 생성하는 실시 형태를 도시한 것이다.
도 4는 입술 모양 특징을 추출하는 실시 형태를 도시한 것이다.
도 5는 입술 텍스쳐 특징을 추출하는 실시 형태를 도시한 것이다.
도 6은 입술 모양 특징과 입술 텍스쳐 특징을 통합하여 단어를 인식하는 실시 형태를 도시한 것이다. 1 is a conceptual diagram illustrating a configuration of a speech recognition system and an information processing process according to an embodiment of the present invention.
FIG. 2 shows an embodiment for detecting a face region using an area detection module.
FIG. 3 shows an embodiment for generating a lip model through prior learning.
Fig. 4 shows an embodiment for extracting lip shape features.
Figure 5 illustrates an embodiment for extracting lip texture features.
FIG. 6 illustrates an embodiment for recognizing words by integrating lip shape features and lip texture features.

이하 본 발명의 바람직한 실시예를 첨부한 도면을 참조하여 설명한다. 본 발명은 도면에 도시된 실시예를 참고로 설명되었으나 이는 하나의 실시예로서 설명되는 것이며, 이것에 의해 본 발명의 기술적 사상과 그 핵심 구성 및 작용이 제한되지 않는다. Hereinafter, preferred embodiments of the present invention will be described with reference to the accompanying drawings. Although the present invention has been described with reference to the embodiments shown in the drawings, it is to be understood that the technical idea of the present invention and its essential structure and operation are not limited thereby.

본 발명의 일 실시예에 따른 음성 인식 시스템은, 사용자의 영상을 획득할 수 있는 영상 획득 장치와, 상기 사용자의 영상에서 얼굴 영역 및 입술 영역을 검출하는 영역 검출 모듈과, 상기 입술 영역의 내외의 복수의 특징점의 위치를 통해 입술 모양 특징을 추출하는 입술 모양 특징 추출 모듈과, 상기 입술 영역의 밝기 정보를 통해 입술 텍스쳐 특징을 추출하는 입술 텍스쳐 특징 추출 모듈 및, 기저장된 음성 단위별 입술 모양 특징과 입술 텍스쳐 특징 정보로부터 상기 입술 모양 특징 추출 모듈에서 추출된 입술 모양 특징 및 상기 입술 텍스쳐 특징 추출 모듈에서 추출된 입술 텍스쳐 특징을 동시에 가지는 음성 단위를 추출하여 인식하는 음성 인식 모듈을 포함한다. According to another aspect of the present invention, there is provided a voice recognition system including an image acquisition device capable of acquiring an image of a user, an area detection module detecting a face area and a lip area in the image of the user, A lip shape feature extraction module for extracting a lip shape feature through a position of a plurality of feature points, a lip texture feature extraction module for extracting a lip texture feature through brightness information of the lip area, And a speech recognition module for extracting and recognizing speech units simultaneously having the lip shape feature extracted from the lip feature extraction module and the lip texture feature extracted from the lip texture feature extraction module from the lip texture feature information.

이하, 도 1을 참조하여, 본 실시예에 따른 음성 인식 시스템을 이용한 음성 인식 방법을 설명한다. Hereinafter, a speech recognition method using the speech recognition system according to the present embodiment will be described with reference to FIG.

도 1을 참조하면, 영상 획득 장치(1)를 이용하여 사용자의 영상을 획득한다. 본 실시예에 따른 영상 획득 장치(1)는 사용자의 좌우 스테레오 영상을 동시에 획득할 수 있는 스테레오 카메라이다. Referring to FIG. 1, an image of a user is acquired using the image acquisition device 1. FIG. The image acquisition device 1 according to the present embodiment is a stereo camera capable of simultaneously acquiring left and right stereo images of a user.

영역 검출 모듈을 이용해 스테레오 카메라(1)로부터 획득된 영상에서 얼굴 영역 및 입술 영역을 검출한다. The face area and the lip area are detected from the image obtained from the stereo camera 1 using the area detection module.

도 2는 스테레오 카메라(1)로부터 획득된 영상(10)으로부터 얼굴 영역 및 입술 영역을 검출하는 방법을 설명하기 위한 도면이다. 2 is a diagram for explaining a method of detecting a face region and a lip region from the image 10 obtained from the stereo camera 1. [

본 실시예에 따르면, 획득된 사용자의 영상(10)에 아다부스트(Adaboost) 얼굴 검출 알고리즘을 적용하여 얼굴을 검출해낸다. According to the present embodiment, the face is detected by applying an adaboost face detection algorithm to the acquired image 10 of the user.

다음으로, 도면 부호 20으로 표시한 바와 같이, 검출된 얼굴 영역에서 수직 방향 에지(edge) 성분을 프로젝션(projection)한 값에서 x축의 중심을 기준으로 좌우 최대치(21, 22)를 각각 구하여 얼굴의 좌우 경계선을 찾고, 도면 부호 30으로 표시한 바와 같이 수평 방향 에지 성분을 y축에 프로젝션한 값에서 y축의 중심을 기준으로 하위 범위 내에서 최대치(31)를 찾아 입술의 초기 영역(31)을 선택한다. Next, as shown by reference numeral 20, the left and right maximum values (21, 22) are obtained from the center of the x-axis with respect to the value obtained by projecting a vertical edge component in the detected face region, The left and right boundaries are searched and the maximum value 31 is found in the lower range based on the center of the y-axis from the value obtained by projecting the horizontal edge component on the y-axis as indicated by reference numeral 30 to select the initial region 31 of the lip do.

도시하지는 않았지만, 본 실시예에 따르면, 입술의 초기 영역 내 영상에서 RGB 색 공간을 비선형적 색 공간으로 변환할 수 있는 LUX(Simplified Logarithmic Hue Extension) 방법을 적용하고 적응적 임계값(adaptive threshold)에 따라 입술영역의 정확한 위치를 찾아낼 수 있다. Although not shown, according to the present embodiment, a LUX (Simplified Logarithmic Hue Extension) method capable of converting an RGB color space into a non-linear color space in an initial in-region image is applied to an adaptive threshold So that the exact position of the lip region can be found.

다시 도 1을 참조하면, 입술 모양 특징 추출 모듈을 통해 입술 모양 특징을 추출하고, 입술 텍스처 특징 추출 모듈을 이용해 입술 텍스처 특징을 추출한다. Referring again to FIG. 1, lip shape features are extracted through a lip shape feature extraction module, and lip texture features are extracted using a lip texture feature extraction module.

먼저, 입술 모양 특징 추출에 대해 설명한다. First, lip shape feature extraction will be described.

본 실시예에 따르면, 입술 모양 특징 추출을 위해 사전 학습 과정을 통해 3차원의 입술 모양 모델을 생성한다. According to the present embodiment, a three-dimensional lip model is generated through a pre-learning process for lip shape feature extraction.

상기 사전 학습 과정은 모의 사용자(100)에 대해 이루어지며, 모의 사용자(100)의 입술 영역 내외의 복수의 특징점들에 반사 마커를 부착하여 해당 특징점들의 좌표값을 취득하고, 모의 사용자로(100)부터 취득된 특징점들의 좌표값들을 이용해 음성 단위별 입술 움직임에 대한 정보를 나타내는 3차원의 입술 모양 모델들을 생성하게 된다. The pre-learning process is performed for the simulated user 100, and a reflection marker is attached to a plurality of minutiae inside and outside the lip region of the mock user 100 to acquire coordinate values of the minutiae points, Dimensional lip models that represent information on lips movement per voice unit using the coordinate values of the minutiae points acquired from the lips model.

여기서 "음성 단위"는 단어, 음절 및/또는 음소일 수 있다. Here, the "speech unit" may be a word, a syllable and / or a phoneme.

이하, 도 3을 참조하여 더 자세히 설명한다. Hereinafter, this will be described in more detail with reference to FIG.

모의 사용자(100)에 입술 영역의 내외부의 사전 설정된 특징점에 복수의 반사 마커(101 내지 111)를 부착한다. 모의 사용자(100)가 입술을 움직여 음성 단위를 발음하면 여러 대의 모션 카메라(201 내지 205)를 이용해 반사 마커들의 움직임을 추적하여 해당 음성 단위에 대한 각 특징점들의 정확한 3차원 좌표를 얻어낸다. A plurality of reflection markers (101 to 111) are attached to the simulated user (100) to predetermined feature points inside and outside the lip region. When the simulated user 100 moves the lips and pronounces the voice unit, the movement of the reflection markers is tracked by using a plurality of motion cameras 201 to 205 to obtain accurate three-dimensional coordinates of the respective feature points with respect to the voice unit.

도 3에서는 서로 다른 두 음성 단위에 대한 각 특징점들의 위치를 도면 부호에 프라임(')과 더블 프라임(")을 붙여 예시적으로 도시하였다. In FIG. 3, positions of respective minutiae points for two different voice units are exemplarily shown with prime (') and double prime (") in the reference numerals.

모의 사용자(100)의 입술의 움직임에 따른 3차원 입술 특징점들의 정확한 좌표 값으로 구성된 데이터베이스가 취득되면, 학습 과정을 통해 다양한 입술의 움직임을 표현할 수 있는 3차원 입술 모양 모델(300)을 형성한다. 본 실시에에 따르면 3차원 입술 모양 모델링에는 PCA 알고리즘이 사용되었다. Dimensional lip model 300 that can represent various lip movements through a learning process is acquired when a database composed of exact coordinate values of three-dimensional lip feature points according to the movement of the lips of the simulated user 100 is acquired. According to the present embodiment, the PCA algorithm is used for three-dimensional lip shape modeling.

위와 같은 과정을 수식적으로 설명하면 다음과 같다. The above procedure is described as follows.

T개의 음성 단위에 대한 M개의 반사 마커의 x, y, z 좌표로 이루어진 행렬은 하기 [수학식 1]과 같이 표현된다. A matrix of x, y and z coordinates of M reflection markers for T speech units is expressed as Equation (1) below.

[수학식 1][Equation 1]

PCA 알고리즘을 적용하여 입술 모양 모델을 모델링하면, 입술 모양 모델은 하기 [수학식 2]으로 표현할 수 있다. When the lip shape model is modeled by applying the PCA algorithm, the lip shape model can be expressed by the following equation (2).

[수학식 2]&Quot; (2) "

여기서 입술 모양 모델은 b_i이라는 파라미터를 가진다. Here, the lips model has a parameter b _i .

각 음성 단위별 입술 모양 모델과 그 파라미터는 데이터베이스(2)(도 1 참조)에 저장되어 있다. The lip model and the parameters thereof for each voice unit are stored in the database 2 (see FIG. 1).

도 4를 참조하면, 스테레오 카메라(1)를 통해 실사용자의 영상을 취득하고, 취득된 영상으로부터 정확한 입술의 위치가 찾아지면 이를 초기값으로 하여 AAM(Active Appearance Model) 또는 ASM(Active Shape Model)과 같은 특징점 추출 알고리즘을 적용하여 좌우 스테레오 영상에서 동일한 위치에 존재하는 입술 영역의 내외부에 위치한 특징점들을 추출한다. 4, when an actual user's image is acquired through the stereo camera 1 and an accurate lip position is found from the acquired image, an Active Appearance Model (AAM) or an Active Shape Model (ASM) Extracts feature points located in the inside and outside of the lip region existing at the same position in the left and right stereo images.

실사용자의 영상으로부터 추출되는 특징점들은 학습 과정에서 모의 사용자의 얼굴에 부착된 반사 마커들의 위치에 해당하는 위치이다. The feature points extracted from the real user's image are positions corresponding to the positions of the reflection markers attached to the face of the simulated user in the learning process.

이후, 좌우 스테레오 영상에서 각각 추출된 특짐점에 삼각법을 적용하여 각 특징점들의 3차원 좌표값을 계산하고, 이를 3차원 모델로 재구성한다. Then, the trigonometric method is applied to the special points extracted from the left and right stereo images to calculate the three-dimensional coordinate values of the respective feature points, and reconstructs them as a three-dimensional model.

도 4에는 두 음성 단위에 대한 실사용자의 입술 모양을 3차원으로 재구성한 3차원 모델(401, 402)이 예시적으로 도시되어 있다. FIG. 4 exemplarily shows three-dimensional models 401 and 402 obtained by reconstructing a lip shape of an actual user three-dimensionally for two voice units.

계산된 실사용자의 영상에서 추출된 특징점들의 3차원 좌표 값을 데이터베이스화되어 있는 3차원의 입술 모양 모델(300)에 피팅하여, 가장 유사한 모델의 모델 파라미터를 찾는다. 본 실시예에 따르면, 위와 같이 찾아진 모델 파라미터를 실사용자가 발음하는 음성 단위의 입술 모양 특징으로 선정한다. Dimensional coordinate values of the minutiae points extracted from the calculated real user's image are fitted to the three-dimensional lip model 300 of the database to find the model parameters of the most similar model. According to the present embodiment, the model parameters found as above are selected as the mouth shape features of the voice unit sounded by the actual user.

입력되는 실사용자의 영상에서 추출된 특징점들의 3차원 좌표 값을 하기 [수학식 3]으로 표현하면, 상기 [수학식 2]로 표현되는 입술 모양 모델에의 피팅 과정은 하기 [수학식 4]로 표현 가능하고, 찾아진 해당 음성 단위에 대한 모델 파라미터는 하기 [수학식 5]로 표현할 수 있다. If the three-dimensional coordinate value of the feature points extracted from the inputted real user's image is expressed by the following equation (3), the fitting process to the lip model represented by the equation (2) The model parameter for the speech unit which can be expressed and is found can be expressed by the following equation (5).

[수학식 3]&Quot; (3) "

[수학식 4]&Quot; (4) "

[수학식 5]&Quot; (5) "

위와 같은 과정을 통해 선정된 입술 모양 특징은 해당 특징을 나타내는 음성 단위를 인식하는데 이용된다. The lip shape feature selected through the above process is used to recognize a voice unit representing the feature.

한편, 본 실시예에 따르면 입술 모양만으로는 다양한 음절 및 단어를 구분하기 어려운 문제점을 보완하기 위하여, 입술의 모습 즉, 입술의 텍스쳐를 이용해 음성 인식을 수행한다. Meanwhile, according to the present embodiment, in order to solve the problem that it is difficult to distinguish various syllables and words from each other only with the lip shape, speech recognition is performed using the shape of the lip, that is, the texture of the lip.

카메라를 이용해 촬영한 영상 내에서 소정 영역의 밝기값의 기울기(gradient) 또는 그의 방향의 분포를 찾아낼 수 있다는 것은 당업자에게 잘 알려진 사실이다. It is well known to those skilled in the art that a gradient of a brightness value in a predetermined region or a distribution of a brightness value in a certain region can be found in an image captured using a camera.

도 5에 도시된 바와 같이, 본 실시예에 따르면 사용자의 영상으로부터 찾아진 입술 영역 전체 또는 일부의 밝기값의 기울기 또는 그의 방향의 분포 특징을 HOG (Histogram of Orientation Gradient) 방법으로 추출하여 입술 텍스쳐 특징으로 선정한다. As shown in FIG. 5, according to the present embodiment, the slope of the brightness value of all or part of the lip region found from the user's image or the distribution characteristic of the direction thereof is extracted by the Histogram of Orientation Gradient .

사전 학습 과정을 통해 음성 단위별 입술 텍스쳐 특징은 데이터 베이스(2)에 저장되어 있다.Through the pre-learning process, lip texture features for each voice unit are stored in the database (2).

음성 인식 모듈은 추출된 입술 모양 특징과 입술 텍스쳐 특징을 이용해 음성 단위를 인식한다. 본 실시예에 따르면 음성 인식 모듈은 HMM(Hidden Markov Model), SVM(Support Vector Machine), ANN(Artificial Neural Networks)와 같은 음성인식을 위한 패턴분류기(pattern classifier)일 수 있다. The speech recognition module recognizes speech units using the extracted lip shape features and lip texture features. According to this embodiment, the speech recognition module may be a pattern classifier for speech recognition such as HMM (Hidden Markov Model), SVM (Support Vector Machine), ANN (Artificial Neural Networks).

본 실시예에 따르면 사용자가 발음하는 하나의 음성 단위에 대해 입술 모양 특징과 입술 텍스쳐 특징이 동시에 추출되므로, 이를 통합하는 과정이 필요하다. According to the present embodiment, since the lip shape feature and the lip texture feature are simultaneously extracted for one voice unit that the user utters, a process of integrating them is needed.

도 6(a)에 도시된 바와 같이, 일 실시 형태에 따르면, 입술 모양 특징과 상기 입술 텍스쳐 특징을 통합하는 통합 입술 특징을 생성하는 특징 레벨 통합 모듈이 음성 인식 모듈의 전단에 구비된다. As shown in FIG. 6 (a), according to an embodiment, a feature level integration module for generating an integrated lips feature that integrates the lip feature and the lip texture feature is provided in front of the voice recognition module.

데이터베이스(2)에는 학습과정을 통해 추출된 입술 모양 특징과 상기 입술 텍스쳐 특징이 통합된 음성 단위별 통합 입술 특징이 저장되어 있으며, 음성 인식 모듈은 데이터베이스(2)에 저장된 음성 단위별 통합 입술 특징으로부터 상기 특징 레벨 통합 모듈에서 입력된 통합 입술 특징을 가지는 음성 단위를 추출하여 인식한다. The database 2 stores integrated lips features for each voice unit in which the lip shape feature extracted through the learning process and the lip texture feature are integrated. The voice recognition module extracts the integrated lips feature for each voice unit stored in the database 2 Extracts and recognizes a voice unit having the integrated lip feature input from the feature level integration module.

음성 인식 모듈에서 인식된 음성 단위 정보는 TTS 모듈을 통해 음성 신호화되어 스피커(3)를 통해 음성으로 복원되어 출력될 수 있다(도 1 참조). 물론, 인식된 음성 신호는 모니터를 통해 텍스트 형태로 출력될 수도 있을 것이다. The voice unit information recognized by the voice recognition module is converted into a voice signal through the TTS module and can be restored to voice through the speaker 3 and output (refer to FIG. 1). Of course, the recognized voice signal may be output in text form through the monitor.

한편, 본 발명의 다른 실시형태에 따르면, 음성 인식 모듈은, 데이터 베이스에 기저장된 음성 단위별 입술 모양 특징 정보로부터 상기 입술 모양 특징 추출 모듈로부터 추출된 입술 모양 특징을 가지는 음성 단위의 후보들을 추출하는 제1 음성 인식 모듈과, 상기 데이터 베이스에 기저장된 음성 단위별 입술 텍스쳐 특징 정보로부터 상기 입술 텍스쳐 특징 추출 모듈로부터 추출된 입술 텍스쳐 특징을 가지는 음성 단위의 후보들을 추출하는 제2 음성 인식 모듈로 구성될 수도 있다. Meanwhile, according to another embodiment of the present invention, the speech recognition module extracts candidates of speech units having the lip shape feature extracted from the lip feature extraction module from the lip feature information for each speech unit previously stored in the database And a second speech recognition module for extracting candidates of speech units having lip texture features extracted from the lip texture feature extraction module from lip texture feature information previously stored in the database, It is possible.

이때, 스코어 레벨 통합 모듈이 두 음성 인식 모듈의 후단에 구비되어 추출 결과를 통합한다. 스코어 레벨 통합 모듈은 기설정된 가중치에 따라 제1 음성 인식 모듈 및 제2 음성 인식 모듈로부터 추출된 음성 단위들에 점수를 부여하여, 점수가 가장 높은 음성 단위를 최종 음성 단위로 인식하여 출력하게 된다. At this time, the score level integration module is provided at the rear of the two speech recognition modules to integrate the extraction results. The score level integration module assigns scores to the speech units extracted from the first speech recognition module and the second speech recognition module according to a predetermined weight, and recognizes and outputs the speech unit having the highest score as a final speech unit.

본 실시예에 따르면, 입술의 움직임에 따른 3차원적인 입술 모양 정보와 2차원 영상에서의 영상 화소의 밝기 정보를 동시에 이용함으로써 스테레오 카메라를 기반으로 다수의 특징들을 실시간으로 추출하여 화자의 의도된 음성을 보다 정확하게 인식할 수 있다. According to this embodiment, by using the three-dimensional lip shape information according to the movement of the lips and the brightness information of the image pixels in the two-dimensional image simultaneously, a plurality of features are extracted in real time based on the stereo camera, Can be recognized more accurately.

3차원적인 입술 모양 정보와 2차원 영상의 밝기 정보는, 화자가 음성 단위를 발음하여 의사를 표현할 때 나타나는 시각적인 정보(비즘; viseme)를 인식할 수 있는 정보를 제공하므로, 잡음이 많은 환경에서 화자의 실제 음성 신호를 정확히 수집할 수 없는 경우에도 화자가 의도한 음성을 정확히 인식할 수 있게 해준다. Since the three-dimensional lip shape information and the brightness information of the two-dimensional image provide information for recognizing the visual information (viseme) appearing when the speaker pronounces the voice unit and expresses the doctor, Even if the actual voice signal of the speaker can not be accurately collected, the speaker can accurately recognize the intended voice.

따라서, 일반 사용자를 위한 음성 인식 및 스마트 인터페이스로 사용될 수 있을 뿐만 아니라, 음성장애 환자, 중증장애인 및 노약자를 위한 의사 소통 보조 시스템으로도 다양하게 이용될 수 있을 것이다. Therefore, it can be used not only as a voice recognition and smart interface for general users but also as a communication auxiliary system for a voice disordered person, a severely disabled person and an elderly person.

Claims

An image acquiring device for acquiring an image of a user;
An area detection module for detecting a face area and a lip area in the user's image;
A lip shape feature extraction module for extracting lip shape features through positions of a plurality of feature points inside and outside the lip area;
A lip texture feature extraction module for extracting a lip texture feature through brightness information of the lip area;
Extracted and recognized voice units having lip shape features extracted from the lip shape feature extraction module and lip texture features extracted from the lip texture feature extraction module from previously stored lip shape features and lip texture feature information, And a speech recognition module.

The method according to claim 1,
Wherein the image acquisition device is a stereo camera that acquires left and right stereo images of the user.

3. The method of claim 2,
Wherein the lip shape feature extraction module calculates three-dimensional coordinates of the feature points from the left and right stereo images, and extracts three-dimensional lip shape features through the three-dimensional coordinates.

The method of claim 3,
Acquiring coordinate values of the minutiae points inside and outside the lip region of the simulated user for each voice unit through preliminary learning for the simulated user,
Further comprising a database for generating and storing three-dimensional lip models representing information on lip motion per voice unit using coordinate values of the minutiae points acquired from the simulated user.

5. The method of claim 4,
The lip shape models having model parameters,
Wherein the lip shape feature extraction module extracts the model parameters of the most similar model as lip shape features by fitting the coordinate values of the feature points of the face of the real user to the lip shape models.

The method according to claim 1,
Wherein the lip texture feature is a directional distribution of the slope or slope of the brightness value of the lip region.

The method according to claim 1,
Further comprising a feature level integration module for generating an integrated lip feature that integrates the lip feature and the lip texture feature,
Wherein the speech recognition module extracts and recognizes a speech unit having the integrated lip feature input from the feature level integration module from the integrated lip feature information for each speech unit previously stored in the database.

The method according to claim 1,
Wherein the speech recognition module comprises:
A first speech recognition module for extracting a speech unit having a lip shape feature extracted from the lip feature extraction module from lip feature information for each speech unit previously stored in a database;
A second speech recognition module for extracting a speech unit having a lip texture feature extracted from the lip texture feature extraction module from lip texture feature information for each voice unit previously stored in the database; And
And a score level integration module for assigning a score to the speech units extracted from the first speech recognition module and the second speech recognition module according to a predetermined weight and recognizing and outputting a speech unit having a high score as a final speech unit And the speech recognition system.