KR20140093459A

KR20140093459A - Method for automatic speech translation

Info

Publication number: KR20140093459A
Application number: KR1020130005844A
Authority: KR
Inventors: 이수종; 김상훈; 김정세; 윤승; 박상규
Original assignee: 한국전자통신연구원
Priority date: 2013-01-18
Filing date: 2013-01-18
Publication date: 2014-07-28

Abstract

An automatic translation method of the present invention may comprise the steps of authenticating a facial image of a user through the comparison of multiple facial feature vectors and facial image templates which are related to the preset facial image after obtaining the facial image of the user; extracting voice features after separation of the voice segment from the input audio signal; converting the extracted voice features to text sentences by using an acoustic model constructed by speaker adaptation; converting the converted text sentence to target language sentences by using an automatic translation engine; and converting the converted target language sentence to target language voice by using a voice synthesis engine and outputting the target language voice.

Description

{METHOD FOR AUTOMATIC SPEECH TRANSLATION}

본 발명은 자동 통역 기법에 관한 것으로, 더욱 상세하게는 사용자의 얼굴 영상 인증 및 화자 적응(speaker adaptation)과 입술 움직임 영상 추적에 의한 음성 구간 확인 기법을 복합 적용함으로써, 휴대 단말에서의 자동 통역 서비스를 제공하는데 적합한 자동 통역 방법에 관한 것이다.
The present invention relates to an automatic interpretation method, and more particularly, to an automatic interpretation service in a portable terminal by applying a face image authentication and a speaker adaptation of a user and a voice interval checking method by a lip motion image tracking The present invention relates to an automatic interpretation method that is suitable for providing an automatic interpretation method.

잘 알려진 바와 같이, 자동 통역은 서로 다른 언어를 사용하는 사람들 간의 의사소통을 위해 구현된 기술로서, 음성 인식, 자동 번역, 음성 합성 등의 요소기술이 결합되어 실현된다.As is well known, automatic interpretation is a technology implemented for communication between people using different languages, and realized by a combination of element technologies such as speech recognition, automatic translation, and speech synthesis.

근래 들어, 이러한 자동 통역 기술이 스마트폰, 스마트 기기 등과 같은 휴대 단말에 탑재되기 시작하고 음향 잡음 환경에 노출되면서, 음향 잡음에 취약한 음성 인식 과정에 영상 정보 활용의 필요성이 제기되고 있다. 이것은 영상 정보가 조명에 취약하고 계산량이 많아 실시간 처리에 제약이 따른다는 단점에도 불구하고, 음향 잡음에 무관하게 획득하고 처리할 수 있기 때문이다.Recently, such an automatic interpretation technology has been installed in portable terminals such as smart phones and smart devices, and is exposed to acoustic noise environment, so that there is a need to utilize image information in speech recognition processes that are susceptible to acoustic noise. This is because the image information is vulnerable to illumination and the amount of computation is large, so that it can be acquired and processed irrespective of acoustic noise, despite the disadvantage of real-time processing restriction.

그리고, 화자 적응은 다수의 화자를 대상으로 하는 화자독립 음성인식 모델을 사용자의 음성에 적합하도록 음성 인식 모델을 변형시킴으로써 화자 종속 음성 인식에 가까운 성능을 얻어낼 수 있는 기법이다. 더욱이 온라인 화자 적응 기법에 의하여 음성인식 모델은 새로운 발음에 의해 지속적으로 개선할 수 있다.Speaker adaptation is a technique that can obtain performance close to speaker dependent speech recognition by modifying a speech recognition model to fit a user's speech to a speaker independent speech recognition model targeting a plurality of speakers. Furthermore, the speech recognition model can be continuously improved by the new pronunciation by the online speaker adaptation technique.

여기에서, 음성 구간 설정은 음향신호를 대상으로 음성의 시작과 끝을 지정하는 것으로, 음성 인식의 성패를 좌우한다. 그러나, 잡음 환경 하에서 음성 구간의 정확한 설정은 매우 어려운 과제의 하나로 남아 있는 것이 현실이다.
Here, the speech interval setting specifies the beginning and end of the speech with respect to the acoustic signal, which determines the success or failure of the speech recognition. However, it is a reality that the accurate setting of the voice interval under the noisy environment remains a very difficult task.

대한민국 공개특허 제2005-0015585호(공개일 : 2005. 02. 21.)Korean Patent Publication No. 2005-0015585 (published on February 21, 2005)

본 발명은, 자동 통역에서 영상 정보를 활용하기 위한 것으로서, 스마트폰, 스마트 기기 등과 같은 휴대 단말(또는 이동 단말)의 전면 카메라에 의해 영상 정보를 쉽게 획득할 수 있는 환경이 조성되었고, 영상 정보를 음향 잡음과 무관하게 획득 및 처리할 수 있다는 점을 감안하여, 음향 잡음에 취약한 자동 통역의 음성 인식 과정에서 영상 정보를 활용하고자 하며, 이를 위한 본 발명의 목적을 몇 가지로 나눠보면 다음과 같다.The present invention utilizes image information in an automatic interpretation system. An environment in which image information can be easily acquired by a front camera of a portable terminal (or a mobile terminal) such as a smart phone or a smart device is created, In view of the fact that it can be acquired and processed irrespective of acoustic noise, the present invention intends to utilize the image information in the voice recognition process of automatic interpretation which is vulnerable to acoustic noise.

첫째, 휴대 단말에서 제공하는 전면 카메라의 기능을 활용하여 사용자의 얼굴 영상을 인증에 활용하며, 휴대 단말이 개인 또는 소수의 가족 구성원에 의하여 사용될 수 있으므로, 이들과 이외의 사용자를 보다 정밀하게 식별하는 데 얼굴 영상 활용이 유용할 수 있다.First, the face image of the user is utilized for authentication by utilizing the function of the front camera provided in the portable terminal, and since the portable terminal can be used by an individual or a small number of family members, The use of face image can be useful.

둘째, 사용자 얼굴 영상을 음성 인식 과정에 활용하여 향상된 음성 인식 성능을 실현할 수 있으며, 사용자 얼굴 영상의 인증 결과로 사용자가 식별되면, 식별된 화자의 화자 적응에 의한 음성 인식 모델을 활용할 수 있다. 즉, 본 발명은 특정된 소수의 사용자를 위한 음성 인식 모델을 맞춤형으로 구성하고, 화자별로 향상된 음성 인식 성능을 제공할 수 있다.Second, it is possible to realize the improved speech recognition performance by using the user face image in the speech recognition process. If the user is identified as the authentication result of the user face image, the speech recognition model based on the speaker adaptation of the identified speaker can be utilized. That is, the present invention can customize a speech recognition model for a specified small number of users and provide enhanced speech recognition performance for each speaker.

셋째, 입술 움직임 영상의 추적 결과를 음성 구간 설정에 활용한다. 즉, 휴대 단말은 이동 상황에서 주로 활용하게 되므로 주변의 다양한 음향 잡음 유입을 피할 수 없으므로, 잡음 환경 하에서는 정확한 음성 구간 설정이 어렵기 때문에 음향 잡음과 무관하게 획득하고 처리할 수 있는 입술 움직임 영상의 추적을 활용한다.Third, the result of tracking the lip motion image is used for voice segment setting. In other words, since the portable terminal is mainly used in a mobile situation, it is difficult to avoid various acoustic noise inflows around, so that it is difficult to set an accurate voice interval under a noisy environment, so that a lip motion image that can be acquired and processed irrespective of acoustic noise .

넷째, 음성 발성의 시작과 종료를 인위적으로 활성화시키고 종료시키는 기존의 PTT(Push-to-talk)를 대체할 수 있다.Fourth, it is possible to replace the existing push-to-talk (PTT) that artificially activates and terminates the start and end of voice utterance.

다섯째, 얼굴 영상의 다중 특징벡터와 다중 템플릿을 도입함으로써 다양한 환경에서의 얼굴 영상을 자동 통역 과정에서 활용할 수 있다.
Fifth, by introducing multiple feature vectors and multiple templates of facial images, facial images in various environments can be utilized in the automatic interpretation process.

본 발명은, 사용자의 얼굴 영상을 획득한 후 기 설정된 얼굴 영상에 관한 다중 얼굴 특징벡터 및 얼굴 영상 템플릿과의 비교를 통해 사용자 얼굴 영상을 인증하는 과정과, 입력된 음성신호로부터 음성구간을 분리한 후 음성특징들을 추출하는 과정과, 화자 적응에 의해 구축된 음향모델을 이용하여 추출된 음성특징들을 텍스트 문장으로 변환하는 과정과, 자동 번역 엔진을 이용하여 변환된 텍스트 문장을 타겟 언어 문장으로 변환하는 과정과, 음성 합성 엔진을 이용하여 변환된 타겟 언어 문장을 타겟 언어 음성으로 변환하여 출력하는 과정을 포함하는 자동 통역 방법을 제공한다.
The method includes the steps of: acquiring a user's face image, authenticating a user's face image through comparison with a multi-facial feature vector and a face image template associated with a predetermined face image, and extracting a voice section from the input voice signal A step of converting the extracted speech features into a text sentence using an acoustic model constructed by speaker adaptation, and a step of converting the converted text sentence into an target language sentence using an automatic translation engine And converting the target language sentence converted by the speech synthesis engine into a target language speech and outputting the translated target language speech.

본 발명에 따르면, 자동 통역의 음성 인식 과정에서 얼굴 영상을 효과적으로 활용함으로써, 다음과 같은 효과를 얻을 수 있으며, 영상 정보 처리에 따른 계산량의 축소는 물론 다양한 조명환경에서도 효과적으로 적용할 수 있다.According to the present invention, the following effects can be obtained by effectively utilizing the face image in the voice recognition process of the automatic interpretation, and the calculation amount according to the image information processing can be reduced, and the system can be effectively applied to various illumination environments.

첫째, 휴대 단말의 사용자 얼굴 영상 인증을 통해 사용자를 식별할 수 있는데, 이를 위해 자동 통역을 위한 사용자 인터페이스 화면에 얼굴 영상 윈도우를 설정하고 획득된 얼굴 영상과 기 설정된 얼굴 영상의 다중 특징벡터와 다중 템플릿을 비교하는 기법을 제시한다.First, a user can be identified through user face image authentication of the mobile terminal. To this end, a face image window is set on a user interface screen for automatic interpretation, and multiple feature vectors of the acquired face image and predetermined face image, To compare the results.

둘째, 식별된 사용자의 화자 적응에 의해 향상된 음성 인식 성능을 기대할 수 있어, 적은 분량의 사용자 데이터를 활용하여 최대의 음성 인식 성능을 보일 수 있는 화자 종속 음성 인식 성능에 가까운 성능을 기대할 수 있다.Second, improved speech recognition performance can be expected by the speaker adaptation of the identified user, and performance close to the speaker dependent speech recognition performance which can show the maximum speech recognition performance using a small amount of user data can be expected.

셋째, 입술 움직임 영상의 추적 결과를 음성구간 설정에 활용할 수 있다. 즉, 자동 통역을 위해 발성하는 과정에서는 필연적으로 입술을 움직이므로, 음성신호의 처리 과정에서 입술 움직임 여부를 확인함으로써 음성 구간의 설정을 효과적으로 실현할 수 있다.Third, the result of tracking the lip motion image can be used for setting the voice interval. That is, since the lips are inevitably moved in the process of voicing for automatic interpretation, the setting of the voice interval can be effectively realized by confirming whether or not the lips are moving during the processing of the voice signal.

넷째, 음성 발성의 시작과 종료를 인위적으로 활성화시키는 PTT(Push-to-talk)를 대체하는데 활용할 수 있다. 여기에서, PTT는 음향 잡음에 의한 음성신호의 처리를 방지하고 음성 구간을 인위적으로 설정하기 위한 동작모드이다.
Fourth, it can be used to replace push-to-talk (PTT) that artificially activates the start and end of voice utterance. Here, the PTT is an operation mode for preventing processing of a voice signal by acoustic noise and artificially setting a voice section.

도 1은 본 발명의 실시 예에 따른 자동 통역 장치의 블록구성도,
도 2는 도 1에 도시된 입술 검출 블록의 세부적인 블록구성도,
도 3은 본 발명의 실시 예에 따라 사용자의 얼굴 영상 인증 및 화자 적응과 입술 움직임 영상의 추적에 의한 음성 구간 확인을 통해 자동 통역 서비스를 제공하는 주요 과정을 도시한 순서도,
도 4는 본 발명에 따라 입술 움직임을 검출하는 주요 과정을 도시한 순서도,
도 5는 본 발명에 따라 얼굴 영상, 입술 움직임 영상 및 음향신호에 기반하여 자동 통역 서비스를 제공하는 개념을 설명하기 위한 개념 예시도.1 is a block diagram of an automatic interpretation apparatus according to an embodiment of the present invention;
FIG. 2 is a detailed block diagram of the lip detection block shown in FIG. 1;
3 is a flowchart illustrating a main process of providing an automatic interpretation service through facial image authentication and speaker adaptation of a user according to an embodiment of the present invention and confirmation of a voice interval by tracking a lip motion image,
Figure 4 is a flow chart illustrating the main process of detecting lips movement according to the present invention,
FIG. 5 is a conceptual diagram illustrating a concept of providing an automatic interpretation service based on a face image, a lip motion image, and an acoustic signal according to the present invention.

먼저, 본 발명의 장점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되는 실시 예들을 참조하면 명확해질 것이다. 여기에서, 본 발명은 이하에서 개시되는 실시 예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시 예들은 본 발명의 개시가 완전하도록 하고, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 발명의 범주를 명확하게 이해할 수 있도록 하기 위해 예시적으로 제공되는 것이므로, 본 발명의 기술적 범위는 청구항들에 의해 정의되어야 할 것이다.First, the advantages and features of the present invention, and how to accomplish them, will be clarified with reference to the embodiments to be described in detail with reference to the accompanying drawings. While the present invention has been described with reference to exemplary embodiments, it is to be understood that the invention is not limited to the disclosed exemplary embodiments, but, on the contrary, It will be understood by those of ordinary skill in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

아울러, 아래의 본 발명을 설명함에 있어서 공지 기능 또는 구성 등에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략할 것이다. 그리고, 후술되는 용어들은 본 발명에서의 기능을 고려하여 정의된 용어들인 것으로, 이는 사용자, 운용자 등의 의도 또는 관례 등에 따라 달라질 수 있음은 물론이다. 그러므로, 그 정의는 본 명세서의 전반에 걸쳐 기술되는 기술사상을 토대로 이루어져야 할 것이다.In the following description of the present invention, detailed description of known functions and configurations incorporated herein will be omitted when it may make the subject matter of the present invention rather unclear. It is to be understood that the following terms are defined in consideration of the functions of the present invention, and may be changed according to intentions or customs of a user, an operator, and the like. Therefore, the definition should be based on the technical idea described throughout this specification.

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시 예에 대하여 상세하게 설명한다.
Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 실시 예에 따른 자동 통역 장치의 블록구성도로서, 영상 획득 블록(102), 윤곽선 및 요소 추출 블록(104), 영상 인증 블록(106), 화자 적응 DB(108), 음성 및 언어 DB(110), 입술 검출 블록(112), 음향 획득 블록(114), 음성특징 추출 블록(116), 음성 인식 및 문장 생성 블록(118), 타겟 언어 문장 생성 블록(120) 및 타겟 언어 음성 생성 블록(122) 등을 포함할 수 있다.FIG. 1 is a block diagram of an automatic interpretation apparatus according to an embodiment of the present invention, which includes an image acquisition block 102, an outline and element extraction block 104, an image authentication block 106, a speaker adaptation DB 108, A speech recognition block 116, a speech recognition and sentence generation block 118, a target language sentence generation block 120, and a target language Voice generation block 122, and the like.

여기에서, 본 발명의 자동 통역 장치는 자동 통역 앱(application)으로 제작되어 휴대 단말(또는 이동 단말)에 삭제 가능하게 설치(로딩)될 수 있는데, 이러한 휴대 단말로서는, 예컨대 휴대폰, 스마트폰, 스마트패드, 노트패드, 태블릿 PC 등을 의미할 수 있다.Here, the automatic interpretation device of the present invention may be installed as an automatic interpreting application and erasably installed (loaded) on a portable terminal (or mobile terminal). Examples of the portable terminal include a mobile phone, a smart phone, Pad, note pad, tablet PC, and the like.

도 1을 참조하면, 영상 획득 블록(102)은 도시 생략된 카메라(예컨대, 휴대 단말의 전면 카메라)로부터 사용자(자동 통역 서비스를 제공받고자 하는 화자(speaker))의 얼굴 영상을 획득 및 인식하고, 이 인식된 얼굴 영상의 프레임을 사용자 인터페이스 화면(휴대 단말의 표시 패널)에 있는 얼굴 영상 윈도우(예컨대, 도 5의 참조번호 502)에 표시(디스플레이)하는 등의 기능을 제공할 수 있다.1, the image acquisition block 102 acquires and recognizes a face image of a user (a speaker to receive automatic interpretation service) from a camera (not shown) (for example, a front camera of a portable terminal) And displaying (displaying) the recognized face image frame on a face image window (for example, reference numeral 502 in FIG. 5) on the user interface screen (display panel of the portable terminal).

이때, 사용자 인터페이스 화면의 얼굴 영상 윈도우에 표시되는 사용자의 얼굴 영상은, 일예로서 도 5에서 점선으로 도시된 바와 같이, 얼굴 접근 유도선(504)에 맞게 조정되도록 얼굴 위치의 조정을 유도할 수 있다. 여기에서, 얼굴 접근 유도선(504)을 설정하는 것은 윤곽선 추출에 따른 계산량을 감소시키기 위해서이다.At this time, the face image of the user displayed on the face image window of the user interface screen can induce adjustment of the face position so as to be adjusted to the face approach inducing line 504, as shown by a dotted line in Fig. 5 . Here, the setting of the facial approach guide line 504 is performed in order to reduce the calculation amount in the contour extraction.

다음에, 윤곽선 및 요소 추출 블록(104)은 얼굴 영상 윈도우로부터 얼굴 윤곽선과 얼굴 요소를 각각 추출하는데, 이러한 얼굴 요소는, 예컨대 눈썹, 눈, 코, 입 등이 될 수 있으며, 각 얼굴 요소의 길이, 폭, 상대적인 위치 등을 포함할 수 있다. 여기에서 추출된 얼굴 윤곽선과 얼굴 요소 정보들은 다음 단의 영상 인증 블록(106)으로 전달되고, 얼굴 요소 중의 입술 정보를 포함하는 정보(즉, 얼굴 하단 영상 프레임)는 입술 검출 블록(112)으로 전달된다.The contour and element extraction block 104 then extracts facial contours and facial elements from the facial image window, which may be, for example, eyebrows, eyes, nose, mouth, etc., , Width, relative position, and the like. The facial contour and facial information extracted here are transmitted to the image authentication block 106 of the next stage, and the information including the lip information in the facial element (i.e., the face lower image frame) is transmitted to the lip detection block 112 do.

또한, 영상 인증 블록(106)은 윤곽선 및 요소 추출 블록(104)으로부터 제공된 화자의 얼굴 윤곽선과 화자 적응 DB(108)에 저장되어 있는 대상자들(예컨대, 휴대 단말의 소유주와 가족)의 얼굴 특징벡터들(다중 얼굴 특징벡터)을 비교하여 소정의 임계값(정합 임계값)보다 큰 화자를 식별(선택)하고, 화자의 얼굴 요소와 기 정의된 얼굴 영상 템플릿간의 정합을 통해 최대 정합율을 산출한다. 여기에서, 템플릿 정합은 영상 간을 픽셀 단위로 비교하는 정합을 의미한다.The image authentication block 106 also receives face contours of the speaker provided from the contour and element extraction block 104 and the face characteristics of the subjects stored in the speaker adaptation DB 108 (e.g., the owner and family of the mobile terminal) (Multiple face feature vectors), identifies (selects) a speaker larger than a predetermined threshold value (matching threshold value), and calculates a maximum matching rate through matching between the face element of the speaker and the predefined face image template . Here, the template matching means matching in which the images are compared pixel by pixel.

이때, 얼굴 특징벡터는 휴대 단말에서의 다양한 사용 환경을 고려하여 다양한 형태로 설정될 수 있으며, 산출되는 최대 정합율, 즉 얼굴 영상간의 템플릿 정합율은 퍼센트(%)로 계산하여 사용자 인터페이스 화면에 표시될 수 있다.At this time, the face feature vector may be set in various forms considering various use environments in the portable terminal, and the calculated maximum matching rate, that is, the template matching rate between the facial images is calculated as a percentage (%) and displayed on the user interface screen .

즉, 영상 인증 블록(106)에서는 기 정의된 얼굴 영상 템플릿과의 최대 정합율이 기 설정된 정합 임계값보다 크면 사용자의 얼굴 영상 인증이 완료된 것으로 판단하는데, 이러한 판단에 따른 화자 식별신호를 생성하여 음성 및 언어 DB(110)로 전달한다.That is, in the image authentication block 106, if the maximum matching rate with the predefined face image template is larger than the predetermined matching threshold, it is determined that the user's face image authentication is completed. And the language DB 110, as shown in FIG.

여기에서, 사용자의 얼굴 영상 템플릿은, 정밀한 사용자 인증의 필요에 따라 옵션으로 추가될 수 있는 것으로, 그 계산량을 감안하여 얼굴 내부에서 얼굴의 특징을 잘 나타내면서도 변형이 적은 부분으로 최대한 축소하여 지정함으로써, 템플릿 정합율 산출에 따른 계산량을 줄일 수 있다. 예컨대, 눈썹의 양쪽 가장자리를 폭으로 하고 아래의 입술 윤곽 부분까지를 얼굴 영상 템플릿으로 지정할 수 있다.Here, the face image template of the user can be optionally added in accordance with the need for precise user authentication. In consideration of the amount of calculation, the face image template can be specified by minimizing the deformed portion , It is possible to reduce the calculation amount according to the template matching rate calculation. For example, both sides of the eyebrow can be designated as the width, and up to the lip contour portion below can be designated as the face image template.

그리고, 음성 및 언어 DB(110)는 음향 모델, 발음 사전, 언어 모델 등이 구축되고, 이들 모델들을 결합하는 인식 네트워크가 형성되는 것으로, 이들 모델들은 얼굴 영상 인식을 통해 사용자(화자)가 확인(화자 식별신호의 생성)될 때, 입력되는 음성신호들로부터 추출한 음성특징들을 텍스트 문장으로 변환할 때 활용하기 위한 모델들이다.The voice and language DB 110 is constructed of a sound model, a pronunciation dictionary, a language model, and the like, and a recognition network for combining these models is formed. These models are used to identify (recognize) (I.e., generation of a speaker identification signal), the speech characteristics extracted from the input speech signals are converted into text sentences.

다음에, 입술 검출 블록(112)은 시간적으로 인접하는 영상 프레임 간의 비교(예컨대, 동일한 위치의 픽셀 값의 차이 비교 등)를 통해 입술 움직임 영상을 추출, 즉 화자의 입술 움직임을 추출하는데, 이를 위해 도 2에 도시된 바와 같은 구성을 포함할 수 있다.Next, the lip detection block 112 extracts the lip motion image, that is, extracts the lip motion of the speaker, through comparison between temporally adjacent image frames (for example, comparison of pixel values at the same position, etc.) And may include a configuration as shown in FIG.

도 2는 도 1에 도시된 입술 검출 블록의 세부적인 블록구성도로서, 움직임 추출 블록(202), 프레임 메모리(20), 움직임 분리 블록(206), 특징값 추출 블록(208) 및 움직임 판단 블록(210) 등을 포함할 수 있다.FIG. 2 is a detailed block diagram of the lip detection block shown in FIG. 1, including a motion extraction block 202, a frame memory 20, a motion separation block 206, a feature value extraction block 208, (210), and the like.

도 2를 참조하면, 움직임 추출 블록(202)은 입력되는 현재 영상 프레임(즉, 얼굴 하단 영상 프레임)과 시간적으로 인접하는 이전 영상 프레임 간의 비교(예컨대, 동일한 위치의 픽셀 값의 차이 비교)를 통해 입술 움직임 영상을 추출하는 등의 기능을 제공할 수 있는데, 이를 위해 프레임 메모리(204)에는 현재 영상 프레임과 시간적으로 인접하는 적어도 하나 이상의 이전 영상 프레임들이 저장될 수 있다.Referring to FIG. 2, the motion extraction block 202 performs a motion comparison between a current image frame (i.e., a face lower image frame) and a temporally adjacent previous image frame (for example, a difference between pixel values at the same position) And extracts a lip motion image. For this purpose, the frame memory 204 may store at least one previous image frame temporally adjacent to the current image frame.

다음에, 움직임 분리 블록(206)은 움직임 추출 블록(202)으로부터 전달되는 입술 움직임 영상에 대해 이 기술 분야에 잘 알려진 통상적인 필터링 기법 등을 적용하는 노이즈 필터링을 통해 잡음 성분을 제거하고, 영상 프레임으로부터 움직임 영상(입술 부분과 그 상단부분을 표시하는 영상)만을 분리하는 등의 기능을 제공할 수 있다.Next, the motion separation block 206 removes noise components from the lip motion image transmitted from the motion extraction block 202 through noise filtering using a general filtering technique well-known in the art, And separating only the moving image (the image showing the lip part and the upper part thereof) from the moving image.

또한, 특징값 추출 블록(208)은 이 기술 분야에 널리 알려진 영상 특징 추출 알고리즘을 이용하여 분리된 움직임 영상의 입술 움직임 영상으로부터 다중 특징값을 추출한 후 움직임 판단 블록(210)으로 전달하는 등의 기능을 제공할 수 있다.In addition, the feature value extraction block 208 extracts multiple feature values from the lip motion image of the separated motion image using the image feature extraction algorithm widely known in the art, and transmits the extracted feature values to the motion determination block 210 Can be provided.

한편, 움직임 판단 블록(210)은 특징값 추출 블록(208)으로부터 제공되는 다중 특징값과 내부 메모리(도시 생략)에 기 저장되어 있는 기 정의된 입술 움직임 특징벡터(다중 입술 움직임 특징벡터)간의 비교를 통해 입술 움직임의 여부를 판단하고, 입술 움직임으로 판단될 때 입술 상단부의 일부 영상(예컨대, 코, 인중, 눈 등)과 기 정의된 입술 상단 템플릿(다중 입술 상단 템플릿)간의 비교를 통해 화자 입술의 움직임 여부를 최종 판단하며, 그 판단 결과(입술 움직임 검출신호 또는 입술 움직임 미검출신호)를 생성하여 도 1의 음성특징 추출 블록(116)으로 전달하는 등의 기능을 제공할 수 있다.Meanwhile, the motion decision block 210 compares the multiple feature values provided from the feature value extraction block 208 with the predefined lip motion feature vectors (multiple lip motion feature vectors) stored in the internal memory (not shown) (E.g., nose, eye, eye, etc.) of the upper lip of the lip when it is judged to be lip movement, and comparing the lip top template (Lip motion detection signal or lip motion non-detection signal) and transmits the result to the voice feature extraction block 116 of FIG. 1, for example.

즉, 움직임 판단 블록(210)에서는 움직임을 갖는 입술 상단부의 일부 영상과 기 정의된 입술 상단 템플릿을 매칭시켜 최대 정합율을 산출하고, 산출된 최대 정합율이 기 설정된 소정의 임계값 이상일 때 입술 움직임으로 최종 판단하여 그에 상응하는 입술 움직임 검출신호를 발생한다. 여기에서, 입술 상단부의 일부 영상과 기 정의된 입술 상단 템플릿 간의 최대 정합율은 퍼센트(%)로 계산하여 사용자 인터페이스 화면에 표시될 수 있다. 이때, 입술 움직임의 판단을 위해 입술 상단 템플릿을 활용하는 것은 휴대 단말이 사용되는 다양한 사용자 조명환경을 반영하기 위해서이다.That is, in the motion determination block 210, a maximum matching rate is calculated by matching a partial image of the upper lip of the motion having motion with a predetermined upper lip template, and when the calculated maximum matching rate is equal to or greater than a predetermined threshold value, And generates a lip motion detection signal corresponding thereto. Here, the maximum matching rate between the partial image of the lip upper part and the predefined lip upper template can be displayed on the user interface screen as a percentage (%). At this time, utilizing the lip top template for determining the lip movement reflects various user lighting environments in which the mobile terminal is used.

다시 도 1을 참조하면, 음향 획득 블록(114)은 도시 생략된 휴대 단말의 마이크를 통해 입력되는 음향신호(예컨대, 음성신호 등을 포함하는 음향신호)를 획득하여 음성특징 추출 블록(116)으로 전달하는 등의 기능을 제공할 수 있다.1, the sound acquisition block 114 acquires an acoustic signal (for example, an acoustic signal including a voice signal or the like) input through a microphone of a portable terminal (not shown) And the like can be provided.

다음에, 음성특징 추출 블록(116)은, 입술 검출 블록(112)으로부터 입술 움직임 검출신호가 제공될 때, 음향 획득 블록(114)으로부터 전달되는 음성신호로부터 음성구간을 분리하고, 이 기술 분야에 널리 알려진 음성 특징 추출 알고리즘을 이용하여 분리된 음성구간의 음성신호로부터 음성특징들을 추출한 후 음성 인식 및 문장 생성 블록(118)으로 전달하는 등의 기능을 제공할 수 있다.Next, the voice feature extraction block 116 separates the voice section from the voice signal delivered from the sound acquisition block 114 when a lip motion detection signal is provided from the lip detection block 112, It is possible to provide functions such as extracting speech features from a speech signal of a separated speech section using a widely known speech feature extraction algorithm, and then transmitting the extracted speech features to a speech recognition and sentence generation block 118.

이때, 음성특징 추출 블록(116)은, 입술 검출 블록(112)으로부터 입술 움직임 미검출신호가 제공되면, 비록 음향 획득 블록(114)으로부터 음성신호(음향신호)가 전달되더라도, 입력되는 음성신호를 잡음으로 간주하여 음성특징을 추출하는 프로세스를 수행하지 않는 기능을 제공할 수 있다.At this time, the voice feature extraction block 116, if a lip motion non-detection signal is supplied from the lip detection block 112, even if a voice signal (acoustic signal) is transmitted from the sound acquisition block 114, It is possible to provide a function of not performing the process of extracting speech features by considering it as noise.

또한, 음성 인식 및 문장 생성 블록(118)은 음성 및 언어 DB(110)에 기 저장되어 있는 식별(선택)된 화자에 대해 구축된 음성 인식 모델을 이용하여 음성특징 추출 블록(116)으로부터 전달되는 음성특징들을 텍스트 문장으로 변환하는 등의 기능을 제공할 수 있다. 여기에서, 변환된 텍스트 문장은 사용자 인터페이스 화면(휴대 단말의 표시 패널)에 표시(출력)될 수 있다.The speech recognition and sentence generation block 118 also receives speech characteristic extraction block 116 using the speech recognition model built for the identified speaker stored in the speech and language DB 110 Converting voice features into text sentences, and so on. Here, the converted text sentence can be displayed (output) on the user interface screen (display panel of the portable terminal).

다음에, 타겟 언어 문장 생성 블록(120)은, 예컨대 자동 번역 엔진 등을 포함하는 것으로, 이러한 자동 번역 엔진을 이용하여 음성 인식 및 문장 생성 블록(118)으로부터 전달되는 텍스트 문장을 타겟 언어 문장으로 변환하는 등의 기능을 제공할 수 있다. 여기에서, 변환된 타겟 언어 문장은 사용자 인터페이스 화면에 표시(출력)될 수 있다.Next, the target language sentence generation block 120 includes, for example, an automatic translation engine and the like. Using the automatic translation engine, the target language sentence generation block 120 converts a text sentence sent from the speech recognition and sentence generation block 118 into a target language sentence And so on. Here, the converted target language sentence can be displayed (output) on the user interface screen.

마지막으로, 타겟 언어 음성 생성 블록(122)은, 예컨대 음성 합성 엔진 등을 포함하는 것으로, 이러한 음성 합성 엔진을 이용하여 타겟 언어 문장 생성 블록(120)으로부터 전달되는 타겟 언어 문장을 타겟 언어 음성으로 변환하여 휴대 단말의 스피커(도시 생략)로 출력하는 등의 기능을 제공할 수 있다.Finally, the target language speech generation block 122 includes, for example, a speech synthesis engine and the like. The target language speech generation block 122 converts the target language sentence, which is transmitted from the target language speech generation block 120, And outputting it to a speaker (not shown) of the portable terminal.

다음에, 상술한 바와 같은 구성을 갖는 본 발명의 자동 통역 장치를 이용하여 자동 통역 서비스를 제공하는 일련의 과정들에 대하여 상세하게 설명한다.Next, a series of processes for providing an automatic interpretation service using the automatic interpretation apparatus of the present invention having the above-described configuration will be described in detail.

도 3은 본 발명의 실시 예에 따라 사용자의 얼굴 영상 인증 및 화자 적응과 입술 움직임 영상의 추적에 의한 음성 구간 확인을 통해 자동 통역 서비스를 제공하는 주요 과정을 도시한 순서도이다.FIG. 3 is a flowchart illustrating a main process of providing an automatic interpretation service through facial image authentication and speaker adaptation of a user according to an embodiment of the present invention, and confirmation of a voice interval by tracking a lip motion image.

도 3을 참조하면, 도시 생략된 휴대 단말의 전면 카메라로부터 사용자의 영상신호가 입력되면(단계 302), 영상 획득 블록(102)에서는 이를 인식하여 획득된 얼굴 영상을 윤곽선 및 요소 추출 블록(104)으로 전달한다.Referring to FIG. 3, when a user's image signal is input from a front camera of a portable terminal (not shown) (step 302), the image acquisition block 102 recognizes the face image, .

이때, 인식된 얼굴 영상의 프레임은 사용자 인터페이스 화면에 있는 얼굴 영상 윈도우에 표시되는데(단계 304), 이와 같이 얼굴 영상 윈도우에 표시되는 얼굴 영상은, 일예로서 도 5에서 점선으로 도시된 바와 같이, 얼굴 접근 유도선(504)에 맞게 조정되도록 얼굴 위치의 조정을 유도할 수 있다.At this time, the recognized face image frame is displayed on the face image window on the user interface screen (step 304). The face image displayed on the face image window in this way is, for example, It is possible to induce adjustment of the face position so as to be adjusted to the approach guide line 504.

이에 응답하여, 윤곽선 및 요소 추출 블록(104)에서는 얼굴 영상 윈도우로부터 얼굴 윤곽선과 얼굴 요소(예컨대, 눈썹, 눈, 코, 입, 각 얼굴 요소의 길이, 폭, 상대적인 위치 등)를 각각 추출한다(단계 306).In response, the contour and element extraction block 104 extracts facial contours and facial elements (e.g., eyebrows, eyes, nose, mouth, length, width, relative position, etc.) from the facial image window Step 306).

다음에, 영상 인증 블록(106)에서는 추출된 화자의 얼굴 윤곽선과 화자 적응 DB(108)에 저장되어 있는 대상자들(예컨대, 휴대 단말의 소유주와 가족)의 얼굴 특징벡터들(다중 얼굴 특징벡터)을 비교하여 정합율을 계산하고, 화자의 얼굴 요소와 기 정의된 얼굴 영상 템플릿간의 정합을 통해 최대 정합율을 산출하며(단계 308), 최대 정합율이 소정의 임계값(정합 임계값)보다 큰 화자를 선택(식별)한다(단계 310). 여기에서, 템플릿 정합은 영상 간을 픽셀 단위로 비교하는 정합을 의미한다. 여기에서, 얼굴 영상간의 템플릿 정합율은 퍼센트(%)로 계산하여 사용자 인터페이스 화면에 표시된다.Next, in the image authentication block 106, the facial contour of the extracted speaker and the facial feature vectors (multi-facial feature vector) of the subjects (for example, the owner and family of the portable terminal) stored in the speaker adaptation DB 108, The matching rate is calculated, and the maximum matching rate is calculated through matching between the face element of the speaker and the predefined face image template (step 308). If the maximum matching ratio is larger than the predetermined threshold value (matching threshold value) The speaker is selected (identified) (step 310). Here, the template matching means matching in which the images are compared pixel by pixel. Here, the template matching rate between face images is calculated on a percentage (%) basis and displayed on the user interface screen.

이어서, 음향 획득 블록(114)에서는 도시 생략된 휴대 단말의 마이크를 통해 입력되는 음향신호(예컨대, 음성신호 등을 포함하는 음향신호)를 획득하여 음성특징 추출 블록(116)으로 전달한다(단계 312).Then, the sound acquisition block 114 acquires an acoustic signal (e.g., an acoustic signal including a voice signal) input through a microphone of a portable terminal (not shown) and transmits the acquired acoustic signal to the voice feature extraction block 116 ).

이에 응답하여, 음성특징 추출 블록(116)에서는, 입술 검출 블록(112)으로부터 제공되는 입술 움직임 추출 결과에 의거하여, 음성신호에서 분리한 음성구간으로부터 음성특징을 추출, 입술 움직임의 검출 여부를 체크하여(단계 314), 입술 움직임이 검출되지 않을 때 입력 음향신호를 음향 잡음으로 간주(음성특징 추출 프로세스의 미수행)하거나(단계 316) 혹은 입술 움직임이 검출될 때 후속 프로세스를 수행한다. 즉, 본 발명은 비록 음성신호(음향신호)가 입력되더라도, 입술 움직임이 검출되지 않으면, 입력되는 음성신호를 잡음으로 간주하여 음성특징을 추출하는 프로세스를 수행하지 않는다.In response to this, the voice feature extraction block 116 extracts the voice feature from the voice section separated from the voice signal based on the lip motion extraction result provided from the lip detection block 112, and checks whether the lip motion is detected or not (Step 314), the input acoustic signal is regarded as acoustic noise (no speech feature extraction process) (step 316) when lip motion is not detected, or a subsequent process is performed when the lip motion is detected. That is, according to the present invention, even if a voice signal (acoustic signal) is inputted, if the lip motion is not detected, the process of extracting the voice feature by considering the input voice signal as noise is not performed.

이를 위해, 입술 검출 블록(112)에서는 시간적으로 인접하는 영상 프레임 간을 비교(예컨대, 동일 위치의 픽셀 값의 차이 비교)하여 입술 움직임 영상을 추출, 즉 화자의 입술 움직임을 추출하고, 그 추출 결과로서 입술 움직임 검출신호 또는 입술 움직임 미검출신호를 생성하여 음성특징 추출 블록(116)으로 제공하는데, 이에 대해서는 그 세부적인 절차를 보여주는 도 4를 참조하여 상세하게 설명한다.For this, the lip detection block 112 extracts the lip motion image, that is, extracts the lip motion of the speaker, by comparing (for example, comparing differences of pixel values at the same position) between temporally adjacent image frames, A lips movement detection signal or a lip movement undetection signal is generated and provided to the voice feature extraction block 116, which will be described in detail with reference to Fig. 4 showing the detailed procedure thereof.

도 4는 본 발명에 따라 입술 움직임을 검출하는 주요 과정을 도시한 순서도이다.Figure 4 is a flow chart illustrating the main process of detecting lips movement in accordance with the present invention.

도 4를 참조하면, 움직임 추출 블록(202)에서는 입력되는 현재 영상 프레임(즉, 얼굴 하단 영상 프레임)과 시간적으로 인접하는 이전 영상 프레임 간의 비교(예컨대, 동일한 위치의 픽셀 값의 차이 비교)를 통해 입술 움직임 영상을 추출한다(단계 402). 이를 위해, 프레임 메모리(204)에는 현재 영상 프레임과 시간적으로 인접하는 적어도 하나 이상의 이전 영상 프레임들이 저장되어 있다.Referring to FIG. 4, in the motion extraction block 202, a comparison (e.g., comparison of pixel values at the same position) between an input current image frame (i.e., a face lower image frame) and a temporally adjacent previous image frame A lip motion image is extracted (step 402). To this end, the frame memory 204 stores at least one previous image frame temporally adjacent to the current image frame.

이어서, 움직임 분리 블록(206)에서는 추출된 입술 움직임 영상에 대해 이 기술 분야에 잘 알려진 통상적인 필터링 기법 등을 적용하는 노이즈 필터링을 통해 잡음 성분을 제거한 후 영상 프레임으로부터 움직임 영상(입술 부분과 그 상단부분을 표시하는 영상)만을 분리한다(단계 404).Next, the motion separation block 206 removes noise components from the extracted lip motion image through noise filtering using a general filtering technique well known in the art, and then extracts motion images (i.e., (Step 404). &Lt; / RTI >

다시, 특징값 추출 블록(208)에서는 이 기술 분야에 널리 알려진 영상 특징 추출 알고리즘을 이용하여 분리된 움직임 영상의 입술 움직임 영상으로부터 다중 특징값을 추출한다(단계 406).The feature value extraction block 208 extracts multiple feature values from the lip motion image of the separated motion image using the image feature extraction algorithm widely known in the art (step 406).

다음에, 움직임 판단 블록(210)에서는 추출된 다중 특징값과 내부 메모리(도시 생략)에 기 저장되어 있는 기 정의된 입술 움직임 특징벡터(다중 입술 움직임 특징벡터)간의 비교를 통해 입술 움직임의 여부를 판단하고(단계 408), 입술 움직임으로 판단될 때 입술 상단부의 일부 영상(예컨대, 코, 인중, 눈 등)과 기 정의된 입술 상단 템플릿(다중 입술 상단 템플릿)간의 비교를 통해 화자 입술의 움직임 여부를 최종 판단, 즉 움직임을 갖는 입술 상단부의 일부 영상과 기 정의된 입술 상단 템플릿을 매칭시켜 최대 정합율을 산출하고, 산출된 최대 정합율이 기 설정된 소정의 임계값 이상인지의 여부를 체크한다(단계 410).Next, in the motion decision block 210, whether or not lip motion is detected through comparison between the extracted multiple feature values and predefined lip motion feature vectors (multiple lip motion feature vectors) stored in the internal memory (not shown) (Step 408). When it is determined that the lips move, it is determined whether or not the speaker lips are moved through comparison between some images (e.g., nose, eye, eye, etc.) The maximum matching rate is calculated by matching a partial image of the upper lip of the lip having the motion and the upper lip of the lip that has been defined, and it is checked whether or not the calculated maximum matching rate is equal to or greater than a preset predetermined threshold value Step 410).

상기 단계(410)에서의 판단 결과, 최대 정합율이 기 설정된 소정의 임계값 이하인 것으로 판단되면, 처리는 단계(412)로 진행되며, 그 결과 움직임 판단 블록(210)에서는 입술 움직임 미검출신호를 생성하여 음성특징 추출 블록(116)으로 전달한다.If it is determined in step 410 that the maximum matching ratio is less than or equal to a predetermined threshold value, the process proceeds to step 412, and as a result, the motion determination block 210 outputs a lip motion non- And transmits it to the voice feature extraction block 116.

상기 단계(410)에서의 판단 결과, 최대 정합율이 기 설정된 소정의 임계값 이상인 것으로 판단되면, 움직임 판단 블록(210)에서는 그에 상응하는 입술 움직임 검출신호를 생성하여 음성특징 추출 블록(116)으로 전달한다(단계 414).If it is determined in step 410 that the maximum matching rate is equal to or greater than a predetermined threshold value, the motion determination block 210 generates a corresponding lip motion detection signal and outputs the generated lip motion detection signal to the voice feature extraction block 116 (Step 414).

다시, 도 3을 참조하면, 상술한 단계(314)에서의 체크 결과, 입술 움직임이 검출(즉, 입술 움직임 검출신호의 발생)된 것으로 판단되면, 음성특징 추출 블록(116)에서는 음향신호로부터 음성구간을 분리한 후 이 기술 분야에 널리 알려진 음성 특징 추출 알고리즘을 이용하여 분리된 음성구간의 음성신호로부터 음성특징들을 추출한다(단계 318).Referring again to FIG. 3, if it is determined in step 314 that the lip motion has been detected (that is, the lip motion detection signal has been generated), the voice feature extraction block 116 extracts The speech features are extracted from the speech signal of the separated speech period using the speech feature extraction algorithm widely known in the art (step 318).

이에 응답하여, 음성 인식 및 문장 생성 블록(118)에서는 음성 및 언어 DB(110)에 기 저장되어 있는 식별(선택)된 화자에 대해 구축된 음성 인식 모델을 이용하여 추출된 음성특징들을 텍스트 문장으로 변환하는데(단계 320), 여기에서 변환된 텍스트 문장은 사용자 인터페이스 화면(휴대 단말의 표시 패널)에 표시(출력)된다.In response, the speech recognition and sentence generation block 118 extracts the speech features extracted using the speech recognition model built for the identified speaker (stored in the speech and language DB 110) into a text sentence (Step 320), the converted text sentence is displayed (output) on the user interface screen (display panel of the portable terminal).

다음에, 타겟 언어 문장 생성 블록(120)에서는 자동 번역 엔진을 이용하여 음성 텍스트 문장을 타겟 언어 문장으로 변환하는데(단계 322), 여기에서 변환된 타겟 언어 문장은 사용자 인터페이스 화면에 표시(출력)된다.Next, the target language sentence generation block 120 converts the speech text sentence into the target language sentence using the automatic translation engine (step 322), where the translated target language sentence is displayed (output) on the user interface screen .

이후, 타겟 언어 음성 생성 블록(122)에서는 음성 합성 엔진을 이용하여 타겟 언어 문장 생성 블록(120)으로부터 전달되는 타겟 언어 문장을 타겟 언어 음성으로 변환하여 도시 생략된 휴대 단말의 스피커로 출력한다(단계 324).Thereafter, the target language speech generation block 122 converts the target language sentence transmitted from the target language sentence generation block 120 using the speech synthesis engine to the target language speech, and outputs the target language speech to the speaker of the portable terminal (not shown) 324).

도 5는 본 발명에 따라 얼굴 영상, 입술 움직임 영상 및 음향신호에 기반하여 자동 통역 서비스를 제공하는 개념을 설명하기 위한 개념 예시도이다.5 is a conceptual diagram illustrating a concept of providing an automatic interpretation service based on a face image, a lip motion image, and an acoustic signal according to the present invention.

도 5를 참조하면, 자동 통역이 구현된 휴대 단말(이동 단말)의 사용자 인터페이스 화면으로서, 얼굴 영상 윈도우가 추가되어 있다. 도면의 좌측에는 사용자(화자)가 있고, 우측에는 자동 통역을 구동시키는 요소기술의 흐름을 보여준다.Referring to FIG. 5, a face image window is added as a user interface screen of a portable terminal (mobile terminal) in which automatic interpretation is implemented. On the left side of the figure, there is a user (speaker). On the right side, the flow of the element technology for driving the automatic interpretation is shown.

참조번호 506은 휴대 단말의 전면 카메라이며, 사용자의 얼굴 영상을 획득하여 얼굴 영상 윈도우(502)에 표시한다. 얼굴 영상 윈도우(502)에는 얼굴 접근 유도선(504)이 점선으로 표시되어 있다. 여기에서, 얼굴 접근 유도선(504)은 얼굴 윤곽선 추출에 따른 계산량 감소를 위한 것이며, 얼굴 영상 윈도우(502)로부터 얼굴 윤곽선을 추출한 후, 기 정의된 얼굴 윤곽선 특징과 비교한 다음 얼굴 영상 템플릿과의 최대 정합율을 산출하여 얼굴 영상 인증을 수행하며, 그 인증 수행 결과를 음성/영상 공유메모리에 기록한다.Reference numeral 506 denotes a front face camera of the portable terminal, which acquires the face image of the user and displays the face image in the face image window 502. In the face image window 502, a face approach inducing line 504 is indicated by a dotted line. Here, the facial approach guide line 504 is for reducing the amount of calculation according to the facial contour extraction. The facial contour line is extracted from the facial image window 502, and compared with the predefined facial contour feature, The maximum matching rate is calculated to perform facial image authentication, and the authentication result is recorded in the voice / image sharing memory.

실제로 자동 통역을 위해 소스 언어를 발성할 때에는 화면에 좀 더 가까이 접근하게 되며, 이때 눈의 깜박임 등을 포함하여 얼굴 요소들의 움직임이 있게 된다. 다양한 움직임 영상 중에서 입술 움직임이 검출되고, 그 상단의 일부 영상(예컨대, 코)을 기 저장해둔 템플릿과 매칭시켜 최대 정합율을 산출하며, 그 결과를 음성/영상 공유메모리에 기록한다. 그리고 얼굴 영상 인증에 따른 템플릿 정합율과 입술 움직임 상단의 템플릿 정합율을 사용자 인터페이스 화면에 표시한다.In fact, when you speak the source language for automatic interpreting, you get closer to the screen, with the movement of facial elements including blinking of the eye. The lips movement is detected from various motion images, and the maximum matching rate is calculated by matching a part of the image (e.g., nose) at the upper end with a previously stored template, and the result is recorded in the audio / video sharing memory. Then, the template matching rate according to the facial image authentication and the template matching ratio at the upper part of the lip motion are displayed on the user interface screen.

소스 언어 발성음은 마이크를 통하여 입력되고, 동시에 입술 움직임 영상신호를 확인한다. 만약, 입술 움직임 영상신호가 없으면 음향 잡음으로 간주하여 프로세스를 더 이상 진행되지 않게 된다. 입술 움직임 영상신호와 음성신호 처리 과정을 통해, 음성구간을 분리해 낸 후 음성구간을 대상으로 음성특징을 추출한다. 여기에서, 음성 DB 및 언어 DB에 기반하여, 음향 모델, 발음 사전, 언어 모델이 구축되고, 이들이 결합된 인식 네트워크가 형성된다.The source language utterance sound is inputted through the microphone, and at the same time, the lip motion picture signal is confirmed. If there is no lip motion video signal, it is regarded as an acoustic noise and the process is no longer proceeded. Through the lips movement video signal and the speech signal processing process, the speech segment is separated and the speech segment is extracted from the speech segment. Here, an acoustic model, a pronunciation dictionary, and a language model are constructed based on the voice DB and the language DB, and a recognition network in which these are combined is formed.

한편, 얼굴 영상 인증을 통하여 사용자가 확인(식별)되면, 화자 적응에 의해 구축된 음향 모델 및 인식 네트워크를 통해, 음성신호 처리 과정에서 추출된 음성특징들이 텍스트 문장으로 변환되고, 사용자 인터페이스 화면에 출력된다. 일예로서,“서울역까지 얼마나 걸립니까?”를 소스언어로 발성한 결과를 보여준다. 여기에서, 화자 적응을 위해서는 사용자로부터 얻은 적은 양의 데이터를 활용하여, 기존 대량의 화자간의 공통 특성을 추출해서 성격이 비슷한 모델을 집단화하여 인식모델을 변형하게 된다.On the other hand, when the user is identified (identified) through the face image authentication, the voice features extracted in the voice signal processing are converted into text sentences through the acoustic model and the recognition network constructed by speaker adaptation, do. As an example, it shows the result of speaking in the source language "How long does it take to get to Seoul Station?" Here, for speaker adaptation, we utilize a small amount of data obtained from the user, extract common characteristics among a large number of speakers, and transform the recognition model by grouping models having similar characteristics.

다음으로, 음성 인식 문장은 자동 번역 엔진에 의하여 타겟 언어의 문장으로 변환되고, 사용자 인터페이스 화면에 출력된다. 일예로서,“How long does it take to the Seoul station?”로 번역되어 출력된다. 자동 번역 문장은 음성 합성 엔진에 의하여 타겟 언어의 음성으로 변환된 후 스피커를 통해 출력됨으로써, 타겟 언어의 사용자가 듣게 되는 의사소통이 이루어진다.Next, the speech recognition sentence is converted into a sentence of the target language by the automatic translation engine and outputted to the user interface screen. As an example, the phrase " How long does it take to the Seoul station? &Quot; The automatic translation sentence is converted into the target language speech by the speech synthesis engine and then outputted through the speaker so that the user can hear the target language.

이상의 설명은 본 발명의 기술사상을 예시적으로 설명한 것에 불과한 것으로서, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 발명의 본질적인 특성에서 벗어나지 않는 범위 내에서 여러 가지 치환, 변형 및 변경 등이 가능함을 쉽게 알 수 있을 것이다. 즉, 본 발명에 개시된 실시 예들은 본 발명의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것으로서, 이러한 실시 예에 의하여 본 발명의 기술 사상의 범위가 한정되는 것은 아니다.It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the present invention as defined by the following claims. It is easy to see that this is possible. In other words, the embodiments disclosed in the present invention are not intended to limit the scope of the present invention but to limit the scope of the technical idea of the present invention.

따라서, 본 발명의 보호 범위는 후술되는 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술사상은 본 발명의 권리범위에 포함되는 것으로 해석되어야 할 것이다.
Therefore, the scope of protection of the present invention should be construed in accordance with the following claims, and all technical ideas within the scope of equivalents should be interpreted as being included in the scope of the present invention.

102 : 양상 획득 블록 104 : 윤곽선 및 요소 추출 블록
106 : 영상 인증 블록 108 : 화자 적응 DB
110 : 음성 및 언어 DB 112 : 입술 검출 블록
114 : 음향 획득 블록 116 : 음성특징 추출 블록
118 : 음성 인식 및 문장 생성 블록 120 : 타겟 언어 문장 생성 블록
122 : 타겟 언어 음성 생성 블록 202 : 움직임 추출 블록
204 : 프레임 메모리 206 : 움직임 분리 블록
208 : 특징값 추출 블록 210 : 움직임 판단 블록102: aspect acquisition block 104: contour and element extraction block
106: video authentication block 108: speaker adaptation DB
110: voice and language DB 112: lip detection block
114: Acquisition Acquisition Block 116: Voice Feature Extraction Block
118: speech recognition and sentence generation block 120: target language sentence generation block
122: target language speech generation block 202: motion extraction block
204: frame memory 206: motion separation block
208: Feature value extraction block 210: Motion decision block

Claims

A step of authenticating a user's face image through comparison with a multi-facial feature vector and a face image template related to a predetermined face image after acquiring a user's face image;
Extracting voice features from an input voice signal,
A step of converting the extracted voice features into a text sentence using an acoustic model constructed by speaker adaptation,
Converting the converted text sentence into a target language sentence using an automatic translation engine;
A process of converting the converted target language sentence into a target language speech by using a speech synthesis engine and outputting
.