KR100873920B1

KR100873920B1 - Speech Recognition Method and Device Using Image Analysis

Info

Publication number: KR100873920B1
Application number: KR1020060109438A
Authority: KR
Inventors: 이수종; 박준; 이영직; 김응규
Original assignee: 한국전자통신연구원
Priority date: 2006-11-07
Filing date: 2006-11-07
Publication date: 2008-12-17
Also published as: KR20080041397A

Abstract

본 발명은 외부로부터 음향 및 발성자(Speaker)의 발성 화면을 수신하는 단계, 수신된 음향을 분석하는 단계, 수신된 발성자의 발성 화면 중 적어도 하나의 움직이는 구역을 선택하는 단계, 적어도 하나의 선택된 움직이는 구역에 상응하여 적어도 하나의 발성 화면 후보 구역을 선택하는 단계, 미리 설정된 기준 화면에 상응하여 적어도 하나의 발성 화면 후보 구역 중 발성 화면 구역을 결정하는 단계 및 발성 화면 구역에 상응하여 분석된 음향을 음성 신호로 판단하여 음성 인식을 수행하는 단계를 포함하되, 미리 설정된 기준 화면은 코 부분을 포함하는 화면인 것을 특징으로 하는 화상 분석과 결합된 음성 인식 방법을 제공할 수 있다.The present invention comprises the steps of receiving a sound and a speaker's speech screen from the outside, analyzing the received sound, selecting at least one moving zone of the speaker's speech screen, at least one selected moving Selecting at least one vocalization screen candidate zone corresponding to the zone, determining a vocalization screen zone among at least one utterance screen candidate zone according to a preset reference screen, and voice analysis of the sound analyzed according to the utterance screen zone The method may include determining a signal and performing voice recognition, wherein the preset reference screen is a screen including a nose portion.

Description

Method and apparatus for speech recognition using image analysis {Method and Apparatus for voice recognition using image analysis}

도 1은 본 발명이 적용되는 개략적인 예시도.1 is a schematic illustration to which the present invention is applied.

도 2는 본 발명의 바람직한 일 실시예에 따른 음성 인식 장치에서 음성을 인식하는 방법을 나타낸 개념도.2 is a conceptual diagram illustrating a method of recognizing a speech in a speech recognition apparatus according to an exemplary embodiment of the present invention.

도 3은 본 발명의 바람직한 일 실시예에 따른 음성을 인식하는 개략적인 순서도.3 is a schematic flowchart of recognizing speech according to a preferred embodiment of the present invention.

도 4는 본 발명의 바람직한 일 실시예에 따른 음성 인식을 위한 발성 화면을 분석하는 순서를 나타낸 순서도.4 is a flowchart illustrating a procedure of analyzing a speech screen for speech recognition according to an exemplary embodiment of the present invention.

도 5는 본 발명과 비교되는 입술 움직임을 통한 음성 인식을 위한 발성 화면의 분석의 예시도.5 is an exemplary view of analysis of a vocalization screen for speech recognition through lip movement compared with the present invention.

도 6은 본 발명의 바람직한 일 실시예에 따른 코의 움직임을 통한 음성 인식을 위한 발성 화면의 분석의 예시도.6 is an exemplary view illustrating analysis of a speech screen for speech recognition through movement of a nose according to an embodiment of the present invention.

도 7은 본 발명의 바람직한 일 실시예에 따른 화상 분석을 이용한 음성 인식 방법의 전체적인 순서도.7 is an overall flowchart of a speech recognition method using image analysis according to an exemplary embodiment of the present invention.

<도면의 주요 부분에 대한 부호의 설명><Explanation of symbols for main parts of the drawings>

101 : 음성 인식 장치101: speech recognition device

103 : 음성 입력 장치103: voice input device

105 : 화면 입력 장치105: screen input device

107 : 서비스 장치107: service device

본 발명은 화상 분석을 이용한 음성 인식 방법 및 장치에 관한 것이다.The present invention relates to a method and apparatus for speech recognition using image analysis.

정보 기술의 급속한 발전과 더불어 단위 기술들 간의 융합이 빠르게 진행되고 있다. PC를 비롯한 다양한 정보기기가 대량 보급되고 있으므로 이들 간에 상호 필요한 정보를 활용하는 것이 필요하다. 특히 PC용 화상카메라는 원거리 영상 통신에 필수품으로 자리잡아가고 있으므로 손쉽게 영상정보를 활용할 수 있는 환경이 되어있다With the rapid development of information technology, convergence between unit technologies is progressing rapidly. As various information devices including PCs are being widely distributed, it is necessary to utilize mutually necessary information between them. In particular, PC video cameras are becoming a necessity for long-distance video communications, making it easy to utilize video information.

한편, 음성 인식 분야에서 가장 문제가 되는 부분 중의 하나는 음성과 함께 수집되는 음향 잡음이다. 이러한 음향 잡음은 기기 자체의 구동 잡음으로부터 통신망 잡음 및 환경 소음에 이르기까지 다양하며, 특히 그 크기와 주파수가 일정한 정적 잡음은 대부분 제거되고 있으나, 그 크기와 주파수가 일정하지 않은 동적 잡음은 기존의 음성만을 이용한 음성 인식 방법으로는 제거하기 어려운 실정이다.On the other hand, one of the most problematic parts in the speech recognition field is acoustic noise collected with voice. This acoustic noise varies from the driving noise of the device itself to the network noise and the environmental noise. In particular, the static noise having a constant size and frequency is mostly removed, but the dynamic noise having a constant size and frequency is a conventional voice. It is difficult to remove the voice recognition method using only.

이러한 동적 잡음은 연속적인 음성 인식에서 특히 문제가 되고 있다.This dynamic noise is particularly problematic in continuous speech recognition.

본 발명은 그 목적이 화상 분석을 이용한 음성 인식 방법 및 장치를 제공하는 데 있다.An object of the present invention is to provide a speech recognition method and apparatus using image analysis.

또한 본 발명은 발성할 때의 화면 정보 중 특히 코를 포함하는 화면 정보를 이용하여 필요 없는 음성 잡음의 제거가 가능한 음성 인식 방법 및 장치를 제공하는 데 있다.Another object of the present invention is to provide a voice recognition method and apparatus capable of removing unnecessary voice noise by using screen information including a nose, among screen information during speech.

상술한 목적들을 달성하기 위하여, 본 발명의 일 측면에 따르면,a). 외부로부터 음향 및 발성자(Speaker)의 발성 화면을 수신하는 단계, b). 상기 수신된 음향을 분석하는 단계, c). 상기 수신된 발성자의 발성 화면 중 적어도 하나의 움직이는 구역을 선택하는 단계, d). 상기 적어도 하나의 선택된 움직이는 구역에 상응하여 적어도 하나의 발성 화면 후보 구역을 선택하는 단계, e). 미리 설정된 기준 화면에 상응하여 상기 적어도 하나의 발성 화면 후보 구역 중 발성 화면 구역을 결정하는 단계 및 f). 상기 발성 화면 구역에 상응하여 상기 b) 단계의 분석된 음향을 음성 신호로 판단하여 음성 인식을 수행하는 단계를 포함하되, 상기 미리 설정된 기준 화면은 코 부분을 포함하는 화면인 것을 특징으로 하는 화상 분석과 결합된 음성 인식 방법을 제공할 수 있다.In order to achieve the above objects, according to one aspect of the present invention, a). Receiving a sound and a speech screen of a speaker from the outside, b). Analyzing the received sound, c). Selecting at least one moving area of the voice screen of the received speaker; d). Selecting at least one spoken screen candidate zone corresponding to the at least one selected moving zone, e). Determining a utterance screen zone among the at least one utterance screen candidate zone corresponding to a preset reference screen; and f). And performing voice recognition by determining the analyzed sound of step b) as a voice signal corresponding to the speech screen region, wherein the preset reference screen is a screen including a nose part. And a speech recognition method combined with the present invention.

바람직한 실시예에 있어서, 상기 수신된 음향은 음성 신호 구간 및 잡음 신호 구간으로 구분되는 것을 특징으로 할 수 있다. 또한, 상기 발성 화면 후보 구역은 상기 선택된 움직이는 구역의 위쪽에서 선택하는 것을 특징으로 할 수 있다. 또한, 상기 발성 화면 구역은 기준 화면과 발성 화면 후보 구역 중 가장 정합율이 높은 구역을 선택하는 것을 특징으로 할 수 있다.In a preferred embodiment, the received sound may be divided into a voice signal section and a noise signal section. The voice screen candidate zone may be selected from above the selected moving zone. The speech screen region may be characterized by selecting a region having the highest matching rate among the reference screen and the speech screen candidate region.

본 발명의 다른 일 측면을 참조하면, a). 외부로부터 음향 및 발성자(Speaker)의 발성 화면을 수신하는 입력부, b). 상기 수신된 음향을 분석하는 음향 분석부, c). 상기 수신된 발성자의 발성 화면 중 적어도 하나의 움직이는 구역을 선택하는 화상 선택부, d). 상기 적어도 하나의 선택된 움직이는 구역에 상응하여 적어도 하나의 발성 화면 후보 구역을 선택하는 발성 화면 후보 구역 선택부, e). 미리 설정된 기준 화면에 상응하여 상기 적어도 하나의 발성 화면 후보 구역 중 발성 화면 구역을 결정하는 발성 화면 구역 선택부 및 f). 상기 발성 화면 구역에 상응하여 상기 음향 분석부에서 분석된 음향을 음성 신호로 판단하여 음성 인식을 수행하는 음성 인식부를 포함하되, 상기 미리 설정된 기준 화면은 코 부분을 포함하는 화면인 것을 특징으로 하는 화상 분석과 결합된 음성 인식 장치를 제공할 수 있다.Referring to another aspect of the present invention, a). Input unit for receiving the sound and the speech screen of the speaker (Speaker) from the outside, b). An acoustic analyzer for analyzing the received sound; c). An image selection unit for selecting at least one moving area of the received talker's speech screen; d). A speech screen candidate zone selection unit for selecting at least one speech screen candidate zone corresponding to the at least one selected moving zone, e). A speech screen region selection unit for determining a speech screen region among the at least one speech screen candidate region corresponding to a preset reference screen; and f). And a voice recognizer configured to perform voice recognition by determining the sound analyzed by the sound analyzer as a voice signal corresponding to the voiced screen area, wherein the preset reference screen is a screen including a nose part. It is possible to provide a speech recognition device combined with the analysis.

바람직한 실시예에 있어서, 상기 수신된 음향은 음성 신호 구간 및 잡음 신호 구간으로 구분되는 것을 특징으로 할 수 있다. 또한, 상기 발성 화면 후보 구역은 상기 선택된 움직이는 구역의 위쪽에서 선택하는 것을 특징으로 할 수 있다. 또 한, 상기 발성 화면 구역은 기준 화면과 발성 화면 후보 구역 중 가장 정합율이 높은 구역을 선택하는 것을 특징으로 할 수 있다.In a preferred embodiment, the received sound may be divided into a voice signal section and a noise signal section. The voice screen candidate zone may be selected from above the selected moving zone. In addition, the speech screen region may be characterized by selecting the region having the highest matching rate among the reference screen and the speech screen candidate region.

본 발명의 또 다른 일 측면을 참조하면, 화상 분석과 결합된 음성 인식을 위한 프로그램을 기록한 컴퓨터로 판독 가능한 기록 매체로서, 상기 프로그램은, 외부로부터 음향 및 발성자(Speaker)의 발성 화면을 수신하고, 상기 수신된 음향을 분석하고, 상기 수신된 발성자의 발성 화면 중 적어도 하나의 움직이는 구역을 선택하고, 상기 적어도 하나의 선택된 움직이는 구역에 상응하여 적어도 하나의 발성 화면 후보 구역을 선택하고, 미리 설정된 기준 화면에 상응하여 상기 적어도 하나의 발성 화면 후보 구역 중 발성 화면 구역을 결정하고, 상기 발성 화면 구역에 상응하여 상기 분석된 수신 음향을 음성 신호로 판단하여 음성 인식을 수행하되, 상기 미리 설정된 기준 화면은 코 부분을 포함하는 화면인 것을 특징으로 하는 컴퓨터로 판독 가능한 기록 매체를 제공할 수 있다.According to another aspect of the present invention, a computer-readable recording medium that records a program for speech recognition combined with image analysis, the program receiving sound and a speech screen of a speaker from the outside; Analyze the received sound, select at least one moving area of the spoken screen of the received speaker, select at least one speech screen candidate area corresponding to the at least one selected moving area, and preset criteria The speech screen region is determined among the at least one speech screen candidate region corresponding to the screen, and the speech recognition is performed by determining the analyzed received sound as a speech signal corresponding to the speech screen region. A computer-readable recording medium characterized by a screen including a nose portion Can be provided.

이어서, 첨부한 도면들을 참조하여 본 발명의 바람직한 실시예를 상세히 설명하기로 한다. Next, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명이 적용되는 개략적인 예시도이다.1 is a schematic illustration to which the present invention is applied.

도 1을 참조하면, 우선 발성자(100)는 자신이 발음하는 음성이 인식되어서 처리되기 위하여, 마이크로폰과 같은 음성 입력 장치(103)에 음성을 발신한다.Referring to FIG. 1, first, the speaker 100 transmits a voice to a voice input device 103 such as a microphone in order to recognize and process a voice pronounced by the speaker.

이때 음성 입력 장치(103)는 계속 음향을 수신하고 있는 상태이므로 발성 자(100)의 음성뿐만 아니라 주위 환경에서 발생하는 잡음도 수집하게 된다. In this case, since the voice input device 103 continues to receive sound, the voice input device 103 collects not only the voice of the speaker 100 but also the noise generated in the surrounding environment.

한편, 발성자(100)가 음성을 입력하는 동안 화상 카메라와 같은 화면 입력 장치(105)는 발성자(100)의 발성 화면을 촬영하게 된다. 이렇게 촬영된 화면은 음성 인식 장치(101)에서 발성자(100)의 음성을 명확하게 인식하기 위해서 사용된다.Meanwhile, while the speaker 100 inputs a voice, the screen input device 105 such as an image camera captures the voice screen of the speaker 100. The screen shot as described above is used by the voice recognition apparatus 101 to clearly recognize the voice of the speaker 100.

이렇게 음성 입력 장치(103)에서 수신된 음향 신호 및 화면 입력 장치(105)에서 촬영한 발성자(100)의 발성 화면은 모두 음성 인식 장치(101)로 발성자(100)의 음성을 가려내어 음성을 인식하게 된다.The sound signal received by the voice input device 103 and the voice screen of the speaker 100 photographed by the screen input device 105 are all screened by the voice recognition device 101 to detect the voice of the speaker 100. Will be recognized.

그러면 서비스 장치(107)에서는 음성 인식 장치(101)에서 인식한 음성 신호를 이용하여 다양한 서비스를 제공할 수 있다.Then, the service device 107 may provide various services using the voice signal recognized by the voice recognition device 101.

도 2는 본 발명의 바람직한 일 실시예에 따른 음성 인식 장치에서 음성을 인식하는 방법을 나타낸 개념도이다.2 is a conceptual diagram illustrating a method of recognizing a speech in a speech recognition apparatus according to an exemplary embodiment of the present invention.

도 2를 참조하면, 음성 입력 장치에 입력되는 음향 신호를 도시한 것이다. 이러한 음향 신호 그래프에서 참조 번호 201의 그래프는 발성자가 의도한 음성을 나타내며, 참조 번호 203의 그래프는 주위의 잡음을 나타낸다.Referring to FIG. 2, a sound signal input to a voice input device is illustrated. In this acoustic signal graph, the graph of reference numeral 201 represents the voice intended by the speaker, and the graph of reference numeral 203 represents the ambient noise.

이러한 잡음 구간이 포함된 음향이 화면 입력 장치 없는 음성 인식 장치(220)에 입력되면 잡음 신호(203)를 음성 신호(201)로 착각하여 잘못된 음성(205)으로 인식하게 된다.When the sound including the noise section is input to the voice recognition apparatus 220 without the screen input device, the noise signal 203 is mistaken for the voice signal 201 and recognized as the wrong voice 205.

예를 들어 음성 신호(201)가 "그만", '묵음', "시작"이라 가정하고, 잡음 신호(203)가 '묵음' 구간에 발생한다면, 음성 인식 장치(220)는 "그만", "시작", "시 작"으로 인식할 수 있다. 왜냐하면, 음성 인식 장치(220)에 저장된 음성 데이터베이스는 제한될 수밖에 없음으로 잡음 신호(203)를 인식하면 음성 인식 장치(220)에 존재하는 데이터베이스 중 잡음 신호(203)와 가장 유사하다고 판단되는 음성 신호로 인식하기 때문이다.For example, assuming that the voice signal 201 is "stop", "mute", "start", and the noise signal 203 occurs in the "mute" period, the speech recognition device 220 is "stop", " Start "," start ". Because the speech database stored in the speech recognition apparatus 220 is limited, the speech signal that is determined to be the most similar to the noise signal 203 among the databases existing in the speech recognition apparatus 220 when the noise signal 203 is recognized. Because it is recognized as.

이러한 경우 실제적인 동작에 큰 영향을 줄 수 있으므로 잡음에 의해 매우 커다란 문제가 발생할 수 있다.This can have a big impact on actual operation, which can lead to very large problems caused by noise.

반면에 화상 입력 장치가 있는 경우에는 발성자의 발성 화면(213)을 분석하여 실제로 발성자가 발성하였는지의 여부를 알 수 있다. 따라서 음성 신호만으로는 잡음 신호(203)도 음성 신호처럼 분석할 수 있지만, 화상의 분석에 의하여 잡음 신호(203) 구간에서는 발성자가 말을 하지 않았음을 화면(213)을 통해 알 수 있다.On the other hand, if there is an image input device, the speaker's speech screen 213 may be analyzed to determine whether the speaker is actually speaking. Therefore, the noise signal 203 may be analyzed like the voice signal using only the voice signal, but the screen 213 indicates that the speaker does not speak in the noise signal 203 section by analyzing the image.

따라서 화면 입력 장치를 같이 가지는 음성 인식 장치(230)는 이러한 경우에도 정확하게 발성자가 의도한 신호만을 인식한다(215).Therefore, even in this case, the voice recognition apparatus 230 having the screen input device recognizes only the signal intended by the speaker (215).

따라서 음성 신호(201)가 "그만", '묵음', "시작"이라 가정하고, 잡음 신호(203)가 '묵음' 구간에 발생한다면, 본 발명에 다른 음성 인식 장치(230)는 발성자의 발성 화면(213)을 분석하여 '묵음' 구간에서 발생하는 소리는 음성으로 인식하지 않아 "그만", '묵음', "시작"으로 정확히 인식할 수 있다.Therefore, assuming that the voice signal 201 is "stop", "mute", "start", and if the noise signal 203 occurs in the "mute" period, the voice recognition device 230 according to the present invention is the voice of the speaker The sound generated in the 'mute' section by analyzing the screen 213 may not be recognized as a voice, and thus may be accurately recognized as "stop", "mute", "start".

도 3은 본 발명의 바람직한 일 실시예에 따른 음성을 인식하는 개략적인 순서도이다.3 is a schematic flowchart of recognizing a voice according to an exemplary embodiment of the present invention.

도 3을 참조하면, 우선 음성 입력 장치로부터 음향 신호를 수신한다(단계 301). 이렇게 수신된 음향 신호는 음성 신호일 수도 있고 잡음 신호일 수도 있다. 음성 입력 장치는 발성자가 음성을 발성할 때에만 작동되는 것이 아니기 때문에 발성자가 음성을 발성하기 전이나 후에 잡음 신호가 추가될 수밖에 없다.Referring to FIG. 3, first, a sound signal is received from a voice input device (step 301). The received sound signal may be a voice signal or a noise signal. Since the voice input device does not operate only when the speaker speaks the voice, a noise signal is inevitably added before or after the speaker speaks the voice.

그 후, 음성 입력 장치로부터 입력된 음향 신호를 분석한다(단계 303). 이러한 발성자 음성 향 신호의 분석은 기존에 사용되는 음성 인식 방법을 사용하기 위하여 음향 신호를 분석하는 단계이다. 따라서 이러한 음향 신호를 분석하는 부분에 대해서는 자세하게 설명하지 아니한다.Thereafter, the sound signal input from the voice input device is analyzed (step 303). The analysis of the speaker voice directional signal is a step of analyzing the acoustic signal in order to use a conventional speech recognition method. Therefore, the details of analyzing the acoustic signal will not be described in detail.

이렇게 음향 신호가 분석되면 화면 입력 장치로부터 입력된 발성자의 발성 화면을 분석하여 얻은 발성자의 발성 여부의 인식 정보(단계311)와 추출된 발성자 음성 후보를 비교하여 발성자가 실제로 발성하였는지를 확인한다(단계 307). 이 때, 발성자가 실제로 발성하였다면 분석된 음향 신호를 발성자가 의도하여 발성한 음성으로 인식하고(단계 309) 음성 인식을 진행한다. 그렇지 않고 발성자가 실제로 발성하지 않았다면, 상기 분석된 구간은 잡음 구간이므로 음성 인식을 진행시키지 않고 다시 음향을 분석하는 단계(단계 303)로 돌아가서 다시 음성 분석을 실시한다.When the acoustic signal is analyzed as described above, it is checked whether the speaker is actually uttered by comparing recognition information of the speaker (step 311) obtained by analyzing the voice screen of the speaker inputted from the screen input device with the extracted speaker voice candidate (step 311). 307). At this time, if the speaker is actually speaking, the analyzed sound signal is recognized as the voice intended by the speaker (step 309) and voice recognition is performed. Otherwise, if the talker has not actually spoken, the analyzed section is a noise section, and therefore, the process returns to analyzing the sound again (step 303) without proceeding with the speech recognition and performing the voice analysis again.

도 4는 본 발명의 바람직한 일 실시예에 따른 음성 인식을 위한 발성 화면을 분석하는 순서를 나타낸 순서도이다.4 is a flowchart illustrating a procedure of analyzing a speech screen for speech recognition according to an exemplary embodiment of the present invention.

도 4를 참조하면, 우선 화면 입력 장치에서 발성자의 발성 영상을 수신한다(단계 401). 그 후 발성 영상을 분석하여 발성 영상 중에서 움직이는 부분 예를 들 어 눈이나 입술 부분을 선택한다(단계 403). 미리 분석한 입술의 움직임 특성과 선택된 움직이는 부분을 비교하여 발성 화면 후보 구역을 결정한다(단계 405). 이는 기준 화면과 비교하여 발성 화면 구역을 선택하기 위한 것이다. 그러나 입술의 움직임은 너무 다양하여 정합을 위한 계산 처리 용량이 지나치게 증가하므로, 본 발명에서는 정합을 위한 기준 화면을 거의 움직임이 없으면서도 명암 구분이 뚜렷한 코 부분으로 결정하였다.Referring to FIG. 4, first, a voice input image of a speaker is received by the screen input device (step 401). Thereafter, the speech image is analyzed to select a moving portion of the speech image, for example, an eye or a lip portion (step 403). The vocalization screen candidate region is determined by comparing the movement characteristics of the previously analyzed lips with the selected moving portion (step 405). This is to select the vocalization screen area compared to the reference screen. However, since the movement of the lips is so diverse that the calculation processing capacity for registration is excessively increased, in the present invention, the reference screen for registration is determined to be a portion of the nose where contrast is clearly distinguished with little movement.

그 후, 발성 화면 후보 구역과 비교할 기준 화면을 호출하고(단계 413), 호출된 기준 화면과 발성 화면 후보 구역과 비교하여 기준 화면과 잘 정합되는 부분이 있는지 판단하여(단계 407), 일정 수준의 임계값 이상의 정합율을 충족하는 부분이 존재하면 이 부분을 발성 화면 구역으로 판단한다(단계 409).Thereafter, the reference screen to be compared with the vocalization screen candidate zone is called (step 413), and compared with the called reference screen and the vocalization screen candidate zone to determine whether there is a part that matches well with the reference screen (step 407), and a certain level of If there is a portion that satisfies the matching ratio above the threshold value, this portion is determined as the speech screen region (step 409).

본 발명에서는 기준 화면을 코의 일부로 선택하였기 때문에, 기준 화면과 비교되는 화면은 발성 화면 후보구역의 위쪽 부분이 될 것이다.In the present invention, since the reference screen is selected as part of the nose, the screen compared to the reference screen will be the upper portion of the vocalization screen candidate area.

그 후, 발성 화면 후보 구역의 움직임을 분석하여 실제 발성자의 발성 여부를 결정한다(단계 411). Thereafter, the movement of the utterance screen candidate region is analyzed to determine whether the actual speaker is uttered (step 411).

도 5는 본 발명과 비교되는 입술 움직임을 통한 음성 인식을 위한 발성 화면의 분석의 예시도이다.5 is an exemplary diagram of analysis of a speech screen for speech recognition through lip movement compared to the present invention.

도 5를 참조하면, 우선 화면 입력 장치를 통해서 입력되는 발성자의 발성 화면(510)은 얼굴 전체가 들어온다. 그러면, 음성 인식 장치는 촬영된 발성 화면(510)에서 우선 움직임이 일어나는 부분을 선택한다(520). 일반적으로 발성을 하 는 발성화면에서 크게 움직이는 부분은 양쪽 눈(521, 523) 부위와 입술(525)부위이다. 이후 선택된 각 부분이 발성 화면 후보 구역이 되어 미리 설정된 입술 화면(530) 특징과 비교하여 발성 화면 후보 구역을 판단하게 된다.Referring to FIG. 5, first, an entire face of a utterance utterance screen 510 input through a screen input device enters. Then, the voice recognition apparatus first selects a portion in which the movement occurs in the captured utterance screen 510 (520). In general, the large moving parts of the vocalization screen are the areas of both eyes 521 and 523 and the lips 525. Thereafter, each of the selected portions becomes the speech screen candidate region, and the speech screen candidate region is determined by comparing with the preset feature of the lips screen 530.

이러한 경우 발성할 때 입술의 움직임은 매우 다양하여 그 각각의 움직임에 따라 기준 화면에 설정되어야할 입술의 움직임 화면이 매우 많아야 한다는 단점이 존재한다. 이러한 단점을 극복하기 위해서 본 발명은 도 6의 방식으로 발성 화면을 판단할 수 있다.In this case, there is a disadvantage that the movement of the lips is very diverse when speaking, so that the movement screen of the lips to be set on the reference screen should be very large according to each movement. In order to overcome this disadvantage, the present invention can determine the speech screen in the manner of FIG. 6.

도 6은 본 발명의 바람직한 일 실시예에 따른 코의 움직임을 통한 음성 인식을 위한 발성 화면의 분석의 예시도이다. 6 is an exemplary diagram illustrating analysis of a speech screen for speech recognition through movement of a nose according to an exemplary embodiment of the present invention.

도 6을 참조하면 우선 화면 입력 장치를 통해서 입력되는 발성자의 발성 화면(610)은 얼굴 전체가 들어온다. Referring to FIG. 6, the entire face of the talker's utterance screen 610 input through the screen input device enters.

그 후, 본 발명에 따른 음성 인식 장치는 촬영된 발성 화면(610)에서 우선 움직임이 일어나는 부분을 선택한다(620). 이러한 선택은 도 5에서의 선택과 동일하게 눈 부위 및 입술 부위가 될 것이다. 다만 그 이후, 이렇게 선택된 움직임 부위의 상부를 기준 화면과 비교 대상이 되는 발성 화면 후보 구역으로 선택한다(630). 따라서 눈의 움직임 구역의 상부(631, 633) 및 입술 움직임 구역의 상부(635)를 비교 대상으로 선택하게 된다. 즉, 참조 번호635 부분에서 발성자 얼굴의 코 부분이 포함된다.Thereafter, the voice recognition apparatus according to the present invention first selects a portion in which the movement occurs in the photographed utterance screen 610 (620). This selection will be the eye area and the lip area as the selection in FIG. 5. However, thereafter, the upper part of the selected moving part is selected as the vocal screen candidate region to be compared with the reference screen (630). Accordingly, the upper portions 631 and 633 of the eye movement region and the upper portion 635 of the lip movement region are selected for comparison. That is, the reference numeral 635 includes the nose portion of the speaker's face.

이후 선택된 각 부분이 발성 화면 후보 구역이 되어 미리 설정된 코의 화 면(640)과 비교하여 실제로 움직임을 판단하는 발성 화면 구역을 결정하게 된다. Thereafter, each of the selected portions becomes a speech screen candidate region, and determines a speech screen region that actually determines movement by comparing with a preset screen 640 of the nose.

이렇게 직접적으로 입술의 움직임 화면과 비교하지 않고 코의 화면과 비교하는 것은 코의 화면의 경우 명암이 뚜렷하고 거의 움직임이 없어 입술의 움직임을 직접적으로 비교하는 것 보다 계산이 용이하기 때문이다.This is because the contrast of the nose screen is clear and there is almost no movement, so it is easier to calculate than the direct movement of the lips.

도 7은 본 발명의 바람직한 일 실시예에 따른 화상 분석을 이용한 음성 인식 방법의 전체적인 순서도이다.7 is an overall flowchart of a speech recognition method using image analysis according to an exemplary embodiment of the present invention.

도 7을 참조하면, 우선 음성 입력 장치로부터 음향 신호를 수신한다(단계701). 그 후, 음성 입력 장치로부터 입력된 음향 신호를 분석한다(단계 703). Referring to FIG. 7, first, a sound signal is received from a voice input device (step 701). Thereafter, the sound signal input from the voice input device is analyzed (step 703).

이렇게 음향 신호를 분석하고 나면 분석된 신호를 음성 인식할지 판단하기 위해서 화면 입력 장치로부터 입력된 발성자의 발성 화면을 분석하여 얻은 발성 구간(단계 721)과 추출된 발성자 음성 후보를 비교하여 실제로 발성자가 발성했는지 여부를 확인한다(단계 707). 이 때, 실제로 발성자가 발성했다면 그 발성자 음성 후보를 발성자가 의도하여 발성한 음성으로 인식하고(단계 709), 그렇지 않다면, 다시 음향을 분석하는 단계(단계 703)로 돌아가서 다시 음향을 분석한다.After analyzing the acoustic signal, the speaker is actually compared with the extracted voice speech candidate (step 721) obtained by analyzing the voice screen of the speaker inputted from the screen input device to determine whether the analyzed signal is recognized. It is checked whether the voice is spoken (step 707). At this time, if the speaker is actually speaking, the speaker voice candidate is recognized as the voice intended by the speaker (step 709), and if not, the process returns to analyzing the sound (step 703) and analyzes the sound again.

한편, 음성 입력 장치로부터 음향 신호를 수신하는 동안에 화면 입력 장치에서 발성자의 발성 영상을 수신한다(단계 711). 그 후 발성 영상을 분석하여 발성 영상 중에서 움직이는 부분 예를 들어 눈이나 입술 부분을 선택한다(단계 713). 그 후 선택된 움직이는 부분들의 위쪽 일정 부분을 발성 화면 후보 구역으로 결정한다(단계 715). On the other hand, while receiving a sound signal from the voice input device, the screen input device receives a voice image of the speaker (step 711). Thereafter, the vocal image is analyzed to select a moving portion of the vocal image, for example, an eye or a lip portion (step 713). Then, a predetermined upper portion of the selected moving parts is determined as the speech screen candidate area (step 715).

그 후, 발성 화면 후보 구역과 비교할 기준 화면을 호출하고(단계 723), 호출된 기준 화면과 발성 화면 후보 구역과 비교하여 기준 화면과 잘 정합되는 부분이 있는지 판단하여(단계 717), 일정한 정합율 이상으로 정합되는 부분이 존재하면 이 부분을 발성 화면 구역으로 판단한다(단계 719). 그 후, 발성 화면 구역의 움직임을 분석하여 발성자의 발성 여부의 정돌를 결정한다(단계 721). 이렇게 결정된 발성 여부는 상기 703 단계에서 분석된 음향을 음성 인식 할지 결정하는데 사용된다.Then, the reference screen to be compared with the utterance screen candidate zone is called (step 723), and compared with the called reference screen and the vocalization screen candidate zone to determine whether there is a part that matches well with the reference screen (step 717), and a constant matching rate If there is a matching part as above, the part is determined as a speech screen area (step 719). Thereafter, the movement of the vocalization screen region is analyzed to determine whether the speaker is talking or not (step 721). The voice determined in this way is used to determine whether to recognize the sound analyzed in step 703.

본 발명은 상기 실시예에 한정되지 않으며, 많은 변형이 본 발명의 사상 내에서 당 분야에서 통상의 지식을 가진 자에 의하여 가능함은 물론이다. The present invention is not limited to the above embodiments, and many variations are possible by those skilled in the art within the spirit of the present invention.

본 발명에 의하여, 화상 분석을 이용한 음성 인식 방법 및 장치를 제공할 수 있다.According to the present invention, a voice recognition method and apparatus using image analysis can be provided.

또한 본 발명에 의하여 발성할 때의 화면 정보 중 특히 코를 포함하는 화면 정보를 이용하여 필요 없는 음성 잡음의 제거가 가능한 음성 인식 방법 및 장치를 제공할 수 있다.In addition, according to the present invention, it is possible to provide a voice recognition method and apparatus capable of removing unnecessary voice noise by using screen information including a nose, among screen information during speech.

Claims

a). Receiving a sound and a speech screen of a speaker from the outside;

b). Analyzing the received sound;

c). Selecting at least one moving area of the voice screen of the received speaker;

d). Selecting at least one spoken screen candidate zone corresponding to the at least one selected moving zone;

e). Determining a utterance screen zone among the at least one utterance screen candidate zone corresponding to a preset reference screen; And

f). Determining the analyzed sound of the step b) as a voice signal corresponding to the speech screen region to perform voice recognition,

The preset reference screen is a screen including a nose portion, and the vocalization screen zone is selected from the reference screen and the vocalization screen candidate zone having the highest matching ratio.

Speech recognition method combined with image analysis, characterized in that.

The method of claim 1,

The received sound is divided into a voice signal section and a noise signal section

Speech recognition method combined with image analysis, characterized in that.

The method of claim 1,

Selecting the vocalization screen candidate zone above the selected moving zone.

Speech recognition method combined with image analysis, characterized in that.

delete

a). An input unit configured to receive a sound and a utterance screen of a speaker from the outside;

b). An acoustic analyzer analyzing the received sound;

c). An image selecting unit which selects at least one moving area of the received speech screen of the speaker;

d). A utterance screen candidate zone selection unit that selects at least one utterance screen candidate zone corresponding to the at least one selected moving zone;

e). A utterance screen zone selection unit that determines a utterance screen zone among the at least one utterance screen candidate zone corresponding to a preset reference screen; And

f). And a voice recognition unit configured to perform voice recognition by determining the sound analyzed by the sound analyzer as a voice signal corresponding to the speech screen region.

Wherein the preset reference screen is a screen including a nose portion and the vocalization screen zone selects a region having the highest matching ratio between the reference screen and the vocalization screen candidate zone.

Speech recognition device combined with image analysis, characterized in that.

The method of claim 5,

Speech recognition device combined with image analysis, characterized in that.

The method of claim 5,

Speech recognition device combined with image analysis, characterized in that.

delete

A computer-readable recording medium having recorded thereon a program for speech recognition combined with image analysis,

The program,

Receive sound and a speaker's speech screen from the outside, analyze the received sound, select at least one moving zone of the received speaker's speech screen, and correspond to the at least one selected moving zone Select at least one utterance screen candidate zone, determine a utterance screen zone among the at least one utterance screen candidate zone according to a preset reference screen, and use the analyzed received sound as a voice signal according to the utterance screen zone; And perform voice recognition, wherein the preset reference screen is a screen including a nose portion and the speech screen region selects the region having the highest matching ratio between the reference screen and the speech screen candidate region. Readable recording medium.