KR20060020363A

KR20060020363A - Apparatus and method for recognizing a voice for an audio-visual (av) system

Info

Publication number: KR20060020363A
Application number: KR1020040069193A
Authority: KR
Inventors: 신종근; 유창동; 김상균; 김종욱; 진민호
Original assignee: 엘지전자 주식회사
Priority date: 2004-08-31
Filing date: 2004-08-31
Publication date: 2006-03-06
Also published as: KR100651940B1

Abstract

본 발명은 사용자의 음성을 정확하게 인식하기 위한 것으로. AV 시스템의 반향음, 사용자 음성, 잡음을 포함하는 소리들을 수신하고, 사용자 음성 관련 정보를 근거로 상기 수신된 소리들로부터 상기 반향음을 제거하는 반향음 제거부와; 상기 수신된 소리들에서 사용자의 음성(voice)을 검출하고, 상기 검출된 사용자 음성을 근거로 상기 사용자 음성 관련 정보를 생성하는 음성 검출부와; 상기 검출된 사용자의 음성을 인식하기 위해 상기 검출된 사용자 음성을 적어도 하나의 모델에 포함된 음성 패턴들과 비교하는 음성 인식부를 포함한다.The present invention is to accurately recognize the user's voice. An echo sound removing unit for receiving sounds including echo sound, user voice, and noise of the AV system and removing the echo sound from the received sounds based on user voice related information; A voice detector for detecting a voice of the user from the received sounds and generating the user voice related information based on the detected user voice; And a voice recognition unit for comparing the detected user voice with voice patterns included in at least one model to recognize the detected user voice.

반향음, 적응필터Echo, adaptive filter

Description

Apparatus and method for recognizing a voice for an audio-visual (AV) system}

도 1은 본 발명의 음성 인식 장치를 포함하는 AV 기기를 나타낸 도면1 is a view showing an AV device including a speech recognition apparatus of the present invention

도 2는 본 발명의 음성 인식 장치의 구성을 나타낸 블록도 2 is a block diagram showing the configuration of a speech recognition apparatus of the present invention;

도 3은 도 2의 반향음 제거부를 상세히 나타낸 도면3 is a view illustrating in detail the echo cancellation unit of FIG.

도 4는 본 발명의 음성 인식 방법을 나타낸 순서도4 is a flowchart illustrating a speech recognition method of the present invention.

도 5는 검출된 사용자의 음성 파형의 예를 나타낸 도면. Fig. 5 shows an example of a detected voice waveform of a user.

* 도면의 주요 부분에 대한 부호의 설명 *Explanation of symbols on the main parts of the drawings

21 : 반향음 제거부 22 : 음성 검출부21: echo cancellation unit 22: voice detection unit

23 : 음성 인식부 24 : 메모리23: speech recognition unit 24: memory

211 : 제 1 필터 뱅크 212 : 제 2 필터 뱅크211: first filter bank 212: second filter bank

213 : 제 3 필터 뱅크 215 : 적응 필터부213: third filter bank 215: adaptive filter unit

본 발명은 음성인식에 관한 것으로, 특히 AV(audio-visual) 기기 등과 같은 장치를 제어하기 위한 사용자의 명령을 인식하는 음성인식 장치 및 방법에 관한 것 이다. The present invention relates to voice recognition, and more particularly, to a voice recognition device and a method for recognizing a user's command for controlling a device such as an AV (audio-visual) device.

일반적으로 티브이, 오디오 등의 가전을 제어하기 위해서 리모트 콘트롤러를 이용하거나 또는 해당 가전에 제공된 버튼을 눌러야 한다. 리모콘을 이용할 경우, 항상 리모콘이 사용자 근처에 있어야 하고, 사람이 직접 손으로 조작해야 하는 번거로움이 있다. 또한, 리모콘을 잃어버렸을 때는 사용자가 가전을 직접 조작해야 한다. 특히, 장애인이나 고령자의 경우 리모콘의 사용마저도 어렵다. 이러한 문제점을 해결하기 위해 가전에 음성 인식 기술이 적용되고 있다. In general, in order to control household appliances such as TV and audio, a remote controller or a button provided for the household appliance must be pressed. When using a remote control, the remote control should always be near the user and hassle-free operation by hand. In addition, when the remote control is lost, the user must directly operate the home appliance. In particular, the use of the remote control is difficult even for the disabled or the elderly. In order to solve this problem, voice recognition technology is applied to home appliances.

음성 인식은 인간의 가장 자연스러운 의사 전달 수단인 음성으로 기계나 컴퓨터에게 인간의 의사를 전달하고 그 의사에 따라 동작하도록 하는 기술이다. 이러한 음성 인식은 현재 여러 분야에서 사람들에게 많은 편리함을 제공할 수 있지만, 아직은 음성 인식을 이용하여 AV 기기를 제어하기는 어렵다. AV 기기를 사용할 때 주변 소음뿐만 아니라 AV 기기에서 출력되는 소리가 있고 AV 기기와 사용자 간의 거리가 가깝지 않기 때문에, 음성 인식의 정확성이 높지 못하다. 이처럼, 사용자의 음성에 따라 AV 기기를 정확히 제어하기 위해서는 극복해야 할 문제점들을 갖고 있다. Speech recognition is a technology that communicates the human will to a machine or computer and acts according to the voice as the most natural means of communication. While such speech recognition can provide a lot of convenience to people in various fields at present, it is still difficult to control AV devices using speech recognition. When using an AV device, not only the ambient noise but also the sound output from the AV device and the distance between the AV device and the user are not close, so the accuracy of speech recognition is not high. As such, there are problems to overcome in order to accurately control the AV device according to the voice of the user.

본 발명은 상기한 종래 기술의 문제점을 해결하기 위하여 안출된 것으로서, 반향음 및 잡음이 존재하는 환경에서도 사용자의 음성을 정확하게 인식할 수 있는 음성 인식 장치 및 방법을 제공하는 데 그 목적이 있다.The present invention has been made to solve the above-mentioned problems of the prior art, and an object thereof is to provide a speech recognition apparatus and method capable of accurately recognizing a user's voice even in an environment in which echo and noise exist.

상기한 기술적 과제를 해결하기 위한 본 발명의 음성 인식 장치는, AV 시스 템의 반향음, 사용자 음성, 잡음을 포함하는 소리들을 수신하고 사용자 음성 관련 정보를 근거로 상기 수신된 소리들로부터 상기 반향음을 제거하는 반향음 제거부와; 상기 수신된 소리들에서 사용자의 음성(voice)을 검출하고 상기 검출된 사용자 음성을 근거로 상기 사용자 음성 관련 정보를 생성하는 음성 검출부와; 상기 검출된 사용자의 음성을 인식하기 위해 상기 검출된 사용자 음성을 적어도 하나의 모델에 포함된 음성 패턴들과 비교하는 음성 인식부를 포함한다.The speech recognition apparatus of the present invention for solving the above technical problem, receives the sound including the echo sound, user voice, noise of the AV system and the echo sound from the received sounds based on the user voice related information Echo cancellation unit for removing; A voice detector for detecting a voice of the user from the received sounds and generating the user voice related information based on the detected user voice; And a voice recognition unit for comparing the detected user voice with voice patterns included in at least one model to recognize the detected user voice.

상기 반향음 제거부는, 상기 AV 시스템의 출력음을 주파수 대역 별로 나누기 위한 제 1 필터 뱅크와; 상기 수신된 소리들을 주파수 대역 별로 나누기 위한 제 2 필터 뱅크와; 상기 제 1 필터 뱅크에서 출력되는 신호들을 이용하여 주파수 대역 별로 나누어진 소리들에서 반향음을 각각 제거하는 적응 필터들과; 상기 적응 필터들에서 출력되는 신호들을 통합하는 제 3 필터 뱅크를 포함한다.The echo canceling unit includes: a first filter bank for dividing an output sound of the AV system by frequency bands; A second filter bank for dividing the received sounds into frequency bands; Adaptive filters for removing echoes from the sounds divided by frequency bands using the signals output from the first filter bank; And a third filter bank integrating the signals output from the adaptive filters.

상기 반향음 제거부는 상기 생성된 사용자 음성 관련 정보를 근거로 상기 반향음 제거부에 포함된 필터들의 차단 주파수 및 통과 주파수, 반향음 제거율 중 적어도 하나를 제어한다. 상기 사용자 음성 관련 정보는 상기 사용자 음성의 시작점 타이밍과 끝점 타이밍을 포함하고, 상기 사용자 음성의 주파수 대역, 진폭, 파형 중 적어도 하나를 더 포함한다.The echo canceling unit controls at least one of cutoff frequency, pass frequency, and echo cancellation rate of the filters included in the echo canceling unit based on the generated user voice related information. The user voice related information includes a start point timing and an end point timing of the user voice, and further includes at least one of a frequency band, an amplitude, and a waveform of the user voice.

상기 음성 인식부는 상기 검출된 사용자의 음성이 제 1 모델에 포함된 음성패턴과 일치할 제 1 확률과 상기 검출된 사용자의 음성이 제 2 모델에 포함된 음성패턴과 일치할 제 2 확률을 계산한다. 여기서, 상기 제 1 모델은 기설정된 단어들에 해당하는 음성패턴들을 포함하고, 상기 제 2 모델은 기설정되지 않은 단어들에 해당하는 음성패턴들을 축적(accumulate)한다. The speech recognition unit calculates a first probability that the detected user's voice matches the voice pattern included in the first model and a second probability that the detected user's voice matches the voice pattern included in the second model. . Here, the first model includes voice patterns corresponding to preset words, and the second model accumulates voice patterns corresponding to non-set words.

상기 음성 인식부는, 상기 제 1 모델에 포함된 음성패턴들이 상기 검출된 사용자의 음성과 일치할 확률을 각각 계산하고, 상기 계산된 확률들 중에서 가장 큰 확률을 상기 제 1 확률로 판단하며, 상기 제 2 모델에 포함된 음성패턴들이 상기 검출된 사용자의 음성과 일치할 확률을 각각 계산하고, 상기 계산된 확률들 중에서 가장 큰 확률을 상기 제 2 확률로 판단한다. The speech recognition unit calculates a probability that the speech patterns included in the first model coincide with the detected user's voice, and determines the largest probability among the calculated probabilities as the first probability. The probability that the speech patterns included in the two models coincide with the detected user's voice are respectively calculated, and the largest probability among the calculated probabilities is determined as the second probability.

상기 음성 인식부는 상기 제 1 확률과 상기 제 2 확률의 비율에 따라 상기 검출된 사용자의 음성을 인식할지 여부를 판단한다. 상기 음성 인식부는 상기 제 1 확률과 상기 제 2 확률의 비율을 기준값들과 비교하고, 상기 비교 결과에 따라 상기 검출된 사용자의 음성을 인식한다. The speech recognition unit determines whether to recognize the detected user's speech based on a ratio of the first probability and the second probability. The speech recognition unit compares the ratio of the first probability to the second probability with reference values, and recognizes the detected user's voice according to the comparison result.

본 발명의 음성 인식 방법은, (a) AV 시스템의 반향음, 사용자 음성, 잡음을 포함하는 소리들을 수신하는 단계와; (b) 상기 수신된 소리들에서 사용자 음성을 검출하고, 상기 검출된 사용자 음성을 근거로 사용자 음성 관련 정보를 생성하는 단계와; (c) 상기 사용자 음성 관련 정보에 따라 제어되는 다수개의 필터들에 상기 수신되는 소리들을 통과시켜 반향음을 제거하는 단계와; (d) 상기 검출된 사용자의 음성을 적어도 하나의 모델에 포함된 음성 패턴들과 비교하는 단계를 포함한다. The speech recognition method of the present invention comprises the steps of: (a) receiving sounds including echoes, user voices and noises of an AV system; (b) detecting a user voice from the received sounds and generating user voice related information based on the detected user voice; (c) removing echoes by passing the received sounds through a plurality of filters controlled according to the user voice related information; (d) comparing the detected user's voice with voice patterns included in at least one model.

상기 반향음을 제거하는 단계에서, 상기 사용자 음성의 시작점 타이밍과 끝점 타이밍 사이의 구간(term) 동안에는 상기 필터들의 파라미터 값을 고정시킨다. 여기서, 상기 필터들의 파라미터는, 상기 필터들의 차단 주파수 및 통과 주파수, 반향음 제거율 중 적어도 하나를 포함한다. 또한, 상기 반향음을 제거하는 단계에 서, 상기 사용자 음성 관련 정보를 근거로 상기 필터들의 차단 주파수 및 통과 주파수, 반향음 제거율 중 적어도 하나를 제어한다.In the removing of the echo, the parameter values of the filters are fixed during a period between the start point timing and the end point timing of the user voice. Herein, the parameters of the filters include at least one of a cutoff frequency, a pass frequency, and an echo cancellation rate of the filters. In the removing of the echo sound, at least one of a cutoff frequency, a pass frequency, and an echo cancellation rate of the filters is controlled based on the user voice related information.

상기 검출된 사용자의 음성을 적어도 하나의 모델에 포함된 음성 패턴들과 비교하는 단계는, 상기 검출된 사용자의 음성이 제 1 모델에 포함된 음성패턴과 일치할 제 1 확률과 상기 검출된 사용자의 음성이 제 2 모델에 포함된 음성패턴과 일치할 제 2 확률을 계산하는 단계와; 상기 제 1 확률과 상기 제 2 확률의 비율을 계산하는 단계와; 상기 제 1 확률과 상기 제 2 확률의 비율을 기준값들과 비교하는 단계와; 상기 비교 결과에 따라 상기 검출된 사용자의 음성을 인식하는 단계를 포함한다.Comparing the detected voice of the user with the voice patterns included in at least one model, the first probability that the detected user's voice matches the voice pattern included in the first model and the detected user's voice Calculating a second probability that the speech will match the speech pattern included in the second model; Calculating a ratio of the first probability and the second probability; Comparing the ratio of the first probability to the second probability with reference values; And recognizing the voice of the detected user according to the comparison result.

본 발명에 따른 다른 형태의 음성 인식 장치는, 수신되는 소리들에서 사용자의 음성(voice)을 검출하는 음성 검출부와; 상기 검출된 사용자의 음성이 제 1 모델에 포함된 음성패턴과 일치할 제 1 확률과 상기 검출된 사용자의 음성이 제 2 모델에 포함된 음성패턴과 일치할 제 2 확률을 계산하고, 상기 제 1 확률과 상기 제 2 확률의 비율에 따라 상기 검출된 사용자의 음성을 인식하는 음성 인식부를 포함한다.According to another aspect of the present invention, there is provided a speech recognition apparatus, comprising: a voice detector for detecting a voice of a user from received sounds; Calculate a first probability that the detected user's voice matches the voice pattern included in the first model, and a second probability that the detected user's voice matches the voice pattern included in the second model; And a voice recognition unit recognizing the detected user's voice according to a ratio of a probability and the second probability.

본 발명에 따른 다른 형태의 음성 인식 방법은, (a) 수신되는 소리들에서 사용자의 음성(voice)을 검출하는 단계와; (b) 상기 검출된 사용자의 음성이 제 1 모델에 포함된 음성패턴과 일치할 제 1 확률과 상기 검출된 사용자의 음성이 제 2 모델에 포함된 음성패턴과 일치할 제 2 확률을 계산하는 단계와; (c) 상기 제 1 확률과 상기 제 2 확률의 비율에 따라 상기 검출된 사용자의 음성을 인식하는 단계를 포함한다.Another form of speech recognition method in accordance with the present invention comprises the steps of: (a) detecting a voice of a user from received sounds; (b) calculating a first probability that the detected user's voice matches the voice pattern included in the first model and a second probability that the detected user's voice matches the voice pattern included in the second model Wow; (c) recognizing the voice of the detected user according to the ratio of the first probability and the second probability.

상기 제 1, 2 확률을 계산하는 단계는, 상기 제 1 모델에 포함된 음성패턴들이 상기 검출된 사용자의 음성과 일치할 확률을 각각 계산하는 단계와; 상기 계산된 확률들 중에서 가장 큰 확률을 상기 제 1 확률로 판단하는 단계를 포함한다.The calculating of the first and second probabilities may include calculating a probability that the voice patterns included in the first model match the detected voice of the user; And determining the largest probability among the calculated probabilities as the first probability.

또한, 상기 제 1, 2 확률을 계산하는 단계는, 상기 제 2 모델에 포함된 음성패턴들이 상기 검출된 사용자의 음성과 일치할 확률을 각각 계산하는 단계와; 상기 계산된 확률들 중에서 가장 큰 확률을 상기 제 2 확률로 판단하는 단계를 포함한다.The calculating of the first and second probabilities may include: calculating a probability that voice patterns included in the second model match the detected voice of the user; And determining the largest probability among the calculated probabilities as the second probability.

상기 제 1 확률과 상기 제 2 확률의 비율에 따라 상기 검출된 사용자의 음성을 인식하는 단계는, 상기 제 1 확률과 상기 제 2 확률의 비율을 제 1, 2 기준값들과 비교하는 단계와; 상기 비교 결과를 근거로 상기 검출된 사용자의 음성을 인식할지 여부를 판단하는 단계를 포함한다. 예를 들어, 상기 제 1 확률과 상기 제 2 확률의 비율이 상기 제 1 기준값 보다 크거나 같으면 상기 검출된 사용자의 음성을 인식하고, 상기 제 1 확률과 상기 제 2 확률의 비율이 상기 제 1 기준값 보다 작고 상기 제 2 기준값 보다 크거나 같으면 상기 검출된 사용자의 음성에 해당하는 단어를 화면에 디스플레이 한다. 그리고, 상기 제 1 확률과 상기 제 2 확률의 비율이 상기 제 2 기준값 보다 작다면 상기 검출된 사용자의 음성을 인식하지 않는다.Recognizing the voice of the detected user according to the ratio of the first probability and the second probability comprises: comparing the ratio of the first probability and the second probability with first and second reference values; And determining whether to recognize the detected user's voice based on the comparison result. For example, when the ratio of the first probability and the second probability is greater than or equal to the first reference value, the voice of the detected user is recognized, and the ratio of the first probability and the second probability is the first reference value. If smaller and greater than or equal to the second reference value, a word corresponding to the detected user's voice is displayed on the screen. If the ratio between the first probability and the second probability is smaller than the second reference value, the detected user's voice is not recognized.

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 일실시예를 상세히 설명하면 다음과 같다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 음성 인식 장치가 포함된 AV 기기의 일 예를 나타내고, 도 2는 도 1의 음성 인식 장치의 구성을 나타낸다. 도 1에 도시된 바와 같이, 상기 음성 인식 장치(20)는 반향음 및 잡음 등과 함께 사용자의 음성을 수신한다. 상기 반향음이란, 벽이나 물체에 의해 반사되는 TV(10)의 음향이다. 사용자의 음성을 가장 효과적으로 인식하기 위해서 상기 음성 인식 장치(20)는 TV(10)의 전면에 위치하는 것이 좋다. FIG. 1 shows an example of an AV device including a voice recognition device of the present invention, and FIG. 2 shows a configuration of the voice recognition device of FIG. As shown in FIG. 1, the speech recognition apparatus 20 receives a user's voice along with echo and noise. The reflection sound is a sound of the TV 10 reflected by a wall or an object. In order to most effectively recognize a user's voice, the voice recognition device 20 may be located in front of the TV 10.

도 2에 도시된 바와 같이, TV(10) 또는 음성인식 장치(20)의 마이크는 TV 반향음, 잡음 등과 같은 다양한 소리와 사용자의 음성을 수신하고, 이들을 주파수 신호들로 변환한다. 반향음 제거부(21)는 TV(10)의 마이크를 통해 상기 주파수 신호들을 수신하고, 상기 주파수 신호들로부터 사용자의 음성신호(vocal signal)만을 출력한다. 볼륨이 작은 잡음은 사용자의 음성으로부터 쉽게 분리될 수 있지만 볼륨이 큰 TV 반향음은 분리되기 어렵다. 따라서, 도 3에 도시된 바와 같이, 본 발명의 반향음 제거부(21)는 TV 반향음을 효과적으로 분리/제거하기 위해서 IIR 필터 뱅크(infinite-impulse response filter banks)(211, 212, 213)를 포함한다. 상기 제 1 필터 뱅크(211)는 방송신호들 중에서 음성신호들(audio signals)을 주파수 대역(서브 밴드) 별로 나누기 위해 M개의 채널 필터들(H0, H1, H2,,, HM-1)을 갖고, 상기 제 2 필터 뱅크(212)는 상기 마이크로부터 수신된 신호들을 주파수 대역 별로 나누기 위해 M개의 채널 필터들(H0, H1, H2,,, HM-1)을 갖는다 상기 채널 필터들(H0, H1, H2,,, HM-1)은 서로 다른 주파수 대역의 신호들을 각각 통과시킨다. 적응 필터부(215)의 각 적응 필터들(W0, W1, W2,,, WM-1)은 상기 제 1, 2 필터 뱅크(211, 212)에 의해 분리된 신호들을 수신한다. 상기 적응 필터들(W0, W1, W2,,, WM-1) 각 각은 해당 주파수 대역에 맞는 신호를 상기 제 1, 2 필터 뱅크(211, 212)로부터 수신하다. 예를 들어, 상기 적응 필터(W0)는 상기 제 1 필터 뱅크(211)의 채널 필터(H0)와 상기 제 2 필터 뱅크(212)의 채널 필터(H0)에서 출력된 신호를 수신한다. 상기 적응 필터들(W0, W1, W2,,, WM-1)은 TV 반향음을 검출하기 위해 상기 제 1 필터 뱅크(211)와 상기 제 2 필터 뱅크(212)에서 출력된 신호들을 비교하고, 주파수와 진폭 등이 서로 일치하거나 유사한 신호들을 찾는다. 상기 제 2 필터 뱅크(212)에서 출력된 신호들 중에서 상기 일치하거나 유사한 신호들은 반향음 신호로 간주되고, 상기 적응 필터들(W0, W1, W2,,, WM-1)은 상기 제 2 필터 뱅크(212)로부터 수신된 신호들에서 상기 반향음 신호를 제거한다. 결국, 상기 적응 필터부(215)는 반향음 신호를 포함하지 않는 주파수 대역별 신호들을 출력한다. 상기 제 3 필터 뱅크(213)는 주파수 대역 별로 반향음 신호가 제거된 신호들을 상기 적응 필터부(215)로부터 수신하고, 상기 주파수 대역 별 신호들을 통합한다. 상기 통합된 신호에는 마이크를 통해 수신된 사용자의 음성이 포함되며, 간혹 잡음도 포함될 수 있다. 따라서, 본 발명은 잡음 및 잔여 반향음을 제거하기 위한 잡음 제거기를 포함할 수 있다. 상기 잡음 제거기는 설정된 문턱 값에 따라 상기 제 3 필터 뱅크(213)에서 출력된 신호를 필터링한다. As shown in FIG. 2, the microphone of the TV 10 or the voice recognition device 20 receives various sounds such as TV echo, noise, and the like, and converts them into frequency signals. The echo canceling unit 21 receives the frequency signals through the microphone of the TV 10 and outputs only a user's voice signal from the frequency signals. Low volume noise can be easily separated from the user's voice, but high volume TV echoes are difficult to separate. Accordingly, as shown in FIG. 3, the echo canceling unit 21 of the present invention uses IIR filter banks 211, 212, and 213 to effectively separate / remove TV echoes. Include. The first filter bank 211 has M channel filters H0, H1, H2, and HM-1 in order to divide audio signals among broadcast signals by frequency band (sub band). The second filter bank 212 has M channel filters H0, H1, H2, and HM-1 for dividing the signals received from the microphone by frequency band. , H2 ,,, HM-1) pass signals of different frequency bands, respectively. Each of the adaptive filters W0, W1, W2, and WM-1 of the adaptive filter unit 215 receives the signals separated by the first and second filter banks 211 and 212. Each of the adaptive filters W0, W1, W2,, and WM-1 receives a signal corresponding to a corresponding frequency band from the first and second filter banks 211 and 212. For example, the adaptive filter W0 receives a signal output from the channel filter H0 of the first filter bank 211 and the channel filter H0 of the second filter bank 212. The adaptive filters W0, W1, W2,, WM-1 compare the signals output from the first filter bank 211 and the second filter bank 212 to detect TV echo. Look for signals with the same or similar frequency and amplitude. Among the signals output from the second filter bank 212, the coincidence or similar signals are regarded as a reverberation signal, and the adaptive filters W0, W1, W2,, and WM-1 are converted into the second filter bank. Remove the echo signal from the signals received from 212. As a result, the adaptive filter unit 215 outputs signals for each frequency band that do not include the echo sound signal. The third filter bank 213 receives signals from which the echo signal is removed for each frequency band from the adaptive filter unit 215 and integrates the signals for each frequency band. The integrated signal includes a user's voice received through a microphone, and may sometimes include noise. Thus, the present invention may include a noise canceller for removing noise and residual reflections. The noise canceller filters the signal output from the third filter bank 213 according to a set threshold.

음성(voice) 검출부(22)는 상기 반향음 제거부(21)에서 출력된 신호를 수신하고, 상기 수신된 신호로부터 사용자의 음성(voice)을 검출하고 음성 관련 정보를 추출한다. 예를 들어, 음성의 구간을 검출하기 위해 상기 음성 검출부(22)는 상기 음성의 시작점과 끝점을 판단하고, 상기 음성의 주파수 대역, 진폭, 파형 등의 음 성 특징들을 검출한다. 상기 음성 검출부(22)는 상기 음성 관련 정보를 상기 반향음 제거부(21)와 음성 인식부(23)에 제공한다. The voice detector 22 receives a signal output from the echo canceling unit 21, detects a voice of the user from the received signal, and extracts voice related information. For example, to detect a voice section, the voice detector 22 determines a start point and an end point of the voice, and detects voice features such as frequency band, amplitude, and waveform of the voice. The voice detector 22 provides the voice related information to the echo canceling unit 21 and the voice recognition unit 23.

메모리(24)는 음성에 해당하는 패턴들의 통계적인 정보를 확률모델 형태로 저장한다. 사용자의 음성이 검출되면 상기 음성 인식부(23)는 상기 검출된 사용자의 음성을 상기 메모리(24)에 저장된 확률모델들과 비교하고, 각각의 모델이 상기 검출된 음성일 확률(유사도)을 계산한다. 그리고, 상기 계산된 확률을 근거로 상기 사용자의 음성이 명령어인지 아닌지, 그리고 상기 사용자의 음성이 어떤 명령인지를 판단한다. 제어부(11)는 상기 사용자의 음성에 해당하는 명령에 따라 TV(10)를 제어한다. The memory 24 stores statistical information of patterns corresponding to speech in the form of a probability model. When the user's voice is detected, the voice recognition unit 23 compares the detected user's voice with the probabilistic models stored in the memory 24 and calculates the probability (similarity) of each model being the detected voice. do. Then, it is determined whether the voice of the user is a command and what command the voice of the user is based on the calculated probability. The controller 11 controls the TV 10 according to a command corresponding to the voice of the user.

본 발명에 따른 음성 인식 방법을 상세히 설명하면 다음과 같다. 도 4는 본 발명의 음성 인식 방법을 나타낸 플로우 차트이다. 사용자가 TV(10)를 시청하는 도중에 명령어(예를 들어 "볼륨-다운")을 말하면, 상기 명령어 "볼륨-다운"은 상기 마이크를 통해 상기 반향음 제거부(21)에 제공된다. 이때, 상기 사용자의 음성과 함께 반향음 및 잡음이 상기 반향음 제거부(21)에 입력된다. Hereinafter, the speech recognition method according to the present invention will be described in detail. 4 is a flowchart illustrating a speech recognition method of the present invention. When the user speaks a command (eg, "volume-down") while watching TV 10, the command "volume-down" is provided to the echo canceling unit 21 through the microphone. At this time, echo and noise are input to the echo canceling unit 21 together with the user's voice.

상기 반향음 제거부(21)에 입력된 사용자의 음성, 반향음, 잡음은 주파수 대역별로 분리되고, 상기 분리된 반향음들은 상기 적응 필터들(W0, W1, W2,,, WM-1)에 의해 제거되며 상기 잡음은 상기 잡음 제거기에 의해 제거된다(S30). 상기 반향음 제거부(21)는 IIR 필터 뱅크들(211, 212, 213)을 이용한 서브밴드 적응 필터링 방식으로 반향음들을 제거한다. The user's voice, echo, and noise input to the echo canceling unit 21 are separated according to frequency bands, and the separated echoes are applied to the adaptive filters W0, W1, W2, and WM-1. The noise is removed by the noise canceller (S30). The echo canceling unit 21 removes the echoes using a subband adaptive filtering method using the IIR filter banks 211, 212, and 213.

상기 음성 검출부(22)는 상기 반향음 및 잡음이 제거된 신호를 상기 반향음 제거부(21)로부터 수신하고, 도 5에 도시된 바와 같이 상기 사용자의 음성을 검출하기 위해 상기 수신된 신호로부터 상기 음성의 파형을 프레임별로 검출한다(S31). 또한 상기 음성 검출부(22)는 상기 음성의 시작점과 끝점을 검출하고 그들의 타이밍을 판단하며, 상기 음성의 주파수 대역, 진폭 등을 검출한다. 상기 음성 검출부(22)는 상기 음성의 시작점과 끝점을 검출하기 위해 상기 음성의 에너지를 기설정된 문턱값들과 비교한다. 예를 들어, 상기 음성 검출부(22)에 입력되는 신호의 에너지가 제 1 문턱값 보다 크면 음성의 시작점으로 판단하고, 상기 음성 검출부(22)에 입력되는 신호의 에너지가 일정 기간동안 제 2 문턱값 보다 작으면 상기 음성의 끝점으로 판단한다.The voice detector 22 receives the signal from which the echo and noise are removed from the echo canceler 21, and as shown in FIG. 5, detects the user's voice from the received signal. An audio waveform is detected for each frame (S31). In addition, the voice detector 22 detects a start point and an end point of the voice, determines their timing, and detects a frequency band, amplitude, and the like of the voice. The voice detector 22 compares the energy of the voice with predetermined thresholds to detect the start and end points of the voice. For example, if the energy of the signal input to the voice detector 22 is greater than the first threshold value, it is determined as the start point of the voice, and the energy of the signal input to the voice detector 22 is the second threshold value for a predetermined period. If smaller, it is determined as the end point of the voice.

상기 음성의 시작점이 검출되면 상기 반향음 제거부(21)의 적응 필터부(215)는 적응을 멈추고, 상기 음성의 끝점이 검출되면, 상기 적응 필터부(215)는 적응을 다시 시작한다. 여기서, 적응이란, TV(10)에서 출력된 음향의 반향 경로를 실시간으로 추측하고, 변화하는 반향 경로(예를 들어, 시청자의 움직임으로 인한)에 따라 필터링 파라미터 값, 예를 들어 상기 적응 필터들(W0, W1, W2,,, WM-1)의 차단 주파수 및 통과 주파수, 반향음 제거율 등을 바꾸는 것이다.When the start point of the voice is detected, the adaptive filter unit 215 of the echo canceling unit 21 stops adaptation. When the end point of the voice is detected, the adaptive filter unit 215 starts the adaptation again. Here, the adaptation is a real-time estimation of the echo path of the sound output from the TV 10, and the filtering parameter value, for example, the adaptive filters according to the changing echo path (for example, due to the movement of the viewer). The cutoff frequency, the pass frequency, and the echo cancellation rate of (W0, W1, W2 ,, WM-1) are changed.

그러나, 상기 반향음 제거부(21)에 사용자의 음성과 반향음이 함께 수신될 때에는 정확한 필터링 파라미터 값을 알 수 없기 때문에 상기 적응 필터부(215)의 필터링 파라미터 값을 고정시키고, 반향음만이 상기 반향음 제거부(21)에 들어올 때에만 반향 경로에 따라 상기 적응 필터부(215)의 필터링 파라미터 값을 변화시킨다. However, since the correct filtering parameter value is not known when the user's voice and the echo sound are received together with the echo canceling unit 21, the filtering parameter value of the adaptive filter unit 215 is fixed and only the echo sound is received. Only when entering the echo canceling unit 21, the filtering parameter value of the adaptive filter unit 215 is changed according to the echo path.

상기 음성 인식부(23)는 상기 검출된 사용자의 음성이 핵심어(기설정된 단어)일 확률과 비핵심어(기설정되지 않은 단어)일 확률을 계산한다(S32). 핵심어 확률을 구하기 위해 상기 음성 인식부(23)는 상기 검출된 음성을 제 1 모델에 포함된 기설정된 음성 패턴들(예를 들어, "채널 변경","볼륨-업","볼륨-다운","외부입력")과 비교하고, 각각의 음성 패턴이 상기 검출된 음성일 확률(i.e. 유사도)을 계산한다. 여기서, 상기 제 1 모델은 각 단어들에 해당하는 음성들을 이용하여 훈련된 은닉 마르코프 모델(hidden markov model; HMM)이다. 상기 음성 인식부(23)는 상기 계산된 확률들 중에서 가장 큰 확률을 핵심어 확률로 판단한다. 또한, 비핵심어 확률을 구하기 위해 상기 음성 인식부(23)는 제 2 모델을 이용한다. 여기서, 상기 제 2 모델은 기설정되지 않은 음성 패턴들이 축적되어 형성되는 필러 모델(filler model)이다. 상기 음성 인식부(23)는 상기 제 2 모델의 음성 패턴들이 상기 검출된 음성일 확률들(유사도)을 계산한다. 그리고, 상기 음성 인식부(23)는 상기 계산된 확률들 중에서 가장 큰 확률을 비핵심어 확률로 판단한다. The voice recognition unit 23 calculates a probability that the detected user's voice is a key word (preset word) and a non-core word (non-preset word) (S32). In order to obtain a keyword probability, the speech recognizer 23 sets the detected speech in predetermined speech patterns (eg, "channel change", "volume-up", "volume-down") included in the first model. , &Quot; external input ", and calculate the probability (i.e. similarity) that each speech pattern is the detected speech. Here, the first model is a hidden markov model (HMM) trained using voices corresponding to each word. The speech recognition unit 23 determines the largest probability among the calculated probabilities as the keyword probability. In addition, the speech recognition unit 23 uses a second model to obtain a non-core word probability. Here, the second model is a filler model in which non-preset voice patterns are accumulated and formed. The speech recognizer 23 calculates probabilities (similarities) of the speech patterns of the second model as the detected speech. The speech recognition unit 23 determines the largest probability among the calculated probabilities as the non-core word probability.

상기 음성 인식부(23)는 상기 계산된 핵심어 확률과 비핵심어 확률의 비(핵심어 확률/비핵심어 확률)를 계산하고, 상기 핵심어 확률과 비핵심어 확률의 비를 제 1 기준값 및 제 2 기준값과 비교한다(S33, S35). 상기 제 1 기준값은 오동작(음성인식의 오인)의 확률이 0.5%인 경우의 상기 핵심어 확률과 비핵심어 확률의 비이고, 상기 제 2 기준값은 오동작의 확률이 5%인 경우의 상기 핵심어 확률과 비핵심어 확률의 비이다. 상기 제 1, 2 기준값은 실험을 통해 얻어진다. The speech recognizer 23 calculates a ratio of the calculated key word probability and the non-key word probability (key word probability / non-key word probability), and compares the ratio of the key word probability and the non-key word probability with a first reference value and a second reference value. (S33, S35). The first reference value is a ratio of the key word probability and the non-core word probability when the probability of malfunction (misrecognition of speech recognition) is 0.5%, and the second reference value is the key word probability and ratio when the probability of the malfunction is 5%. The ratio of key word probabilities. The first and second reference values are obtained through experiments.

상기 핵심어 확률과 비핵심어 확률의 비가 상기 제 1 기준값 보다 크거나 같 은 경우, 상기 음성 인식부(23)는 상기 검출된 사용자의 음성을 인식한다. 예를 들어, 상기 음성 인식부(23)는 상기 제 1 모델의 음성 패턴들 중 가장 큰 확률(유사도)을 갖는 음성 패턴을 확인하고 상기 확인된 음성 패턴에 해당하는 명령을 상기 제어부(11)에 전달한다. 그러면 상기 제어부(11)는 수신된 명령에 따라 상기 TV(10)를 제어한다(S34). When the ratio of the key word probability and the non-key word probability is greater than or equal to the first reference value, the voice recognition unit 23 recognizes the detected user's voice. For example, the speech recognizer 23 identifies a speech pattern having the greatest probability (similarity) among the speech patterns of the first model, and sends a command corresponding to the confirmed speech pattern to the controller 11. To pass. Then, the controller 11 controls the TV 10 according to the received command (S34).

상기 핵심어 확률과 비핵심어 확률의 비가 상기 제 1 기준값 보다 작고 상기 제 2 기준값 보다 크거나 같은 경우, 상기 음성 인식부(23)는 곧바로 상기 검출된 사용자의 음성을 인식하지 않고 상기 사용자에게 상기 사용자의 음성에 해당하는 단어를 화면에 디스플레이 한다. 예를 들어, 상기 음성 인식부(23)는 상기 제 1 모델의 음성 패턴들 중 가장 큰 확률(유사도)을 갖는 음성 패턴을 확인하고 상기 확인된 음성 패턴을 화면에 디스플레이 하도록 상기 제어부(11)에 요청한다(S36). 그러면 상기 제어부(11)는 상기 확인된 음성 패턴, 예를 들어 "볼륨-다운"을 화면에 디스플레이 하고, 사용자가 승인 (confirmation)을 기다린다. 만약, " yes","OK","select" 등과 같은 사용자의 승인이 수신되면(S37), 상기 제어부(11)는 상기 TV(10)의 볼륨을 줄인다(S38). "No","cancel"등과 같은 사용자의 음성이나 새로운 사용자의 명령이 수신되면, 상기 제어부(11)는 화면에 디스플레이 된 상기 음성 패턴 "볼륨-다운"을 삭제한다. If the ratio of the key word probability and the non-key word probability is less than the first reference value and greater than or equal to the second reference value, the speech recognition unit 23 does not immediately recognize the detected user's voice and provides the user with the user's voice. The word corresponding to the voice is displayed on the screen. For example, the voice recognition unit 23 checks the voice pattern having the greatest probability (similarity) among the voice patterns of the first model, and displays the checked voice pattern on the screen. Request (S36). Then, the controller 11 displays the confirmed voice pattern, for example, "volume-down" on the screen, and the user waits for confirmation. If the user's approval such as "yes", "OK", "select", etc. is received (S37), the controller 11 decreases the volume of the TV 10 (S38). When a user's voice such as "No", "cancel", or a new user's command is received, the control unit 11 deletes the voice pattern "volume-down" displayed on the screen.

상기 핵심어 확률과 비핵심어 확률의 비가 상기 제 2 기준값보다 작은 경우, 상기 음성 인식부(23)는 상기 검출된 사용자의 음성과 일치하는 음성 패턴을 찾지 않고 상기 제어부(11)에 어떤 신호도 제공하지 않는다(S39). 따라서, 상기 제어부 (11)는 상기 검출된 사용자의 음성에 반응하지 않는다. When the ratio of the key word probability and the non-key word probability is smaller than the second reference value, the speech recognizer 23 does not find a speech pattern that matches the detected user's voice and does not provide any signal to the controller 11. (S39) Thus, the controller 11 does not respond to the detected voice of the user.

이상 상술한 바와 같이, 본 발명은 핵심어 확률과 비핵심어 확률의 비를 근거로 사용자의 음성을 인식하기 때문에 반향음과 잡음이 존재하는 환경에서도 사용자의 음성만을 정확히 인식하고 해당 명령을 수행한다.As described above, the present invention recognizes the user's voice based on the ratio of the key word probability and the non-key word probability, so that only the user's voice is correctly recognized and the corresponding command is performed even in the presence of echo and noise.

본 발명은 AV 기기뿐만 아니라 자동 통역장치, 다양한 가전 기기, 핸드폰, 장난감 등 다양한 분야에 적용될 수 있다. The present invention can be applied not only to AV equipment, but also to various fields such as automatic interpretation devices, various home appliances, mobile phones, toys, and the like.

이상에서 설명한 내용을 통해 당업자라면 본 발명의 기술 사상을 이탈하지 아니하는 범위에서 다양한 변경 및 수정이 가능함을 알 수 있을 것이다. 따라서 본 발명의 기술적 범위는 실시예에 기재된 내용으로 한정되는 것이 아니라 특허 청구의 범위에 의해 정해져야 한다. Those skilled in the art will appreciate that various changes and modifications can be made without departing from the spirit of the present invention. Therefore, the technical scope of the present invention should not be limited to the contents described in the embodiments, but should be defined by the claims.

Claims

An echo sound removing unit for receiving sounds including echo sound, user voice, and noise of the AV system and removing the echo sound from the received sounds based on user voice related information;

A voice detector for detecting a voice of the user from the received sounds and generating the user voice related information based on the detected user voice;

And a speech recognizer configured to compare the detected user's voice with voice patterns included in at least one model to recognize the detected user's voice.

The method of claim 1,

The echo sound removing unit,

A first filter bank for dividing the output sound of the AV system by frequency bands;

A second filter bank for dividing the received sounds into frequency bands;

Adaptive filters for removing echoes from the sounds divided by frequency bands using the signals output from the first filter bank;

And a third filter bank for integrating the signals output from the adaptive filters.

The method of claim 1,

And a noise canceller for removing noise and residual echo from the echo-removed sounds.

The method of claim 1,

The echo sound removing unit,

And at least one of a cutoff frequency, a pass frequency, and an echo cancellation rate of the filters included in the echo canceling unit based on the generated user speech related information.

The method of claim 1,

And the user voice related information comprises a start point timing and an end point timing of the user voice.

The method of claim 5, wherein

And the user voice related information further comprises at least one of a frequency band, an amplitude, and a waveform of the user voice.

The method of claim 1,

The speech recognition unit,

Calculating a first probability that the detected user's voice matches the voice pattern included in the first model and a second probability that the detected user's voice matches the voice pattern included in the second model. Speech recognition device.

The method of claim 7, wherein

The speech recognition unit,

And determining whether to recognize the detected user's voice based on a ratio of the first probability and the second probability.

The method of claim 7, wherein

The first model includes voice patterns corresponding to preset words.

The method of claim 7, wherein

The second model accumulates speech patterns corresponding to words that are not preset.

The method of claim 7, wherein

The speech recognition unit,

And comparing the ratio of the first probability to the second probability with reference values, and recognizing the detected user's voice according to the comparison result.

The method of claim 7, wherein

The speech recognition unit,

A voice, characterized in that for calculating the probability that the voice patterns included in the first model to match the detected user's voice, and determine the largest probability among the calculated probabilities as the first probability Recognition device.

The method of claim 7, wherein

The speech recognition unit,

A voice, characterized in that for calculating the probability that the voice patterns included in the second model to match the detected user's voice, and determine the largest probability among the calculated probabilities as the second probability Recognition device.

(a) receiving sounds including echoes, user voices, noises in the AV system;

(b) detecting a user voice from the received sounds and generating user voice related information based on the detected user voice;

(c) removing echoes by passing the received sounds through a plurality of filters controlled according to the user voice related information;

(d) comparing the detected user's voice with voice patterns included in at least one model.

The method of claim 14,

In the step of removing the echo,

And a parameter value of the filters is fixed during a period between a start point timing and an end point timing of the user voice.

The method of claim 15,

The parameter of the filters, at least one of the cut-off frequency and pass frequency of the filters, the echo cancellation rate.

The method of claim 14,

Removing noise and residual echo from the echo-removed sounds.

The method of claim 14,

In the step of removing the echo,

And controlling at least one of a cutoff frequency, a pass frequency, and an echo cancellation rate of the filters based on the user voice related information.

The method of claim 14,

Comparing the detected voice of the user with voice patterns included in at least one model,

Calculating a first probability that the detected user's voice matches the voice pattern included in the first model and a second probability that the detected user's voice matches the voice pattern included in the second model. Speech recognition method, characterized in that.

The method of claim 19,

Calculating a ratio of the first probability and the second probability;

Comparing the ratio of the first probability to the second probability with reference values;

And recognizing the detected voice of the user according to the comparison result.

The method of claim 19,

The first model includes voice patterns corresponding to preset words.

The method of claim 19,

A voice detector for detecting a voice of the user from the received sounds;

Calculate a first probability that the detected user's voice matches the voice pattern included in the first model, and a second probability that the detected user's voice matches the voice pattern included in the second model; And a speech recognizer configured to recognize the detected voice of the user based on a ratio between a probability and the second probability.

The method of claim 23,

The first model includes voice patterns corresponding to preset words.

The method of claim 23,

The speech recognition unit,

(a) detecting a user's voice in received sounds;

(b) calculating a first probability that the detected user's voice matches the voice pattern included in the first model and a second probability that the detected user's voice matches the voice pattern included in the second model Wow;

and (c) recognizing the voice of the detected user according to a ratio of the first probability and the second probability.

The method of claim 27,

The first model includes voice patterns corresponding to preset words.

The method of claim 27,

Computing the first and second probabilities,

Calculating a probability that voice patterns included in the first model match the detected voice of the user;

And determining the largest probability among the calculated probabilities as the first probability.

The method of claim 27,

Computing the first and second probabilities,

Calculating a probability that the voice patterns included in the second model match the detected voice of the user;

And determining the largest probability among the calculated probabilities as the second probability.

The method of claim 27,

Recognizing the voice of the detected user according to the ratio of the first probability and the second probability,

Comparing the ratio of the first probability to the second probability with first and second reference values;

And determining whether to recognize the detected user's voice based on the comparison result.

The method of claim 32,

And if the ratio between the first probability and the second probability is greater than or equal to the first reference value, recognizing the voice of the detected user.

The method of claim 32,

And when the ratio of the first probability and the second probability is less than the first reference value and greater than or equal to the second reference value, displaying a word corresponding to the detected user's voice on the screen.

The method of claim 32,

If the ratio of the first probability and the second probability is less than the second reference value, the detected voice of the user is not recognized.