KR20130022637A

KR20130022637A - Voice recognition system and voice recognition method

Info

Publication number: KR20130022637A
Application number: KR1020110085368A
Authority: KR
Inventors: 임종진; 이부열; 한민수; 홍정표
Original assignee: 엘지디스플레이 주식회사; 한국과학기술원
Priority date: 2011-08-25
Filing date: 2011-08-25
Publication date: 2013-03-07
Also published as: KR101899398B1

Abstract

PURPOSE: A voice recognition system and a voice recognition method thereof are provided to improve noise removal performance by accurately removing first and second audio signals of a synthetic signal. CONSTITUTION: Signal delay compensation units(120-126) output a synthetic compensation signal by compensating a time delay between synthetic signal and an audio signal. Noise removing units(130-136) remove first and second audio signals of each synthetic compensation signal. A multi noise removing unit(140) removes an additional noise signal in voice signals. [Reference numerals] (110) Audio signal modulation unit; (120) First signal delay compensation unit; (122) Second signal delay compensation unit; (124) Third signal delay compensation unit; (126) Fourth signal delay compensation unit; (130,132,1343,136,140) Noise removal unit

Description

Voice recognition system and voice recognition method

실시예는 음성 인식 시스템에 관한 것이다.Embodiments relate to a speech recognition system.

실시예는 음성 인식 방법에 관한 것이다.An embodiment relates to a speech recognition method.

사용자의 음성을 인식하여, 인식 결과에 따른 처리를 수행하는 음성 인식 시스템에 대한 연구가 활발하게 진행되고 있다.There is an active research on a speech recognition system that recognizes a user's voice and performs a process according to a recognition result.

음성 인식 시스템은 네비게이션, 텔레비전 등에 채용되고 있다. Voice recognition systems are employed in navigation, television, and the like.

네비게이션이나 텔레비전의 스피커를 통해 오디오 신호가 출력될 수 있다.An audio signal may be output through the navigation or the speaker of the television.

이러한 경우, 사용자의 음성, 오디오 신호 및 주변 잡음 등이 함께 합성 신호로 입력될 수 있다. 합성 신호로부터 음성을 인식하기 위해서는 오디오 신호나 주변 잡음을 제거해야 한다.In this case, the user's voice, audio signal and ambient noise may be input together as a synthesized signal. To recognize speech from the synthesized signal, the audio signal or ambient noise must be removed.

합성 신호로부터 원하는 신호를 제외한 나머지 신호를 제거하는 방법은 특허공개번호 10-2005-0039535와 특허공개번호 10-2009-0056598에 개시되어 있다. Methods of removing signals other than a desired signal from a synthesized signal are disclosed in Patent Publication No. 10-2005-0039535 and Patent Publication No. 10-2009-0056598.

한편, 합성 신호로부터 오디오 신호를 제거하기 위해서는 스피커로 출력되기 전의 오디오 신호, 즉 제1 오디오 신호와 합성 신호의 오디오 신호, 즉 제2 오디오 신호가 일치되어야 하는데, 제1 및 제2 오디오 신호 간의 시간 차가 발생하는 문제가 있다. Meanwhile, in order to remove the audio signal from the synthesized signal, the audio signal before being output to the speaker, that is, the first audio signal and the audio signal of the synthesized signal, that is, the second audio signal, must be matched. There is a problem that a car occurs.

제1 및 제2 오디오 신호는 동일 오디오 신호이다.The first and second audio signals are the same audio signal.

이는 샘플링 레이트(sampling rate)의 불안정에 기인한 것으로 추정되고 있다. 즉, 샘플링 레이트가 15.9kHz 내지 16.1kHz의 번위에서 변화된다. This is presumed to be due to instability of the sampling rate. That is, the sampling rate is changed in the range of 15.9 kHz to 16.1 kHz.

도 1a 및 도 1b에 도시한 바와 같이, 제2 오디오 신호가 제1 오디오 신호보다 시간적으로 뒤지게 된다.As shown in FIGS. 1A and 1B, the second audio signal lags behind the first audio signal in time.

도 1b에서는 사용자의 음성과 주변 잡음이 합성 신호에 혼합되지 않았기 때문에, 합성 신호와 제1 오디오 신호와 거의 동일한 신호 파형을 가진다.In FIG. 1B, since the user's voice and the ambient noise are not mixed with the synthesized signal, the synthesized signal has almost the same signal waveform as the synthesized signal and the first audio signal.

따라서, 어떤 경우에는 제2 오디오 신호가 제1 오디오 신호보다 시간적으로 앞서게 되고, 어떤 경우에는 제2 오디오 신호가 제1 오디오 신호보다 시간적으로 뒤지게 된다.Thus, in some cases, the second audio signal is temporally ahead of the first audio signal, and in some cases the second audio signal is temporally behind the first audio signal.

이와 같이 제1 및 제2 오디오 신호 간의 시간 차로 인해, 합성 신호로부터 오디오 신호가 완벽하게 제거되지 못하게 되어, 음성 인식의 정확도가 저하되는 문제가 있다. As such, the time difference between the first and second audio signals prevents the audio signal from being completely removed from the synthesized signal, thereby degrading the accuracy of speech recognition.

실시예는 잡음 제거 성능을 향상시킬 수 있는 음성 인식 시스템 및 음성 인식 방법을 제공한다.The embodiment provides a speech recognition system and a speech recognition method capable of improving noise cancellation performance.

실시예는 음성 인식의 정확도를 향상시킬 수 있는 음성 인식 시스템 및 음성 인식 방법을 제공한다.The embodiment provides a speech recognition system and a speech recognition method capable of improving the accuracy of speech recognition.

실시예에 따르면, 음성 인식 시스템은, 오디오 신호를 생성하는 오디오 신호 생성부; 상기 오디오 신호를 출력하기 위한 스피커; 사용자의 음성 신호와 상기 오디오 신호가 혼합된 합성 신호를 각각 입력하기 위한 다수의 마이크로폰을 포함하는 마이크로폰 어레이; 상기 오디오 신호 생성부로부터의 상기 오디오 신호와 상기 마이크로폰 각각으로부터의 합성 신호 간의 시간 지연을 보상하고 상기 음성 신호를 추출하는 신호 처리부; 및 상기 음성 신호에 응답하여 대응하는 처리 대응부를 포함한다.According to an embodiment, a speech recognition system includes an audio signal generator for generating an audio signal; A speaker for outputting the audio signal; A microphone array including a plurality of microphones for respectively inputting a synthesized signal mixed with a voice signal of the user and the audio signal; A signal processor for compensating for a time delay between the audio signal from the audio signal generator and the synthesized signal from each of the microphones and extracting the voice signal; And a corresponding processing counterpart in response to the voice signal.

실시예에 따르면, 음성 인식 방법은, 오디오 신호를 스피커를 통해 출력하는 단계; 사용자의 음성 신호와 상기 오디오 신호가 혼합된 합성 신호를 각 마이크로폰을 통해 입력하는 단계; 상기 스피커로부터의 상기 오디오 신호와 상기 각 마이크로폰으로부터의 상기 합성 신호 간의 시간 지연을 보상하는 단계; 상기 보상된 합성 신호로부터 상기 음성 신호를 추출하는 단계; 및 상기 음성 신호에 응답하여 대응하는 단계를 포함한다.According to an embodiment, a voice recognition method includes: outputting an audio signal through a speaker; Inputting, through each microphone, a synthesized signal mixed with a voice signal of the user and the audio signal; Compensating for a time delay between the audio signal from the speaker and the composite signal from each microphone; Extracting the speech signal from the compensated synthesized signal; And responding in response to the voice signal.

제1 및 제2 스피커로 출력되기 전의 제1 및 제2 오디오 신호와 제1 및 제2 스피커로 출력된 후 마이크로폰으로 입력된 합성 신호의 제1 및 제2 오디오 신호 간의 시간 차이가 상기 제1 내지 제4 신호 지연 보상부 각각에 의해 보상될 수 있다. The time difference between the first and second audio signals before being output to the first and second speakers and the first and second audio signals of the composite signal output to the microphone after being output to the first and second speakers is the first to second. It may be compensated by each of the fourth signal delay compensators.

이에 따라, 합성 신호의 제1 및 제2 오디오 신호가 정확하게 제거될 수 있으므로, 잡음 제거 성능이 향상되어 음성 인식의 정확도가 증가될 수 있다.Accordingly, since the first and second audio signals of the synthesized signal can be accurately removed, the noise cancellation performance can be improved and the accuracy of speech recognition can be increased.

도 1a 및 도 1b는 종래의 스피커 출력 전의 오디오 신호와 스피커 출력 후 마이크로폰에 의해 입력된 오디오 신호의 출력을 도시한 도면이다.
도 2는 실시예에 따른 음성 인식 시스템을 도시한 블록도이다.
도 3은 도2의 신호 처리부를 도시한 블록도이다.
도 4a는 종래에 스피커 출력 전의 오디오 신호와 스피커 출력 후 마이크로폰에 의해 입력된 오디오 신호 간의 시간 차를 보정하지 않은 모습을 도시한 도면이다.
도 4b는 실시예에 따라 스피커 출력 전의 오디오 신호와 스피커 출력 후 마이크로폰에 의해 입력된 오디오 신호 간의 시간 차를 보정한 모습을 도시한 도면이다.
도 5a 및 도 5b는 실시예에 따라 스피커 출력 후 마이크로폰에 의해 입력된 오디오 신호를 보정하여 스피커 출력 전의 오디오 신호와 일치하여 주는 모습을 도시한 도면이다.1A and 1B illustrate outputs of an audio signal before a speaker output and an audio signal input by a microphone after speaker output.
2 is a block diagram illustrating a speech recognition system according to an embodiment.
3 is a block diagram illustrating a signal processor of FIG. 2.
FIG. 4A is a diagram illustrating a state in which the time difference between the audio signal before the speaker output and the audio signal input by the microphone after the speaker output is not corrected.
4B is a diagram illustrating a time difference correction between an audio signal before the speaker output and an audio signal input by the microphone after the speaker output, according to an exemplary embodiment.
5A and 5B are diagrams illustrating a method of correcting an audio signal input by a microphone after outputting a speaker and matching the audio signal before the output of the speaker, according to an exemplary embodiment.

발명에 따른 실시 예의 설명에 있어서, 각 구성 요소의 " 상(위) 또는 하(아래)"에 형성되는 것으로 기재되는 경우에 있어, 상(위) 또는 하(아래)는 두개의 구성 요소들이 서로 직접 접촉되거나 하나 이상의 또 다른 구성 요소가 두 개의 구성 요소들 사이에 배치되어 형성되는 것을 모두 포함한다. 또한 "상(위) 또는 하(아래)"으로 표현되는 경우 하나의 구성 요소를 기준으로 위쪽 방향 뿐만 아니라 아래쪽 방향의 의미도 포함할 수 있다.In describing an embodiment according to the invention, in the case of being described as being formed "above" or "below" each element, the upper (upper) or lower (lower) Directly contacted or formed such that one or more other components are disposed between the two components. In addition, when expressed as "up (up) or down (down)" may include the meaning of the down direction as well as the up direction based on one component.

도 2는 실시예에 따른 음성 인식 시스템을 도시한 블록도이고, 도 3은 도2의 신호 처리부를 도시한 블록도이다.2 is a block diagram illustrating a speech recognition system according to an exemplary embodiment, and FIG. 3 is a block diagram illustrating the signal processor of FIG. 2.

도 2에 도시한 바와 같이, 실시예에 따른 음성 인식 시스템(10)은 오디오 신호 생성부(30), 스피커(미도시), 마이크로폰 어레이(20), 신호 처리부(40) 및 신호 처리부(40)를 포함할 수 있다.As shown in FIG. 2, the voice recognition system 10 according to the embodiment includes an audio signal generator 30, a speaker (not shown), a microphone array 20, a signal processor 40, and a signal processor 40. It may include.

상기 오디오 신호 생성부(30)는 상기 스피커로 출력될 사운드, 즉 오디오 신호를 생성할 수 있다. The audio signal generator 30 may generate a sound to be output to the speaker, that is, an audio signal.

상기 음성 인식 시스템(10)이 텔레비전에 장착되는 경우, 상기 텔레비전의 스크린을 통해 영상이 표시되고 상기 스피커를 통해 사운드가 출력될 수 있다. When the voice recognition system 10 is mounted on a television, an image may be displayed through a screen of the television and sound may be output through the speaker.

상기 스피커는 상기 텔레비전의 왼측에 설치된 제1 스피커와 상기 텔레비전의 오른측에 설치된 제2 스피커를 포함할 수 있다.The speaker may include a first speaker installed at the left side of the television and a second speaker installed at the right side of the television.

이러한 제1 및 제2 스피커에 대응하도록 상기 오디오 신호 생성부(30)는 제1 및 제2 오디오 신호 생성부(31, 34)를 포함할 수 있다. 즉, 상기 제1 오디오 신호 생성부(31)는 상기 제1 스피커로 제공할 제1 오디오 신호(V_L)를 생성하고, 상기 제2 오디오 신호 생성부(34)는 상기 제2 스피커로 제공할 제2 오디오 신호(V_R)를 생성할 수 있다. The audio signal generator 30 may include first and second audio signal generators 31 and 34 to correspond to the first and second speakers. That is, the first audio signal generator 31 generates a first audio signal V _L to be provided to the first speaker, and the second audio signal generator 34 is provided to the second speaker. The second audio signal V _R may be generated.

상기 제1 및 제2 오디오 신호 생성부(31, 34)에서 생성된 제1 및 제2 오디오 신호(V_L,V_R)는 상기 신호 처리부(40)로 제공될 수 있다. The first and second audio signals V _L and V _R generated by the first and second audio signal generators 31 and 34 may be provided to the signal processor 40.

상기 마이크로폰 어레이(20)는 제1 내지 제4 마이크로폰(21, 23, 25, 27)을 포함할 수 있다. The microphone array 20 may include first to fourth microphones 21, 23, 25, and 27.

실시예에서는 설명의 편의를 위해 제1 내지 제4 마이크로폰(21, 23, 25, 27)을 개시되고 있지만, 4개 이상의 마이크로폰이 개시될 수도 있다. In the embodiment, the first to fourth microphones 21, 23, 25, and 27 are disclosed for convenience of description, but four or more microphones may be disclosed.

상기 제1 내지 제4 마이크로폰(21, 23, 25, 27) 각각은 상기 음성 인식 시스템(10)의 입력단으로서, 사용자의 음성을 입력받을 수 있다.Each of the first to fourth microphones 21, 23, 25, and 27 may be an input terminal of the voice recognition system 10 and may receive a user's voice.

하지만, 통상적으로 상기 제1 내지 제4 마이크로폰(21, 23, 25, 27) 각각은 사용자의 음성뿐만 아니라, 상기 제1 및 제2 스피커로 출력된 제1 및 제2 오디오 신호(V_L,V_R)와 주변의 잡음도 입력될 수 있다.However, in general, each of the first to fourth microphones 21, 23, 25, and 27 is not only a voice of a user, but also first and second audio signals V _L and V output to the first and second speakers. _R ) and ambient noise can also be input.

상기 제1 및 제2 스피커가 상기 제1 내지 제4 마이크로폰(21, 23, 25, 27)과 가까워질수록 상기 제1 및 제2 스피커로 출력된 제1 및 제2 오디오 신호(V_L,V_R)의 보다 증가된 신호 진폭이 입력될 수 있다. As the first and second speakers become closer to the first to fourth microphones 21, 23, 25, and 27, first and second audio signals V _L and V output to the first and second speakers. _The increased signal amplitude of _R ) can be input.

상기 제1 및 제2 스피커가 상기 제1 내지 제4 마이크로폰(21, 23, 25, 27)으로부터 멀리 떨어지도록 하는 데에는 한계가 있기 때문에, 상기 제1 및 제2 스피커로 출력된 제1 및 제2 오디오 신호(V_L,V_R)가 상기 제1 내지 제4 마이크로폰(21, 23, 25, 27)으로 입력될 가능성은 상당히 높아질 수 있다.Since the first and second speakers are limited to keep them away from the first to fourth microphones 21, 23, 25, and 27, the first and second speakers output to the first and second speakers. The likelihood that an audio signal V _L , V _R is input to the first to fourth microphones 21, 23, 25, 27 can be significantly increased.

결국, 상기 제1 내지 제4 마이크로폰(21, 23, 25, 27)은 사용자의 음성, 상기 제1 및 제2 스피커로 출력된 제1 및 제2 오디오 신호(V_L,V_R) 그리고 주변 잡음이 혼합된 합성 신호(x₀, x₁, x₂, x₃)로 입력되고, 상기 합성 신호는 신호 처리부(40)로 제공될 수 있다. As a result, the first to fourth microphones 21, 23, 25, and 27 may receive a voice of a user, first and second audio signals V _L and V _R output to the first and second speakers, and ambient noise. The mixed signal x ₀ , x ₁ , x ₂ , and x ₃ may be input to the mixed signal, and the synthesized signal may be provided to the signal processor 40.

상기 신호 처리부(40)는 도 3에 도시한 바와 같이, 신호 처리부(40), 제1 내지 제4 신호 지연 보상부(120, 122, 124, 126), 제1 내지 제4 잡음 제거부(130, 132, 134, 136) 및 멀티 잡음 제거부(140)를 포함할 수 있다.As illustrated in FIG. 3, the signal processor 40 may include a signal processor 40, first to fourth signal delay compensators 120, 122, 124, and 126, and first to fourth noise cancellers 130. , 132, 134, and 136 and a multi-noise canceller 140.

상기 신호 처리부(40)는 상기 제1 오디오 신호(V_L)와 상기 제2 오디오 신호(V_R)를 변조한 오디오 변조 신호(V_M)를 생성한다.The signal processor 40 generates an audio modulated signal V _{M obtained} by modulating the first audio signal V _L and the second audio signal V _R.

상기 오디오 변조 신호(V_M)는 상기 제1 및 제2 오디오 신호(V_L,V_R)의 평균값일 수 있다. 즉, 상기 오디오 변조 신호(V_M)는 상기 제1 및 제2 오디오 신호(V_L,V_R)를 더한 후 2로 나눈 값일 수 있다. The audio modulation signal V _M may be an average value of the first and second audio signals V _L and V _R. That is, the audio modulated signal V _M may be a value divided by 2 after adding the first and second audio signals V _L and V _R.

즉, 식 1과 같이 표현될 수 있다.That is, it can be expressed as Equation 1.

상기 합성 신호(x₀, x₁, x₂, x₃)에 혼합된 제1 및 제2 오디오 신호 또한 각 마이크로폰(21, 23, 25, 27)에 의해 하나의 오디오 신호로 혼합되므로, 상기 오디오 변조 신호(V_M)는 상기 합성 신호(x₀, x₁, x₂, x₃)에 혼합된 제1 및 제2 오디오 신호와 유사한 신호를 가질 수 있다. Since the first and second audio signals mixed with the synthesized signals (x ₀ , x ₁ , x ₂ , x ₃ ) are also mixed into one audio signal by the microphones 21, 23, 25, and 27, the audio The modulated signal V _M may have a signal similar to the first and second audio signals mixed with the composite signal x ₀ , x ₁ , x ₂ , x ₃ .

상기 오디오 변조 신호(V_M)는 나중에 신호 처리부(40)에서 상기 합성 신호(x₀, x₁, x₂, x₃)에 혼합된 제1 및 제2 오디오 신호를 제거하는데 사용될 수 있다. 즉, 상기 합성 신호(x₀, x₁, x₂, x₃)에 혼합된 제1 및 제2 오디오 신호는 상기 오디오 변조 신호(V_M)를 바탕으로 제거될 수 있다. The audio modulated signal V _M may later be used by the signal processor 40 to remove the first and second audio signals mixed with the composite signals x ₀ , x ₁ , x ₂ , x ₃ . That is, the first and second audio signals mixed with the synthesized signals x ₀ , x ₁ , x ₂ , and x ₃ may be removed based on the audio modulated signal V _M.

상기 오디오 변조 신호(V_M)는 상기 제1 내지 제4 신호 지연 보상부(120, 122, 124, 126)에 공통으로 제공될 수 있다. The audio modulated signal V _M may be commonly provided to the first to fourth signal delay compensators 120, 122, 124, and 126.

실시예에서는 상기 신호 처리부(40)가 상기 신호 처리부(40)에 포함되는 것으로 개시되고 있지만, 이에 한정하지 않는다. 즉, 상기 신호 처리부(40)는 상기 신호 처리부(40)의 전단에 배치될 수 있다. 이러한 경우, 상기 신호 처리부(40)에서 생성된 오디오 변조 신호(V_M)는 상기 신호 처리부(40)의 제1 내지 제4 신호 지연 보상부(120, 122, 124, 126)로 제공될 수 있다. In the exemplary embodiment, the signal processor 40 is disclosed as being included in the signal processor 40, but is not limited thereto. That is, the signal processor 40 may be disposed in front of the signal processor 40. In this case, the audio modulated signal V _M generated by the signal processor 40 may be provided to the first to fourth signal delay compensators 120, 122, 124, and 126 of the signal processor 40. .

상기 제1 내지 제4 신호 지연 보상부(120, 122, 124, 126) 각각은 다음과 같은 세가지 동작을 수행할 수 있다.Each of the first to fourth signal delay compensators 120, 122, 124, and 126 may perform the following three operations.

1) 첫 번째 동작: 합성 신호(x₀, x₁, x₂, x₃)와 오디오 변조 신호(V_M) 간의 상호 상관도(corr_i(τ))를 산출할 수 있다.1) First operation: A correlation (corr _i (τ)) between the synthesized signal (x ₀ , x ₁ , x ₂ , x ₃ ) and the audio modulated signal V _M may be calculated.

다시 말해, 상기 제1 내지 제4 신호 지연 보상부(120, 122, 124, 126) 각각은 식 2를 바탕으로 정규화된 상호상관도(corr_i(τ))를 산출할 수 있다.In other words, each of the first to fourth signal delay compensators 120, 122, 124, and 126 may calculate a normalized cross correlation corr _i (τ) based on Equation 2.

단, 0 < τ <F₀, i= 0, ..., M-1Where 0 <τ <F ₀ , i = 0, ..., M-1

여기서, L은 상관 상호도 길이이고, M은 마이크로폰의 개수이고, τ은 시간 지연값이고, n은 샘플 인덱스이고, i는 채널 인덱스이며, F₀는 피치(pitch)에 해당하는 샘플 개수를 나타낸다.
Where L is the correlation correlation length, M is the number of microphones, τ is the time delay value, n is the sample index, i is the channel index, and F ₀ represents the number of samples corresponding to the pitch. .

2) 두 번째 동작: 상기 산출된 정규화된 상호 상관도(corr_i(τ))를 바탕으로 상호 상관도를 최대가 되게 하는 시간 지연값(τ)을 산출할 수 있다.2) Second operation: A time delay value τ that maximizes the cross correlation can be calculated based on the calculated normalized cross correlation cor _i (τ).

다시 말해, 상기 제1 내지 제4 신호 지연 보상부(120, 122, 124, 126) 각각은 식 2의 상호 상관도(corr_i(τ))를 최대가 되게 하는 시간 지연값(τ)을 식 3에 의해 산출할 수 있다.In other words, each of the first to fourth signal delay compensators 120, 122, 124, and 126 may express a time delay value τ that maximizes the cross correlation corr _i (τ) of Equation 2. It can calculate by 3.

식 3으로부터, 합성 신호(x₀, x₁, x₂, x₃)와 오디오 변조 신호(V_M) 간의 시간 지연 정도가 파악될 수 있다. From equation 3, the degree of time delay between the synthesized signal (x ₀ , x ₁ , x ₂ , x ₃ ) and the audio modulated signal (V _M ) can be determined.

다시 말해, 합성 신호(x₀, x₁, x₂, x₃)의 제1 및 제2 오디오 신호와 오디오 변조 신호(V_M) 간의 시간 지연 정도가 파악될 수 있다. In other words, the degree of time delay between the first and second audio signals of the synthesized signals x ₀ , x ₁ , x ₂ , and x ₃ and the audio modulation signal V _M may be determined.

상기 합성 신호(x₀, x₁, x₂, x₃)의 제1 및 제2 오디오 신호가 오디오 변조 신호(V_M)보다 식 3의 시간 지연값(τ_i)만큼 앞서거나 뒤지게 될 수 있다.
The first and second audio signals of the synthesized signal (x ₀ , x ₁ , x ₂ , x ₃ ) may precede or lag behind the audio modulation signal V _{M by} the time delay value τ _i of Equation 3. have.

3)세번째 동작: 식 3으로부터 산출된 시간 지연값(τ_i)을 바탕으로 상기 합성 신호(x₀, x₁, x₂, x₃)의 제1 및 제2 오디오 신호와 상기 오디오 변조 신호(V_M)를 동기화 또는 일치시킬 수 있다.3) Third operation: the first and second audio signals of the synthesized signal (x ₀ , x ₁ , x ₂ , x ₃ ) and the audio modulated signal based on the time delay value τ _i calculated from Equation 3: V _M ) can be synchronized or matched.

상기 동기화의 기준으로 합성 신호(x₀, x₁, x₂, x₃)의 제1 및 제2 오디오 신호가 되거나 상기 오디오 변조 신호(V_M)가 될 수 있다.The synchronization may be the first and second audio signals of the synthesized signals (x ₀ , x ₁ , x ₂ , x ₃ ) or the audio modulated signal (V _M ).

예컨대, 상기 합성 신호(x₀, x₁, x₂, x₃)의 제1 및 제2 오디오 신호를 기준으로, 식 3의 시간 지연값(τ_i)을 바탕으로 상기 오디오 변조 신호(V_M)를 상기 합성 신호(x₀, x₁, x₂, x₃)의 제1 및 제2 오디오 신호에 동기화시킬 수 있다.For example, based on the first and second audio signals of the synthesized signal (x ₀ , x ₁ , x ₂ , x ₃ ), the audio modulated signal V _M based on the time delay value τ _i of Equation 3 ) May be synchronized with the first and second audio signals of the synthesized signal (x ₀ , x ₁ , x ₂ , x ₃ ).

예컨대, 상기 오디오 변조 신호(V_M)를 기준으로, 식 3의 시간 지연값(τ_i)을 바탕으로 상기 합성 신호(x₀, x₁, x₂, x₃)의 제1 및 제2 오디오 신호를 상기 오디오 변조 신호(V_M)에 동기화시킬 수 있다.For example, based on the audio modulation signal V _M , first and second audio of the synthesized signal x ₀ , x ₁ , x ₂ , x ₃ based on the time delay value τ _i of Equation 3 The signal may be synchronized with the audio modulated signal V _M.

도 4a에 도시한 바와 같이, 합성 신호(x0)와 오디오 변조 신호(V_M) 간에 시간 차이가 발생함을 알 수 있다.As shown in FIG. 4A, it can be seen that a time difference occurs between the synthesized signal x0 and the audio modulated signal V _M.

도 4b에 도시한 바와 같이, 상기 제1 신호 지연 보상부(120)에 의해 합성 신호(x0)와 오디오 변조 신호(V_M) 간의 시간 지연이 보상되어 상기 합성 신호(x0)와 상기 오디오 변조 신호(V_M)가 동기화될 수 있다.As shown in FIG. 4B, the time delay between the synthesized signal x0 and the audio modulated signal V _M is compensated by the first signal delay compensator 120 so that the synthesized signal x0 and the audio modulated signal are compensated. (V _M ) can be synchronized.

도 5a 및 도 5b에 도시한 바와 같이, 오디오 변조 신호(도 5a)와 합성 신호의 제1 및 제2 오디오 신호(도 5b)가 상기 제1 내지 제4 신호 지연 보상부(120, 122, 124, 126)에 의해 신호 지연이 보상되어, 오디오 변조 신호(도 5a)와 합성 신호의 제1 및 제2 오디오 신호(도 5b)가 동기화됨을 알 수 있다.As shown in FIGS. 5A and 5B, an audio modulated signal (FIG. 5A) and a first and second audio signals (FIG. 5B) of a synthesized signal may be used in the first to fourth signal delay compensators 120, 122, and 124. 126, the signal delay is compensated, and it can be seen that the audio modulation signal (FIG. 5A) and the first and second audio signals (FIG. 5B) of the synthesized signal are synchronized.

이상의 3가지 동작은 상기 제1 내지 제4 신호 지연 보상부(120, 122, 124, 126) 각각에서 개별적으로 수행될 수 있다. The above three operations may be performed separately in each of the first to fourth signal delay compensators 120, 122, 124, and 126.

한편, 상기 제1 내지 제4 신호 지연 보상부(120, 122, 124, 126) 각각은 상기 동기화된 또는 신호 지연이 보상된 합성 신호(이하 합성 보상 신호라 함)(x₀(n-τ₀), x₁(n-τ₁), x₂(n-τ₂), x₃(n-τ₃))와 상기 오디오 변조 신호(V_M)를 대응하는 제1 내지 제4 잡음 제거부(130, 132, 134, 136)로 제공할 수 있다.Meanwhile, each of the first to fourth signal delay compensators 120, 122, 124, and 126 may be a synthesized signal (hereinafter, referred to as a composite compensation signal) in which the synchronized or signal delay is compensated (x ₀ (n−τ _0). ), x ₁ (n-τ ₁ ), x ₂ (n-τ ₂ ), x ₃ (n-τ ₃ )) and first to fourth noise cancellers corresponding to the audio modulated signal V _M ( 130, 132, 134, 136).

상기 제1 신호 지연 보상부(120)로부터의 제1 합성 보상 신호(x₀(n-τ₀))와 상기 오디오 변조 신호(V_M)는 상기 제1 잡음 제거부(130)로 제공될 수 있다. 상기 제1 잡음 제거부(130)는 상기 오디오 변조 신호(V_M)를 바탕으로 상기 제1 합성 보상 신호(x₀(n-τ₀))의 제1 및 제2 오디오 신호를 제거할 수 있다.The first synthesis compensation signal x ₀ (n−τ ₀ ) and the audio modulation signal V _M from the first signal delay compensation unit 120 may be provided to the first noise canceller 130. have. The first noise canceller 130 may remove first and second audio signals of the first synthesis compensation signal x ₀ (n−τ ₀ ) based on the audio modulation signal V _M. .

상기 제2 신호 지연 보상부(122)로부터의 제2 합성 보상 신호(x₁(n-τ₁))와 상기 오디오 변조 신호(V_M)는 상기 제2 잡음 제거부(132)로 제공될 수 있다. 상기 제2 잡음 제거부(132)는 상기 오디오 변조 신호(V_M)를 바탕으로 상기 제2 합성 보상 신호(x₁(n-τ₁))의 제1 및 제2 오디오 신호를 제거할 수 있다.The second synthesis compensation signal x ₁ (n−τ ₁ ) and the audio modulation signal V _M from the second signal delay compensator 122 may be provided to the second noise canceller 132. have. The second noise canceller 132 may remove first and second audio signals of the second synthesis compensation signal x ₁ (n−τ ₁ ) based on the audio modulation signal V _M. .

상기 제3 신호 지연 보상부(124)로부터의 제3 합성 보상 신호( x₂(n-τ₂))와 상기 오디오 변조 신호(V_M)는 상기 제3 잡음 제거부(134)로 제공될 수 있다. 상기 제3 잡음 제거부(134)는 상기 오디오 변조 신호(V_M)를 바탕으로 상기 제3 합성 보상 신호( x₂(n-τ₂))의 제1 및 제2 오디오 신호를 제거할 수 있다.The third synthesis compensation signal x ₂ (n−τ ₂ ) and the audio modulation signal V _M from the third signal delay compensator 124 may be provided to the third noise canceller 134. have. The third noise canceller 134 may remove the first and second audio signals of the third synthesis compensation signal x ₂ (n−τ ₂ ) based on the audio modulation signal V _M. .

상기 제4 신호 지연 보상부(126)로부터의 제4 합성 보상 신호(x₃(n-τ₃))와 상기 오디오 변조 신호(V_M)는 상기 제4 잡음 제거부(136)로 제공될 수 있다. 상기 제4 잡음 제거부(136)는 상기 오디오 변조 신호(V_M)를 바탕으로 상기 제4 합성 보상 신호(x₃(n-τ₃))의 제1 및 제2 오디오 신호를 제거할 수 있다. The fourth composite compensation signal x ₃ (n−τ ₃ ) and the audio modulated signal V _M from the fourth signal delay compensator 126 may be provided to the fourth noise canceller 136. have. The fourth noise canceller 136 may remove first and second audio signals of the fourth synthesis compensation signal x ₃ (n−τ ₃ ) based on the audio modulation signal V _M. .

상기 제1 내지 제4 잡음 제거부(130, 132, 134, 136) 각각은 적응형 필터(adaptive filter)를 포함할 수 있다. 예컨대, 상기 적응형 필터로는 정규화된 최소화 평균 제곱(NLMS: Normalized Least Mean Square) 필터가 사용될 수 있다. Each of the first to fourth noise cancellers 130, 132, 134, and 136 may include an adaptive filter. For example, a normalized least mean square (NLMS) filter may be used as the adaptive filter.

상기 NLMS 필터에 의해 상기 합성 보상 신호(x₀(n-τ₀), x₁(n-τ₁), x₂(n-τ₂), x₃(n-τ₃))의 제1 및 제2 오디오 신호 뿐만 아니라 주변 잡음도 제거되어, 사용자의 음성인 음성 신호(S₁₀, S₁₁, S₁₂, S₁₃)가 상기 제1 내지 제4 잡음 제거부(130, 132, 134, 136)로부터 출력될 수 있다. _First and _second synthesis compensation signals x ₀ (n-τ ₀ ), x ₁ (n-τ ₁ ), x ₂ (n-τ ₂ ), and x ₃ (n-τ ₃ ) by the NLMS filter; As well as the second audio signal, the ambient noise is also removed, so that the voice signals S ₁₀ , S ₁₁ , S ₁₂ , and S ₁₃ , which are voices of the user, are the first to fourth noise cancellers 130, 132, 134, and 136. Can be output from

상기 제1 내지 제4 잡음 제거부(130, 132, 134, 136)로부터의 제1 내지 제4 음성 신호(S₁₀, S₁₁, S₁₂, S₁₃)가 상기 멀티 잡음 제거부(140)로 제공될 수 있다. The first to fourth voice signals S ₁₀ , S ₁₁ , S ₁₂ , and S ₁₃ from the first to fourth noise removers 130, 132, 134, and 136 are transferred to the multi-noise remover 140. Can be provided.

상기 제1 내지 제4 잡음 제거부는 제1 및 제2 오디오 신호(V_L,V_R)를 1차적으로 제거할 수 있다. The first to fourth noise cancellers may firstly remove the first and second audio signals V _L and V _R.

상기 멀티 잡음 제거부(140)는 상기 제1 내지 제4 음성 신호(S₁₀, S₁₁, S₁₂, S₁₃)를 입력받아, 이들 음성 신호들(S₁₀, S₁₁, S₁₂, S₁₃)에 내재된 추가적인 잡음 신호를 제거하여 주어 사용자의 음성과 거의 유사한 음성 신호(S)를 복원하여 줄 수 있다.The multi-noise removing unit 140 receives the first to fourth voice signals S ₁₀ , S ₁₁ , S ₁₂ , and S ₁₃ , and receives these voice signals S ₁₀ , S ₁₁ , S ₁₂ , and S _13. By removing the additional noise signal inherent in the) can restore the voice signal (S) almost similar to the user's voice.

상기 멀티 잡음 제거부(140)는 적응형 빔포밍 필터를 포함할 수 있다. The multi-noise cancellation unit 140 may include an adaptive beamforming filter.

예컨대, 상기 빔포밍 필터로는 일반화된 사이드로브 캔셀러(GSC: Generalized Beamforming Canceller)가 사용될 수 있다. For example, a generalized beamforming canceller (GSC) may be used as the beamforming filter.

상기 제1 내지 제4 잡음 제거부(130, 132, 134, 136)와 상기 멀티 잡음 제거부(140)에 의해 사용자의 음성 신호(S)만이 추출될 수 있다. Only the voice signal S of the user may be extracted by the first to fourth noise removers 130, 132, 134, and 136 and the multi-noise remover 140.

다시 도 2를 참조하면, 상기 신호 처리부(40)는 상기 신호 처리부(40)터부터의 음성 신호(S)에 응답하여 사용자가 하고자 하는 것을 수행하도록 한다.Referring back to FIG. 2, the signal processor 40 performs what the user wants to do in response to the voice signal S from the signal processor 40.

예컨대, 상기 신호 처리부(40)는 상기 음성 신호(S)에 응답하여 볼륨 조절, 화면 분할, 전원 온/오프, 채널 변경 등을 수행할 수 있다. For example, the signal processor 40 may perform volume control, screen division, power on / off, channel change, etc. in response to the voice signal (S).

이상에서 살펴본 바와 같이, 제1 및 제2 스피커로 출력되기 전의 제1 및 제2 오디오 신호(V_L,V_R)와 제1 및 제2 스피커로 출력된 후 마이크로폰(21, 23, 25, 27)으로 입력된 합성 신호(x₀, x₁, x₂, x₃)의 제1 및 제2 오디오 신호 간의 시간 차이가 상기 제1 내지 제4 신호 지연 보상부(120, 122, 124, 126) 각각에 의해 보상될 수 있다. 이에 따라, 합성 신호(x₀, x₁, x₂, x₃)의 제1 및 제2 오디오 신호가 정확하게 제거될 수 있으므로, 잡음 제거 성능이 향상되어 음성 인식의 정확도가 증가될 수 있다.As described above, the microphones 21, 23, 25, 27 after the first and second audio signals V _L and V _R before being output to the first and second speakers and after being output to the first and second speakers are output. The time difference between the first and second audio signals of the synthesized signals x ₀ , x ₁ , x ₂ , and x ₃ inputted by the first to fourth signal delay compensators 120, 122, 124, and 126 is different from each other. Can be compensated for by each. Accordingly, since the first and second audio signals of the synthesized signals x ₀ , x ₁ , x ₂ , and x ₃ can be accurately removed, noise reduction performance can be improved and accuracy of speech recognition can be increased.

10: 음성 인식 시스템 20: 마이크로폰 어레이
21, 23, 25, 27: 마이크로폰 30: 오디오 신호 생성부
31: 제1 오디오 신호 생성부 34: 제2 오디오 신호 생성부
40: 신호 처리부 50: 처리 대응부
110: 오디오 신호 변조부 120, 22, 124, 126: 신호 지연 보상부
130, 132, 134, 136: 잡음 제거부 140: 멀티 잡음 제거부10: speech recognition system 20: microphone array
21, 23, 25, 27: microphone 30: audio signal generator
31: first audio signal generator 34: second audio signal generator
40: signal processing section 50: processing corresponding section
110: audio signal modulator 120, 22, 124, 126: signal delay compensator
130, 132, 134, 136: noise canceller 140: multi noise canceller

Claims

An audio signal generator for generating an audio signal;
A speaker for outputting the audio signal;
A microphone array including a plurality of microphones for respectively inputting a synthesized signal mixed with a voice signal of the user and the audio signal;
A signal processor for compensating for a time delay between the audio signal from the audio signal generator and the synthesized signal from each of the microphones and extracting the voice signal; And
And a processing counterpart corresponding to the voice signal.

The method of claim 1,
The signal processing unit,
A plurality of signal delay compensators for compensating a time delay between the audio signal and the synthesized signal and outputting a synthesized compensation signal;
A plurality of noise cancellers configured to first remove noise from the synthesized compensation signal from each of the signal delay compensators to extract the voice signal; And
And a multi-noise canceling unit for secondarily removing noise from the voice signal from each of the noise removing units.

The method of claim 2,
The audio signal generator generates first and second audio signals,
And a audio signal modulator for modulating the first and second audio signals to generate an audio modulated signal.

The method of claim 3,
The audio signal modulator
Speech recognition system of the average value of the first and second audio signal.

The method of claim 3,
The audio modulation signal is provided in common to the respective signal delay compensation unit.

The method of claim 2,
The signal delay compensation unit,
Means for calculating a cross correlation between the synthesized signal and the audio signal;
Means for calculating a time delay value that maximizes the cross correlation; And
Means for synchronizing the synthesized signal with the audio signal based on the time delay value.

The method according to claim 6,
The cross correlation is calculated from the following equation.

Where 0 <τ <F ₀ , i = 0, ..., M-1
Where L is the correlation correlation length, M is the number of microphones, τ is the time delay value, n is the sample index, i is the channel index, and F ₀ represents the number of samples corresponding to the pitch. .

The method according to claim 6,
And the time delay value is calculated from the following equation.

The method according to claim 6,
And a synchronization based on any one of the synthesized signal and the audio signal.

The method of claim 2,
The noise removing unit removes the audio signal of the synthesized signal based on the audio signal.

Outputting an audio signal through a speaker;
Inputting, through each microphone, a synthesized signal mixed with a voice signal of the user and the audio signal;
Compensating for a time delay between the audio signal from the speaker and the composite signal from each microphone;
Extracting the speech signal from the compensated synthesized signal; And
And responding in response to the voice signal.

The method of claim 11,
Compensating the time delay,
Calculating a cross correlation between the synthesized signal and the audio signal;
Calculating a time delay value that maximizes the cross correlation; And
Synchronizing the synthesized signal with the audio signal based on the time delay value.

The method of claim 11,
Extracting the voice signal,
Extracting the speech signal by first removing noise from the compensated synthesized signal; And
And secondarily removing noise from the extracted speech signal.

The method of claim 11,
And a synchronization is performed based on any one of the synthesized signal and the audio signal.