KR19980037008A

KR19980037008A - Remote speech input and its processing method using microphone array

Info

Publication number: KR19980037008A
Application number: KR1019960055690A
Authority: KR
Inventors: 서영주; 이영직
Original assignee: 양승택; 한국전자통신연구원
Priority date: 1996-11-20
Filing date: 1996-11-20
Publication date: 1998-08-05
Also published as: KR100198019B1

Abstract

본 발명은 컴퓨터를 이용한 음성인식시스템의 응용분야에서 마이크 어레이를 이용한 원격음성입력장치 및 그 원격음성입력 처리 방법에 관한 것으로서, 종래기술에서의 하나의 마이크로 음성을 입력하는 경우에 마이크의 위치에 항상 세심한 주의를 기울여야 하기 때문에 사용상에 많은 불편이 발생하였던 문제점을 해결하기 위해, 본 발명은 다채널의 마이크 어레이를 구비하고, 이 다채널의 마이크들로부터 시간지연 값들이 다른 음성신호를 동시에 입력받아 디지탈신호를 변환하는 마이크 어레이와, 마이크 어레이로부터 입력되는 다채널의 신호들로부터 음성신호를 검출하는 자동음성 검출부와, 그 검출된 각 채널간 음성신호들의 시간지연 정보를 추정하는 시간지연 추정부와; 및 추정된 시간지연 정보를 이용하여 신호대 잡음비가 향상된 음성을 생성해내기 위해 채널간의 음성신호에 존재하는 시간지연을 상쇄시킨 후 모든 채널의 음성신호를 가산하는 시간지연 및 신호가산부로 구성되어;BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a remote speech input apparatus using a microphone array and a remote speech input processing method therefor in a field of application of a computer-based speech recognition system, In order to solve the problem that many inconveniences have arisen due to careful attention, the present invention has a multi-channel microphone array, receives voice signals having different time delay values from the multi-channel microphones simultaneously, An automatic speech detector for detecting a speech signal from multi-channel signals input from a microphone array; a time delay estimator for estimating time delay information of the detected inter-channel speech signals; And a time delay and a signal for adding a voice signal of all channels after offsetting a time delay existing in a voice signal between channels to generate a voice having an improved signal-to-noise ratio using the estimated time delay information;

마이크와 화자가 약 40~80cm 정도의 거리를 두고, 마이크의 위치에 주의를 기울이지 않고 발성할 경우에도 허용된 수준(25dB) 이상의 신호대 잡음비를 갖는 음성을 자동으로 입력받는 기능을 갖도록 개발하여 이 음성을 음성인식시스템의 입력으로 사용할 경우, 우수한 성능을 나타낼 수 있도록 하였다.The microphone and speaker have been developed to have a function of automatically receiving a voice having a signal-to-noise ratio of a level (25 dB) or more at a distance of about 40 to 80 cm, even when the speaker does not pay attention to the position of the microphone, Is used as the input of speech recognition system.

Description

A remote audio input device using a microphone array and its remote audio input processing method

도1은 본 발명의 원격음성입력장치의 블럭 구성도.1 is a block diagram of a remote audio input apparatus according to the present invention;

도2는 본 발명의 전체 원격음성입력 처리 흐름도.2 is a flowchart of the entire remote speech input processing of the present invention;

도3은 도2에 따른 시간지연 추정 상세 흐름도.3 is a detailed flowchart of time delay estimation according to FIG.

도4는 도2에 따른 시간지연-신호가산 상세 흐름도.4 is a detailed time-delay-signal addition flowchart according to Fig.

* 도면의 주요부분에 대한 부호의 설명 *Description of the Related Art [0002]

10 : 마이크 어레이부11 : 마이크 어레이10: Microphone array part 11: Microphone array

12 : A/D신호 변환부20 : 자동 음성 검출부12: A / D signal converting section 20: automatic voice detecting section

30 : 시간지연 추정부40 : 시간지연-신호가산부30: Time delay estimator 40: Time delay -

현재 개발되어 발표되고 있는 대부분의 음성인식시스템은, 하나의 마이크를 이용하여 음성을 입력하는 방법을 채택하고 있다.Most speech recognition systems currently being developed and announced adopt a method of inputting a voice using a single microphone.

그러나, 하나의 마이크로 음성을 입력받는 경우에는, 음성 입력시에 마이크의 위치에 항상 세심한 주의를 기울여야 한다. 특히, 높은 신호대 잡음비를 유지해야 하는 음성인식시스템의 경우에는, 마이크를 가까이에 두고 발성하여야 한다. 따라서, 일반인들이 실용적으로 사용하기에는 여러모로 불편하다.However, in the case of receiving a single micro voice, careful attention must be paid to the position of the microphone at the time of voice input. Particularly, in the case of a speech recognition system in which a high signal-to-noise ratio must be maintained, a microphone must be placed close to the speaker. Therefore, it is inconvenient for many people to use it practically.

본 발명의 원격음성입력장치는 이와 같은 문제점을 해결하기 위하여 고안되었다.The remote audio input device of the present invention is designed to solve such a problem.

즉, 마이크와 화자가 약 40~80cm 정도의 거리를 두고, 마이크의 위치에 주의를 기울이지 않고 발성할 경우에도, 허용된 수준(25dB) 이상의 신호대 잡음비를 갖는 음성을 자동으로 입력받는 기능을 갖도록 개발하였다. 이를 만족하기 위하여, 마이크 어레이를 이용한 잡음제거 기술을 적용하였다.That is, it is developed to have a function of automatically receiving a voice having a signal-to-noise ratio equal to or higher than a permissible level (25 dB) when a microphone and a speaker are spaced by about 40 to 80 cm and do not pay attention to the position of the microphone Respectively. In order to satisfy this, a noise elimination technique using a microphone array was applied.

[발명의 목적][Object of the invention]

본 발명은 사용자가 마이크의 위치에 비교적 주의를 기울이지 않고도 입력 음성신호가 높은 신호대 잡음비를 유지하도록 함으로써, 음성인식시스템이 높은 음성인식성능을 갖도록 하는데 그 목적이 있다.It is an object of the present invention to enable a speech recognition system to have high speech recognition performance by allowing a user to maintain a high signal-to-noise ratio of an input speech signal without paying a great deal of attention to the position of a microphone.

[발명이 속하는 기술분야 및 그 분야의 종래기술][TECHNICAL FIELD OF THE INVENTION AND RELATED ART OF THE SAME]

본 발명은 컴퓨터를 이용한 음성인식과 관련 분야에 관한 것으로서, 특히 여러개의 마이크로 구성된 마이크 어레이를 화자의 전면에 배치하여, 화자의 음성을 여러개의 마이크를 통하여 동시에 입력받고, 이를 다채널 빔포밍(beamforming) 기술을 이용하여 잡음이 감소된 음성을 재생함으로써, 마이크에서 비교적 먼 거리에서 발성하여도 잡음이 감소된 깨끗한 음성을 음성인식시스템이 입력할 수 있도록 한 마이크 어레이를 이용한 원격음성입력장치 및 그 원격음성입력 처리 방법에 관한 것이다.The present invention relates to speech recognition using a computer and, more particularly, to a microphone array having a plurality of micro-arrays arranged in front of a speaker, receiving a speaker's voice simultaneously through a plurality of microphones, and performing multi-channel beamforming ) Technology to reproduce a noise-reduced voice, thereby enabling a voice recognition system to input a clear voice whose noise is reduced even when it is uttered at a relatively long distance from the microphone, and a remote voice input device using the microphone array And a voice input processing method.

이 분야의 종래 기술로는 하나의 채널을 이용한 음성입력기술 및 다채널 마이크 음성입력기술 등을 들 수 있고, 이에 대한 설명을 하면 다음과 같다.Conventional technologies in this field include a voice input technique using one channel and a multi-channel microphone voice input technique.

현재, 개발되어 발표되고 있는 대부분의 음성인식 시스템은 하나의 마이크를 이용하여 음성을 입력받는 방법을 채택하고 있다. 그러나, 하나의 마이크로 음성을 입력받는 경우에는, 사용자가 발화시에 마이크의 위치에 항상 세심한 주의를 기울여야 한다. 특히, 높은 신호대 잡음비를 유지하는 음성인식시스템의 경우에는, 마이크를 가까이에 두고 발성하여야 한다. 따라서, 일반인들이 실용적으로 사용하기에는 여러 가지로 불편한 문제점이 있다.Currently, most speech recognition systems that have been developed and announced adopt a method of receiving a voice using a single microphone. However, in the case of receiving a single micro voice, the user must pay close attention to the position of the microphone at the time of the ignition. In particular, in the case of a speech recognition system that maintains a high signal-to-noise ratio, the microphone must be placed close to the speaker. Therefore, there are various inconveniences that ordinary people use in practice.

또한, 기존의 다채널의 빔포밍(Beamforming) 기술을 이용하여 여러 채널로부터 입력된 음성신호들을 잡음이 감소된 음성신호로 재생할 때에는, 여러 채널의 음성신호간에 생기는 시간지연을 정확하게 검출하기가 어려웠던 문제점이 있다.Further, when reproducing voice signals input from a plurality of channels using a conventional multi-channel beamforming technique as voice signals with reduced noise, it is difficult to accurately detect the time delay between voice signals of various channels .

[발명이 이루고자 하는 기술적 과제][Technical Problem]

따라서, 본 발명에서는 상기와 같은 문제점을 해결하기 위해 먼거리에서도 편리하게 마이크 어레이를 통해 동시에 입력된 음성을 높은 신호대 잡음비를 유지시켜 보다 깨끗한 음성을 음성인식시스템에 입력하고, 또한 시간지연 검출에서의 정확도를 높이기 위해서 음성신호에 센터클리핑(center clipping)방법을 적용한 후, 인접채널의 음성신호 간에 상호 상관관계(cross-correlation)를 이용하는 방법을 제공하고, 아울러, 최종단계인 모든 채널의 음성신호를 가산하기 전에, 잡음레벨 정규화 과정을 첨가하여 주어진 상황에서 최고의 신호대 잡음비를 얻도록 개발하는데에 있다.Accordingly, in order to solve the above-mentioned problems, the present invention has been made to solve the above-mentioned problems, and it is an object of the present invention to provide a speech recognition system that inputs a clear voice to a voice recognition system by maintaining a high signal- A center clipping method is applied to a speech signal to increase cross-correlation between speech signals of adjacent channels and a cross-correlation is used between speech signals of adjacent channels. In addition, Prior to this, we added a noise level normalization procedure to develop the best signal-to-noise ratio in a given situation.

[발명의 구성 및 작용][Structure and operation of the invention]

이하, 본 발명을 첨부된 도면에 의거하여 상세히 설명하면 다음과 같다.DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, the present invention will be described in detail with reference to the accompanying drawings.

도1은 본 발명의 원격음성입력장치의 블럭구성도로서, 다채널 마이크들로 구성된 마이크 어레이부(10), 자동음성검출장치(20), 시간지연 추정부(30), 및 시간지연-신호가산부(40)으로 구성된다.FIG. 1 is a block diagram of a remote audio input apparatus according to the present invention. The microphone array unit 10 includes multi-channel microphones. And an adder 40.

도1에 도시된 바와 같이, 마이크 어레이부(10)는 여러개(또는 다채널)의 마이크들로 구성된 마이크 어레이(11)와, 여러 곳에 위치한 마이크들로부터 시간지연값들이 다른 음성신호를 입력받아 디지탈화하는 아날로그-디지탈 변환기(12)로 구성되어 있다.As shown in FIG. 1, the microphone array unit 10 includes a microphone array 11 composed of a plurality of (or multi-channel) microphones, a microphone array 11 receiving a voice signal having different time delay values from microphones located at various places, And an analog-to-digital converter (12).

그리고 자동음성 검출부(20)는 각 채널의 마이크로부터 입력된 음성신호 간의 시간지연 정보를 추정하는데 필요한 음성부분을 입력된 신호로부터 검출하는 기능을 한다.The automatic speech detection unit 20 detects a speech part necessary for estimating time delay information between speech signals input from microphones of each channel, from an input signal.

시간지연 추정부(30)는 상기 자동음성 검출부(20)로부터 검출된 각 채널별음성신호들 사이의 시간지연 정보를 추정하는 기능을 한다.The time delay estimating unit 30 estimates time delay information between the audio signals of each channel detected by the automatic sound detecting unit 20.

시간지연-신호가산부(40)는 채널간의 음성신호에 존재하는 시간지연을 상쇄시킨 후, 모든 채널의 음성신호를 가산하여 신호대 잡음비가 개선된 음성을 생성하는 기능을 한다.The time delay-signal adder 40 performs a function of canceling a time delay existing in a voice signal between channels, and then adding voice signals of all channels to generate a voice having improved signal-to-noise ratio.

도2는 본 발명의 전체적인 원격음성입력 처리 흐름도이다.2 is a flowchart of the overall remote speech input processing of the present invention.

이는 다채널의 마이크로 구성된 마이크 어레이로부터 입력되는 음성신호를 디지탈 신호로 변환하고(S10), 이렇게 변환된 각 채널의 음성신호 부분만을 검출한 후(S20), 그 검출한 채널별 음성신호간의시간지연을 추정하고(S30), 이 추정에 의해 채널별로 시간지연이 다른 음성신호들을 시간지연이 동일하도록 처리한 다음 모든 채널의 음성신호를 가산하여 음성을 출력한다(S40).This is accomplished by converting a speech signal input from a micro-array of micro-channels of a plurality of channels into a digital signal (S10), detecting only the speech signal portion of each channel thus converted (S20) (S30). Then, the speech signals having different time delays on a channel-by-channel basis are processed to have the same time delay, and speech signals of all channels are added to output speech (S40).

상기한 시간지연 추정 과정(S30)을 도3을 참조하여 보다 상세히 설명하면 다음과 같다.The above-described time delay estimation process (S30) will be described in more detail with reference to FIG.

시간지연 정보의 검출은 일정한 구간의 음성신호를 각 채널마다 동시에 추출한 다음, 기준 채널의 음성신호와 기타 채널의 음성신호 간의 시간지연 정보를 상호상관 함수(cross-correlation function)를 구하여 추정한다.The time delay information is detected by extracting a speech signal of a predetermined interval at each channel and then estimating the time delay information between the speech signal of the reference channel and the speech signal of the other channel by obtaining a cross-correlation function.

시간지연정보 추정과정에서, 스텝 31(S31)은 음성 프레임 추출과정으로서, 자동음성 검출장치로부터 음성부분을 검출한 후, 각 채널마다 동시에 N개의 음성샘플들로 구성된 음성 프레임을 추출한다.In step 31 (S31) of extracting a speech frame, a speech frame is extracted from an automatic speech detection apparatus and a speech frame composed of N speech samples is extracted for each channel simultaneously.

스텝 32(S32)는 최대값, 최소값 검출과정으로서, 각 채널마다 추출된 N개의 음성샘플들로 구성된 음성 프레임들에 대해 음성샘플들의 채널별 최대값과 최소값을 구한다.Step 32 (S32) is a process of detecting a maximum value and a minimum value, and obtains a maximum value and a minimum value of the speech samples for each of the speech frames composed of N speech samples extracted for each channel.

스텝 33(S33)은 센터클리핑(center clipping) 과정으로서, 시간지연 정보의 추정과정에서 정확도를 높이기 위하여, 채널별 음성 프레임에 대하여 각각 센터클리핑 함수를 취한다. 센터클리핑은 음성샘플들에 대하여, 크기가 0부터 음성 프레임의 최대값과 최소값의 50~80% 정도까지를 취한다. 즉, 수식으로 나타내면 다음과 같다.Step 33 (S33) is a center clipping process. In order to increase the accuracy in the estimation of the time delay information, a center clipping function is applied to each voice frame for each channel. Center clipping takes from 0 to 50 to 80% of the maximum and minimum of speech frames for speech samples. That is, the expression is as follows.

[수학식 1][Equation 1]

y(n)=0, for C_min≤x(n)≤C_max, n=0, ... , N-1y (n) = 0, for C min ≤x (n) ≤C max, n = 0, ..., N-1

y(n)-x(n), for elsewhere, n=0, ..., N-1y (n) -x (n), for elsewhere, n = 0, ..., N-1

여기서, x(n)은 음성 샘플값, y(n)은 센터클리핑 함수를 취한 출력값, C_min과 C_max는 클리핑 레벨로서, 각각 음성 프레임에 속한 음성샘플들의 최소값과 최대값에 정해진 비율(50~80%)을 곱한 값이다.Here, x (n) is a voice sample values, y (n) is a clipping level center clipping taking the function output value, C _min and C _max are respectively fixed ratio to the minimum and maximum values of the speech samples belonging to a speech frame (50 To 80%).

스텝 34(S34)는 상호상관 함수 계산과정에서, 센터클리핑된 각 채널별 음성 프레임에 대하여, 먼저, 기준채널을 정하고, 기준채널의 음성 프레임과 기타 채널의 음성 프레임 간에 상호상관 함수(crosscorrelation function)를 각각 구한다. 기준채널의 음성과 기타 채널의 음성간의 상호상관 함수의 수식은 다음과 같다.In step 34 (S34), in the process of calculating the cross-correlation function, a reference channel is first determined for each center-clipped voice frame, and a crosscorrelation function between the voice frame of the reference channel and the voice frame of the other channel is determined. Respectively. The formula of the cross-correlation function between the voice of the reference channel and the voice of the other channel is as follows.

[수학식 2]&Quot; (2) "

여기서, x₀는 기준채널의 음성샘플을 나타내고, x_k는 기타 채널의 음성샘플을 나타낸다. 또한, M은 채널의 수를 나타내고, N은 음성 프레임의 크기를 나타낸다.Where x ₀ represents the speech samples of the reference channel and x _k represents the speech samples of the other channel. Also, M represents the number of channels, and N represents the size of a voice frame.

스텝 35(S35)는 시간지연 추정과정으로서, 구해진 상호상관 함수들의 최대값을 구한후, 그때의 시간 τ를 기준채널과 기타 채널간의 시간지연값으로 정한다. 이를 식으로 나타내면 다음과 같다.Step 35 (S35) is a time delay estimation process. After obtaining the maximum value of the obtained cross-correlation functions, the time τ is determined as a time delay value between the reference channel and the other channel. This can be expressed as follows.

[수학식 3]&Quot; (3) "

여기서, k는 기준채널을 제외한 기타 특정채널을 나타내고, t3 는 기준채널의 음성신호와 기타 채널 k의 음성신호간의 상호상관 함수이다.Here, k denotes other specific channels except for the reference channel, and t3 denotes a cross-correlation function between the voice signal of the reference channel and the voice signal of the other channel k.

그리고 상기한 시간지연-신호가산과정(S40)을 도4를 참조하여 설명하면 다음과 같다.The above-described time delay-signal addition process (S40) will now be described with reference to FIG.

기준채널과 기타 채널간에 존재하는 시간지연을 상쇄시키기 위하여, 시간지연값의 음의 값만큼 역지연시켜 모든 채널의 음성신호를 시간적으로 동기화시킨 다음, 채널별 잡음레벨이 같도록 동기화시키고, 가산시켜 최종의 개선된 음성신호를 생성하는 역할을 한다.In order to cancel the time delay existing between the reference channel and the other channel, the voice signals of all the channels are synchronized in time by delaying by a negative value of the time delay value, And serves to generate the final improved voice signal.

스텝 41(S41)은 채널별로 존재하는 시간지연을 상쇄시키는 과정으로서, 시간지연값의 음의 값만큼 음성신호를 지연시켜서 채널간 음성신호들의 시간지연이 없도록 동기화시킨다.Step 41 (S41) is a process of canceling the time delay existing on a channel-by-channel basis, and delays the voice signal by a negative value of the time delay value so that there is no time delay of the inter-channel voice signals.

스텝 42(S42)는 채널별로 잡음레벨의 차이를 정규화 해주는 과정으로서, 각 채널마다 음성신호에 잡음레벨 정규화 계수 배만큼 곱한다. 채널별 잡음레벨 정규화 계수는 W_k는 다음과 같이 구해진다.Step 42 (S42) is a process of normalizing the difference of the noise level for each channel, and multiplying the voice signal by the noise level normalization coefficient for each channel. The channel-specific noise level normalization coefficient W _k is obtained as follows.

[수학식 4]&Quot; (4) "

여기서, k는 마이크 어레이의 특정 채널을 나타내고, t5 은 각각 채널 k와 기준 채널의 비음성 구간에서의 잡음신호를 나타낸다. 또한, N은 비음성 잡음 프레임의 크기이다.Here, k denotes a specific channel of the microphone array, and t5 denotes a noise signal in a non-speech interval between the channel k and the reference channel, respectively. Also, N is the size of the non-speech noise frame.

스텝 43(S43)은 채널별로 시간지연이 상쇄되고 잡음레벨도 정규화된 음성신호들을 모든 채널에 걸쳐서 가산하고 채널 수만큼 나누어주는 과정이다. 상기한 스텝 41 내지 스텝 43의 사항을 수식으로 나타내면 다음과 같다.Step 43 (S43) is a process of canceling the time delay for each channel and adding normalized voice signals over all channels and dividing by the number of channels. The items of the above-mentioned steps 41 to 43 can be expressed by the following equations.

[수학식 5]&Quot; (5) "

여기서, M은 마이크 어레이 채널수의 수를 나타내며, τ_k는 기준채널과 기타 채널 k간의 시간지연을 나타낸다.Where M denotes the number of microphone array channels, and τ _k denotes the time delay between the reference channel and the other channel k.

이와 같은 본 발명의 작용을 설명하면 다음과 같다.The operation of the present invention will now be described.

본 발명은 마이크의 위치에 구애받지 않고 음성을 입력하기 위하여, 하나의 마이크를 사용하는 대신에, 여러개의 마이크를 사용하였다. 이와 같이 여러개의 마이크를 화자의 전면에 분산배치하면 입력할 수 있는 범위가 넓어진다. 여러개의 마이크들로 구성된, 마이크 어레이(11)로부터 입력된 여러 채널의 음성신호들로부터 잡음이 감소된 깨끗한 음성신호를 생성하기 위하여, 채널별로 시간지연이 다른 음성신호들을 시간지연이 동일하도록 처리한 다음, 모든 채널의 음성신호를 가산하는 방법을 이용한다. 이와 같이 음성신호의 시간지연을 모든 채널에 걸쳐 동일하도록 처리한 후 가산하면, 동일한 모양의 파형을 하고 있는 각 채널의 음성신호들은 서로 중첩되어 파형의 크기가 채널수에 비례하고, 따라서, 음성파형의 제곱의 합으로 정의되는 음성신호의 전력성분은 채널수의 제곱에 비례하여 증가한다. 반면에, 각 채널에 음성신호와 함께 존재하는 잡음성분은 시간적으로 통계적 독립의 특성을 가질 때 잡음 고유의 랜덤화 특성 때문에, 각 채널의 성분을 가산하여도 잡음의 성분은 완전히 중첩되지 않는다. 따라서, 이 경우의 전력성분은 채널수에만 비례하여 증가한다. 따라서, 음성신호의 신호대 잡음비인 (음성신호의 전력성분/잡음의 전력성분)은 채널의 수와 같다. 따라서, 각 채널간의 잡음이 통계적인 독립의 특성을 띠면, 최종적으로 신호처리된 음성신호는 마이크 수만큼의 신호대 잡음비의 이득을 얻는다.In the present invention, instead of using one microphone, several microphones are used for inputting voice regardless of the position of the microphone. If you distribute multiple microphones on the front of the speaker, you can expand the input range. In order to generate a clean speech signal of reduced noise from the voice signals of various channels inputted from the microphone array 11 composed of a plurality of microphones, voice signals having different time delays for each channel are processed to have the same time delay Next, a method of adding voice signals of all channels is used. If the time delay of the audio signal is processed to be the same across all channels and added, the audio signals of the respective channels having the same waveform are superimposed on each other, so that the size of the waveform is proportional to the number of channels, The power component of the speech signal, which is defined as the sum of squares of the number of channels, increases in proportion to the square of the number of channels. On the other hand, when the noise components existing together with the speech signal in each channel have the characteristics of statistical independence in time, the components of noise are not completely overlapped even if the components of each channel are added because of the inherent randomization characteristic of noise. Therefore, the power component in this case increases in proportion to the number of channels. Therefore, the signal-to-noise ratio of the speech signal (the power component of the speech signal / the power component of the noise) is equal to the number of channels. Therefore, if the noise between each channel is statistically independent, finally the signal-processed voice signal gains gain of the signal-to-noise ratio as many as the number of microphones.

본 발명에서는, 이와 같은 방법을 이용하여 음성신호의 신호대 잡음비를 개선하였다.In the present invention, the signal-to-noise ratio of a voice signal is improved by using such a method.

이 방법을 이용할 경우에는, 각 채널간 음성신호의 시간지연의 차이를 정확하게 검출하여야 한다.When this method is used, it is necessary to accurately detect the difference in time delay of the audio signals between the respective channels.

이를 해결하기 위하여, 본 발명에서는, 먼저, 자동음성 검출부(20)에서 각 채널로 입력되는 신호 중에서 음성신호 부분을 검출하였다. 채널별 음성신호 간의 시간지연을 측정하기 위하여, 시간지연 추정부(30)에서는 자동음성 검출부(20)에서 검출한 채널별 음성 프레임을 이용하여 채널간 시간지연을 추정하였다.In order to solve this problem, in the present invention, first, an automatic speech detection unit 20 detects a speech signal portion among signals input to respective channels. In order to measure the time delay between the voice signals for each channel, the time delay estimator 30 estimates the interchannel time delay using the voice frames per channel detected by the automatic voice detection unit 20.

상기 시간지연 추정부(30)에서 적용한 기술인 최대/최소값 검출과정은 다음 단계인 센터클리핑 과정에서 필요한 센터 클리핑 레벨을 정하는데 사용된다. 센터클리핑 과정에서는, 음성신호의 파형에서 센터클리핑 레벨 안쪽의 부분을 0으로 할당하고, 바깥쪽 부분만 원래의 파형값으로 남긴다. 이렇게 함으로써, 음성신호에서 비교적 잡음에 민감한 부분인, 크기가 작은 파형을 제거함으로써, 다음 과정인 상호상관 관계를 이용하여 채널간 시간지연을 추정할 때 훨씬 정확한 결과를 구할 수 있다.The maximum / minimum value detection process, which is a technique applied by the time delay estimator 30, is used to set the center clipping level required in the next center clipping process. In the center clipping process, the portion inside the center clipping level is assigned as 0 in the waveform of the audio signal, and only the outside portion is left as the original waveform value. By doing so, it is possible to obtain a much more accurate result when the inter-channel time delay is estimated by using the next process, i.e., the cross-correlation, by removing a waveform having a small size, which is relatively noise-

상호상관 함수의 계산과정에서는, 서로 다른 2채널의 음성 프레임을 대상으로 상호상관 함수를 계산한다. 상호상관 함수는 2채널로부터의 음성 프레임을 시간축 상에서 전후로 이동하면서 함수값을 계산하는데, 두 음성 프레임이 겹치는 위치(시간 지연이 없는 어느 한 채널의 음성 프레임을 다른 체널 음성신호의 시간지연만큼 이동했을 때)에서 최대값을 나타낸다. 다음 단계인 시간지연 추정과정에서는, 구해진 상호상관 함수에서 최대값을 가지는 위치를 찾아 이 지점을 시간지연값으로 추정한다.In the calculation process of the cross-correlation function, a cross-correlation function is calculated for two different voice frames. The cross-correlation function calculates the function value while moving the voice frame from the two channels back and forth on the time axis. The position of the overlapping of the two voice frames (the voice frame of one channel having no time delay shifted by the time delay of the other channel voice signal ) Represents the maximum value. In the next step, the time delay estimation process, a position having a maximum value is found in the obtained cross-correlation function, and this point is estimated as a time delay value.

이 시간지연 추정과정이 끝나면, 시간지연-신호가산부(40)에 의해 각 채널마다 시간지연을 상쇄시키기 위하여 역지연 처리를 하고, 각 채널의 음성신호의 정규화 과정을 거친 후, 모든 채널의 음성신호를 대상으로 완전하게 중첩된 파형으로 가산시킨다. 이 중에서, 정규화 과정은 모든 채널의 잡음 레벨의 크기가 동일하도록 채널의 신호레벨을 정규화하는 단계로서, 이 경우에 생성된 음성신호는 최고의 신호대 잡음비의 이득을 얻을 수 있다.After the time delay estimation process is completed, the time delay-signal is subjected to inverse delay processing to cancel the time delay for each channel by the divider 40, and after the normalization process of the voice signal of each channel, The signal is added to the completely superimposed waveform on the object. In this case, the normalization process normalizes the signal level of the channel so that the noise level of all the channels is the same, and the voice signal generated in this case can obtain a gain of the highest signal-to-noise ratio.

[발명의 효과][Effects of the Invention]

이상과 같은 본 발명은 음성인식 시스템을 사용할 때 사용자가 마이크의 위치에 구애받지 않고 발성하여도 다채널 디지탈 신호처리 기술을 높은 신호대 잡음비의 음성을 음성인식시스템에 입력시켜 줌으로써, 기존의 하나의 마이크를 사용하는 음성입력장치에 비해서 보다 편리하게 잡음이 적게 인가된 양질의 음성을 입력시킬 수 있어, 음성인식시스템의 성능을 향상시킬 수 있는 장점이 있다.While the present invention has been described with reference to the preferred embodiments thereof, it will be apparent to those skilled in the art that various modifications and variations can be made in the present invention without departing from the spirit or scope of the invention. It is possible to input a voice of a good quality with less noise and to improve the performance of the voice recognition system.

Claims

A microphone array unit 10 having a multichannel microphone array for simultaneously receiving voice signals having different time delay values from the microphones of the multichannel and converting the digital signals;

An automatic speech detection unit 20 for detecting a speech signal from multi-channel signals input from the microphone array unit 10,

A time delay estimator (30) for estimating time delay information of the detected interchannel audio signals; And

A time delay and a signal for adding voice signals of all channels after offsetting a time delay existing in a voice signal between channels to generate a voice having an improved signal-to-noise ratio using the estimated time delay information, And the remote audio input device using the microphone array.

The method of claim 1, wherein the number of channels of the microphone array is

And is configured to be proportional to a power component of a voice signal to a power component of a noise.

2. The apparatus of claim 1, wherein the time delay estimator (30)

Channel time delay is estimated using a channel-by-channel speech frame.

A remote speech input processing method having a microphone array,

A first step of converting a voice signal input from a micro-array of micro-channels of a plurality of channels into a digital signal;

A second step of detecting only the speech signal part of each of the converted channels;

A third step of estimating a time delay between the detected voice signals for each channel; And

And a fourth step of processing the voice signals having different time delays for each channel to have the same time delay by the estimation and adding voice signals of all channels to output voice.

5. The method of claim 4, wherein the third step

A first step of extracting a speech frame composed of N speech samples simultaneously for each channel;

A second step of obtaining a maximum value and a minimum value of each of the extracted voice frames for each channel such as a voice sample;

A third step of performing center clipping on the voice frame for each channel for accurate estimation of the time delay information;

A fourth step of determining a reference channel for the center-clipped voice frame for each channel, and then obtaining a cross-correlation function between the voice frame of the reference channel and the voice frame of the other channel; And

Calculating a maximum value of the obtained cross-correlation functions, and then estimating the time as a time delay value between the reference channel and the other channel.

6. The method of claim 5, wherein the center clipping in the third step

Wherein the speech signal has a size ranging from zero to a predetermined ratio of a maximum value and a minimum value of a voice frame for voice samples.

5. The method of claim 4, wherein the fourth step

A first step of canceling a time delay existing for each channel;

A second step of normalizing a noise level difference for each channel after the cancellation; And

A third step of adding the normalized voice signals over all channels and dividing the normalized voice signals by the number of channels by canceling the time delay for each channel.

8. The method of claim 7, wherein the first step

And delaying the voice signal by a negative value of the time delay value so that there is no time delay of the inter-channel voice signals.

8. The method of claim 7, wherein the second step comprises:

And normalizing the voice signal by multiplying the voice signal by a noise level normalization coefficient for each channel.