KR100587568B1

KR100587568B1 - Speech enhancement system and method

Info

Publication number: KR100587568B1
Application number: KR1020030016896A
Authority: KR
Inventors: 오광철
Original assignee: 삼성전자주식회사
Priority date: 2003-03-18
Filing date: 2003-03-18
Publication date: 2006-06-08
Also published as: KR20040082207A

Abstract

본 발명은 잡음 억제 처리시 뮤지컬 노이즈의 발생을 최소화하면서 음성신호 성분을 강화시킬 수 있는 음성 향상 시스템 및 방법에 관한 것으로, 입력된 음성신호를 프레임 단위로 나눈 다음, 상기 프레임이 잡음 성분만을 가진 잡음 프레임인 경우 상기 잡음 프레임의 실수부와 허수부에 대하여 각각 잡음 억제 처리를 수행한 후, 상기 프레임의 실수부와 허수부에 대하여 각각 음성 강화 처리를 수행하는 것을 특징으로 한다. The present invention relates to a voice enhancement system and method that can enhance a voice signal component while minimizing the generation of musical noise during noise suppression processing. After dividing an input voice signal into frame units, the frame has a noise component only. In the case of a frame, after the noise suppression process is performed on the real part and the imaginary part of the noise frame, the voice enhancement process is performed on the real part and the imaginary part of the frame, respectively.

잡음, 억제, 음성, 향상, musical, noiseNoise, suppression, voice, enhancement, musical, noise

Description

Speech Enhancement System and Method {SPEECH ENHANCEMENT SYSTEM AND METHOD}

도 1은 종래의 음성 향상 방법을 설명하기 위한 도면이다.1 is a view for explaining a conventional voice enhancement method.

도 2는 본 발명에 따른 음성 향상 시스템의 블럭도이다.2 is a block diagram of a speech enhancement system according to the present invention.

도 3은 도 2에 도시된 잡음 평균 계산부의 상세 구성도이다.3 is a detailed block diagram of the noise average calculator illustrated in FIG. 2.

도 4는 본 발명에 따른 음성 향상 방법의 흐름도이다.4 is a flowchart of a voice enhancement method according to the present invention.

도 5는 본 발명에 의해 음성성분이 강화된 일예를 나타낸 도면이다.5 is a view showing an example in which the negative component is enhanced by the present invention.

* 도면의 주요부분에 대한 부호의 설명 *Explanation of symbols on the main parts of the drawings

10...전처리부 20...고속 퓨리에 변환부10 Pre-processing unit 20 High-speed Fourier transform unit

30...잡음 억제부 40...음성 검출부30 ... noise suppression unit 40 ... voice detection unit

50...잡음 평균 계산부 51...제1 잡음 평균 계산부50 ... noise average calculation unit 51 ... first noise average calculation unit

52...제2 잡음 평균 계산부 53...힐버트 변환부52 ... second noise average calculation unit 53 ... Hilbert transform unit

54...엔벨로프 크기 검출부 55...Mean 계산부54 ... envelope size detector 55 ... Mean calculator

60...잡음 가중치 계산부 70...곱셈기60 ... noise weight calculator 70 ... multiplier

80...음성 강화부 90...음성 가중치 계산부80 ... Voice reinforcement unit 90 ... Voice weight calculation unit

100...곱셈기 110...역고속 퓨리에 변환부100 ... multiplier 110 ... inverse fast Fourier transform

120...오버랩부120.Overlap

본 발명은 음성 향상 시스템 및 방법에 관한 것으로서, 특히 잡음 억제 처리시 뮤지컬 노이즈(musical noise)의 발생을 최소화하면서 음성신호 성분을 강화시킬 수 있는 음성 향상 시스템 및 방법에 관한 것이다.The present invention relates to a speech enhancement system and method, and more particularly, to a speech enhancement system and method capable of enhancing speech signal components while minimizing the generation of musical noise during noise suppression processing.

음성신호 프로세싱에서 공통적인 문제점은 배경잡음 성분은 억제하고 음성신호 성분은 향상시키는 것인데, 그 대표적인 음성 향상 방법으로, 도 1에 도시된 바와 같이 잡음이 섞인 입력신호를 퓨리에 변환한 다음 서브트렉션 필터(S)(Subtraction Filter)를 사용해 퓨리에 변환된 신호 스펙트럼에서 잡음 스펙트럼을 감산하여 배경잡음을 억제시키는 스펙트럼 감산법(Spectral Subtraction Method)을 들 수 있다. A common problem in speech signal processing is to suppress background noise components and to improve speech signal components. A typical speech enhancement method is to perform a Fourier transform on a noisy input signal as shown in FIG. Spectral Subtraction Method (S) (Subtraction Filter) is used to subtract the noise spectrum from the Fourier transformed signal spectrum to suppress background noise.

그러나, 상기와 같은 스펙트럼 감산법에 의해 잡음 억제 처리를 하면, 실제 잡음 스펙트럼을 정확하게 추정할 수 없기 때문에 원래의 음성신호 스펙트럼에서 추정된 잡음 스펙트럼을 감산할 때 음성신호의 스펙트럼이 0 이하의 값을 갖게될 수 있으며, 이러한 경우 스펙트럼 감산법은 0 이하의 값을 갖는 구간을 강제로 작은값으로 할당하는 방식을 취하므로, 이로 인하여 특유의 주파수 잡음인 뮤지컬 노이즈가 발생되어 잡음 억압후의 음성이 극히 부자연스러워질 수 있다는 문제점이 있다.However, when the noise suppression process is performed by the spectral subtraction method as described above, since the actual noise spectrum cannot be estimated accurately, when the noise spectrum estimated from the original speech signal spectrum is subtracted, the value of the speech signal has a value of 0 or less. In this case, the spectral subtraction method forcibly assigns a section having a value of 0 or less to a small value, which causes musical noise, which is a characteristic frequency noise, to make the voice after noise suppression extremely unnatural. There is a problem that can be confusing.

이러한 문제점들을 해결하기 위한 것으로, 미국 특허 제5,742,927호(공개일: 1998.8.8)에는 스펙트럼 차감법을 이용하여 잡음을 억제한 후 LPC(Linear Prediction Coefficient) 스펙트럼 추정기를 통해 음성의 포먼트만을 강조하는 방법이 개시되어 있지만, 상기 음성 향상 방법 역시 뮤지컬 노이즈가 발생되는 것을 해결할 수 없다는 한계점을 갖고 있다. In order to solve these problems, US Patent No. 5,742,927 (published: August 8, 1998) suppresses noise by using a spectral subtraction method, and then emphasizes only a formant of speech through a linear prediction coefficient (LPC) spectrum estimator. Although the method is disclosed, the voice enhancement method also has a limitation in that it cannot solve the occurrence of musical noise.

본 발명은 상기한 문제점을 해결하기 위해 안출된 것으로, 본 발명의 목적은 잡음 억제 처리에 있어서 큰 문제점으로 지적되는 뮤지컬 노이즈를 최대한 억제하면서 음성신호 성분을 강화할 수 있도록 하는 것이다.SUMMARY OF THE INVENTION The present invention has been made to solve the above problems, and an object of the present invention is to enhance the sound signal component while suppressing the musical noise as indicated as a major problem in the noise suppression process.

상기 목적을 달성하기 위하여 본 발명에 따른 음성 향상 시스템은, 입력된 음성신호를 프레임 단위로 나누는 전처리부, 상기 전처리부로부터 입력된 프레임이 잡음 성분만을 가진 잡음 프레임인 경우 상기 잡음 프레임의 실수부와 허수부에 대하여 잡음 억제를 수행하는 잡음 억제부, 및 상기 잡음 억제부로부터 입력된 프레임의 실수부와 허수부에 대하여 음성 강화를 수행하는 음성 강화부를 포함하는 것을 특징으로 한다.In order to achieve the above object, the voice enhancement system includes a preprocessor that divides an input voice signal into frame units, and a real part of the noise frame when the frame input from the preprocessor is a noise frame having only noise components. And a voice reinforcement unit for performing noise suppression on the imaginary unit, and a voice reinforcement unit performing voice reinforcement on the real part and the imaginary part of the frame inputted from the noise suppression unit.

이하 첨부된 도면을 참조하여 본 발명의 실시예를 보다 상세히 설명하기로 한다. Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 2는 본 발명에 따른 음성 향상 시스템의 블럭도로서, 도 2에 도시된 바와 같이, 본 발명에 따른 음성 향상 시스템(1)은 전처리부(10), 고속 퓨리에 변환부(20), 잡음 억제부(30), 음성 강화부(80), 역고속 퓨리에 변환부(110), 및 오버랩부(120)를 포함한다.2 is a block diagram of a speech enhancement system according to the present invention. As shown in FIG. 2, the speech enhancement system 1 according to the present invention includes a preprocessor 10, a fast Fourier transform unit 20, and noise suppression. The unit 30 includes a voice reinforcement unit 80, an inverse fast Fourier transform unit 110, and an overlap unit 120.

전처리부(10)는 잡음이 섞인 입력신호를 소정의 주파수로 샘플링하여 디지탈 신호로 바꾸고, 상기 디지탈 신호를 간단한 고역 통과 필터에 의해서 프리엠퍼시스(preemphasis)하여 신호의 고주파수를 약간 강조한 다음, 필터링된 신호를 음성 처리의 기본 단위인 프레임으로 나눈다.The preprocessing unit 10 samples the noisy input signal at a predetermined frequency and converts the signal into a digital signal, preemphasis the digital signal with a simple high pass filter to emphasize the high frequency of the signal slightly, and then filter the input signal. The signal is divided into frames, which are the basic units of speech processing.

고속 퓨리에 변환부(Fast Fourier Transform, FFT)(20)는 상기 전처리부(10)로부터 입력된 프레임에 윈도우(window)를 적용한 다음 N-포인트 고속 퓨리에 변환을 하는데, 여기에서 N-포인트 FFT에 관한 식은 하기의 수학식 1과 같이 주어진다.A Fast Fourier Transform (FFT) 20 applies a window to a frame input from the preprocessor 10 and then performs an N-point Fast Fourier Transform, where the N-point FFT is The equation is given by Equation 1 below.

상기와 같이 고속 퓨리에 변환된 신호는 복소수로 표현되는데, 이 경우 고속 퓨리에 변환된 프레임의 스펙트럼은 다음의 수학식 2와 같이 실수부와 허수부로 표현된다.As described above, the fast Fourier transformed signal is represented by a complex number. In this case, the spectrum of the fast Fourier transformed frame is represented by a real part and an imaginary part as shown in Equation 2 below.

한편, 전술한 바와 같이 스펙트럼 감산법에 의하면 잡음 억제 처리시 뮤지컬 노이즈가 발생될 수 있기 때문에, 이러한 현상을 방지하기 위하여 본 발명에서는 다음과 같이 잡음 억제부(30)를 통해 잡음 프레임에 포함되어 있는 잡음 성분의 특성에 따라 적절한 잡음 억압 처리가 수행되도록 하는데, 이하 잡음 억제부(30)에 대하여 더 자세히 설명한다.Meanwhile, according to the spectral subtraction method as described above, musical noise may be generated during the noise suppression process. In order to prevent such a phenomenon, the present invention includes the noise suppression unit 30 in the noise frame as follows. Appropriate noise suppression processing is performed according to the characteristics of the noise component. Hereinafter, the noise suppression unit 30 will be described in more detail.

도 2를 참조하면, 잡음 억제부(30)는 상기 고속 퓨리에 변환부(20)로부터 입력된 프레임내에 음성이 존재하는지의 여부를 판단하는 음성 검출부(40), 상기 고속 퓨리에 변환부(20)로부터 입력된 프레임이 잡음 성분만을 가진 잡음 프레임으로 판단된 경우 상기 잡음 프레임에서 실수부와 허수부에 대한 평균 스펙트럼을 구하는 잡음 평균 계산부(50), 상기 잡음 평균 계산부(50)를 통해 계산된 평균 스펙트럼값에 따라 잡음 프레임의 실수부와 허수부에 대한 가중치를 계산하는 잡음 가중치 계산부(60), 및 상기 고속 퓨리에 변환부(20)로부터 입력된 프레임의 실수부와 허수부에 잡음 가중치 계산부(60)에서 계산된 가중치를 각각 곱하는 곱셈기(70)를 포함한다.Referring to FIG. 2, the noise suppression unit 30 includes a voice detector 40 and a fast Fourier transform unit 20 that determine whether or not a voice exists in a frame inputted from the fast Fourier transform unit 20. When the input frame is determined to be a noise frame having only a noise component, the noise average calculator 50 and the noise average calculator 50 calculate an average spectrum of the real part and the imaginary part of the noise frame. A noise weight calculator 60 for calculating weights of the real and imaginary parts of the noise frame according to the spectral value, and a noise weight calculator of the real and imaginary parts of the frame inputted from the fast Fourier transform unit 20. And a multiplier 70 that multiplies each weight calculated at 60.

음성 검출부(40)는 상기 고속 퓨리에 변환부(20)로부터 입력된 프레임내에 음성이 존재하는지의 여부를 판단하여 입력된 프레임이 잡음 성분만을 가진 잡음 프레임으로 판단되면 그 잡음 프레임을 잡음 평균 계산부(50)로 출력하는데, 만약 입력된 프레임내에 음성이 존재하면 그 프레임은 잡음 평균 계산부(50)로 출력되지 않는다.The voice detector 40 determines whether voice is present in the frame inputted from the fast Fourier transform unit 20, and if the input frame is determined to be a noise frame having only a noise component, the voice frame calculates the noise frame as a noise average calculator ( 50), if the voice exists in the input frame, the frame is not output to the noise average calculation unit 50.

여기에서, 음성 검출부(40)는 입력된 프레임의 에너지를 계산하여 가장 알맞는 임계값을 찾아 이를 기준으로 음성의 존재 여부를 판단하는 VAD(Voice Activity Detector)로 구현될 수 있으며, 이 외에 다른 방법으로 음성의 존재 여부를 판단하는 것도 가능하다. Here, the voice detector 40 may be implemented as a voice activity detector (VAD) that calculates the energy of the input frame to find the most suitable threshold value and determines the presence of voice based on this. It is also possible to determine the presence of voice.

잡음 평균 계산부(50)는 상기 음성 검출부(40)로부터 입력된 잡음 프레임에서 실수부와 허수부에 대한 평균 스펙트럼을 각각 별도로 구하는데, 이하 도 3을 참조하여 잡음 평균 계산부(50)에 대하여 더 자세히 설명한다.The noise average calculator 50 separately obtains an average spectrum of the real part and the imaginary part from the noise frame input from the voice detector 40, and with respect to the noise average calculator 50 with reference to FIG. 3. Explain in more detail.

도 3은 도 2에 도시된 잡음 평균 계산부(50)의 상세 구성도로서, 도 3에 도시된 바와 같이, 잡음 평균 계산부(50)는 상기 음성 검출부(40)로부터 입력된 잡음 프레임에서 실수부의 평균을 계산하는 제1 잡음 평균 계산부(51) 및 상기 음성 검출부(40)로부터 입력된 잡음 프레임에서 허수부의 평균을 계산하는 제2 잡음 평균 계산부(52)로 구성되어 있으며, 제1 잡음 평균 계산부(51)와 제2 잡음 평균 계산부(52)는 각각 힐버트 변환부(53), 엔벨로프 크기 계산부(54), 및 Mean 계산부(55)를 포함하고 있다.3 is a detailed configuration diagram of the noise average calculation unit 50 shown in FIG. 2, and as shown in FIG. 3, the noise average calculation unit 50 is a real number in a noise frame input from the voice detector 40. A first noise average calculation unit 51 for calculating a negative average and a second noise average calculation unit 52 for calculating an average of an imaginary part in a noise frame input from the voice detector 40, and a first noise The average calculator 51 and the second noise average calculator 52 include a Hilbert transformer 53, an envelope magnitude calculator 54, and a mean calculator 55.

상기 힐버트 변환부(53)는 음성 검출부(40)로부터 입력된 잡음 프레임에서 각각 실수부와 허수부에 대한 복소 신호(analytic complex signal)를 만들어 내기 위한 것으로, 입력된 잡음 프레임의 실수부와 허수부 각각에 대해 힐버트 변환을 수행하는데, 여기에서, 힐버트 변환은 입력신호에 대하여 그에 대한 분석 가능한 복소 신호를 만들어 내는 것으로, 힐버트 변환된 신호의 실수부는 입력신호와 같고 허수부는 입력 신호와 위상이 90도 바뀐 신호가 되며, 이러한 힐버트 변환은 신호의 엔벨로프를 검출할 때 일반적으로 이용되는 방법이므로 힐버트 변환에 대한 자세한 설명은 생략한다.The Hilbert transform unit 53 is for generating a complex signal for the real part and the imaginary part from the noise frame input from the voice detector 40, respectively, and includes the real part and the imaginary part of the input noise frame. For each Hilbert transform, the Hilbert transform produces an analytically complex signal for the input signal, where the real part of the Hilbert transformed signal is equal to the input signal and the imaginary part is 90 degrees out of phase with the input signal. Since the Hilbert transform is a commonly used method of detecting the envelope of the signal, a detailed description of the Hilbert transform is omitted.

상기 엔벨로프 크기 계산부(54)는 힐버트 변환부(53)에 의해 구해진 각각의 복소 신호에 절대값을 취하여 잡음 프레임의 실수부와 허수부에 대한 엔벨로프(Envelope)의 크기를 계산하며, 상기 Mean 계산부(55)는 엔벨로프 크기 계산부(54)를 통해 계산된 각 엔벨로프의 크기를 평균하여 이를 잡음 프레임의 실수 부와 허수부에 대한 평균 스펙트럼값으로 출력하는데, 이 때 평균 스펙트럼값은 잡음 억제 처리의 효율성을 높이기 위해 여러 잡음 프레임(약 100msec)의 엔벨로프 크기를 평균한 값인 것이 바람직하다. The envelope size calculation unit 54 calculates the magnitude of an envelope for the real part and the imaginary part of the noise frame by taking an absolute value of each complex signal obtained by the Hilbert transform unit 53, and calculating the mean. The unit 55 averages the magnitudes of the envelopes calculated by the envelope size calculator 54 and outputs them as average spectral values for the real part and the imaginary part of the noise frame. In order to increase the efficiency, the envelope size of several noise frames (about 100 msec) is preferably averaged.

상기와 같이, 제1 잡음 평균 계산부(51)와 제2 잡음 평균 계산부(52)는 힐버트 변환부(53), 엔벨로프 크기 계산부(54), 및 Mean 계산부(55)를 통해 상기 음성 검출부(40)로부터 입력된 잡음 프레임에서 각각 실수부와 허수부에 대한 평균 스펙트럼을 계산하며, 이렇게 계산된 잡음 프레임의 실수부와 허수부에 대한 평균 스펙트럼값은 잡음 가중치 계산부(60)로 전달된다.As described above, the first noise average calculation unit 51 and the second noise average calculation unit 52 may perform the speech through the Hilbert transform unit 53, the envelope size calculation unit 54, and the mean calculation unit 55. The average spectrum of the real part and the imaginary part is calculated from the noise frame input from the detector 40, and the average spectrum values of the real part and the imaginary part of the noise frame are thus transferred to the noise weight calculation part 60. do.

다시 도 2를 참조하면, 잡음 가중치 계산부(60)는 상기 잡음 평균 계산부(50)를 통해 계산된 평균 스펙트럼값을 역변환(Inverse Transform)하여 잡음 프레임의 실수부와 허수부에 적용할 가중치를 계산하는데, 여기에서, 가중치는 잡음 프레임에 포함된 잡음 성분을 억제하기 위한 값으로, 상기 잡음 평균 계산부(50)로부터 입력된 평균 스펙트럼값이 크면, 즉, 입력된 잡음 프레임에 잡음 성분이 많으면, 가중치를 작게 하여 잡음 성분을 많이 줄일 수 있도록 하고, 입력된 평균 스펙트럼값이 작으면, 즉, 입력된 잡음 프레임에 잡음 성분이 적으면, 가중치를 크게 하여 잡음 성분을 상대적으로 작은폭으로 억제할 수 있도록 한다.Referring back to FIG. 2, the noise weight calculator 60 inversely transforms the average spectral value calculated by the noise average calculator 50 to apply weights to be applied to the real part and the imaginary part of the noise frame. In this case, the weight is a value for suppressing a noise component included in the noise frame, and if the average spectral value input from the noise average calculation unit 50 is large, that is, if the input noise frame has a large noise component, In addition, if the average spectral value is small, that is, if there are few noise components in the input noise frame, the weight is increased to suppress the noise components relatively small. To help.

곱셈기(70)는 상기 고속 퓨리에 변환부(20)로부터 입력된 프레임의 실수부와 허수부에 상기 잡음 가중치 계산부(70)를 통해 계산된 잡음 프레임의 실수부와 허수부에 대한 가중치를 각각 곱한다.The multiplier 70 multiplies the real part and the imaginary part of the frame inputted from the fast Fourier transform unit 20 by the weights of the real part and the imaginary part of the noise frame calculated by the noise weight calculation unit 70, respectively. .

상기한 바와 같이, 잡음 억제부(30)는 잡음 성분만을 가진 잡음 프레임의 실 수부와 허수부에 대하여 각각 다른 비율로 억압 처리를 수행하여 잡음 억제시 음성신호의 스펙트럼이 0 이하의 값을 갖게 되지 않도록 함으로써, 잡음 프레임에서 잡음 성분을 효과적으로 억제하면서도 뮤지컬 노이즈의 발생을 최대한 억제할 수 있도록 한다. As described above, the noise suppression unit 30 performs suppression processing at different ratios for the real part and the imaginary part of the noise frame having only the noise component so that the spectrum of the voice signal has a value of 0 or less during noise suppression. By doing so, it is possible to effectively suppress the noise component in the noise frame while suppressing the occurrence of musical noise as much as possible.

한편, 상기 잡음 억제부(30)로부터 음성 강화부(80)로 입력된 프레임은 음성 강화 처리를 거치게 되는데, 이하 음성 강화부(80)에 대하여 더 자세히 설명한다.On the other hand, the frame input from the noise suppressor 30 to the voice enhancer 80 is subjected to a voice reinforcement process, the voice reinforcement unit 80 will be described in more detail below.

음성 강화부(80)는 상기 잡음 억제부(30)로부터 입력된 프레임에서 음성에 대한 가중치를 계산하는 음성 가중치 계산부(90), 및 상기 잡음 억제부(30)로부터 입력된 프레임에 음성 가중치 계산부(100)에서 계산된 가중치를 곱하는 곱셈기(100)를 포함한다. The voice reinforcement unit 80 calculates a voice weight on the frame input from the noise suppressor 30, and calculates a voice weight on the frame input from the noise suppressor 30. The multiplier 100 multiplies the weight calculated in the unit 100.

음성 가중치 계산부(90)는 상기 잡음 억제부(30)로부터 입력된 프레임에서 음성 성분을 강화하기 위한 가중치를 계산하기 위한 것으로, 입력된 프레임에서 실수부와 허수부에 대한 표준 편차를 계산하여 이를 프레임의 실수부와 허수부에 대한 가중치로 출력하는데, 여기에서, 음성 성분에 대한 가중치값으로 표준 편차를 사용하는 이유는, 잡음 성분에 비하여 음성 성분의 표준편차가 크기 때문에 이에 따라 표준편차를 가중치값으로 설정하면 상대적으로 음성 성분은 강화되고 고주파 대역의 잡음 성분은 억제될 수 있기 때문이다.The voice weight calculator 90 calculates a weight for reinforcing a voice component in a frame input from the noise suppressor 30, and calculates a standard deviation of the real part and the imaginary part in the input frame. The weighted values of the real and imaginary parts of the frame are output. Here, the reason for using the standard deviation as the weight value for the speech component is that the standard deviation of the speech component is larger than that of the noise component. If set to a value, the speech component is relatively enhanced and the noise component of the high frequency band can be suppressed.

곱셈기(110)는 상기 잡음 억제부(30)로부터 입력된 프레임의 실수부와 허수부에 상기 음성 가중치 계산부(100)를 통해 계산된 프레임의 실수부와 허수부에 대한 가중치를 각각 곱하여 음성신호에서 음성 성분을 강화시킨다.The multiplier 110 multiplies the real part and the imaginary part of the frame inputted from the noise suppressor 30 by the weights of the real part and the imaginary part of the frame calculated by the voice weight calculator 100, respectively. Enhances negative ingredients in

상기한 바와 같이, 음성 강화부(80)는 입력된 프레임의 실수부와 허수부에 대하여 각각 다른 비율로 음성 성분을 강화시킴으로써, 프레임에 포함된 잡음 성분은 억제하면서 음성 성분만 강화시킬 수 있도록 한다.As described above, the voice enhancer 80 enhances the voice component at different ratios with respect to the real part and the imaginary part of the input frame, thereby reinforcing only the voice component while suppressing the noise component included in the frame. .

한편, 역고속 퓨리에 변환부(Inverse Fast Fourier Transform, IFFT)(120)는 상기 음성 강화부(80)로부터 입력된 프레임을 역퓨리에 변환하여 다시 시간 영역의 프레임으로 되돌리고, 오버랩부(120)는 상기 역고속 퓨리에 변환부(110)로부터 출력된 시간 영역의 프레임들을 오버랩(overlap)시켜 전후 프레임이 매끄럽게 연접될 수 있도록 하는데, 여기에서 역고속 퓨리에 변환에 관한 식은 하기의 수학식 3과 같이 주어진다.Meanwhile, an inverse fast Fourier transform (IFFT) 120 inversely transforms a frame input from the voice enhancer 80 and returns to a frame of a time domain, and the overlap unit 120 performs the The frames of the time domain output from the inverse fast Fourier transform unit 110 are overlapped to allow the front and rear frames to be seamlessly concatenated. Here, the equation for the inverse fast Fourier transform is given by Equation 3 below.

따라서, 본 발명에 따른 음성 향상 시스템(1)은 해당 프레임의 특성에 따라 적절한 잡음 억제 처리 또는 음성 강화 처리를 수행함으로써, 음성 성분은 강화하고 잡음 성분은 억제하면서도 뮤지컬 노이즈의 발생을 최소화시킬 수 있도록 한다.Accordingly, the speech enhancement system 1 according to the present invention performs appropriate noise suppression processing or speech enhancement processing according to the characteristics of the corresponding frame, thereby minimizing the generation of musical noise while enhancing the speech component and suppressing the noise component. do.

한편, 본 발명에 따른 음성 향상 방법은, 입력된 음성신호를 프레임 단위로 나누는 단계, 상기 각 프레임이 잡음 성분만을 가진 잡음 프레임인 경우 상기 잡음 프레임의 실수부와 허수부에 대하여 잡음을 억제하는 단계, 및 상기 각 프레임의 실수부와 허수부에 대하여 음성을 강화하는 단계를 포함하는 것을 특징으로 한다. On the other hand, the speech enhancement method according to the present invention, the step of dividing the input speech signal in units of frames, if each frame is a noise frame having only a noise component, the step of suppressing noise for the real part and imaginary part of the noise frame And reinforcing the voice with respect to the real part and the imaginary part of each frame.

이하, 본 발명에 따른 음성 검출 방법에 대하여 첨부된 도면들을 참조하여 상세히 설명한다.Hereinafter, a voice detection method according to the present invention will be described in detail with reference to the accompanying drawings.

우선, 음성신호가 입력되면 전처리부(10)는 입력된 신호를 소정의 주파수로 샘플링하여 디지탈 신호로 바꾸고, 상기 디지탈 신호를 간단한 고역 통과 필터에 의해서 프리엠퍼시스(preemphasis)하여 신호의 고주파수를 약간 강조한 다음, 필터링된 신호를 음성 처리의 기본 단위인 프레임으로 나눈다(S10). First, when a voice signal is input, the preprocessor 10 samples the input signal at a predetermined frequency and converts the signal into a digital signal. The digital signal is pre-emphasized by a simple high pass filter to slightly reduce the high frequency of the signal. After emphasizing, the filtered signal is divided into frames that are basic units of speech processing (S10).

그 다음, 고속 퓨리에 변환부(20)는 상기 전처리부(10)로부터 입력된 각 프레임에 윈도우(window)를 적용한 후 N-포인트 고속 퓨리에 변환을 수행하는데(S20), 고속 퓨리에 변환 방법에 대하여는 도 2와 관련된 설명에서 상세히 설명하였으므로 이에 대한 자세한 설명은 생략한다.Next, the fast Fourier transform unit 20 performs a N-point fast Fourier transform after applying a window to each frame input from the preprocessor 10 (S20). Since it has been described in detail in the description related to 2, a detailed description thereof will be omitted.

그 다음, 음성 검출부(40)는 상기 고속 퓨리에 변환부(20)로부터 입력된 프레임내에 음성이 존재하는지의 여부를 판단하는데(S30), 여기에서 음성 검출부(40)는 입력된 프레임의 에너지를 계산하여 가장 알맞는 임계값을 찾아 이를 기준으로 음성의 존재 여부를 판단하며, 이 외에 다른 방법으로 음성의 존재 여부를 판단하는 것도 가능하다. Next, the voice detector 40 determines whether voice is present in the frame input from the fast Fourier transformer 20 (S30), where the voice detector 40 calculates energy of the input frame. It is possible to determine the existence of the voice based on this to find the most suitable threshold value, it is also possible to determine the presence of the voice in other ways.

한편, 상기 음성 검출부(40)를 통해 프레임에 음성이 포함되어 있지 않다고 판단되면, 즉, 프레임이 잡음 성분만을 가진 잡음 프레임으로 판단되면, 그 잡음 프레임은 잡음 평균 계산부(50)로 출력되어 잡음 억제 단계(S40)를 거치게 되는데, 이하 잡음 억제 단계(S40)에 대하여 더 자세히 설명한다.On the other hand, if it is determined by the voice detector 40 that the frame does not contain voice, that is, if the frame is determined to be a noise frame having only a noise component, the noise frame is output to the noise averaging calculator 50 and the noise There is a suppression step (S40), which will be described in more detail with respect to the noise suppression step (S40).

우선, 잡음 평균 계산부(50)는 상기 음성 검출부(40)로부터 입력된 잡음 프 레임에서 실수부와 허수부에 대한 평균 스펙트럼을 각각 별도로 계산하는데(S41), 이에 대해 더 자세히 설명하면, 잡음 평균 계산부(50)는 힐버트 변환을 이용하여 잡음 프레임에서 각각 실수부와 허수부에 대한 해석 가능한 복소 신호를 구한 다음, 각 복소 신호에 절대값을 취하여 잡음 프레임의 실수부와 허수부에 대한 엔벨로프의 크기를 구하고, 이렇게 구해진 각 엔벨로프의 크기를 평균하여 이를 잡음 프레임의 실수부와 허수부에 대한 평균 스펙트럼값으로 출력한다.First, the noise average calculator 50 separately calculates an average spectrum of the real part and the imaginary part from the noise frame input from the voice detector 40 (S41). The calculation unit 50 obtains an interpretable complex signal for the real part and the imaginary part of the noise frame using the Hilbert transform, and then takes an absolute value of each complex signal to determine the envelope of the real part and the imaginary part of the noise frame. The magnitudes are obtained, and the magnitudes of the respective envelopes are averaged and output as average spectral values of the real part and the imaginary part of the noise frame.

그 다음, 잡음 가중치 계산부(60)는 상기 잡음 프레임의 실수부와 허수부에 대한 평균 스펙트럼값을 역변환(Inverse Transform)하여 잡음 프레임의 실수부와 허수부에 대한 가중치를 계산하는데(S42), 여기에서 상기 가중치값은 잡음 프레임에 포함된 잡음 성분을 억제하기 위한 값으로, 잡음 프레임의 평균 스펙트럼값이 크면, 즉, 입력된 잡음 프레임에 잡음 성분이 많으면, 가중치를 작게 하여 잡음 성분을 많이 줄일 수 있도록 하고, 잡음 프레임의 평균 스펙트럼값이 작으면, 즉, 입력된 잡음 프레임에 잡음 성분이 적으면, 가중치를 크게 하여 잡음 억제 처리시 뮤지컬 노이즈의 발생을 최대한 억제할 수 있도록 한다.Next, the noise weight calculation unit 60 calculates weights of the real part and the imaginary part of the noise frame by performing inverse transformation on the mean spectral values of the real part and the imaginary part of the noise frame (S42). Here, the weight value is a value for suppressing a noise component included in the noise frame. If the average spectral value of the noise frame is large, that is, if the input noise frame has many noise components, the weight value is reduced to reduce the noise component. If the average spectral value of the noise frame is small, that is, if the noise component is small in the input noise frame, the weight is increased to suppress the generation of musical noise as much as possible in the noise suppression process.

그 다음, 곱셈기(70)는 상기 고속 퓨리에 변환부(20)로부터 입력된 프레임의 실수부와 허수부에 상기 잡음 가중치 계산부(60)를 통해 계산된 잡음 프레임의 실수부와 허수부에 대한 가중치값들을 각각 곱하는데(S43), 이에 따라 프레임에 포함된 잡음 성분은 실수부와 허수부에 따라 각각 다른 비율로 억제되게 된다.Next, the multiplier 70 weights the real part and the imaginary part of the noise frame calculated by the noise weight calculation unit 60 to the real part and the imaginary part of the frame inputted from the fast Fourier transform unit 20. By multiplying the values (S43), the noise components included in the frame are suppressed at different ratios according to the real part and the imaginary part.

즉, 이와 같은 잡음 억제 단계(S40)에 의하여 고속 퓨리에 변환부(20)로부터 입력된 프레임이 잡음 성분만을 가진 잡음 프레임인 경우 잡음 성분의 특성에 따라 적절한 잡음 억압 처리를 수행할 수 있으므로, 잡음 프레임에서 잡음 성분을 효과적으로 억제하면서도 뮤지컬 노이즈의 발생을 최소화시킬 수 있다.That is, when the frame input from the fast Fourier transform unit 20 by the noise suppression step (S40) is a noise frame having only a noise component, an appropriate noise suppression process may be performed according to the characteristics of the noise component. This effectively suppresses the noise component and minimizes the occurrence of musical noise.

한편, 잡음 억제부(30)로부터 출력된 프레임은 음성 강화부(80)로 입력되어 음성 강화 단계(S50)를 거치게 되는데, 이하 음성 강화 단계(S50)에 대하여 더 자세히 설명한다.On the other hand, the frame output from the noise suppressor 30 is input to the voice enhancer 80 to go through the voice reinforcement step (S50), which will be described in more detail with respect to the voice reinforcement step (S50).

우선, 음성 가중치 계산부(90)는 입력된 프레임에서 실수부와 허수부에 대한 표준 편차를 계산하여 이를 프레임의 실수부와 허수부에 대한 가중치로 설정하는데(S51), 여기에서, 음성 성분에 대한 가중치로 표준 편차를 사용하는 이유는, 잡음 성분에 비하여 음성 성분의 표준편차가 크기 때문에 표준편차를 가중치로 설정하면 상대적으로 음성 성분은 강화되고 고주파 대역의 잡음 성분은 억제될 수 있기 때문이다.First, the speech weight calculation unit 90 calculates a standard deviation of the real part and the imaginary part of the input frame and sets the weight as the weights of the real part and the imaginary part of the frame (S51). The reason for using the standard deviation as the weighting factor is that since the standard deviation of the voice component is larger than that of the noise component, the standard deviation is set as the weight so that the voice component can be enhanced and the noise component of the high frequency band can be suppressed.

그 다음, 곱셈기(100)는 상기 잡음 억제부(30)로부터 입력된 프레임의 실수부와 허수부에 음성 가중치 계산부(90)를 통해 계산된 가중치값들을 각각 곱하는데(S52), 이에 따라 프레임내에서 비교적 낮은 주파수 대역을 가진 음성 성분은 강화되고 비교적 높은 주파수 대역을 가진 잡음 성분은 억제된다. Next, the multiplier 100 multiplies the real parts and the imaginary parts of the frames inputted from the noise suppressor 30 by the weighted values calculated by the voice weight calculation unit 90 (S52). Voice components with a relatively low frequency band are enhanced and noise components with a relatively high frequency band are suppressed.

즉, 이와 같은 음성 강화 단계(S50)에 의하여 프레임의 실수부와 허수부에 대하여 각각 다른 비율로 음성을 강화시킴으로써, 음성 프레임에 섞여 있는 잡음 성분을 효과적으로 억제하면서 음성을 강화시킬 수 있다.That is, by the voice reinforcing step (S50) by reinforcing the voice at different ratios with respect to the real part and the imaginary part of the frame, the voice can be enhanced while effectively suppressing noise components mixed in the voice frame.

한편, 상기와 같은 잡음 억제 단계(S40) 또는 음성 강화 단계(S50)를 거친 프레임은 역고속 퓨리에 변환부(110)에 의해 역퓨리에 변환되어 다시 시간 영역의 프레임으로 변환된다(S60).On the other hand, the frame that has undergone the noise suppression step (S40) or the voice enhancement step (S50) as described above is inverse Fourier transformed by the inverse fast Fourier transform unit 110 is converted into a frame of the time domain again (S60).

그 다음, 음성신호의 모든 프레임에 대해 잡음 억제 또는 음성 강화가 수행되면, 오버랩부(120)는 역고속 퓨리에 변환부(110)로부터 출력된 시간 영역의 프레임들을 오버랩시켜 전후 프레임을 매끄럽게 연접하여 출력한다(S70~80). Then, when noise suppression or speech enhancement is performed on all the frames of the voice signal, the overlap unit 120 overlaps the frames of the time domain output from the inverse fast Fourier transform unit 110 to seamlessly connect the front and rear frames and output them. (S70 ~ 80).

상기한 바와 같이, 본 발명에 따르면 해당 프레임의 특성에 따라 적절한 잡음 억제 처리 또는 음성 강화 처리를 수행함으로써, 음성 성분은 강화하고 잡음 성분은 억제하면서도 뮤지컬 노이즈의 발생을 최소화시킬 수 있다.As described above, according to the present invention, by performing the appropriate noise suppression process or the speech enhancement process according to the characteristics of the frame, it is possible to minimize the generation of musical noise while enhancing the speech component and suppressing the noise component.

도 5는 본 발명에 의해 음성성분이 강화된 일예를 나타낸 도면으로, 도 5(a)와 같이 잡음이 포함된 음성신호에서 실수부와 허수부를 각각 분리하여 잡음 억제 처리 또는 음성 강화 처리를 수행하면, 도 5(b)에 도시된 바와 같이 잡음 성분은 억제되고 음성 성분만 강화되는 것을 알 수 있다. FIG. 5 is a diagram illustrating an example in which a speech component is enhanced by the present invention. When a real part and an imaginary part are separated from a voice signal including noise as shown in FIG. As shown in FIG. 5 (b), it can be seen that the noise component is suppressed and only the voice component is enhanced.

본 발명은 도면에 도시된 일실시예를 참고로 설명되었으나 이는 예시적인 것에 불과하며, 본 기술분야의 통상의 지식을 가진 자라면 이로부터 다양한 변형 및 균등한 타 실시예가 가능하다는 점을 이해할 것이다. 따라서, 본 발명의 진정한 기술적 보호범위는 첨부된 청구범위의 기술적 사상에 의해 정해져야 할 것이다.Although the present invention has been described with reference to one embodiment shown in the drawings, this is merely exemplary, and it will be understood by those skilled in the art that various modifications and equivalent other embodiments are possible. Therefore, the true technical protection scope of the present invention will be defined by the technical spirit of the appended claims.

따라서, 본 발명에 따르면 음성 성분은 강화하고 잡음 성분은 억제하면서도 뮤지컬 노이즈의 발생을 최소화시킬 수 있으므로, 이에 따라 잡음이 있는 환경에서의 음성 인식, 화자 인식 시스템의 성능을 향상시킬 수 있는 효과가 있다.Therefore, according to the present invention, it is possible to minimize the occurrence of musical noise while enhancing the speech component and suppressing the noise component, thereby improving the performance of speech recognition and speaker recognition system in a noisy environment. .

Claims

A preprocessor dividing the input voice signal by frame unit;

A fast Fourier transform unit for fast Fourier transforming the frame input from the preprocessor;

A noise suppression unit for performing noise suppression on the real part and the imaginary part of the noise frame when the frame converted by the fast Fourier transform unit is a noise frame having only noise components; And

And a voice reinforcement unit configured to perform voice reinforcement on the real part and the imaginary part of the frame inputted from the noise suppressor.

The method of claim 1, wherein the noise suppression unit,

A voice detector that determines whether voice exists in a frame input from the preprocessor;

A noise average calculator for obtaining an average spectral value for the real part and the imaginary part in the noise frame when the input frame is determined to be a noise frame by the voice detector;

A noise weight calculator configured to calculate weights for the real part and the imaginary part of the noise frame according to the average spectrum value calculated by the noise average calculator; And

And a multiplier for multiplying the real part and the imaginary part of the frame inputted from the preprocessor by the weights of the real part and the imaginary part of the noise frame calculated by the noise weight calculator.

The method of claim 2, wherein the noise average calculation unit,

A first noise average calculator configured to calculate an average spectral value of a real part in the noise frame; And

And a second noise average calculator for calculating an average spectral value of the imaginary part in the noise frame.

The method of claim 3, wherein the first noise average calculation unit and the second noise average calculation unit,

A Hilbert transform unit for obtaining a complex signal for the real part and the imaginary part in the noise frame input using the Hilbert transform;

An envelope size calculator for calculating an envelope size for the real part and the imaginary part of the input noise frame according to each complex signal obtained by the Hilbert transform part; And

And a mean calculator configured to calculate an average of envelope sizes of the real part and the imaginary part of the noise frame calculated by the envelope size calculator.

The method of claim 1, wherein the voice enhancer,

A speech weight calculator for calculating weights for speech components in the real part and the imaginary part of the frame inputted from the noise suppressor; And

And a multiplier for multiplying the real part and the imaginary part of the frame input from the noise suppressor by the weights of the real part and the imaginary part of the frame calculated by the voice weight calculator.

The method of claim 5, wherein the voice weight calculation unit,

And a weight for the speech component according to the calculated standard deviation by calculating standard deviations of the real part and the imaginary part in the frame inputted from the noise suppressor.

delete

The voice enhancement system of claim 1, further comprising an inverse fast Fourier transform unit for performing an inverse fast Fourier transform on the frame input from the voice enhancer.

The voice enhancement system of claim 8, further comprising an overlap unit configured to overlap the frames input from the inverse fast Fourier transform unit.

Dividing the input voice signal by frame unit;

Fast Fourier transforming the frame;

Suppressing noise with respect to the real part and the imaginary part of the noise frame when the fast Fourier transformed frame is a noise frame having only a noise component; And

And reinforcing the voice with respect to the real part and the imaginary part of each frame.

The method of claim 10, wherein the noise suppression step,

Determining whether voice is present in the frame;

Obtaining an average spectral value for a real part and an imaginary part in the noise frame when the frame is determined to be a noise frame having only a noise component;

Calculating weights for the real part and the imaginary part of the noise frame according to the calculated average spectrum value; And

And multiplying the real part and the imaginary part of the frame by the weights of the calculated real part and the imaginary part of the noise frame, respectively.

12. The method of claim 11, wherein the obtaining the average spectral values for the real part and the imaginary part of the noise frame comprises:

Obtaining a complex signal for a real part and an imaginary part in an input frame using a Hilbert transform;

Calculating envelope sizes for the real part and the imaginary part of the input frame according to each complex signal obtained by the Hilbert transform; And

And calculating an average of the calculated envelope size.

The method of claim 10, wherein the voice reinforcing step,

Calculating weights for speech components in the real part and the imaginary part of the frame, respectively;

And multiplying the real part and the imaginary part of the frame by the weights of the calculated real part and the imaginary part of the frame, respectively.

The method of claim 13, wherein the weight for the negative component,

And a standard deviation of the real part and the imaginary part of the frame.

delete

11. The method of claim 10, further comprising performing an inverse Fourier transform on the frame with enhanced speech.

17. The method of claim 16, further comprising overlapping the inverse fast Fourier transformed frames.