KR19990070595A

KR19990070595A - How to classify voice-voice segments in flattened spectra

Info

Publication number: KR19990070595A
Application number: KR1019980005530A
Authority: KR
Inventors: 박영호; 양재찬; 배명진; 이상효
Original assignee: 이봉훈; 서울이동통신 주식회사
Priority date: 1998-02-23
Filing date: 1998-02-23
Publication date: 1999-09-15
Also published as: KR100283604B1

Abstract

본 발명은 음성신호의 스펙트럼을 구간별로 나누어 그 구간에서의 음성신호가 유성음인가 혹은 무성음인가를 분류하는 방법에 관한 것으로서, 음성신호의 대수 진폭스펙트럼과 포만트 스펙트럼을 구하여 양 스펙트럼의 차이를 구하여 평탄화된 스펙트럼을 구하여 평탄화된 스펙트럼의 구간별 에너지를 구하여 에너지가 일정한 임계치를 넘으면 그 구간을 유성음으로 판단한다. 또한 대수 진폭스펙트럼은, 원음성 신호를 해닝창을 통과시켜 FFT(Fast Fourier Transformation)하여 대수(log)를 취하여 얻어지고, 포만트 스펙트럼은, 원음성 신호를 해닝창을 통과시켜 FFT(Fast Fourier Transformation)하여 대수(log)룰 취하고 리프터(lifter)를 통과시켜 얻어진다.The present invention relates to a method of dividing a spectrum of a speech signal into sections to classify whether a speech signal is a voiced sound or an unvoiced sound in a section. The energy of each section of the flattened spectrum is calculated, and the energy is determined to be a voiced sound when the energy exceeds a predetermined threshold. In addition, the logarithmic amplitude spectrum is obtained by taking the logarithm of the original audio signal through the Hanning window and taking the logarithm (FFT). It is obtained by taking a log and passing through a lifter.

Description

How to classify voice-voice segments in flattened spectra

본 발명은 음성의 유성/무성을 분류하는 방법에 관한 것으로서 특히 평탄화된 스펙트럼에서 유성/무성구간의 분류에 관한 방법이다.The present invention relates to a method of classifying voiced / voiced voices, and more particularly, to a method of classifying voiced / voiced segments in a flattened spectrum.

음성을 분석하다 보면 유성음이라고 단정지을 수 있는 구간 조차도 진폭 스펙트럼의 일부분은 잡음에 의한 에너지로 채워져 있다는 것을 알수 있다. 더욱이 잡음이 섞인 음성이나, 잡음이 섞이지 않은 음성 혼합구간(mixed voicing segment)일 경우에는 기본주파수에 의한 주기적인 고주파와 잡음 에너지에 의한 스펙트럼이 동시에 존재한다는 것을 알 수 있다. 그러므로 보코더 시스템에서 어떤 음성 프레임구간을 단지 2진 유/무성 결정에 의해 선택하는 것으로는 합성음질을 보장받을 수 없다.Analyzing the voice reveals that even a section that can be deemed voiced is part of the amplitude spectrum filled with energy from noise. In addition, it can be seen that in the case of a mixed voice or a mixed voicing segment, a periodic high frequency by the fundamental frequency and a spectrum due to the noise energy exist simultaneously. Therefore, the selection of a certain voice frame section by only binary / unvoiced decision in a vocoder system does not guarantee synthesized sound quality.

이러한 이유로 IMBE(Improved Multi-Band Excitaion)법에서는 스펙트럼을 여러 구간을 나누고, 이 구간에 대한 유/무성을 결정한다.그러므로 IMBE법에 대한 파라미터는 기본주파수, 스펙트럼 포락(envelope)과 스펙트럼의 각 구간에 대한 유/무성결정치로 이루어 진다. 그 과정을 살펴보면, 먼저 분석단에서 창함수가 적용된 음성 신호에 대한 피치와 스펙트럼 포락을 구한 후에 이를 이용하여 각 구간의 유/무성을 결정하고, 이를 그 음성 구간에 대한 시스템 파라미터로 결정한 뒤, 합성단에서 이들 파라미터를 이용하여 음성을 합성, 출력한다. 여기서 창함수라 함은 음성신호를 구간별로 분리하는데 사용하며 일반적으로 블록킹 효과를 줄이기 위해 해밍창, 해닝창, 블랙맨창, 카어저창 등을 사용한다.For this reason, the IMBE (Improved Multi-Band Excitaion) method divides the spectrum into several sections and determines the presence or absence of the sections. It consists of the presence / non-determination value for. Looking at the process, first, the analysis stage finds pitch and spectral envelope of the speech signal to which the window function is applied, and then uses it to determine the presence or absence of each section, and then determines it as a system parameter for the speech section, and then synthesizes it. However, the speech is synthesized and output using these parameters. In this case, the window function is used to separate the voice signal into sections, and generally uses a hamming window, a hanning window, a black man window, a carger window, etc. to reduce the blocking effect.

음성 분석시에 원음성에 대한 스펙트럼과 합성음성에 대한 스펙트럼 사이의 오차가 최소가 되도록 이들 파라미터를 결정해야 한다. 이에 대한 최소 에너지를 갖는 파라미터를 검출하려면 고도의 비선형 최적화 문제를 해결해야 한다. 이런 이유로 먼저 입력되는 모든 음성이 유성음이라는 가정하에서 기본 주파수와 스펙트럼의 포락을 찾고 유/무성 정보를 최적화하는 근사법을 사용해야 한다.In speech analysis, these parameters should be determined so that the error between the spectrum for the original speech and the spectrum for the synthesized speech is minimal. Detecting the parameter with the least energy for this requires solving the problem of highly nonlinear optimization. For this reason, it is necessary to use an approximation method to find the envelope of the fundamental frequency and spectrum and to optimize the voice and voice information under the assumption that all input voices are voiced first.

먼저 원음성의 정확한 기본주파수가 얻어져야 한다. 다음에는 얻어진 기본주파수를 이용하여, 스펙트럼 포락선을 복소 하모닉스 계수(complex harmonic coefficients)의 조합으로 대신한다. 이것은 기본주파수에 대한 고조파 스펙트럼 포락선의 값에 대응한다. 기본주파수를 알고 있다면 하모닉수 계수에 대한 에러를 최소화 할 수 있는 계수는 다른 파라미터의 영향이 없는 선형 방정식이며 쉽게 풀 수 있다.First, the exact fundamental frequency of the original audio must be obtained. Next, using the obtained fundamental frequency, the spectral envelope is replaced by a combination of complex harmonic coefficients. This corresponds to the value of the harmonic spectral envelope for the fundamental frequency. If you know the fundamental frequency, the coefficients that can minimize the error on the harmonic number coefficients are linear equations without the influence of other parameters and can be easily solved.

유/무성결정은 이 최소 에러를 갖는 스펙트럼에서 얻어진다. 먼저, 최소 에러에너지를 갖는 합성 스펙트럼과 원 스펙트럼 사이의 에러 스펙트럼을 구하고, 기본주파수의 3배 길이로 각 구간을 나눈 뒤에 이에 대한 평균 에러를 계산한다. 이 에러 값이 정해진 문턱값을 초과한다면 무성구간이라 하고, 초과하지 않으면 유성구간이라 결정한다.Presence / unvoiced crystals are obtained in the spectrum with this minimum error. First, the error spectrum between the synthesized spectrum having the minimum error energy and the original spectrum is obtained, and each interval is divided by three times the fundamental frequency, and then an average error thereof is calculated. If this error value exceeds the specified threshold, it is called an unvoiced section.

상기의 방법을 성공적으로 수행하기 위해선 우선 오차율을 ±1Hz정도의 정확한 피치검출이 이루어져야 하고, 이를 이용하여 스펙트럼 포락을 검출하여야 한다. 또한 피치의 검출이 잘못 되었다면, 오차가 누적되어 유/무성검출에 큰 영향을 미치게 된다는 문제점을 갖게 된다.In order to successfully perform the above method, accurate pitch detection with an error rate of about ± 1 Hz should be performed and the spectral envelope should be detected using this method. In addition, if the pitch is incorrectly detected, an error may accumulate and have a great influence on voice / voice detection.

IMBE법에서는 스펙트럼을 여러 구간으로 나누고, 이 구간에 대한 유/무성을 결정한다. 그러므로 IMBE법에 대한 파라미터는 기본주파수, 스펙트럼 포락과 스펙트럼의 각 구간에 대한 유/무성결정치로 이루어 진다. 먼저 분석단에서 창함수가 적용된 음성에 대한 피치와 스펙트럼 포락을 구한 후에 이를 이용하여 각 구간의 유/무성을 결정하고, 이를 그 음성구간에 대한 시스템 파라미터로 결정한 뒤, 합성단에서 이들 파라미터를 이용하여 음성을 합성, 출력한다. 그리고, 음성 분석시에는 원음성에 대한 스펙트럼과 합성음성에 대한 스펙트럼 사이의 오차가 최소가 되도록 이들 파라미터를 결정해야 한다. 그러나, 상기의 과정을 성공적으로 수행하기 위해서는 우선 오차율이 ±1Hz 정도의 정확한 피치 검출이 이루어져야 하고, 이를 이용하여 스펙트럼 포락을 검출하여야 한다. 또한 피치의 검출이 잘못 되었다면, 오차가 누적되어 유/무성 구간검출에 큰 영향을 미치게 된다.In the IMBE method, the spectrum is divided into sections, and the presence or absence of the sections is determined. Therefore, the parameters for the IMBE method consist of fundamental frequency, spectral envelope, and the presence / non-determination of each section of the spectrum. First, the pitch and spectral envelope of the speech to which the window function is applied are determined by the analyzer, and then the presence / unlessness of each section is determined using the system parameters for the speech section. To synthesize and output voice. In speech analysis, these parameters should be determined so that the error between the spectrum for the original speech and the spectrum for the synthesized speech is minimized. However, in order to successfully perform the above process, an accurate pitch detection with an error rate of about ± 1 Hz should be performed first, and the spectral envelope should be detected using this. In addition, if the pitch is incorrectly detected, errors accumulate and greatly affect the detection of the voiced / unvoiced section.

따라서 본 발명은, 정확하게 유성/무성을 분류하는 방법을 제공하고자 하는데에 있다.Accordingly, the present invention is to provide a method for accurately classifying oily / voiceless.

제1도는 본 발명에서 제안한 스펙트럼 유성/무성(UV/V) 분류방법의 알고리즘을 구현한 블록도.1 is a block diagram implementing the algorithm of the spectral voice / unvoiced (UV / V) classification method proposed in the present invention.

제2도는 평탄화된 스펙트럼에서 유/무성구간 검출과정을 나타내는 도면.2 is a diagram illustrating a process of detecting a voiced / unvoiced section in a flattened spectrum.

제3도는 본 발명의 방법에 따른 유/무성 검출 결과를 나타내는 도면이다.3 is a view showing the presence / absence detection results according to the method of the present invention.

이하 본 발명의 구성 및 작용을 본 발명의 한 실시예에 의거하여 상세히 설명한다.Hereinafter, the configuration and operation of the present invention will be described in detail based on one embodiment of the present invention.

원음성 신호에서 스펙트럼을 추출하고 그 스펙트럼에서 유/무성구간을 결정하려면 시간영역에서 검출된 정확한 피치가 필요하다. 피치검출의 정확도는 스펙트럼 UV/V구간의 판정뿐아니라 IMBE보코더의 음질에도 큰 영향을 미치게 된다. 지금까지 피치검출법들은 시간영역법, 주파수영역법, 시간-주파수 혼성영역법으로 나누어 연구되어왔다.Extracting the spectrum from the original audio signal and determining the presence / unsegment from the spectrum requires the exact pitch detected in the time domain. The accuracy of pitch detection not only determines the spectral UV / V range, but also greatly affects the sound quality of the IMBE vocoder. Until now, pitch detection methods have been studied by dividing into time domain method, frequency domain method, and time-frequency hybrid domain method.

시간영역 검출법으로는 병렬처리(parallel processing)법, AMDF법, ACM법(autocorrelation method) 등이 있으며 파형의 주기성을 강조한 뒤 결정 논리에 의해 피치를 찾는다. 이러한 방법은 시간영역에서 수행되므로 영역의 변환이 불필요하고 분해능이 높은 장점이 있다. 그러나 음소가 천이구간에 걸쳐있는 경우에는 기본주파수의 주기가 일정치 않으므로 검출에 어려움이 따르게 된다. 특히 잡음이 섞인 음성의 경우에는 검출을 위한 결정논리가 복잡해지므로 검출에러가 커지게 되는 단점이 있다.The time domain detection methods include parallel processing, AMDF, and ACM (autocorrelation method), which emphasize the periodicity of waveforms and find the pitch by decision logic. Since this method is performed in the time domain, there is an advantage that the conversion of the domain is unnecessary and the resolution is high. However, if the phoneme spans the transition period, the period of the fundamental frequency is not constant, which makes it difficult to detect. In particular, in the case of speech mixed with noise, the decision logic for detection becomes complicated, resulting in a large detection error.

주파수영역 검출법을 하모닉스 분석법, 리프터(Lifter)법, 빗살필터링(Comb-filtering)법 등이 있으며 음성스펙트럼 상의 하모닉스 간격을 측정하여 유성음의 기본주파수를 검출하고 있다. 일반적으로 스펙트럼은 한 프레임(20-40ms) 단위로 구해지므로 이 구간에서 음소의 천이나 변동이 일어나거나 배경잡음이 발생하여도 평균화(averaging)되므로 그 영향을 적게 받는다. 그러나 처리 과정상 주파수영역으로의 변환과정이 필요하므로 계산이 복잡하고, 기본주파수의 정밀성을 높이기 위해 FFT(Fast Fourier Tran sformation)의 포인트수를 늘리면 기본주파수의 변동을 검출하지 못할 뿐아니라, 처리시간이 또한 길어진다.Frequency domain detection methods include harmonic analysis, lifter, and comb-filtering methods, and the fundamental frequencies of voiced sounds are detected by measuring the harmonic spacing on the voice spectrum. In general, the spectrum is obtained in units of one frame (20-40 ms), and thus is less affected by averaging even if a phoneme shifts or fluctuates or background noise occurs in this section. However, because the process requires conversion to the frequency domain, the calculation is complicated, and if the number of points of the FFT (Fast Fourier Tran sformation) is increased to increase the precision of the fundamental frequency, the variation of the fundamental frequency is not detected, and the processing time This also takes longer.

시간-주파수 혼성법은 시간영역법의 계산시간 절감과 피치의 정밀성, 그리고 주파수 영역법의 배경잡음이나 음소변화에서도 정확한 피치를 구할 수 있는 장점만을 취한 것이다. 이러한 방법으로는 켑스트럼(Cepstrum)법, 스펙트럼 비교법 등이 있다. 그러나 이 방법은 시간과 주파수 영역을 왕복할 때 윈도우 적용에 따른 오차가 가중되어 나타나므로 피치 추출에 영향을 줄 수 있고, 또한 시간-주파수 영역을 동시에 적용하기 때문에 계산과정이 복잡하다는 단저이 있다.The time-frequency hybrid method only takes advantage of saving the computation time of the time domain method, precision of the pitch, and accurate pitch in the background noise or phoneme change of the frequency domain method. Such methods include the Cepstrum method and the spectral comparison method. However, this method has a problem that the calculation process is complicated because the error of the window application is increased when the time and the frequency domain are reciprocated and the pitch extraction is affected, and the time-frequency domain is applied at the same time.

이들 방법 중 주파수 영역 피치검출법은 스펙트럼상에서 수행하기 때문에 SNR이 0-dB의 잡음이 존재하는 경우에도 검출이 가능하다고 알려져 있지만 스펙트럼 강조나 결정논리 과정에서 피치검출의 정확성과 분해능이 떨어지게 된다. 따라서 주파수 영역법에서 피치검출의 정확성과 분해능을 약화시키는 원인을 분석하여 제거해 줄 수만 있다면 잡음에 강인한 피치 검출법이 된다. 따라서 본 발명에서는 IMBE 부호화를 위해 포만트스펙트럼의 영향과 잡음을 감소하면서 하모닉스간의 간격을 측정하는 스펙트럼AMDF(SAMDF)법을 적용하였다.Among these methods, the frequency domain pitch detection method is known to be able to detect SNR in the presence of 0-dB noise. However, the accuracy and resolution of pitch detection decreases during spectral emphasis or decision logic. Therefore, if the frequency domain method can analyze and eliminate the cause of weakening the accuracy and resolution of the pitch detection, the pitch detection method is robust to noise. Therefore, in the present invention, the spectral AMDF (SAMDF) method for measuring the interval between harmonics while reducing the effect and form noise of the formant spectrum is applied for IMBE encoding.

이 SAMDF법은 배경잡음과 포만트의 영향을 제거하기 위하여 대수형 진폭스펙트럼을 AMDF함수에 통과시키는 것이며, 식(4)와 같이 나타낼 수 있다.This SAMDF method is to pass the algebraic amplitude spectrum to the AMDF function to remove the background noise and formant effects, which can be expressed as Eq. (4).

여기서 F_MAX는 제1포만트에 의한 스펙트럼 최대 에너지 위치이고, size는 한 프레임의 길이이다.Where F _MAX is the spectral maximum energy position by the first formant and size is the length of one frame.

식(4)에 의해 SAMDF(w)은 AMDF함수의 특성과 같이 스펙트럼영역에서 SAMDF(w)값이 증가하다가 고조파 스펙트럼의 골이 인근한 골과 겹치는 경우 F₀의 첫 주파수에서 최소의 값이 된다. 이러한 최소의 협곡점을 찾으면 기본주파수 F₀가 검출될 수가 있다. 이 함수는 포만트들에 의해 스펙트럼의 포락이 평탄하지 않더라도 SAMDF(w)의 기울기가 이를 추정하여 보상하는 특성이 있기 때문에 이의 영향이 감소되는 특징이 있다.Equation (4) shows that SAMDF (w) is the minimum value at the first frequency of F ₀ when the SAMDF (w) value increases in the spectral region as in the AMDF function, but the valley of the harmonic spectrum overlaps with the neighboring valley. . By finding this minimum canyon point, the fundamental frequency F ₀ can be detected. This function is characterized in that its effect is reduced because the slope of the SAMDF (w) is estimated to compensate for this even if the envelope of the spectrum is not flat by the formants.

본 발명에서는 유/무성을 결정하기 전에 대수 진폭스펙트럼을 먼저 평탄화한후 유/무성구간을 분류하는 방법을 제안한다.The present invention proposes a method of first classifying the logarithmic amplitude spectrum before determining the presence / unsaturation and then classifying the presence / unvoiced interval.

먼저 유/뮤성을 결정하기전에 대수(log)진폭 스펙트럼을 평탄화한다. 만약 기본주파수를 알고있다면 포만트 스펙트럼은 리프터(lifter)를 통과시켜 구할 수 있다. 그렇지 않다면 사전에 기본주파수를 구해야한다. 본 발명에서는 포만트스펙트럼의 영향과 잡음을 감소하면서 하모닉스간의 간격을 측정하는 스펙트럼 AMDF(Average Magnitude Difference Function)법을 적용하였다. 이렇게 하여 포만트 스펙트럼이 구해지면 원래의 음성 스펙트럼과의 차이를 구한다. 평탄화된 고조파성분에 대해 기본주파수 단위로 유/무성분류함수를 수행한다. 유/무성분류함수는 고조파스펙트럼과 기본고조파로 구성되어 있으며 기본고조파를 구성하는 스펙트럼의 모델링에는 여러가지가 있지만 시간영역의 구형펄스열이 스펙트럼을 이루는 것처럼 sinc((sinX)/X)함수를 사용하였다. 유/무성분류함수를 통과한 스펙트럼이 실험적인 문턱값을 초과하면 유성 스펙트럼 구간으로 결정하게 된다.First, log-amplitude spectra are flattened before determining the mu / mu. If the fundamental frequency is known, the formant spectrum can be obtained by passing a lifter. If not, the fundamental frequency must be obtained beforehand. In the present invention, the spectral AMDF (Average Magnitude Difference Function) method which measures the spacing between harmonics while reducing the influence and noise of the formant spectrum is applied. When the formant spectrum is obtained in this way, the difference from the original speech spectrum is obtained. The presence / non-components function is performed in units of fundamental frequency for the flattened harmonics. The presence / non-component function consists of harmonic spectrum and fundamental harmonics, and there are various modelings of the spectrum constituting the fundamental harmonics, but the sinc ((sinX) / X) function is used as the spherical pulse train of time domain forms the spectrum. If the spectrum passing through the presence / no component function exceeds the experimental threshold, it is determined as the meteor spectral interval.

본 발명에서는 특히 이와 같이 평탄화된 스펙트럼에서 유/무성구간을 분류하는 방법을 제안한다. 먼저 기본주파수를 검출한 후에 이를 이용하여 평탄화된 고조파 스펙트럼을 얻고, sinc형 고조파펄스와의 유사도를 통해 유/무성 스펙트럼구간을 결정한다.In particular, the present invention proposes a method for classifying voice / voice segments in the flattened spectrum. First, the fundamental frequency is detected, and then, the flattened harmonic spectrum is obtained, and the existence / unvoiced spectrum interval is determined through the similarity with the sinc type harmonic pulse.

본 발명에서 제안한 방법은 음성스펙트럼에서 직접 스펙트럼구간을 분류하는 기존의 방법에 비해 구간 분류율이 평균 2.93％ 정도로개선되고, 스펙트럼의 유/무성 구간 검출 뿐 아니라 피치주기와 스펙트럼 포락정보를 함께 검출 할 수 있는 특징을 얻을 수 있었다.The method proposed in the present invention improves the segment classification rate by 2.93% on average compared to the conventional method of classifying spectral segments directly from the voice spectrum, and detects pitch periods and spectral envelope information as well as detecting the presence or absence of spectrums. Could get features.

도 1은 본 발명에서 제안한 스펙트럼 UV/V분류알고리즘을 구현한 블록도, 도 2는 평탄화된 스펙트럼에서 유/무성구간 검출과정, 도 3은 제안한 방법에 따른 유/무성 검출 결과, 표1은 스펙트럼 유/무성 구간 분류검출결과이다.1 is a block diagram of implementing the spectral UV / V classification algorithm proposed in the present invention, FIG. 2 is a process of detecting an existence / non-voice interval in a flattened spectrum, FIG. This is the result of classification detection of existence / non-voice section.

유/무성을 결정하기 전에 대수 진폭스펙트럼을 먼저 평탄화 한다. 기본주파수 F₀를 알고 있다면, 음성의 진폭스펙트럼 S_W(w)에 대한 근사적인 포만트스펙트럼 F(w)는 다음과 같이 리프터(lifter) 함수를 통과시켜 구할 수 있다.The logarithmic amplitude spectrum is first flattened before determining presence or absence. If the fundamental frequency F ₀ is known, an approximate formant spectrum F (w) for the amplitude spectrum S _W (w) of the speech can be obtained by passing a lifter function as follows.

이렇게 하여 포만트 스펙트럼이 구해지면 원래의 음성스펙트럼과의 차이를 구한다. 평탄화된 고조파 스펙트럼 E(w)은 식 (2)와 같이 나타낼 수 있다.When the formant spectrum is obtained in this way, the difference from the original speech spectrum is obtained. The flattened harmonic spectrum E (w) can be expressed as Equation (2).

평탄화된 고조파성분에 대해 기본주파수(F₀) 단위로 식(3)와 같이 유/무성구간 분류함을 수행한다:For the harmonic components flattened, perform the classification of the non-voiced and unvoiced intervals as shown in equation (3) in units of fundamental frequencies (F ₀ ):

여기서 E(w)는 평탄화된 고조파 스펙트럼이고, F₀(d)는 기본고주파를 이루는 스펙트럼 성분이다. 기본고조파를 구성하는 스펙트럼성분의모델링에는 여러가지가 있지만 시간영역의 구형펄스열이 스펙트럼을 이루는 것처럼 sinc함수를 사용하였다. 이제는 유/무성구간 분류함수 D(w)에 통과된 스펙트럼이 실험적인 문턱값을 초과하면 유성 스펙트럼구간으로 결정하게 된다.Where E (w) is the flattened harmonic spectrum and F ₀ (d) is the spectral component of the fundamental frequency. Although there are various modeling of the spectral components constituting the fundamental harmonics, the sinc function is used as the spherical pulse trains in the time domain form a spectrum. Now, if the spectrum passed to the voiced / unvoiced classification function D (w) exceeds the experimental threshold, it is determined as the voiced spectrum.

도 1에는 본 발명에서 제안한 스펙트럼 UV/V 분류알고리즘을 처리블럭도로 나타내었다. 먼저 음성에 해닝원도우를 취하였는데 프레임크기를 51표본으로 하여 384표본씩 겹치게하면서 처리하였다. 그리고 FFT를 적용한 다음 log를 취하여 대수 진폭스펙트럼을과 이 대수 진폭스펙트럼을 리프터를 통과시켜 포만트 스펙트럼을 각각 얻는다. 여기에 음성스펙트럼의 대수 진폭스펙트럼에서 근사 포만트 스펙트럼의 차리를 구하여 평탄화된 하모닉스 스펙트럼을 구한다. 그리고 하모닉스 스펙트럼에 대해 유/무성구간 분류함수를 적용하여 각 스펙트럼 구간에 대한 유/무성을 결정한다. 스펙트럼의부분 구간길이를 16 주파수 표본으로 하였다. 구간 분류값에 대한 문턱값은 한 기본주파수의 에너지의 70%로 하여 이를 초과하면 유성구간이라 하고, 그렇지 않으면 무성구간으로 결정하였다.Figure 1 shows the spectral UV / V classification algorithm proposed in the present invention as a processing block. First, Hanning One Doe was taken for the voice, and the frame size was 51 samples, and 384 samples were overlapped. After applying FFT, log is taken and the logarithmic amplitude spectrum and the logarithmic amplitude spectrum are passed through the lifter to obtain the formant spectrum, respectively. The flattened harmonic spectrum is obtained by calculating the difference of the approximate formant spectrum in the logarithmic amplitude spectrum of the voice spectrum. In addition, the presence / unsaturation of each spectrum section is determined by applying the classification function of the unvoiced section to the harmonic spectrum. The partial interval length of the spectrum was taken as 16 frequency samples. The threshold for the interval classification value is 70% of the energy of one fundamental frequency. If the threshold value is exceeded, it is called a meteor section, otherwise it is determined as an unvoiced section.

제안한 처리과정에 따라 진행결과를 도 2에 나타내었다. 도면에서 (a)는 음성시료중에 /이/ 발성의 한프레임에 파형이고, (b)는 이에대한 대수 진폭스펙트럼이고, (c)는 평탄화한 고조파 스펙트럼이며 그리고 (d)는 기본주파수 단위로 유/무성구간 결정함수에 통과시킨 스펙트럼이다. 여기서 문턱값을 초과하면 유성음 스펙트럼이고 초과하지 않으면 무성 스펙트럼 구간으로 부호화한다. 이렇게 검출된 유/무성구간 결정함수에 의해서 원스펙트럼과 유/무성으로 분류된 스펙트럼을 도3에 나타내었다. 도면 (a)는 원음성의 스펙트럼이고, (b)는 유성구간을 나타낸 스펙트럼이고, (c)는 무성구간을 나타낸 스펙트럼이다.According to the proposed process, the progress is shown in FIG. 2. In the figure, (a) is a waveform in a frame of / i / vocalization in a voice sample, (b) is a logarithmic amplitude spectrum for it, (c) is a flattened harmonic spectrum, and (d) is in fundamental frequency units. Spectrum passed to the silent function. If the threshold value is exceeded, it is encoded in the voiced sound spectrum. The spectrum classified into one spectrum and voice / voiceless by the detected voice / voice interval determination function is shown in FIG. 3. (A) is a spectrum of an original sound, (b) is a spectrum which shows a meteor section, and (c) is a spectrum which shows an unvoiced section.

검출된 유-무성구간의 정확도를 비교하기 위해 원래의 스펙트럼에 대해 바로 분류한 것과 평탄화된 스펙트럼에서 분류한 것과의 편차를 프레임단위로 파악하여 표 1에 나타내었다. 평탄화된 스펙트럼에서 검출한 유/뮤성 스펙트럼구간을 기준으로 하여 음성 스펙트럼에서 바로구한 구간과의 차이가 +/- F₀구간이상 차이가 발생하면 이프레임은 잘못 검출된 것으로 파악하였다.In order to compare the accuracy of the detected voiced-voice intervals, the deviations between those classified directly from the original spectrum and those classified from the flattened spectrum are shown in Table 1 below. This frame was incorrectly detected when the difference between the right-handed section in the speech spectrum and the +/- F ₀ section is longer than the right-hand section in the flattened spectrum.

처리결과는 발성이나 화자에 따라 다르게 얻어졌으나 평균 2.93％의 프레임이 개선되었다. 이때 평탄화된 스펙트럼에서의 검출을 기준으로 한것은 포만트의 특성이 감소되었기 때문이고, 또한 편차가 나는 프레임 몇개를 눈으로 파악하여도 평탄화된 스펙트럼에서 찾은 것이 정확한 것으로 인정되었기 때문이다.The treatment result was obtained differently according to the utterance and the speaker, but the average frame rate was 2.93%. In this case, the detection in the flattened spectrum is based on the reduction of the formant characteristics, and the fact that the findings in the flattened spectrum are recognized correctly even if some of the frames with deviations are visually recognized.

본 발명에서 제안한 방법은, 음성 스펙트럼에서 직접 스펙트럼 구간을 분류하는 기존의 방법에 비해 구간 분류율이 평균 2.93％ 정도로 개선되고(표 1참조), 스펙트럼의 유/무성 구간 검출 뿐만 아니라 피치주기와 스펙트럼 포락정보를 함께 검출할 수 있다는 특징이 있다.The proposed method of the present invention improves the segment classification rate by an average of 2.93% compared to the conventional method of classifying the spectral sections directly in the speech spectrum (see Table 1), and detects the pitch period and the spectrum as well as detecting the presence or absence of the spectrum The envelope information can be detected together.

Claims

Obtain the logarithmic amplitude spectrum and the formant spectrum of the speech signal, obtain the flattened spectrum by calculating the difference between the two spectra, and obtain the energy for each section of the flattened spectrum. Classification method of voiced-voiced segment in spectrum

The algebraic amplitude spectrum of claim 1, wherein the logarithmic amplitude spectrum is obtained by taking a logarithm of the original audio signal through a Hanning window and taking a logarithm to FFT (Fast Fourier Transformation). Classification method.

The planar spectrum of claim 1, wherein the formant spectrum is obtained by passing an original audio signal through a hanning window, taking a logarithmic logarithmic force (FFT), and passing a lifter. Classification method of voice-voice in the spectrum,

The method of claim 1, wherein the threshold is a constant value for energy of a fundamental frequency.