KR100322704B1

KR100322704B1 - Method for varying voice signal duration time

Info

Publication number: KR100322704B1
Application number: KR1019950026931A
Authority: KR
Inventors: 기석철; 배명진
Original assignee: 윤종용; 삼성전자 주식회사
Priority date: 1995-08-28
Filing date: 1995-08-28
Publication date: 2002-06-20
Also published as: KR970012286A

Abstract

PURPOSE: A method for varying voice signal duration time is provided to classify a voice signal into a vocal sound, a voiceless sound and a voiceless plosive sound in the time domain and change duration time of each of the classified sounds according to a pitch cycle and pitch time obtained through a pitch searching method and GCI detection method. CONSTITUTION: A voice signal is classified into a vocal sound, a voiceless sound and a voiceless plosive sound in the time domain(202). On the basis of pitch time, the vocal sound is repeated and removed in pitch cycle units to change the duration time of the vocal sound. The voiceless sound is repeated and removed in frame units to vary the duration time of the voiceless sound. The voiceless plosive sound is repeated and removed in mute units to change the duration time of the voiceless plosive sound(208).

Description

How to change the duration of voice signals

본 발명은 음성 인식, 합성 및 분석과 같은 음성 신호 처리에 관한 것으로서, 특히, 음성 신호를 유성음, 무성음 및 무성 파열음으로 분류하여 각각의 지속 시간을 변경하는 음성 신호의 지속 시간 변경 방법에 관한 것이다.BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to speech signal processing, such as speech recognition, synthesis, and analysis, and more particularly, to a method of changing the duration of a speech signal that classifies the speech signal into voiced sounds, unvoiced sounds, and unvoiced burst sounds to change respective durations.

종래의 음성신호의 피치 주기를 검출하는 피치 검출 방법은 시간 영역법, 주파수 영역법 및 시간-주파수 영역법으로 구분할 수 있다.Conventional pitch detection methods for detecting pitch periods of speech signals can be classified into a time domain method, a frequency domain method, and a time-frequency domain method.

시간영역 검출법은 파형의 주기성을 강조한 후에 결정논리에 의해 피치를 검출하는 방법으로 보통 시간영역에서 수행되므로 영역의 변환이 필요하지 않고, 피치 검출의 분해능이 높다. 그러나, 음소가 천이구간에 걸쳐 있는 경우에는 프레임 내의 레벨변화가 심하고 피치주기가 변동하기 때문에 피치 검출에 어려움이 따르게된다. 특히, 잡음이 섞인 음성의 경우에는 피치검출을 위한 결정논리가 복잡해져서 검출 오류가 증가하는 문제점이 있다.The time domain detection method is a method of detecting a pitch by crystal logic after emphasizing the periodicity of a waveform. Since the time domain detection method is usually performed in the time domain, it is not necessary to change the domain, and the resolution of the pitch detection is high. However, when the phoneme spans the transition period, it is difficult to detect the pitch because the level change in the frame is severe and the pitch period varies. In particular, in the case of speech mixed noise, the decision logic for the pitch detection is complicated, which increases the detection error.

주파수 영역의 피치 검출방법은 음성 스펙트럼의 고조파 간격을 측정하여 유성음의 기본 주파수를 검출하는 방법으로 고조파 분석법, 리프트(Lifer)법, 빗살형 여과기(Comb-filtering)법등이 제안되어 있다. 일반적으로 스펙트럼은 한 프레임(20∼40ms)단위로 구해지므로, 이 구간에서 음소의 천이나 변동이 일어나거나 배경잡음이 발생하여도 평균화되므로 그 영향을 적게 받는다. 그러나, 처리 과정상 주파수 영역으로의 변환과정이 필요함으로 계산이 복잡하며, 기본 주파수의 정밀성을 높이기 위해 FFT(Fast Fourier Transform)의 포인터 수를 늘리면 처리시간이 길어지는 문제점이 있다.As the pitch detection method in the frequency domain, harmonic analysis, lifter, comb-filtering, and the like have been proposed as a method of detecting fundamental frequencies of voiced sound by measuring harmonic spacing of a speech spectrum. In general, since the spectrum is obtained in units of one frame (20 to 40 ms), even if the phoneme is shifted or fluctuates or the background noise is averaged, the spectrum is attenuated. However, the calculation is complicated because the conversion to the frequency domain is required in the processing process, and if the number of pointers of the fast fourier transform (FFT) is increased to increase the precision of the fundamental frequency, the processing time becomes long.

시간-주파수 혼성 영역법은 시간영역법의 계산시간 절감과 피치의 정밀성, 그리고 주파수 영역법의 배경 잡음이나 음소변화에 대해서도 피치를 정확히 구할 수 있는 장점을 취한 것이다. 이러한 방법으로는 Cepstrum법, 스펙트럼 비교법등이 있고, 시간과 주파수 영역을 왕복할 때 오차가 가중되어 나타나므로 피치 추출에 영향을 줄 수 있고, 또한 시간과 주파수 영역을 동시에 적용하기 때문에 계산과정이 복잡하다는 문제점이 있으나, 영역의 변환이 필요하지 않고, 합, 차, 비교논리 등 간단한 연산만 필요하다. 그러나, 음소가 천이구간에 걸쳐 있는 경우에는 프레임 내의 레벨변화가 심하고 피치 주기가 변동하기 때문에 피치검출에 어려움이 따르게 된다. 특히, 잡음이 섞인 음성의 경우에는 피치검출을 위한 결정 논리가 복잡해져서 검출 오류가 증가하는 문제점이 있다.The time-frequency hybrid domain method takes the advantage of accurate calculation of the time-domain calculation time, the precision of the pitch, and the background noise and the phoneme change of the frequency domain method. These methods include the Cepstrum method and the spectral comparison method, and the error is increased when reciprocating the time and frequency domains, which may affect the pitch extraction, and the calculation process is complicated because the time and frequency domains are applied simultaneously. Although there is a problem in that the domain is not required, only simple operations such as sum, difference, and comparison logic are required. However, when the phoneme spans the transition period, it is difficult to detect the pitch because the level change in the frame is severe and the pitch period varies. In particular, in the case of speech mixed with noise, the decision logic for pitch detection is complicated, which leads to an increase in detection error.

한편, 종래의 피치 시점을 검출하기 위한 세가지 방법으로 음성파형의 공분산(covariance) 행렬에 대한 로그(log)행렬식의 측정을 이용한 방법, 음성파형의 M점 창(window)에서 P폴(pole)(선형 예측계수의 차수)선형 예측 에러 시퀀스를 해석하는 방법 및 일렉트로 성문 그래프(EGG:Electro Glottal Graph)신호를 이용하는 방법이 있다. 첫번째, 방법은 어떤 모음들에서는 성문 폐쇄순간(GCI:Glottal Closure Instant 이하, GCI)을 결정하기가 매우 어렵기 때문에 모든 신호들에 대해서 적용할 수 없으며, 처리 시간이 많이 소요되는 문제점이 있다. 두번째, 방법은 성문의 폐쇄된 위상이 매우 짧은 구간을 가진 고주파나 호흡 음성인 경우에는 정확한 폐쇄위상을 얻는 것이 어려운 문제점이 있다. 세번째 방법은 후두에 마이크로폰을 부착하여 직접적으로 성대의 움직임을 측정하여야 하므로 공간적 거동에 제약을 받는 문제점이 있다.On the other hand, three methods for detecting the pitch point of the conventional method using the measurement of the log matrix of the covariance matrix of the speech waveform, P pole (M) window in the M point window of the speech waveform ( Order of linear prediction coefficients) and a method of analyzing a linear prediction error sequence and using an electro glottal graph (EGG) signal. First, the method is not applicable to all signals because it is very difficult to determine GCI (Glottal Closure Instant, GCI) in some vowels, and there is a problem that it takes a lot of processing time. Second, the method has a problem that it is difficult to obtain an accurate closed phase when the closed phase of the gate is a high frequency or respiratory voice having a very short interval. In the third method, a microphone is attached to the larynx to directly measure the movement of the vocal cords.

종래의 무성 파열음의 구간을 검출하기 위한 방법은 대략적으로 구분된 유성음 구간에 대해 신호의 기울기 변화의 값을 이용한 방법 및 주기 성분과 잡음원의 상대적인 양을 나타내기 위해 선형 예측계수(LPC Linear Prediction Coefficient) 분석의 잔여신호에 대한 자기상관을 이용한 방법이 있다. 첫번째 방법은 무성 파열음에 대한 검사가 유성음으로 판단된 구간에 대해서만 이루어지므로 무성음 성분이 우세한 무성 파열음은 찾을 수 없는 문제점이 있고, 두번째 방법은 화자에 따른 영향을 매우 심하게 받기 때문에 자기 상관만으로는 신뢰할 만한 유성, 무성 및 무성 파열음 구간을 구분할 수 없는 문제점이 있다.The conventional method for detecting a section of unvoiced burst sound uses a method of changing the slope of a signal for a roughly divided voiced sound section, and a linear prediction coefficient (LPC Linear Prediction Coefficient) to indicate a relative amount of a periodic component and a noise source. There is a method using autocorrelation for the residual signal of the analysis. In the first method, the test for unvoiced rupture sound is performed only for the section judged as voiced sound, so there is a problem in that unvoiced rupture sound with the dominant voice component is not found. There is a problem in that it is not possible to distinguish between unvoiced and unvoiced burst sounds.

종래의 합성음의 지속시간을 변경하는 방법으로 신호원 부호화를 이용한 방법과 파형 부호화를 이용한 방법이 있다.As a method of changing the duration of a conventional synthesized sound, there are a method using signal source coding and a method using waveform coding.

신호원 부호화를 이용한 방법은 분석시에 신호의 각 성분을 분리하고, 분리된 신호 성분의 정보를 이용해서 합성하기 때문에 분석시의 오차와 합성시의 오차가 합해져서 합성음의 명료성이 크게 떨어지는 문제점이 있다. 또한 파형 부호화를 이용한 방법은 피치를 결정할 수 없는 폐쇄음이나 파열음 등은 별도로 처리해야 하는 어려움이 따른다.In the method using the signal source coding, each component of the signal is separated at the time of analysis and synthesized using the information of the separated signal components. Therefore, the error of analysis and the error at synthesis are added together, and thus the intelligibility of the synthesized sound is greatly reduced. have. In addition, the method using the waveform coding has a difficulty of separately processing a closed sound or a ruptured sound that cannot determine the pitch.

또한 두 방법 모두 피치검출의 정확성이 합성음의 자연성에 크게 영향을 미치며 피치시점이 피치 반복에 중요한 변수가 된다. 즉, 종래의 음성신호의 지속시간 변경 방법은 피치와 피치시점을 정확히 검출하기 어렵고 음원 분류가 어렵다는 문제점이 있다.Also, in both methods, the accuracy of pitch detection greatly affects the naturalness of the synthesized sound, and the pitch time becomes an important variable for pitch repetition. That is, the conventional method of changing the duration of the voice signal has a problem that it is difficult to accurately detect the pitch and the pitch time and difficult to classify the sound source.

본 발명의 목적은 음성신호를 시간영역에서 유성음, 무성음 및 무성 파열음 등으로 분류하고, 분류된 각 음원의 지속시간을 피치 검색법 및 GCI검출법으로 각각 구해진 피치 주기 및 피치 시점에 의해 변경하는 음성 신호의 지속 시간 변경 방법을 제공하는데 있다.An object of the present invention is to classify a speech signal into voiced sounds, unvoiced sounds and unvoiced burst sounds in the time domain, and to change the duration of each classified sound source by a pitch period and a pitch time point obtained by a pitch search method and a GCI detection method, respectively. To provide a way to change the duration of the.

상기 목적을 달성하기 위하여 본 발명에 의한 음성 신호의 피치 검출 방법은, 상기 음성 신호를 시간영역상에서 유성음, 무성음 및 무성파열음 구간으로 분류하는 음성 분류단계와, 및 피치 시점을 기준으로 상기 유성음을 피치 주기 단위로 반복 및 제거하여 상기 유성음의 지속 시간을 변경하고, 상기 무성음을 프레임 단위로 반복 및 제거하여 상기 무성음의 지속시간을 변경하고, 상기 무성 파열음을 묵음 단위로 반복 및 제거하여 상기 무성 파열음의 지속 시간을 변경하는 변경단계를 구비하는 것을 특징으로 한다.In order to achieve the above object, a pitch detection method of a speech signal according to the present invention comprises: a speech classification step of classifying the speech signal into voiced, unvoiced, and unvoiced sound intervals in a time domain; and pitching the voiced sound based on a pitch time point. Repeating and removing the voiced sound to change the duration of the voiced sound, repeating and removing the voiced sound by frame to change the duration of the unvoiced sound, repeating and removing the unvoiced burst sound by silence unit of the unvoiced burst sound And a changing step of changing the duration.

이하, 본 발명에 의한 음성 신호의 피치 검출 방법을 첨부한 도면을 참조하여 다음과 같이 상세히 설명한다.Hereinafter, a pitch detection method of a voice signal according to the present invention will be described in detail with reference to the accompanying drawings.

제1도는 본 발명에 의한 음성신호의 지속시간 변경방법을 수행하기 위한 종래의 음성신호처리장치의 블럭도로서, 마이크로 폰(100), 제1중폭기(102), 제1저역 통과 필터(LPF:Low Pass Filter 이하, LPF) (104), 아날로그/디지탈(A/D: Analogue/Digital 이하, A/D)변환기(106), 입력포트(108), 메모리(110), 제어부(112), 제1출력포트 (114), D/A변환기(116), 제2LPF(118), 제2증폭기(120), 스피커(122) 및 제2출력포트(124)로 구성된다. 참조부호 126은 전송 채널을 나타낸다.1 is a block diagram of a conventional voice signal processing apparatus for performing a method of changing a duration of a voice signal according to the present invention, and includes a microphone 100, a first attenuator 102, and a first low pass filter (LPF). : Low Pass Filter, LPF (104), Analog / Digital (A / D: Analogue / Digital, A / D) Converter 106, Input Port 108, Memory 110, Control Unit 112, The first output port 114, the D / A converter 116, the second LPF 118, the second amplifier 120, the speaker 122 and the second output port 124. Reference numeral 126 denotes a transport channel.

제2도는 본 발명에 의한 음성 신호의 지속 시간 변경 방법을 설명하기 위한 플로우차트로서, 음성 신호를 유성음, 무성음 및 무성파열음 구간으로 분류하는 단계(제200단계∼제202단계)와, 피치 주기 및 피치 시점을 구하는 단계(제204단계∼제206단계)와, 분류된 음성 구간별로 지속시간을 변경하는 단계(제208단계)로 이루어진다.2 is a flowchart for explaining a method of changing a duration of a voice signal according to the present invention, comprising: classifying a voice signal into voiced, unvoiced, and unvoiced sound intervals (steps 200 to 202), a pitch period, A pitch point is obtained (steps 204 to 206), and a duration is changed (step 208) for each classified voice section.

제2도에 도시된 음성신호의 지속시간 변경 방법에 대한 작용 및 효과를 설명하기에 앞서, 제1도의 음성신호 처리장치의 동작을 살펴보면, 마이크로 폰(100)을 통해 입력된 음파가 전기신호로 변환되면 이를 제1증폭기(102)에서 일정한 레벨로 증폭한다. 마이크로 폰(100)을 통해 입력된 음성 신호는 20Hz∼20KHz 범위의 주파수를 갖는 성분으로 구성된다. 이들 성분 중에서 피치를 구하기 위해서는 음성신호가 의사전달 정보 성분만을 포함하면 되므로, 저역 통과 필터를 통해 의사 전달정보 성분 주파수의 범위인 4KHz 이상 주파수 성분은 제거한다. 이처럼 특정 주파수 이상의 성분을 제거하는 이유는 이 음성 신호를 디지탈로 변환하였을 때, 1초당 처리한 데이타 수를 줄이기 위함이다.Before explaining the operation and effect of the method for changing the duration of the voice signal shown in FIG. 2, the operation of the voice signal processing apparatus of FIG. 1 will be described. Once converted, it is amplified to a constant level in the first amplifier 102. The voice signal input through the microphone 100 is composed of components having a frequency in the range of 20 Hz to 20 KHz. In order to obtain the pitch among these components, the audio signal only needs to include the pseudo-transmission information component, and thus, the low-pass filter removes the frequency component of 4 KHz or more, which is the range of the pseudo-transmission information component frequency. The reason for removing components above a certain frequency is to reduce the number of processed data per second when this voice signal is converted to digital.

한편, 4KHz 이하의 신호 성분만 남기고 저역 필터링 시킨 신호에 대해 컴퓨터로 이를 처리하기 위해 디지탈 신호로 변환하여야 하는데, 이것은 아날로그 신호를 디지탈 신호로 변환하는 A/D변환기(106)에 의해 표본화 한다.On the other hand, the low-pass filtered signal leaving only 4KHz or less signal components to be converted into a digital signal for processing by the computer, which is sampled by the A / D converter 106 converting the analog signal to a digital signal.

디지탈 신호로 표본화 하는 율은 나이퀴스트(Nyquist)의 표본화 이론에 따라 신호 최대 주파수(여기서는 4KHz)의 두배인 8KHz여야 하므로, 입력단자 IN1을 통해 8KHz의 클럭 주파수가 입력된다. 또한 한 표본당 전압 레벨을 양자화해야 하는데 양자화 레벨은 16비트(2¹⁶=65,536)레벨을 사용하였다.Since the sampling rate of the digital signal should be 8KHz, which is twice the maximum signal frequency (4KHz here) according to Nyquist's sampling theory, a clock frequency of 8KHz is input through the input terminal IN1. We also need to quantize the voltage level per sample, which uses a 16-bit (2 ¹⁶ = 65,536) level.

이렇게 처리된 디지탈 음성 신호는 계산 및 처리되기 위해 입력포트(108)을 통해 제어부(112)로 입력된다. 입력된 음성 신호 데이타는 본 발명에 의한 지속 시간 변경 방법에 의해 처리된 다음에, 필요에 따라서 메모리(110)에 저장시키거나 또는 전송 채널(126)을 통해 입력된 데이타를 사용하여 복호화 과정을 통해 음성 신호를 합성한다.The digital voice signal thus processed is input to the controller 112 through the input port 108 to be calculated and processed. The input voice signal data is processed by the duration change method according to the present invention, and then stored in the memory 110 as necessary or through a decoding process using the data input through the transmission channel 126. Synthesize the audio signal.

이처럼 제어부(112)에 의해서 복호화 처리가 완료된 합성 음성 신호는 잘 처리되었는지를 스피커(122)를 통해 들어보기 위해 제1출력포트(114)에 전달된다.In this way, the synthesized speech signal, which has been decoded by the controller 112, is transmitted to the first output port 114 in order to hear through the speaker 122 whether or not it is well processed.

제1출력포트(114)는 입력한 디지탈 신호를 아날로그 신호로 변환하는 D/A변환기(116)로 전달한다. 이 경우에도 표본화율 8KHz 단위로 디지탈 신호를 처리하여 아날로그 값으로 변환하게 된다. 변환된 신호는 아직 표본화율이 고조파가 포함된 개별신호로 나타나기 때문에, 제2저역 통과 필터(118)에 통과시켜 기본 대역의 신호만 남도록 처리한다. 이렇게 처리된 신호가 스피커(122)를 출력할 수 있도록, 제2증폭기(120)에서 증폭하여 스피커(122)로 출력한다. 이로서 지속시간이 변경된 신호를 스피커(122)가 음압파로 변환하여 주기 때문에 인간의 귀를 통해 청취할 수 있게 된다.The first output port 114 transmits the input digital signal to the D / A converter 116 converting the analog signal into an analog signal. In this case, the digital signal is processed and converted into an analog value at a sampling rate of 8KHz. Since the converted signal is still represented as an individual signal including harmonics, the signal is passed through the second low pass filter 118 to process only the signal of the base band. The signal thus processed may be amplified by the second amplifier 120 and output to the speaker 122 so that the speaker 122 may be output. As a result, the speaker 122 converts the signal whose duration has been changed into a sound pressure wave, thereby allowing the user to listen through the human ear.

전술한 종래의 장치에서 수행되는 본 발명에 의한 음성 신호의 지속시간 변경 방법을 다음과 같이 설명한다.A method of changing the duration of a voice signal according to the present invention performed in the above-described conventional apparatus will be described as follows.

먼저, 입력포트(108)를 통해 입력되거나 메모리(110)에 저장되어 있던 음성 신호는 윈도우 함수가 적용된다(제200단계). 즉, 시간상에서 음성 신호의 분석구간을 소단위로 구분하기 위해, 시간에 따라 연속적으로 변하는 음성신호중에서 평균 20ms정도를 떼어내어 처리하는 윈도우(window)함수를 적용하는 과정을 수행한다.First, a window function is applied to a voice signal input through the input port 108 or stored in the memory 110 (operation 200). That is, a process of applying a window function that separates and processes an average of 20 ms from speech signals continuously changing with time in order to classify the analysis section of the speech signal in time into small units.

제200단계후에 시간 영역상에서 음성신호를 유성음 구간, 무성음 구간 및 무성 파열음 구간으로 분류한다(제202단계). 여기서, 무성 파열음 구간을 검출하기 위해서는 무성 파열음의 특성을 이용한다.After step 200, the voice signal is classified into a voiced sound zone, an unvoiced sound zone, and an unvoiced sound sound zone in the time domain (step 202). In this case, in order to detect an unvoiced rupture sound section, the characteristics of the unvoiced rupture sound are used.

무성 파열음의 구간을 검출하기 위한 그 중 한가지 방법은 피크 벨리유 율(PVR:Peak Valley Rate 이하 PVR)이 높은 무성파열음의 특성을 이용한 것이다. 여기서, PVR이란, 주어진 음의 구간 동안의 파형에서 봉우리와 골을 이루는 갯수를 의미한다. 다른 방법은 음성신호에서 전이특성과 불연속음 특성을 갖으며, 한 프레임내의 묵음구간을 동반하고, 무성음에서 유성음으로 이동하는 과정이므로 그 에너지 함수는 증가하며, 선형 예측 계수 오차가 크고, 안전한 피치 구간 앞에서 주기 구간이 검출되는 무성 파열음의 특징을 이용하여 무성 파열음 구간을 검출하는 것이다.One of the methods for detecting the interval of unvoiced rupture sound is to use the characteristic of unvoiced rupture sound having a high peak valley rate (PVR). Here, the PVR means the number of peaks and valleys in the waveform during a given negative interval. The other method has the transition and discontinuity characteristics in the speech signal, is accompanied by the silent section in one frame, and moves from the unvoiced voice to the voiced sound, so the energy function increases, the linear prediction coefficient error is large, and the safe pitch interval. The unvoiced burst sound section is detected by using the feature of the unvoiced sound sound in which the periodic section is detected.

제202단계에서 분류된 유성음 구간에서 피치 검색법을 이용하여 피치 주기를 검출한다(제204단계). 여기서, 피치 검색법이란 피치 검출법과는 달리 원래 음성과 합성된 음성간 피치 지속조건(피치 주기구간)을 최적으로 만족하는 피치 지속값과 피치 이득을 반복하여 비교함으로써 피치 주기를 검출하는 방법이다.The pitch period is detected using the pitch search method in the voiced sound section classified in step 202 (step 204). Here, unlike the pitch detection method, the pitch search method is a method of detecting the pitch period by repeatedly comparing the pitch duration value and the pitch gain that optimally satisfy the pitch duration condition (pitch period section) between the original voice and the synthesized voice.

제204단계후에 G-피크(peak)법을 적용하여 피치 시점을 검출한다(제206단계). G-피크란, 유성음의 한 피치주기 동안의 파형에서 성문파의 모양과 아주 유사한 피크파형의 모양이 나타나기 때문에 이 파형을 G-피크(성문의 첫자 G를 따서)라고 한다. 특히, G-피크는 유성음의 파형에서 성문이 개방되는 순간과 일치하기 때문에 유성음에서 G-피크를 검추하면 G-피크주기마다 유성음의 피치주기가 되고, 피크의 위치마다 성문의 개시점이 얻어질 수 있다.After step 204, the pitch point is detected by applying the G-peak method (step 206). G-peaks are called G-peaks (after G at the beginning of the voice) because they appear in a waveform during a pitch period of voiced sound, very similar to the shape of the voice. In particular, since the G-peak coincides with the moment when the glottal is opened in the waveform of the voiced sound, when the G-peak is detected from the voiced sound, the pitch period of the voiced sound is obtained at every G-peak period, and the starting point of the voiced sound can be obtained at each peak position. have.

제206단계후에 분류된 음성별로 각 음성구간에 대해 지속시간을 변경한다(제208단계).After step 206, the duration of each voice section is changed for each voice classified (step 208).

유성음 구간에서의 지속 시간을 변경하기 위해서 원하는 지속시간만큼 상기 제206단계에서 구한 피치 시점을 기준으로 제204단계에서 구한 피치 주기 단위로 파형을 삽입하거나, 삭제하여 지속시간을 변경한다. 반대로, 지속시간을 줄인다는 것은 파형의 지속기간을 유지하면서, 한 피치 주기 구간의 파형을 제거해 버린다는것이다. 현재의 음성 파형에 대해서 지속시간을 늘이거나 줄이기 위해서는 파형이 갖는 음성특징은 유지하면서 피치주기 단위로 파형을 삽입 및 삭제해야 한다. 즉, 인접한 피치 주기의 파형을 삽입 및 삭제한다. 그리고, 피치 주기의 한 파형을 반복시키거나 삭제할 때에 인근한 파형과의 자연스러운 접속을 위해 성문이 개시되는 시점에서부터 한 피치주기의 파형을 취하는 것이 바람직하다. 그렇지 않으면 동일한 피치주기의 파형들을 접속하는 경우에는 피치주기가 예기치 않는 값으로 변경될 수 있다.In order to change the duration in the voiced sound interval, the waveform is inserted or deleted in units of pitch periods obtained in step 204 based on the pitch time point obtained in step 206 by changing the duration. Conversely, reducing the duration means removing the waveform in one pitch period while maintaining the duration of the waveform. In order to increase or decrease the duration of the current voice waveform, the waveform must be inserted and deleted in pitch periods while maintaining the voice characteristics of the waveform. That is, waveforms of adjacent pitch periods are inserted and deleted. When the waveform of one pitch period is repeated or deleted, it is preferable to take a waveform of one pitch period from the time when the gate is started for natural connection with the adjacent waveform. Otherwise, when connecting waveforms of the same pitch period, the pitch period may be changed to an unexpected value.

마찬가지로, 무성음 구간의 지속 시간을 변경하고자 할 경우는 프레임 단위로 원하는 지속시간 만큼 삽입하거나 삭제하여 지속시간을 변경하고, 무성 파열음 구간에 대해서는 묵음 단위로 삽입하거나 삭제하여 원하는 지속시간만큼 지속 시간을 변경한다.Similarly, if you want to change the duration of the unvoiced section, change the duration by inserting or deleting the desired duration in frame unit, and change the duration by the desired duration by inserting or deleting the silent burst section in silence unit. do.

본 발명의 일실시예를 다음과 같이 설명한다.An embodiment of the present invention will be described as follows.

남녀 아나운서가 발음한 250개의 시사용 단어의 지속 시간을 50%, 70%, 100%, 150% 및 200%로 변경시킨후, 무작위로 선택된 20개의 단어를 미학습된 남녀 각 5명의 청취자에게 받아 쓰도록 하여 명료성을 다음 표1과 같이 평가하였다.Change the duration of the 250 pronouncing words pronounced by the male and female announcers to 50%, 70%, 100%, 150% and 200%, and receive 20 randomly selected words from each of the five unlearned male and female listeners. Clarity was evaluated as shown in Table 1 below.

또한, 그들이 청취하는 음의 돌출성과 부자연성을 표기하게 함으로써 자연성을 다음 표2와 같이 평가하였다.In addition, naturalness was evaluated as shown in Table 2 by allowing them to express the protrusion and unnaturalness of the sound they listened to.

상기 표1 및 표2로부터 알수 있듯이 명료성과 자연성에 평가 결과 명료성은 평균 96.9%를 유지하였으며, 자연성도 96.6%를 유지하여 매우 우수한 성능을 얻을 수 있었다.As can be seen from Table 1 and Table 2, the evaluation result of clarity and naturalness was maintained at an average of 96.9%, and the naturalness was maintained at 96.6% to obtain a very good performance.

이상에서 살펴본 바와 같이 본 발명에 의한 음성신호의 지속 시간변경 방법은 음원을 분류하여 지속 시간을 변경하고, 피치와 피치시점을 정확히 검출하므로서, 음성 신호의 지속 시간을 변경하여도 자연성과 명료성이 우수한 효과가 있다.As described above, the method for changing the duration of the voice signal according to the present invention changes the duration by classifying the sound source, accurately detects the pitch and the pitch time, and thus has excellent naturalness and clarity even when the duration of the voice signal is changed. It works.

제1도는 본 발명에 의한 음성신호의 지속 시간변경 방법을 수행하기 위한 종래의 음성 신호 처리 장치의 블럭도이다.1 is a block diagram of a conventional voice signal processing apparatus for performing a method for changing the duration of a voice signal according to the present invention.

제2도는 본 발명에 의한 음성신호의 지속 시간 변경 방법을 설명하기 위한 플로우차트이다.2 is a flowchart for explaining a method of changing a duration of a voice signal according to the present invention.

Claims

In the method of changing the duration of the audio signal,

Speech classification step of classifying the speech signal into voiced sound, unvoiced sound and unvoiced burst sound interval in the time domain:

The duration of the voiced sound is changed by repeating and removing the voiced sound on a pitch period basis based on a pitch time point, and changing the duration of the voiced sound by repeating and removing the voiced sound on a frame basis, and the silent burst sound unit And changing the duration of the unvoiced rupture sound by repeating and removing the voice signal.

The method of claim 1, wherein the voice classification step

A method of changing the duration of an unvoiced rupture sound using a P.V. (PVR), a Z.R. (ZCR), a pitch gradient and a silent section as parameters. .

The method of claim 1, wherein said changing step

The pitch period is detected by a pitch search method, and the pitch time point is detected by a G.I. (GCI) detection method.