KR100484666B1

KR100484666B1 - Voice Color Converter using Transforming Vocal Tract Characteristic and Method

Info

Publication number: KR100484666B1
Application number: KR10-2002-0087997A
Authority: KR
Inventors: 강동규
Original assignee: (주) 코아보이스; 정보통신연구진흥원
Priority date: 2002-12-31
Filing date: 2002-12-31
Publication date: 2005-04-22
Also published as: KR20040061709A

Abstract

본 발명은 성도특성 변환을 이용한 음색변환장치 및 방법에 관한 것으로, 표준화자의 성도 스펙트럼을 변경하여 음색을 변환하는 음색변환장치에 있어서, 유사 음소군별 표준화자 및 목표화자의 유성음신호에 대하여 선형 예측 기반 포먼트 정보를 추출하여 두 화자의 각 포먼트 간의 차이를 나타내는 스펙트럼 스케일링 계수 SSF(f)를 추출하는 수단; 두 화자간의 성도 스펙트럼의 주파수 차이를 나타내는 SSF(f)를 이용하여 표준화자의 성도 스펙트럼을 목표화자에 근접한 스펙트럼으로 변경하기 위해 표준화자의 피치 구간별로 고속 푸리에 변환(FFT) 기반 스펙트럼의 실수부 및 허수부에 대해 NUSR(Non-Uniform Spectrum Rescaling)을 수행하는 수단; 및 상기 표준화자의 성도 스펙트럼상에서 NUSR을 수행한 스펙트럼을 시간영역의 음성신호로 복원하기 위해 역고속 푸리에 변환(IFFT)을 수행하는 수단을 구비한 음색변환장치 및 방법을 제공하는 것이다. The present invention relates to a tone conversion device and method using the conversion of the vocal tract characteristics, in the tone conversion device for converting the tone by changing the vocal spectral spectrum of the standardizer, the linear prediction based on the voiced sound signal of the standardizer and the target speaker for each phoneme group Means for extracting formant information to extract a spectral scaling factor SSF (f) representing a difference between each formant of two speakers; Real and imaginary parts of the fast Fourier transform (FFT) -based spectrum for each pitch section of the standardizer to change the spectral spectrum of the standardizer to a spectrum close to the target speaker using SSF (f), which represents the frequency difference of the spectral spectrum between two speakers. Means for performing Non-Uniform Spectrum Rescaling (NUSR) for the apparatus; And means for performing inverse fast Fourier transform (IFFT) to restore the NUSR spectrum to the voice signal in the time domain on the vocal spectral spectrum of the normalizer.

Description

Voice Color Converter using Transforming Vocal Tract Characteristic and Method

본 발명은 음색변환에 관한 기술로서, 특히 성도특성 변환에 의한 음색변환에 관한 것으로서, 표준화자와 목표화자의 포먼트 정보를 분석하여 표준화자의 스펙트럼을 목표화자의 스펙트럼으로 변경하기 위한 SSF(Spectrum Scaling Factor)를 구한 다음 FFT(Fast Fourier Transform) 기반의 스펙트럼영역에서 SSF 값에 따라 NUSR(Non-Uniform Spectrum Rescaling)을 수행하여 목표화자에 근접한 스펙트럼을 구한다음 IFFT(Inverse Fast Fourier Transform)를 수행하여 음색변환된 음성신호를 얻음으로서 고품질의 음색변환 결과를 얻을 수 있는 음색변환장치 및 그 방법에 관한 것이다. The present invention relates to a tone conversion technique, and more particularly, to a tone conversion by converting vocal trait characteristics, and the SSF (Spectrum Scaling) for changing the spectrum of the standardizer to the spectrum of the target speaker by analyzing formant information of the standardizer and the target speaker. After obtaining the factor, the spectrum is obtained by performing Near-Uniform Spectrum Rescaling (NUSR) based on the SSF value in the fast Fourier Transform (FFT) -based spectral region, and then performing Inverse Fast Fourier Transform (IFFT). The present invention relates to a tone conversion device and a method for obtaining a high quality tone conversion result by obtaining a converted voice signal.

일반적으로, 음색변환 방법은 크게 시간 영역법과 주파수 영역법으로 분류된다. 시간 영역법은 특정한 특성을 갖는 신호를 시간영역에서 음성신호와 회선(Convolution)하여 음색을 변경하는 것으로 일반적으로 음색변조라고 불리고 있다. In general, the tone conversion method is largely classified into a time domain method and a frequency domain method. In the time domain method, a tone having a specific characteristic is converted into a tone by convolution with a voice signal in the time domain and is generally called tone modulation.

그러나, 이 방법은 처리가 매우 간단하여 실시간 처리가 가능함으로 널리 이용되지만 변환된 음성의 품질이 열화되고 단순한 변화만을 할 수 있으므로 다양한 음색변환을 할 수 없는 단점이 있다.However, this method is widely used because the processing is very simple and real-time processing is possible, but the quality of the converted speech is deteriorated and only a simple change is not possible to perform various tone conversions.

또한, 주파수 영역법은 스펙트럼을 변경하여 음색을 변화시키는 것으로 표준화자로부터 목표화자로의 음색을 변경할 수 있는 방법으로, LMVR(Linear Multi-Variate Regression) 방법이 널리 이용되고 있다. 이 방법은 표준화자와 목표화자에 대한 LP(Linear Prediction)계수를 구하고, 이 계수로부터 표준화자에서 목표화자로의 천이행렬을 구한 다음 표준화자의 LP 계수에 천이행렬을 곱하여 목표화자에 근접한 LP계수를 산출한다. 이후 산출된 LP계수를 이용하여 LP합성을 수행하면 음색변환된 음성신호를 얻는 방법이다. In addition, the frequency domain method is a method of changing the tone by changing the spectrum and changing the tone from the standardizer to the target speaker, and the Linear Multi-Variate Regression (LMVR) method is widely used. This method calculates the LP (Linear Prediction) coefficients for the standardizer and the target speaker, calculates the transition matrix from the standardizer to the target speaker, and multiplies the LP coefficient of the standardizer by the transition matrix to find the LP coefficient close to the target speaker. Calculate. Since LP synthesis is performed using the calculated LP coefficient, a tone-converted voice signal is obtained.

그러나, 이 방법은 수학적인 접근에 의한 것으로 체계적이기는 하지만 최종적으로 음성을 복원할 때 LP방법을 이용함으로서, LP분석의 오류에 의한 안정도가 저하되고, 또한 음원특성을 반영하기 어려운 관계로 복원된 음성이 기계음 수준을 벗어나지 못하고 있는 문제점이 있었다. However, this method is based on a mathematical approach, but by using the LP method when finally restoring the voice, the stability is reduced due to errors in the LP analysis, and the voice is restored because it is difficult to reflect the sound source characteristics. There was a problem that could not escape this machine sound level.

따라서, 본 발명은 상기와 같은 종래 기술의 문제점을 해결하기 위하여 창안한 것으로써, 변환의 안정도를 개선하기 위하여 음색변환을 하기 위한 특징 파라미터 추출 시 분석방법으로, 수학적인 접근에 의한 LP방법을 사용시에 화자간 천이 행렬 대신 두 화자의 스펙트럼에 대한 주파수 차이를 목표화자로 스펙트럼을 변환하기 위한 특징 파라미터인 SSF(f)를 추출하여 사용함으로써, 음색변환의 안정도를 개선하는데 그 목적이 있다.Therefore, the present invention was devised to solve the above problems of the prior art, and in order to improve the stability of the conversion, the analysis method for extracting feature parameters for tone conversion, and using the LP method by mathematical approach The purpose of the present invention is to improve the stability of the tone conversion by extracting and using SSF (f), which is a characteristic parameter for converting the spectrum into a target speaker, instead of the inter-speaker transition matrix.

또한, 음성 복원 시 음질저하 문제점을 극복하기 위하여 원음 수준의 음성 복원이 가능하도록 FFT(Fast Fourier Transform) 기반의 주파수 영역에서 스펙트럼을 변경하고, IFFT(Inverse Fast Fourier Transform)를 수행하여 음성을 복원하여 최종적으로 음색변환된 음성의 품질을 고품질로 유지할 수 있도록 하는데 그 목적이 있다. 즉, 두 화자의 음성에 대한 LP분석에 의해 산출된 SSF(Spectrum Scaling Factor)에 따라 FFT 기반의 스펙트럼영역에서 NUSR(Non-Uniform Spectrum Rescaling)을 수행하여 목표화자에 근접한 스펙트럼을 구한 다음 IFFT를 수행하여 음색변환된 음성신호를 얻음으로써, 고품질의 음색변환된 음성을 복원할 수 있도록 하기 위한 것이다. In addition, in order to overcome the problem of sound quality deterioration during speech restoration, the spectrum is changed in the frequency domain based on the FFT (Fast Fourier Transform) to enable the speech restoration at the original sound level, and the speech is restored by performing Inverse Fast Fourier Transform (IFFT). The purpose is to maintain the quality of the final voice-converted voice in high quality. In other words, NUSR (Non-Uniform Spectrum Rescaling) is performed in the FFT-based spectral region according to SSF (Spectrum Scaling Factor) calculated by LP analysis of two speakers' voices, and then IFFT is performed. By obtaining the voice signal converted to tone, it is to be able to restore the high-quality tone conversion voice.

이와 같은 목적을 달성하기 위한 본 발명으로서 성도특성 변환을 이용한 음색변환장치는, 표준화자의 성도 스펙트럼을 변경하여 음색을 변환하는 음색 변환장치에 있어서, 유사 음소군별 표준화자 및 목표화자의 유성음신호에 대하여 LP기반 포먼트 정보를 추출하여 두 화자의 각 포먼트 간의 차이를 나타내는 SSF(Spectrum Scaling Factor)(f)를 추출하는 수단; 두 화자간의 성도 스펙트럼의 주파수 차이를 나타내는 SSF(f)를 이용하여 표준화자의 성도 스펙트럼을 목표화자에 근접한 스펙트럼으로 변경하기 위해 표준화자의 피치 구간별로 FFT(Fast Fourier Transform) 기반 스펙트럼의 실수부 및 허수부에 대해 NUSR(Non-Uniform Spectrum Rescaling)을 수행하는 수단 및; 상기 표준화자의 성도 스펙트럼상에서 NUSR을 수행한 스펙트럼을 시간영역의 음성신호로 복원하기 위해 IFFT(Inverse Fast Fourier Transform)를 수행하는 수단을 구비한 것을 특징으로 한다.In the present invention for achieving the above object, the tone converting apparatus using the vocal tract characteristic conversion is a tone converting apparatus for converting the tone by changing the vocal spectral spectrum of the standardizer, and the voice signal of the standardizer and the target speaker for each phoneme group is similar. Means for extracting LP-based formant information to extract a Spectrum Scaling Factor (SSF) f representing a difference between each formant of two speakers; Real and imaginary part of FFT (Fast Fourier Transform) based spectrum by pitch section of standardizer to change the spectral spectral spectrum of the standardizer into a spectrum close to the target speaker by using SSF (f) which represents the frequency difference of the spectral spectrum between two speakers. Means for performing Non-Uniform Spectrum Rescaling (NUSR) for; And means for performing an Inverse Fast Fourier Transform (IFFT) to restore the NUSR spectrum to the voice signal in the time domain on the vocal spectral spectrum of the normalizer.

그리고, 방법에 있어서는 두 화자간의 성도 스펙트럼에 대한 스케일 차이를 나타내는 특징 파라미터인 SSF(Spectrum Scaling Factor)를 구하는 단계; SSF를 이용하여 FFT영역에서 NUSR(Non-Uniform Spectrum Rescaling)을 수행하는 단계; 및 NUSR에 의해 변경된 스펙트럼을 IFFT(Inverse Fast Fourier Transform)에 의해 음색변환된 음성신호를 복원하는 단계로 이루어진 것을 특징으로 한다.In addition, the method includes: obtaining a Spectrum Scaling Factor (SSF), which is a feature parameter representing a scale difference of a spectral spectrum between two speakers; Performing Non-Uniform Spectrum Rescaling (NUSR) in the FFT region using the SSF; And restoring the voice signal converted by the NUSR to the tone signal converted by the Inverse Fast Fourier Transform (IFFT).

이하, 본 발명에 따른 일실시예를 첨부한 도면을 참조하여 상세히 설명한다.Hereinafter, with reference to the accompanying drawings an embodiment according to the present invention will be described in detail.

도 1은 본 발명을 적용하기 위한 음색변환 시스템의 구성을 보인 예시도이다. 도 1을 참조하면 음색변환장치(200)를 포함하고 있는 범용 컴퓨터(100)나 이와 대등한 기능을 갖춘 장치에 적용하여 음색변환을 수행하고 변환된 음성을 재생한다. 음색변환된 디지털 음성신호는 디지털/아날로그 변환기(300)에 의해 아날로그 음성신호로 변환되어 스피커(400)를 통과하여 음성으로 재생한다. 1 is an exemplary view showing the configuration of a tone conversion system for applying the present invention. Referring to FIG. 1, a tone conversion is performed by applying to a general-purpose computer 100 including a tone conversion device 200 or a device having an equivalent function, and reproduces the converted voice. The tone-converted digital voice signal is converted into an analog voice signal by the digital / analog converter 300 and passes through the speaker 400 to reproduce the voice.

이와 같이 구성한 일실시예의 동작 과정을 설명하면 다음과 같다.Referring to the operation of the embodiment configured as described above are as follows.

먼저, 도 2는 본 발명에 성도를 길이 L인 튜브로 모델링 한 경우의 공명원리를 보인 예시도로서, 이에 도시한 바와 같이 L인 튜브 내에서 공명되는 주파수는 [수학식 1]로 정의될 수 있다. First, Figure 2 is an exemplary view showing the resonance principle in the case of modeling the saint in the tube of length L in the present invention, as shown in the frequency resonance in the tube of L can be defined by Equation 1 have.

여기서, F(i)의 단위는 Hz이며, i=0, 1, 2, 3…, C는 음파의 속도이다.Here, the unit of F (i) is Hz and i = 0, 1, 2, 3... , C is the speed of sound waves.

즉, 예를 들어 성도길이가 17cm일 경우에는 제1 공명 주파수는 500Hz, 제2 공명주파수는 1500Hz, 제3 포먼트는 2500Hz가 되고 15cm일 경우의 공명 주파수는 각각 566Hz, 1698Hz, 2830Hz가 된다. That is, for example, when the vocal tract length is 17 cm, the first resonant frequency is 500 Hz, the second resonant frequency is 1500 Hz, the third formant is 2500 Hz, and the resonant frequencies at 15 cm are 566 Hz, 1698 Hz, and 2830 Hz, respectively.

이와 같이 성도 내에서 공명되는 주파수는 성도 길이의 변화에 따라 낮은 쪽 주파수 보다 높은 쪽의 주파수 변화폭이 큰 것을 알 수 있다.As described above, it can be seen that the frequency resonated within the saints is larger in frequency variation of the higher frequency than the lower frequency according to the change in saint length.

도 3은 본 발명에 따른 표준 및 목표화자의 동일음소 내에서의 처리 동작을 보인 예시도이다.3 is an exemplary view showing a processing operation in the same phoneme of a standard and a target speaker according to the present invention.

도 3을 참조하면, 동일한 화자일지라도 음소에 따라 성도의 길이가 변화하므로 성도의 길이가 유사한 음소별로 음소군을 나눈 다음 처리하는 것이 보다 높은 성능을 기대할 수 있다. Referring to FIG. 3, even if the same speaker is used, the length of the saints varies according to the phonemes, and thus, after dividing the phoneme group by phonemes having similar saint lengths, higher performance may be expected.

이를 위해서, 먼저 표준 및 목표화자에 대한 음성을 입력받아 몇 개의 피치구간에 대한 LP분석에 의한 스펙트럼과 포먼트 정보를 추출한다(S100 ~ S101). To this end, first, voices for standard and target speakers are input, and spectrum and formant information by LP analysis for several pitch sections is extracted (S100 to S101).

이후, 두 화자간의 포먼트 정보를 분석하여 SSF를 산출하는데(S102), 도 4 및 도 5에 도시한 바와 같이 LP 분석에 의해 추출된 유성음의 스펙트럼 정보에서 포먼트 위치를 비교하여 표준화자의 스펙트럼과 목표화자의 스펙트럼의 주파수 차이를 나타내는 SSF(f)를 산출한다. 즉, 음성신호는 성문의 진동에 의해 성도가 여기되고, 성도의 길이 모양에 따른 고유한 공명에 의한 음성이 발성된다. 이때 성도 내에서 나타나는 공명 주파수를 포먼트라 하며, 낮은 쪽에서부터 제1 포먼트, 제2 포먼트, 제3 포먼트, …라 부른다. 성도의 길이가 서로 다른 표준화자(L1)와 목표화자(L2)의 스펙트럼을 도 4의 (a)와 같이 표시하면 i번째 포먼트에 대한 차이를 나타내는 SSF(FL1(i))는 [수학식 2]와 같이 표현할 수 있고, 상기 도 2에서 설명한 바와 같이 낮은 쪽 보다 높은 주파수 쪽의 차가 크게 나타나게 된다. Subsequently, the formant information between the two speakers is analyzed to calculate the SSF (S102). As shown in FIGS. 4 and 5, the formant position is compared with the formant position in the spectral information of the voiced sound extracted by the LP analysis. SSF (f) representing the frequency difference of the target speaker's spectrum is calculated. That is, the vocal tract is excited by the vibration of the vocal tract, and the voice by the inherent resonance according to the length of the vocal tract is uttered. At this time, the resonance frequency appearing in the saint is called a formant, and from the lower side, the first formant, the second formant, the third formant,... It is called. When the spectrums of the standardizer L1 and the target speaker L2 having different lengths are displayed as shown in FIG. 4 (a), the SSF (FL1 (i)) representing the difference with respect to the i-th formant is represented by the following equation. 2], and as described above with reference to FIG.

여기서 L1, L2는 각각 성도의 길이이고 단위는 cm이며, i = 0,1,2,3,…이다.Where L1 and L2 are the lengths of the saints and the unit is cm, and i = 0, 1, 2, 3,... to be.

또한, 사람의 음색은 주로 성도의 길이에 의한 차이가 크므로, 주로 유성음에서 크게 나타난다. 그러나, SSF(FL1(i))는 각 i번째 포먼트 위치에서만 추정이 가능하므로, 전체 주파수에 대한 처리를 위해서는 선형적인 직선의 방정식을 얻는 것이 안정적인 처리를 할 수 있다. SSF(FL1(i))를 모든 주파수 축에서 선형적인 직선 방정식 SSF(f)로 모델링 하기 위해 성도길이가 유사한 음소군별로 대량의 음성에 대해 SSF(FL1(i))를 구한 다음 각 포먼트에 대한 주파수 차이를 통계적으로 모델링하면 도 4의 (b)와 같이 모든 주파수에 대한 선형적인 함수 SSF(f)를 구할 수 있다.In addition, the tone of a person is largely different from the length of the saints, and thus appears mainly in the voiced sound. However, since SSF (FL1 (i)) can be estimated only at each i-th formant position, it is possible to obtain stable linear equations for processing the entire frequency. To model SSF (FL1 (i)) as a linear linear equation SSF (f) on all frequency axes, obtain SSF (FL1 (i)) for a large number of voices by phoneme groups with similar vocal tract lengths, Statistically modeling the difference in the frequency can be obtained as a linear function SSF (f) for all frequencies as shown in (b) of FIG.

이후에, 표준화자의 한 피치구간에 대해 FFT를 수행하여 스펙트럼을 구한다(S103). 즉, 도 6에 도시한 바와 같이 먼저, 목표화자의 스펙트럼 SPT(f)는 표준화자 스펙트럼 SPS(f)와 SSF(f)에 의해 [수학식 3]과 같이 표시할 수 있다. Thereafter, FFT is performed on one pitch section of the normalizer to obtain a spectrum (S103). That is, as shown in Fig. 6, first, the spectrum SPT (f) of the target speaker can be expressed by Equation 3 by the standardizer spectra SPS (f) and SSF (f).

SPT (f) = SPS (f+SSF(f))SPT (f) = SPS (f + SSF (f))

여기서, f는 주파수이다. Where f is the frequency.

상기에서 SSF의 값이 양수일 경우는 표준화자보다 목표화자의 성도길이가 짧은 경우로서, 표준화자의 스펙트럼이 확장되는 형태를 나타내므로, 중간에 스펙트럼 값이 정의되지 않을 경우에는 인터폴레이션(Interpolation)을 수행하여야 한다. 이와는 다르게 SSF의 값이 음수일 경우는 표준화자보다 목표화자의 성도길이가 긴 경우로서, 표준화자의 스펙트럼이 축소되는 형태를 나타내므로, 중간에 스펙트럼 값이 중복되는 경우에는 데시메이션(Decimation)을 수행하여야 한다. 이와 같은 과정을 NUSR(Non-Uniform Spectrum Rescaling)이라 한다.If the SSF value is positive, the vocal tract length of the target speaker is shorter than that of the standardizer, and the spectrum of the standardizer is expanded. Therefore, if the spectral value is not defined in the middle, interpolation should be performed. do. On the other hand, if the SSF value is negative, the vocalization length of the target speaker is longer than that of the standardizer, and the spectrum of the standardizer is reduced. Therefore, decimation is performed when the spectral values overlap in the middle. shall. This process is called Non-Uniform Spectrum Rescaling (NUSR).

이후, SSF(f)를 이용하여 FFT영역에서 NUSR을 수행하는 단계(S104)는 표준화자에 대한 유성음의 피치 한 구간에 대한 FFT를 구하여 SSF(f)에 따라 실수부(real)와 허수부(image)에 대하여 [수학식 2]에 의해 NUSR을 수행하고, 인터폴레이션 또는 데시메이션을 수행한다.Subsequently, in the step S104 of performing the NUSR in the FFT region using the SSF (f), the FFT for a section of the pitch of the voiced sound for the standardizer is obtained, and the real part and the imaginary part (S) according to the SSF (f). image) performs NUSR by Equation 2, and performs interpolation or decimation.

그 다음 표준화자의 스펙트럼에서 SSF를 이용하여 NUSR를 수행하여 표준화자의 스펙트럼을 목표화자의 스펙트럼에 근접하도록 변경한 후, 변경된 스펙트럼에 대해 IFFT를 수행하여 음성신호를 복원하고, 이와 같은 과정을 전체 음성신호에 적용하여 음색변경된 연속적인 음성신호를 생성한다(S105). 즉, 도 7에 도시한 바와 같이 표준화자의 음성신호에 대하여 NUSR을 수행하면 목표화자의 스펙트럼에 근접한 스펙트럼을 나타내므로, 이를 다시 시간영역의 음성신호로 변환하기 위해 IFFT를 수행하고, 이상의 과정을 반복하여 연속적인 음성신호를 생성한다. 복원된 음성신호는 목표화자에 근접한 성도 특성을 나타내면서도 고품질이 유지된다.Then, the NUSR is performed using the SSF in the spectrum of the normalizer to change the standardizer's spectrum to be close to the target speaker's spectrum, and then IFFT is performed on the changed spectrum to restore the speech signal, and the process is performed as a whole speech signal. In operation S105, a continuous voice signal of which the tone is changed is applied. That is, when the NUSR is performed on the speech signal of the standardizer as shown in FIG. 7, the spectrum close to the spectrum of the target speaker is shown. Therefore, the IFFT is performed to convert the speech signal to the speech signal in the time domain, and the above process is repeated. To generate a continuous voice signal. The reconstructed voice signal maintains high quality while exhibiting saint characteristics close to the target speaker.

도 8은 본 발명에 따른 표준화자의 음소 및 음소군이 표기된 음성에 대한 처리 동작을 보인 예시도로서, 이에 도시한 바와 같이 상기 도 3과 다른 점은 목표화자에 대한 음색 파라미터인 음소군별 LPC 데이터베이스(DB)를 미리 산출하여 저장해 놓고 처리하는 점과 표준화자의 음성에 음소별로 음소군 표기가 미리 되어 있는 점이다.8 is an exemplary view showing a processing operation for a voice marked with a phoneme and a phoneme group of a standardizer according to the present invention. As shown in FIG. 3, the difference from FIG. 3 is a phoneme group-specific LPC database that is a tone parameter for a target speaker. DB) is calculated and stored in advance, and the phoneme group notation is pre-set for each phoneme in the voice of the standardizer.

도 9는 본 발명에 따른 표준화자의 음소 및 음소군을 알지 못하는 음성에 대한 처리 동작을 보인 예시도로서, 이에 도시한 바와 같이 상기 도 8과 다른 점은 표준화자 음성에 대한 음소별 음소군을 알고 있지 않은 경우로서, 입력되는 표준화자의 음소를 인식하여 음소군을 식별하는 단계(S300)가 추가로 더 포함된다.Figure 9 is an exemplary view showing a processing operation for the phoneme and the phoneme group of the normalizer according to the present invention, as shown therein is different from FIG. If not, further comprising the step (S300) of identifying the phoneme group by recognizing the phoneme of the normalized input.

이와 같이 본 발명 성도특성 변환을 이용한 음색변환장치 및 방법은, 음색변경을 위한 특징 파라미터를 추출하는 단계에서 주파수 축 상에서 선형적인 함수인 SSF(f)를 이용함으로써 분석오류에 의한 안정도 저하를 줄여 보다 안정된 음색변환을 할 수 있으며, 또한 표준화자의 성도특성을 변경하는 단계를 표준화자의 음성을 FFT하여 구한 스펙트럼 영역에서 NUSR을 수행한 다음 IFFT를 수행하여 음성을 복원함으로써, 고품질의 음성을 복원할 수 있는 등의 효과가 있다.As described above, in the present invention, the tone conversion device and method using the conversion of the vocal trait characteristics use a linear function SSF (f) on the frequency axis in the step of extracting the feature parameter for the tone change to reduce the stability deterioration due to analysis error It is possible to perform stable tone conversion and also to change the vocal traits of the standardizer by performing NUSR in the spectral region obtained by FFT of the standardizer's voice and then performing IFFT to restore the high quality voice. There is an effect such as.

도 1은 본 발명을 적용하기 위한 음색변환 시스템의 구성을 보인 예시도,1 is an exemplary view showing the configuration of a tone conversion system for applying the present invention,

도 2는 본 발명에 성도를 길이 L인 튜브로 모델링 한 경우의 공명원리를 보인 예시도, Figure 2 is an exemplary view showing a resonance principle when the saints modeled in the present invention as a tube of length L,

도 3은 본 발명에 따른 표준 및 목표화자의 동일음소 내에서의 처리 동작을 보인 예시도,3 is an exemplary view showing a processing operation in the same phoneme of a standard and a target speaker according to the present invention;

도 4는 본 발명에서 표준화자보다 목표화자의 성도길이가 짧은 경우의 스펙트럼과 SSF(f)의 관계를 보인 그래프, Figure 4 is a graph showing the relationship between the spectrum and the SSF (f) when the vocal tract length of the target speaker is shorter than the standardizer in the present invention,

도 5는 본 발명에서 표준화자보다 목표화자의 성도길이가 긴 경우의 스펙트럼과 SSF(f)의 관계를 보인 그래프,5 is a graph showing the relationship between the spectrum and the SSF (f) when the vocal tract length of the target speaker is longer than the standardizer in the present invention;

도 6은 본 발명에 의한 SSF(f)를 이용하여 FFT 주파수 영역에서 NUSR를 수행하는 경우의 관계를 보인 그래프,6 is a graph showing a relationship in the case of performing the NUSR in the FFT frequency domain using the SSF (f) according to the present invention,

도 7은 본 발명에 따른 NUSR을 수행한 스펙트럼을 음성으로 복원한 결과를 보인 그래프,7 is a graph showing a result of reconstructing the spectrum in which NUSR was performed according to the present invention with a voice;

도 8은 본 발명에 따른 표준화자의 음소 및 음소군이 표기된 음성에 대한 처리 동작을 보인 예시도,8 is an exemplary view showing a processing operation for a voice marked with a phoneme and a phoneme group of a standardizer according to the present invention;

도 9는 본 발명에 따른 표준화자의 음소 및 음소군을 알지 못하는 음성에 대한 처리 동작을 보인 예시도이다.9 is an exemplary view showing a processing operation for a voice that does not know a phoneme and a phoneme group of a standardizer according to the present invention.

***도면의 주요부분에 대한 부호의 설명 ****** Explanation of symbols for main parts of drawing ***

100 : 범용 컴퓨터 200 : 음색변환장치 100: general purpose computer 200: tone converter

300 : 디지털/아날로그 변환기 400 : 스피커300: digital to analog converter 400: speaker

Claims

In the tone conversion device for converting the tone by changing the spectral spectral spectrum of the standardizer,

Means for extracting linear prediction (LP) based formant information on voiced signals of the standardizer and the target speaker for each similar phoneme group, and extracting a Spectrum Scaling Factor (SSF) representing a difference between each formant of two speakers;

Real and imaginary part of FFT (Fast Fourier Transform) based spectrum by pitch section of standardizer to change the spectral spectral spectrum of the standardizer into a spectrum close to the target speaker by using SSF (f) which represents the frequency difference of the spectral spectrum between two speakers. Means for performing Non-Uniform Spectrum Rescaling (NUSR) for the apparatus; And

And a means for performing an inverse fast fourier transform (IFFT) to restore the NUSR spectrum to the voice signal in the time domain on the vocal spectral spectrum of the normalizer.

The method of claim 1,

A function representing the difference in the vocal trait characteristics of the standardizer and the target speaker is a linear function on the frequency axis by statistical data to be extracted by a large number of phoneme groups. f) A tone converting apparatus using vocal trait conversion, characterized in that it comprises a means for modeling in a more stable change in saint characteristics.

The method of claim 1,

In order to change the vocal spectral spectrum of the normalizer to the spectrum close to the spectral spectrum of the target speaker, NUSR can be performed using SSF (f), a linear function on the frequency axis, and IFFT can be used to maintain high quality of speech to be restored. Tone conversion apparatus using the saint characteristics change, characterized in that it comprises a means.

Obtaining a Spectrum Scaling Factor (SSF), which is a feature parameter representing a scale difference of the spectral spectrum between two speakers;

Performing Non-Uniform Spectrum Rescaling (NUSR) in the FFT region using the SSF; And

Restoring the voice signal converted by the NUSR to the voice signal converted by the Inverse Fast Fourier Transform (IFFT).

The method of claim 4, wherein obtaining the SSF comprises:

Obtaining formant information for the two speakers through linear prediction (LP) analysis; And a step of obtaining an SSF (f) representing a frequency difference for each frequency capable of converting the spectrum of the standardizer into a spectrum close to the spectrum of the target speaker.

The method of claim 4, wherein performing NUSR in an FFT region using the SSF (f) comprises:

Calculating a vocal spectral spectrum by performing an FFT on a pitch interval of the standardizer voice;

And a second step of performing interpolation or decimation according to expansion or contraction in the real part and the imaginary part according to SSF (f) with respect to the calculated saint spectral. Tone conversion method using the change of saint characteristics.

The method of claim 6, wherein the second step

If the value of SSF (f) is "+" and the SSF (f) is not defined at the frequency f, interpolation is performed.

The method of claim 6, wherein the second step

And performing decimation when SSF (f) is "-" and SSF (f) is not defined at frequency f.

The method of claim 4, wherein the restoring the vocal spectral spectrum changed by the NUSR into a speech signal by IFFT,

Generating a spectrum close to that of the target speaker when performing NUSR on the speech signal of the standardizer;

And performing an IFFT to convert the generated spectrum back into a speech signal in a time domain.