KR20030031936A

KR20030031936A - Mutiple Speech Synthesizer using Pitch Alteration Method

Info

Publication number: KR20030031936A
Application number: KR1020030009198A
Authority: KR
Inventors: 배명진; 박현영
Original assignee: 배명진
Priority date: 2003-02-13
Filing date: 2003-02-13
Publication date: 2003-04-23
Also published as: WO2004072951A1

Abstract

PURPOSE: A synthesizer for synthesizing a single voice into multiple voices using a pitch changing method is provided to synthesize a single voice into multiple voices by changing a pitch. CONSTITUTION: A pitch is detected by applying the amplitude characteristic and period characteristic of a voiceprint. The detected pitch is extended by 140%, 120% and is compressed by 80%, 60% by using a PSOLA(Pitch Synchronous Overwrap and Add) pitch changing method. The extended and compressed pitches are synthesized, leaving some delay, so that multiple voice synthesized sound is generated. A voice signal waveform in an analog type is amplified in an amplifier. The amplified voice signal waveform passes through a low band pass filter to remove aliasing effect, and passes through an analog-to-digital converter to be quantized and coded, so that the voice signal waveform is converted into a digital signal in a PCM(Pulse Code Modulation) type. The digital signal is processed by software or firmware in a CPU or a digital signal processor.

Description

Multi Speech Synthesizer using Pitch Alteration Method

본 발명은 피치를 변경하여 단일 목소리를 다중의 목소리로 합성하는 것으로서 음성통신 기술분야 또는 오디오 신호처리 분야로 분류할 수 있다. 현재 사용되고 있는 기술은 한사람의 음성을 입력받아 피치를 변경 한 후 다중의 목소리로 합성해 주는 것이 아니라 한사람의 음성으로 합성해 내는 기술을 사용하고 있다. 따라서 다양한 목소리를 합성해 낼 수 없는 단점을 가지고 있다.The present invention synthesizes a single voice into multiple voices by changing the pitch and can be classified into a voice communication technology field or an audio signal processing field. Currently used technology uses a technology of synthesizing one voice instead of multiple voices after changing the pitch after receiving one voice. Therefore, there is a disadvantage that can not synthesize a variety of voices.

본 발명은 이의 단점을 보안하여 다양한 목소리를 합성해 낸다.The present invention secures its shortcomings to synthesize various voices.

본 발명은 음성의 중요한 파라미터인 피치를 변경하여 단일 음성을 다중의 목소리로 합성해내는 합성기에 대하여 제안하는 것이다. 도 1은 일반적인 음성생성 모델이다. 폐로부터 성대를 거쳐서 성도로 들어오는 입력은 두 가지로 나눌 수가 있는데, 유성음은 피치 주기를 기초로 한 임펄스 열로, 무성음은 랜덤 노이즈로 모델링이 가능하다. 이 두 신호를 스위칭 한 신호는 입력 신호의 에너지에 따라 이득이 곱해지고 이를 성도 모델인 여파기를 거치면 음성신호가 만들어진다. 음성 발성 모델에 따라 음성 신호를 분석해 보면 인간의 개성과 감정을 나타내는 여기(excitation)정보와 의사 내용을 나타내는 성도 여파기의 포만트 정보로 구성되어 있음을 알 수 있다. 여기정보를 나타내는 피치는 성대의 주기적인 떨림에 의해서 생성되며 인간의 청각에 매우 민감하게 반응하는 파라미터로써, 음성신호의 화자를 구분하는데 사용하며, 음성신호의 naturalness에 큰 영향을 미친다. 이러한 운율정보를 가지는 피치를 변경하면 다양한 합성음을 만들어 낼 수 있다The present invention proposes a synthesizer that synthesizes a single voice into multiple voices by changing the pitch, which is an important parameter of the voice. 1 is a general speech generation model. The input from the lung to the vocal cords can be divided into two types: voiced sound is an impulse sequence based on pitch period, and unvoiced sound can be modeled as random noise. The signal that switches these two signals is multiplied by the gain according to the energy of the input signal, and the voice signal is generated by passing through the filter. Analyzing the voice signal according to the voice phonation model shows that the excitation information indicating the personality and emotion of the human being and the formant information of the vocal tract filter indicating the physician's content are composed. The pitch representing excitation information is a parameter that is generated by periodic shaking of the vocal cords and is very sensitive to human hearing. It is used to distinguish the speaker of a voice signal and has a great influence on the naturalness of the voice signal. By changing the pitch with such rhyme information, various synthesized sounds can be produced.

도 1은 종래의 음성생성 모델을 설명하기 위한 블럭도1 is a block diagram illustrating a conventional speech generation model

도 2는 일반적인 피치 변경 시스템의 블록도2 is a block diagram of a typical pitch change system.

도 3은 본 발명에 적용한 피치 변경 시스템 블럭도3 is a block diagram of a pitch change system applied to the present invention.

도 4는 본 발명에 적용한 피치 시점 검출방법의 블럭도4 is a block diagram of a pitch time detection method applied to the present invention.

도 5는 본 발명에 적용한 피치 변경법(PSOLA 합성법)5 is a pitch change method (PSOLA synthesis method) applied to the present invention.

도 6은 다중 목소리 합성 시스템 하드웨어 구성도6 is a hardware configuration diagram of a multi-voice synthesis system

도 7은 다중 목소리 합성 시스템의 소프트웨어 플로우 챠트7 is a software flow chart of a multi-voice synthesis system

피치 변경 시스템은 도 2와 같이 구성된다. 피치 변경 시스템의 분석단에서는 마이크로폰으로 입력된 원 신호와 목적 신호의 피치를 검출하여 변경 규칙 생성단에 넘겨준다. 변경 규칙 생성단에서는 이를 이용하여 피치 변경율과 그에 적합한 피치 변경법을 결정한다. 이러한 피치 변경 규칙은 실제 피치 변경단에 제공되어 원 신호의 피치를 선정된 피치 변경법을 적용하여 변경율 만큼 피치를 변경하고 합성단에서는 이를 이용하여 음성이 변경된 합성음을 생성한다. 이러한 과정에는 정확한 피치 검출기법과 함께 왜곡이 적은 피치 변경기법을 필요로 한다. 음성신호의 피치 검출법은 최근 40년간 수많은 방법들이 제안되어 있다(참고문헌). 일예로 피치 검출은 자기상관함수법이 주로 사용되고 있으며, 인근 음성파형들 간의 상관관계를 계산하여 반복적인 파형의 주기를 검출하는 방법이 있다(참고문헌). 피치의 변경은 피치 검출이 잘 이루어진 다음에 이를 근거로 피치를 변경시키게 된다. 또한 피치를 변경하는 방법은 지금까지 많이 제안되어져 있다(참고문헌). 일예로 시간 영역에서 피치주기 단위로 음성파형을 넓게 분절한 다음에 변경된 피치주기 단위로 중첩시켜서 파형을 재구성하는 PSOLA(Pitch Synchronous Overwrap and Add) 피치 변경법이 있다(참고문헌).The pitch change system is configured as shown in FIG. The analysis stage of the pitch change system detects the pitch of the original signal and the target signal input to the microphone and passes it to the change rule generator. The change rule generator uses this to determine the pitch change rate and a suitable pitch change method. This pitch change rule is provided to the actual pitch change stage, and the pitch of the original signal is changed by applying a predetermined pitch change method, and the synthesized stage uses the same to generate the synthesized sound whose voice is changed. This process requires a pitch change technique with low distortion along with an accurate pitch detector technique. Pitch detection of speech signals has been proposed in the last 40 years (Ref.). As an example, pitch detection is mainly used for the autocorrelation function, and there is a method of detecting the period of a repetitive waveform by calculating a correlation between adjacent voice waveforms (reference). Pitch change causes the pitch to change based on good pitch detection. Moreover, many methods of changing a pitch have been proposed so far (Ref.). For example, a Pitch Synchronous Overwrap and Add (PSOLA) pitch change method is used to reconstruct a waveform by segmenting a speech waveform in a time period in a time domain and then superimposing the waveform in a changed pitch period (reference).

도 3은 본 발명에서 사용한 피치 변경 시스템 블록도이다. 본 발명에서는 피치 검출을 위하여 도 4와 같은 운율 조절에 필요한 검출법을 사용하였다. 먼저 프리엠퍼시스 필터를 통한 고주파수 영역이 강조된 선형예측계수로 표현되는 필터에 역으로 통과시킨 다음 분석구간별로 얻어지는 성문의 진폭 특성과 주기 특성을 적용하여서 피치 검출 과정을 수행하였다(참고문헌). 위와 같이 피치를 검출하고 검출된 피치를 도 5와 같은 PSOLA 피치 변경법을 사용하여 140%, 120% 신장된 피치와 80%, 60%로 압축된 피치를 약간의 delay를 두어 합성하면 다중 목소리 합성음을 생성 할 수 있게된다.3 is a block diagram of a pitch change system used in the present invention. In the present invention, the detection method required for rhyme control as shown in FIG. 4 was used for pitch detection. First, the high frequency region through the pre-emphasis filter was passed inversely to the filter represented by the linear predictive coefficient, and the pitch detection process was performed by applying the amplitude characteristics and periodic characteristics of the gates obtained for each analysis section (reference). When the pitch is detected as described above, and the detected pitch is synthesized with a slight delay between 140%, 120% stretched pitch and 80%, 60% compressed pitch using the PSOLA pitch changing method as shown in FIG. Will be created.

[하드웨어 장치의 구성][Configuration of Hardware Device]

마이크로폰에서 들어오는 아날로그 형태의 목소리 신호(600)를 입력 받아서 피치를 변경하여 다중의 목소리로 합성하는 장치는 도 6과 같다. 아날로그 형태로 입력된 목소리 신호파형(600)은 증폭기(601)에서 증폭된 다음에 엘리어징(aliasing)효과를 제거하기 위해 저역통과여파기(602)를 통과하고, 양자화(quantization) 및 부호화(coding)를 수행하는 아날로그-디지털 변환기(603)를 통과함으로서 선형펄스부호변조(PCM) 형태의 디지털 신호로 바뀌어서 범용 CPU나 디지털 신호처리기(DSP)에서 소프트웨어나 펌웨어에 의해 처리(604)된다.An apparatus for synthesizing multiple voices by changing the pitch by receiving an analog voice signal 600 input from a microphone is illustrated in FIG. 6. The voice signal waveform 600 input in analog form is amplified by the amplifier 601 and then passed through the low pass filter 602 to eliminate the aliasing effect, and is then quantized and encoded. By passing through the analog-to-digital converter 603, which is converted into a linear pulse code modulation (PCM) type digital signal, it is processed 604 by software or firmware in a general purpose CPU or digital signal processor (DSP).

신호처리 될 때는 이 컴퓨터 처리기(604)가 대내외에 설치된 주변장치(609)를 참고할 수도 있고, 또한 입력 디지털 신호나 처리 결과를 저장하기 위해 주변 메모리(605)를 참고할 수도 있다.When the signal is processed, the computer processor 604 may refer to a peripheral device 609 installed both inside and outside, and may also refer to the peripheral memory 605 to store input digital signals or processing results.

CPU에서 소프트웨어에 의해 피치를 변경하여 다중의 목소리로 합성된 디지털 신호는 디지털-아날로그 변환기(608)를 통해 표본화된 아날로그 신호형태로 변환된다. 이 신호를 저역통과 여파기(607)에 통과시키면 양자화 잡음이 제거된 아날로그 신호가 되고, 적당히 증폭하면(606) 스피커 등을 통해서 들을 수 있는 아날로그 신호(610)가 된다.The digital signal synthesized into the multiple voices by changing the pitch by software in the CPU is converted into a sampled analog signal form through the digital-to-analog converter 608. Passing this signal through the lowpass filter 607 results in an analog signal from which quantization noise has been removed, and when properly amplified (606), an analog signal 610 that can be heard through a speaker or the like.

[소프트웨어 처리과정][Software Process]

피치 변경법을 이용한 다중 목소리 합성기는 기존 단일 피치 변경법을 사용하는 대신에 다중 피치 변경법을 사용하는 소프트웨어나 펌웨어를 추가한 것이다. 도 7은 본 발명에서 사용한 다중 목소리 합성기의 소프트웨어 플로우 챠트를 나타낸다.Multi-voice synthesizer using pitch change is an addition of software or firmware that uses multiple pitch change instead of using the traditional single pitch change. 7 shows a software flow chart of the multiple voice synthesizer used in the present invention.

아날로그-디지털 변환기(ADC)에서 입력된 데이터 표본(701)값이 한 프레임단위로 동시에 처리된다. 먼저 현재 프레임에 있는 데이터 값이 유성음 구간인지 아닌지를 파악하고, 유성음 구간이 아니면(703) 링버퍼의 점유율(Buffer Rate, BR)을 계산하게 된다. 처리된 데이터를 대기시키는데 필요한 메모리 버퍼를 링버퍼(710)라고 한다.The data sample 701 input from the analog-to-digital converter (ADC) is processed simultaneously in units of one frame. First, it is determined whether the data value in the current frame is a voiced sound section, and if it is not the voiced sound section (703), the occupancy ratio (Buffer Rate, BR) of the ring buffer is calculated. The memory buffer required to wait for the processed data is called ring buffer 710.

링버퍼의 점유율(BR)은 처리된 데이터가 링버퍼에서 대기되는 시간비율을 나타내는데, 현 프레임이 비유성음구간이고 링버퍼에 대기하고 있는 시간이 정해진 시간(예 BT=1.5이상)을 넘어섰다면, 처리속도를 앞당기도록 발성의 처리시간 단축(708)을 수행하게 된다. 이렇게 함으로써 다중 피치변경이 수행될 때 야기되는 처리시간 지연을 해소할 수 있게 된다. 즉, 유성음 구간에서는 피치변경이 원활하게 이루어지도록 데이터를 천천히 출력하지만 비유성음 구간에서는 빠르게 하여 전체적인 시간지연을 해소하게 한 것이다.The ring buffer occupancy ratio (BR) represents the time rate at which processed data is waited in the ring buffer. If the current frame is a non-voicing period and the waiting time in the ring buffer exceeds a predetermined time (eg BT = 1.5 or more), In order to speed up the processing speed, the voice processing time is shortened (708). This makes it possible to eliminate the processing time delay caused when multiple pitch changes are performed. In other words, the data is output slowly so that the pitch can be changed smoothly in the voiced sound section, but in the non-voiced sound section, the time delay is eliminated.

현재의 프레임이 유성음 구간인지 비유성음 구간인지를 측정하는 방법(702)은 음성처리 교재(참고문헌)에 많이 제안되어져 있으며, 일례로 에너지 레벨을 측정하여 쉽게 파악할 수 있다. 즉, 현재 프레임의 평균 에너지가 정해진 문턱 값 이하라면 이 구간은 비유성음 구간이 된다.A method 702 of measuring whether the current frame is a voiced sound section or a non-voiced sound section has been proposed in a speech processing textbook (reference). For example, the energy level can be easily measured by measuring the energy level. That is, if the average energy of the current frame is less than or equal to a predetermined threshold value, this section becomes an unvoiced sound section.

입력된 데이타가 유성음 구간이라면 피치시점 검출(705)법을 사용하여 피치주기를 검출 하여야한다. 음성신호의 피치주기 검출법은 최근 40년간 수많은 방법들이 제안되어 있다(참고문헌). 일예로 피치검출은 자기상관함수법이 주로 사용되고 있으며, 인근 음성파형들 간의 상관관계를 계산하여 반복적인 파형의 주기를 검출하는 방법이 있다(참고문헌).If the input data is a voiced sound section, the pitch period should be detected using the pitch point detection 705 method. Pitch period detection method of speech signal has been proposed in the last 40 years (Ref.). As an example, pitch detection is mainly used for the autocorrelation function, and there is a method for detecting the period of a repetitive waveform by calculating correlations between adjacent voice waveforms (reference).

본 발명에서는 위에서 설명한 운율 조절에 필요한 검출법을 사용하였다.In the present invention, the detection method necessary for adjusting the rhyme described above was used.

또한 유성음 구간내에서 억양의 변화를 어느 정도로 제한(예, 1.5배 이내)하기 위해, 연속된 유성음 구간의 피치주기를 검출한 다음에 프레임당 변화도를 구하고, 변화가 크다면 피치 주기변경을 수행하여 목소리를 안정시키게 된다(706). 피치주기 변경은 피치주기 검출이 잘 이루어진 다음에 이를 근거로 피치주기를 변경시키게 된다. 또한 피치주기를 변경하는 방법은 지금까지 많이 제안되어져 있다(참고문헌). 본 발명에서는 시간 영역에서 피치주기 단위로 음성파형을 넓게 분절한 다음에 변경된 피치주기 단위로 중첩시켜서 파형을 재구성하는 PSOLA(Pitch Synchronous Overwrap and Add) 피치 변경법((참고문헌)을 사용하여 다중 피치변경을 수행 하였다.In addition, in order to limit the change of intonation in the voiced sound zone to some extent (eg, within 1.5 times), the pitch period of the continuous voiced sound zone is detected, and then the change rate is calculated per frame. The voice is stabilized (706). Pitch period change is to change the pitch period based on the pitch period detection is well made. In addition, a number of methods for changing the pitch period have been proposed so far (Ref.). In the present invention, multiple pitches are changed by using a PSOLA (Pitch Synchronous Overwrap and Add) pitch change method (Ref.) That reconstructs a waveform by broadly segmenting a speech waveform in a pitch period unit in the time domain and then superimposing the changed waveform unit in a pitch period unit. Was done.

이렇게 처리 완료된 음성 데이터들은 링버퍼에 저장시키고(709), 저장된 순서에 따라서 디지털-아날로그 변환기(DAC)를 통해 음성 데이터 표본 단위로 스피커폰을 통해 출력한다(710). 여기서 다중 목소리 합성기의 기능은 실시간으로 처리된다. 즉, 아날로그-디지털 변환기(ADC)에서 한 프레임의 데이터를 받고(701)나서부터 그 다음 프레임의 데이터를 받아올 때까지 처리(709)가 끝날 수 있도록 해야만 한다.The processed voice data are stored in the ring buffer (709), and output through the speakerphone in units of voice data through a digital-to-analog converter (DAC) according to the stored order (710). The function of the multiple voice synthesizer is handled in real time here. That is, the processing 709 must end until the analog-to-digital converter (ADC) receives the data of one frame (701) until the data of the next frame is received.

[참고문헌][references]

[1] 배명진, 이상효, 디지털 음성분석, 동영출판사, 1998.[1] Myung-Jin Bae, Sang-Hyo Lee, Digital Speech Analysis, Dong Young Publishing Co., 1998.

[2] 배명진, 디지털 음성합성, 동영출판사, 1999.[2] Bae Myung-jin, Digital Speech Synthesis, Dong Young Publishing Co., 1999.

[3] 배명진, 디지털 음성부호화, 동영출판사, 2000.[3] Bae Myung-jin, Digital Voice Coding, Dong Young Publishing Co., 2000.

[4] Rabiner and Schefer, Digital Signal Processing of Speech Signals,[4] Rabiner and Schefer, Digital Signal Processing of Speech Signals,

Prentice Hall, 1978.Prentice Hall, 1978.

[5] 박형빈, 배명진, " 음색변경을 위한 피치시점 검출에 관한 연구 ", 한국음향학회, 하계 학술발표대회, 제19권 1(s)호, No.1, pp 1, 49-152, 2000년 7월7-8일.[5] Hyung-Bin Park, Myung-Jin Bae, "A Study on the Pitch-Point Detection for Tone Change", Korean Society for Acoustical Science, Summer Conference, Vol.19 (1), No.1, pp 1, 49-152, 2000 July 7-8.

이상에서 상술한 바와 같이 본 발명은, 음성의 운율 정보를 가지고 있는 중요한 파라미터인 피치를 변경하여 단일 음성을 다중의 목소리로 합성해 내는 것이다. 음성정보기술은 MIT에서 지정한 21세기 10대 기술, 삼성경제연구소가 선정한 21세기 10대 유망기술로 선정된 바 있다. 기술의 중요성 외에도 음성기술 관련 시장은 초고속 성장세를 기록할 전망이다. 현재 국내 음성기술 시장은 초기단계로 지난해 약 200억원 규모로 추정되고 있으나, 연평균 50% 이상의 성장을 지속해 2005년에는 국내 음성기술 시장규모만 약 1000억원에 달할 것으로 예측되고 있다. 이와 같이 점차 증가하고 있는 음성기술 시장에 본 발명은 다양한 분야에 응용할 수 있다. 운동경기장에서 한사람의 응원으로 여러 사람이 응원하는 효과를 내는 응원 합성기, 생일이나 파티장 등에서의 축하 합성기, 돌림노래 장난감 등에 응용할 수 있으며, 영화나 연극에서의 효과음, 장시간 집을 비우는 맞벌이 가정에서 도난 방지 시스템으로도 응용 할 수 있다. 또한 요즘 항간에 유행하고 있는 졸라맨 이나 유명인 목소리 흉내를 내는 음성변조에도 응용 할 수 있다. 이와 같이 다양한 분야에 응용 할 수 있으며 그 파급효과가 아주 클 것으로 예상된다.As described above, the present invention synthesizes a single voice into multiple voices by changing pitch, which is an important parameter having voice rhyme information. Voice information technology has been selected as one of the twentieth century's 10th technologies designated by MIT and the ten most promising technologies selected by the Samsung Economic Research Institute. In addition to the importance of technology, the voice technology market is expected to record rapid growth. The domestic voice technology market is currently in its initial stage, estimated at about 20 billion won last year. However, the domestic voice technology market is expected to reach about 100 billion won in 2005, with annual growth of more than 50%. In this increasingly increasing voice technology market, the present invention can be applied to various fields. It can be applied to a cheering synthesizer that produces the effect of cheering by several people at a sports stadium, a celebration synthesizer at a birthday or a party, a sounding toy, etc. It can also be applied as a system. It can also be applied to voice modulation that mimics the voila of celebrities and celebrities. As such, it can be applied to various fields and its ripple effect is expected to be very large.

Claims

The present invention synthesizes a single voice into multiple voices by changing the pitch, which is an important voice parameter having rhyme information, while maintaining the formant component. Pitch changes are applied in a time domain in a manner that can control the rhythm in real time. In order to maintain the individuality and clarity of the speaker in the time domain pitch change, the pitch change should be made based on the pitch that is the center of the speaker.In order to perform the pitch change, the linear predictive analysis can detect the pitch time of the speaker. Pitch point-of-sight detection method is used to implement synthesizer that synthesizes multiple voices by simultaneously synthesizing various pitch-changed voices by applying PSOLA synthesis method for real-time pitch change in time domain.