KR100202539B1

KR100202539B1 - Voice synthetic method

Info

Publication number: KR100202539B1
Application number: KR1019950030543A
Authority: KR
Inventors: 김상수
Original assignee: 구자홍; 엘지전자주식회사
Priority date: 1995-09-18
Filing date: 1995-09-18
Publication date: 1999-06-15
Also published as: KR970017171A

Abstract

본 발명은 음성 합성 방법에 관한 것으로, 종래에는 다이폰이나 반음절의 합성 단위가 안정 구간에서 이루어져야 하는데, 실제로 안정 구간은 거의 발견되지 않음에 따라 반음절간의 결합은 포만트의 차이로 인해 단위간의 자연스런 연결이 이루어지지 않는 문제, 특히 파형코딩방식의 합성에서는 이러한 차이로 인하여 음질의 왜곡을 가져오는 등의 문제점이 있다. 따라서, 적절한 데이타 량과 음질이 원음에서 크게 벗어나지 않을 정도의 단위선택을 행한다음, 실제로 존재하는 음소를 구한 후 그 구해진 음소 데이타중 무성음은 PCM의 형태로, 유성음은 저주파 부분의 위상을 제로로하는 단주기 파형군을 구하고 그 단주기 파형군을 운율 정보에 따라 오버랩-애드를 통해 합성을 만들어 냄으로써 음질이 상승되고, 자연스런 합성음을 만들어 내도록 한다.The present invention relates to a speech synthesis method, and conventionally, a synthesis unit of a diphony or a half-syllable should be made in a stable section, but since the stable section is hardly found, the coupling between the half-syllables is between units due to the formant difference. There is a problem that the natural connection is not made, in particular, in the synthesis of the waveform coding method, there is a problem such as distortion of the sound quality due to this difference. Therefore, unit selection is made so that the appropriate data volume and sound quality do not deviate significantly from the original sound, and after the actual phoneme is found, the unvoiced sound of the obtained phoneme data is in the form of PCM, and the voiced sound has zero phase of the low frequency part. The short period waveform group is obtained and the short period waveform group is synthesized through the overlap-add according to the rhyme information, so that the sound quality is increased and the natural synthesized sound is produced.

Description

Speech synthesis method

제1도는 음소의 조음 현상을 고려해 음절의 앞뒤에 위치한 음소를 고려하여 음절단위로 분류한 음소의 도표.1 is a diagram of phonemes classified by syllable units by considering phonemes located before and after syllables in consideration of phonemic articulation.

제2도는 본 발명의 음성 합성 방법에 대한 과정을 보여주는 흐름도Figure 2 is a flow chart showing the process for the speech synthesis method of the present invention

제3도는 제2도에서, 단주기 파형생성 과정을 보여주는 흐름도.3 is a flow chart showing a short-period waveform generation process in FIG.

본 발명은 음절의 합성 단위를 이용하는 음성 합성 방법에 관한 것으로, 특히 합성음의 음질을 높이기 위해 합성 단위간의 연결이 자연스런 형태가 되도록 합성 단위를 선택하고 이를 이용하여 한국어 음성을 합성하는 음성 합성 방법에 관한 것이다.The present invention relates to a speech synthesis method using a synthesis unit of syllables, and more particularly, to a speech synthesis method for selecting a synthesis unit and synthesizing Korean speech using the same so that the connection between the synthesis units becomes a natural form in order to increase the sound quality of the synthesized sound. will be.

종래의 음성 합성 방법은 합성할 문장 입력시 그 문장을 분석하여 운율 정보를 추출하는 제1단계와, 합성에 필요한 단위를 생성하는 제2단계와, 상기 제1단계 및 제2단계에서 추출 및 생성한 운율 정보와 단위를 이용하여 음성을 합성하는 제3단계로 이루어진다.The conventional speech synthesis method includes a first step of extracting rhyme information by analyzing a sentence when a sentence to be synthesized is input, a second step of generating a unit necessary for synthesis, and extracting and generating at the first step and the second step. The third step is to synthesize the speech using the rhyme information and the unit.

이와같이 각 단계로 이루어진 종래의 기술에 대하여 살펴보면 다음과 같다.Thus, looking at the prior art made of each step as follows.

합성될 문장이 입력되면, 그 입력되는 각 문장마다 어휘사전과 형태소사전을 이용하여 가능한 모든 후보를 생성한다.When a sentence to be synthesized is inputted, all possible candidates are generated by using a lexical dictionary and a morpheme dictionary for each input sentence.

상기에서 생성된 후보는 단어와 결합된 조사나 어미의 성분으로 어절을 분석하고, 이렇게 분석된 어절이 전달되면 그 어절의 앞 뒤 어절의 품사를 참조하여 합성에 필요한 운율을 생성한다.The candidate generated above analyzes a word by a word or a word component combined with a word, and when the analyzed word is transmitted, referring to the parts of words before and after the word, generates a rhyme for synthesis.

운율이 생성되면, 다음 과정으로 합성에 필요한 합성 단위를 생성함에 있어, 한국어 음성 합성에서 가장 많이 이용되고 있는 합성 단위는 다이폰(diphone)이란 단위로 음소의 중간 부분에서 다음 음소의 중간 부분까지를 하나의 합성 단위로 두는 것을 말한다.When the rhyme is generated, in the next process of generating the synthesis unit necessary for synthesis, the most commonly used synthesis unit in Korean speech synthesis is the unit called diphone, which is divided from the middle of the phoneme to the middle of the next phoneme. It means to put in one synthetic unit.

또 다른 단위로는 반음절과 반음소 등이 있다.Other units include half syllables and half phonemes.

이와같이 최소 단위인 음소 대신에 음소의 결합을 하나의 단위로 채택하는 이유는 음소와 음소간의 천이구간을 유지하기 위해서이다.The reason why the phoneme combination is adopted as one unit instead of the minimum phoneme is to maintain the transition interval between the phoneme and the phoneme.

상기에서와 같이 운율 정보와 합성 단위를 이용하여 음성을 합성하면 되는데, 인간이 발음한 문장을 분석하면 같은 음소라도 앞뒤 인접한 음소에 의해 영향을 받아 다른 형태로 나타나게 되며, 이러한 영향들을 고려하지 않으면 합성음의 연결이 부자연스러워져 음의 원래 성분이 사라지고 어색한 음이 들리게 된다.As described above, the voice may be synthesized using the rhyme information and the synthesis unit. When the sentence is analyzed by a human, the same phoneme is influenced by adjacent phonemes before and after, and appears in a different form. The connection of the becomes unnatural and the original components of the sound disappear and an awkward sound is heard.

상기에서 다이폰이나 반음절의 합성 단위의 기본 전제는 음소의 안정 구간이 존재한다는 가정하에서 이루어지는데, 이러한 안정 구간에서 음소를 나누면 다른 단위와 결합해도 자연스럽게 연결된다.The basic premise of the synthesis unit of the diphony or the half syllable is made under the assumption that there is a stable section of the phoneme.

반음절을 예로들면 모음의 안정 구간을 기준으로 음절을 전반음절과 후반음절로 나누는데, '나는'이란 단어는 '나'의 전반음절과 '아느'에서 각각 반음절씩 그리고 '은'의 후반음절을 결합하여 만든다.For example, the syllable is divided into the first syllable and the second syllable based on the stable period of the vowel. The word 'I' is the first syllable of 'I' and the first syllable of 'A' and the second syllable of 'Sil'. Create by combining

그러나, 실제로 발음된 '나는'에서 안정 구간은 거의 발견되지 않으며 따라서 반음절간의 결합은 포만트의 차이로 인해 단위간의 자연스런 연결이 이루어지지 않는 문제, 특히 파형코딩방식의 합성에서는 이러한 차이로 인하여 음질의 왜곡을 가져오는 등의 문제점이 있다.In practice, however, there is hardly a stable section in the pronounced 'I', and therefore, the syllables do not have a natural connection between units due to formant differences, especially in the synthesis of waveform coding. There is a problem such as bringing distortion of.

따라서, 종래의 문제점을 해결하기 위한 본 발명의 목적은 실제로 존재하는 음소를 구한 후 이를 토대로 확장된 음절단위로 데이타를 구분하고 그 구해진 음소 데이타중 무성음은 PCM(Pulse Code Modulation)의 형태로, 유성음에 대하여는 저주파 부분의 위상을 제로로 하는 단주기 파형군을 구하고, 그 단주기 파형군을 운율 정보에 따라 오버랩-애드를 통해 합성하여 합성 단위간에 자연스런 합성음을 만들어 내고, 음질을 향상시키는 음성 합성 방법을 제공함에 있다.Accordingly, an object of the present invention for solving the conventional problems is to obtain the actual phonemes, and to divide the data into extended syllable units based on the phonemes, and the unvoiced sound of the obtained phoneme data is in the form of PCM (Pulse Code Modulation). For the speech synthesis method, a short-period waveform group having zero phase of the low frequency part is obtained, and the short-period waveform group is synthesized through overlap-add according to the rhyme information to produce natural synthesized sound between the synthesis units, and the sound quality is improved. In providing.

상기 목적을 달성하기 위한 본 발명의 음성 합성 방법은, 제2도에 도시한 바와 같이, 입력되는 텍스트를 한글 문장으로 바꾸는 전처리 과정을 수행하는 제1단계와, 상기 제1단계에서 한글로 변형된 문장의 음운 변동을 처리하고 음절 단위열을 만든 후, 운율정보를 생성하는 제2단계와, 상기 제2단계에서 생성된 음절 단위열로 부터 해당음절의 단위의 음성테이타 입력시 그 음성데이타에 헤밍창을 곱하여 숏타임 시그널을 얻는 제3단계와, 상기 제3단계의 숏타임 시그널에 대하여 변환을 거쳐 PSE(Power Spectrum Envelope)를 구하는 제4단계와, 상기 제4단계에서 구한 PSE를 고주파 부분에만 위상을 넣어 역 FFT변환을 행하여 한주기의 파형을 생성하여 복주기 파형군을 구하는 제5단계와, 상기 제5단계에서 구한 단주기 파형군을 운율정보에 따라 오버-애드(overlap-add)하여 합성하는 제6단계로 이루어진다.In the speech synthesis method of the present invention for achieving the above object, as shown in Figure 2, the first step of performing a pre-processing process for changing the input text into a Hangul sentence, and transformed to Hangul in the first step After the phonological fluctuation of the sentence is processed and syllable unit strings are generated, the second step of generating rhyme information and the speech data of the corresponding syllable unit are input from the syllable unit string generated in the second step, and hemmed to the voice data. A third step of multiplying the window to obtain a short time signal, a fourth step of obtaining a power spectrum envelope (PSE) through conversion of the short time signal of the third step, and the PSE obtained in the fourth step Inverting FFT transform by adding a phase to generate a waveform of one cycle and obtaining a double cycle waveform group, and over-adding the short period waveform group obtained in the fifth step according to rhyme information. The sixth step is synthesized.

상기에서와 같이 각 단계로 이루어진 본 발명에 대한 동작 및 작용효과를 상세히 설명하면 다음과 같다.Referring to the operation and effect for the present invention made of each step as described above in detail.

합성 단위의 생성은 합성 음질에 중요한 영향을 미치므로 신중한 선택이 요구되는데, 가장 우수한 합성 단위는 어절 단위일 것이나, 그 수가 굉장히 많고 데이타 량도 엄청나기 때문에 사용하기 어려우나 적절한 데이타 량과 음질이 원음에서 크게 벗어나지 않을 정도의 단위 선택을 위한 기준은 대체로 다음과 같다.The generation of synthesized units has a significant effect on the synthesized sound quality, so careful selection is required. The best synthesized units may be word units, but they are difficult to use due to the large number and huge data volume. The criteria for selecting units that do not deviate significantly are as follows.

첫째, 단위간의 연결시 포만트의 차이에 의한 이질음 발생을 없애야 한다.First, it is necessary to eliminate the heterogeneous sound caused by the difference of formants when connecting units.

둘째, 단위간의 연결시 에너지의 차가 연결에 영향을 미치지 말아야 한다.Second, the difference in energy in the connection between units should not affect the connection.

세째, 피치의 차이가 없어야 한다.Third, there should be no difference in pitch.

넷째, 음소간의 조음 현상을 수용할 수 있어야 한다.Fourth, it should be able to accommodate the articulation between phonemes.

따라서, 상기의 조건을 모두 수용할 수 있는 단위는 앞뒤 음소를 고려한 확장된 음절단위이다.Therefore, a unit that can accommodate all of the above conditions is an extended syllable unit considering front and rear phonemes.

한국어는 초성, 중성, 종성으로 하나의 음절이 만들어지고 이러한 음절이 모여 단어가 되는데, 이때 음절과 음절 사이에는 조음 현상 즉, 인접 음소에 영향을 받아 포만트의 변화가 발생한다.In Korean, a syllable is made of primary, neutral, and final, and these syllables become words. At this time, the symptom and syllable are affected by the articulation, that is, the adjacent phonemes.

이러한 조음 현상은 모음에서 특히 많이 나타나므로, 자음과 자음의 결합은 연결시 이질음 발생을 줄일 수 있다.Since such articulations appear in particular in vowels, the combination of consonants and consonants can reduce the occurrence of foreign sounds in the connection.

또한 에너지의 차에 의한 영향도 자음과 자음의 결합이 모음끼리의 결합에서 나타나는 영향보다는 작다.In addition, the effect of energy difference is smaller than the effect of the consonant-consonant combination on the vowels.

그리고 음소의 조음 현상을 고려해 음절의 앞뒤에 위치한 음소를 고려하여 음절 단위를 분류하면 제1도에서와 같다.The syllable units are classified by considering the phonemes located before and after the syllables in consideration of the articulation of the phonemes as shown in FIG.

즉, 초성은 한국어의 기본자음 19개와 '알라'에서 새로 생성되는 자음 ㄹ(1)과 '앙아'에서 생성되는 ㅇ(ng)를 초성으로 하여 모두 21개로 구분되고, 여기다가 초성은 앞음절의 종성의 유무에 따라 변화가 있는데 이를 고려하여 첫음절일때와 종성이 있을 때 그리고 종성이 없고 모음으로 끝나는 세종류로 각각 구분하여 총 3x21=63개로 구분하였다.In other words, Choseong is composed of 19 basic consonants in Korean and 21 consonants with the new consonant ㄹ (1) created in 'Allah' and ㅇ (ng) generated in 'Anah' as first consonants. There are three kinds of changes depending on the presence or absence of the species, considering the first syllable and the final species, and the three kinds of endings without vowels.

중성은 21개의 모음을 발음기호상 같은 종류를 묶어 17개로 분류하였다.Neutral grouped 21 vowels into 17, grouping the same type of phonetic symbols.

종성은 발음기호상 존재할 수 있는 7개와 종성이 없을때 다음 음절의 초성에 모음이 영향을 받으므로 이를 종성으로 분류하여 총 15개로 분류하였다.The finality is classified into a total of 15 by classifying it as a finality, because vowels are affected by the first syllables of the next syllable when there are no 7 and finality.

실제로 존재하는 음소의 갯수는 초성 3x21x17=1071개와 중성 21x17x15=5355개, 그리고 종성 17x4=68개로, 이 세종류의 음소의 조합으로 모든 무제한 문자에 필요한 음절을 만들어 낼 수 있다.The actual number of phonemes that exist is 3x21x17 = 1071 primary, 21x17x15 = 5355 neutral, and 17x4 = 68 final. A combination of these three phonemes can produce syllables for all unlimited characters.

상기에서와 같이 구분된 음소데이타 중에서 무성음은 PCM(Pulse Code Modulation)의 형태로, 유성음은 제3도에서와 같은 과정을 거쳐 저주파 부분의 위상을 제로로 하는 한주기의 파형(impulse response)을 생성한다.The unvoiced sound in the phoneme data divided as above is in the form of Pulse Code Modulation (PCM), and the voiced sound generates a one-cycle waveform (impulse response) in which the phase of the low frequency part is zero through the same process as in FIG. do.

생성과정은, 제3도에서와 같이, 음절 단위의 음성 데이타 입력시 그 음성 데이타에 헤밍(hamming)창을 곱함으로써 숏-타임 시그널(short-time signal)이 얻어진다.In the generation process, as shown in FIG. 3, a short-time signal is obtained by multiplying the voice data by a hamming window when the voice data is input in syllable units.

그러면, 그 숏-타임 시그널에 대하여 FFT(Fast Fourier Transform)변환을 거쳐 PSE(Power Spectrum Envelop)을 구하고 이를 고주파 부분에만 위상을 넣어 역 FFT변환을 하면 한주기의 파형이 생성된다.Then, the PSE (Power Spectrum Envelop) is obtained through the FFT (Fast Fourier Transform) transformation on the short-time signal, and the inverse FFT transformation is performed by putting the phase only in the high frequency portion to generate a waveform of one cycle.

상기에서와 같은 방법으로 각 음소에 대한 파형을 생성하고, 그 생성된 파형 데이타를 일정한 간격으로 오버랩하면서 단주기 파형군을 만든다.A waveform for each phoneme is generated in the same manner as above, and a short period waveform group is created while overlapping the generated waveform data at regular intervals.

이상에서 설명한 과정을 통하여 음성합성을 하게되는 전체 과정을 제2도에 의거하여 살펴보면 다음과 같다.Looking at the entire process of the speech synthesis through the process described above based on Figure 2 as follows.

음성 합성을 위한 텍스트(text)가 입력되면 먼저, 전처리 과정을 통하여 모든 텍스트를 한글 문장으로 바꾼다음 그 바꾸어진 한글 문장에 대하여 음운 변동을 처리하고 이에 해당하는 음절 단위열을 만든다.When text for speech synthesis is input, first, all texts are converted into Hangul sentences through a preprocessing process. Then, the phonetic variations are processed for the changed Hangul sentences, and a corresponding syllable unit string is generated.

이와 동시에 음운 변동을 처리한 음절에 대하여 형태소 분석과 구문 분석을 통해 음율 정보를 생성해낸다.At the same time, phonological information is generated through morphological analysis and syntactic analysis of syllables dealing with phonological changes.

상기에서 생성된 음절 단위열로 부터 해당 음절의 단주기 파형군을 만들고, 그 만들어진 단주기 파형군을 운율 정보에 따라 오버랩-애드(overlap-add)를 통해 합성음을 만들어 낸다.The short period waveform group of the syllable is created from the syllable unit strings generated above, and the synthesized short period waveform group is generated through overlap-add according to the rhyme information.

이렇게 만들어진 합성음은 합성 단위간의 연결이 자연스럽고, 합성음의 음질이 높아진다.The synthesized sound produced in this way is naturally connected between synthesis units, and the sound quality of the synthesized sound is increased.

이상에서 상세히 설명한 바와같이 본 발명은 적절한 데이타 량과 음질이 원음에서 크게 벗어나지 않을 정도로 단위선택을 행한다음, 실제로 존재하는 음소를 구한 후 그 구해진 음소 데이타중 무성음은 PCM의 형태로, 유성음은 저주파 부분의 위상을 제로로하는 단주기 파형군을 구하고 그 단주기 파형군을 운율 정보에 따라 오버랩-애드를 통해 합성을 만들어 냄으로써 음질이 상승되고, 자연스런 합성음을 만들어낸 효과가 있다.As described in detail above, the present invention performs unit selection so that an appropriate amount of data and sound quality do not significantly deviate from the original sound. By obtaining a short-period waveform group with zero phase and generating the synthesis through the overlap-add according to the rhyme information, the sound quality is increased, and the natural synthesized sound is produced.

Claims

A first step of performing a preprocessing process of changing the input text into a Hangul sentence, a second step of processing phonological fluctuations of the sentence transformed into Hangul in the first step, creating a syllable unit string, and generating rhyme information; And a third step of obtaining a short time signal by multiplying the voice data by a Haym window when inputting the speech data of the corresponding syllable unit from the syllable unit string generated in the second step, and the short time signal of the third step. A fourth step of obtaining a PSE (Power Spectrum Envelope) through FFT conversion and an inverse FFT conversion by performing a phase FFT conversion by putting a phase only on a high frequency portion of the PSE obtained in the fourth step to obtain a single period waveform group And a sixth step of over-adding and synthesizing the short-period waveform group obtained in the fifth step according to the rhyme information.