KR20030083903A

KR20030083903A - Phoneme boundary adjustment method for text/speech conversion

Info

Publication number: KR20030083903A
Application number: KR1020020022300A
Authority: KR
Inventors: 신원호
Original assignee: 엘지전자 주식회사
Priority date: 2002-04-23
Filing date: 2002-04-23
Publication date: 2003-11-01

Abstract

PURPOSE: A method for adjusting a phoneme boundary for text/speech conversion is provided to correct an error in a phoneme boundary obtained from a voice recognizing device to offer more accurate phoneme boundary information. CONSTITUTION: Phoneme information recognized by a voice recognizing device and boundary values of the phoneme information are sequentially inputted(S1), to separately detect 'voiceless consonant and vowel' and 'vowel+voiceless consonant'(S2,S3). A position where a smoothed energy variation rate represents a maximum value is controlled to be a boundary of 'consonant+vowel' or 'consonant+consonant'(S7). A position where the slope of smoothed ZCR represents a maximum value is controlled to be a boundary of 'vowel+consonant'.

Description

Phoneme boundary adjustment method for text / voice conversion {PHONEME BOUNDARY ADJUSTMENT METHOD FOR TEXT / SPEECH CONVERSION}

본 발명은 텍스트/음성변환기의 음소 경계 정보를 제공하는 기술에 관한 것으로, 특히 음성인식기로부터 얻어진 음소 경계의 오류를 수정하여 보다 정확한 음소 경계 정보를 제공할 수 있도록 한 텍스트/음성변환을 위한 음소 경계 조정방법에 관한 것이다.The present invention relates to a technology for providing phoneme boundary information of a text / voice converter, and in particular, a phoneme boundary for text / voice conversion that provides more accurate phoneme boundary information by correcting an error of a phoneme boundary obtained from a voice recognizer. It is about an adjustment method.

일반적으로, 음성 합성이란 음소와 같은 작은 단위의 음성을 적절히 결합하여 텍스트를 음성으로 변환하는 것으로, 이를 위해 각 단위의 음성 데이터를 보유하고 있어야 한다.In general, speech synthesis is a method of converting text into speech by appropriately combining small units of speech such as phonemes.

최근 들어, 음성 합성에 대용량의 데이터베이스를 많이 이용하고 있다. 이와 같은 경우 음소의 구간을 일일이 수동으로 세그먼테이션(segmentation)하지 않고, 음성인식기로부터 세그먼테이션 정보를 얻어서 음성 합성에 필요한 데이터베이스를 구축하는 방법이 많이 이용되고 있다. 그런데, 음성인식기로부터 공급되는 음소 경계 정보에는 부정확한 경계정보가 다수 포함되어 있다.Recently, a large database has been used for speech synthesis. In such a case, a method of constructing a database for speech synthesis by obtaining segmentation information from a speech recognizer without manually segmenting a phoneme section is used. By the way, the phoneme boundary information supplied from the voice recognizer includes a large number of incorrect boundary information.

종래 음성합성기의 음소 분할 후처리 방법에 있어서는 음성 합성에 필요한 유니트를 일일이 레이블하였는데, 이와 같은 경우 작업속도가 느리고 필요한 음소나 합성 단위에 대해 한두개 정도의 유니트밖에 가질 수 없는 결함이 있었다. 또한, 음성인식기로부터 공급되는 음소 경계 정보를 이용하는 경우 부정확한 경계정보를 그대로 이용하게 되므로 음성합성기의 음질을 저하시키게 되는 문제점이 있었다.In the conventional phoneme segmentation post-processing method of the speech synthesizer, the units necessary for speech synthesis are individually labeled. In this case, the work speed is slow and there is a defect that only one or two units are required for the necessary phoneme or synthesis unit. In addition, when the phoneme boundary information supplied from the voice recognizer is used, the incorrect boundary information is used as it is, thereby degrading the sound quality of the voice synthesizer.

따라서, 본 발명의 목적은 음성인식기를 이용하여 음소 경계를 자동 검출하고, 스무딩된 변화율을 근거로 하여 음소 경계의 오류를 수정하는 텍스트/음성변환을 위한 음소 경계 조정방법을 제공함에 있다.Accordingly, an object of the present invention is to provide a phoneme boundary adjustment method for text / voice conversion that automatically detects a phoneme boundary using a speech recognizer and corrects an error of the phoneme boundary based on the smoothed rate of change.

도 1은 본 발명에 의한 음소 분할 후처리 방법을 나타낸 신호 흐름도.1 is a signal flow diagram illustrating a phoneme division post-processing method according to the present invention.

***도면의 주요 부분에 대한 부호의 설명****** Description of the symbols for the main parts of the drawings ***

S1-S9 : 제1-9단계S1-S9: Steps 1-9

본 발명의 제1특징은, 음성 인식기의 음소 경계 결과를 근거로 하여 잘못된 경계를 검출하고 이를 수정하여 정확한 음소 경계를 구하는 것이다.A first feature of the present invention is to detect an incorrect boundary based on a phoneme boundary result of a speech recognizer and correct it to obtain an accurate phoneme boundary.

본 발명의 제2특징은, 음소 경계를 바르게 수정하여 음성 합성기의 음질을 개선하는 것이다.A second feature of the present invention is to correct the phonetic boundary correctly to improve the sound quality of the speech synthesizer.

본 발명의 제3특징은, 무성자음과 유성음 사이의 경계를 구하는 분야에 널리 사용할 수 있는 것이다.A third feature of the present invention is that it can be widely used in the field for finding the boundary between unvoiced and voiced sound.

본 발명에 의한 텍스트/음성변환을 위한 음소 경계 조정방법은 음성 인식기에서 인식된 음소 정보와 이의 경계값들을 순차적으로 입력하여 "무성 자음과 모음", "모음 + 무성 자음"인 형태를 분리 검출하는 제1과정과; 스무딩된 에너지 변화율이 최대값을 나타내는 위치를 "자음+모음" 또는 "자음+자음"의 경계 구간으로 조정하는 제2과정과; 스무딩된 ZCR의 기울기가 최대값을 갖는 위치를 "모음+자음"의 경계 구간으로 조정하는 제3과정과; 상기 조정된 경계구간을 근거로 음성을 합성하는 제4과정으로 이루어지는 것으로, 이와 같은 본 발명의 작용을 첨부한 도 1을 참조하여 상세히 설명하면 다음과 같다.According to the present invention, a phoneme boundary adjustment method for text / voice conversion is performed by sequentially inputting phoneme information recognized by a speech recognizer and its boundary values to separately detect a form of "voiceless consonants and vowels" and "vowels + unvoiced consonants." First course; A second step of adjusting a position at which the smoothed energy change rate indicates a maximum value to a boundary section of "consonant + vowel" or "consonant + consonant"; A third step of adjusting a position where the slope of the smoothed ZCR has a maximum value to a boundary section of "vowel + consonant"; A fourth process of synthesizing the speech based on the adjusted boundary section will be described in detail with reference to FIG. 1 attached to the operation of the present invention.

음성 인식기에서 인식된 음소 순서대로 음소 정보와 이의 경계값들을 입력하여 다음과 같은 음소 경계 조정과정을 수행하게 된다.The phoneme boundary adjustment process is performed by inputting phoneme information and its boundary values in the phoneme order recognized by the speech recognizer.

이전 음소, 현재 음소 및 다음 음소가 해당 조건을 만족하는 경우 조정된 경계값으로 갱신하고, 그렇지 않은 경우에는 기존의 음성 인식기에서 얻은 경계값을 그대로 이용한다.If the previous phoneme, the current phoneme, and the next phoneme satisfy the conditions, the phoneme is updated with the adjusted boundary value. If not, the boundary value obtained from the existing speech recognizer is used as it is.

일반적으로, 유성 자음과 모음이 이어진 형태는 수동의 레이블 작업으로도 정확한 경계값을 도출하기 어려우므로, 본 발명의 음소 경계 조정 방법에 적용하는 음소간의 경계는 "무성 자음과 모음", "모음 + 무성 자음"의 형태인 경우로 제한한다.In general, since the voiced consonant and the vowel form are difficult to derive the correct boundary value even by manual labeling, the boundary between the phonemes applied to the phoneme boundary adjustment method of the present invention is "voiceless consonant and vowel", "vowel + Limited to cases of "unvoiced consonants".

"닫힘 종성 + 무성 자음"의 경우 종성은 음가를 거의 갖지 않으므로 "모음 + 무성 자음"의 분류에 속하는 것으로 구분할 수 있다. 즉, "VCC"(V(모음)+C(자음)+C(자음))는 "VC" 와 동일하게 처리하면 된다. 다만, 음소 경계를 조정하기 위해 음성 인식기로부터 얻은 음소 경계 전후의 일정 구간만큼 탐색하게 되는데, 이 값들은 "CC" 및 "CV", "VC","VCC"의 분류에 따라 미리 설정된 값을 적용하여 탐색하게 된다.In the case of "closed + unvoiced consonants", "Finality" has almost no phonetic value and thus can be classified as belonging to the "vowel + unvoiced consonant" category. That is, "VCC" (V (vowel) + C (consonant) + C (consonant)) may be processed in the same manner as "VC". However, in order to adjust the phoneme boundary, a predetermined section before and after the phoneme boundary obtained from the speech recognizer is searched. These values are preset according to the classification of "CC", "CV", "VC", and "VCC". To search.

"CC"의 경우는 "CV"와 거의 동일하게 처리된다. "CV" 및 "CV" 등에 대해 음소 경계를 조정한 후 다음 음소열로 증가시켜 음성 인식기의 마지막 음소열에 도달할 때까지 작업을 반복하게 된다.In the case of "CC", the processing is almost identical to that of "CV". After adjusting the phoneme boundaries for "CV", "CV", etc., they increase to the next phoneme sequence and repeat the operation until the last phoneme sequence of the speech recognizer is reached.

한편, "CV" 및 "VC"의 음소 경계 조정 과정을 좀더 상세히 설명하면 다음과 같다. 이때, 다음의 [수학식1],[수학식2]와 같은 변형된 파라메터를 이용한다.Meanwhile, the phoneme boundary adjustment process of "CV" and "VC" will be described in more detail as follows. In this case, modified parameters such as the following [Equation 1] and [Equation 2] are used.

다음의 [수학식1]을 이용하여 대역 에너지 변화율(Eng_Chg) 및 스무딩된 에너지 변화율(Sm_Eng)을 구하게 된다. 여기서, 소정 주파수 대역(200800Hz)의 에너지(Eng) 계산을 위해 16KHz를 기준으로 128포인트 FFT를 사용하였다. 또한, 스무딩된 에너지 변화율(Eng_Chg)은 일정 구간에서의 스무딩된 에너지 변화율을 평균값으로 나타낸 것이고, A,B의 값들은 스무딩 및 변화량을 구하기 위해 적용된 상수값이다.Using Equation 1 below, the band energy change rate (Eng_Chg) and the smoothed energy change rate (Sm_Eng) are obtained. Here, the predetermined frequency band 200 In order to calculate the energy (Eng) of 800 Hz, a 128-point FFT based on 16 KHz was used. In addition, the smoothed energy change rate (Eng_Chg) represents the smoothed energy change rate in a predetermined interval as an average value, and the values of A and B are constant values applied to obtain a smoothing and change amount.

다음의 [수학식2]를 이용하여 스무딩된 ZCR(ZCR: Zero Crossing Rate)의 기울기(Zcr_Slope) 및 스무딩된 ZCR(sm_zcr)을 구하게 된다. 여기서, ZCR은 일정 구간 동안 영점을 통과한 수를 나타내는 값으로 무성음에서 높은 값을 나타내므로 시간 영역에서 유성음을 쉽게 구별할 수 있도록 해주는 파라메터로 사용된다. 따라서, 1-ZCR은 반대로 무성음에서 0에 가까운 값을 나타내게 된다. 상기 스무딩된 ZCR의 기울기(Zcr_Slope)는 상기 1-ZCR 값을 스무딩하여 이의 기울기를 표시한 것이다. 마찬가지로, C,D의 상수값은 실험을 통해 얻어진 값을 이용한다.Using the following Equation 2, the smoothed ZCR (Zcr_Slope) and smoothed ZCR (sm_zcr) of the smoothed ZCR (ZCR: Zero Crossing Rate) are obtained. Here, ZCR is a value representing the number that passed the zero point for a certain period and is used as a parameter for easily distinguishing the voiced sound in the time domain because it represents a high value in the unvoiced sound. Therefore, 1-ZCR, on the contrary, shows a value close to zero in unvoiced sound. The slope Zcr_Slope of the smoothed ZCR smoothes the 1-ZCR value to indicate its slope. Similarly, the constant values of C and D use values obtained through experiments.

다음의 [수학식3]과 같이, 상기 스무딩된 기울기(변화율)의 최대값(arg max Eng_Chg)이 나타내는 위치를 새로운 경계 구간(NewPosCV)으로 설정하는 것에 의하여 CV의 경계가 조정된다. 이때, 무성음, 유성음의 순서로 음소가 연결되므로 좌측 부분의 ZCR이 우측 부분의 ZCR보다 커야 한다.As shown in Equation 3 below, the boundary of the CV is adjusted by setting the position indicated by the maximum value arg max Eng_Chg of the smoothed slope (rate of change) as a new boundary section NewPosCV. At this time, since the phonemes are connected in the order of unvoiced sound and voiced sound, the ZCR of the left part should be larger than the ZCR of the right part.

마찬가지로, VC의 경계를 조정할 때, 다음의 [수학식4]와 같이 스무딩된 ZCR의 기울기(Zcr_Slope)의 최대값을 구하여 이를 새로운 경계 구간(NewPosCV)으로 한다.Similarly, when adjusting the boundary of the VC, the maximum value of the slope (Zcr_Slope) of the smoothed ZCR is obtained as shown in Equation 4 below, and this is referred to as a new boundary section (NewPosCV).

한편, CC에 대한 경계구간 조정과정은 상기 CV에 대한 경계구간 조정과정과 동일하다.On the other hand, the boundary section adjustment process for the CC is the same as the boundary section adjustment process for the CV.

이상에서 상세히 설명한 바와 같이 본 발명은 음성인식기를 통해 검출된 음소 경계를 자동 검출하고, 스무딩된 변화율을 근거로 하여 음소 경계의 오류를 수정함으로써, 음성합성기의 음질을 향상시킬 수 있는 효과가 있다. 또한, 이러한 음소 경계 수정 원리를 일반적인 유무성음 경계의 오류를 수정하는데 널리 적용할 수 있는 이점이 있다.As described in detail above, the present invention has an effect of automatically detecting a phoneme boundary detected through a voice recognizer and correcting an error of the phoneme boundary based on the smoothed change rate, thereby improving the sound quality of the voice synthesizer. In addition, there is an advantage that the phoneme boundary correction principle can be widely applied to correct errors of general voiceless boundary.

Claims

A first step of sequentially detecting phoneme information recognized by a speech recognizer and its boundary values to separately detect a form of "voiceless consonants and vowels" and "vowels + unvoiced consonants"; A second step of adjusting a position at which the smoothed energy change rate indicates a maximum value to a boundary section of "consonant + vowel" or "consonant + consonant"; And a third step of adjusting a position of the smoothed ZCR having a maximum value to a boundary section of "vowels + consonants".

The phoneme boundary adjustment method of claim 1, wherein the smoothed energy change rate is calculated using the following Equation.

The method of claim 1, wherein the slope of the smoothed ZCR is obtained by using the following Equation.