KR100346790B1

KR100346790B1 - Postprocessing method for automatic phoneme segmentation

Info

Publication number: KR100346790B1
Application number: KR1019990023811A
Authority: KR
Inventors: 김상훈; 박은영; 이영직
Original assignee: 한국전자통신연구원
Priority date: 1999-06-23
Filing date: 1999-06-23
Publication date: 2002-08-01
Also published as: KR20010003502A

Abstract

본 발명은 텍스트/음성변환기의 자동합성단위 생성기 기술로, 자동 음소분할기를 이용한 음소분할에서 음소경계오류를 줄이기 위한 음소분할 후 처리방법에 관한 것으로서, 자동음소분할기를 이용한 음소분할에서 음소경계오류를 줄이기 위해, 통계적 방법에 의한 음소의 경계오류를 보정한 후 통계적인 방법으로 음소경계탐색구간을 설정하고, 음운환경에 강인한 특징 파라미터를 적용하여 신경회로망을 이용한 후처리를 수행함으로써, 경계오류의 범위를 줄이고, 대량의 합성용 운율데이터베이스 구축 및 합성단위의 자동생성에 사용할 수 있으며, 후처리 부분에서 통계적 정보를 이용한 보정과, 음운환경에 따라 각각 특징 파라미터 Feature Set을 적용하고 신경회로망을 이용한 음향특성 규칙화를 이용하여 음소분할 및 성능을 향상시킬 수 있으며, 고립단어로 발성된 합성데이터베이스에서 후처리기로 보정함에 따라 자동 음소분할기보다 향상된 성능을 가질 수 있으며, 절대 오류를 향상시킬 수 있고, 문장단위로 발성된 합성데이터베이스에 대해서는 음운환경에 따라 후처리를 따로 적용했을 때 단독으로 후처리를 적용하는 것에 비해 우수한 성능과, 자동 분절기에 대해 프레임 분할율 향상 및 절대오류 측면에서 성능향상을 보임에 따른 신경회로망을 이용한 후처리로 자동분절오류의 범위를 줄일 수 있고, 대량의 합성용 운율 데이터베이스 구축 및 합성 단위의 자동 생성에 이용될 수 있는 효과를 가진다.The present invention relates to an automatic synthesis unit generator technology of a text / voice converter, and to a post-division processing method for reducing a phoneme boundary error in a phoneme partition using an automatic phoneme divider. In order to reduce, the boundary error of the phoneme by the statistical method is corrected, and then the phoneme boundary search interval is set by the statistical method, and the post-processing using the neural network is performed by applying robust feature parameters to the phoneme environment. It can be used for constructing large-scale synthesis rhyme database and automatic generation of synthesis unit.In the post-processing part, correction using statistical information and application of feature parameter feature set according to phonological environment and acoustic characteristics using neural network Regularization can be used to improve phone segmentation and performance. It is possible to improve performance than automatic phoneme splitter by correcting synthesized database spoken by lip word with postprocessor, and to improve absolute error, and postprocessed according to phonological environment for synthetic database spoken by sentence unit. When applied, it is possible to reduce the range of automatic segmentation error by postprocessing using neural network, which shows superior performance compared to applying postprocessing alone, and improves the frame segmentation rate and improves the performance in terms of absolute error. It has an effect that can be used to build a large amount of synthesis rhyme database and automatic generation of synthesis units.

Description

Postprocessing method for automatic phoneme segmentation}

본 발명은 텍스트/음성변환기의 자동 합성단위 생성기 기술로서, 특히 자동 음소분할기를 이용한 음소분할에서 음소경계 오류를 줄이기 위한 음소분할 후처리방법에 관한 것이다.BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an automatic synthesis unit generator technology of a text / voice converter, and more particularly, to a phoneme post-processing method for reducing phoneme boundary errors in phoneme division using an automatic phoneme divider.

일반적으로 자동으로 분할된 음소경계는 항상 오류가 있고, 이러한 오류는 음소단위의 정보를 사용할 때 부정확한 데이터가 되어 운율 모델링이나 합성단위용으로 사용하기가 어렵다.In general, automatically divided phoneme boundaries always have errors, and these errors become inaccurate data when using phoneme information, making them difficult to use for rhyme modeling or synthesis units.

최근들어 대용량 운율 데이터베이스를 이용한 음성합성 방식이 종래 합성방식보다 성능이 나은 합성음을 생성함에 따라 이러한 합성방식을 이용한 합성시스템이 국내외에서 개발되고 있다.Recently, a synthesis system using such a synthesis method has been developed at home and abroad as a speech synthesis method using a large-scale rhyme database generates synthesized sound having better performance than the conventional synthesis method.

그러나 대용량 데이터베이스를 제작하는데는 상당한 시간과 비용이 소요되므로 데이터베이스(DB) 제작의 자동화가 절실함에 따라 자동 음소분할 기술이 매우 중요해지고 있다.However, since it takes considerable time and cost to produce a large database, automatic phoneme splitting technology becomes very important as the automation of database creation is urgently needed.

현재 자동음소분할 기술로 사용되고 있는 한국전자통신연구원(ETRI) 음성인식시스템은 FM 라디오 뉴스 문장, 대화체 문장 및 낭독체 문자 등에서 분절 대상 음소의 약 80％ 이상이 수동분할과 자동분할을 비교했을 때의 차이로 인한 오류가 30msec 이내인 범위로 분절되며, 고립단어에 대해서는 약 60％의 성능을 보여주고 있다.ETRI's voice recognition system, which is currently used as an automatic phoneme-division technology, is about 80% or more of the phonemes of segmentation in FM radio news sentences, dialogue sentences, and reading texts. The error due to the difference is segmented within 30msec, and shows about 60% performance for the isolated word.

하지만 이러한 음소분할 성능에서 음소를 분할하여 음성합성단위로 사용했을 때 다소 성능이 저하되는 합성음을 생성하게 되므로, 후처리기를 이용하여 음소경계를 보정할 필요가 있다.However, since the phoneme is divided in the phoneme splitting performance to generate a synthesized sound that is somewhat degraded when used as a speech synthesis unit, it is necessary to correct the phoneme boundary using a postprocessor.

종래 후처리기는 주로 스펙트럼 특징 파라미터를 추출하여 음소분할기(신경회로망, HMM)를 훈련하는 구조로 되어 있어 자동 음소분할기의 구조와 거의 같았다.The conventional post-processor has a structure for training a phoneme splitter (neural network, HMM) by extracting spectral feature parameters, which is almost the same as that of an automatic phone splitter.

따라서 후처리기에 의한 성능은 자동 음소분할기의 성능 이상으로 향상되기가 어려운 문제점이 있었다.Therefore, the performance by the post-processor is difficult to improve more than the performance of the automatic phone splitter.

상기와 같이 종래의 자동으로 분할된 음소경계는 항상 오류가 있고 이러한 오류는 음소단위의 정보를 사용할 때 부정확한 데이터가 되어 운율 모델링이나 합성단위용으로 사용하기가 어려우며, 특히 현재 사용하고 있는 자동 음소분할기의성능은 문장단위의 발화에서 약 20％가 절대 오류 30msec 이상이며, 고립 단어의 경우 그 성능은 매우 떨어지는 문제점이 있었다.As described above, the conventionally divided phoneme boundary always has errors, and these errors become inaccurate data when using phoneme information, which makes it difficult to use them for rhyme modeling or synthesis units. The performance of the divider is about 20% absolute sentence error more than 30msec in sentence unit speech, the performance of the isolated word was very poor.

상기와 같은 문제점을 해결하기 위해 본 발명은, 텍스트/음성변환기의 자동 합성단위 생성기에서 사용되는 자동 음소분할의 성능을 높이기 위한 방법을 제공하는 것을 목적으로 한다.In order to solve the above problems, an object of the present invention is to provide a method for improving the performance of the automatic phoneme segmentation used in the automatic synthesis unit generator of the text / speech converter.

상기 목적을 달성하기 위해 본 발명은, 수동 보정된 음소경계 데이터베이스(DB)를 훈련한 후 훈련된 신경회로망으로 후처리하는 것을 특징으로 한다.In order to achieve the above object, the present invention is characterized by post-treatment with a trained neural network after training a manually corrected phoneme boundary database (DB).

상기에 따른 본 발명은 자동 음소분할기의 분할 특성을 이용하여 후처리 방식에 적용하고자 상기 자동 음소분할기는 "음소경계 유형이 음운환경에 따라 일정하다", "음소의 안정구간을 거의 포함하도록 분할한다" 등의 특성을 갖는다.According to the present invention, the automatic phone splitter divides the phoneme partition type so that the phoneme boundary type is constant according to the phonetic environment. And the like.

음소경계 오류의 유형이 음운환경에 따라 일정한 특성은 음성 전문가에 의한 수동 분할과 비교해 음소경계 오류의 정도와 오류의 방향이 음운환경에 따라 거의 일정함을 의미한다.The type of phoneme boundary error is constant according to phoneme environment, which means that the degree of phoneme boundary error and the direction of error is almost constant according to phoneme environment compared to manual division by voice expert.

따라서 자동분할 결과와 수동으로 보정된 분할결과를 비교하면 자동 음소분할기의 오류 특성을 알 수 있다.Therefore, it is possible to know the error characteristics of the automatic phoneme splitter by comparing the automatic segmentation result with the manually corrected segmentation result.

상기 오류특성을 수치화하기 위해 음운환경별로 오류(= ｜수동으로 보정된 음소위치-자동분할된 음소위치｜)를 평균하고 표준편차를 구한다.In order to quantify the error characteristic, the error (= | manually corrected phoneme location-automatically divided phoneme location |) is calculated for each phoneme environment and the standard deviation is obtained.

음운환경별 평균치는 훈련데이터에서 제외된 나머지 자동분할 결과에 적용되며, 음운환경별 음소경계 오류의 통계치(평균, 표준편차)를 이용하여 음소경계 오류를 1차적으로 보정할 수 있고, 표준편차는 후처리시 음소경계 탐색구간을 설정할 때 사용된다.The phonological environment mean value is applied to the remaining automatic segmentation results excluded from the training data, and the phoneme boundary error can be corrected first using the statistics (mean, standard deviation) of phoneme boundary errors by phoneme environment. It is used to set the phoneme boundary search section during post-processing.

음소의 안정구간을 거의 포함하도록 분할하는 특성은, 좌우 음소경계를 넘어서는 큰 오류는 거의 발생하지 않음을 말한다.The characteristic of dividing the phone to include almost stable sections of phonemes means that large errors beyond the left and right phoneme boundaries rarely occur.

특히 음성합성 분야에서 분할하고자 하는 음소를 미리 알고 있기 때문에 이와 같은 특성을 이용하면 음운환경에 따라 다른 특징 파라미터를 이용할 수 있다.In particular, in the field of speech synthesis, the phoneme to be divided is known in advance, so using this characteristic, different feature parameters can be used depending on the phonetic environment.

예를들면, "유성음+무성음" 환경의 음소경계 보정은 영교차율, 에너지 등 시간영역 특징 파라미터를, "비모음+모음" 환경은 주파수의 고대역/저대역 에너지 비율을, "모음+모음" 환경은 주파수 영역 스펙트럼 정보를 이용하면 된다.For example, a phoneme boundary correction in a "voiced + unvoiced" environment can provide time-domain feature parameters such as zero crossing rate, energy, and a "non-vowel + vowel" environment a high-band / low-band energy ratio of a frequency. The environment may use frequency domain spectrum information.

따라서 종래 모든 음소에 대해 같은 특징 파라미터를 사용하는 방식보다 음운환경에 따라 강인한 특징 파라미터를 사용할 수 있어 음소분할 성능을 높일 수 있다.Therefore, it is possible to use robust feature parameters according to phonological environment than the conventional method using the same feature parameters for all phonemes, thereby improving phoneme splitting performance.

이와 같이 분할된 음소에 대한 정보를 이미 알고 있을 경우, 음소간 음향적인 변화를 전문가에 의해 규칙화하고 이러한 정보를 이용하는 전문가 지식베이스(knowledge based expert system) 음소분할 방법을 사용하면 수동분할에 가까운 결과를 얻을 수 있을 것이나 매우 다양하게 나타나는 음소 경계간 음향학적인 특징을 분석, 규칙화하는데 상당한 시간이 요구되므로 이러한 음향적인 변화를 훈련 알고리즘(다층 신경회로망)을 통해 규칙화한다.If you already know the information about the divided phonemes, using a knowledge-based expert system phoneme splitting method that uses the information to regulate the acoustic changes between phonemes by the experts and results in near manual division. We can obtain a large amount of time, but we need to take considerable time to analyze and normalize the acoustic characteristics between the various phoneme boundaries, so we order these acoustic changes through a training algorithm (multilayer neural network).

훈련과정은 수동분할된 결과로부터 규칙을 추출하는 과정으로, 수동분할결과는 신경회로망의 입력이 되며, 신경회로망 훈련에 의해 규칙을 작성하게 된다.The training process is the process of extracting the rules from the manual segmented results. The manual segmented results are input to the neural network, and the rules are prepared by neural network training.

본 발명에서는 자동 음소 분할기와 후처리기의 성능을 음운환경별로 비교 분석하여 음소분할기 성능이 더 높은 음운환경과 후처리기 성능이 더 높은 음운환경으로 나누었다.In the present invention, the performance of the automatic phoneme divider and the post processor is analyzed and analyzed by phonological environment, and the phoneme divider is divided into a phonological environment with a higher phoneme performance and a phonological environment with a higher postprocessor performance.

이는 이는 후처리기를 이용했을 때 항상 성능이 향상되는 것이 아니라 자동 분할기 성능보다 떨어질 수도 있기 때문이다.This is because the performance of the post-processor is not always improved, but may be lower than the automatic divider performance.

자동 음소 분할기의 성능을 음운환경별로 볼 때, 모든 "음소+비용", "모음+유음" 경계에서 후처리기에서 나은 성능을 보였다.The performance of the automatic phoneme divider by phoneme environment showed better performance in the postprocessor at all "phoneme + cost" and "vowel + noise" boundaries.

나머지 음운환경에서는 후처리기의 성능이 나은데 이 중 음운환경별로 강인한 특징 파라미터를 다시 선정하여 후처리시 음운환경별로 특징 파라미터를 달리 적용하여 후처리를 한다.In the other phonological environment, the performance of the post-processor is better. Among them, the robust feature parameters are re-selected for each phonological environment.

음운환경별로 강인한 특징 파라미터를 선정하기 위해 본 발명에서는 시간영역에서 추출한 영교차율 및 에너지와 주파수 영역에서 나타나는 스펙트럼 정보인 대역별 에너지 정보 등 9차를 특징 파라미터 Feature Set Ⅰ로 하고, 사람의 청각 특성을 모델링한 지각 선형 예상(Perceptual Linear Prediction, 이하 PLP라 칭함) 계수 13차를 특징 파라미터 Feature Set Ⅱ로 하여 음성 특징을 추출한 후 각각 신경회로망을 훈련한 후분할 성능을 조사하였다.In order to select robust feature parameters for each phonological environment, the present invention sets the ninth order such as zero-crossing ratio extracted from the time domain and energy information for each band, which is spectrum information appearing in the energy and frequency domain, as Feature Parameter I, and sets the hearing characteristics of a person. After the speech features were extracted using the 13th-order Perceptual Linear Prediction (PLP) coefficients as the feature parameter Feature Set II, the neural network was trained, and the performance of post-segmentation was investigated.

상기 Feature Set Ⅰ은 "[유음, 무성자음]+모음", "모든 음소+무성자음"에서 반면, 상기 Feature Set Ⅱ에서는 "[비음, 유음, 무성자음]+모음", "비음+비음", "유음+유음", "비음+무성자음"의 경우 높은 성능을 보였다.The feature set I is "[voice, unvoiced consonant] + vowel", "all phonemes + unvoiced consonant", whereas in the feature set II, "[noise, voiced, unvoiced consonant] + vowel", "non-voice + naive", In the case of "noise + voice" and "non-voice + unvoiced", the performance was high.

따라서 자동 분할기의 성능은 음소 또는 음운환경에 따른 성능 차이를 보였으며, 특정 음운 환경에 대해서 Feature Set Ⅰ, Ⅱ의 분절 성능이 다르게 나타남을 알 수 있었는데, 즉 상기 Feature Set Ⅰ의 경우, 대체로 유/무성음 구분에 강했으며, Feature Set Ⅱ는 모음부와 유음, 비음부에 좋은 성능을 나타내었다.Therefore, the performance of the automatic splitter showed a performance difference according to the phoneme or phoneme environment, and it was found that the segmentation performance of Feature Set I and II was different for a particular phoneme environment. Strong in classification of unvoiced sound and Feature Set II showed good performance in vowel, voiced and non-voiced parts.

최종 후처리 과정에서는 상기 발명 즉, 통계적 정보를 이용한 보정, 음운환경별 강인한 특징 파리미터 적용 및 신경회로망 훈련을 통하여 새로운 음소 경계를 결정한다.In the final post-processing process, a new phoneme boundary is determined through the invention, that is, correction using statistical information, robust feature parameters for each phonological environment, and neural network training.

도 1은 본 발명의 신경회로망 훈련 흐름도,1 is a neural network training flow chart of the present invention,

도 2는 본 발명이 적용되는 음소분할 후처리 흐름도.2 is a phoneme post-processing flow chart to which the present invention is applied.

이하 첨부된 도면을 참조하여 본 발명을 상세히 설명하면 다음과 같다.Hereinafter, the present invention will be described in detail with reference to the accompanying drawings.

도 1 은 본 발명의 신경회로망 훈련 흐름도로서, 음소 경계에서의 음향적인 변화 규칙을 작성하기 위한 훈련 구성에 관한 것이다.1 is a neural network training flow diagram of the present invention, and relates to a training arrangement for creating an acoustic change rule at the phoneme boundary.

훈련과정을 보면, 수동 음소 분할 데이터베이스(DB)(S1)로부터 통계정보인 음소경계 오류의 평균값과 표준편차를 추출하고(S2) 신경회로망을 훈련하기 위한 특징 파라미터를 추출하고(S3) 추출한 특징 파라미터 Ⅰ, Ⅱ에 따라 신경회로망을 각각 훈련한다(S4).In the training process, the mean value and standard deviation of the phoneme boundary error, which are statistical information, are extracted from the manual phoneme segmentation database (S1) (S2), and the feature parameters for training the neural network are extracted (S3). Train neural networks according to I and II (S4).

이때 상기 훈련은 신경회로망 출력값이 임계치에 도달할 때 끝나게 되며, 그렇지 않을 경우에는 반복해서 훈련이 된다.At this time, the training is completed when the neural network output value reaches the threshold, otherwise the training is repeated.

즉, 상기 훈련의 계속 여부를 판단하여(S5) 훈련을 계속하지 않을 경우에는다시 상기 특징 파라미터에 따라 신경회로망 훈련을 하고, 판단 후 훈련을 계속할 경우에는 신경회로망의 훈련 파라미터인 가중(weight) Ⅰ, Ⅱ값을 추출하여 하드디스크에 저장한(S6) 후 훈련을 끝마친다(S7).That is, if it is determined whether to continue the training (S5), if the training is not continued, the neural network training is performed again according to the feature parameter, and if the training is continued after the determination, the weight (weight) I, which is the training parameter of the neural network, is continued. After extracting the value II, save it to the hard disk (S6) and finish the training (S7).

도 2 는 본 발명이 적용되는 음소분할 후처리 흐름도로서, 훈련된 신경회로망을 이용하여 후처리를 하는 구성에 관한 것이다.Figure 2 is a phoneme post-processing flow chart to which the present invention is applied, and relates to a configuration for post-processing using a trained neural network.

상기 후처리 과정은 훈련 과정에서 추출한 통계정보를 이용하여 오류를 1차적으로 보정(보정된 음소위치=자동분할된 음소 위치+오류의 평균치)을 수행하고(S8) 자동음소 분할 데이터베이스(DB)를 입력받아(S9) 신경회로망을 훈련하기 위해 수동분할된 음성데이터로부터 2가지 형태의 파라미터를 추출한다(S10).The post-processing process primarily corrects the error using the statistical information extracted during the training process (corrected phoneme position = auto-segmented phoneme position + error average) (S8) and performs automatic phoneme division database (DB). In order to train the neural network (S9), two types of parameters are extracted from manually segmented voice data (S10).

상기 파라미터 추출 후 음운환경을 선택하는데(S11), 본 발명에서는 시간영역에서 추출한 영교차율, 에너지와 주파수 영역에서 나타나는 스펙트럼 정보인 대역별 에너지 정보 등 9차를 특징 파라미터 Feature Set Ⅰ로하고(S11a), 사람의 청각 특성을 모델링한 지각 선형 예상(PLP) 계수 13차를 특징 파라미터 Feature Set Ⅱ(S11b)로 하여 음성 특징을 추출하여 사용한다.The phonological environment is selected after the parameter extraction (S11). In the present invention, the ninth order such as the zero crossing rate extracted in the time domain and the energy information for each band, which is spectral information appearing in the energy and frequency domain, is set as the feature parameter I (S11a). For example, the 13th order of the perceptual linear prediction (PLP) coefficient modeling the human auditory characteristics is used as the feature parameter Feature Set II (S11b).

그리고 각 음운환경별로 특징 파라미터와 가중(weight)값을 달리 적용하는데(S11, S12), 상기 음운환경은 아래 표 1에 나타낸 바와 같이 음운환경이 Ⅲ일 경우 자동 음소분할 결과가 더 좋은 경우이므로 후처리 과정을 거치지 않고 후처리를 끝마친다(S13).In addition, the characteristic parameters and weight values are applied differently for each phonological environment (S11, S12), and the phonological environment is better when the phonological environment is Ⅲ, as shown in Table 1 below. Finish the post-treatment without going through the process (S13).

TypeType ⅠI ⅡⅡ ⅢⅢ 음운환경Environment [유음, 무성자음]+모음 비음제외한 모든음소+무성자음[Voice, unvoiced consonant] + vowel all phonemes except non-phone + unvoiced [비음, 유음,무성자음]+모음비음+비음유음+유음비음+무성자음[Non-sound, voiced, unvoiced consonant] + vowel sound + non-unified sound + non-sounded sound + unvoiced consonant 비음 제외한 모든음소+비음모음+유음All phonemes except nasal + nasal vowel + sound

상기 과정을 수행한 후 특징 파라미터에 따라 음소경계 탐색구간(S15)을 결정하는데 스텝 1로, 자동 분절 경계를 위치로부터 좌우 2 프레임(±25 msec)을 신경회로망 출력값 탐색 구간으로 결정하고, 구간내에 임계값 이상인 신경회로망 출력값 중 최대값 두개를 결정해서 자동 분절 위치에 더 가까운 위치를 후처리된 경계로 본다.After the above process, the phoneme boundary search section (S15) is determined according to the feature parameter. In step 1, the automatic segment boundary is determined as the neural network output value search section from the left and right two frames (± 25 msec). Determine the maximum two of the neural network outputs above the threshold and see the postprocessed boundary as the location closer to the auto segment.

그리고 스텝 2로서, 탐색구간 내에 임계값 이상의 신경회로망 출력값이 존재하지 않을 경우, 음운환경에 따른 통계 자료가 등록된 테이블로부터 90％의 신뢰 구간까지 탐색구간을 확장하고, 확장된 탐색구간에서 상기 스텝 1의 과정을 반복하는데, 이때 탐색구간의 확장 범위는 아래 식 1에 따른다.In step 2, when there is no neural network output value above the threshold in the search section, the search section is extended from a table in which statistical data according to the phonological environment is registered to a 90% confidence interval, and in the extended search section. The process of 1 is repeated, in which the extended range of the search section follows Equation 1 below.

if 오류의 평균 < 0 [평균-표준편차*1.645, 25]mean of if error <0 [mean-standard deviation * 1.645, 25]

Else [-25, 평균+표준편차*1.645]Else [-25, Mean + Standard Deviation * 1.645]

상기 탐색구간이 결정되면 그 구간내에서 임계값 이상의 신경회로망 출력값이 존재하면 임계값 중 최대값을 가지는 위치를 후처리된 음소 경계로 선택한다(S16).When the search section is determined, if a neural network output value of more than a threshold value exists within the section, the position having the maximum value among the threshold values is selected as the post-processed phoneme boundary (S16).

그리고 상기 탐색구간 내에 임계값 이상의 신경회로망 출력값이 존재하는지 판단하여(S17) 출력값이 존재하지 않는 경우 자동 음소분할 결과를 그대로 이용하고(S15), 탐색구간 내에 임계값 이상의 신경회로망 출력값이 존재하면 후처리 부분에서 통계적 정보를 이용한 보정을 수행하고(S18) 후처리를 끝마친다(S13).If there is no neural network output value above the threshold in the search section (S17), if the output value does not exist, the automatic phoneme division result is used as it is (S15). In the processing portion, correction using statistical information is performed (S18), and the post-processing is completed (S13).

상술한 바와 같이 본 발명은 자동 분석 결과를 다층 신경회로망을 이용하여 후처리함으로써, 경계 오류의 범위를 줄이고, 궁극적으로 대량의 합성용 운율 데이터베이스 구축 및 합성 단위의 자동 생성에 사용할 수 있으며, 후처리 부분에서 통계적 정보를 이용한 보정, 음운환경에 따라 각각 특징 파라미터 Feature Set을 적용, 신경회로망을 이용한 음향특성 규칙화 등을 이용하여 음소분할, 성능이 향상될 수 있음을 확인할 수가 있다.As described above, the present invention can postprocess the result of automatic analysis using a multilayer neural network, thereby reducing the range of boundary errors, and ultimately, can be used for constructing a large-scale synthesis rate database and automatically generating the synthesis unit. In the part, it can be confirmed that the phoneme segmentation and performance can be improved by using statistical information, applying feature parameter feature sets according to phonological environment, and using acoustic characteristic regularization using neural networks.

또한 고립단어로 발성된 합성 데이터베이스에서 후처리기로 보정된 분석 결과는 자동 음소분할기보다 약 25％의 향상된 성능을 보였으며, 절대 오류는 약 39％가 향상되었다.In addition, the analysis results corrected by the post-processor in the synthetic database of isolated words showed about 25% improvement over the automatic phoneme splitter, and the absolute error improved by about 39%.

그리고 문장 단위로 발성된 합성 데이터베이스에 대해서는 음운환경에 따라 후처리를 따로 적용했을 때 단독으로 후처리를 적용하는 것에 비해 성능이 우수하였으며, 자동 분절기에 대해서는 약 17.7％의 프레임 분할율 향상을 보이고, 또한 절대 오류 측면에서는 28.6％의 성능 향상을 보임에 따른 신경회로망을 이용한 후처리로 자동 분절 오류의 범위를 줄일 수 있고, 대량의 합성용 운율 데이터베이스 구축 및 합성 단위의 자동 생성에 이용될 수 있는 효과를 가진다.In addition, when the postprocessing was applied separately according to the phonological environment, the synthesized database spoken by sentence unit showed better performance than the postprocessing alone, and the frame segmentation rate improved about 17.7% for the automatic segmenter. In addition, in terms of absolute error, the performance of 28.6% shows that the post-processing using neural networks can reduce the range of automatic segmentation errors, and can be used to construct a large-scale synthesis rhyme database and automatically generate synthesis units. Has

Claims

In the phoneme segmentation method for reducing phoneme boundary errors in phoneme segmentation using an automatic phoneme divider,

A first step of correcting boundary errors of phonemes by a statistical method;

A second step of setting a phoneme boundary search section in a statistical manner after performing the first step;

Applying a robust feature parameter to the phonological environment after performing the second step;

And a fourth step of performing post-processing using a neural network after performing the third step.

The method of claim 1 wherein the first step is

A first substep of extracting an average value and a standard deviation of a phoneme boundary error, which is statistical information, from a manual phoneme segmentation database;

A second substep of extracting feature parameters for training a neural network after performing the first substep;

And a third substep of training the neural networks according to the feature parameters I and II extracted after performing the second substep.

The method of claim 2, wherein the third substep is

Completion of training when the neural network output value reaches a threshold value, and repeats the training if it is not.

The method of claim 1, wherein the second step

A first sub-step of performing neural network training again according to the feature parameter if the training is not continued after determining whether to continue the training;

And a second sub-step of finishing the training after extracting the weighting values I and II, which are training parameters of the neural network, in the hard disk when continuing the training after the determination.

The method of claim 1, wherein the third step

A first sub-step of correcting errors primarily by using statistical information extracted in a post-processing training process;

A second substep of receiving an autonomous phoneme division database after performing the first substep and extracting a plurality of types of parameters from manually segmented voice data for training a neural network;

Perceptual linear prediction (PLP) modeling the auditory characteristics of a person using feature set Ⅰ, the 9th order, such as the zero-crossing ratio extracted from the time domain after the second sub-step, and the band-specific energy information, which is the spectrum information appearing in the energy and frequency domain. A third sub-step of selecting a phonological environment by extracting a speech feature using the coefficient 13th as a feature parameter Feature Set II;

And a fourth sub-step of performing post-processing by applying the feature parameter weighting value to each phonological environment after performing the third sub-step.

The method of claim 5, wherein the first substep

The phoneme post-processing method of claim 1, wherein the corrected phoneme position is determined by using the automatically segmented phoneme position and an average value of errors.

The method of claim 1, wherein the fourth step

A first sub-step of selecting a position having a maximum value among the threshold values as a post-processed phoneme boundary if a neural network output value of a threshold value or more exists in the interval after the search interval determination;

A second sub-step of determining whether a neural network output value greater than or equal to a threshold value exists in the search section and using the automatic phoneme division result if it does not exist;

And a third sub-step for finishing the post-processing after performing correction using statistical information in the post-processing part when the neural network output value above the threshold exists in the search section.