KR100477980B1

KR100477980B1 - Method of removing the inferior speech synthesis units to improve naturalness of the synthetic speech

Info

Publication number: KR100477980B1
Application number: KR10-2003-0013411A
Authority: KR
Inventors: 엄기완; 김정수; 주기현
Original assignee: 삼성전자주식회사
Priority date: 2003-03-04
Filing date: 2003-03-04
Publication date: 2005-03-23
Also published as: KR20040078460A

Abstract

본 발명은 코퍼스 기반 음성합성에 있어서 불량 음성합성단위를 사전에 제거함으로써 합성음의 안정성 즉, 자연성과 명료성을 향상시킬 수 있는 합성음의 자연성 향상을 위한 불량 음성합성단위 제거 방법에 관한 것으로, 본 발명에 따른 합성음의 자연성 향상을 위한 불량 음성합성단위 제거 방법은 동일 음성합성단위 내에서 각 후보 음성합성단위들의 스펙트럼 정보를 추출하는 단계와, 상기 각 후보 음성합성단위들에 대해서 타 후보 단위들과의 스펙트럼 정보를 대비한 평균 스펙트럼 유사도 값을 산출하는 단계와, 상기 각 후보 음성합성단위들의 평균 스펙트럼 유사도 값이 설정된 문턱치 범위 내에 존재하는지의 여부를 판별하는 단계와, 상기 판별 결과에 따라서 설정된 문턱치 범위를 벗어난 후보 음성합성단위들을 음성합성 데이터베이스에서 제거하는 단계를 포함하는 것을 특징으로 한다.The present invention relates to a method for removing a defective speech synthesis unit for improving the naturalness of a synthesized sound that can improve the stability of the synthesized sound, that is, the naturalness and clarity by removing the defective speech synthesis unit in advance in a corpus-based speech synthesis. The method of removing the defective speech synthesis unit for improving the naturalness of the synthesized speech includes extracting spectral information of each candidate speech synthesis unit within the same speech synthesis unit, and spectrum with other candidate units for the candidate speech synthesis units. Calculating an average spectral similarity value for the information; determining whether the average spectral similarity value of each candidate speech synthesis unit is within a set threshold range; and out of a threshold range set according to the determination result. Candidate speech synthesis units in the speech synthesis database Characterized by including the step of removing.

Description

Method of removing the inferior speech synthesis units to improve naturalness of the synthetic speech}

본 발명은 불량 음성합성단위 제거에 관한 것으로, 특히, 코퍼스 기반 음성합성에 있어서 불량 음성합성단위를 사전에 제거함으로써 합성음의 안정성 즉, 자연성과 명료성을 향상시킬 수 있는 합성음의 자연성 향상을 위한 불량 음성합성단위 제거 방법에 관한 것이다.The present invention relates to the removal of a bad speech synthesis unit, and in particular, a bad speech for improving the stability of the synthesized sound, that is, the naturalness and the clarity of the synthesized sound, by removing the bad speech synthesis unit in advance in the corpus-based speech synthesis. A method for removing synthetic units.

일반적으로 코퍼스 기반 음성합성은 동일 음성합성단위 내 복수개의 후보 단위를 두고 합성하고자 하는 텍스트의 운율(피치, 세기, 길이) 예측 결과에 따라 최적 음성합성단위 후보열을 선정한 다음 그 후보 단위들 간의 연결 COST값이 가장 작은 음성합성단위를 연결하여 합성음을 생성하게 되는데, 상기 동일 음성합성단위 내 후보 단위들 중에는 음성합성 데이터 생성시 음소 레이블링 오류 또는 음성 입력자의 조악한 조음으로 인해 합성음의 음질을 저하시킬 수 있는 불량 음성합성단위가 존재하게 된다. 이러한 불량 음성합성단위는 스펙트로그램(Spectrogram)으로 분석해 보면 그 포만트(Formant)특성이 타 단위에 비해 매우 상이한 구조를 가지고 있음으로 인해, 후보 단위들 간 스펙트럼 유사도를 측정하여 그 값이 큰 단위들을 음성합성단위에서 제거하는 방법을 취함으로써 상이한 구조를 갖는 불량 음성합성단위는 제거가 가능하다. In general, corpus-based speech synthesis selects an optimal speech synthesis unit candidate string based on a prediction result of prosody (pitch, intensity, length) of text to be synthesized by multiple candidate units within the same speech synthesis unit, and then connects the candidate units. The synthesized sound is generated by connecting the voice synthesis units having the smallest COST value. Among the candidate units within the same voice synthesis unit, the sound quality of the synthesized sound may be degraded due to a phonetic labeling error or a poor articulation of the voice input when generating the voice synthesis data. There is a bad speech synthesis unit. When the bad speech synthesis unit is analyzed by spectrogram, the formant characteristic has a very different structure compared to other units, and thus, the units having large values are measured by measuring the spectral similarity between candidate units. By taking the method of removing from the speech synthesis unit, the defective speech synthesis unit having a different structure can be removed.

합성음의 안정성 확보를 위한 종래의 기술들 중, 복수 후보 합성단위 내에서 그 운율 특성이 후보 음성합성단위의 평균치와 비교해서 그 값의 차이가 큰 후보 단위를 제거하는 방법이 있다. 그러나, 이러한 방식은 음소의 포만트 특성과 운율과는 상호 종속적인 요소가 거의 없기 때문에, 음소 레이블링 오류나 음성 입력자의 조악한 조음에 의한 음성합성단위는 제거가 되지 않는다는 문제점이 있다.Among conventional techniques for securing the stability of synthesized speech, there is a method of removing candidate units having a large difference in their rhythm characteristics compared to an average value of candidate speech synthesis units within a plurality of candidate synthesis units. However, this method has a problem in that the speech synthesis unit due to the phonetic labeling error or the coarse articulation of the voice inputter is not removed because there is almost no interdependent factor between the formant characteristics and the rhyme of the phoneme.

그리고 대한민국 등록특허공보 제 327903호(2002,02,26)의 "합성 데이터베이스 경량화를 위한 불필요한 합성단위 제거 방법"은 복수개의 후보 단위들 중 합성음의 음질에 영향을 주지 않는 범위 내에서 발생 빈도가 낮은 합성 단위를 데이터베이스에서 제거하는 방법을 사용하고 있는데, 이는 실제 다수의 문장을 합성해서 합성에 사용되는 빈도가 낮은 단위를 제거하는 것이다. 그러나, 특정 음성합성단위가 실제 음성합성에 사용되었다고 하더라도 그 단위가 자연성, 명료성에 악영향을 미치는지의 여부는 합성음을 직접 듣지 않고서는 알 수 없기 때문에 음성합성 데이터베이스 경량화의 관점에서는 타당성을 가지고 있으나, 자연성 및 명료성 향상을 고려하지는 못한 문제점이 있다.In addition, the "Unnecessary synthesis unit removal method for reducing the synthesis database" of the Republic of Korea Patent Publication No. 327903 (2002,02,26) has a low frequency of occurrence within a range that does not affect the sound quality of the synthesized sound of the plurality of candidate units We use a method of removing synthesis units from the database, which actually combines a large number of sentences to remove the less frequent units used for synthesis. However, even if a specific speech synthesis unit is used for actual speech synthesis, whether the unit adversely affects the naturalness and clarity is not known without directly listening to the synthesized sound. And there is a problem that does not consider the improvement of clarity.

또한, 복수개의 후보 음성합성단위들을 군집화함으로써 음성합성 데이터베이스의 경량화에 관한 논문(Alan W Black et al., Proc. Eurospeech97)이 발표된 바 있다. 상기 논문에 의하면 발생빈도가 많은 유사단위들을 군집화하여 몇 개의 대표 단위들만을 음성합성 데이터로 사용하는 과정에서 불량 음성합성단위 또한 제거되는 효과가 있다는 것을 설명하고 있다. 상기 군집화에서는 해당 음성합성단위들 간의 음향학적 거리를 계산하기 위해 스펙트럼 정보와 운율 정보를 사용한 "weighted mahalanobis distance metric"를 사용한다. 그러나, 거리함수에 사용된 각 음향학 특징 벡터들의 가중치를 구하기 어렵다는 것과 음소의 포만트 특성과 운율과는 상호 연관성을 찾기 어렵다는 것으로 인해 불량 음성합성단위를 효과적으로 제거하기는 어렵다는 문제점이 있다. In addition, a paper (Alan W Black et al., Proc. Eurospeech97) on weighting of a speech synthesis database by clustering a plurality of candidate speech synthesis units has been published. According to the above paper, it is explained that there is an effect that the bad speech synthesis unit is also removed in the process of grouping similar units with high frequency and using only a few representative units as speech synthesis data. In the clustering, a "weighted mahalanobis distance metric" using spectral information and rhyme information is used to calculate the acoustic distance between corresponding speech synthesis units. However, it is difficult to effectively remove the defective speech synthesis unit because it is difficult to obtain weights of the acoustical feature vectors used in the distance function and the correlation between the formant characteristics of the phoneme and the rhyme.

본 발명은 상기한 종래 기술의 문제점을 해결하기 위하여 안출된 것으로서, 코퍼스 기반 음성합성에 있어서 불량 음성합성단위를 사전에 제거함으로써 합성음의 안정성 즉, 자연성과 명료성을 향상시킬 수 있는 합성음의 자연성 향상을 위한 불량 음성합성단위 제거 방법을 제공하는 데 그 목적이 있다.The present invention has been made to solve the above-mentioned problems of the prior art, and in order to improve the stability of the synthesized sound, that is, improve the naturalness of the synthesized sound, which can improve the stability and the naturalness and clarity of the synthesized speech in advance in the corpus-based speech synthesis. An object of the present invention is to provide a method for removing a defective speech synthesis unit.

상기 목적을 달성하기 위하여, 본 발명에 따른 합성음의 자연성 향상을 위한 불량 음성합성단위 제거 방법은 동일 In order to achieve the above object, the method of removing the defective speech synthesis unit for improving the naturalness of the synthesized sound according to the present invention is the same

음성합성단위 내에서 각 후보 음성합성단위들의 스펙트럼 정보를 추출하는 단계와, 상기 각 후보 음성합성단위들에 대해서 타 후보 단위들과의 스펙트럼 정보를 대비한 평균 스펙트럼 유사도 값을 산출하는 단계와, 상기 각 후보 음성합성단위들의 평균 스펙트럼 유사도 값이 설정된 문턱치 범위 내에 존재하는지의 여부를 판별하는 단계와, 상기 판별 결과에 따라서 설정된 문턱치 범위를 벗어난 후보 음성합성단위들을 음성합성 데이터베이스에서 제거하는 단계를 포함하는 것을 특징으로 한다.Extracting spectral information of each candidate speech synthesis unit in a speech synthesis unit, calculating an average spectral similarity value for each candidate speech synthesis unit in comparison with spectral information with other candidate units, and Determining whether an average spectral similarity value of each candidate speech synthesis unit is within a set threshold range, and removing candidate speech synthesis units outside the set threshold range according to the determination result from the speech synthesis database. It is characterized by.

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시예를 상세히 설명하기로 한다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명에 따른 합성음의 자연성 향상을 위한 불량 음성합성단위 제거시스템을 나타낸 블록도로서, 본 발명에 따른 합성음의 자연성 향상을 위한 불량 음성합성단위 제거시스템은 음성합성 데이터 베이스부(1), 추출부(3), 산출부(5), 비교/판별부(7) 및 제어부(9)로 구성된다. 1 is a block diagram showing a system for removing a defective speech synthesis unit for improving the naturalness of the synthesized sound according to the present invention. The system for removing a defective speech synthesis unit for improving the naturalness of the synthesized sound according to the present invention includes a speech synthesis database unit 1. And an extracting section 3, a calculating section 5, a comparing / discriminating section 7 and a control section 9.

상기 음성합성 데이터 베이스부(1)는 동일 음성합성단위 내의 각 후보 음성합성단위들에 관한 정보가 저장되고, 상기 추출부(3)는 데이터 베이스부(1)에 저장된 각 후보 음성합성단위들의 해당 스펙트럼에 관한 정보를 추출하고, 상기 산출부(5)는 각 후보 음성합성단위들에 대해서 타 후보 단위들과의 스펙트럼 정보를 대비한 평균 스펙트럼 유사도 값을 산출하고, 상기 비교/판별부(7)는 각 후보 음성합성단위들의 평균 스펙트럼 유사도 값이 설정된 문턱치 범위 내에 존재하는지의 여부를 판별하고, 상기 제어부(9)는 상기 판별결과에 따라서 설정된 문턱치 범위를 벗어난 후보 음성합성단위들을 음성합성 데이터 베이스부(1)에서 제거한다. The speech synthesis database unit 1 stores information about each candidate speech synthesis unit in the same speech synthesis unit, and the extraction unit 3 corresponds to each candidate speech synthesis unit stored in the database unit 1. Extracting information on the spectrum, the calculation unit 5 calculates the average spectral similarity value compared to the spectrum information with other candidate units for each candidate speech synthesis unit, the comparison / discrimination unit (7) Determines whether the average spectral similarity value of each candidate speech synthesis unit is within the set threshold range, and the controller 9 determines the candidate speech synthesis units outside the set threshold range according to the determination result. Remove from (1).

상기와 같이 구성되는 본 발명에 따른 합성음의 자연성 향상을 위한 불량 음성합성단위 제거방법에 대해 설명하면 다음과 같다.Referring to the method of removing the defective speech synthesis unit for improving the naturalness of the synthesized sound according to the present invention configured as described above are as follows.

도 2는 본 발명에 따른 합성음의 자연성 향상을 위한 불량 음성합성단위 제거방법을 설명한 흐름도이다.2 is a flowchart illustrating a method of removing a defective speech synthesis unit for improving the naturalness of the synthesized sound according to the present invention.

도 2에 도시된 바와 같이, 동일 음성합성단위 내의 각 후보 음성합성단위들의 스펙트럼 정보를 상기 추출부(3)에서 추출하게 되는데, 보다 정확한 정보를 추출하기 위해서는 해당 후보 음성합성단위가 유성음인지 무성음인지 판단하고(S101), 상기 판단결과에 따라서 후보 음성합성단위가 유성음이면 피치 동기성 분석에 의한 해당 스펙트럼을 추출하고(S103), 무성음이면 피치 비동기성 분석에 의한 해당 스펙트럼을 추출한다(S105).As shown in FIG. 2, the spectral information of each candidate speech synthesis unit in the same speech synthesis unit is extracted by the extractor 3. In order to extract more accurate information, the candidate speech synthesis unit is voiced or unvoiced. When the candidate speech synthesis unit is voiced sound according to the determination result (S101), the corresponding spectrum by pitch synchronization analysis is extracted if the candidate speech synthesis unit is voiced (S103), and the corresponding spectrum by pitch asynchronous analysis is extracted (S105).

상기 추출된 각 후보 음성합성단위들의 스펙트럼 정보를 바탕으로 상기 각 후보 음성합성단위들에 대해서 타 후보 단위들과의 스펙트럼 정보를 대비한 평균 스펙트럼 유사도 값을 후보 단위들간의 길이차에 대한 보상을 위해 DTW(Dynamic Time Warping)방법에 의해 상기 산출부(5)에서 계산하게 되는데, 이에 대해 구체적으로 설명하면 각 후보 음성합성단위들에 대해서 타 후보 단위들과의 스펙트럼 정보를 대비하여 상호 스펙트럼의 유사 정도를 비교 판별하고(S107), 상기 각 후보 음성합성단위들의 기준대비 유사도 값들을 산출하며(S109), 상기 각 후보 음성합성단위들에 대해서 기준대비 유사도 값들에 대한 평균값을 산출한다(S111). 이렇게 산출된 각 후보 음성합성단위들의 평균 스펙트럼 유사도 값이 불량음성합성단위의 기준으로 설정된 문턱치 범위 내에 존재하는지의 여부를 상기 비교/판별부(7)에서 판별하고(S113), 상기 제어부(9)는 상기 판별결과에 따라서 설정 문턱치 범위 내에 존재하는 후보 음성합성단위들을 상기 데이터베이스부(1)에 저장하고(S115), 문턱치 범위를 벗어난 후보 음성합성단위들을 데이터베이스부(1)에서 제거한다(S117). Based on the spectral information of the extracted candidate speech synthesis units, the average spectral similarity value of the candidate speech synthesis units with respect to the spectral information with other candidate units is compensated for the length difference between the candidate units. It is calculated by the calculation unit 5 by a dynamic time warping (DTW) method, which will be described in detail. The degree of similarity of the cross spectrum is compared with respect to the spectrum information with other candidate units for each candidate speech synthesis unit. Compare and determine (S107), calculate similarity values with respect to each of the candidate speech synthesis units (S109), and calculate an average value with respect to similarity values with respect to each candidate speech synthesis units (S111). The comparison / discrimination unit 7 determines whether or not the average spectral similarity value of each candidate speech synthesis unit calculated as described above is within a threshold range set as a criterion of the defective speech synthesis unit (S113), and the controller 9 According to the determination result, the candidate speech synthesis units existing within the set threshold range are stored in the database unit 1 (S115), and the candidate speech synthesis units outside the threshold range are removed from the database unit 1 (S117). .

도 3은 본 발명에 따른 해당 후보 음성합성단위와 타 후보 단위들의 평균 스펙트럼 유사도를 비교한 예시도이다.3 is an exemplary diagram comparing average spectral similarities between corresponding candidate speech synthesis units and other candidate units according to the present invention.

도 3에 도시된 바와 같이, 각 후보 음성합성단위들은 좌우 음성 환경이 일치하는 경우로서, 스펙트럼 정보에 대해서 실제 음성합성에 사용된 후보 단위의 포만트 특성이 타 후보 단위들에 비해 매우 상이하고, 각 후보 음성합성단위들의 스펙트럼 정보로부터 산출된 각 후보 단위들의 평균 스펙트럼 유사도 값을 비교하면 실제 음성합성에 사용된 후보 단위에 해당하는 평균 스펙트럼 유사도 값이 3.83으로서 타 후보 단위들의 스펙트럼 유사도 값에 비해 매우 큰 값을 갖는다. 이와 같이, 불량 음성합성단위는 스펙트럼을 추출하여 분석해 보면 그 포만트 특성이 타 후보 단위에 비해 매우 상이한 구조를 가지고 있기 때문에, 각 후보 합성단위들에 대해서 타 후보 단위들과의 스펙트럼을 대비한 평균 스펙트럼 유사도 값을 측정하여 타 후보 단위들에 비해서 그 평균 스펙트럼 유사도 값이 큰 후보 단위들을 음성합성 데이터베이스에서 제거함으로써, 합성음의 자연성 및 명료성을 향상시킬 수 있다. As shown in FIG. 3, each candidate speech synthesis unit is a case where the left and right speech environments are identical, and formant characteristics of the candidate unit used for actual speech synthesis with respect to spectral information are very different from those of other candidate units. Comparing the average spectral similarity values of the candidate units calculated from the spectral information of each candidate speech synthesis unit, the average spectral similarity value corresponding to the candidate unit used for actual speech synthesis is 3.83, which is very high compared to the spectral similarity values of other candidate units. Has a large value. As described above, the bad speech synthesis unit has a structure in which the formant characteristics are very different from those of other candidate units, and thus the average of the candidate speech synthesis units compared to the other candidate units for each candidate synthesis unit. By measuring the spectral similarity value and removing candidate units having a larger average spectral similarity value from other candidate units in the speech synthesis database, the naturalness and clarity of the synthesized speech can be improved.

본 발명에 따른 합성음의 자연성 향상을 위한 불량 음성합성단위 제거 방법을 예시된 도면을 참조로 설명하였으나, 본 명세서에 개시된 실시예와 도면에 의해 본 발명은 한정되지 않으며 그 발명의 기술사상 범위내에서 당업자에 의해 다양한 변형이 이루어질 수 있음은 물론이다. Although the method for removing the defective speech synthesis unit for improving the naturalness of the synthesized sound according to the present invention has been described with reference to the illustrated drawings, the present invention is not limited by the embodiments and drawings disclosed herein and is within the technical scope of the invention. Of course, various modifications may be made by those skilled in the art.

상기한 바와 같은 본 발명에 따른 합성음의 자연성 향상을 위한 불량 음성합성단위 제거 방법은 코퍼스 기반 음성합성방식에 있어서 불량 음성합성단위를 사전에 제거함으로써 합성음의 안정성 즉, 자연성과 명료성을 향상시킬 수 있는 장점이 있다.As described above, the method for removing the defective speech synthesis unit for improving the naturalness of the synthesized sound according to the present invention can improve the stability of the synthesized sound, that is, the naturalness and clarity by removing the defective speech synthesis unit in advance in the corpus-based speech synthesis method. There is an advantage.

또한, 이러한 불량 음성합성단위를 사전에 제거하기 때문에 해당 단위가 음성합성 데이터 베이스에서 차지하는 메모리의 크기를 줄일 수 있으며, 음성합성에 따른 속도를 개선할 수 있는 장점도 있다.In addition, since the bad speech synthesis unit is removed in advance, the size of the memory occupied by the unit in the speech synthesis database can be reduced, and the speed of speech synthesis can be improved.

도 1은 본 발명에 따른 합성음의 자연성 향상을 위한 불량 음성합성단위 제거시스템을 나타낸 블록도이다.1 is a block diagram showing a system for removing a defective speech synthesis unit for improving the naturalness of the synthesized sound according to the present invention.

도 3은 본 발명에 따른 해당 후보 음성합성단위와 타 후보 단위들의 평균 스펙트럼 유사도를 비교한 예시도이다. 3 is an exemplary diagram comparing average spectral similarities between corresponding candidate speech synthesis units and other candidate units according to the present invention.

<도면의 주요 부분에 관한 부호의 설명><Explanation of symbols on main parts of the drawings>

1 : 음성합성 데이터 베이스부 3 : 추출부 1: speech synthesis database unit 3: extraction unit

5 : 산출부 7 : 비교/판별부 5: calculating unit 7: comparing / determining unit

9 : 제어부9: control unit

Claims

Extracting spectral information of each candidate speech synthesis unit within the same speech synthesis unit;

Calculating an average spectral similarity value for each candidate speech synthesis unit by comparing spectral information with other candidate units;

Determining whether an average spectral similarity value of each candidate speech synthesis unit is within a set threshold range; And

And removing candidate speech synthesis units that fall outside the set threshold range according to the determination result from the speech synthesis database.

The method of claim 1, wherein the extracting of the spectrum information comprises:

Determining whether the candidate speech synthesis unit is voiced or unvoiced; And

If the candidate speech synthesis unit is a voiced sound according to the determination result, generating a corresponding spectrum by pitch synchronous analysis; and generating a corresponding spectrum by pitch asynchronous analysis if unvoiced sound. To remove bad speech synthesis units.

The method of claim 1, wherein the calculating of the average spectral similarity value comprises:

Calculating the similarity values with respect to each candidate speech synthesis unit by comparing and comparing the similarity of the cross spectrum with respect to the spectrum information with other candidate units for each candidate speech synthesis unit; and

Comprising the step of calculating the average value for the similarity value compared to the reference for each candidate speech synthesis unit, characterized in that for removing the bad speech synthesis unit for improving the naturalness of the synthesized sound.