KR101567566B1

KR101567566B1 - System and Method for Statistical Speech Synthesis with Personalized Synthetic Voice

Info

Publication number: KR101567566B1
Application number: KR1020140061532A
Authority: KR
Inventors: 김형순; 반성민; 최영호
Original assignee: 부산대학교 산학협력단
Priority date: 2014-05-22
Filing date: 2014-05-22
Publication date: 2015-11-06

Abstract

The present invention relates to a statistical voice synthesis system applying personalized voice and a method thereof wherein synthesized sound to which voice characteristics of a patient with speech disorder are applied is generated in the statistical voice synthesis system to improve communication of the patient. The statistical voice synthesis system includes: a short vowel voice collection unit which collects short vowel voice from the patient with speech disorder; a speaker adaptation parameter extraction unit which compares short vowel voice, collected in the short vowel voice collection unit, with a relevant short vowel model from sound models of the statistical voice synthesis system and extracts a bilinear conversion speaker adaptation parameter based on a formant; a synthesized sound generation unit which generates synthesized sound to which the characteristics of a speaker are applied by applying the speaker adaptation parameter extracted in the speaker adaptation parameter extraction unit; and a synthesized sound tuning unit which selectively tunes timbre of the synthesized sound generated in the synthesized sound generation unit.

Description

BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a statistical speech synthesis system and method,

본 발명은 통계적 음성합성 기술 분야에 관한 것으로, 구체적으로 통계적 음성합성 시스템에서 말장애 환자의 음색 특성을 반영한 합성음을 생성하여 의사소통 개선이 가능하도록 한 개인 음색을 반영한 통계적 음성합성 시스템 및 방법에 관한 것이다.BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a statistical speech synthesis technology, and more particularly, to a statistical speech synthesis system and a method for generating a synthetic speech reflecting a tone characteristic of a horse disordered person in a statistical speech synthesis system, will be.

현재 상용화된 음성합성 시스템들은 주로 코퍼스 기반의 합성방식을 사용하는데, 이 방식의 경우 특정 화자의 음색을 표현하기 위해서는 많은 양의 음성 데이터와 더불어 많은 작업이 필요하다.Currently, commercialized speech synthesis systems mainly use a corpus-based synthesis method. In this method, a lot of work is required in addition to a large amount of speech data in order to express the tone of a specific speaker.

그럼에도 불구하고, 만약 말장애가 서서히 진행되는 경우라면 중증이 되기 이전에 환자로부터 많은 양의 음성을 녹음하여 이로부터 본인의 음색을 가지는 합성음을 만들 수 있다.Nonetheless, if the speech impairment progresses slowly, you can record a large amount of voice from the patient before it becomes severe and produce a synthetic voice with your own voice from it.

그러나, 이 방법은 이미 말장애가 많이 진전된 환자에게는 적용할 수 없다.However, this method can not be applied to patients who have already developed a lot of speech disorders.

코퍼스 기반의 합성방식 이후에 개발된 통계적 음성합성 방식은 기존에 음성인식에 사용되던 통계적 음향 모델을 사용하여 음성을 합성하는 방법으로, 비교적 적은 양의 특정 화자 음성 데이터만으로도 특정화자의 음색에 가까운 합성음을 생성하는 화자적응 기능을 제공한다.The statistical speech synthesis method developed after the corpus-based synthesis method is a method of synthesizing speech using a statistical acoustic model that has been used for speech recognition in the past. Even if only a relatively small amount of specific speaker speech data is used, A speaker adaptation function for generating a speaker adaptation function.

도 1은 종래 기술의 화자적응 기능을 가지는 통계적 음성합성 시스템의 구성도를 나타낸 것이다.1 shows a configuration diagram of a statistical speech synthesis system having a speaker adaptation function according to the prior art.

종래 기술의 화자적응 기능을 가지는 통계적 음성합성 시스템은 특정화자 또는 다수화자들의 대규모 음성 데이터를 수집하여(S100), 통계적 음성합성을 위한 음향 모델링을 하는 단계(S101)와, 특정화자 또는 평균음성 음향 모델을 구축하여(S102), 구축된 특정화자 또는 평균음성 음향 모델 및 대상화자의 소규모 음성 데이터(S103)를 이용한 화자적응 단계(S104)와, 화자적응을 통한 대상화자로 화자적응된 음향모델(S105)을 통하여 합성음을 생성하는 단계(S106)를 포함한다.A statistical speech synthesis system having a speaker adaptation function of the prior art includes collecting large-scale speech data of a specific speaker or a plurality of speakers (S100), performing acoustic modeling for statistical speech synthesis (S101) (S102), a speaker adaptation step (S104) using the established specific speaker or the average speech acoustic model and the small-scale speech data S103 of the target speaker, and the acoustic model S105) to generate a synthesized sound (S106).

이러한 통계적 음성합성 방식에서도 화자적응에 필요한 음성 데이터의 규모는 대략 6~7분 (100 문장 정도) 분량으로 알려져 있으며, 말장애 환자들이 이 분량을 발성하기에 매우 큰 어려움이 있을 뿐만 아니라, 화자적응 과정에서 본인이 희망하는 음색 특성 이외에 장애음성의 불명료성 및 잡음 특성이 함께 반영됨으로 인해, 화자적응된 합성음의 명료도도 심각하게 저하되는 문제가 발생한다.In this statistical speech synthesis method, the scale of voice data necessary for speaker adaptation is known to be about 6 to 7 minutes (about 100 sentences), and it is very difficult for the patients with speech disorders to utter this amount, The ambiguity and the noise characteristic of the obstacle voice are reflected together with the tone color characteristic desired by the user, so that the clarity of the speaker-adapted synthetic sound also deteriorates seriously.

이 문제를 해결하기 위해 종래의 기술에서는 말장애 환자의 음성으로 화자적응된 음향 모델의 일부 파라미터들을 기존의 평균음성 모델의 파라미터들로 대치하는 방법과 말장애 환자의 음성 중 명료도가 높은 부분을 선별 편집하여 이로부터 화자적응 모델을 구성하는 방법이 있다.In order to solve this problem, in the prior art, there are a method of replacing some parameters of an acoustic model adapted to a speaker with a voice of a horse disorder patient by parameters of an existing average voice model, And a speaker adaptation model is constructed from the result.

이들 두 방법 모두 단순히 말장애 환자의 음성을 그대로 화자적응하는 방법에 비해서는 합성음의 명료도가 어느 정도 개선되나, 정상인의 합성음에 비해서는 명료도가 여전히 많이 떨어지는 문제가 있다.In both of these methods, the clarity of the synthesized voice is improved to some extent, but the clarity of the synthesized voice is still much lower than that of the normal person.

최근 성도 길이 정규화(vocal tract length normalization, VTLN) 기술을 통계적 음성합성에 적용하여 1 문장 정도로 짧은 길이의 특정 화자 음성에 대해 화자적응을 수행하는 방법이 제안되었고, 이 방법을 말장애 환자 음성의 화자적응에 적용할 경우 환자의 녹음부담이 많이 줄어드는 장점이 있다.Recently, a method of applying speaker adaptation to a specific speaker voice having a short length of about one sentence by applying the vocal tract length normalization (VTLN) technique to statistical speech synthesis has been proposed. When applied to adaptation, the patient's recording burden is greatly reduced.

그러나, 이 방법의 경우에도 말장애 환자의 발음 불명료 특성으로 인해 1 문장의 음성을 기존의 음소별 음향 모델과 대응시키는 과정에 어려움이 있어서 자동화 작업이 쉽지 않고, 통계적 음성합성 방식의 스펙트럼 특징 파라미터인 MGC(Mel-Generalized Cepstrum) 파라미터를 직접 이용하여 VTLN을 수행하기 때문에 일반인이 아닌 말장애 환자의 개인성 파악에는 효과적이지 못한 문제가 있다.However, even in this method, it is difficult to automate the process of matching the voice of one sentence with the acoustic model of the existing phoneme due to the characteristics of pronunciation imprecision of the patient with the horse disorder, Since the VTLN is performed using the MGC (Mel-Generalized Cepstrum) parameter directly, there is a problem that is not effective in identifying the personality of a patient with a horse disorder.

한국공개특허 10-2011-0021944호Korean Patent Publication No. 10-2011-0021944 한국공개특허 10-2013-0109902호Korean Patent Publication No. 10-2013-0109902

본 발명은 이와 같은 종래 기술의 통계적 음성합성 시스템의 문제를 해결하기 위한 것으로, 통계적 음성합성 시스템에서 말장애 환자의 음색 특성을 반영한 합성음을 생성하여 의사소통 개선이 가능하도록 한 개인 음색을 반영한 통계적 음성합성 시스템 및 방법을 제공하는데 그 목적이 있다.In order to solve the problem of the conventional statistical speech synthesis system, the present invention provides a statistical speech synthesis system, in which a synthetic tone reflecting the tone characteristics of a horse disordered person is generated, And to provide a synthesis system and method.

본 발명은 본인의 단모음 발성을 일반인이 어떤 모음인지 도저히 식별할 수 없을 정도로 중증인 환자를 제외한 말장애 환자를 대상으로 하여, 본인이 가장 명확하게 발성할 수 있는 단모음 발성 하나만으로 본인의 개인적인 음색 특성을 학습하여, 통계적 음성합성 시스템에서 개인적인 음색 특성을 반영한 합성음을 생성하는 방법을 제공하는데 그 목적이 있다. The present invention is directed to patients with speech disorders who do not have a severely impaired voice so that the voice of a person's voice can not be clearly identified by the general public, And a method for generating a synthetic sound reflecting a personal timbre characteristic in a statistical speech synthesis system.

본 발명은 말장애 환자들은 일반인과 정상적인 의사소통이 힘들기 때문에 음성합성 기능을 가지는 보완대체의사소통(Augmented and Alternative Communication, AAC) 기기를 보조 수단으로 사용하는 경우가 있는데, AAC 기기가 제공하는 합성음의 음색 제한으로 인해 말장애 환자가 원하는 개인적인 음색을 표현하지 못하는 문제를 극복하기 위한 수단으로서, 환자 본인의 음성 데이터로부터 본인의 음색 특성을 학습하여 개인성을 반영한 합성음을 생성하는 개인 음색을 반영한 통계적 음성합성 시스템 및 방법을 제공하는데 그 목적이 있다.In the present invention, patients with speech disorders have difficulty in communicating normally with the general public. Therefore, Augmented and Alternative Communication (AAC) devices having voice synthesis functions are sometimes used as auxiliary means. A personal voice which reflects personal characteristics is learned by learning the tone color characteristics of the user from the voice data of the patient himself or herself, And to provide a synthesis system and method.

본 발명의 목적들은 이상에서 언급한 목적들로 제한되지 않으며, 언급되지 않은 또 다른 목적들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.The objects of the present invention are not limited to the above-mentioned objects, and other objects not mentioned can be clearly understood by those skilled in the art from the following description.

이와 같은 목적을 달성하기 위한 본 발명에 따른 개인 음색을 반영한 통계적 음성합성 시스템은 말장애 환자로부터 단모음 음성을 수집하는 단모음 음성 수집부;상기 단모음 음성 수집부에서 수집된 단모음 음성과 통계적 음성합성 시스템의 음향 모델 중 해당 단모음 모델을 비교하여 포먼트 기반의 쌍선형 변환 화자적응 파라미터를 추출하는 화자적응 파라미터 추출부;상기 화자적응 파라미터 추출부에서 추출된 화자적응 파라미터를 적용하여 화자 특성을 반영하는 합성음을 생성하는 합성음 생성부;상기 합성음 생성부에서 생성된 합성음의 음색을 선택적으로 튜닝하는 합성음 튜닝부;를 포함하는 것을 특징으로 한다.According to another aspect of the present invention, there is provided a statistical speech synthesis system that reflects individual tone colors, comprising: a short vowel sound collection unit for collecting short vowel sound from a patient with a horse disorder; a short vowel sound collection unit for collecting short vowel sounds and a statistical speech synthesis system A speaker adaptation parameter extracting unit for extracting a formant-based bilinear transformed speaker adaptation parameter by comparing the corresponding short sound model among the acoustic models, a synthesized sound reflecting the speaker characteristics by applying the speaker adaptation parameter extracted from the speaker adaptation parameter extracting unit And a synthesized sound tuning unit for selectively tuning a tone color of the synthesized sound generated by the synthetic sound generating unit.

여기서, 상기 화자적응 파라미터 추출부는, 수집된 단모음 음성 데이터를 프레임 단위로 구분하는 프레임 단위 구분부와,프레임 에너지가 가장 큰 프레임 및 그 전후 K개의 프레임들로 이루어진 2K + 1개의 프레임에 대해 3개의 포먼트(formant) 주파수, 즉, 제1 포먼트(F1), 제2 포먼트(F2), 제3 포먼트(F3) 주파수를 추출하는 포먼트 주파수 추출부와,포먼트 주파수 추출부에서 추출된 각 프레임의 포먼트 주파수들의 중앙값(median)으로 결정하는 중앙값 결정부와,말장애 환자의 특정 단모음으로부터 추출한 포먼트 주파수(F1,F2 및 F3)들을 이 단모음에 해당하는 음향모델의 포먼트 주파수(F1_M,F2_M 및 F3_M)들의 쌍선형 변환(bilinear transform)으로 표현할 때 가중제곱오차합이 최소가 되는 쌍선형 변환 계수α_SA를 화자적응 파라미터로 구하는 화자적응 파라미터 결정부를 포함하는 것을 특징으로 한다.Here, the speaker adaptation parameter extracting unit, collected vowel 3 for audio data delimited units of frames separated by frame-by-frame unit, 2 K + 1 frames consisting of a frame energy as the largest frame, and before and after the K frame A formant frequency extracting unit for extracting a plurality of formant frequencies, that is, a first formant F1, a second formant F2, and a third formant F3, ( F 1, F 2, and F 3) extracted from a specific short vowel sound of a patient with a horse disorder is determined as a median of formant frequencies of each extracted frame, and an acoustic model corresponding to the short vowel A speaker adaptation parameter determining a bilinear transform coefficient? _SA whose weight sum of squared errors is minimum when expressed as a bilinear transform of the formant frequencies ( F 1 _M , F 2 _M and F 3 _M ) Parami Crystal characterized in that it comprises a.

다른 목적을 달성하기 위한 본 발명에 따른 개인 음색을 반영한 통계적 음성합성 방법은 말장애 환자로부터 단모음 음성을 수집하는 단계;수집된 단모음 음성과 통계적 음성합성 시스템의 음향 모델중 해당 단모음 모델을 비교하여 포먼트 기반의 쌍선형 변환 화자적응 파라미터를 추출하는 단계;추출된 화자적응 파라미터를 적용하여 화자 특성을 반영하는 합성음을 생성하는 단계;를 포함하는 것을 특징으로 한다.In order to accomplish the other object, there is provided a method of statistically synthesizing a voice according to the present invention, comprising: collecting a short vowel voice from a patient with speech disorders; comparing the collected short vowel voice with a corresponding one of the acoustic models of the statistical speech synthesis system; Extracting a speaker-based bilinear transformed speaker adaptation parameter, and generating a synthesized speech that reflects the speaker characteristic by applying the extracted speaker adaptation parameter.

여기서, 상기 추출된 화자적응 파라미터를 적용하여 화자 특성을 반영하는 합성음을 생성하는 단계에서, 생성된 합성음의 청취 결과를 기반으로 음색을 튜닝하는 단계를 선택적으로 더 수행하는 것을 특징으로 한다.Herein, in the step of generating the synthetic sound reflecting the speaker characteristics by applying the extracted speaker adaptation parameter, the step of tuning the tone color based on the result of the generated synthetic sound is further performed.

그리고 포먼트 기반의 쌍선형 변환 화자적응 파라미터를 추출하는 단계는, 수집된 단모음 음성 데이터를 프레임 단위로 구분하는 단계와,프레임 에너지가 가장 큰 프레임 및 그 전후 K개의 프레임들로 이루어진 2K + 1개의 프레임에 대해 3개의 포먼트(formant) 주파수, 즉, 제1 포먼트(F1), 제2 포먼트(F2), 제3 포먼트(F3) 주파수를 추출하는 단계와,추출된 각 프레임의 포먼트 주파수들의 중앙값(median)으로 결정하는 단계와,말장애 환자의 특정 단모음으로부터 추출한 포먼트 주파수(F1,F2 및 F3)들을 이 단모음에 해당하는 음향모델의 포먼트 주파수(F1_M,F2_M 및 F3_M)들의 쌍선형 변환(bilinear transform)으로 표현할 때 가중제곱오차합이 최소가 되는 쌍선형 변환 계수α_SA를 화자적응 파라미터로 구하는 단계를 포함하는 것을 특징으로 한다.And Four steps of treatment based on a bilinear transformation to extract the speaker adaptation parameters of the is provided with a step to separate the vowel sound collecting data on a frame-by-frame basis, the frame energy consisting of the largest frame and before and after the K frame 2 K + 1 Extracting a first formant frequency F1, a second formant F2, and a third formant frequency F3 for three frames, Four determining a median value (median) of the treatment frequency, formant frequencies extracted from a particular vowel of the end disorder (F 1, F 2 and F 3) of the formant frequency (F 1 of the acoustic model corresponding to the vowel _M, F 2 characterized by including the _M and obtaining F 3 _M bilinear transformation (bilinear transform) the weighted square error sum is bilinear transform coefficients is minimized α _SA when rendering of) a speaker adaptation parameters.

그리고 가중제곱오차합이 최소가 되는 쌍선형 변환 계수α_SA를 화자적응 파라미터로 구하는 단계에서, 주파수(Hz 단위) 영역에서의 쌍선형 변환식은And, in the step of obtaining the bilinear transform coefficient? _SA with the weighted square error sum as a speaker adaptation parameter, the bilinear transform equation in frequency (Hz unit)

이고,

ego,

여기서

와

는 각각 쌍선형 변환 이전 및 이후의 주파수를 의미하며, α는 쌍선형 변환 파라미터인 것을 특징으로 한다.here

Wow

Respectively denote frequencies before and after the bilinear transformation, and? Is a bilinear transformation parameter.

그리고 주파수(Hz 단위) 영역에서의 쌍선형 변환식을

에 대해 정리한 함수

는,

이고,And the bilinear transformation in frequency (Hz)

Functions that are summarized for

Quot;

ego,

쌍선형 변환 함수

를 이용하여 말장애 환자의 특정 단모음으로부터 추출한 포먼트 주파수(F1,F2 및 F3)들과 이 단모음에 해당하는 음향모델의 포먼트 주파수(F1_M,F2_M 및 F3_M)의 쌍선형 변환 값들과의 가중제곱오차합은,Binary Linear Conversion Functions

( F 1, F 2, and F 3) extracted from a specific short vowel of a horse disordered person and formant frequencies ( F 1 _M , F 2 _M, and F 3 _M ) of the acoustic model corresponding to the short vowel, The weighted squared error sum with the binarized transformed values of < RTI ID = 0.0 >

이고,

ego,

여기서 가중치

는, 말장애 환자의 음성의 경우 일반인의 음성에 비해 포먼트 주파수 추출의 신뢰도가 떨어짐을 감안하여, 추정된 i번째 포먼트 주파수의 신뢰도를 고려한 가중치인 것을 특징으로 한다.Here,

Is a weight considering the reliability of the estimated i- th formant frequency considering the reliability of the formant frequency extraction is lower than the voice of a general person in the case of speech of a horse disordered patient.

그리고 개별 포먼트 주파수의 신뢰도를 고려하는 가중치

계산은,Weights that take into account the reliability of individual formant frequencies

The calculation,

이고, 여기서

와

는 사용된 2K + 1개 프레임의 i번째 포먼트 주파수

들의 평균과 표준 편차값, 함수 g(x)는 x ≥ 0인 범위에 대해 단조증가 특성을 가지는 함수인 것을 특징으로 한다.

, Where

Wow

Is the i- th formant frequency of the 2 K + 1 frames used

And the function g ( x ) is a function having a monotone increasing property with respect to a range of x ? 0.

그리고 생성된 합성음의 청취 결과를 기반으로 음색을 튜닝하는 단계에서, 합성음의 운율 특성 중, 음의 고저를 나타내는 억양 특성은 음향 모델의 log F0 파라미터로 표현하며, 합성음의 j번째 프레임에 대해 기존의 log F0 값을 LF0(j),음색변환된 log F0 값을

라고 하면, 음색 변환은,And in the step of tuning the sound based on the listening results of the generated synthesized voice, of the prosody characteristic of the synthesized sound, intonation characteristics indicating high and low of sound is used to represent a log F0 parameters of the acoustic model, the old for the j-th frame of the synthesized sound The log F0 value is set to LF 0 ( j ), the tone-converted log F0 value is set to

In the tone color conversion,

으로 이루어지고,

Lt; / RTI >

여기서 LF0_SA는 합성음의 억양 특성 변환을 위한 사용자 지정 파라미터이고, LF0_SA> 0이면 음이 높아지고, LF0_SA < 0이면 음이 낮아지게 되고, LF0_SA 값을 조절하면서 합성음을 듣고 LF0_SA 값을 선정하는 것을 특징으로 한다.Here, LF 0 _SA is a user-specified parameter for converting the intonation characteristics of the synthetic sound. If LF 0 _SA > 0, the sound becomes higher and LF 0 _SA <-Negative or 0 becomes low, while adjusting the value LF 0 _SA hear the synthesized voice is characterized in that selection of the LF 0 _SA value.

그리고 생성된 합성음의 청취 결과를 기반으로 음색을 튜닝하는 단계에서, 구해진 화자적응 파라미터를 이용한 쌍선형 변환과 멜-스케일 주파수 변환을 위한 쌍선형 변환을 직렬 연결하여 단일 쌍선형 변환으로 표현하면,Then, in the step of tuning the tone based on the result of the generated synthesized tone, a pairwise linear transformation using the obtained speaker adaptation parameter and a bilinear transformation for the mel-scale frequency transformation are expressed by a single pair linear transformation,

쌍선형 변환 파라미터 α_F는,

이고,Bilinear transformation parameter α is _F,

ego,

여기서, α_SA는 가중제곱오차합이 최소가 되는 쌍선형 변환 계수, α_M은 멜-스케일 주파수 변환을 위한 쌍선형 변환 계수이고,Where α _SA is a bilinear transform coefficient that minimizes the weighted square error sum, α _M is a bilinear transform coefficient for the Mel-scale frequency transform,

쌍선형 변환 파라미터 α_F를 이용한 스펙트럼 변환의 경우에도, 말장애 환자 본인이 추가적으로 스펙트럼 특성의 변경을 원할 경우, α_F 값을 조절하면서 합성음을 들어보고 말장애 환자 본인에게 가장 만족스러운 α_F값을 선정하는 것을 특징으로 한다.In the case of spectral conversion using the bilinear transformation parameter α _F , if the person with terminal impairment wishes to change the spectral characteristics additionally, he or she can listen to the synthesized sound while adjusting the α _F value and obtain the most satisfactory α _F value .

이와 같은 본 발명에 따른 개인 음색을 반영한 통계적 음성합성 시스템 및 방법은 다음과 같은 효과를 갖는다.The system and method for statistical speech synthesis reflecting the personalized tone according to the present invention have the following effects.

첫째, 말장애 환자의 단모음 발성만으로 본인의 주요한 음색 특징을 가지는 합성음을 생성할 수 있다.First, synthesized speech with the main tone characteristics of the person can be generated by only the vowel voice utterance of the horse disordered patient.

둘째, 통계적 음성합성 시스템에서 말장애 환자의 음색 특성을 반영한 합성음을 생성하여 의사소통 개선이 가능하다.Second, in the statistical speech synthesis system, it is possible to improve the communication by generating a synthetic sound reflecting the tone characteristic of the horse disordered person.

셋째, AAC 기기가 제공하는 합성음의 음색이 제한되지 않도록 말장애 환자의 음성 데이터로부터 본인의 음색 특성을 학습하여 개인성을 반영한 합성음을 생성할 수 있다.
Third, it is possible to generate the synthesized voice reflecting the personality by learning the tone characteristic of the user from the voice data of the patient with the hurdle so that the tone of the synthesized voice provided by the AAC device is not limited.

도 1은 종래 기술의 화자적응 기능을 가지는 통계적 음성합성 방법을 나타낸 플로우 차트
도 2는 본 발명에 따른 개인 음색을 반영한 통계적 음성합성 시스템의 구성도
도 3은 본 발명에 따른 화자적응 파라미터 추출부의 상세 구성도
도 4는 본 발명에 따른 개인 음색을 반영한 통계적 음성합성 방법을 나타낸 플로우 차트1 is a flowchart showing a statistical speech synthesis method having a speaker adaptation function according to the prior art
2 is a block diagram of a statistical speech synthesis system reflecting a personal tone color according to the present invention
3 is a detailed configuration diagram of a speaker adaptation parameter extracting unit according to the present invention;
FIG. 4 is a flowchart illustrating a statistical speech synthesis method reflecting a personal tone color according to the present invention.

이하, 본 발명에 따른 개인 음색을 반영한 통계적 음성합성 시스템 및 방법의 바람직한 실시 예에 관하여 상세히 설명하면 다음과 같다.DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, preferred embodiments of a statistical speech synthesis system and method according to the present invention will be described in detail.

본 발명에 따른 개인 음색을 반영한 통계적 음성합성 시스템 및 방법의 특징 및 이점들은 이하에서의 각 실시 예에 대한 상세한 설명을 통해 명백해질 것이다.The features and advantages of the statistical speech synthesis system and method reflecting the personalized tone according to the present invention will be apparent from the following detailed description of each embodiment.

도 2는 본 발명에 따른 개인 음색을 반영한 통계적 음성합성 시스템의 구성도이고, 도 3은 본 발명에 따른 화자적응 파라미터 추출부의 상세 구성도이다.FIG. 2 is a configuration diagram of a statistical speech synthesis system reflecting personal tones according to the present invention, and FIG. 3 is a detailed configuration diagram of a speaker adaptation parameter extraction unit according to the present invention.

본 발명은 본인이 가장 명확하게 발성할 수 있는 단모음 발성 하나만으로 본인의 개인적인 음색 특성을 학습하여, 통계적 음성합성 시스템에서 개인적인 음색 특성을 반영한 합성음을 생성하여 의사소통 개선이 가능하도록 한 것이다.The present invention learns personal tone characteristics of a person with only one voice utterance which can be most clearly uttered by him / herself, thereby improving communication by generating a synthetic voice reflecting personal tone characteristics in a statistical speech synthesis system.

이를 위한 본 발명에 따른 개인 음색을 반영한 통계적 음성합성 시스템은 도 2에서와 같이, 말장애 환자로부터 단모음 음성을 수집하는 단모음 음성 수집부(20)와, 수집된 단모음 음성과 통계적 음성합성 시스템의 음향 모델 중 해당 단모음 모델을 비교하여 포먼트 기반의 쌍선형 변환 화자적응 파라미터를 추출하는 화자적응 파라미터 추출부(30)와, 추출된 화자적응 파라미터를 적용하여 화자 특성을 반영하는 합성음을 생성하는 합성음 생성부(40)와, 생성된 합성음의 청취 결과를 기반으로 사용자의 취향에 따라 음색을 선택적으로 튜닝하는 합성음 튜닝부(50)를 포함한다.As shown in FIG. 2, the statistical speech synthesis system reflecting the personalized tone according to the present invention includes a short vowel sound collection unit 20 for collecting short vowel sound from a horse disordered person, A speaker adaptation parameter extracting unit 30 for extracting a formant-based bilinear transformed speaker adaptation parameter by comparing corresponding short sound model among the models, and a synthesized sound generating unit 30 for generating a synthesized sound reflecting the speaker characteristics by applying the extracted speaker adaptation parameter And a synthesized sound tuning unit 50 for selectively tuning a tone color according to a user's taste based on the result of the generated synthetic sound.

여기서, 화자적응 파라미터 추출부(30)의 상세 구성은 도 3에서와 같이, 수집된 단모음 음성 데이터를 프레임 단위로 구분하는 프레임 단위 구분부(31)와, 프레임 에너지가 가장 큰 프레임 및 그 전후 K개의 프레임들로 이루어진 2K + 1개의 프레임에 대해 3개의 포먼트(formant) 주파수, 즉, 제1 포먼트(F1), 제2 포먼트(F2), 제3 포먼트(F3) 주파수를 추출하는 포먼트 주파수 추출부(32)와, 포먼트 주파수 추출부(32)에서 추출된 각 프레임의 포먼트 주파수들의 중앙값(median)으로 결정하는 중앙값 결정부(33)와, 말장애 환자의 특정 단모음으로부터 추출한 포먼트 주파수(F1,F2 및 F3)들을 이 단모음에 해당하는 음향모델의 포먼트 주파수(F1_M,F2_M 및 F3_M)들의 쌍선형 변환(bilinear transform)으로 표현할 때 가중제곱오차합이 최소가 되는 쌍선형 변환 계수α_SA를 화자적응 파라미터로 구하는 화자적응 파라미터 결정부(34)를 포함한다.Here, the speaker adaptation parameters, such as the detailed construction is in Figure 3 of the extraction unit 30, a collection of short vowels and the sound data delimited units of frames separated by frame unit 31, a frame energy largest frame, and the front and rear K The first formant F1, the second formant F2, and the third formant F3 frequency are extracted from the 2 K + 1 frames, A median determining unit 33 for determining the median of the formant frequencies of the respective frames extracted by the formant frequency extracting unit 32, ( F 1, F 2, and F 3) extracted from the formant frequencies are represented by a bilinear transform of the formant frequencies ( F 1 _M , F 2 _M, and F 3 _M ) of the acoustic model corresponding to the short vowels when the weighted square error sum is the smallest pair of linear transformation coefficient α _SA the speaker Response includes speaker adaptation parameter to obtain a parameter determination unit 34. The

이와 같은 본 발명에 따른 개인 음색을 반영한 통계적 음성합성 시스템에서의 음성합성 과정은 다음과 같다.The speech synthesis process in the statistical speech synthesis system reflecting the individual tone colors according to the present invention is as follows.

도 4는 본 발명에 따른 개인 음색을 반영한 통계적 음성합성 방법을 나타낸 플로우 차트이다.4 is a flowchart illustrating a statistical speech synthesis method reflecting a personal tone color according to the present invention.

본 발명에 따른 개인 음색을 반영한 통계적 음성합성 방법은 크게 말장애 환자로부터 단모음 음성을 수집하는 단계(S401), 수집된 단모음 음성과 통계적 음성합성 시스템의 음향 모델(S402)중 해당 단모음 모델을 비교하여 포먼트 기반의 쌍선형 변환 화자적응 파라미터를 추출하는 단계(S403), 추출된 화자적응 파라미터를 적용하여 화자 특성을 반영하는 합성음을 생성하는 단계(S404)의 3 단계로 구성되며, 여기에 생성된 합성음의 청취 결과를 기반으로 사용자의 취향에 따라 음색을 튜닝하는 단계(S405)를 선택적으로 추가할 수 있다.The statistical speech synthesis method reflecting the individual tone color according to the present invention includes a step of collecting short vowel sound from a patient with terminal impairment (S401), comparing the collected short vowel speech with a corresponding short vowel sound model among the acoustic model S402 of the statistical speech synthesis system A step S403 of extracting a formant-based bilinear transformed speaker adaptation parameter, and a step S404 of generating a synthesized sound reflecting speaker characteristics by applying the extracted speaker adaptation parameter. (S405) of tuning a tone according to the taste of the user based on the result of listening to the synthesized sound.

이하, 본 발명을 각 단계별로 상세히 설명하면 다음과 같다.Hereinafter, the present invention will be described in detail in each step.

먼저, 말장애 환자로부터 단모음 음성을 수집하는 단계(S401)는, 단모음(/아/, /어/, /오/, /우/, /이/, /에/, 등) 중에서 말장애 환자가 가장 명료하게 발성할 수 있는 단모음을 발성하고 이를 개인용 컴퓨터를 비롯한 디지털 녹음 장치를 통해 수집한다.First, the step S401 of collecting a short vowel voice from a patient with a horse disorder is performed by a patient with a horse disorder among short vowel sounds (/ / / / / / / / / / / / / / / / /, etc.) It emits shortest voices that can be uttered most clearly and collects it through a digital recording device including a personal computer.

녹음된 음성을 들어보고 불명료하거나 부자연스럽다고 판단되면 반복 녹음 및 청취를 통해 가장 명료한 발성을 수집한다.Listen to the recorded voice and if it is unclear or unnatural, repeat the recording and listen to the most clear vocalizations.

그리고 수집된 단모음 음성과 통계적 음성합성 시스템의 음향 모델중 해당 단모음 모델을 비교하여 포먼트 기반의 쌍선형 변환 화자적응 파라미터를 추출하는 단계(S403)는, The step S403 of extracting the formant-based bilinear transformed speaker adaptation parameter by comparing the short monophonic sound and the acoustic model of the statistical speech synthesis system and comparing the corresponding short-

먼저, 수집된 단모음 음성 데이터를 통상의 통계적 음성합성 방식에서 사용하는 것과 동일한 방법으로 프레임 단위로 구분하고, 프레임 에너지가 가장 큰 프레임 및 그 전후 K개의 프레임들로 이루어진 2K + 1개의 프레임에 대해 3개의 포먼트(formant) 주파수, 즉, 제1 포먼트(F1), 제2 포먼트(F2), 제3 포먼트(F3) 주파수를 추출한다.First, the collected short vowel sound data is divided into frames in the same manner as that used in the normal statistical speech synthesis method. For 2 K + 1 frames having the largest frame energy and K frames before and after the frame, The first formant F1, the second formant F2, and the third formant F3 frequency are extracted from the three formants.

이때 K값은 실험적으로 결정하며, 포먼트 주파수의 단위는 Hz이다.At this time, the value of K is experimentally determined, and the unit of the formant frequency is Hz.

이들 포먼트 주파수들은 종래의 선형예측(linear prediction, LP) 분석 방식에 의해 구한다. 2K + 1개의 프레임들 중 k번째 프레임의 포먼트 주파수들을 F1(k),F2(k) 및 F3(k)라고 할 때, 수집된 단모음의 포먼트 주파수, F1,F2 및 F3은 다음 수학식 1과 같이 각 프레임의 포먼트 주파수들의 중앙값(median)으로 결정한다.These formant frequencies are obtained by a conventional linear prediction (LP) analysis method. 2 K + 1 of the formant frequency of the k-th frame of the frames F 1 (k), F 2 (k) and F 3 when said (k), the formant frequency of the collected short vowels, F 1, F 2 And F3 are determined as a median of the formant frequencies of each frame as shown in the following Equation (1).

여기서, Median{ }은 { }안의 값들의 중앙값을 의미한다.Here, Median {} is the median of the values in {}.

통계적 음성합성 시스템의 음향 모델은 통상의 통계적 음성합성 방식에 의거하여 특정화자의 음성 또는 다수 화자의 음성으로부터 구한다.The acoustic model of the statistical speech synthesis system is obtained from the speech of a specific speaker or the speech of many speakers based on a normal statistical speech synthesis method.

이 음향 모델에서 수집된 단모음에 해당하는 트라이폰 모델, 즉, 묵음-단모음-묵음 모델의 가장 중앙에 위치해 있는 상태의 평균 MGC 파라미터로부터 변환된 line spectral pair (LSP) 파라미터, 즉, MGC-LSP 파라미터로부터 동일하게 제1, 제2 및 제3 포먼트 주파수들을 추출하고, 이들을 음향모델의 포먼트 주파수 F1_M,F2_M 및 F3_M이라 한다.The LSP parameter, that is, the MGC-LSP parameter, which is converted from the mean MGC parameter of the most central position of the triphone model corresponding to the short vowels collected in this acoustic model, i.e., the mute- Second, and third formant frequencies are extracted in the same manner, and these are called formants frequencies F 1 _M , F 2 _M, and F 3 _M of the acoustic model.

본 발명에서의 화자적응 과정은 말장애 환자의 특정 단모음으로부터 추출한 포먼트 주파수(F1,F2 및 F3)들을 이 단모음에 해당하는 음향모델의 포먼트 주파수(F1_M,F2_M 및 F3_M)들의 쌍선형 변환(bilinear transform)으로 표현할 때 가중제곱오차합이 최소가 되는 쌍선형 변환 계수α_SA를 화자적응 파라미터로 구한다.In the speaker adaptation process according to the present invention, the formants frequency ( F 1, F 2 and F 3) extracted from a specific short vowel sound of a patient with a horse disorder is compared with the formant frequencies ( F 1 _M , F 2 _M , F 3 _M ) is expressed as a bilinear transform, a bilinear transform coefficient? _SA is obtained as a speaker adaptation parameter that minimizes the weighted square error sum.

이 과정을 보다 구체적으로 설명하면 다음과 같다.This process will be described in more detail as follows.

먼저 주파수(Hz 단위) 영역에서의 쌍선형 변환은 다음 식과 같다.First, the bilinear transformation in the frequency (Hz) region is as follows.

여기서

와

는 각각 쌍선형 변환 이전 및 이후의 주파수를 의미하며, α는 쌍선형 변환 파라미터이다. 수학식 2를

에 대해 정리한 함수

는 다음 수학식 3과 같다. here

Wow

Respectively denote the frequencies before and after the bilinear transformation, and a is the bilinear transformation parameter. Equation (2)

Functions that are summarized for

Is expressed by the following equation (3).

수학식 3의 쌍선형 변환 함수

를 이용하여 말장애 환자의 특정 단모음으로부터 추출한 포먼트 주파수(F1,F2 및 F3)들과 이 단모음에 해당하는 음향모델의 포먼트 주파수(F1_M,F2_M 및 F3_M)의 쌍선형 변환 값들과의 가중제곱오차합은 다음 수학식 4에서와 같다.The bilinear transform function of equation (3)

( F 1, F 2, and F 3) extracted from a specific short vowel of a horse disordered person and formant frequencies ( F 1 _M , F 2 _M, and F 3 _M ) of the acoustic model corresponding to the short vowel, And the weighted squared error sum with the binarized conversion values of Equation (4) is as shown in Equation (4).

여기서 가중치

는, 말장애 환자의 음성의 경우 일반인의 음성에 비해 포먼트 주파수 추출의 신뢰도가 떨어짐을 감안하여, 추정된 i번째 포먼트 주파수의 신뢰도를 고려한 가중치이다.Here,

Is a weight considering the reliability of the estimated i- th formant frequency considering that the reliability of the formant frequency extraction is lower than the voice of a general person in the case of speech of a horse disordered person.

개별 포먼트 주파수의 신뢰도를 고려하는 가중치

계산 방법으로는 다음과 같은 수학식 5를 사용할 수 있다.Weights that take into account the reliability of individual formant frequencies

As a calculation method, the following Equation (5) can be used.

여기서

와

는 수학식 1에 사용된 2K + 1개 프레임의 i번째 포먼트 주파수

들의 평균과 표준편차 값을 의미한다.here

Wow

Is the i- th formant frequency of the 2 < K > + 1 frame used in equation

Means the mean and standard deviation of the mean.

그리고 함수 g(x)는 x ≥ 0인 범위에 대해 단조증가 특성을 가지는 함수이다.And the function g ( x ) is a function with monotone increasing property for the range of x ≥ 0.

수학식 5로부터 i번째 포먼트 주파수의 평균값에 비해 표준편차가 상대적으로 작을수록 포먼트 추정의 신뢰도가 높다고 판단하여 가중치가 커짐을 알 수 있다. 구체적인 g(x)함수 설정은 실험적으로 결정한다.From Equation (5), as the standard deviation is relatively smaller than the mean value of the i- th formant frequency, it is determined that the reliability of the formant estimation is high and the weight increases. The specific g ( x ) function setting is determined experimentally.

수학식 4의 가중제곱오차합을 최소화하는 α값은 탐색 범위

를 n -1개의 구간으로 균등 분할한 n개의 α값 The value < RTI ID = 0.0 > a < / RTI > that minimizes the weighted squared error sum of Equation (4)

The equal dividing the α value of n as n -1 of period

을 각각 수학식 4에 대입하여 그 중에서 WSE(α)를 최소화하는 α값을 α_SA로 정한다.

Are substituted into Equation (4), and among them, the value of? That minimizes the WSE (?) Is defined as? _SA .

이때 α_min, α_max 및 n 값은 실험적으로 결정한다. 이렇게 구해진 α_SA 값을 말장애 환자의 음색 특성을 표현하는 단일 화자적응 파라미터로 결정한다.At this time, α _min , α _max And n values are determined experimentally. The obtained? _SA value is determined as a single speaker adaptation parameter expressing the tone color characteristic of the horse disordered patient.

그리고 추출된 화자적응 파라미터를 적용하여 화자 특성을 반영하는 합성음을 생성하는 단계(S404)는 다음과 같다.The step S404 of generating a synthesized sound reflecting speaker characteristics by applying the extracted speaker adaptation parameter is as follows.

통계적 음성합성 시스템의 음향 모델을 구성하는 주요한 특징 파라미터는 MGC(Mel-Generalized Cepstrum) 파라미터와 log F0 파라미터, 상태 지속시간 파라미터 등이 있다.The main feature parameters that constitute the acoustic model of the statistical speech synthesis system are the Mel-Generalized Cepstrum (MGC) parameter, the log F0 parameter, and the state duration parameter.

MGC 파라미터를 이용한 기존의 음성합성 단계에서 멜-스케일 주파수 변환을 위해 α_M = 0.42 값을 이용하는 쌍선형 변환이 사용된다.In the conventional speech synthesis step using MGC parameters, a bilinear transform using a value of α _M = 0.42 is used for the mel-scale frequency conversion.

따라서 본 발명의 2단계에서 구한 화자적응 파라미터를 이용한 쌍선형 변환과 멜-스케일 주파수 변환을 위한 쌍선형 변환을 직렬 연결하여 단일 쌍선형 변환으로 표현하면, 이때의 쌍선형 변환 파라미터 α_F는 다음 수학식 6과 같이 표현된다.Therefore, this pair with the speaker adaptation parameters obtained in step 2 of the invention, linear transform and Mel-when the a bilinear transformation for scaling the frequency converted serial connection represented by a single bilinear transformation, a pair of case linear transformation parameter α _F is the following mathematical Expression 6 is expressed as follows.

수학식 6은 통상의 VTLN 기반의 화자적응 기술을 통계적 음성합성에 적용한 사례에서와 동일하다.Equation (6) is the same as in the case where a normal VTLN-based speaker adaptation technique is applied to statistical speech synthesis.

통계적 음성합성 시스템의 MGC 파라미터를 이용한 음성합성 단계에서 α_M = 0.42를 이용한 기존의 쌍선형 변환 대신 수학식 6의 α_F를 이용한 쌍선형 변환을 수행함으로써, 간단하게 말장애 환자의 음색 특성을 반영한 합성음 생성이 이루어진다.In the speech synthesis step using the MGC parameter of the statistical speech synthesis system, instead of the existing bilinear conversion using? _M = 0.42, a bilinear conversion using? _F in Equation 6 is performed, A synthetic sound is generated.

그리고 생성된 합성음의 청취 결과를 기반으로 사용자의 취향에 따라 음색을 튜닝하는 단계(S405)는 다음과 같이 선택적으로 수행할 수 있다.The step S405 of tuning the tone according to the taste of the user based on the result of the generated synthesized sound can be selectively performed as follows.

상기 S401 ~ S404 과정만으로 통계적 음성합성 시스템에서 말장애 환자의 음색 특성을 반영한 합성음 생성이 가능하다.It is possible to generate a synthetic sound reflecting the tone characteristic of the horse disordered person in the statistical speech synthesis system only by the steps S401 to S404.

이와 같은 본 발명에 따른 개인 음색을 반영한 통계적 음성합성 시스템 및 방법의 실시 예에서는 합성음의 스펙트럼 특성 변환만을 이용한 음색 변경 과정을 수행하고, 합성음의 운율 특성은 음색 변경을 수행하지 않고 이전의 음성합성기의 운율 특성을 그대로 사용하는데, 그 이유는 말장애 환자가 자연스러운 운율을 표현하기 힘들어서, 장애음성으로부터 직접 구한 운율 특성이 장애자가 실제로 원하는 합성음의 운율 특성과는 매우 다른 경우가 많기 때문이다. In the embodiment of the statistical speech synthesis system and method according to the present invention, the tone color changing process using only the spectrum characteristic conversion of the synthesized tone is performed, and the rhythm characteristics of the synthesized tone are changed without performing the tone color change, The rhyme characteristics are used as it is because it is difficult for the horse disordered person to express the natural rhyme because the rhyme characteristics directly obtained from the speech of the impaired person are very different from the rhyme characteristics of the synthesized sound actually desired by the person with impairment.

그러나 스펙트럼 변환만 이용하는 음색 변경이 말장애 환자 본인의 선호도를 충분히 만족시키지 못할 경우를 대비하여, 합성음의 운율 특성과 스펙트럼 특성 모두 본인의 선호도에 따라 튜닝할 수 있는 단계를 선택적으로 추가할 수 있다.However, in case that the tone color change using only the spectrum conversion does not sufficiently satisfy the preference of the horse disorder person, a step that can be tuned according to the user's preference can be selectively added to both the rhythm characteristic and the spectrum characteristic of the synthetic tone.

즉, 합성음의 운율 특성 중, 음의 고저를 나타내는 억양 특성은 음향 모델의 log F0 파라미터로 표현하며, 합성음의 j번째 프레임에 대해 기존의 log F0 값을 LF0(j), 그리고 음색변환된 log F0 값을

라고 하면, 다음 수학식 7과 같이 음색 변환을 수행할 수 있다.That is, among the prosodic features of the synthetic sound, the intonation characteristics representing the high and low sound are represented by the log F0 parameter of the acoustic model. For the jth frame of the synthetic sound, the log F0 value is LF 0 ( j ) The F0 value

, It is possible to perform tone color conversion as shown in Equation (7).

여기서 LF0_SA는 합성음의 억양 특성 변환을 위한 사용자 지정 파라미터이고, LF0_SA> 0이면 음이 높아지고, LF0_SA < 0이면 음이 낮아지게 된다. Here, LF 0 _SA is a user-specified parameter for converting the intonation characteristics of the synthetic sound. If LF 0 _SA > 0, the sound becomes higher and LF 0 _SA &Lt; 0, the sound becomes lower.

LF0_SA 값을 조절하면서 합성음을 들어보고 말장애 환자 본인에게 가장 만족스러운 LF0_SA 값을 선정하도록 한다.Adjust the value of LF 0 _SA and listen to the synthesized sound and select the most satisfactory LF 0 _SA value for the person with speech impairment.

수학식 7의 과정은 합성음 생성 단계에서 매 프레임 별로 구현하는 대신, 음향 모델 파라미터 중 log F0 모델 파라미터를 수정하여 구현할 수 있다. The process of Equation (7) can be implemented by modifying the log F0 model parameter among the acoustic model parameters instead of implementing each frame in the synthetic sound generating step.

수학식 6을 통해 구해진 쌍선형 변환 파라미터 α_F를 이용한 스펙트럼 변환의 경우에도, 말장애 환자 본인이 추가적으로 스펙트럼 특성의 변경을 원할 경우, α_F 값을 조절하면서 합성음을 들어보고 말장애 환자 본인에게 가장 만족스러운 α_F값을 선정하도록 한다. In the case of spectral conversion using the bilinear transformation parameter α _F obtained through Equation (6), if the person with terminal impairment wishes to further change the spectral characteristics, the synthesized sound is adjusted while adjusting the α _F value, Select a satisfactory α _F value.

이와 같은 본 발명에 따른 개인 음색을 반영한 통계적 음성합성 시스템 및 방법은 말장애 환자의 단모음 발성만으로 본인의 주요한 음색 특징을 갖는 합성음을 생성하고, 단모음 음성과 통계적 음성합성 시스템의 음향 모델 중 해당 단모음 모델을 비교하여 포먼트 주파수 기반의 쌍선형 변환 파라미터 형태의 화자적응 파라미터를 추출하여 통계적 음성합성 시스템에서 개인적인 음색 특성을 반영한 합성음을 생성하여 의사소통 개선이 가능하도록 한 것이다.The system and method of statistical speech synthesis that reflects the individual tone color according to the present invention generates a synthesized tone having the main tone characteristics of the person only by the vocalization of the short vowel of the horse disordered person and the short vowel voice and the corresponding short tone voice among the acoustic models of the statistical speech synthesis system And extracts the speaker adaptation parameters in the form of a bilinear transformation parameter based on the formant frequency to generate a synthetic sound reflecting the individual tone characteristics in the statistical speech synthesis system to improve the communication.

이상에서의 설명에서와 같이 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 변형된 형태로 본 발명이 구현되어 있음을 이해할 수 있을 것이다.As described above, it will be understood that the present invention is implemented in a modified form without departing from the essential characteristics of the present invention.

그러므로 명시된 실시 예들은 한정적인 관점이 아니라 설명적인 관점에서 고려되어야 하고, 본 발명의 범위는 전술한 설명이 아니라 특허청구 범위에 나타나 있으며, 그와 동등한 범위 내에 있는 모든 차이점은 본 발명에 포함된 것으로 해석되어야 할 것이다.It is therefore to be understood that the specified embodiments are to be considered in an illustrative rather than a restrictive sense and that the scope of the invention is indicated by the appended claims rather than by the foregoing description and that all such differences falling within the scope of equivalents are intended to be embraced therein It should be interpreted.

20. 단모음 음성 수집부 30. 화자적응 파라미터 추출부
40. 합성음 생성부 50. 합성음 튜닝부20. Short vowel sound collection section 30. Speaker adaptation parameter extraction section
40. Synthetic sound generating section 50. Synthetic sound tuning section

Claims

A short vowel voice collection unit for collecting short vowel voice from a horse disordered patient;
And a second formatting unit for dividing the short vowel sound data collected by the short vowel sound collection unit into a frame unit and a 2K + 1 frame including a frame having the largest frame energy and K frames before and after the frame, A formant frequency extracting unit for extracting a frequency of a first formant F1, a second formant F2 and a third formant F3; and a median of formant frequencies of each frame extracted by the formant frequency extracting unit and determining a median determining unit which, formant frequency (F 1, F 2 and F 3) is extracted from a particular vowel of the end disorder frequency of formant of the acoustic model corresponding to the vowel (F 1 _M, F 2 _M and F 3 _M ) as a speaker adaptation parameter, and a speaker adaptation parameter determiner for obtaining a bilinear transform coefficient < RTI ID = 0.0 > _SA < / RTI > that minimizes the weighted squared error sum when expressed as a bilinear transform, A speaker adaptation parameter extraction unit for extracting a conversion speaker adaptation parameter;
A synthesized voice generating unit for generating a synthesized voice reflecting speaker characteristics by applying the speaker adaptation parameter extracted by the speaker adaptation parameter extracting unit;
And a synthetic tone tuning unit for selectively tuning a tone color of the synthesized voice generated by the synthetic voice generating unit.

delete

Collecting short vowel speech from a horse disordered patient;
And the step to separate the collected vowel sound data in units of frames, the first formant (F1) for the 2 K + 1 frames consisting of a frame energy as the largest frame, and that after K frames, the second formant ( F2), third formant (F3) and a step of extracting frequency, extracted from the step, and a particular short vowels in speech disorder patient for determining a median value (median) of the formant frequency of the frames extracted formant frequency (F 1 , F 2, and F 3) are expressed as a bilinear transform of the formant frequencies ( F 1 _M , F 2 _M, and F 3 _M ) of the acoustic model corresponding to this short vowel, the weighted squared error sum is minimized is a bilinear transform coefficients comprising: a step to obtain the α _SA with speaker adaptation parameters, extracts the speaker adaptation parameters Four pairs of treatments based on a linear transformation;
And generating a synthesized voice that reflects the speaker characteristics by applying the extracted speaker adaptation parameter.

The method according to claim 3, wherein in the step of generating a synthesized sound reflecting the speaker characteristics by applying the extracted speaker adaptation parameter,
And selectively tuning the tone color based on the result of the generated synthetic tone.

delete

4. The method according to claim 3, wherein, in the step of obtaining the bilinear transform coefficient alpha _SA with the weighted squared error sum as a speaker adaptation parameter,
The bilinear transformation in frequency (Hz)

ego,
here

Wow

Characterized in that each of the plurality of frequency bands represents a frequency before and after the bilinear transformation, and a is a bilinear transformation parameter.

The method according to claim 6, wherein the binarization conversion in the frequency (Hz)

Functions that are summarized for

The

ego,
Binary Linear Conversion Functions

ego,
Here,

Is a weight in consideration of the reliability of the estimated i- th formant frequency considering the reliability of the formant frequency extraction is lower than that of the voice of the general person in the case of speech of a horse disordered person. Way.

8. The method of claim 7, wherein the weighting factor that takes into account the reliability of the individual formant frequencies

The calculation,

ego,
here

Wow

Is the i- th formant frequency of the 2 K + 1 frames used

And a function g ( x ) is a function having a monotone increasing characteristic for a range of x & gt ; = 0. The statistical speech synthesis method reflecting the personal tone.

5. The method according to claim 4, wherein in tuning the tone color based on the result of the generated synthesized tone,
Among the prosodic features of the synthetic sound, the intonation characteristics representing the high and low sound are represented by the log F0 parameter of the acoustic model. For the jth frame of the synthetic sound, the log F0 value is LF0 ( j )

In the tone color conversion,

Lt; / RTI >
Here, LF 0 _SA is a user-specified parameter for converting the intonation characteristics of the synthetic sound. If LF 0 _SA > 0, the sound becomes higher and LF 0 _SA &Lt; 0, the sound is lowered, and the LF 0 _SA value is adjusted, and the LF 0 _SA value is selected by listening to the synthesized sound while adjusting the LF 0 _SA value.

5. The method according to claim 4, wherein in tuning the tone color based on the result of the generated synthesized tone,
If the bilinear conversion using the speaker adaptation parameters and the bilinear conversion for the mel-scale frequency conversion are expressed as a single bilinear conversion by serial connection,
Bilinear transformation parameter α is _F,

ego,
Where α _SA is a bilinear transform coefficient that minimizes the weighted square error sum, α _M is a bilinear transform coefficient for the Mel-scale frequency transform,
In the case of spectral conversion using the bilinear transformation parameter α _F , if the person with terminal impairment wishes to change the spectral characteristics additionally, he or she can listen to the synthesized sound while adjusting the α _F value and obtain the most satisfactory α _F value And selecting a statistical speech synthesis method based on the personalized tone.