KR20180078252A

KR20180078252A - Method of forming excitation signal of parametric speech synthesis system based on gesture pulse model

Info

Publication number: KR20180078252A
Application number: KR1020187012944A
Authority: KR
Inventors: 라제쉬 다치라주; 이. 비라 라가벤드라; 아라빈드 가나파티라주
Original assignee: 인터랙티브 인텔리전스 그룹, 인코포레이티드
Priority date: 2015-10-06
Filing date: 2015-10-06
Publication date: 2018-07-09
Also published as: WO2017061985A1; AU2015411306A1; CN108369803A; CA3004700A1; CA3004700C; CN108369803B; EP3363015A4; EP3363015A1

Abstract

성문 펄스 모델 기반 매개 변수식 음성 합성 시스템의 여기 신호(excitation signal) 형성 시스템 및 방법이 제공된다. 여기 신호는 단일 서브 밴드 템플리트 대신에 복수의 서브 밴드 템플리트를 사용하여 형성될 수 있다. 복수의 서브 밴드 템플리트를 조합하여 여기 신호를 형성할 수 있으며, 여기에서 템플리트가 추가되는 비율은 결정된 에너지 계수(energy coefficient)에 동적으로 기초한다. 이들 계수는 프레임마다 상이할 수 있으며 또한 특징 훈련(feature training) 중에 스펙트럼 매개 변수와 함께 학습된다. 각 계수는 스펙트럼 매개 변수를 포함하는 특징 벡터에 추가되고 HMM을 사용하여 모델링되며, 여기 신호가 결정된다.A system and method for forming an excitation signal of a parametric speech synthesis system based on a sentence pulse model is provided. The excitation signal may be formed using a plurality of subband templates instead of a single subband template. A plurality of subband templates may be combined to form an excitation signal, wherein the rate at which the template is added is dynamically based on the determined energy coefficient. These coefficients can vary from frame to frame and are also learned along with spectral parameters during feature training. Each coefficient is added to a feature vector containing spectral parameters and modeled using an HMM, and the excitation signal is determined.

Description

Method of forming excitation signal of parametric speech synthesis system based on gesture pulse model

본 발명은 일반적으로 음성 합성뿐만 아니라 통신 시스템 및 방법에 관한 것이다. 더욱 구체적으로, 본 발명은 은닉 마르코프 모델에 기반한 통계적인 매개 변수식 음성 합성 시스템에서의 여기 신호의 형성에 관한 것이다.The present invention generally relates to speech synthesis as well as communication systems and methods. More particularly, the present invention relates to the formation of excitation signals in a statistical parametric speech synthesis system based on a hidden Markov model.

일 실시예에 있어서, 음성 합성 시스템의 훈련에서 사용되는 매개 변수식 모델을 생성하기 위한 일 방법이 제공되며, 여기에서 상기 시스템은 적어도 하나의 훈련용 텍스트 코퍼스, 음성 데이터베이스, 및 모델 훈련 모듈을 포함하며, 상기 방법은, 상기 모델 훈련 모듈에 의해서, 상기 훈련용 텍스트 코퍼스에 대한 음성 데이터 -- 상기 음성 데이터는 녹음된 음성 신호 및 대응하는 녹취록을 포함함 -- 를 획득하는 단계; 상기 모델 훈련 모듈에 의해서, 상기 훈련용 텍스트 코퍼스를 맥락 의존형 음소 라벨(phone label)로 변환하는 단계; 상기 모델 훈련 모듈에 의해서, 상기 음성 훈련용 데이터베이스로부터의 상기 음성 신호 중의 각각의 음성 프레임에 대해서, 스펙트럼 특징, 복수의 밴드 여기 에너지 계수, 및 기본 주파수 값 중의 적어도 하나를 추출하는 단계; 상기 모델 훈련 모듈에 의해서, 각각의 음성 프레임에 대해서, 상기 스펙트럼 특징, 복수의 밴드 여기 에너지 계수, 및 기본 주파수 값 중의 적어도 하나를 사용하여 특징 벡터 스트림을 형성하는 단계; 맥락 의존형 음소(phone)를 사용하여 음성에 라벨을 다는 단계; 상기 라벨이 달린 음성으로부터 각각의 맥락 의존형 음소의 지속 시간을 추출하는 단계; 상기 음성 신호의 매개 변수 추정 -- 상기 매개 변수의 추정은 특징, HMM, 및 의사 결정 트리를 포함하여 수행됨 -- 을 수행하는 단계; 및 합성 중에 여기를 형성하는데 사용하는 별도의 모델을 포함하는 복수의 서브 밴드 고유치 성문 펄스를 식별하는 단계;를 포함한다.In one embodiment, a method for generating a parametric model for use in training a speech synthesis system is provided, wherein the system includes at least one training text corpus, a speech database, and a model training module The method comprising the steps of: obtaining, by the model training module, speech data for the training text corpus, the speech data comprising a recorded speech signal and a corresponding transcript; Translating the training text corpus into a context dependent phoneme label by the model training module; Extracting, by the model training module, at least one of a spectral characteristic, a plurality of band excitation energy coefficients, and a fundamental frequency value for each speech frame in the speech signal from the speech training database; Using the model training module to form a feature vector stream for each voice frame using at least one of the spectral characteristics, the plurality of band excitation energy coefficients, and the fundamental frequency value; Labeling the speech using a context dependent phoneme; Extracting a duration of each context dependent phoneme from the labeled speech; Performing a parameter estimation of the speech signal, the estimation of the parameter being performed including a feature, an HMM, and a decision tree; And identifying a plurality of subband eigenvalue pulse comprising a separate model for use in forming an excitation during synthesis.

다른 실시예에 있어서, 음성 합성 시스템 훈련용 성문 펄스 데이터베이스로부터 서브 밴드 고유치 펄스를 식별하기 위한 일 방법이 제공되며, 상기 방법은, 상기 성문 펄스 데이터베이스로부터 펄스를 수신하는 단계; 각각의 펄스를 복수의 서브 밴드 성분으로 분해하는 단계; 상기 분해하는 단계에 기초하여 상기 서브 밴드 성분을 복수의 데이터베이스로 분할하는 단계; 각각의 데이터베이스의 벡터 표현을 결정하는 단계; 상기 벡터 표현으로부터, 각각의 데이터베이스에 대한 고유치 펄스값을 결정하는 단계; 및 합성시 사용하기 위해서 각각의 데이터베이스에 대한 최상의 고유치 펄스를 선택하는 단계;를 포함한다.In another embodiment, a method is provided for identifying a subband eigenvalue pulse from a verbal pulse database for speech synthesis system training, the method comprising: receiving a pulse from the vernal pulse database; Decomposing each pulse into a plurality of subband components; Dividing the subband component into a plurality of databases based on the decomposing step; Determining a vector representation of each database; Determining, from the vector representation, an eigenvalue pulse value for each database; And selecting the best eigenvalue pulse for each database for use in synthesis.

도 1은, 은닉 마르코프 모델 기반 텍스트 음성 변환 시스템의 일 실시예를 도시한 다이아그램이다.
도 2는, 특징 벡터 추출 프로세스의 일 실시예를 도시한 흐름도이다.
도 3은, 특징 벡터 추출 프로세스의 일 실시예를 도시한 흐름도이다.
도 4는, 고유치 펄스 식별 프로세스의 일 실시예를 도시한 흐름도이다.
도 5는, 음성 합성 프로세스의 일 실시예를 도시한 흐름도이다.
[관련 출원에 대한 상호 참조]
본 출원은 2014년 5월 28일자로 출원되었고, 발명의 명칭이 "성문 펄스 모델 기반 매개 변수식 음성 합성 시스템의 여기 신호 형성 방법"(Method for Forming the Excitation Signal for a Glottal Pulse Model Based Parametric Speech Synthesis System)인 미합중국 특허 출원 제14/288,745호의 일부 계속 출원이며, 참조에 의해서 내용 일부가 본 출원에 합체된다.1 is a diagram illustrating an embodiment of a hidden Markov model-based text-to-speech system.
2 is a flow chart illustrating an embodiment of a feature vector extraction process.
3 is a flow chart illustrating an embodiment of a feature vector extraction process.
4 is a flow chart illustrating one embodiment of an eigenvalue pulse identification process.
5 is a flow chart illustrating an embodiment of a speech synthesis process.
[Cross reference to related application]
The present application was filed on May 28, 2014, entitled " Method for Forming a Excitation Signal for a Parametric Speech Synthesis System Based on a Glottal Pulse Model " ), Which is hereby incorporated by reference in its entirety.

본 발명의 원칙에 대한 이해를 돕기 위해서, 도면에 도시된 실시예를 참조하기로 하며, 또한 도면을 설명하기 위해서 특유의 표현을 사용하기로 한다. 그럼에도 불구하고 이와 같은 설명은 본 발명의 범위를 제한하고자 의도하지 않았음을 잘 알 것이다. 본 발명과 관련된 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자라면 통상적으로 본 발명에서 설명한 각 실시예에 대한 임의의 변형과 변경, 및 본 명세서에서 설명한 바와 같은 본 발명의 각 원칙에 대한 임의의 추가적인 응용이 가능함을 알 것이다.In order to facilitate understanding of the principles of the present invention, reference will be made to the embodiments shown in the drawings and specific reference will be made to the drawings. It will nevertheless be understood that such description is not intended to limit the scope of the invention. It will be understood by those skilled in the art that various changes and modifications may be made to the embodiments described herein without departing from the spirit and scope of the invention as defined in the appended claims. Lt; RTI ID = 0.0 > a < / RTI >

음성 합성에 있어서, 여기(excitation)는 일반적으로 발화 영역에 대해서 준주기적 임펄스 수열(sequence)인 것으로 간주된다. 각각의 수열은,

와 같은 일정 지속 기간만큼 이전 수열에서 분리되며, 여기에서

는 피치 주기를 나타내고,

는 기본 주파수를 나타낸다. 비발화 영역(unvoiced region)에서, 각 수열은 백색 잡음(white noise)으로 모델링된다. 그러나 발화 영역(voiced region)에서, 여기(excitation)는 실제로는 임펄스 수열이 아니다. 대신, 여기는 성대 주름(vocal folds)의 진동 및 그 형상에 의해서 생성되는 음원 펄스의 서열이다. 또한, 펄스의 형상은 화자, 화자의 기분, 언어학적 맥락, 감정 등과 같은 다양한 요인에 따라서 변화될 수 있다.In speech synthesis, excitation is generally considered to be a quasi-periodic impulse sequence for the speech region. Each sequence,

, &Lt; / RTI > where < RTI ID = 0.0 >

Represents a pitch period,

Represents the fundamental frequency. In the unvoiced region, each sequence is modeled as white noise. However, in the voiced region, the excitation is not actually an impulse sequence. Instead, this is the sequence of the source pulses generated by the vibrations of the vocal folds and their shape. Further, the shape of the pulse can be changed according to various factors such as the speaker, the speaker's mood, the linguistic context, emotion, and the like.

소스 펄스는, 예를 들면, 유럽 특허 EP 2242045(2012년 6월 27일 등록, 발명자 토마스 드러그맨(Thomas Drugman) 등)에 개시된 바와 같이, (리샘플링을 통한) 길이 정규화 및 임펄스 정렬에 의해서 수학적으로 벡터(vector)로 취급된다. 정규화된 소스 펄스 신호의 최종 길이는 리샘플링되어 목표 피치를 충족하게 된다. 소스 펄스는 데이터베이스에서 선택되지 않고, 주파수 도메인의 펄스 특성을 손상시키는 일련의 계산을 통해서 획득된다. 전통적으로 HMM 기반 시스템에 대해서 음향 매개 변수(acoustic parameter) 또는 여기 모델(excitation model)을 사용하여 음원 펄스의 모델링을 수행하고 있었으나, 이 모델은 성문/잔차 펄스(residual pulse)를 보간/리샘플링하여 목표로 하는 피치 주기를 충족시키고 있었기에 주파수 도메인에서의 모델 펄스의 특징에 손상을 초래하였다. 펄스를 선택하는 표준적인 방식에서 다른 방법이 사용되기도 하였지만, 다른 방법은 길이 정규화에 의해서 잔차 펄스를 동일한 길이의 벡터로 변환하였다. 이들 방법은 또한 이들 벡터에 대해서 PCA를 수행하는데, 이 방법은 훈련용 데이터로부터 직접 선택된 최종 펄스가 아니라 최종 펄스가 계산된 펄스로 선택되도록 한다.The source pulses are mathematically represented by length normalization and impulse alignment (via resampling), as described, for example, in European Patent EP 2242045 (Registered June 27, 2012, inventor Thomas Drugman et al. It is treated as a vector. The final length of the normalized source pulse signal is resampled to meet the target pitch. The source pulses are not selected in the database and are obtained through a series of calculations that impair the pulse characteristics of the frequency domain. Traditionally, sound source pulses were modeled using acoustic parameters or excitation models for an HMM-based system. However, this model interpolates / resamples the residual / And thus the characteristic of the model pulse in the frequency domain is impaired. Although other methods have been used in the standard way of selecting pulses, other methods have transformed residual pulses into vectors of equal length by length normalization. These methods also perform PCA on these vectors, which allows the final pulse to be selected as the calculated pulse, rather than the final pulse directly selected from the training data.

계산에 의한 것과는 달리, 훈련용 데이터로부터 직접 선택하여 최종 펄스를 얻기 위해서는, 척도(metric)를 정의하고 또한 벡터 표현(vector representation)을 규정하여 성문 펄스를 모델링할 수 있다. 성문 펄스 및 기본 주파수가 주어지는 경우, 해당 펄스에 대한 리샘플링 또는 보간을 행하지 않는 여기 형성 또한 제공된다.In contrast to computation, to obtain the final pulse directly from the training data, you can define a metric and also model the gating pulse by defining a vector representation. Excitation and pulses of the fundamental frequency are provided, excitation is also provided which does not resample or interpolate the pulse.

통계학적인 매개 변수식 음성 합성에 있어서, 음성 단위 신호는 음성을 합성하는데 사용될 수 있는 일련의 매개 변수로 표현된다. 이 매개 변수는, 예를 들면, HMM과 같은 통계학적인 모델에 의해서 학습될 수 있다. 일 실시예에 있어서, 음성은 소스 필터 모델(source-filter model)로 표현할 수 있으며, 여기에서 소스/여기는 적절한 필터를 통과할 때 소정의 소리(given sound)를 생성하는 신호이다. 도 1은 전체적으로 도면 부호 100으로 나타낸 은닉 마르코프 모델(HMM, Hidden Markov Model) 기반 텍스트 음성 변환 시스템의 일 실시예를 도시한 다이아그램이다. 예시적인 시스템의 일 실시예는, 예를 들면, 훈련 단계와 합성 단계의 두 개의 단계를 포함할 수 있으며, 이하에서 이들 단계에 대해서 더욱 상세하게 설명하기로 한다.In statistical parametric speech synthesis, the speech unit signal is represented by a set of parameters that can be used to synthesize speech. This parameter can be learned, for example, by a statistical model such as an HMM. In one embodiment, the speech may be represented by a source-filter model, where the source / excitation is a signal that produces a given sound as it passes through the appropriate filter. 1 is a diagram illustrating an embodiment of a text-to-speech system based on a Hidden Markov Model (HMM) indicated generally by the reference numeral 100. FIG. One embodiment of an exemplary system may include, for example, two steps of training and combining, and these steps will now be described in more detail.

음성 데이터베이스(105)는 음성 합성에서 사용되는 다량의 음성 데이터를 포함할 수 있다. 음성 데이터는 녹음된 음성 신호 및 대응하는 녹취록을 포함할 수 있다. 훈련 단계 중에, 음성 신호(106)는 매개 변수로 변환된다. 매개 변수는 여기 매개 변수, F0 매개 변수, 및 스펙트럼 매개 변수를 포함할 수 있다. 여기 매개 변수 추출(110a), 스펙트럼 매개 변수 추출(110b), 및 F0 매개 변수 추출(110c)은 음성 데이터베이스(105)로부터 전달된 음성 신호(106)로부터 생성된다. 은닉 마르코프 모델은 이들 추출된 매개 변수 및 음성 데이터베이스(105)로부터의 라벨(107)을 사용하여 훈련 모듈(115)을 사용하여 훈련될 수 있다. 이 훈련으로부터 임의 개수의 HMM 모델이 생성될 수 있으며 이들 맥락 의존형 HMM은 데이터베이스(120)에 저장된다.The voice database 105 may include a large amount of voice data used in voice synthesis. The voice data may include a recorded voice signal and a corresponding transcription. During the training phase, the speech signal 106 is converted to a parameter. The parameters may include excitation parameters, F0 parameters, and spectral parameters. The excitation parameter extraction 110a, the spectrum parameter extraction 110b and the F0 parameter extraction 110c are generated from the speech signal 106 delivered from the speech database 105. The hidden Markov model can be trained using the training module 115 using these extracted parameters and the label 107 from the voice database 105. [ Any number of HMM models can be generated from this training and these context dependent HMMs are stored in the database 120.

합성 단계는 맥락 의존형 HMM(120)을 사용하여 매개 변수를 생성(135)하면서 시작된다. 매개 변수 생성(135)은 음성(speech)을 분석할 텍스트(125) 코퍼스의 입력값을 사용할 수 있다. 매개 변수 생성(135)에서 사용하기 전에, 텍스트(125)의 분석(130)이 수행될 수 있다. 분석(130) 중에, 텍스트(125)로부터 라벨(131)이 추출되어 매개 변수 생성(135)에서 사용된다. 일 실시예에 있어서, 여기 매개 변수 및 스펙트럼 매개 변수는 매개 변수 생성 모듈(135)에서 생성될 수 있다.The synthesis step begins by creating a parameter 135 using context-dependent HMM 120. The parameter generation 135 may use the input value of the text (125) corpus to analyze the speech. Prior to use in parameter generation 135, analysis 130 of text 125 may be performed. During analysis 130, a label 131 is extracted from the text 125 and used in the parameter generation 135. In one embodiment, excitation parameters and spectral parameters may be generated in the parameter generation module 135.

여기 매개 변수를 사용하여 여기 신호(140)를 생성할 수 있으며, 이 여기 신호는 스펙트럼 매개 변수와 함께 합성 필터(145)에 입력된다. 필터 매개 변수는 일반적으로 멜 주파수 켑스트렐 계수(MFCC, Mel frequency cepstral coefficient)이며, 종종 HMM을 사용하는 통계적 시계열(statistical time series)에 의해서 모델링된다. 기본 주파수 값 및 MFCC 값으로부터 여기 신호를 생성하여 필터를 형성함으로써 시계열 값으로서의 기본 주파수와 필터의 예측치를 사용하여 필터를 합성할 수 있다. 합성된 음성(150)은 여기 신호가 필터를 통과할 때 생성된다.The excitation parameter may be used to generate an excitation signal 140 that is input to the synthesis filter 145 along with the spectral parameter. The filter parameters are generally Mel frequency cepstral coefficients (MFCC) and are often modeled by a statistical time series using HMMs. The excitation signal is generated from the fundamental frequency value and the MFCC value to form a filter, thereby synthesizing the filter using the fundamental frequency as the time series value and the predicted value of the filter. The synthesized voice 150 is generated when the excitation signal passes through the filter.

도 1에서의 여기 신호(140)의 형성은 출력, 또는 합성된 음성(150)의 품질에 필수적이다. 일반적으로, 통계적인 매개 변수식 음성 합성 시스템에서 사용되는 스펙트럼 매개 변수는 MCEPS, MGC, Mel-LPC, 또는 Mel-LSP를 포함하고 있다. 일 실시예에 있어서, 스펙트럼 매개 변수는 사전 강조된 음성 신호로부터 계산된 멜 일반화 켑스트렐(MGC, mel-generalized cepstral)이지만, 0차 에너지 계수는 원래의 음성 신호로부터 계산된다. 전통적인 시스템에 있어서, 기본 주파수 값은 단독으로 소스 매개 변수(source parameter)로 간주되고 또한 전체 스펙트럼은 시스템 매개 변수로 간주된다. 하지만, 음성 스펙트럼 중의 스펙트럼 경사(spectrum tilt) 또는 총 스펙트럼 형상(gross spectral shape)은 실제로는 성문 펄스의 특성이며 따라서 소스 매개 변수로 간주된다. 스펙트럼 경사는 성문 펄스 기반 여기용으로 수집되고 모델링되지만, 시스템 매개 변수로서는 제외된다. 대신에, 사전 강조된 음성을 사용하여 0차 에너지 계수(음성 에너지)를 제외한 스펙트럼 매개 변수(MGC)를 계산한다. 이 계수는 시간에 따라서 천천히 변동되며 또한 미처리 음성으로부터 직접 계산된 운율소(prosodic) 매개 변수로 간주될 수 있다.The formation of the excitation signal 140 in FIG. 1 is essential to the quality of the output, or synthesized speech 150. In general, the spectral parameters used in statistical parametric speech synthesis systems include MCEPS, MGC, Mel-LPC, or Mel-LSP. In one embodiment, the spectral parameter is a mel-generalized cepstral (MGC) calculated from the pre-emphasized speech signal, but the zero-order energy coefficient is calculated from the original speech signal. In a conventional system, the fundamental frequency value alone is regarded as a source parameter and the entire spectrum is regarded as a system parameter. However, the spectral tilt or gross spectral shape in the speech spectrum is actually a characteristic of the gating pulse and is therefore considered as the source parameter. The spectral slope is collected and modeled for the glottal pulse-based excitation, but excluded as a system parameter. Instead, the spectral parameter (MGC) excluding the zero-order energy coefficient (speech energy) is calculated using the pre-emphasized speech. This coefficient varies slowly over time and can also be regarded as a prosodic parameter calculated directly from the raw speech.

훈련 및 모델 구축Training and model building

도 2는 전체적으로 도면 부호 200으로 나타낸 특징 벡터 추출 프로세스의 일 실시예를 도시한 흐름도다. 이 프로세스는 도 1의 스펙트럼 매개 변수 추출(110b) 중에 수행될 수 있다. 상술한 바와 같이, HMM 모델 훈련과 같은 모델 훈련에서 매개 변수가 사용될 수 있다.FIG. 2 is a flow chart illustrating one embodiment of a feature vector extraction process, indicated generally at 200. This process can be performed during the spectral parameter extraction 110b of Fig. As mentioned above, parameters can be used in model training, such as HMM model training.

단계(205)에서, 음성 신호가 수신되어 매개 변수로 변환된다. 도 1에 나타낸 바와 같이, 음성 신호는 음성 데이터베이스(105)로부터 수신될 수 있다. 단계(210 및 220)로 제어가 이동하며 프로세스(200)는 계속된다. 일 실시예에 있어서, 단계(210 및 215)는 단계(220)와 동시적으로 수행되고 또한 각 결과는 모두 단계(225)로 전달된다.In step 205, a voice signal is received and converted to a parameter. As shown in Fig. 1, a voice signal can be received from the voice database 105. Fig. The control moves to steps 210 and 220 and the process 200 continues. In one embodiment, steps 210 and 215 are performed concurrently with step 220, and each result is also passed to step 225.

단계(210)에서, 음성 신호에 대한 사전 강조(pre-emphasis)가 수행된다. 예를 들면, 이 단계에서 음성 신호를 사전 강조하면 다음 단계에서의 MGC 계수의 결정에서 저주파 소스 정보가 수집되지 않는다. 단계(215)로 제어가 이동하며 프로세스(200)는 계속된다.In step 210, a pre-emphasis on the speech signal is performed. For example, if the speech signal is pre-emphasized at this stage, low frequency source information is not collected in the determination of the MGC coefficient in the next step. Then control transfers to step 215 and process 200 continues.

단계(215)에서, 각각의 음성 프레임에 대한 스펙트럼 매개 변수가 결정된다. 일 실시예에 있어서, 각각의 프레임에 대해서 MGC 계수 1 내지 39가 결정될 수 있다. 다르게는, 또한 MFCC와 LSP를 사용할 수도 있다. 단계(225)로 제어가 이동하며 프로세스(200)는 계속된다.In step 215, the spectral parameters for each voice frame are determined. In one embodiment, MGC coefficients 1 through 39 may be determined for each frame. Alternatively, MFCC and LSP may also be used. The control moves to step 225 and the process 200 continues.

단계(220)에서, 각각의 음성 프레임에 대한 0차 계수가 결정된다. 일 실시예에 있어서, 이 계수의 결정은 사전 강조된 음성과는 대조적으로 미처리 음성을 사용하여 결정될 수 있다. 단계(225)로 제어가 이동하며 프로세스(200)는 계속된다.In step 220, a zero-order coefficient for each voice frame is determined. In one embodiment, the determination of this coefficient may be determined using the raw speech in contrast to the pre-emphasized speech. The control moves to step 225 and the process 200 continues.

단계(225)에서, 단계(220 및 215)로부터의 계수를 1 내지 39의 MGC 계수에 추가하여 각각의 음성 프레임에 대한 39 개의 계수를 형성한다. 일 프레임에 대한 스펙트럼 계수는 이하에서 스펙트럼 벡터(spectral vector)로 통칭될 수 있다. 프로세스(200)가 종료된다.At step 225, the coefficients from steps 220 and 215 are added to the MGC coefficients of 1 to 39 to form 39 coefficients for each voice frame. The spectral coefficients for one frame may be referred to below as a spectral vector. The process 200 is terminated.

도 3은 전체적으로 도면 부호 300으로 나타낸 특징 벡터 추출 프로세스의 일 실시예를 도시한 흐름도다. 본 프로세스는 도 1의 여기 매개 변수 추출 단계(110a) 중에 수행될 수 있다. 상술한 바와 같이, HMM 모델 훈련과 같은 모델 훈련에서 매개 변수가 사용될 수 있다.FIG. 3 is a flow chart illustrating an embodiment of a feature vector extraction process generally indicated at 300. This process can be performed during the excitation parameter extraction step 110a of FIG. As mentioned above, parameters can be used in model training, such as HMM model training.

단계(305)에서, 음성 신호가 수신되어 매개 변수로 변환된다. 도 1에 나타낸 바와 같이, 음성 신호는 음성 데이터베이스(105)로부터 수신될 수 있다. 단계(310, 320, 및 325)로 제어가 이동하며 프로세스(300)는 계속된다.In step 305, a voice signal is received and converted to a parameter. As shown in Fig. 1, a voice signal can be received from the voice database 105. Fig. Control transfers to steps 310, 320, and 325 and process 300 continues.

단계(310)에서, 음성 신호에 대해서 사전 강조(pre-emphasis)가 수행된다. 예를 들면, 이 단계에서 음성 신호를 사전 강조하면 다음 단계에서의 MGC 계수의 결정에서 저주파 소스 정보가 수집되지 않는다. 단계(315)로 제어가 이동하며 프로세스(300)는 계속된다.In step 310, a pre-emphasis is performed on the speech signal. For example, if the speech signal is pre-emphasized at this stage, low frequency source information is not collected in the determination of the MGC coefficient in the next step. Control moves to step 315 and process 300 continues.

단계(315)에서, 사전 강조된 음성 신호에 대해서 선형 예측 부호화(linear predictive coding) 또는 LPC 분석이 수행된다. 예를 들면, LPC 분석은 역 필터링을 수행하는 다음 단계에서 사용하는 계수를 생성한다. 단계(320)로 제어가 이동하며 프로세스(300)는 계속된다. In step 315, linear predictive coding or LPC analysis is performed on the pre-emphasized speech signal. For example, LPC analysis generates the coefficients used in the next step of performing the inverse filtering. Control moves to step 320 and process 300 continues.

단계(320)에서, 분석된 신호 및 원래의 음성 신호에 대해서 역 필터링이 수행된다. 일 실시예에 있어서, 사전 강조(단계(310))가 수행되지 않는다면 단계(320)는 수행되지 않는다. 단계(330)로 제어가 이동하며 프로세스(300)는 계속된다.In step 320, an inverse filtering is performed on the analyzed signal and the original speech signal. In one embodiment, if pre-emphasis (step 310) is not performed, step 320 is not performed. Control moves to step 330 and process 300 continues.

단계(325)에서, 원래의 음성 신호로부터 기본 주파수 값이 결정된다. 기본 주파수 값은 본 발명이 속하는 기술 분야에서 공지된 임의의 표준 기법을 사용하여 결정될 수 있다. 단계(330)로 제어가 이동하며 프로세스(300)는 계속된다.In step 325, a fundamental frequency value is determined from the original speech signal. The fundamental frequency value may be determined using any standard technique known in the art. Control moves to step 330 and process 300 continues.

단계(330)에서, 성문 사이클(glottal cycle)이 분할(segment)된다. 단계(335)로 제어가 이동하며 프로세스(300)는 계속된다.In step 330, the glottal cycle is segmented. Control moves to step 335 and process 300 continues.

단계(335)에서, 성문 사이클은 분해(decompose)된다. 각각의 프레임에 대해서, 일 실시예에 있어서, 대응하는 성문 사이클은 서브 밴드 성분으로 분해된다. 일 실시예에 있어서, 서브 밴드 성분은 복수의 밴드를 포함할 수 있으며, 여기에서 각 밴드는 하단 및 상단 성분을 포함할 수 있다.In step 335, the gatekeeper cycle is decomposed. For each frame, in one embodiment, the corresponding grammar cycle is decomposed into subband components. In one embodiment, the subband component may comprise a plurality of bands, where each band may comprise a bottom and top component.

전형적인 성문 펄스의 스펙트럼에 있어서, 저주파수에서는 고에너지 벌지(bulge)가 존재하고 또한 고주파수에서는 전형적인 평탄 구조가 존재할 수 있다. 이들 밴드 간의 분획(demarcation)은 에너지 비(energy ratio)뿐만 아니라 펄스마다 변동된다. 성문 펄스가 주어지면, 고대역 및 저대역을 분리하는 컷 오프 주파수가 결정된다. 일 실시예에 있어서, 윈도(window) 크기를 적절하게 조절하면서 ZFR법을 사용할 수 있지만, 이 기법은 스펙트럼 크기에 적용될 수 있다. 저주파수 벌지(bulge)의 에지에서 제로 교차(zero crossing)가 얻어지며, 이는 저대역(lower band)과 고대역(higher band) 간의 분획 주파수로 간주된다. 시간 도메인의 두 성분은 역 FFT를 취하기 전에 스펙트럼의 고대역 영역에 제로(zero)를 배치하여 성문 펄스의 저주파수 성분의 시간 도메인 버전을 획득하고 그 반대도 마찬가지로 하여 고주파수 성분을 획득함으로써 획득될 수 있다. 단계(340)로 제어가 이동하며 프로세스(300)는 계속된다.In the spectrum of a typical screen pulse, there is a high energy bulge at low frequencies and a typical flat structure at high frequencies. The demarcation between these bands varies per pulse as well as the energy ratio. Given a sentence pulse, the cutoff frequency separating the high and low bands is determined. In one embodiment, the ZFR method can be used with the window size appropriately adjusted, but this technique can be applied to the spectral magnitude. A zero crossing is obtained at the edge of the low frequency bulge, which is considered to be the fractional frequency between the lower band and the higher band. The two components of the time domain can be obtained by placing a zero in the highband region of the spectrum before taking the inverse FFT to obtain a time domain version of the low frequency component of the loudspeaker pulse and vice versa to obtain the high frequency component . Control moves to step 340 and process 300 continues.

단계(340)에서, 서브 밴드 성분에 대한 에너지가 결정된다. 예를 들면, 각각의 서브 밴드 성분의 에너지는 각각의 프레임에 대한 에너지 계수를 형성하도록 결정될 수 있다. 일 실시예에 있어서, 서브 밴드 성분의 개수는 두 개일 수 있다. 서브 밴드 성분에 대한 에너지의 결정은 본 발명이 속하는 기술 분야에서 공지된 임의의 표준 기법을 사용하여 행해질 수 있다. 프레임의 에너지 계수는 이하에서 에너지 벡터(energy vector)로 통칭된다. 프로세스(300)가 종료된다.In step 340, the energy for the subband component is determined. For example, the energy of each subband component may be determined to form an energy factor for each frame. In one embodiment, the number of subband components may be two. Determination of the energy for the subband components may be done using any standard technique known in the art. The energy coefficients of the frame are collectively referred to below as an energy vector. Process 300 ends.

일 실시예에 있어서, 역 필터링된 음성으로부터 각각의 프레임에 대한 두 개의 밴드 에너지 계수가 결정된다. 에너지 계수는 성문 여기(glottal excitation)의 동적 특성을 나타낼 수 있다. 역 필터링된 음성은 소스 신호에 대한 근사치를 가지고 있으며, 이후에 성문 사이클로 분할된다. 두 개의 밴드 에너지 계수는 소스 신호 중의 성문 사이클에 대응하는 저대역 및 고대역 성분의 에너지 계수를 포함하고 있다. 저주파수 성분의 에너지는 저대역의 에너지 계수를 포함하고 있으며 또한 유사하게는 고주파수 성분의 에너지는 고대역의 에너지 계수를 포함하고 있다. 이들 계수는 대응하는 프레임의 특징 벡터에 이들을 포함시켜서 모델링될 수 있으며, 이후에 HTS의 HMM-GMM을 사용하여 모델링될 수 있다.In one embodiment, two band energy coefficients for each frame are determined from the inverse filtered speech. The energy coefficient can represent the dynamic properties of glottal excitation. The inverse filtered speech has an approximation to the source signal and is then divided into a grammar cycle. The two band energy coefficients contain the energy coefficients of the low and high band components corresponding to the gate cycle in the source signal. The energy of the low frequency component includes the energy coefficient of the low band and the energy of the high frequency component similarly includes the energy coefficient of the high band. These coefficients may be modeled by including them in the feature vector of the corresponding frame, and then modeled using the HMM-GMM of the HTS.

비제한적인 예시에 있어서, 소스 신호 중의 두 개의 밴드 에너지 계수는 프로세스(200)에서 결정된 스펙트럼 매개 변수에 추가되어 기본 주파수 값과 함께 특징 스트림(feature stream)을 형성하는데, 이는 전형적인 HMM-GMM(HTS) 기반 TTS 시스템에서와 동일하다. 이 모델은, 이후에 후술하는 바와 같이, 음성 합성용 프로세스(500)에서 사용될 수 있다.In a non-limiting example, the two band energy coefficients of the source signal are added to the spectral parameters determined in process 200 to form a feature stream with the fundamental frequency value, which is a typical HMM-GMM (HTS ) Based TTS system. This model may be used in the speech synthesis process 500, as described below.

고유치 펄스 식별 훈련Eigenvalue pulse identification training

도 4는 전체적으로 도면 부호 400으로 나타낸 고유치 펄스 식별 프로세스의 일 실시예를 도시한 흐름도다. 각각의 서브 밴드 성문 펄스 데이터베이스에 대해서 고유치 펄스가 식별될 수 있으며, 이 고유치 펄스는 이하에서 더욱 상세하게 설명하는 바와 같이 합성에 사용될 수 있다.4 is a flow chart illustrating one embodiment of an eigenvalue pulse identification process, For each subband signature pulse database, an eigenvalue pulse may be identified, which may be used for synthesis as described in more detail below.

단계(405)에서, 성문 펄스 데이터베이스가 생성된다. 일 실시예에 있어서, 성문 펄스 데이터베이스는 성우로부터 획득한 훈련용 데이터(음성 데이터)를 사용하여 자동적으로 생성된다. 음성 신호(s(n))가 주어지면, 선형 예측 분석이 수행된다. 신호(s(n))에 역 필터링을 수행하여 성문 여기의 근사치인 통합 선형 예측 잔차 신호를 획득한다. 통합 선형 예측 잔차는 이후에, 예를 들면, 제로 주파수 필터링(zero frequency filtering)과 같은 기법을 사용하여 성문 사이클로 분할된다. 성문 펄스로 통칭되는 다수의 소신호(small signal)가 획득되며, 이들 소신호는

으로 나타낼 수 있다. 성문 펄스는 병합(pool)되어 데이터베이스를 생성한다. 단계(410)로 제어가 이동하며 프로세스(400)는 계속된다.At step 405, a sentence pulse database is created. In one embodiment, the verbal pulse database is automatically generated using training data (voice data) obtained from a voice actor. Given a speech signal s (n), a linear prediction analysis is performed. And performs an inverse filtering on the signal s (n) to obtain an integrated linear prediction residual signal, which is an approximation of the signal excitation. The integrated linear prediction residuals are then divided into gating cycles using techniques such as, for example, zero frequency filtering. A number of small signals, commonly referred to as gate signal pulses, are obtained,

. The gate pulse is pooled to create the database. The control moves to step 410 and the process 400 continues.

단계(410)에서, 데이터베이스로부터의 펄스는 서브 밴드 성분으로 분해된다. 일 실시예에 있어서, 성문 펄스는 저대역 및 고대역 성분과 같은 복수의 서브 밴드 성문 펄스, 및 두 개의 밴드 에너지 계수로 분해될 수 있다. 전형적인 성문 펄스 스펙트럼에 있어서, 저주파수에서는 고에너지 벌지가 존재하고 또한 고주파수에서는 전형적인 평탄 구조가 존재한다. 그러나, 밴드 간의 분획은 이들 두 밴드 간의 에너지 비가 변동되는 것과 마찬가지로 펄스마다 변동된다. 그 결과, 이들 밴드 모두에 대한 서로 다른 모델이 필요해질 수 있다.In step 410, pulses from the database are decomposed into subband components. In one embodiment, the sentence pulse may be decomposed into a plurality of subband speech pulses, such as low and high band components, and two band energy coefficients. In a typical gate pulse spectrum, there is a high energy bulge at low frequencies and a typical flat structure at high frequencies. However, the fraction between the bands varies per pulse as well as the energy ratio between these two bands fluctuates. As a result, different models for all of these bands may be required.

성문 펄스가 주어지면 컷 오프 주파수가 결정된다. 일 실시예에 있어서, 컷 오프 주파수는 윈도 크기를 적절하게 조절하면서 제로 주파수 공진기(ZRF, Zero Frequency Resonator) 기법을 사용하여 고대역 및 저대역을 분리하는 것이지만, 이 기법은 스펙트럼 크기에 적용될 수 있다. 저주파수 벌지(bulge)의 에지에서 제로 교차(zero crossing)가 얻어지며, 이는 저대역 및 고대역 간의 분획 주파수로 간주된다. 시간 도메인의 두 성분은 역 FFT를 취하기 전에 스펙트럼의 고대역 영역에 제로(zero)를 배치하여 성문 펄스의 저주파수 성분의 시간 도메인 버전을 획득하고 그 반대도 마찬가지로 하여 고주파수 성분을 획득함으로써 획득될 수 있다. 단계(415)로 제어가 이동하며 프로세스(400)는 계속된다.Given a gate pulse, the cutoff frequency is determined. In one embodiment, the cut-off frequency is to separate the high and low bands using a Zero Frequency Resonator (ZRF) technique while properly adjusting the window size, but this technique can be applied to spectral magnitudes . A zero crossing is obtained at the edge of the low frequency bulge, which is considered to be the fractional frequency between the low and high bands. The two components of the time domain can be obtained by placing a zero in the highband region of the spectrum before taking the inverse FFT to obtain a time domain version of the low frequency component of the loudspeaker pulse and vice versa to obtain the high frequency component . Control moves to step 415 and process 400 continues.

단계(415)에서, 펄스 데이터베이스가 형성된다. 예를 들면, 단계(410)에서 저대역 성문 펄스 데이터베이스와 고대역 성문 펄스 데이터베이스와 같은 복수의 성문 펄스 데이터베이스가 생성된다. 일 실시예에 있어서, 형성된 밴드의 개수에 대응하는 개수의 데이터베이스가 형성된다. 단계(420)로 제어가 이동하며 프로세스(400)는 계속된다.In step 415, a pulse database is formed. For example, in step 410, a plurality of loudspeaker pulse databases such as a low-band speech pulse database and a high-band speech pulse database are generated. In one embodiment, a number of databases corresponding to the number of bands formed are formed. Control moves to step 420 and process 400 continues.

단계(420)에서, 각 데이터베이스의 벡터 표현이 결정된다. 일 실시예에 있어서, 성문 펄스의 저대역 및 고대역 성분에 대한 두 개의 별도의 모델이 생성되지만, 이들 모델 각각에 대해서 동일한 방법이 적용되며, 이하 설명하기로 한다. 이와 같은 맥락에서, 서브 밴드 성문 펄스는 성문 펄스의 일 성분을 지칭한다.At step 420, the vector representation of each database is determined. In one embodiment, two separate models for the low and high band components of the sentence pulse are generated, but the same method applies for each of these models and will be described below. In this context, the subband speech pulse refers to one component of the speech pulse.

서브 밴드 성문 펄스 신호 공간은 신규한 수학적인 척도 공간(metric space)로 취급될 수 있으며, 이하 설명하기로 한다.The subband signal pulse signal space can be treated as a new mathematical metric space and will be described below.

연속적이고 유계 변동이면서 단위 에너지를 갖는 함수의 함수 공간(

)을 가정한다.

와

가 동일하고,

가 시간 변환/지연 버전이라면 상술한 공간 내에서의 변환이 식별된다.

와

가 주어지고, 여기에서

와

는 임의의 두 개의 서브 밴드 성문 펄스를 나타내고,

인 실수 상수가 존재하여

가 되고, 여기에서

가

의 힐베르트 변환인 경우,

는

와 동치인 상기 공간에 대해서, 동치 관계가 적용된다.The function space of a function with continuous and steady-state variation and unit energy (

).

Wow

Lt; / RTI >

Is a time-converted / delayed version, the transformation within the space described above is identified.

Wow

Is given, and here

Wow

Lt; / RTI > represents any two subband speech pulse,

There is a real constant

, And here

end

In the case of the Hilbert transform,

The

The same relation is applied to the space equivalent to the above.

함수 공간(

)에 대해서 거리 척도(

)가 정의될 수 있다.

이라면, 두 함수 간의 정규화된 상호 상관 관계는

로 나타낼 수 있다.

가

의 힐베르트 변환일 때

로 둔다.

및

간의 각도는

로 정의되며, 여기에서

는 함수(

)의 최댓값으로 간주된다.

,

간의 거리 척도는

가 된다. 함수 공간(

)과 함께 척도(

)는 척도 공간(

)을 형성한다.Function space

) Was measured on a distance scale (

) Can be defined.

, Then the normalized cross-correlation between the two functions is

.

end

In the Hilbert Transformation of

.

And

The angle between

Lt; RTI ID = 0.0 >

Is a function (

) Is considered to be the maximum value.

,

The distance scale between

. Function space

) And the scale (

) Is the measure space (

).

척도(

)가 힐베르트 척도(Hilbertian metric)인 경우, 상술한 공간은 힐베르트 공간 내에 등방적으로 매장(embed)될 수 있다. 따라서, 함수 공간 내의 주어진 신호에 대해서,

는 힐베르트 공간 내의 벡터(

)에 사상(mapping)될 수 있으며, 이는 다음과 같이 나타낼 수 있다.Measure(

) Is a Hilbertian metric, the above space may be isotropically buried in the Hilbert space. Thus, for a given signal in the function space,

Is a vector within Hilbert space (

), Which can be expressed as follows.

여기에서,

은

에서의 고정 원소이다. 제로(0) 원소는

와 같이 표현된다. 사상(

)은 힐베르트 공간 내의 전체합을 나타낸다. 이 사상은 등거리 사상으로,

임을 의미한다.From here,

silver

. The zero (0) element

. thought(

) Represents the total sum in the Hilbert space. This idea is equidistant,

.

척도 공간 중에 주어진 신호(

)에 대한 벡터 표현(

)은 해당 척도 공간 내의 모든 다른 신호로부터의

의 거리 집합에 따라서 결정된다. 척도 공간 내의 모든 다른 지점으로부터의 거리를 결정하는 것은 비실용적이며, 따라서 벡터 표현은 척도 공간의 일련의 고정된 개수의 지점(

)으로부터의 거리만에 의해서 결정될 수 있고, 이들 거리는 척도 공간으로부터의 다량의 신호에 대해서 척도 기반 클러스터링을 수행한 이후에 중심(centroid)으로서 획득되는 거리이다. 단계(425)로 제어가 이동하며 프로세스(400)는 계속된다.The signal given in the scale space (

) &Lt; / RTI >

) From all other signals in the measure space

Is determined according to the set of distances. It is impractical to determine the distance from all other points in the scale space so that the vector representation is a series of fixed numbers of points in the scale space

), And these distances are the distances obtained as a centroid after performing the scale-based clustering for a large number of signals from the scale space. The control moves to step 425 and the process 400 continues.

단계(425)에서, 고유치 펄스(Eigen pulse)가 결정되고 또한 프로세스(400)가 종료된다. 일 실시예에 있어서, 서브 밴드 성문 펄스의 척도(metric)를 결정하기 위해서 임의의 두 개의 서브 밴드 성문 펄스(

및

) 간의 척도 또는 거리 개념(

)을 정의한다. 두 펄스(

,

) 간의 척도는 다음과 같이 정의된다.

,

간의 정규화된 원형 교차 상관 관계는 다음 수학식과 같이 정의된다.In step 425, an eigen pulse is determined and process 400 is terminated. In one embodiment, in order to determine the metric of the subband speech pulse, any two subband speech signal pulses

And

) Scale or distance concept (

). Two pulses (

,

) Is defined as follows.

,

The normalized circular cross-correlation between the points is defined by the following equation.

순환 상관 관계의 주기는

,

의 길이 중에서 가장 긴 것으로 정해진다. 더 짧은 신호는 척도를 계산하기 위해서 제로 확장(zero extend)되며 데이터베이스에서는 수정되지 않는다.

의 이산 힐베르트 변환(

)이 결정된다.The cycle of the cyclic correlation is

,

Is the longest among the lengths of the first and second lines. The shorter signal is zero extended to calculate the scale and is not modified in the database.

Discrete Hilbert transform of

) Is determined.

다음으로, 하기 수학식을 통해서 신호를 얻는다.Next, a signal is obtained by the following equation.

두 신호(

,

) 간의 각도(

)의 코사인(cosine)은 다음과 같이 정의될 수 있다.Two signals (

,

)

) Can be defined as follows.

여기에서,

는 모든 샘플 신호(

) 중의 최댓값을 나타낸다. 거리 척도는 다음과 같을 수 있다.From here,

Lt; RTI ID = 0.0 > (

). The distance scale can be:

본 발명이 속하는 기술 분야에서 공지된

평균 클러스터링 알고리즘을 변형하여 전체 성문 펄스 데이터베이스(

)로부터의

클러스터 중심 성문 펄스(

cluster centroid glottal pulse)가 결정될 수 있다. 제 1 변형은 유클리드 거리 척도(Euclidean distance metric)를 상술한 바와 같이 성문 펄스에 대해서 정의된 척도(

)로 치환하는 단계를 포함한다. 제 2 변형은 클러스터의 중심을 업데이트하는 단계를 포함한다. 각 원소가

으로 표시되는 성문 펄스 클러스터의 중심 성문 펄스가 해당 원소(

)가 되는In the present invention,

The average clustering algorithm was modified to use the entire sentence pulse database (

)

Cluster centered field pulse (

cluster centroid glottal pulse) can be determined. The first variant measures the Euclidean distance metric on a scale defined for the sentence pulse (as described above)

). &Lt; / RTI > The second variant includes updating the center of the cluster. Each element

The center gate pulse of the glottal pulse cluster represented by the corresponding element (

)

은

에서 최소가 된다. 클러스터링의 반복은

번째 클러스터의 임의의 중심 중에 더 이상의 시프트(shift)가 없으면 종료된다.silver

. The iteration of clustering

Th cluster, if there is no further shift in any center of the first cluster.

이후에 서브 밴드 성문 펄스에 대한 벡터 표현이 결정될 수 있다. 성문 펄스(

)가 주어지고, 또한 상술한 바와 같이 클러스터링에 의해서 결정된 중심 성문 펄스가

이라고 가정하면, 성문 펄스 데이터베이스의 크기는

이 된다. 각각의 성문 펄스를 거리 척도(distance metric)에 기초하여 클러스터(

)의 중심 중의 하나에 할당하면, 중심(

)에 할당된 원소의 전체 개수는

로 정의될 수 있다. 여기에서,

는 데이터베이스로부터 선택된 고정된 서브 밴드 성문 펄스를 나타내며, 그 벡터 표현은 다음과 같이 정의될 수 있다.The vector representation for the subband speech pulse may then be determined. Seongam Pulse

), And as described above, a central sentence pulse determined by clustering

, The size of the gate pulse database is

. Each sentence pulse is divided into clusters based on a distance metric

), The center (

) Is the total number of elements assigned to

. &Lt; / RTI > From here,

Denotes a fixed subband plaintext pulse selected from the database, the vector representation of which can be defined as follows.

여기에서,

가 서브 밴드 성문 펄스(

)에 대한 벡터 표현인 경우,

는 다음과 같을 수 있다.From here,

Lt; RTI ID = 0.0 >

), &Lt; / RTI >

Can be as follows.

데이터베이스 내의 모든 성문 펄스에 대해서, 대응하는 벡터가 결정되어 데이터베이스에 저장된다.For every sentence pulse in the database, the corresponding vector is determined and stored in the database.

벡터 공간에 대한 PCA가 수행되고, 고유치 성문 펄스가 식별된다. 성문 펄스 데이터베이스와 연관된 벡터 집합에 대해 주성분 분석(PCA, Principal component analysis)을 수행하여 고유치 벡터(Eigen vector)를 얻는다. 각각의 벡터로부터 전체 벡터 데이터베이스의 평균 벡터를 차감(substract)하여 평균 차감 벡터(mean subtracted vector)를 얻는다. 이후에 벡터 집합의 공분산 행렬의 고유치 벡터가 결정된다. 각각의 고유치 벡터가 획득됨에 따라서, 평균 차감 벡터가 고유치 벡터로부터 최소 유클리드 거리(Euclidean distance)를 갖는 성문 펄스가 연관되며 또한 대응하는 고유치 성문 펄스로 통칭된다. 따라서 각각의 서브 밴드 성문 펄스 데이터베이스에 대한 고유치 펄스가 결정되고 또한 이들 각각으로부터 하나는 청취 테스트에 기초하여 선택되며 또한 고유치 펄스는 더욱 상세하게 후술하는 바와 같이 합성에서 사용될 수 있다.PCA is performed on the vector space, and the eigenvalue pulse is identified. Principal component analysis (PCA) is performed on the vector set associated with the sentence pulse database to obtain an eigenvector. Subtract the mean vector of the entire vector database from each vector to obtain a mean subtracted vector. The eigenvalue vector of the covariance matrix of the vector set is then determined. As each eigenvector vector is obtained, the average difference vector is associated with a sentence pulse having a minimum Euclidean distance from the eigenvalue vector, and is also referred to as the corresponding eigenvalue pulse. Thus, an eigenvalue pulse for each subband signature pulse database is determined and one of each is selected based on a listening test, and the eigenvalue pulse can also be used in the synthesis as described in more detail below.

음성 합성에서의 사용Use in speech synthesis

도 5는 전체적으로 도면 부호 500으로 나타낸 음성 합성용 프로세스의 일 실시예를 도시한 흐름도다. 이 프로세스는 프로세스(100)(도 1 참조)에서 획득된 모델의 훈련에 사용될 수 있다. 일 실시예에 있어서, 특정 피치 사이클에서 여기(excitation)로서 사용된 성문 펄스는 대응하는 두 개의 밴드 에너지 계수에 대해서 각각의 성문 펄스를 스케일링한 이후에 저대역 성문 템플리트 펄스 및 고대역 성문 템플리트 펄스를 조합하여 형성된다. 특정 사이클에 대한 두 개의 밴드 에너지 계수는 해당 피치 사이클에 대응하는 프레임의 밴드 에너지 계수로 정해진다. 여기는 성문 펄스로부터 형성되며 또한 출력 음성을 획득하기 위해서 필터링된다.FIG. 5 is a flow chart illustrating an embodiment of a process for speech synthesis, indicated generally by the reference numeral 500. FIG. This process can be used to train the model obtained in process 100 (see FIG. 1). In one embodiment, a sentence pulse used as an excitation in a particular pitch cycle scales each sentence pulse for a corresponding two band energy coefficients, and then a low-pass word template pulse and a high-band word template pulse . The two band energy coefficients for a particular cycle are determined by the band energy coefficients of the frame corresponding to the corresponding pitch cycle. This is formed from the sentence pulses and is also filtered to obtain the output speech.

합성은 주파수 도메인 및 시간 도메인에서 수행될 수 있다. 주파수 도메인에 있어서, 각각의 피치 주기에 대해서, 대응하는 스펙트럼 매개 변수 벡터는 스펙트럼으로 변환되고 또한 성문 펄스의 스펙트럼에 곱해진다. 그 결과에 역 이산 푸리에 변환(DFT, Discrete Fourier Transform)을 적용하면 해당 피치 사이클에 대응하는 음편(speech segment)이 획득된다. 시간 도메인 내의 모든 획득된 피치 동기화 음편에 오버랩 추가를 적용하여 동기화 음성을 획득한다.The synthesis may be performed in the frequency domain and the time domain. In the frequency domain, for each pitch period, the corresponding spectral parameter vector is transformed into a spectrum and also multiplied by the spectral pulse spectrum. Applying a Discrete Fourier Transform (DFT) to the result, a speech segment corresponding to the pitch cycle is obtained. Apply overlap to all acquired pitch synchronization timings in the time domain to obtain synchronized speech.

시간 도메인에 있어서, 여기 신호가 구축되고 Mel Log Spectrum Approximation(MLSA) 필터를 사용하여 필터링되어 합성 음성 신호를 획득한다. 주어진 성문 펄스는 단위 에너지로 정규화된다. 비발화 영역에 대해서, 고정 에너지의 백색 잡음은 여기 신호 중에 위치한다. 발화 영역에 대해서, 여기 신호는 제로(0)로 초기화된다. 예컨대, 매 5 ms 프레임당 제공되는 기본 주파수 값을 사용하여 피치 경계를 계산한다. 모든 피치 경계로부터 순서대로 성문 펄스가 배치되고 또한 신호를 획득하기 위해서 제로(zero)로 초기화된 여기 신호에 오버랩(overlap)이 추가된다. 오버랩 추가는 각각의 피치 경계에서 성문 펄스에 대해서 수행되며, 대역 통과 필터링된 미량의 정량 백색 잡음을 추가하여 미량의 랜덤/확률 성분이 여기 신호에 확실하게 존재하도록 한다. 합성 음성 내의 바람 잡음을 억제하려면, 오른쪽으로 시프트(shift)된 피치 경계와 원형으로 왼쪽으로 시프트된 성문 펄스를 사용하여 다수의 여기 신호가 형성되는 경우 봉합 메커니즘을 적용한다. 모델 구축에 사용되는 피치 경계에서의 오른쪽 시프트(right-shift)는 고정 상수를 포함하고 있으며, 모델 구축에 사용되는 성문 펄스는 동일한 양만큼 원형으로 왼쪽으로 시프트된다. 최종적으로 봉합된(stitched) 여기는 여기 신호의 산술 평균이다. 이 평균값을 MLSA 필터에 통과시키면 음성 신호가 획득된다.In the time domain, an excitation signal is established and filtered using a Mel Log Spectrum Approximation (MLSA) filter to obtain a synthesized speech signal. The given gating pulse is normalized to the unit energy. For the non-firing region, the white noise of fixed energy is located in the excitation signal. For the firing region, the excitation signal is initialized to zero (0). For example, the pitch boundary is calculated using the fundamental frequency value provided every 5 ms frame. An envelope pulse is placed in order from all pitch boundaries and an overlap is added to the excitation signal that is zero initialized to acquire the signal. Overlap addition is performed on the speech pulse at each pitch boundary and adds a band pass filtered low amount of quantitative white noise to ensure that a small amount of random / probable components are present in the excitation signal. To suppress wind noise in synthesized speech, a suturing mechanism is applied where multiple excitation signals are formed using a right shifted pitch boundary and a circularly shifted gentle pulse. The right-shift at the pitch boundary used in the model construction includes a fixed constant, and the sentence pulses used in the model construction are shifted left by the same amount in a circle. Finally, the stitched excitation is the arithmetic mean of the excitation signal. When the average value is passed through the MLSA filter, a voice signal is obtained.

단계(505)에서, 음성 합성 시스템의 모델에 텍스트가 입력된다. 예를 들면, 도 1에서 획득된 모델(맥락 의존형 HMM(120))은, 후술하는 바와 같이, 입력 텍스트를 수신하고 이후에 입력 텍스트와 관련된 음성을 합성하는데 사용되는 특징(feature)을 제공한다. 단계(510) 및 단계(515)로 제어가 이동하며 프로세스(500)는 계속된다.At step 505, text is entered into the model of the speech synthesis system. For example, the model (context-dependent HMM 120) obtained in FIG. 1 provides features that are used to receive the input text and then synthesize the voice associated with the input text, as described below. The control moves to step 510 and step 515 and the process 500 continues.

단계(510)에서, 각각의 프레임에 대한 특징 벡터가 예측된다. 이 예측은, 예를 들면, 맥락 의존형 의사 결정 트리와 같은, 본 발명이 속하는 기술 분야에서 표준적인 방법을 사용하여 수행될 수 있다. 단계(525 및 540)로 제어가 이동하며 단계(500)는 계속된다.In step 510, a feature vector for each frame is predicted. This prediction can be performed using standard methods in the art to which the present invention belongs, such as, for example, a context-dependent decision tree. Then control transfers to steps 525 and 540 and step 500 continues.

단계(515)에서, 기본 주파수 값(들)이 결정된다. 단계(520)로 제어가 이동하며 프로세스(500)는 계속된다.In step 515, the fundamental frequency value (s) is determined. Control transfers to step 520 and process 500 continues.

단계(520)에서, 피치 경계가 결정된다. 단계(560)로 제어가 이동하며 프로세스(500)는 계속된다.In step 520, a pitch boundary is determined. The control moves to step 560 and the process 500 continues.

단계(525)에서, 각각의 프레임에 대해서 MGC가 결정된다. 예를 들면, 0 내지 39 개의 MGC가 결정된다. 단계(530)로 제어가 이동하며 프로세스(500)는 계속된다.In step 525, the MGC is determined for each frame. For example, 0 to 39 MGCs are determined. Control transfers to step 530 and process 500 continues.

단계(530)에서, MGC는 스펙트럼으로 변환된다. 단계(535)로 제어가 이동하며 프로세스(500)는 계속된다.In step 530, the MGC is transformed into a spectrum. Control is transferred to step 535 and process 500 continues.

단계(540)에서, 각각의 프레임에 대한 에너지 계수(energy coefficient)가 결정된다. 단계(545)로 제어가 이동하며 프로세스(500)는 계속된다.In step 540, an energy coefficient for each frame is determined. The control moves to step 545 and the process 500 continues.

단계(545)에서, 고유치 펄스(Eigen pulse)가 결정되고 또한 정규화된다. 단계(550)로 제어가 이동하며 프로세스(500)는 계속된다.At step 545, an Eigen pulse is determined and normalized. The control moves to step 550 and the process 500 continues.

단계(550)에서, FFT가 적용된다. 단계(535)로 제어가 이동하며 프로세스(500)는 계속된다.In step 550, an FFT is applied. Control is transferred to step 535 and process 500 continues.

단계(535)에서, 데이터의 곱셈이 수행될 수 있다. 예를 들면, 단계(550)로부터의 데이터는 단계(535)에서의 데이터에 곱해진다. 일 실시예에 있어서, 이 단계는 샘플별 곱셈으로 수행될 수 있다. 단계(555)로 제어가 이동하며 프로세스(500)는 계속된다.At step 535, a multiplication of the data may be performed. For example, the data from step 550 is multiplied with the data in step 535. [ In one embodiment, this step may be performed by sample-by-sample multiplication. The control moves to step 555 and the process 500 continues.

단계(555)에서, 역 FFT가 적용된다. 단계(560)로 제어가 이동하며 프로세스(500)는 계속된다.In step 555, an inverse FFT is applied. The control moves to step 560 and the process 500 continues.

단계(560)에서, 음성 신호에 대해서 오버랩 추가(overlap add)가 실행된다. 단계(565)로 제어가 이동하며 프로세스(500)는 계속된다.At step 560, an overlap add is performed on the speech signal. Control transfers to step 565 and process 500 continues.

단계(565)에서, 음성 신호 출력이 수신되며 프로세스(500)가 종료된다.At step 565, the speech signal output is received and the process 500 ends.

도면 및 상술한 발명의 설명에서 본 발명을 상세하게 도시하고 또한 설명하였지만, 이와 같은 도시 및 설명은 특성상 예시적인 것일 뿐 제한적인 것으로 간주되어서는 아니되며, 바람직한 실시예만 나타내고 또한 설명되었으며 또한 본 명세서에서 설명된 바와 같이 및/또는 하기 청구 범위에 의해서 본 발명의 취지에 포함되는 모든 균등물, 변동, 및 변경은 보호되어야 하는 것이 바람직함을 알아야 한다.Although the present invention has been shown and described in detail in the drawings and foregoing description of the invention, such illustration and description are to be considered illustrative and exemplary, and not restrictive, It is to be understood that it is desired that all equivalents, variations, and alterations included in the spirit of the invention as described in the claims and / or in the following claims are to be protected.

따라서, 상술한 모든 변경 내용뿐만 아니라 도면에 도시되고 또한 본 명세서에서 설명된 것과 동등한 모든 관계를 포함하도록 첨부한 청구 범위를 가장 넓게 해석하여 본 발명의 적절한 범위가 결정되어야 한다.Accordingly, the appended claims should be accorded the broadest interpretation so as to encompass all such modifications as well as all such equivalents to the one set forth in the drawings, and the equivalents of those described herein, and the appropriate scope of the invention should be determined.

Claims

A method for generating a parametric model used in training of a speech synthesis system, the system comprising at least one training text corpus, a speech database, and a model training module,
(A) by the model training module, obtaining voice data for the training text corpus, the voice data comprising a recorded voice signal and a corresponding voice record;
(B) converting, by the model training module, the training text corpus into a context dependent phoneme label;
(C) extracting, by the model training module, at least one of a spectral characteristic, a plurality of band excitation energy coefficients, and a fundamental frequency value for each speech frame in the speech signal from the speech training database;
(D) forming, by the model training module, a feature vector stream for each voice frame using at least one of the spectral characteristics, a plurality of band excitation energy coefficients, and a fundamental frequency value;
(E) labeling the speech using a context-dependent phoneme;
(F) extracting the duration of each context dependent phoneme from the labeled speech;
(G) parameter estimation of the speech signal, the estimation of the parameters being performed including a feature, an HMM, and a decision tree; And
(A) identifying a plurality of subband eigenvalue sign pulses including a separate model for use in forming an excitation during synthesis,
Way.

The method according to claim 1,
The spectral characteristics include,
(A) determining an energy coefficient from the speech signal;
(B) pre-emphasizing the speech signal and determining an MGC coefficient for each frame of the pre-emphasized speech signal;
(C) adding the energy coefficient and the MGC coefficient to form an MCG for each frame of the signal; And
(D) extracting a spectral vector for each frame,
Way.

The method according to claim 1,
Wherein the plurality of band excitation energy coefficients are selected from the group consisting of:
(A) determining, from the speech signal, a fundamental frequency value;
(B) performing pre-emphasis on the speech signal;
(C) performing LCP analysis on the pre-emphasized speech signal;
(D) performing inverse filtering on the speech signal and the LCP analyzed signal;
(E) partitioning the sentence cycle using the base frequency value and the inverse filtered speech signal;
(F) decomposing a corresponding grammar cycle into subband components for each frame;
(G) calculating energy of each subband component to form a plurality of energy coefficients for each frame; And
(A) using the energy coefficients to extract an excitation vector for each frame,
Way.

The method of claim 3,
Wherein the subband component comprises at least two bands,
Way.

The method of claim 4,
Wherein the subband component comprises at least a highband component and a lowband component,
Way.

The method according to claim 1,
Wherein identifying a plurality of subband eigenvalue word pulses comprises:
(A) generating a voiceprint pulse database using the voice data;
(B) decomposing each pulse into a plurality of subband components;
(C) dividing the subband component into a plurality of databases based on the decomposing step;
(D) determining a vector representation of each database;
(E) determining an eigenvalue pulse value for each database from the vector representation; And
(F) selecting the best eigenvalue pulse for each database for use in synthesis; and
Way.

The method of claim 6,
Wherein the plurality of subband components comprise a low band and a high band,
Way.

The method of claim 6,
The gender database includes:
(A) performing a linear prediction analysis on a speech signal;
(B) performing inverse filtering of the signal to obtain an integrated linear prediction residual; And
(C) dividing the integrated linear prediction residual into a lull word cycle to obtain a plurality of lull word pulses,
Way.

The method of claim 6,
Wherein said disassembling comprises:
(A) a cut-off frequency, the cut-off frequency separating and grouping the subband components;
(B) obtaining a zero crossing at the edge of the low frequency bulge;
(C) placing a zero in the highband region of the spectrum and obtaining a time domain version of the low frequency component of the sentence pulse, the obtaining comprising performing an inverse FFT; And
(D) placing a zero in the low-band region of the spectrum prior to obtaining the time domain version of the high-frequency component of the vernal pulse, wherein the obtaining comprises performing an inverse FFT Further comprising:
Way.

The method of claim 9,
The grouping includes a low band grouping and a high band grouping.
Way.

The method of claim 9,
The step of separating and grouping the subband components is performed using the ZFR method with a window of the appropriate size and applied to the spectrum size,
Way.

The method of claim 6,
Wherein the step of determining a vector representation of each database comprises:
Further comprising a series of distances from a series of fixed number of points in the scale space obtained as a centroid after performing scale-based clustering for a large number of signals from the scale space.
Way.

A method for identifying a subband eigenvalue pulse from a speech frame pulse database for speech synthesis system training,
(A) receiving a pulse from the vernal pulse database;
(B) decomposing each pulse into a plurality of subband components;
(C) dividing the subband component into a plurality of databases based on the decomposing step;
(D) determining a vector representation of each database;
(E) determining, from the vector representation, an eigenvalue pulse value for each database; And
(F) selecting the best eigenvalue pulse for each database for use in synthesis.
Way.

14. The method of claim 13,
Wherein the plurality of subband components comprise a low band and a high band,
Way.

14. The method of claim 13,
Wherein the vernier pulse database comprises:
(A) performing a linear prediction analysis on a speech signal;
(B) performing inverse filtering on the signal to obtain an integrated linear prediction residual; And
(C) dividing the integrated linear prediction residual into a lull word cycle to obtain a plurality of lull word pulses,
Way.

14. The method of claim 13,
Wherein said disassembling comprises:
(A) a cut-off frequency, the cut-off frequency separating and grouping the subband components;
(B) obtaining a zero crossing at the edge of the low frequency bulge;
(C) placing a zero in the high-band region of the spectrum prior to obtaining the time domain version of the low-frequency component of the speech pulse, the obtaining comprising performing an inverse FFT Includes; And
(D) placing a zero in the low-band region of the spectrum prior to obtaining the time domain version of the high-frequency component of the vernal pulse, wherein the obtaining comprises performing an inverse FFT Further comprising:
Way.

18. The method of claim 16,
The grouping includes a low band grouping and a high band grouping.
Way.

18. The method of claim 16,
The step of separating and grouping the subband components is performed using the ZFR method with a window of the appropriate size and applied to the spectrum size,
Way.

14. The method of claim 13,
Wherein the step of determining a vector representation of each database comprises:
Further comprising a series of distances from a series of fixed number of points in the scale space obtained as a centroid after performing scale-based clustering for a large number of signals from the scale space.
Way.