JP6523423B2

JP6523423B2 - Speech synthesizer, speech synthesis method and program

Info

Publication number: JP6523423B2
Application number: JP2017241425A
Authority: JP
Inventors: 眞弘森田
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2017-12-18
Filing date: 2017-12-18
Publication date: 2019-05-29
Anticipated expiration: 2034-02-10
Also published as: JP2018041116A

Description

本発明の実施形態は、音声合成装置、音声合成方法およびプログラムに関する。 Embodiments of the present invention, the voice if NaruSo location, a speech if Narukata method and program.

音声合成では、事前に用意された少数の候補から声を選んで読ませるだけではなく、有名人や身近な人など、特定の話者の声の音声合成辞書を新たに生成し、様々なテキストコンテンツを読ませたいというニーズが高まっている。こうしたニーズに応えるため、辞書生成の対象となる対象話者の音声データから音声合成辞書を自動で生成する技術が提案されている。また、対象話者の少量の音声データから音声合成辞書を生成する技術として、予め用意された複数話者の平均的な特徴を表すモデルを、対象話者の特徴に近づけるように変換することで対象話者のモデルを生成する話者適応の技術がある。 Speech synthesis not only allows you to select and read voices from a few prepared in advance, but also generates a new speech synthesis dictionary for the voice of a specific speaker, such as a celebrity or a familiar person, and various text contents There is a growing need to make people read. In order to meet such needs, there has been proposed a technique for automatically generating a speech synthesis dictionary from speech data of a target speaker to be generated as a dictionary. In addition, as a technology for generating a speech synthesis dictionary from a small amount of speech data of a target speaker, a model prepared in advance and representing the average characteristics of a plurality of speakers is converted to be closer to the target speaker's characteristics. There is a technique of speaker adaptation that generates a model of a target speaker.

音声合成辞書を自動で生成する従来の技術は、対象話者の声や話し方にできるだけ似せることを主目的としている。しかし、辞書生成の対象となる対象話者は、プロのナレータや声優だけではなく、発声のトレーニングを全く受けていない一般の話者も含まれる。このため、対象話者の発話スキルが低いと、そのスキルの低さが忠実に再現されて、用途によっては使いづらい音声合成辞書になってしまう。 Conventional techniques for automatically generating a speech synthesis dictionary are mainly intended to make the speech and the speech of the target speaker as similar as possible. However, target speakers targeted for dictionary generation include not only professional narrators and voice actors, but also general speakers who have not received any training on speech production. For this reason, if the target speaker's speech skill is low, the low skill of the target speaker is faithfully reproduced, and the speech synthesis dictionary is difficult to use in some applications.

また、対象話者の母国語だけではなく、外国語の音声合成辞書をその対象話者の声で生成したいというニーズもある。このニーズに対しては、対象話者に外国語を読ませた音声が録音できれば、この録音音声からその言語の音声合成辞書を生成することが可能である。しかし、その言語の発声として正しくない発声や訛りのある不自然な発声の録音音声から音声合成辞書を生成すると、その発声の特徴が反映され、ネイティブが聞いても理解できない音声合成辞書になってしまう。 There is also a need to generate not only the native language of the target speaker but also a speech synthesis dictionary of a foreign language by the voice of the target speaker. In order to meet this need, it is possible to generate a speech synthesis dictionary of the language from the recorded speech if speech in which the target speaker is made to read the foreign language can be recorded. However, when a speech synthesis dictionary is generated from the recorded speech of an incorrect speech or an unnatural speech with an accent as the speech of the language, the characteristics of the speech are reflected, and the speech synthesis dictionary becomes incomprehensible even by native speakers. I will.

特開２０１３−７２９０３号公報JP, 2013-72903, A 特開２００２−２４４６８９号公報JP 2002-244689 A

本発明が解決しようとする課題は、目標とする発話スキルやネイティブ度に応じて話者性の類似度を調整した音声合成辞書を生成できる音声合成装置、音声合成方法およびプログラムを提供することである。 An object of the present invention is to provide a speech if NaruSo location that can generate speech synthesis dictionary adjusting the speaker of the similarity in accordance with the speech skills and native degree of a target, the voice multiplexer Narukata method and program It is to provide.

実施形態の音声合成装置は、音声分析部と、話者適応部と、目標話者レベル指定部と、決定部と、音声合成部と、を備える。音声分析部は、任意の対象話者の音声データを分析して、前記対象話者の発話の特徴を表すデータを含む音声データベースを生成する。話者適応部は、前記音声データベースに基づき、所定のベースモデルを前記対象話者の特徴に近づけるように変換する話者適応を行って、前記対象話者のモデルを生成する。目標話者レベル指定部は、話者の発話スキルと音声合成辞書の言語に対する話者のネイティブ度との少なくとも一方を表す話者レベルについて、目標とする前記話者レベルである目標話者レベルの指定を受け付ける。決定部は、指定された前記目標話者レベルと、前記対象話者の前記話者レベルである対象話者レベルとの関係に応じて、前記話者適応での話者性再現の忠実度に関わるパラメータの値を決定する。音声合成部は、前記パラメータの値に従って音声波形を生成する。 Speech if NaruSo location embodiment comprises a voice analyzer, and a speaker adaptation section, and the target speaker level specifying unit, a determination unit, and a speech synthesis unit. The voice analysis unit analyzes voice data of any target speaker and generates a voice database including data representing the features of the target speaker's speech. The speaker adaptation unit performs speaker adaptation that converts a predetermined base model so as to be close to the feature of the target speaker based on the voice database, and generates a model of the target speaker. Target speaker level specifying unit, for speaker level representative of at least one of the native level of the speaker with respect to the language of the speaker's speech skills and speech synthesis dictionary, the target speaker level is the speaker the target level Accept the specification of The determination unit determines the fidelity of the speaker reproduction in the speaker adaptation according to the relationship between the designated target speaker level and the target speaker level which is the speaker level of the target speaker. Determine the values of the parameters involved. The speech synthesis unit generates a speech waveform according to the value of the parameter .

第１の実施形態の音声合成辞書生成装置の構成例を示すブロック図。FIG. 1 is a block diagram showing a configuration example of a speech synthesis dictionary generation device according to a first embodiment. 音声合成装置の概略構成を示すブロック図。FIG. 1 is a block diagram showing a schematic configuration of a speech synthesizer. ＨＭＭ方式の話者適応で用いられる区分線形回帰の概念図。The conceptual diagram of the piecewise linear regression used by the speaker adaptation of HMM system. 決定部がパラメータの値を決定する方法の一例を示す図。The figure which shows an example of the method a determination part determines the value of a parameter. 第２の実施形態の音声合成辞書生成装置の構成例を示すブロック図。FIG. 7 is a block diagram showing an example of the configuration of a speech synthesis dictionary generation device according to a second embodiment. 第３の実施形態の音声合成辞書生成装置の構成例を示すブロック図。FIG. 11 is a block diagram showing an example of the configuration of a speech synthesis dictionary generation device according to a third embodiment. 目標話者レベルを指定するＧＵＩの表示例を示す図。The figure which shows the example of a display of GUI which designates a target speaker level. クラスタ適応学習で学習したモデルを用いた話者適応の概念図。The conceptual diagram of speaker adaptation using the model learned by cluster adaptive learning. 式（２）における補間比率ｒと目標の重みベクトルとの関係を示す概念図。The conceptual diagram which shows the relationship between the interpolation ratio r in Formula (2), and the weight vector of a target. 第６の実施形態の音声合成辞書生成装置の構成例を示すブロック図。FIG. 16 is a block diagram showing an example of the configuration of a speech synthesis dictionary generation device according to a sixth embodiment;

（第１の実施形態）
図１は、本実施形態の音声合成辞書生成装置１００の構成例を示すブロック図である。図１に示すように、本実施形態の音声合成辞書生成装置１００は、音声分析部１０１と、話者適応部１０２と、対象話者レベル指定部１０３と、目標話者レベル指定部１０４と、決定部１０５とを備える。音声合成辞書生成装置１００は、辞書生成の対象となる任意の対象話者の録音音声１０とその読み上げ内容に対応したテキスト２０（以下、「録音テキスト」と呼ぶ）が入力されると、その対象話者の声質・話し方をモデル化した対象話者のモデルを含む音声合成辞書３０を生成する。 First Embodiment
FIG. 1 is a block diagram showing a configuration example of the speech synthesis dictionary generation device 100 of the present embodiment. As shown in FIG. 1, the speech synthesis dictionary generation apparatus 100 according to this embodiment includes a speech analysis unit 101, a speaker adaptation unit 102, a target speaker level designation unit 103, and a target speaker level designation unit 104. And a determination unit 105. When the speech synthesis dictionary generation device 100 receives the recorded speech 10 of an arbitrary target speaker to be generated as a dictionary and the text 20 (hereinafter referred to as “recorded text”) corresponding to the read contents, A speech synthesis dictionary 30 is generated that includes a model of a target speaker modeling the speaker's voice quality and speech.

上記の構成のうち、対象話者レベル指定部１０３、目標話者レベル指定部１０４、および決定部１０５は本実施形態に特有の構成要素であるが、それら以外については、話者適応の技術を用いる音声合成辞書生成装置に一般的な構成である。 Among the above configurations, the target speaker level designation unit 103, the target speaker level designation unit 104, and the determination unit 105 are components unique to the present embodiment, but the technique of speaker adaptation is This configuration is general to the speech synthesis dictionary generation device used.

本実施形態の音声合成辞書生成装置１００により生成される音声合成辞書３０は、音声合成装置に必要なデータであり、声質をモデル化した音響モデルや、抑揚・リズムなどの韻律をモデル化した韻律モデル、その他の音声合成に必要な各種情報を含む。音声合成装置は、通常、図２で示すように、言語処理部４０と音声合成部５０から構成されており、テキストが入力されると、それに対する音声波形を生成する。言語処理部４０では、入力されたテキストを分析して、テキストの読みやアクセント、ポーズの位置、その他単語境界や品詞などの各種言語情報を取得し、音声合成部５０に渡す。音声合成部５０では、これらの情報を基に、音声合成辞書３０に含まれる韻律モデルを用いて抑揚・リズムなどの韻律パターンを生成し、さらに音声合成辞書３０に含まれる音響モデルを用いて音声波形を生成する。 The speech synthesis dictionary 30 generated by the speech synthesis dictionary generation device 100 according to the present embodiment is data necessary for the speech synthesis device, and it is an acoustic model that models voice quality, and a prosody that models prosody such as intonation / rhythm. It contains models and various other information necessary for speech synthesis. The speech synthesizer is generally composed of a language processing unit 40 and a speech synthesis unit 50 as shown in FIG. 2, and when a text is input, it generates a speech waveform for that. The language processing unit 40 analyzes the input text, acquires various language information such as the reading and accent of the text, the position of the pose, and other word boundaries and parts of speech, and passes it to the speech synthesis unit 50. The speech synthesis unit 50 generates a prosody pattern such as intonation / rhythm using the prosody model included in the speech synthesis dictionary 30 based on the above information, and further uses the acoustic model included in the speech synthesis dictionary 30 for speech Generate a waveform.

特許文献２に記載されているようなＨＭＭ（ＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ：隠れマルコフモデル）に基づく方式の場合、音声合成辞書３０に含まれる韻律モデルや音響モデルは、テキストを言語解析して得られる音韻・言語情報と、韻律や音響などのパラメータ系列との対応関係をモデル化したものである。具体的には、各パラメータを状態ごとに音韻・言語環境でクラスタリングした決定木と、決定木の各リーフノードに割り当てられたパラメータの確率分布からなる。韻律パラメータとしては、声の高さを表すピッチパラメータや、音の長さを表す継続時間長などがある。また、音響パラメータとしては、声道の特徴を表すスペクトルパラメータや、音源信号の非周期性の程度を表す非周期性指標などがある。状態とは、各パラメータの時間変化をＨＭＭでモデル化したときの内部状態を指す。通常、各音素区間は、後戻り無しで左から右の状態に遷移する３〜５状態のＨＭＭでモデル化されるため、３〜５個の状態を含む。そこで、例えばピッチパラメータの第一状態に対する決定木では、音素区間内の先頭区間のピッチ値の確率分布が音韻・言語環境でクラスタリングされており、対象の音素区間に関する音韻・言語情報を基にこの決定木をたどることで、その音素の先頭区間のピッチパラメータの確率分布を得ることができる。パラメータの確率分布には正規分布が用いられることが多く、その場合、分布の中心を表す平均ベクトルと分布の広がりを表す共分散行列で表現される。 In the case of an HMM (Hidden Markov Model) -based system such as that described in Patent Document 2, the prosody model and the acoustic model included in the speech synthesis dictionary 30 are phonetic and / or acoustic models obtained by linguistic analysis of text. It is what modeled the correspondence with linguistic information and parameter sequences, such as a prosody and an acoustic. Specifically, it consists of a decision tree in which each parameter is clustered in a phonological language environment for each state, and a probability distribution of parameters assigned to each leaf node of the decision tree. As the prosody parameter, there are a pitch parameter that represents the pitch of the voice, and a duration that represents the length of the sound. Further, as the acoustic parameter, there are a spectral parameter indicating the feature of the vocal tract, an aperiodicity index indicating the degree of aperiodicity of the sound source signal, and the like. The state refers to the internal state when the time change of each parameter is modeled by HMM. Usually, each phoneme section includes 3 to 5 states because it is modeled by a 3 to 5 HMM transitioning from left to right with no backtracking. Therefore, for example, in the decision tree for the first state of the pitch parameter, the probability distribution of the pitch value of the leading section in the phoneme section is clustered in the phonological language environment, and based on the phonological language information about the target phoneme section By tracing the decision tree, it is possible to obtain the probability distribution of the pitch parameter of the leading section of the phoneme. As a probability distribution of parameters, a normal distribution is often used, and in that case, it is expressed by an average vector representing the center of the distribution and a covariance matrix representing the spread of the distribution.

音声合成部５０では、各パラメータの各状態に対する確率分布を上述のような決定木で選択して、これらの確率分布を基に確率が最大となるパラメータ系列をそれぞれ生成し、これらのパラメータ系列を基に音声波形を生成する。一般的なＨＭＭに基づく方式の場合、生成されたピッチパラメータと非周期性指標を基に音源波形を生成し、この音源波形に、生成されたスペクトルパラメータに従ってフィルタ特性が時間変化する声道フィルタを畳み込むことで、音声波形を生成する。 The speech synthesis unit 50 selects the probability distribution for each state of each parameter with the decision tree as described above, generates a parameter sequence having the maximum probability based on these probability distributions, and generates these parameter sequences. Generate an audio waveform based on it. In the case of a general HMM-based system, a sound source waveform is generated based on the generated pitch parameter and aperiodic index, and a vocal tract filter whose filter characteristics change with time according to the generated spectral parameter is generated on this sound source waveform. By convoluting, an audio waveform is generated.

音声分析部１０１は、音声合成辞書生成装置１００に入力された録音音声１０と録音テキスト２０を分析し、音声データベース（以下、音声ＤＢという）１１０を生成する。音声ＤＢ１１０には、話者適応で必要になる各種の音響・韻律データ、つまり対象話者の発話の特徴を表すデータが含まれる。具体的には、スペクトル包絡の特徴を表すスペクトルパラメータや、各周波数帯域での非周期成分の比率を表す非周期性指標、基本周波数（Ｆ０）を表すピッチパラメータなどの時系列（例えばフレーム毎）、音素などのラベルの系列とこれらの各ラベルに関する時間情報（音素の開始時刻、終了時刻など）や言語情報（音素を含む単語のアクセントや見出し、品詞、前後の単語との接続強度など）、ポーズの位置・長さの情報、などが音声ＤＢ１１０に含まれる。音声ＤＢ１１０は、少なくともこれらの情報の一部を含むが、ここに挙げたもの以外の情報を含んでもよい。また、スペクトルパラメータには、メル周波数ケプストラム（メルケプストラム）やメル周波数線スペクトル対（メルＬＳＰ）が一般的によく用いられるが、スペクトル包絡の特徴を表すパラメータであればどのようなものであってもよい。 The speech analysis unit 101 analyzes the recorded speech 10 and the recorded text 20 input to the speech synthesis dictionary generation device 100, and generates a speech database (hereinafter referred to as a speech DB) 110. The speech DB 110 includes various types of sound and prosody data required for speaker adaptation, that is, data representing the features of the speech of the target speaker. Specifically, a time series such as a spectral parameter representing the feature of the spectral envelope, an aperiodic index representing the ratio of aperiodic components in each frequency band, a pitch parameter representing the fundamental frequency (F0) (for example, every frame) , A sequence of labels such as phonemes, time information (such as start time and end time of phonemes) and language information (such as accents and headings of words including phonemes, parts of speech, connection strength with words before and after, etc.) Information on the position and length of the pose is included in the voice DB 110. The voice DB 110 includes at least a part of these pieces of information, but may include information other than those listed here. In addition, although mel frequency cepstrum (mel cepstrum) and mel frequency line spectrum pair (mel LSP) are generally and commonly used as spectral parameters, any parameters representing the characteristics of the spectral envelope may be used. It is also good.

音声分析部１０１では、音声ＤＢ１１０に含まれるこれらの情報を生成するため、音素ラベリング、基本周波数抽出、スペクトル包絡抽出、非周期性指標抽出、言語情報抽出などの処理が自動で行われる。これらの処理には、それぞれ既存の手法がいくつか存在し、そのいずれかを用いてもよいし、新たな別の手法を用いてもよい。例えば、音素ラベリングではＨＭＭを用いた手法が一般的に用いられる。基本周波数抽出には、音声波形の自己相関を用いた手法やケプストラムを用いた手法、スペクトルの調波構造を用いた手法など、数多くの手法が存在する。スペクトル包絡抽出には、ピッチ同期分析を用いた手法やケプストラムを用いた手法、ＳＴＲＡＩＧＨＴと呼ばれる手法など多くの手法が存在する。非周期性指標抽出には、各周波数帯域の音声波形での自己相関を用いた手法や、ＰＳＨＦと呼ばれる手法で音声波形を周期成分と非周期成分に分割して周波数帯域ごとのパワー比率を求める手法などが存在する。言語情報抽出では、形態素解析などの言語処理を行った結果から、アクセントの情報や、品詞、単語間の接続強度などの情報を得る。 In the speech analysis unit 101, in order to generate these pieces of information included in the speech DB 110, processes such as phoneme labeling, fundamental frequency extraction, spectral envelope extraction, aperiodicity index extraction, linguistic information extraction and the like are automatically performed. There are several existing methods for each of these processes, and any of them may be used, or another new method may be used. For example, in phoneme labeling, a method using an HMM is generally used. There are many fundamental frequency extraction methods such as a method using autocorrelation of speech waveform, a method using cepstrum, a method using a harmonic structure of spectrum, and the like. There are many methods for spectral envelope extraction, such as a method using pitch synchronization analysis, a method using cepstrum, and a method called STRAIGHT. For aperiodic index extraction, the speech waveform is divided into periodic components and non-periodic components by a method using autocorrelation in the speech waveform of each frequency band or a method called PSHF to obtain the power ratio for each frequency band Methods etc. exist. In the linguistic information extraction, information such as accent information, part-of-speech and connection strength between words are obtained from the result of linguistic processing such as morphological analysis.

音声分析部１０１により生成された音声ＤＢ１１０は、話者適応用ベースモデル１２０とともに、話者適応部１０２において対象話者のモデルを生成するために用いられる。 The speech DB 110 generated by the speech analysis unit 101 is used together with the speaker adaptation base model 120 to generate a model of the target speaker in the speaker adaptation unit 102.

話者適応用ベースモデル１２０は、音声合成辞書３０に含まれるモデルと同様に、テキストを言語解析して得られる音韻・言語情報と、スペクトルパラメータやピッチパラメータ、非周期性指標などのパラメータ系列との対応関係をモデル化したものである。通常、複数人の大量音声データからこれらの話者の平均的な特徴を表すモデルが学習され、幅広い音韻・言語環境をカバーしたモデルが話者適応用ベースモデル１２０として用いられる。例えば、特許文献２に記載のようなＨＭＭに基づく方式の場合、この話者適応用ベースモデル１２０は、各パラメータを音韻・言語環境でクラスタリングした決定木と、決定木の各リーフノードに割り当てられたパラメータの確率分布からなる。 Similar to the model included in the speech synthesis dictionary 30, the speaker adaptation base model 120 includes phonological and linguistic information obtained by linguistic analysis of text, and a parameter series such as spectral parameters, pitch parameters, and aperiodicity index. Modeling the correspondence relationship between Usually, a model representing the average characteristics of these speakers is learned from a large amount of speech data of a plurality of persons, and a model covering a wide range of phonological and language environments is used as a base model 120 for speaker adaptation. For example, in the case of an HMM-based system as described in Patent Document 2, the speaker adaptation base model 120 is assigned to a decision tree obtained by clustering each parameter in a phonological language environment and each leaf node of the decision tree. It consists of the probability distribution of different parameters.

この話者適応用ベースモデル１２０の学習方法としては、特許文献２に記載されているように、複数の話者の音声データから、ＨＭＭ音声合成の一般的なモデル学習方式を用いて「不特定話者モデル」を学習する方法や、下記の参考文献１に記載されているように、話者適応学習（ＳｐｅａｋｅｒＡｄａｐｔｉｖｅＴｒａｉｎｉｎｇ：ＳＡＴ）という方式を用いて話者間の特徴のバラつきを正規化しながら学習する方法などがある。
（参考文献１）Ｊ．ＹａｍａｇｉｓｈｉａｎｄＴ．Ｋｏｂａｙａｓｈｉ，“Ａｖｅｒａｇｅ−Ｖｏｉｃｅ−ＢａｓｅｄＳｐｅｅｃｈＳｙｎｔｈｅｓｉｓＵｓｉｎｇＨＳＭＭ−ＢａｓｅｄＳｐｅａｋｅｒＡｄａｐｔａｔｉｏｎａｎｄＡｄａｐｔｉｖｅＴｒａｉｎｉｎｇ”，ＩＥＩＣＥＴｒａｎｓ．ＩｎｆｏｒｍａｔｉｏｎａｎｄＳｙｓｔｅｍｓ，Ｖｏｌ．Ｎｏ．２，ｐｐ．５３３−５４３（２００７−２） As a learning method of the speaker adaptation base model 120, as described in Patent Document 2, from the speech data of a plurality of speakers, a general model learning method of HMM speech synthesis is used. "Speaker model", or as described in reference 1 below, normalizing speaker-to-speaker feature variances using a scheme called Speaker Adaptive Training (SAT) There is a way to learn.
(Reference 1) J. Yamagishi and T. Kobayashi, "Average-Voice-Based Speech Synthesis Using HSMM-Based Speaker Adaptation and Adaptive Training", IEICE Trans. Information and Systems, Vol. No. 2, pp. 533-543 (2007-2).

本実施形態では、話者適応用ベースモデル１２０は、原則、その言語のネイティブでかつ発声スキルの高い複数の話者の音声データから学習するものとする。 In the present embodiment, it is assumed that the speaker adaptation base model 120 is learned from speech data of a plurality of speakers who are native to the language and who have high vocal skills.

話者適応部１０２は、音声ＤＢ１１０を用いて、話者適応用ベースモデル１２０を対象話者（録音音声１０の話者）の特徴に近づけるように変換する話者適応を行って、対象話者に近い声質・話し方のモデルを生成する。ここでは、最尤線形回帰（ＭＬＬＲ）や制約付き最尤線形回帰（ｃＭＬＬＲ）、構造的事後確立最大線形回帰（ＳＭＡＰＬＲ）などの手法を用いて、話者適応用ベースモデル１２０が持つ確率分布を、音声ＤＢ１１０のパラメータに合わせて最適化することで、話者適応用ベースモデル１２０を対象話者の特徴に近づける。例えば、最尤線形回帰を用いた方法の場合、決定木中のリーフノードｉに割り当てられたパラメータの確率分布の平均ベクトルμ_ｉを、下記の式（１）のように変換する。ただし、Ａ，Ｗは行列、Ｂ，ξ_ｉはベクトル、ξ_ｉ＝[１，μ_ｉ ^Ｔ]^Ｔ（Ｔは転置）、Ｗ＝[ｂＡ]であり、Ｗを回帰行列と呼ぶ。

The speaker adaptation unit 102 performs speaker adaptation that converts the speaker adaptation base model 120 so as to be closer to the features of the target speaker (speaker of the recorded speech 10) using the voice DB 110, Generate a voice quality / speaking model close to. Here, the probability distribution of the speaker adaptation base model 120 is obtained using a technique such as maximum likelihood linear regression (MLLR), constrained maximum likelihood linear regression (cMLLR), or structural a posteriori established maximum linear regression (SMAPLR). By optimizing in accordance with the parameters of the speech DB 110, the speaker adaptation base model 120 approaches the characteristics of the target speaker. For example, in the case of a method using maximum likelihood linear regression, the mean vector μ _i of the probability distribution of the parameters assigned to the leaf node i in the decision tree is transformed as in the following equation (1). Here, A and W are matrices, B and ξ _i are vectors, ξ _i = [1, μ _i ^T ] ^T (T is transpose), W = [b A], and W is called a regression matrix.

式（１）の変換においては、対象話者のモデルのパラメータに対する、変換後の確率分布の尤度が最大になるように回帰行列Ｗを最適化した上で変換を行う。確率分布の平均ベクトルに加えて、共分散行列についても変換してもよいが、ここでは詳細は割愛する。 In the conversion of Equation (1), the conversion is performed after the regression matrix W is optimized so that the likelihood of the probability distribution after conversion is maximized with respect to the parameters of the model of the target speaker. In addition to the mean vector of the probability distribution, the covariance matrix may also be transformed, but details are omitted here.

こうした最尤線形回帰による変換では、決定木の全リーフノードの確率分布を１つの共通な回帰行列で変換してもよいが、一般的に話者性の違いは音韻などによって異なるため、この場合は非常に粗い変換になってしまい、対象話者の話者性が十分再現できなかったり、さらには音韻性も崩れてしまう場合がある。一方、対象話者の音声データが大量に存在する場合は、各リーフノードの確率分布ごとに異なる回帰行列を用意することで非常に精密な話者適応も可能であるが、話者適応を用いるケースの多くでは、対象話者の音声データは少量のため、各リーフノードに割り当てられる目標話者の音声データは非常に少ないか、全く無い場合もあり、回帰行列の計算ができないリーフノードが多数出てきてしまう。 In such transformation by maximum likelihood linear regression, the probability distribution of all leaf nodes of the decision tree may be transformed by one common regression matrix, but in general, the difference in the speaker nature differs depending on the phoneme etc. Is a very rough conversion, and the speaker nature of the target speaker may not be sufficiently reproduced, and furthermore, the phonological property may also be lost. On the other hand, when there is a large amount of voice data of the target speaker, very precise speaker adaptation is possible by preparing a different regression matrix for each probability distribution of each leaf node, but speaker adaptation is used In many cases, the voice data of the target speaker is small, so the voice data of the target speaker assigned to each leaf node may be very small or not at all, and many leaf nodes can not calculate the regression matrix. It will come out.

そこで通常は、変換元の確率分布を複数の回帰クラスにクラスタリングし、回帰クラスごとに変換行列を求めて確率分布の変換を行う。このような変換を区分線形回帰と呼ぶ。図３にそのイメージを示す。回帰クラスへのクラスタリングでは、通常、図３のように音韻・言語環境でクラスタリングされた話者適応用ベースモデル１２０の決定木（通常２分木）や、確率分布間の距離を基準に全リーフノードの確率分布を物理量でクラスタリングした結果の２分木を用いる（以下、これらの決定木や２分木を回帰クラス木と呼ぶ）。これらの方法では、回帰クラスあたりの対象話者の音声データ量に対して最小閾値を設定し、対象話者の音声データ量に応じて回帰クラスの粒度を制御する。 Therefore, usually, the probability distribution of the transformation source is clustered into a plurality of regression classes, and a transformation matrix is determined for each regression class to transform the probability distribution. Such transformation is called piecewise linear regression. The image is shown in FIG. In clustering to a regression class, as shown in FIG. 3, all leaves are usually determined based on the distance between a decision tree (usually a binary tree) of a base model for speaker adaptation 120 (usually a binary tree) clustered in a phonological language environment as shown in FIG. A binary tree is used which is the result of clustering the probability distribution of nodes by physical quantities (hereinafter, these decision trees and binary trees are called regression class trees). In these methods, a minimum threshold is set for the speech data volume of the target speaker per regression class, and the granularity of the regression class is controlled according to the speech data volume of the target speaker.

具体的には、まず、対象話者のパラメータの各サンプルが、回帰クラス木のどのリーフノードに割り当てられるかを調べ、各リーフノードに割り当てられたサンプル数を算出する。割り当てられたサンプル数が閾値を下回るリーフノードがある場合、その親ノードに遡って、親ノード以下のリーフノードをマージする。全てのリーフノードのサンプル数が最小閾値を上回るまでこの操作を繰り返し、最終的にできた各リーフノードが回帰クラスとなる。この結果、対象話者の音声データ量が少ない場合は各回帰クラスが大きく（すなわち変換行列の個数が少なく）なって粒度の粗い適応となり、音声データ量が多い場合は各回帰クラスが大きく（すなわち変換行列の個数が少なく）なって粒度の細かい適応となる。 Specifically, first, it is examined to which leaf node of the regression class tree each sample of the target speaker's parameters is assigned, and the number of samples assigned to each leaf node is calculated. If there is a leaf node for which the number of allocated samples is less than the threshold value, the leaf nodes below the parent node are merged back to the parent node. This operation is repeated until the number of samples of all leaf nodes exceeds the minimum threshold, and each leaf node finally generated becomes a regression class. As a result, when the amount of speech data of the target speaker is small, each regression class is large (that is, the number of transformation matrices is small) and coarse-grained adaptation is performed, and when the amount of speech data is large, each regression class is large (i.e., The number of transformation matrices is reduced, resulting in fine-grained adaptation.

本実施形態では、話者適応部１０２は、上述のように、変換行列を回帰クラスごとに求めて確率分布の変換を行い、回帰クラスあたりの対象話者の音声データ量に対する最小閾値のように、回帰クラスの粒度（つまり、話者適応での話者性再現の忠実度）を外部から制御できるパラメータを持つものとする。例えば、回帰クラスあたりの対象話者の音声データ量に最小閾値を設定して回帰クラスの粒度を制御する場合、通常は、韻律・音響パラメータの種類ごとに経験的に求めた固定値を用い、変換行列が計算できる十分なデータ量の範囲で比較的小さめの値に設定することが多い。この場合、対象話者の声質や発声の特徴は、利用可能な音声データ量に応じて、できるだけ忠実に再現できる。 In the present embodiment, as described above, the speaker adaptation unit 102 obtains a transformation matrix for each regression class and performs transformation of probability distribution, such as a minimum threshold for the speech data amount of the target speaker per regression class. , It is assumed that the parameters of the regression class granularity (that is, the fidelity of the speaker reproduction in the speaker adaptation) can be externally controlled. For example, when controlling the granularity of the regression class by setting the minimum threshold to the speech data amount of the target speaker per regression class, usually, using fixed values empirically obtained for each type of prosody / sound parameter, It is often set to a relatively small value within the range of sufficient data volume that the transformation matrix can calculate. In this case, the voice quality and characteristics of the target speaker can be reproduced as faithfully as possible according to the amount of available voice data.

一方、このような最小閾値をより大きな値に設定すると、回帰クラスは大きくなり、粒度の粗い適応となる。この場合、全体的には対象話者の声質や発声の仕方に近づくが、細かい特徴については話者適応用ベースモデル１２０の特徴を反映したモデルが生成される。すなわち、この最小閾値を大きくすることで、話者適応での話者性再現の忠実度を下げることが可能である。本実施形態では、後述する決定部１０５において、こうしたパラメータの値が、対象話者の話者レベルと目標とする話者レベル（音声合成辞書３０による合成音声に期待する話者レベル）との関係に基づいて決定され、話者適応部１０２に入力される。 On the other hand, if such a minimum threshold is set to a larger value, the regression class becomes larger, resulting in coarse-grained adaptation. In this case, a model is generated that reflects the characteristics of the speaker adaptation base model 120, although the sound quality and the manner of speech of the target speaker are generally approached. That is, by increasing the minimum threshold, it is possible to lower the fidelity of the speaker reproduction in the speaker adaptation. In the present embodiment, in the determination unit 105 to be described later, the values of such parameters are related to the speaker level of the target speaker and the target speaker level (speaker level expected for synthetic speech by the speech synthesis dictionary 30). And is input to the speaker adaptation unit 102.

なお、本実施形態で用いる「話者レベル」の用語は、話者の発話スキルと、生成する音声合成辞書３０の言語に対する話者のネイティブ度との少なくとも一方を表す。対象話者の話者レベルを「対象話者レベル」と呼び、目標とする話者レベルを「目標話者レベル」と呼ぶ。話者の発話スキルは、話者の発音やアクセントの正確さや、発声の流暢さを表す数値あるいは分類であり、例えば、非常にたどたどしい発声の話者であれば１０、正確かつ流暢な発声ができるプロのアナウンサーなら１００などの数値で表す。話者のネイティブ度は、その話者にとって対象言語が母語かどうか、母語でなければどの程度その言語の発声スキルがあるかを表す数値あるいは分類である。例えば、母語であれば１００、学習したことさえない言語であれば０などである。話者レベルは、用途によって、発声スキルとネイティブ度のいずれか一方でもよいし、両方でもよい。また、発声スキルとネイティブ度が組み合わさったような指標を話者レベルとしてもよい。 The term "speaker level" used in the present embodiment represents at least one of the speaker's speech skills and the degree of nativeness of the speaker to the language of the speech synthesis dictionary 30 to be generated. The speaker level of the target speaker is called "target speaker level", and the target speaker level is called "target speaker level". The speaker's speech skill is a numerical value or classification that represents the accuracy of the speaker's pronunciation or accent and the fluency of the speech. For example, a speaker with a very traceable speech can produce accurate and fluent speech. If it is a professional announcer, it will be represented by a number such as 100. The native degree of a speaker is a numerical value or classification indicating whether the target language is the native language for the speaker, and how much the speaking skill of the language is otherwise. For example, it is 100 in the case of the native language, 0 in the case of a language that has not even been learned. Depending on the application, the speaker level may be either one or both of the speaking skill and the native degree. In addition, an index in which the speaking skill and the native degree are combined may be set as the speaker level.

対象話者レベル指定部１０３は、対象話者レベルの指定を受け付けて、指定された対象話者レベルを決定部１０５に渡す。例えば、対象話者本人などのユーザが何らかのユーザインタフェースを用いて対象話者レベルを指定する操作を行うと、対象話者レベル指定部１０３は、このユーザの操作による対象話者レベルの指定を受け付けて決定部１０５に渡す。なお、生成する音声合成辞書３０の用途などによって対象話者レベルが想定できる場合は、対象話者レベルとして固定の想定値が予め設定しておいてもよい。この場合、音声合成辞書生成装置１００は、対象話者レベル指定部１０３の代わりに、予め設定された対象話者レベルを記憶する記憶部を備える。 The target speaker level specification unit 103 receives specification of a target speaker level, and passes the specified target speaker level to the determination unit 105. For example, when the user such as the target speaker performs an operation to specify the target speaker level using some user interface, the target speaker level specification unit 103 receives specification of the target speaker level by the operation of the user. To the determining unit 105. If the target speaker level can be assumed depending on the application of the speech synthesis dictionary 30 to be generated, a fixed assumed value may be set in advance as the target speaker level. In this case, the speech synthesis dictionary generation device 100 includes a storage unit for storing a preset target speaker level instead of the target speaker level designation unit 103.

目標話者レベル指定部１０４は、目標話者レベルの指定を受け付けて、指定された目標話者レベルを決定部１０５に渡す。例えば、対象話者本人などのユーザが何らかのユーザインタフェースを用いて目標話者レベルを指定する操作を行うと、目標話者レベル指定部１０４は、このユーザの操作による目標話者レベルの指定を受け付けて決定部１０５に渡す。例えば、対象話者の発話スキルやネイティブ度が低い場合、対象話者本人に似た声で、対象話者本人よりもプロっぽく、またはネイティブっぽく発声させたい場合がある。このような場合、ユーザは高めの目標話者レベルを指定すればよい。 The target speaker level designation unit 104 receives designation of a target speaker level, and passes the designated target speaker level to the determination unit 105. For example, when a user such as a target speaker performs an operation to specify a target speaker level using any user interface, the target speaker level specification unit 104 receives specification of a target speaker level by this user operation. To the determining unit 105. For example, when the target speaker's speech skill or native degree is low, there is a case where he / she wants to make a voice similar to that of the target speaker, and more professional or native than the target speaker. In such a case, the user may specify a higher target speaker level.

決定部１０５は、目標話者レベル指定部１０４から渡された目標話者レベルと、対象話者レベル指定部１０３から渡された対象話者レベルとの関係に応じて、上述した話者適応部１０２による話者適応での話者性再現の忠実度に関わるパラメータの値を決定する。 The determination unit 105 determines the above-mentioned speaker adaptation unit according to the relationship between the target speaker level passed from the target speaker level designation unit 104 and the target speaker level passed from the target speaker level designation unit 103. The values of parameters related to the fidelity of the speaker reproduction in the speaker adaptation according to 102 are determined.

決定部１０５がパラメータの値を決定する方法の一例を図４に示す。図４は目標話者レベルと対象話者レベルとの関係を分類する二次元平面を表しており、横軸が対象話者レベルの大きさに対応し、縦軸が目標話者レベルの大きさに対応する。図中の斜めの破線は、目標話者レベルと対象話者レベルとが等しい位置を示している。決定部１０５は、例えば、目標話者レベル指定部１０４から渡された目標話者レベルと、対象者レベル指定部１０３から渡された対象話者レベルとの関係が、図４の領域Ａ〜Ｄのいずれに当てはまるかを判定する。そして、目標話者レベルと対象話者レベルとの関係が領域Ａに当てはまる場合は、決定部１０５は、話者性再現の忠実度に関わるパラメータの値を、話者性再現の忠実度が最大となる値として予め定められたデフォルト値に決定する。領域Ａは、目標話者レベルが対象話者レベル以下の場合、あるいは目標話者レベルが対象話者レベルよりも高いがその差が所定値未満の場合に当てはまる領域である。目標話者レベルが対象話者レベルよりも高いがその差が所定値未満の場合を領域Ａに含めているのは、話者レベルの不確実性を考慮して、パラメータの値をデフォルト値とする領域にマージンを持たせるためである。ただし、このようなマージンは必ずしも必要ではなく、目標話者レベルが対象話者レベル以下の場合に当てはまる領域（図の斜めの破線よりも右下の領域）のみを領域Ａとしてもよい。 An example of a method in which the determination unit 105 determines the value of the parameter is shown in FIG. FIG. 4 shows a two-dimensional plane for classifying the relationship between the target speaker level and the target speaker level, the horizontal axis corresponding to the size of the target speaker level, and the vertical axis the size of the target speaker level Corresponds to The oblique broken lines in the figure indicate positions where the target speaker level and the target speaker level are equal. For example, in the determination unit 105, the relationship between the target speaker level passed from the target speaker level designation unit 104 and the target speaker level passed from the target person level designation unit 103 corresponds to the regions A to D in FIG. Determine which of the following applies. Then, when the relationship between the target speaker level and the target speaker level is in the region A, the determination unit 105 determines the value of the parameter related to the fidelity of the speaker reproduction and the fidelity of the speaker reproduction is the maximum. The default value determined in advance as the value to be Region A is a region that applies when the target speaker level is below the target speaker level or when the target speaker level is higher than the target speaker level but the difference is less than a predetermined value. Region A contains the case where the target speaker level is higher than the target speaker level but the difference is less than the predetermined value means that the parameter value is set to the default value in consideration of the speaker level uncertainty. To have a margin in the However, such a margin is not necessarily required, and only the region (lower right region than the oblique broken line in the figure) applicable when the target speaker level is equal to or lower than the target speaker level may be region A.

また、目標話者レベルと対象話者レベルとの関係が領域Ｂに当てはまる場合は、決定部１０５は、話者性再現の忠実度に関わるパラメータの値を、デフォルト値よりも話者性再現の忠実度が低くなる値に決定する。また、目標話者レベルと対象話者レベルとの関係が領域Ｃに当てはまる場合は、決定部１０５は、話者性再現の忠実度に関わるパラメータの値を、目標話者レベルと対象話者レベルとの関係が領域Ｂに当てはまる場合よりもさらに、話者性再現の忠実度が低くなる値に決定する。また、目標話者レベルと対象話者レベルとの関係が領域Ｄに当てはまる場合は、決定部１０５は、話者性再現の忠実度に関わるパラメータの値を、目標話者レベルと対象話者レベルとの関係が領域Ｃに当てはまる場合よりもさらに、話者性再現の忠実度が低くなる値に決定する。 In addition, when the relationship between the target speaker level and the target speaker level is in the region B, the determination unit 105 sets the value of the parameter related to the fidelity of the speaker reproduction to that of the speaker reproduction more than the default value. Decide on a value that reduces fidelity. In addition, when the relationship between the target speaker level and the target speaker level is in the region C, the determination unit 105 determines the values of the parameters related to the fidelity of the speaker reproduction, the target speaker level and the target speaker level. Further, it is determined that the fidelity of the speaker reproduction is lower than in the case where the relationship with the above applies to the region B. In addition, when the relationship between the target speaker level and the target speaker level is applicable to the region D, the determination unit 105 determines the value of the parameter related to the fidelity of the speaker reproduction, the target speaker level and the target speaker level. Further, it is determined that the fidelity of the speaker reproduction is lower than in the case where the relationship with the above applies to the region C.

以上のように、決定部１０５は、目標話者レベルが対象話者レベルよりも高い場合は、話者性再現の忠実度に関わるパラメータの値を、デフォルト値よりも話者性再現の忠実度が低くなる値に決定し、その差が大きくなるほど話者性再現の忠実度が低くなるように、パラメータの値を決定する。この際、話者適応により生成する対象話者のモデルのうち、音響モデルの生成に用いるパラメータと、韻律モデルの生成に用いるパラメータとで、パラメータの変更度合いを変えてもよい。 As described above, when the target speaker level is higher than the target speaker level, the determination unit 105 sets the parameter value related to the fidelity of the speaker reproduction to the fidelity of the speaker reproduction more than the default value. Is determined to be a lower value, and the value of the parameter is determined such that the greater the difference, the lower the fidelity of the speaker reproduction. At this time, among the models of the target speaker generated by speaker adaptation, the degree of change of the parameters may be changed according to the parameters used for generating the acoustic model and the parameters used for generating the prosody model.

多くの話者では、その話者性は韻律よりも声質に強く表れるので、声質は忠実に再現する必要があるが、韻律は平均レベルさえその話者に合わせておけば、話者性をある程度再現できる場合が多い。また、多くの話者にとって、文中の各音節が正しく聞き取れるように発音することは比較的容易であるが、アクセントや抑揚、リズムといった韻律については、プロナレータなどのように自然で聞きやすい読み方をすることは、かなりの訓練を受けなければ難しい。外国語を読む場合も同様であり、例えば中国語を学習したことのない日本語話者が中国語を読む場合、中国語のピンインやこれをカナに変換したものを読めば、各音節はある程度正しく発音できるが、正しい声調（標準中国語の場合は四声）で読むことはほぼ不可能である。そこで、話者性再現の忠実度に関わるパラメータの値を、デフォルト値よりも話者性再現の忠実度が低くなるように決定する際に、音響モデルの生成に用いるパラメータのデフォルト値に対する変更度合いよりも、韻律モデルの生成に用いるパラメータのデフォルト値に対する変更度合いを大きくすることで、話者性の再現と発話スキルの高さを両立した音声合成辞書３０を生成しやすくすることが可能となる。 In many speakers, the speaker nature appears stronger in voice quality than in prosody, so the voice quality needs to be faithfully reproduced, but if even the average level of prosody is matched to the speaker, the speaker nature is to some extent It can often be reproduced. Also, for many speakers, it is relatively easy to pronounce each syllable in a sentence so that it can be correctly heard, but for prosody such as accent, intonation, and rhythm, natural and easy-to-hear reading like pronarator etc. Things are difficult unless you have a good deal of training. The same applies to reading foreign languages. For example, if a Japanese speaker who has never learned Chinese reads Chinese, each syllable is to a certain extent by reading Chinese pinyins or those converted to kana. It can be pronounced correctly, but it is almost impossible to read it in the correct tone (in the case of Mandarin Chinese, four voices). Therefore, when determining the value of the parameter related to the fidelity of the speaker reproduction, to determine the fidelity of the speaker reproduction lower than the default value, the degree of change to the default value of the parameter used to generate the acoustic model By making the degree of change to the default value of the parameter used for generating the prosody model larger than that, it becomes possible to easily generate the speech synthesis dictionary 30 which has both the reproduction of the speaker nature and the height of the speech skill. .

例えば、話者性再現の忠実度に関わるパラメータとして、上述した回帰クラスあたりの対象話者の音声データ量に対する最小閾値を用いる場合、目標話者レベルと対象話者レベルとの関係が図４の領域Ｂに当てはまれば、音響モデルの生成に用いるパラメータの値をデフォルト値の１０倍とし、韻律モデルの生成に用いるパラメータの値をデフォルト値の１０倍とする。また、目標話者レベルと対象話者レベルとの関係が図４の領域Ｃに当てはまれば、音響モデルの生成に用いるパラメータの値をデフォルト値の３０倍とし、韻律モデルの生成に用いるパラメータの値をデフォルト値の１００倍とする。また、目標話者レベルと対象話者レベルとの関係が図４の領域Ｄに当てはまれば、音響モデルの生成に用いるパラメータの値をデフォルト値の１００倍とし、韻律モデルの生成に用いるパラメータの値をデフォルト値の１０００倍とするといった方法が考えられる。 For example, in the case of using the minimum threshold for the voice data volume of the target speaker per regression class described above as the parameter related to the fidelity of the speaker reproduction, the relationship between the target speaker level and the target speaker level is shown in FIG. In the case of region B, the value of the parameter used to generate the acoustic model is 10 times the default value, and the value of the parameter used to generate the prosody model is 10 times the default value. Also, if the relationship between the target speaker level and the target speaker level applies to region C in FIG. 4, the parameter value used to generate the acoustic model is 30 times the default value, and the parameter value used to generate the prosody model Let be 100 times the default value. Also, if the relationship between the target speaker level and the target speaker level applies to region D in FIG. 4, the parameter value used to generate the acoustic model is set to 100 times the default value, and the parameter value used to generate the prosody model Is considered to be 1000 times the default value.

以上説明したように、本実施形態の音声合成辞書生成装置１００では、対象話者レベルよりも高い目標話者レベルが指定されると、話者適応での話者再現性の忠実度が自動的に下がり、全体的には話者の声質や発声の仕方に近いが、細かい特徴については話者適応用ベースモデル１２０の特徴、すなわち、発話スキルやその言語のネイティブ度の高い特徴を持った音声合成辞書３０が生成される。このように、本実施形態の音声合成辞書生成装置１００によれば、目標とする発話スキルやネイティブ度に応じて話者性の類似度を調整した音声合成辞書３０を生成することができ、対象話者の発話スキルが低い場合でも発話スキルの高い音声合成を、対象話者のネイティブ度が低い場合でもネイティブに近い発声の音声合成を実現できる。 As described above, in the speech synthesis dictionary generation apparatus 100 of this embodiment, when a target speaker level higher than the target speaker level is specified, the fidelity of the speaker reproducibility in the speaker adaptation is automatically Overall, it is close to the speaker's voice quality and the way of speech, but the fine features are speech with the characteristics of the base model for speaker adaptation 120, that is, the speech skill and the native feature of the language A synthetic dictionary 30 is generated. As described above, according to the speech synthesis dictionary generation device 100 of the present embodiment, the speech synthesis dictionary 30 can be generated in which the similarity of the speaker characteristics is adjusted according to the target speech skill and the native degree. Even when the speaker's speech skill is low, speech synthesis with high speech skill can be realized, and even when the native degree of the target speaker is low, speech synthesis of speech close to native can be realized.

（第２の実施形態）
第１の実施形態においては、対象話者レベルは対象話者本人などのユーザにより指定される、あるいは予め固定の想定値を設定するものとした。しかし、録音音声１０での実際の発話スキルやネイティブ度に合った適切な対象話者レベルを指定・設定するのは非常に難しい。そこで、本実施形態では、音声分析部１０１による対象話者の音声データの分析結果を基に対象話者レベルを推定し、指定された目標話者レベルと、推定した対象話者レベルとの関係に応じて、話者性再現の忠実度に関わるパラメータの値を決定する。 Second Embodiment
In the first embodiment, the target speaker level is specified by the user such as the target speaker, or a fixed assumed value is set in advance. However, it is very difficult to specify and set an appropriate target speaker level suitable for the actual speech skill and native degree in the recorded speech 10. Therefore, in the present embodiment, the target speaker level is estimated based on the analysis result of the voice data of the target speaker by the voice analysis unit 101, and the relationship between the designated target speaker level and the estimated target speaker level. In accordance with, determine the value of the parameter related to the fidelity of the speaker reproduction.

図５は、本実施形態の音声合成辞書生成装置２００の構成例を示すブロック図である。図５に示すように、本実施形態の音声合成辞書生成装置２００は、図１に示した対象話者レベル指定部１０３に代えて、対象話者レベル推定部２０１を備える。それ以外の構成は第１の実施形態と同様であるため、第１の実施形態と共通の構成要素については図中同一の符号を付して、重複した説明を省略する。 FIG. 5 is a block diagram showing a configuration example of the speech synthesis dictionary generation device 200 of the present embodiment. As shown in FIG. 5, the speech synthesis dictionary generation apparatus 200 of this embodiment includes a target speaker level estimation unit 201 in place of the target speaker level designation unit 103 shown in FIG. The other configuration is the same as that of the first embodiment, and therefore, the same components as those of the first embodiment are denoted by the same reference numerals in the drawings, and the redundant description will be omitted.

対象話者レベル推定部２０１は、音声分析部１０１で音素ラベリングされた結果や、抽出されたピッチやポーズなどの情報を基に、対象話者の発話スキルやネイティブ度を判定する。例えば、発話スキルの低い対象話者は、ポーズの頻度が、流暢に話せる話者よりも高くなる傾向があるので、この情報を用いて対象話者の発話スキルを判定できる。また、録音された音声から話者の発話スキルを自動判定する技術は、従来にも語学学習などの目的でさまざまなものが存在し、下記の参考文献２にその一例が開示されている。
（参考文献２）特開２００６−２０１４９１
この参考文献２に記載されている技術では、ＨＭＭモデルを教師データとし、これを使って話者の音声をアラインメントした結果得られる確率値から、その話者の発音レベルに関する評定値を算出する。このような既存の技術のいずれかを用いてもよい。 The target speaker level estimation unit 201 determines the speech skill and the native degree of the target speaker based on the result of the phoneme labeling by the speech analysis unit 101 and the extracted information such as the pitch and the pose. For example, a target speaker with low speech skills tends to have a higher frequency of pauses than a speaker who can speak fluently, so this information can be used to determine the speech skill of the target speaker. In addition, there are various techniques for automatically determining the speaker's speech skill from the recorded voice, for the purpose of language learning and the like, and an example is disclosed in the following reference 2 as an example.
(Reference 2) Japanese Patent Application Laid-Open No. 2006-201491
In the technique described in this reference 2, an HMM model is used as teacher data, and an evaluation value on the pronunciation level of the speaker is calculated from probability values obtained as a result of aligning the speaker's voice using this. Any of such existing techniques may be used.

以上のように、本実施形態の音声合成辞書生成装置２００によれば、録音音声１０での実際の話者レベルに合った適切な対象話者レベルが自動判定されるので、指定された目標話者レベルを適切に反映した音声合成辞書３０を生成することが可能になる。 As described above, according to the speech synthesis dictionary generation device 200 of the present embodiment, an appropriate target speaker level that matches the actual speaker level in the recorded speech 10 is automatically determined, so the designated target speech It is possible to generate a speech synthesis dictionary 30 that appropriately reflects the user level.

（第３の実施形態）
ユーザが指定する目標話者レベルは、生成される音声合成辞書３０（対象話者のモデル）の発話レベルやネイティブ度に影響するだけでなく、実際には対象話者の類似度とのトレードオフを調整することになる。すなわち、対象話者の発話レベルやネイティブ度よりも高い目標話者レベルを設定すると、対象話者の話者性の類似度は多少犠牲にすることになる。しかしながら、第１、第２の実施形態においては、ユーザは目標話者レベルを指定するだけのため、最終的にどういった音声合成辞書３０が生成されるかをイメージすることが難しい。また、そうしたトレードオフが実際に調整可能な範囲は、録音音声１０の発話レベルやネイティブ度によってある程度制限されることになるが、これについてもユーザは事前に把握できないまま目標話者レベルを設定する必要がある。 Third Embodiment
The target speaker level specified by the user not only affects the speech level or native degree of the speech synthesis dictionary 30 (the model of the target speaker) to be generated, but in fact it is a tradeoff of the degree of similarity of the target speaker Will be adjusted. That is, if the target speaker level higher than the target speaker's speech level or the native level is set, the similarity of the target speaker's speaker nature is somewhat sacrificed. However, in the first and second embodiments, since the user only specifies the target speaker level, it is difficult to imagine what speech synthesis dictionary 30 is finally generated. Also, although the range within which such tradeoffs can actually be adjusted is limited to some extent by the speech level and native level of the recorded speech 10, the user can not set the target speaker level without knowing this in advance. There is a need.

そこで、本実施形態では、入力された録音音声１０に応じて、指定される目標話者レベルと、その結果生成される音声合成辞書３０（対象話者のモデル）で想定される話者性の類似度との関係、および、目標話者レベルの指定可能な範囲を、例えばＧＵＩによる表示などでユーザに提示し、目標話者レベルをどのように指定すると、どういった音声合成辞書３０が生成されるかをユーザがイメージできるようにする。 Therefore, in the present embodiment, the target speaker level specified according to the input recorded speech 10 and the speaker characteristics assumed by the speech synthesis dictionary 30 (the model of the target speaker) generated as a result thereof The relationship between the degree of similarity and the specifiable range of the target speaker level are presented to the user, for example, by display on a GUI, and when the target speaker level is designated, any speech synthesis dictionary 30 is generated. Allow the user to image what

図６は、本実施形態の音声合成辞書生成装置３００の構成例を示すブロック図である。図６に示すように、本実施形態の音声合成辞書生成装置３００は、図５に示した目標話者レベル指定部１０４に代えて、目標話者レベル提示・指定部３０１を備える。それ以外の構成は第１、第２の実施形態と同様であるため、第１、第２の実施形態と共通の構成要素については図中同一の符号を付して、重複した説明を省略する。 FIG. 6 is a block diagram showing a configuration example of the speech synthesis dictionary generation device 300 of the present embodiment. As shown in FIG. 6, the speech synthesis dictionary generation apparatus 300 of this embodiment includes a target speaker level presentation / designation unit 301 in place of the target speaker level designation unit 104 shown in FIG. The other configuration is the same as that of the first and second embodiments, and therefore, the same components as those of the first and second embodiments are denoted by the same reference numerals in the drawings, and redundant description will be omitted. .

本実施形態の音声合成辞書生成装置３００では、録音音声１０が入力されると、対象話者レベル推定部２０１で対象話者レベルが推定され、この推定された対象話者レベルが目標話者レベル提示・指定部３０１に渡される。 In the speech synthesis dictionary generation apparatus 300 of this embodiment, when the recorded speech 10 is input, the target speaker level estimation unit 201 estimates the target speaker level, and the estimated target speaker level is the target speaker level. It is passed to the presentation / designation unit 301.

目標話者レベル提示・指定部３０１は、対象話者レベル推定部２０１により推定された対象話者レベルに基づいて、指定可能な目標話者レベルの範囲と、この範囲内の目標話者レベルと、音声合成辞書３０で想定される話者性の類似度との関係を求めて、例えばＧＵＩ上に表示するとともに、このＧＵＩを用いてユーザが目標話者レベルを指定する操作を受け付ける。 Based on the target speaker level estimated by target speaker level estimation section 201, target speaker level presentation / designation section 301 sets a target speaker level range that can be designated and a target speaker level within this range. The relationship between the degree of similarity of speakeriness assumed in the speech synthesis dictionary 30 is obtained, for example, displayed on a GUI, and the user accepts an operation of designating a target speaker level using this GUI.

このＧＵＩによる表示例を図７に示す。図７（ａ）は対象話者レベルが比較的高いと推定された場合のＧＵＩの表示例であり、図７（ｂ）は対象話者レベルが低いと推定された場合のＧＵＩの表示例である。これらのＧＵＩには、目標話者レベルの指定可能な範囲を示すスライダＳが設けられ、ユーザはこのスライダＳ内のポインタＰを動かすことで目標話者レベルを指定する。スライダＳは、ＧＵＩ上で斜めに表示され、スライダＳ内のポインタＰの位置が、指定された目標話者レベルと、生成される音声合成辞書３０（対象話者のモデル）で想定される話者性の類似度との関係を表している。なお、図中の破線の丸は、話者適応用ベースモデル１２０をそのまま用いた場合と、録音音声１０を忠実に再現した場合とのそれぞれについて、話者レベルおよび話者性の類似度を示したものである。話者適応用ベースモデル１２０については、話者レベルは高いが対象話者とは全く別人の声・話し方のため図の左上に位置する。一方、録音音声１０については、対象話者そのもののため図の右端に位置し、対象話者レベルの高さに応じて上下の位置が変わる。スライダＳは、２つの破線の丸の間に位置しているが、対象話者を忠実に再現する設定の場合は話者レベルと話者性の類似度が共に録音音声１０に近くなる一方、目標話者レベルを高く設定すると、粗い粒度で話者適応をすることになって、話者性の類似度がある程度犠牲になることを示している。図７に示すように、話者適応用ベースモデル１２０と録音音声１０の話者レベルの差が大きいほど、設定可能な目標話者レベルの範囲は広くなる。 An example of display by this GUI is shown in FIG. FIG. 7 (a) is a display example of the GUI when the target speaker level is estimated to be relatively high, and FIG. 7 (b) is a display example of the GUI when the target speaker level is estimated to be low. is there. In these GUIs, a slider S is provided to indicate a specifiable range of the target speaker level, and the user moves the pointer P in the slider S to specify the target speaker level. The slider S is diagonally displayed on the GUI, and the position of the pointer P in the slider S is assumed to be a target speaker level designated and a speech assumed by the generated speech synthesis dictionary 30 (model of the target speaker) It shows the relationship between the similarity of humanity. The broken circles in the figure indicate the speaker level and the similarity of the speakerness for each of the case where the speaker adaptation base model 120 is used as it is and the case where the recorded speech 10 is faithfully reproduced. It is The speaker adaptation base model 120 is located at the upper left of the figure because the speaker level is high but the voice / speaking style is completely different from that of the target speaker. On the other hand, the recording voice 10 is positioned at the right end of the figure for the target speaker itself, and the upper and lower positions change according to the height of the target speaker level. The slider S is located between two dashed circles, but in the setting that faithfully reproduces the target speaker, the similarity between the speaker level and the speaker characteristic is close to the recorded speech 10, Setting the target speaker level high indicates that the speaker adaptation is performed with coarse granularity, and the similarity of the speakerality is sacrificed to some extent. As shown in FIG. 7, the larger the difference between the speaker levels of the speaker adaptation base model 120 and the recorded speech 10, the wider the range of target speaker levels that can be set.

図７に例示したＧＵＩを用いてユーザにより指定された目標話者レベルは決定部１０５に渡され、対象話者レベル推定部２０１から渡される対象話者レベルとの関係に基づいて、話者適応での話者の忠実度に関わるパラメータの値が決定部１０５において決定される。話者適応部１０２では、決定されたパラメータの値に応じた話者適応がなされることによって、ユーザが意図した話者レベルおよび話者性の類似度を持った音声合成辞書３０を生成することができる。 The target speaker level specified by the user using the GUI illustrated in FIG. 7 is passed to the determination unit 105, and speaker adaptation is performed based on the relationship with the target speaker level passed from the target speaker level estimation unit 201. The value of the parameter relating to the speaker's fidelity at is determined by the determination unit 105. The speaker adaptation unit 102 generates the speech synthesis dictionary 30 having the speaker level and the similarity of the speakerality intended by the user by performing the speaker adaptation according to the value of the determined parameter. Can.

（第４の実施形態）
第１〜第３の実施形態では、ＨＭＭ音声合成での一般的な話者適応方式を用いる例を説明したが、話者性再現の忠実度に関わるパラメータを持つものであれば、第１〜第３の実施形態とは異なる話者適応方式を用いてもよい。 Fourth Embodiment
In the first to third embodiments, an example using a general speaker adaptation method in HMM speech synthesis has been described, but if it has parameters relating to the fidelity of the speaker reproduction, the first to third embodiments will be described. A speaker adaptation scheme different from that of the third embodiment may be used.

異なる話者適応方式の一つとして、下記の参考文献３のように、クラスタ適応学習（ＣｌｕｓｔｅｒＡｄａｐｔｉｖｅＴｒａｉｎｉｎｇ：ＣＡＴ）で学習したモデルを用いた話者適応方式がある。本実施形態では、このクラスタ適応学習で学習したモデルを用いた話者適応方式を用いるものとする。
（参考文献３）Ｋ．Ｙａｎａｇｉｓａｗａ，Ｊ．Ｌａｔｏｒｒｅ，Ｖ．Ｗａｎ，Ｍ．ＧａｌｅｓａｎｄＳ．Ｋｉｎｇ，“ＮｏｉｓｅＲｏｂｕｓｔｎｅｓｓｉｎＨＭＭ−ＴＴＳＳｐｅａｋｅｒＡｄａｐｔａｔｉｏｎ” Ｐｒｏｃ．ｏｆ８ｔｈＩＳＣＡＳｐｅｅｃｈＳｙｎｔｈｅｓｉｓＷｏｒｋｓｈｏｐ，ｐｐ．１１９−１２４，２０１３−９ As one of the different speaker adaptation methods, there is a speaker adaptation method using a model learned by cluster adaptive training (CAT) as described in reference 3 below. In this embodiment, it is assumed that a speaker adaptation method using a model learned by this cluster adaptive learning is used.
(Reference 3) K. Yanagisawa, J. et al. Latorre, V .; Wan, M .; Gales and S. King, "Noise Robustness in HMM-TTS Speaker Adaptation" Proc. of 8th ISCA Speech Synthesis Workshop, pp. 119-124, 2013-9

クラスタ適応学習では、モデルを複数クラスタの重み付き和で表し、モデルの学習時には、各クラスタのモデルと重みをデータに合わせて同時に最適化する。本実施形態で用いる話者適応のための複数話者のモデル化では、図８に示すように、複数話者を含む大量の音声データから、それぞれのクラスタをモデル化した決定木と、クラスタの重みとを同時に最適化する。こうしてできたモデルの重みを、学習に用いた各話者に最適化された値に設定すると、それぞれの話者の特徴が再現できる。こうしてできたモデルを、以下ＣＡＴモデルと呼ぶ。 In cluster adaptive learning, a model is represented by a weighted sum of a plurality of clusters, and at the time of model learning, the model and weights of each cluster are simultaneously optimized according to the data. In the modeling of multiple speakers for speaker adaptation used in the present embodiment, as shown in FIG. 8, a decision tree modeling each cluster from a large amount of speech data including multiple speakers, and clusters Optimize the weights simultaneously. By setting the weight of the model thus created to a value optimized for each speaker used for learning, the characteristics of each speaker can be reproduced. The resulting model is hereinafter referred to as the CAT model.

実際には、ＣＡＴモデルは第１の実施形態で説明した決定木と同様に、スペクトルパラメータやピッチパラメータなどのパラメータ種別毎に学習する。各クラスタの決定木は、各パラメータを音韻・言語環境でクラスタリングしたものであり、バイアスクラスタという重みが常に１に設定されたクラスタのリーフノードには、対象のパラメータの確率分布（平均ベクトルと共分散行列）が割り当てられ、その他のクラスタのリーフノードには、バイアスクラスタからの確率分布の平均ベクトルに重み付きで加算する平均ベクトルが割り当てられている。 In practice, the CAT model is learned for each parameter type such as spectrum parameters and pitch parameters, as in the decision tree described in the first embodiment. The decision tree of each cluster is obtained by clustering each parameter in the phonological language environment, and the probability distribution of the target parameter (coincident with the average vector) is given to the leaf node of the cluster whose weight called bias cluster is always set to 1. A dispersion matrix is assigned, and leaf nodes of other clusters are assigned average vectors weightedly adding to the average vectors of probability distributions from the bias cluster.

本実施形態では、このようにクラスタ適応学習で学習されたＣＡＴモデルを話者適応用ベースモデル１２０として用いる。この場合の話者適応では、対象話者の音声データに合わせて重みを最適化することによって、対象話者に近い声質・話し方のモデルを得ることができる。しかし、このＣＡＴモデルでは通常、学習に用いた話者の特徴の線形和で表現可能な空間内の特徴しか表せないので、例えば学習に用いた話者がプロのナレータばかりの場合、一般者の声質や話し方はうまく再現できない可能性がある。そこで、本実施形態では、話者レベルが様々で、様々な声質や話し方の特徴を含む複数の話者からＣＡＴモデルを学習することとする。 In this embodiment, the CAT model thus learned by cluster adaptive learning is used as the speaker adaptation base model 120. In the speaker adaptation in this case, it is possible to obtain a voice quality / speaking model close to the target speaker by optimizing the weights in accordance with the voice data of the target speaker. However, since this CAT model can usually represent only features in the space that can be represented by a linear sum of speaker features used for learning, for example, when the speaker used for learning is only a professional narrator, it can be used by ordinary people. Voice quality and speech may not be reproduced well. Therefore, in the present embodiment, the CAT model is learned from a plurality of speakers having various speaker levels and including various voice characteristics and speaking characteristics.

この場合、対象話者の音声データに最適化した重みベクトルをＷ_ｏｐｔとすると、この重みＷ_ｏｐｔで合成される音声は対象話者に近いが、話者レベルも対象話者のレベルを再現したものになる。一方、ＣＡＴモデルの学習に用いた話者のうち、話者レベルが高い話者に最適化された重みベクトルの中からＷ_ｏｐｔに最も近いものを選択してこれをＷ_{ｓ（ｎｅａｒ）}とすると、この重みＷ_{ｓ（ｎｅａｒ）}で合成される音声は対象話者に比較的近く、話者レベルの高いものとなる。なお、Ｗ_{ｓ（ｎｅａｒ）}は、ここではＷ_ｏｐｔに最も近いものとしたが、必ずしも重みベクトルの距離で選択する必要はなく、話者の性別や特徴など別の情報を基に選択してもよい。 In this case, assuming that the weight vector optimized for voice data of the target speaker is W _opt , the speech synthesized with this weight W _opt is close to the target speaker, but the speaker level also reproduces the level of the target speaker It becomes a thing. On the other hand, among the speakers used for learning of the CAT model, it is assumed that the one closest to W _opt is selected from among the weight vectors optimized for the speaker with high speaker level and this is taken as W _{s (near)} The speech synthesized with this weight W _{s (near)} is relatively close to the target speaker and becomes high in speaker level. Here, W _{s (near)} is assumed to be closest to W _opt here, but it is not necessary to select by the distance of weight vector, and it is possible to select based on other information such as speaker gender and features Good.

本実施形態では、さらに、下記の式（２）のように、Ｗ_ｏｐｔとＷ_{ｓ（ｎｅａｒ）}を補間した重みベクトルＷ_{ｔａｒｇｅｔ}を新たに定義し、Ｗ_{ｔａｒｇｅｔ}を話者適応した結果の重みベクトル（目標の重みベクトル）とすることにする。

Further, in the present embodiment, a weight vector W _target obtained by interpolating W _opt and W _{s (near)} is newly defined as in the following Expression (2), and a weight vector of a result of speaker adaptation of W _target ((2) Let it be the target weight vector).

図９は、式（２）における補間比率であるｒと、これにより定まる目標の重みベクトルＷ_{ｔａｒｇｅｔ}との関係を示す概念図である。この場合、例えば、補間比率ｒが１なら対象話者を最も忠実に再現する設定となり、補間比率ｒが０なら最も話者レベルが高い設定にできる。つまり、この補間比率ｒを、話者再現性の忠実度を表すパラメータとして用いることができる。本実施形態では、決定部１０５において、目標話者レベルと対象話者レベルとの関係に基づいてこの補間比率ｒの値を決定する。これにより、第１〜第３の実施形態と同様に、目標とする発話スキルやネイティブ度に応じて話者性の類似度を調整した音声合成辞書３０を生成することができ、対象話者の発話スキルが低い場合でも発話スキルの高い音声合成を、対象話者のネイティブ度が低い場合でもネイティブに近い発声の音声合成を実現できる。 FIG. 9 is a conceptual diagram showing the relationship between the interpolation ratio r in equation (2) and the _target weight vector W _target determined thereby. In this case, for example, when the interpolation ratio r is 1, it is set to reproduce the target speaker most faithfully, and when the interpolation ratio r is 0, the speaker level can be set to the highest. That is, this interpolation ratio r can be used as a parameter representing the fidelity of the speaker reproducibility. In the present embodiment, the determination unit 105 determines the value of the interpolation ratio r based on the relationship between the target speaker level and the target speaker level. As a result, as in the first to third embodiments, it is possible to generate the speech synthesis dictionary 30 in which the similarity of the speaker nature is adjusted according to the target speech skill and the native degree. Even when the speech skill is low, speech synthesis with high speech skill can be realized, and even when the native degree of the target speaker is low, speech synthesis of speech close to native speech can be realized.

（第５の実施形態）
第１〜第４の実施形態は、ＨＭＭ音声合成のための音声合成辞書３０を生成する例を説明したが、音声合成の方式はＨＭＭ音声合成に限らず、素片選択型の音声合成など、異なる音声合成方式であってもよい。例えば、素片選択型の音声合成においても、下記の参考文献４に開示されているような話者適応方法がある。
（参考文献４）特開２００７−１９３１３９号公報 Fifth Embodiment
In the first to fourth embodiments, the example of generating the speech synthesis dictionary 30 for HMM speech synthesis has been described, but the method of speech synthesis is not limited to HMM speech synthesis, and unit selection type speech synthesis, etc. Different speech synthesis methods may be used. For example, also in speech selection of segment selection type, there is a speaker adaptation method as disclosed in reference 4 below.
(Reference 4) Japanese Patent Application Laid-Open No. 2007-193139

参考文献４で開示されている話者適応方法では、ベースの話者の音声素片を対象話者（目標話者）の特徴に合わせて変換する。具体的には、音声素片の音声波形を音声分析してスペクトルパラメータに変換し、このスペクトルパラメータをスペクトル領域上で対象話者の特徴に変換した後、変換後のスペクトルパラメータを時間領域の音声波形に戻すことにより、対象話者の音声波形に変換する。 In the speaker adaptation method disclosed in reference 4, the speech segment of the base speaker is converted in accordance with the characteristics of the target speaker (target speaker). Specifically, the speech waveform of the speech segment is speech-analyzed and converted to spectral parameters, and the spectral parameters are converted to the features of the target speaker on the spectral region, and the converted spectral parameters are then converted to speech in the time domain. By converting back to the waveform, it is converted to the voice waveform of the target speaker.

この際の変換規則については、素片選択の手法を用いてベースの話者の音声素片と対象話者の音声素片の対を作り、これらの音声素片を音声分析してスペクトルパラメータの対に変換し、これらのスペクトルパラメータ対を基に、回帰分析やベクトル量子化、混合ガウス分布（ＧＭＭ）で変換をモデル化することによって生成する。すなわち、ＨＭＭ音声合成での話者適応の場合と同様に、スペクトル等のパラメータの領域で変換を行う。また、変換方式の中には、話者性再現の忠実度に関わるパラメータが存在するものもある。 As for the conversion rule at this time, pairs of the speech unit of the base speaker and the speech unit of the target speaker are formed using the unit selection method, and these speech units are subjected to speech analysis to obtain spectral parameters It is converted into pairs, and based on these spectral parameter pairs, it is generated by modeling the conversion with regression analysis, vector quantization, and mixed Gaussian distribution (GMM). That is, as in the case of speaker adaptation in HMM speech synthesis, conversion is performed in the domain of parameters such as spectrum. In addition, some conversion methods include parameters related to the fidelity of the speaker reproduction.

例えば、参考文献４で挙げられている変換方式のうち、ベクトル量子化を用いる方式では、ベース話者のスペクトルパラメータをＣ個のクラスタにクラスタリングし、それぞれのクラスタで最尤線形回帰などによって変換行列を生成する。この場合、クラスタ数のＣを、話者性再現の忠実度に関わるパラメータとして用いることができる。Ｃを大きくすれば忠実度が高く、小さくすれば忠実度が低くなる。また、ＧＭＭを用いる変換方式においては、ベース話者から対象話者への変換規則をＣ個のガウス分布で表現するが、この場合、ガウス分布の混合数Ｃを話者性再現の忠実度に関わるパラメータとして用いることができる。 For example, among the conversion methods described in reference 4, in the method using vector quantization, the spectral parameters of the base speaker are clustered into C clusters, and the conversion matrix is generated by maximum likelihood linear regression or the like in each cluster. Generate In this case, the number C of clusters can be used as a parameter related to the fidelity of the speaker reproduction. The larger C, the higher the fidelity, and the smaller the C, the lower the fidelity. In addition, in the conversion method using GMM, the conversion rule from the base speaker to the target speaker is expressed by C Gaussian distributions. In this case, the mixing number C of the Gaussian distribution is used as the fidelity of the speaker reproduction. It can be used as a related parameter.

本実施形態では、上記のようなベクトル量子化を用いる変換方式におけるクラスタ数Ｃ、あるいは、ＧＭＭを用いる変換方式におけるガウス分布の混合数Ｃを、話者性再現の忠実度に関わるパラメータとして用いる。そして、決定部１０５において、これらクラスタ数Ｃの値あるいはガウス分布の混合数Ｃの値を、目標話者レベルと対象話者レベルとの関係に基づいて決定する。これにより、素片選択型の音声合成など、ＨＭＭ音声合成方式以外の方式で音声合成を行う場合であっても、第１〜第４の実施形態と同様に、目標とする発話スキルやネイティブ度に応じて話者性の類似度を調整した音声合成辞書３０を生成することができ、対象話者の発話スキルが低い場合でも発話スキルの高い音声合成を、対象話者のネイティブ度が低い場合でもネイティブに近い発声の音声合成を実現できる。 In this embodiment, the cluster number C in the conversion scheme using vector quantization as described above, or the mixing number C of the Gaussian distribution in the conversion scheme using GMM is used as a parameter related to the fidelity of the speaker reproduction. Then, the determination unit 105 determines the value of the cluster number C or the value of the mixture number C of the Gaussian distribution based on the relationship between the target speaker level and the target speaker level. Thus, even when speech synthesis is performed by a method other than the HMM speech synthesis method, such as segment selection type speech synthesis, as in the first to fourth embodiments, the target speech skill and native degree The speech synthesis dictionary 30 can be generated by adjusting the similarity of the speakerality according to the case, and the speech synthesis with high speech skills is performed even when the speech skill of the target speaker is low, and the native degree of the target speaker is low However, it is possible to realize speech synthesis of speech close to native.

（第６の実施形態）
話せない言語の音声合成辞書３０を生成する場合など、話者のネイティブ度が低い場合は、その言語での音声の録音が非常に難しくなることが予想される。例えば、音声録音ツールにおいて、中国語の分からない日本人話者に対して、中国語のテキストのまま表示して読ませることは困難である。そこで、本実施形態では、テキストの読みの情報を対象話者が通常使用する言語の読み表記に変換して対象話者に提示しながら、音声の録音を行い、かつ、提示する情報を対象話者のネイティブ度に応じて切り換える。 Sixth Embodiment
When the speaker's native degree is low, such as when generating a speech synthesis dictionary 30 of a language that can not be spoken, it is expected that speech recording in that language will be very difficult. For example, in a voice recording tool, it is difficult for Japanese speakers who do not understand Chinese to display and read Chinese text as it is. Therefore, in the present embodiment, the information of the reading of the text is converted into the reading notation of the language that the target speaker normally uses and presented to the target speaker while recording the voice and the information to be presented is the target talk Switch according to the native degree of the person.

図１０は、本実施形態の音声合成辞書生成装置４００の構成例を示すブロック図である。図１０に示すように、本実施形態の音声合成辞書生成装置４００は、図１に示した第１の実施形態の構成に加えて、音声録音・提示部４０１を備える。それ以外の構成は第１の実施形態と同様であるため、第１の実施形態と共通の構成要素については図中同一の符号を付して、重複した説明を省略する。 FIG. 10 is a block diagram showing a configuration example of the speech synthesis dictionary generation device 400 of this embodiment. As shown in FIG. 10, in addition to the configuration of the first embodiment shown in FIG. The other configuration is the same as that of the first embodiment, and therefore, the same components as those of the first embodiment are denoted by the same reference numerals in the drawings, and the redundant description will be omitted.

音声録音・提示部４０１は、対象話者が通常使用する言語ではない他の言語の録音テキスト２０を読み上げる際に、録音テキスト２０の表記を、対象話者が通常使用する言語の読みの表記に変換した表示テキスト１３０を対象話者に提示しながら、対象話者が録音テキスト２０を読み上げた音声を録音する。例えば、日本人を対象として中国語の音声合成辞書３０を生成する場合、音声録音・提示部４０１は、読み上げるテキストを中国語ではなく、例えば中国語の読みをカタカナに変換した表示テキスト１３０を表示する。こうすることで、日本人でも中国語に近い発音をすることが可能となる。 When the voice recording / presenting unit 401 reads out the recorded text 20 of another language that is not the language that the target speaker normally uses, the voice recording / presentation section 401 displays the description of the recorded text 20 as the reading of the language that the target speaker normally uses. While presenting the converted display text 130 to the target speaker, the target speaker records the voice read out from the recorded text 20. For example, in the case of generating the Chinese speech synthesis dictionary 30 for Japanese, the speech recording / presentation unit 401 displays the display text 130 obtained by converting, for example, Chinese reading into katakana instead of Chinese text to be read Do. By doing this, it becomes possible for Japanese to pronounce similar to Chinese.

この際、音声録音・提示部４０１は、対象話者に提示する表示テキスト１３０を、対象話者のネイティブ度に応じて切り換える。すなわち、アクセントや声調は、その言語を学習したことがある話者なら、正しいアクセントや声調で発声することも可能である。しかし、その言語を学習したこともない、ネイティブ度の非常に低い話者の場合、アクセント位置や声調の種類が適切に表示されていても、それを発声に反映することは非常に難しい。例えば、中国語を学習したことのない日本人が中国語の声調である四声を正しく発声することはほぼ不可能に近い。 At this time, the voice recording / presenting unit 401 switches the display text 130 to be presented to the target speaker in accordance with the native degree of the target speaker. That is, accents and tones can also be uttered with correct accents and tones if the speaker has learned the language. However, in the case of a speaker with a very low native degree who has never learned the language, it is very difficult to reflect it in the utterance even if the accent position and tone type are properly displayed. For example, it is almost impossible for Japanese who have not studied Chinese to correctly utter four tones that are Chinese tones.

そこで、本実施形態の音声録音・提示部４０１は、アクセントの位置や声調の種類などを表示するか否かを、対象話者によって指定された対象話者自身のネイティブ度に応じて切り換える。具体的には、音声録音・提示部４０１は、対象話者により指定された対象話者レベルのうち、対象話者のネイティブ度を対象話者レベル指定部１０３から受け取る。そして、音声録音・提示部４０１は、対象話者のネイティブ度が所定のレベルよりも高い場合は、読みの表記に加えてアクセントの位置や声調の種類を表示する。一方、対象話者のネイティブ度が所定のレベルよりも低い場合は、音声録音・提示部４０１は、読みの表記を表示するが、アクセントの位置や声調の種類は表示しない。 Therefore, the voice recording / presentation unit 401 of this embodiment switches whether to display the position of accent, the type of tone, etc., according to the native degree of the target speaker specified by the target speaker. Specifically, the voice recording / presentation unit 401 receives the native degree of the target speaker from the target speaker level designation unit 103 among the target speaker levels designated by the target speaker. Then, when the native degree of the target speaker is higher than a predetermined level, the voice recording / presentation unit 401 displays the accent position and the type of tone in addition to the reading notation. On the other hand, when the native degree of the target speaker is lower than a predetermined level, the voice recording / presenting unit 401 displays the reading notation but does not display the accent position and the tone type.

アクセントの位置や声調の種類を表示しない場合、アクセントや声調については正しく発声されることはあまり期待できない一方で、対象話者は、アクセントや声調は気にせず、正しく発音することに集中すると考えられ、発音はある程度正しくなることが期待できる。そこで、決定部１０５でパラメータの値を決定する際には、音響モデルの生成に用いるパラメータはやや高めの値に設定する一方、韻律モデルの生成に用いるパラメータの値はかなり低めに設定することが望ましい。こうすることで、ネイティブ度の非常に低い対象話者でも、話者の特徴を反映させながら、ある程度正しい発声ができる音声合成辞書３０を生成できる可能性が高まる。 If you do not indicate the location of the accent or the type of tone, you can not expect the correct accent or tone to be pronounced, but the target speaker does not care about the accent or tone, and thinks to concentrate on correct pronunciation. Can be expected to be correct to some extent. Therefore, when determining the value of the parameter in the determination unit 105, the parameter used for generating the acoustic model is set to a slightly higher value, while the value of the parameter used for generating the prosody model is set to be considerably lower. desirable. This increases the possibility that even a target speaker with a very low degree of nativeness can generate a speech synthesis dictionary 30 that can produce correct speech while reflecting the characteristics of the speaker.

なお、決定部１０５がパラメータの値を決定する際に用いる対象話者レベルは、対象話者が指定したもの、つまり、対象話者レベル指定部１０３から音声録音・提示部４０１に渡されたネイティブ度を含む対象話者レベルであってもよいし、第２の実施形態と同様の対象話者レベル推定部２０１を別途設けて、この対象話者レベル推定部２０１で推定された対象話者レベル、つまり、音声録音・提示部４０１で録音された録音音声１０を用いて推定された対象話者レベルであってもよい。また、対象話者により指定された対象話者レベルと、録音音声１０を用いて推定された対象話者レベルとの両方用いて、決定部１０５でパラメータの値を決定するようにしてもよい。 The target speaker level used when the determination unit 105 determines the value of the parameter is one specified by the target speaker, that is, the native speaker passed from the target speaker level designation unit 103 to the voice recording and presentation unit 401. It may be a target speaker level including a degree, or a target speaker level estimation unit 201 similar to that of the second embodiment may be separately provided, and the target speaker level estimated by the target speaker level estimation unit 201. That is, the target speaker level may be estimated using the recorded speech 10 recorded by the speech recording / presentation unit 401. Further, the value of the parameter may be determined by the determination unit 105 using both the target speaker level designated by the target speaker and the target speaker level estimated using the recorded speech 10.

本実施形態のように、音声の録音時に対象話者に提示する表示テキスト１３０の切り換えと、話者適応における話者再現性の忠実度を表すパラメータの値を決定する方法とを連携させることで、ネイティブ度の低い対象話者の録音音声１０を用いて、ある程度のネイティブ度を持つ音声合成辞書３０を、より適切に生成することが可能になる。 As in the present embodiment, by switching the display text 130 to be presented to the target speaker at the time of voice recording and coordinating the method of determining the value of the parameter representing the fidelity of the speaker reproducibility in the speaker adaptation. It is possible to more appropriately generate the speech synthesis dictionary 30 having a certain degree of nativeness, using the recorded speech 10 of the less native degree target speaker.

以上、具体的な例を挙げながら詳細に説明したように、実施形態の音声合成辞書生成装置によれば、目標とする発話スキルやネイティブ度に応じて話者性の類似度を調整した音声合成辞書を生成することができる。 As described above in detail with specific examples, according to the speech synthesis dictionary generation device of the embodiment, the speech synthesis in which the similarity of speaker characteristics is adjusted according to the target speech skill and the native degree A dictionary can be generated.

なお、上述した実施形態の音声合成辞書生成装置は、例えば、プロセッサや主記憶装置、補助記憶装置などを備える汎用のコンピュータに、ユーザインタフェースとなる出力装置（ディスプレイ、スピーカなど）や入力装置（キーボード、マウス、タッチパネルなど）を接続したハードウェア構成を利用することができる。この構成の場合、実施形態の音声合成辞書生成装置は、コンピュータに搭載されたプロセッサが所定のプログラムを実行することによって、上述した音声分析部１０１、話者適応部１０２、対象話者レベル指定部１０３、目標話者レベル指定部１０４、決定部１０５、対象話者レベル推定部２０１、目標話者レベル提示・指定部３０１、音声録音・提示部４０１などの機能的な構成要素が実現する。このとき、音声合成辞書生成装置は、上記のプログラムをコンピュータに予めインストールすることで実現してもよいし、ＣＤ−ＲＯＭなどの記憶媒体に記憶して、あるいはネットワークを介して上記のプログラムを配布して、このプログラムをコンピュータに適宜インストールすることで実現してもよい。また、上記のプログラムをサーバーコンピュータ上で実行させ、ネットワークを介してその結果をクライアントコンピュータで受け取ることにより実現してもよい。 Note that the speech synthesis dictionary generation device according to the above-described embodiment includes, for example, an output device (display, speaker, etc.) or an input device (keyboard) serving as a user interface to a general-purpose computer including a processor, a main storage device, an auxiliary storage device, etc. , A mouse, a touch panel, etc.) can be used. In the case of this configuration, in the speech synthesis dictionary generation device according to the embodiment, the processor installed in the computer executes a predetermined program, thereby the speech analysis unit 101, the speaker adaptation unit 102, and the target speaker level designation unit described above. The functional components such as the target speaker level designation unit 104, the determination unit 105, the target speaker level estimation unit 201, the target speaker level presentation / designation unit 301, and the voice recording / presentation unit 401 are realized. At this time, the speech synthesis dictionary generation device may be realized by installing the above program into a computer in advance, or may be stored in a storage medium such as a CD-ROM, or distributed through a network. Then, the program may be realized by appropriately installing this program on a computer. Alternatively, the above program may be executed on a server computer, and the result may be received by a client computer via a network.

コンピュータで実行されるプログラムは、実施形態の音声合成辞書生成装置を構成する各機能的な構成要素（音声分析部１０１、話者適応部１０２、対象話者レベル指定部１０３、目標話者レベル指定部１０４、決定部１０５、対象話者レベル推定部２０１、目標話者レベル提示・指定部３０１、音声録音・提示部４０１など）を含むモジュール構成となっており、実際のハードウェアとしては、例えば、プロセッサが上記記憶媒体からプログラムを読み出して実行することにより、上記各処理部が主記憶装置上にロードされ、主記憶装置上に生成されるようになっている。なお、上述した機能的な構成要素の一部または全部を、例えばＡＳＩＣやＦＰＧＡなどの専用のハードウェアを用いて実現することもできる。 The program executed by the computer includes functional components (the speech analysis unit 101, the speaker adaptation unit 102, the target speaker level designation unit 103, the target speaker level designation) which constitute the speech synthesis dictionary generation device of the embodiment. Section 104, determination section 105, target speaker level estimation section 201, target speaker level presentation / designation section 301, voice recording / presentation section 401, etc.). As an actual hardware, for example, The processor reads out the program from the storage medium and executes the program to load the above processing units onto the main storage device and generate them on the main storage device. Note that part or all of the functional components described above can also be realized using dedicated hardware such as an ASIC or an FPGA.

また、実施形態の音声合成辞書生成装置で使用する各種情報は、上記のコンピュータに内蔵あるいは外付けされたメモリ、ハードディスクもしくはＣＤ−Ｒ、ＣＤ−ＲＷ、ＤＶＤ−ＲＡＭ、ＤＶＤ−Ｒなどの記録媒体を適宜利用して格納しておくことができる。例えば、実施形態の音声合成辞書生成装置が使用する音声ＤＢ１１０や話者適応用ベースモデル１２０は、これら記録媒体を適宜利用して格納しておくことができる。 Further, various information used in the voice synthesis dictionary generation device of the embodiment is a memory built in or externally attached to the above computer, a hard disk or a recording medium such as a CD-R, a CD-RW, a DVD-RAM, a DVD-R, etc. Can be stored appropriately. For example, the speech DB 110 and the speaker adaptation base model 120 used by the speech synthesis dictionary generation device according to the embodiment can be stored by appropriately using these recording media.

以上、本発明の実施形態を説明したが、ここで説明した実施形態は、例として提示したものであり、発明の範囲を限定することは意図していない。ここで説明した新規な実施形態は、その他の様々な形態で実施されることが可能であり、発明の要旨を逸脱しない範囲で、種々の省略、置き換え、変更を行うことができる。ここで説明した実施形態やその変形は、発明の範囲や要旨に含まれるとともに、特許請求の範囲に記載された発明とその均等の範囲に含まれる。 While the embodiments of the present invention have been described above, the embodiments described herein are presented as examples and are not intended to limit the scope of the invention. The novel embodiments described herein can be implemented in other various forms, and various omissions, substitutions, and modifications can be made without departing from the scope of the invention. The embodiments and the modifications thereof described herein are included in the scope and the gist of the invention, and are included in the invention described in the claims and the equivalent scope thereof.

１０録音音声
２０録音テキスト
３０音声合成辞書
１００音声合成辞書生成装置
１０１音声分析部
１０２話者適応部
１０３対象話者レベル指定部
１０４目標話者レベル指定部
１０５決定部
１１０音声データベース（音声ＤＢ）
１２０話者適応用ベースモデル
２００音声合成辞書生成装置
２０１対象話者レベル推定部
３００音声合成辞書生成装置
３０１目標話者レベル提示・指定部
４００音声合成辞書生成装置
４０１音声録音・提示部 10 voice recording 20 voice recording text 30 voice synthesis dictionary 100 voice synthesis dictionary generation device 101 voice analysis unit 102 speaker adaptation unit 103 target speaker level designation unit 104 target speaker level designation unit 105 determination unit 110 voice database (voice DB)
120 Speaker Adaptive Base Model 200 Speech Synthesis Dictionary Generator 201 Target Speaker Level Estimator 300 Speech Synthesis Dictionary Generator 301 Target Speaker Level Presenting & Specifying Section 400 Speech Synthesis Dictionary Generator 401 Speech Recording & Presenting Unit

Claims

Analyzing the voice data of any target speaker, the speech analysis unit for generating a speech database containing data representing a feature of the utterance of the target speaker,
A speaker adaptation unit that generates a model of the target speaker by performing speaker adaptation that converts a predetermined base model so as to be close to the characteristics of the target speaker based on the voice database;
For speaker level, which represents at least one of the native of the speaker for the language of the speaker of the speech skills and voice synthesis dictionary, the target speaker level to accept the designation of the target speaker level, which is the speaker level to target Designated section,
According to the relationship between the designated target speaker level and the target speaker level that is the speaker level of the target speaker, values of parameters related to the fidelity of the speaker reproduction in the speaker adaptation A decision unit to decide
Ruoto Koego NaruSo location and a speech synthesis unit for generating a speech waveform according to the value of the parameter.

When the designated target speaker level is higher than the target speaker level, the determination unit has lower fidelity compared to when the designated target speaker level is equal to or lower than the target speaker level. The speech synthesizer according to claim 1, wherein the value of the parameter is determined to be

And a target speaker level designation unit for receiving designation of the target speaker level.
The determining portion includes a specified the target speaker level, depending on the relationship between the designated the target speaker level, according to claim 1 or 2, characterized in that to determine the value of the parameter voice if NaruSo location.

The system further comprises a target speaker level estimation unit that automatically estimates the target speaker level based on at least part of data of the voice database,
The said determination part determines the value of the said parameter according to the relationship between the designated said target speaker level and the said estimated said target speaker level, The characteristic of Claim 1 or 2 characterized by the above-mentioned. voice if NaruSo location.

The target speaker level designation unit determines a relationship between the target speaker level and the degree of similarity of speakerality assumed in a model of the target speaker to be generated, based on the target speaker level, and The voice according to any one of claims 1 to 4 , wherein an designating range of the target speaker level is displayed, and an operation of designating the target speaker level from the displayed range is received. if NaruSo location.

The speaker adaptation section, a voice if NaruSo according to any one of claims 1 to 5, wherein the use of average voice model the speaker level models the high speaker as the base model Place.

The said parameter is a parameter which defines the number of transformation matrices used for transformation of the said base model in the said speaker adaptation, The said fidelity becomes low, so that the number of the said transformation matrices is small. speech if NaruSo location according to any one of 6.

The speaker adaptation unit uses, as the base model, a model represented by a weighted sum of a plurality of clusters learned by cluster adaptive learning from data of a plurality of speakers having different speaker levels, and weights the plurality of clusters. Perform the speaker adaptation by fitting a weight vector, which is a set of
The weight vector is obtained by interpolating the optimal weight vector for the target speaker and the optimal weight vector of one speaker among the plurality of speakers having the highest speaker level.
The parameters are speech if NaruSo location according to any one of claims 1 to 5, characterized in that an interpolation ratio for obtaining the weight vector.

The model of the target speaker includes a prosody model and an acoustic model,
The parameters include a first parameter used to generate the prosody model and a second parameter used to generate the acoustic model,
When the determination unit determines the value of the parameter so as to lower the fidelity, the degree of change of the first parameter with respect to the default value with which the fidelity becomes higher is the change of the first parameter with respect to the default value. speech if NaruSo location according to any one of claims 2-8, characterized in that larger than change degree.

It further comprises a recording unit for recording the voice data,
The recording unit records the voice data while presenting information of at least reading of a sentence to be read to the target speaker for each unit of reading;
The reading information is not the reading notation in the language to be read out, but is converted to the reading notation of the language usually used by the target speaker, and at least the native degree of the target speaker is higher than a predetermined value. when low, the voice if NaruSo location according to any one of claims 1 to 9, characterized in that does not include the symbol relating to the intonation accents and tone.

Analyzing the voice data of any target speaker, the speech analysis step of generating a speech database containing data representing a feature of the utterance of the target speaker,
A speaker adaptation step of generating a model of the target speaker by performing speaker adaptation to convert a predetermined base model so as to be close to the characteristic of the target speaker based on the voice database;
For speaker level, which represents at least one of the native of the speaker for the language of the speaker of the speech skills and voice synthesis dictionary, the target speaker level to accept the designation of the target speaker level, which is the speaker level to target Designated step,
According to the relationship between the designated target speaker level and the target speaker level that is the speaker level of the target speaker, values of parameters related to the fidelity of the speaker reproduction in the speaker adaptation Decision steps to determine
Including sound Koego Narukata method and speech synthesizing step, the generating a speech waveform according to the value of the parameter.

On your computer,
A voice analysis step of analyzing voice data of any target speaker to generate a voice database including data representing features of the target speaker's speech;
A speaker adaptation step of generating a model of the target speaker by performing speaker adaptation to convert a predetermined base model so as to be close to the characteristic of the target speaker based on the voice database;
For speaker level representative of at least one of the native level of the speaker with respect to the language of the speaker's speech skills and speech synthesis dictionary, the target level specified step of accepting a target speaker level specified is the speaker the target level When,
According to the relationship between the designated target speaker level and the target speaker level that is the speaker level of the target speaker, values of parameters related to the fidelity of the speaker reproduction in the speaker adaptation Decision steps to determine
Program for executing a speech synthesis step of generating a speech waveform according to the value of the parameter.