JP2012141354A

JP2012141354A - Method, apparatus and program for voice synthesis

Info

Publication number: JP2012141354A
Application number: JP2010292223A
Authority: JP
Inventors: Yusuke Ijima; 勇祐井島; Mitsuaki Isogai; 光昭磯貝; Hideyuki Mizuno; 秀之水野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2010-12-28
Filing date: 2010-12-28
Publication date: 2012-07-26
Anticipated expiration: 2030-12-28
Also published as: JP5411845B2

Abstract

PROBLEM TO BE SOLVED: To provide a method, an apparatus and a program for voice synthesis, capable of synthesizing a voice compatible with a target text and having the characteristic of a target speaker by using a similar speaker voice database obtained from a small amount of the voice data of the target speaker.SOLUTION: A method includes: steps of pre-storing a similar speaker voice database that comprises plural pieces of partial voice data having a high degree of speaker similarity with voice data of a target speaker and a voice fragment at least representing a similar speaker identifier indicating a speaker vocalizing partial voice data and phoneme information indicating a vocalized phoneme of the partial voice data; searching, on the basis of the phoneme information, the similar speaker voice database for a voice phoneme candidate(s) conforming to a phoneme context of a target text in synthesis unit; and calculating a degree of conformity, as a total cost, between the target text in synthesis unit and the voice phoneme candidate by using at least a degree of speaker similarity corresponding to the similar speaker identifier of each voice phoneme candidate.

Description

本発明は、対象テキストに対応し、任意話者の特徴を持つ音声を合成する音声合成方法、音声合成装置及び音声合成プログラムに関する。 The present invention relates to a speech synthesis method, a speech synthesizer, and a speech synthesis program that synthesize speech having characteristics of an arbitrary speaker corresponding to a target text.

特許文献１記載の音声合成装置１０が従来技術として知られている。図１を用いて、音声合成装置１０の概要を説明する。 A speech synthesizer 10 described in Patent Document 1 is known as a prior art. The outline of the speech synthesizer 10 will be described with reference to FIG.

音声合成の対象となるテキスト（以下、「対象テキスト」という）が入力されると、まず、テキスト解析部１１において、係り受けや品詞解析等の形態素解析、漢字かな変換及びアクセント処理が行われ、素片選択接続部１４に対して音韻の区別を示す記号列を出力し、韻律生成部１２に対して呼気段落内モーラ数、アクセント形、発声スピードを出力する。 When a text to be synthesized (hereinafter referred to as “target text”) is input, first, the text analysis unit 11 performs morphological analysis such as dependency and part-of-speech analysis, kanji-kana conversion, and accent processing. A symbol string indicating phoneme distinction is output to the segment selection connection unit 14, and the number of mora in the exhalation paragraph, accent shape, and utterance speed are output to the prosody generation unit 12.

次に、韻律生成部１２において、受け取った情報を基にピッチパターン、各音素の時間長パターン及び振幅パターンを韻律モデル１３により生成し素片選択接続部１４に出力する。
最後に、素片選択接続部１４は、音韻の区別を示す記号列、ピッチパターン、時間長パターン及び振幅パターンに基づき、音声データベース１５より最適な波形を選択し、接続することにより音声を合成し、出力する。 Next, the prosody generation unit 12 generates a pitch pattern, a time length pattern and an amplitude pattern of each phoneme based on the received information, and outputs them to the segment selection connection unit 14.
Finally, the segment selection connection unit 14 selects an optimum waveform from the speech database 15 based on the symbol string indicating the phoneme distinction, the pitch pattern, the time length pattern, and the amplitude pattern, and synthesizes the speech by connecting them. ,Output.

特許文献１の場合、音声データベース１５中に、同一コンテキストの素片が大量に存在すれば、ピッチパターン、時間長パターン及び振幅パターンのバリエーションが増加し、合成音声の品質が向上する。しかし、十分な品質の合成音声を得るためには、大量の音声が必要となる。そのため、多くの音声合成装置では、合成可能な話者の人数は予め用意されている数名程度と限られている。ユーザが自由に好みの話者の音声を生成または選択しようと思った場合には、合成したい話者（以下「目標話者」という）の大量の音声（最低でも数時間程度の音声）が必要となる。 In the case of Patent Document 1, if there are a large number of segments of the same context in the speech database 15, variations in the pitch pattern, time length pattern, and amplitude pattern increase, and the quality of the synthesized speech is improved. However, in order to obtain synthesized speech with sufficient quality, a large amount of speech is required. Therefore, in many speech synthesizers, the number of speakers that can be synthesized is limited to about a few prepared in advance. If the user wants to generate or select the voice of his / her favorite speaker, he / she needs a large amount of voice (at least several hours of voice) of the speaker to be synthesized (hereinafter referred to as “target speaker”). It becomes.

このような課題を解決した従来技術として非特許文献１記載の音声合成装置２０が知られている。図２を用いて音声合成装置２０を説明する。
多数話者音声データベース２１には予め多数の話者の音声データを収録しておく。
モデル学習部２２は、多数話者音声データベース２１から多数の話者の音声データを受け取り、多数の話者の平均的な音声の特徴を持つ平均声モデルを学習する。 A speech synthesizer 20 described in Non-Patent Document 1 is known as a prior art that solves such a problem. The speech synthesizer 20 will be described with reference to FIG.
The multi-speaker voice database 21 stores voice data of a large number of speakers in advance.
The model learning unit 22 receives voice data of a large number of speakers from the multi-speaker voice database 21, and learns an average voice model having average voice characteristics of a large number of speakers.

変換規則学習部２３は、平均声モデルと目標話者の音声データとから、平均声モデルを適応モデルに変換するための変換規則を学習し、これを適応部２４に出力する。なお、適応モデルとは、目標話者の大量の音声データから得られる音声モデルに似た音声モデルである。 The conversion rule learning unit 23 learns a conversion rule for converting the average voice model into an adaptive model from the average voice model and the target speaker's voice data, and outputs this to the adaptation unit 24. The adaptive model is a speech model similar to a speech model obtained from a large amount of speech data of the target speaker.

適応部２４は、変換規則を平均声モデルに適応し、適応モデルに変換する。
合成部２５は、対象テキストが入力されると、適応モデルに基づき、合成音声を生成し、出力する。 The adaptation unit 24 adapts the conversion rule to the average voice model and converts it into an adaptive model.
When the target text is input, the synthesizer 25 generates and outputs synthesized speech based on the adaptive model.

特許２７６１５５２号公報Japanese Patent No. 2761552

田村正統、益子貴史、徳田恵一、小林隆夫、“ＨＭＭに基づく音声合成におけるピッチ・スペクトルの話者適応”、電子情報通信学会論文誌、２００２年４月、ｖｏｌ．Ｊ８５−Ｄ−ＩＩ、ｎｏ．４、ｐｐ．５４５−５５３Masanori Tamura, Takashi Masuko, Keiichi Tokuda, Takao Kobayashi, “Speaker Adaptation of Pitch Spectrum in HMM-Based Speech Synthesis”, IEICE Transactions, April 2002, vol. J85-D-II, no. 4, pp. 545-553

非特許文献１は特許文献１に比べ、音声合成に必要なデータベースやモデルを作成するために必要な目標話者の音声データの量を大幅に減らすことができる。しかしながら、非特許文献１も変換規則を学習するために目標話者の音声データが数分程度必要となる。そのため、音声収録の際に目標話者を長時間拘束しなければならない。例えば、５分の音声データを収録するには、３０分程度の拘束時間が必要となる。 Compared with Patent Document 1, Non-Patent Document 1 can significantly reduce the amount of target speaker's speech data required to create a database and model necessary for speech synthesis. However, Non-Patent Document 1 also requires about a few minutes of target speaker's voice data in order to learn conversion rules. Therefore, the target speaker must be restrained for a long time when recording audio. For example, in order to record 5 minutes of audio data, a restraint time of about 30 minutes is required.

そこで本発明は、さらに少ない量の目標話者の音声データから得られる類似話者音声データベースを用いて、対象テキストに対応し、目標話者の特徴を持つ音声を合成する音声合成方法、音声合成装置及び音声合成プログラムを提供することを目的とする。 Therefore, the present invention provides a speech synthesis method for synthesizing speech having characteristics of a target speaker corresponding to a target text, using a similar speaker speech database obtained from speech data of a smaller amount of the target speaker. An object is to provide a device and a speech synthesis program.

上記の課題を解決するために、本発明の第一の態様によれば、対象テキストに対応し、目標話者の音声特徴を持つ合成音声を生成する。２つの音声データが類似しているか否かを示す指標を話者類似度とし、目標話者の音声データとの話者類似度が高い複数の音声データを合成音声を組み立てる上で適切な合成単位に分割した部分音声データと、部分音声データに対して付与される情報であって当該部分音声データを発した話者を示す類似話者識別子と当該部分音声データの発声音素を示す音素情報とを少なくとも示す音声素片とからなる類似話者音声データベースが予め記憶される。対象テキストを解析して、対象テキストの読み情報を取得する。読み情報を音素の並びである音素コンテキストに変換する。音素情報に基づいて、音素コンテキストに合成単位で適合する音声素片候補を類似話者音声データベースから探索する。各音声素片候補の類似話者識別子に対応する話者類似度を少なくとも用いて、合成単位の対象テキストと音声素片候補との適合度を総合コストとして算出し、この総合コストが最良となるときの音声素片候補を、それぞれ選択音声素片として選択する。選択音声素片に対応する部分音声データを類似話者音声データベースから読み込み、この部分音声データを接続して合成音声を得る。 In order to solve the above problems, according to the first aspect of the present invention, a synthesized speech corresponding to the target text and having the speech characteristics of the target speaker is generated. A synthesis unit suitable for assembling a plurality of speech data having a high speaker similarity with the target speaker's speech data as an index indicating whether or not the two speech data are similar Divided partial voice data, information given to the partial voice data, a similar speaker identifier indicating the speaker who has issued the partial voice data, and phoneme information indicating the utterance phoneme of the partial voice data; Is stored in advance as a similar speaker speech database consisting of speech segments indicating at least. Analyzes the target text and obtains reading information of the target text. The reading information is converted into a phoneme context that is a sequence of phonemes. Based on the phoneme information, a speech unit candidate that matches the phoneme context in synthesis units is searched from the similar speaker speech database. Using at least the speaker similarity corresponding to the similar speaker identifier of each speech unit candidate, the degree of matching between the synthesis target text and the speech unit candidate is calculated as the total cost, and this total cost is the best. Are selected as selected speech segments. Partial speech data corresponding to the selected speech segment is read from the similar speaker speech database, and this partial speech data is connected to obtain synthesized speech.

上記の課題を解決するために、本発明の第二の態様によれば、対象テキストに対応し、目標話者の音声特徴を持つ合成音声を生成する。合成音声を組み立てる上で適切な合成単位の部分音声データと、部分音声データに対して付与される情報であって当該部分音声データを発した話者を示す類似話者識別子と当該部分音声データの発声音素を示す音素情報とを少なくとも示す音声素片とからなる類似話者音声データベースと、類似話者識別子と、その類似話者識別子の対応する話者類似度とを記憶する話者類似度記憶部と、対象テキストを解析して、対象テキストの読み情報を取得するテキスト解析部と、読み情報を音素の並びである音素コンテキストに変換する音素コンテキスト変換部と、音素情報に基づいて、音素コンテキストに合成単位で適合する音声素片候補を類似話者音声データベースから探索する音声素片候補探索部と、各音声素片候補の類似話者識別子に対応する話者類似度を少なくとも用いて、合成単位の対象テキストと音声素片候補との適合度を総合コストとして算出し、この総合コストが最良となるときの音声素片候補を、それぞれ選択音声素片として選択する素片選択部と、選択音声素片に対応する部分音声データを類似話者音声データベースから読み込み、この部分音声データを接続して合成音声を得る素片接続部と、を有する。 In order to solve the above-mentioned problem, according to the second aspect of the present invention, a synthesized speech corresponding to the target text and having the speech characteristics of the target speaker is generated. The partial speech data of an appropriate synthesis unit for assembling the synthesized speech, the information given to the partial speech data, the similar speaker identifier indicating the speaker who has issued the partial speech data, and the partial speech data Speaker similarity that stores a similar speaker speech database that includes at least a speech segment that indicates phoneme information indicating a utterance phoneme, a similar speaker identifier, and a speaker similarity corresponding to the similar speaker identifier Based on the phoneme information, a storage unit, a text analysis unit that analyzes the target text and obtains reading information of the target text, a phoneme context conversion unit that converts the reading information into a phoneme context that is a sequence of phonemes, A speech unit candidate search unit that searches speech unit candidates that match the context by synthesis unit from a similar speaker speech database, and corresponds to a similar speaker identifier of each speech unit candidate Using at least the speaker similarity, the degree of matching between the target text of the synthesis unit and the speech unit candidate is calculated as the total cost, and the speech unit candidate when the total cost is the best is selected as the selected speech unit. And a segment connection unit that reads partial speech data corresponding to the selected speech segment from the similar speaker speech database and connects the partial speech data to obtain synthesized speech.

本発明は、話者類似度が高い複数の音声データからなる類似話者音声データベースと、話者類似度を用いて、音声合成を行うので、目標話者により類似している話者の音声データが音声合成の際に使用されやすくなり、合成音声の目標話者に対する類似性を向上させることができる。 The present invention performs speech synthesis using a similar speaker voice database composed of a plurality of voice data having a high speaker similarity and the speaker similarity, so that the voice data of a speaker more similar to the target speaker Can be easily used in speech synthesis, and the similarity of the synthesized speech to the target speaker can be improved.

従来の音声合成装置１０の構成を示すブロック図。The block diagram which shows the structure of the conventional speech synthesizer 10. FIG. 従来の音声合成装置２０の構成を示すブロック図。The block diagram which shows the structure of the conventional speech synthesis apparatus 20. FIG. 音声合成装置１００、２００、３００の機能構成例を示すブロック図。The block diagram which shows the function structural example of the speech synthesizer 100,200,300. 音声合成装置１００、２００、３００の処理フローを示す図。The figure which shows the processing flow of the speech synthesizer 100,200,300. 音声素片のデータ構造を示す図。The figure which shows the data structure of a speech unit. 類似話者音声データベース構築部１１０の機能構成例を示すブロック図。The block diagram which shows the function structural example of the similar speaker audio | voice database construction part 110. FIG. 類似話者選択部１１１の機能構成例を示すブロック図。The block diagram which shows the function structural example of the similar speaker selection part 111. FIG. 類似話者選択部１１１の処理フローを示す図。The figure which shows the processing flow of the similar speaker selection part 111. FIG. 話者統合部１１５の処理フローを示す図。The figure which shows the processing flow of the speaker integration part 115. FIG. 音声合成部１５０の機能構成例を示すブロック図。FIG. 3 is a block diagram illustrating an example of a functional configuration of a speech synthesizer 150. 音声合成部１５０の処理フローを示す図。The figure which shows the processing flow of the speech synthesizing part 150. 類似話者音声データベース構築部２１０の機能構成例を示すブロック図。The block diagram which shows the function structural example of the similar speaker audio | voice database construction part 210. FIG. 類似話者音声データベース構築部２１０の処理フローを示す図。The figure which shows the processing flow of the similar speaker audio | voice database construction part 210. 話者変換規則学習部２１３及び話者単位変換部２１４の処理フローを示す図。The figure which shows the processing flow of the speaker conversion rule learning part 213 and the speaker unit converter 214. 類似話者音声データベース構築部３１０の機能構成例を示すブロック図。The block diagram which shows the function structural example of the similar speaker audio | voice database construction part 310. FIG. 類似話者音声データベース構築部３１０の処理フローを示す図。The figure which shows the processing flow of the similar speaker audio | voice database construction part 310.

以下、本発明の実施形態について説明する。
＜第一実施形態に係る音声合成装置１００＞
図３及び図４を用いて第一実施形態に係る音声合成装置１００を説明する。音声合成装置１００は、多数話者音声データベース構築部１０１と多数話者音声データベース１０３と類似話者音声データベース構築部１１０と類似話者音声データベース１３０と話者類似度記憶部１４０とを備える。 Hereinafter, embodiments of the present invention will be described.
<Speech Synthesizer 100 according to First Embodiment>
The speech synthesizer 100 according to the first embodiment will be described with reference to FIGS. 3 and 4. The speech synthesizer 100 includes a multi-speaker speech database construction unit 101, a multi-speaker speech database 103, a similar speaker speech database construction unit 110, a similar speaker speech database 130, and a speaker similarity storage unit 140.

＜多数話者音声データベース構築部１０１及び多数話者音声データベース１０３＞
多数話者音声データベース構築部１０１は、事前に多数話者（Ｎ名分）の音声を収録し、類似話者音声データベース構築部１１０で用いる多数話者音声データベース１０３を構築する（ｓ１０１）。 <Multi-speaker speech database construction unit 101 and multi-speaker speech database 103>
The multi-speaker voice database construction unit 101 records the voices of a large number of speakers (for N names) in advance, and constructs a multi-speaker speech database 103 used by the similar speaker speech database construction unit 110 (s101).

なお、収録する多数話者の音声は、類似話者音声データベース構築部１１０と音声合成部１５０で使用されるため以下の要件（１）、（２）を満たすことが望ましい。（１）収録する１名あたりの音声データ量（無音区間を除いた音声区間の時間）は、音声合成用のモデルを学習可能な時間以上である。なお、学習可能な時間は、使用する音声合成システムにより異なり、例えば、素片選択型音声合成では数時間程度の音声データ量が必要となる。（２）収録する話者数Ｎは、性別毎に最低でも１００名以上、計２００名以上であることが望ましい。 It should be noted that since the voices of many speakers to be recorded are used by the similar speaker voice database construction unit 110 and the voice synthesis unit 150, it is desirable to satisfy the following requirements (1) and (2). (1) The amount of voice data per person recorded (the time of the voice section excluding the silent section) is longer than the time during which the model for voice synthesis can be learned. Note that the learning time varies depending on the speech synthesis system to be used. For example, in the unit selection speech synthesis, a speech data amount of about several hours is required. (2) It is desirable that the number N of speakers to be recorded is at least 100 for each gender, and 200 or more in total.

さらに、音声収録の終了後、収録した音声データに対して、音声素片を付与する。よって、多数話者音声データベース１０３はＮ名分の音声素片付音声データを保持する。
ここで、音声データとは、単語や文章を読み上げた肉声の音声信号に対してＡ／Ｄ変換を行い、ディジタルデータ化した音声波形データである。この音声波形データは、波形接続型音声合成の素材として利用できる。 Furthermore, after the end of voice recording, a voice segment is added to the recorded voice data. Therefore, the multi-speaker speech database 103 holds speech data with speech segments for N names.
Here, the voice data is voice waveform data obtained by performing A / D conversion on a real voice signal that reads out a word or a sentence and converting it into digital data. This speech waveform data can be used as a material for waveform-connected speech synthesis.

図５に音声素片からなるデータ構造（テーブル）の例を示す。音声素片とは、合成音声を組み立てる上で適切な単位（以下「合成単位」という）の音声データ（以下「部分音声データ」という）の諸情報を示すものであり、少なくとも部分音声データを発した話者を示す話者識別子と、合成単位の発声音素を示す音素情報を含む。また、例えば、全音声データに対する部分音声データの位置を示す位置情報（開始時間、終了時間）や、部分音声データのＦ０パターン情報等を含んでもよい。なお、音声素片の付与は、人手により行ってもよいし、コンピュータにより自動的に行ってもよい。例えば、音素情報と位置情報は以下の参考文献１記載の技術を用いてコンピュータにより自動的に行ってもよい。
（参考文献１）特開２００４−７７９０１号公報 FIG. 5 shows an example of a data structure (table) made up of speech segments. A speech unit indicates various pieces of information of speech data (hereinafter referred to as “partial speech data”) in an appropriate unit (hereinafter referred to as “synthesis unit”) for assembling synthesized speech. At least partial speech data is generated. And a phoneme information indicating the synthesized phonetic phoneme. Further, for example, position information (start time, end time) indicating the position of partial audio data with respect to all audio data, F0 pattern information of partial audio data, and the like may be included. Note that the speech unit may be assigned manually or automatically by a computer. For example, phoneme information and position information may be automatically performed by a computer using the technique described in Reference Document 1 below.
(Reference 1) Japanese Patent Application Laid-Open No. 2004-77901

なお、この実施形態では説明を具体的なものとする観点から合成単位を音素とする。この他、例えば音節、半音節としてもよいし、あるいは音素・音節・半音節などの組み合わせを合成単位とすることなども可能であり、任意に決めることができる。 In this embodiment, the synthesis unit is a phoneme from the viewpoint of specific description. In addition, for example, a syllable or semi-syllable may be used, or a combination of phonemes, syllables, semi-syllables, or the like may be used as a synthesis unit, and can be arbitrarily determined.

＜類似話者音声データベース構築部１１０、類似話者音声データベース１３０及び話者類似度記憶部１４０＞
類似話者音声データベース構築部１１０は、複数の音声データを用いて、目標話者の音声データに類似した音声データからなる類似話者音声データベースを構築する（ｓ１１０）。図６に示すように類似話者音声データベース構築部１１０は、類似話者選択部１１１と話者統合部１１５を有する。 <Similar Speaker Speech Database Construction Unit 110, Similar Speaker Speech Database 130, and Speaker Similarity Storage Unit 140>
The similar speaker speech database construction unit 110 constructs a similar speaker speech database composed of speech data similar to the speech data of the target speaker using a plurality of speech data (s110). As shown in FIG. 6, the similar speaker voice database construction unit 110 includes a similar speaker selection unit 111 and a speaker integration unit 115.

＜類似話者選択部１１１＞
類似話者選択部１１１は、多数話者音声データベース１０３に保存されている複数の音声データを入力とし、これを用いて、各話者の音声データと目標話者の音声データとの話者類似度を求め、話者類似度が高い音声データを複数選択し（ｓ１１１）、類似話者の音声素片付音声データを話者統合部１１５に出力する。なお、話者類似度とは２つの音声データが類似しているか否かを示す指標である。 <Similar speaker selection unit 111>
The similar speaker selection unit 111 receives a plurality of voice data stored in the multi-speaker voice database 103 as input, and uses the voice data of each speaker and the voice data of the target speaker. A plurality of speech data having a high speaker similarity are selected (s111), and speech data with speech units of similar speakers is output to the speaker integration unit 115. The speaker similarity is an index indicating whether or not two voice data are similar.

例えば、話者識別・認証で使用されている混合正規分布（Gaussian Mixture Model；ＧＭＭ）を用いた参考文献２記載の技術に基づき話者類似度を求める。
（参考文献２）D. A. Reynolds, “Speaker identification and verification using Gaussian mixture speaker models”, Speech Communication, 1995, vol.17, pp.91-108
この場合、図７に示すように類似話者選択部１１１は混合正規分布学習部１１１ａと多数話者混合正規分布記憶部１１１ｂと話者類似度計算部１１１ｃと類似話者抽出部１１１ｄとを有する。 For example, the speaker similarity is obtained based on the technique described in Reference 2 using a Gaussian Mixture Model (GMM) used in speaker identification / authentication.
(Reference 2) DA Reynolds, “Speaker identification and verification using Gaussian mixture speaker models”, Speech Communication, 1995, vol.17, pp.91-108
In this case, as shown in FIG. 7, the similar speaker selection unit 111 includes a mixed normal distribution learning unit 111a, a multi-speaker mixed normal distribution storage unit 111b, a speaker similarity calculation unit 111c, and a similar speaker extraction unit 111d. .

（混合正規分布学習部１１１ａ及び多数話者混合正規分布記憶部１１１ｂ）
混合正規分布学習部１１１ａは、Ｎ名分の音声データを入力とし、全ての話者の音声データに対して以下の処理（図８中のｓ１１１ａ−２、ｓ１１１ａ−３）を行う（ｓ１１１ａ−１，ｓ１１１ａ−４，ｓ１１１ａ−５）。各話者ｎの音声データを用いて、それぞれの音声データからスペクトルパラメータ（ケプストラム、メルケプストラム等）を取得する（ｓ１１１ａ−２）。さらに、それぞれの音声データから得られるスペクトルパラメータを用いて、混合正規分布λ_ｎを学習し、モデルパラメータである混合重みｗ_ｎ（ｍ）、平均ベクトルμ_ｎ（ｍ）、分散ベクトルν_ｎ（ｍ）を推定し（ｓ１１１ａ−３）、これらの値を多数話者混合正規分布記憶部１１１ｂに出力する。但し、ｍ＝１，２，…，Ｍであり、Ｍは混合正規分布の混合数である。 (Mixed normal distribution learning unit 111a and multi-speaker mixed normal distribution storage unit 111b)
The mixed normal distribution learning unit 111a receives voice data for N names, and performs the following processing (s111a-2 and s111a-3 in FIG. 8) on the voice data of all speakers (s111a-1). , S111a-4, s111a-5). Using the voice data of each speaker n, a spectrum parameter (cepstrum, mel cepstrum, etc.) is acquired from each voice data (s111a-2). Further, the mixture normal distribution λ _n is learned using the spectral parameters obtained from the respective speech data, and the mixture weights w _n (m), the average vector μ _n (m), and the variance vector ν _n (m ) Is estimated (s111a-3), and these values are output to the multi-speaker mixed normal distribution storage unit 111b. Here, m = 1, 2,..., M, and M is the number of mixtures in the mixed normal distribution.

さらに、混合正規分布学習部１１１ａは、同様に全て（Ｎ名分）の音声データから得られるスペクトルパラメータを利用して、全ての音声データに対する混合正規分布λ_Ｕ（Universal Background Model；ＵＢＭ）を学習し、モデルパラメータである混合重みｗ_Ｕ（ｍ）、平均ベクトルμ_Ｕ（ｍ）、分散ベクトルν_Ｕ（ｍ）を推定し（ｓ１１１ａ−６）、これらの値を多数話者混合正規分布記憶部１１１ｂに出力する。 Further, the mixed normal distribution learning unit 111a similarly learns a mixed normal distribution λ _U (Universal Background Model; UBM) for all audio data using spectral parameters obtained from all (N names) audio data. Then, the mixture weights w _U (m), the average vector μ _U (m), and the variance vector ν _U (m), which are model parameters, are estimated (s111a-6), and these values are stored in a multi-speaker mixed normal distribution storage unit. To 111b.

（話者類似度計算部１１１ｃ）
話者類似度計算部１１１ｃは、多数話者混合正規分布記憶部１１１ｂから混合重みｗ_ｎ（ｍ）、ｗ_Ｕ（ｍ）、平均ベクトルμ_ｎ（ｍ）、μ_Ｕ（ｍ）、分散ベクトルν_ｎ（ｍ）、ν_Ｕ（ｍ）を取得し、これらの値と目標話者の音声データを入力とする。まず、話者類似度計算部１１１ｃは、目標話者の音声データからスペクトルパラメータ系列Ｘを取得する。次に、各話者ｎの話者類似度Ｌ_ｎを以下の対数尤度として計算する（ｓ１１１ｃ−２）。全ての話者ｎの話者類似度Ｌ_ｎを計算し（ｓ１１１ｃ−１，ｓ１１１ｃ−３，ｓ１１１ｃ−４）、類似話者抽出部１１１ｄに出力する。 (Speaker similarity calculator 111c)
The speaker similarity calculation unit 111c receives the mixture weights w _n (m), w _U (m), the average vectors μ _n (m), μ _U (m), and the variance vector ν from the multi-speaker mixed normal distribution storage unit 111b. _n (m) and ν _U (m) are acquired, and these values and the voice data of the target speaker are input. First, the speaker similarity calculation unit 111c acquires the spectrum parameter series X from the speech data of the target speaker. Next, the speaker similarity L _n of each speaker n is calculated as the following log likelihood (s111c-2). The speaker similarity L _n of all the speakers _n is calculated (s111c-1, s111c-3, s111c-4) and output to the similar speaker extraction unit 111d.

スペクトルパラメータ系列Ｘの次元数とフレーム数はそれぞれＲとＴであり、ｘ（ｔ）は第ｔフレーム目のスペクトルパラメータのベクトルであり、χ_ｔ（ｒ）は第ｔフレームの第ｒ次元目のスペクトルパラメータである。また、μ_ｉ（ｍ，ｒ）、σ_ｉ（ｍ，ｒ）は、混合正規分布λ_ｉのパラメータであり、第ｍ混合目の分布の第ｒ次元目の平均、標準偏差を表す。式（１）〜（４）より、話者類似度Ｌ_ｎは、目標話者の音声データが持つ音声特徴と類似する音声特徴を有する音声データのほうが大きくなる。 The number of dimensions and the number of frames of the spectral parameter series X are R and T, respectively, x (t) is a vector of spectral parameters of the t-th frame, and χ _t (r) is the r-th dimension of the t-th frame. It is a spectral parameter. Μ _i (m, r) and σ _i (m, r) are parameters of the mixed normal distribution λ _i and represent the average and standard deviation of the r-th dimension of the m-th mixture distribution. From equations (1) to (4), the speaker similarity L _n is larger for speech data having speech features similar to those of the speech data of the target speaker.

（類似話者抽出部１１１ｄ）
類似話者抽出部１１１ｄは、話者類似度Ｌ_ｎを受け取り、その中で話者類似度の大きい上位Ｓ名を抽出する（ｓ１１１ｄ）。但し、この上位Ｓ名の話者を類似話者ｓと呼び、２≦Ｓ≦Ｎとし、ｓ＝１，２，…，Ｓとする。類似話者抽出部１１１ｄは、抽出した上位Ｓ名の話者類似度Ｌ_ｓを話者類似度記憶部１４０へ出力し、抽出した上位Ｓ名の音声素片付音声データを話者統合部１１５へ出力する。例えば、話者類似度記憶部１４０には、類似話者の話者識別子（以下、「類似話者識別子」という）とその類似話者に対応する話者類似度を格納する。 (Similar speaker extraction unit 111d)
The similar speaker extraction unit 111d receives the speaker similarity L _n and extracts the upper S names having the highest speaker similarity (s111d). However, the speaker with the upper S name is called a similar speaker s, 2 ≦ S ≦ N, and s = 1, 2,. The similar speaker extraction unit 111 d outputs the extracted speaker similarity L _s of the upper S name to the speaker similarity storage unit 140, and the extracted voice data with the speech unit of the upper S name extracted by the speaker integration unit 115. Output to. For example, the speaker similarity storage unit 140 stores a speaker identifier of a similar speaker (hereinafter referred to as “similar speaker identifier”) and a speaker similarity corresponding to the similar speaker.

＜話者統合部１１５＞
話者統合部１１５は、複数選択した音声素片付音声データを統合して、部分音声データと、その部分音声データの音声素片とからなる類似話者音声データベースを構築する（ｓ１１５）。 <Speaker integration unit 115>
The speaker integration unit 115 integrates a plurality of selected speech unit-attached speech data, and constructs a similar speaker speech database composed of partial speech data and speech segments of the partial speech data (s115).

例えば、図９に示すように、音声データを統合する。まず類似話者ｓの音声データ中に含まれる合成単位の音素ｐの部分音声データと、それに対応する音声素片を全て取り出す（ｓ１１５ｃ）。これを全ての類似話者に対して行い（ｓ１１５ｂ、ｓ１１５ｄ、ｓ１１５ｅ）、取り出した合成単位の音素ｐに対応する部分音声データを類似話者音声データベース１３０に追加する。その際、部分音声データに対応する音声素片は多数話者音声データベースと同様の構成（図５参照）となるが、音声素片番号を追加した順番に変更し、開始時間、終了時間を類似話者音声データベース１３０上の各部分音声データの位置に変更する。全ての音素に対して上記処理を行い（ｓ１１５ａ、ｓ１１５ｇ、ｓ１１５ｈ）、類似話者の音声データを統合する。 For example, as shown in FIG. 9, the audio data is integrated. First, all the partial speech data of the synthesis unit phoneme p included in the speech data of the similar speaker s and the corresponding speech segment are extracted (s115c). This is performed for all similar speakers (s115b, s115d, s115e), and the partial speech data corresponding to the extracted phoneme p of the synthesis unit is added to the similar speaker speech database 130. At that time, the speech unit corresponding to the partial speech data has the same configuration as that of the multi-speaker speech database (see FIG. 5), but the order of the speech unit numbers is changed and the start time and end time are similar. The position is changed to the position of each partial voice data on the speaker voice database 130. The above processing is performed on all phonemes (s115a, s115g, s115h), and the speech data of similar speakers are integrated.

なお、通常、複数の話者の音声データを統合して、一つの音声データベースを作成すると、各話者間の音声特徴量が大きく異なるため、波形接続時に異音等が生じてしまう可能性があり、合成音声の品質が低下してしまう。しかし、類似話者選択部１１１で、類似話者を選択するため、各話者間の音声特徴量の差が小さくなる。そのため、合成音声の品質劣化が生じにくくなる。さらに、複数名の類似話者の音声データを統合することで、類似話者音声データベース１３０には、抑揚や前後の音素環境等の音声データのバリエーションが増加する。これにより、合成音声の自然性が向上する。 Normally, when a single voice database is created by integrating the voice data of a plurality of speakers, there is a possibility that abnormal noise or the like may occur at the time of waveform connection because the voice feature amounts between the speakers differ greatly. Yes, the quality of the synthesized speech is degraded. However, since a similar speaker is selected by the similar speaker selection unit 111, a difference in speech feature amount between the speakers is reduced. As a result, the quality of synthesized speech is unlikely to deteriorate. Furthermore, by integrating the speech data of a plurality of similar speakers, variations in speech data such as intonation and phoneme environment before and after increase in the similar speaker speech database 130. This improves the naturalness of the synthesized speech.

＜音声合成部１５０＞
音声合成部１５０は、類似話者音声データベース１３０に記憶された類似話者の音声素片付音声データと、話者類似度記憶部１４０に記憶された話者類似度を用いて、対象テキストに対応する合成音声を生成する（図４のｓ１５０）。 <Speech synthesizer 150>
The speech synthesizer 150 uses the speech data with speech units of similar speakers stored in the similar speaker speech database 130 and the speaker similarity stored in the speaker similarity storage unit 140 as target text. Corresponding synthesized speech is generated (s150 in FIG. 4).

図１０に示すように、音声合成部１５０は、テキスト解析部１５１と韻律生成部１５２と韻律モデル記憶部１５３と音素コンテキスト変換部１５４と音声素片候補探索部１５５と素片選択部１５６と素片接続部１５７とを有する。 As shown in FIG. 10, the speech synthesis unit 150 includes a text analysis unit 151, prosody generation unit 152, prosody model storage unit 153, phoneme context conversion unit 154, speech unit candidate search unit 155, unit selection unit 156, and unit. And a single connection portion 157.

音声合成部１５０に入力される対象テキストは、図示しない入力部から入力されるものとしてもよいし、予め図示しない記憶部に記憶されていてもよい。また、本発明において対象テキストの種類などに格別の限定はなく、この実施形態では、漢字かな混合の日本語テキストとする。 The target text input to the speech synthesizer 150 may be input from an input unit (not shown), or may be stored in advance in a storage unit (not shown). In the present invention, the type of the target text is not particularly limited, and in this embodiment, Japanese text mixed with kanji and kana is used.

＜テキスト解析部１５１＞
まず、テキスト解析部１５１が、対象テキストを取得し、この対象テキストを形態素解析して、対象テキストに対応した読み情報を音素コンテキスト変換部１５４に、韻律情報を韻律生成部１５２に出力する（ｓ１５１）。 <Text analysis unit 151>
First, the text analysis unit 151 acquires the target text, performs morphological analysis on the target text, and outputs reading information corresponding to the target text to the phoneme context conversion unit 154 and prosodic information to the prosody generation unit 152 (s151). ).

形態素解析の概要について説明すると、テキスト解析部１５１は、単語モデル、漢字かな変換モデル等（これらは必要に応じて図示しない記憶部に記憶されている）を参照して、対象テキストをかなに変換する（読み情報の取得）。また、対象テキストが日本語の場合、複数の単語が集まって文節などを構成すると、アクセントが移動・消失するなどの現象が起こるので、予めこれらの規則（アクセント結合規則）をデータとして記憶部に記憶しておき、テキスト解析部１５１は、このアクセント結合規則に従って、対象テキストのアクセント型を決定する。さらに、対象テキストが日本語の場合、意味的ないし文法的なまとまり毎にアクセントが１つ付く特徴的傾向があるので、予めこれらの規則（フレーズ規則）をデータとして記憶部に記憶しておき、テキスト解析部１５１は、このフレーズ規則に従って、アクセントの１つ付いたまとまりがいくつか接続したものを呼気段落として決定する（韻律情報の取得）。この他、韻律情報にポーズ位置を含めることもできる。 The outline of the morphological analysis will be described. The text analysis unit 151 converts the target text into kana by referring to a word model, a kanji conversion model, etc. (these are stored in a storage unit (not shown) as necessary). (Acquire reading information). In addition, when the target text is Japanese, if a plurality of words are gathered to form a phrase or the like, a phenomenon such as an accent moving or disappearing occurs, so these rules (accent combining rules) are stored in advance in the storage unit as data. The text analysis unit 151 stores the accent type of the target text according to the accent combination rule. Furthermore, when the target text is Japanese, there is a characteristic tendency that an accent is attached to each semantic or grammatical unit, so these rules (phrase rules) are stored in advance in the storage unit as data, In accordance with this phrase rule, the text analysis unit 151 determines a connection of several groups with one accent as an exhalation paragraph (acquisition of prosodic information). In addition, the pose position can be included in the prosodic information.

なお、ここで説明した形態素解析の概要は、形態素解析の一例であって、その他の形態素解析手法を排除する趣旨のものではない。本発明の音声合成装置・方法では、種々の形態素解析を用いることができ、これらは従来手法（例えば参考文献３、４参照）によって達成されるので、その詳細を省略する。
（参考文献３）特許３３７９６４３号公報
（参考文献４）特許３５１８３４０号公報 The outline of the morpheme analysis described here is an example of morpheme analysis and is not intended to exclude other morpheme analysis methods. In the speech synthesizer / method according to the present invention, various morphological analyzes can be used, and these are achieved by conventional methods (see, for example, References 3 and 4), and thus the details thereof are omitted.
(Reference 3) Japanese Patent No. 3379634 (Reference 4) Japanese Patent No. 3518340

＜韻律生成部１５２及び韻律モデル記憶部１５３＞
次に、韻律生成部１５２が、テキスト解析部１５１が出力した韻律情報を入力として、韻律モデル記憶部１５３を参照して、韻律に関する情報である韻律パラメータを推定してこれを出力する（ｓ１５２）。 <Prosody Generation Unit 152 and Prosody Model Storage Unit 153>
Next, the prosody generation unit 152 receives the prosody information output from the text analysis unit 151 as an input, refers to the prosody model storage unit 153, estimates the prosodic parameters that are information about the prosody, and outputs this (s152). .

韻律パラメータとして、Ｆ０パターン(基本周波数パターン)、Ｆ０パターンの平均値、Ｆ０パターンの傾き、音素継続時間長(音素の発声の長さ)等を例示できる。例えば、音素継続時間長は、予め規則化された、呼気段落内における音素の位置、発声速度、当該音素の前後の音素環境などに従って適宜に設定することができる。また、Ｆ０パターンについては、いわゆる藤崎モデルなどによって求めることができる。なお、藤崎モデル等の韻律モデルは、予め韻律モデル記憶部１５３に記憶しておく。なお、「推定」とは、音声合成のために必要となる情報（Ｆ０パターン、音素継続時間長等）を、ある特定のものに決定することを意味する。 Examples of prosodic parameters include the F0 pattern (fundamental frequency pattern), the average value of the F0 pattern, the slope of the F0 pattern, the phoneme duration (phoneme utterance length), and the like. For example, the phoneme duration can be appropriately set in accordance with the phoneme position in the exhalation paragraph, the utterance speed, the phoneme environment before and after the phoneme, etc., which are regulated in advance. Also, the F0 pattern can be obtained by a so-called Fujisaki model. The prosody model such as the Fujisaki model is stored in the prosody model storage unit 153 in advance. Note that “estimation” means that information necessary for speech synthesis (F0 pattern, phoneme duration, etc.) is determined to be a specific one.

ここで説明した韻律パラメータ取得の概要は一例に過ぎず、その他の手法を排除する趣旨のものではない。本発明の音声合成装置・方法では、韻律パラメータの取得には、従来の韻律パラメータ取得手法を用いることができるので、その詳細を省略する。Ｆ０パターンの取得については例えば参考文献５、６を、音素継続時間長については例えば参考文献７、８を参照されたい。
（参考文献５）特許３４２０９６４号公報
（参考文献６）特許３３４４４８７号公報
（参考文献７）海木佳延、武田一哉、匂坂芳典、「言語情報を利用した母音継続時間長の制御」、電子情報通信学会誌 Vol. J75-A, No.3, pp.467-473, 1992.
（参考文献８）M.D.Riley, "Tree-based modeling for speech synthesis", In G. Bailly, C. Benoit, and T. R. Sawallis, editors, Talking Machines: Theories, Models, and Designs, pages 265-273. Elsevier, 1992. The outline of prosodic parameter acquisition described here is merely an example, and is not intended to exclude other methods. In the speech synthesizer / method according to the present invention, the prosodic parameters can be acquired by using a conventional prosodic parameter acquiring method, and the details thereof are omitted. Refer to References 5 and 6 for the acquisition of the F0 pattern, and refer to References 7 and 8 for the phoneme duration, for example.
(Reference 5) Japanese Patent No. 3420964 (Reference 6) Japanese Patent No. 3344487 (Reference 7) Yoshinobu Kaiki, Kazuya Takeda, Yoshinori Mozaka, “Control of vowel duration using language information”, electronic information Journal of Communications Society Vol. J75-A, No.3, pp.467-473, 1992.
(Reference 8) MDRiley, "Tree-based modeling for speech synthesis", In G. Bailly, C. Benoit, and TR Sawallis, editors, Talking Machines: Theories, Models, and Designs, pages 265-273. Elsevier, 1992 .

＜音素コンテキスト変換部１５４＞
音素コンテキスト変換部１５４は、テキスト解析部１５１が出力した読み情報を入力として、音素コンテキストを求めてこれを出力する（ｓ１５４）。 <Phoneme context conversion unit 154>
The phoneme context conversion unit 154 obtains the phoneme context by using the reading information output by the text analysis unit 151 as input, and outputs it (s154).

音素コンテキストとは音素の並びのことであり、例えば、読み情報が、“キョウワハレ”であれば音素コンテキストは、“／ｋ／／ｙ／／Ｏ／／Ｗ／／Ａ／／Ｈ／／Ａ／／Ｒ／／Ｅ／”となる。音素コンテキスト変換部１５４は、かな音素変換モデルなど（必要に応じて記憶部に記憶されている。）を参照して、読み情報を音素列に変換する（音素コンテキストの取得）。 The phoneme context is an arrangement of phonemes. For example, if the reading information is “Kyowa Halle”, the phoneme context is “/ k // y // O // W // A // H // A /”. / R // E / ". The phoneme context conversion unit 154 converts the reading information into a phoneme string (acquisition of phoneme context) with reference to a kana phoneme conversion model or the like (stored in the storage unit as necessary).

＜音声素片候補探索部１５５＞
次に、音声素片候補探索部１５５が、音素コンテキストを入力として、音素情報に基づいて、音素コンテキストに合成単位で適合する音声素片の候補（以下、「音声素片候補」という）を類似話者音声データベース１３０から探索してこれを出力する（ｓ１５５）。 <Speech segment candidate search unit 155>
Next, the speech unit candidate search unit 155 uses the phoneme context as an input, and based on the phoneme information, similar speech unit candidates (hereinafter referred to as “speech unit candidates”) that match the phoneme context in synthesis units. The speaker voice database 130 is searched for and output (s155).

音声素片候補の探索方法として、種々の方法を採用できる。例えば、参考文献９記載の方法により実施することができる。
（参考文献９）特開２００９−１２２３８１号公報 Various methods can be adopted as a method for searching for speech element candidates. For example, it can be performed by the method described in Reference 9.
(Reference 9) JP 2009-122381 A

音素情報が音素コンテキストの一部と一致する音声素片を類似話者音声データベース１３０から全て探索して、音声素片候補とする。 All speech units whose phoneme information matches a part of the phoneme context are searched from the similar speaker speech database 130 and set as speech unit candidates.

合成単位が音素の例では、音素コンテキストが“／ｋ／／ｙ／／Ｏ／／Ｗ／／Ａ／／Ｈ／／Ａ／／Ｒ／／Ｅ／”である場合を例にすると、音素コンテキストの各音素“／ｋ／”、“／ｙ／”、“／Ｏ／”、“／Ｗ／”、“／Ａ／”、“／Ｈ／”、“／Ａ／”、“／Ｒ／”、“／Ｅ／”毎に、当該音素に一致する音素情報を持つ音声素片を類似話者音声データベース１３０から全て探索して、これら音声素片を音素コンテキストの音素毎に音声素片候補とする。つまり、この例では、音素コンテキストの音素毎に一つまたは複数の音声素片候補が決まる。 In the case where the synthesis unit is a phoneme, for example, the phoneme context is “/ k // y // O // W // A // H // A // R // E /”. Phonemes "/ k /", "/ y /", "/ O /", "/ W /", "/ A /", "/ H /", "/ A /", "/ R /" , For each “/ E /”, search for all speech units having phoneme information matching the corresponding phoneme from the similar speaker speech database 130, and identify these speech units as speech unit candidates for each phoneme of the phoneme context. To do. That is, in this example, one or a plurality of speech segment candidates are determined for each phoneme in the phoneme context.

＜素片選択部１５６＞
素片選択部１５６は、各音声素片候補の類似話者識別子に対応する話者類似度Ｌ_ｓを少なくとも用いて、合成単位の対象テキストと音声素片候補との適合度を総合コストとして算出し、この総合コストが最良となるときの音声素片候補を、それぞれ選択音声素片として選択する（ｓ１５６）。 <Element selection unit 156>
The unit selection unit 156 calculates, as an overall cost, the degree of matching between the target text of the synthesis unit and the speech unit candidate using at least the speaker similarity L _s corresponding to the similar speaker identifier of each speech unit candidate. Then, the speech unit candidates when the total cost is the best are selected as the selected speech units, respectively (s156).

例えば、音声素片候補を入力として、一つまたは複数のサブコスト関数を用いて、音声素片候補それぞれのサブコストを計算し、さらにサブコストからなる総合コストを計算し、総合コストを用いて、波形接続に用いる選択音声素片を特定して、これを出力する。 For example, using speech unit candidates as input, calculate the sub-cost of each speech unit candidate using one or more sub-cost functions, calculate the total cost consisting of sub-costs, and connect the waveform using the total cost The selected speech segment to be used for is identified and output.

例えば、サブコストそれぞれは、対象テキストから得られる韻律パラメータと、音声素片候補の韻律パラメータとの適合度を表す。 For example, each sub-cost represents the degree of matching between the prosodic parameter obtained from the target text and the prosodic parameter of the speech segment candidate.

サブコストの計算方法であるが、任意に種々の方法を採用できる。一例として、参考文献１０に示されるようなサブコスト関数を用いて計算することができる。
（参考文献１０）「波形編集型合成方式におけるスペクトル連続性を考慮した波形選択法」、日本音響学会講演論文集、2-6-10, pp.239-240, 1990/9 Although it is a sub-cost calculation method, various methods can be arbitrarily adopted. As an example, it can be calculated using a sub-cost function as shown in Reference 10.
(Reference 10) “Waveform Selection Method Considering Spectral Continuity in Waveform Editing Type Synthesis Method”, Proc. Of the Acoustical Society of Japan, 2-6-10, pp.239-240, 1990/9

音声素片候補の韻律パラメータのＦ０パターン平均値Ｖｐと、対象テキストの合成単位の音声素片候補のＦ０パターン平均値Ｖｓに対応するサブコスト関数は、
Ｃ_１（Ｖｐ，Ｖｓ）＝（Ｖｐ−Ｖｓ）^２（１１）
である。 The sub-cost function corresponding to the F0 pattern average value Vp of the prosody parameters of the speech unit candidate and the F0 pattern average value Vs of the speech unit candidate of the synthesis unit of the target text is
C ₁ (Vp, Vs) = (Vp−Vs) ² (11)
It is.

音声素片候補の韻律パラメータのＦ０パターンの傾きＦｐと、対象テキストの合成単位の音声素片候補のＦ０パターンの傾きＦｓに対応するサブコスト関数は、
Ｃ_２（Ｆｐ，Ｆｓ）＝（Ｆｐ−Ｆｓ）^２（１２）
である。 The sub cost function corresponding to the slope Fp of the F0 pattern of the prosodic parameter of the speech unit candidate and the slope Fs of the F0 pattern of the speech unit candidate of the synthesis unit of the target text is:
C ₂ (Fp, Fs) = (Fp−Fs) ² (12)
It is.

音声素片候補の韻律パラメータの音素継続時間長Ｔｐと、対象テキストの合成単位の音声素片候補の音素継続時間長Ｔｓに対応するサブコスト関数は、
Ｃ_３（Ｔｐ，Ｔｓ）＝（Ｔｐ−Ｔｓ）^２（１３）
である。 The sub-cost function corresponding to the phoneme duration Tp of the prosodic parameter of the speech unit candidate and the phoneme duration Ts of the speech unit candidate of the synthesis unit of the target text is:
C ₃ (Tp, Ts) = (Tp−Ts) ² (13)
It is.

話者類似度をサブコスト関数の一つとして使用する場合、サブコスト関数は、
Ｃ_４（Ｌ_１，Ｌ_ｓ）＝（Ｌ_１−Ｌ_ｓ）^２（１４）
である。なお、Ｌ_１は類似話者選択部１１１でＳ個の話者類似度Ｌ_ｓの中で最も大きい話者類似度であり、Ｌ_ｓはサブコスト計算の対象となる音声素片候補の類似話者ｓ（ｓ＝１，２，…，Ｓ）の話者類似度である。サブコスト計算の対象となる音声素片が最も話者類似度が高い話者の場合、Ｃ_４（Ｌ_１，Ｌ_ｓ）は０となり、話者類似度が低い話者ほどＣ（Ｌ_１，Ｌ_ｓ）は大きな値となる。なお、話者類似度は、音声素片候補の類似話者識別子をキーとして、話者類似度記憶部１４０から取得する。 When using speaker similarity as one of the sub-cost functions, the sub-cost function is
C ₄ (L ₁ , L _s ) = (L ₁ −L _s ) ² (14)
It is. Note that L ₁ is the highest speaker similarity among the S speaker similarity L _{s in} the similar speaker selection unit 111, and L _s is a similar speaker of a speech unit candidate to be subjected to sub cost calculation. This is the speaker similarity of s (s = 1, 2,..., S). When the speech unit that is the target of sub-cost calculation is the speaker with the highest speaker similarity, C ₄ (L ₁ , L _s ) is 0, and the speaker with the lower speaker similarity has C (L ₁ , L _s ) is a large value. Note that the speaker similarity is acquired from the speaker similarity storage unit 140 using the similar speaker identifier of the speech unit candidate as a key.

次に、素片選択部１５６が、これらのサブコストからなる総合コストを計算する。総合コストには種々の方式を採用することができる。一例として、以下のように、各サブコスト値に重みｗ_ｋを掛けて総和を計算することで、これを総合コストＱとする。 Next, the segment selection unit 156 calculates an overall cost including these sub costs. Various methods can be adopted for the total cost. As an example, the total cost Q is calculated by multiplying each sub-cost value by the weight w _k as follows.

但し、Ｋはサブコストの個数である（例えばＫ＝４）。総合コストＱは、対象テキストの合成単位毎に、一つまたは複数の音声素片候補に対してそれぞれ求められる。但し、重みｗ_ｋは何れも正値とし、任意に設定することができる。上記の例では、各サブコストＣ_ｋは０以上の値をとり、音素コンテキストに対して優れた音声素片候補ほどそれらの値は０に近いから、総合コストＱは０以上の値をとり、総合コストＱが０に近いほど良好な素片候補と判定することができる。 However, K is the number of sub-costs (for example, K = 4). The total cost Q is obtained for one or a plurality of speech segment candidates for each synthesis unit of the target text. However, each of the weights w _k is a positive value and can be set arbitrarily. In the above example, each sub-cost C _k takes a value of 0 or more. Since the speech unit candidates that are superior to the phoneme context have values closer to 0, the total cost Q takes a value of 0 or more. As the cost Q is closer to 0, it can be determined as a better segment candidate.

そして、素片選択部１５６は、総合コストＱの総和が最良（この例では最小）となるように、対象テキストの合成単位毎に、音声素片候補を一つ特定し、対象テキストの音素コンテキストに対応する一連の音声素片を決定する。この特定された音声素片候補が、選択音声素片である。 Then, the segment selection unit 156 specifies one speech segment candidate for each synthesis unit of the target text so that the total sum of the total costs Q is the best (in this example, the minimum), and the phoneme context of the target text A series of speech segments corresponding to is determined. The identified speech element candidate is the selected speech element.

＜素片接続部１５７＞
最後に、素片接続部１５７が、類似話者音声データベース１３０から選択音声素片に対応する部分音声データを読み込み、この部分音声データを一連の音声素片の並びに従って接続することで合成音声を生成する（ｓ１５７）。 <Unit Connection Unit 157>
Finally, the unit connection unit 157 reads the partial speech data corresponding to the selected speech unit from the similar speaker speech database 130, and connects the partial speech data according to the sequence of the speech units to generate the synthesized speech. Generate (s157).

選択音声素片に対応する部分音声データを時間的な順に単に接続してもよいが、異なる部分音声波形データ間を時間的又は周波数的に補間して波形接続してもよい（参考文献１１参照）。
（参考文献１１）特開平７−０７２８９７号公報 The partial speech data corresponding to the selected speech segment may be simply connected in order of time, or the waveform may be connected by interpolating between different partial speech waveform data in terms of time or frequency (see Reference 11). ).
(Reference 11) Japanese Patent Laid-Open No. 7-072897

＜効果＞
このような構成とすることで、目標話者により類似している話者の音声データが音声合成の際に使用されやすくなり、合成音声の目標話者に対する類似性を向上させることができる。 <Effect>
With such a configuration, the voice data of a speaker that is more similar to the target speaker can be easily used in speech synthesis, and the similarity of the synthesized speech to the target speaker can be improved.

また、類似話者の選択は、非特許文献１の平均声モデルからの変換規則の学習に比べ、数秒〜数十秒程度の少量の音声データで十分な性能が得られるため、音声合成を行うために必要な目標話者の音声データ量を削減することができる。また、目標話者の多量の音声データが必要でなくなるため、音声収録の拘束時間を極めて短時間とすることができる。 In addition, the selection of similar speakers performs speech synthesis because sufficient performance can be obtained with a small amount of speech data of several seconds to several tens of seconds compared to learning of conversion rules from the average voice model of Non-Patent Document 1. Therefore, it is possible to reduce the amount of voice data required for the target speaker. In addition, since a large amount of voice data of the target speaker is not necessary, the voice recording restraint time can be made extremely short.

さらに、複数名の音声データを統合することで、統合後の音声データベース中に存在する同一コンテキストの素片が増加させ、合成音声の自然性が向上させることができる。 Furthermore, by integrating the voice data of a plurality of names, the number of segments of the same context existing in the integrated voice database can be increased, and the naturalness of the synthesized voice can be improved.

＜変形例１＞
第一実施形態と異なる部分についてのみ説明する。
図１０の素片選択部１５６’において、各音声素片候補の類似話者識別子に対応する話者類似度Ｌ_ｓを少なくとも用いて、合成単位の前記対象テキストと音声素片候補との適合度を、以下のようにして総合コストとして算出し、この総合コストが最良となるときの音声素片候補を、それぞれ選択音声素片として選択する（図１１のｓ１５６’）。 <Modification 1>
Only parts different from the first embodiment will be described.
In the unit selection unit 156 ′ in FIG. 10, the degree of matching between the target text in the synthesis unit and the speech unit candidate using at least the speaker similarity L _s corresponding to the similar speaker identifier of each speech unit candidate. Is calculated as an overall cost as follows, and speech unit candidates when the total cost is the best are selected as selected speech units (s156 ′ in FIG. 11).

なお、重みｗ_ｓは、話者類似度が最も高い話者の場合に１となり、それ以外の類似話者の場合には、話者類似度に応じて１より大きくなれば、別の計算式で求めても構わない。また、この重みは総合コスト全体に使用しても、個別のサブコスト関数の重みとして使用しても構わない（例えば、Ｆ０平均値のサブコスト関数Ｃ_１のみに使用する等）。 The weight w _s is 1 for the speaker with the highest speaker similarity, and for other similar speakers, if the weight w _s becomes larger than 1 according to the speaker similarity, another calculation formula is used. You can ask for it. Further, this weight may be used for the total cost as a whole or as a weight for an individual sub-cost function (for example, it is used only for the sub-cost function C ₁ of the F0 average value).

また、この場合、話者類似度Ｌ_ｓをサブコストとして利用してもよいし、利用しなくてもよい。
このような構成とすることで第一実施形態と同様の効果を得ることができ、さらに、柔軟性のある素片選択が可能となる。 In this case, the speaker similarity L _s may or may not be used as a sub-cost.
By adopting such a configuration, it is possible to obtain the same effect as that of the first embodiment, and it is possible to select a flexible segment.

＜その他の変形例＞
音声合成装置１００は、必ずしも多数話者音声データベース構築部１０１を備えなくともよい。その場合、他の装置等で構成した多数話者音声データベース１０３を、記録媒体から、あるいは通信回線を介してダウンロードして取得し記憶すればよい。さらに、音声合成装置１００は、多数話者音声データベース１０３及び類似話者音声データベース構築部１１０を備えなくともよい。その場合、他の装置等で構成した類似話者音声データベース１３０及び類似話者の話者類似度を、記録媒体から、あるいは通信回線を介してダウンロードして取得し記憶すればよい。音声合成部１５０は、このように得られた類似話者音声データベース１３０と話者類似度を用いても第一実施形態と同様の音声合成を行うことができる。 <Other variations>
The speech synthesizer 100 does not necessarily include the multi-speaker speech database construction unit 101. In that case, the multi-speaker voice database 103 constituted by another device or the like may be obtained from a recording medium or downloaded via a communication line and stored. Furthermore, the speech synthesizer 100 may not include the multi-speaker speech database 103 and the similar speaker speech database construction unit 110. In that case, the similar speaker voice database 130 constituted by another device or the like and the speaker similarity of the similar speaker may be obtained from a recording medium or downloaded via a communication line and stored. The speech synthesizer 150 can perform speech synthesis similar to that of the first embodiment even using the similar speaker speech database 130 and speaker similarity obtained in this way.

また、音声データは、音声波形データではなく、音声波形データに対して信号処理を行った結果、得られる音声特徴量（音高パラメータ（基本周波数等）、スペクトルパラメータ（ケプストラム、メルケプストラム等））でもよい。この場合、類似話者選択部１１１内で音声波形データからスペクトルパラメータ（ケプストラム、メルケプストラム等）を取得する処理を省略することができる。また、素片接続部１５７では、接続した部分音声データ（音声特徴量）を用いて、音声波形データを生成し、出力する。 Voice data is not voice waveform data, but is obtained as a result of performing signal processing on voice waveform data, resulting in voice feature values (pitch parameters (basic frequency, etc.), spectral parameters (cepstrum, mel cepstrum, etc.)) But you can. In this case, it is possible to omit the process of acquiring the spectrum parameters (cepstrum, mel cepstrum, etc.) from the speech waveform data in the similar speaker selection unit 111. Also, the segment connecting unit 157 generates and outputs speech waveform data using the connected partial speech data (speech feature amount).

音声素片は、アクセント情報（アクセント型、アクセント句長）や品詞情報等を含んでもよい。また、多数話者音声データベース１０３には、話者の情報（性別、年齢、出身地）、収録環境（マイクロホンの種類、収録ブースの情報）等を含んでもよい。目標話者の音声データにもこのような情報を付加することで、より精度の高い合成音声を生成することができる。 The speech segment may include accent information (accent type, accent phrase length), part of speech information, and the like. The multi-speaker voice database 103 may include speaker information (gender, age, hometown), recording environment (microphone type, recording booth information), and the like. By adding such information to the speech data of the target speaker, it is possible to generate synthesized speech with higher accuracy.

類似話者選択部１１１における話者類似度の求め方は、他の方法によってもよい。例えば、スペクトルパラメータを用いずに、音声波形データから直接類似度を求めてもよい。 The method for obtaining the speaker similarity in the similar speaker selection unit 111 may be another method. For example, the similarity may be obtained directly from speech waveform data without using spectral parameters.

類似話者音声データベースには、必ずしも音声波形データを記憶しなくともよい。音声素片のみからなるデータベースであってもよい。この場合、音声波形データは、類似話者音声データベース内の音声素片をキーとして、多数話者音声データベース１０３から取得する構成とする。 It is not always necessary to store speech waveform data in the similar speaker speech database. It may be a database consisting only of speech segments. In this case, the speech waveform data is obtained from the multi-speaker speech database 103 using speech units in the similar speaker speech database as keys.

本実施形態では、サブコストとして、Ｆ０パターンの平均値、Ｆ０パターンの傾き、音素継続時間長、話者類似度を用いているが、少なくとも話者類似度を用いていればよい。最も目標話者に近いと判定された話者の音声素片が選択されやすくなり、合成音声の目標話者に対する類似性が向上する。さらに他の情報からサブコストを求めてもよい。例えば、素片選択部１５６において、音素コンテキストを入力として（図１０中破線で示す）、読みに対応するサブコストを計算してもよい。なお、読みに対応するサブコスト関数は、
Ｃ（ｊ）＝１／ｅ^ｊ（２３）
である。但し、対象テキストの音素コンテキストと、合成単位の音声素片候補の音素コンテキストが一致する音素数をｊとする。 In this embodiment, the average value of the F0 pattern, the slope of the F0 pattern, the phoneme duration, and the speaker similarity are used as the sub-costs, but at least the speaker similarity may be used. The speech unit of the speaker determined to be closest to the target speaker is easily selected, and the similarity of the synthesized speech to the target speaker is improved. Further, the sub cost may be obtained from other information. For example, the segment selection unit 156 may calculate the sub-cost corresponding to the reading with the phoneme context as an input (indicated by a broken line in FIG. 10). The sub-cost function corresponding to the reading is
C (j) = 1 / e ^j (23)
It is. Here, j is the number of phonemes in which the phoneme context of the target text matches the phoneme context of the speech unit candidate of the synthesis unit.

＜第二実施形態＞
図３を用いて第二実施形態に係る音声合成装置２００を説明する。第一実施形態と異なる部分についてのみ説明する。音声合成装置２００は、多数話者音声データベース構築部１０１と多数話者音声データベース１０３と類似話者音声データベース構築部２１０と類似話者音声データベース１３０と話者類似度記憶部１４０とを備える。類似話者音声データベース構築部２１０の構成が第一実施形態と異なる。 <Second embodiment>
A speech synthesizer 200 according to the second embodiment will be described with reference to FIG. Only parts different from the first embodiment will be described. The speech synthesizer 200 includes a multi-speaker speech database construction unit 101, a multi-speaker speech database 103, a similar speaker speech database construction unit 210, a similar speaker speech database 130, and a speaker similarity storage unit 140. The configuration of the similar speaker voice database construction unit 210 is different from that of the first embodiment.

［ポイント］
第一実施形態では、話者統合部１１５において、抽出したＳ名の話者を統合する際に、話者間の音声特徴量の差が大きいと、合成音声の品質の劣化を引き起こす恐れがある。そのため、第二実施形態では、話者統合部２１５で目標話者の音声特徴量を用いて、抽出した話者の音声特徴量を目標話者の特徴量へ正規化することで、話者間の音声特徴量の差を軽減し、合成音声の品質劣化を防ぐ。 [point]
In the first embodiment, when the speaker integration unit 115 integrates the extracted S-named speakers, if there is a large difference in speech feature between the speakers, the quality of the synthesized speech may be degraded. . Therefore, in the second embodiment, the speaker integration unit 215 uses the target speaker's voice feature value to normalize the extracted speaker's voice feature value to the target speaker's feature value. This reduces the difference in the voice feature amount of the voice and prevents the quality degradation of the synthesized voice.

＜類似話者音声データベース構築部２１０＞
図１２及び図１３を用いて、類似話者音声データベース構築部２１０を説明する。
類似話者音声データベース構築部２１０は、複数の音声データを用いて、目標話者の音声データに類似した音声データからなる類似話者音声データベースを構築する（ｓ２１０）。図１２に示すように類似話者音声データベース構築部２１０は、類似話者選択部１１１と、さらに、音声素片付与部２１２と、話者変換規則学習部２１３と話者単位変換部２１４と話者統合部２１５を有する。 <Similar speaker voice database construction unit 210>
The similar speaker voice database construction unit 210 will be described with reference to FIGS. 12 and 13.
The similar speaker speech database construction unit 210 constructs a similar speaker speech database composed of speech data similar to the speech data of the target speaker using a plurality of speech data (s210). As shown in FIG. 12, the similar speaker voice database construction unit 210 includes a similar speaker selection unit 111, a speech unit assignment unit 212, a speaker conversion rule learning unit 213, a speaker unit conversion unit 214, and a speaker. A person integration unit 215.

＜音声素片付与部２１２＞
音声素片付与部２１２は、入力された目標話者の音声データに対して、音声素片を付与する（ｓ２１２）。音声素片として付与される情報は、多数話者音声データベース構築部１０１で付与される音声素片と同様である。但し、音素番号や各音素の開始時間、終了時間は、目標話者の音声データに対するものとする。この音声素片は、人手により付与するか、音声データと発話テキストから自動で付与してもよい。 <Voice segment giving unit 212>
The speech segment adding unit 212 adds a speech segment to the input target speaker's speech data (s212). The information given as the speech unit is the same as the speech unit given by the multi-speaker speech database construction unit 101. However, the phoneme number and the start time and end time of each phoneme are for the target speaker's voice data. This speech segment may be given manually or automatically from voice data and utterance text.

＜話者変換規則学習部２１３＞
話者変換規則学習部２１３は、目標話者の音声データと複数選択した類似話者の音声データを用いて、各類似話者の音声データを目標話者の音声特徴量を持つ音声データに変換する話者変換規則を学習する（ｓ２１３）。例えば、参考文献１２記載の方法により話者変換規則を学習する。
（参考文献１２） M. J. F Gales and P. C. Woodland, “Mean and variance adaptation within the MLLR framework,” Computer Speech and Language, 1996, vol.10, pp.249-264 <Speaker conversion rule learning unit 213>
The speaker conversion rule learning unit 213 converts the voice data of each similar speaker into voice data having the voice feature amount of the target speaker, using the voice data of the target speaker and the voice data of a plurality of selected similar speakers. The speaker conversion rule to be learned is learned (s213). For example, the speaker conversion rule is learned by the method described in Reference 12.
(Reference 12) MJ F Gales and PC Woodland, “Mean and variance adaptation within the MLLR framework,” Computer Speech and Language, 1996, vol.10, pp.249-264

まず、類似話者選択部１１１で選択されたＳ名分の類似話者の音声データから得られる音声特徴量と、その音声データに対応する音声素片から統計モデル（例えば、Hidden Markov Model（ＨＭＭ）やＧＭＭ等）を学習する（図１４中のｓ２１３ｂ）。なお、多数話者音声データベース１０３の音声データを用いて、事前に全ての話者（Ｎ名分）の統計モデルを求めておき、多数話者音声データベース内に記憶しておいてもよい。 First, a statistical model (for example, Hidden Markov Model (HMM) is obtained from speech feature amounts obtained from speech data of S similar speakers selected by the similar speaker selection unit 111 and speech units corresponding to the speech data. ) Or GMM) (s213b in FIG. 14). Note that a statistical model of all speakers (for N names) may be obtained in advance using voice data in the multi-speaker voice database 103 and stored in the multi-speaker voice database.

次に、目標話者の全スペクトルパラメータ及び音声素片と、類似話者ｓの統計モデルとを用いて、類似話者ｓのスペクトルパラメータを目標話者のスペクトルパラメータへ変換するＣＭＬＬＲ変換行列Ｗ（話者変換規則φ_ｓ）を学習する（ｓ２１３ｃ）。話者単位変換部２１４に話者変換規則φ_ｓを出力する。変換行列Ｗは以下の方程式を解くことで求める。 Next, the CMLLR transformation matrix W (which converts the spectral parameters of the similar speaker s into the spectral parameters of the target speaker using all the spectral parameters and speech segments of the target speaker and the statistical model of the similar speaker s). The speaker conversion rule φ _s ) is learned (s213c). The speaker conversion rule φ _s is output to the speaker unit converter 214. The transformation matrix W is obtained by solving the following equation.

ここで、（・）’は・の転置行列を表す。ｘ_ｔは時刻ｔの目標話者のスペクトルパラメータ、μ_ｇ，Ｕ_ｇ ^−１はそれぞれＨＭＭの状態ｇの平均と共分散行列の逆行列である。また、γ_ｇ（ｔ）は状態ｇにおいてｘ_ｔが出力される確率であり、ｘ_ｔとμ_ｇ，Ｕ_ｇ ^−１とから得られる。この変換行列Ｗは、ＨＭＭの状態毎に求めることが可能であるが、本実施形態では全ての状態を共有することにより、一つの変換行列を求める。なお、各類似話者ｓに対して、話者変換規則φ_ｓを学習する。 Here, (·) ′ represents a transposed matrix of •. x _t is the spectral parameter of the target speaker at time t, and μ _g and U _g ⁻¹ are the mean and the inverse of the covariance matrix of the state g of the HMM, respectively. Further, γ _g (t) is a probability that x _t is output in the state g, and is obtained from x _t and μ _g , U _g ⁻¹ . The transformation matrix W can be obtained for each state of the HMM, but in the present embodiment, one transformation matrix is obtained by sharing all the states. Note that the speaker conversion rule φ _s is learned for each similar speaker s.

＜話者単位変換部２１４＞
話者単位変換部２１４は各類似話者の音声データを話者変換規則φ_ｓに従って変換する（ｓ２１４）。例えば、類似話者ｓの音声データベース中の時刻ｔにおけるスペクトルパラメータｘ_ｓ，ｔを、変換行列Ｗ（話者変換規則φ_ｓ）を用いて変換することで、目標話者の特徴へ変換した類似話者ｓのスペクトルパラメータｘ￣_ｓ，ｔを得る。 <Speaker unit converter 214>
The speaker unit converter 214 converts the voice data of each similar speaker according to the speaker conversion rule φ _s (s214). For example, the similarity obtained by converting the spectral parameter x _{s, t} at time t in the speech database of the similar speaker s using the conversion matrix W (speaker conversion rule φ _s ) into the target speaker characteristics. Obtain spectral parameters x ￣ _{s, t} of speaker s.

この処理を各類似話者の全時刻の音声データに対して行う。
以下、Ｆ０パラメータの変換の一例として、以下の処理で行う線形変換手法について説明する。目標話者の音声データから得られる全ての対数Ｆ０値から平均μ_ｈと分散ν_ｈを求める。また類似話者ｓの音声データから得られる全ての対数Ｆ０値から平均μ_ｓと分散ν_ｓを求める。そして、類似話者ｓの変換後の対数Ｆ０値を以下の式により求める。 This process is performed on the voice data of all similar speakers at all times.
Hereinafter, as an example of conversion of the F0 parameter, a linear conversion method performed by the following processing will be described. The average μ _h and variance ν _h are obtained from all logarithm F0 values obtained from the target speaker's voice data. Further, the average μ _s and the variance ν _s are obtained from all logarithmic F0 values obtained from the speech data of the similar speaker s. And the logarithm F0 value after conversion of the similar speaker s is calculated | required with the following formula | equation.

ここで、ｚ_ｔは変換前の類似話者ｓの第ｔ番目の対数Ｆ０値であり、ｙ_ｔは変換後の類似話者ｓの第ｔ番目の対数Ｆ０値である。Ｔは類似話者ｓの対数Ｆ０の全フレーム数であり、ｔ＝１，２，…，Ｔである。 Here, z _t is the t-th logarithmic F0 values of similar speakers s before conversion, the y _t is the t-th logarithmic F0 values of similar speakers s after conversion. T is the total number of frames of the logarithm F0 of the similar speaker s, and t = 1, 2,.

全ての類似話者の音声データに対して、同様の処理（ｓ２１３ｂ〜ｓ２１４）を行い（ｓ２１３ａ，ｓ２１３ｄ，ｓ２１３ｅ）、話者単位の変換処理を行った音声データを話者統合部２１５に出力する。 The same processing (s213b to s214) is performed on the speech data of all similar speakers (s213a, s213d, and s213e), and the speech data subjected to the conversion processing for each speaker is output to the speaker integration unit 215. .

＜話者統合部２１５＞
話者統合部２１５は、類似話者ｓの音声データそのものではなく、話者変換規則φ_ｓを使って変換された音声データを統合して、類似話者音声データベースを構築する（ｓ２１５）。構築方法は、第一実施形態と同様である。 <Speaker integration unit 215>
The speaker integration unit 215 integrates not the voice data itself of the similar speaker s but the voice data converted using the speaker conversion rule φ _s to construct a similar speaker voice database (s215). The construction method is the same as in the first embodiment.

＜効果＞
このような構成とすることで、第一実施形態に比べ、話者間の音声特徴量の差を軽減し、合成音声の品質劣化を防ぐ。 <Effect>
By adopting such a configuration, compared to the first embodiment, a difference in speech feature amount between speakers is reduced, and quality degradation of synthesized speech is prevented.

＜変形例＞
話者変換規則学習部２１３において、目標話者の音声データと複数選択した類似話者の音声データのみを用いて、各類似話者の音声データを目標話者の音声データに変換する話者変換規則を学習してもよい。その場合、音声素片付与部２１２は設けなくともよい。 <Modification>
The speaker conversion rule learning unit 213 uses only the target speaker's voice data and a plurality of selected similar speaker's voice data, and converts each similar speaker's voice data into the target speaker's voice data. You may learn the rules. In that case, the speech element providing unit 212 may not be provided.

＜第三実施形態＞
図３を用いて第三実施形態に係る音声合成装置３００を説明する。第二実施形態と異なる部分についてのみ説明する。音声合成装置３００は、多数話者音声データベース構築部１０１と多数話者音声データベース１０３と類似話者音声データベース構築部３１０と類似話者音声データベース１３０と話者類似度記憶部１４０とを備える。類似話者音声データベース構築部３１０の構成が第一実施形態と異なる。 <Third embodiment>
A speech synthesis apparatus 300 according to the third embodiment will be described with reference to FIG. Only parts different from the second embodiment will be described. The speech synthesizer 300 includes a multi-speaker speech database construction unit 101, a multi-speaker speech database 103, a similar speaker speech database construction unit 310, a similar speaker speech database 130, and a speaker similarity storage unit 140. The configuration of the similar speaker voice database construction unit 310 is different from that of the first embodiment.

［ポイント］
第一実施形態及び第二実施形態では、目標話者に類似した複数名の話者から類似話者音声データベース１３０を生成したが、本実施形態では、話者統合部２１５で得られる音声データを基として、さらに音声合成単位毎にモデル適応技術を用いて、音声データを変換することで、目標話者により近いモデルを生成することが可能である。 [point]
In the first embodiment and the second embodiment, the similar speaker voice database 130 is generated from a plurality of speakers similar to the target speaker, but in this embodiment, the voice data obtained by the speaker integration unit 215 is used. As a basis, it is possible to generate a model closer to the target speaker by converting speech data using a model adaptation technique for each speech synthesis unit.

＜類似話者音声データベース構築部３１０＞
図１５及び図１６を用いて、類似話者音声データベース構築部３１０を説明する。
類似話者音声データベース構築部３１０は、複数の音声データを用いて、目標話者の音声データに類似した音声データからなる類似話者音声データベースを構築する（ｓ３１０）。図１５に示すように類似話者音声データベース構築部３１０は、類似話者選択部１１１と、音声素片付与部２１２と、話者変換規則学習部２１３と話者単位変換部２１４と話者統合部２１５と、さらに、合成単位変換規則学習部３１７と、合成単位変換部３１８を有する。 <Similar speaker voice database construction unit 310>
The similar speaker voice database construction unit 310 will be described with reference to FIGS. 15 and 16.
The similar speaker speech database construction unit 310 constructs a similar speaker speech database composed of speech data similar to the speech data of the target speaker using a plurality of speech data (s310). As shown in FIG. 15, the similar speaker voice database construction unit 310 includes a similar speaker selection unit 111, a speech unit assignment unit 212, a speaker conversion rule learning unit 213, a speaker unit conversion unit 214, and speaker integration. A unit 215, a synthesis unit conversion rule learning unit 317, and a synthesis unit conversion unit 318.

＜合成単位変換規則学習部３１７＞
合成単位変換規則学習部３１７は、目標話者の音声データと第一類似話者音声データベース１３０の部分音声データを用いて、同一の状態毎に各類似話者の部分音声データを目標話者の音声特徴を持つ部分音声データに変換する合成単位変換規則を学習する（ｓ３１７）。例えば、非特許文献１記載の方法により合成単位変換規則を学習する。なお、第一類似話者音声データベース１３０内の音声データは、第一実施形態及び第二実施形態の類似話者音声データベース１３０内の音声データと同様の方法により構成される。 <Composition Unit Conversion Rule Learning Unit 317>
The synthesis unit conversion rule learning unit 317 uses the target speaker's speech data and the partial speech data of the first similar speaker speech database 130 to obtain the partial speech data of each similar speaker for each of the same states. A synthesis unit conversion rule for converting into partial speech data having speech features is learned (s317). For example, the composition unit conversion rule is learned by the method described in Non-Patent Document 1. Note that the voice data in the first similar speaker voice database 130 is configured by the same method as the voice data in the similar speaker voice database 130 of the first embodiment and the second embodiment.

まず、合成単位変換規則学習部３１７は、第一類似話者音声データベース１３０を用いて、同一の状態を持つ部分音声データ毎に、そのスペクトルパラメータと音声素片を用いて、統計モデル（ＨＭＭ）を学習する。この統計モデルは、類似話者の平均的な統計モデルとなる。 First, the synthesis unit conversion rule learning unit 317 uses the first similar speaker voice database 130 to generate a statistical model (HMM) for each partial voice data having the same state, using the spectrum parameter and the voice element. To learn. This statistical model is an average statistical model of similar speakers.

次に、合成単位変換規則学習部３１７は、目標話者の部分音声データから得られるスペクトルパラメータ及びその部分音声データに対する音声素片と、第一類似話者音声データベース１３０を用いて学習した統計モデルとから、全類似話者の平均的な音声データから得られるスペクトルパラメータ及び音声素片を、目標話者のスペクトルパラメータ及び音声素片に変換する変換行列を学習する。なお、同一の状態毎に変換行列を学習する。学習方法は実施例２と同様である。このＭＬＬＲ変換行列を合成単位変換規則として取得する。 Next, the synthesis unit conversion rule learning unit 317 uses the first similar speaker voice database 130 and the statistical parameters learned from the spectral parameters obtained from the target speaker's partial voice data and the speech segments for the partial voice data. From the above, a conversion matrix for converting spectral parameters and speech units obtained from average speech data of all similar speakers into spectral parameters and speech units of the target speaker is learned. Note that a transformation matrix is learned for each identical state. The learning method is the same as in the second embodiment. This MLLR transformation matrix is acquired as a synthesis unit transformation rule.

なお、合成単位変換規則は、同一の状態毎に学習を行うが、目標話者の音声データが少量の場合、全ての状態に対応する音声データの合成単位変換規則を学習することはできない。本実施形態では、リーフノードが各音声となる二分木を作成し、目標話者のデータ量が一定値以上となる最下位ノードにおいて合成単位変換規則を学習する。これにより、目標話者の音声データが少量の場合でも、全ての状態に対して、合成単位変換規則を学習することができる。 The synthesis unit conversion rule is learned for each same state. However, if the target speaker has a small amount of speech data, the synthesis unit conversion rule for speech data corresponding to all states cannot be learned. In the present embodiment, a binary tree in which each leaf node is each voice is created, and a synthesis unit conversion rule is learned at the lowest node where the data amount of the target speaker is a certain value or more. Thereby, even when the voice data of the target speaker is small, the synthesis unit conversion rule can be learned for all states.

＜合成単位変換部３１８＞
合成単位変換部３１８は、各類似話者の音声データを合成単位毎の合成単位変換規則に従って変換する（ｓ３１８）。学習した合成単位変換規則を、第一類似話者音声データベース１３０へ適用し、第二類似話者音声データベース３３０を得る。なお、合成単位変換規則の適用手法は非特許文献１記載の手法を用いることができる。
このような構成とすることで、第二実施形態よりもさらに、目標話者により近い合成音声を生成することができる。 <Composition Unit Conversion Unit 318>
The synthesis unit conversion unit 318 converts the speech data of each similar speaker according to the synthesis unit conversion rule for each synthesis unit (s318). The learned synthesis unit conversion rule is applied to the first similar speaker voice database 130 to obtain the second similar speaker voice database 330. Note that the method described in Non-Patent Document 1 can be used as a method for applying the composition unit conversion rule.
By adopting such a configuration, it is possible to generate synthesized speech that is closer to the target speaker than in the second embodiment.

＜プログラム及び記憶媒体＞
上述した音声合成装置は、コンピュータにより機能させることもできる。この場合はコンピュータに、目的とする装置（各種実施例で図に示した機能構成をもつ装置）として機能させるためのプログラム、またはその処理手順（各実施例で示したもの）の各過程をコンピュータに実行させるためのプログラムを、ＣＤ−ＲＯＭ、磁気ディスク、半導体記憶装置などの記録媒体から、あるいは通信回線を介してそのコンピュータ内にダウンロードし、そのプログラムを実行させればよい。 <Program and storage medium>
The speech synthesizer described above can also be functioned by a computer. In this case, each process of a program for causing a computer to function as a target device (a device having the functional configuration shown in the drawings in various embodiments) or a processing procedure (shown in each embodiment) is processed by the computer. A program to be executed by the computer may be downloaded from a recording medium such as a CD-ROM, a magnetic disk, or a semiconductor storage device or via a communication line into the computer, and the program may be executed.

＜その他＞
本発明は上記の実施形態及び変形例に限定されるものではない。例えば、上述の各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。 <Others>
The present invention is not limited to the above-described embodiments and modifications. For example, the various processes described above are not only executed in time series according to the description, but may also be executed in parallel or individually as required by the processing capability of the apparatus that executes the processes. In addition, it can change suitably in the range which does not deviate from the meaning of this invention.

１００，２００，３００音声合成装置
１０１多数話者音声データベース構築部
１０３多数話者音声データベース
１１０，２１０，３１０類似話者音声データベース構築部
１１１類似話者選択部
１１１ａ混合正規分布学習部
１１１ｂ多数話者混合正規分布記憶部
１１１ｃ話者類似度計算部
１１１ｄ類似話者抽出部
１１５，２１５話者統合部
１３０類似話者音声データベース
１３０第一類似話者音声データベース
１４０話者類似度記憶部
１５０音声合成部
１５１テキスト解析部
１５２韻律生成部
１５３韻律モデル記憶部
１５４音素コンテキスト変換部
１５５音声素片候補探索部
１５６素片選択部
１５６素片選択部
１５７素片接続部
２１２音声素片付与部
２１３話者変換規則学習部
２１４話者単位変換部
３１７合成単位変換規則学習部
３１８合成単位変換部
３３０第二類似話者音声データベース 100, 200, 300 Speech synthesis apparatus 101 Multi-speaker speech database construction unit 103 Multi-speaker speech database 110, 210, 310 Similar speaker speech database construction unit 111 Similar speaker selection unit 111a Mixed normal distribution learning unit 111b Multi-speaker Mixed normal distribution storage unit 111c Speaker similarity calculation unit 111d Similar speaker extraction unit 115, 215 Speaker integration unit 130 Similar speaker voice database 130 First similar speaker voice database 140 Speaker similarity storage unit 150 Speech synthesis unit 151 Text Analysis Unit 152 Prosody Generation Unit 153 Prosody Model Storage Unit 154 Phoneme Context Conversion Unit 155 Speech Segment Candidate Search Unit 156 Segment Selection Unit 156 Segment Selection Unit 157 Segment Connection Unit 212 Speech Segment Assignment Unit 213 Speaker Conversion Rule learning unit 214 Speaker unit conversion unit 317 Synthesis unit conversion rule習部 318 Synthesis unit conversion unit 330 second similar-speaker speech database

Claims

A speech synthesis method for generating synthesized speech corresponding to a target text and having speech characteristics of a target speaker,
Speaker similarity is used as an index indicating whether or not two voice data are similar, and using the voice data of a plurality of speakers, the speaker similarity between the voice data of each speaker and the voice data of the target speaker A similar speaker selection step for selecting a plurality of voice data having a high speaker similarity,
A combination of partial speech data suitable for combining synthesized speech data and assembling a synthesized speech, and information given to the partial speech data, which indicates the speaker who issued the partial speech data A speaker integration step of constructing a similar speaker speech database comprising at least a speech unit indicating a similar speaker identifier and phoneme information indicating a phoneme of the partial speech data;
A text analysis step of analyzing the target text and obtaining reading information of the target text;
A phoneme context conversion step for converting the reading information into a phoneme context that is a sequence of phonemes;
Based on the phoneme information, a speech unit candidate search step for searching the similar speaker speech database for speech unit candidates that match the phoneme context in a synthesis unit;
Using at least the speaker similarity corresponding to the similar speaker identifier of each of the speech unit candidates, the degree of matching between the target text of the synthesis unit and the speech unit candidate is calculated as a total cost, and this total cost is A unit selection step for selecting the speech unit candidate at the best time as a selected speech unit;
A segment connection step of reading partial speech data corresponding to the selected speech segment from the similar speaker speech database and connecting the partial speech data to obtain the synthesized speech;
Speech synthesis method.

A speech synthesis method for generating synthesized speech corresponding to a target text and having speech characteristics of a target speaker,
A synthesis unit suitable for assembling a plurality of speech data having a high speaker similarity with the target speaker's speech data as an index indicating whether or not the two speech data are similar Divided partial voice data, information given to the partial voice data, a similar speaker identifier indicating the speaker who has issued the partial voice data, and phoneme information indicating the utterance phoneme of the partial voice data; A similar speaker speech database comprising at least speech segments indicating
A text analysis step of analyzing the target text and obtaining reading information of the target text;
A phoneme context conversion step for converting the reading information into a phoneme context that is a sequence of phonemes;
Based on the phoneme information, a speech unit candidate search step for searching the similar speaker speech database for speech unit candidates that match the phoneme context in a synthesis unit;
Using at least the speaker similarity corresponding to the similar speaker identifier of each of the speech unit candidates, the degree of matching between the target text of the synthesis unit and the speech unit candidate is calculated as a total cost, and this total cost is A unit selection step for selecting the speech unit candidate at the best time as a selected speech unit;
A segment connection step of reading partial speech data corresponding to the selected speech segment from the similar speaker speech database and connecting the partial speech data to obtain the synthesized speech;
Speech synthesis method.

The speech synthesis method according to claim 1 or 2,
The speaker similarity of the speech data of each similar speaker s is L _s , the speaker similarity of the speaker with the highest speaker similarity is L ₁ ,
In the segment selection step,
The sum of K sub-costs C _k is _defined as the total cost Q, and at least one of the sub-costs,
C (L ₁ , L _s ) = (L ₁ −L _s ) ²
Use
Speech synthesis method.

The speech synthesis method according to claim 1 or 2,
The speaker similarity of the speech data of each similar speaker s is L _s , the speaker similarity of the speaker with the highest speaker similarity is L ₁ ,
In the segment selection step,
The sum of K sub-costs C _k is the total cost Q, the weight for each sub-cost is w _k , and the total cost is

Asking,
Speech synthesis method.

The speech synthesis method according to claim 1,
A speaker who learns speaker conversion rules for converting each similar speaker's voice data into voice data having the target speaker's voice characteristics using the target speaker's voice data and voice data of a plurality of selected similar speakers A conversion rule learning step;
A speaker unit conversion step of converting voice data of each similar speaker according to the speaker conversion rule, and
The speaker integration step integrates speech data obtained by converting speech data of a plurality of selected similar speakers, and includes a similar speaker speech database including the partial speech data and the speech segments of the partial speech data. Build,
Speech synthesis method.

The speech synthesis method according to claim 1,
A speaker who learns speaker conversion rules for converting each similar speaker's voice data into voice data having the target speaker's voice characteristics using the target speaker's voice data and voice data of a plurality of selected similar speakers A conversion rule learning step;
A speaker unit conversion step of converting voice data of each similar speaker according to the speaker conversion rule, and
The speaker integration step integrates speech data obtained by converting speech data of a plurality of selected similar speakers, and includes a similar speaker speech database including the partial speech data and the speech segments of the partial speech data. Build
Synthetic unit conversion for converting the voice data of each similar speaker into voice data having the voice characteristics of the target speaker for each same state using the target speaker's voice data and the partial voice data of the similar speaker's voice database A synthesis unit conversion rule learning step for learning a rule;
A synthesis unit conversion step of converting the partial speech data of each similar speaker according to a synthesis unit conversion rule for each synthesis unit;
Speech synthesis method.

A speech synthesizer that generates synthesized speech corresponding to a target text and having speech characteristics of a target speaker,
Speaker similarity is used as an index indicating whether or not two voice data are similar, and using the voice data of a plurality of speakers, the speaker similarity between the voice data of each speaker and the voice data of the target speaker A similar speaker selection unit that selects a plurality of voice data having high speaker similarity,
A combination of partial speech data suitable for combining synthesized speech data and assembling a synthesized speech, and information given to the partial speech data, which indicates the speaker who issued the partial speech data A speaker integration unit for constructing a similar speaker voice database including a speech element indicating at least a similar speaker identifier and phoneme information indicating a phoneme of the partial voice data;
A speaker similarity storage unit for storing the similar speaker identifier and the speaker similarity corresponding to the similar speaker identifier;
A text analysis unit that analyzes the target text and obtains reading information of the target text;
A phoneme context conversion unit that converts the reading information into a phoneme context that is a sequence of phonemes;
Based on the phoneme information, a speech unit candidate search unit that searches the similar speaker speech database for speech unit candidates that match the phoneme context in a synthesis unit;
Using at least the speaker similarity corresponding to the similar speaker identifier of each of the speech unit candidates, the degree of matching between the target text of the synthesis unit and the speech unit candidate is calculated as a total cost, and this total cost is A speech segment selection unit that selects the speech segment candidate at the best time as a selected speech segment;
A segment connection unit that reads partial speech data corresponding to the selected speech segment from the similar speaker speech database and connects the partial speech data to obtain the synthesized speech;
Speech synthesizer.

A speech synthesizer that generates synthesized speech corresponding to a target text and having speech characteristics of a target speaker,
The partial speech data of an appropriate synthesis unit for assembling the synthesized speech, the information given to the partial speech data, the similar speaker identifier indicating the speaker who has issued the partial speech data, and the partial speech data A similar speaker speech database comprising at least speech segments indicating phoneme information indicating utterance phonemes;
A speaker similarity storage unit for storing the similar speaker identifier and the speaker similarity corresponding to the similar speaker identifier;
A text analysis unit that analyzes the target text and obtains reading information of the target text;
A phoneme context conversion unit that converts the reading information into a phoneme context that is a sequence of phonemes;
Based on the phoneme information, a speech unit candidate search unit that searches the similar speaker speech database for speech unit candidates that match the phoneme context in a synthesis unit;
Using at least the speaker similarity corresponding to the similar speaker identifier of each of the speech unit candidates, the degree of matching between the target text of the synthesis unit and the speech unit candidate is calculated as a total cost, and this total cost is A speech segment selection unit that selects the speech segment candidate at the best time as a selected speech segment;
A segment connection unit that reads partial speech data corresponding to the selected speech segment from the similar speaker speech database and connects the partial speech data to obtain the synthesized speech;
Speech synthesizer.

A speech synthesis program for causing a computer to execute the speech synthesis method according to claim 1.