JP5366919B2

JP5366919B2 - Speech synthesis method, apparatus, and program

Info

Publication number: JP5366919B2
Application number: JP2010272560A
Authority: JP
Inventors: 光昭磯貝; 秀之水野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2010-12-07
Filing date: 2010-12-07
Publication date: 2013-12-11
Anticipated expiration: 2030-12-07
Also published as: JP2012123096A

Abstract

<P>PROBLEM TO BE SOLVED: To maintain continuity of intonation of a synthetic speech to select a candidate phoneme so that the synthetic speech becomes high quality even for such a target that voiced sounds are connected to each other via voiceless sounds. <P>SOLUTION: A speech synthesis method includes: a preceding voiced sound candidate phoneme search sub-step of searching a candidate phoneme which is selected as a candidate phoneme suitable for a target to be the (i-k)-th voiced sound in which k is 2 or more and k is the minimum (hereafter, called as the (i-k)-th candidate phoneme) when the i-th target (i is a natural number equal to or more than 3) is the voiced sound, and the (i-1)-th target is a voiceless sound; a connection F0 sub-cost calculation sub-step of calculating a connection F0 sub-cost from an F0 value at a terminal position of the (i-k)-th candidate phoneme and F0 value groups at the head positions of the respective candidate phonemes of the i-th candidate phoneme group; and a selection sub-step of selecting one candidate phoneme suitable for the i-th target from the i-th candidate phoneme group based on the connection F0 sub-cost. <P>COPYRIGHT: (C)2012,JPO&INPIT

Description

本発明は、例えば波形接続型の音声合成における音声波形素片選択過程に用いられ、高品質なイントネーションを持つ音声合成を可能とする音声合成方法、装置、及びプログラムに関する。 The present invention relates to a speech synthesis method, apparatus, and program that are used, for example, in a speech waveform segment selection process in waveform-connected speech synthesis and enable speech synthesis with high quality intonation.

近年の音声合成の分野では、大容量の記憶装置に数十分から数十時間の大量の肉声データを格納して音声波形データベースとし、入力されたテキストに応じて適切な基準で音声波形データベースから適切な長さの音声波形を切り出して音声波形素片とし、それらを接続して合成音声を作成する音声合成方法や音声合成装置が提案されている（特許文献１）。以下、図１及び図２を参照して従来の音声合成装置９０を詳細に説明する。図１は従来の音声合成装置９０の構成を示すブロック図である。図２は従来の音声合成装置９０が備える音声情報インデックス９５２を例示する図である。従来の音声合成装置９０は、テキスト解析部９１と、韻律生成部９２と、候補素片選択部９３と、音声合成部９４と、音声波形データベース９５とを備える。音声波形データベース９５は音声波形データ９５１と、音声情報インデックス９５２とを備える。音声合成装置９０はテキスト情報を入力とし、合成音声を出力する装置である。音声波形データベース９５はハードディスクなどの記憶媒体で構成されており、音声波形データ９５１と、音声情報インデックス９５２とを記憶している。音声波形データ９５１には、単語や文章を読み上げた音声データにＡＤ変換を施した音声波形素片が記憶されている。音声波形データ９５１が記憶する音声波形素片は、音声合成において、音声合成の候補となる素片であるため、以下、これらの音声波形素片を候補素片という。また、２以上の候補素片をまとめて呼ぶ場合は候補素片群という。音声情報インデックス９５２は、図２に例示するように音素を単位とするエントリーからなるテーブルである。各エントリーは音声波形データ９５１に対応して、候補素片の通し番号である素片番号、候補素片の音素を分類する音素ラベル、候補素片の継続時間長を示す音素継続時間（ms）、候補素片の音高の時間推移を表したF0パターン情報（Hz）、音声波形データ９５１中での各候補素片の格納場所情報である素片データ位置とを備えている。素片データ位置は、具体的には音声波形データ９５１を記憶するハードディスクのメモリアドレスなどを示している。例えば、素片番号１番の候補素片は音素ラベル「ａ」に分類されるデータである。言い換えれば、素片番号１番の候補素片は音素ラベル「ａ」に分類される候補素片のひとつである。素片番号１番の候補素片の音素継続時間長は８５（ms）で、そのF0は３００→３０２→３０３→・・・→３０１（Hz）と時間推移する。また、素片番号１番の候補素片は素片データ位置０に格納されている。具体的には、素片番号１番の候補素片は、音声波形データ９５１を記憶するハードディスクのメモリアドレス０番地に格納されている。また、素片番号２番の候補素片は音素ラベル「ｓ」に分類される候補素片の一つであり、音素「ｓ」は、声帯の振動を伴わない無声音である。無声音の発声時には、声帯の振動が停止するため、無声音には基本周波数（F0）が存在しない。このため、例えば素片番号２番のように、無声音であるエントリーのF0パターン情報（Hz）には、F0情報が存在しないことを意味する数値として例えば−１を格納しておくものとする。 In the field of speech synthesis in recent years, a large amount of tens of minutes to several tens of hours of real voice data is stored in a large-capacity storage device as a speech waveform database. A speech synthesis method and speech synthesizer have been proposed in which speech waveforms of appropriate lengths are cut out to form speech waveform segments and connected to create synthesized speech (Patent Document 1). Hereinafter, a conventional speech synthesizer 90 will be described in detail with reference to FIGS. FIG. 1 is a block diagram showing a configuration of a conventional speech synthesizer 90. FIG. 2 is a diagram illustrating a speech information index 952 provided in the conventional speech synthesizer 90. The conventional speech synthesizer 90 includes a text analysis unit 91, a prosody generation unit 92, a candidate segment selection unit 93, a speech synthesis unit 94, and a speech waveform database 95. The speech waveform database 95 includes speech waveform data 951 and a speech information index 952. The speech synthesizer 90 is a device that receives text information and outputs synthesized speech. The audio waveform database 95 is composed of a storage medium such as a hard disk, and stores audio waveform data 951 and an audio information index 952. The speech waveform data 951 stores speech waveform segments obtained by subjecting speech data read out from words and sentences to AD conversion. Since the speech waveform segments stored in the speech waveform data 951 are segments that are candidates for speech synthesis in speech synthesis, these speech waveform segments are hereinafter referred to as candidate segments. When two or more candidate segments are called together, it is called a candidate segment group. The audio information index 952 is a table composed of entries in units of phonemes as illustrated in FIG. Each entry corresponds to the speech waveform data 951, a segment number that is a serial number of the candidate segment, a phoneme label that classifies the phoneme of the candidate segment, a phoneme duration (ms) that indicates the duration of the candidate segment, F0 pattern information (Hz) representing the time transition of the pitch of the candidate segment, and segment data positions that are storage location information of each candidate segment in the speech waveform data 951 are provided. The segment data position specifically indicates a memory address of a hard disk that stores the voice waveform data 951 or the like. For example, the candidate segment with segment number 1 is data classified into the phoneme label “a”. In other words, the candidate element with the element number 1 is one of candidate elements classified into the phoneme label “a”. The phoneme duration length of the candidate unit number 1 is 85 (ms), and its F0 changes over time from 300 → 302 → 303 →... → 301 (Hz). Also, the candidate element with the element number 1 is stored in the element data position 0. Specifically, the candidate element with the element number 1 is stored at the memory address 0 of the hard disk that stores the audio waveform data 951. In addition, the candidate element of the element number 2 is one of candidate elements classified into the phoneme label “s”, and the phoneme “s” is an unvoiced sound without accompanying vocal cord vibration. When the unvoiced sound is uttered, the vocal cord vibration stops, so the unvoiced sound has no fundamental frequency (F0). For this reason, for example, -1 is stored as a numerical value indicating that F0 information does not exist in the F0 pattern information (Hz) of an entry that is an unvoiced sound, such as unit number 2.

次に、前述の図１及び図２と併せて図３を参照し、従来の音声合成装置９０が行う音声合成動作について詳細に説明する。図３は従来の音声合成装置９０の動作を示すフローチャートである。音声合成装置９０に音声合成対象となるテキストが入力されたことを前提に説明を進める。テキスト解析部９１は、音声合成装置９０に入力されたテキストを取得して、当該取得したテキストを形態素解析し、当該形態素解析結果から音素列とアクセント型とを生成して、当該音素列とアクセント型とを韻律生成部９２に出力する（Ｓ９１）。韻律生成部９２は、テキスト解析部９１が出力した音素列とアクセント型とを取得して、音素毎にF0パターンと音素継続時間長とを推定して、当該推定されたF0パターンと音素継続時間長とを音素毎に番号付けして、ターゲットとして候補素片選択部９３に出力する（Ｓ９２）。候補素片選択部９３は、韻律生成部９２が出力したターゲットを取得して、これらターゲットとの歪みが小さく、かつ候補素片同士を接続した際の接続歪みが小さくなるような候補素片の組み合わせを音声情報インデックス９５２から選択して、当該選択した候補素片の素片番号を音声合成部９４に出力する（Ｓ９３）。上記の歪みを定義する距離尺度をコストと呼ぶ。候補素片選択部９３は、コストが最小となる候補素片の組み合わせを決定する。コストが最小となる候補素片の組み合わせの決定には動的計画法などを用いる。候補素片選択部９３が行うコスト計算の詳細については後述する。音声合成部９４は、候補素片選択部９３が出力した候補素片の素片番号を取得して、素片番号と対応する候補素片を音声波形データ９５１から読み出し、これらの候補素片を接続して音声を生成し、合成音声として出力する（Ｓ９４）。 Next, referring to FIG. 3 in conjunction with FIG. 1 and FIG. 2 described above, the speech synthesis operation performed by the conventional speech synthesis apparatus 90 will be described in detail. FIG. 3 is a flowchart showing the operation of the conventional speech synthesizer 90. The description will proceed on the assumption that the text to be synthesized is input to the speech synthesizer 90. The text analysis unit 91 acquires the text input to the speech synthesizer 90, morphologically analyzes the acquired text, generates a phoneme string and an accent type from the morpheme analysis result, and generates the phoneme string and the accent. The type is output to the prosody generation unit 92 (S91). The prosody generation unit 92 acquires the phoneme string and accent type output from the text analysis unit 91, estimates the F0 pattern and phoneme duration for each phoneme, and calculates the estimated F0 pattern and phoneme duration. The length is numbered for each phoneme, and is output to the candidate segment selection unit 93 as a target (S92). The candidate segment selection unit 93 acquires the targets output by the prosody generation unit 92, and selects candidate segments that have a small distortion with these targets and a small connection distortion when the candidate segments are connected to each other. A combination is selected from the speech information index 952, and the segment number of the selected candidate segment is output to the speech synthesizer 94 (S93). The distance measure that defines the above distortion is called cost. The candidate segment selection unit 93 determines a combination of candidate segments that minimizes the cost. Dynamic programming or the like is used to determine the combination of candidate segments that minimizes the cost. Details of the cost calculation performed by the candidate segment selection unit 93 will be described later. The speech synthesizer 94 acquires the segment number of the candidate segment output by the candidate segment selector 93, reads the candidate segment corresponding to the segment number from the speech waveform data 951, and extracts these candidate segments. Connected to generate voice and output as synthesized voice (S94).

次に、候補素片選択部９３が計算するコストについて具体的に説明する。テキスト解析部９１が出力した音素列のうちj番目（jは１以上の整数）の音素のターゲットをt(j)とする。前述のようにターゲットは音素毎のF0パターン情報と音素継続時間長の情報とからなる。音声波形データ９５１に格納されている音声波形素片のうちターゲットt(j)に対応する候補素片群をU(j)と表す。候補素片群U(j)は、音声情報インデックス９５２に格納されたエントリーのうち、ターゲットt(j)と音素ラベルが合致するエントリー全てを表すものとする。候補素片群U(j)のうちの任意の１の候補素片をu(j)と表し、以下の説明に用いる。t(j)とu(j)の歪みを表す距離尺度をターゲットコストCt(t(j),u(j))とする。ターゲットコストCt(t(j),u(j))は、後述する各種サブコストの重みつき和として、
Ct(t(j),u(j))=Wtf・Stf(t(j),u(j))+Wtdur・Stdur(t(j),u(j))
と定義する。また、u(j-1)とu(j)の間の接続歪みを表す接続コストを、Cc(u(j-1),u(j))とする。接続コストCc(u(j-1),u(j))は、後述する各種サブコストの重みつき和として、
Cc(u(j-1),u(j))=Wcf・Scf(u(j-1),u(j))+Wcenv・Scenv(u(j-1),u(j))
と定義する。WtfはStf(t(j),u(j))に対する重み、WtdurはStdur(t(j),u(j))に対する重み、WcfはScf(u(j-1),u(j))に対する重み、WcenvはScenv(u(j-1),u(j))に対する重みである。Stf(t(j),u(j))はターゲットt(j)と候補素片u(j)の間でのF0パターンの歪みを表し、ターゲットt(j)のF0パターンをFt(t(j))、u(j)のF0パターンをFu(u(j))としたとき、Ft(t(j))とFu(u(j))の差の二乗
Stf(t(j),u(j))={Ft(t(j))-Fu(u(j))}²
とする。以下これを、ターゲットF0サブコストと呼ぶ。なお、ここで候補素片が無声音である場合は、F0パターンを持っていないことにより、ターゲットF0サブコストを求めることができないため、Stf(t(j),u(j))の値を一定値（例えば０）とする。Stdur(t(j),u(j))はターゲット音素t(j)と候補素片u(j)の間での継続時間長の歪みを表し、t(j)の継続時間長をDURt(t(j))、u(j)の継続時間長をDURu(u(j))としたとき、DURt(t(j))とDURu(u(j))の差の二乗
Stdur(t(j),u(j))={DURt(t(j))-DURu(u(j))}²
とする。以下これを、ターゲット継続時間長サブコストと呼ぶ。Scf(u(j-1),u(j))は候補素片u(j)と、先行する候補素片u(j-1)の接続点でのF0の歪みを表し、u(j)の始点のF0をFSu(u(j))、u(j-1)の終点のF0をFEu(u(j-1))としたとき、FSu(u(j))とFEu(u(j-1))の差の二乗
Scf(u(j-1),u(j))={FSu(u(j))-FEu(u(j-1))}²
とする。以下これを、接続F0サブコストと呼ぶ。なお、ここでu(j-1)とu(j)のいずれか、あるいは双方が無声音である場合は、F0値を持っていないことにより、F0サブコストを求めることができないため、Scf(u(j-1),u(j))の値を一定値（例えば０）とする。Scenv(u(j-1),u(j))は候補素片u(j)と、先行する候補素片u(j-1)の前後の音素環境の違いを表し、ターゲットt(j)と音素情報インデックス９５２中でu(j-1)に後続する音素ラベルの音響的類似度、および、ターゲットt(j-1)と音素情報インデックス９５２中でu(j)に先行する音素ラベルの音響的類似度から定義される。以下これを、接続音素環境サブコストと呼ぶ。ターゲットt(j)とu(j-1)の後続音素、および、ターゲットt(j-1)とu(j)の先行音素の音響的類似度が高いほど、当該サブコストの値は小さくなり、例えば、t(j)と音素情報インデックス９５２中でu(j-1)に後続する音素ラベルが一致し、かつ、t(j-1)と音素情報インデックス９５２中でu(j)に先行する音素ラベルが一致すれば、Scenv(u(j-1),u(j))=0である。これらのサブコストのうち、Stf(t(j),u(j))、Stdur(t(j),u(j))は、韻律生成部９２で推定したターゲットに対する、候補素片群の持つF0パターンや音素継続時間長の差からなるサブコストである。また、Scf(u(j-1),u(j))、Scenv(u(j-1),u(j))は、候補素片間でのF0パターンや音素環境の違いからなるサブコストである。前記サブコストの計算に必要なu(j)のF0パターンや継続時間長は、音声情報インデックス９５２から得ることができる。候補素片群U(j)に候補素片u(j)が２以上存在する場合（つまり、同じ音素の候補が２以上存在する場合）、上記の計算は候補素片の数分だけ繰り返される。合成対象の文全体に対する総コストCを Next, the cost calculated by the candidate segment selection unit 93 will be specifically described. In the phoneme string output by the text analysis unit 91, the target of the jth phoneme (j is an integer equal to or greater than 1) is assumed to be t (j). As described above, the target includes F0 pattern information for each phoneme and information about the phoneme duration. Of the speech waveform segments stored in the speech waveform data 951, a candidate segment group corresponding to the target t (j) is represented as U (j). The candidate segment group U (j) represents all entries whose phoneme labels match the target t (j) among the entries stored in the speech information index 952. Arbitrary one candidate segment in the candidate segment group U (j) is represented as u (j) and used for the following description. A distance measure representing the distortion of t (j) and u (j) is set as a target cost Ct (t (j), u (j)). The target cost Ct (t (j), u (j)) is a weighted sum of various sub-costs described later.
Ct (t (j), u (j)) = Wtf ・ Stf (t (j), u (j)) + Wtdur ・ Stdur (t (j), u (j))
It is defined as Further, the connection cost representing the connection distortion between u (j-1) and u (j) is Cc (u (j-1), u (j)). The connection cost Cc (u (j-1), u (j)) is a weighted sum of various subcosts to be described later.
Cc (u (j-1), u (j)) = Wcf ・ Scf (u (j-1), u (j)) + Wcenv ・ Scenv (u (j-1), u (j))
It is defined as Wtf is the weight for Stf (t (j), u (j)), Wtdur is the weight for Stdur (t (j), u (j)), Wcf is Scf (u (j-1), u (j)) Wcenv is a weight for Scenv (u (j-1), u (j)). Stf (t (j), u (j)) represents the distortion of the F0 pattern between the target t (j) and the candidate segment u (j), and the F0 pattern of the target t (j) is expressed as Ft (t ( j)), when the F0 pattern of u (j) is Fu (u (j)), the square of the difference between Ft (t (j)) and Fu (u (j))
Stf (t (j), u (j)) = {Ft (t (j))-Fu (u (j))} ²
And This is hereinafter referred to as target F0 sub cost. If the candidate segment is an unvoiced sound, the target F0 subcost cannot be obtained because it does not have an F0 pattern, so the value of Stf (t (j), u (j)) is a constant value. (For example, 0). Stdur (t (j), u (j)) represents the distortion of the duration between the target phoneme t (j) and the candidate unit u (j), and the duration of t (j) is expressed as DURt ( t (j)), when the duration of u (j) is DURu (u (j)), the square of the difference between DURt (t (j)) and DURu (u (j))
Stdur (t (j), u (j)) = {DURt (t (j))-DURu (u (j))} ²
And Hereinafter, this is referred to as a target duration long sub-cost. Scf (u (j-1), u (j)) represents the distortion of F0 at the connection point between the candidate element u (j) and the preceding candidate element u (j-1), and u (j) Where Fu (u (j)) and FEu (u (j ()) are Fu (u (j)) and Fu (u (j-1)) is Fu (u (j-1)). -1)) squared difference
Scf (u (j-1), u (j)) = {FSu (u (j))-FEu (u (j-1))} ²
And Hereinafter, this is referred to as connection F0 sub-cost. Note that if either u (j-1) or u (j) or both are unvoiced sounds, the F0 sub-cost cannot be obtained because it does not have an F0 value, so Scf (u ( The values of j-1) and u (j)) are set to constant values (for example, 0). Scenv (u (j-1), u (j)) represents the difference between the phoneme environment before and after the candidate unit u (j) and the preceding candidate unit u (j-1), and the target t (j) And the phonetic label of the phoneme label that precedes u (j-1) in the phoneme information index 952 and the phoneme label that precedes u (j) in the target t (j-1) and the phoneme information index 952 Defined from acoustic similarity. Hereinafter, this is referred to as a connected phoneme environment sub-cost. The higher the acoustic similarity between the subsequent phonemes of the target t (j) and u (j-1) and the preceding phonemes of the target t (j-1) and u (j), the smaller the value of the sub-cost, For example, the phoneme label that follows u (j-1) in the phoneme information index 952 matches t (j), and t (j-1) precedes u (j) in the phoneme information index 952 If the phoneme labels match, Scenv (u (j-1), u (j)) = 0. Among these sub-costs, Stf (t (j), u (j)) and Stdur (t (j), u (j)) are F0 possessed by the candidate segment group for the target estimated by the prosody generation unit 92. This is a sub-cost consisting of the difference in pattern and phoneme duration. Scf (u (j-1), u (j)) and Scenv (u (j-1), u (j)) are sub-costs consisting of differences in F0 patterns and phoneme environments between candidate segments. is there. The F0 pattern and duration length of u (j) necessary for the calculation of the sub-cost can be obtained from the audio information index 952. When there are two or more candidate segments u (j) in the candidate segment group U (j) (that is, when there are two or more candidates for the same phoneme), the above calculation is repeated for the number of candidate segments. . Total cost C for the entire sentence to be synthesized

と定義したとき（ここで、Nは合成対象の文の音素数）、Cを最小にするような候補素片の組み合わせを、例えば動的計画法等の方法で求めることにより、ターゲットに対して最適な候補素片を決定する。 (Where N is the number of phonemes in the sentence to be synthesized), the candidate segment combination that minimizes C is determined by a method such as dynamic programming. The optimal candidate segment is determined.

次に、図４、図５を参照して候補素片選択部９３が行う動作について詳細に説明する。図４は従来の音声合成装置９０が備える候補素片選択部９３の詳細を示すブロック図である。図５は従来の音声合成装置９０が備える候補素片選択部９３の動作を示すフローチャートである。候補素片選択部９３は、ターゲットF0サブコスト計算手段９３１と、ターゲット継続時間長サブコスト計算手段９３２と、接続F0サブコスト計算手段９３４と、接続音素環境サブコスト計算手段９３５と、探索仮説展開手段９３６と、選択手段９３７とを備える。ターゲットF0サブコスト計算手段９３１は、j番目のターゲットt(j)のF0パターンと、j番目の候補素片群U(j)のF0パターンとを用いて、ターゲットF0サブコストを計算する（Ｓ９３１）。ターゲット継続時間長サブコスト計算手段９３２は、j番目のターゲットt(j)の継続時間長と、j番目の候補素片群U(j)の継続時間長群とを用いて、ターゲット継続時間長サブコストを計算する（Ｓ９３２）。ここで、ターゲットt(j)に対応する探索仮説群をH(j)と表す。また、H(j)のうち任意の１の探索仮説をh(j)と表す。接続F0サブコスト計算手段９３４は、探索仮説群H(j-1)の候補素片u(j-1)の終点のF0と、j番目の候補素片群U(j)の始点のF0とを用いて、接続F0サブコストを計算する（Ｓ９３４）。接続音素環境サブコスト計算手段９３５は、探索仮説群H（j-1）の候補素片u(j-1)と、j番目の候補素片群U(j)との音響的類似度を接続音素環境サブコストとして計算する（Ｓ９３５）。次に探索仮説展開手段９３６は、探索仮説群H(j-1)の各仮説h(j-1)に上記計算したサブコストを加算したと仮定した場合に、最も低いコストとなる１の探索仮説h(j-1)に候補素片u(j)を追加し、新たな探索仮説h(j)とする（Ｓ９３６）。このようにして候補素片群U(j)の各候補素片u(j)に対し、ステップＳ９３１〜ステップＳ９３６が繰り返し実行され、上記サブコストの計算及び探索仮説の展開が行われる（Ｓ９３ｂ、Ｓ９３ｃ）。さらに各ターゲットに対し、ステップＳ９３１〜ステップＳ９３６が繰り返し実行され（Ｓ９３ｂ、Ｓ９３ｃ）、各ターゲットの候補素片群に対応する探索仮説群が展開される（Ｓ９３ａ、Ｓ９３ｄ）。次に、選択手段９３７は上記展開した探索仮説群を参照して、最終的に最もコストの低い探索仮説のパスに含まれる候補素片の素片番号が音声合成部９４に出力される。素片番号を取得した音声合成部９４の動作は前述のとおりである。このようにして、音声合成装置９０は、入力されたテキストから生成した音素毎のターゲットに最適な候補素片を選択し、当該選択した候補素片同士を接続することで入力されたテキストに対応する合成音声を生成することができる。 Next, the operation performed by the candidate segment selection unit 93 will be described in detail with reference to FIGS. FIG. 4 is a block diagram showing details of the candidate segment selection unit 93 provided in the conventional speech synthesizer 90. FIG. 5 is a flowchart showing the operation of the candidate segment selection unit 93 provided in the conventional speech synthesizer 90. The candidate segment selection unit 93 includes a target F0 sub cost calculation unit 931, a target duration length sub cost calculation unit 932, a connection F0 sub cost calculation unit 934, a connected phoneme environment sub cost calculation unit 935, a search hypothesis expansion unit 936, Selection means 937. The target F0 sub cost calculating unit 931 calculates the target F0 sub cost using the F0 pattern of the jth target t (j) and the F0 pattern of the jth candidate segment group U (j) (S931). The target duration length sub-cost calculating means 932 uses the duration length of the j-th target t (j) and the duration length group of the j-th candidate segment group U (j) to calculate the target duration length sub-cost. Is calculated (S932). Here, the search hypothesis group corresponding to the target t (j) is represented as H (j). Also, any one search hypothesis in H (j) is represented as h (j). The connection F0 sub-cost calculating means 934 calculates the end point F0 of the candidate unit u (j-1) of the search hypothesis group H (j-1) and the start point F0 of the jth candidate unit group U (j). And calculate the connection F0 sub-cost (S934). The connected phoneme environment sub-cost calculating means 935 calculates the acoustic similarity between the candidate unit u (j-1) of the search hypothesis group H (j-1) and the jth candidate unit group U (j). Calculate as the environmental sub-cost (S935). Next, the search hypothesis expanding means 936 assumes that one search hypothesis having the lowest cost is assumed when the calculated sub-cost is added to each hypothesis h (j-1) of the search hypothesis group H (j-1). Candidate segment u (j) is added to h (j-1) to obtain a new search hypothesis h (j) (S936). In this way, Step S931 to Step S936 are repeatedly executed for each candidate unit u (j) of the candidate unit group U (j), and the sub cost calculation and search hypothesis development are performed (S93b, S93c). ). Further, steps S931 to S936 are repeatedly executed for each target (S93b, S93c), and a search hypothesis group corresponding to the candidate segment group of each target is developed (S93a, S93d). Next, the selection means 937 refers to the expanded search hypothesis group, and finally outputs the unit numbers of candidate segments included in the path of the search hypothesis with the lowest cost to the speech synthesizer 94. The operation of the speech synthesizer 94 that acquired the segment number is as described above. In this way, the speech synthesizer 90 selects a candidate segment that is optimal for the target for each phoneme generated from the input text, and supports the input text by connecting the selected candidate segments. A synthesized speech can be generated.

特許第２７６１５５２号公報Japanese Patent No. 2761552

接続F0サブコスト計算手段９３４において、隣接する候補素片のいずれかもしくは双方が無声音である場合には、これら無声音の候補素片はF0パターンが存在しないため、接続F0サブコストを計算することができない。この場合は前述したように接続F0サブコストの値を一定値、例えば０とみなすこととしている。このため、例えば音素列／Ａ／−／Ｓ／―／Ｕ／における二番目の音素／Ｓ／のように候補素片間のF0の距離が評価されないため、選択手段９３６で選択される候補素片のイントネーションについて、その連続性は必ずしも保証されないという問題がある。この典型的な例を図６を用いて具体的に説明する。図６は従来の音声合成装置９０の候補素片選択部９３が選択する候補素片を例示する図である。図６のグラフは横軸を時間（ms）、縦軸を周波数（Hz）とする。図６中破線で表された曲線は韻律生成部９２が生成したターゲットのF0パターンである。ターゲットのF0パターンは音素の区間毎に区切られて候補素片と比較される。図６では破線で表された曲線を、音素毎に区切られた範囲についてターゲット３１、ターゲット３２、ターゲット３３、ターゲット３５、ターゲット３６と呼ぶこととする。具体的にはターゲットのF0パターンの音素／Ｅ／で区切られる範囲をターゲット３１、音素／Ｓ／で区切られる範囲をターゲット３２、音素／Ａ／で区切られる範囲をターゲット３３、音素／Ｋ／で区切られる範囲をターゲット３５、音素／Ｉ／で区切られる範囲をターゲット３６と呼ぶ。ターゲット３１に対応する候補素片は候補素片２１であるものとする。ターゲット３３に対応する候補素片群は候補素片２３および候補素片２４であるものとし、候補素片を２つ有している。ターゲット３６に対応する候補素片は候補素片２６であるものとする。ターゲット３２およびターゲット３５の候補素片については図示を省略する。 In the connection F0 sub cost calculation means 934, if either or both of the adjacent candidate segments are unvoiced sounds, the connection segment F0 sub cost cannot be calculated because there is no F0 pattern for these unvoiced sound candidate segments. In this case, as described above, the value of the connection F0 sub-cost is assumed to be a constant value, for example, 0. For this reason, since the distance of F0 between candidate segments is not evaluated as in the second phoneme / S / in the phoneme string / A / − / S / − / U /, for example, the candidate element selected by the selection unit 936 There is a problem that the continuity of a piece of intonation is not always guaranteed. This typical example will be specifically described with reference to FIG. FIG. 6 is a diagram illustrating candidate segments selected by the candidate segment selection unit 93 of the conventional speech synthesizer 90. In the graph of FIG. 6, the horizontal axis represents time (ms) and the vertical axis represents frequency (Hz). A curve represented by a broken line in FIG. 6 is the target F0 pattern generated by the prosody generation unit 92. The target F0 pattern is divided into phoneme segments and compared with candidate segments. In FIG. 6, the curve represented by a broken line is referred to as a target 31, a target 32, a target 33, a target 35, and a target 36 for a range divided for each phoneme. Specifically, the range delimited by the phoneme / E / of the target F0 pattern is the target 31, the range delimited by the phoneme / S / is the target 32, the range delimited by the phoneme / A / is the target 33, and the phoneme / K / The range to be delimited is called a target 35, and the range delimited by phonemes / I / is called a target 36. It is assumed that the candidate segment corresponding to the target 31 is the candidate segment 21. It is assumed that the candidate segment group corresponding to the target 33 is the candidate segment 23 and the candidate segment 24, and has two candidate segments. It is assumed that the candidate segment corresponding to the target 36 is the candidate segment 26. The candidate segments of the target 32 and the target 35 are not shown.

ここで、候補素片２３、候補素片２４の音素環境はともに前環境／Ｓ／、後環境／Ｋ／であって、接続音素環境サブコストの値は０であるものとする。また、候補素片２３と候補素片２４の音素継続時間長は等しいものとし、ターゲット継続時間長サブコストの値は互いに等しいものとする。この場合、選択可能な候補素片の組み合わせ（音声波形素片列）は、候補素片２１→候補素片２３→候補素片２６の組み合わせＡか、候補素片２１→候補素片２４→候補素片２６の組み合わせＢの何れかである。ここで、組み合わせＡのほうが大局的なF0パターンの連続性が保たれるため、自然なイントネーションが期待できる。しかしながら前述のターゲットF0サブコスト計算手段９３１は、候補素片２３のターゲットF0サブコストよりも候補素片２４のターゲットF0サブコストを小さな値に計算してしまう。また、候補素片２３および候補素片２４はいずれも先行音素が無声音の／Ｓ／であるため、接続F0サブコストが計算できず、その値は０となる。前述したように、候補素片２３、候補素片２４の接続音素環境サブコスト、ターゲット継続時間長サブコストの値は互いに等しい。従って、ターゲットコストと接続コストの和を比較すると、候補素片２３よりも候補素片２４のほうがコストの和が小さくなるため、自然なイントネーションが期待できない候補素片２４が選択されてしまう。本発明は、図６のように有声音同士が無声音を介して接続されているようなターゲットに対しても、合成音声のイントネーションの連続性を保ち、合成音声が高品質となるよう候補素片を選択することができる音声合成装置を提供することを目的とする。 Here, the phoneme environments of the candidate segment 23 and the candidate segment 24 are both the pre-environment / S / and the post-environment / K /, and the value of the connected phoneme environment sub-cost is 0. Further, the phoneme durations of the candidate segment 23 and the candidate segment 24 are assumed to be equal, and the target duration length sub-cost values are assumed to be equal to each other. In this case, a selectable combination of candidate segments (speech waveform segment sequence) is a combination A of candidate segment 21 → candidate segment 23 → candidate segment 26 or candidate segment 21 → candidate segment 24 → candidate. One of the combinations B of the element pieces 26. Here, since the continuity of the global F0 pattern is maintained in the combination A, natural intonation can be expected. However, the above-described target F0 sub-cost calculating unit 931 calculates the target F0 sub-cost of the candidate segment 24 to a smaller value than the target F0 sub-cost of the candidate segment 23. Further, since both the candidate segment 23 and the candidate segment 24 are / S / of the preceding phoneme being an unvoiced sound, the connection F0 sub-cost cannot be calculated, and the value thereof is 0. As described above, the values of the connected phoneme environment sub cost and the target duration length sub cost of the candidate segment 23 and the candidate segment 24 are equal to each other. Therefore, when the sum of the target cost and the connection cost is compared, the candidate segment 24 has a smaller cost sum than the candidate segment 23, and therefore, the candidate segment 24 that cannot be expected to be a natural intonation is selected. The present invention maintains the continuity of intonation of synthesized speech even for a target in which voiced sounds are connected via unvoiced sounds as shown in FIG. An object of the present invention is to provide a speech synthesizer capable of selecting a voice.

本発明の音声合成装置は、音声波形データベースに記憶された複数の候補素片（候補素片群）から、音素ラベル毎に番号付けした合成音声目標（ターゲット）に適した候補素片を選択し、当該選択した候補素片を接続して合成音声を生成する。本発明の音声合成装置は少なくとも先行有声候補素片探索手段と、接続F0サブコスト計算手段と、選択手段とを備えることを特徴とする。先行有声候補素片探索手段は、i番目のターゲット（iは３以上の自然数）が有声音であって、（i-1）番目のターゲットが無声音であった場合に、kが２以上であってkが最小となる（i-k）番目の有声音となるターゲットに適した候補素片として選択されている候補素片（以下、（i-k）番目の候補素片という）を探索する。接続F0サブコスト計算手段は、（i-k）番目の候補素片の終端位置のF0値とi番目のターゲットに対応する候補素片群（以下、i番目の候補素片群という）の各候補素片の先頭位置のF0値群から接続F0サブコストを計算する。選択手段は、接続F0サブコストに基づいてi番目の候補素片群から、i番目のターゲットに適した１の候補素片を選択してi番目の候補素片とする。 The speech synthesizer of the present invention selects candidate segments suitable for a synthesized speech target (target) numbered for each phoneme label from a plurality of candidate segments (candidate segment group) stored in the speech waveform database. Then, the selected candidate segments are connected to generate a synthesized speech. The speech synthesizer of the present invention comprises at least a preceding voiced candidate segment search means, a connected F0 sub-cost calculation means, and a selection means. In the preceding voiced candidate segment search means, when the i-th target (i is a natural number of 3 or more) is a voiced sound and the (i-1) -th target is an unvoiced sound, k is 2 or more. The candidate segment selected as the candidate segment suitable for the target of the (ik) th voiced sound with the smallest k is searched (hereinafter referred to as the (ik) th candidate segment). The connected F0 sub-cost calculating means calculates each candidate segment of the candidate segment group (hereinafter referred to as the i-th candidate segment group) corresponding to the F0 value of the terminal position of the (ik) th candidate segment and the i-th target. The connection F0 sub-cost is calculated from the F0 value group at the head position of. The selection means selects one candidate segment suitable for the i-th target from the i-th candidate segment group based on the connection F0 sub-cost and sets it as the i-th candidate segment.

本発明の音声合成装置によれば、有声音同士が無声音を介して接続されているようなターゲットに対しても、合成音声のイントネーションの連続性を保ち、合成音声が高品質となるよう候補素片を選択することができる。 According to the speech synthesizer of the present invention, even if a target in which voiced sounds are connected via unvoiced sound, the candidate speech is maintained so that the synthesized speech can be of high quality while maintaining the continuity of the synthesized speech intonation. A piece can be selected.

従来の音声合成装置の構成を示すブロック図。The block diagram which shows the structure of the conventional speech synthesizer. 従来の音声合成装置が備える音声情報インデックスを例示する図。The figure which illustrates the speech information index with which the conventional speech synthesizer is provided. 従来の音声合成装置の動作を示すフローチャート。The flowchart which shows operation | movement of the conventional speech synthesizer. 従来の音声合成装置が備える候補素片選択部の詳細を示すブロック図。The block diagram which shows the detail of the candidate unit selection part with which the conventional speech synthesizer is provided. 従来の音声合成装置が備える候補素片選択部の動作を示すフローチャート。The flowchart which shows operation | movement of the candidate segment selection part with which the conventional speech synthesizer is provided. 従来の音声合成装置の候補素片選択部が選択する候補素片を例示する図。The figure which illustrates the candidate segment which the candidate segment selection part of the conventional speech synthesizer selects. 実施例１に係る音声合成装置の構成例を示すブロック図。1 is a block diagram illustrating a configuration example of a speech synthesis device according to Embodiment 1. FIG. 実施例１に係る音声合成装置が備える候補素片選択部の詳細を示すブロック図。FIG. 3 is a block diagram illustrating details of a candidate segment selection unit included in the speech synthesis device according to the first embodiment. 実施例１に係る音声合成装置が備える候補素片選択部の動作を示すフローチャート。6 is a flowchart illustrating an operation of a candidate segment selection unit included in the speech synthesizer according to the first embodiment. 実施例１に係る音声合成装置の候補素片選択部が選択する候補素片を例示する図。The figure which illustrates the candidate segment which the candidate segment selection part of the speech synthesizer concerning Example 1 selects.

以下、本発明の実施の形態について、詳細に説明する。なお、同じ機能を有する構成部には同じ番号を付し、重複説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail. In addition, the same number is attached | subjected to the structure part which has the same function, and duplication description is abbreviate | omitted.

図７を参照して、実施例１に係る音声合成装置１０について詳細に説明する。図７は本実施例に係る音声合成装置１０の構成例を示すブロック図である。本実施例の音声合成装置１０は、テキスト解析部９１と、韻律生成部９２と、候補素片選択部１３と、音声合成部９４と、音声波形データベース９５とを備える。音声波形データベース９５は音声波形データ９５１と、音声情報インデックス９５２とを備える。従来技術の音声合成装置９０と異なる番号を付した候補素片選択部１３（図中太枠で表示）以外の各構成部については、従来技術の音声合成装置９０と全く同じ動作をするため、その説明を省略する。次に、図８、図９を参照して、従来技術と異なる候補素片選択部１３について詳細に説明する。図８は本実施例に係る音声合成装置１０が備える候補素片選択部１３の詳細を示すブロック図である。図９は本実施例に係る音声合成装置１０が備える候補素片選択部１３の動作を示すフローチャートである。候補素片選択部１３は、ターゲットF0サブコスト計算手段９３１と、ターゲット継続時間長サブコスト計算手段９３２と、先行有声候補素片探索手段１３３と、接続F0サブコスト計算手段１３４と、接続音素環境サブコスト計算手段９３５と、探索仮説展開手段９３６と、選択手段９３７とを備える。従来技術の音声合成装置９０と異なる番号を付した先行有声候補素片探索手段１３３および接続F0サブコスト計算手段１３４（図中太枠で表示）以外の各構成部については、従来技術の音声合成装置９０と全く同じ動作をするため、その説明を省略する。 With reference to FIG. 7, the speech synthesizer 10 according to the first embodiment will be described in detail. FIG. 7 is a block diagram illustrating a configuration example of the speech synthesizer 10 according to the present embodiment. The speech synthesizer 10 of this embodiment includes a text analysis unit 91, a prosody generation unit 92, a candidate segment selection unit 13, a speech synthesis unit 94, and a speech waveform database 95. The speech waveform database 95 includes speech waveform data 951 and a speech information index 952. Since each component other than the candidate segment selection unit 13 (indicated by a thick frame in the figure) numbered differently from the prior art speech synthesizer 90 performs the same operation as the prior art speech synthesizer 90, The description is omitted. Next, with reference to FIGS. 8 and 9, the candidate segment selection unit 13 different from the prior art will be described in detail. FIG. 8 is a block diagram illustrating details of the candidate segment selection unit 13 included in the speech synthesizer 10 according to the present embodiment. FIG. 9 is a flowchart illustrating the operation of the candidate segment selection unit 13 included in the speech synthesizer 10 according to the present embodiment. The candidate segment selection unit 13 includes a target F0 sub cost calculation unit 931, a target duration length sub cost calculation unit 932, a preceding voiced candidate segment search unit 133, a connection F0 sub cost calculation unit 134, and a connected phoneme environment sub cost calculation unit. 935, search hypothesis expansion means 936, and selection means 937. For each component other than the preceding voiced candidate segment searching means 133 and the connected F0 sub cost calculating means 134 (indicated by a thick frame in the figure) numbered differently from the prior art speech synthesizer 90, the conventional speech synthesizer Since the operation is exactly the same as 90, the description thereof is omitted.

先行有声候補素片探索手段１３３は、i番目のターゲット（iは３以上の自然数）が有声音であって、（i-1）番目のターゲットが無声音であった場合に（Ｓ１３ａＹ）、kが２以上であってkが最小となる（i-k）番目の有声音となるターゲットに適した候補素片として選択されている候補素片（（i-k）番目の候補素片）を、（i-1）番目の候補素片の探索仮説のパスをt(1)の方向に辿って探索する（Ｓ１３３）。先行する有声音の候補素片が存在する場合に（Ｓ１３ｂＹ）、接続F0サブコスト計算手段１３４は、前記（i-k）番目の候補素片の終端位置のF0値と前記i番目のターゲットに対応する候補素片群（i番目の候補素片群）の各候補素片の先頭位置のF0値群から接続F0サブコストを計算する（Ｓ１３４）。一方、ステップＳ１３ａの条件を満たさない場合（例えば、u(i)が無声音である場合、u(i)が有声音であるが、先行音素も有声音である場合、Ｓ１３ａＮ）、ステップＳ１３４に移り、従来技術の音声合成装置９０と同様に（i-1）番目の候補素片の終端位置のF0値とi番目の候補素片群の各候補素片の先頭位置のF0値群から接続F0サブコストを計算する（Ｓ１３４）。また、ステップＳ１３ｂの条件を満たさない場合（例えば、探索の結果、無声音が文頭に来ており、先行する有声音が無かった場合など、Ｓ１３ｂＮ）、従来技術と同様、（i-1）番目の候補素片の終端位置のF0値とi番目の候補素片群の各候補素片の先頭位置のF0値群から接続F0サブコストを計算する（Ｓ１３４）。 The preceding voiced candidate segment searching means 133, when the i-th target (i is a natural number of 3 or more) is a voiced sound and the (i-1) -th target is an unvoiced sound (S13aY), A candidate segment ((ik) th candidate segment) selected as a candidate segment suitable for the target of the (ik) th voiced sound that is 2 or more and has the smallest k is (i-1). The search hypothesis path of the first candidate segment is traced in the direction of t (1) to search (S133). When there is a preceding voiced sound candidate segment (S13bY), the connected F0 sub-cost calculating unit 134 selects the F0 value of the terminal position of the (ik) th candidate segment and the candidate corresponding to the i th target. The connection F0 sub-cost is calculated from the F0 value group at the head position of each candidate element of the element group (i-th candidate element group) (S134). On the other hand, when the condition of step S13a is not satisfied (for example, when u (i) is an unvoiced sound, u (i) is a voiced sound, but the preceding phoneme is also a voiced sound, S13aN), the process proceeds to step S134. As in the conventional speech synthesizer 90, the connection F0 is determined from the F0 value at the end position of the (i-1) th candidate segment and the F0 value group at the start position of each candidate segment of the ith candidate segment group. The sub cost is calculated (S134). Further, when the condition of step S13b is not satisfied (for example, as a result of the search, an unvoiced sound comes to the beginning of the sentence and there is no preceding voiced sound, etc., S13bN), as in the prior art, the (i-1) th The connection F0 sub-cost is calculated from the F0 value at the end position of the candidate element and the F0 value group at the head position of each candidate element of the i-th candidate element group (S134).

次に、図１０を参照して、本実施例の音声合成装置が従来技術の音声合成装置９０よりも優れた効果を発揮する典型例として、図６の例を再度用いて具体的に説明する。図１０は本実施例に係る音声合成装置１０の候補素片選択部１３が選択する候補素片を例示する図であり、図６との相違点は音素／Ｅ／と音素／Ｓ／の境界を位置Ｌ、音素／Ｓ／と音素／Ａ／の境界を位置Ｔとして示した点、候補素片２１の終端部を＊２１ａ、候補素片２３の先頭部を○２３ａ、候補素片２４の先頭部を△２４ａとして示した点のみである。本実施例の候補素片選択部１３の先行有声候補素片探索手段１３３は、音素／Ａ／の候補素片を決定する際に、音素／Ａ／の先行音素が無声音／Ｓ／であるため、音素／Ａ／に先行する有声音の候補素片を探索する（Ｓ１３３）。先行する候補素片として音素／Ｅ／のターゲットに適しているとして選択されている候補素片２１が存在するため、接続F0サブコスト計算手段１３４は、候補素片２１の終端位置ＬのF0値（図の＊２１ａ）と候補素片２３の先頭位置ＴのF0値（図の○２３ａ）から接続F0サブコストを計算する（Ｓ１３４）。同様に、接続F0サブコスト計算手段１３４は、候補素片２１の終端位置ＬのF0値（図の＊２１ａ）と候補素片２４の先頭位置ＴのF0値（図の△２４ａ）から接続F0サブコストを計算する（Ｓ１３４）。このようにして計算した接続F0サブコストおよび、従来と同様にして計算したターゲットF0サブコスト、ターゲット継続時間長サブコスト、接続音素環境サブコストを用いて、ターゲットコストと接続コストの和を比較すると、候補素片２４よりも候補素片２３のほうがコストの和が小さくなるため、自然なイントネーションが期待できる候補素片２３が選択される。このようにして本実施例に係る音声合成装置１０によれば、有声音同士が無声音を介して接続されているようなターゲットに対しても、合成音声が自然なイントネーションを有するように候補素片を選択することができる。 Next, with reference to FIG. 10, as a typical example in which the speech synthesizer of the present embodiment exhibits an effect superior to that of the conventional speech synthesizer 90, the example of FIG. 6 will be described in detail. . FIG. 10 is a diagram illustrating candidate segments selected by the candidate segment selection unit 13 of the speech synthesizer 10 according to the present embodiment. The difference from FIG. 6 is the boundary between phonemes / E / and phonemes / S /. Is the position L, the boundary between the phoneme / S / and the phoneme / A / is shown as the position T, the end of the candidate segment 21 is * 21a, the beginning of the candidate segment 23 is ○ 23a, and the candidate segment 24 This is only the point indicated by Δ24a at the beginning. When the preceding voiced candidate segment searching unit 133 of the candidate segment selecting unit 13 of the present embodiment determines the candidate segment of the phoneme / A /, the preceding phoneme of the phoneme / A / is the unvoiced sound / S /. Then, a candidate segment of voiced sound preceding phoneme / A / is searched (S133). Since there is a candidate segment 21 selected as suitable for the target of the phoneme / E / as the preceding candidate segment, the connection F0 sub-cost calculating means 134 uses the F0 value of the end position L of the candidate segment 21 ( The connection F0 sub-cost is calculated from * 21a) in the figure and the F0 value (◯ 23a in the figure) at the leading position T of the candidate segment 23 (S134). Similarly, the connection F0 sub-cost calculation means 134 calculates the connection F0 sub-cost from the F0 value (* 21a in the figure) of the end position L of the candidate segment 21 and the F0 value (Δ24a in the figure) of the leading position T of the candidate segment 24. Is calculated (S134). Using the connection F0 subcost calculated in this way, the target F0 subcost, the target duration length subcost, and the connection phoneme environment subcost calculated in the same manner as before, a comparison of the sum of the target cost and the connection cost can be made. Since the cost of the candidate segment 23 is smaller than 24, the candidate segment 23 from which natural intonation can be expected is selected. In this way, according to the speech synthesizer 10 according to the present embodiment, the candidate segment so that the synthesized speech has a natural intonation even for a target in which voiced sounds are connected via unvoiced sounds. Can be selected.

Claims

Select a candidate segment suitable for a synthesized speech target (hereinafter referred to as a target) numbered for each phoneme label from a plurality of candidate segments (hereinafter referred to as a candidate segment group) stored in the speech waveform database, A speech synthesis method for generating synthesized speech by connecting the selected candidate segments,
When the i-th target (i is a natural number of 3 or more) is a voiced sound and the (i-1) -th target is an unvoiced sound, k is 2 or more and k is minimized (ik) A preceding voiced candidate segment search substep for searching for a candidate segment (hereinafter referred to as (ik) th candidate segment) selected as a candidate segment suitable for the target that is the th voiced sound;
The F0 value of the end position of the (ik) th candidate element and the F0 of the leading position of each candidate element of the candidate element group corresponding to the i th target (hereinafter referred to as the i th candidate element group) A connection F0 subcost calculation substep for calculating a connection F0 subcost from a value group; and
A candidate element having a selection sub-step of selecting one candidate element suitable for the i-th target from the i-th candidate element group based on the connection F0 sub-cost and making it the i-th candidate element A speech synthesis method comprising a single selection step.

The speech synthesis method according to claim 1,
The candidate segment selection step includes:
Using the F0 pattern of the i th target and the F0 pattern group of the i th candidate segment group, a target F0 sub cost calculation substep for calculating a target F0 sub cost;
A target duration length sub-cost calculation sub-step for calculating a target duration length sub-cost using the duration length of the i-th target and the duration length group of the i-th candidate segment group;
(I-1) a connected phoneme environment sub-cost calculation substep for calculating an acoustic similarity between the i th candidate segment and the i th candidate segment group as a connected phoneme environment sub cost;
Have
The selection sub-step is suitable for the i-th target from the i-th candidate segment group based on the target F0 sub-cost, the target duration length sub-cost, and the connected phoneme environment sub-cost in addition to the connection F0 sub-cost. A speech synthesis method, wherein one candidate segment is selected as the i-th candidate segment.

The speech synthesis method according to claim 1 or 2,
A text analysis step of acquiring text, performing morphological analysis on the acquired text, generating a phoneme string and an accent type from the morpheme analysis result, and outputting the phoneme string and the accent type;
Obtain the output phoneme string and accent type, estimate the F0 pattern and phoneme duration for each phoneme, and number the estimated F0 pattern and phoneme duration for each phoneme. And a prosody generation step of outputting the target as the target.

Select a candidate segment suitable for a synthesized speech target (hereinafter referred to as a target) numbered for each phoneme label from a plurality of candidate segments (hereinafter referred to as a candidate segment group) stored in the speech waveform database, A speech synthesizer that connects the selected candidate segments to generate synthesized speech,
When the i-th target (i is a natural number of 3 or more) is a voiced sound and the (i-1) -th target is an unvoiced sound, k is 2 or more and k is minimized (ik) A preceding voiced candidate segment search means for searching for a candidate segment selected as a candidate segment suitable for the target to be the voiced sound (hereinafter referred to as (ik) th candidate segment);
The F0 value of the end position of the (ik) -th candidate element and the F0 of the leading position of each candidate element of the candidate element group corresponding to the i-th target (hereinafter referred to as the i-th candidate element group) A connection F0 subcost calculating means for calculating a connection F0 subcost from the value group;
A candidate unit comprising: a selecting unit that selects one candidate unit suitable for the i-th target from the i-th candidate unit group based on the connection F0 sub-cost and sets it as the i-th candidate unit A speech synthesizer comprising a selection unit.

The program which gives the instruction | command which should perform the speech synthesis method in any one of Claim 1 to 3 with respect to a computer.