JP2010008922A

JP2010008922A - Speech processing device, speech processing method and program

Info

Publication number: JP2010008922A
Application number: JP2008170973A
Authority: JP
Inventors: Shinko Morita; 眞弘森田; Takehiko Kagoshima; 岳彦籠嶋; Takeshi Hirabayashi; 剛平林
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2008-06-30
Filing date: 2008-06-30
Publication date: 2010-01-14
Anticipated expiration: 2028-06-30
Also published as: JP5106274B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a speech processing device for generating high quality synthetic speech which is not artificial with less booming and buzzing noise. <P>SOLUTION: A phoneme selecting section 43 selects multiple phonemes for each segment of a phonological sequence of a target speech. A phoneme integrating section 44 generates an integrated phoneme by integrating multiple phonemes for each segment. A formant emphasis level estimating section 46 estimates a formant emphasis level by using at least either of a feature amount regarding the multiple phonemes and a feature amount regarding the integrated phonemes, for each segment. A formant emphasis filter section 45 emphasizes formant based on the emphasis level estimated for the integrated phoneme for each segment. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、音声処理装置、音声処理方法及びプログラムに関する。 The present invention relates to a voice processing device, a voice processing method, and a program.

任意の文章から人工的に音声信号を作り出すことを、テキスト音声合成という。テキスト音声合成は、一般的に、言語処理部、韻律処理部及び音声合成部の３つ段階によって行われる。 Artificially creating speech signals from arbitrary sentences is called text-to-speech synthesis. Text-to-speech synthesis is generally performed in three stages: a language processing unit, a prosody processing unit, and a speech synthesis unit.

入力されたテキストは、まず言語処理部において、形態素解析や構文解析が行われ、次に韻律処理部において、アクセントやイントネーションの処理が行われて、音韻系列・韻律情報（基本周波数、音韻継続時間長、パワーなど）が出力される。最後に、音声合成部において、音韻系列・韻律情報から音声信号を合成する。そこで、音声合成部に用いる音声合成方法は、韻律処理部で生成される任意の音韻系列を、任意の韻律で音声合成することが可能な方法でなければならない。 The input text is first subjected to morphological analysis and syntactic analysis in the language processing unit, and then subjected to accent and intonation processing in the prosody processing unit to obtain phoneme sequence / prosodic information (basic frequency, phoneme duration). Output). Finally, the speech synthesis unit synthesizes a speech signal from the phoneme sequence / prosodic information. Therefore, the speech synthesis method used for the speech synthesizer must be a method that can synthesize an arbitrary phoneme sequence generated by the prosody processing unit with an arbitrary prosody.

従来、このような音声合成方法として、入力の音韻系列を分割して得られる複数の合成単位（合成単位列）のそれぞれに対して、入力された音韻系列・韻律情報を目標にして、予め記憶された大量の音声素片の中から音声素片を選択し、選択した音声素片を合成単位間で接続することによって、音声を合成する、音声合成方法（素片選択型の音声合成方法）が知られている。例えば、特許文献１に開示された素片選択型の音声合成方法では、音声を合成することで生じる音声合成の劣化の度合いを、コストで表すこととし、予め定義されたコスト関数を用いて計算されるコストが小さくなるように、音声素片を選択する。例えば、音声素片を編集・接続することで生じる変形歪み及び接続歪みを、コストを用いて数値化し、このコストに基づいて、音声合成に使用する音声素片系列を選択し、選択した音声素片系列に基づいて、合成音声を生成する。 Conventionally, as such a speech synthesis method, the input phoneme sequence / prosodic information is stored in advance for each of a plurality of synthesis units (synthesis unit sequences) obtained by dividing the input phoneme sequence. A speech synthesis method (unit selection type speech synthesis method) that synthesizes speech by selecting speech units from a large number of speech units and connecting the selected speech units between synthesis units. It has been known. For example, in the segment selection type speech synthesis method disclosed in Patent Document 1, the degree of speech synthesis degradation caused by speech synthesis is expressed by cost, and calculated using a predefined cost function. The speech segment is selected so that the cost to be played is reduced. For example, the distortion and connection distortion caused by editing and connecting speech units are quantified using cost, and based on this cost, a speech unit sequence used for speech synthesis is selected, and the selected speech unit is selected. Based on the single sequence, synthesized speech is generated.

特許文献１に開示された音声合成方法のように、音声を合成することで生じる音声合成の劣化の度合いを考慮して、大量の音声素片の中から適切な音声素片系列を選択することによって、音声素片の編集及び接続による音質の劣化を抑えた合成音声を生成することができる。 As in the speech synthesis method disclosed in Patent Document 1, in consideration of the degree of speech synthesis degradation caused by speech synthesis, an appropriate speech unit sequence is selected from a large number of speech units. Thus, it is possible to generate synthesized speech in which deterioration of sound quality due to editing and connection of speech segments is suppressed.

しかしながら、特許文献１に開示された素片選択型の音声合成方法には、部分的に合成音の音質が劣化する問題点がある。この理由は次のようなものである。 However, the segment selection type speech synthesis method disclosed in Patent Document 1 has a problem in that the quality of synthesized speech is partially degraded. The reason for this is as follows.

第１の理由は、予め記憶された音声素片が非常に多い場合であっても、様々な音韻・韻律環境に対して適切な音声素片が存在するとは限らないことである。 The first reason is that even if there are a large number of speech segments stored in advance, speech segments suitable for various phoneme / prosodic environments are not always present.

第２の理由は、人が実際に感じる合成音声の劣化の度合いをコスト関数が完全に表現できないため、必ずしも最適な素片系列が選ばれない場合があるからである。 The second reason is that an optimal unit sequence may not always be selected because the cost function cannot completely represent the degree of deterioration of the synthesized speech that a person actually feels.

第３の理由は、音声素片が非常に多いために予め不良な音声素片を排除しておくことが困難であり、また不良な音声素片を取り除くためのコスト関数の設計も難しいため、選択された音声素片系列中に、突発的に不良な音声素片が混入する場合があるからである。 The third reason is that because there are so many speech segments, it is difficult to eliminate bad speech segments in advance, and it is also difficult to design a cost function for removing bad speech segments. This is because a defective speech unit may be suddenly mixed in the selected speech unit sequence.

そこで、合成単位当たり１つずつの音声素片を選ぶのではなく、合成単位当たり複数個の音声素片を選択し、これを融合することによって新たな音声素片を生成し、こうして生成された音声素片を使って音声を合成する方法が開示されている（特許文献２参照）。以下、この方法を「複数素片選択融合型の音声合成方法」と呼ぶ。 Therefore, instead of selecting one speech unit per synthesis unit, a plurality of speech units are selected per synthesis unit, and a new speech unit is generated by fusing them, thus generated. A method of synthesizing speech using speech segments has been disclosed (see Patent Document 2). Hereinafter, this method is referred to as a “multiple unit selective fusion type speech synthesis method”.

特許文献２に開示された複数素片選択融合型の音声合成方法では、合成単位毎に複数の音声素片を融合することによって、目標とする音韻・韻律環境に合う適切な音声素片が存在しない場合や、最適な音声素片が選択されない場合、不良素片が選択されてしまった場合でも、高品質な音声素片を新たに生成することができ、さらに、この新たに生成した音声素片を使用して音声合成を行うことで、前述した素片選択型の音声合成方法の問題点を改善することができ、より安定性を増した高音質の音声合成を実現することができる。 In the multi-unit selective fusion type speech synthesis method disclosed in Patent Document 2, there is an appropriate speech unit suitable for the target phoneme / prosodic environment by fusing a plurality of speech units for each synthesis unit. If no optimal speech unit is selected, or if a defective unit is selected, a new high-quality speech unit can be generated, and this newly generated speech unit can be generated. By performing speech synthesis using fragments, it is possible to improve the above-described problems of the unit selection type speech synthesis method, and it is possible to realize speech synthesis with higher sound quality and higher stability.

この複数素片選択融合型の音声合成方法においては、音声素片の融合による平均化の副作用によってスペクトル包絡が原音に比べて若干鈍る傾向があり、その結果、こもり感やブザー感が生じる場合がある。こうしたこもり感やブザー感の主観的な改善には、音声符号化や音声合成でよく用いられるようなフォルマント強調フィルタを、融合された素片に対して適用することが効果的である。 In this multi-unit selective fusion type speech synthesis method, the spectral envelope tends to be slightly dull compared to the original sound due to the side effect of averaging due to the fusion of speech units. is there. In order to subjectively improve such a feeling of being bulky or a buzzer, it is effective to apply a formant emphasis filter, which is often used in speech coding and speech synthesis, to the fused segments.

フォルマント強調フィルタは、入力音声波形のスペクトル包絡のフォルマントによる山・谷を強調したような音声波形を出力するフィルタで、適度な度合いでフォルマントを強調できれば、スペクトル包絡が鈍ったことによって生じるこもり感やブザー感を改善できる。一般的に、フォルマント強調フィルタは入力波形のスペクトル特性に応じてフィルタ特性を変える点では適応的だが、どの程度フォルマントを強調するかについては、適切な強調度合いを決めるための客観尺度が存在しないため、主観評価などによって実験的に決めるしかなく、ハイパーパラメータなどの値を外部から指定することによって制御することが多い。 The formant emphasis filter is a filter that outputs a speech waveform that emphasizes peaks and valleys due to the formant of the spectral envelope of the input speech waveform.If the formant can be emphasized to a reasonable degree, Buzzer feeling can be improved. In general, formant emphasis filters are adaptive in that they change the filter characteristics according to the spectral characteristics of the input waveform, but there is no objective measure for determining the appropriate degree of emphasis on how much formants are emphasized. However, it is often determined experimentally by subjective evaluation or the like, and is often controlled by specifying values such as hyperparameters from the outside.

そのため、複数素片選択型の音声合成方法で用いる場合には、フォルマントの強調度合いは、合成音声の主観的な音質が総合的に良くなるように、主観評価などによって実験的に決める。すなわち、フォルマントの強調度合いは、融合されたあらゆる素片に対して共通のものが適用される。
特開２００１−２８２２７８公報特開２００５−１６４７４９公報 For this reason, when used in the multi-unit selection type speech synthesis method, the degree of formant enhancement is determined experimentally by subjective evaluation or the like so that the subjective sound quality of the synthesized speech is improved overall. In other words, the same degree of formant emphasis is applied to all fused pieces.
JP 2001-282278 A JP 2005-164749

しかしながら、音声素片の融合によるスペクトル包絡の鈍り具合は、通常、合成単位によって異なり、一様ではない。例えば、合成単位に対して選ばれた複数の素片が類似のスペクトル包絡を持つ場合は、融合してもさほどスペクトル包絡は鈍らないと考えられるが、フォルマントの位置が素片間で大きく異なるなど、選ばれた音声素片のスペクトル包絡がそれぞれ異なる特徴を持つ場合には、融合するとスペクトル包絡が鈍ってしまう可能性が高い。 However, the degree of blunting of the spectral envelope due to the fusion of speech elements usually varies depending on the synthesis unit and is not uniform. For example, if multiple segments selected for the synthesis unit have similar spectral envelopes, it is considered that the spectral envelope will not be dulled even if they are merged, but the position of the formant varies greatly between the segments, etc. If the spectral envelopes of the selected speech segments have different characteristics, there is a high possibility that the spectral envelope will become dull when they are merged.

このような状況において、全音声素片に対して同じ強調度合いのフォルマント強調フィルタを一様に適用すると、融合によってスペクトル包絡が大きく鈍った箇所にはフォルマント強調の程度が不十分であるのに対し、逆に融合によるスペクトル包絡の鈍りが小さい箇所はフォルマントが強調されすぎて人工的な音になる問題がある。 In such a situation, when formant emphasis filter with the same emphasis degree is uniformly applied to all speech segments, formant emphasis is insufficient at the place where the spectral envelope is greatly dull due to fusion. On the other hand, there is a problem that the formant is emphasized too much and the artificial sound is generated at the portion where the spectral envelope due to fusion is small.

本発明は、上記事情を考慮してなされたもので、こもり感やブザー感が少なく、かつ人工的でない高音質な合成音声を生成できる音声処理装置、音声処理方法及びプログラムを提供することを目的とする。 The present invention has been made in view of the above circumstances, and an object thereof is to provide a speech processing device, a speech processing method, and a program that can generate high-quality synthesized speech that is less artificial and has a low feeling of buzz and a buzzer. And

本発明に係る音声処理装置は、目標音声に対応する音韻系列を合成単位で区切って得られる複数のセグメントを取得する第１の取得部と、前記目標音声に対応する各々の前記セグメントの韻律情報を取得する第２の取得部と、各々の前記セグメントごとに、当該セグメントに対し、当該セグメントの前記韻律情報に基づいて、予め用意された複数の音声素片のうちから、複数個の音声素片を選択する選択部と、各々の前記セグメントごとに、当該セグメントに対して選択された複数個の前記音声素片を融合することによって、融合素片を生成する融合部と、各々の前記セグメントごとに、前記選択部により選択された複数個の前記音声素片に関する特徴量と、前記融合部により生成された前記融合素片に関する特徴量との少なくとも一方を用いて、当該セグメントに係る前記融合素片に対して行うべきフォルマント強調における強調度合いを推定する推定部と、各々の前記セグメントごとに、当該セグメントに係る前記融合素片に対して、前記推定部が推定した前記強調度合い基づくフォルマント強調を行うフォルマント強調フィルタ部とを備えたことを特徴とする。 The speech processing apparatus according to the present invention includes a first acquisition unit that acquires a plurality of segments obtained by dividing a phoneme sequence corresponding to a target speech by a synthesis unit, and prosodic information of each of the segments corresponding to the target speech A second acquisition unit that acquires a plurality of speech elements for each segment from a plurality of speech segments prepared in advance based on the prosodic information of the segment. A selection unit that selects a piece; a fusion unit that generates a fusion unit by fusing a plurality of the speech units selected for the segment for each segment; and each of the segments For each, at least one of a feature amount related to the plurality of speech units selected by the selection unit and a feature amount related to the fusion unit generated by the fusion unit is used. The estimation unit for estimating the degree of emphasis in formant enhancement to be performed on the fusion unit related to the segment, and the estimation unit for the fusion unit related to the segment for each segment. And a formant emphasis filter unit that performs formant emphasis based on the estimated degree of emphasis.

本発明によれば、こもり感やブザー感が少なく、かつ人工的でない高音質な合成音声を生成できる。 According to the present invention, it is possible to generate a high-quality synthesized voice that is less artificial and has a buzzing feeling and is not artificial.

以下、図面を参照しながら本発明の実施形態について説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

（第１の実施形態）
本発明の第１の実施形態に係るテキスト音声合成装置（音声処理装置）について説明する。 (First embodiment)
A text-to-speech synthesizer (speech processor) according to a first embodiment of the present invention will be described.

図１に、本実施形態に係るテキスト音声合成を行うテキスト音声合成装置（音声処理装置）の全体構成例を示す。 FIG. 1 shows an example of the overall configuration of a text-to-speech synthesizer (speech processor) that performs text-to-speech synthesis according to the present embodiment.

図１に示されるように、本実施形態のテキスト音声合成装置は、テキスト入力部１、言語処理部２、韻律処理部３、音声合成部４を備えている。 As shown in FIG. 1, the text-to-speech synthesizer of this embodiment includes a text input unit 1, a language processing unit 2, a prosody processing unit 3, and a speech synthesis unit 4.

テキスト入力部１は、テキストを入力する。 The text input unit 1 inputs text.

言語処理部２は、テキスト入力部１から入力されるテキストの形態素解析・構文解析を行い、これら言語解析により得られた言語解析結果を韻律処理部３へ出力する。 The language processing unit 2 performs morphological analysis and syntax analysis of the text input from the text input unit 1, and outputs the language analysis results obtained by the language analysis to the prosody processing unit 3.

韻律制御部３は、該言語解析結果を入力し、該言語解析結果からアクセントやイントネーションの処理を行って、音韻系列及び韻律情報を生成し、生成した音韻系列及び韻律情報を音声合成部へ出力する。 The prosodic control unit 3 inputs the language analysis result, performs accent and intonation processing from the language analysis result, generates a phoneme sequence and prosodic information, and outputs the generated phoneme sequence and prosodic information to the speech synthesis unit To do.

音声合成部４は、該音韻系列及び韻律情報を入力し、該音韻系列及び韻律情報から音声波形を生成して出力する。 The speech synthesizer 4 inputs the phoneme sequence and prosody information, generates a speech waveform from the phoneme sequence and prosody information, and outputs it.

以下、音声合成部４を中心に、その構成及び動作について詳細に説明する。 Hereinafter, the configuration and operation of the speech synthesizer 4 will be described in detail.

図２に、本実施形態の音声合成部４の構成例を示す。 FIG. 2 shows a configuration example of the speech synthesizer 4 of the present embodiment.

図２に示されるように、音声合成部４は、音韻系列・韻律情報入力部４１、音声素片記憶部４２、素片選択部４３、素片融合部４４、フォルマント強調フィルタ部４５、フォルマント強調度合い推定部４６、素片編集・接続部４７、音声波形出力部４８を備えている。 As shown in FIG. 2, the speech synthesizer 4 includes a phoneme sequence / prosodic information input unit 41, a speech unit storage unit 42, a unit selection unit 43, a unit fusion unit 44, a formant enhancement filter unit 45, and formant enhancement. A degree estimation unit 46, a segment editing / connection unit 47, and a speech waveform output unit 48 are provided.

音韻系列・韻律情報入力部（以下、情報入力部と略記する。）４１は、音声合成部４への入力として、韻律制御部３から音韻系列・韻律情報を受理する。 A phoneme sequence / prosodic information input unit (hereinafter abbreviated as an information input unit) 41 receives phoneme sequence / prosodic information from the prosody control unit 3 as an input to the speech synthesis unit 4.

音声素片記憶部（以下、素片記憶部と略記する。）４２は、大量の音声素片を蓄積している。また、素片記憶部４２は、それら蓄積されている音声素片の全てについて、それぞれ、当該音声素片に対する音韻・韻律環境を併せて蓄積している。 A speech unit storage unit (hereinafter abbreviated as a unit storage unit) 42 stores a large amount of speech units. In addition, the unit storage unit 42 stores the phoneme / prosodic environment for each of the speech units together with the stored speech units.

素片選択部４３は、素片記憶部４２に蓄積された音声素片の中から、複数の音声素片を選択する。 The unit selection unit 43 selects a plurality of speech units from the speech units stored in the unit storage unit 42.

素片融合部４４は、素片選択部４３により選択された複数の音声素片を融合して、新たな音声素片（以下、「融合素片」とも呼ぶ。）を生成する。 The unit fusion unit 44 unites a plurality of speech units selected by the unit selection unit 43 to generate a new speech unit (hereinafter also referred to as “fusion unit”).

フォルマント強調フィルタ部４５は、（次のフォルマント強調度合い推定部４６により推定された、強調の程度に応じて）素片融合部４４により生成された音声素片に対して、フォルマント強調を行う（すなわち、フォルマント強調された融合素片を生成する）。 The formant emphasis filter unit 45 performs formant emphasis on the speech unit generated by the unit fusion unit 44 (that is, according to the degree of emphasis estimated by the next formant emphasis degree estimation unit 46) (that is, , Generate formant-enhanced fusion fragments).

フォルマント強調度合い推定部４６は、フォルマント強調フィルタ部４５においてフォルマントを強調する程度を推定する。 The formant emphasis degree estimation unit 46 estimates the degree of emphasizing formants in the formant emphasis filter unit 45.

素片編集・接続部４７は、フォルマント強調フィルタ部４５から得られた音声素片を韻律変形及び接続して、合成音声の音声波形を生成する。 The segment editing / connecting unit 47 prosody modification and connection of speech units obtained from the formant emphasis filter unit 45 to generate a speech waveform of synthesized speech.

音声波形出力部４８は、素片編集・接続部４７で生成した音声波形を出力する。 The speech waveform output unit 48 outputs the speech waveform generated by the segment editing / connection unit 47.

なお、情報入力部４１〜音声波形出力部４８の各部の機能は、コンピュータに格納されたプログラムに実現できる。 Note that the functions of the information input unit 41 to the voice waveform output unit 48 can be realized by a program stored in a computer.

次に、図２の音声合成部４の各ブロックについて詳しく説明する。 Next, each block of the speech synthesizer 4 in FIG. 2 will be described in detail.

＜情報入力部＞
まず、情報入力部４１は、韻律制御部３から入力された音韻系列・韻律情報を、素片選択部４４へ出力する。音韻系列は、例えば、音韻記号の系列である。また、韻律情報は、例えば、基本周波数、音韻継続時間長、パワーなどである。 <Information input part>
First, the information input unit 41 outputs the phoneme sequence / prosodic information input from the prosody control unit 3 to the segment selection unit 44. The phoneme sequence is a sequence of phoneme symbols, for example. The prosodic information includes, for example, a fundamental frequency, a phoneme duration, and power.

以下、情報入力部４１に入力される音韻系列、韻律情報を、それぞれ、入力音韻系列、入力韻律情報と呼ぶ。 Hereinafter, the phoneme sequence and prosody information input to the information input unit 41 are referred to as input phoneme sequence and input prosody information, respectively.

＜素片記憶部＞
次に、素片記憶部４２には、合成音声を生成するときに用いられる音声の単位（以下、「合成単位」と称する。）で、音声素片が大量に蓄積されている。 <Unit storage unit>
Next, a large amount of speech segments are stored in the segment storage unit 42 in units of speech used when generating synthesized speech (hereinafter referred to as “synthesis unit”).

ここで、「合成単位」とは、音素あるいは音素を分割したもの（例えば、半音素など）の組み合わせ、例えば、半音素、音素（Ｃ、Ｖ）、ダイフォン（ＣＶ、ＶＣ、ＶＶ）、トライフォン（ＣＶＣ、ＶＣＶ）、音節（ＣＶ、Ｖ）、などであり（ここで、Ｖは母音、Ｃは子音を表す。）、また、これらが混在しているなど可変長であってもよい。 Here, the “synthesis unit” is a phoneme or a combination of phonemes (for example, semiphonemes), for example, semiphonemes, phonemes (C, V), diphones (CV, VC, VV), and triphones. (CVC, VCV), syllables (CV, V), and the like (where V represents a vowel and C represents a consonant), and may be variable length such as a mixture of these.

また、「音声素片」は、合成単位に対応する音声信号の波形もしくはその特徴を表すパラメータ系列などを表すものとする。 Further, “speech segment” represents a waveform of a speech signal corresponding to a synthesis unit or a parameter series representing the characteristics thereof.

図３に、素片記憶部４２に蓄積される音声素片の例を示す。図３に示すように、素片記憶部４２には、各音素の音声信号の波形である音声素片が、当該音声素片を識別するための素片番号とともに記憶されている。これらの音声素片は、別途収録された多数の音声データに対して音素毎にラベル付けし、ラベルにしたがって音素毎に音声波形を切り出したものである。 FIG. 3 shows an example of speech segments accumulated in the segment storage unit 42. As shown in FIG. 3, in the segment storage unit 42, a speech unit that is a waveform of a speech signal of each phoneme is stored together with a unit number for identifying the speech unit. These speech segments are obtained by labeling a large number of separately recorded speech data for each phoneme and cutting out a speech waveform for each phoneme according to the label.

また、素片記憶部４２には、大量の音声素片とともに、各音声素片に対応した音韻・韻律環境が蓄積されている。 The unit storage unit 42 stores a large amount of speech units and phoneme / prosodic environments corresponding to each speech unit.

ここで、「音韻・韻律環境」とは、対応する音声素片にとって環境となる要因の組み合わせである。要因としては、例えば、当該音声素片の音素名、先行音素、後続音素、後々続音素、基本周波数、音韻継続時間長、パワー、ストレスの有無、アクセント核からの位置、息継ぎからの時間、発声速度、感情などがある。 Here, the “phoneme / prosodic environment” is a combination of factors that become an environment for the corresponding speech segment. Factors include, for example, the phoneme name of the speech unit, the preceding phoneme, the subsequent phoneme, the subsequent phoneme, the fundamental frequency, the phoneme duration, power, the presence or absence of stress, the position from the accent core, the time from breathing, the utterance There are speed, feelings, etc.

また、素片記憶部４２には、上記の他、音声素片の始端・終端でのケプストラム係数など、音声素片の音響特徴のうち音声素片の選択に用いる情報も蓄積されている。 In addition to the above, the unit storage unit 42 also stores information used for selecting speech units among the acoustic features of the speech units, such as cepstrum coefficients at the start and end of the speech units.

以下では、素片記憶部４２に蓄積される音声素片の音韻・韻律環境と音響特徴量とを総称して、「素片環境」と呼ぶ。 Hereinafter, the phoneme / prosodic environment and the acoustic feature amount of the speech unit stored in the unit storage unit 42 are collectively referred to as “unit environment”.

図４に、素片記憶部４２に蓄積される素片環境の例を示す。図４に示す環境記憶部４３には、素片記憶部４２に蓄積される各音声素片の素片番号に対応して素片環境が記憶されている。ここでは、音韻・韻律環境として、音声素片に対応した音韻（音素名）、隣接音韻（この例では、当該音韻の前後それぞれ２音素ずつ）、基本周波数、音韻継続時間長が記憶され、音響特徴量として、音声素片始終端のケプストラム係数が記憶されている。 FIG. 4 shows an example of the segment environment accumulated in the segment storage unit 42. In the environment storage unit 43 shown in FIG. 4, the unit environment is stored corresponding to the unit number of each speech unit stored in the unit storage unit 42. Here, as phoneme / prosodic environment, phonemes (phoneme names) corresponding to phonemes, adjacent phonemes (in this example, two phonemes before and after the phoneme), fundamental frequency, and phoneme duration are stored. A cepstrum coefficient at the beginning and end of a speech unit is stored as a feature amount.

なお、これらの素片環境は、音声素片を切り出す元になった音声データを分析して抽出することによって得られる。また、図４では、音声素片の合成単位が音素である場合を示しているが、半音素、ダイフォン、トライフォン、音節、あるいはこれらの組み合わせや可変長であってもよい。 These segment environments are obtained by analyzing and extracting speech data from which speech segments are cut out. Further, FIG. 4 shows a case where the synthesis unit of a speech unit is a phoneme, but it may be a semiphoneme, a diphone, a triphone, a syllable, or a combination or variable length thereof.

＜素片選択部＞
次に、図２の音声合成部４の動作を詳しく説明する。 <Unit selection unit>
Next, the operation of the speech synthesizer 4 in FIG. 2 will be described in detail.

図２において、情報入力部４１を介して素片選択部４３に入力された音韻系列は、素片選択部４７において、合成単位毎に区切られる。以下、この区切られた合成単位を、「セグメント」と呼ぶ。 In FIG. 2, the phoneme sequence input to the segment selection unit 43 via the information input unit 41 is segmented by synthesis unit in the segment selection unit 47. Hereinafter, the divided synthesis unit is referred to as a “segment”.

素片選択部４３は、入力された入力音韻系列と入力韻律情報を基に、素片記憶部４２を参照し、各セグメントに対して、それぞれ、融合する複数個の音声素片の組み合わせを選択する。 The segment selection unit 43 refers to the segment storage unit 42 based on the input input phoneme sequence and the input prosody information, and selects a combination of a plurality of speech units to be fused for each segment. To do.

このとき素片選択部４３は、各音声素片候補を用いて音声を合成した場合の合成音声と目標音声との歪みができるだけ小さくなるように、融合する音声素片の組み合わせを選択する。ここでは、素片選択部４３は、一般の素片選択型音声合成方法や従来の複数素片選択融合型音声合成方法と同様に、音声素片の選択の尺度として、各音声素片候補を用いて音声を合成した場合の合成音声と目標音声との歪みの大きさを間接的に表すコストを用い、このコストができるだけ小さくなるように、融合する音声素片の組み合わせを選択する。 At this time, the unit selection unit 43 selects a combination of speech units to be merged so that distortion between the synthesized speech and the target speech when the speech is synthesized using each speech unit candidate is minimized. Here, similar to the general unit selection type speech synthesis method and the conventional multiple unit selection fusion type speech synthesis method, the unit selection unit 43 selects each speech unit candidate as a scale for selecting a speech unit. Using a cost that indirectly represents the magnitude of distortion between the synthesized speech and the target speech when the speech is synthesized using the speech, a combination of speech units to be merged is selected so that this cost is minimized.

ここで、「目標音声」とは、音声を合成する際の目標となる（仮想的な）音声、すなわち、入力された音韻の並びと韻律を実現し、かつ、理想的に自然な音声をいう。 Here, the “target speech” refers to a (virtual) speech that is a target when synthesizing speech, that is, an ideal natural speech that realizes the input phoneme sequence and prosody. .

最初に、コストについて説明する。 First, the cost will be described.

合成音声の目標音声に対する歪みの度合いを表すコストには、大きく分けて、目標コストと接続コストの２種類のコストがある。 Costs representing the degree of distortion of the synthesized speech with respect to the target speech are roughly divided into two types of costs: target cost and connection cost.

目標コストは、コストの算出対象である音声素片（対象素片）を目標の音韻・韻律環境で使用することによって生じるコストである。 The target cost is a cost generated by using a speech segment (target segment) that is a cost calculation target in a target phoneme / prosodic environment.

接続コストは、対象素片を隣接する音声素片と接続したときに生じるコストである。 The connection cost is a cost that occurs when the target segment is connected to an adjacent speech segment.

具体的には、次の通りである。 Specifically, it is as follows.

目標コストとしては、音声素片が持つ基本周波数と目標の基本周波数の違い（差）によって生じる歪み（基本周波数コスト）、音声素片の音韻継続時間長と目標の音韻継続時間長の違い（差）によって生じる歪み（継続時間長コスト）、音声素片が属していた音韻環境と目標の音韻環境の違いによって生じる歪み（音韻環境コスト）などがある。接続コストとしては、音声素片境界でのスペクトルの違い（差）によって生じる歪み（スペクトル接続コスト）や、音声素片境界での基本周波数の違い（差）によって生じる歪み（基本周波数接続コスト）などがある。 The target cost includes distortion (basic frequency cost) caused by the difference (difference) between the fundamental frequency of the speech unit and the target fundamental frequency, and the difference (difference) between the phoneme duration of the speech unit and the target phoneme duration. ) Caused by the difference between the phoneme environment to which the speech segment belonged and the target phoneme environment (phoneme environment cost). Connection costs include distortion caused by spectral difference (difference) at the speech unit boundary (spectrum connection cost), distortion caused by fundamental frequency difference (difference) at the speech unit boundary (basic frequency connection cost), etc. There is.

コストを用いて、一セグメント当たり複数個の音声素片を選択する方法については、どのような方法を用いても構わない。 Any method may be used for selecting a plurality of speech segments per segment using cost.

例えば、特許文献２に開示された方法を用いても良い。ここでは、この選択方法の概要について、図５の処理手順例を参照しながら、一セグメント当たりＭ個の音声素片を選ぶ場合について説明する。 For example, the method disclosed in Patent Document 2 may be used. Here, the outline of this selection method will be described with reference to the case of selecting M speech units per segment with reference to the processing procedure example of FIG.

まず、ステップＳ１０１において、素片選択部４３は、入力された音韻系列を、合成単位毎のセグメントに分割する。ここで、分割されたセグメントの数をＮとする。 First, in step S101, the segment selection unit 43 divides the input phoneme sequence into segments for each synthesis unit. Here, the number of divided segments is N.

次に、ステップＳ１０２において、素片記憶部４２に記憶されている音声素片群の中から、各セグメントにつき１つずつの音声素片の系列を選択する。このときの選択においては、入力された目標の音韻系列・韻律情報と、素片記憶部４２の音声素片環境の情報を基に、系列としてのコストの総和（トータルコスト）が最小となるような音声素片の系列（最適素片系列）を求める。この最適素片系列の探索は、動的計画法（DP(dynamic programming)）を用いることで、効率的に行うことができる。 Next, in step S102, one speech segment sequence is selected for each segment from the speech segment group stored in the segment storage unit 42. In the selection at this time, based on the inputted target phoneme sequence / prosodic information and the information on the speech unit environment of the unit storage unit 42, the total cost (total cost) as a sequence is minimized. A speech unit sequence (optimal unit sequence) is obtained. The search for the optimum unit sequence can be efficiently performed by using dynamic programming (DP).

次に、ステップＳ１０３において、セグメント番号を表すカウンターｉに、初期値「１」をセットする。 Next, in step S103, an initial value “1” is set in the counter i representing the segment number.

次に、ステップＳ１０４において、セグメントｉに対する複数の音声素片候補の各々に対してコストを算出する。このときに用いるコストには、当該音声素片候補での目標コストと、当該音声素片候補の前後のセグメントの最適音声素片（最適素片系列に含まれる音声素片）と当該音声素片候補との接続コストとの和を用いる。 Next, in step S104, a cost is calculated for each of a plurality of speech element candidates for segment i. The cost used at this time includes the target cost of the speech unit candidate, the optimal speech unit of the segment before and after the speech unit candidate (the speech unit included in the optimal unit sequence), and the speech unit. The sum of the connection costs with the candidates is used.

次に、ステップＳ１０５において、ステップＳ１０４で算出したコストを用いて、セグメントｉについて、コストの小さい上位Ｍ個の音声素片を選択する。 Next, in step S105, using the cost calculated in step S104, the top M speech elements with the lowest cost are selected for segment i.

次に、ステップＳ１０６において、カウンターｉがＮ以下かどうかを判定する。 Next, in step S106, it is determined whether the counter i is N or less.

カウンターｉがＮ以下である場合（ステップＳ１０６のＹＥＳ）には、ステップＳ１０７に進んで、カウンターｉの値を１つ増やした後に、ステップＳ１０４に進んで、次のセグメントに係る処理を行う。 If the counter i is less than or equal to N (YES in step S106), the process proceeds to step S107, the counter i is incremented by one, and then the process proceeds to step S104 to perform processing related to the next segment.

カウンターｉがＮに達した場合（ステップＳ１０６のＮＯ）には、この素片選択の処理を終了する。 If the counter i reaches N (NO in step S106), the segment selection process is terminated.

このように、素片選択部４４は、各セグメントに対してＭ個ずつの音声素片を選択し、選択した音声素片を分離部４５に出力する。 As described above, the segment selection unit 44 selects M speech units for each segment and outputs the selected speech unit to the separation unit 45.

素片選択部４４においてセグメント当たり複数個の音声素片を選択する方法は、上記した方法に限定する必要はなく、コストであっても、コスト以外であっても、何らかの評価尺度の下で、適切な音声素片の組を選べる方法であれば、いかなる方法を用いても良い。 The method for selecting a plurality of speech segments per segment in the segment selection unit 44 is not necessarily limited to the above-described method. Any method may be used as long as it can select an appropriate speech segment set.

＜素片融合部＞
素片融合部４４は、それぞれのセグメント毎に、素片選択部４３から入力された複数個の音声素片を融合して、新たな音声素片を生成する。 <Fragment unit>
The unit fusion unit 44 fuses a plurality of speech units input from the unit selection unit 43 for each segment to generate a new speech unit.

音声素片を融合する方法については、どのような方法を用いても構わない。 Any method may be used as a method for fusing speech segments.

例えば、特許文献２に開示された方法を用いても良い。ここでは、この方法について図６及び図７を参照しながら説明する。 For example, the method disclosed in Patent Document 2 may be used. Here, this method will be described with reference to FIGS.

図６は、一つのセグメントに対する複数個の音声素片の波形を融合して、新たな音声波形を生成する手順を示すフローチャートである。図７は、あるセグメントに対して選択された３つの音声素片からなる素片組み合わせ候補（図中、６０）を融合して、新たな音声素片（図中、６３）を生成する例を示す図である。 FIG. 6 is a flowchart showing a procedure for generating a new speech waveform by fusing waveforms of a plurality of speech segments for one segment. FIG. 7 shows an example of generating a new speech unit (63 in the figure) by merging the unit combination candidates (60 in the figure) composed of three speech units selected for a certain segment. FIG.

まず、ステップＳ２０１において、（ある一つのセグメントについて）選択されたそれぞれの音声素片からピッチ波形を切り出す。 First, in step S201, a pitch waveform is cut out from each selected speech segment (for a certain segment).

ここで、「ピッチ波形」とは、その長さが音声の基本周期の数倍程度で、それ自身は基本周期を持たない比較的短い波形であって、そのスペクトルが音声信号のスペクトル包絡を表すものである。 Here, the “pitch waveform” is a relatively short waveform having a length that is several times the basic period of speech, and has no fundamental period, and its spectrum represents the spectral envelope of the speech signal. Is.

このようなピッチ波形を抽出する方法には、どのような方法が用いられても良いが、その一つの方法として、基本周期同期窓を用いる方法があり、ここでは、この方法が用いられる場合を例にとって説明する。 Any method may be used to extract such a pitch waveform, but one method is to use a fundamental period synchronization window. Here, the case where this method is used is used. Let's take an example.

具体的には、それぞれの音声素片の音声波形に対して基本周期間隔毎にマーク（ピッチマーク）を付し、このピッチマークを中心にして、窓長が基本周期の２倍のハニング窓で窓掛けすることによって、ピッチ波形を切り出す。図７のピッチ波形系列６１は、素片組み合わせ候補６０の各音声素片から切り出して得られたピッチ波形の系列の例を示している。 Specifically, a mark (pitch mark) is attached to the speech waveform of each speech unit at every basic period interval, and a Hanning window whose window length is twice the basic period centered on this pitch mark. A pitch waveform is cut out by windowing. A pitch waveform series 61 of FIG. 7 shows an example of a series of pitch waveforms obtained by cutting out from each speech unit of the unit combination candidate 60.

次に、ステップＳ２０２において、それぞれの音声素片に対するピッチ波形の個数が、音声素片間で同一になるように、ピッチ波形の数を揃える。 Next, in step S202, the number of pitch waveforms is aligned so that the number of pitch waveforms for each speech unit is the same between speech units.

このときに、揃える対象となるピッチ波形の数は、目標の音韻継続時間長の合成音声を生成するために必要なピッチ波形数とするが、例えば、最もピッチ波形数の多いものに揃えても良い。 At this time, the number of pitch waveforms to be aligned is the number of pitch waveforms necessary to generate a synthesized speech having a target phoneme duration length. good.

ピッチ波形の少ない系列は、系列に含まれるいくつかのピッチ波形を複製することによってピッチ波形数を増やし、ピッチ波形の多い系列は、系列中のいくつかのピッチ波形を間引くことによってピッチ波形数を減らす。図７のピッチ波形系列６２は、ピッチ波形の数を６つに揃えた例を示している。 A series with few pitch waveforms increases the number of pitch waveforms by duplicating several pitch waveforms included in the series, and a series with many pitch waveforms reduces the number of pitch waveforms by thinning out several pitch waveforms in the series. cut back. The pitch waveform series 62 in FIG. 7 shows an example in which the number of pitch waveforms is aligned to six.

次に、ステップＳ２０３において、ピッチ波形数を揃えた後、それぞれの音声素片に対応するピッチ波形系列中のピッチ波形を、その位置毎に融合することによって、新たなピッチ波形系列を生成する。 Next, in step S203, after arranging the number of pitch waveforms, a new pitch waveform sequence is generated by fusing pitch waveforms in the pitch waveform sequence corresponding to each speech unit for each position.

例えば、図７で生成された新たなピッチ波形６３に含まれるピッチ波形６３ａは、ピッチ波形系列６２のうち、６番目のピッチ波形６２ａ，６２ｂ，６２ｃを融合することによって得られる。このようにして生成された新たなピッチ波形系列６３を、融合された音声素片とする。 For example, the pitch waveform 63 a included in the new pitch waveform 63 generated in FIG. 7 is obtained by fusing the sixth pitch waveforms 62 a, 62 b, and 62 c in the pitch waveform series 62. The new pitch waveform series 63 generated in this way is used as a fused speech unit.

ここで、ピッチ波形を融合する方法としては、例えば、次のような方法がある。 Here, as a method of merging pitch waveforms, for example, there are the following methods.

第１の方法は、単純にピッチ波形の平均を計算する方法である。 The first method is simply a method of calculating the average of the pitch waveform.

第２の方法は、ピッチ波形間の相関が最大になるよう時間方向に各ピッチ波形の位置を補正してから平均化する方法である。 The second method is a method of averaging after correcting the position of each pitch waveform in the time direction so that the correlation between the pitch waveforms is maximized.

第３の方法は、ピッチ波形を帯域分割して、帯域毎にピッチ波形間の相関が最大になるようピッチ波形の位置を補正して平均化した結果を、帯域間で足し合わせる方法である。 The third method is a method in which the pitch waveform is divided into bands, the positions of the pitch waveforms are corrected and averaged so that the correlation between the pitch waveforms is maximized for each band, and the results are averaged between the bands.

いずれの方法を用いても良いが、本実施形態では、最後に説明した第３の方法を用いる場合を例にとって説明する。 Any method may be used, but in this embodiment, the case of using the third method described last will be described as an example.

素片融合部４４は、上記した方法を用いて、各セグメントについて、複数の音声素片を融合して新たな音声素片を生成し、フォルマント強調フィルタ部４５に出力する。 Using the above-described method, the unit fusion unit 44 unites a plurality of speech units for each segment to generate a new speech unit, and outputs the new speech unit to the formant emphasis filter unit 45.

＜フォルマント強調フィルタ部＞
さて、上記のように融合によって生成された音声素片の音声波形は、融合の影響によって、融合元の音声素片の波形よりもスペクトル包絡がなまってしまい、いくつかのフォルマントが弱められてしまった結果、明瞭感が下がってしまうことが多い。そこで、フォルマント強調フィルタ部４５は、素片融合部４４から入力された融合素片に対して、フォルマントを強調するためのフィルタリングを行い、素片編集・接続部４７に出力する。 <Formant emphasis filter part>
Now, the speech waveform of the speech unit generated by the fusion as described above has a spectrum envelope that is less than the waveform of the speech unit of the fusion source due to the influence of the fusion, and some formants are weakened. As a result, the sense of clarity often decreases. Therefore, the formant emphasis filter unit 45 performs filtering for emphasizing the formant on the fusion unit input from the unit fusion unit 44 and outputs the result to the unit editing / connection unit 47.

ここで用いるフォルマント強調フィルタとしては、例えば、J. Chenらの文献(J. Chen, etc., 「Adaptive Postfiltering for Quality Enhancement of Coded Speech」, IEEE Trans. Speech and Audio Processing, vol. 3, Jan 1995)（以下、文献３と呼ぶ。）によって開示されているものを、用いることができる。 As the formant enhancement filter used here, for example, J. Chen et al. (J. Chen, etc., “Adaptive Postfiltering for Quality Enhancement of Coded Speech”, IEEE Trans. Speech and Audio Processing, vol. 3, Jan 1995 ) (Hereinafter referred to as Document 3) can be used.

こうしたフォルマント強調フィルタを、融合素片の音声波形に対して適用することによって、スペクトル包絡中のフォルマントを強調し、融合による明瞭性の低下を補償することが可能である。 By applying such a formant emphasis filter to the speech waveform of the fusion unit, it is possible to emphasize the formant in the spectrum envelope and compensate for the decrease in clarity due to the fusion.

フォルマント強調フィルタの概要を、文献３で開示されているフォルマント強調フィルタを例に用いて説明する。文献３で開示されているフォルマント強調フィルタは、数式（１）のような伝達関数を持つフィルタである。

An outline of the formant emphasis filter will be described using the formant emphasis filter disclosed in Document 3 as an example. The formant emphasis filter disclosed in Document 3 is a filter having a transfer function as expressed by Equation (1).

ただし、Ｐ（ｚ）は、数式（２）で表される。ここで、ａ_ｉは入力波形を線形予測分析したときのｉ番目の線形予測係数（ＬＰＣ）を表し、Ｍは線形予測次数である。

However, P (z) is expressed by Equation (2). Here, a _i represents the i-th linear prediction coefficient (LPC) when the input waveform is subjected to linear prediction analysis, and M is the linear prediction order.

数式（１）における１／［１−Ｐ（ｚ／α）］は、α＝１の場合は、線形予測フィルタを表し、入力波形のＬＰＣスペクトルと同じ周波数応答を持つ。αを小さくすると、ＬＰＣスペクトルを鈍らせたような周波数応答になり、０に近づくにつれ、フラットな周波数応答になる。よって、入力波形のスペクトル中のパワーの大きい周波数成分は、より大きくなり、パワーの小さい周波数成分は、より小さくなるため、スペクトル中の山・谷を強調する効果を持つ。また、一般的な音声のスペクトル包絡には、低域から高域に向かって負の傾斜が見られるため、１／［１−Ｐ（ｚ／α）］の周波数応答は、全体的に、同様の負の傾斜を持つ。すなわち、スペクトルの山・谷を強調する効果に加え、副作用としてローパス特性を持っている。そこで、［１−Ｐ（ｚ／β）］および［１−μｚ^−１］の項によって、このローパス特性を補正する。［１−Ｐ（ｚ／β）］は、ＬＰＣスペクトルの極と同じ周波数に零点を持つフィルタであり、１／［１−Ｐ（ｚ／α）］でのスペクトルの傾斜を補償する効果を持つ。一方、［１−μｚ^−１］は、単純なハイパスフィルタで、残っているスペクトルの傾きを無くすよう調整するための項である。なお、Ｇは、フィルタリング前後でパワーが変化するのを防ぐためのパワー調整用のゲインであり、文献３で開示されている方法により、入力波形に応じて自動で決めることができる。 1 / [1-P (z / α)] in Equation (1) represents a linear prediction filter when α = 1, and has the same frequency response as the LPC spectrum of the input waveform. When α is made small, the frequency response becomes as if the LPC spectrum is dulled, and as it approaches 0, the frequency response becomes flat. Therefore, the frequency component with high power in the spectrum of the input waveform becomes larger, and the frequency component with low power becomes smaller, so that it has the effect of emphasizing peaks and valleys in the spectrum. In addition, since the negative envelope of a general speech spectrum envelope is seen from low to high, the frequency response of 1 / [1-P (z / α)] is generally the same. With a negative slope. That is, in addition to the effect of emphasizing the peaks and valleys of the spectrum, it has a low-pass characteristic as a side effect. Therefore, this low-pass characteristic is corrected by the terms [1-P (z / β)] and [1-μz ⁻¹ ]. [1-P (z / β)] is a filter having a zero point at the same frequency as the pole of the LPC spectrum, and has an effect of compensating for the inclination of the spectrum at 1 / [1-P (z / α)]. . On the other hand, [1-μz ⁻¹ ] is a term for adjusting so as to eliminate the inclination of the remaining spectrum with a simple high-pass filter. Note that G is a power adjustment gain for preventing the power from changing before and after filtering, and can be automatically determined according to the input waveform by the method disclosed in Document 3.

このフォルマント強調フィルタでは、パラメータαを変えることによって、フォルマント強調の度合いを変えることができる（ただし、αの値に応じて、ローパス特性を補償するような適切なβ、μも決める必要がある）。αが１に近いほど強調の度合いが強く、αが小さくなるにつれ強調の度合いが弱まり、αが０．５以下になるとほとんど強調されない。どの程度フォルマントを強調すべきかは音声波形の特徴によって異なるが、これを決めるための客観尺度が存在しないため、通常、音声符号化や音声合成においてフォルマント強調フィルタを用いる場合には、フォルマント強調の度合いは主観評価などによって実験的に求める。 In this formant emphasis filter, the degree of formant emphasis can be changed by changing the parameter α (however, it is necessary to determine appropriate β and μ that compensate for the low-pass characteristics according to the value of α). . As α is closer to 1, the degree of emphasis is stronger, and as α becomes smaller, the degree of emphasis becomes weaker, and when α is 0.5 or less, it is hardly emphasized. How much formant should be emphasized depends on the characteristics of the speech waveform. However, there is no objective measure to determine this, so the degree of formant enhancement is usually used when using a formant enhancement filter in speech coding or speech synthesis. Is obtained experimentally by subjective evaluation.

しかしながら、複数素片選択融合型の合成方法においては、融合によるスペクトル包絡の鈍り具合がセグメントごとに大きく変わり得るため、１文など全体に対して同じパラメータを適用すると、融合によってスペクトル包絡が大きく鈍った箇所にはフォルマント強調の程度が不十分であるのに対し、逆に融合によるスペクトル包絡の鈍りが小さい箇所はフォルマントが強調されすぎて人工的な音になるという問題がある。 However, in the multi-element selective fusion type synthesis method, the degree of blunting of the spectral envelope due to fusion can vary greatly from segment to segment. Therefore, if the same parameter is applied to the entire sentence or the like, the spectral envelope is greatly blunted by fusion. Whereas the formant emphasis is insufficient at the spot, the problem is that the formant is too weak and the artificial sound is generated at the spot where the spectral envelope due to fusion is small.

そこで、本実施形態では、融合されてできたそれぞれの音声素片に対し（あるいは、それぞれの融合素片の各ピッチ波形に対し）、適切なフォルマント強調の度合いをフォルマント強調度合い推定部４６で推定し、フォルマント強調フィルタ部４５は、推定されたフォルマント強調度合いに応じてフォルマント強調フィルタの係数を変える。すなわち、融合素片ごとに（あるいは、融合素片のピッチ波形ごとに）、フォルマント強調度合いを適応的に制御する。ここで、フォルマント強調度合い推定部４６から与えられるフォルマント強調度合いは、例えば、０（強調無し）から１００（フォルマント強調フィルタの制御可能な範囲で最も強い強調）まで連続的に変化するようなものでもよいし、また、例えば、０（強調無し）から４（非常に強く強調）までの５段階で指定できるような離散的なものであってもよい。上述の文献３で開示されているフォルマント強調フィルタを用いる場合は、フォルマント強調度合い推定部４６で推定されたフォルマント強調度合いが大きい場合はαの値を１に近づけ、逆にフォルマント度合いが小さい場合はαを０．５に近づける。βおよびμの値もαの値に応じて変えるが、各αの値に対して適切なβとμの値は、実験的に求めることが可能である。 Therefore, in the present embodiment, an appropriate formant enhancement degree is estimated by the formant enhancement degree estimation unit 46 for each speech unit formed by the fusion (or for each pitch waveform of each fusion unit). The formant emphasis filter unit 45 changes the coefficient of the formant emphasis filter according to the estimated formant emphasis degree. That is, the formant enhancement degree is adaptively controlled for each fusion unit (or for each pitch waveform of the fusion unit). Here, the formant emphasis degree given from the formant emphasis degree estimation unit 46 may change continuously from 0 (no emphasis) to 100 (strongest emphasis within the controllable range of the formant emphasis filter), for example. Alternatively, for example, it may be discrete such that it can be specified in five stages from 0 (no emphasis) to 4 (very strong emphasis). When the formant emphasis filter disclosed in the above-mentioned document 3 is used, when the formant emphasis degree estimated by the formant emphasis degree estimation unit 46 is large, the value of α is brought close to 1, and conversely, the formant degree is small. α is brought close to 0.5. The values of β and μ are also changed according to the value of α, but appropriate values of β and μ can be obtained experimentally for each α value.

また、フォルマント強調度合い推定部４６で推定されたフォルマント強調度合いを、フィルタ係数に具体的に反映するためのマッピングは、主観評価などによって実験的に得ることができる。 Also, mapping for specifically reflecting the formant enhancement degree estimated by the formant enhancement degree estimation unit 46 in the filter coefficient can be obtained experimentally by subjective evaluation or the like.

本実施形態においては、文献３で開示されているフォルマント強調フィルタを用いる場合について説明したが、フォルマントが強調でき、フォルマントの強調度合いがパラメータなどで制御できるフォルマント強調フィルタであれば、いかなるものでも用いることができる。 In this embodiment, the case where the formant emphasis filter disclosed in Document 3 is used has been described. However, any formant emphasis filter can be used as long as the formant can be emphasized and the emphasis degree of the formant can be controlled by a parameter or the like. be able to.

＜フォルマント強調度合い推定部＞
フォルマント強調度合い推定部４６は、素片選択部４３や素片融合部４４から与えられた融合素片や融合元の複数の音声素片の情報を元に、融合素片に対して適切なフォルマント強調度合いを推定し、推定したフォルマント強調度合いをフォルマント強調フィルタ部４５に出力する。 <Formant emphasis degree estimation unit>
The formant emphasis degree estimation unit 46 generates an appropriate formant for the fusion unit based on the information of the fusion unit and the plurality of speech units of the fusion source given from the unit selection unit 43 and the unit fusion unit 44. The degree of emphasis is estimated, and the estimated formant emphasis degree is output to the formant emphasis filter unit 45.

前述のように、ある波形に対して適切なフォルマント強調度合いを決めるような客観尺度は存在しないが、融合素片と融合元の複数の音声素片の間でスペクトル包絡に関する特徴量を比較することによって、音声素片の融合によってどの程度スペクトル包絡が鈍ったかをある程度見積もることは可能である。そこで、フォルマント強調度合い推定部４６では、融合によるスペクトル包絡の鈍り具合を以下のような方法で推定し、これに基づいてフォルマントの強調度合いを決める。 As described above, there is no objective scale that determines the appropriate formant enhancement level for a certain waveform, but the feature values related to the spectral envelope are compared between the fusion unit and the speech units of the fusion source. Thus, it is possible to estimate to some extent how much the spectral envelope has become dull due to the fusion of speech segments. Therefore, the formant emphasis degree estimation unit 46 estimates the dullness of the spectrum envelope due to the fusion by the following method, and determines the formant emphasis degree based on this.

融合によるスペクトル包絡の鈍りが大きいほど、融合元の各音声素片と融合素片との間でスペクトル包絡の形状の差が大きくなると考えられる。そこで、融合元の各音声素片と融合素片との間でのスペクトル包絡の形状の差を見積もることができれば、音声素片の融合によるスペクトル包絡の鈍り具合を推定できると考えられる。 It is considered that the difference in the shape of the spectral envelope between each speech unit and the fusion unit becomes larger as the spectral envelope becomes dull due to the fusion. Therefore, if the difference in the shape of the spectrum envelope between each speech unit of the fusion source and the fusion unit can be estimated, it is considered that the dullness of the spectrum envelope due to the fusion of the speech units can be estimated.

スペクトル包絡の特徴を表すパラメータとしては、ケプストラムやＬＳＰ（線スペクトル対）などがある。以下では、ケプストラムの一つであるメル周波数ケプストラム係数（ＭＦＣＣ）を用いて、融合元の各音声素片と融合素片の間でのスペクトル包絡の形状の差を間接的に見積もる場合を例によって説明する。 The parameters representing the characteristics of the spectrum envelope include cepstrum and LSP (Line Spectrum Pair). The following is an example of indirectly estimating the difference in the shape of the spectral envelope between each fusion unit and each fusion unit using the Mel frequency cepstrum coefficient (MFCC), which is one of the cepstrum. explain.

ＭＦＣＣは、音声認識の分野で広く用いられている音響特徴量で、音声合成においても上述の「スペクトル接続コスト」の評価尺度としてよく用いられる。ＭＦＣＣは、人間の聴覚特性を考慮した特徴量で、低い次元でもスペクトル包絡の特徴を良く表せる利点も持つ。ＭＦＣＣの低次の係数はスペクトル包絡の慨形を、高次の係数はスペクトル包絡の細部を表現する。素片１と素片２のｉ次のＭＦＣＣをそれぞれｃ_１ｉ、ｃ_２ｉとすると、数式（３）により、素片１と素片２との間のＭＦＣＣ距離が算出できる。

MFCC is an acoustic feature that is widely used in the field of speech recognition, and is often used as an evaluation measure of the above-mentioned “spectrum connection cost” in speech synthesis. The MFCC is a feature amount that takes into account human auditory characteristics, and has the advantage that the features of the spectral envelope can be expressed well even in a low dimension. The low order coefficients of the MFCC represent the shape of the spectral envelope, and the high order coefficients represent the details of the spectral envelope. Assuming that the i-th order MFCCs of the segment 1 and the segment 2 are c _1i and c _2i , respectively, the MFCC distance between the segment 1 and the segment 2 can be calculated by Equation (3).

ただし、ｐはＭＦＣＣの次元を表す。 However, p represents the dimension of MFCC.

なお、本例においては、ＭＦＣＣの次元は２０次程度とする。 In this example, the MFCC dimension is about 20th order.

次に、このＭＦＣＣ距離を使って、音声素片の融合によるスペクトル包絡の鈍り具合を推定する方法について説明する。 Next, a method of estimating the dullness of the spectrum envelope due to the fusion of speech units using the MFCC distance will be described.

図８に、この場合の処理手順の一例を示す。 FIG. 8 shows an example of the processing procedure in this case.

ここで、融合素片の元になった融合元の素片数はＮとする。 Here, it is assumed that the number of fusion elements that are the basis of the fusion element is N.

まず、融合素片のＭＦＣＣ（ｃ０）を算出する（ステップＳ３０１）。 First, the MFCC (c0) of the fusion unit is calculated (step S301).

次に、カウンターｉを１に、Ｄ_ｓｕｍを０に初期化して（ステップＳ３０２、ステップＳ３０３）、ステップＳ３０４に進む。 Next, the counter i is initialized to 1 and D _sum is initialized to 0 (steps S302 and S303), and the process proceeds to step S304.

ステップＳ３０４では、融合元のＮ個の音声素片のうち、ｉ番目の音声素片のＭＦＣＣ（ｃｉ）を算出する。 In step S304, the MFCC (ci) of the i-th speech element among the N speech elements of the fusion source is calculated.

次に、ｃ_０とｃ_ｉとの間のＭＦＣＣ距離（Ｄ_ｉ）を、数式（３）を用いて算出する（ステップＳ３０４）。 Next, the MFCC distance (D _i ) between c ₀ and c _i is calculated using Equation (3) (step S304).

次のステップＳ３０５では、算出されたＤ_ｉをＤ_ｓｕｍに加算して、ステップＳ３０７に進む。 In the next step S305, the calculated _{D i} by adding the _{D sum,} the process proceeds to step S307.

ステップＳ３０７では、カウンターｉがＮ以下であるかを判定する。 In step S307, it is determined whether the counter i is N or less.

カウンターｉがＮ以下である場合（ステップＳ３０７のＹＥＳ）には、ステップＳ３０８に進んで、カウンターｉの値を１つ増やした後に、ステップＳ３０４に進んで、次の音声素片に係る処理を行う。 If the counter i is equal to or smaller than N (YES in step S307), the process proceeds to step S308, the counter i is incremented by 1, and then the process proceeds to step S304 to perform processing related to the next speech unit. .

カウンターｉがＮに達した場合（ステップＳ３０７のＮＯ）には、ループ処理を終了し、ステップＳ３０９に進む。 If the counter i reaches N (NO in step S307), the loop process is terminated and the process proceeds to step S309.

ステップＳ３０９では、Ｄ_ｓｕｍをＮで割ることによって、平均ＭＦＣＣ距離（Ｄ_ｍｅａｎ）を求め、全ての処理を終了する。 In step S309, the average MFCC distance (D _mean ) is obtained by dividing D _sum by N, and all the processes are terminated.

本実施形態では、このようにして求めた平均ＭＦＣＣ距離を、融合によるスペクトル包絡の鈍り具合を反映する評価尺度として用いる。すなわち、平均ＭＦＣＣ距離が小さいほどスペクトル包絡の鈍り具合が小さく、平均ＭＦＣＣ距離が大きいほどスペクトル包絡の鈍り具合が大きいとして、平均ＭＦＣＣ距離をそのままスペクトル包絡の鈍り具合とするか、平均ＭＦＣＣ距離の分布などに基づいて何らかの変換を行って得た値をスペクトル包絡の鈍り具合とする。 In the present embodiment, the average MFCC distance obtained in this way is used as an evaluation scale that reflects the dullness of the spectral envelope due to fusion. That is, the smaller the average MFCC distance, the smaller the spectral envelope dullness, and the larger the average MFCC distance, the larger the spectral envelope dullness. The average MFCC distance is used as it is or the distribution of the average MFCC distance. A value obtained by performing some conversion based on the above is defined as a dullness of the spectrum envelope.

次に、このようにして得たスペクトル包絡の鈍り具合に基づいて、フォルマント強調度合いを求める必要があるが、スペクトル包絡の鈍り具合が大きいほど強いフォルマント強調を施すべきと考えられるため、ここでは、スペクトル包絡の鈍り具合が増すとともに単調増加するような関数（ただし、フォルマント強調度合いが離散値の場合は、階段状に変化）を用いてフォルマント強調度合いに変換する。関数の形状については、例えば、スペクトル包絡の鈍り具合に対して途中まで線形に増加し、ある閾値を超えるとフォルマント強調度合いの上限値をとるようなものであっても良いし、シグモイド関数のように増加率がスペクトル包絡の鈍り具合に応じて変化するような形状のものであっても良く、それらの関数のパラメータ（傾き、など）は実験的に適切なものを得れば良い。 Next, it is necessary to obtain the formant emphasis degree based on the spectral envelope dullness obtained in this way, but it is thought that stronger formant emphasis should be applied as the spectral envelope dullness is greater, so here, The function is converted into a formant enhancement degree using a function that increases monotonously as the spectral envelope becomes dull (however, if the formant enhancement degree is a discrete value, it changes stepwise). As for the shape of the function, for example, it may increase linearly halfway with respect to the degree of bluntness of the spectrum envelope, and when it exceeds a certain threshold, it may take the upper limit of the formant enhancement degree, or it may be a sigmoid function. In addition, it may be of a shape such that the increase rate changes according to the degree of dullness of the spectrum envelope, and the parameters (slope, etc.) of those functions may be obtained experimentally.

なお、本実施形態においては、融合によるスペクトル包絡の鈍り具合を推定する方法の一例として、上記のＭＦＣＣを用いる方法を例にとって説明したが、スペクトル包絡の形状の差を適切に見積もれる音響パラメータであれば、どのようなものを用いてもよい。例えば、ＬＳＰ係数の二乗誤差を用いても良いし、ＦＦＴ（高速フーリエ変換）によって得られたＦＦＴスペクトルを確率分布のように見なすことによって、確率分布の差を計算するのによく用いられるＫＬ距離（Kullback-Leibler距離）を算出して、これを用いても良い。 In this embodiment, as an example of a method for estimating the dullness of the spectrum envelope due to fusion, the method using the above MFCC has been described as an example. However, the acoustic parameters that can appropriately estimate the difference in the shape of the spectrum envelope are used. Any one may be used as long as it is present. For example, the square error of the LSP coefficient may be used, or the KL distance often used to calculate the difference in probability distribution by regarding the FFT spectrum obtained by FFT (Fast Fourier Transform) as a probability distribution. (Kullback-Leibler distance) may be calculated and used.

また、融合によるスペクトル包絡の鈍り具合を推定する方法として、素片選択部４３で算出された目標コストを用いる方法も考えられる。融合元の複数の音声素片がいずれも適切な音韻・韻律環境から選ばれた場合、目標コストは小さくなり、かつ、融合によるスペクトル包絡の鈍り具合も小さくなると考えられる。逆に、目標の音韻・韻律環境と異なる音声素片ばかりが選ばれた場合、目標コストは大きくなり、融合によるスペクトル包絡の鈍り具合も大きくなると考えられる。そこで、融合によるスペクトル包絡の鈍り具合を表す一つの指標として、融合元の音声素片が選ばれた際の目標コストを用いてもよいと考えられる。この方法は、前述の音響パラメータを用いる方法よりは間接的だが、非常に単純である。 Further, as a method for estimating the dullness of the spectrum envelope due to fusion, a method using the target cost calculated by the segment selection unit 43 is also conceivable. When a plurality of speech units of the fusion source are all selected from an appropriate phoneme / prosodic environment, it is considered that the target cost is reduced and the degree of blunting of the spectral envelope due to the fusion is also reduced. Conversely, if only speech segments that are different from the target phoneme / prosodic environment are selected, the target cost will increase, and the degree of blunting of the spectral envelope due to fusion will also increase. Therefore, it is considered that the target cost when the fusion source speech unit is selected may be used as one index indicating the degree of dullness of the spectrum envelope due to fusion. This method is indirect but much simpler than the method using the acoustic parameters described above.

フォルマント強調度合い推定部４６は、上述のようにして推定した、融合によるスペクトル包絡の鈍り具合を、フォルマント強調フィルタ部４５に出力する。 The formant emphasis degree estimation unit 46 outputs to the formant emphasis filter unit 45 the spectral envelope dullness caused by the fusion estimated as described above.

＜素片編集・接続部＞
素片編集・接続部４７は、フォルマント強調部４５から渡されたセグメント毎の音声素片を、入力韻律情報に従って変形して接続することによって、合成音声の音声波形を生成する。 <Element editing / connection part>
The segment editing / connecting unit 47 generates a speech waveform of synthesized speech by deforming and connecting the speech units for each segment passed from the formant emphasizing unit 45 according to the input prosodic information.

図９は、素片編集・接続部４７での処理を説明するための図である。図９には、フォルマント強調部４５から入力された、音素「ａ」「Ｎ」「ｓ」「ａ」「ａ」の各合成単位に対する音声素片を、変形・接続して、「ａＮｓａａ」という音声波形を生成する場合を示している。 FIG. 9 is a diagram for explaining processing in the segment editing / connecting unit 47. In FIG. 9, speech units for the synthesis units of phonemes “a”, “N”, “s”, “a”, and “a” input from the formant emphasizing unit 45 are transformed and connected, and are referred to as “aNsaa”. The case where a speech waveform is generated is shown.

この例では、有声音の音声素片はピッチ波形の系列で表現されている。一方、無声音の音声素片は、フレーム毎の波形として表現されている。 In this example, a voiced speech segment is represented by a series of pitch waveforms. On the other hand, an unvoiced speech segment is represented as a waveform for each frame.

図９の点線は、目標の音韻継続時間長に従って分割した音素毎のセグメントの境界を表し、白い三角は、目標の基本周波数に従って配置した各ピッチ波形を重畳する位置（ピッチマーク）を示している。 A dotted line in FIG. 9 represents a segment boundary for each phoneme divided according to the target phoneme duration, and a white triangle represents a position (pitch mark) at which each pitch waveform arranged according to the target fundamental frequency is superimposed. .

図９のように、有声音については音声素片のそれぞれのピッチ波形を対応するピッチマーク上の重畳し、無声音については各フレームの波形をセグメント中の各フレームに対応する部分に貼り付けることによって、所望の韻律（ここでは、基本周波数、音韻継続時間長）を持った音声波形を生成する。 As shown in FIG. 9, for voiced sound, each pitch waveform of the speech unit is superimposed on the corresponding pitch mark, and for unvoiced sound, the waveform of each frame is pasted on the part corresponding to each frame in the segment. Then, a speech waveform having a desired prosody (here, fundamental frequency, phoneme duration length) is generated.

以上のように本実施形態によれば、素片融合によるフォルマントの鈍り具合に応じて、セグメントごとに適切な強さのフォルマント強調を行うので、こもり感やブザー感が少なく、かつ人工的でない高音質な合成音声を生成できる。 As described above, according to the present embodiment, formant emphasis with an appropriate strength is performed for each segment in accordance with the formant bluntness caused by unit fusion, so that there is little feeling of vomit and buzzer and is not artificial. Sound quality synthesized speech can be generated.

（第２の実施形態）
本発明の第２の実施形態に係るテキスト音声合成を行うテキスト音声合成装置（音声処理装置）について説明する。 (Second Embodiment)
A text-to-speech synthesizer (speech processing device) that performs text-to-speech synthesis according to a second embodiment of the present invention will be described.

第１の実施形態では、音声素片の融合処理およびフォルマント強調の処理に大きな計算量を要するため、ＣＰＵスペックが比較的低いミドルウェア向けの応用などには適用が向かないこともあり得る。 In the first embodiment, a large amount of calculation is required for the speech unit fusion processing and the formant emphasis processing. Therefore, the first embodiment may not be applicable to middleware applications with relatively low CPU specifications.

そこで、本実施形態では、音声素片の融合およびフォルマント強調の処理を予め行った音声素片をオフラインで作成しておき、実際の動作時には、こうして作成された音声素片から適切な音声素片を選択して接続するだけの処理で合成波形を生成する。 Therefore, in this embodiment, a speech unit in which speech unit fusion and formant emphasis processing has been performed in advance is created offline, and an appropriate speech unit is created from the created speech unit during actual operation. A composite waveform is generated simply by selecting and connecting.

本実施形態に係るテキスト音声合成装置の全体構成例は、図１と同様であり、テキスト入力部１、言語処理部２、韻律処理部３、音声合成部４を備えている。 The entire configuration example of the text-to-speech synthesizer according to this embodiment is the same as that in FIG. 1, and includes a text input unit 1, a language processing unit 2, a prosody processing unit 3, and a speech synthesis unit 4.

図１０に、本実施形態の音声合成部４の構成例を示す。 FIG. 10 shows a configuration example of the speech synthesis unit 4 of the present embodiment.

以下、図１０を参照しながら、本実施形態について、第１の実施形態と相違する点を中心に説明する。 Hereinafter, with reference to FIG. 10, the present embodiment will be described focusing on differences from the first embodiment.

図１０に示されるように、本実施形態の音声合成部４は、情報入力部４１、素片記憶部４２、素片選択部４３、素片編集・接続部４７、音声波形出力部４８を備えている。 As shown in FIG. 10, the speech synthesizer 4 of the present embodiment includes an information input unit 41, a segment storage unit 42, a segment selection unit 43, a segment editing / connection unit 47, and a speech waveform output unit 48. ing.

第１の実施形態（図２）と比較すると、本実施形態の音声合成部４は、図２の素片融合部４４、フォルマント強調フィルタ部４５、フォルマント強調度合い推定部４６が省かれている。 Compared to the first embodiment (FIG. 2), the speech synthesizer 4 of this embodiment does not include the unit fusion unit 44, the formant enhancement filter unit 45, and the formant enhancement degree estimation unit 46 of FIG.

また、本実施形態の素片記憶部４２には、後述の方法によって生成された融合済みの音声素片が格納されている。 In the segment storage unit 42 of the present embodiment, fused speech segments generated by a method described later are stored.

第１の実施形態の素片選択部４４が各セグメントに対して複数個ずつの音声素片を選択するのに対し、本実施形態の素片選択部４４は、各セグメントに対して１つずつの融合済み音声素片の最適系列を選択する。 The segment selection unit 44 of the first embodiment selects a plurality of speech segments for each segment, whereas the segment selection unit 44 of the present embodiment has one segment for each segment. Select the optimal sequence of speech units that have been merged.

素片選択部４４の動作としては、例えば第１の実施形態で図５のフローチャートを用いる場合と比較すると、本実施形態では、図５のフローチャートのうち、ステップＳ１０１とステップＳ１０２だけを実行すればよい。もちろん、各セグメントに対して１つずつの融合済み音声素片の最適系列を選択する方法は、これに限られるものではなく、種々の方法が可能である。 As the operation of the segment selection unit 44, for example, in comparison with the case where the flowchart of FIG. 5 is used in the first embodiment, in the present embodiment, only steps S101 and S102 in the flowchart of FIG. Good. Of course, the method of selecting the optimum sequence of fused speech segments for each segment is not limited to this, and various methods are possible.

なお、素片編集・接続部４７および音声波形出力部４８の動作は、第１の実施形態のものと同様である。 The operations of the segment editing / connecting unit 47 and the speech waveform output unit 48 are the same as those in the first embodiment.

次に、音声素片記憶部４２に格納する融合済みの音声素片を学習する方法について、図１１及び図１２を参照しながら説明する。 Next, a method for learning the fused speech unit stored in the speech unit storage unit 42 will be described with reference to FIGS. 11 and 12.

本実施形態では、融合済みの音声素片を作成する融合済み音声素片作成部５を用いる。融合済み音声素片作成部５は、図１０のテキスト音声合成装置に含まれても良い。この場合、テキスト音声合成に供するための「フォルマント強調された融合素片」の生成時には、図１の音声合成部４を融合済み音声素片作成部５に置き換えた構成で用いれば良い。 In the present embodiment, a fused speech element creation unit 5 that creates a fused speech element is used. The fused speech element creation unit 5 may be included in the text-to-speech synthesizer of FIG. In this case, when the “formant-emphasized fused segment” for use in text-to-speech synthesis is generated, the speech synthesizer 4 in FIG. 1 may be replaced with the fused speech segment creation unit 5.

また、融合済み音声素片作成部５は、テキスト音声合成装置に含まれなくても良い。この場合、例えば、融合済み音声素片作成部５を、独立した音声処理装置（テキスト音声合成に供するための「フォルマント強調された融合素片」を生成する音声処理装置）として構成しても良い。この場合、独立した音声処理装置は、図１の音声合成部４を融合済み音声素片作成部５に置き換えた構成にすれば良い。 Further, the fused speech segment creation unit 5 may not be included in the text speech synthesizer. In this case, for example, the fused speech segment creation unit 5 may be configured as an independent speech processing device (speech processing device that generates a “formant-emphasized fused segment” for use in text speech synthesis). . In this case, the independent speech processing apparatus may have a configuration in which the speech synthesis unit 4 in FIG. 1 is replaced with the fused speech unit creation unit 5.

図１１に、融合済み音声素片作成部５の構成例を示す。 FIG. 11 shows a configuration example of the fused speech element creation unit 5.

融合済み音声素片作成部５の構成は、第１の実施形態の音声合成部４の構成とほとんど同じであるため、ここでは相違する点について説明する。 Since the configuration of the fused speech unit creation unit 5 is almost the same as the configuration of the speech synthesis unit 4 of the first embodiment, only the differences will be described here.

融合済み音声素片作成部５は、第１の実施形態の音声合成部４の素片編集・接続部４７および音声波形出力部４８の代わりに、音声素片出力部４９を持つ。第１の実施形態の音声合成部４の素片編集・接続部４７および音声波形出力部４８は、フォルマント強調部４５から入力された各セグメントに対する音声素片を接続して、入力テキストに対する合成波形を生成するのに対し、音声素片出力部４９は、フォルマント強調部４５から入力された音声素片をそのまま出力する。 The fused speech unit creating unit 5 has a speech unit output unit 49 instead of the unit editing / connecting unit 47 and the speech waveform output unit 48 of the speech synthesis unit 4 of the first embodiment. The segment editing / connecting unit 47 and the speech waveform output unit 48 of the speech synthesizer 4 of the first embodiment connect speech segments for each segment input from the formant emphasizing unit 45 to generate a synthesized waveform for the input text. The speech unit output unit 49 outputs the speech unit input from the formant emphasizing unit 45 as it is.

すなわち、融合済み音声素片作成部５は、音声素片（フォルマント強調された融合素片）を、図１０のテキスト音声合成装置の音声素片記憶部４２へ出力し、音声素片（フォルマント強調された融合素片）は、音声素片記憶部４２に記憶される。 That is, the fused speech unit creating unit 5 outputs the speech unit (formant-enhanced fused unit) to the speech unit storage unit 42 of the text-to-speech synthesizer in FIG. The fused segment) is stored in the speech segment storage unit 42.

次に、音声素片記憶部４２に格納する融合済みの音声素片を学習する手順について説明する。 Next, a procedure for learning the fused speech unit to be stored in the speech unit storage unit 42 will be described.

図１２に、この場合の処理手順の一例を示す。 FIG. 12 shows an example of the processing procedure in this case.

まず、ステップＳ５０１において、融合済み音声素片作成部５を備えたテキスト音声合成装置又は独立した音声処理装置に対して、大量の文を入力する。 First, in step S501, a large number of sentences are input to the text-to-speech synthesizer provided with the fused speech unit creation unit 5 or an independent speech processing device.

次に、ステップＳ５０２において、入力された各文の各セグメントに対して生成された融合済み音声素片が、融合済み音声素片生成部５から出力される。 Next, in step S <b> 502, the fused speech element generated for each segment of each input sentence is output from the fused speech element generation unit 5.

次に、ステップＳ５０３において、外部から指定された音声素片記憶部４２に格納する音声素片の総数のうち、それぞれの素片種別に対して幾つずつ配分するかを決める。 Next, in step S503, it is determined how many of the total number of speech units stored in the speech unit storage unit 42 designated from the outside are allocated to each unit type.

ここで、素片種別とは、音声素片の音韻環境などで分類された種別を指す。例えば、素片種別／ａ／は、音素／ａ／に対応する音声素片のこととする。 Here, the unit type refers to a type classified by the phoneme environment of the speech unit. For example, the segment type / a / is a speech segment corresponding to the phoneme / a /.

各素片種別に何個ずつ素片を配分するかは、各素片種別の音声素片の出現頻度などに応じて決める。例えば、素片種別／ａ／の素片が素片種別／ｕ／の素片よりも出現頻度が高い場合は、素片種別／ａ／に多めの素片を配分することとする。 The number of segments allocated to each segment type is determined according to the appearance frequency of the speech segment of each segment type. For example, when the segment type / a / has a higher appearance frequency than the segment type / u / segment, a larger number of segments is allocated to the segment type / a /.

素片種別ｉに配分する音声素片の個数をＮ_ｉとする。 The number of speech units to be allocated to the segment type i and N _i.

次に、ステップＳ５０４において、素片種別番号ｉに初期値１をセットする。 Next, in step S504, an initial value 1 is set to the segment type number i.

次に、ステップＳ５０５において、素片種別ｉの融合済み音声素片を、ステップＳ５０２で出力された素片種別ｉの音声素片の中から、出現頻度が上位のものをＮ_ｉずつ抽出する。 Next, in step S505, the united i speech units of unit type _i are extracted from the speech unit of unit type i output in step S502, the ones with the highest appearance frequency, N _{i at a} time.

次に、ステップＳ５０６において、ｉと素片種別数を比較する。 Next, in step S506, i is compared with the number of unit types.

ｉが素片種別数以下である場合（ステップＳ５０６のＹＥＳ）には、ステップＳ５０７に進んで、ｉの値を１つ増やし、そして、ステップＳ５０５〜ステップＳ５０６を繰り返す。 If i is less than or equal to the number of element types (YES in step S506), the process proceeds to step S507, the value of i is incremented by 1, and steps S505 to S506 are repeated.

ｉが素片種別数を超えている場合（すなわち、全ての素片種別に対する処理が完了している場合）（ステップＳ５０６のＮＯ）には、全ての処理を終了する。 When i exceeds the number of segment types (that is, when processing for all segment types is completed) (NO in step S506), all processing is terminated.

上記のようにして抽出した融合済み音声素片を、音声素片記憶部４２に格納する。 The fused speech unit extracted as described above is stored in the speech unit storage unit 42.

ここで、音声素片記憶部４２に格納するために選択する音声素片の個数は、トータルでの音声素片サイズと合成音声の音質とのトレードオフで、任意に決めることができる。より多くの音声素片を選択して格納すれば、サイズは大きくなるが、合成音声の音質を高くすることができ、音声素片の数を減らせば、合成音声の音質は犠牲になるが、サイズを小さくすることができる。 Here, the number of speech units selected for storage in the speech unit storage unit 42 can be arbitrarily determined by a trade-off between the total speech unit size and the quality of the synthesized speech. If you select and store more speech units, the size will increase, but the quality of the synthesized speech can be increased, and if you reduce the number of speech units, the quality of the synthesized speech will be sacrificed, The size can be reduced.

なお、上記では出現頻度の高い素片を抽出する方法を説明したが、音声素片の両端で算出したメルケプストラムなどの音声素片の特徴量を用いて抽出しても良い。 In addition, although the method of extracting the segment with high appearance frequency has been described above, it may be extracted using the feature amount of the speech unit such as a mel cepstrum calculated at both ends of the speech unit.

この場合、各素片種別に対して出力された融合済み音声素片をそれぞれ、音声素片の特徴量を用いてクラスタリングし、分割された各クラスタの中心（セントロイド）に最も近い素片を抽出する。クラスタリングにおけるクラスタ数は、各素片種別に配分する素片数に応じて決める。 In this case, the fused speech units output for each unit type are clustered using the feature values of the speech units, and the unit closest to the center (centroid) of each divided cluster is selected. Extract. The number of clusters in clustering is determined according to the number of segments allocated to each segment type.

出現頻度に基づいて素片を抽出する場合は、出現頻度が低いコンテキストに対して適切な素片が抽出されない可能性があり、入力テキストによっては音質が大きく劣化してしまう可能性があるが、本方法によって素片を抽出した場合、特徴量空間をできるだけ広く覆うような音声素片のセットが抽出できるため、出現頻度に基づいて抽出した場合より安定した合成音が生成できる。 When extracting a segment based on the appearance frequency, an appropriate segment may not be extracted for a context with a low appearance frequency, and depending on the input text, the sound quality may be greatly degraded. When a segment is extracted by this method, a set of speech segments that covers the feature amount space as much as possible can be extracted, so that a more stable synthesized sound can be generated than when extracted based on the appearance frequency.

以上のように本実施形態によれば、複数の音声素片を融合する処理とフォルマント強調の処理を予めオフラインで行うので、第１の実施形態よりも少ない計算量で実現でき、ＣＰＵスペックが比較的低いミドルウェア向けなどの応用にも適用可能である。 As described above, according to the present embodiment, the process of fusing a plurality of speech units and the formant emphasis process are performed offline in advance, so that the calculation can be realized with a smaller calculation amount than the first embodiment, and the CPU specifications are compared. It can also be applied to low middleware applications.

また、合成音声の音質とのトレードオフで、格納する音声素片のトータルのサイズもスケーラブルに決めることができる。 In addition, the total size of the speech segments to be stored can be determined in a scalable manner by a trade-off with the sound quality of the synthesized speech.

（第３の実施形態）
本発明の第３の実施形態に係るテキスト音声合成装置について説明する。 (Third embodiment)
A text-to-speech synthesizer according to a third embodiment of the present invention will be described.

本実施形態は、フォルマント強調度合い推定部４６の推定方法が、第１の実施形態で説明した例とは相違するものであり、以下、この相違点を中心に説明する。 In the present embodiment, the estimation method of the formant enhancement degree estimation unit 46 is different from the example described in the first embodiment, and this difference will be mainly described below.

第１の実施形態では、フォルマント強調度合い推定部４６でフォルマント強調度合いを推定する方法として、融合元の各音声素片と融合素片の間でのスペクトル包絡の差を算出することによって推定する方法を説明したが、融合元の各音声素片と融合素片との間でのスペクトル包絡の差と、融合によるスペクトル包絡の鈍り具合の間には、高い相関はあると考えられるものの、直接的な関係があるわけではない。そこで、スペクトル包絡の鈍り具合を、より直接的に求められる方法があれば、より確度の高い推定を行うことが可能と考えられる。 In the first embodiment, as a method of estimating the formant emphasis degree by the formant emphasis degree estimation unit 46, a method of estimating the formant by calculating the difference in spectrum envelope between each fusion element and the fusion element. However, although there seems to be a high correlation between the difference in spectral envelope between each fusion source speech unit and the fusion unit and the bluntness of the spectral envelope due to fusion, There is no such relationship. Therefore, if there is a method that can directly determine the degree of bluntness of the spectrum envelope, it is considered possible to perform estimation with higher accuracy.

その一つの方法として、線形予測極（ＬＰ極）を用いる方法が考えられる。ＬＰ極は、数式（２）のＰ（ｚ）について（１−Ｐ（ｚ））を＝０とおいたときに得られる解（複素数）のことで、この解のｚ平面上での位置と単位円との関係から、各フォルマントの周波数とバンド幅を推定することができる。それぞれの極が各フォルマントに対応すると考えられ、ｉ番目の極に関して、極と原点を結ぶ線の角度をθ_ｉ、極と原点の距離をｒ_ｉとした場合、ｉ番目のフォルマントの周波数Ｆ_ｉおよびバンド幅ＢＷ_ｉは、数式（４）のように推定できる。

One method is to use a linear prediction pole (LP pole). The LP pole is a solution (complex number) obtained when (1-P (z)) = 0 for P (z) in Equation (2), and the position and unit of this solution on the z plane. From the relationship with the circle, the frequency and bandwidth of each formant can be estimated. Each pole is considered to correspond to each formant. For the i-th pole, when the angle of the line connecting the pole and the origin is θ _i and the distance between the pole and the origin is r _i , the frequency F _{i of the} i-th formant. And the bandwidth BW _i can be estimated as in Equation (4).

このようにして推定した各フォルマントの周波数とバンド幅を用いれば、スペクトル包絡のうち、特にフォルマントに関する鈍り具合がより正確に推定できると考えられる。 If the frequency and bandwidth of each formant estimated in this way are used, it is considered that the degree of bluntness related to the formant in the spectral envelope can be estimated more accurately.

以下、ＬＰ極から推定される各フォルマントのバンド幅を用いて、スペクトル包絡の鈍り具合を推定する方法の一例を、図１３を参照しながら説明する。 Hereinafter, an example of a method of estimating the dullness of the spectrum envelope using the bandwidth of each formant estimated from the LP pole will be described with reference to FIG.

図１３に、ＬＰ極から推定される各フォルマントのバンド幅を用いてスペクトル包絡の鈍り具合を推定する手順の一例を示す。 FIG. 13 shows an example of a procedure for estimating the dullness of the spectral envelope using the bandwidth of each formant estimated from the LP pole.

まず、ステップＳ６０１において、融合素片のＬＰ極を算出する。具体的には、融合素片の音声波形に対してＬＰＣ分析を行い、得られた線形予測係数を係数に持つ数式（２）のＰ（ｚ）について、（１−Ｐ（ｚ））＝０とおいたときの解を得る。 First, in step S601, the LP pole of the fusion unit is calculated. Specifically, an LPC analysis is performed on the speech waveform of the fusion unit, and (1−P (z)) = 0 for P (z) in Equation (2) having the obtained linear prediction coefficient as a coefficient. Get the solution when

次のステップＳ６０２では、融合元の音声素片それぞれに対するＬＰ極を、ステップＳ６０１と同様の方法で算出する。 In the next step S602, the LP pole for each of the fusion source speech units is calculated by the same method as in step S601.

次に、ステップＳ６０３では、フォルマントバンド幅比率の和Ｒ_ｓｕｍを０に、ステップＳ６０４では、用いたＬＰ極の個数Ｎ_ＬＰを０に、ステップＳ６０５では、カウンターｉを１に、それぞれ初期化して、ステップＳ６０６に進む。 Next, in step S603, the sum of formant bandwidth ratios R _{sum is set} to 0, in step S604, the number of LP poles N _{LP used is set} to 0, and in step S605, the counter i is initialized to 1, respectively. The process proceeds to step S606.

ステップＳ６０６では、融合素片のｉ番目のＬＰ極が実軸上（すなわち虚数項が０）かどうかを判定し、実軸上である場合（ステップＳ５０６のＹＥＳ）には、ステップＳ６２０に進んで、カウンターｉの値を１つ増やした後に、再びＳ６０６に進む。 In step S606, it is determined whether or not the i-th LP pole of the fusion unit is on the real axis (that is, the imaginary term is 0). If it is on the real axis (YES in step S506), the process proceeds to step S620. After incrementing the value of the counter i by one, the process proceeds to S606 again.

これは、実軸上のＬＰ極がフォルマントには対応しない（スペクトル包絡全体の形状に寄与）ため、実軸上である場合については、ステップＳ６０７以降の処理をスキップし、フォルマントに対応したＬＰ極のみを考慮するためのものである。 This is because the LP pole on the real axis does not correspond to the formant (contributes to the shape of the entire spectrum envelope), and therefore, on the real axis, the processing after step S607 is skipped and the LP pole corresponding to the formant. Only for consideration.

ＬＰ極が実軸上でない場合（ステップＳ６０６のＮＯ）には、ステップＳ６０７に進む。 If the LP pole is not on the real axis (NO in step S606), the process proceeds to step S607.

ステップＳ６０７では、Ｎ_ＬＰの値を１つ増やした後に、ステップＳ６０８に進む。 In step S607, after increasing the value of N _LP by 1, the process proceeds to step S608.

ステップＳ６０８では、融合素片のｉ番目のＬＰ極に対するフォルマントのバンド幅ＢＷｉを、数式（４）を用いて算出する。 In step S608, the formant bandwidth BWi for the i-th LP pole of the fusion unit is calculated using Equation (4).

次のステップＳ６０９では、融合元の音声素片のフォルマントに関するバンド幅の和ＢＷ_{ｉ＿ｏｒｇ＿ｓｕｍ}を０に初期化し、ステップＳ６１０に進む。 In the next step S609, the bandwidth sum BW _{i_org_sum} relating to the formant of the fusion source speech unit is initialized to 0, and the process proceeds to step S610.

ステップＳ６１０では、カウンターｋを１に初期化して、ステップＳ６１１に進む。 In step S610, the counter k is initialized to 1, and the process proceeds to step S611.

ステップＳ６１１では、融合元の音声素片（計Ｎ_{ｆｕｓｅｄ}個）のうちｋ番目の音声素片（「音声素片ｋ」と呼ぶ。）について、この音声素片のＬＰ極の中で、融合素片のｉ番目のＬＰ極が表すフォルマントに対応するようなＬＰ極を選択する。具体的には、音声素片ｋのＬＰ極の中で、融合素片のｉ番目のＬＰ極に最も近いものを選択する。ＬＰ極の間の距離については、例えば数式（５）（文献“Goncharoff, etc., 「Interplation of LPC spectra via pole shifting.」, IEEE ICASSP, Detroit, MI, Vol.1, pp.780-783, 1995”参照）を用いて算出できる。ただし、ｐ_ｉはＬＰ極の複素数表現、ｒ_ｉはＬＰ極と原点の距離を表し、Ｄ（ｐ０，ｐ１）がＬＰ極ｐ_０とｐ_１の距離を表す。

In step S611, the k-th speech unit (referred to as “speech unit k”) among the fusion source speech units (total N _fused ) is the fusion unit among the LP poles of this speech unit. The LP pole corresponding to the formant represented by the i-th LP pole of the piece is selected. Specifically, among the LP poles of the speech unit k, the one closest to the i-th LP pole of the fusion unit is selected. For the distance between the LP poles, for example, Formula (5) (reference “Goncharoff, etc.,“ Interplation of LPC spectra via pole shifting. ”, IEEE ICASSP, Detroit, MI, Vol.1, pp.780-783, 1995 ”), where p _i is the complex number representation of the LP pole, r _i is the distance between the LP pole and the origin, and D (p0, p1) is the distance between the LP poles p ₀ and p ₁ . To express.

この数式（５）によって、融合素片のｉ番目のＬＰ極との距離を、融合元の音声素片のＬＰ極のそれぞれについて算出し、最も距離が小さいＬＰ極を選択する。 Using this equation (5), the distance from the i-th LP pole of the fusion unit is calculated for each of the LP poles of the voice unit of the fusion source, and the LP pole with the shortest distance is selected.

次のステップＳ６１２では、ステップＳ６１１で選択されたＬＰ極に対するバンド幅ＢＷ_{ｉ＿ｏｒｇ＿ｋ}を、数式（４）を用いて算出する。 In the next step S612, the bandwidth BW _{i_org_k} for the LP pole selected in step S611 is calculated using Equation (4).

次に、ステップＳ６１３において、ステップＳ６１２で算出したＢＷ_{ｉ＿ｏｒｇ＿ｋ}をＢＷ_{ｉ＿ｏｒｇ＿ｓｕｍ}に加算する。 Next, in step S613, BW _{i_org_k} calculated in step S612 is added to BW _{i_org_sum} .

続いて、ステップＳ６１３において、カウンターｋが融合元の音声素片数Ｎ_{ｆｕｓｅｄ}以下かどうかを判定する。 Subsequently, in step S613, it is determined whether or not the counter k is equal to or less than the fusion source speech unit number N _fused .

カウンターｋがＮ_{ｆｕｓｅｄ}以下である場合（ステップＳ６１３のＹＥＳ）には、ステップＳ６１９に進んで、カウンターｋの値を１つ増やした後に、ステップＳ６１１からのステップを繰り返す。一方、カウンターｋがＮ_{ｆｕｓｅｄ}を超える場合（ステップＳ６１３のＮＯ）には、ステップＳ６１５に進む。 If the counter k is equal to or less than N _fused (YES in step S613), the process proceeds to step S619, and after incrementing the value of the counter k by 1, the steps from step S611 are repeated. On the other hand, if the counter k exceeds N _fused (NO in step S613), the process proceeds to step S615.

ステップＳ６１５では、ＢＷ_{ｉ＿ｏｒｇ＿ｓｕｍ}をＮ_{ｆｕｓｅｄ}で割ることによって、融合素片のｉ番目のＬＰ極に対応するような、融合元の各音声素片のＬＰ極についての、フォルマントのバンド幅の平均値ＢＷ_{ｉ＿ｏｒｇ＿ｍｅａｎ}を算出する。 In step S615, by dividing BW _{i_org_sum} by N _fused , an average value BW of formant bandwidths for the LP poles of each speech unit corresponding to the i-th LP pole of the fusion unit. _{i_org_mean} is calculated.

次のステップＳ６１６では、ステップＳ６１５で算出したＢＷ_{ｉ＿ｏｒｇ＿ｍｅａｎ}に対する、融合素片のｉ番目のＬＰ極のバンド幅ＢＷｉの比率を、フォルマントバンド幅比率の和Ｒ_ｓｕｍに加算する。 In the next step S616, the ratio of the bandwidth BWi of the i-th LP pole of the fusion unit to the BW _{i_org_mean} calculated in step S615 is added to the _sum Rsum of formant bandwidth ratios.

続いて、ステップＳ６１７では、カウンターｉが、Ｎ_{ｍａｘＬＰ}という設定値以下かどうかを判定する。 Subsequently, in step _S617 , it is determined whether the counter i is equal to or _smaller than a set value N _maxLP .

ここで、Ｎ_{ｍａｘＬＰ}は、フォルマントの鈍り具合を推定するのに用いるＬＰ極の個数の最大値を表す。 Here, N _maxLP represents the maximum value of the number of LP poles used to estimate the formant bluntness.

この値は、例えば、ＬＰＣ分析での分析次数の１／２などに設定する。 This value is set to, for example, 1/2 of the analysis order in the LPC analysis.

カウンターｉがＮ_{ｍａｘＬＰ}以下である場合（ステップＳ６１７のＹＥＳ）には、ステップＳ６２０に進んで、カウンターｉの値を１つ増やした後に、ステップＳ６０６からの処理を繰り返す。一方、カウンターｉがＮ_{ｍａｘＬＰ}を越える場合（ステップＳ６１７のＮＯ）には、ステップＳ６１８に進む。 If the counter i is _{equal to} or less than N _maxLP (YES in step _S617 ), the process proceeds to step S620, and after incrementing the value of the counter i by 1, the process from step S606 is repeated. On the other hand, when the counter i exceeds N _maxLP (NO in step _S617 ), the process proceeds to step _S618 .

ステップＳ６１８では、フォルマントバンド幅比率の和Ｒ_ｓｕｍを、用いたＮＰ極の個数Ｎ_ＬＰで割ることによって、フォルマントバンド幅比率の平均値Ｒ_ｍｅａｎを算出し、全ての処理を終了する。 In step S618, the average value R _mean of the formant bandwidth ratio is calculated by dividing the sum R _sum of the formant bandwidth ratios by the number N _LP of NP poles used, and all the processes are completed.

本実施形態では、上記のような方法で算出したフォルマントバンド幅比率の平均値Ｒ_ｍｅａｎを、音声素片の融合によるスペクトル包絡の鈍り具合を表す尺度として用いる。この値は、フォルマントのバンド幅がほぼ変わらずスペクトル包絡がほとんど鈍らなかった場合には１に近い値、フォルマントのバンド幅が融合元の音声素片より広がってスペクトル包絡が鈍った場合には１より大きい値となり、スペクトル包絡の鈍り具合が大きければ大きいほど大きな値になると考えられる。そこで、本実施形態においては、Ｒ_ｍｅａｎが１以下の場合は強調無しで、１より大きい場合は強調度合いが単調増加するような何らかの関数を用いることによって、このＲ_ｍｅａｎからフォルマント強調度合いを算出することとする。 In the present embodiment, the average value R _mean of the formant bandwidth ratio calculated by the method as described above is used as a measure representing the dullness of the spectrum envelope due to the fusion of speech segments. This value is close to 1 when the bandwidth of the formant is not substantially changed and the spectral envelope is hardly dull, and is 1 when the bandwidth of the formant is wider than the speech unit of the fusion source and the spectral envelope is dull. It is considered that the larger the value is, the larger the dullness of the spectrum envelope becomes. Therefore, in the present embodiment, the formant emphasis degree is calculated from this R _mean by using some function such that no emphasis is given when R _mean is 1 or less, and the emphasis degree monotonously increases when R _mean is greater than 1. I will do it.

このように、融合素片と融合元の各音声素片に対して推定されたフォルマントのバンド幅を用いることによって、スペクトル包絡形状の差を用いる場合よりも、融合によるスペクトル包絡の鈍り具合をより直接的に求められるので、フォルマント強調度合いをより高い確度で推定することが可能である。 In this way, by using the formant bandwidth estimated for the fusion unit and each speech unit of the fusion source, the degree of bluntness of the spectrum envelope due to fusion is more than that when using the difference in spectrum envelope shape. Since it is obtained directly, it is possible to estimate the formant emphasis degree with higher accuracy.

（第４の実施形態）
本発明の第４の実施形態に係るテキスト音声合成装置について説明する。 (Fourth embodiment)
A text-to-speech synthesizer according to a fourth embodiment of the present invention will be described.

本実施形態は、フォルマント強調度合い推定部４６の推定方法が、第１、第３の実施形態で説明した例とは相違するものであり、以下、この相違点を中心に説明する。 In the present embodiment, the estimation method of the formant emphasis degree estimation unit 46 is different from the examples described in the first and third embodiments. Hereinafter, this difference will be mainly described.

第３の実施形態においては、融合素片の各ＬＰ極に対して求めたフォルマントバンド幅比率を平均化することによって、スペクトル包絡全体での鈍り具合を推定しているが、実際にはスペクトルの鈍り具合がフォルマント毎で異なる場合も考えられる。そこで、各ＬＰ極に対して求めたフォルマントバンド幅比率（以下、Ｒ_ｉとする。）をそのまま用いることによって、フォルマントごとに強調度合いが異なるようなフォルマント強調を行うことも可能である。 In the third embodiment, the degree of dullness in the entire spectrum envelope is estimated by averaging the formant bandwidth ratio obtained for each LP pole of the fusion unit. It may be possible that the dullness varies depending on the formant. Therefore, by using the formant bandwidth ratio (hereinafter referred to as R _i ) obtained for each LP pole as it is, it is possible to perform formant enhancement in which the degree of enhancement differs for each formant.

ここで、融合素片のｉ番目のＬＰ極をｐ_ｉとすると、数式（２）のＰ（ｚ）に関して、数式（６）のように表せる。

Here, assuming that the i-th LP pole of the fusion unit is p _i , P (z) in Expression (2) can be expressed as Expression (6).

Ｈ（ｚ）＝１／（１−Ｐ（ｚ））という伝達関数を持つフィルタ（線形予測フィルタ）に、ＬＰＣ分析したときの予測残差を入力すると完全に元の波形が再現できるが、上記のｐ_ｉをｚ平面上の単位円に近づくように変更したフィルタに予測残差を入力すると、ｉ番目のＬＰ極に対応するフォルマントのバンド幅が狭まり、結果的に、このフォルマントを強調することができる。すなわち、Ｒ_ｉに応じて適切にｐ_ｉを変更したフィルタをフォルマント強調フィルタとして用いれば、フォルマントごとに適切なフォルマント強調を行うことができる。 When the prediction residual when the LPC analysis is performed is input to a filter (linear prediction filter) having a transfer function of H (z) = 1 / (1-P (z)), the original waveform can be completely reproduced. When the p _i inputting the prediction residual in the filter has been changed so as to approach the unit circle on the z plane, narrowed formant bandwidth corresponding to the i-th LP pole, consequently, to emphasize this formant Can do. That is, if a filter in which p _i is appropriately changed according to R _i is used as a formant emphasis filter, it is possible to perform appropriate formant emphasis for each formant.

図１４に、本実施形態のフォルマント強調フィルタ部４５の構成例を示す。 FIG. 14 shows a configuration example of the formant emphasis filter unit 45 of this embodiment.

ＬＰＣ分析部４５１は、入力された波形に対してＬＰＣ分析を行い、算出されたＬＰＣをＬＰＣ変形部４５２に、予測残差を線形予測フィルタ部４５３に出力する。 The LPC analysis unit 451 performs LPC analysis on the input waveform, and outputs the calculated LPC to the LPC deformation unit 452 and the prediction residual to the linear prediction filter unit 453.

ＬＰＣ変形部４５２では、フォルマント強調度合い推定部４６から入力された各ＬＰ極に対するフォルマントバンド幅比率Ｒ_ｉに応じてＬＰＣ係数を変形し、この変形されたＬＰＣ係数を線形予測フィルタ部４５３に与える。 The LPC deforming unit 452 modifies the LPC coefficient according to the formant bandwidth ratio R _i for each LP pole input from the formant enhancement degree estimating unit 46, and gives the deformed LPC coefficient to the linear prediction filter unit 453.

線形予測フィルタ部４５３では、ＬＰＣ変形部４５２から与えられたＬＰＣ係数をフィルタ係数に用いて、ＬＰＣ分析部４５１から入力された予測残差をフィルタリングすることによって、フォルマント強調された波形を出力する。 The linear prediction filter unit 453 outputs a waveform with formant emphasis by filtering the prediction residual input from the LPC analysis unit 451 using the LPC coefficient given from the LPC deformation unit 452 as a filter coefficient.

なお、ＬＰＣ変形部４５２においては、まず、入力されたＬＰＣ係数から数式Ｐ（ｚ）を得た後、（１−Ｐ（ｚ））を数式（６）のように因数分解することによってＬＰ極ｐ_ｉを得る。 Note that the LPC deformation unit 452 first obtains the mathematical expression P (z) from the input LPC coefficient, and then factorizes (1-P (z)) as shown in the mathematical expression (6) to obtain the LP pole. Get p _i .

次に、ＬＰ極ｐ_ｉをＲ_ｉに応じて変更する。 Next, the LP pole p _i is changed according to R _i .

例えば、数式（７）のように変更すれば、フォルマントのバンド幅は１／Ｒ_ｉ倍となり、融合元の音声素片での平均的なフォルマントのバンド幅に近づくようバンド幅を狭めることが可能である。

For example, if changed to Equation (7), the bandwidth of the formant becomes 1 / R _i times, and the bandwidth can be narrowed so as to approach the average formant bandwidth of the speech unit of the fusion source. It is.

このような方法でＲ_ｉに応じて変更したＬＰ極ｐ_ｉを数式（６）に代入して、この数式を展開することによって、変形されたＬＰＣ係数を得ることができる。 A modified LPC coefficient can be obtained by substituting the LP pole p _i changed according to R _i in this way into Equation (6) and expanding the equation.

本実施形態においては、融合素片および融合元の各音声素片に対して求めたＬＰ極を用いてフォルマントごとに強調度合いを変える方法を説明したが、この方法以外にも、フォルマントごとあるいは周波数帯域によって強調度合いが変わるようなフォルマント強調を行うことも可能である。 In the present embodiment, the method of changing the degree of emphasis for each formant using the LP pole obtained for the fusion unit and each speech unit of the fusion source has been described. However, in addition to this method, for each formant or frequency It is also possible to perform formant emphasis that the degree of emphasis varies depending on the band.

例えば、フォルマント強調度合い推定部４６において、融合素片および融合元の各音声素片の波形を複数の周波数帯域に分割し、それぞれの帯域においてスペクトル包絡の鈍り具合を推定することによって、それぞれの帯域でのフォルマント強調度合いを推定する。そして、フォルマント強調フィルタ部４５において、融合素片の波形を帯域分割して得た各周波数帯域の波形に対し、フォルマント強調度合い推定部４６から入力された各帯域の強調度合いに従ってフォルマント強調した後、周波数帯域間で波形を足し合わせれば、各周波数帯域でのスペクトル包絡の鈍り具合に応じたスペクトル強調を行うことが可能である。 For example, the formant emphasis degree estimation unit 46 divides the waveform of the fusion unit and each speech unit of the fusion source into a plurality of frequency bands, and estimates the dullness of the spectrum envelope in each band, thereby obtaining each band. Estimate the degree of formant emphasis at. Then, after the formant emphasis filter unit 45 performs formant emphasis according to the emphasis degree of each band input from the formant emphasis degree estimation unit 46 with respect to the waveform of each frequency band obtained by dividing the waveform of the fusion unit into bands, If the waveforms are added between the frequency bands, it is possible to perform spectrum enhancement in accordance with the degree of spectral envelope dullness in each frequency band.

なお、以上の各機能は、ソフトウェアとして記述し適当な機構をもったコンピュータに処理させても実現可能である。
また、本実施形態は、コンピュータに所定の手順を実行させるための、あるいはコンピュータを所定の手段として機能させるための、あるいはコンピュータに所定の機能を実現させるためのプログラムとして実施することもできる。加えて該プログラムを記録したコンピュータ読取り可能な記録媒体として実施することもできる。 Each of the above functions can be realized even if it is described as software and processed by a computer having an appropriate mechanism.
The present embodiment can also be implemented as a program for causing a computer to execute a predetermined procedure, causing a computer to function as a predetermined means, or causing a computer to realize a predetermined function. In addition, the present invention can be implemented as a computer-readable recording medium on which the program is recorded.

なお、本発明は上記実施形態そのままに限定されるものではなく、実施段階ではその要旨を逸脱しない範囲で構成要素を変形して具体化できる。また、上記実施形態に開示されている複数の構成要素の適宜な組み合わせにより、種々の発明を形成できる。例えば、実施形態に示される全構成要素から幾つかの構成要素を削除してもよい。さらに、異なる実施形態にわたる構成要素を適宜組み合わせてもよい。 Note that the present invention is not limited to the above-described embodiment as it is, and can be embodied by modifying the constituent elements without departing from the scope of the invention in the implementation stage. In addition, various inventions can be formed by appropriately combining a plurality of components disclosed in the embodiment. For example, some components may be deleted from all the components shown in the embodiment. Furthermore, constituent elements over different embodiments may be appropriately combined.

本発明の一実施形態に係るテキスト音声合成装置の構成例を示すブロック図The block diagram which shows the structural example of the text speech synthesizer which concerns on one Embodiment of this invention. 同実施形態に係る音声合成部の構成例を示すブロック図The block diagram which shows the structural example of the speech synthesizer which concerns on the embodiment 同実施形態に係る音声素片記憶部に蓄積される音声素片の例を示す図The figure which shows the example of the speech unit accumulate | stored in the speech unit memory | storage part which concerns on the embodiment 同実施形態に係る音声素片記憶部に蓄積される素片属性情報の例を示す図The figure which shows the example of the segment attribute information accumulate | stored in the speech unit memory | storage part which concerns on the embodiment 音声素片の選択手順の一例を示すフローチャートFlow chart showing an example of a procedure for selecting speech segments 音声波形を融合して新たな音声波形を生成する手順の一例を示すフローチャートFlow chart showing an example of a procedure for generating a new speech waveform by fusing speech waveforms 選択された３つの音声素片からなる素片組み合わせ候補を融合して新たな音声素片を生成する例について説明するための図The figure for demonstrating the example which unites the unit combination candidate which consists of three selected speech units, and produces | generates a new speech unit 音声素片の融合によるスペクトル包絡の鈍り具合を推定する手順の一例を示すフローチャートFlow chart showing an example of a procedure for estimating the dullness of the spectral envelope due to the fusion of speech segments 同実施形態に係る素片編集・接続部での処理を説明するための図The figure for demonstrating the process in the segment edit and the connection part which concerns on the same embodiment 同実施形態に係る音声合成部の他の構成例を示すブロック図The block diagram which shows the other structural example of the speech synthesizer which concerns on the same embodiment 同実施形態に係る融合済み音声素片作成部の構成例を示すブロック図The block diagram which shows the structural example of the united speech unit preparation part which concerns on the same embodiment 融合済みの音声素片を学習する手順の一例を示すフローチャートFlow chart showing an example of a procedure for learning a fused speech unit フォルマント強調度合いを推定する手順の一例を示すフローチャートFlow chart showing an example of a procedure for estimating the formant emphasis degree 同実施形態に係るフォルマント強調フィルタ部の構成例を示すブロック図The block diagram which shows the structural example of the formant emphasis filter part which concerns on the same embodiment

Explanation of symbols

１…テキスト入力部、２…言語処理部、３…韻律処理部、４…音声合成部、４１…音韻系列・韻律情報入力部、４２…音声素片記憶部、４３…素片選択部、４４…素片融合部、４５…フォルマント強調フィルタ部、４６…フォルマント強調度合い推定部、４７…素片編集・接続部、４８…音声波形出力部、４９…音声素片出力部、５…融合済み音声素片作成部、４５１…ＬＰＣ分析部、４５２…ＬＰＣ変形部、４５３…線形予測フィルタ部 DESCRIPTION OF SYMBOLS 1 ... Text input part, 2 ... Language processing part, 3 ... Prosody processing part, 4 ... Speech synthesis part, 41 ... Phoneme series / prosodic information input part, 42 ... Speech unit memory | storage part, 43 ... Unit selection part, 44 ... unit fusion unit, 45 ... formant emphasis filter unit, 46 ... formant emphasis degree estimation unit, 47 ... unit editing / connection unit, 48 ... speech waveform output unit, 49 ... speech unit output unit, 5 ... fused speech Segment creation unit, 451... LPC analysis unit, 452... LPC deformation unit, 453... Linear prediction filter unit

Claims

A first acquisition unit that acquires a plurality of segments obtained by dividing a phoneme sequence corresponding to a target speech by a synthesis unit;
A second acquisition unit that acquires prosodic information of each of the segments corresponding to the target speech;
For each segment, for each segment, a selection unit that selects a plurality of speech units from a plurality of speech units prepared in advance based on the prosodic information of the segment;
For each of the segments, a fusion unit that generates a fusion unit by fusing a plurality of the speech units selected for the segment;
For each of the segments, using at least one of a feature amount related to the plurality of speech units selected by the selection unit and a feature amount related to the fusion unit generated by the fusion unit, the segment An estimation unit for estimating an enhancement degree in formant enhancement to be performed on the fusion unit according to
An audio processing apparatus comprising: a formant emphasis filter unit that performs formant emphasis based on the degree of emphasis estimated by the estimation unit for each fusion segment related to the segment for each segment.

The generator further includes a generating unit that generates a synthesized speech based on a speech waveform related to the form fragment emphasized by the formant emphasizing filter unit obtained for each segment by the formant emphasizing filter unit. Item 6. The speech processing apparatus according to Item 1.

The speech processing apparatus according to claim 1, further comprising an output unit that directly outputs the fusion element that has been subjected to formant emphasis obtained by the formant emphasis filter unit for each of the segments.

The speech processing apparatus according to claim 3, wherein the output unit outputs the fusion unit to a storage unit that stores a speech unit for use in text-to-speech synthesis.

5. The speech processing apparatus according to claim 1, further comprising a speech unit storage unit that stores the plurality of speech units prepared in advance.

The estimation unit estimates, for each of the segments, how blunt the spectrum envelope of the fusion unit generated by the fusion unit is from the spectrum envelope of the speech unit selected by the selection unit. 6. The speech processing apparatus according to claim 1, wherein a stronger formant emphasis degree is estimated for a segment having a greater degree of bluntness of the estimated spectral envelope.

The estimation unit estimates a difference between a spectrum envelope of the fusion unit generated by the fusion unit and a shape of a spectrum envelope of the speech unit selected by the selection unit for each segment. The speech processing apparatus according to claim 1, wherein a stronger formant enhancement degree is estimated for a segment having a larger difference in estimated spectrum envelope shape.

The estimation unit selects, for each segment, a difference between a target speech and a speech generated by the fusion unit generated by the fusion unit, by the prosody information corresponding to the target speech of the segment and the selection unit. 2. The formant emphasis degree is estimated for a segment that is estimated from the prosodic information of the speech segment and has a larger difference between the estimated target speech and the speech of the fusion segment. 6. The sound processing device according to any one of 5 above.

The estimation unit estimates the formant emphasis degree for each of the plurality of segments for each formant or for each frequency band divided into a plurality of segments,
6. The formant emphasizing filter unit performs formant emphasis having different strengths between formants or frequency bands in accordance with the formant emphasis degree estimated for each formant or frequency band. The voice processing device according to claim 1.

A speech processing method of a speech processing apparatus including a first acquisition unit, a second acquisition unit, a selection unit, a fusion unit, an estimation unit, and a formant emphasis filter unit,
The first obtaining unit obtaining a plurality of segments obtained by dividing a phoneme sequence corresponding to a target speech by a synthesis unit;
The second obtaining unit obtaining prosodic information of each of the segments corresponding to the target speech;
The selection unit selects a plurality of speech units from a plurality of speech units prepared in advance for each segment based on the prosodic information of the segment for each segment. When,
The fusion unit for each segment, generating a fusion unit by fusing the plurality of speech units selected for the segment;
For each of the segments, the estimation unit calculates at least one of a feature amount related to the plurality of speech units selected by the selection unit and a feature amount related to the fusion unit generated by the fusion unit. Using to estimate the degree of emphasis in formant emphasis to be performed on the fusion unit related to the segment;
The formant emphasizing filter unit includes, for each segment, performing formant emphasis based on the degree of emphasis estimated by the estimation unit for the fusion unit related to the segment. Processing method.

The voice processing device further includes a generation unit,
In the speech processing method, the generation unit generates a synthesized speech based on a speech waveform relating to the fusion element obtained by the formant emphasis filter unit for each segment and obtained by the formant emphasis filter unit. The speech processing method according to claim 10, further comprising:

The voice processing device further includes an output unit,
The speech processing method further includes a step in which the output unit directly outputs the form fragment emphasized by the formant emphasis filter unit obtained by the formant emphasis filter unit for each of the segments as it is. The voice processing method described in 1.

A program for causing a computer to function as a speech processing apparatus including a first acquisition unit, a second acquisition unit, a selection unit, a fusion unit, an estimation unit, and a formant emphasis filter unit,
The first obtaining unit obtaining a plurality of segments obtained by dividing a phoneme sequence corresponding to a target speech by a synthesis unit;
The second obtaining unit obtaining prosodic information of each of the segments corresponding to the target speech;
The selection unit selects a plurality of speech units from a plurality of speech units prepared in advance for each segment based on the prosodic information of the segment for each segment. When,
The fusion unit for each segment, generating a fusion unit by fusing the plurality of speech units selected for the segment;
For each of the segments, the estimation unit calculates at least one of a feature amount related to the plurality of speech units selected by the selection unit and a feature amount related to the fusion unit generated by the fusion unit. Using to estimate the degree of emphasis in formant emphasis to be performed on the fusion unit related to the segment;
A program for causing the computer to execute, for each of the segments, the step of performing formant emphasis based on the degree of enhancement estimated by the estimation unit for the fusion unit associated with the segment for each of the segments. .

The voice processing device further includes a generation unit,
The program further includes a step in which the generation unit generates a synthesized speech based on a speech waveform related to the fusion element that is formant-emphasized obtained by the formant emphasis filter unit for each of the segments. The voice processing apparatus according to claim 13, wherein the voice processing apparatus is executed by a computer.

The voice processing device further includes an output unit,
The computer program further causes the computer to execute a step of outputting the fusion fragment, which is obtained by the formant emphasis filter unit for each of the segments, as it is, and outputs the fusion fragment as it is. 14. The voice processing device according to 13.