JP3601974B2

JP3601974B2 - Voice synthesis device and voice synthesis method

Info

Publication number: JP3601974B2
Application number: JP14397898A
Authority: JP
Inventors: 哲也酒寄
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1998-05-26
Filing date: 1998-05-26
Publication date: 2004-12-15
Anticipated expiration: 2018-05-26
Also published as: JPH11338488A

Description

【０００１】
【発明の属する技術分野】
本発明は音声合成装置及び方法に関し、特に、定型句中の一部の語句をその都度変更して使用するようなテキストを音声に変換する用途に利用できる、音声合成装置及び方法に関するものである。
【０００２】
【従来の技術】
（１）特開平９−３４４９０号公報「音声合成装置および音声合成方法、ナビゲーションシステム、ならびに記録媒体」は、「ここは○○付近です。」のような定型文に対して自然音声の韻律を、「品川」などの非定型情報に対して規則によって計算した韻律を用い、両者を滑らかに聞こえるように整合性を取って接続し、「ここは品川付近です。」という置き換え文の音声を合成するものが開示されている。
（２）日本音響学会平成１０年度春季研究発表会講演論文集、１−７−２４「単語および文韻律データベースを用いた韻律制御方式の検討」、自然音声から抽出した定型句の韻律と規則によって計算した非定常部分の韻律を以下の式によって接続するものが開示されている。
ｐ′_ｉ＝ｐｉ＊（（１−ｉ／Ｎ）＊（ｐ_０／Ｐ_ｓ）＋（ｉ／Ｎ）＊（ｐ_Ｎ／Ｐ_Ｅ））
ここで、Ｎは可変部にはいる単語のモーラ数、ｐ′_ｉ（ｉ＝０，１，２，３…，Ｎ）は求めるＦ_０データ、ｐｉは単語Ｆ_０パターンＤＢから得られたＦ_０データ、Ｐ_ｓとＰ_Ｅは各々可変部の始端Ｆ_０データ、終端Ｆ_０データである。
【０００３】
【発明が解決しようとする課題】
従来、音声合成装置の合成方式には録音編集方式と規則合成方式があった。前者はアナウンサーなどがフレーズ毎に音声を登録しておき、これを適宜選択結合してメッセージを作成するもので、肉声に近い良好な音声が得られる可能性があるが、データ量が多い、登録外のフレーズには対応できない、新たにフレーズを追加するために同一話者の確保が必要、等の問題がある。一方、後者は、音素や音節などの細かい単位で音声データを蓄積して任意語彙の合成を可能とするものであるが、音質的に録音編集方式に劣り、特に基本周波数、音韻継続時間長、振幅などの韻律パターンを規則によって付与するためどうしても機械的で不自然なものになる。
このため任意語彙の出力が不要な定型的なメッセージには、音質の良い録音編集方式が用いられ、テキストからの音声変換が必要な場面では規則合成が用いられる。しかし、カーナビゲーションの音声案内で定型的メッセージの中に地名が埋め込まれるなど、定型文の中に一部任意語彙が埋め込まれるようなアプリケーションも多く存在する。このような場合ごく一部の任意語彙のために全体の音質を落として規則合成を採用するか、任意語彙の出力をあきらめて録音編集方式を用いるか、あるいは定型部分を録音編集で行い、任意語彙部分のみを規則合成で行うという混在方式を採るかの選択をせざるを得ない。
録音編集と規則合成を混在させる場合の問題点は２つの方式で出力音声の声質がまったく異なるため、聞いていて違和感があるばかりでなく非常に聞取り難いものとなる点である。
【０００４】
これに対し従来技術（１）では定型文にも規則合成的に音素あるいは音節等をつないで音韻パラメータを生成し、これに自然音声から抽出した基本周波数及び音韻継続時間長を付与することにより、任意語部分との話者連続性を保持しつつ自然性を向上している。ここで問題となるのは自然音声から抽出された定型部分の韻律パターンと規則によって計算された非定型部分の接続の際に、いかにして不連続性を生じさせずに韻律パターンを自然に接続するかということである。特にピッチパターンの接続をどのように行うかが合成音声の自然性にとって重要である。従来技術（１）ではこの問題に対して「接続部分が滑らかになるように」とするのみでなんら具体的方法を開示していない。
【０００５】
従来技術（２）では上述の問題に対して、接続部分のピッチを定型部分に合わせるように規則で計算する、非定型部分のピッチパターン全体を線型変換する方法が示されている。この方法の問題点を図７を用いて以下に説明する。この例では破線で示した定型文のピッチパターンには３つのアクセント成分による山が見られる。山の大きさは３つ目がもっとも大きく２つ目が最も小さい。自然音声はこのようなメリハリのあるピッチパターンによって豊かな表情が付けられている。これを従来技術（２）による方法で、実線で示した規則で計算されたパターンで置き換えると、全ての山の大きさはほぼ等しくなる。結果的に定型部分の人間的で自然なピッチパターンの特徴を十分生かすことが出来ない。
【０００６】
従って、本発明では定型部分の自然性を極力損なうことなく非定型部分のピッチパターンを置き換えることを目的とするものである。
【０００７】
【課題を解決するための手段】
請求項１は、人間が発声した音声フレーズから抽出した原ピッチパターンを用いて音声を合成する音声合成装置であって、前記音声フレーズの一部を異なる置換語句に置き換えて合成するために、置換区間に対して予め複数の代表点を設定した原ピッチパターンを含む韻律パターンを記憶する原韻律パターン記憶部と、置換語句の韻律とピッチパターンの代表点の時刻を計算する置換語句韻律計算部と、前記置換語句のピッチパターンの代表点の時刻に合うように原ピッチパターンの代表点間距離を伸縮する代表点移動部と、前記代表点移動部で求められた代表点間のパターンを、原ピッチパターンの概形を保つように補間する代表点間補間部を具備し、置換区間の原ピッチパターンを前記代表点移動部および前記代表点間補間部で変換した変換パターンを利用して、置換語句に対応するピッチパターンを生成することを特徴とする音声合成装置である。
【０００８】
請求項２の発明は、請求項１に記載された音声合成装置において、ピッチパターンの代表点としてモーラ毎の中心点あるいは重心点を用いることを特徴とした音声合成装置である。
【０００９】
請求項３の発明は、請求項１に記載された音声合成装置において、ピッチパターンの代表点としてピッチパターンの上昇部、下降部、定常部の各始終点を用いることを特徴とした音声合成装置である。
【００１０】
請求項４の発明は、請求項１乃至３のいずれかに記載された音声合成装置において、前記代表点移動部は、パターン概形を保つような代表点間距離の伸縮を主としてパターンの定常部の伸縮によって行い、代表点の置換後の語句に対応する時刻への移動を行うことを特徴とする音声合成装置である。
【００１１】
請求項５の発明は、請求項１乃至４のいずれかに記載された音声合成装置において、代表点間補間部は、代表点間のパターンの補間方法として直線補間を用いるものであることを特徴とした音声合成装置である。
【００１２】
請求項６の発明は、請求項１乃至４のいずれかに記載された音声合成装置において、代表点間補間部は、代表点間のパターンの補間方法として置換前のピッチパターンを伸縮する方法を用いることを特徴とした音声合成装置である。
【００１３】
請求項７の発明は、人間が発声した音声フレーズから抽出した原ピッチパターンを用いて音声を合成する音声合成方法において、前記音声フレーズの一部を異なる語句に置き換えて合成するため、置換部分のピッチパターンの代表点の時刻を計算し、前記置換部分のピッチパターンの代表点の時刻に合うように代表点間距離を伸縮するとともに、該代表点間のパターンを、原ピッチパターンの概形を保つように補間して求められた変換パターンを利用して、置換部分に対応するピッチパターンを生成することを特徴とする音声合成方法である。
【００１４】
請求項８の発明は、請求項７に記載された音声合成方法において、ピッチパターンの代表点としてモーラ毎の中心点あるいは重心点を用いることを特徴とした音声合成方法である。
【００１５】
請求項９の発明は、請求項７に記載された音声合成方法において、ピッチパターンの代表点としてピッチパターンの上昇部、下降部、定常部の各始終点を用いることを特徴とした音声合成方法である。
【００１６】
請求項１０の発明は、請求項７乃至９のいずれかに記載された音声合成装置において、前記代表点間距離の伸縮方法として、主としてパターンの定常部の伸縮によって、代表点の置換後の語句に対応する時刻への移動を行うことを特徴とする音声合成方法である。
【００１７】
請求項１１の発明は、請求項７乃至１０のいずれかに記載された音声合成方法において、前記代表点間のパターンの補間方法として直線補間を用いることを特徴とした音声合成方法である。
【００１８】
請求項１２の発明は、請求項７乃至９のいずれかに記載された音声合成方法において、代表点間のパターンの補間方法として置換前のピッチパターンを伸縮する方法を用いることを特徴とした音声合成方法である。
【００１９】
【発明の実施の形態】
以下の説明では、予め人間が発声した音声フレーズを原音声、原音声から抽出した韻律パターンを原韻律パターンと呼ぶ。韻律パターンには継続時間長、ピッチ、パワーなどのパターンが含まれるが、ここでは主としてピッチパターンについて説明するので、単にパターンとした場合にはピッチパターンを指すものとする。原音声中、異なる語句に置き換えられる部分を置換区間、置き換える語句を置換語句、置き換えた後のパターンを変換パターンと呼ぶ。
【００２０】
まず、具体的な動作例から本発明の特徴を説明する。原音声として例えば「この先、横浜駅周辺、渋滞しています。」というフレーズがあるものとする。この中の「横浜駅」の部分置換区間とし、「横浜アリーナ」を置換語句とする。置換語句に関しては人間が発声した音声フレーズはなく、文字列だけからパターンを合成しなければならない。ここで規則合成的にパターンを作らず、置換語句に合うように原パターンを変形して対応することが本発明の特徴である。原パターン中の「横浜駅」は６モーラ４型のアクセント句、かつ、変換パターン中の「横浜アリーナ」は８モーラ６型のアクセント句であり、音韻系列も異なるので少なくとも時間軸方向の変形が必要である。本発明では主な代表点の時刻のみを規則によって計算し、原パターンの概形を保つような代表点の移動を行って変換パターンを作成することによって、原パターンの自然性を出来るだけ生かすことを目指すものである。
【００２１】
本発明の一実施例について説明する。図１は本発明の音声合成装置の実施例の構成を示す。音声合成のための全体の処理の流れを図３に示す。図１及び図３を用いて音声合成装置の各部の動作を以下に説明する。
【００２２】
原韻律パターン選択部１５は入力される原音声番号１１から、使用する原韻律パターンをデータベース（原韻律パターン記憶部）１６から検索し選択する（Ｓ１０１）。一方、音素片選択部１２は入力文字列１０から音素片ラベルを得、これを基に音素片記憶部１４から必要な音素片を検索する。置換区間以外の区間に関してはこのまま選択された原韻律パターンと音素片から波形合成部１７によって波形生成処理を行う。ただし、置換区間の後の区間に関しては置換区間の伸縮に伴って時間軸およびピッチ軸上での移動があり得るが変形はない。
【００２３】
置換区間に関する処理を以下に説明する。置換語句韻律計算部１３では一般的な規則音声合成の手法により、音韻継続時間長およびピッチパターン代表点の時刻を計算する（Ｓ１０２）。代表点は、入力文字列から得られる韻律的な句境界やアクセント核位置などの情報と先に計算されて音韻継続時間長から設定される。
【００２４】
図２は代表点の設定例を表わしている。縦線は点線ＢＬがモーラ境界、実線Ｌがアクセント句境界である。モーラの重心点を代表点に選んだ例を○で、アクセントパターンの上昇区間、下降区間、定常区間の各始終点の中から代表点に選んだ例を×で示した。アクセント句のモーラ数をＮ、アクセント核位置を第Ｍモーラとすると、後者の例で具体的に代表点に選んだのは、第１モーラ開始点付近、第１・２モーラ境界付近、第Ｍ・Ｍ＋１モーラ境界付近、第Ｍ＋１・Ｍ＋２境界付近、第Ｎモーラ終端付近である。付近とは具体的には音韻種別、アクセント型、アクセント変化量などから計算したオフセットを、各モーラ境界位置に加えるなどの方法で計算された位置を指す。なお、ここではピッチパターン生成は行わないので、入力文字列の情報から代表点の時刻を決定するだけで、ピッチは計算しない。図に曲線で示されたピッチパターンは代表点の性格を示すために補助的に添えたものである。
【００２５】
このような代表点は原パターンに対しても同様の定義によって予め設定されている置換語句の場合と異なり、原パターンではピッチパターンも既に存在するので代表点は時刻だけでなくピッチも意味を持つ。代表点移動部では置換語句の代表点時刻に合わせた原パターンの代表点の移動を行う。具体的な処理は図３中の代表点移動の処理（Ｓ１０３）にあたり、この内部処理を図４に示す。ここでＴｉ，Ｐｉはそれぞれ原パターンの第ｉ代表点の時刻とピッチ、Ｔ′ｉ，Ｐ′ｉはそれぞれ変換パターンの第ｉ代表点の時刻とピッチ、区間［ｉ，ｊ］は代表点ｉとｊの間の区間、Ｐ（ｔ）は時刻ｔのピッチを表わす。Ｔ′ｉには当初規則によって計算された置換語句の代表点時刻が代入されている。Ｔｄ，Ｐｄは処理途中の原パターンの移動量を補正するための補助変数である。
【００２６】
代表点移動処理は、ｉ＝０、Ｔｄ＝０（Ｓ２０１）から処理を始め、ｉ＜総代表点数であれば（Ｓ２０２）、次のステップにおいて、ｉ＝０もしくは区間［ｉ−１，ｉ］は伸縮区間か否かを判断し（Ｓ２０３）、伸縮区間であれば更にＴ′ｉ＞Ｔｉ＋Ｔｄであるか否かを判断し（Ｓ２０４）、ＹＥＳであれば区間［ｉ−１，ｉ］を時刻Ｔ′ｉ−Ｔｄまで外挿する（Ｓ２０６）。前記ステップＳ２０３において伸縮区間でなければ、Ｔ′ｉ＝Ｔｉ、即ち、変換パターンの第ｉ代表点の時刻は原パターンの第ｉ代表点の時刻と一致させる（Ｓ２０５）。前記ステップ２０６において前記外挿を行うと、次に、変換パターンの第ｉ代表点のピッチをＰ（Ｔ′ｉ−Ｔｄ）＋Ｐｄにし（Ｓ２０７）、さらに、次のステップにおいて、補助変数ＴｄをそれぞれＴｄ＋Ｔ′ｉ−Ｔｉに、また、補助変数ＰｄをＰｄ＋Ｐ′ｉ−Ｐｉにして処理を終了する。また、前記ステップＳ２０２において、ｉ＜総代表点数でなければ処理を終了する。
ここでは定常区間を伸縮区間とし、それ以外の上昇区間および下降区間を非伸縮区間として扱うことで原パターンの概形を保存する。
【００２７】
つまり、以上の処理において、伸縮区間の場合（ステップＳ２０３において、ＹＥＳの場合）は原パターンの代表点時刻を置換語句の代表点時刻へ移動する（Ｓ２０４）。負方向への移動すなわち区間の短縮の場合は、その時刻の原パターンのピッチを変換代表点のピッチとする。正方向への移動すなわち区間の伸長の場合は、原パターンを外挿（Ｓ２０６）した上でその時刻におけるピッチを変換代表点のピッチとする（Ｓ２０７）。非伸縮区間では原パターンの代表点を伸縮区間の伸縮量に応じて平行移動する（Ｓ２０７）。
【００２８】
代表点間補間部１８では代表点間のパターンを計算する。図５，図６を用いてその方法の一例を示す。図５では伸縮区間に対して原パターンの始点終点間の直線に沿った線形伸縮によって区間パターンを計算している。パターンの伸縮によって補間する方法はこの例以外にも、短縮時の切りつめ、伸長時の終端付近繰り返しなど様々な方法が考えられる。図６では非伸縮区間に対して原パターンを平行移動してそのまま用いる方法を示している。この他により簡単な補間方法としては代表点間を直線でつなぐ方法が考えられる。
【００２９】
波形合成部１７ではこのようにして得られたピッチパターンとその他の韻律パターンと音素片系列によって波形生成処理を行う。なお、ここで行われる音素片の伸縮及び接続、基本周波数の付与などの処理に関しては規則音声合成の一般的技術を用いることが出来るため、ここでは詳細な説明は省略する。
【００３０】
【発明の効果】
本発明によれば、定型区間のみならず置換区間に関しても原パターンの特徴を生かし自然性劣化を防ぐことができる。
【００３２】
また、簡易な方法によって聴感上違和感の少ないパターンを生成できる。
【００３３】
また、少ないデータ量と処理量でありながら原パターンの特徴を再現することができる。
【００３４】
また、少ない処理量でありながら原パターンの特徴損失を低く抑えることができる。
【図面の簡単な説明】
【図１】本発明の音声合成装置の一実施例を示すブロック図である。
【図２】ピッチパターン代表点の設定例を示すための図である。
【図３】音声合成のための処理を説明するフロー図である。
【図４】音声合成処理フローにおける代表点移動行程の内部処理を説明するためのフロー図である。
【図５】代表点間のパターンを計算する方法を説明するための図である。
【図６】代表点間のパターンを計算する他の方法を説明するための図である。
【図７】自然音声から抽出された定型部分と従来の規則によって計算された非定型部分の相違を説明するための図である。
【符号の説明】
１０…入力文字列、１１…原音声番号、１２…音素片選択部、１３…置換語句韻律計算部、１４…音素片記憶部、１５…原韻律パターン選択部、１６…原韻律パターン記憶部、１７…波形合成部、１８…代表点間補間部、１９…代表点移動部、２０…合成音声。[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a speech synthesis apparatus and method, and more particularly to a speech synthesis apparatus and method that can be used for converting text to speech that is used by changing some phrases in a fixed phrase each time. .
[0002]
[Prior art]
(1) Japanese Patent Application Laid-Open No. 9-34490, "Speech synthesizer and speech synthesis method, navigation system, and recording medium" describes the prosody of natural speech for a fixed phrase such as "Here is near XX." , "Shinagawa" and other atypical information, using prosody calculated by rules, connecting them with consistency so that they can be heard smoothly, and synthesizing the voice of the replacement sentence "Here is near Shinagawa." Are disclosed.
(2) Proceedings of the Spring Meeting of the Acoustical Society of Japan 1998, 1-7-24, "Examination of Prosody Control Method Using Word and Sentence Prosody Database," based on the prosody and rules of fixed phrases extracted from natural speech It discloses that the calculated prosody of the unsteady part is connected by the following equation.
_{_{p 'i = pi * ((}} 1-i / N) * (p 0 / P s) + (i / N) * (p N / P E))
Here, N is the number of mora of words in the variable part, p ′ _i (i = 0, 1, 2, 3,..., N) is F ₀ data to be obtained, and pi is F ₀ obtained from the word F ₀ pattern DB. ₀ data, _{P s} and _{P E} each variable portion of the starting end _{F 0} data is the end _{F 0} data.
[0003]
[Problems to be solved by the invention]
Conventionally, there are a recording / editing method and a rule synthesizing method as a synthesizing method of the speech synthesizer. In the former, an announcer or the like registers voices for each phrase and selects and combines them as appropriate to create a message.There is a possibility that good voices close to real voice may be obtained, but the amount of data is large. There are problems such as not being able to cope with an outside phrase, and securing the same speaker to add a new phrase. On the other hand, the latter is to store speech data in fine units such as phonemes and syllables and to enable synthesis of arbitrary vocabulary.However, the sound quality is inferior to the recording and editing method, especially the fundamental frequency, phoneme duration, will inevitably those mechanical and unnatural for imparting prosody pattern such as amplitude by the rules.
For this reason, a standard message that does not require output of an arbitrary vocabulary uses a sound recording and editing method with good sound quality, and a rule synthesis is used in a case where voice conversion from text is required. However, there are many applications in which an arbitrary vocabulary is partially embedded in a fixed sentence, such as a place name being embedded in a fixed message by voice guidance of car navigation. In such a case, the overall sound quality is reduced for a small part of the optional vocabulary, rule synthesis is used, the output of the optional vocabulary is given and the recording and editing method is used, or the fixed part is performed by recording and editing, There is no choice but to adopt a mixed method of performing only vocabulary part by rule synthesis.
The problem with mixing recording editing and rule synthesis is that the two systems have completely different voice qualities in the output voice, making them uncomfortable and extremely difficult to hear.
[0004]
On the other hand, in the prior art (1), a phoneme parameter is generated by connecting phonemes or syllables and the like in a regular synthesis to a fixed sentence, and a fundamental frequency and a phoneme duration extracted from natural speech are added to this. The naturalness is improved while maintaining speaker continuity with the arbitrary word part. The problem here is how to connect prosody patterns naturally without causing discontinuity when connecting prosodic patterns of fixed parts extracted from natural speech and non-standard parts calculated by rules. Is to do it. In particular, how to connect pitch patterns is important for the naturalness of synthesized speech. The prior art (1) does not disclose any specific method for this problem only by "making the connection portion smooth".
[0005]
In the prior art (2), a method of linearly converting the entire pitch pattern of the non-standard portion, which is calculated by a rule so that the pitch of the connection portion matches the standard portion, is described. The problem of this method will be described below with reference to FIG. In this example, peaks due to three accent components are seen in the pitch pattern of the fixed phrase indicated by the broken line. The third is the largest and the second is the smallest. Natural speech is given a rich expression by such a sharp pitch pattern. If this is replaced with the pattern calculated according to the rule shown by the solid line by the method according to the prior art (2), the sizes of all the peaks become substantially equal. As a result, it is not possible to make full use of the characteristics of the human-natural pitch pattern in the fixed portion.
[0006]
Therefore, an object of the present invention is to replace the pitch pattern of the non-standard part without deteriorating the naturalness of the standard part as much as possible.
[0007]
[Means for Solving the Problems]
2. A speech synthesizer for synthesizing speech using an original pitch pattern extracted from a speech phrase uttered by a human being, wherein the speech synthesis unit replaces a part of the speech phrase with a different replacement phrase. An original prosody pattern storage unit that stores a prosody pattern including an original pitch pattern in which a plurality of representative points are set in advance for a section, and a replacement phrase prosody calculation unit that calculates the prosody of the replacement phrase and the time of the representative point of the pitch pattern. the representative point moving unit for stretching the distance between the representative points of the original pitch pattern to fit the time of the representative point of the pitch pattern of substitution phrase, the pattern between the representative points determined by the representative point moving unit, the original comprising a representative point interpolation unit that interpolates so as to keep the rough shape of the pitch pattern, and converts the original pitch pattern of substitution section in the representative point moving unit and the representative point interpolation unit Using the conversion pattern, a speech synthesis apparatus and generates a pitch pattern corresponding to the replacement phrase.
[0008]
According to a second aspect of the present invention, there is provided the voice synthesizer according to the first aspect, wherein a center point or a center of gravity of each mora is used as a representative point of the pitch pattern.
[0009]
According to a third aspect of the present invention, in the voice synthesizing apparatus according to the first aspect, a start point and an end point of a rising part, a falling part, and a steady part of the pitch pattern are used as representative points of the pitch pattern. It is.
[0010]
According to a fourth aspect of the present invention, in the voice synthesizing apparatus according to any one of the first to third aspects, the representative point moving section mainly performs expansion and contraction of a distance between representative points so as to maintain an outline of the pattern. The voice synthesizer is characterized in that it moves to the time corresponding to the phrase after the replacement of the representative point by performing expansion and contraction of the representative point.
[0011]
According to a fifth aspect of the present invention, in the speech synthesizer according to any one of the first to fourth aspects, the interpolating unit between the representative points uses linear interpolation as a method of interpolating a pattern between the representative points. This is a speech synthesizer.
[0012]
According to a sixth aspect of the present invention, in the voice synthesizing apparatus according to any one of the first to fourth aspects, the inter-representative-point interpolating unit performs a method of expanding and contracting a pitch pattern before replacement as an interpolating method of a pattern between the representative points. A speech synthesizer characterized in that it is used.
[0013]
According to a seventh aspect of the present invention, in the voice synthesizing method for synthesizing a voice using an original pitch pattern extracted from a voice phrase uttered by a human, a part of the voice phrase is replaced with a different phrase and synthesized, so the time of the representative point of the pitch pattern are calculated, as well as stretch the distance between the representative points to match the time of the representative point of the pitch pattern of the substituted moieties, the pattern between the representative points, outline of the original pitch pattern This is a speech synthesis method characterized by generating a pitch pattern corresponding to a replacement part by using a conversion pattern obtained by interpolation so as to maintain .
[0014]
An eighth aspect of the present invention is the voice synthesizing method according to the seventh aspect, wherein a center point or a center of gravity of each mora is used as a representative point of the pitch pattern.
[0015]
According to a ninth aspect of the present invention, in the voice synthesizing method according to the seventh aspect, a starting point and an ending point of a rising part, a falling part, and a steady part of the pitch pattern are used as representative points of the pitch pattern. It is.
[0016]
According to a tenth aspect of the present invention, in the speech synthesizer according to any one of the seventh to ninth aspects, as the method of expanding and contracting the distance between the representative points, the word after replacement of the representative point is mainly obtained by expanding and contracting a stationary portion of the pattern. The voice synthesizing method is characterized in that the voice synthesizing method moves to a time corresponding to the time.
[0017]
An eleventh aspect of the present invention is the voice synthesizing method according to any one of the seventh to tenth aspects, wherein linear interpolation is used as a method of interpolating the pattern between the representative points.
[0018]
According to a twelfth aspect of the present invention, in the voice synthesizing method according to any one of the seventh to ninth aspects, a method of expanding and contracting a pitch pattern before replacement is used as a method of interpolating a pattern between representative points. This is a synthesis method.
[0019]
BEST MODE FOR CARRYING OUT THE INVENTION
In the following description, a voice phrase uttered in advance by a human is referred to as an original voice, and a prosody pattern extracted from the original voice is referred to as an original prosody pattern. The prosody pattern includes patterns such as duration, pitch, and power. However, here, the pitch pattern will be mainly described. In the original speech, a portion that can be replaced with a different word is called a replacement section, a word to be replaced is called a replacement word, and the replaced pattern is called a conversion pattern.
[0020]
First, the features of the present invention will be described from a specific operation example. For example, it is assumed that there is a phrase such as “There is a traffic jam around Yokohama Station in the future.” Here, a partial replacement section of “Yokohama Station” is set, and “Yokohama Arena” is set as a replacement phrase. As for the replacement phrase, there is no voice phrase uttered by a human, and a pattern must be synthesized only from a character string. Here, it is a feature of the present invention that the original pattern is deformed so as to conform to the replacement phrase, and the pattern is not formed in a regular synthetic manner. “Yokohama Station” in the original pattern is a 6-mora 4-type accent phrase, and “Yokohama Arena” in the conversion pattern is an 8-mora 6-type accent phrase. is necessary. In the present invention, only the time of the main representative point is calculated by a rule, and the representative point is moved so as to keep the outline of the original pattern to create a conversion pattern, thereby making the most of the naturalness of the original pattern as much as possible. It is aimed at.
[0021]
An embodiment of the present invention will be described. FIG. 1 shows the configuration of an embodiment of the speech synthesizer of the present invention. FIG. 3 shows the overall processing flow for speech synthesis. The operation of each unit of the speech synthesizer will be described below with reference to FIGS.
[0022]
The original prosody pattern selection unit 15 searches and selects an original prosody pattern to be used from a database (original prosody pattern storage unit) 16 based on the input original speech number 11 (S101). On the other hand, the phoneme segment selection unit 12 obtains a phoneme segment label from the input character string 10 and searches the phoneme segment storage unit 14 for a necessary phoneme segment based on this. For a section other than the replacement section, a waveform generation process is performed by the waveform synthesizing unit 17 from the original prosody pattern and the phoneme segment selected as they are. However, the section after the replacement section may move on the time axis and the pitch axis as the replacement section expands and contracts, but there is no deformation.
[0023]
The processing regarding the replacement section will be described below. The replacement phrase prosody calculation unit 13 calculates the phoneme duration and the time of the pitch pattern representative point by a general rule speech synthesis method (S102). The representative point is set based on information such as prosodic phrase boundaries and accent nuclei positions obtained from the input character string, and a phoneme duration calculated in advance.
[0024]
FIG. 2 shows an example of setting representative points. In the vertical line, a dotted line BL is a mora boundary, and a solid line L is an accent phrase boundary. An example in which the center of gravity of the mora was selected as the representative point was indicated by ○, and an example in which the representative point was selected from the start and end points of the rising section, the falling section, and the steady section of the accent pattern was indicated by ×. Assuming that the number of mora in the accent phrase is N and the accent nucleus position is the M-th mora, the representative points specifically selected in the latter example are near the first mora start point, near the first and second mora boundaries, and near the M-th mora. -Near the (M + 1) mora boundary, near the (M + 1) -M + 2 boundary, and near the N-th mora end. Specifically, the vicinity means a position calculated by a method of adding an offset calculated from a phoneme type, an accent type, an accent change amount, and the like to each mora boundary position. In this case, since pitch pattern generation is not performed, the pitch is not calculated merely by determining the time of the representative point from the information of the input character string. The pitch pattern indicated by the curve in the figure is supplementarily added to indicate the character of the representative point.
[0025]
Such representative points are different from the case of the replacement phrase set in advance by the same definition also for the original pattern. In the original pattern, since the pitch pattern already exists, the representative point has not only the time but also the pitch. . The representative point moving unit moves the representative point of the original pattern in accordance with the representative point time of the replacement phrase. The specific processing corresponds to the processing of moving the representative point in FIG. 3 (S103), and this internal processing is shown in FIG. Here, Ti and Pi are the time and pitch of the i-th representative point of the original pattern, T'i and P'i are the time and pitch of the i-th representative point of the conversion pattern, respectively, and the section [i, j] is the representative point i. , P (t) represents the pitch at time t. The representative point time of the replacement phrase calculated according to the rule at the beginning is substituted for T'i. Td and Pd are auxiliary variables for correcting the moving amount of the original pattern during processing.
[0026]
The representative point moving process starts from i = 0, Td = 0 (S201). If i <the total number of representative points (S202), in the next step, i = 0 or the section [i-1, i]. Determines whether or not it is a stretchable section (S203). If it is a stretchable section, it is further determined whether or not T'i> Ti + Td (S204). If YES, the section [i-1, i] is timed. Extrapolation up to T'i-Td (S206). If it is not a stretchable section in step S203, T′i = Ti, that is, the time of the i-th representative point of the conversion pattern is made to coincide with the time of the i-th representative point of the original pattern (S205). When the extrapolation is performed in step 206, the pitch of the i-th representative point of the conversion pattern is set to P (T'i-Td) + Pd (S207). In the next step, the auxiliary variable Td is set to The process is terminated by setting Td + T'i-Ti and auxiliary variable Pd to Pd + P'i-Pi. In step S202, unless i <the total number of representative points, the process ends.
Here, the general shape of the original pattern is preserved by treating the steady section as a stretchable section and the other ascending and descending sections as non-stretchable sections.
[0027]
That is, in the above processing, in the case of a stretchable section (YES in step S203), the representative point time of the original pattern is moved to the representative point time of the replacement phrase (S204). In the case of movement in the negative direction, that is, shortening of the section, the pitch of the original pattern at that time is set as the pitch of the conversion representative point. In the case of movement in the forward direction, that is, extension of the section, the pitch at that time is set as the pitch of the conversion representative point after extrapolating the original pattern (S206) (S207). In the non-stretchable section, the representative point of the original pattern is translated in accordance with the amount of expansion and contraction of the stretchable section (S207).
[0028]
The inter-representative-point interpolator 18 calculates a pattern between the representative points. An example of the method will be described with reference to FIGS. In FIG. 5, the section pattern is calculated by linear expansion and contraction along the straight line between the start point and the end point of the original pattern for the expansion and contraction section. In addition to this example, various methods of interpolating by expansion and contraction of the pattern, such as truncation at the time of shortening and repetition near the end at the time of expansion, can be considered. FIG. 6 shows a method in which the original pattern is translated with respect to the non-contracted section and used as it is. As a simpler interpolation method, a method of connecting representative points with a straight line can be considered.
[0029]
The waveform synthesizing unit 17 performs a waveform generation process based on the pitch pattern, other prosody pattern, and phoneme sequence obtained in this manner. Note that a general technique of regular speech synthesis can be used for processing such as expansion and contraction and connection of phoneme pieces and addition of a fundamental frequency, and a detailed description thereof is omitted here.
[0030]
【The invention's effect】
According to the present invention, it is possible to prevent naturalness degradation not only in the fixed section but also in the replacement section by utilizing the features of the original pattern.
[0032]
In addition, a pattern with less auditory discomfort can be generated by a simple method.
[0033]
Further, the feature of the original pattern can be reproduced with a small amount of data and a small amount of processing.
[0034]
In addition, the feature loss of the original pattern can be suppressed to a low level with a small amount of processing.
[Brief description of the drawings]
FIG. 1 is a block diagram showing one embodiment of a speech synthesizer of the present invention.
FIG. 2 is a diagram showing a setting example of a pitch pattern representative point.
FIG. 3 is a flowchart illustrating a process for speech synthesis.
FIG. 4 is a flowchart for explaining internal processing of a representative point moving process in a voice synthesis processing flow.
FIG. 5 is a diagram for explaining a method of calculating a pattern between representative points.
FIG. 6 is a diagram for explaining another method of calculating a pattern between representative points.
FIG. 7 is a diagram for explaining a difference between a fixed part extracted from natural speech and an unfixed part calculated according to a conventional rule.
[Explanation of symbols]
10: input character string, 11: original voice number, 12: phoneme selection unit, 13: replacement phrase prosody calculation unit, 14: phoneme storage unit, 15: original prosody pattern selection unit, 16: original prosody pattern storage unit, 17: Waveform synthesis unit, 18: Interpolation unit between representative points, 19: Representative point moving unit, 20: Synthesized voice.

Claims

A speech synthesizer for synthesizing a voice using an original pitch pattern extracted from a voice phrase uttered by a human, in which a part of the voice phrase is replaced with a different replacement phrase and synthesized, and An original prosody pattern storage unit that stores a prosody pattern including an original pitch pattern in which a plurality of representative points are set; a replacement phrase prosody calculation unit that calculates the prosody of the replacement phrase and the time of the representative point of the pitch pattern ; A representative point moving unit that expands and contracts the distance between the representative points of the original pitch pattern so as to match the time of the representative point of the pitch pattern, and the pattern between the representative points obtained by the representative point moving unit is a rough shape of the original pitch pattern. comprising a representative point interpolation unit that interpolates to keep, to convert the original pitch pattern of substitution section in the representative point moving unit and the representative point interpolation unit conversion pattern Using speech synthesis apparatus characterized by generating a pitch pattern corresponding to the replacement phrase.

2. The speech synthesizer according to claim 1, wherein a center point or a center of gravity of each mora is used as a representative point of the pitch pattern.

2. The speech synthesizer according to claim 1, wherein the start point and the end point of a rising part, a falling part, and a steady part of the pitch pattern are used as representative points of the pitch pattern.

4. The voice synthesizing device according to claim 1, wherein the representative point moving unit performs expansion and contraction of a distance between representative points so as to maintain an outline of the pattern mainly by expansion and contraction of a stationary portion of the pattern. A voice synthesizing apparatus that moves to a time corresponding to the phrase after the replacement of the speech.

5. The speech synthesizer according to claim 1, wherein the interpolator between representative points uses linear interpolation as a method of interpolating a pattern between representative points.

5. The speech synthesizer according to claim 1, wherein the interpolator between representative points uses a method of expanding and contracting a pitch pattern before replacement as an interpolating method of a pattern between representative points. Synthesizer.

Humans with original pitch pattern extracted from the voice phrases uttered in the speech synthesis method of synthesizing speech, to synthesize by replacing a portion of the voice phrases into different words, the representative point of the pitch pattern of the replacement portion time was calculated, as well as expansion of the representative point distance to suit the time of the representative point of the pitch pattern of the substituted moieties, the pattern between the representative points, interpolated to to keep the outline of the original pitch pattern A voice synthesizing method, wherein a pitch pattern corresponding to a replacement part is generated by using the obtained conversion pattern .

8. The speech synthesis method according to claim 7, wherein a center point or a center of gravity of each mora is used as a representative point of the pitch pattern.

8. The speech synthesis method according to claim 7, wherein each of a start point and an end point of a rising part, a falling part, and a steady part of the pitch pattern is used as a representative point of the pitch pattern.

10. The voice synthesizing device according to claim 7, wherein the method of expanding / contracting the distance between the representative points is a movement to a time corresponding to the word after the replacement of the representative point, mainly by expansion / contraction of a stationary part of the pattern. A speech synthesis method characterized by performing the following.

The speech synthesis method according to any one of claims 7 to 10, wherein linear interpolation is used as a method of interpolating a pattern between the representative points.

10. The speech synthesis method according to claim 7, wherein a method of expanding and contracting a pitch pattern before replacement is used as a method of interpolating a pattern between representative points.