JP4326251B2

JP4326251B2 - Text-to-speech synthesizer, text-to-speech synthesis method and program thereof

Info

Publication number: JP4326251B2
Application number: JP2003102148A
Authority: JP
Inventors: 一浩三木; 治木村; 智一森尾
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 2003-04-04
Filing date: 2003-04-04
Publication date: 2009-09-02
Anticipated expiration: 2023-04-04
Also published as: JP2004309724A

Abstract

<P>PROBLEM TO BE SOLVED: To solve a problem that although trends of corrections are known by using a correction history before to manually improve a subsequent meter generation rule, it is very intricate and technical to automatically improve a meter generation rule itself generated from massive conditions by analyzing a very large correction history and it is difficult for an end user to correct the meter generation rule itself according to his or her preference. <P>SOLUTION: A text speech synthesizer is equipped with a dynamic meter generation part 102 which holds correction contents and correction conditions of past corrections of a meter correction part 103 as correction history information in a correction history holding part 106 and generates meter information thereafter by referring to the held correction history information and automatically reflecting correction contents made to correspond to a correction condition when there is information meeting the correction condition. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、テキスト音声合成装置に関する。詳しくは、修正履歴を自動的に反映して韻律情報を生成する機能を備えたテキスト音声合成装置に関する。また、本発明は、修正履歴を自動的に反映して韻律情報を生成する方法を含むテキスト音声合成方法と、そのテキスト音声合成方法を実現するプログラムとに関する。
【０００２】
【従来の技術】
従来の音声合成においては、音声を構成する音素片を指定されたピッチパターン（韻律情報）にしたがって接続することにより、定められたピッチパターンで合成音声を生成していた。また、従来において、テキスト音声合成のピッチパターンは、入力されたテキストを解析し、その解析結果に基づき、予め定められた韻律生成規則にしたがって生成されていた。
【０００３】
生成される韻律は、テキスト解析時や韻律生成時の誤りなどを含むこともあり、この場合は、合成音声作成者（以後、作者と称する）の意図と異なる合成音声が生成されることになる。そのような韻律の誤りを修正するため、また、作者の好む韻律に調整するためには、技術者が韻律情報の記述された韻律ファイルの内容を書き換えるなど、経験に基づいた専門的な操作を行って韻律を決定するパラメータ（ピッチパターン、パワー等）を直接変更する必要があった。
【０００４】
このようなピッチパターンの修正を容易に行う方法として、韻律を制御する韻律ファイルの内容をディスプレイ上にグラフィカルに表示し、表示されたパラメータ（ピッチパターン、パワー等）のパターンをマウスで変更する方法が知られている（例えば、特許文献１参照）。この韻律修正方法では、処理された修正が記憶装置に記憶され、同じ韻律ファイルを用いて再度の音声合成を行う場合には、現在のパターンと修正履歴パターンとが表示される。これにより、修正の手間を低減し、かつ、修正の傾向をつかむことで、その後の韻律作成規則の情報を得ようとしている。
【０００５】
特許文献１に記載された韻律修正の手順を、図１８を参照して簡単に説明する。入力されたテキストは言語処理部４０１において、読みの情報、品詞の情報、係り受け情報などの言語関連情報が抽出される。その後、それらの言語関連情報を用いて韻律生成部４０２は、音声合成の基本情報となる韻律ファイル４０９を生成する。韻律修正部４０３においては、生成された韻律ファイル４０９の内容がグラフファイル表示生成部４１２によって画面表示され、画面表示されたグラフファイルを修正することにより、韻律ファイル４０９の修正を行う。この修正作業を行う部分がパターン修正部４１３である。
【０００６】
パターン修正部４１３によって修正された各パラメータの値は、修正履歴データとして修正履歴ＤＢ（修正履歴データベース）４０６に記録される。したがって、再度同じ韻律ファイル４０９を修正する機会には、過去の修正履歴もグラフファイル表示生成部４１２によって画面に表示されるため、同等の韻律の修正が容易に行える。また、修正の傾向なども修正履歴を解析することにより得ることができるため、その後の韻律生成における情報の一つとして扱うことができる。
【０００７】
このようにして作成された修正後韻律ファイル４１９に基づき、素片選択部４１４にて合成用の素片が選択され、音声合成部４２４はその素片を韻律ファイルにしたがって変形、接続することで合成音声を作成する。
【０００８】
【特許文献１】
上記特開平５-２３２９８０号公報
【０００９】
【発明が解決しようとする課題】
図１８に示された従来の方法では、音声合成を行うたびに、作成される韻律ファイル４０９に対して同じ修正を手動で行っており、その修正履歴も韻律ファイル４０９ごとに保持されている。したがって、同じ文章を合成する場合には過去の修正履歴を参考にすることができるものの、異なる文章を合成する場合には、同じ修正を過去の修正履歴を参考にすることなく行わなければならなかった。
【００１０】
また、韻律ファイル４０９をパンターン化して修正履歴の修正の傾向を知り、その傾向を一つの情報としてその後の韻律生成規則の改善に用いることにしているが、膨大な条件から作成される韻律生成規則に対して、膨大な修正履歴を解析し韻律生成規則そのものを改善することは個別に韻律ファイルを修正するよりも極めて高度な専門的知識を有する作業となる。つまり、大まかな傾向は把握できたとしても、エンドユーザが自分の好みによって韻律生成規則そのものを簡便に修正することは難しかった。
【００１１】
本発明は上記に鑑みなされたものであり、その目的は、エンドユーザが、韻律生成規則そのものを直接修正することなく、また、膨大な韻律情報を音声合成のたびに修正することもなく、嗜好に合った音声合成を簡便に行える方法と、その方法を適用したテキスト音声合成装置と、その方法を実現するプログラムとを提供することにある。
【００１２】
【課題を解決するための手段】
上記の課題を解決するために、本発明は、エンドユーザによって韻律情報が修正されたときの修正内容及び修正条件を一対にして、修正履歴情報を構成する修正履歴データとして保持しておき、その修正以降に行われる韻律情報の生成においては、保持されている修正履歴情報を参照し、修正条件に合致する修正履歴データがあれば、その修正条件に対応付けられた修正内容が自動的に反映された韻律情報を生成する機能を有する構成である。
【００１３】
上記の構成であれば、修正履歴情報に登録された過去の各修正と同一の修正を改めて行う必要がなくなり、かつ、エンドユーザによって行われた修正に応じて修正履歴情報の学習が逐次進むため、エンドユーザの嗜好に合った韻律情報を簡便に生成することができる。
【００１４】
具体的には、本発明に係るテキスト音声合成装置は、テキストデータに対して言語解析を行い、言語情報を抽出する言語解析部と、修正履歴情報を保持する修正履歴保持部と、上記修正履歴情報を管理する修正履歴管理部と、上記修正履歴管理部を介して上記修正履歴情報を参照して、上記言語情報に基づき動的韻律情報を生成する動的韻律生成部と、外部修正命令に応じて上記動的韻律情報に修正を行うことにより確定韻律情報を生成し、かつ、上記外部修正命令に応じた修正に基づいて、上記修正履歴管理部を介して上記修正履歴情報を更新する韻律修正部と、上記言語情報及び上記確定韻律情報に基づいて合成音声を生成する合成音声生成部と、上記修正履歴管理部が、上記動的韻律生成部で参照される修正履歴情報を抽出する修正履歴抽出手段と、上記修正履歴保持部に保持された修正履歴情報を更新する修正履歴更新手段と、を有し、上記修正履歴保持部が、韻律スタイルの互いに異なる複数の修正履歴データベースを有し、上記修正履歴管理部が、選択命令に応じて、上記複数の修正履歴データベースの選択を制御するデータベース選択制御手段を更に有する構成である。
【００１５】
また、本発明に係るテキスト音声合成方法は、テキストデータに対して言語解析を行い、言語情報を抽出する言語解析ステップと、修正履歴情報を保持する修正履歴保持ステップと、上記修正履歴情報を管理する修正履歴管理ステップと、上記修正履歴管理部を介して上記修正履歴情報を参照して、上記言語情報に基づき動的韻律情報を生成する動的韻律生成ステップと、外部修正命令に応じて上記動的韻律情報に修正を行うことにより確定韻律情報を生成し、かつ、上記外部修正命令に応じた修正に基づいて、上記修正履歴管理部を介して上記修正履歴情報を更新する韻律修正ステップと、上記言語情報及び上記確定韻律情報に基づいて合成音声を生成する合成音声生成ステップと、上記修正履歴管理ステップが、上記動的韻律生成ステップで参照される修正履歴情報を抽出する修正履歴抽出ステップと、上記修正履歴保持ステップに保持された修正履歴情報を更新する修正履歴更新ステップと、を有し、上記修正履歴保持ステップが、韻律スタイルの互いに異なる複数の修正履歴データベースをさらに保持し、上記修正履歴管理ステップが、選択命令に応じて、上記複数の修正履歴データベースの選択を制御するデータベース選択制御ステップを更に有する。
【００１６】
また、本発明に係るテキスト音声合成プログラムは、テキストデータに対して言語解析を行い、言語情報を抽出する言語解析ステップと、修正履歴情報を保持する修正履歴保持ステップと、上記修正履歴情報を管理する修正履歴管理ステップと、上記修正履歴管理部を介して上記修正履歴情報を参照して、上記言語情報に基づき動的韻律情報を生成する動的韻律生成ステップと、外部修正命令に応じて上記動的韻律情報に修正を行うことにより確定韻律情報を生成し、かつ、上記外部修正命令に応じた修正に基づいて、上記修正履歴管理部を介して上記修正履歴情報を更新する韻律修正ステップと、上記言語情報及び上記確定韻律情報に基づいて合成音声を生成する合成音声生成ステップと、上記修正履歴管理ステップが、上記動的韻律生成ステップで参照される修正履歴情報を抽出する修正履歴抽出ステップと、上記修正履歴保持ステップに保持された修正履歴情報を更新する修正履歴更新ステップと、をコンピュータに実行させるテキスト音声合成プログラムにおいて、上記修正履歴保持ステップが、韻律スタイルの互いに異なる複数の修正履歴データベースをさらに保持し、上記修正履歴管理ステップが、選択命令に応じて、上記複数の修正履歴データベースの選択を制御するデータベース選択制御ステップを更に有することを特徴とする。
【００１７】
【発明の実施の形態】
本発明の内容を説明すると共に、好ましい実施の形態を記述する。なお、必要に応じて図１及び図２を参照する。図１は、本発明に係るテキスト音声合成装置の構成を概念的に示すブロック図である。図２は、本発明に係るテキスト音声合成方法を概念的に示すブロック図である。
【００１８】
図１に示されたテキスト音声合成装置は、テキストデータに対して言語解析を行って、言語情報を抽出する言語解析部１０１と、修正履歴情報を保持する修正履歴保持部１０６と、修正履歴情報を管理する修正履歴管理部１０５と、修正履歴管理部１０５を介して修正履歴情報を参照して、言語情報に基づき動的韻律情報を生成する動的韻律生成部１０２と、動的韻律情報に外部修正命令に応じた修正を行って確定韻律情報を生成し、かつ、修正に応じて修正履歴管理部１０５を介して修正履歴情報を更新する修正韻律修正部１０３と、言語情報及び確定韻律情報に基づいて合成音声を生成する合成音声生成部１０４とを含む構成である。
【００１９】
図２に示されたテキスト音声合成方法（テキスト音声合成プログラムコード）は、テキストデータに対して言語解析を行って言語情報を抽出する言語解析ステップ２０１（言語解析プログラムコード）と、修正履歴情報を参照する修正履歴参照ステップ２１５（修正履歴参照プログラムコード）と、修正履歴参照ステップ２１５（修正履歴参照プログラムコード）と連携して、言語情報に基づき動的韻律情報を生成する動的韻律生成ステップ２０２（動的韻律生成プログラムコード）と、外部修正命令に応じて動的韻律情報に修正を行って確定韻律情報を生成する韻律修正ステップ２０３（韻律修正プログラムコード）と、動的韻律情報の修正に応じて修正履歴情報を更新する修正履歴更新ステップ２２５（修正履歴更新プログラムコード）と、言語情報及び確定韻律情報に基づいて合成音声を生成する合成音声生成ステップ２０４（合成音声生成プログラムコード）と、を含む構成である。
【００２０】
韻律情報を生成する一般的な方法としては、例えば、言語情報を引数にして、予め定められた韻律生成規則（静的な韻律生成規則）にしたがって韻律情報（静的韻律情報）を生成する方法、及び、予め用意された複数の韻律パターン片の中から１つの韻律パターン片を予め定められた規則（静的な韻律パターン片選択規則）にしたがって選択することにより韻律情報（静的韻律情報）を生成する方法が挙げられる。これに対して、本発明においては、修正履歴情報を参照することにより韻律情報（動的韻律情報）を動的に生成することを本質的な特徴としている。
【００２１】
本明細書において、「静的」及び「動的」とは、それぞれ、「修正履歴情報に依存せず固定的」及び「修正履歴情報に依存し、その情報に応じて可変的」を意味する。また、「静的韻律情報」とは、従来の如く修正履歴情報を参照せずに生成された韻律情報を意味する。また、「動的韻律情報」とは、修正履歴情報を参照して生成された韻律情報を意味し、修正履歴情報に合致する場合には、基本の韻律情報と異なる韻律情報となり、修正履歴情報に合致しない場合には、基本の韻律情報と同一の韻律情報となる。
【００２２】
まず、テキスト音声合成装置の言語解析部１０１について説明する。言語解析部１０１は、音声合成を行う対象のテキストデータに対して言語解析を行う。この言語解析によって、様々な言語情報が抽出される（言語解析ステップ２０１）。言語情報としては、例えば、読みを特定する情報（音素記号列等）、品詞を特定する情報、係り受けを特定する情報が挙げられる。
【００２３】
次に、テキスト音声合成装置の動的韻律生成部１０２について説明する。動的韻律生成部１０２は、修正履歴情報を参照して（修正履歴参照ステップ２１５）、言語情報に基づき動的韻律情報を生成する（動的韻律生成ステップ２０２）。動的韻律生成部１０２は、修正履歴情報を参照して動的に韻律情報を生成する限りにおいて、どのような方式で修正履歴情報を参照してもよく、例えば、下記の３つの参照方式が挙げられる。
【００２４】
第１の修正履歴参照方式は、動的韻律生成部１０２において、静的な韻律生成規則にしたがって静的韻律情報を生成し、かつ、生成された静的韻律情報を修正履歴情報に応じて修正することによって、動的韻律情報を生成する方式である。
【００２５】
第２の修正履歴参照方式は、動的韻律生成部１０２において、修正履歴情報に応じて韻律生成規則の韻律生成パラメータの設定を修正し、韻律生成パラメータの設定により変化する動的な韻律生成規則にしたがって動的韻律情報を生成する方式である。
【００２６】
第３の修正履歴参照方式は、動的韻律生成部１０２において、静的な韻律選択規則にしたがって、言語情報に基づき複数の韻律パターン片から１つの最適な韻律パターン片を選択韻律パターン片として選択し、かつ、選択韻律パターン片を修正履歴情報に応じて修正することにより動的韻律情報を生成する方式である。
【００２７】
次に、テキスト音声合成装置の韻律修正部１０３について説明する。韻律修正部１０３は、動的韻律生成部１０２で生成された動的韻律情報に対して、外部修正命令に応じた修正を行って確定韻律情報を生成する（韻律修正ステップ２０３）。
【００２８】
韻律修正部１０３に外部修正命令が入力されなければ、動的韻律情報は、修正されずに確定韻律情報となる。他方、外部修正命令を受信すれば、外部修正命令に応じた修正が動的韻律情報に施されて、修正後の動的韻律情報が確定韻律情報となる。
【００２９】
外部修正命令に応じた修正における修正内容及びその修正時の修正条件（以下、一対の修正内容と修正条件を修正履歴データと称す）は、修正履歴保持部１０６に保存されている修正履歴情報を更新するために、修正履歴管理部１０５に引渡される（修正履歴更新ステップ２０３）。ここに、修正履歴情報の更新とは、修正履歴情報に修正履歴データを追加すること、又は、修正履歴情報を構成する修正履歴データの一部を変更することを意味する。
【００３０】
修正履歴情報を構成する各修正履歴データにおける修正内容の修正要素としては、修正前後における変化が特定できればいかなる韻律パラメータでもよい。修正要素としては、修正前後における修正量が規定できる韻律パラメータを用いることが好ましい。修正量が規定できる韻律パラメータとしては、例えば、１又は複数の音素記号からなる音素記号列単位や呼気段落単位やアクセント句単位に対する継続時間長、強度パターン（パワーパターン）又は基本周波数パターン（ピッチパターン）が挙げられる。また、各修正履歴データの修正内容の修正要素は、１種類の韻律パラメータのみを含む構成であってもよいし、複数種類の韻律パラメータを含む構成であってもよい。
【００３１】
他方、修正履歴データにおける修正条件は、言語情報を用いて条件設定できる。修正条件における条件要素としては、例えば、１又は複数の音素記号からなる音素記号列、品詞、文中位置、アクセント型が挙げられる。動的韻律生成部１０２が、第１の履歴参照方式を有する場合には、更に、静的韻律情報に含まれる少なくとも１種の韻律パラメータを条件要素として用いることもできる。修正履歴データの修正条件は、１種類の条件要素のみを含む構成であってもよいし、複数種類の条件要素を含む構成であってもよい。
【００３２】
韻律修正部１０３において動的韻律情報を修正する方法としては、例えば、修正可能な韻律パラメータのパターンをグラフィカルに表示し、表示されたパターンに対してマウスなどを用いて修正する方法や、動的韻律情報をテキストで表示し、表示されたテキストを編集することによって修正する方法が挙げられる。
【００３３】
更に、動的韻律情報の修正において、修正の反映された修正韻律情報を用いて生成される合成音声を逐次聞きながら修正をインタラクティブに調整してもよい。最終的に調整を完了した状態での修正韻律情報が、確定韻律情報として合成音声生成部１０４に送られる。この場合には、韻律修正部１０３が、修正対象となっているテキストデータの断片に対する言語情報及び修正韻律情報に基づいて合成音声を生成するサンプル音声合成手段を有するように構成する。
【００３４】
次に、テキスト音声合成装置の修正履歴保持部１０６及び修正履歴管理部１０５について説明する。修正履歴保持部１０６は、韻律修正部１０３において修正された修正内容をその修正条件と共に修正履歴情報として保持する。また、修正履歴管理部１０５は、韻律修正部１０３における修正に応じて修正履歴情報を更新する修正履歴更新手段や、動的韻律生成部１０２で参照する修正履歴情報を抽出する修正履歴抽出手段を備えた構成である。
【００３５】
修正履歴管理部１０５の修正履歴抽出手段は、動的韻律生成部１０２から送られてきた修正条件に基づいて、修正履歴保持部１０６から修正条件に合致する修正履歴データを抽出し、修正条件と対応付けられた修正内容を動的韻律生成部１０２に送る。修正条件に合致する修正履歴データが修正履歴保持部１０６に複数存在する場合には、それらすべてに対応する修正内容を抽出して動的韻律生成部１０２に送る。
【００３６】
修正履歴管理部１０５の修正履歴更新手段は、韻律修正部１０３から修正内容及び修正条件からなる修正履歴データを受け取ったとき、修正履歴保持部１０６に保持されている修正履歴情報を更新する。
【００３７】
ここで、受け取った修正条件を満たす修正条件を有する修正履歴データが修正履歴保持部１０６に保持されていない場合には、修正履歴保持部１０６に修正履歴データを追加することにより修正履歴情報を更新する。
【００３８】
また、受け取った修正条件を満たす修正条件と、受け取った修正内容と修正要素が異なる修正内容とを有する既存の修正履歴データが修正履歴保持部１０６に保持されている場合には、受け取った修正履歴データを追加することにより修正履歴情報を更新してもよいし、それらを統合して複数の修正要素を含む修正内容を有する１つの新たな修正履歴データに置き換えてもよい。
【００３９】
また、修正履歴管理部１０５は、韻律修正部１０３から修正履歴データを受け取ったとき、受け取った修正条件を満たす修正条件と、受け取った修正内容と修正要素が同一でありかつ修正処理は異なる修正内容とを有する既存の修正履歴データが修正履歴保持部１０６に保持されている場合には、基本の韻律情報に対して相対的に決定される最終の修正内容を次回以降の動的韻律生成において反映させることができるように更新する。この場合の更新においては、既に存在する修正履歴データと関連付けて修正履歴データを追加してもよいし、過去の修正内容との差分を考慮して新たな１つの修正履歴データに置き換えてもよい。
【００４０】
修正履歴保持部１０６は、単一の修正履歴ＤＢで構成されていてもよいし、韻律スタイルの互いに異なる複数の修正履歴ＤＢで構成されていてもよい。韻律スタイルとは、例えば、大阪弁や京都弁などの方言に応じた口調のスタイル、及び、悲しい口調、楽しい口調、激しい口調、優しい口調などの感情に応じた口調のスタイルを意味する。なお、修正履歴保持部１０６が複数の修正履歴ＤＢを有する構成の場合、修正履歴情報とは、すべての修正履歴ＤＢに含まれる韻律修正データ全体、つまり、修正履歴保持部に保持された韻律修正データ全体を意味することに注意を要する。
【００４１】
以下においては、修正履歴保持部１０６が複数の修正履歴ＤＢを有する場合について説明する。必要に応じて図３及び図４を参照する。図３は、複数の修正履歴ＤＢからなる修正履歴保持部を備えたテキスト音声合成装置の構成を概念的に示すブロック図である。図４（ａ）〜（ｃ）は、複数の修正履歴ＤＢの選択を制御するＤＢ選択制御手段の構成例を概念的に示すブロック図である。
【００４２】
図３に示されるように、修正履歴保持部１０６が複数の修正履歴ＤＢ１１６を有する場合には、修正履歴管理部１０５が、修正履歴抽出手段１１５及び修正履歴更新手段１２５と共に、修正履歴ＤＢからの修正履歴データの抽出又は修正履歴ＤＢへの修正履歴情報の更新において、いずれの修正履歴ＤＢに対して行うかを制御するためのＤＢ選択制御手段１３５を有する構成とする。
【００４３】
ＤＢ選択制御手段１３５は、動的韻律生成部１０２によって参照される修正履歴ＤＢ（以下、参照用修正履歴ＤＢとも称す）及び韻律修正部１０３における修正に基づいて更新される修正履歴ＤＢ（以下、更新用修正履歴ＤＢとも称す）として、同一の修正履歴ＤＢを選択する手段であってもよいし、参照用修正履歴ＤＢと更新用修正履歴ＤＢとを互いに独立に選択する手段であってもよい。以下に、ＤＢ選択制御手段１３５の具体的な構成について説明する。
【００４４】
図４（ａ）に示されたように、修正履歴管理部１０５は、共通選択命令に応じて、複数の修正履歴ＤＢ１１６の少なくとも１つを共通修正履歴ＤＢとして選択する共通ＤＢ選択制御手段１４５を有する構成（第１の構成）とすることができる。第１の構成の場合、動的韻律生成部１０２は、共通修正履歴ＤＢに含まれる修正履歴情報を選択的に参照し、韻律修正部１０３は、共通修正履歴ＤＢに含まれる修正履歴情報を選択的に更新することとなる。なお、選択命令として、共通選択命令が用いられている。
【００４５】
上記の構成であれば、目的に応じて複数の修正履歴ＤＢ１１６のうち１つ又は複数の修正履歴ＤＢの修正履歴情報を選択的に動的韻律生成に反映させることができ、かつ、１回の韻律修正によって１つ又は複数の修正履歴ＤＢの修正履歴情報を選択的に更新できる。また、動的韻律生成部１０２において、所望の韻律スタイルを反映させた動的韻律情報を簡便に生成することができる。
【００４６】
また、図４（ｂ）に示されたように、修正履歴管理部１０５は、参照選択命令に応じて複数の修正履歴ＤＢ１１６の少なくとも１つを参照用修正履歴ＤＢとして選択する参照用ＤＢ選択制御手段１５５と、更新選択命令に応じて複数の修正履歴ＤＢ１１６の少なくとも１つを更新用修正履歴ＤＢとして選択をする更新用ＤＢ選択制御手段１６５とを有する構成（第２の構成）とすることができる。第２の構成の場合、動的韻律生成部１０２は、参照用修正履歴ＤＢに含まれる修正履歴情報を選択的に参照し、韻律修正部１０３は、更新用修正履歴ＤＢに含まれる修正履歴情報を選択的に更新することとなる。なお、選択命令として、参照選択命令と更新選択命令が用いられている。
【００４７】
修正履歴管理部１０５が第１の構成の場合には、参照用修正履歴ＤＢと更新用修正履歴ＤＢとに対して共通の制御がなされるが、第２の構成の場合は、参照用修正履歴ＤＢと更新用修正履歴ＤＢとに対して独立した制御をすることができる。これにより、修正履歴情報を柔軟かつ効果的に更新させることができる。つまり、修正履歴の学習を効率良く行うことができる。
【００４８】
また、修正履歴管理部１０５は、共通選択命令に応じて、複数の修正履歴ＤＢ１１６の少なくとも１つを共通修正履歴ＤＢとして選択する共通ＤＢ選択制御手段１４５を有し、かつ、選択変更命令に応じて、共通ＤＢ選択制御手段１４５で選択された修正履歴ＤＢのいずれかに対する選択の解除及び／又は共通ＤＢ選択手段１４５で選択された修正履歴ＤＢ以外の修正履歴ＤＢの追加選択を行う選択ＤＢ変更手段１７５を有する構成とすることができる。
【００４９】
選択ＤＢ変更手段１７５は、動的韻律生成部１０２によって参照される修正履歴ＤＢと韻律修正部１０３における修正に基づいて更新される修正履歴ＤＢの双方に対して共通の変更又は双方に独立な変更を加えてもよい。更に、動的韻律生成部１０２によって参照される修正履歴ＤＢ及び韻律修正部１０３における修正に基づいて更新される修正履歴ＤＢの一方のみに対して変更を加えてもよい。
【００５０】
図４（ｃ）には、修正履歴管理部１０５が、韻律修正部１０３における修正に基づいて更新される修正履歴ＤＢに対して変更を加える選択ＤＢ変更手段１７５を有する構成（第３の構成）が示されている。第３の構成の場合、動的韻律生成部１０２は、共通ＤＢ選択制御手段１４５により選択された少なくとも１つの修正履歴ＤＢ（参照用修正履歴ＤＢ）に含まれる修正履歴情報を選択的に参照し、韻律修正部１０３は、共通ＤＢ選択制御手段１４５及び選択ＤＢ変更手段１７５で決定された少なくとも１つの修正履歴ＤＢ（更新用修正履歴ＤＢ）に含まれる修正履歴情報を選択的に更新することとなる。なお、選択命令として、共通選択命令及び選択変更命令が用いられている。
【００５１】
第３の構成であれば、更新用修正履歴ＤＢを、参照用修正履歴ＤＢから独立させ、かつ、修正履歴保持手段１０６における複数の韻律履歴ＤＢ１１６から任意に選択することができる。韻律修正部１０３で動的韻律情報に対して修正を行う場合、通常、動的韻律情報の生成において参照された韻律スタイル（修正履歴ＤＢ）に対しては修正を反映させるため、上記の第２の構成に比べて、修正履歴情報を簡便、柔軟、かつ、効果的に更新させることができる。
【００５２】
また、修正履歴管理部１０５は、共通選択命令に応じて、複数の修正履歴データベース１１６の少なくとも１つを共通修正履歴ＤＢとして選択する共通ＤＢ選択手段１４５を有し、かつ、選択変更命令に応じて、共通修正履歴ＤＢで構成された更新用修正履歴ＤＢに新たな修正履歴ＤＢの追加のみを行う選択ＤＢ変更手段１７５を有する構成（第４の構成）としてもよい。第４の構成の場合、動的韻律生成部１０２は、共通ＤＢ選択制御手段１４５により選択された共通修正履歴ＤＢで構成される参照用修正履歴ＤＢに含まれる修正履歴情報を選択的に参照し、韻律修正部１０３は、共通修正履歴ＤＢと選択ＤＢ変更手段で追加された少なくとも１つの修正履歴ＤＢとで構成される更新用修正履歴ＤＢに含まれる修正履歴情報を選択的に更新することとなる。なお、選択命令として、共通選択命令及び選択変更命令が用いられている。
【００５３】
韻律修正部１０３で動的韻律情報に対して修正を行う場合、通常、動的韻律情報の生成において参照された韻律スタイル（修正履歴ＤＢ）に対しては修正を反映させるため、更新用修正履歴ＤＢには、参照用修正履歴ＤＢを構成するすべての修正履歴ＤＢが含まれていることがより好ましい。したがって、第４の構成であれば、第３の構成に比べて構成が簡素であるにも関わらず、第３の構成と同等の効果を発現する。
【００５４】
複数の修正履歴ＤＢ１１６のいずれを選択するかは、装置又はアプリケーションの立ち上げごとに決定してもよいし、テキストデータごとに決定してもよい。更に、アプリケーション上で、修正履歴ＤＢを韻律修正部１０３における動的韻律情報の修正ごとに適宜決定してもよい。更に、修正履歴管理部１０５のＤＢ選択制御手段１３５が上記第２の構成、上記第３の構成、上記第４の構成などである場合（少なくとも２種の手段を有する構成の場合）には、複数の決定方法を併用することもできる。
【００５５】
テキストデータごとに修正履歴ＤＢを選択する場合には、アプリケーション上で作者が選択してもよいし、テキストデータと共にテキストファイルに含まれる制御コード（スタイル選択情報）等に応じて選択してもよい。前者の場合、選択命令を入力する選択命令入力部を、後者の場合、テキストファイルを解析して選択命令を生成する選択命令生成部を更に含むテキスト音声合成装置とする。
【００５６】
最後に、テキスト音声合成装置の合成音声生成部１０４について説明する。音声合成生成部１０４では、言語情報及び確定韻律情報に基づき素片の選択と素片の変形と素片の接続とを行うことによって、合成音声を生成する（合成音声作成ステップ）。なお、言語情報と確定韻律情報を用いた合成音声の生成においては、従来のいかなる公知技術を用いてもよい。
【００５７】
（実施の形態１）
本実施の形態１においては、第１の修正履歴参照方式を適用した動的韻律生成部を有するテキスト音声合成装置について説明する。必要に応じて、図５及び図６を参照する。図５は、本実施の形態１に係るテキスト音声合成装置の構成を概念的に示すブロック図である。図６は、本実施の形態１に係るテキスト音声合成装置における特徴部分を詳細に説明するための説明図である。
【００５８】
図５に示されたテキスト音声合成装置は、テキスト保持部１０７と、言語解析部１０１と、静的韻律生成手段１１２、フィルタリング手段１２２及びフィルタ制御手段１３２を有する動的韻律生成部１０２と、韻律修正部１０３と、表示部１１７と、修正命令入力部１２７と、単一の修正履歴ＤＢを有する修正履歴保持部１０６と、修正履歴抽出手段１１５及び修正履歴更新手段１２５を有する修正履歴管理部１０５と、素片保持手段１３４、素片選択手段１１４及び音声合成手段１２４とを有する合成音声作成部１０４と、音声出力部１３７とを含む構成である。
【００５９】
図５に示されたテキスト音声合成装置の動作について説明する。言語解析部１０１において、テキスト保持部１０７に保持されたテキストデータに対して、所定の単位で言語解析が行われ、その結果、言語情報が抽出される（言語処理ステップ）。抽出された言語情報は、動的韻律生成部１０２の静的韻律生成手段１１２に送られる。
【００６０】
静的韻律生成手段１１２において、送られてきた言語情報に基づき、静的な韻律生成規則にしたがって静的韻律情報が生成される（静的韻律生成ステップ）。生成された静的韻律情報は、言語情報と共にフィルタリング手段１２２に送られる。
【００６１】
フィルタリング手段１２２において、送られてきた言語情報と静的韻律情報から修正条件が生成される。生成された修正条件は、フィルタリング処理の処理内容を決定するために、フィルタ制御手段１３２に送られる。
【００６２】
フィルタ制御手段１３２において、送られてきた修正条件に合致する韻律修正データが修正履歴保持部１０６の韻律修正ＤＢに含まれているか否かを確認するために、その修正条件が修正履歴管理部１０５の修正履歴抽出手段１１５に送られる。ここに、修正履歴保持部１０６が１種の韻律修正ＤＢのみを有するため、修正履歴ＤＢに含まれる情報全体が修正履歴情報である。
【００６３】
修正履歴抽出手段１１５において、修正履歴保持部１０６の修正履歴ＤＢが検索され、その結果、送られてきた修正条件に合致する修正履歴データが存在すれば、その修正履歴データが抽出される。抽出された韻律修正データは、フィルタ制御手段１３２に送られる。ここに、修正条件に合致する修正履歴データが複数存在していれば、それらすべての修正履歴データがフィルタ制御手段１３２に送られる。また、修正条件に合致する修正履歴データが存在していなければ、その旨をフィルタ制御手段１３２に通知する。
【００６４】
フィルタ制御手段１３２において、送られてきた修正履歴データから静的韻律情報における各韻律パラメータの修正量を決定し、フィルタリング手段１２２に通知する。ここに、修正履歴抽出手段１１５から複数の修正履歴データが送られてきた場合は、それらすべてを反映するように各韻律パラメータの修正量を決定する。また、修正履歴データが存在しない旨の通知があれば、修正を行わない旨、又は、各韻律パラメータの修正量がゼロである旨の通知をフィルタリング手段１２２に通知する。
【００６５】
フィルタリング手段１２２において、フィルタ制御手段１３２で決定された各韻律パラメータの修正量に基づき、静的韻律情報を修正して、動的韻律情報を生成する。生成された動的韻律情報は、言語情報と共に、韻律修正部１０３に送られる。ここに、フィルタリング手段１２２における修正は、修正履歴情報に応じて自動的に行われること、及び、動的韻律情報は、修正履歴情報の反映された韻律情報であることに注意を要する。
【００６６】
韻律修正部１０３は、送られてきた動的韻律情報を表示部１１７にグラフィカルな画像として表示させる。表示部１１７に表示された動的韻律情報に対して、修正命令入力部１２７からの外部修正命令にしたがって、所望の追加修正が行われる。ここに、韻律修正部１０３における修正は、従来技術の如く手作業によって行われることに注意を要する。
【００６７】
韻律修正部１０３における追加修正に対して、修正条件及び修正内容をセットにした修正履歴データが生成される。生成された修正履歴データは、修正履歴保持部１０６における修正履歴ＤＢを更新するために、修正履歴管理部１０５の修正履歴更新手段１２５に送られる。また、所望の修正が加えられた動的韻律情報は、確定韻律情報として、言語情報と共に合成音声生成部１０４の素片選択手段１１４に送られる。
【００６８】
修正履歴管理部１０５の修正履歴更新手段１２５は、送られてきた修正履歴データに基づき、修正履歴保持部１０６の修正履歴ＤＢを更新する。ここに、フィルタリング手段１２２での次回からのフィルタリング処理において、更新された修正履歴ＤＢが参照されることに注意を要する。また、同一ファイル内に限らず、第１のファイルに対する修正は、第１のファイルと異なる第２のファイルに含まれるテキストデータをテキスト音声合成する場合のフィルタリング処理にも反映されることに注意を要する。
【００６９】
合成音声生成部１０４の素片選択手段１１４において、韻律修正部１０３から送られてきた言語情報に基づき素片保持部１３４に保持された素片群から最適な素片が選択される。選択された素片は、音声合成手段１２４に送られる。
【００７０】
素片保持部１３４に保持された素片は、単素片であってもよいし、合成素片であってもよい。合成素片としては、例えば、ＣＶ単位（Ｃ：子音、Ｖ：母音）の素片、ＶＣ単位の素片、ＣＶＣ単位の素片及びＶＣＶ単位の素片が挙げられる。素片群は、単素片のみからなる構成、１種類の合成素片のみからなる構成、複数種類の合成素片からなる構成、及び、単素片及び１又は複数の合成素片からなる構成であってもよい。
【００７１】
合成音声生成部１０４の音声合成手段１２４において、送られてきた素片が確定韻律情報に基づき変形されかつ接続されることにより、合成音声が生成される。生成された合成音声は、音声出力部１３７において出力される。
【００７２】
以上の処理を経ると、作者の過去の修正が自動的に反映されて好みに近い合成音声を生成する動的韻律情報に対して、手動の追加修正が行われるため、基本の韻律情報（静的韻律情報）に対して手動の修正を行う従来のテキスト音声合成又は韻律生成規則を手動で修正する従来のテキスト音声合成に比べて、任意のテキストデータを好みに合った合成音声として簡便に出力させることができる。
【００７３】
ここで、図６に示された具体例に基づいて、フィルタリング手段１２２におけるフィルタリング処理を詳細に説明する。自動的に修正される韻律情報としては継続時間長、パワーパターン、ピッチパターンなどが考えられるが、この具体例では、生成される韻律情報に含まれる韻律パラメータがピッチパターンである場合について説明する。なお、以下の説明において、各部材及び各手段については、図５に示された参照符号を付す。
【００７４】
まず、過去において行われた韻律修正について説明する。韻律修正部１０３において、グラフィカルに表示されたピッチパターンに対してマウスを用いた修正３０５ａ、又は、テキストで表示されたピッチパターンに対するテキスト編集による修正３０５ｂが行われている。ここに、修正前のピッチパターンは、最初の音素記号／ａ／のピッチが４００Ｈｚ、最後の音素記号／ａ／のピッチが３００Ｈｚであること意味している。なお、子音に対してはピッチが定義されないため、／ｋ／には数値が与えられていない。過去の韻律修正においては、最後の音素記号／ａ／のピッチが３００Ｈｚから２００Ｈｚに変更されている。
【００７５】
上記のいずれの方法で韻律修正を行っても、同一の修正履歴データ３０６が生成される。生成された修正履歴データ３０６は、修正履歴ＤＢ１１６に格納される。
【００７６】
修正履歴ＤＢ１１６に登録された修正履歴データ３０６は、修正条件として、文中位置が文末であり、音素記号列（対象音素記号、先行音素記号、後続音素記号）が（／ａ／、／ｋ／、／−／）であり、モーラ数が２であり、かつ、アクセント型が０型であることを含んでおり、修正内容には、対象音素音素／ａ／のピッチを１００Ｈｚだけ下げることを意味する修正処理を含んでいる。
【００７７】
次に、過去の韻律修正後における動的韻律情報の生成について説明する。静的韻律生成手段１１２により生成された静的韻律情報３０１として、音素記号列（／ａ／、／ｋ／、／ａ／）に対応するピッチパターン（４００Ｈｚ、−、３００Ｈｚ）がフィルタリング手段１２２に与えられた場合、修正条件３０２が生成される。
【００７８】
フィルタ制御手段１３２においては、修正条件３０２に基づいて修正履歴ＤＢ１１６を参照し、修正条件の合致する修正履歴データ３０６を得る。修正履歴データ３０６に修正内容（−１００Ｈｚ：１００Ｈｚ下げる）が存在するため、「最後の音素記号／ａ／のピッチを１００Ｈｚ下げる」との修正内容３０３をフィルタリング手段１２２に送る。
【００７９】
フィルタリング手段１２２においては、送られてきた修正内容３０３に基づき、最後の音素記号／ａ／に対するピッチパターンを１００Ｈｚ下げる。つまり、ピッチパターンを（４００Ｈｚ、−、３００Ｈｚ）から（４００Ｈｚ、−、２００Ｈｚ）に修正する。修正されたピッチパターンは、動的韻律情報３０４として韻律修正部１０３に送られる。
【００８０】
これにより、静的韻律生成手段１１２において生成された修正条件に合致する修正履歴データが既に修正履歴情報に含まれている場合には、過去に行った修正と同じ修正が自動的に施された動的韻律情報を生成できる。
【００８１】
（実施の形態２）
本実施の形態２においては、第１の修正履歴参照方式を適用した動的韻律生成部を有し、かつ、韻律スタイルの互いに異なる複数の韻律ＤＢを有する修正履歴保持部を有するテキスト音声合成装置について説明する。なお、必要に応じて図７及び図８を参照する。図７は、本実施の形態２に係るテキスト音声合成装置の構成を概念的に示すブロック図である。図８は、本実施の形態２に係るテキスト音声合成装置の特徴部分を詳細に説明するための説明図である。
【００８２】
図７に示されたテキスト音声合成装置は、テキスト保持部１０７と、言語解析部１０１と、静的韻律生成手段１１２、フィルタリング手段１２２及びフィルタ制御手段１３２を有する動的韻律生成部１０２と、韻律修正部１０３と、表示部１１７と、修正命令入力部１２７と、複数種類の修正履歴ＤＢ１１６を有する修正履歴保持部１０６と、修正履歴抽出手段１１５、修正履歴更新手段１２５及びＤＢ選択制御手段１３５を有する修正履歴管理部１０５と、選択命令入力部１２７と、素片保持手段１３４、素片選択手段１１４及び音声合成手段１２４とを有する合成音声生成部１０４と、音声出力部１３７とを含む構成である。この構成であれば、修正履歴保持部１０６が複数の修正履歴ＤＢ１１６を有することにより、複数の韻律スタイルから所望の韻律スタイルを選択的に反映させる韻律修正を自動的に施すことができる。
【００８３】
図７に示されたテキスト音声合成装置の動作について説明する。なお、図７に示されたテキスト音声合成装置における動作は、修正履歴管理部における修正履歴情報の更新及び参照の方法が異なる以外、上記実施の形態１のテキスト音声合成装置と基本的に同様であるので、共通部分についての説明は省略する。
【００８４】
まず、修正履歴情報の更新においては、韻律修正部１０３からの新規な修正履歴データに応じて修正履歴更新手段１２５は、選択命令入力部１４７からの修正命令に応じて選択された少なくとも１つの修正履歴ＤＢに対して、修正履歴情報の更新を行う。
【００８５】
次に、修正履歴情報の参照においては、動的韻律生成部１０２（フィルタ制御手段１３２）からの要求に応じて修正履歴抽出手段１１５は、選択命令入力部１４７からの選択命令に応じて選択された少なくとも１つの修正履歴ＤＢに対して検索を行い、修正条件に合致する修正履歴データを抽出する。
【００８６】
ここで、図８に示された具体例に基づいて、修正履歴管理部１０５における修正履歴情報の更新及び参照の方法について詳細に説明する。自動的に修正される韻律情報としては継続時間長、パワーパターン、ピッチパターンなどが考えられるが、この具体例では、生成される韻律情報に含まれる韻律パラメータがピッチパターンである場合について説明する。なお、以下の説明において、各部材及び各手段については、図７に示された参照符号を付す。
【００８７】
まず、過去において行われた韻律修正について説明する。韻律修正部１０３において韻律修正３１４が行われ、修正履歴データ３１５が生成された。生成された修正履歴データ３１５には、音素記号列（／ａ／、／ｋ／、／ａ／）に対応するピッチパターン（４００Ｈｚ、−、３００Ｈｚ）をピッチパターン（４００Ｈｚ、−、２００Ｈｚ）に修正する修正内容が含まれている。この修正履歴データ３１５は、選択命令によって、修正履歴ＤＢ−Ａ１２６に格納された。なお、修正条件の図示は省略した。
【００８８】
更に、韻律修正部１０３において韻律修正３１６が行われ、修正履歴データ３１７が生成された。生成された修正履歴データ３１７には、音素記号列（／ａ／、／ｋ／、／ａ／）に対応するピッチパターン（４００Ｈｚ、−、３００Ｈｚ）をピッチパターン（４００Ｈｚ、−、３５０Ｈｚ）に修正する修正内容が含まれている。この修正履歴データ３１７は、選択命令によって、修正履歴ＤＢ−Ｂ１３６に格納された。なお、修正履歴データ３１７における修正条件の図示は省略したが、上記修正履歴データ３１５における修正条件と同一であるとする。
【００８９】
次に、過去の韻律修正後における動的韻律情報の生成について説明する。静的韻律生成手段１１２により生成された静的韻律情報３１１として、音素記号列（／ａ／、／ｋ／、／ａ／）に対応するピッチパターン３１１（４００Ｈｚ、−、３００Ｈｚ）がフィルタリング手段１２２に与えられた場合、修正条件を生成し、フィルタ制御手段１３２を介して修正履歴管理部１０５の修正履歴抽出手段１１５に生成された修正条件を送る。なお、生成された修正条件は、上記の修正履歴データ３１５における修正条件及び修正履歴データ３１７における修正条件と同一であるとする。
【００９０】
ＤＢ選択制御手段１３５において、予め入力された選択命令によって修正履歴ＤＢ−Ｂ１３６の選択３１２がなされているので、修正履歴抽出手段１２５は、修正履歴ＤＢ−Ｂ１３６に保持されている修正履歴データのみを検索し、修正条件に合致する修正履歴データ３１７を抽出する。ここに、修正条件に合致する修正履歴データが修正履歴ＤＢ−Ａ１２６にも存在しているが、修正履歴ＤＢ−Ａ１２６の修正履歴データ３１５は抽出されないことに注意を要する。抽出された修正履歴データ３１７は、動的韻律生成部のフィルタ制御手段１３２に送られる。
【００９１】
フィルタ制御手段１３２では、修正履歴データ３１７には、修正条件に対応付けられた修正処理（＋５０Ｈｚ：５０Ｈｚ上げる）が含まれているため、「最後の音素記号／ａ／のピッチを５０Ｈｚ上げる」との修正内容を決定する。決定された修正内容は、フィルタリング手段１２２に送られる。
【００９２】
フィルタリング手段１２２においては、送られてきた修正内容に基づき、最後の音素記号／ａ／に対応するピッチパターンを５０Ｈｚ上げる。つまり、ピッチパターン（静的韻律情報）３１１を（４００Ｈｚ、−、３００Ｈｚ）から（４００Ｈｚ、−、３５０Ｈｚ）に修正する。修正されたピッチパターン（動的韻律情報）３１３は、韻律修正部１０３に送られる。
【００９３】
韻律修正部１０３において、送られてきたピッチパターン３１３に更なる修正を行う場合には、改めて選択命令によって修正履歴ＤＢの選択を変更しない限り、修正履歴ＤＢ−Ｂ１３６が選択されている。
【００９４】
これにより、複数の韻律スタイルのうち所望の韻律スタイルが反映された動的韻律情報を生成することができる。また、複数の韻律スタイルのうち所望の韻律スタイルに対応する修正履歴情報のみを更新することができる。
【００９５】
ここで、修正履歴ＤＢを選択する方法について説明する。選択命令入力部１４７においては、アプリケーション上の韻律スタイル選択ボタン（修正履歴ＤＢ選択ボタン）の押下などによって、外部から入力される選択命令により、修正履歴ＤＢ１１６を切り替えることとなる。
【００９６】
図７に示されたテキスト音声合成装置は、修正履歴ＤＢ１１６を選択するための選択命令を入力する選択命令入力部１４７を備える構成であるが、他の構成によって、ＤＢ選択制御手段１３５に選択命令を入力することもできる。図９は、図７における選択命令入力部１４７に代えて、選択命令生成部を備えたテキスト音声合成装置の構成を概念的に示すブロック図である。
【００９７】
図９に示されたように、テキスト音声合成装置は、テキスト保持手段１０７に保持されたテキストデータから制御コードを解析して韻律スタイル情報を抽出し、抽出された韻律スタイル情報に基づいて選択命令を生成する選択命令生成部１０８を備えた構成である。
【００９８】
選択命令生成部１０８は、テキストデータに韻律スタイル情報を１つだけ含み、テキストデータごとに韻律スタイルを決定する手段であっても、テキストデータに複数の韻律スタイル情報を含み、テキストデータの断片ごとに韻律スタイルを決定する手段であってもよい。例えば、解析された制御コードに、同一ファイル内の文章１及び文章２に対して、それぞれ、韻律スタイルＡ及び韻律スタイルＢを適用することを記載した内容を含む場合、動的韻律生成部１０２において、文章１に対して韻律スタイルＡでピッチパターンを生成させ、文章２に対して韻律スタイルＢでピッチパターンを生成させることができる。
【００９９】
（実施の形態３）
本実施の形態３においては、第２の修正履歴参照方式を適用した動的韻律生成部を有するテキスト音声合成装置について説明する。必要に応じて、図１０及び１１を参照する。図１０は、本実施の形態３に係るテキスト音声合成装置の構成を概念的に示すブロック図である。図１１は、本実施の形態３に係るテキスト音声合成装置における特徴部分を詳細に説明するための説明図である。
【０１００】
図１０に示されたテキスト音声合成装置は、テキスト保持部１０７と、言語解析部１０１と、韻律生成パラメータ制御手段１４２及び韻律生成規則制御手段１５２を有する動的韻律生成部１０２と、韻律修正部１０３と、表示部１１７と、修正命令入力部１２７と、修正履歴ＤＢを有する修正履歴保持部１０６と、修正履歴抽出手段１１５及び修正履歴更新手段１２５を有する修正履歴管理部１０５と、素片保持手段１３４、素片選択手段１１４及び音声合成手段１２４を有する合成音声生成部１０４と、音声出力部１３７を含む構成である。
【０１０１】
図１０に示されたテキスト音声合成装置の動作について説明する。なお、動的韻律生成部の構成以外については、上記の実施の形態１と同様であるため、その説明を省略する。
【０１０２】
動的韻律生成部１０２の動的韻律生成手段１４２においては、言語解析部１０１から送られてきた言語情報に基づき、韻律情報の生成に必要な基本の韻律生成パラメータが決定され、かつ、修正条件が生成される。生成された修正条件は、動的韻律情報の生成に用いる韻律生成パラメータを確定するために、韻律生成規則制御手段１５２に送られる。
【０１０３】
韻律生成規則制御手段１５２において、送られてきた修正条件は、その修正条件に合致する韻律修正データが修正履歴保持部１０６の修正履歴ＤＢ（修正履歴情報）に含まれているか否かを確認するため、修正履歴管理部１０５の修正履歴抽出手段１１５に送られる。
【０１０４】
修正履歴抽出手段１１５において、送られてきた修正条件に基づいて修正履歴保持部１０６の修正履歴ＤＢを検索し、その結果、その修正条件に合致する修正履歴データが存在すれば、その修正履歴データを抽出する。抽出された修正履歴データは韻律生成規則制御手段１５２に送られる。ここに、修正条件に合致する修正履歴データが複数存在していれば、それらすべての修正履歴データが韻律生成規則制御手段１５２に送られる。また、修正条件に合致する修正履歴データが存在していなければ、その旨を韻律生成規則制御手段１５２に通知する。
【０１０５】
韻律生成規則制御手段１５２において、送られてきた修正履歴データに基づき各韻律生成パラメータの修正を確定し、動的韻律生成手段１４２に通知する。また、修正履歴データが存在しない旨の通知があれば、修正を行わない旨、又は、各韻律パラメータの修正がゼロである旨の通知を動的韻律生成手段１４２に通知する。ここに、修正履歴抽出手段１１５から複数の修正履歴データが送られてきた場合は、それらすべての修正履歴データに含まれる修正内容を反映するように各韻律生成パラメータの修正を確定する。
【０１０６】
動的韻律生成手段１４２において、言語解析部１０１で生成された言語情報に基づき、韻律生成規則制御手段１５２で決定された韻律生成パラメータを用いた韻律生成規則にしたがって動的韻律情報を生成する。生成された動的韻律情報は、言語情報と共に、韻律修正部１０３に送られる。ここに、韻律生成規則制御手段１５２で決定された各韻律生成パラメータに基づき動的な韻律生成規則が決定されていることに注意を要する。また、動的韻律生成手段１４２における韻律生成規則の変更は、修正履歴情報に応じて自動的に行われること、及び、動的韻律情報は、修正履歴情報の反映された韻律情報であることに注意を要する。
【０１０７】
ここで、図１１に示された具体例に基づいて、動的韻律修正手段１４２における韻律生成規則の制御の方法を詳細に説明する。なお、生成される動的韻律情報としては時間長、パワー、ピッチパターンなどが考えられるが、この具体例では、生成される動的韻律情報に含まれる韻律パラメータがピッチパターンである場合について説明する。なお、以下の説明において、各部材及び各手段については、図１０に示された参照符号を付す。
【０１０８】
動的韻律生成手段１４２におけるピッチパターンを生成するモデルとしては、藤崎モデルを用いる。藤崎モデルとは、ピッチパターンを呼気段落の減衰を表現するフレーズ成分とアクセントごとのピッチ変動を表すアクセント成分との重畳で表現するものである。また、藤崎モデルでは、フレーズの減衰の傾きなど様々なパラメータを調整可能であるが、本具体例ではフレーズの大きさを表すフレーズ指令とアクセントの大きさを表すアクセント指令の２つのパラメータによりピッチパターンの生成を行う。
【０１０９】
まず、過去において行われた韻律修正について説明する。韻律修正部１０３において、グラフィカルに表示されたピッチパターンに対してマウスを用いた修正３２５が行われた。ここに、フレーズ指令「ｈ」及びアクセント指令「ａ」を用いて生成された修正前のピッチパターンは点線で、修正後のピッチパターンは実線で記されている。
【０１１０】
この修正に伴い、修正後のピッチパターンを生成するために必要な韻律生成パラメータの算出３２６が行われる。この具体例においては、フレーズ指令に対しては「ｈ」のままで変化はなく、アクセント指令に対しては「ａ」が「ａ’」に変化する。
【０１１１】
修正後のピッチパターンを生成するために必要な韻律生成パラメータを算出する方法としては、公知の方法を用いることができる。例えば、ピッチパターンを生成する韻律生成パラメータは、修正後のピッチパターンをターゲットとした最小自乗法などを用いて、推定したい韻律生成パラメータを未知数とした線形方程式を解くことによって得られる。
【０１１２】
修正後のピッチパターンを生成するために必要な韻律生成パラメータが算出されると、引き続き、修正履歴データ３２７が生成される。生成された修正履歴データ３２７は、修正履歴ＤＢ１１６に格納される。
【０１１３】
生成された修正履歴データ３２７は、修正条件に、文中位置が文末であり、音素記号列（修正音素、先行音素、後続音素）が（／ａ／、／ｋ／、／−／）であり、モーラ数が３であり、かつ、アクセント型が０型であることを含んでおり、修正内容には、アクセント指令を「ａ」から「ａ’」に修正することを意味する処理内容を含んでいる。
【０１１４】
次に、過去の韻律修正後における動的韻律情報の生成について説明する。動的韻律生成手段１４２において、基本の韻律生成パラメータ３２１として、フレーズ指定「ｈ」及びアクセント指令「ａ」が生成されると共に、修正条件３２２が生成される。生成された修正条件３２２は、韻律正規規則制御手段１５２に送られる。ここに、生成された修正条件には、文中位置が文末であり、音素記号列（対象音素記号、先行音素記号、後続音素記号）が（／ａ／、／ｋ／、／−／）であり、モーラ数が３であり、かつ、アクセント型が０型であることが含まれている。
【０１１５】
韻律生成規則制御手段１５２においては、送られてきた修正条件に基づいて修正履歴ＤＢ１１６を検索し、検索した結果、修正条件の合致する修正履歴データ３２７を得る。修正履歴データ３２７には、送られてきた修正条件に対応付けられた修正内容３２３（ａをａ’に変更する：ピッチパターンを左上段図の点線から実線のパターに変更する）が存在するため、アクセント指令「ａ」が「ａ’」に修正されて修正韻律生成パラメータ３２４が確定する。修正韻律生成パラメータ３２４は動的韻律生成手段１４２に送られる。
【０１１６】
動的韻律生成手段１４２においては、送られてきた修正韻律生成パラメータ３２４に基づき、フレーズ指定「ｈ」及びアクセント指令「ａ’」を修正韻律生成パラメータとして用いた韻律生成規則が決定され、決定された韻律生成規則にしたがって、言語情報に基づき動的韻律情報が生成される。生成された動的韻律情報は韻律修正部１０３に送られる。
【０１１７】
これにより、動的韻律生成手段１４２において生成された修正条件に合致する修正履歴データが既に修正履歴情報に含まれている場合には、過去に行った修正と同じ修正が自動的に施された動的韻律情報を生成できる。
【０１１８】
上記においては、ピッチパターンを生成するモデルとして、藤崎モデルを用いて説明を行ったが、その他のモデルを用いても同様に、過去に行った修正を自動的に反映させることができる。ピッチパターン以外の制御を行う場合には、例えば、韻律情報における継続時間長やパワーパターンなどを制御する場合は、制御する韻律パラメータごとに適したモデルを利用する必要がある。
【０１１９】
また、上記においては、修正履歴保持手段が単一の修正履歴ＤＢを有する構成について説明したが、上記実施の形態２で説明したように複数の修正履歴ＤＢを有する構成とすることもできる。図１２及び図１３は、第２の履歴参照方式の動的韻律生成部と、韻律スタイルの互いに異なる複数の韻律修正ＤＢを有する修正履歴保持部とを含むテキスト音声合成装置の概念的な構成を示すブロック図である。なお、修正履歴管理部がＤＢ選択制御手段を有し、修正履歴保持手段が複数の修正履歴ＤＢを有すること以外、図１０に示されたテキスト音声合成装置と概ね同一の構成である。
【０１２０】
また、図１２及び図１３に示されたテキスト音声合成装置は、それぞれ、図７及び図９に示されたテキスト音声合成装置における静的韻律生成手段１１２、フィルタリング手段１２２及びフィルタ制御手段１３２を有する動的韻律生成部を、動的韻律生成手段１４２及び韻律生成規則制御手段１５２を有する動的韻律生成部に変更した以外同じ構成である。
【０１２１】
図１２及び図１３に示されたテキスト音声合成装置における複数の修正履歴ＤＢの参照方法及び更新方法は、上記実施の形態２と同等であるため、その説明は省略する。
【０１２２】
（実施の形態４）
本実施の形態４においては、第３の修正履歴参照方式を適用した動的韻律生成部を有するテキスト音声合成装置について説明する。必要に応じて、図１４及び図１５を参照する。図１４は、本実施の形態４に係るテキスト音声合成装置の構成を概念的に示すブロック図である。図１５は、本実施の形態４に係るテキスト音声合成装置における特徴部分を詳細に説明するための説明図である。
【０１２３】
図１４に示されたテキスト音声合成装置は、テキスト保持部１０７と、言語解析部１０１と、韻律パターン選択手段１６２、韻律パターン修正手段１７２、パターン片修正制御手段１８２及び韻律パターン片保持手段１９２を有する動的韻律生成部１０２と、韻律修正部１０３と、表示部１１７と、修正命令入力部１２７と、修正履歴ＤＢを有する修正履歴保持部１０６と、修正履歴抽出手段１１５及び修正履歴更新手段１２５を有する修正履歴管理部１０５と、素片保持手段１３４、素片選択手段１１４及び音声合成手段１２４とを有する合成音声生成部１０４と、音声出力部１３７とを含む構成である。
【０１２４】
図１４に示されたテキスト音声合成装置の動作について説明する。なお、動的韻律生成部の構成以外については、上記の実施の形態１と同様であるため、その説明を省略する。
【０１２５】
韻律パターン片選択手段１６２においては、送られてきた言語情報に基づいて、韻律パターン片保持手段１９２から最適な１つの韻律パターン片が選択韻律パターン片として選択されると共に、修正条件が生成される。選択された選択韻律パターン片及び生成された修正条件は、韻律パターン片修正手段１７２に送られる。
【０１２６】
韻律パターン片修正手段１７２においては、送られてきた修正条件は、選択韻律パターン片に対する修正を決定するために、パターン片修正制御手段１８２に送られる。
【０１２７】
パターン片修正制御手段１８２においては、送られてきた修正条件は、その修正条件に合致する修正履歴データが修正履歴保持部１０６の韻律修正ＤＢに含まれているか否かを確認するために、修正履歴管理部１０５の修正履歴抽出手段１１５に送られる。
【０１２８】
修正履歴抽出手段１１５においては、送られてきた修正条件に基づいて修正履歴保持部１０６の修正履歴ＤＢを検索し、その結果、送られてきた修正条件に合致する修正履歴データが存在すれば、その修正履歴データを抽出する。抽出された韻律修正データは、パターン片修正制御手段１８２に送られる。ここに、修正条件に合致する修正履歴データが複数存在していれば、それらすべての修正履歴データがパターン片修正制御手段１８２に送られる。また、修正条件に合致する修正履歴データが存在していなければ、その旨をパターン片修正制御手段１８２に通知する。
【０１２９】
パターン片修正制御手段１８２においては、送られてきた修正履歴データに基づいて選択韻律パターン片に対する修正内容を決定する。決定された修正内容は韻律パターン片修正手段１７２に送られる。ここに、修正履歴抽出手段１１５から複数の修正履歴データが送られてきた場合は、それらすべての修正履歴データの修正内容を反映するように韻律パターン片の修正内容を決定する。また、修正履歴データが存在しない旨の通知があれば、修正を行わない旨、又は、各韻律パターン片の修正がなしである旨の通知を韻律パターン片修正手段１７２に通知する。
【０１３０】
韻律パターン片修正手段１７２においては、パターン片修正制御手段１８２で決定された修正内容に基づき、選択韻律パターン片を修正して、動的韻律情報を生成する。生成された動的韻律情報は、言語情報と共に、韻律修正部１０３に送られる。ここに、韻律パターン片修正手段１７２における修正は、修正履歴情報に応じて自動的に行われること、及び、動的韻律情報は、修正履歴情報の反映された韻律情報であることに注意を要する。
【０１３１】
ここで、図１５に示された具体例に基づいて、韻律パターン片修正手段１７２における選択韻律パターンの修正方法について詳細に説明する。なお、修正される韻律情報としては継続時間長、パワー、ピッチパターンなどが考えられるが、この具体例では、生成される韻律情報に含まれる韻律パラメータがピッチパターンである場合について説明する。なお、以下の説明において、各部材及び各手段については、図１４に示された参照符号を付す。
【０１３２】
まず、過去において行われた韻律修正について説明する。韻律修正部１０３において、グラフィカルに表示されたピッチパターンに対してマウスを用いた修正３３７が行われた。ここに、修正前のピッチパターン３３４ａ（選択韻律パターン片）は点線で、修正後のピッチパターン３３４ｂ（修正韻律パターン片）は実線で記されている。
【０１３３】
この韻律修正に伴い、修正履歴データ３３８が生成され、生成された修正履歴データ３３８は、修正履歴ＤＢ１１６に格納される。
【０１３４】
生成された修正履歴データ３３８は、修正条件として、文中位置が文末であり、モーラ数が３であり、かつ、アクセント型が２型であることを含んでおり、修正内容として、修正前のピッチパターン３３４ａを修正後のピッチパターン３３４ｂに修正することを含んでいる。ここに、この具体例においては、修正後のピッチパターンそのものを修正内容として保持している。
【０１３５】
次に、過去の韻律修正後における動的韻律情報の生成について説明する。韻律パターン片選択手段１６２において、言語情報に基づいて韻律パターン片ＤＢ３３４から最適な韻律パターン片が選択韻律パターン片として選択されると共に、修正条件３３１が生成される。選択された選択韻律パターン片及び生成された修正条件は、韻律パターン片修正手段１７２に送られる。生成された修正条件３３１には、文中位置が文末であり、モーラ数が３であり、かつ、アクセント型が２型であることが含まれている。
【０１３６】
韻律パターン片修正手段１７２においては、送られてきた修正条件をパターン片修正制御手段１８２に送る。
【０１３７】
パターン片修正制御手段１８２においては、修正条件に基づいて修正履歴ＤＢ１１６を検索し、その結果、修正条件の合致する修正履歴データ３３８を得る。修正条件を満たす修正内容（選択されたピッチパターン３３４ａを保持されたピッチパターン３３４ｂに変更する：ピッチパターンを左上段図の点線から実線のパターンに変更する）が存在するため、その修正内容を韻律パターン修正手段１７２に送る。
【０１３８】
韻律パターン片修正手段１７２においては、送られてきた修正内容に基づき、韻律パターン３３４ａを韻律パターン３３４ｂに修正し、動的韻律情報３３６を生成する。生成された動的韻律情報は韻律修正部１０３に送られる。
【０１３９】
これにより、韻律パターン片選択手段１６２において生成された修正条件に合致する修正履歴データが既に修正履歴情報に含まれている場合には、過去に行った修正と同じ修正が自動的に施された動的韻律情報を生成できる。
【０１４０】
上記においては、変更後のピッチパターンそのものを保持しているが、最終的に変更後のピッチパターンを再現しえる保持方法であれば、どのような方法を用いて変形後のピッチパターンを保持してもよい。例えば、韻律パターン片ＤＢ３３４から選択されたピッチパターンに対する時刻毎の差分値を保存することでも変更後のピッチパターンを再現することができる。
【０１４１】
また、上記においては、修正履歴保持手段１０６が単一の修正履歴ＤＢ１１６を有する構成について説明したが、上記実施の形態２で説明したように複数の修正履歴ＤＢを有する構成とすることもできる。図１６及び図１７は、第３の修正履歴参照方式の動的韻律生成部と、韻律スタイルの互いに異なる複数の韻律修正ＤＢを有する修正履歴保持部とを含むテキスト音声合成装置の概念的な構成を示すブロック図である。なお、修正履歴管理部がＤＢ選択制御手段を有し、修正履歴保持手段が複数の修正履歴ＤＢを有すること以外、図１４に示されたテキスト音声合成装置と概ね同一の構成である。
【０１４２】
また、図１６及び図１７に示されたテキスト音声合成装置は、それぞれ、図７及び図９に示されたテキスト音声合成装置における静的韻律生成手段１１２、フィルタリング手段１２２及びフィルタ制御手段１３２を有する動的韻律生成部を、韻律パターン片選択手段１６２、韻律パターン修正手段１７２、パターン片修正制御手段１８２及び韻律パターン片保持手段１９２を有する動的韻律生成部に変更した以外同じ構成である。
【０１４３】
図１６及び図１７に示されたテキスト音声合成装置における複数の修正履歴ＤＢの参照方法及び更新方法は、上記実施の形態２と同等であるため、その説明は省略する。
【０１４４】
【発明の効果】
以上で説明したように、本発明のテキスト音声合成装置では、修正履歴情報を保持する修正履歴保持部と、修正履歴情報を管理する修正履歴管理部と、修正履歴管理部を介して修正履歴情報を参照して、言語情報に基づき動的韻律情報を生成する動的韻律生成部とを含む構成としたことにより、修正履歴情報に登録された過去の修正と同一の修正を改めて行う必要がなく、かつ、過去の修正に応じて修正履歴情報の学習が逐次進むため、エンドユーザの嗜好に合った韻律情報を簡便に生成する装置となる。
【０１４５】
動的韻律生成部において、静的な韻律生成規則にしたがって静的韻律情報を生成し、かつ、生成された静的韻律情報を修正履歴情報に応じて修正することによって、動的韻律情報を生成する第１の修正履歴参照方式を採用する。
【０１４６】
また、動的韻律生成部において、修正履歴情報に応じて韻律生成規則の韻律生成パラメータの設定を修正し、韻律生成パラメータの設定により変化する動的な韻律生成規則にしたがって動的韻律情報を生成する第２の修正履歴参照方式を採用する。
【０１４７】
また、動的韻律生成部において、静的な韻律選択規則にしたがって、言語情報に基づき複数の韻律パターン片から１つの最適な韻律パターン片を選択韻律パターン片として選択し、かつ、選択韻律パターン片を修正履歴情報に応じて修正することにより動的韻律情報を生成する第３の修正履歴参照方式を採用する。
【０１４８】
上記の第１〜第３の修正履歴参照方式のいずれか採用した動的韻律生成部を備えたテキスト音声合成装置であれば、エンドユーザの嗜好に合った韻律情報の簡便な生成を確実に実現できる。
【０１４９】
更に、修正履歴保持部を韻律スタイルの互いに異なる複数の修正履歴ＤＢで構成し、修正履歴管理部において、選択的な修正履歴ＤＢの参照及び選択的な修正履歴ＤＢの更新を行うことにより、複数の韻律スタイルから所望の韻律スタイルを選択的に反映した韻律修正を自動的に施すことができ、かつ、修正履歴情報全体を柔軟かつ効果的に更新（学習）させることができる。
【０１５０】
また、本発明のテキスト音声合成方法では、修正履歴情報を参照する修正履歴参照ステップと、修正履歴参照ステップと連携して、言語情報に基づき動的韻律情報を生成する動的韻律生成ステップと、修正に応じて修正履歴情報を更新する修正履歴更新ステップとを含む構成としたことにより、修正履歴情報に登録された過去の修正と同一の修正を改めて行う必要がなくなり、かつ、過去の修正に応じて修正履歴情報の学習が逐次進むため、エンドユーザの嗜好に合った韻律情報を簡便に生成することができる。
【０１５１】
また、本発明のテキスト音声合成プログラムは、修正履歴情報を参照する修正履歴参照プログラムコードと、修正履歴参照プログラムコードと連携して、言語情報に基づき動的韻律情報を生成する動的韻律生成プログラムコードと、修正に応じて修正履歴情報を更新する修正履歴更新プログラムコードとを含む構成としたことにより、修正履歴情報に登録された過去の修正と同一の修正を改めて行う必要がなくなるため、かつ、過去の韻律修正に応じて修正履歴情報の学習が逐次進むため、エンドユーザの嗜好に合った韻律情報を簡便に生成するテキスト音声合成方法を実現することができる。
【図面の簡単な説明】
【図１】図１は、本発明に係るテキスト音声合成装置の構成を概念的に示すブロック図である。
【図２】図２は、本発明に係るテキスト音声合成方法の構成を概念的に示すブロック図である。
【図３】図３は、修正履歴保持部に複数の修正履歴ＤＢを有するテキスト音声合成装置の構成を概念的に示すブロック図である。
【図４】図４は、修正履歴保持部に複数の修正履歴ＤＢを有するテキスト音声合成装置における修正履歴管理部の構成を概念的に示すブロック図であって、図４（ａ）が、共通ＤＢ選択制御手段を有する修正履歴管理部の構成を示すブロック図であり、図４（ｂ）が、参照用ＤＢ選択制御手段と更新用ＤＢ選択制御手段とを有する修正履歴管理部の構成を示すブロック図であり、図４（ｃ）が、共通ＤＢ選択制御手段と選択ＤＢ変更手段とを有する修正履歴管理部の構成を示すブロック図である。
【図５】図５は、第１の修正履歴参照方式の動的韻律生成部を備えたテキスト音声合成装置の構成を概念的に示すブロック図である。
【図６】図６は、第１の修正履歴参照方式を説明するための説明図である。
【図７】図７は、複数の修正履歴ＤＢを備えた修正履歴保持部と第１の修正履歴参照方式の動的韻律生成部とを含むテキスト音声合成装置の第１構成例を概念的に示すブロック図である。
【図８】図８は、修正履歴管理部における修正履歴情報の選択的な更新方法及び修正履歴情報の選択的な抽出方法を説明するための説明図である。
【図９】図９は、複数の修正履歴ＤＢを備えた修正履歴保持部と第１の修正履歴参照方式の動的韻律生成部とを含むテキスト音声合成装置の第２構成例を概念的に示すブロック図である。
【図１０】図１０は、第２の修正履歴参照方式の動的韻律生成部を備えたテキスト音声合成装置の構成を概念的に示すブロック図である。
【図１１】図１１は、第２の修正履歴参照方式を説明するための説明図である。
【図１２】図１２は、複数の修正履歴ＤＢを備えた修正履歴保持部と第２の修正履歴参照方式の動的韻律生成部とを含むテキスト音声合成装置の第１の構成例を概念的に示すブロック図である。
【図１３】図１３は、複数の修正履歴ＤＢを備えた修正履歴保持部と第２の修正履歴参照方式の動的韻律生成部とを含むテキスト音声合成装置の第２の構成例を概念的に示すブロック図である。
【図１４】図１４は、第３の修正履歴参照方式の動的韻律生成部を備えたテキスト音声合成装置の構成を概念的に示すブロック図である。
【図１５】図１５は、第３の修正履歴参照方式を説明するための説明図である。
【図１６】図１６は、複数の修正履歴ＤＢを備えた修正履歴保持部と第３の修正履歴参照方式の動的韻律生成部とを含むテキスト音声合成装置の第１の構成例を概念的に示すブロック図である。
【図１７】図１７は、複数の修正履歴ＤＢを備えた修正履歴保持部と第３の修正履歴参照方式の動的韻律生成部とを含むテキスト音声合成装置の第２の構成例を概念的に示すブロック図である。
【図１８】図１８は、従来のテキスト音声合成装置の構成を概念的に示すブロック図である。
【符号の説明】
１０１言語解析部
１０２動的韻律生成部
１１２静的韻律生成手段
１２２フィルタリング手段
１３２フィルタ制御手段
１４２動的韻律生成手段
１５２韻律生成規則制御手段
１６２韻律パターン片選択手段
１７２韻律パターン片修正手段
１８２韻律パターン片修正制御手段
１９２韻律パターン片保持手段
１０３韻律修正部
１０４合成音声生成部
１１４素片選択手段
１２４音声合成手段
１３４素片保持手段
１０５修正履歴管理部
１１５修正履歴抽出手段
１２５修正履歴更新手段
１３５ＤＢ選択制御手段
１４５共通ＤＢ選択制御手段
１５５参照用ＤＢ選択制御手段
１６５更新用ＤＢ選択制御手段
１７５選択ＤＢ変更手段
１０６修正履歴保持部
１１６修正履歴ＤＢ
１０７テキスト保持部
１１７表示部
１２７修正命令入力部
１３７音声出力部
１４７選択命令入力部
１０８選択命令生成部
２０１言語解析ステップ（言語解析プログラムコード）
２０２動的韻律生成ステップ（動的韻律生成プログラムコード）
２０３韻律修正ステップ（韻律修正プログラムコード）
２０４合成音声生成ステップ（合成音声生成プログラムコード）
２１５修正履歴抽出ステップ（修正履歴抽出プログラムコード）
２２５修正履歴更新ステップ（修正履歴更新プログラムコード）[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a text-to-speech synthesizer. More specifically, the present invention relates to a text-to-speech synthesizer having a function of automatically generating correction information and generating prosodic information. The present invention also relates to a text-to-speech synthesis method including a method for automatically generating a revision history and generating prosodic information, and a program for realizing the text-to-speech synthesis method.
[0002]
[Prior art]
In conventional speech synthesis, synthesized speech is generated with a predetermined pitch pattern by connecting phonemes constituting speech in accordance with a designated pitch pattern (prosodic information). Conventionally, a pitch pattern for text-to-speech synthesis has been generated according to a predetermined prosody generation rule based on an analysis result of an input text.
[0003]
The generated prosody may include errors at the time of text analysis or prosody generation. In this case, a synthesized speech different from the intention of the synthesized speech creator (hereinafter referred to as the author) is generated. . In order to correct such prosodic errors and to adjust to the prosody preferred by the author, the engineer rewrites the contents of the prosodic file in which the prosodic information is described, and performs specialized operations based on experience. It was necessary to directly change the parameters (pitch pattern, power, etc.) for determining the prosody.
[0004]
As a method for easily correcting the pitch pattern, the contents of the prosody file for controlling the prosody are graphically displayed on the display, and the displayed parameter (pitch pattern, power, etc.) pattern is changed with the mouse. Is known (see, for example, Patent Document 1). In this prosody modification method, the processed modification is stored in the storage device, and when performing the second speech synthesis using the same prosody file, the current pattern and the modification history pattern are displayed. As a result, the effort of correction is reduced, and the tendency of correction is grasped, so that information on subsequent prosody creation rules is obtained.
[0005]
The prosody modification procedure described in Patent Document 1 will be briefly described with reference to FIG. The language processing unit 401 extracts language-related information such as reading information, part-of-speech information, and dependency information from the input text. After that, the prosody generation unit 402 generates prosody files 409 serving as basic information for speech synthesis using the language related information. In the prosody modification unit 403, the content of the generated prosody file 409 is displayed on the screen by the graph file display generation unit 412, and the prosody file 409 is modified by modifying the graph file displayed on the screen. A portion for performing this correction work is a pattern correction unit 413.
[0006]
The value of each parameter corrected by the pattern correction unit 413 is recorded in the correction history DB (correction history database) 406 as correction history data. Therefore, when the same prosody file 409 is corrected again, the past correction history is also displayed on the screen by the graph file display generation unit 412, so that the equivalent prosody can be easily corrected. Further, since the correction tendency can be obtained by analyzing the correction history, it can be treated as one of information in the subsequent prosody generation.
[0007]
Based on the modified prosody file 419 created in this way, a segment selection unit 414 selects a segment for synthesis, and the speech synthesis unit 424 transforms and connects the segment according to the prosody file. Create synthesized speech.
[0008]
[Patent Document 1]
JP-A-5-232980 above
[0009]
[Problems to be solved by the invention]
In the conventional method shown in FIG. 18, every time speech synthesis is performed, the same correction is manually made to the prosody file 409 to be created, and the correction history is also held for each prosody file 409. Therefore, when synthesizing the same sentence, it is possible to refer to the past correction history, but when synthesizing different sentences, the same correction must be made without referring to the past correction history. It was.
[0010]
In addition, the prosody file 409 is panned to know the correction history correction tendency, and the tendency is used as one piece of information for the subsequent improvement of the prosody generation rules. Analyzing an enormous revision history for a rule and improving the prosody generation rule itself is a task with extremely high level of technical knowledge rather than individually modifying a prosody file. In other words, even if the general tendency can be grasped, it is difficult for the end user to easily modify the prosody generation rule itself according to his / her preference.
[0011]
The present invention has been made in view of the above. The purpose of the present invention is to make it possible for the end user to modify the prosody generation rules themselves directly without changing the prosodic generation rules themselves and without correcting a large amount of prosodic information every time speech synthesis is performed. Is to provide a method for easily performing speech synthesis suitable for the above, a text-to-speech synthesizer to which the method is applied, and a program for realizing the method.
[0012]
[Means for Solving the Problems]
In order to solve the above-mentioned problem, the present invention makes a pair of correction contents and correction conditions when prosodic information is corrected by an end user, and holds it as correction history data constituting correction history information. In the generation of prosodic information performed after the correction, if there is correction history data that matches the correction conditions, the correction details associated with the correction conditions are automatically reflected. This is a configuration having a function of generating the prosodic information.
[0013]
With the above configuration, it is not necessary to perform the same correction as each past correction registered in the correction history information, and the learning of the correction history information sequentially proceeds according to the correction made by the end user. Prosody information that suits the end user's preference can be easily generated.
[0014]
Specifically, the text-to-speech synthesizer according to the present invention performs language analysis on text data, extracts language information, a correction history holding unit that holds correction history information, and the correction history. A correction history management unit that manages information, a dynamic prosody generation unit that generates dynamic prosody information based on the language information with reference to the correction history information via the correction history management unit, and an external correction command Accordingly, the prosodic information is generated by correcting the dynamic prosodic information, and the correction history information is updated via the correction history management unit based on the correction according to the external correction command. A synthesis unit that generates a synthesized speech based on the correction unit, the language information, and the definite prosody information; The correction history management unit, a correction history extraction unit that extracts correction history information referred to by the dynamic prosody generation unit, a correction history update unit that updates the correction history information held in the correction history holding unit, The correction history holding unit includes a plurality of correction history databases having different prosodic styles, and the correction history management unit controls selection of the plurality of correction history databases according to a selection command. It further has a selection control means It is a configuration.
[0015]
The text-to-speech synthesis method according to the present invention includes a language analysis step for performing language analysis on text data and extracting language information, and correction history information. A revision history holding step for holding, a revision history management step for managing the revision history information, and the revision history information with reference to the revision history information via the revision history management unit, to generate dynamic prosodic information based on the language information A dynamic prosody generation step, generating the fixed prosody information by correcting the dynamic prosody information according to the external correction command, and the correction history management unit based on the correction according to the external correction command The prosody modification step for updating the modification history information via the above, the synthesized speech generation step for generating synthesized speech based on the language information and the definite prosody information, and the modification history management step, the dynamic prosody generation step The revision history extraction step for extracting revision history information referred to in the revision history update step for updating the revision history information held in the revision history holding step. The revision history holding step further holds a plurality of revision history databases having different prosodic styles, and the revision history management step stores the plurality of revision history databases in response to a selection command. A database selection control step for controlling the selection; .
[0016]
In addition, the text-to-speech synthesis program according to the present invention performs language analysis on text data. A language analysis step for extracting linguistic information, a correction history holding step for holding correction history information, a correction history management step for managing the correction history information, and the correction history information via the correction history management unit A dynamic prosody generation step for generating dynamic prosody information based on the language information, generating fixed prosody information by correcting the dynamic prosody information according to an external correction command, and A prosody modification step for updating the modification history information via the modification history management unit based on a modification according to the external modification command, and a synthesized speech for generating a synthesized speech based on the language information and the definitive prosody information A correction history extraction step in which the generation step, the correction history management step extracts correction history information referred to in the dynamic prosody generation step, and the correction history In the text-to-speech synthesis program for causing the computer to execute a correction history update step for updating the correction history information held in the holding step, the correction history holding step further holds a plurality of correction history databases having different prosodic styles. The correction history management step further includes a database selection control step for controlling selection of the plurality of correction history databases in response to a selection command. .
[0017]
DETAILED DESCRIPTION OF THE INVENTION
While describing the contents of the present invention, a preferred embodiment is described. Note that FIG. 1 and FIG. 2 are referred to as necessary. FIG. 1 is a block diagram conceptually showing the structure of a text-to-speech synthesizer according to the present invention. FIG. 2 is a block diagram conceptually showing the text-to-speech synthesis method according to the present invention.
[0018]
The text-to-speech synthesizer shown in FIG. 1 performs language analysis on text data and extracts language information, a correction history holding unit 106 that holds correction history information, and correction history information. A correction history management unit 105 that manages the dynamic prosody information, a dynamic prosody generation unit 102 that generates dynamic prosody information based on linguistic information with reference to the correction history information via the correction history management unit 105, and dynamic prosody information A modified prosody modification unit 103 that performs modification according to an external modification command to generate definite prosody information, and updates modification history information via the modification history management unit 105 according to the modification, and linguistic information and definite prosody information And a synthesized speech generation unit 104 that generates synthesized speech based on the above.
[0019]
The text-to-speech synthesis method (text-to-speech program code) shown in FIG. 2 includes a language analysis step 201 (language analysis program code) for performing language analysis on text data and extracting language information, and correction history information. A dynamic prosody generation step 202 for generating dynamic prosodic information based on language information in cooperation with a correction history reference step 215 to be referred to (correction history reference program code) and a correction history reference step 215 (correction history reference program code). (Dynamic prosody generation program code), prosody modification step 203 (prosody modification program code) for generating definite prosody information by modifying dynamic prosody information according to an external modification command, and modification of dynamic prosody information Correction history update step 225 (correction history update program code) for updating the correction history information accordingly A configuration including a synthetic speech generation step 204 (speech synthesizer program code) that generates a synthesized speech based on the language information and defined prosodic information.
[0020]
As a general method for generating prosodic information, for example, a method of generating prosodic information (static prosodic information) according to a predetermined prosodic generation rule (static prosodic generation rule) using language information as an argument Prosody information (static prosodic information) by selecting one prosody pattern piece from a plurality of prosody pattern pieces prepared in advance according to a predetermined rule (static prosodic pattern piece selection rule) The method of producing | generating is mentioned. On the other hand, in the present invention, the essential feature is that the prosodic information (dynamic prosodic information) is dynamically generated by referring to the correction history information.
[0021]
In this specification, “static” and “dynamic” mean “fixed without depending on correction history information” and “variable depending on the correction history information, depending on the information”, respectively. . “Static prosodic information” means prosodic information generated without referring to correction history information as in the prior art. The “dynamic prosodic information” means prosodic information generated by referring to the correction history information. If the prosodic information matches the correction history information, the prosodic information is different from the basic prosodic information, and the correction history information If not, the prosodic information is the same as the basic prosodic information.
[0022]
First, the language analysis unit 101 of the text-to-speech synthesizer will be described. The language analysis unit 101 performs language analysis on text data to be subjected to speech synthesis. Various language information is extracted by this language analysis (language analysis step 201). Examples of language information include information for specifying reading (phoneme symbol string, etc.), information for specifying part of speech, and information for specifying dependency.
[0023]
Next, the dynamic prosody generation unit 102 of the text-to-speech synthesizer will be described. The dynamic prosody generation unit 102 refers to the correction history information (correction history reference step 215), and generates dynamic prosody information based on the language information (dynamic prosody generation step 202). The dynamic prosody generation unit 102 may refer to the correction history information by any method as long as the prosody information is dynamically generated by referring to the correction history information. For example, the following three reference methods may be used. Can be mentioned.
[0024]
In the first revision history reference method, the dynamic prosody generation unit 102 generates static prosody information according to a static prosody generation rule, and corrects the generated static prosody information according to the correction history information. In this way, dynamic prosodic information is generated.
[0025]
In the second modification history reference method, the dynamic prosody generation unit 102 modifies the setting of the prosody generation parameter of the prosody generation rule according to the correction history information, and the dynamic prosody generation rule changes depending on the setting of the prosody generation parameter. According to the method, dynamic prosodic information is generated.
[0026]
In the third revision history reference method, the dynamic prosody generation unit 102 selects one optimal prosodic pattern piece as a selected prosodic pattern piece from a plurality of prosodic pattern pieces based on linguistic information according to a static prosodic selection rule. In addition, the dynamic prosodic information is generated by correcting the selected prosodic pattern piece according to the correction history information.
[0027]
Next, the prosody modification unit 103 of the text-to-speech synthesizer will be described. The prosody modification unit 103 generates definite prosody information by modifying the dynamic prosody information generated by the dynamic prosody generation unit 102 according to the external modification command (prosody modification step 203).
[0028]
If an external correction command is not input to the prosody modification unit 103, the dynamic prosody information becomes definite prosody information without modification. On the other hand, if an external correction command is received, the correction according to the external correction command is applied to the dynamic prosodic information, and the corrected dynamic prosodic information becomes definite prosodic information.
[0029]
The correction contents in the correction according to the external correction command and the correction conditions at the time of correction (hereinafter, a pair of correction contents and correction conditions are referred to as correction history data) are the correction history information stored in the correction history holding unit 106. In order to update, it is delivered to the correction history management unit 105 (correction history update step 203). Here, updating the revision history information means adding revision history data to the revision history information or changing a part of the revision history data constituting the revision history information.
[0030]
As a correction element of the correction content in each correction history data constituting the correction history information, any prosodic parameter may be used as long as a change before and after the correction can be specified. As the correction element, it is preferable to use a prosodic parameter that can define the correction amount before and after the correction. The prosody parameters that can define the correction amount include, for example, the duration length, intensity pattern (power pattern), or fundamental frequency pattern (pitch pattern) for a phoneme symbol string unit, expiratory paragraph unit, or accent phrase unit composed of one or more phoneme symbols. ). The correction element of the correction content of each correction history data may include only one type of prosodic parameter, or may include a plurality of types of prosodic parameters.
[0031]
On the other hand, the correction conditions in the correction history data can be set using language information. Examples of the condition element in the correction condition include a phoneme symbol string including one or more phoneme symbols, a part of speech, a position in a sentence, and an accent type. When the dynamic prosody generation unit 102 has the first history reference method, at least one prosodic parameter included in the static prosodic information can be used as a conditional element. The correction condition of the correction history data may include only one type of condition element, or may include a plurality of types of condition elements.
[0032]
As a method of correcting dynamic prosody information in the prosody correcting unit 103, for example, a pattern of prosody parameters that can be corrected is displayed graphically, and the displayed pattern is corrected using a mouse or the like. There is a method in which the prosodic information is displayed as text and is corrected by editing the displayed text.
[0033]
Further, in the modification of the dynamic prosodic information, the modification may be adjusted interactively while sequentially listening to the synthesized speech generated using the modified prosodic information reflecting the modification. The modified prosody information in a state where the adjustment is finally completed is sent to the synthesized speech generation unit 104 as definite prosody information. In this case, the prosody modification unit 103 is configured to include sample speech synthesis means for generating synthesized speech based on the language information and the modified prosody information for the text data fragment to be modified.
[0034]
Next, the correction history holding unit 106 and the correction history management unit 105 of the text-to-speech synthesizer will be described. The correction history holding unit 106 holds the correction content corrected by the prosody correction unit 103 as correction history information together with the correction conditions. The correction history management unit 105 includes a correction history update unit that updates the correction history information according to the correction in the prosody correction unit 103 and a correction history extraction unit that extracts the correction history information referred to in the dynamic prosody generation unit 102. This is a configuration provided.
[0035]
The correction history extraction unit of the correction history management unit 105 extracts correction history data that matches the correction condition from the correction history holding unit 106 based on the correction condition sent from the dynamic prosody generation unit 102, and The associated correction content is sent to the dynamic prosody generation unit 102. If there are a plurality of correction history data that match the correction conditions in the correction history holding unit 106, the correction contents corresponding to all of them are extracted and sent to the dynamic prosody generation unit 102.
[0036]
When the correction history update unit of the correction history management unit 105 receives the correction history data including the correction contents and the correction conditions from the prosody correction unit 103, the correction history update unit updates the correction history information held in the correction history holding unit 106.
[0037]
Here, when the correction history data having the correction condition satisfying the received correction condition is not held in the correction history holding unit 106, the correction history information is updated by adding the correction history data to the correction history holding unit 106 To do.
[0038]
Further, when the existing correction history data having the correction condition satisfying the received correction condition and the received correction content and the correction content having a different correction element is held in the correction history holding unit 106, the received correction history The correction history information may be updated by adding data, or they may be integrated and replaced with one new correction history data having correction contents including a plurality of correction elements.
[0039]
When the correction history management unit 105 receives the correction history data from the prosody correction unit 103, the correction condition satisfying the received correction condition and the received correction content and the correction element are the same and the correction processing is different. Is stored in the correction history storage unit 106, the final correction content determined relative to the basic prosody information is reflected in the subsequent generation of dynamic prosody. Update so that you can. In the update in this case, the correction history data may be added in association with the existing correction history data, or may be replaced with one new correction history data in consideration of the difference from the past correction contents. .
[0040]
The revision history holding unit 106 may be constituted by a single revision history DB or may be constituted by a plurality of revision history DBs having different prosodic styles. The prosodic style means a tone style according to dialects such as Osaka dialect and Kyoto dialect, and a tone style according to emotions such as a sad tone, a pleasant tone, an intense tone, and a gentle tone. When the revision history holding unit 106 has a plurality of revision history DBs, the revision history information is the entire prosody modification data included in all revision history DBs, that is, the prosody revisions held in the revision history holding unit. Note that it means the entire data.
[0041]
In the following, a case where the correction history holding unit 106 has a plurality of correction history DBs will be described. Refer to FIGS. 3 and 4 as necessary. FIG. 3 is a block diagram conceptually showing the structure of the text-to-speech synthesizer provided with a correction history holding unit composed of a plurality of correction history DBs. FIGS. 4A to 4C are block diagrams conceptually showing a configuration example of DB selection control means for controlling selection of a plurality of correction history DBs.
[0042]
As shown in FIG. 3, when the revision history holding unit 106 has a plurality of revision history DBs 116, the revision history management unit 105 together with the revision history extraction unit 115 and the revision history update unit 125 reads from the revision history DB. It is configured to include DB selection control means 135 for controlling which correction history DB is extracted in the correction history data extraction or correction history information update to the correction history DB.
[0043]
The DB selection control unit 135 includes a correction history DB (hereinafter also referred to as a reference correction history DB) referred to by the dynamic prosody generation unit 102 and a correction history DB (hereinafter referred to as a correction history DB updated based on the correction in the prosody correction unit 103). As the update correction history DB, the same correction history DB may be selected, or the reference correction history DB and the update correction history DB may be selected independently of each other. . Hereinafter, a specific configuration of the DB selection control means 135 will be described.
[0044]
As shown in FIG. 4A, the correction history management unit 105 includes a common DB selection control unit 145 that selects at least one of the plurality of correction history DBs 116 as a common correction history DB in response to the common selection command. It can be set as the structure which it has (1st structure). In the case of the first configuration, the dynamic prosody generation unit 102 selectively refers to the correction history information included in the common correction history DB, and the prosody correction unit 103 selects the correction history information included in the common correction history DB. Will be updated. A common selection instruction is used as the selection instruction.
[0045]
With the above configuration, the correction history information of one or a plurality of correction history DBs among the plurality of correction history DBs 116 can be selectively reflected in dynamic prosody generation according to the purpose, and one time Correction history information of one or a plurality of correction history DBs can be selectively updated by prosody correction. In addition, the dynamic prosody generation unit 102 can easily generate dynamic prosody information reflecting a desired prosodic style.
[0046]
As shown in FIG. 4B, the correction history management unit 105 selects a reference DB selection control that selects at least one of the plurality of correction history DBs 116 as a reference correction history DB in response to a reference selection command. A configuration (second configuration) including means 155 and update DB selection control means 165 that selects at least one of the plurality of correction history DBs 116 as an update correction history DB in response to an update selection command. it can. In the case of the second configuration, the dynamic prosody generation unit 102 selectively refers to the correction history information included in the reference correction history DB, and the prosody correction unit 103 includes the correction history information included in the update correction history DB. Will be updated selectively. Note that a reference selection instruction and an update selection instruction are used as selection instructions.
[0047]
When the correction history management unit 105 is in the first configuration, common control is performed for the reference correction history DB and the update correction history DB. In the second configuration, the reference correction history is used. The DB and the update correction history DB can be controlled independently. Thereby, the correction history information can be updated flexibly and effectively. That is, the correction history can be learned efficiently.
[0048]
The correction history management unit 105 includes a common DB selection control unit 145 that selects at least one of the plurality of correction history DBs 116 as a common correction history DB in response to the common selection command, and responds to the selection change command. The selection DB change for canceling the selection for any of the correction history DBs selected by the common DB selection control unit 145 and / or for additional selection of a correction history DB other than the correction history DB selected by the common DB selection unit 145 A configuration having the means 175 can be adopted.
[0049]
The selection DB changing means 175 is a change common to both the correction history DB referred to by the dynamic prosody generation unit 102 and the correction history DB updated based on the correction in the prosody correction unit 103 or a change independent of both. May be added. Further, only one of the correction history DB referred to by the dynamic prosody generation unit 102 and the correction history DB updated based on the correction in the prosody correction unit 103 may be changed.
[0050]
FIG. 4C shows a configuration in which the correction history management unit 105 includes selection DB changing means 175 for changing the correction history DB updated based on the correction in the prosody correction unit 103 (third configuration). It is shown. In the case of the third configuration, the dynamic prosody generation unit 102 selectively refers to correction history information included in at least one correction history DB (reference correction history DB) selected by the common DB selection control unit 145. The prosody modification unit 103 selectively updates the modification history information included in at least one modification history DB (update modification history DB) determined by the common database selection control unit 145 and the selection DB modification unit 175. Become. A common selection instruction and a selection change instruction are used as selection instructions.
[0051]
With the third configuration, the update correction history DB can be made independent of the reference correction history DB and can be arbitrarily selected from the plurality of prosody history DBs 116 in the correction history holding means 106. When the prosodic correction unit 103 corrects the dynamic prosody information, the correction is usually reflected on the prosodic style (correction history DB) referred to in the generation of the dynamic prosodic information. Compared to the above configuration, the correction history information can be updated easily, flexibly and effectively.
[0052]
The correction history management unit 105 includes a common DB selection unit 145 that selects at least one of the plurality of correction history databases 116 as a common correction history DB in response to the common selection command, and responds to the selection change command. Thus, a configuration (fourth configuration) having selection DB changing means 175 that only adds a new correction history DB to the update correction history DB configured by the common correction history DB may be adopted. In the case of the fourth configuration, the dynamic prosody generation unit 102 selectively refers to the correction history information included in the reference correction history DB configured by the common correction history DB selected by the common DB selection control unit 145. The prosody modification unit 103 selectively updates the modification history information included in the modification modification history DB configured by the common modification history DB and at least one modification history DB added by the selection DB modification unit; Become. A common selection instruction and a selection change instruction are used as selection instructions.
[0053]
When the prosody modification unit 103 corrects the dynamic prosody information, the correction history for updating is usually reflected in the prosodic style (correction history DB) referred to in the generation of the dynamic prosody information. More preferably, the DB includes all the correction history DBs constituting the reference correction history DB. Therefore, if it is the 4th composition, although the composition is simple compared with the 3rd composition, the same effect as the 3rd composition will be expressed.
[0054]
Which of the plurality of correction history DBs 116 is selected may be determined every time the apparatus or application is started, or may be determined for each text data. Further, on the application, the correction history DB may be appropriately determined every time the prosodic correction unit 103 corrects the dynamic prosodic information. Further, when the DB selection control unit 135 of the correction history management unit 105 is the second configuration, the third configuration, the fourth configuration, or the like (in the case of a configuration having at least two types of units), A plurality of determination methods can be used in combination.
[0055]
When selecting the correction history DB for each text data, the author may select on the application, or may be selected according to the control code (style selection information) included in the text file together with the text data. . In the former case, the selection command input unit for inputting a selection command is used. In the latter case, the text-to-speech synthesizer further includes a selection command generation unit for analyzing the text file and generating a selection command.
[0056]
Finally, the synthesized speech generation unit 104 of the text speech synthesizer will be described. The speech synthesis generation unit 104 generates synthesized speech by selecting a segment, deforming the segment, and connecting the segment based on the linguistic information and the definite prosodic information (synthesized speech creation step). It should be noted that any conventional known technique may be used for generating a synthesized speech using linguistic information and definite prosodic information.
[0057]
(Embodiment 1)
In the first embodiment, a text-to-speech synthesizer having a dynamic prosody generation unit to which the first correction history reference method is applied will be described. Refer to FIG. 5 and FIG. 6 as necessary. FIG. 5 is a block diagram conceptually showing the structure of the text-to-speech synthesizer according to the first embodiment. FIG. 6 is an explanatory diagram for explaining in detail the characteristic part in the text-to-speech synthesizer according to the first embodiment.
[0058]
The text-to-speech synthesizer shown in FIG. 5 includes a text holding unit 107, a language analysis unit 101, a dynamic prosody generation unit 102 having a static prosody generation unit 112, a filtering unit 122, and a filter control unit 132, and a prosody. The correction history management unit 105 including the correction unit 103, the display unit 117, the correction command input unit 127, the correction history holding unit 106 having a single correction history DB, the correction history extraction unit 115, and the correction history update unit 125. A synthesized speech creating unit 104 having a segment holding unit 134, a segment selecting unit 114, and a speech synthesizing unit 124, and a speech output unit 137.
[0059]
The operation of the text-to-speech synthesizer shown in FIG. 5 will be described. The language analysis unit 101 performs language analysis on the text data held in the text holding unit 107 in a predetermined unit, and as a result, language information is extracted (language processing step). The extracted language information is sent to the static prosody generation means 112 of the dynamic prosody generation unit 102.
[0060]
In the static prosody generation means 112, static prosody information is generated according to a static prosody generation rule based on the transmitted language information (static prosody generation step). The generated static prosodic information is sent to the filtering unit 122 together with the language information.
[0061]
In the filtering means 122, a correction condition is generated from the transmitted language information and static prosodic information. The generated correction condition is sent to the filter control means 132 in order to determine the processing content of the filtering process.
[0062]
In the filter control means 132, in order to check whether or not the prosody modification data that matches the sent modification condition is included in the prosody modification DB of the modification history holding unit 106, the modification condition is stored in the modification history management unit 105. Is sent to the correction history extraction means 115. Here, since the correction history holding unit 106 has only one type of prosody correction DB, the entire information included in the correction history DB is the correction history information.
[0063]
The correction history extraction unit 115 searches the correction history DB of the correction history holding unit 106. As a result, if there is correction history data that matches the sent correction conditions, the correction history data is extracted. The extracted prosody modification data is sent to the filter control means 132. If there are a plurality of correction history data matching the correction conditions, all the correction history data are sent to the filter control means 132. If there is no correction history data that matches the correction condition, the filter control means 132 is notified to that effect.
[0064]
The filter control unit 132 determines the correction amount of each prosodic parameter in the static prosody information from the received correction history data, and notifies the filtering unit 122 of the correction amount. Here, when a plurality of correction history data is sent from the correction history extraction means 115, the correction amount of each prosodic parameter is determined so as to reflect all of them. If there is a notification that the correction history data does not exist, the filtering unit 122 is notified that the correction is not performed or that the correction amount of each prosodic parameter is zero.
[0065]
The filtering means 122 corrects the static prosodic information based on the correction amount of each prosodic parameter determined by the filter control means 132 to generate dynamic prosodic information. The generated dynamic prosody information is sent to the prosody modification unit 103 together with the language information. Here, it should be noted that the correction in the filtering unit 122 is automatically performed according to the correction history information, and that the dynamic prosody information is prosodic information in which the correction history information is reflected.
[0066]
The prosody modification unit 103 causes the display unit 117 to display the transmitted dynamic prosody information as a graphical image. The desired additional correction is performed on the dynamic prosody information displayed on the display unit 117 according to the external correction command from the correction command input unit 127. Here, it should be noted that the correction in the prosody correction unit 103 is performed manually as in the prior art.
[0067]
For the additional correction in the prosody correction unit 103, correction history data in which a correction condition and correction content are set is generated. The generated correction history data is sent to the correction history update unit 125 of the correction history management unit 105 in order to update the correction history DB in the correction history holding unit 106. The dynamic prosody information with the desired correction is sent to the segment selection means 114 of the synthesized speech generation unit 104 together with the language information as definite prosody information.
[0068]
The correction history update unit 125 of the correction history management unit 105 updates the correction history DB of the correction history holding unit 106 based on the received correction history data. Here, it should be noted that the updated correction history DB is referred to in the next filtering process by the filtering unit 122. Also, note that modifications to the first file, not limited to the same file, are also reflected in the filtering process when text-to-speech text data contained in a second file different from the first file. Cost.
[0069]
The unit selection unit 114 of the synthesized speech generation unit 104 selects an optimal unit from the unit group held in the unit holding unit 134 based on the language information sent from the prosody modification unit 103. The selected segment is sent to the speech synthesizer 124.
[0070]
The unit held by the unit holding unit 134 may be a single unit or a composite unit. Examples of the composite segment include a CV unit (C: consonant, V: vowel) segment, a VC unit segment, a CVC unit segment, and a VCV unit segment. The element group is composed of only a single element, composed of only one kind of composition element, composed of a plurality of kinds of composition elements, and composed of a element element and one or more composition elements. It may be.
[0071]
In the speech synthesizer 124 of the synthesized speech generation unit 104, the transmitted speech is deformed and connected based on the definite prosodic information, thereby generating synthesized speech. The generated synthesized speech is output from the speech output unit 137.
[0072]
After the above processing, since the author's past correction is automatically reflected and dynamic prosodic information that generates synthesized speech close to preference is manually added, basic prosodic information (static Compared with conventional text-to-speech synthesis that manually corrects prosodic information) or manual text-to-speech synthesis that manually corrects prosody generation rules, any text data is easily output as synthesized speech that suits your preferences. Can be made.
[0073]
Here, based on the specific example shown in FIG. 6, the filtering process in the filtering means 122 will be described in detail. The prosodic information that is automatically corrected may be a duration, power pattern, pitch pattern, or the like. In this specific example, a case where the prosodic parameter included in the generated prosodic information is a pitch pattern will be described. In the following description, the reference numerals shown in FIG. 5 are assigned to the members and the means.
[0074]
First, prosody correction performed in the past will be described. In the prosody modification unit 103, modification 305a using the mouse is performed on the graphically displayed pitch pattern, or modification 305b is performed on the pitch pattern displayed in text by text editing. Here, the pitch pattern before correction means that the pitch of the first phoneme symbol / a / is 400 Hz, and the pitch of the last phoneme symbol / a / is 300 Hz. Since no pitch is defined for consonants, no numerical value is given to / k /. In past prosody correction, the pitch of the last phoneme symbol / a / is changed from 300 Hz to 200 Hz.
[0075]
Even if the prosody correction is performed by any of the above methods, the same correction history data 306 is generated. The generated correction history data 306 is stored in the correction history DB 116.
[0076]
In the correction history data 306 registered in the correction history DB 116, the position in the sentence is the end of the sentence and the phoneme symbol string (target phoneme symbol, preceding phoneme symbol, subsequent phoneme symbol) is (/ a /, / k /, / − /), The number of mora is 2, and the accent type is 0 type, and the correction content means that the pitch of the target phoneme phoneme / a / is lowered by 100 Hz. Includes correction processing.
[0077]
Next, generation of dynamic prosodic information after past prosodic correction will be described. As the static prosody information 301 generated by the static prosody generation unit 112, a pitch pattern (400 Hz, −, 300 Hz) corresponding to the phoneme symbol string (/ a /, / k /, / a /) is sent to the filtering unit 122. If given, a modification condition 302 is generated.
[0078]
The filter control means 132 refers to the correction history DB 116 based on the correction condition 302 and obtains correction history data 306 that matches the correction condition. Since there is a correction content (−100 Hz: lower by 100 Hz) in the correction history data 306, the correction content 303 that “lowers the pitch of the last phoneme symbol / a / by 100 Hz” is sent to the filtering means 122.
[0079]
The filtering unit 122 lowers the pitch pattern for the last phoneme symbol / a / by 100 Hz based on the correction content 303 sent. That is, the pitch pattern is corrected from (400 Hz, −, 300 Hz) to (400 Hz, −, 200 Hz). The corrected pitch pattern is sent to the prosody correcting unit 103 as dynamic prosody information 304.
[0080]
Thereby, when the correction history data matching the correction condition generated in the static prosody generation means 112 is already included in the correction history information, the same correction as the correction performed in the past is automatically applied. Dynamic prosodic information can be generated.
[0081]
(Embodiment 2)
In the second embodiment, the text-to-speech synthesizer includes a dynamic prosody generation unit to which the first correction history reference method is applied and a correction history holding unit having a plurality of prosody DBs having different prosodic styles. Will be described. Note that FIG. 7 and FIG. 8 are referred to as necessary. FIG. 7 is a block diagram conceptually showing the structure of the text-to-speech synthesizer according to the second embodiment. FIG. 8 is an explanatory diagram for explaining in detail the characteristic part of the text-to-speech synthesizer according to the second embodiment.
[0082]
The text-to-speech synthesizer shown in FIG. 7 includes a text holding unit 107, a language analysis unit 101, a dynamic prosody generation unit 102 having a static prosody generation unit 112, a filtering unit 122, and a filter control unit 132, and a prosody. A correction unit 103, a display unit 117, a correction command input unit 127, a correction history holding unit 106 having a plurality of types of correction history DBs 116, a correction history extraction unit 115, a correction history update unit 125, and a DB selection control unit 135 are provided. The composition includes a modification history management unit 105, a selection command input unit 127, a unit holding unit 134, a unit selection unit 114, and a voice synthesis unit 124, and a voice output unit 137. is there. With this configuration, the correction history holding unit 106 includes a plurality of correction history DBs 116, so that prosody correction that selectively reflects a desired prosody style from a plurality of prosody styles can be automatically performed.
[0083]
The operation of the text-to-speech synthesizer shown in FIG. 7 will be described. Note that the operation of the text-to-speech synthesizer shown in FIG. 7 is basically the same as that of the text-to-speech synthesizer of the first embodiment, except that the revision history information is updated and referred to by the revision history management unit. Since there is, explanation about a common part is omitted.
[0084]
First, in the update of the correction history information, the correction history update means 125 according to the new correction history data from the prosody correction unit 103 is at least one correction selected according to the correction command from the selection command input unit 147. The correction history information is updated with respect to the history DB.
[0085]
Next, in referring to the revision history information, the revision history extraction unit 115 is selected according to the selection command from the selection command input unit 147 in response to a request from the dynamic prosody generation unit 102 (filter control unit 132). Then, a search is performed on at least one correction history DB, and correction history data matching the correction conditions is extracted.
[0086]
Here, based on the specific example shown in FIG. 8, a method of updating and referring to the revision history information in the revision history management unit 105 will be described in detail. The prosodic information that is automatically corrected may be a duration, power pattern, pitch pattern, or the like. In this specific example, a case where the prosodic parameter included in the generated prosodic information is a pitch pattern will be described. In the following description, the reference numerals shown in FIG. 7 are assigned to the members and the means.
[0087]
First, prosody correction performed in the past will be described. In the prosody modification unit 103, the prosody modification 314 is performed, and the modification history data 315 is generated. In the generated correction history data 315, the pitch pattern (400 Hz,-, 300 Hz) corresponding to the phoneme symbol string (/ a /, / k /, / a /) is corrected to the pitch pattern (400 Hz,-, 200 Hz). The correction contents to be included are included. The correction history data 315 is stored in the correction history DB-A 126 by a selection command. The correction conditions are not shown.
[0088]
Further, the prosody modification unit 103 performs the prosody modification 316 and the modification history data 317 is generated. In the generated correction history data 317, the pitch pattern (400 Hz,-, 300 Hz) corresponding to the phoneme symbol string (/ a /, / k /, / a /) is corrected to the pitch pattern (400 Hz,-, 350 Hz). The correction contents to be included are included. The correction history data 317 is stored in the correction history DB-B 136 by a selection command. Although the illustration of the correction condition in the correction history data 317 is omitted, it is assumed that it is the same as the correction condition in the correction history data 315.
[0089]
Next, generation of dynamic prosodic information after past prosodic correction will be described. As the static prosody information 311 generated by the static prosody generation means 112, the pitch pattern 311 (400 Hz,-, 300 Hz) corresponding to the phoneme symbol string (/ a /, / k /, / a /) is the filtering means 122. The correction condition is generated, and the generated correction condition is sent to the correction history extraction unit 115 of the correction history management unit 105 via the filter control unit 132. It is assumed that the generated correction conditions are the same as the correction conditions in the correction history data 315 and the correction conditions in the correction history data 317.
[0090]
In the DB selection control means 135, the correction history DB-B 136 is selected 312 by a selection command inputted in advance, so that the correction history extraction means 125 only reads the correction history data held in the correction history DB-B 136. Search is performed, and correction history data 317 that matches the correction condition is extracted. Here, it should be noted that although the correction history data matching the correction condition exists in the correction history DB-A 126, the correction history data 315 of the correction history DB-A 126 is not extracted. The extracted correction history data 317 is sent to the filter control means 132 of the dynamic prosody generation unit.
[0091]
In the filter control means 132, the correction history data 317 includes a correction process (+50 Hz: increase by 50 Hz) associated with the correction condition, so that “the pitch of the last phoneme symbol / a / is increased by 50 Hz”. Determine the correction contents. The determined correction content is sent to the filtering means 122.
[0092]
In the filtering means 122, the pitch pattern corresponding to the last phoneme symbol / a / is raised by 50 Hz based on the sent correction contents. That is, the pitch pattern (static prosodic information) 311 is corrected from (400 Hz, −, 300 Hz) to (400 Hz, −, 350 Hz). The corrected pitch pattern (dynamic prosody information) 313 is sent to the prosody correcting unit 103.
[0093]
When the prosody modification unit 103 performs further modification on the pitch pattern 313 sent thereto, the modification history DB-B 136 is selected unless the modification history DB selection is changed again by a selection command.
[0094]
As a result, dynamic prosodic information reflecting a desired prosodic style among a plurality of prosodic styles can be generated. In addition, it is possible to update only correction history information corresponding to a desired prosodic style among a plurality of prosodic styles.
[0095]
Here, a method for selecting the correction history DB will be described. In the selection command input unit 147, the correction history DB 116 is switched by a selection command input from the outside by pressing a prosodic style selection button (correction history DB selection button) on the application.
[0096]
The text-to-speech synthesizer shown in FIG. 7 is configured to include a selection command input unit 147 that inputs a selection command for selecting the correction history DB 116. Can also be entered. FIG. 9 is a block diagram conceptually showing the structure of a text-to-speech synthesizer including a selection command generation unit instead of the selection command input unit 147 in FIG.
[0097]
As shown in FIG. 9, the text-to-speech synthesizer extracts the prosodic style information by analyzing the control code from the text data held in the text holding means 107, and selects a command based on the extracted prosodic style information. The selection command generation unit 108 for generating
[0098]
The selection command generation unit 108 includes only one prosodic style information in the text data, and even if it is a means for determining the prosodic style for each text data, the text data includes a plurality of prosodic style information for each piece of text data. It may be a means for determining the prosodic style. For example, when the analyzed control code includes contents describing that the prosodic style A and the prosodic style B are applied to the sentence 1 and the sentence 2 in the same file, respectively, the dynamic prosody generation unit 102 The pitch pattern can be generated with the prosody style A for the sentence 1 and the pitch pattern can be generated with the prosody style B for the sentence 2.
[0099]
(Embodiment 3)
In the third embodiment, a text-to-speech synthesizer having a dynamic prosody generation unit to which the second correction history reference method is applied will be described. Refer to FIGS. 10 and 11 as necessary. FIG. 10 is a block diagram conceptually showing the structure of the text-to-speech synthesizer according to the third embodiment. FIG. 11 is an explanatory diagram for explaining in detail the characteristic part in the text-to-speech synthesizer according to the third embodiment.
[0100]
The text-to-speech synthesizer shown in FIG. 10 includes a text holding unit 107, a language analysis unit 101, a dynamic prosody generation unit 102 including a prosody generation parameter control unit 142 and a prosody generation rule control unit 152, and a prosody modification unit. 103, a display unit 117, a correction command input unit 127, a correction history holding unit 106 having a correction history DB, a correction history management unit 105 having a correction history extracting unit 115 and a correction history updating unit 125, and a unit holding A synthesized speech generation unit 104 having a unit 134, a segment selection unit 114 and a speech synthesis unit 124, and a speech output unit 137 are included.
[0101]
The operation of the text-to-speech synthesizer shown in FIG. 10 will be described. Since the configuration other than the dynamic prosody generation unit is the same as that of the first embodiment, the description thereof is omitted.
[0102]
In the dynamic prosody generation unit 142 of the dynamic prosody generation unit 102, basic prosody generation parameters necessary for generating prosody information are determined based on the linguistic information sent from the language analysis unit 101, and correction conditions are set. Is generated. The generated correction condition is sent to the prosodic generation rule control means 152 in order to determine the prosodic generation parameters used for generating the dynamic prosodic information.
[0103]
In the prosody generation rule control means 152, the sent correction condition confirms whether or not the prosody correction data that matches the correction condition is included in the correction history DB (correction history information) of the correction history holding unit 106. Therefore, it is sent to the correction history extraction means 115 of the correction history management unit 105.
[0104]
The correction history extraction unit 115 searches the correction history DB of the correction history holding unit 106 based on the received correction conditions, and as a result, if there is correction history data that matches the correction conditions, the correction history data To extract. The extracted correction history data is sent to the prosody generation rule control means 152. If there are a plurality of correction history data matching the correction conditions, all the correction history data are sent to the prosody generation rule control means 152. If there is no correction history data that matches the correction condition, the prosody generation rule control means 152 is notified of this fact.
[0105]
The prosody generation rule control unit 152 determines the correction of each prosody generation parameter based on the received correction history data and notifies the dynamic prosody generation unit 142 of the correction. If there is a notification that the correction history data does not exist, the dynamic prosody generation unit 142 is notified that the correction is not performed, or that the correction of each prosody parameter is zero. Here, when a plurality of correction history data is sent from the correction history extraction means 115, the correction of each prosody generation parameter is determined so as to reflect the correction contents included in all the correction history data.
[0106]
The dynamic prosody generation unit 142 generates dynamic prosody information according to the prosodic generation rules using the prosodic generation parameters determined by the prosody generation rule control unit 152 based on the language information generated by the language analysis unit 101. The generated dynamic prosody information is sent to the prosody modification unit 103 together with the language information. Here, it should be noted that a dynamic prosody generation rule is determined based on each prosody generation parameter determined by the prosody generation rule control means 152. Further, the change of the prosody generation rule in the dynamic prosody generation unit 142 is automatically performed according to the correction history information, and the dynamic prosody information is the prosody information in which the correction history information is reflected. Need attention.
[0107]
Here, based on the specific example shown in FIG. 11, the method of controlling the prosody generation rule in the dynamic prosody modification means 142 will be described in detail. The generated dynamic prosodic information may be time length, power, pitch pattern, etc. In this specific example, the case where the prosodic parameter included in the generated dynamic prosodic information is a pitch pattern will be described. . In the following description, the reference numerals shown in FIG. 10 are assigned to the members and the means.
[0108]
As a model for generating a pitch pattern in the dynamic prosody generation means 142, a Fujisaki model is used. The Fujisaki model expresses a pitch pattern by superimposing a phrase component that expresses attenuation of an exhalation paragraph and an accent component that expresses pitch fluctuation for each accent. In the Fujisaki model, various parameters such as the slope of phrase attenuation can be adjusted. In this example, the pitch pattern is determined by two parameters: a phrase command indicating the size of the phrase and an accent command indicating the size of the accent. Is generated.
[0109]
First, prosody correction performed in the past will be described. In the prosody modification unit 103, modification 325 using the mouse is performed on the pitch pattern displayed graphically. Here, the pitch pattern before correction generated by using the phrase command “h” and the accent command “a” is indicated by a dotted line, and the pitch pattern after correction is indicated by a solid line.
[0110]
Along with this correction, calculation 326 of prosody generation parameters necessary for generating a corrected pitch pattern is performed. In this specific example, “h” remains unchanged for the phrase command, and “a” changes to “a ′” for the accent command.
[0111]
A publicly known method can be used as a method for calculating the prosody generation parameter necessary for generating the corrected pitch pattern. For example, the prosody generation parameter for generating the pitch pattern is obtained by solving a linear equation with the prosody generation parameter to be estimated as an unknown using a least square method targeting the pitch pattern after correction.
[0112]
When the prosody generation parameters necessary for generating the corrected pitch pattern are calculated, the correction history data 327 is subsequently generated. The generated correction history data 327 is stored in the correction history DB 116.
[0113]
The generated correction history data 327 has a correction condition that the position in the sentence is the end of the sentence, and the phoneme symbol string (corrected phoneme, preceding phoneme, subsequent phoneme) is (/ a /, / k /, //-/), The number of mora is 3 and the accent type is 0 type, and the correction content includes the processing content which means that the accent command is corrected from “a” to “a ′”. Yes.
[0114]
Next, generation of dynamic prosodic information after past prosodic correction will be described. In the dynamic prosody generation unit 142, the phrase designation “h” and the accent command “a” are generated as the basic prosody generation parameter 321, and the correction condition 322 is generated. The generated correction condition 322 is sent to the prosody normal rule control means 152. Here, in the generated correction condition, the position in the sentence is the end of the sentence, and the phoneme symbol string (target phoneme symbol, preceding phoneme symbol, subsequent phoneme symbol) is (/ a /, / k /, / − /). , The number of mora is 3 and the accent type is 0 type.
[0115]
The prosody generation rule control means 152 searches the correction history DB 116 based on the received correction conditions, and obtains correction history data 327 that matches the correction conditions as a result of the search. The correction history data 327 includes correction contents 323 (a is changed to a ′: the pitch pattern is changed from the dotted line in the upper left diagram to the solid line pattern) associated with the received correction condition. The accent command “a” is corrected to “a ′”, and the corrected prosody generation parameter 324 is determined. The modified prosody generation parameter 324 is sent to the dynamic prosody generation unit 142.
[0116]
The dynamic prosody generation unit 142 determines and determines the prosody generation rule using the phrase designation “h” and the accent command “a ′” as the modified prosody generation parameter based on the received modified prosody generation parameter 324. In accordance with the prosodic generation rules, dynamic prosodic information is generated based on the linguistic information. The generated dynamic prosody information is sent to the prosody modification unit 103.
[0117]
As a result, when the correction history data matching the correction condition generated by the dynamic prosody generation unit 142 is already included in the correction history information, the same correction as the correction performed in the past is automatically applied. Dynamic prosodic information can be generated.
[0118]
In the above description, the Fujisaki model has been described as a model for generating a pitch pattern. However, corrections made in the past can be automatically reflected using other models as well. When performing control other than the pitch pattern, for example, when controlling the duration or power pattern in the prosodic information, it is necessary to use a model suitable for each prosodic parameter to be controlled.
[0119]
In the above description, the configuration in which the correction history holding unit has a single correction history DB has been described. However, as described in the second embodiment, a configuration having a plurality of correction history DBs may be employed. 12 and 13 show a conceptual configuration of a text-to-speech synthesizer including a dynamic prosody generation unit of the second history reference method and a correction history holding unit having a plurality of prosody modification DBs having different prosodic styles. FIG. The configuration is substantially the same as that of the text-to-speech synthesizer shown in FIG. 10 except that the correction history management unit has DB selection control means and the correction history holding means has a plurality of correction history DBs.
[0120]
Also, the text-to-speech synthesizer shown in FIGS. 12 and 13 includes static prosody generation means 112, filtering means 122, and filter control means 132 in the text-to-speech synthesizer shown in FIGS. 7 and 9, respectively. The configuration is the same except that the dynamic prosody generation unit is changed to a dynamic prosody generation unit having dynamic prosody generation unit 142 and prosody generation rule control unit 152.
[0121]
A method for referring to and updating a plurality of correction history DBs in the text-to-speech synthesizer shown in FIGS. 12 and 13 is the same as that in the second embodiment, and a description thereof will be omitted.
[0122]
(Embodiment 4)
In the fourth embodiment, a text-to-speech synthesizer having a dynamic prosody generation unit to which the third modification history reference method is applied will be described. Reference is made to FIGS. 14 and 15 as necessary. FIG. 14 is a block diagram conceptually showing the structure of the text-to-speech synthesizer according to the fourth embodiment. FIG. 15 is an explanatory diagram for explaining in detail the characteristic part in the text-to-speech synthesizer according to the fourth embodiment.
[0123]
The text-to-speech synthesizer shown in FIG. 14 includes a text holding unit 107, a language analysis unit 101, a prosody pattern selection unit 162, a prosody pattern correction unit 172, a pattern piece correction control unit 182 and a prosody pattern piece holding unit 192. The dynamic prosody generation unit 102, the prosody modification unit 103, the display unit 117, the modification command input unit 127, the modification history holding unit 106 including the modification history DB, the modification history extraction unit 115, and the modification history update unit 125. The modified history management unit 105 including the unit, the unit holding unit 134, the unit selection unit 114, and the voice synthesis unit 124, and the voice output unit 137.
[0124]
The operation of the text-to-speech synthesizer shown in FIG. 14 will be described. Since the configuration other than the dynamic prosody generation unit is the same as that of the first embodiment, the description thereof is omitted.
[0125]
In the prosodic pattern piece selection means 162, one optimal prosodic pattern piece is selected as the selected prosodic pattern piece from the prosodic pattern piece holding means 192 based on the transmitted language information, and a correction condition is generated. . The selected selected prosodic pattern piece and the generated correction condition are sent to the prosodic pattern piece correcting means 172.
[0126]
In the prosody pattern piece correcting means 172, the sent correction condition is sent to the pattern piece correction control means 182 in order to determine the correction for the selected prosodic pattern piece.
[0127]
In the pattern piece correction control means 182, the received correction condition is corrected in order to check whether the correction history data matching the correction condition is included in the prosody correction DB of the correction history holding unit 106. It is sent to the correction history extraction means 115 of the history management unit 105.
[0128]
The correction history extraction unit 115 searches the correction history DB of the correction history holding unit 106 based on the sent correction conditions. As a result, if there is correction history data that matches the sent correction conditions, The correction history data is extracted. The extracted prosody correction data is sent to the pattern piece correction control means 182. If there are a plurality of correction history data matching the correction conditions, all the correction history data are sent to the pattern piece correction control means 182. If there is no correction history data that matches the correction condition, the pattern piece correction control means 182 is notified of this fact.
[0129]
The pattern piece correction control means 182 determines the correction contents for the selected prosodic pattern piece based on the sent correction history data. The determined correction content is sent to the prosodic pattern piece correcting means 172. Here, when a plurality of correction history data is sent from the correction history extraction means 115, the correction contents of the prosodic pattern pieces are determined so as to reflect the correction contents of all the correction history data. If there is a notification that the correction history data does not exist, the prosody pattern piece correcting means 172 is notified that the correction is not performed or that each prosody pattern piece is not corrected.
[0130]
The prosodic pattern piece correcting means 172 corrects the selected prosodic pattern piece based on the correction contents determined by the pattern piece correction control means 182 to generate dynamic prosodic information. The generated dynamic prosody information is sent to the prosody modification unit 103 together with the language information. Here, it should be noted that the correction in the prosodic pattern piece correcting means 172 is automatically performed according to the correction history information, and that the dynamic prosody information is prosodic information in which the correction history information is reflected. .
[0131]
Here, a method of correcting the selected prosodic pattern in the prosodic pattern piece correcting means 172 will be described in detail based on the specific example shown in FIG. Note that the prosodic information to be corrected may be duration time, power, pitch pattern, and the like. In this specific example, a case where the prosodic parameter included in the generated prosodic information is a pitch pattern will be described. In the following description, the reference numerals shown in FIG. 14 are assigned to the members and the means.
[0132]
First, prosody correction performed in the past will be described. In the prosody modification unit 103, modification 337 using a mouse is performed on the graphically displayed pitch pattern. Here, the pitch pattern 334a (selected prosodic pattern piece) before correction is indicated by a dotted line, and the corrected pitch pattern 334b (corrected prosodic pattern piece) is indicated by a solid line.
[0133]
Along with this prosody correction, correction history data 338 is generated, and the generated correction history data 338 is stored in the correction history DB 116.
[0134]
The generated correction history data 338 includes, as correction conditions, that the position in the sentence is the end of the sentence, the number of mora is 3, and the accent type is 2 type. This includes correcting the pattern 334a to the corrected pitch pattern 334b. Here, in this specific example, the corrected pitch pattern itself is held as correction contents.
[0135]
Next, generation of dynamic prosodic information after past prosodic correction will be described. In the prosodic pattern piece selection means 162, the optimal prosodic pattern piece is selected as the selected prosodic pattern piece from the prosodic pattern piece DB 334 based on the language information, and the correction condition 331 is generated. The selected selected prosodic pattern piece and the generated correction condition are sent to the prosodic pattern piece correcting means 172. The generated correction condition 331 includes that the position in the sentence is the end of the sentence, the number of mora is 3, and the accent type is 2.
[0136]
The prosody pattern piece correcting means 172 sends the sent correction condition to the pattern piece correction control means 182.
[0137]
The pattern piece correction control means 182 searches the correction history DB 116 based on the correction condition, and as a result, the correction history data 338 that matches the correction condition is obtained. Since there is a correction content that satisfies the correction conditions (the selected pitch pattern 334a is changed to the held pitch pattern 334b: the pitch pattern is changed from the dotted line in the upper left diagram), the correction content is prosodic. This is sent to the pattern correction means 172.
[0138]
The prosodic pattern piece correcting means 172 corrects the prosodic pattern 334a to the prosodic pattern 334b based on the received correction contents, and generates dynamic prosodic information 336. The generated dynamic prosody information is sent to the prosody modification unit 103.
[0139]
Thereby, when the correction history data matching the correction condition generated by the prosodic pattern piece selection means 162 is already included in the correction history information, the same correction as the correction made in the past is automatically applied. Dynamic prosodic information can be generated.
[0140]
In the above, the changed pitch pattern is retained, but any method can be used to retain the modified pitch pattern as long as it can be reproduced finally. May be. For example, the changed pitch pattern can also be reproduced by storing the difference value for each time with respect to the pitch pattern selected from the prosodic pattern piece DB 334.
[0141]
In the above description, the configuration in which the correction history holding unit 106 has a single correction history DB 116 has been described. However, as described in the second embodiment, a configuration having a plurality of correction history DBs may be employed. FIGS. 16 and 17 are conceptual diagrams of a text-to-speech synthesizer including a dynamic prosody generation unit of the third correction history reference method and a correction history holding unit having a plurality of prosody modification DBs having different prosodic styles. FIG. The configuration is substantially the same as that of the text-to-speech synthesizer shown in FIG. 14 except that the correction history management unit has DB selection control means and the correction history holding means has a plurality of correction history DBs.
[0142]
Also, the text-to-speech synthesizer shown in FIGS. 16 and 17 has static prosody generation means 112, filtering means 122, and filter control means 132 in the text-to-speech synthesizer shown in FIGS. 7 and 9, respectively. The dynamic prosody generation unit has the same configuration except that the dynamic prosody generation unit is changed to a dynamic prosody generation unit having prosody pattern piece selection means 162, prosody pattern correction means 172, pattern piece correction control means 182 and prosody pattern piece holding means 192.
[0143]
The reference method and the update method of the plurality of correction history DBs in the text-to-speech synthesizer shown in FIGS. 16 and 17 are the same as those in the second embodiment, and a description thereof will be omitted.
[0144]
【The invention's effect】
As described above, in the text-to-speech synthesizer of the present invention, the correction history information is stored via the correction history management unit that manages the correction history information, the correction history management unit that manages the correction history information, and the correction history information. With the configuration including the dynamic prosody generation unit that generates dynamic prosody information based on language information, it is not necessary to perform the same correction as the past correction registered in the correction history information again And since learning of correction history information progresses sequentially according to past corrections, it becomes a device that easily generates prosodic information that suits the end user's preference.
[0145]
The dynamic prosody generation unit generates static prosody information according to static prosody generation rules, and modifies the generated static prosody information according to the correction history information, thereby generating dynamic prosody information. The first correction history reference method is adopted.
[0146]
The dynamic prosody generation unit corrects the prosody generation parameter setting of the prosody generation rule according to the correction history information, and generates dynamic prosody information according to the dynamic prosody generation rule that changes according to the setting of the prosody generation parameter. The second correction history reference method is adopted.
[0147]
The dynamic prosody generation unit selects one optimal prosody pattern piece as a selected prosodic pattern piece from a plurality of prosodic pattern pieces based on linguistic information according to a static prosodic selection rule, and the selected prosodic pattern piece. The third correction history reference method for generating dynamic prosodic information by correcting the information according to the correction history information is adopted.
[0148]
A text-to-speech synthesizer equipped with a dynamic prosody generation unit adopting any one of the first to third correction history reference methods assures easy generation of prosodic information that matches the end user's preference. it can.
[0149]
Further, the revision history holding unit is configured by a plurality of revision history DBs having different prosodic styles, and the revision history management unit performs selective reference to the revision history DB and update of the revision history DB selectively. The prosody modification that selectively reflects the desired prosody style can be automatically applied from the prosody styles, and the entire modification history information can be updated (learned) flexibly and effectively.
[0150]
In the text-to-speech synthesis method of the present invention, a correction history reference step for referring to correction history information, a dynamic prosody generation step for generating dynamic prosody information based on language information in cooperation with the correction history reference step, By including a revision history update step for updating revision history information according to the revision, it is not necessary to perform the same revision as the past revision registered in the revision history information, and the past revision Accordingly, the learning of the correction history information proceeds sequentially, so that prosodic information suitable for the end user's preference can be easily generated.
[0151]
The text-to-speech synthesis program according to the present invention includes a correction history reference program code that references correction history information, and a dynamic prosody generation program that generates dynamic prosody information based on language information in cooperation with the correction history reference program code. By including the code and the correction history update program code for updating the correction history information according to the correction, it becomes unnecessary to perform the same correction as the past correction registered in the correction history information, and Since learning of correction history information sequentially proceeds according to past prosodic corrections, it is possible to realize a text-to-speech synthesis method that easily generates prosodic information that matches the end user's preference.
[Brief description of the drawings]
FIG. 1 is a block diagram conceptually showing the structure of a text-to-speech synthesizer according to the present invention.
FIG. 2 is a block diagram conceptually showing the structure of the text-to-speech synthesis method according to the present invention.
FIG. 3 is a block diagram conceptually showing the structure of a text-to-speech synthesizer having a plurality of correction history DBs in a correction history holding unit.
FIG. 4 is a block diagram conceptually showing a configuration of a correction history management unit in a text-to-speech synthesizer having a plurality of correction history DBs in a correction history holding unit, and FIG. FIG. 4B is a block diagram illustrating a configuration of a correction history management unit including a DB selection control unit, and FIG. 4B illustrates a configuration of a correction history management unit including a reference DB selection control unit and an update DB selection control unit. FIG. 4C is a block diagram illustrating a configuration of a correction history management unit having a common DB selection control unit and a selection DB change unit.
FIG. 5 is a block diagram conceptually showing the structure of a text-to-speech synthesizer provided with a dynamic prosody generation unit of the first modification history reference method.
FIG. 6 is an explanatory diagram for explaining a first correction history reference method;
FIG. 7 conceptually illustrates a first configuration example of a text-to-speech synthesizer including a correction history holding unit including a plurality of correction history DBs and a dynamic prosody generation unit using a first correction history reference method. FIG.
FIG. 8 is an explanatory diagram for explaining a selective update method of correction history information and a selective extraction method of correction history information in a correction history management unit;
FIG. 9 conceptually illustrates a second configuration example of a text-to-speech synthesizer including a correction history holding unit including a plurality of correction history DBs and a dynamic prosody generation unit of the first correction history reference method. FIG.
FIG. 10 is a block diagram conceptually showing the structure of a text-to-speech synthesizer including a second modification history reference method dynamic prosody generation unit.
FIG. 11 is an explanatory diagram for explaining a second correction history reference method;
FIG. 12 is a conceptual diagram illustrating a first configuration example of a text-to-speech synthesizer including a correction history holding unit including a plurality of correction history DBs and a dynamic prosody generation unit of a second correction history reference method; It is a block diagram shown in FIG.
FIG. 13 is a conceptual diagram illustrating a second configuration example of a text-to-speech synthesizer including a correction history holding unit including a plurality of correction history DBs and a dynamic prosody generation unit of a second correction history reference method; It is a block diagram shown in FIG.
FIG. 14 is a block diagram conceptually showing the structure of a text-to-speech synthesizer including a third modification history reference method dynamic prosody generation unit.
FIG. 15 is an explanatory diagram for explaining a third correction history reference method;
FIG. 16 is a conceptual diagram illustrating a first configuration example of a text-to-speech synthesizer including a correction history holding unit including a plurality of correction history DBs and a dynamic prosody generation unit of a third correction history reference method; It is a block diagram shown in FIG.
FIG. 17 is a conceptual diagram illustrating a second configuration example of a text-to-speech synthesizer including a correction history holding unit including a plurality of correction history DBs and a dynamic prosody generation unit of a third correction history reference method; It is a block diagram shown in FIG.
FIG. 18 is a block diagram conceptually showing the structure of a conventional text-to-speech synthesizer.
[Explanation of symbols]
101 Language Analysis Department
102 Dynamic prosody generation part
112 Static prosody generation means
122 Filtering means
132 Filter control means
142 Dynamic Prosody Generation Means
152 Prosody Generation Rule Control Means
162 Prosodic pattern segment selection means
172 Prosody pattern fragment correction means
182 Prosodic pattern fragment correction control means
192 Prosody pattern fragment holding means
103 Prosody modification part
104 synthesized speech generator
114 Segment selection means
124 Speech synthesis means
134 Element holding means
105 revision history management
115 Correction history extraction means
125 Correction history update means
135 DB selection control means
145 Common DB selection control means
155 DB selection control means for reference
165 Update DB selection control means
175 Selection DB changing means
106 Correction history holding unit
116 revision history DB
107 Text holding part
117 Display
127 Correction command input part
137 Audio output unit
147 Selection command input part
108 Selection instruction generator
201 Language analysis step (language analysis program code)
202 dynamic prosody generation step (dynamic prosody generation program code)
203 Prosody modification step (prosody modification program code)
204 synthesized speech generation step (synthesized speech generation program code)
215 Correction history extraction step (correction history extraction program code)
225 revision history update step (modification history update program code)

Claims

A language analysis unit that performs language analysis on text data and extracts language information;
A revision history holding unit for holding revision history information;
A correction history management unit for managing the correction history information;
A dynamic prosody generation unit that generates dynamic prosody information based on the language information with reference to the correction history information via the correction history management unit;
The fixed prosody information is generated by correcting the dynamic prosodic information according to an external correction command, and the correction history information is generated via the correction history management unit based on the correction according to the external correction command. A prosody modification section for updating
A synthesized speech generation unit that generates a synthesized speech based on the language information and the definite prosodic information;
The correction history management unit extracts correction history information for extracting correction history information referred to by the dynamic prosody generation unit; a correction history update unit for updating correction history information held in the correction history holding unit; Have
The correction history holding unit has a plurality of correction history databases having different prosodic styles,
The text-to-speech synthesizer further comprising a database selection control means for controlling the selection of the plurality of correction history databases in accordance with a selection command .

The text-to-speech synthesizer according to claim 1.
The dynamic prosody generation unit generates static prosody information according to a static prosody generation rule, and modifies the static prosody information generated by the static prosody generation unit A text-to-speech synthesizer comprising: filtering means for generating the dynamic prosody information by means of; and filter control means for controlling correction by the filtering means in accordance with the correction history information.

The text-to-speech synthesizer according to claim 1.
The dynamic prosody generation unit generates the dynamic prosody information according to a dynamic prosody generation rule that changes according to the setting of the prosody generation parameter, and the prosody generation parameter according to the correction history information A text-to-speech synthesizer comprising: prosody generation rule control means for controlling

The text-to-speech synthesizer according to claim 1.
The dynamic prosody generation unit selects one of the plurality of prosodic pattern pieces as a selected prosodic pattern piece according to a prosodic pattern piece holding unit that holds a plurality of prosodic pattern pieces, and a static prosodic selection rule. Prosodic pattern fragment selecting means, prosodic pattern fragment correcting means for generating the dynamic prosodic information by correcting the selected prosodic pattern fragment, and controlling correction to the selected prosodic pattern fragment according to the correction history information A text-to-speech synthesizer.

The text-to-speech synthesizer according to claim 4 .
A text-to-speech synthesizer, further comprising a selection command input unit for inputting the selection command.

The text-to-speech synthesizer according to claim 4 .
A text-to-speech synthesizer, further comprising: a selection command generation unit that detects style selection information included in the text data and generates the selection command based on the style selection information.

The text-to-speech synthesizer according to claim 4 .
The correction history management unit has a common database selection control means for selecting at least one of the plurality of correction history databases as a common correction history database in response to the selection command,
The dynamic prosody generation unit selectively refers to correction history information included in the common correction history database;
The text-to-speech synthesizer, wherein the prosody modification unit selectively updates modification history information included in the common modification history database.

The text-to-speech synthesizer according to claim 4 .
The selection instruction includes a reference selection instruction and an update selection instruction;
The correction history management unit is configured to select at least one of the plurality of correction history databases as a reference correction history database according to the reference selection command, and according to the update selection command. An update database selection control means for selecting at least one of the plurality of correction history databases as an update correction history database;
The dynamic prosody generation unit selectively refers to correction history information included in the reference correction history database,
The text-to-speech synthesizer, wherein the prosody modification unit selectively updates modification history information included in the update modification history database.

A language analysis step for performing language analysis on text data and extracting language information;
A revision history holding step for holding revision history information;
A correction history management step for managing the correction history information;
A dynamic prosody generation step of generating dynamic prosody information based on the language information with reference to the correction history information via the correction history management unit;
The fixed prosody information is generated by correcting the dynamic prosodic information according to the external correction command, and the correction history information is generated via the correction history management unit based on the correction according to the external correction command. A prosody modification step for updating
A synthesized speech generating step for generating a synthesized speech based on the language information and the definite prosodic information;
The correction history management step includes a correction history extraction step for extracting correction history information referred to in the dynamic prosody generation step; a correction history update step for updating the correction history information held in the correction history holding step; Have
The revision history holding step further holds a plurality of revision history databases having different prosodic styles,
The text-to-speech synthesis method, wherein the modification history management step further includes a database selection control step for controlling selection of the plurality of modification history databases in response to a selection command .

The text-to-speech synthesis method according to claim 9 .
The dynamic prosody generation step generates static prosody information according to a static prosody generation rule, and corrects the static prosody information according to the correction history information to generate the dynamic prosody information. A method for synthesizing text-to-speech.

The text-to-speech synthesis method according to claim 9 .
The dynamic prosody generation step generates a modified prosody generation parameter by correcting a prosody generation parameter for determining a prosody generation rule according to the correction history information, and the prosody generation parameter according to the prosodic generation rule using the modified prosody generation parameter. A text-to-speech synthesis method characterized by generating dynamic prosodic information.

The text-to-speech synthesis method according to claim 9 .
The dynamic prosody generation step selects any one of a plurality of prosodic pattern pieces as a selected prosodic pattern piece based on a static prosodic selection rule, and the selected prosodic pattern piece corresponds to the correction history information The text-to-speech synthesis method is characterized in that the dynamic prosody information is generated by correcting the information.

A language analysis step for performing language analysis on text data and extracting language information;
A revision history holding step for holding revision history information;
A correction history management step for managing the correction history information;
A dynamic prosody generation step of generating dynamic prosody information based on the language information with reference to the correction history information via the correction history management unit;
The fixed prosody information is generated by correcting the dynamic prosodic information according to the external correction command, and the correction history information is generated via the correction history management unit based on the correction according to the external correction command. A prosody modification step for updating
A synthesized speech generating step for generating a synthesized speech based on the language information and the definite prosodic information;
The correction history management step includes a correction history extraction step for extracting correction history information referred to in the dynamic prosody generation step; a correction history update step for updating correction history information held in the correction history holding step; In a text-to-speech program that causes a computer to execute
The revision history holding step further holds a plurality of revision history databases having different prosodic styles,
The text-to-speech synthesis program characterized in that the revision history management step further comprises a database selection control step for controlling selection of the plurality of revision history databases in response to a selection command .