JP3568972B2

JP3568972B2 - Voice synthesis method and apparatus

Info

Publication number: JP3568972B2
Application number: JP12926393A
Authority: JP
Inventors: 義幸原
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 1993-05-31
Filing date: 1993-05-31
Publication date: 2004-09-22
Anticipated expiration: 2019-09-22
Also published as: JPH06337698A

Description

【０００１】
【産業上の利用分野】
本発明は文字コード列、または韻律情報と音韻系列とから合成音声を生成する音声合成方法および装置に関する。
【０００２】
【従来の技術】
近時、漢字かな混じりの文を解析し、その文が示す音声情報を規則合成法により音声合成して出力する音声合成装置が種々開発されている。そして、この種の音声合成装置は、銀行業務における電話紹介サービスや、新聞校閲システム、文書読み上げ装置等として幅広く利用され始めている。
【０００３】
この種の規則合成法を採用した音声合成装置は、基本的には人間が発声した音声を予めある単位、例えばＣＶ（子音、母音）、ＣＶＣ（子音、母音、子音）、ＶＣＶ（母音、子音、母音）、ＶＣ（母音、子音）毎にＬＳＰ（線スペクトル対）分析やケプストラム分析等の手法を用いて分析して求められる音韻情報を音声素片ファイルに登録しておき、この音声素片ファイルを参照して合成パラメータ（音韻パラメータと韻律パラメータ）を生成し、これらの合成パラメータをもとにして音源の生成と合成フィルタリング処理を行うことにより合成音声を生成するものである。
【０００４】
従来、このような音声合成装置は、リアルタイムに処理するために専用のハードウェアを必要としている。この音声合成装置のシステム構成には大きく分けて次の２種がある。
【０００５】
第１の構成は、パーソナルコンピュータ（ＰＣ）などのホスト計算機が漢字かな混じり文を韻律情報と音韻系列に変換し（言語処理）、専用のハードウェアで合成パラメータの生成、音源の生成、合成フィルタリング、Ｄ／Ａ（ディジタル／アナログ）変換を行うものである。これに対して第２の構成は、漢字かな混じり文から音声を生成するまでの全ての処理を専用のハードウェアで行うものである。いずれの構成における専用ハードウェアも、積和演算が高速なＤＳＰ（ディジタル・シグナル・プロセッサ）と呼ばれるＬＳＩあるいは合成ＬＳＩと汎用のＭＰＵ（マイクロプロセッサユニット）で構成されるのが殆どである。
【０００６】
このような専用装置では、漢字かな混じり文を単語辞書を用いて形態素解析や係り受け解析を行って、韻律情報と音韻系列からなる音声記号列（韻律情報と音韻系列を記号やカタカナで表現したもの）が生成される。この音声記号列の韻律情報に基づいて基本周波数の列（韻律パラメータ）が生成され、音韻系列の音韻に対応する音韻パラメータを音声素片ファイルから取り出して音韻間を接続して音韻パラメータが生成される。このとき、音声記号列にポーズ区間を表す記号が含まれているときはそのポーズ部に対して韻律パラメータ、音韻パラメータとも「０」を設定する。
【０００７】
以上に述べた処理は汎用のＭＰＵ（ＣＰＵ）で行われ、生成された合成パラメータ（韻律パラメータと音韻パラメータ）は音声合成部に入力される。音声合成部には、ＤＳＰや合成ＬＳＩが使用される。この音声合成部では、合成パラメータをＤＳＰや合成ＬＳＩがフレーム周期毎に入力し、そのパラメータに基づいて音源生成や合成フィルタリングを行い、サンプリング周期毎にＤ／Ａ変換器に出力する。
【０００８】
一方、パーソナルコンピュータ（ＰＣ）やエンジニアリング・ワーク・ステーション（ＥＷＳ）においても、処理能力が高まったことと、標準でＤ／Ａ変換器、アナログ出力部およびスピーカを搭載したことで、上記の処理をリアルタイムにソフトウェア処理だけで行えるようになりつつある。また、近時、ＥＷＳだけでなくＰＣにも、マルチタスク可能なＯＳ（オペレーティングシステム）が採用され始めてきている。
【０００９】
しかしながら、このようなＯＳはリアルタイム性が保証されていないものが殆どである。つまり、音声合成以外の処理タスクが少ない場合は問題ないが、タスクが多くなると音声合成以外の処理にＣＰＵが使用され、音声合成がリアルタイムに処理できなくこともある。
【００１０】
このような理由から、ＰＣやＥＷＳのマルチタスクＯＳ下のもとでソフトウェア処理のみでリアルタイムに音声合成するためには、音声合成に要する時間をできるだけ短縮させることが重要である。
【００１１】
【発明が解決しようとする課題】
このように、上記した従来の音声合成装置にあっては、専用の装置では音声合成部のリアルタイム性が保証されているものの、マルチタスク可能なＯＳを採用したパーソナルコンピュータ（ＰＣ）やエンジニアリング・ワーク・ステーション（ＥＷＳ）のソフトウェア処理により実現される装置では、処理タスクが少ないときにはリアルタイムに音声合成できていたものが、タスクが多くなるとリアルタイムにできない等の不具合があった。
【００１２】
本発明はこのような事情を考慮してなされたもので、その目的とするところは、音声合成に要する時間を短縮でき、もって音声合成がリアルタイムに行える音声合成方法および装置を提供することにある。
【００１３】
【課題を解決するための手段】
本発明に係る音声合成方法および装置は、ポーズ部は出力される合成音声が無音であり、有音とならないような信号（出力「０」の信号）を出力すればよいことに着目し、入力された文字コード列から生成された音声記号列のうちポーズ記号部分を除外し、音韻系列に従って対応する音韻パラメータを生成すると共に韻律情報に従って韻律パラメータを生成し、このポーズ記号部分のパラメータを含まない音韻パラメータおよび韻律パラメータから合成フィルタリング処理により音声波形データを生成し、ポーズ記号部分に対しては、生成された音声波形データから対応する音声を出力する音声出力器のサンプリング周波数に基づき、そのポーズ長分の時間だけ無音となるようなポーズデータを別に生成し、このポーズデータを上記の合成フィルタリング処理により生成された音声波形データに付加して音声を生成するようにしたことを特徴とするものである。
【００１５】
【作用】
上記の構成においては、入力された文字コード列からポーズ記号を含む音声記号列が生成される。そして、生成された音声記号列のうちポーズ記号部分を除外して、音韻パラメータおよび韻律パラメータ（合成パラメータ）が生成される。これにより、有音部に対しては合成フィルタリング処理が実行されて音声波形データ（音声ディジタルデータ）が生成されるものの、ポーズ部分に対しては、合成パラメータ中に対応するパラメータが含まれないことから、合成フィルタリング処理は実行されない。そして、これに代えて、音声出力器のサンプリング周波数に基づき、ポーズ記号部分のポーズ長分の時間だけ無音となるようなポーズデータが生成されて、有音部の音声波形データに付加される。したがって、このポーズデータが付加された音声波形データを音声出力することにより、ポーズ区間を含む音声出力が正しく行われる。
【００１７】
このように、上記の構成によれば、音声合成処理の中で最も時間を要する合成フィルタリングを有音部についてのみ実行し、無音部（ポーズ部）については、対応する時間（期間）中有音となるような信号が出力されないように、その時間分のポーズデータを生成するようにしているため、無音部を含む音声出力を行いながらも、無音部の合成フィルタリングを行わないで済む分だけ、音声合成の処理時間を短縮させることが可能となる。
【００１８】
【実施例】
［第１実施例］
まず、本発明の第１実施例を説明する。
図１は同実施例に係る音声合成装置の概略構成を示すブロック図である。
【００１９】
図１に示す音声合成装置は、音声合成の対象とする漢字かな混じりの文字コード列の入力を司る入力部１と、音声合成の対象となる単語や句等についてのアクセント型、読み、品詞情報等が予め登録されている単語辞書２と、言語処理部３とを有する。この言語処理部３は、入力部１により入力された文字コード列を単語辞書２を用いて解析し、対応する音韻系列および韻律情報を生成する言語処理を司る。
【００２０】
図１に示す音声合成装置はまた、予め任意の音声単位毎に入力音声を分析することにより求められたケプストラムパラメータ群が格納されている音声素片ファイル４と、言語処理部３にて生成された音韻系列に従う音韻パラメータ（ここでは、音韻のケプストラムパラメータ）の生成、および言語処理部３にて生成された韻律情報に従う韻律パラメータの生成を行う合成パラメータ生成部５と、ポーズ生成部６とを有する。このポーズ生成部６は、言語処理部３から合成パラメータ生成部５に与えられる音韻系列および韻律情報のうち、ポーズ区間を表す記号に基づいて、その長さだけ「０」を表すポーズデータ（Ｄ／Ａ変換器の出力信号が「０」になるようなデータ）を作成する。このポーズデータは、次に述べる音声合成部７から出力される音声波形データ（音声ディジタルデータ）に付加される。
【００２１】
図１に示す音声合成装置はまた、合成パラメータ生成部５によって生成された音韻パラメータおよび韻律パラメータをもとに、音源の生成と、合成フィルタリング処理を行って合成音声を生成する音声合成部７と、オーディオデバイス８と、音声出力用のスピーカ９とを有する。オーディオデバイス８には、音声合成部７から出力される音声ディジタルデータおよびポーズ生成部６から出力されて同データに付加されるポーズデータ（ディジタルデータ）をアナログ信号に変換するＤ／Ａ変換器、折り返し雑音除去フィルタ、およびパワーアンプなど周知の構成（図示せず）が含まれている。
【００２２】
以上の構成の音声合成装置は、マルチタスクを実行するパーソナルコンピュータ（ＰＣ）やエンジニアリング・ワーク・ステーション（ＥＷＳ）によって実現されるもので、入力部１、言語処理部３、合成パラメータ生成部５、ポーズ生成部６および音声合成部７（内の音源生成、フィルタリング処理部分）は、ＣＰＵのプログラム処理（音声合成処理用タスクの実行）によって実現される機能ブロックである。
【００２３】
次に、図１に示す音声合成装置の動作を図２のフローチャートを参照して説明する。
まず入力部１により、音声合成の対象とする漢字かな混じりの文字コード列、例えば「明日は説明会があります。よろしくお願いします。」が入力されたとする。
【００２４】
入力部１は、句点「。」で１文「明日は説明会があります。」を切り出す（図２ステップＳ１）。この入力部１により切り出された１文は、言語処理部３に与えられる。
【００２５】
言語処理部３は、入力部１により切り出された文字コード列（１文）「明日は説明会があります。」と単語辞書２とを照合し、この入力文字コード列が示す音声合成の対象となっている単語や句等についてのアクセント型、読み、品詞情報を求め、その品詞情報に従うアクセント型・境界の決定、ポーズ記号の挿入、および漢字かな混じり文の読みの形式への変換を行い、例えば「ア（シ）タ＾ワ．セツメ＾ーカイ＜ガ＞／アリマ＾（ス）．．．．．／／」なる音韻系列と韻律情報（音声記号列）を生成する（図２ステップＳ２）。ここで、「＾」はアクセス位置、「／」はアクセス句の区切り、「／／」は文の終端、「．」は２００ｍｓのポーズ区間を表す記号（ポーズ記号）、＜＞内は鼻濁音、（）は無声化音を表すものとする。また、サンプリング周波数は８ｋＨｚとする。
【００２６】
言語処理部３によって生成された音声記号列「ア（シ）タ＾ワ．セツメ＾ーカイ＜ガ＞／アリマ＾（ス）．．．．．／／」は合成パラメータ生成部５に与えられる。
【００２７】
合成パラメータ生成部５は、言語処理部３から与えられた音声記号列から、最初に出現するポーズ記号（ポーズ区間記号）「．」の先行アクセント句である「ア（シ）タ＾ワ」なる音韻系列を切り出し、その音韻系列に対応する音韻のケプストラムパラメータを音声素片ファイル４より抽出して音韻パラメータを生成する（図２ステップＳ３）。同時に合成パラメータ生成部５は、その音韻系列に対応する韻律情報に従って韻律パラメータを生成する。
【００２８】
合成パラメータ生成部５は、生成した「ア（シ）タ＾ワ」に対応する音韻パラメータおよび韻律パラメータを音声合成部７に与える。また合成パラメータ生成部５は、この「ア（シ）タ＾ワ」に後続する１個のポーズ記号「．」をポーズ生成部６に与える。
【００２９】
なお、従来であれば、合成パラメータ生成部５において、「ア（シ）タ＾ワ」とそれに後続するポーズ記号「．」までの「ア（シ）タ＾ワ．」の音韻パラメータおよび韻律パラメータ（ポーズ記号で表されるポーズ部に対しては音韻パラメータおよび韻律パラメータとも「０」が設定される）が生成されて、それが音声合成部７に与えられることに注意されたい。
【００３０】
音声合成部７は、合成パラメータ生成部５から「ア（シ）タ＾ワ」に対応する音韻パラメータおよび韻律パラメータ（合成パラメータ）が与えられると、それを入力して一時保持する。そして音声合成部７は、入力した「ア（シ）タ＾ワ」の合成パラメータに従い、音源の生成とディジタルフィルタリング処理とを行うことにより、「ア（シ）タ＾ワ」の音声ディジタルデータ（前記入力文字コード列「明日は説明会があります。よろしくお願いします。」中の「明日は」に示される音声ディジタルデータ）を生成する（図２ステップＳ４）。
【００３１】
この音声合成部７により生成された音声ディジタルデータはオーディオデバイス８に与えられる。但し、オーディオデバイス８による音声出力中の場合には、生成した音声ディジタルデータは、音声出力の終了を待って与えられる。
【００３２】
一方、ポーズ生成部６は、合成パラメータ生成部５から与えられた「ア（シ）タ＾ワ」に後続する１個のポーズ記号「．」に基づいて、２００ｍｓ分のポーズデータ（「０」データ）を生成する（図２ステップＳ５）。ここでは、サンプリング周波数が８ｋＨｚであるため、ポーズ生成部６は１６００個のポーズデータを生成する。
【００３３】
ポーズ生成部６は、音声合成部７からオーディオデバイス８に音声ディジタルデータ（ここでは「ア（シ）タ＾ワ」の音声ディジタルデータ）が与えられると、自身が生成したポーズデータ（ここでは、「ア（シ）タ＾ワ」に後続する１個のポーズ記号「．」により示される２００ｍｓのポーズ区間に対応した１６００個のポーズデータ）を同デバイス８に与える（図２ステップＳ６）。これにより、「ア（シ）タ＾ワ」の音声ディジタルデータの後に１６００個のポーズデータが付加されたことになる。
【００３４】
オーディオデバイス８は、音声合成部７から与えられた「ア（シ）タ＾ワ」の音声ディジタルデータとポーズ生成部６から与えられて同データに付加された１６００個のポーズデータを、Ｄ／Ａ変換器により順にアナログ信号に変換し、折り返し雑音除去フィルタを介してスピーカ９に出力することにより、「ア（シ）タ＾ワ．」に対応する音声をスピーカ９から出力させる（図２ステップＳ７，Ｓ８）。
【００３５】
このように本実施例では、ポーズ部は出力される合成音が無音であり、オーディオデバイス８からは出力「０」の信号を出力させればよいことに着目して、当該ポーズ部については、ポーズ記号「．」をもとにポーズ区間分のポーズデータ（「０」データ）を生成するようにし、合成パラメータ生成部５による合成パラメータ作成の対象外とすることで、音声合成部７にてポーズ部の合成フィルタリングが行われないようにしている。こうすることで、音声合成部７での合成フィルタリングに要する時間が短縮され、リアルタイムでの音声合成が可能となる。
【００３６】
なお、従来であれば、ポーズのパラメータを含む合成パラメータを合成パラメータ生成部５にて生成して音声合成部７に与えていたため、当該音声合成部７では、合成音が無音となるポーズ部についても合成フィルタリングが行われ、音声合成に長時間要していた。
【００３７】
さて、合成パラメータ生成部５は、オーディオデバイス８によるスピーカ９からの音声出力が開始されると、「／／」で示される１文の終りまで処理したか否かを判断する（図２ステップＳ９）。この例のように１文の終りに達していない場合には、後続の「セツメ＾ーカイ＜ガ＞／アリマ＾（ス）．．．．．／／」について、ステップＳ３からステップＳ８まで上記と同様の処理が行われる。このときポーズ生成部６では、５個のポーズ記号「．．．．．」により示される１０００ｍｓ（１秒）のポーズ区間に対応した８０００個のポーズデータが生成される。
【００３８】
そして、先の「ア（シ）タ＾ワ．」に対応する音声出力が終了すると、新たに音声合成部７により生成される「セツメ＾ーカイ＜ガ＞／アリマ＾（ス）」の音声ディジタルデータとポーズ生成部６により生成される８０００個のポーズデータが順にオーディオデバイス８に与えられ、次の音声出力に供される。
【００３９】
このとき、１文の処理が終了していることから、今度はステップＳ９からステップＳ１０に進み、入力部１に制御が戻る。
入力部１は、ステップＳ１０において、文章の終りまで処理したか否かを判断し、この例のように終りでなければ、ステップＳ１の処理に戻る。このステップＳ１では、次の文「よろしくお願いします。」が入力部１により切り出され、以後、前記した処理と同様な処理が行われる。
【００４０】
さて本実施例においては、「ア（シ）タ＾ワ．セツメ＾ーカイ＜ガ＞／アリマ＾（ス）．．．．．／／」の例では、６個のポーズ記号、即ち１．２秒分のポーズ記号があることから、音声合成部７の合成フィルタリング処理により例えば１秒分の音声ディジタルデータを生成するのに単純に１秒かかるものとすると、ポーズ区間１．２秒分だけ処理時間が短縮できたことになる、但し、ポーズ生成部６によるポーズデータ生成に要する時間は音声合成部７での合成フィルタリングに要する時間より極めて少ないため無視する。
【００４１】
以上に述べた第１実施例では、ポーズ生成部６を設けて当該ポーズ生成部６にてポーズデータを生成するようにしていたが、一般にＰＣ（パーソナルコンピュータ）やＥＷＳ（エンジニアリング・ワーク・ステーション）は、オーディオデバイス（８）に対して音声ディジタルデータの書き込みを行わなければ、自動的にポーズの状態が続く構造となっている。したがって、ポーズ生成部６がなくてもポーズ（ポーズ区間）が生成でき、その時間は、オーディオデバイス（８）に対して何も書き込まない期間を設定することにより、任意に制御できる。また、ＰＣやＥＷＳにはタイマ機能が組み込まれており、この機能を利用することで、上記の何も書き込まない期間、即ちポーズ区間の設定が可能となる。
【００４２】
そこで、このタイマ機能を利用してポーズ区間を設定するようにした第２実施例につき、以下に説明する。
［第２実施例］
図３は同実施例に係る音声合成装置の概略構成を示すブロック図である。なお、図１と同様の部分には、便宜上同一符号を付してある。
【００４３】
図３に示す音声合成装置が、図１に示した音声合成装置と異なるのは、主として次の３点である。
第１は、ポーズ生成部６に代えて、タイマ１１と同タイマ１１にタイマ値を設定するタイマ設定部１２を用いている点である。
【００４４】
第２は、図１では合成パラメータ生成部５からポーズ生成部６に与えられたポーズ記号が、タイマ設定部１２に与えられる点である。
第３は、音声合成部７が生成した音声ディジタルデータをオーディオデバイス８に出力できる条件が、オーディオデバイス８が音声出力中でないことに加えて、タイマ１１のタイマ値が「０」である点である。
【００４５】
タイマ１１は、１カウント当たり例えば１ｍｓ（即ちタイマクロックの周期は１ｍｓ）であり、「０」より小さい値にはならないものとする。
タイマ設定部１２は、オーディオデバイス８が音声出力中にないこと（したがって音声合成部７からの音声ディジタルデータ入力が可能なこと）を示すレディ信号１３の出力時に、合成パラメータ生成部５から与えられているポーズ記号（ポーズ記号列）の示すポーズ長をタイマ１１に設定する。
【００４６】
次に、図３に示す音声合成装置の動作を図４のフローチャートを参照して説明する。
まず、前記した第１実施例の場合と同様に、入力部１により、音声合成の対象とする漢字かな混じりの文字コード列「明日は説明会があります。よろしくお願いします。」が入力されたとする。
【００４７】
入力部１は、句点「。」で１文「明日は説明会があります。」を切り出す（図４ステップＳ１１）。
言語処理部３は、入力部１により切り出された文字コード列（１文）「明日は説明会があります。」と単語辞書２とを照合し、この入力文字コード列が示す音声合成の対象となっている単語や句等についてのアクセント型、読み、品詞情報を求め、その品詞情報に従うアクセント型・境界の決定、ポーズ記号の挿入、および漢字かな混じり文の読みの形式への変換を行い、前記したような「ア（シ）タ＾ワ．セツメ＾ーカイ＜ガ＞／アリマ＾（ス）．．．．．／／」なる音韻系列と韻律情報（音声記号列）を生成する（図４ステップＳ１２）。
【００４８】
言語処理部３によって生成された音声記号列「ア（シ）タ＾ワ．セツメ＾ーカイ＜ガ＞／アリマ＾（ス）．．．．．／／」は合成パラメータ生成部５に与えられる。これにより合成パラメータ生成部５および音声合成部７により次に述べるステップＳ１３の処理が行われる。
【００４９】
即ち、まず合成パラメータ生成部５は、言語処理部３から与えられた音声記号列から、最初に出現するポーズ記号（ポーズ区間記号）「．」の先行アクセント句である「ア（シ）タ＾ワ」なる音韻系列を切り出し、その音韻系列に対応する音韻のケプストラムパラメータを音声素片ファイル４より抽出して音韻パラメータを生成する。同時に合成パラメータ生成部５は、その音韻系列に対応する韻律情報に従って韻律パラメータを生成する。
【００５０】
合成パラメータ生成部５は、生成した「ア（シ）タ＾ワ」に対応する音韻パラメータおよび韻律パラメータを音声合成部７に与える。また合成パラメータ生成部５は、この「ア（シ）タ＾ワ」に後続する１個のポーズ記号「．」をタイマ設定部１２に与える。
【００５１】
タイマ設定部１２は、合成パラメータ生成部５から与えられたポーズ記号「．」を入力して一時保持する。
音声合成部７は、合成パラメータ生成部５から与えられた「ア（シ）タ＾ワ」に対応する音韻パラメータおよび韻律パラメータ（合成パラメータ）を入力して一時保持する。そして音声合成部７は、入力した「ア（シ）タ＾ワ」の合成パラメータに従い、音源の生成とディジタルフィルタリング処理とを行うことにより、「ア（シ）タ＾ワ」の音声ディジタルデータ（前記入力文字コード列「明日は説明会があります。よろしくお願いします。」中の「明日は」に示される音声ディジタルデータ）を生成する。
【００５２】
音声合成部７は、「ア（シ）タ＾ワ」の音声ディジタルデータを生成すると、オーディオデバイス８が音声出力中であるか否かを、同デバイス８からのレディ信号１３により調べる（図４ステップＳ１４）。もし、音声出力中でなければ、音声合成部７はステップＳ１５に進み、音声出力中であるならば、音声出力中でなくなる（即ち音声出力が終了してレディ信号１３が真となる）のを待つ。
【００５３】
ここでは、オーディオデバイス８は音声出力中でないため、音声合成部７はステップＳ１５に進む。音声合成部７は、このステップＳ１５において、タイマ１１の値（タイマ値）が「０」であるか否かを調べる。もし、タイマ値が「０」であるならば、音声合成部７はステップＳ１６に進み、「０」でないならば、「０」になるのを待つ。
【００５４】
ここでは、タイマ１１には何も設定されていないため、タイマ値は「０」であり、音声合成部７は次のステップＳ１６に進む。音声合成部７は、このステップＳ１６において、先のステップＳ１３で生成した「ア（シ）タ＾ワ」の音声ディジタルデータを、スピーカ９からの音声出力のために、オーディオデバイス８に与える。これにより、前記した第１実施例の場合と同様にして、「ア（シ）タ＾ワ．」に対応する音声のスピーカ９からの出力が開始される。このときオーディオデバイス８からのレディ信号１３は、音声出力中を示す偽値に設定される。
【００５５】
さて、音声合成部７で生成された「ア（シ）タ＾ワ」の音声ディジタルデータがオーディオデバイス８に与えられると、合成パラメータ生成部５は、「／／」で示される１文の終りまで処理したか否かを判断する（図４ステップＳ１７）。この例のように１文の終りに達していない場合には、後続の「セツメ＾ーカイ＜ガ＞／アリマ＾（ス）．．．．．／／」について、音声合成部７およびオーディオデバイス８によるステップＳ１３の処理が、上記した「ア（シ）タ＾ワ」に対するのと同様に行われる。
【００５６】
このステップＳ１３では、「セツメ＾ーカイ＜ガ＞／アリマ＾（ス）」の合成パラメータの生成と、それに基づく音声ディジタルデータの生成とが行われると共に、それに後続する５個のポーズ記号「．．．．．」が合成パラメータ生成部５からタイマ設定部１２に与えられる。
【００５７】
タイマ設定部１２は、このポーズ記号「．．．．．」を入力し、先に保持した「ア（シ）タ＾ワ」に後続する１個のポーズ記号「．」の後に保持する。
一方、「ア（シ）タ＾ワ」の音声出力が終了すると、オーディオデバイス８はレディ信号１３を真にする。するとタイマ設定部１２は、自身が保持しているポーズ記号（ポーズ記号列）のうち、その時点で最も古いポーズ記号（ポーズ記号列）、即ち「ア（シ）タ＾ワ」に後続する１個のポーズ記号「．」を取り出し、それに対応するポーズ区間２００ｍｓを示す値「２００」をタイマ１１に設定する。このタイマ設定部１２の動作は、ＰＣやＥＷＳ上では、音声出力終了に応じて発生する割り込み（音声出力終了割り込み）に従う音声出力終了割り込み処理により実現される。
【００５８】
タイマ１１は、１ｍｓ毎にカウントダウンを行う。
さて音声合成部７は、ステップＳ１３の処理で、「セツメ＾ーカイ＜ガ＞／アリマ＾（ス）」の音声ディジタルデータを生成すると、前記したように、オーディオデバイス８が音声出力中であるか否かを調べる（図４ステップＳ１４）。
【００５９】
もし、既に「ア（シ）タ＾ワ」に対する音声出力が終了しているならば、音声合成部７はステップＳ１５に進み、終了していなければ、終了するのを待つ。
ここで、「ア（シ）タ＾ワ」に対する音声出力が終了したものとすると、音声合成部７は、ステップＳ１５において、タイマ１１の値が「０」であるか否か、即ち音声出力が終了してからタイマ設定部１２により設定された期間（ここでは、「ア（シ）タ＾ワ」に後続する１個のポーズ記号「．」に対応する２００ｍｓ）が経過したかを判別する。
【００６０】
そしてタイマ１１の値が「０」になったとき、即ち音声出力終了後、ポーズ区間の時間分（２００ｍｓ）が経過したとき、音声合成部７は、ステップＳ１３で生成した「セツメ＾ーカイ＜ガ＞／アリマ＾（ス）」の音声ディジタルデータをオーディオデバイス８に与え、スピーカ９からの音声出力を行わせる（図４ステップＳ１６）。
【００６１】
このように、音声出力が終了しても、タイマ１１の値が「０」になるまでは、即ち音声出力終了時にタイマ設定部１２によって設定された、その出力音声に後続するポーズ区間に相当する時間が経過するまでは、次の音声出力対象となる音声ディジタルデータの出力は待たされる。オーディオデバイス８の出力は、音声出力終了後から次の音声ディジタルデータが与えられるまでの期間、ポーズ状態となるため、ポーズを生成したのと等価となる。
【００６２】
ステップＳ１６にて、「セツメ＾ーカイ＜ガ＞／アリマ＾（ス）」の音声ディジタルデータが音声合成部７からオーディオデバイス８に与えられると、１文の処理が終了していることから、今度はステップＳ１７からステップＳ１８に進み、入力部１に制御が戻る。
【００６３】
入力部１は、ステップＳ１８において、文章の終りまで処理したか否かを判断し、この例のように終りでなければ、ステップＳ１１の処理に戻る。このステップＳ１１では、次の文「よろしくお願いします。」が入力部１により切り出され、以後、前記した処理と同様な処理が行われる。
【００６４】
以上に述べた第２実施例においても、ポーズ部については、合成パラメータ生成部５による合成パラメータ作成の対象外とすることで、音声合成部７にてポーズ部の合成フィルタリングが行われないようにしているため、音声合成に係わる処理時間が短縮できる。
【００６５】
なお、本発明は上述した実施例に限定されるものではない。即ち、実施例では、ポーズデータとして「０」を用いて説明したが、オーディオデバイス８内のＤ／Ａ変換器の仕様によっては「０」データを入力してもアナログ信号が「０」にならないものがあるため、特に「０」に限定する必要はなく、アナログ信号が「０」になるようなディジタルデータをポーズデータとして使用すればよい。
要するに本発明はその要旨を逸脱しない範囲で種々変形して実施することができる。
【００６６】
【発明の効果】
以上説明したように本発明によれば、ポーズ区間のデータが合成パラメータ（音韻パラメータと韻律パラメータ）に含まれず、したがってポーズ区間については、処理に最も時間を要する音源生成や合成フィルタリングが実行されない構成とすると共に、ポーズ区間に対応する時間（期間）中は有音となるような信号が出力されない構成としたので、ポーズ区間を含む音声出力を正しく行いながらも、ポーズ区間の合成フィルタリングを行わないで済む分だけ、音声合成に要する処理時間を短縮させることができ、パーソナルコンピュータ（ＰＣ）やエンジニアリング・ワーク・ステーション（ＥＷＳ）のマルチタスクＯＳ下のもとでソフトウェア処理で音声合成する場合にも、リアルタイムに行える等の実用上多大なる効果が奏せられる。
【図面の簡単な説明】
【図１】本発明の第１実施例を示す音声合成装置のブロック構成図。
【図２】上記第１実施例における処理の流れを説明するためのフローチャート。
【図３】本発明の第２実施例を示す音声合成装置のブロック構成図。
【図４】上記第２実施例における処理の流れを説明するためのフローチャート。
【符号の説明】
１…入力部、２…単語辞書、３…言語処理部、４…音声素片ファイル、５…合成パラメータ生成部、６…ポーズ生成部、７…音声合成部、８…オーディオデバイス（音声出力手段）、９…スピーカ、１１…タイマ、１２…タイマ設定部、１３…レディ信号。[0001]
[Industrial applications]
The present invention relates to a speech synthesis method and apparatus for generating a synthesized speech from a character code string or prosodic information and a phoneme sequence.
[0002]
[Prior art]
2. Description of the Related Art Recently, various speech synthesizers have been developed which analyze a sentence mixed with kanji and kana, and synthesize and output speech information indicated by the sentence by a rule synthesis method. Such a speech synthesizer has begun to be widely used as a telephone introduction service in a banking business, a newspaper review system, a text-to-speech device, and the like.
[0003]
A speech synthesizer employing this type of rule synthesis method basically converts a voice uttered by a human into a predetermined unit, for example, CV (consonant, vowel, consonant), VCV (vowel, consonant). Vowels) and VCs (vowels, consonants), phonemic information obtained by analysis using a method such as LSP (line spectrum pair) analysis or cepstrum analysis is registered in a speech unit file, and the speech unit Synthesized voices are generated by generating synthesis parameters (phonological parameters and prosodic parameters) with reference to a file and generating a sound source and performing synthesis filtering processing based on these synthesized parameters.
[0004]
Conventionally, such a speech synthesizer requires dedicated hardware for processing in real time. The system configuration of this speech synthesizer is roughly divided into the following two types.
[0005]
In the first configuration, a host computer such as a personal computer (PC) converts a sentence mixed with Chinese characters into prosodic information and a phoneme sequence (language processing), and generates synthesis parameters, a sound source, and synthesis filtering using dedicated hardware. , D / A (digital / analog) conversion. On the other hand, the second configuration is such that all processes from generation of a kanji-kana mixed sentence to generation of speech are performed by dedicated hardware. In most cases, the dedicated hardware in any configuration is composed of an LSI or a synthesis LSI called a DSP (Digital Signal Processor) which performs a high-speed product-sum operation, and a general-purpose MPU (Microprocessor Unit).
[0006]
In such a dedicated device, a kanji-kana sentence is subjected to morphological analysis and dependency analysis using a word dictionary, and a phonetic symbol sequence composed of prosodic information and a phoneme sequence (prosodic information and phoneme sequences are represented by symbols and katakana. Stuff) are generated. A sequence of fundamental frequencies (prosodic parameters) is generated based on the prosodic information of the phonetic symbol sequence, phoneme parameters corresponding to the phonemes of the phoneme series are extracted from the speech unit file, and the phonemes are connected by connecting phonemes to generate phoneme parameters. You. At this time, if a symbol representing a pause section is included in the speech symbol string, both the prosody parameter and the phoneme parameter are set to “0” for the pause portion.
[0007]
The processing described above is performed by a general-purpose MPU (CPU), and the generated synthesis parameters (prosodic parameters and phonemic parameters) are input to the speech synthesis unit. A DSP or a synthesis LSI is used for the voice synthesis unit. In this voice synthesis unit, a DSP or a synthesis LSI inputs synthesis parameters for each frame period, performs sound source generation and synthesis filtering based on the parameters, and outputs the generated sound to the D / A converter for each sampling period.
[0008]
On the other hand, personal computers (PCs) and engineering work stations (EWSs) have also been able to perform the above processing by increasing their processing capabilities and by installing D / A converters, analog output units and speakers as standard. It is becoming possible to perform only software processing in real time. Recently, OSs (operating systems) capable of multitasking have begun to be adopted not only in EWS but also in PCs.
[0009]
However, most of such OSs do not guarantee real-time properties. That is, although there is no problem when the number of processing tasks other than voice synthesis is small, when the number of tasks increases, the CPU is used for processing other than voice synthesis, and voice synthesis may not be performed in real time.
[0010]
For these reasons, it is important to reduce the time required for speech synthesis as much as possible in order to synthesize speech in real time only by software processing under a multitasking OS such as PC or EWS.
[0011]
[Problems to be solved by the invention]
As described above, in the above-described conventional speech synthesizer, although the real-time property of the speech synthesizer is guaranteed in the dedicated device, a personal computer (PC) or an engineering work system employing an OS capable of multitasking is provided. In an apparatus realized by software processing of a station (EWS), speech synthesis was performed in real time when the number of processing tasks was small, but there was a problem in that real-time speech synthesis was not possible when the number of tasks was large.
[0012]
The present invention has been made in view of such circumstances, and an object of the present invention is to provide a speech synthesis method and apparatus capable of shortening the time required for speech synthesis and thereby performing speech synthesis in real time. .
[0013]
[Means for Solving the Problems]
The speech synthesizing method and apparatus according to the present invention focus on the fact that the pause unit only needs to output a signal (a signal having an output “0”) in which the synthesized speech to be output is silent and does not become sound. Excluding the pause symbol portion in the phonetic symbol sequence generated from the input character code sequence, generating a corresponding phonological parameter according to the phonological sequence and generating a prosodic parameter according to the prosodic information, this Pose sign Does not include partial parameters Phonetic and prosodic parameters From Synthetic filtering By processing Audio waveform data Produces Pose sign For the part generated Audio waveform data Based on the sampling frequency of an audio output device that outputs a corresponding audio from the pause data, pause data is generated separately so as to be silent for a time corresponding to the pause length. Synthetic filtering processing Generated by Audio waveform data To generate voice.
[0015]
[Action]
In the above configuration, A speech symbol string including a pause symbol is generated from the input character code string. Then, a phoneme parameter and a prosody parameter (synthesis parameter) are generated by excluding a pause symbol portion in the generated speech symbol string. This allows For live parts Synthetic filtering processing Is running Audio waveform data (Voice digital data) is generated, but for the pause part, Corresponding Contains no parameters From that , Synthetic filtering processing Is not executed. And instead, based on the sampling frequency of the audio output device, Pose sign Pause data is generated so that silence occurs only for the pause length of the part, Audio waveform data Is added to Therefore, this pause data is added Audio waveform data , The voice output including the pause section is correctly performed.
[0017]
As described above, according to the above configuration, the synthesis filtering that requires the longest time in the speech synthesis processing is performed only for the sound part, and for the silent part (pause part), the sound is performed during the corresponding time (period). Generate pause data for that time so that no signal I'll Therefore, the processing time of the speech synthesis can be shortened by the amount that the synthesis filtering of the silent section is not performed while outputting the speech including the silent section.
[0018]
【Example】
[First embodiment]
First, a first embodiment of the present invention will be described.
FIG. 1 is a block diagram showing a schematic configuration of the speech synthesizer according to the embodiment.
[0019]
The speech synthesizer shown in FIG. 1 includes an input unit 1 for inputting a character code string mixed with kanji or kana to be subjected to speech synthesis, and accent type, reading, and part of speech information for words and phrases to be subjected to speech synthesis. And a language processing unit 3. The language processing unit 3 analyzes a character code string input by the input unit 1 using the word dictionary 2 and performs language processing for generating a corresponding phoneme sequence and prosody information.
[0020]
The speech synthesis apparatus shown in FIG. 1 also includes a speech unit file 4 storing a cepstrum parameter group obtained by analyzing input speech in advance for each arbitrary speech unit, and a speech processing unit 3 for generating a speech segment file. A synthesis parameter generation unit 5 that generates a phoneme parameter (here, a cepstral parameter of a phoneme) according to the generated phoneme sequence and a prosody parameter according to the prosody information generated by the language processing unit 3, and a pause generation unit 6. Have. The pause generation unit 6 uses the pause data (D) representing “0” by the length based on the symbol indicating the pause section in the phoneme sequence and the prosody information given from the language processing unit 3 to the synthesis parameter generation unit 5. / A converter) is generated so that the output signal of the / A converter becomes “0”. This pause data is added to audio waveform data (audio digital data) output from the audio synthesizer 7 described below.
[0021]
The speech synthesis device shown in FIG. 1 also includes a speech synthesis unit 7 that generates a sound source and performs a synthesis filtering process to generate a synthesized speech based on the phoneme parameters and the prosody parameters generated by the synthesis parameter generation unit 5. , An audio device 8 and a speaker 9 for outputting sound. The audio device 8 includes a D / A converter that converts audio digital data output from the audio synthesis unit 7 and pause data (digital data) output from the pause generation unit 6 and added to the audio data into analog signals. A well-known configuration (not shown) such as an aliasing noise removal filter and a power amplifier is included.
[0022]
The speech synthesizer having the above configuration is realized by a personal computer (PC) or an engineering work station (EWS) that executes multitasking, and includes an input unit 1, a language processing unit 3, a synthesis parameter generation unit 5, The pause generating unit 6 and the voice synthesizing unit 7 (the sound source generation and filtering processing unit therein) are functional blocks realized by program processing (execution of a task for voice synthesizing process) of the CPU.
[0023]
Next, the operation of the speech synthesizer shown in FIG. 1 will be described with reference to the flowchart of FIG.
First, it is assumed that the input unit 1 has input a character code string containing kanji or kana to be subjected to speech synthesis, for example, "There will be a briefing session tomorrow. Thank you."
[0024]
The input unit 1 cuts out one sentence "Tomorrow will have a briefing." With a period "." (Step S1 in FIG. 2). One sentence cut out by the input unit 1 is given to the language processing unit 3.
[0025]
The language processing unit 3 collates the character code string (one sentence) cut out by the input unit 1 “Tomorrow there is a briefing session” with the word dictionary 2 and determines the speech synthesis target indicated by the input character code string. Find the accent type, reading, and part of speech information for the words and phrases that have become, determine the accent type and boundary according to the part of speech information, insert pause symbols, and convert to the form of reading of kanji kana mixed sentences, For example, a phonological sequence and a prosodic information (speech symbol string) of "A (Shi) awa.Setsukai <ga> / Arima (s) ... /" are generated (step S2 in FIG. 2). . Here, “＾” is an access position, “/” is a delimiter of an access phrase, “//” is the end of a sentence, “.” Is a symbol (pause symbol) indicating a pause section of 200 ms, () Represents a devoiced sound. The sampling frequency is 8 kHz.
[0026]
The speech symbol string “A (S) Tawa.Setsumekai <ga> / Arima (S) ... //” generated by the language processing unit 3 is given to the synthesis parameter generation unit 5.
[0027]
The synthesizing parameter generating unit 5 becomes, from the phonetic symbol string given from the language processing unit 3, “a (shi) tawa” which is a preceding accent phrase of the first appearing pause symbol (pause section symbol) “.”. A phoneme sequence is cut out, and cepstral parameters of a phoneme corresponding to the phoneme sequence are extracted from the speech unit file 4 to generate phoneme parameters (step S3 in FIG. 2). At the same time, the synthesis parameter generation unit 5 generates a prosody parameter according to the prosody information corresponding to the phoneme sequence.
[0028]
The synthesis parameter generation unit 5 provides the speech synthesis unit 7 with the phoneme parameter and the prosody parameter corresponding to the generated “A (Sh) Tawa”. Further, the synthesis parameter generation unit 5 gives the pause generation unit 6 one pause symbol “.” Following the “A (Shi) Tawa”.
[0029]
Conventionally, in the synthesis parameter generation unit 5, the phonetic parameters and the prosodic parameters of "A (shi) Tawa." Up to the pause symbol "." Note that “0” is set for both the phonological parameter and the prosodic parameter for the pause portion represented by the pause symbol, and the result is given to the speech synthesizer 7.
[0030]
When the speech parameter and the prosodic parameter (synthesis parameter) corresponding to “A (S) Tawa” are given from the synthesis parameter generation unit 5, the speech synthesis unit 7 inputs them and temporarily holds them. The voice synthesizing unit 7 generates a sound source and performs a digital filtering process in accordance with the input synthesis parameters of “A (Sh) Tawa”, thereby obtaining the voice digital data of “A (Sh) Tawa”. The input character code string "voice digital data shown in" tomorrow is "in the tomorrow will have a briefing. Thank you." (FIG. 2, step S4).
[0031]
The audio digital data generated by the audio synthesizer 7 is provided to the audio device 8. However, when the audio output by the audio device 8 is being performed, the generated audio digital data is given after the end of the audio output.
[0032]
On the other hand, the pause generation unit 6 generates 200 ms of pause data (“0”) based on one pause symbol “.” Following “A (S) Tawa” given from the synthesis parameter generation unit 5. Is generated (step S5 in FIG. 2). Here, since the sampling frequency is 8 kHz, the pause generator 6 generates 1600 pieces of pause data.
[0033]
When voice digital data (here, voice digital data of “A (S) Tawa”) is given from the voice synthesis unit 7 to the audio device 8, the pause generation unit 6 generates its own pause data (here, 1600 pause data corresponding to a 200 ms pause section indicated by one pause symbol “.” Following “A (S) Tawa” is given to the device 8 (step S6 in FIG. 2). As a result, 1600 pause data have been added after the audio digital data of “A (S) Tawa”.
[0034]
The audio device 8 converts the audio digital data of “A (S) Tawa” given from the speech synthesis unit 7 and the 1600 pause data given from the pause generation unit 6 and added to the data into D / The analog signal is sequentially converted into an analog signal by the A converter, and is output to the speaker 9 through the aliasing noise removal filter, so that the sound corresponding to “A (S) power” is output from the speaker 9 (FIG. 2 step). S7, S8).
[0035]
As described above, in the present embodiment, the pause unit focuses on the fact that the synthesized sound to be output is silent, and the audio device 8 only needs to output the signal of the output “0”. By generating pause data (“0” data) for a pause section based on the pause symbol “.” And excluding the synthesis parameter generation unit 5 from generating synthesis parameters, the speech synthesis unit 7 The composition filtering of the pause part is not performed. By doing so, the time required for the synthesis filtering in the voice synthesis unit 7 is reduced, and voice synthesis in real time becomes possible.
[0036]
In the related art, the synthesis parameter including the pause parameter is generated by the synthesis parameter generation unit 5 and given to the speech synthesis unit 7. Also, synthesis filtering was performed, and it took a long time for speech synthesis.
[0037]
Now, when the audio output from the speaker 9 by the audio device 8 is started, the synthesis parameter generation unit 5 determines whether or not the processing has been performed up to the end of one sentence indicated by “//” (step S9 in FIG. 2). ). If the end of one sentence has not been reached, as in this example, the following “Set-me-kai <ga> / Arima (s) ... /” is executed as described above from step S3 to step S8. Similar processing is performed. At this time, the pause generation unit 6 generates 8000 pieces of pause data corresponding to a 1000 ms (1 second) pause section indicated by five pause symbols “...
[0038]
Then, when the audio output corresponding to the above-mentioned “A (Sh) Tawa.” Is completed, the audio digital of “Set-me-kai <Ga> / Arima (S)” newly generated by the voice synthesizing unit 7 is output. The data and 8000 pieces of pause data generated by the pause generation unit 6 are sequentially supplied to the audio device 8 and used for the next audio output.
[0039]
At this time, since the processing of one sentence has been completed, the process proceeds from step S9 to step S10, and the control returns to the input unit 1.
In step S10, the input unit 1 determines whether or not the processing has been performed up to the end of the sentence. If the processing has not been completed as in this example, the process returns to step S1. In this step S1, the next sentence "Thank you." Is cut out by the input unit 1, and thereafter the same processing as described above is performed.
[0040]
By the way, in the present embodiment, in the example of "A (Shi) Tawa. Settsukaikai <ga> / Arima (S) ... //", six pause symbols, that is, 1.2 Given that there are pause symbols for seconds, if it takes only one second to generate, for example, one second of voice digital data by the synthesis filtering process of the voice synthesis unit 7, processing is performed for only 1.2 seconds of pause section. This means that the time has been reduced. However, the time required for pause data generation by the pause generation unit 6 is extremely shorter than the time required for synthesis filtering by the speech synthesis unit 7 and is ignored.
[0041]
In the first embodiment described above, the pause generator 6 is provided and the pause generator 6 generates the pose data. However, in general, a PC (personal computer) or an EWS (engineering work station) Has a structure in which a pause state is automatically continued unless audio digital data is written to the audio device (8). Therefore, a pause (pause section) can be generated without the pause generation unit 6, and the time can be arbitrarily controlled by setting a period during which nothing is written to the audio device (8). Further, a timer function is incorporated in the PC or EWS, and by using this function, a period in which nothing is written, that is, a pause section can be set.
[0042]
Therefore, a second embodiment in which a pause section is set using this timer function will be described below.
[Second embodiment]
FIG. 3 is a block diagram showing a schematic configuration of the speech synthesizer according to the embodiment. The same parts as those in FIG. 1 are denoted by the same reference numerals for convenience.
[0043]
The speech synthesizer shown in FIG. 3 differs from the speech synthesizer shown in FIG. 1 mainly in the following three points.
First, in place of the pause generation unit 6, a timer 11 and a timer setting unit 12 for setting a timer value to the timer 11 are used.
[0044]
Second, in FIG. 1, the pause symbol given from the synthesis parameter generator 5 to the pose generator 6 is given to the timer setting unit 12.
Third, the condition under which the audio digital data generated by the audio synthesizer 7 can be output to the audio device 8 is that the audio device 8 is not outputting audio and the timer value of the timer 11 is “0”. is there.
[0045]
The timer 11 is, for example, 1 ms per count (that is, the timer clock period is 1 ms), and does not become a value smaller than “0”.
The timer setting unit 12 is provided from the synthesis parameter generation unit 5 when outputting the ready signal 13 indicating that the audio device 8 is not outputting audio (the digital audio data can be input from the audio synthesis unit 7). The pause length indicated by the pause symbol (pause symbol string) is set in the timer 11.
[0046]
Next, the operation of the speech synthesizer shown in FIG. 3 will be described with reference to the flowchart in FIG.
First, as in the case of the first embodiment described above, it is assumed that the input unit 1 has input a character code string containing kanji or kana to be subjected to speech synthesis, "There will be a briefing session tomorrow. Thank you." I do.
[0047]
The input unit 1 cuts out one sentence "Tomorrow will have a briefing session" with a period "." (Step S11 in FIG. 4).
The language processing unit 3 collates the character code string (one sentence) cut out by the input unit 1 “Tomorrow there is a briefing session” with the word dictionary 2 and determines the speech synthesis target indicated by the input character code string. Find the accent type, reading, and part of speech information of the words and phrases that have become, determine the accent type / boundary according to the part of speech information, insert pause symbols, and convert to the form of reading kanji kana mixed sentences, The phoneme sequence and the prosodic information (speech symbol sequence) such as “A (Shi) Tawa. Setmekakai <Ga> / Arima (S) ... /” as described above are generated (FIG. 4). Step S12).
[0048]
The speech symbol string “A (S) Tawa.Setsumekai <ga> / Arima (S) ... //” generated by the language processing unit 3 is given to the synthesis parameter generation unit 5. Thus, the processing of step S13 described below is performed by the synthesis parameter generation unit 5 and the speech synthesis unit 7.
[0049]
That is, first, the synthesis parameter generation unit 5 determines, from the phonetic symbol string given from the language processing unit 3, “a (shi) ＾” which is a preceding accent phrase of a pause symbol (pause section symbol) “.” That first appears. The phoneme sequence “W” is cut out, and cepstral parameters of the phoneme corresponding to the phoneme sequence are extracted from the speech unit file 4 to generate phoneme parameters. At the same time, the synthesis parameter generation unit 5 generates a prosody parameter according to the prosody information corresponding to the phoneme sequence.
[0050]
The synthesis parameter generation unit 5 provides the speech synthesis unit 7 with the phoneme parameter and the prosody parameter corresponding to the generated “A (Sh) Tawa”. In addition, the synthesis parameter generation unit 5 provides the pause setting unit 12 with one pause symbol “.” Following the “A (S) Tawa”.
[0051]
The timer setting unit 12 inputs the pause symbol “.” Given from the synthesis parameter generation unit 5 and temporarily holds the pause symbol.
The speech synthesizing unit 7 inputs and temporarily holds the phoneme parameter and the prosodic parameter (synthesis parameter) corresponding to “A (S) Tawa” given from the synthesis parameter generation unit 5. The voice synthesizing unit 7 generates a sound source and performs a digital filtering process in accordance with the input synthesis parameters of “A (Sh) Tawa”, thereby obtaining the voice digital data of “A (Sh) Tawa”. The input character code string "digital data shown in" Tomorrow is shown "in" Tomorrow has a briefing session. Thank you. "
[0052]
When the voice synthesizing unit 7 generates the voice digital data of “A (S) Tawa”, the voice synthesizing unit 7 checks whether or not the audio device 8 is outputting voice by the ready signal 13 from the device 8 (FIG. 4). Step S14). If the voice is not being output, the voice synthesizing unit 7 proceeds to step S15. If the voice is being output, the voice synthesizing unit 7 determines that the voice is not being output (that is, the voice output ends and the ready signal 13 becomes true). wait.
[0053]
Here, since the audio device 8 is not outputting a voice, the voice synthesizing unit 7 proceeds to step S15. In step S15, the voice synthesizer 7 checks whether the value of the timer 11 (timer value) is “0”. If the timer value is “0”, the speech synthesizing unit 7 proceeds to step S16, and if not, waits for the value to become “0”.
[0054]
Here, since nothing is set in the timer 11, the timer value is “0”, and the voice synthesizing unit 7 proceeds to the next step S16. In step S16, the voice synthesizing unit 7 provides the audio device 8 with the audio digital data of “A (S) Tawa” generated in step S13 in order to output the voice from the speaker 9. As a result, in the same manner as in the above-described first embodiment, the output of the sound corresponding to “A (S) power” from the speaker 9 is started. At this time, the ready signal 13 from the audio device 8 is set to a false value indicating that audio is being output.
[0055]
Now, when the audio digital data of “A (S) Tawa” generated by the voice synthesizing unit 7 is given to the audio device 8, the synthesizing parameter generating unit 5 terminates one sentence indicated by “//”. It is determined whether or not the process has been performed (step S17 in FIG. 4). If the end of one sentence has not been reached as in this example, the speech synthesizer 7 and the audio device 8 are used for the following "Set-up <Ga> / Arima (s) ... /" Is performed in the same manner as in the above-mentioned "A (Shi) Power".
[0056]
In this step S13, the generation of the synthesis parameters of "set-up <ga> / arima (s)" and the generation of the audio digital data based on the synthesis parameters are performed, and the following five pause symbols ". Is given from the synthesis parameter generation unit 5 to the timer setting unit 12.
[0057]
The timer setting unit 12 receives the pause symbol “...” And holds it after one pause symbol “.” Subsequent to the previously held “A (S) Tawa”.
On the other hand, when the audio output of “A (S) Tawa” ends, the audio device 8 sets the ready signal 13 to true. Then, the timer setting unit 12 is the one following the oldest pause symbol (pause symbol string) at that time, that is, “A (Shi) Tawa”, among the pause symbols (pose symbol strings) held by the timer setting unit 12. The pause symbol “.” Is extracted, and the value “200” indicating the pause section 200 ms corresponding to the pause symbol “.” Is set in the timer 11. The operation of the timer setting unit 12 is realized on the PC or EWS by an audio output end interrupt process according to an interrupt (audio output end interrupt) generated in response to the audio output end.
[0058]
The timer 11 counts down every 1 ms.
When the voice synthesizing unit 7 generates the voice digital data of "Set-up <Ga> / Alima (S)" in the process of step S13, as described above, whether the audio device 8 is outputting the voice. It is checked whether or not it is (step S14 in FIG. 4).
[0059]
If the voice output for “A (S) Tawa” has already been completed, the voice synthesizing unit 7 proceeds to step S15, and if not completed, waits for completion.
Here, assuming that the audio output to “A (S) Tawa” has ended, the audio synthesizing unit 7 determines in step S15 whether the value of the timer 11 is “0”, that is, the audio output is It is determined whether or not a period set by the timer setting unit 12 (200 ms corresponding to one pause symbol “.” Subsequent to “A (S) power”) has elapsed after the end.
[0060]
Then, when the value of the timer 11 becomes “0”, that is, when the time of the pause section (200 ms) has elapsed after the end of the voice output, the voice synthesizing unit 7 generates the “Set-up <Gain>> / Arima @ (S) "to the audio device 8 to output audio from the speaker 9 (step S16 in FIG. 4).
[0061]
In this way, even if the audio output ends, the time until the value of the timer 11 becomes "0", that is, the pause section set by the timer setting unit 12 at the end of the audio output and corresponding to the pause section following the output audio. Until the time elapses, the output of the audio digital data to be output next is waited. The output of the audio device 8 is in a pause state from the end of the audio output until the next audio digital data is supplied, which is equivalent to generating a pause.
[0062]
In step S16, when the voice digital data of "Set-up <Ga> / Arima (S)" is provided from the voice synthesis unit 7 to the audio device 8, the processing of one sentence has been completed. Goes from step S17 to step S18, and the control returns to the input unit 1.
[0063]
In step S18, the input unit 1 determines whether or not the processing has been performed up to the end of the sentence. If the processing has not been completed as in this example, the process returns to step S11. In this step S11, the next sentence "Thank you." Is cut out by the input unit 1, and thereafter, the same processing as described above is performed.
[0064]
Also in the second embodiment described above, the pause unit is excluded from the target of the synthesis parameter creation by the synthesis parameter generation unit 5 so that the speech synthesis unit 7 does not perform the synthesis filtering of the pause unit. Therefore, the processing time related to speech synthesis can be reduced.
[0065]
Note that the present invention is not limited to the above-described embodiment. That is, in the embodiment, “0” is used as the pause data. However, depending on the specifications of the D / A converter in the audio device 8, the analog signal does not become “0” even if the “0” data is input. Therefore, it is not necessary to particularly limit the data to “0”, and digital data such that an analog signal becomes “0” may be used as pause data.
In short, the present invention can be variously modified and implemented without departing from the gist thereof.
[0066]
【The invention's effect】
As described above, according to the present invention, the data of the pause section is not included in the synthesis parameters (phonological parameters and prosodic parameters), and therefore, in the pause section, sound source generation and synthesis filtering requiring the longest time for processing are not executed. In addition, during the time (period) corresponding to the pause section, a signal that makes a sound is not output. Therefore, the speech filtering including the pause section is correctly performed, but the synthesis filtering of the pause section is not performed. The processing time required for speech synthesis can be shortened by the amount required, and speech synthesis can be performed by software processing under a multitask OS of a personal computer (PC) or an engineering work station (EWS). Thus, a great effect in practical use such as real-time operation can be obtained.
[Brief description of the drawings]
FIG. 1 is a block diagram of a speech synthesizer according to a first embodiment of the present invention.
FIG. 2 is a flowchart for explaining the flow of processing in the first embodiment.
FIG. 3 is a block diagram of a speech synthesizer according to a second embodiment of the present invention.
FIG. 4 is a flowchart for explaining the flow of processing in the second embodiment.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 1 ... input part, 2 ... word dictionary, 3 ... language processing part, 4 ... speech unit file, 5 ... synthesis parameter generation part, 6 ... pause generation part, 7 ... speech synthesis part, 8 ... audio device (speech output means) ), 9: speaker, 11: timer, 12: timer setting unit, 13: ready signal.

Claims

In a speech synthesis method for synthesizing speech,
Generate a phonetic symbol string including a pause symbol from the input character code string,
A phoneme sequence that does not include the pause symbol portion is cut out by removing the pause symbol portion from the generated speech symbol sequence, a corresponding phoneme parameter is generated according to the cut phoneme sequence, and prosody information corresponding to the phoneme sequence is generated. Generates prosodic parameters according to
A phoneme parameters and prosodic parameters said generated from phoneme parameters and prosodic parameters do not include the parameters of pose symbol portion, and generates a speech waveform data by combining filtering process,
Based on the sampling frequency of an audio output device that outputs a corresponding audio from the generated audio waveform data, the pause symbol portion excluded from the generation of the phonological parameters and the prosody parameters is a time corresponding to the pause length. A speech synthesizing method characterized by separately generating pause data that causes silence, and adding the pause data to the speech waveform data generated by the synthesis filtering process to generate speech.

In a speech synthesizer that synthesizes speech,
Language processing means for generating a speech symbol string including a pause symbol from the input character code string;
A phoneme sequence that does not include the pause symbol portion is cut out by removing the pause symbol portion from the generated speech symbol sequence, a corresponding phoneme parameter is generated according to the cut phoneme sequence, and prosody information corresponding to the phoneme sequence is generated. Synthesis parameter generation means for generating a prosody parameter according to
A phoneme parameters and prosodic parameter generated by the synthesis parameter generating means, the phoneme parameters and prosodic parameters do not include the parameters of the pause symbol portion, a voice synthesizing means for generating a speech waveform data by combining filtering process,
Audio output means for outputting a corresponding audio from the audio waveform data generated by the audio synthesis means,
Based on the sampling frequency of the voice output means, pause data is generated by the voice synthesis means so as to be silent for a time corresponding to the pause length of the pause symbol part excluded from the generation of the phonological parameters and the prosodic parameters. Pause generating means for adding to the generated audio waveform data,
A voice synthesizing apparatus, wherein a corresponding voice is output by the voice output unit from the voice waveform data to which the pause data is added by the pause generation unit.