JP3712227B2

JP3712227B2 - Speech synthesis apparatus, data creation method in speech synthesis method, and speech synthesis method

Info

Publication number: JP3712227B2
Application number: JP2000005380A
Authority: JP
Inventors: 達哉京光; 康一小島
Original assignee: Honda Motor Co Ltd
Current assignee: Honda Motor Co Ltd
Priority date: 2000-01-14
Filing date: 2000-01-14
Publication date: 2005-11-02
Anticipated expiration: 2020-01-14
Also published as: JP2001195080A

Abstract

PROBLEM TO BE SOLVED: To provide a speech synthesis method giving a concrete personal image about a sender of an electronic mail or the like, moreover permitting to transmit and receive electronic data by reducing the data size. SOLUTION: A speech synthesis method, whereby based on text data TX for speech synthesis and speech model data MD for a model of the speech synthesis created by a sender as both computer-readable electronic data, a receiver, who receives these text data TX and speech model data MD at the same time or separately, performs speech synthesis and reads the text data TX. Moreover, in such a configuration, the speech model data MD comprises speech database specifying data for specifying a specific speech data Dxn from among lots of speech data stored in the speech database commonly possessed by the sender and the receiver, and conversion data of phonemes for converting the phonemes in this specified speech data Dxn into those intended by the sender.

Description

【０００１】
【発明の属する技術分野】
本発明は、音声合成方法に関し、殊に、電子メールなどで送られてくる電子データとしてのテキストデータを、このテキストデータの作成者などの音声で音声合成して読み上げる音声合成方法に関する。
【０００２】
【従来の技術】
近年インターネットの普及により、電子メールでの文書のやり取りが頻繁に行われるようになった。通常電子メールでの文書のやり取りはテキストベースで行われ、電子メールの受信者はパソコン（パーソナルコンピュータ）やＰＤＡ（Personal Data Assistants）の画面上で電子メールの内容を目視により確認する。しかし、この電子メールの内容を送信者の音声で聞きたい、あるいは自己の音声で受信者に聞かせたいという要望がある。音声は、性別はもとより、年齢、体格、風貌や性格などにより異なり、人間の特徴をよく表すからである。つまり、受信者にとって音声での電子メールの内容確認は、まだ会ったことのない電子メールの送信者の具体的な人物像が思い浮かんできたり、電子メールの文書の行間では表せない送信者の感情を読み取ることができたりする。また、送信者にとっても同様であり、例えば、文書だけでは表せない行間を表現して受信者に伝えることができたりする。
【０００３】
ところで、自己の音声をデジタル記録して音声ファイルを作成し、これを電子メールで相手方に送信することもできる。例えば、ｗａｖ形式で自己の音声をデジタル記録して、これを電子メールの添付ファイルとして送信する方法である。ｗａｖ形式の音声ファイルは、Ｗｉｎｄｏｗｓ（マイクロソフト社の登録商標）をＯＳ（Operating System）とするパソコンなどで一般的に使用されるものであり、動作が軽く音質が大変よいなどの特徴がある。しかし、ｗａｖ形式の音声ファイルは、ほんの数秒（２〜３秒）の音声を記録するだけで１００ｋＢ（キロバイト）以上のデータサイズになってしまう。比較の仕方によっても異なるが、同じ内容の文書を保存・記録するとして、ｗａｖ形式の音声ファイルは、ｔｘｔ形式の文書ファイル（テキストデータ）に比べて数千倍のデータサイズを必要とする（場合によっては差が一万倍以上になることもある）。したがって、パソコンのハードディスクの容量やフロッピーディスクの容量など及び電子メールの送受信に要する時間などから、ｗａｖ形式の音声ファイルを電子メールに添付することは非現実的であり、ほとんど行われていない。これは、Ｗｉｎｄｏｗｓばかりでなく他のＯＳにおける音声ファイル（例えばＡＩＦＦ形式）でも同様である。
【０００４】
最近、音声データを圧縮する手段としてＭＰ３という音声データ圧縮規格が使用されるようになり、ＭＰ３に関連する商品が多数販売されるようになった。このＭＰ３は「ＭＰＥＧ−１ＡｕｄｉｏＬａｙｅｒ−３」の略であり、人間にとって聞くことのできない周波数帯域の音声データ（つまり不要な部分の音声データ）をカットすることなどにより、ＣＤ並みの音質を確保しつつ、もとの音声ファイルの約１／１０にデータサイズを圧縮することができる。しかし、ＭＰ３規格で音声データを圧縮しても約１／１０がやっとであり、まだテキストデータに比べてデータサイズはかなり大きく、送信者側のパソコンや受信者側のパソコンのハードウェアに対する負担が大きいと共に、電子メールの送受信に時間を要する。
また、音声ファイルに共通していえることであるが、テキストファイルのように簡単に音声ファイルを編集することができない。したがって、音声ファイル作成の際、つまり音声入力する際に、言いよどんだり言い間違えたりなどの失敗が許されない。
【０００５】
一方、近年、テキストデータを機械的に読み上げる音声合成に関する研究が盛んに行なわれている。このうち音声データベースを用いたコーパスベース（Corpus Base）の音声合成では、テキストデータをより人間に近い発音に音声合成するため、以下の方法をとっている。
【０００６】
〔手順１〕先ず、読み上げ用テキストデータの形態素解析を行なう。これは、読み上げ用テキストデータがどのような単語とどのような助詞・助動詞・活用語尾で構成されるかを構文解析することである。具体的には、「僕は日本人です。」というテキストを、／僕／は／日本人／です。／という正しい構文解析を行なうことを指す。ちなみに、「日本人」という単語は、“日本／人”というように、“日本”と“人”という構文解析も可能である。この手順１は、それを間違いなく“日本人”という単語であるということを認識する過程である。
【０００７】
〔手順２〕形態素解析で得られた各単語・構文情報をもとに、音素の選択・各単語のアクセントの位置・イントネーション・ポーズなどを決定する。この過程では、例えば、“日本人”という単語を音声合成するための音素を選択し、さらにその選択された音素のピッチやアクセントを決定する。“日本人”という言葉を構成する“Ｎｉ”という発音でも、文字にすると“にほんじん”や“ニックネーム”というふうに、単語が違っていてもテキストは“に”と変わらないが、その音素を考えた場合、単語によって、また、後に続く発音や単語のアクセントによって音素は異なる。また、音素の長さ・高さ・強さも異なる。
つまり、この手順２で、各単語・構文情報をもとに、音声合成が自然に聞こえるための各音素について最適な長さ・高さ・強さを決定し、さらに音素系列における最適なイントネーション・ポーズなどを決定する。
【０００８】
〔手順３〕手順２で得られた各データをインデックス化し、音声合成装置にこのインデックス化したデータを引き渡して音声合成を行ない、テキストデータを読み上げる。
【０００９】
このようにすることで、男・女といったように、予め決められた特定の音声での音声合成を行い、テキストデータを読み上げること（音声出力すること）が可能になっている。したがって、ｗａｖ形式やＭＰ３形式の音声データを送受信するのと違い、データサイズを少なくすることも可能である。また、編集も容易である。
【００１０】
【発明が解決しようとする課題】
しかしながら、人間の音声は犯罪捜査に使われるように千差万別であり、男女のみの二者択一的な音声選択では、電子メールの送信者の具体的な人物像が湧いてこないし、テキストデータには表されていない行間を読み取ることもできない。また、電子メールの送信者にとっては、自己の人物像を受信者に湧かせたり、テキストデータでは表せない行間を受信者に読み取らせることができない。さらには、二者択一的な音声選択では面白みに欠ける。これは、フロッピーディスクなどの記憶媒体で、テキストベースの電子データをやり取りする場合も同様である。
【００１１】
従って、電子メールなどの送信者（テキストデータの作成者）の具体的な人物像などが湧き、あるいは、テキストデータでは表せない行間を読み取ることなどができ、さらには、多様な音声を楽しむことなどができ、しかも、電子データのデータサイズを少なくして送受信して音声合成することができる方策が望まれる。
そこで、本発明は、上記課題を解決することを主たる目的とする。
【００１２】
【課題を解決するための手段】
本発明の音声合成装置（請求項１）では、送信側装置からは、送信者の音声と類似度が最大になる音声データを受信側装置の音声記憶手段（音声データベース）の中から指定する指定データと、前記指定データにより指定される音声データに修正を施す特徴量修正データを送信する。一方、音声合成用の受信側装置では、これらデータを受信すると、音声記憶手段から受信した指定データに基づいて対応する音声データを読み出し、このように読み出した音声データに受信した特徴量修正データで修正を施して修正音声データを作成し、音声合成を行う。
この構成においては、送信者の音声に特徴が類似する所定の音声データを指定する指定データで、音声記憶手段に記憶されている音声データが読み出される。そして、この読み出した音声データに対して、より前記音声に近似するように音声データの修正を行う。このため、送信者の音声（入力音声）により近い音声で音声合成し、データを読み上げることが可能になる。
ちなみに、例えば実施形態の「音声データベースＤＢ」は、請求項の「音声記憶手段」に相当する。また、実施形態の「音声規範データ作成部１１ａ」は、類似度が最大となる所定のデータの選択、及び特徴量修正データの作成を行う請求項の「処理手段」に相当する。また、実施形態の「音声合成部２１ａ」は、修正音声データの作成、及び音声合成を行う請求項の「処理手段」に相当する。また、実施形態の「ＤＢ指定データ」は請求項の音声データを指定する「指定データ」に相当する。また、実施形態の「テキストデータ」及び「音声インデックスデータ」は、請求項における「修正音声データに基づいてデータを読み上げ…」の「データ」に相当する。
【００１３】
また、本発明は、音声記憶手段の中から音声合成側で使用すべき音声データを指定する指定データと、この指定データにより指定した音声データを修正する特徴量修正データを音声合成側に送信し、送信側と同じ内容の音声記憶手段を備える音声合成側で受信したデータ（指定データ、特徴量修正データ）に基づいて音声合成の対象となるデータ（実施形態におけるテキストデータ、音声インデックスデータ）を読み上げるようにした音声合成方法におけるデータ作成方法であり、入力音声に対して類似度が最大となる所定の音声データを決定すると共に、この所定の音声データを指定する指定データを決定し、さらに、この決定した所定の音声データを前記入力音声により近似するように当該音声データを修正する特徴量修正データを作成することを特徴とする（請求項４）。
また、本発明は、送信側の音声記憶手段と同じ内容の音声記憶手段を備える音声合成側の装置を用い、受信した指定データと特徴量修正データに基づいて音声合成の対象となるデータを読み上げるようにした音声合成方法である（請求項５）。
この構成（請求項４・請求項５）においては、前記した音声合成装置（請求項１〜請求項３）と同様に入力音声に類似度が最大となる音声データが音声合成側で読み出されるように指定することができる。かつ、この類似度が最大となる音声データに対して入力音声により近似するように特徴量修正データに修正を施すことができる。このため、入力された音声（入力音声、送信者の音声）により近い音声で音声合成を行うことが可能になる。
【００１４】
【発明の実施の形態】
以下、本発明の音声合成装置、音声合成方法におけるデータ作成方法、及び音声合成方法の実施の形態を、図面を参照して詳細に説明する。なお、本発明の実施の形態は、（１）▲１▼指定データ（以下「ＤＢ指定データ」という）及び音素の特徴量修正データ（以下「特徴量修正データ」という）を▲２▼音声インデックスデータとは別に持つ第１の実施形態と、（２）▲１▼ＤＢ指定データ及び特徴量修正データを▲２▼音声インデックスデータと一緒に持つ第２の実施形態の、２つの実施形態に分けて説明する。
【００１５】
◎第１の実施形態；
先ず、第１の実施形態の説明を行う。本実施形態では、電子データとして、電子メールを送受信する。
図１は音声合成方法の送信者側及び受信者側で行う処理を、図２は音声規範データの作成を、図３は受信者側で行う処理を、図４はハードウェアの構成をそれぞれ示す。
【００１６】
≪ハードウェア構成≫
本実施形態の音声合成方法は、図４に示すように、すくなくとも送信者側装置（送信側装置、送信側）１０、受信者側装置（受信側装置、受信側、音声合成側）２０及び電子データ転送手段３０をもって実施される。
【００１７】
電子メールの送信者に係る送信者側装置１０は、中央処理装置１１、外部記憶装置１２、入出力装置１３、ＤＳＵ（Digital Service Unit）１４がバス１５に接続された構成を有する。中央処理装置１１は、送信者側装置１０を統括的に制御すると共に、入力された音声から音声合成の規範となる音声規範データＭＤを作成する音声規範データ作成部１１ａ（音声規範データ作成エンジン）及びテキストデータ作成部１１ｂを有する。外部記憶装置１２は、各種データやプログラムを格納すると共に、音声規範データＭＤを作成するための音声データＤｎを多数記憶した音声データベースＤＢを有する。音声データＤｎのそれぞれは、図２などに示すように音素の集合データである。入出力装置１３には、テキストデータＴＸを入力するためのキーボードや音声入力を行うためのサウンドシステムなどが、Ｉ／Ｏ装置を介して接続されている（図示外）。また、ＤＳＵ１４は、ＩＳＤＮなどのデジタル回線のユーザ側に設置する電子データの送受信装置である。この送信者側装置１０は、携帯可能あるいは車両設置可能なものであってもよい。
【００１８】
電子メールの受信者に係る受信者側装置２０は、中央処理装置２１、外部記憶装置２２、入出力装置２３、ＤＳＵ２４がバス２５に接続された構成を有する。中央処理装置２１は、受信者側装置２０を統括的に制御する他、送信者側装置１０から送られてくる音声規範データＭＤに基づいてテキストデータＴＸを音声合成する音声合成部２１ａ（音声合成エンジン）を有する。外部記憶装置２２は、各種データやプログラムを格納すると共に、音声規範データＭＤに基づいて音声合成を行うための音声データＤｎ（音素の集合データ）を多数記憶した音声データベースＤＢを有する。入出力装置２３には、キーボードや合成された音声を出力するサウンドシステムなどが、Ｉ／Ｏ装置を介して接続されている（図示外）。また、ＤＳＵ２４は、ＩＳＤＮなどのデジタル回線のユーザ側に設置する電子データの送受信装置である。なお、受信者側装置２０の音声データベースＤＢと送信者側装置１０の音声データベースＤＢは、同じ内容のものである。この受信者側装置２０は、送信者側装置１０と同様に携帯可能あるいは車両設置可能なものであってもよい。
【００１９】
電子データ転送手段３０は、ここではＩＳＤＮなどのデジタル通信回線である。なお、電子データ転送手段３０は、アナログの公衆電話回線などでもよい。有線か無線かは問わない。また、フロッピーディスクなどの電子データ記憶媒体を宅配や郵送する電子データ転送手段であってもよい。あるいは、インターネット上のサイトから電子データが配信（ダウンロード）される態様や、雑誌などで配布される態様など、メディアを介した態様も電子データ転送手段３０に含まれる。いずれの態様も、本発明における「送信」及び「受信」に該当する。
【００２０】
≪音声合成方法≫
本実施形態の音声合成方法は、図１に示すように、電子メールの送信者側で行う処理と受信者側で行う処理とに分けられる。
送信者側では、音声規範データの作成（Ｓ１）、テキストデータの作成（Ｓ２）、そして、作成された音声規範データ及びテキストデータの送信が行われる（Ｓ３、電子メールの送信）。なお、ステップＳ１の音声規範データの作成とステップＳ２のテキストデータの作成は、その順序が逆でもよい。一方、受信者側では、音声規範データ及びテキストデータの受信（Ｓ５、電子メールの受信）、音声データの読み込み及び修正（Ｓ６）、音声インデックスデータの作成（Ｓ７）、そして、音声合成（Ｓ８）が行われる。なお、ステップＳ６の音声データの読み込み及び修正とステップＳ７の音声インデックスデータの作成は、その順序が逆でもよい。また、受信者側では、音声規範データＭＤの記憶が、必要に応じて行われる。
以下、送信者側の処理と受信者側の処理とに分けて詳細に説明する。
【００２１】
〔送信者側の処理〕
第１の実施形態の音声合成方法では、電子メールの送信者は、図４で例示した送信者側装置１０を用いて、音声規範データＭＤ及びテキストデータＴＸを作成し、これを電子メールで受信者に向けて送信する。なお、本実施形態では、電子メールの送信者が、送信者自身に似せた声で音声合成させる音声規範データＭＤを作成するものとする。
ちなみに、音声規範データＭＤは、ＤＢ指定データｘｎ及び特徴量修正データＦＤからなる（図３参照）。特徴量修正データＦＤは、特許請求の範囲における音素の変換データに該当する。
【００２２】
（１）音声規範データの作成；先ず、送信者は自分の声を送信者側装置１０に入出力装置１３から音声入力する（Ｓ１１）。音声入力するのは、音素系列に偏りのない文書（定型文）もしくは５０音である。入力した音声は、音声規範データ作成部１１ａで特徴抽出され、入力音声の音素ごとの特徴量ｉが抽出される（Ｓ１２）。ちなみに、本実施形態の特徴量ｉは、基本周波数とスペクトル包絡からなる。ここで、スペクトル包絡は音声スペクトルの概形である。なお、特徴量としては、次の表に示すようなものを使用することもできる。
【００２３】
【表１】

【００２４】
次に、音声規範データ作成部１１ａは、特徴抽出した入力音声の音素ごとの特徴量ｉと音声データベースＤＢが記憶する音声データＤｎ(n=1・2・・・n)の音素ごとの特徴量の類似度を計算し、全体を比較する（Ｓ１３）。そして、入力音声と類似度が最大である特定の音声データＤxnを、音声データベースＤＢの中から決定する（Ｓ１４）。以後この音声データＤxnを標準音声データＤｓとする。ちなみに、音声データベースＤＢの中から標準音声データＤｓを特定するためのデータ（アドレスなど）が、ＤＢ指定データｘｎになる。なお、受信者側装置２０と送信者側装置１０の音声データベースＤＢは同じものである。
【００２５】
続いて、音声規範データ作成部１１ａは、標準音声データＤｓが持つ各音素に対して、先に得られた各音素ごとの特徴量ｉを基に、特徴量修正データＦＤを作成する（Ｓ１５）。この特徴量修正データＦＤとＤＢ指定データｘｎとで、送信者の声の情報である音声規範データＭＤが構成される。
【００２６】
ここで、特徴量修正データＦＤの作成方法をさらに説明する。
▲１▼基本周波数の修正；音素ｋの特徴量である基本周波数ｆkと入力された音声の基本周波数ｆの差が範囲内｜ｆk−ｆ｜＜ａで最小となるようなαを求め、ｆk−αを基本周波数ｆkとして置き換える。つまり、ある範囲内（範囲ａ）において、ｆkをｆに近似させるため、あるいは等しくするため、αを計算する（αは音素ｋの基本周波数ｆkと入力音声の基本周波数ｆの差になる）。ここで範囲ａは、｜ｆk−ｆ｜の数値が大きくなりすぎないようにするための閾値でもある。この閾値により修正したｆk（＝ｆk−α）が修正する前のｆkから大きくかけ離れた周波数にならず、修正する前のｆkのもつ特徴を維持する。ちなみに、修正する前のｆkが“あ”と聞こえるための基本周波数であるとすると、修正したｆk（＝ｆk−α）を使用した音声が“あ”と聞こえる限界の範囲（閾値）がａである。
【００２７】
▲２▼スペクトル包絡の修正；音素ｋのスペクトル包絡を表す特徴量を入力音声の特徴量に近似させる。ただし、音素ｋのスペクトル包絡の修正量は、所定の閾値範囲内とする。
【００２８】
（２）テキストデータの作成；音声合成の対象となるテキストデータＴＸは、入出力装置１３から文字をキー入力し、テキストデータ作成部１１ｂで作成される（図１(a)のＳ２）。ちなみに、テキストデータＴＸの内容は「こんにちは…」である。
【００２９】
（３）音声規範データ及びテキストデータの送信；作成された音声規範データＭＤ及びテキストデータＴＸは、電子メールにより受信者宛てに送信される（図１(a)のＳ３）。このうち音声規範データＭＤは、電子メールの添付ファイルとして送信される。なお、送信される音声規範データＭＤやテキストデータＴＸは、文字や数字のデータであり、データサイズはｗａｖｅ形式やＭＰ３形式の音声データなどに比べて遥かに小さい。したがって、送信者側及び受信者側のハードウェアに与える負担が小さいと共に、送受信に要する時間を小さくすることができる。
【００３０】
〔受信者側の処理〕
第１の実施形態の音声合成方法では、電子メールの受信者は、図４で例示した受信者側装置２０を用いて、次のように音声合成を行う（図３参照）。
【００３１】
（１）標準音声データの読み込み・修正；受信者側装置２０のＤＳＵ２４は、電子データ転送手段３０を介して電子メールを受信する（Ｓ５１）。次に、受信者側装置２０の音声合成部２１ａは、電子メールに添付された音声規範データＭＤのうち、ＤＢ指定データｘｎに基づいて音声データベースＤＢを検索し、該当する音声データＤｘｎ（つまり標準音声データＤｓ）を読み込む。続いて、音声規範データＭＤのうちの特徴量修正データＦＤに基づいて、読み込んだ標準音声データＤｓに修正を施し修正音声データＤｓ’を作成する（Ｓ５２）。修正音声データＤｓ’は、送信者の音声の特徴が反映され、送信者の音声に近似した音声（音素）になっている。
【００３２】
（２）音声インデックスデータの作成；受信した電子メールのうち「こんにちは…」という内容のテキストデータＴＸについて、音声合成部２１ａでテキスト解析及び音律予測を行い、音声インデックスデータＩＤを作成する（Ｓ５３）。音声インデックスデータＩＤは、図３の吹き出しに示す構造を有する。なお、テキスト解析は、音声合成して出力したいテキストデータのテキスト解析を行い、べた書きのテキストを単語ごと（つまり形態素ごと）に分割して、アクセントの位置やポーズ長などを決定する。また、音律予測は、前記テキスト解析のデータにもとづいて、音声合成して出力したい音素系列の韻律パラメータを音素ごとに予測する。
ちなみに、音声インデックスデータＩＤに記述された音素「k,o,n,n,i,ch,i,w,a…」は、音素指定データである。
【００３３】
（３）音声合成；前記（１）で作成した修正音声データＤｓ’と前記（２）で作成した音声インデックスデータＩＤに基づいて音声合成を行う（Ｓ５４）。具体的には、修正音声データＤｓ’の修正済み音素を用いて、音声インデックスデータＩＤの音素指定データの順序どおりに音素系列を組み上げ（k,o,n,n,i,ch,i,w,a…）、その系列に基づいて音声合成を行いテキストデータＴＸを読み上げる。このようにすることで、受信者は、送信者の音声（送信者の音声と同等あるいは近似した音声）で電子メールの内容を聞くことができる。
【００３４】
なお、音声規範データＭＤは個人に特有のものであるが、この音声規範データＭＤを外部記憶装置２２に記憶しておけば、次回、同じ送信者から電子メールを受信する際に、音声規範データＭＤが添付されていなくとも、電子メールの内容をこの送信者の声で音声合成して読み上げることができる。この際、送信者のｅ−ｍａｉｌアドレスや電話番号などのユニークな番号をアドレスとして、音声規範データＭＤを保存しておくのがよい。同じ送信者からの電子メールに対して、当該送信者用の音声規範データＭＤを即座に呼び出して、電子メールの内容を音声合成して読み上げることができるからである。
【００３５】
このように音声合成することにより、受信者は、テキストデータでは表せない行間を読み取ることなどなどができ、かつ送受信される電子データのデータサイズも音声ファイルに比べてはるかに少なくすることができる。また、音声規範データを受信者側装置に記憶することにより、同じ送信者が再度電子メールを送る際には、テキストデータ（音声インデックスデータ）を送信するだけでよいので、送受信される電子データのデータサイズをさらに少なくすることができる。さらに、送信者は、音声規範データを作成する際に、テキストデータの内容そのものを読み上げる必要がなく、定型文分や５０音を読み上げればよいだけである。したがって、オフィス空間で音声規範データを作成しても、テキストデータの内容そのものを読み上げるのと異なり、周囲に違和感を与えない。加えて、一度作成した音声規範データは、繰り返し使用することができるので、電子メール送信のたびに定型文や５０音を読み上げる必要はない。ましてや、音声ファイルを電子メールで送信するのと異なり、テキストデータの内容が異なるような場合でも、その都度テキストデータの内容そのものを読み上げる必要がない。さらに、音声ファイルと異なり、テキストファイルの修正・編集は極めて容易である。また、受信者は、電子メールの内容確認が、自動車の運転などの他の作業と並行して行なえる。
【００３６】
なお、上記実施形態では、音声インデックスデータの作成を受信者側で行ったが、送信者側で行って電子メールに添付するようにしてもよい。送信者側で音声インデックスデータを作成することにより、送信者が自由に音素の長さ・高さ・強さなどを編集することができ、より自分の意図した音声（アクセントやイントネーションなど）を合成させることが可能になるからである。このように音声インデックスデータを送信者の側で作成すると共に、電子メールに添付して送信する場合は、電子メールには音声合成の対象となるテキストデータが存在しなくともよい。音声インデックスデータの音素指定データの順序どおりに音素系列を組み上げて、音声合成することができるからである。この場合、音声インデックスデータはテキストデータに由来するものであり、この音声インデックスデータが特許請求の範囲の記載におけるテキストデータに該当する。
【００３７】
◎第２の実施形態；
次に、第２の実施形態の音声合成方法の説明を行う。なお、第１の実施形態と同一の部材・要素などについては、第１の実施形態で使用した図面を参酌すると共に同一の符号を付し、その説明を省略する。ここで、図５は音声合成方法の送信者側及び受信者側で行う処理を、図６は音声合成方法における各データの関係を、図７は受信者側で行う処理をそれぞれ示す。
この第２の実施形態では、音声インデックスデータを送信者側で作成すると共に、音声インデックスデータと音声規範データを併記した書式の併合音声規範データを、電子メールで受信者側に送信する。
【００３８】
≪ハードウェア構成≫
第２の実施形態の音声合成方法は、第１の実施形態に使用されるハードウェアをそのまま適用することができるので、ハードウェア構成の説明を省略する。但し、第２の実施形態では、音声インデックスデータＩＤを送信者側装置１０で作成する。したがって、送信者側装置１０の音声規範データ作成部２１ａは、音声インデックスデータＩＤの作成が可能なものである。
【００３９】
≪音声合成方法≫
第２の実施形態の音声合成方法も第１の実施形態の音声合成方法と同様、電子メールの送信者側で行う処理と受信者側で行う処理とに分けられる。
送信者側では、音声規範データの作成（Ｓ１０１）、テキストデータの作成（Ｓ１０２）、テキストデータにもとづいての音声インデックスデータの作成（Ｓ１０３）、音声規範データと音声インデックスデータの併合、つまり併合音声規範データの作成が行われ（Ｓ１０４）、そして、併合音声規範データの送信が行われる（Ｓ１０５、電子メールの送信）。一方、受信者側では、併合音声規範データの受信（Ｓ１０６、電子メールの受信）、音声データの読み込み及び修正（Ｓ１０７）、そして、音声合成が行われる（Ｓ８）。
以下、送信者側の処理と受信者側の処理とに分けて詳細に説明する。
【００４０】
〔送信者の処理〕
（１）音声規範データの作成、（２）テキストデータの作成、（３）音声インデックスデータの作成を、それぞれ行う。それぞれの作成手順は、第１の実施形態で説明したのと同じであるので、その説明を省略する。なお、前記の通り、音声インデックスデータＩＤは、送信者側で作成する。この点第１の実施形態と異なる。
【００４１】
（４）併合音声規範データの作成；音声規範データＭＤと音声インデックスデータＩＤを併合して併合音声規範データＣＭＤを作成する。併合音声規範データＣＭＤは、図６（ｃ）に示すように、ＤＢ指定データｘｎと音声インデックスデータＩＤの内容全部とこの音声インデックスデータＩＤの音素指定データに対応した特徴量修正データＦＤを有する。
【００４２】
（５）併合音声規範データの送信；作成された併合音声規範データＣＭＤは、電子メールにより受信者宛てに送信される（図５(a)のＳ１０５）。送信される併合音声規範データＣＭＤは、文字や数字のデータでありデータサイズは、ｗａｖｅ形式やＭＰ３形式の音声データなどに比べて遥かに小さい。したがって、送信者側及び受信者側のハードウェアに与える負担が小さいと共に、送受信に要する時間を小さくすることができる。なお、電子メールには、テキストデータＴＸは不用である。併合音声規範データＣＭＤが有する音素指定データに基づいて音素系列を組み上げ、音声合成することができるからである。但し、受信者が目視で電子メールの内容を確認できるように、テキストデータＴＸを電子メールに記載してもよい。
【００４３】
〔受信者側の処理〕
第２の実施形態の音声合成方法では、電子メールの受信者は、図４で示した受信者側装置２０を用いて、次のように音声合成を行う（図７参照）。
【００４４】
（１）標準音声データの読み込み・修正；受信者側装置２０のＤＳＵ２４は、電子データ転送手段３０を介して電子メールを受信する（Ｓ１５１）。次に、受信者側装置２０の音声合成部２１ａは、電子メールに添付された併合音声規範データＣＭＤのうち、ＤＢ指定データｘｎに基づいて音声データベースＤＢを検索し、該当する音声データＤｘｎ（つまり標準音声データＤｓ）を読み込む。続いて、併合音声規範データＣＭＤのうち、特徴量修正データＦＤに基づいて読み込んだ標準音声データＤｓに修正を施し修正音声データＤｓ’を作成する（Ｓ１５２）。修正音声データＤｓ’は、送信者の音声の特徴が反映され、送信者の音声に近似した音声（音素）になっている。
【００４５】
（２）音声合成；作成した修正音声データＤｓ’と併合音声規範データＣＭＤの音声インデックスデータＩＤに由来する部分のデータに基づいて音声合成を行う（Ｓ１５３）。具体的には、修正音声データＤｓ’の修正済み音素を用いて、併合音声規範データＣＭＤの音素指定データの順序どおりに音素系列を組み上げ（k,o,n,n,i,ch,i,w,a…）、その系列に基づいて音声合成を行い送信者がステップＳ１０２で作成したテキストデータＴＸを読み上げる。このようにすることで、受信者は、送信者の音声（送信者の音声と同等あるいは近似した音声）で電子メールの内容を聞くことができる。
【００４６】
このように音声合成することにより、第１の実施形態と同様にテキストデータでは表せない行間を読み取ることなどができ、かつ送受信される電子データのデータサイズも音声ファイルを送受信するのに比べて少なくすることができる。また、特徴量修正データを全ての音素に対して準備する必要がないので、第１の実施形態よりもデータサイズを小さくすることも可能である。なお、本実施形態において、受信者に送られる併合音声規範データのうち、音声インデックスデータに由来する部分が、特許請求の範囲におけるテキストデータに該当する。
【００４７】
以上説明した本発明は、上記した実施の形態に限定されることなく広く変更実施することができる。
例えば、似せる声は送信者自身のものでなくと、家族や友人、あるいは他人のものでもよい。受信者にとって聞いたことのない声であっても、多様な音声で電子データの内容（テキストデータ）を聞くことができ、面白みを味わうことができるからである。
【００４８】
【発明の効果】
本発明（音声合成装置、音声合成方法におけるデータ作成方法、及び音声合成方法）によれば、送信者などの声で電子メールの内容を聞くことができるので、受信者には送信者などの具体的人物像が湧いてきたり、テキストデータでは表せない行間を読み取ったりすることができる。また、受信者は、多様な声で面白味を味わいながらテキストデータの内容を聞くことができる。しかも、音声ファイルを送受信するのと異なり、データサイズを小さくすることができる。さらには、音声ファイルを送受信するのと異なり、送信者は、テキストデータの内容そのものを読み上げる必要がない。加えて、指定データ及び特徴量修正データ（音声規範データ）は、電子データとして構成されるため再利用が可能である。かつ、指定データ及び特徴量修正データは、テキストデータの内容に影響されないので、異なるテキストデータを作成する場合においても、そのテキストデータに合わせて指定データ及び特徴量修正データを再度作成する必要がない。
【００４９】
特に、請求項２に記載の発明によれば、受信側装置で音声インデックスデータを作成するので、送信側装置での負荷を低減することができる。また、請求項３に記載の発明によれば、送信側装置で音声インデックスデータを作成するので、受信側装置での負荷を低減することができる。また、本発明によれば、送受信されるデータサイズを確実に小さくすることができると共に、確実に送信者が意図する音声で受信者側に音声合成させてテキストデータの内容を聞かせることができる。また、請求項４の発明によれば、音声合成側で使用する指定データ、特徴量修正データを適切に決定及び作成することができる。また、請求項５に記載の発明によれば、指定データ、特徴量修正データに基づいて、より入力音声に近似させて音声合成を行い、読み上げ対象のデータを読み上げることができる。
【図面の簡単な説明】
【図１】第１の実施形態の音声合成方法の処理を示す流れ図であり、（ａ）は送信者側の処理を、（ｂ）は受信者側の処理を示す。
【図２】第１の実施形態の音声合成方法の送信者側で行う音声規範データの作成をデータの構成と共に示す流れ図である。
【図３】第１の実施形態の音声合成方法の受信者側で行う処理をデータの構成と共に示す流れ図である。
【図４】第１の実施形態の音声合成方法が適用されるハードウェアの構成を示すブロック図である。
【図５】第２の実施形態の音声合成方法の処理を示す流れ図であり、（ａ）は送信者側の処理を、（ｂ）は受信者側の処理を示す。
【図６】第２の実施形態の音声合成方法における各データの関係を説明する図であり、（ａ）は音声インデックスデータの構成を、（ｂ）は音声規範データの構成を、（ｃ）は併合音声規範データの構成を示す。
【図７】第２の実施形態の音声合成方法の受信者側で行う処理をデータの構成と共に示す図である。
【符号の説明】
１０ … 送信者側装置（送信者）
２０ … 受信者側装置（受信者）
ＭＤ … 音声規範データ
ＴＸ … テキストデータ
ＤＢ … 音声データベース
Ｄｎ … 音声データ
Ｄｓ・Ｄｘｎ … 標準音声データ（特定の音声データ）
ｘｎ … ＤＢ指定データ（音声データベース指定データ）
ＦＤ … 特徴量変換データ（音素の変換データ）[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a speech synthesis method, and more particularly to a speech synthesis method in which text data as electronic data sent by e-mail or the like is synthesized by speech of the text data creator or the like and read out.
[0002]
[Prior art]
In recent years, the spread of the Internet has led to frequent exchange of documents by e-mail. Normally, the exchange of documents by e-mail is performed on a text basis, and the e-mail recipient visually confirms the contents of the e-mail on a personal computer (PDA) or personal data assistant (PDA) screen. However, there is a desire to listen to the contents of this e-mail with the sender's voice or to let the receiver hear with his own voice. This is because the voice differs not only by gender but also by age, physique, appearance, personality, etc., and expresses human characteristics well. In other words, the confirmation of the content of the email by voice for the recipient can come up with a specific image of the sender of the email that has not yet been met, or the sender's email that cannot be expressed between the lines of the email document. I can read emotions. The same applies to the sender. For example, a line space that cannot be represented by a document alone can be expressed and transmitted to the receiver.
[0003]
By the way, it is also possible to digitally record own voice to create a voice file and send it to the other party by e-mail. For example, it is a method in which the user's own voice is digitally recorded in the wav format and transmitted as an attached file of an e-mail. An audio file in the wav format is generally used in a personal computer or the like using Windows (registered trademark of Microsoft Corporation) as an OS (Operating System), and has features such as a light operation and a very good sound quality. However, a wav-format audio file has a data size of 100 kB (kilobytes) or more if only a few seconds (2 to 3 seconds) of audio is recorded. Although it depends on the method of comparison, it is assumed that a document having the same contents is stored and recorded. A wav format audio file requires a data size several thousand times that of a txt format document file (text data) (in the case of The difference may be over 10,000 times). Therefore, it is impractical to attach a wav format audio file to an e-mail because of the capacity of a hard disk of a personal computer, the capacity of a floppy disk, and the time required for sending and receiving an e-mail. The same applies to audio files (for example, AIFF format) in other OSs as well as Windows.
[0004]
Recently, an audio data compression standard called MP3 has been used as a means for compressing audio data, and many products related to MP3 have been sold. This MP3 is an abbreviation of “MPEG-1 Audio Layer-3”, and it ensures sound quality equivalent to a CD by cutting audio data in a frequency band that cannot be heard by humans (that is, unnecessary audio data). However, the data size can be compressed to about 1/10 of the original audio file. However, even if audio data is compressed with the MP3 standard, it is only about 1/10, and the data size is still considerably larger than that of text data, which places a burden on the hardware of the sender's personal computer and the receiver's personal computer. In addition to being large, it takes time to send and receive e-mails.
Moreover, it can be said that it is common to audio files, but it is not possible to edit an audio file as easily as a text file. Therefore, when creating a voice file, that is, when inputting a voice, a failure such as misunderstanding or wrong words is not allowed.
[0005]
On the other hand, in recent years, research on speech synthesis that reads text data mechanically has been actively conducted. Among them, in the corpus base speech synthesis using a speech database, the following method is used to synthesize text data into pronunciations closer to humans.
[0006]
[Procedure 1] First, morphological analysis of text data for reading is performed. This is a parsing of what words and what particles / auxiliary verbs / utilization endings the reading-out text data comprises. Specifically, the text “I am Japanese” is / I / is / Japanese /. The correct parsing is /. Incidentally, the word “Japanese” can be parsed as “Japan” and “Human”, as in “Japan / people”. This procedure 1 is a process of recognizing that it is definitely the word “Japanese”.
[0007]
[Procedure 2] Based on each word and syntax information obtained by morphological analysis, phoneme selection, accent position of each word, intonation and pose are determined. In this process, for example, a phoneme for speech synthesis of the word “Japanese” is selected, and the pitch and accent of the selected phoneme are determined. Even if the pronunciation of “Ni”, which constitutes the word “Japanese”, is written as “Nihonjin” or “Nickname”, the text will not change to “ni” even if the word is different. When considered, phonemes differ depending on the word, and on the subsequent pronunciation and accent of the word. Also, the length, height, and strength of phonemes are different.
In other words, in this step 2, the optimal length, height, and strength are determined for each phoneme for natural speech synthesis based on each word and syntax information, and the optimal intonation in the phoneme sequence is determined. Determine the pose.
[0008]
[Procedure 3] Each data obtained in the procedure 2 is indexed, the indexed data is delivered to the speech synthesizer, speech synthesis is performed, and text data is read out.
[0009]
By doing so, it is possible to synthesize text with a specific voice determined in advance, such as a man and a woman, and to read out text data (to output voice). Therefore, unlike the transmission / reception of audio data in the wav format or the MP3 format, the data size can be reduced. Editing is also easy.
[0010]
[Problems to be solved by the invention]
However, human voices are very different so that they can be used in criminal investigations. In the alternative voice selection of only men and women, there is no specific image of the sender of the email, It is also impossible to read line spaces that are not represented in the text data. Also, for the sender of the e-mail, the receiver cannot read his / her own personal image, and the receiver cannot read line spaces that cannot be represented by text data. Furthermore, the alternative voice selection lacks interest. The same applies to the case where text-based electronic data is exchanged on a storage medium such as a floppy disk.
[0011]
Therefore, a specific person image of the sender (creator of text data) such as e-mail can be obtained, or a line space that cannot be represented by text data can be read, and various voices can be enjoyed. In addition, there is a need for a method that can reduce the data size of electronic data and synthesize voice by transmitting and receiving.
Therefore, the main object of the present invention is to solve the above problems.
[0012]
[Means for Solving the Problems]
Main departureIn the clear voice synthesizer (claim 1), the transmission side device designates the voice data having the maximum similarity to the sender's voice from the voice storage means (voice database) of the reception side device. And feature amount correction data for correcting the audio data specified by the specified data. On the other hand, when receiving the data, the receiving device for speech synthesis reads out the corresponding speech data based on the designated data received from the speech storage means, and the feature amount correction data received in the speech data thus read out Make corrections to create corrected speech data, and perform speech synthesis.
In this configuration, the voice data stored in the voice storage means is read with the designation data that designates predetermined voice data that is similar in characteristics to the voice of the sender. Then, the audio data is corrected so that the read audio data is more approximate to the audio. For this reason, it is possible to synthesize speech with speech closer to the sender's speech (input speech) and read the data.
Incidentally, for example, the “voice database DB” in the embodiment corresponds to “voice storage means” in the claims. In addition, the “speech reference data creation unit 11a” in the embodiment corresponds to “processing means” in the claims that selects predetermined data that maximizes the similarity and creates feature amount correction data. In addition, the “speech synthesizer 21a” of the embodiment corresponds to “processing means” in the claims that create modified speech data and synthesize speech. In addition, “DB designation data” in the embodiment corresponds to “designation data” for designating voice data in claims. Further, “text data” and “voice index data” in the embodiment correspond to “data” of “read data based on corrected voice data” in the claims.
[0013]
  Further, the present invention transmits designation data for designating voice data to be used on the voice synthesis side from the voice storage means and feature amount correction data for modifying the voice data designated by the designation data to the voice synthesis side. ,Provide voice storage means with the same content as the senderA data creation method in the speech synthesis method in which data (text data and speech index data in the embodiment) to be synthesized is read out based on data (designated data, feature amount correction data) received on the speech synthesis side. Yes, the predetermined voice data having the maximum similarity to the input voice is determined, the designated data for specifying the predetermined voice data is determined, and the determined predetermined voice data is further determined by the input voice. The feature amount correction data for correcting the sound data so as to approximate is created (claim 4).
  The present invention also provides:Using a voice synthesizing apparatus provided with a voice storing means having the same contents as the transmitting side voice storing meansReceived specified data and feature correction dataToThis is a speech synthesis method in which data to be synthesized is read out based on the above (claim 5).
  This configuration(Claims 4 and 5)In the speech synthesizer described above(Claims 1 to 3)Similarly, it is possible to specify that the voice data having the maximum similarity to the input voice is read out on the voice synthesis side. In addition, the feature quantity correction data can be corrected so that the voice data having the maximum similarity is approximated by the input voice. For this reason, it is possible to perform speech synthesis with speech that is closer to the input speech (input speech, sender's speech).
[0014]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, the speech synthesis of the present inventionApparatus, data creation method in speech synthesis method, and speech synthesis methodThe embodiment will be described in detail with reference to the drawings. The embodiment of the present invention is (1).▲ 1 ▼ Finger(1) the first embodiment having constant data (hereinafter referred to as “DB designation data”) and phoneme feature amount correction data (hereinafter referred to as “feature amount correction data”) separately from (2) voice index data; (1) The DB designation data and the feature amount correction data will be described in two embodiments, ie, the second embodiment having (2) the voice index data together.
[0015]
◎ first embodiment;
First, the first embodiment will be described. In this embodiment, an electronic mail is transmitted / received as electronic data.
1 shows processing performed on the sender side and receiver side of the speech synthesis method, FIG. 2 shows creation of speech reference data, FIG. 3 shows processing performed on the receiver side, and FIG. 4 shows a hardware configuration. .
[0016]
≪Hardware configuration≫
As shown in FIG. 4, the speech synthesis method of the present embodiment is at least a sender-side device.(Sending device, sending side)10, Receiver side device(Receiving side device, receiving side, speech synthesis side)20 and electronic data transfer means 30.
[0017]
A sender-side device 10 relating to an email sender has a configuration in which a central processing unit 11, an external storage device 12, an input / output device 13, and a DSU (Digital Service Unit) 14 are connected to a bus 15. The central processing unit 11 controls the sender device 10 in an integrated manner, and generates a speech reference data creation unit 11a (speech reference data creation engine) that creates speech reference data MD that is a reference for speech synthesis from input speech. And a text data creation unit 11b. The external storage device 12 stores a variety of data and programs, and has a voice database DB that stores a large number of voice data Dn for creating voice normative data MD. Each of the audio data Dn is a set of phoneme data as shown in FIG. A keyboard for inputting text data TX, a sound system for inputting voice, and the like are connected to the input / output device 13 via an I / O device (not shown). The DSU 14 is an electronic data transmission / reception apparatus installed on the user side of a digital line such as ISDN. The sender device 10 may be portable or can be installed in a vehicle.
[0018]
A recipient-side device 20 associated with an e-mail recipient has a configuration in which a central processing unit 21, an external storage device 22, an input / output device 23, and a DSU 24 are connected to a bus 25. The central processing unit 21 performs overall control of the receiver-side device 20 and also generates a voice synthesizer 21a (speech synthesizer) that synthesizes the text data TX based on the voice normative data MD sent from the sender-side device 10. Engine). The external storage device 22 stores various data and programs, and has a speech database DB that stores a large number of speech data Dn (phoneme set data) for performing speech synthesis based on the speech normative data MD. The input / output device 23 is connected to a keyboard, a sound system that outputs synthesized sound, and the like via an I / O device (not shown). The DSU 24 is an electronic data transmission / reception apparatus installed on the user side of a digital line such as ISDN. Note that the voice database DB of the receiver-side device 20 and the voice database DB of the sender-side device 10 have the same contents. The receiver-side device 20 may be portable or vehicle-installable like the sender-side device 10.
[0019]
Here, the electronic data transfer means 30 is a digital communication line such as ISDN. The electronic data transfer means 30 may be an analog public telephone line. It does not matter whether it is wired or wireless. Also, electronic data transfer means for delivering or mailing an electronic data storage medium such as a floppy disk may be used. Alternatively, the electronic data transfer means 30 includes a mode through media such as a mode in which electronic data is distributed (downloaded) from a site on the Internet and a mode in which it is distributed in a magazine or the like. Either aspect corresponds to “transmission” and “reception” in the present invention.
[0020]
≪Speech synthesis method≫
As shown in FIG. 1, the speech synthesis method according to the present embodiment is divided into a process performed on the sender side of the e-mail and a process performed on the receiver side.
On the sender side, creation of speech normative data (S1), creation of text data (S2), and transmission of the created speech normative data and text data are performed (S3, transmission of e-mail). Note that the order of creation of the speech normative data in step S1 and the creation of text data in step S2 may be reversed. On the other hand, on the receiver side, reception of speech normative data and text data (S5, reception of electronic mail), reading and correction of speech data (S6), creation of speech index data (S7), and speech synthesis (S8) Is done. Note that the order of reading and correcting the voice data in step S6 and creating the voice index data in step S7 may be reversed. On the receiver side, the voice normative data MD is stored as necessary.
Hereinafter, the processing on the sender side and the processing on the receiver side will be described in detail.
[0021]
[Sender processing]
In the speech synthesis method according to the first embodiment, the sender of the e-mail creates the speech normative data MD and the text data TX using the sender-side device 10 illustrated in FIG. 4 and receives this by e-mail. Send to the person. In the present embodiment, it is assumed that the sender of the e-mail creates speech normative data MD that is synthesized with a voice resembling the sender.
Incidentally, the speech normative data MD includes DB designation data xn and feature amount correction data FD (see FIG. 3). The feature amount correction data FD corresponds to phoneme conversion data in claims.
[0022]
(1) Creation of voice normative data; First, the sender inputs his / her voice to the sender apparatus 10 from the input / output device 13 (S11). A voice input is a document (fixed sentence) or 50 sounds with no bias in phoneme series. The input speech is subjected to feature extraction by the speech reference data creation unit 11a, and a feature value i for each phoneme of the input speech is extracted (S12). Incidentally, the feature quantity i of the present embodiment is composed of a fundamental frequency and a spectrum envelope. Here, the spectrum envelope is an outline of the speech spectrum. In addition, as a feature-value, what is shown in the following table | surface can also be used.
[0023]
[Table 1]

[0024]
Next, the speech normative data creation unit 11a performs the feature amount i for each phoneme of the extracted input speech and the feature amount for each phoneme of the speech data Dn (n = 1 · 2... N) stored in the speech database DB. Are compared and compared as a whole (S13). Then, the specific voice data Dxn having the maximum similarity with the input voice is determined from the voice database DB (S14). Hereinafter, the audio data Dxn is referred to as standard audio data Ds. Incidentally, data (address etc.) for specifying the standard voice data Ds from the voice database DB becomes the DB designation data xn. The voice database DB of the receiver side device 20 and the sender side device 10 is the same.
[0025]
Subsequently, the speech normative data creation unit 11a creates feature amount correction data FD for each phoneme included in the standard speech data Ds based on the feature amount i for each phoneme obtained in advance (S15). . The feature amount correction data FD and the DB designation data xn constitute speech normative data MD that is information about the voice of the sender.
[0026]
Here, a method of creating the feature amount correction data FD will be further described.
(1) Correction of the fundamental frequency; α is obtained such that the difference between the fundamental frequency fk, which is the feature quantity of the phoneme k, and the fundamental frequency f of the input speech is within the range | fk−f | <a, and fk -Α is replaced with the fundamental frequency fk. That is, in order to approximate or equalize fk to f within a certain range (range a), α is calculated (α is the difference between the fundamental frequency fk of the phoneme k and the fundamental frequency f of the input speech). Here, the range a is also a threshold value for preventing the numerical value of | fk−f | from becoming too large. By this threshold value, fk (= fk−α) corrected does not become a frequency greatly different from fk before correction, and the characteristics of fk before correction are maintained. By the way, if fk before correction is a fundamental frequency for hearing “a”, the limit range (threshold) where the sound using the corrected fk (= fk−α) can be heard as “a” is a. is there.
[0027]
{Circle around (2)} Correction of spectrum envelope; The feature quantity representing the spectrum envelope of phoneme k is approximated to the feature quantity of the input speech. However, the correction amount of the spectrum envelope of phoneme k is set within a predetermined threshold range.
[0028]
(2) Creation of text data; Text data TX to be subjected to speech synthesis is created by the text data creation unit 11b by inputting characters from the input / output device 13 (S2 in FIG. 1 (a)). By the way, the contents of the text data TX is "Hello ...".
[0029]
(3) Transmission of voice normative data and text data; The created voice normative data MD and text data TX are sent to the recipient by e-mail (S3 in FIG. 1 (a)). Among these, the voice normative data MD is transmitted as an attached file of an electronic mail. Note that the transmitted voice normative data MD and text data TX are data of characters and numbers, and the data size is much smaller than the sound data of wave format or MP3 format. Therefore, the burden on the hardware on the sender side and the receiver side is small, and the time required for transmission and reception can be reduced.
[0030]
[Recipient side processing]
In the speech synthesis method according to the first embodiment, the recipient of the e-mail performs speech synthesis as follows using the recipient-side device 20 illustrated in FIG. 4 (see FIG. 3).
[0031]
(1) Reading / Correcting Standard Audio Data; The DSU 24 of the receiver side device 20 receives an electronic mail via the electronic data transfer means 30 (S51). Next, the speech synthesizing unit 21a of the recipient-side device 20 searches the speech database DB based on the DB designation data xn among the speech normative data MD attached to the e-mail, and the corresponding speech data Dxn (that is, standard) Audio data Ds) is read. Subsequently, the read standard audio data Ds is corrected based on the feature amount correction data FD in the audio standard data MD to generate corrected audio data Ds' (S52). The modified voice data Ds ′ is a voice (phoneme) that approximates the voice of the sender, reflecting the characteristics of the voice of the sender.
[0032]
(2) the creation of voice index data; For text data TX of the content of "Hello ..." out of the received electronic mail, perform text analysis and temperament predicted by the speech synthesis unit 21a, to create a voice index data ID (S53) . The voice index data ID has a structure shown in a balloon in FIG. In the text analysis, text analysis of text data to be output after speech synthesis is performed, and the solid text is divided into words (that is, morphemes) to determine the position of accents, the pose length, and the like. In the phoneme prediction, the phoneme series prosody parameters desired to be synthesized and output are predicted for each phoneme based on the text analysis data.
Incidentally, the phoneme “k, o, n, n, i, ch, i, w, a...” Described in the speech index data ID is phoneme designation data.
[0033]
(3) Speech synthesis; speech synthesis is performed based on the modified speech data Ds' created in (1) and the speech index data ID created in (2) (S54). Specifically, using the modified phoneme of the modified speech data Ds ′, a phoneme sequence is assembled in the order of the phoneme designation data of the speech index data ID (k, o, n, n, i, ch, i, w , a..., voice synthesis is performed based on the series, and the text data TX is read out. In this way, the receiver can listen to the contents of the e-mail with the voice of the sender (voice equivalent to or close to the voice of the sender).
[0034]
The voice norm data MD is unique to an individual. If this voice norm data MD is stored in the external storage device 22, the voice norm data will be received the next time an e-mail is received from the same sender. Even if the MD is not attached, the content of the e-mail can be read out by voice synthesis with the voice of the sender. At this time, it is preferable to store the voice normative data MD by using a unique number such as the sender's e-mail address or telephone number as an address. This is because it is possible to immediately call up the voice normative data MD for the sender from the same sender, and synthesize and read out the contents of the email.
[0035]
By performing speech synthesis in this way, the receiver can read a line space that cannot be represented by text data, and the like, and the data size of electronic data to be transmitted / received can be much smaller than that of an audio file. Also, by storing the voice normative data in the receiver side device, when the same sender sends an e-mail again, it is only necessary to send text data (voice index data). The data size can be further reduced. Furthermore, when creating the speech normative data, the sender does not need to read out the content of the text data itself, but only needs to read out a fixed sentence or 50 sounds. Therefore, even if the voice reference data is created in the office space, unlike the reading of the text data itself, the surroundings are not uncomfortable. In addition, once created voice norm data can be used repeatedly, there is no need to read out fixed phrases and 50 sounds each time an e-mail is sent. In addition, unlike sending an audio file by e-mail, it is not necessary to read out the content of the text data every time even if the content of the text data is different. Furthermore, unlike an audio file, correction and editing of a text file is extremely easy. In addition, the recipient can check the contents of the e-mail in parallel with other work such as driving a car.
[0036]
In the above embodiment, the voice index data is created on the receiver side. However, the voice index data may be attached on the e-mail on the sender side. By creating voice index data on the sender side, the sender can freely edit the length, height, strength, etc. of the phoneme, and synthesizes the voice (accent, intonation, etc.) that it intended. It is because it becomes possible to make it. As described above, when the voice index data is created on the sender side and transmitted as an attachment to an e-mail, the e-mail does not need to include text data to be synthesized. This is because it is possible to synthesize speech by assembling phoneme sequences in the order of the phoneme designation data of the speech index data. In this case, the voice index data is derived from text data, and this voice index data corresponds to the text data in the claims.
[0037]
Second embodiment;
Next, the speech synthesis method according to the second embodiment will be described. Note that members and elements that are the same as those in the first embodiment are given the same reference numerals with reference to the drawings used in the first embodiment, and descriptions thereof are omitted. Here, FIG. 5 shows the processing performed on the sender side and the receiver side of the speech synthesis method, FIG. 6 shows the relationship of each data in the speech synthesis method, and FIG. 7 shows the processing performed on the receiver side.
In the second embodiment, voice index data is created on the sender side, and merged voice normative data in a format in which the voice index data and the voice norm data are written is transmitted to the receiver side by e-mail.
[0038]
≪Hardware configuration≫
Since the speech synthesis method of the second embodiment can be applied with the hardware used in the first embodiment as it is, description of the hardware configuration is omitted. However, in the second embodiment, the voice index data ID is created by the sender device 10. Therefore, the speech normative data creation unit 21a of the sender device 10 can create the speech index data ID.
[0039]
≪Speech synthesis method≫
Similar to the speech synthesis method of the first embodiment, the speech synthesis method of the second embodiment can be divided into a process performed on the sender side of the e-mail and a process performed on the receiver side.
On the sender side, creation of speech normative data (S101), creation of text data (S102), creation of speech index data based on text data (S103), merge of speech normative data and speech index data, that is, merged speech The normative data is created (S104), and the merged voice normative data is transmitted (S105, e-mail transmission). On the other hand, the receiver side receives merged speech normative data (S106, reception of e-mail), reads and corrects speech data (S107), and performs speech synthesis (S8).
Hereinafter, the processing on the sender side and the processing on the receiver side will be described in detail.
[0040]
[Sender processing]
(1) Creation of speech normative data, (2) creation of text data, and (3) creation of speech index data. Since each creation procedure is the same as that described in the first embodiment, a description thereof will be omitted. As described above, the voice index data ID is created on the sender side. This is different from the first embodiment.
[0041]
(4) Creation of merged speech normative data; Merged speech normative data MD and speech index data ID are merged to create merged speech normative data CMD. As shown in FIG. 6C, the merged speech normative data CMD has DB specification data xn, the entire contents of the speech index data ID, and feature amount correction data FD corresponding to the phoneme designation data of the speech index data ID.
[0042]
(5) Transmission of merged speech normative data; The created merged speech normative data CMD is transmitted to the recipient by electronic mail (S105 in FIG. 5 (a)). The merged speech normative data CMD to be transmitted is character or numeric data, and the data size is much smaller than the sound data in the wave format or MP3 format. Therefore, the burden on the hardware on the sender side and the receiver side is small, and the time required for transmission and reception can be reduced. Note that text data TX is not necessary for e-mail. This is because a phoneme sequence can be assembled based on phoneme designation data included in the merged speech reference data CMD, and speech synthesis can be performed. However, the text data TX may be described in the e-mail so that the recipient can visually confirm the contents of the e-mail.
[0043]
[Recipient side processing]
In the speech synthesis method according to the second embodiment, the recipient of the e-mail performs speech synthesis as follows using the recipient-side device 20 shown in FIG. 4 (see FIG. 7).
[0044]
(1) Reading / modifying standard audio data; The DSU 24 of the recipient-side device 20 receives an e-mail via the electronic data transfer means 30 (S151). Next, the speech synthesizer 21a of the receiver-side device 20 searches the speech database DB based on the DB designation data xn among the merged speech normative data CMD attached to the e-mail, and the corresponding speech data Dxn (that is, Standard audio data Ds) is read. Subsequently, the standard speech data Ds read based on the feature amount modification data FD in the merged speech normative data CMD is modified to create modified speech data Ds ′ (S152). The modified voice data Ds ′ is a voice (phoneme) that approximates the voice of the sender, reflecting the characteristics of the voice of the sender.
[0045]
(2) Speech synthesis; speech synthesis is performed based on the part of the data derived from the created modified speech data Ds' and the speech index data ID of the merged speech reference data CMD (S153). Specifically, using the modified phonemes of the modified speech data Ds ′, phoneme sequences are assembled in the order of the phoneme designation data of the merged speech reference data CMD (k, o, n, n, i, ch, i, w, a...), voice synthesis is performed based on the series, and the text data TX created by the sender in step S102 is read out. In this way, the receiver can listen to the contents of the e-mail with the voice of the sender (voice equivalent to or close to the voice of the sender).
[0046]
By synthesizing the speech in this way, it is possible to read a line space that cannot be represented by text data as in the first embodiment, and the data size of electronic data to be transmitted / received is smaller than that of transmitting / receiving an audio file. can do. Further, since it is not necessary to prepare feature amount correction data for all phonemes, the data size can be made smaller than that in the first embodiment. In the present embodiment, the portion derived from the speech index data in the merged speech normative data sent to the receiver corresponds to the text data in the claims.
[0047]
The present invention described above can be widely modified without being limited to the above-described embodiments.
For example, the voice to resemble may not be that of the sender itself, but may be that of a family member, friend, or another person. This is because even a voice that has not been heard by the receiver can hear the contents (text data) of the electronic data with various voices, and can enjoy the fun.
[0048]
【The invention's effect】
Main departureAkira (speech synthesizer, data creation method in speech synthesis method, and speech synthesis method)ShakeIfSince the contents of the e-mail can be heard by the voice of the sender or the like, the receiver can have a specific image of the sender or the like, or can read a line space that cannot be represented by text data. In addition, the receiver can listen to the contents of the text data while enjoying the fun with various voices. In addition, the data size can be reduced unlike the case of transmitting and receiving audio files. Furthermore, unlike sending and receiving audio files, the sender does not need to read the text data itself. in addition, Designated data and feature correction data (voice norm data)Since it is configured as electronic data, it can be reused. And,The specified data and feature correction data areBecause it is not affected by the contents of text data, even when creating different text data,Designated data and feature correction dataThere is no need to create it again.
[0049]
In particular,According to the second aspect of the invention, since the voice index data is created by the reception side device, the load on the transmission side device can be reduced. According to the third aspect of the present invention, since voice index data is created by the transmission side apparatus, the load on the reception side apparatus can be reduced. Moreover, according to the present invention,The size of the data to be transmitted / received can be surely reduced, and the content of the text data can be heard by reliably synthesizing the receiver side with the voice intended by the sender.According to the invention of claim 4, the designation data and the feature amount correction data used on the speech synthesis side can be appropriately determined and created. According to the fifth aspect of the present invention, it is possible to perform speech synthesis by approximating the input speech based on the designated data and the feature amount correction data, and to read out the data to be read out.
[Brief description of the drawings]
FIG. 1 is a flowchart showing processing of a speech synthesis method according to a first embodiment, where (a) shows processing on the sender side and (b) shows processing on the receiver side.
FIG. 2 is a flowchart showing the creation of speech normative data performed on the sender side in the speech synthesis method according to the first embodiment together with the data configuration.
FIG. 3 is a flowchart showing processing performed on the receiver side of the speech synthesis method according to the first embodiment together with a data configuration;
FIG. 4 is a block diagram illustrating a hardware configuration to which the speech synthesis method according to the first embodiment is applied.
FIG. 5 is a flowchart showing processing of the speech synthesis method according to the second embodiment, where (a) shows processing on the sender side and (b) shows processing on the receiver side.
FIGS. 6A and 6B are diagrams for explaining the relationship between each data in the speech synthesis method according to the second embodiment. FIG. 6A shows the configuration of speech index data, FIG. Indicates the structure of the merged speech normative data.
FIG. 7 is a diagram illustrating a process performed on the receiver side of the speech synthesis method according to the second embodiment together with a data configuration.
[Explanation of symbols]
10 ... Sender side device (sender)
20 ... Receiver side device (receiver)
MD ... Voice normative data
TX ... Text data
DB ... Voice database
Dn ... Voice data
Ds / Dxn Standard audio data (specific audio data)
xn ... DB designation data (voice database designation data)
FD: Feature conversion data (phoneme conversion data)

Claims

In a speech synthesizer composed of a transmitting device that transmits data created by a sender and a receiving device that synthesizes and reads data received from the transmitting device,
In the transmission side device, voice storage means for storing a plurality of voice data as aggregated phoneme data ;
Extracting the characteristics of the sender's voice, determining predetermined voice data having the maximum similarity compared to each voice data and each phoneme in the voice storage means, and approximating the voice of the sender Processing means for creating feature amount correction data for correcting the determined predetermined audio data;
Providing transmission means for transmitting the specified data for specifying the predetermined audio data and the feature amount correction data;
In the receiving side device, receiving means for receiving the data transmitted from the transmitting side device;
Voice storage means having the same content as the transmission side device;
The corresponding voice data is read from the voice storage means based on the received designation data, and corrected voice data is created by correcting the read voice data based on the received feature amount correction data. And processing means for reading out data based on the above and performing speech synthesis,
A speech synthesizer characterized by the above.

The transmitting device further includes input means for inputting text data, and transmitting means for transmitting the text data,
The processing means of the receiving side device creates speech index data in which prosodic parameters including an accent position or a pose length are predicted based on the received text data, and based on the modified speech data and the speech index data The speech synthesis apparatus according to claim 1, wherein speech synthesis is performed.

The transmitting device is further provided with input means for inputting text data,
The processing means of the transmitting device creates speech index data that predicts prosodic parameters including accent position or pose length based on the text data,
The transmission means also transmits the audio index data,
Processing means of the receiving side device performs speech synthesis based on the modified speech data and the received speech index data;
The speech synthesizer according to claim 1.

Designation data for designating speech data to be used on the speech synthesis side from speech storage means for storing a plurality of speech data as phoneme aggregate data, and feature amount modification data for modifying the speech data designated by the designation data Sent to the voice synthesizer,
A data creation method in a speech synthesis method in which speech synthesis is performed based on the transmitted designated data and the feature amount correction data on the speech synthesis side including speech storage means having the same content as the speech storage means,
Extract feature values for each phoneme from the input speech,
The feature data and each voice data of the voice storage means are compared for each phoneme to determine predetermined voice data having the maximum similarity to the input voice, and designation data for designating the predetermined voice data Decide
Furthermore, creating feature amount correction data for correcting the determined voice data so as to approximate the determined predetermined voice data by the input voice,
A data creation method in a speech synthesis method characterized by the above.

Speech data to be used for speech synthesis is read from speech storage means for storing a plurality of speech data as phoneme aggregate data, the read speech data is corrected, and speech synthesis is performed based on the corrected speech data. A speech synthesis method that reads out the target data,
A feature value for each phoneme is extracted from the input speech, and the input speech and the input speech are compared by comparing the feature amount with each speech data of the transmission side speech storage unit having the same content as the speech storage unit. Receiving designation data for designating predetermined voice data having the maximum similarity, and feature amount correction data for approximating the voice data designated by the designation data by the input voice;
It reads the audio data for speech synthesis from the voice storage manually round using the received specifying data,
For the voice data read in this way, the received feature data is corrected so that the read voice data approximates the input voice,
Based on the corrected voice data corrected in this way, the data to be subjected to the voice synthesis is read out,
A speech synthesis method characterized by the above.