JP3999078B2

JP3999078B2 - Voice data distribution device and client terminal

Info

Publication number: JP3999078B2
Application number: JP2002257570A
Authority: JP
Inventors: 聡渡辺; 慎司早川; 真弓原田
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 2002-09-03
Filing date: 2002-09-03
Publication date: 2007-10-31
Anticipated expiration: 2022-09-03
Also published as: JP2004094085A

Description

【０００１】
【発明の属する技術分野】
本発明は、インターネット等のデータネットワークを介して利用者端末に音声データを配信する音声データ配信装置及び配信依頼者端末に関する。
【０００２】
【従来の技術】
電子メールやＷｅｂページの長文テキストを画面上で読む場合、目への負担を軽減するため、テキストを音声化して提示したいというニーズがある。このニーズに対し、音声データ配信装置を利用した音声提示を実現する方法が知られている。かかる音声データ配信装置においては、テキスト形式等の文書データを音声データに変換するには、音声合成エンジンと呼ばれる特殊のソフトウェアを必要とすることから、かかる音声データへの変換を音声データの配信に先立って利用者にサービスする音声データ配信システムも知られている。この点、特開２００１−２８２２６８公報は、天気予報等のテキストデータを依頼者から受け取り、音声合成データを作成して配信元であるＷｅｂサーバ或いは電話サーバに配信することにより、ホームページ閲覧者や電話利用者が天気予報等の情報を音声にて聴取できるようにする音声データ配信システムを開示している。かかる構成により、音声データの配信を望む配信依頼者は、自己の装置に音声合成機能を有しなくても、テキスト形式等の文書データを該データ配信システムに送信するだけで、かかる音声データを配信先の利用者に提供できるとしている。
【０００３】
ところで、日本語を対象言語とする前提では、音声データの合成は、通常、漢字かな混じり文であるテキストデータから、単語辞書データ、韻律規則データ及び音声素片データを用いて音声波形の音声データが生成される。
【０００４】
【発明が解決しようとする課題】
このように生成される音声データには、常に、読みの誤りの発生、或いは未知語の発生、即ち、単語辞書に登録されていないために対応する音声が確定できず読みが未確定のまま配信される危険の発生が予想されている。特に、日本語では、当該テキストの内容の属する分野、例えば、経済、文化、政治、娯楽等の分野の違いにより、妥当する読みが異なる或いは専門用語のため未知語が多く発生する等の問題が予想される。そのため結果的に、誤った読みを含んだ音声データを配信し、これを音声にて聴取する利用者をして情報の理解を誤らせしめる危険が存在する。かかる誤りのある音声データの提供は、音声データ配信サービスの信頼性を損ね、その普及を妨げる要因となっていた。
【０００５】
本発明は、以上の問題に鑑みてなされたものであり、その目的は、信頼性のある音声データ配信装置を提供することである。
【０００６】
【課題を解決するための手段】
本発明による音声データ配信装置は、データネットワークを介して音声データを配信する音声データ配信装置であり、少なくとも１つの依頼者端末から少なくとも１連のテキストデータを受信し、音声合成により前記テキストデータを１連の音声データに変換して配信のために蓄積する音声データ蓄積手段と、前記依頼者端末からの修正リクエストに応じて、前記修正リクエストに対応する音声データについて、前記修正リクエストの内容に応じて音声合成を再度行って修正済音声データを得る音声データ修正手段と、得られた修正済音声データを前記依頼者端末に送信する修正済音声データ送信手段と、前記依頼者端末からの配信承認メッセージに応じて、前記配信承認メッセージに対応する修正済音声データに、前記音声データ蓄積手段により蓄積されていた修正前の音声データを更新する音声データ更新手段と、を含むことを特徴とする。
【０００７】
本発明による配信依頼端末は、請求項１記載の音声データ配信装置に少なくとも１連の音声データの配信を依頼する依頼者端末であり、前記音声データに変換されるべきテキストデータの入力を促す音声データ配信依頼情報を表示する配信依頼情報表示手段と、前記修正リクエストの内容入力を促すと共に、前記修正済音声データに対する再生指令または承認指令入力を促す音声データ修正依頼情報を表示する修正依頼情報表示手段と、を含むことを特徴とする。
【０００８】
【発明の実施の形態】
本発明の実施例について添付の図面を参照して詳細に説明する。
＜第１実施例＞
図１は、本発明の第１の実施例であり、音声データ配信装置を含むシステム全体構成を示している。音声データ配信装置（音声データ配信システムとも称する）は、音声データ配信サーバ１０及び音声データ合成サーバ２０から構成される。
【０００９】
図１の右側に示される音声データ合成サーバ２０を参照すると、音声データ合成サーバ２０は、インターネット３０に接続されて、依頼者端末４０及び／又は利用者端末１００との通信を可能とする。音声データ合成サーバ２０のハードウェア構成は、通常のサーバコンピュータとして構成される。音声データ合成サーバ２０のソフトウェア構成は、図示されるように、制御プログラム２２と、ＴＣＰ／ＩＰ部２３と、ＨＴＴＰ／ＣＧＩ部２４と、音声合成処理部２５と、音声データ蓄積依頼部２６と、を含む。制御プログラム２２は、音声データ合成サーバ２０の全体の基本制御を司るオペレーティングシステムである。ＴＣＰ／ＩＰ部２３は、制御プログラム２２の制御の下にＴＣＰ／ＩＰプロトコルの手順を実行し、インターネット３０を介して依頼者端末４０及び／又は利用者端末１００とのデータ通信を実現する。ＨＴＴＰ／ＣＧＩ部２４は、ＴＣＰ／ＩＰ部２３を介して、音声データ配信依頼受け付けのためのＷｅｂページをインターネット３０に配信すると共に、そのＣＧＩ機能により、依頼者端末４０からのテキストデータを受信する機能を有する。ＨＴＴＰ／ＣＧＩ部２４は、また、音声合成された音声データを依頼者端末４０に返却して、その修正指示を受け付ける機能を有する。ＨＴＴＰ／ＣＧＩ部２４は、ＨＴＴＰ（HyperText Transfar Protocol；IETF：RFC2616参照）によるデータ通信を実現する。
【００１０】
音声合成処理部２５は、ＨＴＴＰ／ＣＧＩ部２４から入力されるテキストデータを音声データに変換する機能を有する。入力されるテキストデータの形式は、プレーンテキスト、即ち、ＪＩＳ漢字文字コード列から構成されるのが通常であるが、他の文書形式でも良い。出力される音声データは、音声波形に対応する音素データから構成され、通常１つのファイルにまとめられる。このファイルは、利用者端末１００及び又は依頼者端末４０により再生可能な音声波形形式のファイルであれば良く、ＷＡＶファイル、ＭＰ３ファイル等の多様なファイル形式が想定される。音声合成処理部２５には、例えば１０万語程度の表記、読み、アクセント及び品詞の情報が各単語に対応付けて保存される単語辞書データファイル２５１と、基本周波数や音圧の制御規則データが格納される韻律規則データファイル２５２と、音素ごとの波形データを蓄えた音声素片データファイル２５３と、が接続される。音声合成処理部２５の音声合成の処理手順は、先ず、漢字かな混じり文であるテキストデータを単語辞書データファイル２５１の単語辞書データを用いてアクセント記号付カナ文字列である中間言語を生成する。次いで、生成された中間言語から、韻律規則データファイル２５２の韻律規則データを用いて音声素片番号、ピッチパターン情報及び音韻継続時間情報等からなる合成パラメータを生成する。そして、生成された合成パラメータに従って、音声素片データ２５３の音声素片データを読み出してこれを順次繋げることで音声波形データを生成する。これにより、音声データの音声波形情報は、例えば、標本化周波数８ｋＨｚ、量子化１６ｂｉｔのＰＣＭデータとして生成され、音声データ６３として音声データ合成サーバ２０の適切な一次記憶装置に一時的に保存される。
【００１１】
音声合成処理部２５は、更に、文字列として未知語を返す機能を有する。ここで、未知語とは、単語辞書データファイルに登録されておらず、読み誤る可能性がある単語である。音声合成処理部２５は、未知語であると判定された単語は、所定の適当な読みで音声に変換と共に、未知語と判定された単語のリストを未知語リスト６４として出力する。未知語リスト６４は、先の音声データ６３と共に音声データ合成サーバ２０の適切な一次記憶装置に一時的に保存される。
【００１２】
音声データ蓄積依頼部２６は、依頼者端末４０からの音声データに対する配信承認のＯＫメッセージに応じて、一時的に記憶されていた音声データ６３を音声データ配信サーバ１０に送信して、これに蓄積せしめてインターネット３０に向け各利用者端末１００に配信するように依頼する機能を有する。
図１の左側に示される音声データ配信サーバ１０について参照すると、音声データ配信サーバ１０がインターネット３０に接続されて、利用者端末１００及び／又は依頼者端末４０との間でデータ通信を可能とする。音声データ配信サーバ１０のハードウェア構成は、通常のサーバコンピュータとして構成される。音声データ合成サーバ２０のソフトウェア構成は、図示されるように、制御プログラム１２と、ＴＣＰ／ＩＰ部１３と、ＨＴＴＰ部１４と、音声データ蓄積部１５と、配信のための音声データ・データベース６０（以下、音声データＤＢと称する）と、から構成される。
【００１３】
制御プログラム１２は、音声データ配信サーバ装置１０の全体の基本制御を司るオペレーティングシステムである。ＴＣＰ／ＩＰ部１３は、制御プログラム１２の制御の下にＴＣＰ／ＩＰプロトコルの手順を実行し、インターネット３０を介して依頼者端末４０及び／又は利用者端末１００とのデータ通信を実現する。ＨＴＴＰ部１４は、ＴＣＰ／ＩＰ部１３を介して、音声データＤＢ６０に格納されている音声データの配信を要求する利用者端末１００又は依頼者端末４０に向けて所望の音声データを配信する機能を有する。音声データＤＢ６０は、複数の識別子からなる識別子群と、該識別子の各々に対応付けられた複数の音声データからなる音声データ群からなる。音声データＤＢ６０の構成は、識別子をファイル名とする通常のファイルシステムによって実現され得る。
【００１４】
図２は、図１に示される依頼者端末４０の内部構成を示している。ここで、依頼者端末４０は、インターネット３０に接続されデータ通信を可能としている。依頼者端末４０のハードウェア構成としては、パーソナルコンピュータ等の通常のネットワーククライアント端末であり、通常のディスプレイ５１及びキーボード５４が接続されている。依頼者端末４０のハードウェア構成は、更に、音声データを再生して検証するためのスピーカ５２と操作者の音声入力により直接音声データ作成するためのマイク５３とを含む。依頼者端末４０のソフトウェア構成としては、依頼者端末４０の全体の基本制御を司る制御プログラム４１と、インターネット３０を介してデータ通信を実現するためのＴＣＰ／ＩＰ部４２と、キーボード５４を含むＩ／Ｏ機器の入出力を制御プログラム４１と協働して実現するＩ／Ｏ制御部４３と、マイク５３から入力される音声をデジタル処理して取り込む音声入力部４６と、ＴＣＰ／ＩＰ部４２を介して受信されるＷｅｂページを表示し、これに対する操作者の入力を再びインターネットに向けて送信するＷｅｂブラウザ部４４と、Ｗｅｂブラウザ部４４に表示されたＷｅｂページを介して取り込まれる音声データをスピーカ５２を通して再生するための音声データ再生部４５と、を含む。
【００１５】
尚、依頼者端末４０には、マイク５３及び音声入力部４６を用いてテキストデータの修正に対して直接操作者の音声をＷｅｂブラウザ部４４に表示されるテキスト入力欄と連動させて音声データを直接生成することも可能である。
図３は、依頼者端末４０における音声データ配信依頼画面の例を示している。音声データ配信依頼画面８０は、テキストデータ入力欄８１と、音声データ化指示釦８２と、音声合成処理条件設定欄８３とを含む。テキストデータ入力欄８１は、依頼者端末４０の操作者が実際に音声データとして配信を所望する文書をテキストデータとして文字入力する欄である。音声データ化指示釦８２がテキストデータ入力の後に指示されることより、当該テキストデータが音声データ合成サーバ２０に送信される。音声合成処理条件設定欄８３は、音声データ合成サーバ２０が音声合成処理を実行する際の詳細な処理条件を設定することを可能とする。本図の例では、音声データの声質設定と、速さ設定と、未知語表示の設定とを可能とする例を示されている。音声合成処理条件設定欄８３に設定可能とする処理条件としては、他に生成される音声データの符号化速度（１６ｋｂｐｓ、３２ｋｂｐｓ等）や単語辞書の分野を設定する等の多様な設定が考えられる。
【００１６】
図４は、依頼者端末４０における音声データ修正依頼画面の例を示している。音声データ修正依頼画面８４は、スピーカ５２を用いた音声データの再生を指示する音声データ再生釦８５と、テキストデータに対する修正指示としての文字入力を促す修正入力欄８６と、未知語リストを表示する未知語リスト欄８７と、該未知語に対する修正指示としてその単語情報の入力を促す単語情報入力欄８８と、音声データに対する配信を承認する又は修正指示をなすＯＫメッセージ釦８９とを含む。単語情報入力欄８８は、更に、未知語リスト８７上で選択された単語に対する修正指示として、そのフリガナ、即ち「読み」の入力を促す読み入力欄８８１と、その品詞を指定する品詞指定欄８８２と、読み入力欄にて指定されたフリガナに従い、そのアクセント情報の指定を促すアクセント指定欄８８３とを含む。尚、配信承認のための釦と修正指示のための釦とを別異に設ける形態でも良い。
【００１７】
図５は、音声データ合成サーバ２０及び依頼者端末４０の処理手順を示している。本図に示されるシーケンスについて、前述の図１及び図２に示される構成要素を適宜参照して説明する。
先ず、音声データ合成サーバ２０は、ＨＴＴＰ／ＣＧＩ部２４により、依頼者端末４０に音声データ配信依頼画面を送信する（ステップＳ１１）。これは、依頼者端末４０が音声データ合成サーバ２０のアドレスを指定して音声データ配信依頼画面にアクセスすることにより送信される。この送信に応じて、依頼者端末４０は、そのＷｅｂブラウザ部４４により、音声データ配信依頼画面を受信して表示する（ステップＳ１２）。次いで、依頼者端末４０の操作者からテキストデータの入力を受け付け、音声データ合成サーバに送信する（ステップＳ１３）。この送信に応じて、音声データ配信サーバ２０は、ＨＴＴＰ／ＣＧＩ部２４により、テキストデータを受信する（ステップＳ１４）。
【００１８】
次に、音声データ合成サーバ２０は、音声合成処理部２５により、受信されたテキストデータに対して音声合成処理を実行する（ステップＳ１５）。この合成処理においては、音声データ配信依頼画面において指定された詳細な音声合成処理条件に従って合成処理がなされても良い。音声合成処理部２５は、合成処理の結果として音声データ６３及び未知語リスト６４を作成する。次いで、音声データ合成サーバ２０は、ＨＴＴＰ／ＣＧＩ部２４により、得られた音声データ６３及び未知語リスト６４を含む音声データ修正依頼画面を依頼者端末４０に送信する（ステップＳ１６）。この送信に応じて、依頼者端末４０は、そのＷｅｂブラウザ部４４により、音声データ修正依頼画面を受信して表示する（ステップＳ１７）。音声データ修正依頼画面の表示に応じて、依頼者端末４０の操作者は、適宜、当該合成された音声データが適正か否かの判断を行い、必要に応じてテキストデータの編集を行う。
【００１９】
この編集の例について説明すると、「こちら側ではなしている言葉はきこえない。」というテキストデータは、文脈上「こちら側で、はなしている言葉は聞こえない。」という切れ目が妥当なのに対し、「こちら側では、なしてる言葉は聞こえない。」と聞こえる音声が生成された場合、依頼者端末４０の操作者は、元のテキストデータに句点を加え、「こちら側で、話している言葉は聞こえない。」と修正指示をすることで、再び音声合成処理を依頼し、妥当な音声を生成することが出来る。又、「彼は大和魂がある」というテキストデータは、文脈上「彼はヤマトダマシイがある」という読み方が妥当なのに対し、「彼はダイワダマシイがある」と聞こえる音声が生成された場合、操作者は、元のテキストデータの「大和魂」の部分を「ヤマトダマシイ」とカナに変更し、修正指示することで妥当な音声を生成することが出来る。この場合、操作者は、図４に示す単語情報入力欄８８から、「大和魂」の単語情報を入力することで修正指示をしても、妥当な音声を生成することが出来る。ここで入力された単語情報は、音声合成処理部２５が、再び音声合成処理を行うにあたり、単語辞書データ２５１と併せて用いられる。
【００２０】
以上の編集操作の後に、依頼者端末４０は、Ｗｅｂブラウザ部４４を介して、音声データ修正指示（修正リクエスト）又はＯＫメッセージを音声データ合成サーバに送信する（ステップＳ１８）。修正指示の無いＯＫメッセージは、依頼者端末４０の操作者が当該音声データについて修正指示が不要として実際の配信を承認することを意味する。
【００２１】
この送信に応じて、音声データ合成サーバ２０は、ＨＴＴＰ／ＣＧＩ部２４により、修正指示（修正リクエスト）又はＯＫメッセージを受信する（ステップＳ１９）。次いで、修正指示／ＯＫか否かの判定を行う（ステップＳ２０）。もし修正指示がある場合には、ステップＳ１５の音声合成処理に戻る。一方、ＯＫである場合には、音声データ合成サーバ２０は、音声データ蓄積依頼部２６を介して音声データ配信サーバ１０に当該音声データを蓄積せしめる（ステップＳ２１）。音声データ蓄積依頼部２６は、インターネット３０を介して音声データ配信サーバ１０の音声データ蓄積部１５に音声データの蓄積を依頼し、音声データ蓄積部１５はこれを音声データＤＢ６０に蓄積する。これにより、蓄積された音声データは、音声データ配信サーバ１０のＨＴＴＰ部１４によりインターネット３０を介して利用者端末１００又は依頼者端末４０に配信可能となる。
【００２２】
次に、音声データ合成サーバ２０は、ＨＴＴＰ／ＣＧＩ部２４により、登録完了メッセージを依頼者端末に通知する（ステップＳ２２）。この送信に応じて、依頼者端末４０は、Ｗｅｂブラウザ部４４により、登録完了メッセージを受信する（ステップＳ２３）。この登録完了メッセージは、依頼者端末４０に操作者に表示される。
【００２３】
以上のように、本第１の実施例においては、音声データ配信システムに音声合成サーバを設けたことで、音声データ配信の依頼者は、その依頼者端末に音声合成機能を備える作業が必要なく、合成された音声データに対して適切な修正追加をすることができる。
尚、音声データの配信形態として、ＨＴＴＰサーバによる配信について説明したが、かかる形態に限られず、ＦＴＰ（File Transfer Protocol）によるファイル転送v やＵＤＰ（User Datagram Protocol）による配信形態等の他の配信形態でも良い。
＜第２の実施例＞
図７は、本発明の第２の実施例であり、音声データ合成サーバ２０の他の構成を示している。ここで、音声データ合成サーバ２０のハードウェア構成は第１の実施例の場合と同様であり、そのソフトウェア構成は一部に機能の追加がなされる。かかる機能追加の部分についてのみ説明する。図７を参照すると、識別子決定部２７と、音声音声データ情報計算部２８とが備えられている。
【００２４】
識別子決定部２７は、合成され且つ依頼者端末からの配信承認のＯＫメッセージに基づいて配信される音声データを音声データファイルとしてインターネット３０上でアクセス可能とする一意の識別子を自動的に決定する機能を有する。識別子の書式としては、ＵＲＩ（Uniform Resource Identifiyer；IETF：RFC2396参照）を特定する情報として、ＩＰヘッダの要求送信元ＩＰアドレス（Internet Protocol；Source-address；IETF：RFC791）を利用することができる。
【００２５】
識別子決定部２７の識別子の自動生成方法の１例としては、カウンタメモリを備える方法がある。この例では、識別子決定部２７の内部にカウンタメモリを備える。識別子決定部２７は自動生成を開始すると直ちに、カウンタメモリから値を読み出す。そして、この値の文字列として含んだ識別子を生成すると共に、カウンタメモリの値をインクリメントする。カウンタメモリの値は、次の識別子生成まで保存される。具体的には、カウンタメモリから読み出した値が１２３だった場合、例えばhttp://aaa.bbb.cccc.dddd/123.wavと識別子を生成し、直ちにカウンタをインクリメントすることで、値１２４で次の識別子の生成に備える。自動生成方法の他の１例としては、時刻情報を用いる方法がある。この例では、識別子決定部２７の内部に時刻情報を提供する時計を備える。識別子決定部２７は、自動生成を開始すると直ちに時計から時刻情報を取得し、この値の文字列として含んだ識別子を生成する。具体的には、時計から得た時刻情報が２００１年１２月３日１０時２９分１５秒３３ｍｓだった場合には、例えば次のようになる。
【００２６】
http://aaaa.bbbb.cccc.dddd/2001120310291533.wav
と識別子を生成する。尚、識別子決定部２７による識別子生成機能は、対象となるテキストデータの内容から自動的に識別子を生成するように実現されても良い。この場合の方法としては、テキストデータの書式を予め規定しておき所定のフィールドの文字例から自動的に識別子を予定する文字例を抽出するようにすることが考えられる。これにより、依頼者は、所望の識別子名を自ら指定できるようになる。これは、依頼者にとってインターネット上での音声データの識別を自ら指定した識別子名によって管理できるので、使いやすいシステムを提供できる。音声データ合成サーバ２０の音声データ蓄積依頼部２６は、識別子決定部２７により決定された識別子を音声データ６３に付与して音声データ配信サーバ１０に蓄積依頼する。
【００２７】
音声データ情報計算部２８は、音声合成処理部２５により合成された音声データに基づいて音声データ情報を計算する。ここで音声データ情報とは、音声データのデータサイズおよび音声データの再生時間長である。音声データ情報計算部２８は、入力された音声データのサイズ及び予め取り決められた音声データの形式から音声データの再生時間を計算する。具体的には、該音声データサイズがＮバイト、該形式が標本化周波数８ｋＨｚ、及び量子化１６ｂｉｔであると仮定すると、再生時間は、Ｎ／（１６）［ｍｓ］と計算される。
【００２８】
音声データ合成サーバ２０は、音声データ情報計算部２８により計算された再生時間を、識別子決定部２７から出力される識別子と共に依頼者端末４０にＨＴＴＰ／ＣＧＩ部２４を用いて送信する。
図７は、依頼者端末４０に表示される音声データ登録確認画面の例を示している。音声データ登録確認画面９０は、識別子表示欄９１と再生時間表示欄９２とを含んでいる。識別子表示欄９１は、音声データ合成サーバ２０の識別子決定部２７により決定された識別子が表示される。再生時間表示欄９２は、音声データ合成サーバ２０の音声データ情報計算部２８により計算される当該音声データの予測再生時間が表示される。
【００２９】
以上の第２実施例のおいては、第１の実施例と同様の処理手順の動作の結果として、依頼者端末の操作者は、音声データの識別子を認識することができる。例えば、Ｗｅｂページ管理者は、この識別子をその管理するＷｅｂページにハイパーリンクとして埋め込むことで更新し、Ｗｅｂサーバに登録する。Ｗｅｂページの利用者は、該更新されたＷｅｂページをブラウザ等で表示し、音声出力したい場合には、該追加されたハイパーリンクを利用することで、音声出力を得ることができる。また、本第２の実施例においては、音声データの再生時間情報を認識することができる。これにより、音声配信がなされた場合の聴取時間、通信データ量を知ることができ、音声データの利用者における具体的な利用イメージを想定することが可能となり、より洗練された音声データの配信が可能となる。
【００３０】
尚、以上の第１及び第２の実施例において、説明の容易性から単一の依頼者端末或いは利用者端末について説明したが、本発明による音声データ配信システムは、多数の単一の依頼者端末或いは利用者端末を想定している。従って、音声データ配信システムが扱い得る音声データは、図示される数に限られず多数の音声データを収容し得る。
【００３１】
又、以上の第１及び第２の実施例における音声データ配信装置は、音声データ配信サーバと音声データ合成サーバとの２つのサーバにより構成されるものとしたが、これは、既存の音声データ配信サーバのみが存在する運用形態に新たに音声構成サーバを追加する場合によりシステムの実現が容易であることによる。しかし、この２つのサーバを単一ハードウェアとしてのサーバ装置に集約する構成も当然に可能である。
【００３２】
【発明の効果】
以上のように本発明による音声データ配信装置及び配信者端末によれば、配信対象となるべき音声データを再生確認して適切な修正をなすことが可能となり誤った読みによる発声を回避することができる。これにより、信頼性のある音声データ配信装置が提供される。
【図面の簡単な説明】
【図１】本発明の第１の実施例であり、音声データ配信システムの全体構成を示しているブロック図である。
【図２】図１に示される依頼者端末の構成を示しているブロック図である。
【図３】依頼者端末４０に表示される音声データ配信依頼画面の例を示している図である。
【図４】依頼者端末４０に表示される音声データ修正依頼画面の例を示している図である。
【図５】音声データ合成サーバ２０及び依頼者端末４０の処理手順を示しているシーケンス図である。
【図６】本発明の第２の実施例であり、音声データ合成サーバ２０の他の構成を示しているブロック図である。
【図７】依頼者端末４０に表示され音声データ登録確認画面の例を示している図である。
【符号の説明】
１０音声データ配信サーバ
１２、２２、４１制御プログラム
１３、２３、４２ＴＣＰ／ＩＰ部
１４ＨＴＴＰ部
１５音声データ蓄積部
２０音声データ合成サーバ
２４ＨＴＴＰ／ＣＧＩ部
２５音声合成処理部
２６音声データ蓄積依頼部
２７識別子決定部
２８音声データ情報計算部
３０インターネット
４０依頼者端末
４３Ｉ／Ｏ制御部
４４Ｗｅｂブラウザ部
４５音声データ再生部
４６音声入力部
５１ディスプレイ
５２スピーカ
５３マイク
５４キーボード
６０音声データ・データベース
６３音声データ
６４未知語リスト
８０音声データ配信依頼画面
８４音声データ修正依頼画面
９０音声データ登録確認画面
１００利用者端末[0001]
BACKGROUND OF THE INVENTION
The present invention relates to an audio data distribution apparatus and a distribution requester terminal that distribute audio data to a user terminal via a data network such as the Internet.
[0002]
[Prior art]
When reading a long text of an e-mail or a web page on the screen, there is a need to present the text as a voice to reduce the burden on the eyes. In response to this need, a method for realizing voice presentation using a voice data distribution apparatus is known. In such an audio data distribution apparatus, in order to convert document data in a text format or the like into audio data, special software called a speech synthesis engine is required. Therefore, the conversion to the audio data is used for distributing audio data. An audio data distribution system that services users in advance is also known. In this regard, Japanese Patent Laid-Open No. 2001-282268 discloses a homepage viewer or a telephone by receiving text data such as a weather forecast from a requester, creating speech synthesis data and distributing it to a Web server or a telephone server as a distribution source. An audio data distribution system that enables a user to listen to information such as weather forecasts by voice is disclosed. With this configuration, a distribution requester who wants to distribute audio data can transmit the audio data by simply transmitting document data in a text format or the like to the data distribution system, even if his / her device does not have a speech synthesis function. It is said that it can be provided to users at the delivery destination.
[0003]
By the way, on the premise that Japanese is the target language, speech data is usually synthesized from text data, which is a kana-kana mixed sentence, using speech dictionary data, word dictionary data, prosodic rule data, and speech segment data. Is generated.
[0004]
[Problems to be solved by the invention]
The voice data generated in this way is always distributed with the occurrence of an error in reading or the occurrence of an unknown word, that is, the corresponding voice cannot be determined because it is not registered in the word dictionary, and the reading is not yet determined. The occurrence of danger is expected. In particular, in Japanese, there are problems such as the occurrence of many unknown words due to different readings or technical terms due to differences in the fields to which the content of the text belongs, such as economic, cultural, political, and entertainment fields. is expected. Therefore, as a result, there is a risk that the user who listens to the voice data including the wrong reading and listens to the voice data may misunderstand the information. Providing such erroneous voice data has been a factor that impairs the reliability of the voice data distribution service and hinders its spread.
[0005]
The present invention has been made in view of the above problems, and an object thereof is to provide a reliable audio data distribution apparatus.
[0006]
[Means for Solving the Problems]
An audio data distribution apparatus according to the present invention is an audio data distribution apparatus that distributes audio data via a data network, receives at least one series of text data from at least one client terminal, and converts the text data by voice synthesis. In response to a correction request from the client terminal, voice data storage means for converting into a series of voice data and storing it for distribution, The voice data corresponding to the correction request is subjected to voice synthesis again according to the content of the correction request to obtain corrected voice data, and the obtained corrected voice data is transmitted to the client terminal. The corrected voice data stored in the voice data storage means in the corrected voice data corresponding to the delivery approval message in response to the delivery approval message from the corrected voice data transmitting means and the requester terminal. Voice data updating means for updating It is characterized by including.
[0007]
The distribution request terminal according to the present invention is: Claim 1 Audio data distribution device Very small It is a client terminal that requests distribution of at least one series of audio data, A distribution request information display means for displaying audio data distribution request information for prompting input of text data to be converted into the audio data; and a reproduction instruction or an approval instruction for the corrected audio data while prompting input of contents of the correction request Correction request information display means for displaying voice data correction request information prompting input; It is characterized by including.
[0008]
DETAILED DESCRIPTION OF THE INVENTION
Embodiments of the present invention will be described in detail with reference to the accompanying drawings.
<First embodiment>
FIG. 1 is a first embodiment of the present invention, and shows the overall system configuration including an audio data distribution apparatus. The audio data distribution device (also referred to as an audio data distribution system) includes an audio data distribution server 10 and an audio data synthesis server 20.
[0009]
Referring to the voice data synthesis server 20 shown on the right side of FIG. 1, the voice data synthesis server 20 is connected to the Internet 30 and enables communication with the client terminal 40 and / or the user terminal 100. The hardware configuration of the voice data synthesis server 20 is configured as a normal server computer. As shown, the software configuration of the voice data synthesis server 20 includes a control program 22, a TCP / IP unit 23, an HTTP / CGI unit 24, a voice synthesis processing unit 25, a voice data accumulation request unit 26, including. The control program 22 is an operating system that controls the basic control of the entire voice data synthesis server 20. The TCP / IP unit 23 executes a TCP / IP protocol procedure under the control of the control program 22 and realizes data communication with the client terminal 40 and / or the user terminal 100 via the Internet 30. The HTTP / CGI unit 24 distributes a web page for accepting a voice data distribution request to the Internet 30 via the TCP / IP unit 23 and receives text data from the requester terminal 40 by the CGI function. It has a function. The HTTP / CGI unit 24 also has a function of returning voice-synthesized voice data to the client terminal 40 and receiving a correction instruction. The HTTP / CGI unit 24 realizes data communication based on HTTP (HyperText Transfar Protocol; see IETF: RFC2616).
[0010]
The speech synthesis processing unit 25 has a function of converting text data input from the HTTP / CGI unit 24 into speech data. The format of the input text data is usually composed of plain text, that is, a JIS kanji character code string, but may be other document formats. The output voice data is composed of phoneme data corresponding to the voice waveform, and is usually collected into one file. This file may be any audio waveform format file that can be reproduced by the user terminal 100 and / or the client terminal 40, and various file formats such as WAV files and MP3 files are assumed. The speech synthesis processing unit 25 includes, for example, a word dictionary data file 251 in which information on notation, reading, accent, and part of speech of about 100,000 words is stored in association with each word, and control rule data on basic frequency and sound pressure. The stored prosodic rule data file 252 is connected to the speech segment data file 253 storing waveform data for each phoneme. The speech synthesis processing procedure of the speech synthesis processing unit 25 first generates an intermediate language that is an accented kana character string by using text data that is a kana-kana mixed sentence and word dictionary data of the word dictionary data file 251. Next, from the generated intermediate language, using the prosodic rule data in the prosodic rule data file 252, a synthesis parameter including a speech unit number, pitch pattern information, phonological duration information, and the like is generated. Then, in accordance with the generated synthesis parameter, the speech unit data of the speech unit data 253 is read out and sequentially connected to generate speech waveform data. Thereby, the speech waveform information of the speech data is generated as, for example, PCM data with a sampling frequency of 8 kHz and quantization of 16 bits, and is temporarily stored as speech data 63 in an appropriate primary storage device of the speech data synthesis server 20. .
[0011]
The speech synthesis processing unit 25 further has a function of returning an unknown word as a character string. Here, the unknown word is a word that is not registered in the word dictionary data file and may be misread. The speech synthesis processing unit 25 converts a word determined to be an unknown word into a sound with a predetermined appropriate reading, and outputs a list of words determined to be an unknown word as an unknown word list 64. The unknown word list 64 is temporarily stored in the appropriate primary storage device of the voice data synthesis server 20 together with the previous voice data 63.
[0012]
The voice data accumulation requesting unit 26 transmits the voice data 63 temporarily stored to the voice data delivery server 10 in response to the delivery approval OK message for the voice data from the client terminal 40 and accumulates it therein. At least, it has a function of requesting delivery to each user terminal 100 toward the Internet 30.
Referring to the audio data distribution server 10 shown on the left side of FIG. 1, the audio data distribution server 10 is connected to the Internet 30 and enables data communication with the user terminal 100 and / or the client terminal 40. . The hardware configuration of the audio data distribution server 10 is configured as a normal server computer. As shown in the figure, the software configuration of the voice data synthesis server 20 includes a control program 12, a TCP / IP unit 13, an HTTP unit 14, a voice data storage unit 15, and a voice data database 60 (for distribution). Hereinafter, it is referred to as a voice data DB).
[0013]
The control program 12 is an operating system that governs basic control of the entire voice data distribution server device 10. The TCP / IP unit 13 executes a TCP / IP protocol procedure under the control of the control program 12 and realizes data communication with the client terminal 40 and / or the user terminal 100 via the Internet 30. The HTTP unit 14 has a function of distributing desired audio data via the TCP / IP unit 13 to the user terminal 100 or the requester terminal 40 that requests distribution of the audio data stored in the audio data DB 60. Have. The audio data DB 60 includes an identifier group including a plurality of identifiers, and an audio data group including a plurality of audio data associated with each of the identifiers. The configuration of the audio data DB 60 can be realized by a normal file system having an identifier as a file name.
[0014]
FIG. 2 shows an internal configuration of the client terminal 40 shown in FIG. Here, the client terminal 40 is connected to the Internet 30 to enable data communication. The hardware configuration of the client terminal 40 is a normal network client terminal such as a personal computer, to which a normal display 51 and a keyboard 54 are connected. The hardware configuration of the client terminal 40 further includes a speaker 52 for reproducing and verifying audio data and a microphone 53 for directly generating audio data by voice input by the operator. The software configuration of the client terminal 40 includes a control program 41 that controls the entire basic control of the client terminal 40, a TCP / IP unit 42 for realizing data communication via the Internet 30, and an I including a keyboard 54. An I / O control unit 43 that realizes input / output of the / O device in cooperation with the control program 41, a voice input unit 46 that digitally captures voice input from the microphone 53, and a TCP / IP unit 42 A web browser unit 44 that displays a web page received via the web browser and transmits the operator's input to the internet again to the Internet, and audio data captured via the web page displayed on the web browser unit 44 And an audio data reproduction unit 45 for reproduction through 52.
[0015]
The requester terminal 40 uses the microphone 53 and the voice input unit 46 to directly input the voice data of the operator in conjunction with the text input field displayed on the Web browser unit 44 when the text data is corrected. It can also be generated directly.
FIG. 3 shows an example of an audio data distribution request screen on the client terminal 40. The voice data distribution request screen 80 includes a text data input field 81, a voice data conversion instruction button 82, and a voice synthesis processing condition setting field 83. The text data input field 81 is a field in which the operator of the client terminal 40 inputs characters as text data for a document that is actually desired to be distributed as voice data. When the voice data conversion instruction button 82 is instructed after the text data is input, the text data is transmitted to the voice data synthesis server 20. The voice synthesis processing condition setting field 83 enables setting of detailed processing conditions when the voice data synthesis server 20 executes voice synthesis processing. In the example of this figure, the example which enables the voice quality setting of voice data, the speed setting, and the setting of unknown word display is shown. The processing conditions that can be set in the speech synthesis processing condition setting field 83 include various settings such as setting the encoding speed (16 kbps, 32 kbps, etc.) of other generated speech data and the field of the word dictionary. .
[0016]
FIG. 4 shows an example of a voice data correction request screen on the client terminal 40. The voice data correction request screen 84 displays a voice data playback button 85 for instructing playback of voice data using the speaker 52, a correction input field 86 for prompting character input as a correction instruction for text data, and an unknown word list. It includes an unknown word list field 87, a word information input field 88 that prompts the user to input word information as a correction instruction for the unknown word, and an OK message button 89 that approves delivery of the voice data or issues a correction instruction. The word information input field 88 further includes a reading input field 881 that prompts the user to input the reading, that is, “reading”, and a part of speech specification field 882 that specifies the part of speech as a correction instruction for the word selected on the unknown word list 87. And an accent designation field 883 that prompts designation of the accent information according to the reading designated in the reading input field. It should be noted that a form for providing a distribution approval button and a correction instruction button may be provided separately.
[0017]
FIG. 5 shows a processing procedure of the voice data synthesis server 20 and the requester terminal 40. The sequence shown in this figure will be described with reference to the components shown in FIGS. 1 and 2 as appropriate.
First, the voice data synthesis server 20 transmits a voice data distribution request screen to the requester terminal 40 by the HTTP / CGI unit 24 (step S11). This is transmitted when the client terminal 40 accesses the voice data distribution request screen by designating the address of the voice data synthesis server 20. In response to this transmission, the client terminal 40 receives and displays the voice data distribution request screen by the web browser unit 44 (step S12). Next, the input of text data is received from the operator of the client terminal 40 and transmitted to the voice data synthesis server (step S13). In response to this transmission, the voice data distribution server 20 receives the text data by the HTTP / CGI unit 24 (step S14).
[0018]
Next, the speech data synthesis server 20 performs speech synthesis processing on the received text data by the speech synthesis processing unit 25 (step S15). In this synthesizing process, the synthesizing process may be performed in accordance with detailed voice synthesizing process conditions specified on the voice data distribution request screen. The speech synthesis processing unit 25 creates speech data 63 and an unknown word list 64 as a result of the synthesis process. Next, the voice data synthesis server 20 transmits a voice data correction request screen including the obtained voice data 63 and the unknown word list 64 to the client terminal 40 by the HTTP / CGI unit 24 (step S16). In response to this transmission, the client terminal 40 receives and displays the voice data correction request screen by the web browser unit 44 (step S17). In response to the display of the voice data correction request screen, the operator of the client terminal 40 appropriately determines whether or not the synthesized voice data is appropriate, and edits the text data as necessary.
[0019]
To explain an example of this editing, the text data "I can't hear the words I'm talking about on this side." When the voice that can be heard is generated on this side, the operator of the client terminal 40 adds a punctuation to the original text data, and “the spoken word on this side is heard. By giving a correction instruction “No”, it is possible to request the speech synthesis process again and generate an appropriate speech. In addition, the text data “He has Yamato soul” is appropriate to read “He has Yamatoda Mashii” in the context, but if the voice that sounds “He has Daiwa da Mashii” is generated, the operator Can change the "Yamatotama" part of the original text data to "Yamatoda Mashii" and Kana, and generate a correct voice by instructing correction. In this case, even if the operator gives a correction instruction by inputting the word information of “Yamato soul” from the word information input field 88 shown in FIG. The word information input here is used together with the word dictionary data 251 when the speech synthesis processing unit 25 performs speech synthesis processing again.
[0020]
After the above editing operation, the client terminal 40 transmits a voice data correction instruction (correction request) or an OK message to the voice data synthesis server via the web browser unit 44 (step S18). The OK message without the correction instruction means that the operator of the client terminal 40 approves the actual distribution because the correction instruction is not necessary for the voice data.
[0021]
In response to this transmission, the voice data synthesis server 20 receives a correction instruction (correction request) or an OK message by the HTTP / CGI unit 24 (step S19). Next, it is determined whether or not it is a correction instruction / OK (step S20). If there is a correction instruction, the process returns to the speech synthesis process in step S15. On the other hand, if it is OK, the voice data synthesis server 20 causes the voice data distribution server 10 to store the voice data via the voice data storage request unit 26 (step S21). The voice data storage request unit 26 requests the voice data storage unit 15 of the voice data distribution server 10 to store voice data via the Internet 30, and the voice data storage unit 15 stores the voice data in the voice data DB 60. As a result, the accumulated audio data can be distributed to the user terminal 100 or the client terminal 40 via the Internet 30 by the HTTP unit 14 of the audio data distribution server 10.
[0022]
Next, the voice data synthesis server 20 notifies the requester terminal of a registration completion message through the HTTP / CGI unit 24 (step S22). In response to this transmission, the client terminal 40 receives a registration completion message by the Web browser unit 44 (step S23). This registration completion message is displayed to the operator on the client terminal 40.
[0023]
As described above, in the first embodiment, since the voice data distribution system is provided with the voice synthesis server, the requester of the voice data distribution does not need to have the task of providing the client terminal with the voice synthesis function. Thus, it is possible to make appropriate corrections and additions to the synthesized voice data.
In addition, although the delivery by HTTP server was demonstrated as a delivery form of audio | voice data, it is not restricted to this form, Other delivery forms, such as a file transfer v by FTP (File Transfer Protocol) and a delivery form by UDP (User Datagram Protocol) But it ’s okay.
<Second embodiment>
FIG. 7 shows another configuration of the voice data synthesis server 20 according to the second embodiment of the present invention. Here, the hardware configuration of the voice data synthesis server 20 is the same as that of the first embodiment, and the software configuration is partially added with functions. Only the function addition part will be described. Referring to FIG. 7, an identifier determination unit 27 and an audio / sound data information calculation unit 28 are provided.
[0024]
The identifier determining unit 27 has a function of automatically determining a unique identifier that can be accessed on the Internet 30 as a voice data file of voice data that is synthesized and distributed based on a distribution approval OK message from the client terminal. Have As an identifier format, a request source IP address (Internet Protocol; Source-address; IETF: RFC791) of an IP header can be used as information for specifying a URI (Uniform Resource Identifiyer; see IETF: RFC2396).
[0025]
As an example of the identifier automatic generation method of the identifier determination unit 27, there is a method including a counter memory. In this example, the identifier determination unit 27 includes a counter memory. As soon as automatic generation starts, the identifier determination unit 27 reads a value from the counter memory. Then, an identifier included as a character string of this value is generated and the value of the counter memory is incremented. The value in the counter memory is saved until the next identifier generation. Specifically, when the value read from the counter memory is 123, for example, an identifier is generated as http: //aaa.bbb.cccc.dddd/123.wav, and the counter is incremented immediately, so that the value 124 Prepare for the next identifier generation. Another example of the automatic generation method is a method using time information. In this example, the identifier determination unit 27 is provided with a clock that provides time information. As soon as automatic generation is started, the identifier determination unit 27 acquires time information from the clock and generates an identifier included as a character string of this value. Specifically, when the time information obtained from the clock is December 3, 2001 10: 29: 15: 33 ms, for example, the following is performed.
[0026]
http: //aaaa.bbbb.cccc.dddd/2001120310291533.wav
And an identifier. The identifier generation function by the identifier determination unit 27 may be realized so as to automatically generate an identifier from the contents of the target text data. As a method in this case, it is conceivable to pre-define the format of the text data and automatically extract a character example for which an identifier is scheduled from a character example of a predetermined field. As a result, the client can specify the desired identifier name himself. Since the client can manage the identification of the voice data on the Internet by the identifier name designated by the client, an easy-to-use system can be provided. The voice data accumulation request unit 26 of the voice data synthesis server 20 adds the identifier determined by the identifier determination unit 27 to the voice data 63 and requests the voice data distribution server 10 to store.
[0027]
The voice data information calculation unit 28 calculates voice data information based on the voice data synthesized by the voice synthesis processing unit 25. Here, the audio data information is the data size of the audio data and the reproduction time length of the audio data. The audio data information calculation unit 28 calculates the reproduction time of the audio data from the size of the input audio data and the predetermined audio data format. Specifically, assuming that the audio data size is N bytes, the format is a sampling frequency of 8 kHz, and a quantization of 16 bits, the playback time is calculated as N / (16) [ms].
[0028]
The voice data synthesis server 20 transmits the playback time calculated by the voice data information calculation unit 28 to the client terminal 40 together with the identifier output from the identifier determination unit 27 using the HTTP / CGI unit 24.
FIG. 7 shows an example of a voice data registration confirmation screen displayed on the client terminal 40. The audio data registration confirmation screen 90 includes an identifier display field 91 and a playback time display field 92. The identifier display field 91 displays the identifier determined by the identifier determination unit 27 of the voice data synthesis server 20. The reproduction time display field 92 displays the predicted reproduction time of the voice data calculated by the voice data information calculation unit 28 of the voice data synthesis server 20.
[0029]
In the second embodiment described above, the operator of the client terminal can recognize the identifier of the voice data as a result of the operation of the processing procedure similar to that of the first embodiment. For example, the Web page manager updates the identifier by embedding it as a hyperlink in the Web page to be managed, and registers it in the Web server. When a user of a Web page displays the updated Web page with a browser or the like and wants to output the sound, the user can obtain the sound output by using the added hyperlink. In the second embodiment, it is possible to recognize reproduction time information of audio data. As a result, it is possible to know the listening time and the amount of communication data when audio distribution is performed, and it is possible to assume a specific usage image for the user of the audio data, so that more sophisticated audio data distribution is possible. It becomes possible.
[0030]
In the first and second embodiments described above, a single client terminal or user terminal has been described for ease of explanation. However, the voice data distribution system according to the present invention has a large number of single clients. A terminal or a user terminal is assumed. Therefore, the audio data that can be handled by the audio data distribution system is not limited to the number shown in the figure, and can accommodate a large number of audio data.
[0031]
In addition, the audio data distribution apparatus in the first and second embodiments described above is composed of two servers, an audio data distribution server and an audio data synthesis server. This is because the system can be easily realized by adding a new voice configuration server to the operation mode in which only the server exists. However, a configuration in which these two servers are integrated into a server device as a single hardware is naturally possible.
[0032]
【The invention's effect】
As described above, according to the audio data distribution device and the distributor terminal according to the present invention, it is possible to confirm the reproduction of the audio data to be distributed and make appropriate corrections, thereby avoiding utterances due to erroneous reading. it can. Thereby, a reliable audio data distribution apparatus is provided.
[Brief description of the drawings]
FIG. 1 is a block diagram showing an overall configuration of an audio data distribution system according to a first embodiment of the present invention.
FIG. 2 is a block diagram showing a configuration of a client terminal shown in FIG.
FIG. 3 is a diagram showing an example of an audio data distribution request screen displayed on the requester terminal 40;
FIG. 4 is a diagram showing an example of a voice data correction request screen displayed on the client terminal 40.
FIG. 5 is a sequence diagram showing processing procedures of the voice data synthesis server 20 and the client terminal 40;
FIG. 6 is a block diagram showing another configuration of the voice data synthesis server 20 according to the second embodiment of the present invention.
7 is a diagram showing an example of a voice data registration confirmation screen displayed on the client terminal 40. FIG.
[Explanation of symbols]
10 Voice data distribution server
12, 22, 41 Control program
13, 23, 42 TCP / IP section
14 HTTP part
15 Audio data storage
20 Speech data synthesis server
24 HTTP / CGI Department
25 Speech synthesis processor
26 Voice data storage request section
27 Identifier determination unit
28 Voice data information calculator
30 Internet
40 Client terminal
43 I / O controller
44 Web browser
45 Audio data playback section
46 Voice input part
51 display
52 Speaker
53 Microphone
54 keyboard
60 voice data database
63 Voice data
64 Unknown word list
80 Voice data distribution request screen
84 Voice data correction request screen
90 Audio data registration confirmation screen
100 user terminal

Claims

An audio data distribution apparatus for distributing audio data via a data network,
Voice data storage means for receiving at least one series of text data from at least one client terminal, converting the text data into a series of voice data by voice synthesis and storing the data for distribution;
In response to a correction request from the requester terminal, for voice data corresponding to the correction request, voice data correcting means for performing voice synthesis again according to the content of the correction request to obtain corrected voice data;
Corrected voice data transmitting means for transmitting the obtained corrected voice data to the client terminal;
Voice data update means for updating the voice data before correction stored by the voice data storage means to the corrected voice data corresponding to the delivery approval message in response to the delivery approval message from the requester terminal;
An audio data distribution apparatus comprising:

The audio data correction transmission means, according to claim 1, wherein Rukoto to correct the Kion voice data before based on the modified request containing the word information for modifying information for modifying the target audio data and / or the unknown word Osamu Audio data distribution device.

Before an identifier determining means for determining an identifier of the Kion voice data, further audio data distribution according to claim 1 or 2 characterized in that it comprises an identifier transmission means for transmitting to the requester terminal corresponding to the determined identifier apparatus.

The identifier determination unit, before counter which is updated each Kion voice data, current time, or the audio data distribution apparatus according to claim 3, wherein the determining based on the information extracted from the text data.

Before calculates a predicted play time of Kion voice data, the prediction reproduced audio data distribution apparatus time, characterized in that it further includes means for transmitting to the corresponding requester terminal according to claim 1, wherein.

Even without least the audio data distribution apparatus according to claim 1, wherein a requester terminal for requesting the distribution of audio data of one series,
Distribution request information display means for displaying audio data distribution request information that prompts input of text data to be synthesized with the audio data;
Correction request information display means for prompting input of the content of the correction request and displaying voice data correction request information for prompting input of a reproduction command or approval command for the corrected voice data;
A distribution requester terminal characterized by including:

7. The distribution requester terminal according to claim 6, wherein the correction request information display means prompts input of a content of a correction request including correction information for correction target audio data and / or word information for an unknown word .