JP4082249B2

JP4082249B2 - Content distribution system

Info

Publication number: JP4082249B2
Application number: JP2003070717A
Authority: JP
Inventors: 淳野口; 広羽金; 栄子山田
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2003-03-14
Filing date: 2003-03-14
Publication date: 2008-04-30
Anticipated expiration: 2023-03-14
Also published as: JP2004282392A

Description

【０００１】
【発明の属する技術分野】
本発明は、音声を出力させるコンテンツを端末装置に配信する音声コンテンツ配信システムと、この音声コンテンツ配信システムに適用されるコンテンツ配信装置、音声情報変換装置、端末装置、コンテンツ配信プログラム、音声情報変換プログラムおよびコンテンツ出力プログラムに関する。
【０００２】
【従来の技術】
端末装置（以下、端末と記す。）にコンテンツを配信して、端末の使用者にコンテンツを閲覧させるサービスが普及している。コンテンツとしては、画像や文字等を端末に表示させるコンテンツが多い。しかし、音声を出力させるコンテンツを配信し、端末において音声を出力するシステムも種々提案されている。
【０００３】
例えば、特許文献１では、表音文字列を要求する記述を含むハイパーテキストをクライアントに送信するシステムが提案されている。特許文献１に記載のシステムにおいて、クライアントは、受信したハイパーテキスト内に表音文字列を要求する記述が含まれている場合、サーバに表音文字列を要求する。そして、クライアントは、サーバから表音文字列の情報を受信し、その表音文字列に従って音声を出力する。
【０００４】
また、特許文献２では、テキストデータ等の音声合成目的データを端末がダウンロードし、ダウンロードしたデータに基づいて、端末が音声を読み上げるシステムが提案されている。特許文献２には、端末の使用者が希望するキャラクタ音声の音素データを端末がダウンロードし、端末がそのキャラクタ音声でテキストデータ等を読み上げることについても記載されている。また、端末が音声合成処理プログラムや画像データをダウンロードする場合についても記載されている。
【０００５】
また、音声処理を行うためのマークアップ言語であるＶｏｉｃｅＸＭＬを利用したシステムも提案されている（例えば、特許文献３）。特許文献３に記載のシステムでは、ＶｏｉｃｅＸＭＬを利用した記述に基づいて、音声対話サーバが音声を生成し、その音声を電話回線を介して端末に送信する。また、音声対話サーバが音声と同期させて、画面表示データを端末に送信する。
【０００６】
以下の説明において、端末が出力音声を特定するためのデータを発音記号列と記す。
【０００７】
【特許文献１】
特開２００１−４３０６４号公報（段落００２４−０１３２、第１−２１図）
【０００８】
【特許文献２】
特開２００２−３２８６９４号公報（段落００３８−０１３２、第１−７４図）
【０００９】
【特許文献３】
特開２００２−３１８１３２号公報（段落００１６−００５１、第１−９図）
【００１０】
【発明が解決しようとする課題】
特許文献１に記載のシステムでは、クライアントは、サーバからハイパーテキストを受信し、そのハイパーテキスト内に表音文字列を要求する記述が含まれている場合、サーバに表音文字列を要求する。従って、クライアントは、コンテンツとなるハイパーテキストを受信しただけでは音声を出力できない。音声を出力する場合には、ハイパーテキストを受信した後、更にサーバに表音文字列を要求して表音文字列を受信する処理を行わなければならない。この結果、コンテンツとなるハイパーテキストを要求してから、音声の出力完了までに時間がかかってしまう。
【００１１】
また、端末の種類によっては、発音記号列（例えば、特許文献１における表音文字列）の仕様が異なる場合もある。例えば、携帯電話機を端末として用いる場合、通信会社Ａの携帯電話機と通信会社Ｂの携帯電話機とでは、発音記号列の仕様が異なる場合ある。コンテンツ提供者は、発音記号列の仕様が異なる複数の端末にコンテンツを提供しようとする場合、個々の仕様毎に発音記号列を作成しなければならない。このため、コンテンツ提供者の負担が大きくなってしまう。
【００１２】
例えば、通信会社Ａの携帯電話機の仕様では、「こんにちは」という音声を出力するためには「KONNICHIWA」という発音記号列を作成しなければならないとする。また、通信会社Ｂの携帯電話機の仕様では、「こんにちは」という音声を出力するためには「KON-NITIWA」という発音記号列を作成しなければならないとする。コンテンツ提供者が、「こんにちは」という音声を出力させるようなコンテンツを通信会社Ａ，Ｂそれぞれの携帯電話機に提供する場合、「KONNICHIWA」および「KON-NITIWA」という複数の発音記号列を作成しておかなければならず、発音記号列を作成する負担が大きかった。
【００１３】
特許文献１や特許文献２に記載のシステムでは、このようなコンテンツ提供者の負担は考慮されていない。そのため、発音記号列の仕様が異なる複数種類の端末にコンテンツを提供する場合、コンテンツ提供者の負担を軽減させることはできない。
【００１４】
また、特許文献２に記載の端末は、音声に合致する画像を表示する場合、画像データを音声のデータとは別個にダウンロードしなければならない。また、特許文献３に記載のＶｏｉｃｅＸＭＬを利用したシステムでは、音声対話サーバが音声に同期させて画面データを端末に送信しなければならない。
【００１５】
そこで、本発明は、コンテンツを受信した端末が音声を出力するまでの時間を短縮することを目的とする。また、発音記号列の仕様が異なる複数種類の端末にコンテンツを提供する場合に、コンテンツ提供者の負担を軽減させることを目的とする。
【００１６】
【課題を解決するための手段】
本発明によるコンテンツ配信システムは、コンテンツデータを配信するコンテンツ配信装置と、コンテンツ配信装置から受信したコンテンツデータに基づいてコンテンツを出力する端末装置とを備えたコンテンツ配信システムであって、コンテンツ配信装置は、音声として読み上げられるべき文字列である読み上げ文字列が、読み上げ文字列を示すタグとともに記述されたコンテンツデータの入力を受け付けるコンテンツ入力手段と、コンテンツデータ内の読み上げ文字列を、出力音声を特定するためのデータである発音記号列に置換するコンテンツ置換手段とを備え、端末装置は、コンテンツ配信装置から、発音記号列が記述されたコンテンツデータを受信するコンテンツ受信手段と、コンテンツデータから発音記号列を抽出する発音記号列抽出手段と、発音記号列に基づいて音声を出力する出力手段とを備え、前記コンテンツ置換手段は、前記コンテンツ入力手段に入力されたコンテンツデータから、読み上げ文字列を示すタグとともに記述された文字列を読み上げ文字列として抽出する読み上げ文字列抽出手段と、前記読み上げ文字列を発音記号列に変換する変換手段と、前記コンテンツデータ内の読み上げ文字列を前記発音記号列に置換し、読み上げ文字列を示すタグを、発音記号列を示すタグに置換する置換手段とを備え、前記コンテンツ受信手段は、発音記号列を示すタグとともに発音記号列が記述されたコンテンツデータをコンテンツ配信装置から受信し、発音記号列抽出手段は、発音記号列を示すタグとともに記述された文字列を発音記号列として抽出することを特徴とする。
【００１８】
コンテンツ入力手段が、発音記号列の仕様を示す仕様情報が読み上げ文字列とともに記述されたコンテンツデータの入力を受け付け、読み上げ文字列抽出手段が、コンテンツデータから読み上げ文字列と仕様情報とを抽出し、変換手段が、読み上げ文字列を、仕様情報が示す仕様に応じた発音記号列に変換することが好ましい。そのような構成によれば、コンテンツ提供者は、各仕様に応じた発音記号列を記述しなくてよいので、コンテンツ提供者の負担が軽減される。
【００１９】
また、本発明によるコンテンツ配信システムは、コンテンツデータを配信するコンテンツ配信装置と、前記コンテンツ配信装置から受信したコンテンツデータに基づいてコンテンツを出力する端末装置と、音声として読み上げられるべき文字列である読み上げ文字列を発音記号列に変換する音声情報変換装置とを備えたコンテンツ配信システムであって、前記コンテンツ配信装置は、読み上げ文字列を示すタグとともに読み上げ文字列が記述されたコンテンツデータの入力を受け付けるコンテンツ入力手段と、前記コンテンツデータ内の読み上げ文字列を、出力音声を特定するためのデータである発音記号列に置換するコンテンツ置換手段とを備え、前記端末装置は、前記コンテンツ配信装置から、発音記号列が記述されたコンテンツデータを受信するコンテンツ受信手段と、前記コンテンツデータから発音記号列を抽出する発音記号列抽出手段と、前記発音記号列に基づいて音声を出力する出力手段とを備え、前記コンテンツ置換手段は、前記コンテンツ入力手段に入力されたコンテンツデータから、読み上げ文字列を示すタグとともに記述された文字列を読み上げ文字列として抽出する読み上げ文字列抽出手段と、前記読み上げ文字列を前記音声情報変換装置に送信する読み上げ文字列送信手段と、前記音声情報変換装置から発音記号列を受信し、前記コンテンツデータ内の読み上げ文字列を前記発音記号列に置換し、読み上げ文字列を示すタグを、発音記号列を示すタグに置換する置換手段とを備え、前記音声情報変換装置は、前記コンテンツ配信装置から読み上げ文字列を受信し、前記読み上げ文字列を発音記号列に変換する変換手段と、前記発音記号列を前記コンテンツ配信装置に送信する発音記号列送信手段とを備え、前記コンテンツ受信手段は、発音記号列を示すタグとともに発音記号列が記述されたコンテンツデータをコンテンツ配信装置から受信し、発音記号列抽出手段は、発音記号列を示すタグとともに記述された文字列を発音記号列として抽出することを特徴とする。そのような構成によれば、音声情報変換装置に処理が分散され、コンテンツ配信サーバの処理負荷を軽減させることができる。
【００２０】
コンテンツ入力手段が、発音記号列の仕様を示す仕様情報が読み上げ文字列とともに記述されたコンテンツデータの入力を受け付け、読み上げ文字列抽出手段が、コンテンツデータから読み上げ文字列と仕様情報とを抽出し、読み上げ文字列送信手段が、読み上げ文字列と仕様情報を音声情報変換装置に送信し、変換手段が、読み上げ文字列を、仕様情報が示す仕様に応じた発音記号列に変換することが好ましい。そのような構成によれば、コンテンツ提供者は、各仕様に応じた発音記号列を記述しなくてよいので、コンテンツ提供者の負担が軽減される。
【００２９】
また、本発明によるコンテンツ配信装置は、端末装置にコンテンツデータを配信するコンテンツ配信装置であって、音声として読み上げられるべき文字列である読み上げ文字列が、読み上げ文字列を示すタグとともに記述されたコンテンツデータの入力を受け付けるコンテンツ入力手段と、コンテンツデータから、読み上げ文字列を示すタグとともに記述された文字列を読み上げ文字列として抽出する読み上げ文字列抽出手段と、読み上げ文字列を、出力音声を特定するためのデータである発音記号列に変換する変換手段と、コンテンツデータ内の読み上げ文字列を発音記号列に置換し、読み上げ文字列を示すタグを、発音記号列を示すタグに置換する置換手段とを備えたことを特徴とする。
【００３０】
コンテンツ入力手段は、発音記号列の仕様を示す仕様情報が読み上げ文字列とともに記述されたコンテンツデータの入力を受け付け、読み上げ文字列抽出手段は、コンテンツデータから読み上げ文字列と仕様情報とを抽出し、変換手段は、読み上げ文字列を、仕様情報が示す仕様に応じた発音記号列に変換することが好ましい。そのような構成によれば、コンテンツ提供者は、各仕様に応じた発音記号列を記述しなくてよいので、コンテンツ提供者の負担が軽減される。
【００３１】
また、本発明によるコンテンツ配信装置は、音声として読み上げられるべき文字列である読み上げ文字列を、出力音声を特定するためのデータである発音記号列に変換する音声情報変換装置に接続され、端末装置にコンテンツデータを配信するコンテンツ配信装置であって、読み上げ文字列を示すタグとともに読み上げ文字列が記述されたコンテンツデータの入力を受け付けるコンテンツ入力手段と、コンテンツデータから、読み上げ文字列を示すタグとともに記述された文字列を読み上げ文字列として抽出する読み上げ文字列抽出手段と、読み上げ文字列抽出手段が抽出した読み上げ文字列を音声情報変換装置に送信する読み上げ文字列送信手段と、読み上げ文字列から変換された発音記号列を音声情報変換装置から受信し、コンテンツデータ内の読み上げ文字列を発音記号列に置換し、読み上げ文字列を示すタグを、発音記号列を示すタグに置換する置換手段とを備えたことを特徴とする。
【００３２】
コンテンツ入力手段は、発音記号列の仕様を示す仕様情報が読み上げ文字列とともに記述されたコンテンツデータの入力を受け付け、読み上げ文字列抽出手段は、コンテンツデータから読み上げ文字列と仕様情報とを抽出し、読み上げ文字列送信手段は、読み上げ文字列と仕様情報を音声情報変換装置に送信することが好ましい。そのような構成によれば、コンテンツ提供者は、各仕様に応じた発音記号列を記述しなくてよいので、コンテンツ提供者の負担が軽減される。
【００３９】
本発明によるコンテンツ配信プログラムは、端末装置にコンテンツデータを配信するコンテンツ配信装置に搭載されるコンテンツ配信プログラムであって、コンピュータに、音声として読み上げられるべき文字列である読み上げ文字列が、読み上げ文字列を示すタグとともに記述されたコンテンツデータの入力を受け付ける処理、コンテンツデータから、読み上げ文字列を示すタグとともに記述された文字列を読み上げ文字列として抽出する処理、読み上げ文字列を、出力音声を特定するためのデータである発音記号列に変換する処理、およびコンテンツデータ内の読み上げ文字列を発音記号列に置換し、読み上げ文字列を示すタグを、発音記号列を示すタグに置換する処理を実行させることを特徴とする。
【００４０】
また、本発明によるコンテンツ配信プログラムは、音声として読み上げられるべき文字列である読み上げ文字列を、出力音声を特定するためのデータである発音記号列に変換する音声情報変換装置に接続され、端末装置にコンテンツデータを配信するコンテンツ配信装置に搭載されるコンテンツ配信プログラムであって、コンピュータに、読み上げ文字列を示すタグとともに読み上げ文字列が記述されたコンテンツデータの入力を受け付ける処理、コンテンツデータから、読み上げ文字列を示すタグとともに記述された文字列を読み上げ文字列として抽出する処理、読み上げ文字列を音声情報変換装置に送信する処理、および読み上げ文字列から変換された発音記号列を音声情報変換装置から受信し、コンテンツデータ内の読み上げ文字列を発音記号列に置換し、読み上げ文字列を示すタグを、発音記号列を示すタグに置換する処理を実行させることを特徴とする。
【００４４】
【発明の実施の形態】
以下、本発明の実施の形態を図面を参照して説明する。
【００４５】
実施の形態１．
図１は、本発明による音声コンテンツ配信システムの第１の実施の形態を示すブロック図である。図１に示す音声コンテンツ配信システムは、コンテンツサーバ（コンテンツ配信装置）１と端末１１とを備える。コンテンツサーバ１と端末１１とは、通信ネットワーク２１を介して接続される。以下、通信ネットワーク２１がインターネットである場合を例に説明するが、通信ネットワーク２１はインターネットに限定されない。例えば、通信ネットワーク２１は、ＬＡＮ、ＷＡＮ等であってもよい。
【００４６】
コンテンツサーバ１は、発音記号列を含むコンテンツのデータを端末１１に送信する情報処理装置である。コンテンツは、マークアップ言語によって記述される。以下、マークアップ言語によって記述されたコンテンツのデータをコンテンツデータと記す。端末１１は、コンテンツサーバ１からコンテンツデータを受信し、コンテンツデータに含まれる発音記号列に基づいて音声を出力する。また、端末１１は、コンテンツデータの記述に従って画像を表示してもよい。
【００４７】
なお、コンテンツデータに記述される発音記号列は、音素文字の文字列であっても、音節文字の文字列であってもよい。
【００４８】
図１では一台の端末１１を示したが、コンテンツサーバ１に複数の端末１１が接続されてもよい。さらに、発音記号列の仕様が異なる複数種類の端末がコンテンツサーバ１に接続されてもよい。
【００４９】
コンテンツサーバ１に入力されるコンテンツデータは、端末１１において音声として読み上げられるべき文字列（以下、読み上げ文字列と記す。）を含む。読み上げ文字列は、コンテンツが配信される端末１１の発音記号列の仕様に従って記述される必要はない。例えば、端末１１が発音記号列の仕様として「ローマ字綴りであること」を要求している場合であっても、読み上げ文字列は漢字や仮名等で記述されていてよい。また、複数種類の端末がそれぞれ異なる発音記号列の仕様を要求している場合であっても、読み上げ文字列は各仕様に従っていなくてよい。コンテンツサーバ１は、読み上げ文字列が記述されたコンテンツデータの入力を受け付けると、コンテンツデータ内の読み上げ文字列を、指定された仕様の発音記号列に置換する。その後、コンテンツサーバ１は、端末１１からの要求に応じてコンテンツデータを送信する。
【００５０】
なお、コンテンツデータにおいて、読み上げ文字列として記述される文字列は、その文字列が読み上げ文字列であることを示すタグとともに記述される。さらに、その読み上げ文字列をどのような仕様の発音記号列に変換すべきかを示す識別情報も、読み上げ文字列およびタグとともに記述される。
【００５１】
また、読み上げ文字列が発音記号列に置換された場合、発音記号列として記述される文字列は、その文字列が発音記号列であることを示すタグとともに記述される。
【００５２】
図１に示すコンテンツ入力部２は、コンテンツ提供者から、コンテンツデータの入力を受け付ける。読み上げ文字列抽出置換部３は、入力されるコンテンツデータの中から読み上げ文字列および識別情報を抽出する処理や、コンテンツデータ内の読み上げ文字列を発音記号列に置換する処理を行う。変換部４は、読み上げ文字列に含まれる個々の文字や単語と発音記号列との対応関係を示す辞書データを記憶する記憶装置（図１において図示せず。）を含む。この記憶装置は、発音記号列の各仕様毎に辞書データを記憶する。変換部４は、この辞書データを用いて、読み上げ文字列から発音記号列への変換処理を行う。変換部４は、どの仕様に従う発音記号列に変換すべきかを識別情報に基づいて判定する。読み上げ文字列抽出置換部３は、変換部４によって変換された発音記号列を用いて、コンテンツデータ内の読み上げ文字列を発音記号列に置換する。コンテンツ送信部５は、端末１１からの要求に応じて、置換後のコンテンツデータを端末１１に送信する。
【００５３】
端末１１におけるコンテンツ受信部１２は、コンテンツサーバ１からコンテンツデータを受信する。発音記号列抽出部１３は、コンテンツ受信部１２が受信したコンテンツデータの中から発音記号列を抽出する。音声生成部１４は、発音記号列抽出部１３が抽出した発音記号列に基づいて音声信号を生成する。音声出力部１５は、音声生成部１３が生成した音声信号に基づいて音声を出力する。なお、端末１１は、コンテンツを表示する表示部（ディスプレイ装置）を備えていてもよい。
【００５４】
図２は、本発明の第１の実施の形態の具体的な構成例を示すブロック図である。図２において、コンテンツサーバ１の制御部６は、記憶装置７が記憶するコンテンツ配信プログラムに従って処理を実行する。具体的には、制御部６は、コンテンツデータの入力受け付け処理、コンテンツデータの中から読み上げ文字列および識別情報を抽出する処理、その読み上げ文字列から発音記号列への変換処理、コンテンツデータ内の読み上げ文字列を発音記号列に置換する処理、端末１１へのコンテンツデータ送信処理を実行する。ネットワークインタフェース部８は、インターネット２１を介してコンテンツデータの送受信を行う。記憶装置７は、コンテンツ配信プログラムのほかに、辞書データを記憶する。また、一時記憶装置９は、コンテンツデータから抽出される読み上げ文字列、識別情報や、変換処理によって得られる発音記号列を一時的に記憶する記憶装置である。
【００５５】
なお、制御部６は、例えば、インターネット２１およびネットワークインタフェース部８を介して、コンテンツ提供者の端末（図示せず。）からコンテンツデータを受信することにより、コンテンツデータの入力を受け付ける。
【００５６】
記憶装置７は、コンピュータに、音声として読み上げられるべき文字列である読み上げ文字列が、読み上げ文字列を示すタグとともに記述されたコンテンツデータの入力を受け付ける処理、コンテンツデータから、読み上げ文字列を示すタグとともに記述された文字列を読み上げ文字列として抽出する処理、読み上げ文字列を、出力音声を特定するためのデータである発音記号列に変換する処理、およびコンテンツデータ内の読み上げ文字列を発音記号列に置換し、読み上げ文字列を示すタグを、発音記号列を示すタグに置換する処理を実行させるためのコンテンツ配信プログラムを記憶する。
【００５７】
また、端末１１の制御部１６は、記憶装置１７が記憶するコンテンツ出力プログラムに従って処理を実行する。具体的には、制御部１６は、コンテンツサーバ１からのコンテンツデータ受信処理、コンテンツデータからの発音記号列抽出処理、音声記号列に基づく音声信号生成処理、音声を出力させる処理を実行する。ネットワークインタフェース部８は、インターネット２１を介して、コンテンツデータの要求の送信や、コンテンツデータの受信を行う。音声出力装置１９は、スピーカ等の音声出力装置であり、音声を出力する。一時記憶装置２０は、受信したコンテンツデータや、コンテンツデータから抽出される発音記号列を一時的に記憶する記憶装置である。
【００５８】
記憶装置１７は、コンピュータに、出力音声を特定するためのデータである発音記号列が記述されたコンテンツデータをコンテンツ配信装置から受信する処理、コンテンツデータから発音記号列を抽出する処理、および発音記号列に基づいて音声を出力する処理を実行させるためのコンテンツ出力プログラムを記憶する。
【００５９】
音声コンテンツ配信システムを図２に示すような構成とした場合、読み上げ文字列抽出置換部３は、コンテンツサーバ１の制御部６によって実現される。変換部４は、記憶装置７および制御部６によって実現される。コンテンツ入力部２およびコンテンツ送信部５は、制御部６およびネットワークインタフェース部８によって実現される。また、発音記号列抽出部１３および音声生成部１４は、端末１１の制御部１６によって実現される。コンテンツ受信部１２は、制御部１６およびネットワークインタフェース部１８によって実現される。音声出力部１５は、音声出力装置１７によって実現される。
【００６０】
次に、動作について説明する。
図３は、コンテンツサーバ１の動作の例を示す流れ図である。コンテンツサーバ１のコンテンツ入力部２は、コンテンツ提供者からコンテンツデータの入力を受け付ける（ステップＳ１０１）。例えば、インターネット２１を介して、コンテンツ提供者の端末（図示せず。）からコンテンツデータを受信する。
【００６１】
コンテンツサーバ１に入力されるコンテンツデータには、読み上げ文字列となる文字列とともに、その文字列が読み上げ文字列であることを示すタグと、読み上げ文字列をどのような仕様の発音記号列に変換すべきかを示す識別情報とが記述される。図４は、コンテンツサーバ１に入力されるコンテンツデータの例を示す説明図である。図４に示す例において、「TTP 」は、次に記述される識別情報の後から「/ 」まで続く文字列が読み上げ文字列であることを示すタグである。「TTP 」の次に記述される「phoneme="Type1" 」は、識別情報が「Type1 」であることを示す記述である。従って図４に示す例では、「phoneme="Type1" 」と「/>」との間に記述された「ご訪問ありがとうございます」という文字列が読み上げ文字列になる。なお、コンテンツデータは、読み上げ文字列以外の記述を含んでいてもよい。例えば、ＨＴＭＬ（Hypertext Markup Language ）と同様のタグとともに、画像や文字列の表示を指定する記述を含んでいてもよい。図４では、ＨＴＭＬと同様の言語で記載される場合を示したが、画像や文字列の表示を指定する記述は、ＨＴＭＬ以外のマークアップ言語で記述されてもよい。
【００６２】
ステップＳ１０１において、コンテンツ入力部２は、入力されたコンテンツデータを、記憶装置７（図１において図示せず。）に記憶させる。
【００６３】
続いて、読み上げ文字列抽出置換部３は、コンテンツ入力部２に入力されて記憶装置７に記憶されたコンテンツデータから読み上げ文字列および識別情報を抽出する（ステップＳ１０２）。このとき、読み上げ文字列抽出置換部３は、読み上げ文字列を示すタグ（例えば、図４に示す「TTP 」）とともに記述されている識別情報を抽出し、また、そのタグとともに記述されている文字列を読み上げ文字列として抽出する。ステップＳ１０２において、読み上げ文字列抽出置換部３は、抽出した識別情報および読み上げ文字列を一時記憶装置９（図１において図示せず。）に記憶させる。
【００６４】
続いて、変換部４は、読み上げ文字列抽出置換部３によって抽出されて一時記憶装置９に記憶された読み上げ文字列を発音記号列に変換する（ステップＳ１０３）。このとき、変換部４は、一時記憶装置９に記憶された識別情報に基づいて、どの辞書データを用いて変換すればよいのかを判定する。そして、変換部４は、識別情報に応じた辞書データを用いて、読み上げ文字列に含まれる各文字や単語を、対応する発音記号列に変換すればよい。例えば、図４に例示する「Type1 」という識別情報に応じた辞書データでは、「ご」と「GO」、「訪問」と「HOUMON」等の対応関係が示されているとする。この場合、変換部４は、「Type1 」に応じた辞書データを用いて、「ご訪問ありがとうございます」という読み上げ文字列を「GOHOUMON ARIGATOU GOZAIMASU 」という発音記号列に変換する。以下、読み上げ文字列から発音記号列への変換処理をＴＴＰ（Text-to-Phoneme）処理と記す。ステップＳ１０３において、変換部４は、読み上げ文字列から変換した発音記号列を一時記憶装置９に記憶させる。
【００６５】
なお、ステップＳ１０３におけるＴＴＰ処理は、ステップＳ１０２で抽出した読み上げ文字列を発音記号列に変換する処理である。従って、コンテンツデータ自体は、ステップＳ１０３では変更されない。
【００６６】
次に、読み上げ文字列抽出置換部３は、記憶装置７に記憶されたコンテンツデータ内の読み上げ文字列を、ＴＴＰ処理で得られた発音記号列（一時記憶装置９に記憶される発音記号列）に置換する（ステップＳ１０４）。ステップＳ１０４において、読み上げ文字列抽出置換部３は、読み上げ文字列を示すタグとともに記述された文字列を発音記号文字列に置換すればよい。また、このとき、読み上げ文字列抽出置換部３は、識別情報を示す記述（例えば、図４に示す「phoneme="Type1" 」）を削除し、読み上げ文字列を示すタグを、発音記号列を示すタグに置換する。
【００６７】
図５は、置換後のコンテンツデータの例を示す説明図である。図５に示す例において、「PTS 」は、「/ 」まで続く文字列が発音記号列であることを示すタグである。図４に示す読み上げ文字列「ご訪問ありがとうございます」は、発音記号列「GOHOUMON ARIGATOU GOZAIMASU 」に置換される。また、識別情報を示す記述「phoneme="Type1" 」は削除され、読み上げ文字列を示すタグ「TTP 」は、発音記号列を示すタグ「PTS」に置換される。この結果、図５に例示するコンテンツデータが得られる。
【００６８】
読み上げ文字列抽出置換部３は、置換後のコンテンツデータを記憶装置７に記憶させておく。その後、コンテンツ送信部５は、端末１１からコンテンツデータの要求を受け付けた場合、インターネット２１を介して、置換後のコンテンツデータを端末１１に送信する（ステップＳ１０５）。
【００６９】
図６は、端末１１の動作の例を示す流れ図である。端末１１の使用者は、端末１１の仕様に従う発音記号列を含むコンテンツデータを要求するように端末１１を操作する。端末１１のコンテンツ受信部１２は、この操作に応じて、コンテンツサーバ１にコンテンツデータを要求する。そして、コンテンツ受信部１２は、要求したコンテンツデータをコンテンツサーバ１から受信する（ステップＳ１１１）。このコンテンツデータは、例えば、図５に示すような発音記号列を含む。ステップＳ１１１において、コンテンツ受信部１２は、受信したコンテンツデータを一時記憶装置２０（図１において図示せず。）に記憶させる。
【００７０】
発音記号列抽出部１３は、一時記憶装置２０に記憶されたコンテンツデータから発音記号列を抽出する（ステップＳ１１２）。このとき、発音記号列抽出部１３は、発音記号列を示すタグ（例えば、図５に示す「PTS 」）とともに記述されている文字列を発音記号列として抽出すればよい。ステップＳ１１２において、発音記号抽出部１３は、抽出した発音記号列を一時記憶装置２０に記憶させる。
【００７１】
音声生成部１４は、一時記憶装置２０に記憶された発音記号列に基づいて、出力すべき音声の音声信号を生成する（ステップＳ１１３）。以下、この音声信号生成処理をＰＴＳ（Pheneme-to-Speech）処理と記す。音声生成部１４は、生成した音声信号を音声出力部１５に出力し、音声出力部１５に音声を出力させる（ステップＳ１１４）。例えば、ステップＳ１１２において「GOHOUMON ARIGATOU GOZAIMASU 」という発音記号列を抽出した場合、音声生成部１４は、この記号列に基づいて、「ご訪問ありがとうございます」という発声に対応する音声信号を生成する。そして、その音声信号を音声出力部１５に出力し、音声出力部１５に「ご訪問ありがとうございます」という音声を出力させる。
【００７２】
本実施の形態に示す音声コンテンツ配信システムによれば、コンテンツサーバは、読み上げ文字列を含むコンテンツデータが入力されると、コンテンツデータ内の読み上げ文字列を発音記号列に置換する。そして、置換後のコンテンツデータを端末１１に送信する。従って、端末１１は、コンテンツデータを受信したときに、すぐに音声を出力することができる。よって、コンテンツデータを受信した端末が音声を出力するまでの時間を短縮することができる。
【００７３】
また、コンテンツサーバ１は、読み上げ文字列を、識別情報によって指定される発音記号列に変換する。従って、発音記号列の仕様が異なる複数の端末にコンテンツを提供しようとする場合であっても、コンテンツ提供者は、各仕様毎に発音記号列を記述する必要はなく、各仕様を指定する識別情報を読み上げ文字列とともに記述すればよい。例えば、ステップＳ１０１〜Ｓ１０５の説明では、「ご訪問ありがとうございます」という読み上げ文字列を「GOHOUMON ARIGATOU GOZAIMASU 」という発音記号列に変換する場合を示した。図４に示す「Type1 」の代わりに他の識別情報（例えば「Type2 」）が記述されたコンテンツデータが入力された場合、コンテンツサーバ１は、「ご訪問ありがとうございます」を他の仕様に従った発音記号列に置換する。このように、コンテンツ提供者は、「GOHOUMON ARIGATOU GOZAIMASU 」等の個々の仕様に沿う発音記号列を記述する必要はなく、「Type1 」等の識別情報を記述すればよい。従って、コンテンツ提供者の負担を軽減させることができる。
【００７４】
また、ＴＴＰ処理は、コンテンツサーバ１が行う。従って、端末１１が読み上げ文字列から発音記号列への変換処理を行う必要はないので、端末１１の処理を簡易化し、端末１１の生産コストを低くすることができる。また、発音記号列を含むコンテンツデータ内に画像等の表示を指定する記述が含まれていれば、端末１１は、その記述に従って画像等を表示すればよい。従って、発音記号列の情報と、画像データの表示に関する情報とを、別々にダウンロードする必要がない。また、コンテンツサーバが、音声と画像とを同期させて送信する必要もない。
【００７５】
本実施の形態において、各端末１１の発音記号列の仕様が共通である場合、コンテンツ入力部２に、識別情報が記述されていないコンテンツデータが入力されてもよい。この場合、変換部４は、読み上げ文字列を所定の仕様に従う発音記号列に変換してよい。
【００７６】
また、発音記号列の仕様の指定を、識別情報ではなく、タグによって行ってもよい。例えば、読み上げ文字列を示すタグとして「TTP 」、「TTPX」等の複数種類のタグを用い、所望の仕様毎に読み上げ文字列を示すタグを変えてもよい。この場合、読み上げ文字列抽出置換部３は、ステップＳ１０２において、読み上げ文字列を示すタグと読み上げ文字列とを抽出し、一時記憶装置に記憶させればよい。そして、変換部４は、そのタグの種類に応じた仕様に従って、読み上げ文字列を発音記号文字列に変換すればよい。
【００７７】
本実施の形態において、コンテンツ入力手段は、コンテンツ入力部２に相当する。コンテンツ置換手段は、読み上げ文字列抽出置換部３および変換部４に相当する。そして、読み上げ文字列抽出手段および置換手段は、読み上げ文字列抽出置換部３に相当し、変換手段は、変換部４に相当する。
【００７８】
また、コンテンツ受信手段は、コンテンツ受信部１２に相当する。発音記号列抽出手段は、発音記号列抽出部１３に相当する。出力手段は、音声生成部１４および音声出力部１５に相当する。
【００７９】
実施の形態２．
本実施の形態では、コンテンツサーバとは別に設けられる変換サーバがＴＴＰ処理（読み上げ文字列から発音記号列への変換処理）を行う。図７は、本発明による音声コンテンツ配信システムの第２の実施の形態を示すブロック図である。第１の実施の形態と同様の構成部は、図１と同一の符合を付し、説明を省略する。図７に示す音声コンテンツ配信システムは、コンテンツサーバ（コンテンツ配信装置）３１と変換サーバ（音声情報変換装置）４１と端末１１とを備える。コンテンツサーバ３１と変換サーバ４１と端末１１とは、通信ネットワーク２１を介して接続される。以下、通信ネットワーク２１がインターネットである場合を例に説明するが、第１の実施の形態と同様、通信ネットワーク２１はインターネットに限定されない。また、コンテンツサーバ３１と端末１１とを接続する通信ネットワークと、コンテンツサーバ３１と変換サーバ４１とを接続する通信ネットワークとが異なっていてもよい。
【００８０】
第１の実施の形態と同様、コンテンツサーバ３１に複数の端末１１が接続されてもよい。さらに、発音記号列の仕様が異なる複数種類の端末がコンテンツサーバ３１に接続されてもよい。
【００８１】
コンテンツサーバ３１には、図１に示すコンテンツサーバ１と同様に、読み上げ文字列を示すタグ、識別情報、および読み上げ文字列を含むコンテンツデータが入力される。そして、コンテンツサーバ３１は、読み上げ文字列が発音記号列に置換されたコンテンツデータを端末１１に送信する。ただし、コンテンツサーバ３１は、ＴＴＰ処理を行わず、変換サーバ４１がＴＴＰ処理を実行する。
【００８２】
コンテンツサーバ３１において、読み上げ文字列抽出置換部３２は、コンテンツ入力部２に入力されたコンテンツデータの中から読み上げ文字列や識別情報を抽出して変換サーバ４１に送信する。また、読み上げ文字列抽出置換部３２は、変換サーバ４１から発音記号列を受信し、コンテンツデータ内の読み上げ文字列を発音記号列に置換する。
【００８３】
変換サーバ４１の読み上げ文字列受信部４２は、読み上げ文字列抽出置換部３２から読み上げ文字列および識別情報を受信する。変換部４は、第１の実施の形態で示した変換部４と同様に、辞書データを記憶する記憶装置（図７において図示せず。）を含む。この記憶装置は、発音記号列の各仕様毎に辞書データを記憶する。そして、変換部４は、どの仕様に従う発音記号列に変換すべきかを識別情報に基づいて判定し、ＴＴＰ処理を行う。発音記号列送信部４３は、変換部４によって変換された発音記号列を読み上げ文字列抽出置換部３２に送信する。
【００８４】
端末１１の構成は、第一の実施の形態の端末の構成と同様である。
【００８５】
図８は、本発明の第２の実施の形態の具体的な構成例を示すブロック図である。図８において、コンテンツサーバ３１の制御部３６は、記憶装置３７が記憶するコンテンツ配信プログラムに従って処理を実行する。具体的には、制御部３６は、コンテンツデータの入力受け付け処理、コンテンツデータの中から読み上げ文字列や識別情報を抽出して変換サーバ４１に送信する処理、変換サーバ４１から発音記号列を受信し、コンテンツデータ内の読み上げ文字列を発音記号列に置換する処理、端末１１へのコンテンツデータ送信処理を実行する。ネットワークインタフェース部３８は、インターネット２１を介してコンテンツデータの送受信を行う。また、一時記憶装置３９は、コンテンツデータから抽出される読み上げ文字列、識別情報や、変換サーバから受信する発音記号列を一時的に記憶する記憶装置である。なお、制御部３６は、例えば、インターネット２１およびネットワークインタフェース部３８を介して、コンテンツ提供者の端末（図示せず。）からコンテンツデータを受信することにより、コンテンツデータの入力を受け付ける。
【００８６】
記憶装置３７は、コンピュータに、読み上げ文字列を示すタグとともに読み上げ文字列が記述されたコンテンツデータの入力を受け付ける処理、コンテンツデータから、読み上げ文字列を示すタグとともに記述された文字列を読み上げ文字列として抽出する処理、読み上げ文字列を音声情報変換装置に送信する処理、および読み上げ文字列から変換された発音記号列を音声情報変換装置から受信し、コンテンツデータ内の読み上げ文字列を発音記号列に置換し、読み上げ文字列を示すタグを、発音記号列を示すタグに置換する処理を実行させるためのコンテンツ配信プログラムを記憶する。
【００８７】
また、変換サーバ４１の制御部４６は、記憶装置４７が記憶する音声情報変換プログラムに従って処理を実行する。具体的には、制御部４６は、読み上げ文字列および識別情報の受信処理、ＴＴＰ処理、ＴＴＰ処理によって得た発音記号列の送信処理を実行する。ネットワークインタフェース部４８は、インターネット２１を介して、データ（例えば、読み上げ文字列、識別情報、発音記号列）を送受信する。記憶装置４７は、音声情報変換プログラムのほかに、辞書データを記憶する。一時記憶装置４９は、コンテンツサーバから受信する読み上げ文字列、識別情報や、ＴＴＰ処理によって得られる発音記号列を一時的に記憶する記憶装置である。
【００８８】
記憶装置４７は、コンピュータに、通信ネットワークを介して接続される情報処理装置から、音声として読み上げられるべき文字列である読み上げ文字列を受信する処理、読み上げ文字列を、出力音声を特定するためのデータである発音記号列に変換する処理、および発音記号列を情報処理装置に送信する処理を実行させるための音声情報変換プログラムを記憶する。
【００８９】
図８に示す端末１１の構成の例は、図２に示す場合と同様である。
【００９０】
音声コンテンツ配信システムを図８に示すような構成とした場合、読み上げ文字列抽出置換部３２、コンテンツ入力部２およびコンテンツ送信部５は、コンテンツサーバ３１の制御部３６およびネットワークインタフェース部３８によって実現される。また、読み上げ文字列受信部４２および発音記号列送信部４３は、変換サーバ４１の制御部４６およびネットワークインタフェース部４８によって実現される。変換部４は、制御部４６および記憶装置４７によって実現される。
【００９１】
次に、動作について説明する。
図９は、コンテンツサーバ３１および変換サーバ４１の動作の例を示す流れ図である。コンテンツサーバ３１のコンテンツ入力部２は、コンテンツ提供者からコンテンツデータの入力を受け付け、そのコンテンツデータを記憶装置３７（図７において図示せず。）に記憶させる。（ステップＳ１２１）。この処理は、ステップＳ１０１と同様の処理である。第一の実施の形態と同様、読み上げ文字列となる文字列とともに、その文字列が読み上げ文字列であることを示すタグと、読み上げ文字列をどのような仕様の発音記号列に変換すべきかを示す識別情報とが記述されたコンテンツデータが入力される。
【００９２】
読み上げ文字列抽出置換部３２は、コンテンツ入力部２に入力されて記憶装置３７に記憶されたコンテンツデータから、読み上げ文字列および識別情報を抽出し、一時記憶装置３９（図７において図示せず。）に記憶させる。このとき読み上げ文字列抽出置換部３２は、読み上げ文字列を示すタグとともに記述されている識別情報を抽出し、また、そのタグとともに記述されている文字列を読み上げ文字列として抽出する。読み上げ文字列抽出置換部３２は、一時記憶装置３９に記憶させた読み上げ文字列および識別情報を、インターネット２１を介して変換サーバ４１に送信する（ステップＳ１２２）。
【００９３】
なお、読み上げ文字列抽出置換部３２は、予め変換サーバ４１のアドレス情報を記憶装置に記憶しておけばよい。そして、ステップＳ１２２では、そのアドレス情報を用いて、読み上げ文字列および識別情報を変換サーバ４１に送信すればよい。あるいは、コンテンツ提供者が、読み上げ文字列を示すタグ、識別情報、読み上げ文字列とともに、変換サーバ４１のアドレス情報をコンテンツデータ中に記述しておいてもよい。この場合、読み上げ文字列抽出置換部３２は、コンテンツデータからアドレス情報を抽出し、そのアドレス情報を用いて、読み上げ文字列および識別情報を変換サーバ４１に送信すればよい。
【００９４】
変換サーバ４１の読み上げ文字列受信部４２は、読み上げ文字列および識別情報を受信すると、一時記憶装置４９（図８において図示せず。）に記憶させる。
【００９５】
変換部４は、一時記憶装置４９に記憶された識別情報に基づいて、使用すべき辞書データを判定する。そして、その辞書データを用いて、読み上げ文字列受信部４２から送られる読み上げ文字列を発音記号列に変換する（ステップＳ１２３）。このＴＴＰ処理は、ステップＳ１０３と同様の処理である。変換部４は、変換した発音記号列を一時記憶装置４９に記憶させる。発音記号列送信部４３は、その発音記号列を、インターネット２１を介してコンテンツサーバ３１に送信する（ステップＳ１２４）。
【００９６】
コンテンツサーバ３１の読み上げ文字列抽出置換部３２は、変換サーバ４１が送信した発音記号列を受信し、一時記憶装置３９に記憶させる。続いて、読み上げ文字列抽出置換部３２は、その発音記号列を用いて、記憶装置３７に記憶されたコンテンツデータ内の読み上げ文字列を発音記号列に置換する（ステップＳ１２５）。読み上げ文字列抽出置換部３２は、読み上げ文字列を示すタグとともに記述された文字列を発音記号文字列に置換すればよい。また、このとき、読み上げ文字列抽出置換部３は、識別情報を示す記述（例えば、図４に示す「phoneme="Type1" 」）を削除し、読み上げ文字列を示すタグを、発音記号列を示すタグに置換する。
【００９７】
読み上げ文字列抽出置換部３２は、置換後のコンテンツデータを記憶装置３７に記憶させておく。その後、コンテンツ送信部５は、端末１１からの要求に応じて、置換後のコンテンツデータを端末１１に送信する（ステップＳ１２６）。ステップＳ１２６の処理は、ステップＳ１０５の処理と同様の処理である。
【００９８】
端末１１がコンテンツサーバ３１からコンテンツデータを受信して、音声を出力する際の動作は、第１の実施の形態と同様である。
【００９９】
本実施の形態においても、コンテンツサーバ３２は、入力されたコンテンツデータ内の読み上げ文字列を発音記号列に置換し、置換後のコンテンツデータを端末１１に送信する。従って、コンテンツデータを受信した端末が音声を出力するまでの時間を短縮することができる。また、コンテンツサーバ３１が、読み上げ文字列を、識別情報によって指定される発音記号列に変換する。従って、コンテンツ提供者の負担を軽減させることができる。また、端末１１がステップＳ１２３のＴＴＰ処理を行う必要はないので、端末１１の処理を簡易化し、端末１１の生産コストを低くすることができる。また、端末１１は、発音記号列の情報と、画像データの表示に関する情報とを、別々にダウンロードする必要がない。コンテンツサーバが、音声と画像とを同期させて送信する必要もない。
【０１００】
さらに、本実施の形態によれば、コンテンツデータの置換や配信を行うコンテンツサーバ３１と、ステップＳ１２３のＴＴＰ処理を実行する変換サーバ４１とを別々に設けたので、処理の分散化を図れる。特に、ＴＴＰ処理の処理負荷は大きいので、コンテンツサーバ３１における処理負荷を軽減させることができる。
【０１０１】
第１の実施の形態と同様に、各端末１１の発音記号列の仕様が共通である場合、コンテンツ入力部２に、識別情報が記述されていないコンテンツデータが入力されてもよい。また、発音記号列の仕様の指定を、識別情報ではなく、タグによって行ってもよい。この場合、読み上げ文字列抽出置換部３２は、ステップＳ１２２において、読み上げ文字列を示すタグと読み上げ文字列とを抽出し、変換サーバ４１に送信すればよい。そして、変換サーバ４１の変換部４は、そのタグの種類に応じた仕様に従って、読み上げ文字列を発音記号文字列に変換すればよい。
【０１０２】
本実施の形態において、音声記号列の仕様毎に変換サーバが設けられ、個々の変換サーバがそれぞれ特定の仕様に従ってＴＴＰ処理を行うように構成されていてもよい。この場合、コンテンツサーバ３１は、識別情報等によって指定された仕様に対応する変換サーバに読み上げ文字列を送信すればよい。
【０１０３】
本実施の形態において、コンテンツ入力手段は、コンテンツ入力部２に相当する。コンテンツ置換手段は、読み上げ文字列抽出置換部３２に相当する。そして、読み上げ文字列抽出手段、読み上げ文字列送信手段および変換手段も、読み上げ文字列抽出置換部３２に相当する。
【０１０４】
また、変換手段は、読み上げ文字列受信部４２および変換部４に相当する。発音記号列送信手段は、発音記号列送信部４３に相当する。
【０１０５】
また、コンテンツ受信手段は、コンテンツ受信部１２に相当する。発音記号列抽出手段は、発音記号列抽出部１３に相当する。出力手段は、音声生成部１４および音声出力部１５に相当する。
【０１０６】
実施の形態３．
本実施の形態では、端末は、読み上げ文字列が記述されたコンテンツデータを受信する。ただし、端末は、読み上げ文字列が記述されたコンテンツデータの他に、発音記号列が記述されたコンテンツデータも受信してよい。図１０は、本発明による音声コンテンツ配信システムの第３の実施の形態を示すブロック図である。第１または第２の実施の形態と同様の構成部は、図１または図７と同一の符合を付し、説明を省略する。図１０に示す音声コンテンツ配信システムは、コンテンツサーバ（コンテンツ配信装置）５１と変換サーバ（音声情報変換装置）４１と端末６１とを備える。コンテンツサーバ５１と変換サーバ４１と端末６１とは、通信ネットワーク２１を介して接続される。以下、通信ネットワーク２１がインターネットである場合を例に説明するが、第１の実施の形態と同様、通信ネットワーク２１はインターネットに限定されない。また、コンテンツサーバ５１と端末６１とを接続する通信ネットワークと、端末６１と変換サーバ４１とを接続する通信ネットワークとが異なっていてもよい。
【０１０７】
第１の実施の形態と同様、コンテンツサーバ５１に複数の端末６１が接続されてもよい。さらに、発音記号列の仕様が異なる複数種類の端末がコンテンツサーバ５１に接続されてもよい。
【０１０８】
コンテンツサーバ５１には、読み上げ文字列を示すタグ、識別情報、および読み上げ文字列を含むコンテンツデータ（例えば、図４に例示するコンテンツデータ）が入力される。コンテンツサーバ５１は、そのコンテンツデータを、読み上げ文字列を含む状態のまま端末６１に送信する。端末６１は、このコンテンツデータを受信すると、変換サーバ４１にＴＴＰ処理を実行させ、その結果得られる発音記号列に基づいて音声を出力する。
【０１０９】
図１０では図示していないが、第１の実施の形態あるいは第２の実施の形態で示したコンテンツサーバが通信ネットワーク（本例ではインターネット２１）に接続されていてもよい。そして、端末６１は、このコンテンツサーバから、発音記号列を含むコンテンツデータを受信して、音声を出力してもよい。
【０１１０】
コンテンツサーバ５１において、コンテンツ入力部２は、第１の実施の形態と同様に、コンテンツ提供者から、コンテンツデータの入力を受け付ける。また、コンテンツ送信部５は、端末１１からの要求に応じて、コンテンツデータを端末１１に送信する。ただし、コンテンツ送信部５は、読み上げ文字列を含んだ状態のコンテンツデータを端末６１に送信する。
【０１１１】
変換サーバ４１の構成は、第２の実施の形態における変換サーバの構成と同様である。ただし、本実施の形態では、変換サーバ４１は、端末６１から読み上げ文字列および識別情報を受信し、その端末６１に発音記号列を送信する。
【０１１２】
端末１１において、データ抽出部６３は、コンテンツ受信部１２が受信したコンテンツデータの中から読み上げ文字列または発音記号列を抽出する。タイミング制御部６２は、データ抽出部６３がコンテンツデータから読み上げ文字列または発音記号列を抽出するタイミングを制御する。変換要求部６４は、データ抽出部６３が抽出した読み上げ文字列を変換サーバ４１に送信し、読み上げ文字列から発音記号列への変換（ＴＴＰ処理）を要求する。そして、変換要求部６４は、ＴＴＰ処理によって得られた発音記号列を変換サーバ４１から受信する。音声生成部６５は、変換要求部６４が受信した発音記号列またはデータ抽出部６３が抽出した発音記号列に基づいて音声信号を生成する。なお、端末６１は、コンテンツを表示する表示部（ディスプレイ装置）を備えていてもよい。
【０１１３】
図１１は、本発明の第３の実施の形態の具体的な構成例を示すブロック図である。図８において、コンテンツサーバ５１の制御部５６は、記憶装置５７が記憶するコンテンツ配信プログラムに従って、コンテンツデータの入力受付処理および端末１１へのコンテンツデータ送信処理を実行する。ネットワークインタフェース部５８は、インターネット２１を介してコンテンツデータを送受信する。
【０１１４】
また、端末６１の制御部６６は、記憶装置６７が記憶するコンテンツ出力プログラムに従って処理を実行する。具体的には、制御部６６は、コンテンツサーバ５１からのコンテンツデータ受信処理、コンテンツデータから読み上げ文字列または発音記号列を抽出する処理、抽出した読み上げ文字列を変換サーバ４１に送信して変換サーバ４１から発音記号列を受信する処理、音声記号列に基づく音声信号生成処理、音声を出力させる処理を実行する。ネットワークインタフェース部６８は、インターネット２１を介して、コンテンツデータの要求の送信や、コンテンツデータの受信を行う。音声出力装置６９は、スピーカ等の音声出力装置であり、音声を出力する。一時記憶装置７０は、受信したコンテンツデータや、発音記号列を一時的に記憶する記憶装置である。
【０１１５】
記憶装置６７は、コンピュータに、コンテンツ配信装置から、読み上げ文字列が記述されたコンテンツデータを受信する処理、コンテンツデータから読み上げ文字列を抽出する処理、読み上げ文字列を音声情報変換装置に送信する処理、読み上げ文字列から変換された発音記号列を音声情報変換装置から受信する処理、および発音記号列に基づいて音声を出力する処理を実行させるためのコンテンツ出力プログラムを記憶する。
【０１１６】
図１１に示す変換サーバ４１の構成の例は、図８に示す場合と同様である。
【０１１７】
音声コンテンツ配信システムを図１１に示すような構成とした場合、コンテンツ入力部２およびコンテンツ送信部５は、コンテンツサーバ５１の制御部５６およびネットワークインタフェース部５８によって実現される。また、コンテンツ受信部１２および変換要求部６４は、制御部６６およびネットワークインタフェース部６８によって実現される。また、タイミング制御部６２、データ抽出部６３および音声生成部６５は、制御部６６によって実現される。音声出力部１５は、音声出力装置６９によって実現される。
【０１１８】
次に、動作について説明する。
コンテンツサーバ５１のコンテンツ入力部２は、コンテンツ提供者からコンテンツデータの入力を受け付け、そのコンテンツデータを記憶装置５７（図１０において図示せず。）に記憶させる。この処理は、ステップＳ１０１と同様の処理である。第一の実施の形態と同様、読み上げ文字列となる文字列とともに、その文字列が読み上げ文字列であることを示すタグと、読み上げ文字列をどのような仕様の発音記号列に変換すべきかを示す識別情報とが記述されたコンテンツデータが入力される。
【０１１９】
コンテンツ送信部５は、端末６１からコンテンツデータの要求を受け付けた場合、記憶装置５７に記憶されたコンテンツデータを端末１１に送信する。入力されたコンテンツデータは、置換等の処理を行われていない。従って、コンテンツ送信部５は、読み上げ文字列を示すタグと、識別情報と、読み上げ文字列とを含むコンテンツデータを送信する。
【０１２０】
図１２は、端末６１の動作の例を示す流れ図である。端末６１のコンテンツ受信部１２は、使用者の操作に応じて、コンテンツサーバ５１にコンテンツデータを要求する。あるいは、使用者の操作に応じて、第１の実施の形態あるいは第２の実施の形態と同様のコンテンツサーバ（図１０において図示せず。）に、発音記号列を含むコンテンツデータを要求してもよい。そして、コンテンツ受信部１２は、要求したコンテンツデータをコンテンツサーバから受信し、一時記憶装置７０（図１０において図示せず。）に記憶させる（ステップＳ１３１）。
【０１２１】
データ抽出部６３は、タイミング制御部６２の制御に従い、一時記憶装置７０に記憶されるコンテンツデータから読み上げ文字列または発音記号列を抽出する。タイミング制御部６２は、例えば、コンテンツ受信部１２がコンテンツデータを受信した直後（ステップＳ１３１の直後）に、データ抽出部６３に読み上げ文字列等の抽出を開始させてもよい。あるいは、コンテンツデータを受信後に、使用者によって音声出力を指示する操作が行われたときに、データ抽出部６３に読み上げ文字列等の抽出を開始させてもよい。また、コンテンツデータに基づいて表示部に画像等を表示させてから所定期間後に、データ抽出部６３に読み上げ文字列等の抽出を開始させてもよい。
【０１２２】
データ抽出部６３は、コンテンツデータから読み上げ文字列または発音記号列を抽出する際、まず、コンテンツデータに読み上げ文字列が含まれているのか、発音記号列が含まれているのかを判定する（ステップＳ１３２）。データ抽出部６３は、コンテンツデータに、読み上げ文字列を示すタグが記述されていれば、読み上げ文字列が含まれていると判定する。また、発音記号列を示すタグが記述されていれば、発音記号列が含まれていると判定する。
【０１２３】
データ抽出部６３は、読み上げ文字列が含まれていると判定した場合、コンテンツデータから読み上げ文字列および識別情報を抽出し、一時記憶装置７０に記憶させる。このときデータ抽出部６３は、読み上げ文字列を示すタグとともに記述されている識別情報を抽出し、また、そのタグとともに記述されている文字列を読み上げ文字列として抽出する。
【０１２４】
変換要求部６４は、一時記憶装置７０に記憶された読み上げ文字列および識別情報を、インターネット２１を介して変換サーバ４１に送信する（ステップＳ１３３）。
【０１２５】
なお、変換要求部６４は、予め変換サーバ４１のアドレス情報を記憶装置に記憶しておけばよい。そして、ステップＳ１３３では、そのアドレス情報を用いて、読み上げ文字列および識別情報を変換サーバ４１に送信すればよい。あるいは、コンテンツ提供者が、読み上げ文字列を示すタグ、識別情報、読み上げ文字列とともに、変換サーバ４１のアドレス情報も記述したデータをコンテンツサーバ５１に入力しておいてもよい。この場合、データ抽出部６３がコンテンツデータからアドレス情報を抽出し、変換要求部６４は、そのアドレス情報を用いて、読み上げ文字列および識別情報を変換サーバ４１に送信すればよい。
【０１２６】
変換サーバ４１の読み上げ文字列受信部４２は、端末６１から読み上げ文字列および識別情報を受信する。そして、変換部４は、ＴＴＰ処理を実行し、受信した読み上げ文字列を発音記号列に変換する（ステップＳ１３４）。そして、その発音記号列を端末６１に送信する（ステップＳ１３５）。変換サーバ４１が読み上げ文字列および識別情報を受信してから、発音記号列を送信するまでの動作は、第２の実施の形態における変換サーバ４１の動作と同様である。
【０１２７】
ステップＳ１３５において、端末６１の変換要求部６４は、変換サーバ４１が送信した発音記号列を受信し、一時記憶装置７０に記憶させる。
【０１２８】
また、データ抽出部６３は、コンテンツデータに発音記号列が含まれていると判定した場合（ステップＳ１３２）、コンテンツデータから発音記号列を抽出し、一時記憶装置７０に記憶させる（ステップＳ１３６）。
【０１２９】
音声生成部６５は、ステップＳ１３５またはステップＳ１３６において、一時記憶装置７０に記憶された発音記号列に基づいてＰＴＳ処理（音声信号生成処理）を実行する（ステップＳ１３７）。音声生成部６５は、生成した音声信号をを音声出力部１５に出力し、音声出力部１５に音声を出力させる（ステップＳ１３８）。
【０１３０】
本実施の形態によれば、端末６１は、発音記号列を含むコンテンツデータを受信した場合だけでなく、読み上げ文字列を含むコンテンツを受信した場合にも、音声を出力することができる。
【０１３１】
また、端末６１は、発音記号列を含むコンテンツデータを受信した場合には、その発音記号列に基づいて音声を出力するので、音声を出力するまでの時間を短縮できる。また、コンテンツ提供者は、各仕様毎に発音記号列を記述しなくてよいので、コンテンツ提供者の負担を軽減させることができる。また、端末１１は、発音記号列の情報と、画像データの表示に関する情報とを、別々にダウンロードする必要がない。コンテンツサーバが、音声と画像とを同期させて送信する必要もない。
【０１３２】
第３の実施の形態において、コンテンツサーバ５１には、読み上げ文字列と発音記号列の双方が記述されたコンテンツデータが入力されてもよい。このコンテンツデータにおいて、読み上げ文字列は、読み上げ文字列を示すタグおよび識別情報とともに記述され、発音記号列は、発音記号列を示すタグとともに記述される。読み上げ文字列と発音記号列の双方が記述されたコンテンツデータを端末６１が受信した場合、データ抽出部６２は、発音記号列を示すタグとともに記述されている文字列を発音記号列として抽出する。また、読み上げ文字列を示すタグとともに記述されている文字列を読み上げ文字列として抽出し、また、識別情報も抽出する。変換要求部６４は、読み上げ文字列および識別情報を変換サーバ４１に送信し、変換サーバ４１から発音記号列を受信する。音声生成部６５は、コンテンツデータから抽出された発音記号列および変換サーバ４１から受信した発音記号列から音声信号を生成する。
【０１３３】
なお、第１の実施の形態と同様に、各端末１１の発音記号列の仕様が共通である場合、コンテンツ入力部２に、識別情報が記述されていないコンテンツデータが入力されてもよい。また、発音記号列の仕様の指定を、識別情報ではなく、タグによって行ってもよい。この場合、ステップＳ１３３において、データ抽出部６３は、読み上げ文字列を示すタグと読み上げ文字列とを抽出し、変換要求部６４は、このタグと読み上げ文字列を変換サーバ４１に送信すればよい。そして、変換サーバ４１の変換部４は、そのタグの種類に応じた仕様に従って、読み上げ文字列を発音記号文字列に変換すればよい。
【０１３４】
また、本実施の形態において、音声記号列の仕様毎に変換サーバ４１が設けられ、個々の変換サーバ４１がそれぞれ特定の仕様に従ってＴＴＰ処理を行うように構成されていてもよい。この場合、端末６１は、識別情報等によって指定された仕様に対応する変換サーバに読み上げ文字列を送信すればよい。
【０１３５】
本実施の形態において、コンテンツ受信手段は、コンテンツ受信部１２に相当する。コンテンツ出力手段は、データ抽出部６３、変換要求部６４、音声生成部６５および音声出力部１５に相当する。そして、データ抽出手段はデータ抽出部６３に相当し、送受信手段は変換要求部６４に相当し、出力手段は音声生成部６５および音声出力部１５に相当する。タイミング制御手段は、タイミング制御部６２に相当する。
【０１３６】
また、変換手段は、読み上げ文字列受信部４２および変換部４に相当する。発音記号列送信手段は、発音記号列送信部４３に相当する。
【０１３７】
【発明の効果】
本発明によれば、コンテンツ配信装置が、音声として読み上げられるべき文字列である読み上げ文字列が記述されたコンテンツデータの入力を受け付けるコンテンツ入力手段と、コンテンツデータ内の読み上げ文字列を、出力音声を特定するためのデータである発音記号列に置換するコンテンツ置換手段とを備え、端末装置が、コンテンツ配信装置から、発音記号列が記述されたコンテンツデータを受信するコンテンツ受信手段と、コンテンツデータから発音記号列を抽出する発音記号列抽出手段と、発音記号列に基づいて音声を出力する出力手段とを備えているので、コンテンツを受信した端末が音声を出力するまでの時間を短縮できる。
【図面の簡単な説明】
【図１】本発明の第１の実施の形態を示すブロック図である。
【図２】第１の実施の形態の具体的な構成例を示すブロック図である。
【図３】コンテンツサーバの動作の例を示す流れ図である。
【図４】コンテンツサーバに入力されるコンテンツデータの例を示す説明図である。
【図５】置換後のコンテンツデータの例を示す説明図である。
【図６】端末の動作の例を示す流れ図である。
【図７】本発明の第２の実施の形態を示すブロック図である。
【図８】第２の実施の形態の具体的な構成例を示すブロック図である。
【図９】コンテンツサーバおよび変換サーバの動作の例を示す流れ図である。
【図１０】本発明の第３の実施の形態を示すブロック図である。
【図１１】第３の実施の形態の具体的な構成例を示すブロック図である。
【図１２】端末および変換サーバの動作の例を示す流れ図である。
【符号の説明】
１コンテンツサーバ
２コンテンツ入力部
３読み上げ文字列抽出置換部
４変換部
５コンテンツ送信部
１１端末装置
１２コンテンツ受信部
１３発音記号列抽出部
１４音声生成部
１５音声出力部
２１通信ネットワーク[0001]
BACKGROUND OF THE INVENTION
The present invention relates to an audio content distribution system that distributes audio output content to a terminal device, and a content distribution device, audio information conversion device, terminal device, content distribution program, and audio information conversion program applied to the audio content distribution system And a content output program.
[0002]
[Prior art]
A service that distributes content to a terminal device (hereinafter referred to as a terminal) and allows the user of the terminal to browse the content is widespread. As content, there are many contents that display images, characters and the like on a terminal. However, various systems for distributing contents for outputting sound and outputting sound at a terminal have been proposed.
[0003]
For example, Patent Document 1 proposes a system that transmits hypertext including a description requesting a phonetic character string to a client. In the system described in Patent Literature 1, when a description requesting a phonetic character string is included in the received hypertext, the client requests a phonetic character string from the server. Then, the client receives information on the phonetic character string from the server, and outputs a voice according to the phonetic character string.
[0004]
Patent Document 2 proposes a system in which a terminal downloads speech synthesis target data such as text data and the terminal reads out speech based on the downloaded data. Patent Document 2 also describes that the terminal downloads phoneme data of a character voice desired by the user of the terminal and that the terminal reads out text data and the like by the character voice. In addition, the case where the terminal downloads a voice synthesis processing program or image data is also described.
[0005]
A system using VoiceXML, which is a markup language for performing voice processing, has also been proposed (for example, Patent Document 3). In the system described in Patent Document 3, a voice conversation server generates voice based on a description using VoiceXML, and transmits the voice to a terminal via a telephone line. In addition, the voice dialogue server transmits screen display data to the terminal in synchronization with the voice.
[0006]
In the following description, data for the terminal to specify the output voice is referred to as a phonetic symbol string.
[0007]
[Patent Document 1]
JP 2001-43064 A (paragraph 0024-0132, FIG. 1-21)
[0008]
[Patent Document 2]
JP 2002-328694 A (paragraph 0038-0132, FIG. 1-74)
[0009]
[Patent Document 3]
Japanese Patent Laid-Open No. 2002-318132 (paragraphs 0016-0051, FIG. 1-9)
[0010]
[Problems to be solved by the invention]
In the system described in Patent Document 1, when a client receives hypertext from a server and the hypertext includes a description requesting a phonetic character string, the client requests the phonetic character string from the server. Therefore, the client cannot output sound only by receiving hypertext as content. In the case of outputting voice, after receiving the hypertext, it is necessary to further request the server for a phonetic character string and receive the phonetic character string. As a result, it takes time from requesting the hypertext as the content until the completion of the audio output.
[0011]
Further, depending on the type of terminal, the specification of a phonetic symbol string (for example, a phonetic character string in Patent Document 1) may be different. For example, when a mobile phone is used as a terminal, the phonetic symbol specification may be different between the mobile phone of communication company A and the mobile phone of communication company B. When a content provider intends to provide content to a plurality of terminals having different phonetic symbol string specifications, it must create a phonetic symbol string for each individual specification. This increases the burden on the content provider.
[0012]
For example, in the specification of the mobile phone communication company A, in order to output a voice saying "Hello" is a must to create a pronunciation symbol string "KONNICHIWA". In addition, in the specification of the mobile phone communication company B, in order to output a voice saying "Hello" is a must to create a pronunciation symbol string "KON-NITIWA". Content provider, when providing content such as to output a voice saying "Hello" communication company A, B to each of the mobile telephone, by creating a plurality of phonetic symbols string "KONNICHIWA" and "KON-NITIWA" There was a heavy burden on creating a phonetic symbol string.
[0013]
In the systems described in Patent Document 1 and Patent Document 2, such a burden on the content provider is not considered. Therefore, when content is provided to a plurality of types of terminals having different phonetic symbol string specifications, the burden on the content provider cannot be reduced.
[0014]
In addition, the terminal described in Patent Document 2 must download image data separately from audio data when displaying an image that matches the audio. Further, in a system using VoiceXML described in Patent Document 3, the voice dialogue server must transmit screen data to the terminal in synchronization with the voice.
[0015]
Therefore, an object of the present invention is to shorten the time until a terminal that receives content outputs audio. Another object of the present invention is to reduce the burden on the content provider when providing content to a plurality of types of terminals having different phonetic symbol string specifications.
[0016]
[Means for Solving the Problems]
A content distribution system according to the present invention is a content distribution system including a content distribution device that distributes content data and a terminal device that outputs content based on the content data received from the content distribution device. , The text to be read out as a text , With a tag indicating the text to read Content input means for receiving input of described content data, and content replacement means for replacing a read character string in the content data with a phonetic symbol string that is data for specifying output speech, the terminal device includes: Content receiving means for receiving content data in which phonetic symbol strings are described from a content distribution device, phonetic symbol string extracting means for extracting phonetic symbol strings from content data, and output means for outputting sound based on phonetic symbol strings And with The content replacement means extracts from the content data input to the content input means a character string described with a tag indicating a read character string as a read character string, and the read character string. Conversion means for converting into a phonetic symbol string; and replacement means for replacing a reading character string in the content data with the phonetic symbol string and replacing a tag indicating the reading character string with a tag indicating the phonetic symbol string, The content receiving means receives content data describing a phonetic symbol string together with a tag indicating a phonetic symbol string from the content distribution device, and the phonetic symbol string extracting means receives a character string described with a tag indicating the phonetic symbol string. Extract as phonetic symbol string It is characterized by that.
[0018]
The content input means accepts input of content data in which the specification information indicating the specification of the phonetic symbol string is described together with the reading character string, and the reading character string extracting means extracts the reading character string and the specification information from the content data, Preferably, the conversion means converts the read-out character string into a phonetic symbol string corresponding to the specification indicated by the specification information. According to such a configuration, the content provider does not have to describe the phonetic symbol string corresponding to each specification, so the burden on the content provider is reduced.
[0019]
In addition, a content distribution system according to the present invention includes a content distribution device that distributes content data, a terminal device that outputs content based on content data received from the content distribution device, and a read-out that is a character string to be read out as speech. A content distribution system including an audio information conversion device that converts a character string into a phonetic symbol string, wherein the content distribution device accepts input of content data in which the read character string is described together with a tag indicating the read character string Content input means; and content replacement means for replacing a read-out character string in the content data with a phonetic symbol string that is data for specifying output speech, and the terminal device generates a pronunciation from the content distribution device. Content data describing symbol strings Content receiving means for receiving; phonetic symbol string extracting means for extracting a phonetic symbol string from the content data; and output means for outputting sound based on the phonetic symbol string, wherein the content replacing means includes the content input Read-out character string extraction means for extracting a character string described with a tag indicating a read-out character string from the content data input to the means as a read-out character string, and read-out characters for transmitting the read-out character string to the speech information converter Receiving a phonetic symbol string from the string transmitting means and the audio information conversion device, replacing the reading character string in the content data with the phonetic symbol string, and converting the tag indicating the reading character string into a tag indicating the phonetic symbol string Replacement means for replacing, and the audio information conversion device reads a read-out character string from the content distribution device. And a conversion means for converting the read-out character string into a phonetic symbol string; and a phonetic symbol string transmission means for transmitting the phonetic symbol string to the content distribution device, wherein the content receiving means indicates a phonetic symbol string The content data in which the phonetic symbol string is described together with the tag is received from the content distribution device, and the phonetic symbol string extracting unit extracts a character string described with the tag indicating the phonetic symbol string as a phonetic symbol string. . According to such a configuration, the processing is distributed to the audio information conversion device, and the processing load on the content distribution server can be reduced.
[0020]
The content input means accepts input of content data in which the specification information indicating the specification of the phonetic symbol string is described together with the reading character string, and the reading character string extracting means extracts the reading character string and the specification information from the content data, It is preferable that the reading character string transmission unit transmits the reading character string and the specification information to the speech information conversion device, and the conversion unit converts the reading character string into a phonetic symbol string corresponding to the specification indicated by the specification information. According to such a configuration, the content provider does not have to describe the phonetic symbol string corresponding to each specification, so the burden on the content provider is reduced.
[0029]
A content distribution apparatus according to the present invention is a content distribution apparatus that distributes content data to a terminal device, and a read-out character string that is a character string to be read out as speech is , With a tag indicating the text to read Content input means for receiving input of described content data and content data , The character string described with the tag indicating the reading character string Read-out string As Read-out character string extraction means to be extracted, conversion means to convert the read-out character string into a phonetic symbol string that is data for specifying output speech, and replace the read-out character string in the content data with the phonetic symbol string And replace the tag indicating the reading string with the tag indicating the phonetic symbol string And replacement means.
[0030]
The content input means accepts input of content data in which the specification information indicating the specification of the phonetic symbol string is described together with the reading character string, and the reading character string extracting means extracts the reading character string and the specification information from the content data, The conversion means preferably converts the read-out character string into a phonetic symbol string corresponding to the specification indicated by the specification information. According to such a configuration, the content provider does not have to describe the phonetic symbol string corresponding to each specification, so the burden on the content provider is reduced.
[0031]
Also, the content distribution apparatus according to the present invention is connected to a speech information conversion device that converts a read-out character string that is a character string to be read out as speech into a phonetic symbol string that is data for specifying output speech, and a terminal device A content distribution device that distributes content data to Along with a tag indicating the text to be read Content input means for receiving input of content data in which a reading character string is described, and content data , The character string described with the tag indicating the reading character string Read-out string As Read-out character string extracting means for extracting, read-out character string transmitting means for transmitting the read-out character string extracted by the read-out character string extracting means to the speech information converting device, and the phonetic information converting device for the phonetic symbol string converted from the read-out character string Is replaced with a phonetic symbol string. And replace the tag indicating the reading string with the tag indicating the phonetic symbol string And replacement means.
[0032]
The content input means accepts input of content data in which the specification information indicating the specification of the phonetic symbol string is described together with the reading character string, and the reading character string extracting means extracts the reading character string and the specification information from the content data, The reading character string transmitting means preferably transmits the reading character string and the specification information to the voice information conversion device. According to such a configuration, the content provider does not have to describe the phonetic symbol string corresponding to each specification, so the burden on the content provider is reduced.
[0039]
A content distribution program according to the present invention is a content distribution program installed in a content distribution device that distributes content data to a terminal device, and a computer has a read-out character string that is a character string to be read out as speech. , With a tag indicating the text to read Process that accepts input of described content data, from content data , The character string described with the tag indicating the reading character string Read-out string As Processing to extract, processing to convert the text string to be read into a phonetic symbol string that is data for specifying the output sound, and replace the text string to be read out in the content data with the phonetic symbol string And replace the tag indicating the reading string with the tag indicating the phonetic symbol string Processing is executed.
[0040]
Also, the content distribution program according to the present invention is connected to a speech information conversion device that converts a read-out character string that is a character string to be read out as a sound into a phonetic symbol string that is data for specifying output speech, and a terminal device. A content distribution program installed in a content distribution apparatus for distributing content data to a computer, Along with a tag indicating the text to be read Processing that accepts input of content data in which read-out character strings are described, from content data , The character string described with the tag indicating the reading character string Read-out string As Processing to extract, processing to send the read-out character string to the speech information converter, receive the phonetic symbol string converted from the text-to-speech string from the speech information converter, and replace the text-to-speech string in the content data with the phonetic symbol string And replace the tag indicating the reading string with the tag indicating the phonetic symbol string Processing is executed.
[0044]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
[0045]
Embodiment 1 FIG.
FIG. 1 is a block diagram showing a first embodiment of an audio content distribution system according to the present invention. The audio content distribution system shown in FIG. 1 includes a content server (content distribution device) 1 and a terminal 11. The content server 1 and the terminal 11 are connected via a communication network 21. Hereinafter, a case where the communication network 21 is the Internet will be described as an example, but the communication network 21 is not limited to the Internet. For example, the communication network 21 may be a LAN, a WAN, or the like.
[0046]
The content server 1 is an information processing apparatus that transmits content data including a phonetic symbol string to the terminal 11. Content is described in a markup language. Hereinafter, content data described in a markup language is referred to as content data. The terminal 11 receives the content data from the content server 1 and outputs sound based on the phonetic symbol string included in the content data. Further, the terminal 11 may display an image according to the description of the content data.
[0047]
The phonetic symbol string described in the content data may be a phoneme character string or a syllable character string.
[0048]
Although one terminal 11 is shown in FIG. 1, a plurality of terminals 11 may be connected to the content server 1. Furthermore, a plurality of types of terminals having different phonetic symbol string specifications may be connected to the content server 1.
[0049]
The content data input to the content server 1 includes a character string (hereinafter referred to as a read-out character string) that should be read out as speech in the terminal 11. The reading character string need not be described according to the specification of the phonetic symbol string of the terminal 11 to which the content is distributed. For example, even when the terminal 11 requests “phonetic spelling” as the phonetic symbol string specification, the read-out character string may be described in kanji or kana. Even when a plurality of types of terminals request different phonetic symbol string specifications, the read-out character string does not have to comply with each specification. When the content server 1 receives the input of the content data in which the read character string is described, the content server 1 replaces the read character string in the content data with a phonetic symbol string of the specified specification. Thereafter, the content server 1 transmits content data in response to a request from the terminal 11.
[0050]
In the content data, a character string described as a reading character string is described with a tag indicating that the character string is a reading character string. Further, identification information indicating what kind of phonetic symbol string should be converted to the reading character string is also described together with the reading character string and the tag.
[0051]
When the reading character string is replaced with the phonetic symbol string, the character string described as the phonetic symbol string is described with a tag indicating that the character string is a phonetic symbol string.
[0052]
The content input unit 2 shown in FIG. 1 accepts input of content data from a content provider. The reading character string extraction / replacement unit 3 performs a process of extracting a reading character string and identification information from the input content data, and a process of replacing the reading character string in the content data with a phonetic symbol string. The conversion unit 4 includes a storage device (not shown in FIG. 1) that stores dictionary data indicating the correspondence between individual characters or words included in the read-out character string and the phonetic symbol string. This storage device stores dictionary data for each specification of the phonetic symbol string. Using the dictionary data, the conversion unit 4 performs a conversion process from a reading character string to a phonetic symbol string. Based on the identification information, the conversion unit 4 determines which specification should be converted into a phonetic symbol string. The reading character string extraction and replacement unit 3 uses the phonetic symbol string converted by the conversion unit 4 to replace the reading character string in the content data with the phonetic symbol string. The content transmission unit 5 transmits the replaced content data to the terminal 11 in response to a request from the terminal 11.
[0053]
The content receiving unit 12 in the terminal 11 receives content data from the content server 1. The phonetic symbol string extraction unit 13 extracts a phonetic symbol string from the content data received by the content receiving unit 12. The voice generation unit 14 generates a voice signal based on the phonetic symbol string extracted by the phonetic symbol string extraction unit 13. The audio output unit 15 outputs audio based on the audio signal generated by the audio generation unit 13. The terminal 11 may include a display unit (display device) that displays content.
[0054]
FIG. 2 is a block diagram showing a specific configuration example of the first exemplary embodiment of the present invention. In FIG. 2, the control unit 6 of the content server 1 executes processing according to the content distribution program stored in the storage device 7. Specifically, the control unit 6 receives content data input, processes to extract a read character string and identification information from the content data, converts the read character string to a phonetic symbol string, A process of replacing the read-out character string with a phonetic symbol string and a process of transmitting content data to the terminal 11 are executed. The network interface unit 8 transmits and receives content data via the Internet 21. The storage device 7 stores dictionary data in addition to the content distribution program. The temporary storage device 9 is a storage device that temporarily stores read-out character strings, identification information extracted from content data, and phonetic symbol strings obtained by conversion processing.
[0055]
For example, the control unit 6 receives content data from a content provider's terminal (not shown) via the Internet 21 and the network interface unit 8 to accept input of content data.
[0056]
The storage device 7 stores a read-out character string that is a character string to be read out as a voice in the computer. , With a tag indicating the text to read Process that accepts input of described content data, from content data , The character string described with the tag indicating the reading character string Read-out string As Processing to extract, processing to convert the text string to be read into a phonetic symbol string that is data for specifying the output sound, and replace the text string to be read out in the content data with the phonetic symbol string And replace the tag indicating the reading string with the tag indicating the phonetic symbol string A content distribution program for executing processing is stored.
[0057]
Further, the control unit 16 of the terminal 11 executes processing according to the content output program stored in the storage device 17. Specifically, the control unit 16 executes content data reception processing from the content server 1, phonetic symbol string extraction processing from content data, audio signal generation processing based on a phonetic symbol sequence, and audio output processing. The network interface unit 8 transmits content data requests and receives content data via the Internet 21. The sound output device 19 is a sound output device such as a speaker, and outputs sound. The temporary storage device 20 is a storage device that temporarily stores received content data and phonetic symbol strings extracted from the content data.
[0058]
The storage device 17 receives, from the content distribution device, content data in which a phonetic symbol string, which is data for specifying output sound, is described in the computer, a process of extracting a phonetic symbol string from the content data, and a phonetic symbol A content output program for executing a process of outputting sound based on the column is stored.
[0059]
When the audio content distribution system is configured as shown in FIG. 2, the reading character string extraction / replacement unit 3 is realized by the control unit 6 of the content server 1. The conversion unit 4 is realized by the storage device 7 and the control unit 6. The content input unit 2 and the content transmission unit 5 are realized by the control unit 6 and the network interface unit 8. Further, the phonetic symbol string extraction unit 13 and the voice generation unit 14 are realized by the control unit 16 of the terminal 11. The content receiving unit 12 is realized by the control unit 16 and the network interface unit 18. The audio output unit 15 is realized by the audio output device 17.
[0060]
Next, the operation will be described.
FIG. 3 is a flowchart showing an example of the operation of the content server 1. The content input unit 2 of the content server 1 accepts input of content data from the content provider (step S101). For example, content data is received from a content provider terminal (not shown) via the Internet 21.
[0061]
The content data input to the content server 1 includes a character string that is a read-out character string, a tag that indicates that the character string is a read-out character string, and a phonetic symbol string that has a specification of the read-out character string. Identification information indicating whether or not to be described. FIG. 4 is an explanatory diagram illustrating an example of content data input to the content server 1. In the example shown in FIG. 4, “TTP” is a tag indicating that the character string that continues from the identification information described next to “/” is a read-out character string. “Phoneme =“ Type1 ”” described after “TTP” is a description indicating that the identification information is “Type1”. Therefore, in the example shown in FIG. 4, the character string “Thank you for visiting” written between “phoneme =“ Type1 ”” and “/>” becomes the read-out character string. Note that the content data may include a description other than the reading character string. For example, a description similar to a tag similar to HTML (Hypertext Markup Language) may be included to specify display of an image or a character string. Although FIG. 4 shows a case where it is described in a language similar to HTML, a description for designating display of an image or a character string may be described in a markup language other than HTML.
[0062]
In step S101, the content input unit 2 stores the input content data in the storage device 7 (not shown in FIG. 1).
[0063]
Subsequently, the reading character string extraction / replacement unit 3 extracts the reading character string and the identification information from the content data input to the content input unit 2 and stored in the storage device 7 (step S102). At this time, the read-out character string extraction / replacement unit 3 extracts the identification information described together with the tag indicating the read-out character string (for example, “TTP” shown in FIG. 4), and the character described with the tag. Extract columns as read-out strings. In step S102, the reading character string extraction / replacement unit 3 stores the extracted identification information and the reading character string in the temporary storage device 9 (not shown in FIG. 1).
[0064]
Subsequently, the conversion unit 4 converts the read character string extracted by the read character string extraction / replacement unit 3 and stored in the temporary storage device 9 into a phonetic symbol string (step S103). At this time, the conversion unit 4 determines which dictionary data should be used for conversion based on the identification information stored in the temporary storage device 9. And the conversion part 4 should just convert each character and word contained in the reading character string into a corresponding phonetic symbol string using the dictionary data according to identification information. For example, in the dictionary data corresponding to the identification information “Type1” illustrated in FIG. 4, it is assumed that correspondence relationships such as “go” and “GO”, “visit” and “HOUMON” are shown. In this case, the conversion unit 4 converts the reading character string “Thank you for your visit” into a phonetic symbol string “GOHOUMON ARIGATOU GOZAIMASU” using dictionary data corresponding to “Type1”. Hereinafter, the conversion process from the read-out character string to the phonetic symbol string is referred to as a TTP (Text-to-Phoneme) process. In step S <b> 103, the conversion unit 4 stores the phonetic symbol string converted from the read-out character string in the temporary storage device 9.
[0065]
The TTP process in step S103 is a process for converting the reading character string extracted in step S102 into a phonetic symbol string. Accordingly, the content data itself is not changed in step S103.
[0066]
Next, the reading character string extraction / replacement unit 3 converts the reading character string in the content data stored in the storage device 7 into a phonetic symbol string obtained by TTP processing (a phonetic symbol string stored in the temporary storage device 9). (Step S104). In step S104, the reading character string extraction / replacement unit 3 may replace the character string described together with the tag indicating the reading character string with the phonetic symbol character string. At this time, the reading character string extraction / replacement unit 3 deletes the description indicating the identification information (for example, “phoneme =“ Type1 ”” shown in FIG. 4), and sets the tag indicating the reading character string as the phonetic symbol string. Replace with the indicated tag.
[0067]
FIG. 5 is an explanatory diagram showing an example of content data after replacement. In the example shown in FIG. 5, “PTS” is a tag indicating that the character string that continues to “/” is a phonetic symbol string. The reading character string “Thank you for visiting” shown in FIG. 4 is replaced with the phonetic symbol string “GOHOUMON ARIGATOU GOZAIMASU”. Further, the description “phoneme =“ Type1 ”” indicating the identification information is deleted, and the tag “TTP” indicating the reading character string is replaced with the tag “PTS” indicating the phonetic symbol string. As a result, the content data illustrated in FIG. 5 is obtained.
[0068]
The reading character string extraction / replacement unit 3 stores the content data after replacement in the storage device 7. Thereafter, when the content transmission unit 5 receives a request for content data from the terminal 11, the content transmission unit 5 transmits the replaced content data to the terminal 11 via the Internet 21 (step S105).
[0069]
FIG. 6 is a flowchart showing an example of the operation of the terminal 11. A user of the terminal 11 operates the terminal 11 so as to request content data including a phonetic symbol string according to the specifications of the terminal 11. In response to this operation, the content receiving unit 12 of the terminal 11 requests content data from the content server 1. Then, the content receiving unit 12 receives the requested content data from the content server 1 (step S111). This content data includes, for example, a phonetic symbol string as shown in FIG. In step S111, the content receiving unit 12 stores the received content data in the temporary storage device 20 (not shown in FIG. 1).
[0070]
The phonetic symbol string extraction unit 13 extracts a phonetic symbol string from the content data stored in the temporary storage device 20 (step S112). At this time, the phonetic symbol string extraction unit 13 may extract a character string described together with a tag indicating the phonetic symbol string (for example, “PTS” shown in FIG. 5) as a phonetic symbol string. In step S112, the phonetic symbol extraction unit 13 stores the extracted phonetic symbol string in the temporary storage device 20.
[0071]
The voice generation unit 14 generates a voice signal of the voice to be output based on the phonetic symbol string stored in the temporary storage device 20 (step S113). Hereinafter, this audio signal generation process is referred to as a PTS (Pheneme-to-Speech) process. The sound generation unit 14 outputs the generated sound signal to the sound output unit 15, and causes the sound output unit 15 to output sound (step S114). For example, when the phonetic symbol string “GOHOUMON ARIGATOU GOZAIMASU” is extracted in step S112, the voice generation unit 14 generates a voice signal corresponding to the utterance “Thank you for your visit” based on this symbol string. Then, the audio signal is output to the audio output unit 15, and the audio output unit 15 outputs an audio message “Thank you for your visit”.
[0072]
According to the audio content distribution system shown in the present embodiment, when content data including a reading character string is input, the content server replaces the reading character string in the content data with a phonetic symbol string. Then, the replaced content data is transmitted to the terminal 11. Therefore, the terminal 11 can output the voice immediately when receiving the content data. Therefore, the time until the terminal that receives the content data outputs the sound can be shortened.
[0073]
In addition, the content server 1 converts the read-out character string into a phonetic symbol string designated by the identification information. Therefore, even when content is to be provided to a plurality of terminals with different phonetic symbol string specifications, the content provider does not have to write a phonetic symbol string for each specification. What is necessary is just to describe information with a reading character string. For example, in the description of steps S101 to S105, a case where a text string “Thank you for visiting” is converted to a phonetic symbol string “GOHOUMON ARIGATOU GOZAIMASU” is shown. When content data in which other identification information (for example, “Type2”) is entered instead of “Type1” shown in FIG. 4, the content server 1 follows “Thank you for your visit” according to other specifications. Replace with a phonetic symbol string. In this way, the content provider need not describe a phonetic symbol string that conforms to individual specifications such as “GOHOUMON ARIGATOU GOZAIMASU”, but may describe identification information such as “Type1”. Therefore, the burden on the content provider can be reduced.
[0074]
Further, the content server 1 performs the TTP process. Therefore, since it is not necessary for the terminal 11 to perform the conversion process from the reading character string to the phonetic symbol string, the process of the terminal 11 can be simplified and the production cost of the terminal 11 can be reduced. If the content data including the phonetic symbol string includes a description that designates display of an image or the like, the terminal 11 may display the image or the like according to the description. Therefore, there is no need to separately download phonetic symbol string information and image data display information. Further, it is not necessary for the content server to transmit the sound and the image in synchronization.
[0075]
In the present embodiment, if the specifications of the phonetic symbol strings of the terminals 11 are common, content data in which no identification information is described may be input to the content input unit 2. In this case, the conversion unit 4 may convert the reading character string into a phonetic symbol string according to a predetermined specification.
[0076]
Also, the specification of the phonetic symbol string may be specified by a tag instead of the identification information. For example, a plurality of types of tags such as “TTP” and “TTPX” may be used as tags indicating the reading character string, and the tag indicating the reading character string may be changed for each desired specification. In this case, the read-out character string extraction / replacement unit 3 may extract the tag indicating the read-out character string and the read-out character string in step S102 and store them in the temporary storage device. And the conversion part 4 should just convert the reading character string into a phonetic symbol character string according to the specification according to the kind of the tag.
[0077]
In the present embodiment, the content input unit corresponds to the content input unit 2. The content replacement means corresponds to the reading character string extraction / replacement unit 3 and the conversion unit 4. The reading character string extraction unit and the replacement unit correspond to the reading character string extraction / replacement unit 3, and the conversion unit corresponds to the conversion unit 4.
[0078]
The content receiving unit corresponds to the content receiving unit 12. The phonetic symbol string extraction means corresponds to the phonetic symbol string extraction unit 13. The output means corresponds to the sound generation unit 14 and the sound output unit 15.
[0079]
Embodiment 2. FIG.
In the present embodiment, a conversion server provided separately from the content server performs TTP processing (conversion processing from a read-out character string to a phonetic symbol string). FIG. 7 is a block diagram showing a second embodiment of the audio content distribution system according to the present invention. The same components as those in the first embodiment are given the same reference numerals as those in FIG. The audio content distribution system shown in FIG. 7 includes a content server (content distribution device) 31, a conversion server (audio information conversion device) 41, and a terminal 11. The content server 31, the conversion server 41, and the terminal 11 are connected via the communication network 21. Hereinafter, a case where the communication network 21 is the Internet will be described as an example. However, as in the first embodiment, the communication network 21 is not limited to the Internet. The communication network connecting the content server 31 and the terminal 11 may be different from the communication network connecting the content server 31 and the conversion server 41.
[0080]
Similar to the first embodiment, a plurality of terminals 11 may be connected to the content server 31. Furthermore, a plurality of types of terminals having different phonetic symbol string specifications may be connected to the content server 31.
[0081]
Similar to the content server 1 shown in FIG. 1, the content server 31 receives a tag indicating a read character string, identification information, and content data including the read character string. Then, the content server 31 transmits the content data in which the read-out character string is replaced with the phonetic symbol string to the terminal 11. However, the content server 31 does not perform the TTP process, and the conversion server 41 executes the TTP process.
[0082]
In the content server 31, the reading character string extraction / replacement unit 32 extracts a reading character string and identification information from the content data input to the content input unit 2 and transmits the extracted character string and identification information to the conversion server 41. The reading character string extraction / replacement unit 32 receives the phonetic symbol string from the conversion server 41 and replaces the reading character string in the content data with the phonetic symbol string.
[0083]
The reading character string receiving unit 42 of the conversion server 41 receives the reading character string and the identification information from the reading character string extraction / replacement unit 32. The conversion unit 4 includes a storage device (not shown in FIG. 7) that stores dictionary data, similarly to the conversion unit 4 described in the first embodiment. This storage device stores dictionary data for each specification of the phonetic symbol string. Then, the conversion unit 4 determines which specification should be converted into a phonetic symbol string based on the identification information, and performs TTP processing. The phonetic symbol string transmission unit 43 transmits the phonetic symbol string converted by the conversion unit 4 to the read-out character string extraction / replacement unit 32.
[0084]
The configuration of the terminal 11 is the same as the configuration of the terminal according to the first embodiment.
[0085]
FIG. 8 is a block diagram showing a specific configuration example of the second exemplary embodiment of the present invention. In FIG. 8, the control unit 36 of the content server 31 executes processing according to the content distribution program stored in the storage device 37. Specifically, the control unit 36 receives content data input, processes to extract read-out character strings and identification information from the content data and transmits them to the conversion server 41, and receives phonetic symbol strings from the conversion server 41. Then, the process of replacing the read character string in the content data with the phonetic symbol string and the process of transmitting the content data to the terminal 11 are executed. The network interface unit 38 transmits and receives content data via the Internet 21. The temporary storage device 39 is a storage device that temporarily stores read-out character strings and identification information extracted from the content data, and phonetic symbol strings received from the conversion server. The control unit 36 receives input of content data by receiving content data from a content provider terminal (not shown) via the Internet 21 and the network interface unit 38, for example.
[0086]
The storage device 37 is connected to the computer. Along with a tag indicating the text to be read Processing that accepts input of content data in which read-out character strings are described, from content data , The character string described with the tag indicating the reading character string Read-out string As Processing to extract, processing to send the read-out character string to the speech information converter, receive the phonetic symbol string converted from the text-to-speech string from the speech information converter, and replace the text-to-speech string in the content data with the phonetic symbol string And replace the tag indicating the reading string with the tag indicating the phonetic symbol string A content distribution program for executing processing is stored.
[0087]
In addition, the control unit 46 of the conversion server 41 executes processing according to the audio information conversion program stored in the storage device 47. Specifically, the control unit 46 performs a process of receiving a reading character string and identification information, a TTP process, and a process of transmitting a phonetic symbol string obtained by the TTP process. The network interface unit 48 transmits and receives data (for example, a read character string, identification information, and a phonetic symbol string) via the Internet 21. The storage device 47 stores dictionary data in addition to the voice information conversion program. The temporary storage device 49 is a storage device that temporarily stores read-out character strings, identification information received from the content server, and phonetic symbol strings obtained by TTP processing.
[0088]
The storage device 47 is a process for receiving a read-out character string that is a character string to be read out as speech from an information processing device connected to the computer via a communication network, and for specifying the output speech for the read-out character string A voice information conversion program for executing a process of converting to a phonetic symbol string as data and a process of transmitting the phonetic symbol string to the information processing apparatus is stored.
[0089]
An example of the configuration of the terminal 11 shown in FIG. 8 is the same as that shown in FIG.
[0090]
When the audio content distribution system is configured as shown in FIG. 8, the reading character string extraction / replacement unit 32, the content input unit 2, and the content transmission unit 5 are realized by the control unit 36 and the network interface unit 38 of the content server 31. The Further, the reading character string receiving unit 42 and the phonetic symbol string transmitting unit 43 are realized by the control unit 46 and the network interface unit 48 of the conversion server 41. The conversion unit 4 is realized by the control unit 46 and the storage device 47.
[0091]
Next, the operation will be described.
FIG. 9 is a flowchart illustrating an example of operations of the content server 31 and the conversion server 41. The content input unit 2 of the content server 31 receives input of content data from the content provider, and stores the content data in the storage device 37 (not shown in FIG. 7). (Step S121). This process is the same as step S101. As in the first embodiment, together with a character string that becomes a read-out character string, a tag indicating that the character string is a read-out character string, and what specifications the phonetic symbol string should be converted into Content data in which identification information to be indicated is described is input.
[0092]
The reading character string extraction / replacement unit 32 extracts the reading character string and the identification information from the content data input to the content input unit 2 and stored in the storage device 37, and is temporarily stored in the temporary storage device 39 (not shown in FIG. 7). ). At this time, the reading character string extraction / replacement unit 32 extracts the identification information described together with the tag indicating the reading character string, and extracts the character string described together with the tag as the reading character string. The reading character string extraction / replacement unit 32 transmits the reading character string and the identification information stored in the temporary storage device 39 to the conversion server 41 via the Internet 21 (step S122).
[0093]
The reading character string extraction / replacement unit 32 may store the address information of the conversion server 41 in the storage device in advance. In step S122, the read character string and the identification information may be transmitted to the conversion server 41 using the address information. Alternatively, the content provider may describe the address information of the conversion server 41 in the content data together with the tag indicating the read character string, the identification information, and the read character string. In this case, the reading character string extraction / replacement unit 32 may extract the address information from the content data, and transmit the reading character string and the identification information to the conversion server 41 using the address information.
[0094]
When the read-out character string receiving unit 42 of the conversion server 41 receives the read-out character string and the identification information, it stores them in the temporary storage device 49 (not shown in FIG. 8).
[0095]
The conversion unit 4 determines dictionary data to be used based on the identification information stored in the temporary storage device 49. Then, using the dictionary data, the reading character string sent from the reading character string receiving unit 42 is converted into a phonetic symbol string (step S123). This TTP process is the same process as step S103. The conversion unit 4 stores the converted phonetic symbol string in the temporary storage device 49. The phonetic symbol string transmission unit 43 transmits the phonetic symbol string to the content server 31 via the Internet 21 (step S124).
[0096]
The reading character string extraction / replacement unit 32 of the content server 31 receives the phonetic symbol string transmitted by the conversion server 41 and stores it in the temporary storage device 39. Subsequently, the reading character string extraction / replacement unit 32 uses the phonetic symbol string to replace the reading character string in the content data stored in the storage device 37 with the phonetic symbol string (step S125). The reading character string extraction / replacement unit 32 may replace the character string described together with the tag indicating the reading character string with the phonetic symbol character string. At this time, the reading character string extraction / replacement unit 3 deletes the description indicating the identification information (for example, “phoneme =“ Type1 ”” shown in FIG. 4), and sets the tag indicating the reading character string as the phonetic symbol string. Replace with the indicated tag.
[0097]
The reading character string extraction / replacement unit 32 stores the replaced content data in the storage device 37. Thereafter, the content transmission unit 5 transmits the replaced content data to the terminal 11 in response to a request from the terminal 11 (step S126). The process in step S126 is the same as the process in step S105.
[0098]
The operation when the terminal 11 receives content data from the content server 31 and outputs sound is the same as that in the first embodiment.
[0099]
Also in the present embodiment, the content server 32 replaces the read character string in the input content data with the phonetic symbol string, and transmits the replaced content data to the terminal 11. Therefore, the time until the terminal that receives the content data outputs the sound can be shortened. Further, the content server 31 converts the read-out character string into a phonetic symbol string designated by the identification information. Therefore, the burden on the content provider can be reduced. Further, since the terminal 11 does not need to perform the TTP process in step S123, the process of the terminal 11 can be simplified and the production cost of the terminal 11 can be reduced. Further, the terminal 11 does not need to separately download the phonetic symbol string information and the image data display information. There is no need for the content server to transmit the sound and the image in synchronization.
[0100]
Furthermore, according to the present embodiment, since the content server 31 that performs replacement and distribution of content data and the conversion server 41 that executes the TTP processing in step S123 are provided separately, the processing can be distributed. In particular, since the processing load of TTP processing is large, the processing load on the content server 31 can be reduced.
[0101]
Similarly to the first embodiment, when the specifications of the phonetic symbol strings of the terminals 11 are common, content data in which no identification information is described may be input to the content input unit 2. Also, the specification of the phonetic symbol string may be specified by a tag instead of the identification information. In this case, the reading character string extraction / replacement unit 32 may extract the tag indicating the reading character string and the reading character string and transmit them to the conversion server 41 in step S122. And the conversion part 4 of the conversion server 41 should just convert a reading character string into a phonetic symbol character string according to the specification according to the kind of the tag.
[0102]
In the present embodiment, a conversion server may be provided for each specification of the phonetic symbol string, and each conversion server may be configured to perform TTP processing according to a specific specification. In this case, the content server 31 may transmit the read character string to the conversion server corresponding to the specification specified by the identification information or the like.
[0103]
In the present embodiment, the content input unit corresponds to the content input unit 2. The content replacement means corresponds to the read-out character string extraction / replacement unit 32. The reading character string extraction unit, the reading character string transmission unit, and the conversion unit also correspond to the reading character string extraction / replacement unit 32.
[0104]
The conversion means corresponds to the read character string receiving unit 42 and the conversion unit 4. The phonetic symbol string transmission means corresponds to the phonetic symbol string transmitter 43.
[0105]
The content receiving unit corresponds to the content receiving unit 12. The phonetic symbol string extraction means corresponds to the phonetic symbol string extraction unit 13. The output means corresponds to the sound generation unit 14 and the sound output unit 15.
[0106]
Embodiment 3 FIG.
In the present embodiment, the terminal receives content data in which a reading character string is described. However, the terminal may receive the content data in which the phonetic symbol string is described in addition to the content data in which the reading character string is described. FIG. 10 is a block diagram showing a third embodiment of an audio content distribution system according to the present invention. The same components as those in the first or second embodiment are denoted by the same reference numerals as those in FIG. 1 or FIG. The audio content distribution system shown in FIG. 10 includes a content server (content distribution device) 51, a conversion server (audio information conversion device) 41, and a terminal 61. The content server 51, the conversion server 41, and the terminal 61 are connected via the communication network 21. Hereinafter, a case where the communication network 21 is the Internet will be described as an example. However, as in the first embodiment, the communication network 21 is not limited to the Internet. Further, the communication network that connects the content server 51 and the terminal 61 may be different from the communication network that connects the terminal 61 and the conversion server 41.
[0107]
Similar to the first embodiment, a plurality of terminals 61 may be connected to the content server 51. Furthermore, a plurality of types of terminals having different phonetic symbol string specifications may be connected to the content server 51.
[0108]
The content server 51 receives a tag indicating the read character string, identification information, and content data including the read character string (for example, content data illustrated in FIG. 4). The content server 51 transmits the content data to the terminal 61 in a state including the read-out character string. Upon receiving this content data, the terminal 61 causes the conversion server 41 to execute TTP processing and outputs sound based on the phonetic symbol string obtained as a result.
[0109]
Although not shown in FIG. 10, the content server shown in the first embodiment or the second embodiment may be connected to a communication network (the Internet 21 in this example). Then, the terminal 61 may receive the content data including the phonetic symbol string from the content server and output the sound.
[0110]
In the content server 51, the content input unit 2 accepts input of content data from the content provider, as in the first embodiment. Further, the content transmission unit 5 transmits content data to the terminal 11 in response to a request from the terminal 11. However, the content transmission unit 5 transmits the content data including the reading character string to the terminal 61.
[0111]
The configuration of the conversion server 41 is the same as the configuration of the conversion server in the second embodiment. However, in the present embodiment, conversion server 41 receives a read-out character string and identification information from terminal 61 and transmits a phonetic symbol string to terminal 61.
[0112]
In the terminal 11, the data extraction unit 63 extracts a reading character string or a phonetic symbol string from the content data received by the content receiving unit 12. The timing control unit 62 controls the timing at which the data extraction unit 63 extracts a reading character string or phonetic symbol string from the content data. The conversion request unit 64 transmits the read character string extracted by the data extraction unit 63 to the conversion server 41 and requests conversion (TTP processing) from the read character string to the phonetic symbol string. Then, the conversion request unit 64 receives the phonetic symbol string obtained by the TTP process from the conversion server 41. The voice generation unit 65 generates a voice signal based on the phonetic symbol string received by the conversion request unit 64 or the phonetic symbol string extracted by the data extraction unit 63. The terminal 61 may include a display unit (display device) that displays content.
[0113]
FIG. 11 is a block diagram illustrating a specific configuration example of the third exemplary embodiment of the present invention. In FIG. 8, the control unit 56 of the content server 51 executes content data input reception processing and content data transmission processing to the terminal 11 in accordance with the content distribution program stored in the storage device 57. The network interface unit 58 transmits and receives content data via the Internet 21.
[0114]
In addition, the control unit 66 of the terminal 61 executes processing according to the content output program stored in the storage device 67. Specifically, the control unit 66 receives the content data from the content server 51, extracts the read character string or phonetic symbol string from the content data, and transmits the extracted read character string to the conversion server 41 to convert it. A process of receiving a phonetic symbol string from 41, a voice signal generating process based on the phonetic symbol string, and a process of outputting a voice are executed. The network interface unit 68 transmits a request for content data and receives content data via the Internet 21. The audio output device 69 is an audio output device such as a speaker, and outputs audio. The temporary storage device 70 is a storage device that temporarily stores received content data and phonetic symbol strings.
[0115]
The storage device 67 receives, from the content distribution device, content data in which the read character string is described, processing for extracting the read character string from the content data, and processing for transmitting the read character string to the audio information conversion device. A content output program for executing a process of receiving a phonetic symbol string converted from a read-out character string from the speech information converter and a process of outputting a sound based on the phonetic symbol string is stored.
[0116]
An example of the configuration of the conversion server 41 shown in FIG. 11 is the same as that shown in FIG.
[0117]
When the audio content distribution system is configured as shown in FIG. 11, the content input unit 2 and the content transmission unit 5 are realized by the control unit 56 and the network interface unit 58 of the content server 51. Further, the content receiving unit 12 and the conversion request unit 64 are realized by the control unit 66 and the network interface unit 68. Further, the timing control unit 62, the data extraction unit 63, and the voice generation unit 65 are realized by the control unit 66. The audio output unit 15 is realized by the audio output device 69.
[0118]
Next, the operation will be described.
The content input unit 2 of the content server 51 receives input of content data from the content provider, and stores the content data in the storage device 57 (not shown in FIG. 10). This process is the same as step S101. As in the first embodiment, together with a character string that becomes a read-out character string, a tag indicating that the character string is a read-out character string, and what specifications the phonetic symbol string should be converted into Content data in which identification information to be indicated is described is input.
[0119]
When the content transmission unit 5 receives a request for content data from the terminal 61, the content transmission unit 5 transmits the content data stored in the storage device 57 to the terminal 11. The input content data is not subjected to processing such as replacement. Therefore, the content transmission unit 5 transmits content data including a tag indicating a reading character string, identification information, and a reading character string.
[0120]
FIG. 12 is a flowchart showing an example of the operation of the terminal 61. The content receiving unit 12 of the terminal 61 requests content data from the content server 51 in accordance with a user operation. Alternatively, content data including a phonetic symbol string is requested from a content server (not shown in FIG. 10) similar to that in the first embodiment or the second embodiment in accordance with a user operation. Also good. Then, the content receiving unit 12 receives the requested content data from the content server and stores it in the temporary storage device 70 (not shown in FIG. 10) (step S131).
[0121]
The data extraction unit 63 extracts a read-out character string or phonetic symbol string from the content data stored in the temporary storage device 70 under the control of the timing control unit 62. For example, the timing control unit 62 may cause the data extraction unit 63 to start extracting a read character string or the like immediately after the content reception unit 12 receives the content data (immediately after step S131). Alternatively, after the content data is received, when the user performs an operation for instructing voice output, the data extraction unit 63 may start extraction of a read character string or the like. Alternatively, the data extraction unit 63 may start extraction of a read-out character string or the like after a predetermined period after displaying an image or the like on the display unit based on the content data.
[0122]
When extracting the reading character string or the phonetic symbol string from the content data, the data extraction unit 63 first determines whether the content data includes the reading character string or the phonetic symbol string (step S132). The data extraction unit 63 determines that the read character string is included if the tag indicating the read character string is described in the content data. If a tag indicating a phonetic symbol string is described, it is determined that a phonetic symbol string is included.
[0123]
If the data extraction unit 63 determines that the read character string is included, the data extraction unit 63 extracts the read character string and the identification information from the content data, and stores them in the temporary storage device 70. At this time, the data extraction unit 63 extracts the identification information described together with the tag indicating the reading character string, and extracts the character string described together with the tag as the reading character string.
[0124]
The conversion request unit 64 transmits the read-out character string and identification information stored in the temporary storage device 70 to the conversion server 41 via the Internet 21 (step S133).
[0125]
The conversion request unit 64 may store the address information of the conversion server 41 in the storage device in advance. In step S133, the read character string and the identification information may be transmitted to the conversion server 41 using the address information. Alternatively, the content provider may input data describing the address information of the conversion server 41 together with the tag indicating the read character string, the identification information, and the read character string into the content server 51. In this case, the data extraction unit 63 extracts the address information from the content data, and the conversion request unit 64 may transmit the reading character string and the identification information to the conversion server 41 using the address information.
[0126]
The reading character string receiving unit 42 of the conversion server 41 receives the reading character string and the identification information from the terminal 61. Then, the conversion unit 4 executes TTP processing and converts the received reading character string into a phonetic symbol string (step S134). Then, the phonetic symbol string is transmitted to the terminal 61 (step S135). The operation from when the conversion server 41 receives the read-out character string and the identification information until the phonetic symbol string is transmitted is the same as the operation of the conversion server 41 in the second embodiment.
[0127]
In step S <b> 135, the conversion request unit 64 of the terminal 61 receives the phonetic symbol string transmitted by the conversion server 41 and stores it in the temporary storage device 70.
[0128]
If the data extraction unit 63 determines that the phonetic symbol string is included in the content data (step S132), the data extraction unit 63 extracts the phonetic symbol string from the content data and stores it in the temporary storage device 70 (step S136).
[0129]
The voice generation unit 65 performs a PTS process (voice signal generation process) based on the phonetic symbol string stored in the temporary storage device 70 in step S135 or step S136 (step S137). The sound generation unit 65 outputs the generated sound signal to the sound output unit 15, and causes the sound output unit 15 to output sound (step S138).
[0130]
According to the present embodiment, the terminal 61 can output audio not only when content data including a phonetic symbol string is received but also when content including a reading character string is received.
[0131]
Further, when the terminal 61 receives the content data including the phonetic symbol string, the terminal 61 outputs the voice based on the phonetic symbol string, so that the time until the voice is output can be shortened. Further, since the content provider does not have to describe a phonetic symbol string for each specification, the burden on the content provider can be reduced. Further, the terminal 11 does not need to separately download the phonetic symbol string information and the image data display information. There is no need for the content server to transmit the sound and the image in synchronization.
[0132]
In the third embodiment, content data in which both a reading character string and a phonetic symbol string are described may be input to the content server 51. In this content data, the reading character string is described together with a tag indicating the reading character string and identification information, and the phonetic symbol string is described together with a tag indicating the phonetic symbol string. When the terminal 61 receives the content data in which both the reading character string and the phonetic symbol string are described, the data extracting unit 62 extracts the character string described together with the tag indicating the phonetic symbol string as the phonetic symbol string. In addition, a character string described together with a tag indicating a reading character string is extracted as a reading character string, and identification information is also extracted. The conversion request unit 64 transmits the reading character string and the identification information to the conversion server 41 and receives the phonetic symbol string from the conversion server 41. The sound generation unit 65 generates a sound signal from the phonetic symbol string extracted from the content data and the phonetic symbol string received from the conversion server 41.
[0133]
Similarly to the first embodiment, when the specifications of the phonetic symbol strings of the terminals 11 are common, content data that does not describe identification information may be input to the content input unit 2. Also, the specification of the phonetic symbol string may be specified by a tag instead of the identification information. In this case, in step S <b> 133, the data extraction unit 63 extracts the tag indicating the read character string and the read character string, and the conversion request unit 64 may transmit the tag and the read character string to the conversion server 41. And the conversion part 4 of the conversion server 41 should just convert a reading character string into a phonetic symbol character string according to the specification according to the kind of the tag.
[0134]
In the present embodiment, a conversion server 41 may be provided for each specification of the phonetic symbol string, and each conversion server 41 may be configured to perform TTP processing according to a specific specification. In this case, the terminal 61 may transmit the reading character string to the conversion server corresponding to the specification designated by the identification information or the like.
[0135]
In the present embodiment, the content receiving unit corresponds to the content receiving unit 12. The content output means corresponds to the data extraction unit 63, the conversion request unit 64, the audio generation unit 65, and the audio output unit 15. The data extraction unit corresponds to the data extraction unit 63, the transmission / reception unit corresponds to the conversion request unit 64, and the output unit corresponds to the voice generation unit 65 and the voice output unit 15. The timing control means corresponds to the timing control unit 62.
[0136]
The conversion means corresponds to the read character string receiving unit 42 and the conversion unit 4. The phonetic symbol string transmission means corresponds to the phonetic symbol string transmitter 43.
[0137]
【The invention's effect】
According to the present invention, the content distribution device accepts input of content data in which a read-out character string that is a character string to be read out as sound is described, and reads out the read-out character string in the content data as an output sound. Content replacement means for replacing the phonetic symbol string, which is data for identification, with the terminal device receiving content data in which the phonetic symbol string is described from the content distribution device; and pronunciation from the content data Since the phonetic symbol string extracting means for extracting the symbol string and the output means for outputting the sound based on the phonetic symbol string are provided, the time until the terminal receiving the content outputs the voice can be shortened.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a first embodiment of the present invention.
FIG. 2 is a block diagram illustrating a specific configuration example of the first embodiment.
FIG. 3 is a flowchart showing an example of the operation of the content server.
FIG. 4 is an explanatory diagram illustrating an example of content data input to a content server.
FIG. 5 is an explanatory diagram showing an example of content data after replacement.
FIG. 6 is a flowchart showing an example of operation of a terminal.
FIG. 7 is a block diagram showing a second embodiment of the present invention.
FIG. 8 is a block diagram illustrating a specific configuration example of the second embodiment.
FIG. 9 is a flowchart showing an example of operations of a content server and a conversion server.
FIG. 10 is a block diagram showing a third embodiment of the present invention.
FIG. 11 is a block diagram illustrating a specific configuration example of the third embodiment.
FIG. 12 is a flowchart showing an example of operations of a terminal and a conversion server.
[Explanation of symbols]
1 Content server
2 Content input section
3 Text-to-speech extraction and replacement part
4 Conversion unit
5 Content transmission part
11 Terminal device
12 Content receiver
13 Pronunciation symbol string extraction unit
14 Speech generator
15 Audio output section
21 Communication network

Claims

A content distribution system comprising: a content distribution device that distributes content data; and a terminal device that outputs content based on content data received from the content distribution device,
The content distribution device includes:
A content input means for receiving input of content data in which a read- out character string that is a character string to be read out as sound is described together with a tag indicating the read-out character string ;
Content replacement means for replacing the read-out character string in the content data with a phonetic symbol string that is data for specifying the output sound;
The terminal device
Content receiving means for receiving content data in which a phonetic symbol string is described from the content distribution device;
Phonetic symbol string extraction means for extracting a phonetic symbol string from the content data;
Output means for outputting sound based on the phonetic symbol string ,
The content replacement means includes:
Read-out character string extraction means for extracting a character string described together with a tag indicating a read-out character string from the content data input into the content input means,
Conversion means for converting the reading character string into a phonetic symbol string;
Substituting means for substituting the phonetic symbol string in the content data with the phonetic symbol string, and substituting the tag indicating the voiced character string with the tag indicating the phonetic symbol string,
The content receiving means receives content data in which a phonetic symbol string is described together with a tag indicating a phonetic symbol string from the content distribution device,
A content distribution system, wherein the phonetic symbol string extraction unit extracts a character string described with a tag indicating the phonetic symbol string as a phonetic symbol string .

The content input means accepts input of content data in which specification information indicating the specification of the phonetic symbol string is described together with the reading character string,
The reading character string extraction means extracts the reading character string and the specification information from the content data,
Conversion means, content delivery system according to claim 1 for converting the speech string, the phonetic symbol string corresponding to the specifications indicated by the specification information.

A content distribution device that distributes content data, a terminal device that outputs content based on the content data received from the content distribution device, and a voice that converts a read-out character string that is a character string to be read out as a voice into a phonetic symbol string A content distribution system comprising an information conversion device ,
The content distribution device includes:
Content input means for accepting input of content data in which a read-out character string is described together with a tag indicating the read-out character string;
Content replacement means for replacing the read-out character string in the content data with a phonetic symbol string that is data for specifying the output sound;
The terminal device
Content receiving means for receiving content data in which a phonetic symbol string is described from the content distribution device;
Phonetic symbol string extraction means for extracting a phonetic symbol string from the content data;
Output means for outputting sound based on the phonetic symbol string ,
The content replacement means includes:
A reading character string extraction unit that extracts a character string described together with a tag indicating a reading character string from the content data input to the content input unit ;
Read-out character string transmitting means for transmitting the read-out character string to the speech information converter;
Substitution means for receiving a phonetic symbol string from the audio information converter, replacing a text string in the content data with the phonetic symbol string, and replacing a tag indicating the text string with a tag indicating the phonetic string With
The voice information conversion device includes:
Conversion means for receiving a read-out character string from the content distribution device and converting the read-out character string into a phonetic symbol string;
Phonetic symbol string transmitting means for transmitting the phonetic symbol string to the content distribution device;
The content receiving means receives content data in which a phonetic symbol string is described together with a tag indicating a phonetic symbol string from the content distribution device,
A content distribution system, wherein the phonetic symbol string extraction unit extracts a character string described with a tag indicating the phonetic symbol string as a phonetic symbol string .

The content input means accepts input of content data in which specification information indicating the specification of the phonetic symbol string is described together with the reading character string,
The reading character string extraction means extracts the reading character string and the specification information from the content data,
The reading character string transmitting means transmits the reading character string and the specification information to the voice information conversion device,
The content distribution system according to claim 3 , wherein the conversion unit converts the reading character string into a phonetic symbol string corresponding to a specification indicated by the specification information.

A content distribution device for distributing content data to a terminal device,
A content input means for receiving input of content data in which a read- out character string that is a character string to be read out as sound is described together with a tag indicating the read-out character string ;
Read-out character string extraction means for extracting a character string described with a tag indicating a read-out character string from the content data as a read-out character string;
Conversion means for converting the reading character string into a phonetic symbol string that is data for specifying output speech;
A content distribution apparatus comprising: replacement means for replacing a reading character string in the content data with the phonetic symbol string, and replacing a tag indicating the reading character string with a tag indicating the phonetic symbol string .

The content input means accepts input of content data in which specification information indicating the specification of the phonetic symbol string is described together with the reading character string,
The reading character string extraction means extracts the reading character string and the specification information from the content data,
The content distribution apparatus according to claim 5 , wherein the conversion unit converts the reading character string into a phonetic symbol string corresponding to a specification indicated by the specification information.

A content distribution device that is connected to a speech information conversion device that converts a read-out character string, which is a character string to be read out as speech, into a phonetic symbol sequence that is data for specifying output speech, and distributes content data to a terminal device. There,
Content input means for accepting input of content data in which a read-out character string is described together with a tag indicating the read-out character string;
Read-out character string extraction means for extracting a character string described with a tag indicating a read-out character string from the content data as a read-out character string;
Read-out character string transmitting means for transmitting the read-out character string extracted by the read-out character string extracting means to the speech information converter;
The phonetic symbol string converted from the read-out character string is received from the speech information conversion device, the read-out character string in the content data is replaced with the phonetic symbol string, and a tag indicating the read-out character string is added to the phonetic symbol string. A content distribution apparatus comprising: replacement means for replacing the indicated tag .

The content input means accepts input of content data in which specification information indicating the specification of the phonetic symbol string is described together with the reading character string,
The reading character string extraction means extracts the reading character string and the specification information from the content data,
The content distribution device according to claim 7 , wherein the reading character string transmission unit transmits the reading character string and the specification information to an audio information conversion device.

A content distribution program installed in a content distribution device for distributing content data to a terminal device,
On the computer,
A process of accepting input of content data in which a read- out character string, which is a character string to be read out as sound, is described together with a tag indicating the read-out character string ;
A process of extracting a character string described with a tag indicating a reading character string from the content data as a reading character string ;
A process for converting the reading character string into a phonetic symbol string that is data for specifying output sound, and replacing the reading character string in the content data with the phonetic symbol string, and a tag indicating the reading character string, A content distribution program for executing processing to replace a tag indicating a phonetic symbol string .

To a content distribution apparatus that is connected to a speech information conversion device that converts a read-out character string that is a character string to be read out as speech into a phonetic symbol string that is data for specifying output speech, and that distributes content data to a terminal device A content distribution program installed;
On the computer,
A process of accepting input of content data in which a text string is written together with a tag indicating the text string to be read,
A process of extracting a character string described with a tag indicating a reading character string from the content data as a reading character string ;
A process of transmitting the reading character string to the speech information conversion device; and a phonetic symbol string converted from the reading character string is received from the speech information conversion device, and the reading character string in the content data is received as the phonetic symbol string. A content distribution program for executing processing for replacing a tag indicating a read-out character string with a tag indicating a phonetic symbol string .