JP2004140583A

JP2004140583A - Information providing apparatus

Info

Publication number: JP2004140583A
Application number: JP2002303193A
Authority: JP
Inventors: Hirofumi Nishimura; 西村　洋文
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 2002-10-17
Filing date: 2002-10-17
Publication date: 2004-05-13

Abstract

<P>PROBLEM TO BE SOLVED: To provide an information providing apparatus whereby a user can easily view and listen to video and superimposed character included in contents. <P>SOLUTION: The information providing apparatus is configured to include: a contents input means 110 for receiving contents including video information, audio information and character information; a video image providing means 120 for providing the video information included in the contents received by the contents input means 110; a voice synthesis means 160 for applying voice synthesis to the character information included in the contents received by the contents input means 110 at a prescribed speech speed; and a voice output means 170 for outputting a synthesized voice in response to the voice synthesis information subjected to the voice synthesis by the voice synthesis means 160 and outputting a voice in response to the voice information included in the contents received by the contents input means 110. <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
本発明は、文字多重放送などの番組コンテンツの情報や記録媒体に記録されたコンテンツに含まれる情報を再生して利用者に提示する情報提示装置に関する。
【０００２】
【従来の技術】
従来の情報提示装置は、番組コンテンツを表示させるための大型の画面表示部と小型の画面表示部とを有しており、大型の画面表示部に番組コンテンツに含まれる映像を表示させ、字幕の表示を容易にするために利用者の近傍に設置される小型の画面表示部に番組コンテンツに含まれる字幕を表示させていた（例えば、特許文献１参照。）。
【０００３】
【特許文献１】
特開平１１−１９６３４５号公報（５４段落目−５６段落目、図１）
【０００４】
【発明が解決しようとする課題】
しかしながら、上述の従来の情報提示装置では、大型の画面表示部と小型の画面表示部の双方の画面を見なければならず、映像や字幕の視聴が煩雑になってしまうという問題があった。
本発明は、このような従来の問題を解決するためになされたもので、コンテンツに含まれる映像や字幕を容易に視聴することができる情報提示装置を提供するものである。
【０００５】
【課題を解決するための手段】
本発明の情報提示装置は、映像情報および文字情報を含むコンテンツを入力するコンテンツ入力手段と、前記コンテンツから文字情報を抽出する文字情報抽出手段と、前記文字情報抽出手段によって抽出された文字情報を前記映像情報と同期した音声に変換して出力する音声出力手段とを備えた構成を有している。
この構成により、字幕を表す文字情報を音声として聴覚的に出力するため、映像や字幕を容易に視聴することができる。
【０００６】
また、本発明の情報提示装置は、映像情報、音声情報および文字情報を含むコンテンツを入力するコンテンツ入力手段と、前記コンテンツ入力手段によって入力されたコンテンツに含まれる映像情報を提示する映像提示手段と、前記コンテンツ入力手段によって入力されたコンテンツに含まれる文字情報を所定の話速に基づいて音声合成する音声合成手段と、前記音声合成手段によって音声合成された合成音声情報に応じて合成音声を出力すると共に前記コンテンツ入力手段によって入力されたコンテンツに含まれる音声情報に応じて音声を出力する音声出力手段とを備えた構成を有している。
この構成により、字幕を表す文字情報を合成音声として聴覚的に出力するため、映像や字幕を容易に視聴することができる。また、狭い画面を有する携帯端末を使用する場合でも、映像情報、音声情報および文字情報を含むコンテンツを容易に視聴することができる。
【０００７】
また、本発明の情報提示装置は、映像情報、音声情報および文字情報を含むコンテンツを入力するコンテンツ入力手段と、前記コンテンツ入力手段によって入力されたコンテンツに含まれる映像情報を提示する映像提示手段と、前記コンテンツ入力手段によって入力されたコンテンツに含まれる文字情報を所定の話速に基づいて音声合成する音声合成手段と、前記音声合成手段によって音声合成された合成音声情報に応じて合成音声を出力すると共に前記コンテンツ入力手段によって入力されたコンテンツに含まれる音声情報に応じて音声を出力する音声出力手段とを備え、さらに、前記コンテンツ入力手段によって入力されたコンテンツに含まれる音声情報から無音部分を検出する無音検出手段と、前記無音検出手段によって検出された無音部分の期間を測定する無音期間測定手段と、前記無音期間測定手段によって測定された無音期間に基づいて前記合成音声の話速を算出する無音話速算出手段とを備え、前記音声合成手段は、前記無音話速算出手段によって算出された合成音声の話速に基づいて前記コンテンツ入力手段によって入力されたコンテンツに含まれる文字情報を音声合成し、前記音声出力手段は、前記音声合成手段によって音声合成された合成音声情報に応じて前記無音期間内に合成音声を出力すると共に前記コンテンツ入力手段によって入力されたコンテンツに含まれる音声情報に応じて音声を出力する構成を有している。
この構成により、コンテンツに含まれる音声と合成音声とが重複しないように合成音声を出力するため、合成音声の聞き取りを容易にすることができる。
【０００８】
また、本発明の情報提示装置は、映像情報、音声情報および文字情報を含むコンテンツを入力するコンテンツ入力手段と、前記コンテンツ入力手段によって入力されたコンテンツに含まれる映像情報を提示する映像提示手段と、前記コンテンツ入力手段によって入力されたコンテンツに含まれる文字情報を所定の話速に基づいて音声合成する音声合成手段と、前記音声合成手段によって音声合成された合成音声情報に応じて合成音声を出力すると共に前記コンテンツ入力手段によって入力されたコンテンツに含まれる音声情報に応じて音声を出力する音声出力手段とを備え、さらに、前記コンテンツ入力手段によって入力されたコンテンツに含まれる音声情報から無声部分を検出する無声検出手段と、前記無声検出手段によって検出された無声部分の期間を測定する無声期間測定手段と、前記無声期間測定手段によって測定された無声期間に基づいて前記合成音声の話速を算出する無声話速算出手段とを備え、前記音声合成手段は、前記無声話速算出手段によって算出された合成音声の話速に基づいて前記コンテンツ入力手段によって入力されたコンテンツに含まれる文字情報を音声合成し、前記音声出力手段は、前記音声合成手段によって音声合成された合成音声情報に応じて前記無声期間内に合成音声を出力すると共に前記コンテンツ入力手段によって入力されたコンテンツに含まれる音声情報に応じて音声を出力する構成を有している。
この構成により、コンテンツに含まれる音声の無声部分と合成音声とが重複しないように合成音声を出力するため、合成音声の聞き取りをさらに容易にすることができる。
【０００９】
また、本発明の情報提示装置は、映像情報、音声情報および文字情報を含むコンテンツを入力するコンテンツ入力手段と、前記コンテンツ入力手段によって入力されたコンテンツに含まれる映像情報を提示する映像提示手段と、前記コンテンツ入力手段によって入力されたコンテンツに含まれる文字情報を所定の話速に基づいて音声合成する音声合成手段と、前記音声合成手段によって音声合成された合成音声情報に応じて合成音声を出力すると共に前記コンテンツ入力手段によって入力されたコンテンツに含まれる音声情報に応じて音声を出力する音声出力手段とを備え、さらに、前記コンテンツ入力手段によって入力されたコンテンツに含まれる音声情報に基づいて発声部分を検出する発声検出手段と、前記発声検出手段によって検出された発声部分の期間を測定する発声期間測定手段と、前記発声期間測定手段によって測定された発声期間だけ前記コンテンツ入力手段によって入力されたコンテンツに含まれる音声を消音する音声消音手段とを備え、前記音声合成手段は、前記コンテンツ入力手段によって入力されたコンテンツに含まれる文字情報を所定の話速に基づいて音声合成し、前記音声出力手段は、前記音声合成手段によって音声合成された合成音声情報に応じて合成音声を出力すると共に前記コンテンツ入力手段によって入力されたコンテンツに含まれる音声情報に応じて音声を出力する構成を有している。
この構成により、翻訳された字幕を表示する洋画などのコンテンツを視聴するときに発声部分を消音するため、出演者の台詞などを自動的に吹き替えすることができる。
【００１０】
また、本発明の情報提示装置は、映像情報、音声情報および文字情報を含むコンテンツを入力するコンテンツ入力手段と、前記コンテンツ入力手段によって入力されたコンテンツに含まれる映像情報を提示する映像提示手段と、前記コンテンツ入力手段によって入力されたコンテンツに含まれる文字情報を所定の話速に基づいて音声合成する音声合成手段と、前記音声合成手段によって音声合成された合成音声情報に応じて合成音声を出力すると共に前記コンテンツ入力手段によって入力されたコンテンツに含まれる音声情報に応じて音声を出力する音声出力手段と、前記コンテンツ入力手段によって入力されたコンテンツに含まれる音声情報に基づいて発声部分を検出する発声検出手段と、前記発声検出手段によって検出された発声部分の期間を測定する発声期間測定手段と、前記発声期間測定手段によって測定された発声期間だけ前記コンテンツ入力手段によって入力されたコンテンツに含まれる音声を消音する音声消音手段とを備え、さらに、前記発声期間測定手段によって測定された発声期間内の音声情報を解析する発声解析手段と、前記発声解析手段によって解析された音声情報から得られる周波数に基づいて前記音声情報を各々の声質に分類する声質分類手段とを備え、前記音声合成手段は、前記声質分類手段によって分類された各々の声質に応じて前記コンテンツ入力手段によって入力されたコンテンツに含まれる文字情報を音声合成し、前記音声出力手段は、前記音声合成手段によって音声合成された合成音声情報に応じて合成音声を出力すると共に前記コンテンツ入力手段によって入力されたコンテンツに含まれる音声情報に応じて音声を出力する構成を有している。
この構成により、翻訳された字幕を表示する洋画などのコンテンツを視聴するときに出演者などの声質に応じて合成音声を出力するため、コンテンツの理解性を向上させることができる。
【００１１】
また、本発明の情報提示装置は、映像情報、音声情報および文字情報を含むコンテンツを入力するコンテンツ入力手段と、前記コンテンツ入力手段によって入力されたコンテンツに含まれる映像情報を提示する映像提示手段と、前記コンテンツ入力手段によって入力されたコンテンツに含まれる文字情報を所定の話速に基づいて音声合成する音声合成手段と、前記音声合成手段によって音声合成された合成音声情報に応じて合成音声を出力すると共に前記コンテンツ入力手段によって入力されたコンテンツに含まれる音声情報に応じて音声を出力する音声出力手段と、前記コンテンツ入力手段によって入力されたコンテンツに含まれる音声情報に基づいて発声部分を検出する発声検出手段と、前記発声検出手段によって検出された発声部分の期間を測定する発声期間測定手段と、前記発声期間測定手段によって測定された発声期間だけ前記コンテンツ入力手段によって入力されたコンテンツに含まれる音声を消音する音声消音手段とを備え、さらに、前記発声期間測定手段によって測定された発声期間に基づいて前記合成音声の話速を算出する発声話速算出手段を備え、前記音声合成手段は、前記発声話速算出手段によって算出された合成音声の話速に基づいて前記コンテンツ入力手段によって入力されたコンテンツに含まれる文字情報を音声合成し、前記音声出力手段は、前記音声合成手段によって音声合成された合成音声情報に応じて前記発声期間内に合成音声を出力すると共に前記コンテンツ入力手段によって入力されたコンテンツに含まれる音声情報に応じて音声を出力する構成を有している。
この構成により、翻訳された字幕を表示する洋画などのコンテンツを視聴するときに出演者などが発声する速度に応じて合成音声を出力するため、リアルに出演者の台詞などを吹き替えすることができる。
【００１２】
【発明の実施の形態】
以下、本発明の実施の形態について、図面を参照して説明する。
（第１の実施の形態）
図１は、本発明の第１の実施の形態に係る情報提示装置のブロック構成図である。本発明の第１の実施の形態に係る情報提示装置１００は、映像情報、音声情報および文字情報を含むコンテンツを入力するコンテンツ入力手段１１０、コンテンツ入力手段１１０によって入力されたコンテンツに含まれる映像情報を提示する映像提示手段１２０、コンテンツ入力手段１１０によって入力されたコンテンツに含まれる音声情報から無音部分を検出する無音検出手段１３０、無音検出手段１３０によって検出された無音部分の期間を測定する無音期間測定手段１４０、無音期間測定手段１４０によって測定された無音期間に基づいて合成音声の話速を算出する無音話速算出手段１５０、コンテンツ入力手段１１０によって入力されたコンテンツに含まれる文字情報を所定の話速に基づいて音声合成する音声合成手段１６０、および音声合成手段１６０によって音声合成された合成音声情報に応じて合成音声を出力すると共にコンテンツ入力手段１１０によって入力されたコンテンツに含まれる音声情報に応じて音声を出力する音声出力手段１７０を含むように構成される。なお、本発明の不図示の文字情報抽出手段は、音声合成手段１６０を構成するようにしてもよい。
【００１３】
コンテンツ入力手段１１０は、文字多重放送などの番組コンテンツや記録媒体に記録されたコンテンツであって、映像情報、音声情報および文字情報を含むコンテンツを入力し、入力されたコンテンツに含まれる映像情報を映像提示手段１２０に出力し、入力されたコンテンツに含まれる音声情報を無音検出手段１３０および音声出力手段１７０に出力し、入力されたコンテンツに含まれる字幕などの文字情報を無音話速算出手段１５０および音声合成手段１６０に出力するようになっている。
なお、本発明によれば、コンテンツは、翻訳された字幕を表示する洋画などであり、コンテンツに含まれる映像情報、音声情報および文字情報は、例えば、画面に提示されている字幕などが切り替わるタイミングで区切られたフレームによって構成されている。
【００１４】
映像提示手段１２０には、コンテンツ入力手段１１０によって出力されたコンテンツに含まれる映像情報が入力され、映像提示手段１２０は、入力された映像情報を再生して提示するようになっている。なお、本発明によれば、映像提示手段１２０は、液晶ディスプレイなどを含むように構成されるが、映像情報を提示する手段であれば如何なるもので構成されてもよい。
【００１５】
無音検出手段１３０には、コンテンツ入力手段１１０によって出力されたコンテンツに含まれる音声情報が入力され、無音検出手段１３０は、入力されたコンテンツに含まれる音声情報から無音部分を検出するようになっている。無音部分とは、人の耳に聞こえない音からなる音声である。なお、本発明によれば、無音部分を検出する方法に関しては、無音検出手段１３０が音声情報を構成するフレームを所定期間、例えば５ｍｓ程度の期間を構成するサブフレームに分割し、分割された各サブフレームから音圧値の平均を算出し、算出された音圧平均値が所定の閾値を超えないサブフレームを無音部分として検出するようにしてもよい。無音検出手段１３０は、検出された無音部分に関する情報を無音期間測定手段１４０に出力するようになっている。
【００１６】
無音期間測定手段１４０には、無音検出手段１３０によって出力された無音部分に関する情報が入力され、無音期間測定手段１４０は、入力された無音部分に関する情報に基づいて無音部分の期間を測定するようになっている。なお、本発明によれば、無音期間測定手段１４０は、無音部分に関する情報に含まれる無音部分が連続して検出された連続回数と所定期間とを掛け合わせて無音部分の期間を測定するようにしてもよい。
無音期間測定手段１４０は、測定された無音部分の期間を表す無音期間情報を無音話速算出手段１５０に出力するようになっている。
【００１７】
無音話速算出手段１５０には、無音期間測定手段１４０によって出力された無音期間情報およびコンテンツ入力手段１１０によって出力されたコンテンツに含まれる文字情報が入力され、無音話速算出手段１５０は、入力された無音期間情報に基づいて合成音声の話速を算出するようになっている。なお、本発明によれば、無音話速算出手段１５０は、１文字あたりの平均的な合成音声が発せられる所定の話速と文字情報を構成するフレームに含まれる文字数とを掛け合わせて、合成音声が発せられる標準時間を算出し、算出された標準時間で無音期間を割って話速変更率を算出するようにしてもよい。
無音話速算出手段１５０は、算出された話速に関する話速情報を音声合成手段１６０に出力するようになっている。
【００１８】
音声合成手段１６０には、入力されたコンテンツに含まれる文字情報が入力され、音声合成手段１６０は、入力された文字情報を所定の話速に基づいて音声合成するようになっている。また、音声合成手段１６０に無音話速算出手段１５０によって話速情報が入力されたときには、音声合成手段１６０は、話速情報すなわち話速変更率に基づいて話速を変更し、変更された話速に基づいて文字情報を音声合成するようになっている。音声合成手段１６０は、素片データベース１６１を含むように構成され、文字情報に含まれるテキストデータから素片データベース１６１を用いて音声素片情報に変換し、変換したそれぞれの音声素片情報を合成するようになっている。
音声合成手段１６０は、音声合成された合成音声情報を音声出力手段１７０に出力するようになっている。なお、本発明によれば、文字情報抽出手段は、コンテンツ入力手段１１０によって出力されたコンテンツから文字情報を抽出し、抽出された文字情報を音声出力手段１７０に出力するようにしてもよい。
【００１９】
音声出力手段１７０には、音声合成手段１６０によって出力された合成音声情報およびコンテンツ入力手段１１０によって出力されたコンテンツに含まれる音声情報が入力され、音声出力手段１７０は、入力された合成音声情報に応じて合成音声を再生して出力すると共に、入力されたコンテンツに含まれる音声情報に応じて音声を再生して出力するようになっている。なお、本発明によれば、音声出力手段１７０は、文字情報抽出手段によって出力された文字情報を映像情報と同期した音声に変換して出力するようにしてもよい。
【００２０】
図２は、本発明の第１の実施の形態に係る合成音声出力および音声出力のタイミングチャートである。なお、本発明によれば、図２に示すように音声情報を構成するフレームを再生して出力する期間であるフレーム出力期間２０内に、無音期間２２と無音期間２２以外からなる有音期間２１が複数存在する場合がある。音声出力手段１７０は、無音期間測定手段１４０によって測定された無音期間２２内に合成音声を出力すると共に、有音期間２１内に音声を再生して出力するようになっている。例えば、無音期間２２内に合成音声を出力するとき、音声出力手段１７０は、句点や読点を示すタイミングで合成音声の出力を一時停止し、次の無音期間２２内で合成音声の出力を再開するようになっている。
【００２１】
以下、本発明の第１の実施の形態に係る情報提示装置１００の動作について、図１を参照して説明する。
まず、映像情報、音声情報および文字情報を含むコンテンツは、コンテンツ入力手段１１０によって文字多重放送や記録媒体から入力され、映像提示手段１２０、無音検出手段１３０、無音話速算出手段１５０および音声出力手段１７０に出力される。
次に、コンテンツに含まれる映像情報は、映像提示手段１２０によって再生されてディスプレイなどに提示される。
【００２２】
一方、無音部分は、無音検出手段１３０によってコンテンツに含まれる音声情報から検出され、検出された無音部分に関する情報が無音期間測定手段１４０に出力にされたとき、無音期間情報は、無音期間測定手段１４０によって無音部分に関する情報に基づいて無音部分の期間が測定されて生成され、無音話速算出手段１５０に出力される。
話速情報は、無音話速算出手段１５０によって無音期間情報およびコンテンツ入力手段１１０によって出力された文字情報に基づいて算出されて生成され、音声合成手段１６０に出力される。
【００２３】
次に、文字情報に含まれるテキストデータから素片データベース１６１を用いて変換されたそれぞれの音声素片情報は、音声合成手段１６０によって話速情報に基づいて音声合成され、音声合成された合成音声情報は、音声出力手段１７０に出力される。
音声出力手段１７０によって再生される合成音声は、音声合成手段１６０から出力された合成音声情報に応じて無音期間内に出力されると共に、音声出力手段１７０によって再生される音声は、コンテンツ入力手段１１０によって出力された音声情報に応じて出力される。
【００２４】
以上説明したように、本発明の第１の実施の形態に係る情報提示装置は、字幕を表す文字情報を合成音声として聴覚的に出力するため、映像や字幕を容易に視聴することができ、狭い画面を有する携帯端末を使用する場合でも、映像情報、音声情報および文字情報を含むコンテンツを容易に視聴することができる。
また、コンテンツに含まれる音声と合成音声とが重複しないように合成音声を出力するため、合成音声の聞き取りを容易にすることができる。
【００２５】
（第２の実施の形態）
図３は、本発明の第２の実施の形態に係る情報提示装置のブロック構成図である。本発明の第２の実施の形態に係る情報提示装置２００は、映像情報、音声情報および文字情報を含むコンテンツを入力するコンテンツ入力手段１１０、コンテンツ入力手段１１０によって入力されたコンテンツに含まれる映像情報を提示する映像提示手段１２０、コンテンツ入力手段１１０によって入力されたコンテンツに含まれる音声情報から無声部分を検出する無声検出手段２３０、無声検出手段２３０によって検出された無声部分の期間を測定する無声期間測定手段２４０、無声期間測定手段２４０によって測定された無声期間に基づいて前記合成音声の話速を算出する無声話速算出手段２５０、無声話速算出手段２５０によって算出された合成音声の話速に基づいてコンテンツ入力手段１１０によって入力されたコンテンツに含まれる文字情報を音声合成する音声合成手段２６０、および音声合成手段２６０によって音声合成された合成音声情報に応じて合成音声を出力すると共にコンテンツ入力手段１１０によって入力されたコンテンツに含まれる音声情報に応じて音声を出力する音声出力手段２７０を含むように構成される。
なお、本発明の第２の実施の形態に係る情報提示装置２００を構成する手段のうち、本発明の第１の実施の形態に係る情報提示装置１００を構成する手段と同一の手段には同一の符号を付し、それぞれの説明を省略する。
【００２６】
無声検出手段２３０には、コンテンツ入力手段１１０によって出力されたコンテンツに含まれる音声情報が入力され、無声検出手段２３０は、入力されたコンテンツに含まれる音声情報から無声部分を検出するようになっている。無声部分とは、無音部分に加えて雑音や効果音などからなる音声である。なお、本発明によれば、無声部分を検出する方法に関しては、無声検出手段２３０が音声情報を構成するフレームを所定期間、例えば５ｍｓ程度の期間を構成するサブフレームに分割し、分割された各サブフレームから周波数分析して人の声とされる周波数範囲以内でないサブフレームと、周波数範囲以内の成分の音圧値が所定の閾値以下となるサブフレームとを無声部分として検出するようにしてもよい。人の声とされる周波数範囲は、例えば、約５ｋＨｚ以下となるような範囲である。
無声検出手段２３０は、検出された無声部分に関する情報を無声期間測定手段２４０に出力するようになっている。
【００２７】
無声期間測定手段２４０には、無声検出手段２３０によって出力された無音部分に関する情報が入力され、無声期間測定手段２４０は、入力された無声部分に関する情報に基づいて無声部分の期間を測定するようになっている。なお、本発明によれば、無声期間測定手段２４０は、無声部分に関する情報に含まれる無声部分が連続して検出された連続回数と所定期間とを掛け合わせて無声部分の期間を測定するようにしてもよい。
無声期間測定手段２４０は、測定された無声部分の期間を表す無声期間情報を無声話速算出手段２５０に出力するようになっている。
【００２８】
無声話速算出手段２５０には、無声期間測定手段２４０によって出力された無声期間情報およびコンテンツ入力手段１１０によって出力されたコンテンツに含まれる文字情報が入力され、無声話速算出手段２５０は、入力された無声期間情報に基づいて合成音声の話速を算出するようになっている。なお、本発明によれば、無声話速算出手段２５０は、１文字あたりの平均的な合成音声が発せられる所定の話速と文字情報を構成するフレームに含まれる文字数とを掛け合わせて、合成音声が発せられる標準時間を算出し、算出された標準時間で無声期間を割って話速変更率を算出するようにしてもよい。
無声話速算出手段２５０は、算出された話速に関する話速情報を音声合成手段２６０に出力するようになっている。
【００２９】
音声合成手段２６０には、入力されたコンテンツに含まれる文字情報が入力され、音声合成手段２６０は、入力された文字情報を所定の話速に基づいて音声合成するようになっている。また、音声合成手段２６０に無声話速算出手段２５０によって話速情報が入力されたときには、音声合成手段２６０は、話速情報すなわち話速変更率に基づいて話速を変更し、変更された話速に基づいて文字情報を音声合成するようになっている。音声合成手段２６０は、素片データベース１６１を含むように構成され、文字情報に含まれるテキストデータから素片データベース１６１を用いて音声素片情報に変換し、変換したそれぞれの音声素片情報を合成するようになっている。
音声合成手段２６０は、音声合成された合成音声情報を音声出力手段２７０に出力するようになっている。
【００３０】
音声出力手段２７０には、音声合成手段２６０によって出力された合成音声情報およびコンテンツ入力手段１１０によって出力されたコンテンツに含まれる音声情報が入力され、音声出力手段２７０は、入力された合成音声情報に応じて合成音声を再生して出力すると共に、入力されたコンテンツに含まれる音声情報に応じて音声を再生して出力するようになっている。
【００３１】
図４は、本発明の第２の実施の形態に係る合成音声出力および音声出力のタイミングチャートである。なお、本発明によれば、図４に示すように音声情報を構成するフレームを再生して出力する期間であるフレーム出力期間２０内に、無声期間２４と無声期間２４以外からなる発声期間２３が複数存在する場合がある。音声出力手段２７０は、無声期間測定手段２４０によって測定された無声期間２４内に合成音声を出力すると共に、発声期間２３内に音声を再生して出力するようになっている。例えば、無声期間２４内に合成音声を出力するとき、音声出力手段２７０は、句点や読点を示すタイミングで合成音声の出力を一時停止し、次の無声期間２４内で合成音声の出力を再開するようになっている。
【００３２】
以下、本発明の第２の実施の形態に係る情報提示装置２００の動作について、図３を参照して説明する。
まず、映像情報、音声情報および文字情報を含むコンテンツは、コンテンツ入力手段１１０によって文字多重放送や記録媒体から入力され、映像提示手段１２０、無声検出手段２３０、無声話速算出手段２５０および音声出力手段２７０に出力される。
次に、コンテンツに含まれる映像情報は、映像提示手段１２０によって再生されてディスプレイなどに提示される。
【００３３】
一方、無声部分は、無声検出手段２３０によってコンテンツに含まれる音声情報から検出され、検出された無声部分に関する情報が無声期間測定手段２４０に出力にされたとき、無声期間情報は、無声期間測定手段２４０によって無声部分に関する情報に基づいて無声部分の期間が測定されて生成され、無声話速算出手段２５０に出力される。
話速情報は、無声話速算出手段２５０によって無声期間情報およびコンテンツ入力手段１１０によって出力された文字情報に基づいて算出されて生成され、音声合成手段２６０に出力される。
【００３４】
次に、文字情報に含まれるテキストデータから素片データベース１６１を用いて変換されたそれぞれの音声素片情報は、音声合成手段２６０によって話速情報に基づいて音声合成され、音声合成された合成音声情報は、音声出力手段２７０に出力される。
音声出力手段２７０によって再生される合成音声は、音声合成手段２６０から出力された合成音声情報に応じて無声期間内に出力されると共に、音声出力手段２７０によって再生される音声は、コンテンツ入力手段１１０によって出力された音声情報に応じて出力される。
【００３５】
以上説明したように、本発明の第２の実施の形態に係る情報提示装置は、コンテンツに含まれる音声の無声部分と合成音声とが重複しないように合成音声を出力するため、合成音声の聞き取りをさらに容易にすることができる。
【００３６】
（第３の実施の形態）
図５は、本発明の第３の実施の形態に係る情報提示装置のブロック構成図である。本発明の第３の実施の形態に係る情報提示装置３００は、映像情報、音声情報および文字情報を含むコンテンツを入力するコンテンツ入力手段１１０、コンテンツ入力手段１１０によって入力されたコンテンツに含まれる映像情報を提示する映像提示手段１２０、コンテンツ入力手段１１０によって入力されたコンテンツに含まれる音声情報に基づいて発声部分を検出する発声検出手段３３０、発声検出手段３３０によって検出された発声部分の期間を測定する発声期間測定手段３４０、発声期間測定手段３４０によって測定された発声期間に基づいて合成音声の話速を算出する発声話速算出手段３５０、発声期間測定手段３４０によって測定された発声期間だけコンテンツ入力手段１１０によって入力されたコンテンツに含まれる音声を消音する音声消音手段３８０、発声期間測定手段３４０によって測定された発声期間内の音声情報を解析する発声解析手段３９１、発声解析手段３９１によって解析された音声情報から得られる周波数に基づいて前記音声情報を各々の声質に分類する声質分類手段３９２、発声話速算出手段３５０によって算出された合成音声の話速に基づいてコンテンツ入力手段１１０によって入力されたコンテンツに含まれる文字情報を音声合成する音声合成手段３６０、および音声合成手段３６０によって音声合成された合成音声情報に応じて合成音声を出力すると共にコンテンツ入力手段１１０によって入力されたコンテンツに含まれる音声情報に応じて音声を出力する音声出力手段３７０を含むように構成される。
なお、本発明の第３の実施の形態に係る情報提示装置３００を構成する手段のうち、本発明の第１の実施の形態に係る情報提示装置１００を構成する手段と同一の手段には同一の符号を付し、それぞれの説明を省略する。
【００３７】
発声検出手段３３０には、コンテンツ入力手段１１０によって出力されたコンテンツに含まれる音声情報が入力され、発声検出手段３３０は、入力されたコンテンツに含まれる音声情報から発声部分を検出するようになっている。発声部分とは、人が発する声からなる音声である。なお、本発明によれば、発声部分を検出する方法に関しては、発声検出手段３３０が音声情報を構成するフレームを所定期間、例えば５ｍｓ程度の期間を構成するサブフレームに分割し、分割された各サブフレームから周波数分析して人の声とされる周波数範囲以内の成分の音圧値が所定の閾値を越えるサブフレームを発声部分として検出するようにしてもよい。人の声とされる周波数範囲は、例えば、約５ｋＨｚ以下となるような範囲である。
発声検出手段３３０は、検出された発声部分に関する情報を発声期間測定手段３４０に出力するようになっている。
【００３８】
発声期間測定手段３４０には、発声検出手段３３０によって出力された発声部分に関する情報が入力され、発声期間測定手段３４０は、入力された発声部分に関する情報に基づいて発声部分の期間を測定するようになっている。なお、本発明によれば、発声期間測定手段３４０は、発声部分に関する情報に含まれる発声部分が連続して検出された連続回数と所定期間とを掛け合わせて発声部分の期間を測定するようにしてもよい。
発声期間測定手段３４０は、測定された発声部分の期間を表す発声期間情報を発声話速算出手段３５０、音声消音手段３８０、および発声解析手段３９１に出力するようになっている。
【００３９】
発声話速算出手段３５０には、発声期間測定手段３４０によって出力された無声期間情報およびコンテンツ入力手段１１０によって出力されたコンテンツに含まれる文字情報が入力され、発声話速算出手段３５０は、入力された発声期間情報に基づいて合成音声の話速を算出するようになっている。なお、本発明によれば、発声話速算出手段３５０は、１文字あたりの平均的な合成音声が発せられる所定の話速と文字情報を構成するフレームに含まれる文字数とを掛け合わせて、合成音声が発せられる標準時間を算出し、算出された標準時間で発声期間を割って話速変更率を算出するようにしてもよい。
発声話速算出手段３５０は、算出された話速に関する話速情報を音声合成手段３６０に出力するようになっている。
【００４０】
音声消音手段３８０には、発声期間測定手段３４０によって出力された発声期間情報およびコンテンツ入力手段１１０によって出力されたコンテンツに含まれる音声情報が入力され、音声消音手段３８０は、入力されたコンテンツに含まれる音声を発声期間だけ消音するようになっており、例えば、音声消音手段３８０は、発声期間に含まれない音声情報を音声出力手段３７０に出力するようになっている。または、音声消音手段３８０は、発声期間だけ消音するように利得を調節するようにしてもよい。
【００４１】
発声解析手段３９１には、発声期間測定手段３４０によって出力された発声期間情報およびコンテンツ入力手段１１０によって出力されたコンテンツに含まれる音声情報が入力され、発声期間内の音声情報を解析するようなっている。例えば、発声解析手段３９１は、発声期間内の音声情報から周波数分析して周波数の平均値など算出した音声属性情報をおよび音声情報を声質分類手段３９２に出力するようなっている。
【００４２】
声質分類手段３９２には、発声解析手段３９１によって出力された音声属性情報をおよび音声情報が入力され、声質分類手段３９２は、入力された音声属性情報に基づいて音声情報を各々の声質に分類するようになっている。なお、本発明によれば、声質分類手段３９２は、音声属性情報に含まれる周波数平均値が所定の閾値より高い場合、女声に分類し、周波数平均値が所定の閾値より高い場合、男声に分類するようにしてもよい。男声の周波数は、一般的に１００Ｈｚから２００Ｈｚの範囲であるため１００Ｈｚとし、女声の周波数は、一般的に１５０Ｈｚから４００Ｈｚの範囲であるため３００Ｈｚとし、男声周波数１００Ｈｚと女声の周波数３００Ｈｚの中間値をとれば、所定の閾値は、２００Ｈｚ程度になる。また、声質分類手段３９２は、声質を男声、女声の２種類に限定せず、多数の周波数範囲を設けて声質を多種類に分類するようにしてもよい。
声質分類手段３９２は、分類された声質を表す声質情報を音声合成手段３６０に出力するようになっている。
【００４３】
音声合成手段３６０には、入力されたコンテンツに含まれる文字情報が入力され、音声合成手段３６０は、入力された文字情報を所定の話速に基づいて音声合成するようになっている。また、音声合成手段３６０は、男声の音声素片情報を有する素片データベース１６１および女声の音声素片情報を有する素片データベース１６２を含むように構成され、音声合成手段３６０に声質情報が入力されたときには、音声合成手段３６０は、声質情報に応じて文字情報に含まれるテキストデータから素片データベースを用いて音声素片情報に変換し、変換したそれぞれの音声素片情報を合成するようになっており、例えば、声質情報が男声を表していた場合には、素片データベース１６１を用いて音声素片情報に変換し、声質情報が女声を表していた場合には、素片データベース１６２を用いて音声素片情報に変換するようになっている。
【００４４】
また、音声合成手段３６０に発声話速算出手段３５０によって話速情報が入力されたときには、音声合成手段３６０は、話速情報すなわち話速変更率に基づいて話速を変更し、変更された話速に基づいて文字情報を音声合成するようになっている。
音声合成手段３６０は、音声合成された合成音声情報を音声出力手段３７０に出力するようになっている。
【００４５】
音声出力手段３７０には、音声合成手段３６０によって出力された合成音声情報および音声消音手段３８０によって出力された音声情報が入力され、音声出力手段３７０は、入力された合成音声情報に応じて合成音声を再生して出力すると共に、入力されたコンテンツに含まれる音声情報に応じて音声を再生して出力するようになっている。
【００４６】
図６は、本発明の第３の実施の形態に係る合成音声出力および音声出力のタイミングチャートである。なお、本発明によれば、図６に示すように音声情報を構成するフレームを再生して出力する期間であるフレーム出力期間２０内に、無声期間２５と発声期間２６が複数存在する場合がある。音声出力手段３７０は、発声期間測定手段３４０によって測定された発声期間２６内に合成音声を出力すると共に、無声期間２５内に音声を再生して出力するようになっている。
【００４７】
以下、本発明の第３の実施の形態に係る情報提示装置３００の動作について、図５を参照して説明する。
まず、映像情報、音声情報および文字情報を含むコンテンツは、コンテンツ入力手段１１０によって文字多重放送や記録媒体から入力され、映像提示手段１２０、発声検出手段３３０、発声話速算出手段３５０、発声解析手段３９１および音声消音手段３８０などに出力される。
次に、コンテンツに含まれる映像情報は、映像提示手段１２０によって再生されてディスプレイなどに提示される。
【００４８】
一方、発声部分は、発声検出手段３３０によってコンテンツに含まれる音声情報から検出され、検出された発声部分に関する情報が発声期間測定手段３４０に出力にされたとき、発声期間情報は、発声期間測定手段３４０によって発声部分に関する情報に基づいて発声部分の期間が測定されて生成され、発声話速算出手段３５０、音声消音手段３８０、および発声解析手段３９１に出力される。
話速情報は、発声話速算出手段３５０によって発声期間情報およびコンテンツ入力手段１１０によって出力された文字情報に基づいて算出されて生成され、音声合成手段３６０に出力される。
【００４９】
次に、音声情報は、発声期間測定手段３４０によって測定された発声期間だけ音声消音手段３８０によって消音され、音声出力手段３７０に出力される。
また、発声期間測定手段３４０によって測定された発声期間内の音声情報は、発声解析手段３９１によって解析され、解析された音声情報などの音声属性情報が声質分類手段３９２に出力され、音声属性情報に基づいて音声情報は、声質分類手段３９２によって各々の声質に分類され、声質情報は、声質分類手段３９２によって生成され、音声合成手段３６０に出力される。
【００５０】
音声合成手段３６０に声質情報が入力されたとき、音声素片情報は、声質情報に応じて文字情報に含まれるテキストデータから素片データベース１６１または素片データベース１６２を用いて音声合成手段３６０によって変換され、変換されたそれぞれの音声素片情報は、話速情報に基づいて音声合成され、音声合成された合成音声情報は、音声出力手段３７０に出力される。
音声出力手段３７０によって再生される合成音声は、音声合成手段３６０から出力された合成音声情報に応じて発声期間内に出力されると共に、音声出力手段３７０によって再生される音声は、コンテンツ入力手段１１０によって出力された音声情報に応じて出力される。
【００５１】
以上説明したように、本発明の第３の実施の形態に係る情報提示装置は、翻訳された字幕を表示する洋画などのコンテンツを視聴するときに発声部分を消音するため、出演者の台詞などを自動的に吹き替えすることができる。
また、翻訳された字幕を表示する洋画などのコンテンツを視聴するときに出演者などの声質に応じて合成音声を出力するため、コンテンツの理解性を向上させることができ、翻訳された字幕を表示する洋画などのコンテンツを視聴するときに出演者などが発声する速度に応じて合成音声を出力するため、リアルに出演者の台詞などを吹き替えすることができる。
【００５２】
【発明の効果】
以上説明したように、本発明は、字幕を表す文字情報を合成音声として聴覚的に出力するため、映像や字幕を容易に視聴することができる情報提示装置を提供するものである。
【図面の簡単な説明】
【図１】本発明の第１の実施の形態に係る情報提示装置のブロック構成図
【図２】本発明の第１の実施の形態に係る合成音声出力および音声出力のタイミングチャート
【図３】本発明の第２の実施の形態に係る情報提示装置のブロック構成図
【図４】本発明の第２の実施の形態に係る合成音声出力および音声出力のタイミングチャート
【図５】本発明の第３の実施の形態に係る情報提示装置のブロック構成図
【図６】本発明の第３の実施の形態に係る合成音声出力および音声出力のタイミングチャート
【符号の説明】
２０　フレーム出力期間
２１　有音期間
２２　無音期間
２３、２６　発声期間
２４、２５　無声期間
１００、２００、３００　情報提示装置
１１０　コンテンツ入力手段
１２０　映像提示手段
１３０　無音検出手段
１４０　無音期間測定手段
１５０　無音話速算出手段
１６０、２６０、３６０　音声合成手段
１６１、１６２　素片データベース
１７０、２７０、３７０　音声出力手段
２３０　無声検出手段
２４０　無声期間測定手段
２５０　無声話速算出手段
３３０　発声検出手段
３４０　発声期間測定手段
３５０　発声話速算出手段
３８０　音声消音手段
３９１　発声解析手段
３９２　声質分類手段[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to an information presentation apparatus that reproduces information of program content such as text multiplex broadcasting and information included in content recorded on a recording medium and presents the information to a user.
[0002]
[Prior art]
A conventional information presentation device has a large screen display unit and a small screen display unit for displaying program content, and displays an image included in the program content on the large screen display unit, and displays subtitles. In order to facilitate display, subtitles included in program content are displayed on a small screen display unit installed near the user (for example, see Patent Document 1).
[0003]
[Patent Document 1]
JP-A-11-196345 (paragraphs 54 to 56, FIG. 1)
[0004]
[Problems to be solved by the invention]
However, in the above-described conventional information presenting apparatus, there is a problem in that it is necessary to view the screens of both the large screen display unit and the small screen display unit, so that viewing of images and subtitles becomes complicated.
The present invention has been made to solve such a conventional problem, and an object of the present invention is to provide an information presenting apparatus capable of easily viewing a video or a caption included in a content.
[0005]
[Means for Solving the Problems]
An information presenting apparatus according to the present invention includes a content input unit for inputting content including video information and character information, a character information extracting unit for extracting character information from the content, and a character information extracted by the character information extracting unit. Audio output means for converting the audio into audio synchronized with the video information and outputting the audio.
With this configuration, since the character information representing the caption is output as audio in an auditory sense, the video and the caption can be easily viewed.
[0006]
Further, the information presenting device of the present invention comprises: a content input unit for inputting content including video information, audio information, and character information; and a video presenting unit for presenting video information included in the content input by the content input unit. A voice synthesizing unit for voice-synthesizing character information included in the content input by the content input unit based on a predetermined speech speed; and outputting a synthesized voice in accordance with the synthesized voice information synthesized by the voice synthesizing unit. And audio output means for outputting audio in accordance with audio information included in the content input by the content input means.
With this configuration, since the text information representing the caption is output in an aural manner as a synthesized voice, the video and the caption can be easily viewed. Further, even when a mobile terminal having a narrow screen is used, it is possible to easily view contents including video information, audio information, and character information.
[0007]
Further, the information presenting device of the present invention comprises: a content input unit for inputting content including video information, audio information, and character information; and a video presenting unit for presenting video information included in the content input by the content input unit. A voice synthesizing unit for voice-synthesizing character information included in the content input by the content input unit based on a predetermined speech speed; and outputting a synthesized voice in accordance with the synthesized voice information synthesized by the voice synthesizing unit. Audio output means for outputting audio in accordance with audio information included in the content input by the content input means, and further comprising: outputting a silent portion from audio information included in the content input by the content input means. Silence detecting means for detecting, and silence detected by the silence detecting means. A silent period measuring unit that measures a period of a portion, and a silent speech speed calculating unit that calculates a speech speed of the synthesized voice based on the silent period measured by the silent period measuring unit, wherein the voice synthesizing unit includes: Based on the speech speed of the synthesized speech calculated by the silent speech speed calculation unit, speech information is synthesized from character information included in the content input by the content input unit, and the speech output unit is synthesized by the speech synthesis unit. It is configured to output a synthesized voice within the silence period in accordance with the synthesized voice information and output a voice in accordance with voice information included in the content input by the content input means.
According to this configuration, the synthesized voice is output so that the voice included in the content and the synthesized voice do not overlap, so that it is possible to easily hear the synthesized voice.
[0008]
Further, the information presenting device of the present invention comprises: a content input unit for inputting content including video information, audio information, and character information; and a video presenting unit for presenting video information included in the content input by the content input unit. A voice synthesizing unit for voice-synthesizing character information included in the content input by the content input unit based on a predetermined speech speed; and outputting a synthesized voice in accordance with the synthesized voice information synthesized by the voice synthesizing unit. Audio output means for outputting audio in accordance with the audio information included in the content input by the content input means, and further comprising: extracting a silent portion from the audio information included in the content input by the content input means. Voiceless detection means for detecting, and voicelessness detected by the voiceless detection means. Unvoiced period measurement means for measuring the period of the portion, and unvoiced speech speed calculation means for calculating the speech speed of the synthesized voice based on the unvoiced period measured by the unvoiced period measurement means, the speech synthesis means, Based on the speech speed of the synthesized speech calculated by the unvoiced speech speed calculation means, speech information is synthesized from character information included in the content input by the content input means, and the speech output means is synthesized by the speech synthesis means. The apparatus is configured to output a synthesized voice within the unvoiced period according to the synthesized voice information and output a voice according to the voice information included in the content input by the content input unit.
With this configuration, the synthesized voice is output so that the unvoiced part of the voice included in the content does not overlap with the synthesized voice, so that the synthesized voice can be heard more easily.
[0009]
Further, the information presenting device of the present invention comprises: a content input unit for inputting content including video information, audio information, and character information; and a video presenting unit for presenting video information included in the content input by the content input unit. A voice synthesizing unit for voice-synthesizing character information included in the content input by the content input unit based on a predetermined speech speed; and outputting a synthesized voice in accordance with the synthesized voice information synthesized by the voice synthesizing unit. Voice output means for outputting a voice in accordance with voice information included in the content input by the content input means, and further comprising: uttering voice based on voice information included in the content input by the content input means. Utterance detection means for detecting a portion; Speech period measuring means for measuring the period of the uttered portion, and sound muffling means for muffling the sound contained in the content input by the content input means only for the speech period measured by the speech period measuring means, The voice synthesizing unit synthesizes the text information included in the content input by the content input unit based on a predetermined speech speed, and the voice output unit generates the synthesized voice information synthesized by the voice synthesizing unit. The apparatus has a configuration in which a synthesized voice is output in response to the voice and a voice is output in accordance with voice information included in the content input by the content input means.
With this configuration, since the utterance portion is muted when viewing the content such as a Western movie displaying the translated subtitles, the voice of the performer can be automatically dubbed.
[0010]
Further, the information presenting device of the present invention comprises: a content input unit for inputting content including video information, audio information, and character information; and a video presenting unit for presenting video information included in the content input by the content input unit. A voice synthesizing unit for voice-synthesizing character information included in the content input by the content input unit based on a predetermined speech speed; and outputting a synthesized voice in accordance with the synthesized voice information synthesized by the voice synthesizing unit. Audio output means for outputting audio in accordance with audio information included in the content input by the content input means, and detecting an utterance portion based on audio information included in the content input by the content input means Utterance detection means, and an utterance part detected by the utterance detection means. A speech period measuring unit for measuring a period, and a sound muffling unit for silencing a sound contained in the content input by the content input unit for the speech period measured by the speech period measuring unit, further comprising: Voice analysis means for analyzing voice information within the voice period measured by the measurement means; and voice quality classification means for classifying the voice information into respective voice qualities based on a frequency obtained from the voice information analyzed by the voice analysis means. Wherein the voice synthesis means voice-synthesizes character information included in the content input by the content input means according to each voice quality classified by the voice quality classification means, and the voice output means includes: The synthesized speech is output according to the synthesized speech information synthesized by the speech synthesis means, and the content is output. It has a structure of outputting the sound corresponding to the audio information included in the input content by the input means.
With this configuration, when viewing a content such as a Western movie that displays translated subtitles, a synthesized voice is output in accordance with the voice quality of the cast or the like, so that comprehension of the content can be improved.
[0011]
Further, the information presenting device of the present invention comprises: a content input unit for inputting content including video information, audio information, and character information; and a video presenting unit for presenting video information included in the content input by the content input unit. A voice synthesizing unit for voice-synthesizing character information included in the content input by the content input unit based on a predetermined speech speed; and outputting a synthesized voice in accordance with the synthesized voice information synthesized by the voice synthesizing unit. Audio output means for outputting audio in accordance with audio information included in the content input by the content input means, and detecting an utterance portion based on audio information included in the content input by the content input means Utterance detection means, and an utterance part detected by the utterance detection means. A speech period measuring unit for measuring a period, and a sound muffling unit for silencing a sound contained in the content input by the content input unit for the speech period measured by the speech period measuring unit, further comprising: Speech rate calculation means for calculating the speech rate of the synthesized speech based on the speech period measured by the measurement means, wherein the speech synthesis means calculates the speech rate of the synthesized speech calculated by the speech rate calculation means. Based on the synthesized speech information, the speech output unit synthesizes the synthesized speech within the utterance period in accordance with the synthesized speech information synthesized by the speech synthesis unit. Output and output audio in accordance with audio information included in the content input by the content input means. It has a configuration.
With this configuration, a synthesized voice is output according to the speed at which the performer utters when viewing content such as a Western movie that displays translated subtitles, so that the performer's dialogue and the like can be dubbed realistically. .
[0012]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
(First Embodiment)
FIG. 1 is a block diagram of the information presentation device according to the first embodiment of the present invention. The information presentation device 100 according to the first embodiment of the present invention includes a content input unit 110 for inputting content including video information, audio information, and text information, and video information included in the content input by the content input unit 110. , A silence detecting unit 130 for detecting a silence portion from audio information included in the content input by the content input unit 110, and a silence period for measuring a period of the silence portion detected by the silence detection unit 130. Measuring means 140, silence speech rate calculating means 150 for calculating the speech speed of the synthesized speech based on the silence period measured by silence period measuring means 140, and character information contained in the content input by content input means 110 are converted into predetermined information. Voice synthesis means 160 for voice synthesis based on speech speed, and A voice output unit 170 that outputs a synthesized voice in accordance with the synthesized voice information synthesized by the voice synthesis unit 160 and outputs a voice in accordance with the voice information included in the content input by the content input unit 110. Be composed. The character information extracting means (not shown) of the present invention may constitute the speech synthesizing means 160.
[0013]
The content input unit 110 inputs program content such as text multiplex broadcast or content recorded on a recording medium, including video information, audio information, and text information, and converts the video information included in the input content. Output to the video presenting means 120, output audio information included in the input content to the silent detection means 130 and audio output means 170, and output character information such as subtitles included in the input content to the silent speech speed calculation means 150 And the voice synthesizing means 160.
According to the present invention, the content is a Western movie or the like that displays translated subtitles, and the video information, audio information, and text information included in the content are, for example, the timing at which the subtitles presented on the screen are switched. It consists of frames separated by.
[0014]
Video information included in the content output by the content input means 110 is input to the video presenting means 120, and the video presenting means 120 reproduces and presents the input video information. According to the present invention, the video presenting means 120 is configured to include a liquid crystal display or the like, but may be configured by any means for presenting video information.
[0015]
The audio information included in the content output by the content input unit 110 is input to the silence detection unit 130, and the silence detection unit 130 detects a silent portion from the audio information included in the input content. I have. The silent part is a sound composed of sound that cannot be heard by human ears. According to the present invention, regarding the method of detecting a silent part, the silent detection means 130 divides a frame constituting audio information into subframes constituting a predetermined period, for example, a period of about 5 ms, and The average of the sound pressure values may be calculated from the subframes, and a subframe in which the calculated average sound pressure value does not exceed a predetermined threshold may be detected as a silent portion. The silence detecting means 130 outputs information on the detected silence part to the silence period measuring means 140.
[0016]
The silent period measuring unit 140 receives information on the silent portion output by the silent detecting unit 130, and the silent period measuring unit 140 measures the period of the silent portion based on the input information on the silent portion. Has become. According to the present invention, the silent period measuring means 140 measures the period of the silent portion by multiplying the predetermined number of times by the number of continuous detections of the silent portion included in the information on the silent portion. May be.
The silence period measuring means 140 outputs silence period information indicating the period of the measured silence portion to the silent voice speed calculating means 150.
[0017]
The silent period calculating unit 150 receives the silent period information output by the silent period measuring unit 140 and the character information included in the content output by the content input unit 110, and the silent period calculating unit 150 receives the input. The speech speed of the synthesized speech is calculated based on the silent period information. According to the present invention, the silent voice speed calculating means 150 multiplies the predetermined voice speed at which an average synthesized voice per character is emitted by the number of characters included in the frame constituting the text information, and synthesizes the synthesized voice. A standard time at which a voice is emitted may be calculated, and a silent period may be divided by the calculated standard time to calculate a speech speed change rate.
The silent voice speed calculating means 150 outputs voice speed information relating to the calculated voice speed to the voice synthesizing means 160.
[0018]
Character information included in the input content is input to the voice synthesizing unit 160, and the voice synthesizing unit 160 synthesizes the input character information based on a predetermined speech speed. Further, when speech speed information is input to the speech synthesis unit 160 by the silent speech speed calculation unit 150, the speech synthesis unit 160 changes the speech speed based on the speech speed information, that is, the speech speed change rate, and Speech synthesis of character information is performed based on speed. The speech synthesis unit 160 is configured to include the speech unit database 161, converts text data included in character information into speech unit information using the speech unit database 161, and synthesizes each of the converted speech unit information. It is supposed to.
The voice synthesizing unit 160 outputs synthesized voice information obtained by voice synthesis to the voice output unit 170. According to the present invention, the character information extracting unit may extract the character information from the content output by the content input unit 110 and output the extracted character information to the audio output unit 170.
[0019]
The speech output unit 170 receives the synthesized speech information output by the speech synthesis unit 160 and the speech information included in the content output by the content input unit 110. The speech output unit 170 outputs the synthesized speech information In addition to reproducing and outputting the synthesized voice in response, the voice is reproduced and output in accordance with the voice information included in the input content. According to the present invention, the audio output unit 170 may convert the character information output by the character information extracting unit into audio synchronized with the video information and output the audio.
[0020]
FIG. 2 is a timing chart of the synthesized speech output and the speech output according to the first embodiment of the present invention. According to the present invention, as shown in FIG. 2, a sound period 21 including a silent period 22 and a period other than the silent period 22 is included in a frame output period 20 which is a period for reproducing and outputting a frame constituting audio information. May exist more than once. The audio output unit 170 outputs a synthesized voice during the silent period 22 measured by the silent period measuring unit 140, and reproduces and outputs the audio during the voiced period 21. For example, when outputting a synthesized voice during the silence period 22, the voice output unit 170 temporarily stops outputting the synthesized voice at a timing indicating a punctuation mark or a reading point, and restarts the output of the synthesized voice within the next silence period 22. It has become.
[0021]
Hereinafter, the operation of the information presentation apparatus 100 according to the first embodiment of the present invention will be described with reference to FIG.
First, content including video information, audio information, and text information is input from a text multiplex broadcast or a recording medium by the content input unit 110, and the video presentation unit 120, the silence detection unit 130, the silent speech speed calculation unit 150, and the audio output unit It is output to 170.
Next, the video information included in the content is reproduced by the video presenting means 120 and presented on a display or the like.
[0022]
On the other hand, the silence part is detected from the audio information included in the content by the silence detection means 130, and when the information on the detected silence part is output to the silence period measurement means 140, the silence period information becomes the silence period measurement means. The period of the silent part is measured and generated based on the information about the silent part by 140, and is output to the silent voice speed calculating means 150.
The speech speed information is calculated and generated based on the silent period information by the silent speech speed calculating unit 150 and the character information output by the content input unit 110, and is output to the voice synthesizing unit 160.
[0023]
Next, the respective speech unit information converted from the text data included in the character information by using the unit database 161 is subjected to speech synthesis by the speech synthesis unit 160 based on the speech speed information, and the synthesized speech is synthesized. The information is output to the audio output unit 170.
The synthesized voice reproduced by the voice output unit 170 is output within a silent period according to the synthesized voice information output from the voice synthesis unit 160, and the voice reproduced by the voice output unit 170 is output by the content input unit 110. Is output in accordance with the audio information output by.
[0024]
As described above, the information presentation apparatus according to the first embodiment of the present invention outputs textual information representing subtitles as a synthetic sound, so that the user can easily view the video and subtitles. Even when a mobile terminal having a narrow screen is used, it is possible to easily view contents including video information, audio information, and text information.
Further, since the synthesized voice is output so that the voice included in the content does not overlap with the synthesized voice, it is possible to easily hear the synthesized voice.
[0025]
(Second embodiment)
FIG. 3 is a block diagram of the information presentation device according to the second embodiment of the present invention. The information presentation device 200 according to the second embodiment of the present invention includes a content input unit 110 for inputting content including video information, audio information, and text information, and video information included in the content input by the content input unit 110. , A voiceless detecting means 230 for detecting a voiceless portion from audio information included in the content input by the content input means 110, and a voiceless period for measuring a period of the voiceless portion detected by the voiceless detecting means 230. Measuring means 240, an unvoiced speech speed calculating means 250 for calculating the speech speed of the synthesized voice based on the unvoiced period measured by the unvoiced period measuring means 240, and a speech speed of the synthesized voice calculated by the unvoiced speech speed calculating means 250. Included in the content input by the content input means 110 based on the Speech synthesis means 260 for synthesizing character information, and outputs synthesized speech in accordance with the synthesized speech information synthesized by the speech synthesis means 260, and in accordance with the speech information included in the content input by the content input means 110. It is configured to include audio output means 270 for outputting audio.
It should be noted that, of the units constituting the information presenting apparatus 200 according to the second embodiment of the present invention, the same units as those constituting the information presenting apparatus 100 according to the first embodiment of the present invention are the same. And the description thereof is omitted.
[0026]
The voice information included in the content output by the content input unit 110 is input to the voiceless detection unit 230, and the voiceless detection unit 230 detects a voiceless portion from the voice information included in the input content. I have. The unvoiced part is a voice composed of noise and sound effects in addition to the silent part. According to the present invention, with respect to the method of detecting unvoiced portions, the unvoiced detection means 230 divides a frame forming audio information into a predetermined period, for example, a subframe forming a period of about 5 ms, and Even if a sub-frame that is not within the frequency range considered to be a human voice by performing frequency analysis from the sub-frame and a sub-frame whose sound pressure value of the component within the frequency range is equal to or less than a predetermined threshold is detected as an unvoiced portion, Good. The frequency range of the human voice is, for example, a range of about 5 kHz or less.
The unvoiced detecting means 230 outputs information on the detected unvoiced portion to the unvoiced period measuring means 240.
[0027]
The unvoiced period measuring unit 240 receives information on the silent portion output by the unvoiced detecting unit 230, and the unvoiced period measuring unit 240 measures the period of the unvoiced portion based on the input information on the unvoiced portion. Has become. According to the present invention, the unvoiced period measuring means 240 measures the period of the unvoiced portion by multiplying the number of continuous detections of the unvoiced portion included in the information on the unvoiced portion by a predetermined period. May be.
The unvoiced period measuring means 240 outputs unvoiced period information indicating the measured period of the unvoiced portion to the unvoiced speech speed calculating means 250.
[0028]
The unvoiced speech speed calculation unit 250 receives the unvoiced period information output by the unvoiced period measurement unit 240 and the character information included in the content output by the content input unit 110, and the unvoiced speech speed calculation unit 250 receives the input. The speech speed of the synthesized voice is calculated based on the silent period information. According to the present invention, the unvoiced speech speed calculating means 250 multiplies the predetermined speech speed at which an average synthesized speech per character is emitted by the number of characters included in the frame constituting the character information, and performs synthesis. A standard time at which a voice is emitted may be calculated, and a silent period may be divided by the calculated standard time to calculate a speech speed change rate.
The unvoiced speech speed calculation means 250 outputs speech speed information on the calculated speech speed to the speech synthesis means 260.
[0029]
The character information included in the input content is input to the voice synthesizing unit 260, and the voice synthesizing unit 260 synthesizes the input character information based on a predetermined speech speed. When speech rate information is input to the speech synthesis means 260 by the unvoiced speech rate calculation means 250, the speech synthesis means 260 changes the speech rate based on the speech rate information, that is, the speech rate change rate, and changes the changed speech rate. Speech synthesis of character information is performed based on speed. The speech synthesis unit 260 is configured to include the speech unit database 161, converts text data included in the character information into speech unit information using the speech unit database 161, and synthesizes the converted speech unit information. It is supposed to.
The voice synthesizer 260 outputs the synthesized voice information obtained by voice synthesis to the voice output unit 270.
[0030]
The voice output means 270 receives the synthesized voice information output by the voice synthesis means 260 and the voice information included in the content output by the content input means 110, and the voice output means 270 outputs the synthesized voice information to the input synthesized voice information. In addition to reproducing and outputting the synthesized voice in response, the voice is reproduced and output in accordance with the voice information included in the input content.
[0031]
FIG. 4 is a timing chart of a synthesized voice output and a voice output according to the second embodiment of the present invention. According to the present invention, as shown in FIG. 4, within a frame output period 20 which is a period for reproducing and outputting a frame constituting audio information, a utterance period 23 including an unvoiced period 24 and other than the unvoiced period 24 is included. There may be more than one. The voice output unit 270 outputs a synthesized voice during the unvoiced period 24 measured by the unvoiced period measurement unit 240, and reproduces and outputs the voice during the utterance period 23. For example, when outputting a synthesized voice during the unvoiced period 24, the voice output unit 270 suspends the output of the synthesized voice at a timing indicating a punctuation mark or a reading point, and restarts the output of the synthesized voice within the next unvoiced period 24. It has become.
[0032]
Hereinafter, the operation of the information presentation device 200 according to the second embodiment of the present invention will be described with reference to FIG.
First, content including video information, audio information, and text information is input from a text multiplex broadcast or a recording medium by the content input means 110, and the video presentation means 120, the voiceless detection means 230, the voiceless speech speed calculation means 250, and the voice output means 270.
Next, the video information included in the content is reproduced by the video presenting means 120 and presented on a display or the like.
[0033]
On the other hand, the unvoiced portion is detected from the voice information included in the content by the unvoiced detection unit 230, and when information on the detected unvoiced portion is output to the unvoiced period measurement unit 240, the unvoiced period information is output by the unvoiced period measurement unit. The period of the unvoiced portion is measured and generated based on the information on the unvoiced portion by 240, and output to the unvoiced speech speed calculation means 250.
The speech speed information is calculated and generated by the unvoiced speech speed calculating unit 250 based on the unvoiced period information and the character information output by the content input unit 110, and is output to the voice synthesizing unit 260.
[0034]
Next, the speech unit information converted from the text data included in the character information using the unit database 161 is subjected to speech synthesis by the speech synthesis unit 260 based on the speech speed information, and the synthesized speech is synthesized. The information is output to the audio output unit 270.
The synthesized voice reproduced by the voice output unit 270 is output within a silent period according to the synthesized voice information output from the voice synthesis unit 260, and the voice reproduced by the voice output unit 270 is output by the content input unit 110. Is output in accordance with the audio information output by.
[0035]
As described above, the information presentation apparatus according to the second embodiment of the present invention outputs synthesized speech so that the unvoiced part of the speech included in the content does not overlap with the synthesized speech, Can be further facilitated.
[0036]
(Third embodiment)
FIG. 5 is a block diagram of the information presentation device according to the third embodiment of the present invention. The information presentation device 300 according to the third embodiment of the present invention includes a content input unit 110 for inputting content including video information, audio information, and text information, and video information included in the content input by the content input unit 110. , An utterance detection unit 330 that detects an utterance portion based on audio information included in the content input by the content input unit 110, and a period of the utterance portion detected by the utterance detection unit 330 is measured. The utterance period measuring means 340, the utterance speed calculating means 350 for calculating the speech speed of the synthesized speech based on the utterance period measured by the utterance period measuring means 340, and the content input means only for the utterance period measured by the utterance period measuring means 340 Turn off the audio included in the content input by 110 Voice muffling means 380, voice analysis means 391 for analyzing voice information within the voice period measured by the voice period measuring means 340, and the voice information based on the frequency obtained from the voice information analyzed by the voice analysis means 391. Voice quality classification means 392 for classifying each voice quality, voice synthesis means for voice synthesizing character information included in the content input by the content input means 110 based on the speech speed of the synthesized voice calculated by the utterance voice speed calculation means 350 A voice output unit 370 that outputs a synthesized voice according to the synthesized voice information synthesized by the voice synthesis unit 360 and outputs a voice according to the voice information included in the content input by the content input unit 110; It is configured to include.
It should be noted that, of the units constituting the information presenting apparatus 300 according to the third embodiment of the present invention, the same units as those constituting the information presenting apparatus 100 according to the first embodiment of the present invention are the same. And the description thereof is omitted.
[0037]
Voice information included in the content output by the content input unit 110 is input to the voice detection unit 330, and the voice detection unit 330 detects a voice portion from voice information included in the input content. I have. The utterance portion is a voice composed of a voice uttered by a person. According to the present invention, with respect to the method for detecting the utterance portion, the utterance detection means 330 divides the frame constituting the speech information into subframes constituting a predetermined period, for example, a period of about 5 ms, and A sub-frame in which the sound pressure value of a component within a frequency range regarded as a human voice within a frequency range that is regarded as a human voice exceeds a predetermined threshold value may be detected as a speech portion. The frequency range of the human voice is, for example, a range of about 5 kHz or less.
The utterance detection means 330 outputs information on the detected utterance part to the utterance period measurement means 340.
[0038]
The utterance period measuring means 340 receives information on the utterance part output by the utterance detection means 330, and the utterance period measuring means 340 measures the period of the utterance part based on the input information on the utterance part. Has become. According to the present invention, the utterance period measuring unit 340 measures the period of the utterance portion by multiplying the predetermined number of times by the number of consecutive detections of the utterance portion included in the information about the utterance portion. May be.
The utterance period measuring unit 340 outputs utterance period information indicating the measured period of the utterance part to the utterance speed calculation unit 350, the sound mute unit 380, and the utterance analysis unit 391.
[0039]
The unvoiced period information output by the utterance period measuring unit 340 and the character information included in the content output by the content input unit 110 are input to the uttered voice speed calculating unit 350, and the uttered voice speed calculating unit 350 receives the input. The speech speed of the synthesized speech is calculated based on the utterance period information. According to the present invention, the utterance speech speed calculation means 350 multiplies a predetermined speech speed at which an average synthesized speech per character is uttered by the number of characters included in a frame constituting character information, and A standard time at which a voice is emitted may be calculated, and the speech period may be divided by the calculated standard time to calculate a speech speed change rate.
The utterance voice speed calculating means 350 outputs voice speed information relating to the calculated voice speed to the voice synthesizing means 360.
[0040]
The voice mute means 380 receives the voice duration information output by the voice duration measurement means 340 and the voice information included in the content output by the content input means 110, and the voice mute means 380 includes the voice mute means 380 in the input content. For example, the sound mute unit 380 outputs to the sound output unit 370 sound information that is not included in the utterance period. Alternatively, the sound muffling means 380 may adjust the gain so as to mute only during the utterance period.
[0041]
The utterance analysis unit 391 receives the utterance period information output by the utterance period measurement unit 340 and the audio information included in the content output by the content input unit 110, and analyzes the audio information in the utterance period. I have. For example, the utterance analysis unit 391 outputs to the voice quality classification unit 392 the voice attribute information calculated by performing frequency analysis on the voice information in the utterance period and the like and the average value of the frequencies and the voice information.
[0042]
The voice attribute information and the voice information output by the utterance analysis unit 391 are input to the voice quality classification unit 392, and the voice quality classification unit 392 classifies the voice information into each voice quality based on the input voice attribute information. It has become. According to the present invention, the voice quality classification unit 392 classifies the voice attribute information into a female voice when the average frequency value included in the voice attribute information is higher than a predetermined threshold, and classifies the voice into a male voice when the frequency average value is higher than the predetermined threshold value. You may make it. The frequency of a male voice is generally 100 Hz to 200 Hz, so it is 100 Hz.The frequency of a female voice is generally 150 Hz to 400 Hz, so it is 300 Hz, and the intermediate value between the male voice frequency 100 Hz and the female voice frequency 300 Hz. Then, the predetermined threshold becomes about 200 Hz. Further, the voice quality classification means 392 is not limited to two types of voice quality, male voice and female voice, but may provide many frequency ranges and classify voice quality into many types.
The voice quality classification means 392 outputs voice quality information indicating the classified voice quality to the voice synthesis means 360.
[0043]
Character information included in the input content is input to the voice synthesizing unit 360, and the voice synthesizing unit 360 synthesizes the input character information based on a predetermined speech speed. The voice synthesizing means 360 is configured to include a voice segment database 161 having male voice voice segment information and a voice segment database 162 having female voice voice segment information. Then, the voice synthesizing means 360 converts text data included in the character information into voice unit information using the voice unit database according to the voice quality information, and synthesizes the converted voice unit information. For example, if the voice quality information indicates a male voice, the voice quality information is converted into voice voice information using the voice segment database 161, and if the voice quality information indicates a female voice, the voice voice database is used using the voice segment database 162. To convert it into speech unit information.
[0044]
When speech speed information is input to the speech synthesis unit 360 by the utterance speech speed calculation unit 350, the speech synthesis unit 360 changes the speech speed based on the speech speed information, that is, the speech speed change rate, and Speech synthesis of character information is performed based on speed.
The voice synthesizing unit 360 outputs synthesized voice information obtained by voice synthesis to the voice output unit 370.
[0045]
The voice output means 370 receives the synthesized voice information output by the voice synthesis means 360 and the voice information output by the voice mute means 380. The voice output means 370 outputs the synthesized voice in accordance with the input synthesized voice information. Is reproduced and output, and audio is reproduced and output according to the audio information included in the input content.
[0046]
FIG. 6 is a timing chart of a synthesized speech output and a speech output according to the third embodiment of the present invention. According to the present invention, as shown in FIG. 6, a plurality of unvoiced periods 25 and a plurality of uttered periods 26 may be present in a frame output period 20, which is a period for reproducing and outputting frames constituting audio information. . The voice output unit 370 outputs a synthesized voice during the voice period 26 measured by the voice period measuring unit 340, and reproduces and outputs the voice within the voiceless period 25.
[0047]
Hereinafter, the operation of the information presentation device 300 according to the third embodiment of the present invention will be described with reference to FIG.
First, content including video information, audio information and text information is input from a text multiplex broadcast or a recording medium by the content input means 110, and the video presentation means 120, the voice detection means 330, the voice speed calculation means 350, the voice analysis means 391 and the sound muffling means 380.
Next, the video information included in the content is reproduced by the video presenting means 120 and presented on a display or the like.
[0048]
On the other hand, the utterance part is detected from the voice information included in the content by the utterance detection means 330, and when information on the detected utterance part is output to the utterance period measurement means 340, the utterance period information becomes The period of the utterance part is measured and generated based on the information about the utterance part by 340, and is output to the utterance speed calculation means 350, the sound muffling means 380, and the utterance analysis means 391.
The speech speed information is calculated and generated by the speech speed calculation unit 350 based on the speech period information and the character information output by the content input unit 110, and is output to the voice synthesis unit 360.
[0049]
Next, the voice information is muted by the voice muting means 380 only for the voice period measured by the voice period measuring means 340, and is output to the voice output means 370.
The speech information within the speech period measured by the speech period measurement unit 340 is analyzed by the speech analysis unit 391, and the speech attribute information such as the analyzed speech information is output to the voice quality classification unit 392. The voice information is classified into respective voice qualities by the voice quality classification unit 392 based on the voice information, and the voice quality information is generated by the voice quality classification unit 392 and output to the voice synthesis unit 360.
[0050]
When voice quality information is input to the voice synthesis unit 360, the voice segment information is converted from the text data included in the character information by the voice synthesis unit 360 using the voice unit database 161 or the voice unit database 162 according to the voice quality information. The converted speech unit information is subjected to speech synthesis based on the speech speed information, and the synthesized speech information synthesized is output to the speech output unit 370.
The synthesized voice reproduced by the voice output unit 370 is output within the utterance period according to the synthesized voice information output from the voice synthesis unit 360, and the voice reproduced by the voice output unit 370 is output by the content input unit 110. Is output in accordance with the audio information output by.
[0051]
As described above, the information presenting apparatus according to the third embodiment of the present invention silences the utterance portion when viewing content such as a Western movie that displays translated subtitles. Can be automatically dubbed.
In addition, when viewing content such as a Western movie that displays translated subtitles, synthesized speech is output according to the voice quality of the cast, etc., so that comprehension of the content can be improved and translated subtitles are displayed. Since a synthesized voice is output in accordance with the speed at which the performer utters when viewing content such as a foreign film, the voice of the performer can be dubbed realistically.
[0052]
【The invention's effect】
As described above, the present invention provides an information presentation device that can easily view a video or a caption because the text information representing the caption is auditorily output as synthesized voice.
[Brief description of the drawings]
FIG. 1 is a block diagram of an information presentation device according to a first embodiment of the present invention.
FIG. 2 is a timing chart of a synthesized speech output and a speech output according to the first embodiment of the present invention.
FIG. 3 is a block diagram of an information presentation device according to a second embodiment of the present invention.
FIG. 4 is a timing chart of a synthesized voice output and a voice output according to a second embodiment of the present invention.
FIG. 5 is a block diagram of an information presentation device according to a third embodiment of the present invention.
FIG. 6 is a timing chart of a synthesized voice output and a voice output according to a third embodiment of the present invention.
[Explanation of symbols]
20 frame output period
21 Sound period
22 Silence period
23, 26 utterance period
24, 25 Silent period
100, 200, 300 Information presentation device
110 Content Input Means
120 Image presentation means
130 Silence detection means
140 Silence period measuring means
150 silent speech speed calculation means
160, 260, 360 voice synthesis means
161, 162 unit database
170, 270, 370 audio output means
230 silent detection means
240 silent period measuring means
250 voiceless speech speed calculation means
330 utterance detection means
340 utterance period measuring means
350 utterance speed calculation means
380 Sound silencer
391 utterance analysis means
392 Voice quality classification means

Claims

Content input means for inputting content including video information and text information, text information extraction means for extracting text information from the content, and text information extracted by the text information extraction means to audio synchronized with the video information An information presentation device comprising: a sound output unit that converts and outputs the converted sound.

The information presentation device further includes a voice synthesizing unit that synthesizes text information included in the content input by the content input unit based on a predetermined speech speed, and the content input unit converts video information and text information into text information. In addition, a content including voice information is input, and the voice output unit outputs a synthesized voice according to the synthesized voice information synthesized by the voice synthesis unit and is included in the content input by the content input unit. The information presentation device according to claim 1, wherein a sound is output according to the sound information.

The information presenting device includes a silence detecting unit that detects a silence portion from audio information included in the content input by the content input unit, and a silence period measuring unit that measures a period of the silence portion detected by the silence detection unit. And a silent voice speed calculating means for calculating a voice speed of the synthesized voice based on a voiceless period measured by the voiceless time measuring means, wherein the voice synthesizing means is calculated by the voiceless voice speed calculating means. Based on the speech speed of the synthesized voice, the text information included in the content input by the content input unit is voice-synthesized, and the voice output unit responds to the synthesized voice information synthesized by the voice synthesis unit. Outputting a synthesized voice within a silent period and voice included in the content input by the content input means; Information presentation apparatus according to claim 2, characterized in that outputs sound in response to the broadcast.

The information presenting device includes an unvoiced detection unit configured to detect an unvoiced portion from audio information included in the content input by the content input unit, and an unvoiced period measurement unit configured to measure a period of the unvoiced portion detected by the unvoiced detection unit. And a voiceless voice speed calculating means for calculating the voice speed of the synthesized voice based on the voiceless period measured by the voiceless period measuring means, wherein the voice synthesizing means is calculated by the voiceless voice speed calculating means. Based on the speech speed of the synthesized voice, the text information included in the content input by the content input unit is voice-synthesized, and the voice output unit responds to the synthesized voice information synthesized by the voice synthesis unit. Outputting a synthesized voice within a silent period and voice included in the content input by the content input means; Information presentation apparatus according to claim 3, characterized in that outputs sound in response to the broadcast.

The information presentation device includes: an utterance detection unit that detects an utterance portion based on audio information included in the content input by the content input unit; and an utterance period that measures a period of the utterance portion detected by the utterance detection unit. Measuring means; and sound muffling means for muffling the sound contained in the content input by the content input means only during the utterance period measured by the vocal period measuring means, wherein the voice synthesizing means comprises: The character information included in the input content is subjected to speech synthesis based on a predetermined speech speed, and the speech output unit outputs a synthesized speech according to the synthesized speech information synthesized by the speech synthesis unit, and Voice according to the voice information contained in the content input by the content input means Information presentation apparatus according to claim 2, characterized in that the output.

The information presenting device, the vocal analysis means for analyzing the voice information within the vocal period measured by the vocal period measuring means, and the voice information based on the frequency obtained from the voice information analyzed by the vocal analysis means Voice quality classification means for classifying each voice quality, wherein the voice synthesis means voices the character information included in the content input by the content input means according to each voice quality classified by the voice quality classification means. Synthesizing, the voice output means outputs a synthesized voice according to the synthesized voice information voice-synthesized by the voice synthesis means, and outputs a voice according to the voice information included in the content input by the content input means The information presentation device according to claim 5, wherein

The information presenting apparatus further includes a utterance speed calculation unit that calculates a speech speed of the synthesized voice based on the utterance period measured by the utterance period measurement unit, and the voice synthesis unit includes the utterance speed calculation unit. Based on the speech speed of the synthesized voice calculated by the above, the text information included in the content input by the content input unit is voice-synthesized, and the voice output unit outputs the synthesized voice information synthesized by the voice synthesis unit. 7. The information according to claim 5, wherein a synthesized voice is output within the utterance period in response to the voice, and a voice is output in accordance with voice information included in the content input by the content input unit. Presentation device.