JP4080965B2

JP4080965B2 - Information presenting apparatus and information presenting method

Info

Publication number: JP4080965B2
Application number: JP2003197292A
Authority: JP
Inventors: 秀樹筒井; 俊彦真鍋
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2003-07-15
Filing date: 2003-07-15
Publication date: 2008-04-23
Anticipated expiration: 2023-07-15
Also published as: JP2005038014A

Description

【０００１】
【発明の属する技術分野】
本発明は、ユーザからの検索要求に応じて情報を音声で提示する機能を有する情報提示装置及び情報提示方法に関する。
【０００２】
【従来の技術】
近年、流通するデータのマルチメディア化と検索装置のマルチモーダル化に伴い、音声で情報を提示する手段が増えてきている。例えば、インターネット検索においても、音声合成技術により、検索されたＷｅｂページを音声で読み上げることが可能となっている（例えば、特許文献１参照）。また、音声認識技術により、文書データだけでなく音声／映像を検索することも可能となっている（例えば、特許文献２参照）。
【０００３】
しかし、従来の音声提示方法では、提示されたデータが、どのような根拠で選択されたかが解りにくかった。また、文字データによる提示と違って音声による提示では一覧性が低いため、検索されたデータを最初から最後まで全て聞かなければ、検索された根拠に気付かないことも多く、また検索間違いにも気付くのに時間がかかることも多い。
【０００４】
例えば、特許文献１では、Ｗｅｂ情報を音声読み上げるにあたって、リンクを他の部分と区別して読み上げているが、クエリに適合したヒットワードの部分がどこであるか聞いていてもわからないという問題がある。
【０００５】
また、例えば、特許文献２では、記録音声に対する検索結果を聞いても、どの部分が検索要求にヒットしたのかがわからないという問題がある。また、記録音声を音声認識技術で文字列に変換した後、検索要求によるテキスト検索技術で対応する部分を検索する場合に、そもそも音声認識結果が間違えていると、検索要求と違う単語であっても、検索されてしまうことになり、どの部分がヒットしたかがわからなければ、ユーザは最後まで聞かなければならないし、聞いても検索語が現れないこともある。
【０００６】
また、従来、検索結果を文字データで提示するにあたって、ヒットした部分の文字を反転させたり、文字の色を変えたりする技術もあるが、この技術は、情報の音声による提示には適用することができなかった。
【０００７】
【特許文献１】
特開２００２−３５８１９４号公報
【０００８】
【特許文献２】
特開２００２−３６６５５２号公報
【０００９】
【発明が解決しようとする課題】
上記のように、従来の音声による情報提示技術では、提示された情報の選択根拠が分かりにくい、検索間違いに気付きにくい、ユーザが目的のデータにたどり着くのに時間かかかる、などの問題点があった。
【００１０】
本発明は、上記事情を考慮してなされたもので、音声によるより効果的な情報提示を可能にした情報提示装置及び情報提示方法を提供することを目的とする。
【００１１】
【課題を解決するための手段】
本発明に係る情報提示装置は、検索文を入力するための入力手段と、前記検索文をもとに、文字列データ又は音声データを含むコンテンツを格納したデータベースを検索して、コンテンツを取得する取得手段と、この取得手段により取得された前記コンテンツに含まれる文字列データ又は音声データ中から、前記検索文に適合するヒットワードを検出する検出手段と、この検出手段により検出された前記ヒットワードに関する情報を音声により提示する提示手段とを備えたことを特徴とする。
【００１２】
また、装置または方法に係る本発明は、コンピュータに当該発明に相当する手順を実行させるための（あるいはコンピュータを当該発明に相当する手段として機能させるための、あるいはコンピュータに当該発明に相当する機能を実現させるための）プログラムとしても成立し、該プログラムを記録したコンピュータ読取り可能な記録媒体としても成立する。
【００１３】
本発明によれば、検出されたヒットワードに関する情報を音声により提示することによって、音声によるより効果的な情報提示が可能になる。例えば、音声によるデータ提示のときに、ヒットワードを強調して提示することにより、そのデータが選択された根拠を分かりやすく知ることができる。
【００１４】
【発明の実施の形態】
以下、図面を参照しながら発明の実施の形態を説明する。
【００１５】
（第１の実施形態）
本実施形態では、情報提示装置の一例として検索されたデータに含まれる文字列を音声出力するボイスブラウザの機能を有するものを例にとって説明する。
【００１６】
図１に、本発明の第１の実施形態に係る情報提示装置の構成例を示す。
【００１７】
ここでは、検索対象となるデータは、文字列を含むコンテンツであり、ヒットワード検出やヒットワード強調による音声出力は、検索されたデータに含まれる文字列部分を対象にする。
【００１８】
また、ヒットワードは、検索されたデータに含まれる文字列のうち、検索文に適合する文字列（あるいは、そのデータが検索された根拠となった文字列）である。特に、検索文中の検索語（文字列）に基づいて検索が行われる場合には、ヒットワードは、検索されたデータに含まれる文字列のうち、検索文中の検索語（文字列）に適合する文字列である。また、検索文中に検索語（文字列）が複数ある場合には、ヒットワードは、検索されたデータに含まれる文字列のうち、検索文中の検索語（文字列）のいずれかに適合する文字列である。この場合、適合するとは、例えば、一致することであるが、その他にも構成可能である。例えば、検索されたデータに含まれる文字列のうち、検索文中の検索語（文字列）の下位概念に属する文字列、検索文中の検索語（文字列）の上位概念に属する文字列、検索文中の検索語（文字列）に類似する文字列若しくは検索文中の検索語（文字列）に一致する文字列、又はそれらの任意の組合せとして、構成することも可能である。
【００１９】
また検索がＱＡ検索であった場合は、ヒットワードは検索文に対する回答でもいい。ＱＡ検索技術により検索文に対する回答を求める方法は公知技術（例えば、Question Answering Challenge (QAC-1): An Evaluation of Question Answering Task atNTCIRWorkshop 3, Jun'ichi FUKUMOTO, Tsuneaki KATO, and Fumito MASUI（ＵＲＬ：http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings3/）を用いることができる。
【００２０】
なお、本実施形態では、ヒットワードは、検索されたデータに含まれる文字列のうち、検索文中の検索語（文字列）に一致する文字列である場合を例にとって説明する。
【００２１】
図１において、本情報提示装置１は、検索文を入力する入力部１０１、検索文に適合するデータをネットワーク７を介して接続されたデータサーバ８から検索する検索部１０３、検索されたデータからヒットワードに該当する部分を検出するヒットワード検出部１０４、検索されたデータ（の文字列部分）を音声に変換する音声合成部１０７、検出されたデータ中のヒットワードに該当する部分を強調するためのデータ加工を行うデータ加工部１０５、ヒットワードに該当する部分を強調されたデータを音声出力する出力部１０６、ネットワーク７を介したデータ通信を行う通信部１０８を備えている。
【００２２】
ネットワーク７は、例えば、インターネットであり、データは、例えば、Ｗｅｂページであるが、これに限定されるものではない（ネットワークは、ＬＡＮ等であってもよいし、データは、各種編集ソフトの文書データ等であってもよい）。
【００２３】
また、上記では、本情報提示装置１はネットワーク７を介してデータを蓄積したデータベースを持つデータサーバ８にアクセスするものとしたが、その代わりにまたはそれに加えて、本情報提示装置１が、データを蓄積しているデータベース１０９を内蔵し、検索部１０３が、検索文に適合するデータをデータベースから検索するようにしてもよい。この場合、通信部１０８は必ずしも必要ではない。
【００２４】
なお、本情報提示装置１は、専用機として構成することも、汎用計算機を利用して構成することも可能である。また、例えば、各種機能を有する携帯電話端末により実現することも可能である。
【００２５】
また、本情報提示装置１は、検索されたデータや各種メッセージ等を表示画面に表示する機能を併せ持つものであってもよい。例えば、検索されたデータを、音声出力せずに画面表示するか、画面表示せずに音声出力するか、画面表示しながら音声出力するかを選択可能なものであってもよく、ここで説明する機能は、画面表示せずに音声出力するか、または画面表示しながら音声出力する場合の当該音声出力に適用することができる。
【００２６】
図２に、本情報提示装置１の全体的な処理手順の一例を示す。
【００２７】
まず、ステップＳ１において、ユーザは入力部１０１を介してテキストや音声などにより検索文を入力する。入力方法は、最終的に文字列の検索文が得られるものであればよく、例えば、テキスト入力（キーボード入力）のみ可能であっても、音声入力のみ可能であっても、これらの両方が可能であってもよい。
【００２８】
音声入力を行う場合、ユーザの音声を認識処理して文字列（検索文）に変換する音声認識部（図示せず）を入力部１０１が持てばよい。なお、音声認識技術には公知のものを用いて構わない。
【００２９】
なお、入力方法としていずれの方法をとるにしても、検索文は文字列として検索部１０３に与えられるので、以下では、説明をより簡潔にするために、検索文がテキストデータとして入力部１０１に入力される場合を中心に説明を行う。
【００３０】
次に、ステップＳ２において、検索部１０３は、通信部１０８を通して、検索文に適合するデータ（例えば、Ｗｅｂページ）を検索する。文字列（検索文）によるデータ検索については公知の文書検索技術あるいはインターネット検索技術等を利用して構わない。具体的には、例えば、文字列に変換された検索文中の単語（若しくは文字列）と同じものが含まれているデータを検索するとともに、該検索文中の単語（若しくは文字列）と同じものがより多く含まれているデータほど高いスコアを与えて、それらデータをスコア順に並べ替えるなどの処理を行う。又は、ＱＡ検索の用に検索文に対する回答が含まれるデータを検索することもできる。最も確からしい回答を含むデータほど高いスコアを与えて、検索されたデータをスコア順に並べ替えるなどの処理を行う。
【００３１】
次に、ステップＳ３において、ヒットワード検出部１０４は、検索されたデータのうち第１順位のスコアを持つものを対象として、ヒットワードの抽出を行う。この際、ヒットワードがデータのどの部分に出現したかを検出し、この出現位置を示すヒットワード情報を記録する。
【００３２】
なお、検索データに対するヒットワード情報を、この検索データとは別のファイルとして持つようにしてもよいし、その検索データ中にヒットワード情報を記述するようにしてもよい。
【００３３】
図３に、前者の場合（Ｗｅｂページとは別のファイルとして持つ場合）のヒットワード情報の一例を示す。この例では、ヒットワードが当該データにおいて先頭から何回目の出現かを示す情報と、当該回目のヒットワードが当該データにおいて先頭から何文字目から開始し何文字目で終了するかを示す情報を記録している。また、ヒットワードが複数種類あり、ヒットワードごとに出現回数を区別すべき場合には、図４に示す通りヒットワード情報をヒットワードごとに設ければよいし、区別しない場合には、ヒットワードの内容にかかわらず図３に示す通りヒットワード情報を１つ設ければよい。
【００３４】
また、例えば、ヒットワードが複数種類ある場合に、図４のように、ヒットワードごとにカウントしたときの出現回数の他に、ヒットワードにかかわらずにカウントしたときの総出現回数を記録する形態も可能である。
【００３５】
なお、この他にも種々の形態が可能であり、例えば、終了位置について、その開始位置から何文字目かを示す情報を記述してもよい。また、例えば、出現回数が２回目以降では、開始位置について、その１回前の出現回数における開始位置又は終了位置から何文字目かを示す情報を記述してもよい。また、ヒットワードの文字数に従って、開始位置と終了位置の一方から他方を特定することができるので、開始位置と終了位置の一方のみ記録するようにしてもよい。
【００３６】
図５に、後者の場合（Ｗｅｂページ中に記述する場合）のヒットワード情報の一例を示す。この例では、ヒットワードを、開始タグ<hw id=1 t=1>と終了タグ</hw>で挟むことによって、ヒットワードを指示している（なお、タグ名のhwは一例であり、これに限定されるものではない）。また、開始タグ中のｉｄは、ヒットワードを区別するために割り当てられた（例えば、ヒットワード検出部１０４で割り当てられた）ＩＤ番号であるが、これを付加しない構成ももちろん可能である。また、開始タグ中のｔは、同一のヒットワード（ここでは、ｉｄ＝１を持つヒットワード）の出現回数であり、ｔ＝１は、当該データにおける１回目の出現であることが示されている。なお、出現回数についてヒットワードを区別しない場合には、ｔは全ヒットワードに関する出現回数を示す。もちろん、ｔを付加しない構成も可能である。
【００３７】
また、例えば、ヒットワードが複数種類ある場合に、図６に示す通り、ヒットワードごとにカウントしたときの出現回数ｔの他に、ヒットワードにかかわらずにカウントしたときの総出現回数ｓを記録する形態も可能である。
【００３８】
次に、ステップＳ４において、ヒットワードを強調して、データを音声出力する。以下、この処理について、詳しく説明する。
【００３９】
まず、音声合成部１０７では、検索されたデータ（例えば、Ｗｅｂページ）の文字列を音声に変換する。文字列を音声に変換する音声合成技術には公知のものを利用して構わない。
【００４０】
このとき、データ加工部１０５では、上記のヒットワード情報を参照して、検索されたデータ中のヒットワードに該当する部分を強調しながら音声変換する。
【００４１】
そして、ヒットワードが強調された検索データは、出力部１０６からユーザに音声で提示される。
【００４２】
ここで、強調の方法としては、種々の形態が可能である。以下に、その具体例を列挙する。
・ヒットワードに該当する部分については、それ以外の部分に比較して、ボリュームを大きくする。
・ヒットワードに該当する部分の再生にあたって、ヒットワード再生の手前で「ポーン」等の通知音を鳴らす。
・ヒットワードに該当する部分については、予め定められた回数だけ繰り返して発声する。
・ヒットワードに該当する部分を発声する際の声調（例えば、性別など）を、それ以外の部分を発声する際の声調から変化させる。
・ヒットワードに該当する部分の再生スピードを、それ以外の部分の再生スピードと変える。
・ヒットワードに該当する部分の発声の直前に、「ヒットワード」あるいは「ヒットワードです」などの音声を挟み込む（なお、この音声は、検索データを発声する際の声調とは異なる声調で発声するようにしてもよい）。
・表示画面を併用する場合に、ヒットワードに該当する部分の発声時に、例えば表示画面を赤く光らせるなど、視覚的に認識できるように差を付ける。
などがある。
【００４３】
また、複数種類の強調の方法を併せて用いるようにしてもよい。例えば、ボリュームを大きし、かつ、通知音を鳴らす方法や、声調と再生スピードの両方を変化させる方法など、種々の方法がある。
【００４４】
また、ヒットワードが複数種類ある場合に、ヒットワードごとに強調の内容を変えるようにしてもよい。
【００４５】
例えば、強調形態は同じにし、その際のパラメータをヒットワードに応じて変化させる方法がある。例えば、ヒットワードに該当する部分を発声する際の声調をそれ以外の部分を発声する際の声調から変化させるにあたって、ヒットワードごとに声調を変えるようにしてもよい。また、例えば、ヒットワードに該当する部分の再生にあたって鳴らす通知音の回数を、ヒットワードごとに変えるようにしてもよい。
【００４６】
また、例えば、ヒットワードごとに強調形態を変えるようにしてもよい。例えば、第１のヒットワードに該当する部分については、ボリュームを大きくし、第２のヒットワードに該当する部分については、通知音を鳴らし、第３のヒットワードに該当する部分については、声調を変化させるようにしてもよい。
【００４７】
また、例えば、第１のヒットワードに該当する部分については、ボリュームを大きくし、第２のヒットワードに該当する部分については、通知音を１回鳴らし、第３のヒットワードに該当する部分については、通知音を２回鳴らすなどの方法も可能である。
【００４８】
他方、ヒットワードの強調とともに又はこれに代えて、当該データを音声出力する前に、検出されたヒットワード（例えば、検出された全ヒットワード、あるいは最初に検出されたヒットワード、あるいはヒットワードごとに最初に検出されたもの、など）を音声出力するようにしてもよい。これによって、ユーザは、検出されたヒットワードを確認してから、当該データを聞くことができる。特に、音声入力により検索文の入力を行った場合には、音声認識技術は必ずしも音声を完全に間違いなく文字列に変換できるわけではなく、間違いを含んでいることもあるので、検索データの音声出力を聞く前に検出されたヒットワードを聞いて確認することによって、入力音声認識が正しく行われたか否かを確認することができる（例えば、誤認識によって意図するものとは異なるヒットワードが発声された場合には、データ検索結果が所望のものであるとは期待できないので、検索データの音声出力を聴かずに処理を中止して検索文の音声入力からやり直すなどができる）。
【００４９】
さらに、ヒットワードを強調するにあたって、当該データにおけるヒットワードの出現回数をユーザに提示するようにしてもよい。例えば、ヒットワードたる文字列の発声にあたって、「ｎ回目のヒットワード（です）」あるいは「ｎ回目（です）」などの音声を挟み込むようにしてもよい。ヒットワードが複数種類ある場合に、ヒットワードごとの出現回数を提示する場合も、ヒットワードの内容にかかわらない出現回数を提示する同様である。また、ヒットワードが複数種類あり、ヒットワードごとの出現回数を提示する場合において、例えば、ヒットワードごとに「３回目の○○○（です）」のように具体的なヒットワード○○○を含む音声を挟み込むようにしてもよい。また、その他の方法として、ヒットワードの再生にあたって鳴らす通知音を、出現回数と同じ回数だけ鳴らす方法なども可能である（例えば、「３回目です」の代わりに、「ポーンポーンポーン」と３回鳴らす）。
【００５０】
他方、上記のように、ヒットワードを強調するにあたっての出現回数の提示に代えてあるいはそれとともに、当該データを音声出力する前または音声出力した後に、全ヒットワードの総出現回数、及び又は各ヒットワードごとの総出現回数等の情報を音声出力するようにしてもよい。
【００５１】
ここで、上記の出現回数の提示を聞いたユーザが、当該データについて「ｎ回目のヒットワード」から聞き直したい場合に、ユーザが所望の出現回数ｎを指定して再生命令を入力すれば、これに応じて、当該データについて「ｎ回目のヒットワード」に該当する部分（あるいはこの一定文字数前若しくは一定時間前）から再生するような機能を本情報提示装置が備えてもよい。この場合、すでにヒットワード情報は作成されているので、これを参照して、（加工データが残っていなければ、音声合成及びデータ加工した後に、）該当する部分から音声出力を行えばよい。なお、ヒットワードを区別する場合には、ユーザが所望のヒットワード○○○と所望の出現回数ｎの対を指定して再生命令を入力すれば、これに応じて、当該データについて「ｎ回目のヒットワード○○○」に該当する部分（あるいは、その一定文字数前若しくは一定時間前）から再生するようにすればよい。
【００５２】
なお、上記した手順例では、検索されたデータのうち第１順位のスコアを持つもののみ、ヒットワードを強調して、データを音声出力したが、例えば、第１順位から予め定められた順位までのデータを順次ヒットワードを強調して音声出力するようにしてもよい。
【００５３】
図７に、本情報提示装置１の全体的な処理手順の他の例を示す。
【００５４】
なお、ここでは、図７の手順例が図２の手順例と相違する点を中心に説明する。
【００５５】
まず、ステップＳ１１において、ユーザは入力部１０１を介してテキストや音声などにより検索文を入力する。なお、ここでも、図２を参照しながら行った説明と同様の趣旨で、検索文がテキストデータとして入力部１０１に入力される場合を中心に説明を行う。
【００５６】
次に、ステップＳ１２において、検索部１０３は、通信部１０８を通して、検索文に適合するデータを検索する。
【００５７】
次に、ステップＳ１３において、検索部１０３は、検索されたデータに関する情報の一覧を、出力部１０６の表示画面に提示する。この一覧では、例えば、検索時に得られたスコアの順に、検索データに関する情報が並べられる。
【００５８】
次に、ステップＳ１４において、入力部１０１は、ユーザからの指示を受け付ける。
【００５９】
次に、ステップＳ１５において、ステップＳ１４でユーザが入力部１０１を介して入力した指示内容が、ステップＳ１３で提示された一覧において所望のデータを選択指示するものであれば、ステップＳ１６に進む。もし、指示内容が、終了を指示するものであれば、図７の処理を終了する。
【００６０】
次に、ステップＳ１６において、ヒットワード検出部１０４は、ステップＳ１４でユーザが選択指示したデータを対象として、ヒットワードの抽出を行う。
【００６１】
そして、ステップＳ１７において、図２の手順例と同様にして、ヒットワードを強調して、データを音声出力し、ステップＳ１４に戻る。
【００６２】
なお、本情報提示装置１の全体的な処理手順は図２や図７の手順例に限定されるものではなく、種々変形して実施することができる。
【００６３】
例えば、図７の手順例の代わりに、図７のステップＳ１２に続いて検索された全てのデータについてヒットワード検出処理を行っておき、該予め定められた順位までのデータに関する情報をスコア順にソートして提示してユーザ選択可能にし、ユーザが選択したデータについて、音声出力処理を行うようにしてもよい。あるいは、例えば、図７の手順例の代わりに、図７のステップＳ１２に続いて第１順位から予め定められた順位までのデータについてヒットワード検出処理を行っておき、検索されたデータに関する情報をスコア順にソートして提示してユーザ選択可能にし、ユーザが選択したデータについて、（ユーザが選択したデータがヒットワード検出処理されていなければ、ヒットワード検出処理を行ってから）音声出力処理を行うようにしてもよい。
【００６４】
以上説明してきたように本実施形態によれば、ユーザは、強調されたヒットワードを聞くことで、本情報提示装置がそのデータを提示した根拠を把握・確認することができる。また、もし検索文とは異なる文字列が強調されている場合、音声入力した検索文が誤って認識されたことを知ることができ、すぐに検索をやり直すことができる。
【００６５】
なお、本実施形態では、文字列を含むデータを対象としたが、その他のデータにも適用することができる。例えば、キーワード情報を添付された音声データを該キーワード情報に基づいて検索し、検索した音声データを文字列データに変換し、該文字列データを、これまで説明した検索されたデータと同様に扱うようにすれば、音声データを対象とすることもできる。
【００６６】
（第２の実施形態）
本実施形態では、情報提示装置の一例として音声データを検索する音声検索機能を有するものを例にとって説明する。
【００６７】
図８に、本発明の第２の実施形態に係る情報提示装置の構成例を示す。
【００６８】
ここでは、検索対象となるデータは、文字列を発声した音声部分を含む音声コンテンツであり、ヒットワード検出やヒットワード強調による音声出力は、検索された音声データに含まれる文字列に該当する部分（音声認識処理可能部分）を対象とする。もちろん、このような音声データを伴う映像データを含むコンテンツにおける当該音声データの部分についても適用可能である。
【００６９】
なお、ヒットワードは、第１の実施形態で説明した通りである。また、本実施形態では、ヒットワードは、音声データのうち、検索文中の検索語（文字列）に一致する文字列に該当する部分として説明する。
【００７０】
図８において、本情報提示装置２は、検索文を入力する入力部２０１、音声データを文字列データに変換する音声認識部２０２、検索文に適合する音声データをネットワーク２７を介して接続された音声データサーバ２８から検索する音声検索部２０３、検索された音声データからヒットワードに該当する部分を検出するヒットワード検出部２０４、検索された音声データ中のヒットワードに該当する部分を強調するように加工する再生処理部２０５、ヒットワードに該当する部分を強調された音声データを音声出力する出力部２０６、音声データを文字列データに変換し、この変換した文字列データの索引（文書インデックス）を蓄積する文書インデックスデータベース２０８、ネットワーク２７を介したデータ通信を行う通信部２０９を備えている。
【００７１】
ネットワーク２７は、例えば、ＬＡＮであるが、これに限定されるものではなく、インターネット等であってもよい。音声データは、どのようなフォーマットに基づくものであっても構わない。
【００７２】
また、上記では、本情報提示装置２はネットワーク２７を介して音声データを蓄積したデータベースを持つ音声データサーバ２８にアクセスするものとしたが、その代わりにまたはそれに加えて、本情報提示装置２が、音声データを蓄積している音声データベース２０７を備え、音声検索部２０３が、検索文に適合する音声データを音声データベース２０７から検索するようにしてもよい。
【００７３】
また、音声データサーバ２８が、文書インデックスを提供する機能を有する場合には、音声データサーバ２８が提供する文書インデックスを、音声データ検索時に取得して参照するようにしてもよいし、あるいは予め文書インデックスデータベース２０８にダウンロードするようにしてもよい。
【００７４】
なお、本情報提示装置２は、専用機として構成することも、汎用計算機を利用して構成することも可能である。また、例えば、各種機能を有する携帯電話端末により実現することも可能である。
【００７５】
図９に、本情報提示装置２の全体的な処理手順の一例を示す。
【００７６】
ここでは、本情報提示装置の一例として音声検索システムを例にして説明している。音声検索システムとは、例えば、ボイスレコーダなどに録音したデータを検索できるようにしたシステムのことである。ボイスメールシステムでのボイスメールの検索や、会議を録音した音声データを検索する場合や、ラジオ放送を録音した音声データを検索する場合や、テレビ番組を録画したデータの音声トラックを検索する場合に利用することができる。このような音声データベース２０７が音声データサーバ２８あるいは音声データベース２０７に蓄積されているものとする。
【００７７】
まず、必要に応じて、音声認識部２０２により、（音声データサーバ２８あるいは音声データベース２０７に蓄積されている全部又は一部の音声データについて、）音声データを、文字列の集合（すなわち、文書）に変換する（ステップＳ３０１）。この音声認識技術には公知のものを用いて音声データを文字列データに変換する。
【００７８】
そして、音声データを変換して得られる変換文字列にはタイムスタンプ（例えば、開始時刻と終了時刻）を付し、変換文字列の各単語がもとの音声データのどの部分に対応するかがわかるようにした文書インデックスを作成する。
【００７９】
なお、音声データを文字列データに変換するにあたって、音声データ中で発声がある一定時間以上とぎれる位置で音声データを区切ることによって、音声データを文書毎に区切るようにしてもよい（ただし、ボイスメールなどのようにもともと音声データが内容毎に区切られている場合には、この処理は必ずしも必要ない）。
【００８０】
そして、各音声データについて、該音声データから作成した文書すなわち文字列集合の各文字列と、その各文字列に対応する該音声データ中の位置を指し示すタイムスタンプとを対応付けた文書インデックスを、文書インデックスデータベース２０８に蓄積しておく。
【００８１】
なお、音声認識技術は必ずしも音声を完全に間違いなく文字列に変換できるわけではなく、間違いを含んでいることもある。従って、誤認識があった場合には、間違いを含んだまま文書インデックスデータベース２０８に蓄積されることになる。
【００８２】
次に、音声データの検索をしようとするユーザが、入力部２０１を介してテキストや音声などにより検索文を入力する（ステップＳ３０２）。この入力方法は、最終的に文字列の検索文が得られるものであればよく、例えば、テキスト入力（キーボード入力）のみ可能であっても、音声入力のみ可能であっても、これらの両方が可能であってもよい。
【００８３】
音声入力を行う場合、音声認識部２０２により、ユーザの音声を認識処理して文字列（検索文）に変換する。
【００８４】
なお、入力方法としていずれの方法をとるにしても、検索文は文字列として音声検索部２０３に与えられるので、以下では、説明を簡単にするために、検索文がテキストデータとして入力される場合を中心に説明を行う。
【００８５】
次に、音声検索部２０３は、文書インデックスデータベース２０８を用いて、ユーザによって入力された検索文に適合する文書（に対応する音声データ）を検索する（ステップＳ３０３）。
【００８６】
なお、音声検索手法には公知のものを用いて構わない。例えば、検索文に基づいて文書インデックスデータベース２０８を全文検索することにより、適合する文書を検索し、その文書に対応する音声データを、タイムスタンプをもとに探し出すことができる。また、例えば、ＴＦ・ＩＤＦのような検索技術で文書集合から適合する文書を検索することもできる。ＱＡ検索で検索文に対する回答を含む文書を検索することもできる。なお、検索結果は、例えば、スコア順に並べられた文書リストになる（第１の実施形態参照）。
【００８７】
次に、ヒットワード検出部２０４は、文書インデックスデータベース２０８を用いて、検索された文書中に存在するヒットワードを検出する。検出結果として、検出されたヒットワードの文字列と、その文字列がもとの音声データに出現する位置を示す情報を求める（ステップＳ３０４）。出現位置は、例えば、音声認識部２０２で音声データを文書に変換する際に得られる音声データでの位置を指し示すヒットワードの開始時刻と終了時刻である。音声データ中にヒットワードに該当する部分が複数ある場合には、それら部分音声を全て検出対象とする。
【００８８】
次に、再生処理部２０５は、まず、第１順位のスコアを持つ検索された文書に対応する音声データについて、そのヒットワードの開始時刻から終了時刻までの音声だけを再生する（ステップＳ３０４）。なお、ヒットワードが複数ある場合には、例えば、連続してヒットワードに該当する部分の音声を再生していく。
【００８９】
ユーザは、ヒットワードの音声を聞き、検索文中の文字列が含まれている場合には正しく検索できていると判断し、含まれていない場合には正しく検索できてないと判断する（ステップＳ３０５）。ユーザは、正しく検索できている場合、例えば「正しい」もしくは「再生」等と音声により指示し、あるいは「ＯＫ」ボタンもしくは「再生」ボタン等を押すなどして、正しく検索できている旨の指示をし、正しく検索できていない場合、例えば「違う」もしくは「次」等と音声により指示あるいは「次」ボタン等を押すなどして正しく検索できていない旨の指示をする。
【００９０】
そして、ステップＳ３０５において、ユーザから正しく検索できている旨の指示が入力された場合には、続いてこの文書に対応する音声データを（例えば先頭から）再生する（ステップＳ３０７）。
【００９１】
他方、ステップＳ３０５において、ユーザから正しく検索できていなかった（ユーザがヒットワードを聞いたとき、検索文中の文字列でなかった）場合にはこの音声データを全て聞かなくても誤りであることがわかるので、この音声データの再生をスキップして（ステップ３０６）、ステップＳ３０４に戻る。そして、次の順位のスコアを持つ検索された文書に対応する音声データについて、同様の処理を繰り返す。
【００９２】
またここではヒットワードは検索文中の検索語に一致する文字列に該当する部分として説明しているが、第１の実施形態で説明したように、検索文中の検索語の下位概念に属する文字列，検索文中の検索語の上位概念に属する文字列，検索文中の検索語に類似する文字列，又はそれらの任意の組合わせとして構成された文字列，又はＱＡ検索の場合の検索文に対する回答部分であった場合も、ユーザがヒットワードを聞き検索文と適合しない語が再生された場合は、容易に検索結果が誤りであることを判断できる。
【００９３】
このように、ユーザは、音声データの全てを聞かなくても、検索誤りをいち早く検知して、次の音声データを再生するように指示することができる。
【００９４】
なお、上記した手順例の代わりに、例えば、検索された音声データのうち第１順位から予め定められた順位までの音声データについて、ヒットワードの開始時刻から終了時刻までの音声だけを順次再生し、ユーザが選択した音声データについて先頭から再生する手順など、種々の手順が可能である。
【００９５】
また、ここで、第１の実施形態と同様に、検索された文書に対応する音声データの再生中に、ヒットワードが出現すると、このヒットワードを強調して再生するようにしてもよい。ヒットワードの強調については、第１の実施形態で説明したものと同様の構成例や各種バリエーションを用いることができる。
【００９６】
また、上記では、文字列による検索文と、音声データを文書化した文書インデックスデータベース２０８とを用いて（すなわち、音声データではなく文字列の比較などによって）処理を行ったが、例えば、ユーザが検索文を発声して音声入力したもの（あるいは、キーボード入力した検索文を音声合成したもの）と、検索対象となる音声データとを、音声波形、音素、音節、発音記号等の中間言語などによって比較等して、検索処理やヒットワード検出処理等を行う構成も可能である。
【００９７】
以上説明してきたように本実施形態によれば、ユーザは、ヒットワードを聞くことで、本情報提示装置がそのデータを提示した根拠を把握・確認することができる。また、検索文とは異なる文字列がヒットワードとして強調されている場合、音声データから文書インデックスを作成する際の音声認識に誤認識があったか、あるいは音声入力した検索文が誤って認識されたかなどを知ることができ、すぐに次のスコア順位の音声データの再生に移り、あるいはすぐに検索をやり直すことができる。
【００９８】
（第３の実施形態）
本実施形態では、情報提示装置の一例として映像データを検索する映像検索機能を有するものを例にとって説明する。
【００９９】
図１０に、本発明の第３の実施形態に係る情報提示装置の構成例を示す。
【０１００】
ここでは、検索対象となる映像データは、文字列を発声した音声部分を含む音声データを伴うものであり、ヒットワード検出やヒットワード強調による音声出力は、検索された映像データに伴われる音声データに含まれる文字列に該当する部分（音声認識処理可能部分）を対象とする。もちろん、このような音声データのみのコンテンツ（映像のないもの）についても適用可能である。
【０１０１】
なお、ヒットワードは、第１の実施形態で説明した通りである。また、本実施形態では、ヒットワードは、音声データのうち、検索文中の検索語（文字列）に一致する文字列に該当する部分として説明する。
【０１０２】
図１０において、本情報提示装置３は、検索文や各種指示等の入力や検索結果等の出力のための情報入出力部３０１、検索文に適合する映像データをネットワーク３７に接続されている映像データサーバ３８から検索するデータ検索部３０３、検索された映像データからヒットワードに該当する部分を検出するヒットワード検出部３０４、検索された映像データ中のヒットワードに該当する部分を強調するように加工する再生処理部３０５、ネットワーク３７を介したデータ通信を行う通信部３０９を備えている。
【０１０３】
情報入出力部３０１は、ＧＵＩ画面を用いたユーザ・インタフェースである。
【０１０４】
ネットワーク３７は、例えば、ＬＡＮであるが、これに限定されるものではなく、インターネット等であってもよい。映像データは、どのようなフォーマットに基づくものであっても構わない。
【０１０５】
また、上記では、本情報提示装置３はネットワーク３７を介して映像データを蓄積したデータベースを持つ映像データサーバ３８にアクセスするものとしたが、この代わりにまたはこれに加えて、本情報提示装置３が、映像データを蓄積している映像データベース（映像ＤＢ）３０７を内蔵し、データ検索部３０３が、検索文に適合する映像データを映像データベース３０７から検索するようにしてもよい。この場合、通信部３０９は必ずしも必要ではない。
【０１０６】
なお、本情報提示装置３は、専用機として構成することも、汎用計算機を利用して構成することも可能である。また、例えば、各種機能を有する携帯電話端末により実現することも可能である。
【０１０７】
図１１に、本情報提示装置３の全体的な処理手順の一例を示す。
【０１０８】
以下、本情報提示装置３でテレビ番組情報を検索する場合を具体例にして説明する。
【０１０９】
本具体例のテレビ番組情報検索は、テレビのニュースを記事毎に検索し、ユーザの見たいニュースを検索できるような場合を想定したものである。
【０１１０】
映像ＤＢ３０７は、例えば、ハードディスク・レコーダなどの映像録画装置のデータベースのことであり、映像と音声からなる。例えば、ニュースが録画されており、ニュースは記事毎に分割して保存されている。本具体例では、本情報提示装置３が映像ＤＢ３０７からユーザの検索文に適合する映像をユーザに提示する場合を想定している。
【０１１１】
まず、ステップＳ２１において、ユーザは情報入出力部３０１を介してテキストや音声などにより検索文を入力する。第１や第２の実施形態と同様、入力方法は、最終的に文字列の検索文が得られるものであればよく、例えば、テキスト入力（キーボード入力）のみ可能であっても、音声入力のみ可能であっても、これらの両方が可能であってもよい。音声入力を行う場合、ユーザの音声を認識処理して文字列（検索文）に変換する音声認識部（図示せず）を情報入出力部３０１が持てばよい。
【０１１２】
なお、入力方法としていずれの方法をとるにしても、検索文は文字列としてデータ検索部３０３に与えられるので、以下では、説明をより簡潔にするために、検索文がテキストデータとして入力される場合を中心に説明を行う。
【０１１３】
次に、データ検索部３０３は、映像ＤＢ３０７から検索文に適合する映像データを検索する（ステップＳ２２）。この検索には、例えば、映像データの音声トラックを使って音声検索する公知技術を用いることができる（例えば、伊藤克亘,藤井敦,石川徹也,「音声文書検索を用いたオンデマンド講義システム」,情報処理学会研究報告, 2001-SLP-39, pp.165-170, Dec. 2001参照）。
【０１１４】
次に、ヒットワード検出部３０４は、検索された映像データから、ヒットワードの検出を行う（ステップＳ２３）。ヒットワードの検出には、例えば、第２の実施形態で説明したような方法を用いればよい。すなわち、例えば、文書インデックスを用いてもよいし、音声波形、音素、音節、発音記号等の中間言語などで直接検出してもよい。
【０１１５】
次に、検索結果画面を表示する（ステップＳ２４）。
【０１１６】
図１２に、検索結果画面の一例を示す。
【０１１７】
図１２の例では、検索されたデータがサムネイル（図中、符号４０３）で表示され、ランキング（図中、符号４０１）の順に並んでいる。なお、図１２の例では、ランキング第３位のデータは映像のない音声コンテンツなので、サムネイルが空欄になっている様子を示している。なお、図１２の画面において、各コンテンツに関する他の種々の情報を提示してもよい。なお、各コンテンツに関する情報としては、例えば、当該コンテンツが放送されたチャンネルあるいは放送局を示す情報、地上波か衛星放送かを示す情報、テレビ放送かラジオ放送かを示す情報、放送された日付及び時間を示す情報、そのデータが格納されているアドレスなど、種々の情報が考えられる。
【０１１８】
次に、ステップＳ２５において、情報入出力部３０１は、ユーザからの指示受け付ける。
【０１１９】
次に、ステップＳ２６において、ステップＳ２５でユーザが情報入出力部３０１を介して入力した指示内容が、ステップＳ２４で表示された検索結果画面において、所望のデータのヒットワード再生を指示するものであれば、ステップＳ２７に進み、所望のデータのデータ再生を指示するものであれば、ステップＳ２８に進む。もし、指示内容が、終了を指示するものであれば、図７の処理を終了する。
【０１２０】
ここで、ステップＳ２７の処理に関して詳しく説明する。
【０１２１】
先のステップＳ２４で表示される図１２の検索結果画面では、各データとともに［ヒットワード］のボタン（図中、符号４０２）がある。ステップＳ２５でユーザが所望のデータに対するヒットワード・ボタンをマウス等でクリック指示することにより、ステップＳ２６からＳ２７に遷移して、図１３に示すヒットワード再生画面が現れ、ユーザが指示したデータにおけるヒットワードの部分が再生される。
【０１２２】
図１３のヒットワード再生画面は、ヒットワードの音声に対応する映像を表示する領域５０１、検索文５０２、ヒットワードに対応するタイムライン５０３、現在の再生位置を表示するポインタ５０４がある。なお、複数のヒットワードがある場合には、このタイムライン５０３は、複数のヒットワードが繋ぎ合わされたものに対応するものとしている。
【０１２３】
なお、図１４に示す通り、ヒットワードの種類ごとにボタンを設けるようにしてもよい。この場合には、ユーザにクリックされた種類のヒットワークに対応する画面が表示される。なお、当該種類のヒットワードが複数存在する場合には、タイムライン５０３は、複数のヒットワードが繋ぎ合わされたものに対応する。
【０１２４】
ユーザは、このような画面でヒットワードを聞くことで、検索結果の正誤を容易に判断することができる。もしヒットワードとして検索文とは違う語が再生された場合は、この音声データを全て聞かなくても誤りであることがわかる。
【０１２５】
またここではヒットワードは検索文中の検索語に一致する文字列に該当する部分として説明しているが、第１の実施形態で説明したように、検索文中の検索語の下位概念に属する文字列，検索文中の検索語の上位概念に属する文字列，検索文中の検索語に類似する文字列，又はそれらの任意の組合わせとして構成された文字列，又はＱＡ検索の場合の検索文に対する回答部分であった場合も、ユーザがヒットワードを聞き検索文と適合しない語が再生された場合は、容易に検索結果が誤りであることを判断できる。
【０１２６】
また、図１３の例では、再生するタイムバーの下に［ＯＫ］ボタン５０５と［ＮＧ］ボタン５０６がある。
【０１２７】
もしユーザがヒットワードの再生を聞いて検索文と同じ言葉が再生された場合には、［ＯＫ］のボタン５０５を押す。他方、検索文と違う言葉がヒットワードとして再生された場合には［ＮＧ］５０６のボタンを押す。
【０１２８】
［ＯＫ］のボタン５０５が押された場合には、この音声データに、このヒットワードに対応するクエリの文字列を正の加重（＋ａ）とともに付加する。
【０１２９】
［ＮＧ］のボタン５０６が押された場合には、この音声データに、クエリの文字列を負の加重（−ｂ）とともに付加する。
【０１３０】
なお、付加する正の加重は（＋ａ）を限度とし、付加する負の加重は（−ｂ）を限度としてもよい。
【０１３１】
以降に行われる検索では、この付加文字列が利用される。
【０１３２】
例えば、「○○」という検索文字列で検索された音声データで、このヒットワードを再生したときに、正しく「○○」と再生されてユーザが［ＯＫ］のボタンを押した場合、この音声データのこの部分には「○○」という音声が含まれていることがわかる。このためその音声データには「○○」という文字列が正の加重とともに付加される。他方、このヒットワードを再生したときに、「○○」ではなく「○Ｘ」と再生されてユーザが［ＮＧ］のボタンを押した場合、この音声データには「○○」という文字列が負の加重とともに付加される。次回から検索文に「○○」という文字列がある場合には、付加文字列に正の加重の「○○」を持つ音声データが優先的に検索される（あるいは、スコアが高くなる）ようになる。他方、付加文字列に負の加重の「○○」を持つ音声データは、検索されない（あるいは、スコアが低くなる）ようになる。
【０１３３】
付加文字列を考慮したスコアは、例えば、次のようになる。
【０１３４】
今回の検索文（検索文字列）と、音声データに付加された付加文字列との類似度スコアをＳａ（例えば、上記の＋ａまたは−ｂ）とし、新しい検索文と音声データとの類似度スコア）をＳｄとして、最終的なスコアを、
Ｓ＝αＳｄ＋（１−α）Ｓａ
とする。ここで、αは０から１の間の適当な数である。
【０１３５】
なお、新しい検索文と音声データとの類似度Ｓｄについては、新しい検索文を示す検索文字列と音声データを文字列に変換した文字列文書とを用いて類似度Ｓｄを求めることも可能であり、また、新しい検索文と、対象となる音声データとの類似度を、音声波形、音素、音節、発音記号等の中間言語などを用いて類似度Ｓｄを求めることも可能である。
【０１３６】
付加文字列に検索文字列がヒットした場合には、例えば、図１２に例示するように、タイトルの先頭にマーク（本具体例では、星マーク）４０４が付けられ、付加文字列にヒットしたことが分かるようにするようにしてもよい。
【０１３７】
このステップＳ２７の処理が済めば、ステップＳ２５に戻る。
【０１３８】
次に、ステップＳ２８の処理に関して詳しく説明する。
【０１３９】
先のステップＳ２４で表示される図１２の検索結果画面では、各データとともにサムネイル（図中、符号４０３）がある。ステップＳ２５でユーザが所望のデータに対するサムネイルをマウス等でクリック指示することにより、ステップＳ２６からＳ２８に遷移して、図１５に示すデータ再生画面が現れ、ユーザが指定したデータ（映像データあるいは音声データ）が再生される。
【０１４０】
図１５のデータ再生画面では、映像を再生する領域６０１、検索文６０２、コンテンツ全体に対応するタイムライン６０３、現在の再生位置を表示するポインタ６０４がある。また、図１５では、タイムライン６０３には、ヒットワードが出現する位置にマーク６０７が付けられる。これにより、どの位置にヒットワードが出現するのかがわかる。
【０１４１】
なお、図１６に示す通り、各マーク６０７の近傍に、当該ヒットワードとその前後の文章を文字列にして表示するようにしてもよい（図中の６０８において、○○が当該ヒットワードを表し、…がその前後の文章を表している）。
【０１４２】
このステップＳ２７の処理が済めば、ステップＳ２５に戻る。
【０１４３】
なお、本実施形態においても第１の実施形態と同様に、検索された文書に対応する音声データの再生中に、ヒットワードが出現すると、その部分を強調して再生するようにしてもよい。ヒットワードの強調については、第１の実施形態で説明したものと同様の構成例や各種バリエーションを用いることができる。
【０１４４】
これらにより、これまでシステムの音声認識が誤った場合でも、検索されたデータを全て再生してからやっと検索文が存在しないことがわかり、誤った検索結果を全て再生しなければそれが誤りであることがわからなかった。本システムによるヒットワード強調機能により、データを再生しなくてもあらかじめシステムの誤りに気付くことができるため、目的のデータを探す時間が大幅に短縮される。
【０１４５】
なお、以上の各機能は、ソフトウェアとして記述し適当な機構をもったコンピュータに処理させても実現可能である。
また、本実施形態は、コンピュータに所定の手段を実行させるための、あるいはコンピュータを所定の手段として機能させるための、あるいはコンピュータに所定の機能を実現させるためのプログラムとして実施することもできる。加えて該プログラムを記録したコンピュータ読取り可能な記録媒体として実施することもできる。
【０１４６】
なお、本発明は上記実施形態そのままに限定されるものではなく、実施段階ではその要旨を逸脱しない範囲で構成要素を変形して具体化できる。また、上記実施形態に開示されている複数の構成要素の適宜な組み合わせにより、種々の発明を形成できる。例えば、実施形態に示される全構成要素から幾つかの構成要素を削除してもよい。さらに、異なる実施形態にわたる構成要素を適宜組み合わせてもよい。
【０１４７】
【発明の効果】
本発明によれば、音声によるより効果的な情報提示が可能になる。
【図面の簡単な説明】
【図１】本発明の第１の実施形態に係る情報提示装置の構成例を示す図
【図２】同実施形態に係る情報提示装置の処理手順の一例を示すフローチャート
【図３】ヒットワード情報の一例を示す図
【図４】ヒットワード情報の他の例を示す図
【図５】ヒットワード情報のさらに他の例を示す図
【図６】ヒットワード情報のさらに他の例を示す図
【図７】同実施形態に係る情報提示装置の処理手順の他の例を示すフローチャート
【図８】本発明の第２の実施形態に係る情報提示装置の構成例を示す図
【図９】同実施形態に係る情報提示装置の処理手順の一例を示すフローチャート
【図１０】本発明の第３の実施形態に係る情報提示装置の構成例を示す図
【図１１】同実施形態に係る情報提示装置の処理手順の一例を示すフローチャート
【図１２】検索結果画面の一例を示す図
【図１３】ヒットワード再生画面の一例を示す図
【図１４】検索結果画面の他の例を示す図
【図１５】データ再生画面の一例を示す図
【図１６】データ再生画面の他の例を示す図
【符号の説明】
１，２，３…情報提示装置、８…データサーバ、７，２７，３７…ネットワーク、１８…音声データサーバ、２８…画像データサーバ、１０１，２０１…入力部、１０３…検索部、１０４，２０４，３０４…ヒットワード検出部、１０５…データ加工部、１０６，２０６…出力部、１０７…音声合成部、１０８，２０９，３０９…通信部、１０９…データベース、２０２…音声認識部、２０３…音声検索部、２０５，３０５…再生処理部、２０７…音声データベース、２０８…文書インデックス、３０１…情報入出力部、３０３…データ検索部、３０７…映像データベース[0001]
BACKGROUND OF THE INVENTION
The present invention relates to an information presentation apparatus and an information presentation method having a function of presenting information by voice in response to a search request from a user.
[0002]
[Prior art]
In recent years, the means for presenting information by voice has been increasing along with the multimediaization of distributed data and the multimodalization of search devices. For example, even in the Internet search, it is possible to read out a searched Web page by voice by using a voice synthesis technique (see, for example, Patent Document 1). In addition, it is possible to search not only document data but also audio / video by using voice recognition technology (see, for example, Patent Document 2).
[0003]
However, in the conventional voice presentation method, it is difficult to understand on what basis the presented data is selected. In addition, unlike the presentation by text data, the presentation by voice is low in listability, so if you do not listen to all of the searched data from the beginning to the end, you will often not notice the basis of the search, and you will also notice a search mistake It often takes time.
[0004]
For example, in Patent Literature 1, when Web information is read aloud, a link is read separately from other parts, but there is a problem that it is not possible to know where a hit word part that matches a query is.
[0005]
Further, for example, in Patent Document 2, there is a problem that even if the search result for the recorded voice is heard, it is not known which part hits the search request. In addition, when the recorded speech is converted into a character string using the speech recognition technology and the corresponding part is searched using the text search technology based on the search request, if the speech recognition result is wrong, the word is different from the search request. However, if the user does not know which part is hit, the user must listen to the end, and the search term may not appear even if it is heard.
[0006]
Conventionally, when presenting search results as character data, there are also techniques for inverting the character of the hit part or changing the color of the character, but this technique is applicable to the presentation of information by voice. I could not.
[0007]
[Patent Document 1]
JP 2002-358194 A
[0008]
[Patent Document 2]
JP 2002-366552 A
[0009]
[Problems to be solved by the invention]
As described above, the conventional information presentation technology using voice has problems such as difficult to understand the basis of selection of the presented information, difficulty in recognizing a search error, and time required for the user to reach the target data. It was.
[0010]
The present invention has been made in view of the above circumstances, and an object thereof is to provide an information presentation apparatus and an information presentation method that enable more effective information presentation by voice.
[0011]
[Means for Solving the Problems]
An information presentation apparatus according to the present invention acquires content by searching an input unit for inputting a search sentence and a database storing content including character string data or audio data based on the search sentence. An acquisition unit; a detection unit that detects a hit word that matches the search sentence from character string data or audio data included in the content acquired by the acquisition unit; and the hit word detected by the detection unit Presentation means for presenting information related to voice by voice.
[0012]
Further, the present invention relating to an apparatus or a method has a function for causing a computer to execute a procedure corresponding to the invention (or for causing a computer to function as a means corresponding to the invention, or for a computer to have a function corresponding to the invention. It is also established as a program (for realizing) and also as a computer-readable recording medium on which the program is recorded.
[0013]
According to the present invention, it is possible to present information more effectively by voice by presenting information related to the detected hit word by voice. For example, when presenting voice data, the hit word is highlighted and presented, so that the basis for selecting the data can be easily understood.
[0014]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, embodiments of the invention will be described with reference to the drawings.
[0015]
(First embodiment)
In the present embodiment, an example of an information presentation device having a voice browser function of outputting a character string included in retrieved data as an example will be described.
[0016]
FIG. 1 shows a configuration example of an information presentation apparatus according to the first embodiment of the present invention.
[0017]
Here, the data to be searched is content including a character string, and audio output by hit word detection or hit word emphasis is on the character string portion included in the searched data.
[0018]
The hit word is a character string that matches the search sentence among the character strings included in the retrieved data (or the character string that is the basis for retrieving the data). In particular, when a search is performed based on a search word (character string) in the search sentence, the hit word matches a search word (character string) in the search sentence among character strings included in the searched data. It is a string. In addition, when there are a plurality of search words (character strings) in the search sentence, the hit word is a character that matches one of the search words (character strings) in the search sentence among the character strings included in the searched data. Is a column. In this case, matching means, for example, matching, but other configurations are possible. For example, among the character strings included in the searched data, a character string belonging to a subordinate concept of a search word (character string) in a search sentence, a character string belonging to a superordinate concept of a search word (character string) in the search sentence, It is also possible to configure as a character string that is similar to the search word (character string), a character string that matches the search word (character string) in the search sentence, or any combination thereof.
[0019]
If the search is a QA search, the hit word may be an answer to the search sentence. A method for obtaining an answer to a search sentence by a QA search technique is a known technique (for example, Question Answering Challenge (QAC-1): An Evaluation of Question Answering Task at NTCIRWorkshop 3, Jun'ichi FUKUMOTO, Tsuneaki KATO, and Fumito MASUI (URL: http: : //research.nii.ac.jp/ntcir/workshop/OnlineProceedings3/) can be used.
[0020]
In the present embodiment, a hit word is described as an example of a character string that matches a search word (character string) in a search sentence among character strings included in searched data.
[0021]
In FIG. 1, the information presentation apparatus 1 includes an input unit 101 for inputting a search sentence, a search unit 103 for searching data that matches the search sentence from a data server 8 connected via the network 7, and the searched data. A hit word detection unit 104 that detects a portion corresponding to a hit word, a speech synthesis unit 107 that converts searched data (a character string portion) into speech, and emphasizes a portion corresponding to a hit word in the detected data. A data processing unit 105 that performs data processing, an output unit 106 that outputs data in which a portion corresponding to a hit word is emphasized, and a communication unit 108 that performs data communication via the network 7.
[0022]
The network 7 is, for example, the Internet, and the data is, for example, a Web page, but is not limited to this (the network may be a LAN or the like, and the data is a document of various editing software) Data etc.).
[0023]
In the above description, the information presentation apparatus 1 accesses the data server 8 having the database storing the data via the network 7, but instead of or in addition to the data server 8, the information presentation apparatus 1 May be built in, and the search unit 103 may search the database for data that matches the search text. In this case, the communication unit 108 is not always necessary.
[0024]
The information presenting apparatus 1 can be configured as a dedicated machine or can be configured using a general-purpose computer. Further, for example, it can be realized by a mobile phone terminal having various functions.
[0025]
The information presentation device 1 may also have a function of displaying the retrieved data and various messages on the display screen. For example, the retrieved data may be selectable to be displayed on the screen without outputting the sound, outputting the sound without displaying the screen, or outputting the sound while displaying the screen. The function to be applied can be applied to the sound output when the sound is output without displaying the screen or the sound is output while displaying the screen.
[0026]
FIG. 2 shows an example of the overall processing procedure of the information presentation apparatus 1.
[0027]
First, in step S 1, the user inputs a search sentence by text or voice via the input unit 101. The input method is not limited as long as a final text search sentence can be obtained. For example, both text input (keyboard input) and voice input are possible. It may be.
[0028]
When performing voice input, the input unit 101 only needs to have a voice recognition unit (not shown) that recognizes the user's voice and converts it into a character string (search sentence). A known voice recognition technique may be used.
[0029]
Regardless of which method is used as the input method, the search sentence is given as a character string to the search unit 103. Therefore, for the sake of simplicity, the search sentence is input to the input unit 101 as text data. The explanation will be focused on the case where it is input.
[0030]
Next, in step S 2, the search unit 103 searches through the communication unit 108 for data (for example, a web page) that matches the search sentence. For data search using a character string (search sentence), a known document search technique or Internet search technique may be used. Specifically, for example, while searching for data containing the same word (or character string) in the search sentence converted to a character string, the same as the word (or character string) in the search sentence The higher the data contained, the higher the score is given, and the data is rearranged in the order of score. Alternatively, it is possible to search for data including an answer to a search sentence for QA search. Data including the most probable answer is given a higher score, and processing such as sorting the retrieved data in the order of score is performed.
[0031]
Next, in step S 3, the hit word detection unit 104 extracts hit words for the retrieved data having the first rank score. At this time, it is detected in which part of the data the hit word appears, and hit word information indicating the appearance position is recorded.
[0032]
The hit word information for the search data may be stored as a separate file from the search data, or the hit word information may be described in the search data.
[0033]
FIG. 3 shows an example of hit word information in the former case (when the file is held as a file different from the Web page). In this example, information indicating how many times the hit word appears in the data from the beginning, and information indicating how many characters from the beginning of the hit word in the data start and end in the data It is recorded. If there are multiple types of hit words and the number of appearances should be distinguished for each hit word, hit word information may be provided for each hit word as shown in FIG. Regardless of the contents, one hit word information may be provided as shown in FIG.
[0034]
Further, for example, when there are a plurality of types of hit words, a form of recording the total number of appearances when counting regardless of the hit words in addition to the number of appearances when counting for each hit word as shown in FIG. Is also possible.
[0035]
Various other forms are possible. For example, for the end position, information indicating the number of characters from the start position may be described. In addition, for example, when the number of appearances is the second or later, information indicating the number of characters from the start position or the end position in the number of appearances one time before the start position may be described. Further, since one of the start position and the end position can be specified according to the number of characters of the hit word, only one of the start position and the end position may be recorded.
[0036]
FIG. 5 shows an example of hit word information in the latter case (when described in a Web page). In this example, the hit word is the start tag <hw id = 1 t = 1> and end tag </ hw> is used to indicate a hit word (note that hw in the tag name is an example and is not limited to this). The id in the start tag is an ID number assigned to distinguish hit words (for example, assigned by the hit word detection unit 104), but it is of course possible to add the ID number. In addition, t in the start tag is the number of appearances of the same hit word (here, hit word having id = 1), and t = 1 is the first appearance in the data. Yes. In the case where the hit word is not distinguished with respect to the number of appearances, t represents the number of appearances regarding all hit words. Of course, a configuration without t is also possible.
[0037]
Further, for example, when there are a plurality of types of hit words, as shown in FIG. 6, in addition to the number of appearances t when counting for each hit word, the total number of appearances s when counting regardless of the hit word is recorded. The form to do is also possible.
[0038]
Next, in step S4, the hit word is emphasized and the data is output as voice. Hereinafter, this process will be described in detail.
[0039]
First, the speech synthesizer 107 converts a character string of searched data (for example, a web page) into speech. A well-known speech synthesis technique for converting a character string into speech may be used.
[0040]
At this time, the data processing unit 105 refers to the hit word information and performs voice conversion while emphasizing a portion corresponding to the hit word in the retrieved data.
[0041]
The search data in which the hit word is emphasized is presented to the user by voice from the output unit 106.
[0042]
Here, various forms are possible as the emphasis method. Specific examples are listed below.
-For the portion corresponding to the hit word, the volume is increased compared to the other portions.
・ When playing the part corresponding to the hit word, sound a notification sound such as “Pawn” before playing the hit word.
・ For the part corresponding to the hit word, repeat the voice a predetermined number of times.
-Change the tone (for example, gender) at the time of speaking the part corresponding to the hit word from the tone at the time of speaking the other part.
・ Change the playback speed of the part corresponding to the hit word to the playback speed of the other parts.
・ Place a voice such as “hit word” or “it is a hit word” immediately before the utterance of the part corresponding to the hit word (Note that this voice is uttered in a different tone from that used when uttering search data. You may do it).
-When using a display screen, make a difference so that it can be visually recognized, for example, when the portion corresponding to the hit word is uttered, the display screen glows red.
and so on.
[0043]
A plurality of types of emphasis methods may be used together. For example, there are various methods such as a method of increasing the volume and sounding a notification sound and a method of changing both the tone and the reproduction speed.
[0044]
Further, when there are a plurality of types of hit words, the emphasis content may be changed for each hit word.
[0045]
For example, there is a method in which the emphasis forms are the same and the parameters at that time are changed according to the hit word. For example, when changing the tone when uttering a portion corresponding to a hit word from the tone when uttering other portions, the tone may be changed for each hit word. In addition, for example, the number of notification sounds that are generated when reproducing a portion corresponding to a hit word may be changed for each hit word.
[0046]
Further, for example, the emphasis form may be changed for each hit word. For example, for the portion corresponding to the first hit word, the volume is increased, for the portion corresponding to the second hit word, the notification sound is sounded, and for the portion corresponding to the third hit word, the tone is changed. It may be changed.
[0047]
Also, for example, for the portion corresponding to the first hit word, the volume is increased, for the portion corresponding to the second hit word, the notification sound is sounded once, and for the portion corresponding to the third hit word. A method such as sounding a notification sound twice is also possible.
[0048]
On the other hand, with or instead of highlighting hit words, the detected hit words (e.g., all detected hit words, or the first detected hit word, or hit word) before the data is voiced out. The first detected signal may be output as a sound. Thus, the user can listen to the data after confirming the detected hit word. In particular, when a search sentence is input by voice input, the voice recognition technology does not necessarily completely convert the voice into a character string, and may contain a mistake. By listening to and confirming the hit word detected before listening to the output, it is possible to confirm whether or not the input speech recognition has been performed correctly (for example, a hit word different from the intended one due to misrecognition is uttered) In such a case, since the data search result cannot be expected to be desired, the processing can be stopped without listening to the voice output of the search data, and the search sentence can be input again from the voice input).
[0049]
Furthermore, when emphasizing hit words, the number of appearances of hit words in the data may be presented to the user. For example, when a character string as a hit word is uttered, a voice such as “nth hit word (is)” or “nth (is)” may be inserted. When there are a plurality of types of hit words, the number of appearances for each hit word is presented in the same manner as the number of appearances irrespective of the content of the hit word. In addition, when there are multiple types of hit words and the number of occurrences for each hit word is presented, for example, a specific hit word XXX such as “the third XXX” is shown for each hit word. You may make it insert | pinch the audio | voice containing. In addition, as another method, it is also possible to generate a notification sound for hit word playback as many times as the number of appearances (for example, “Pawn pawn pawn” instead of “third time”) Ring).
[0050]
On the other hand, as described above, instead of or along with the presentation of the number of appearances when emphasizing the hit word, the total number of occurrences of all hit words and / or each hit before or after the data is voiced. Information such as the total number of appearances for each word may be output by voice.
[0051]
Here, if the user who has heard the presentation of the number of appearances wants to hear the data again from the “nth hit word”, the user designates the desired number of appearances n and inputs a reproduction command. In response to this, the information presenting apparatus may be provided with a function of reproducing the data from a portion corresponding to the “nth hit word” (or a certain number of characters before or a certain time before). In this case, since the hit word information has already been created, referring to this, the speech output may be performed from the corresponding portion (after the speech synthesis and data processing if the processed data does not remain). In order to distinguish hit words, if the user designates a pair of a desired hit word xxx and a desired number of appearances n and inputs a reproduction command, the “nth May be played back from the portion corresponding to the hit word “XX” (or a certain number of characters before or a certain time before).
[0052]
In the above-described procedure example, only the data having the first rank score among the searched data is emphasized with the hit word and the data is output by voice. For example, from the first rank to the predetermined rank. These data may be sequentially output by voice with emphasis on hit words.
[0053]
FIG. 7 shows another example of the overall processing procedure of the information presentation apparatus 1.
[0054]
Here, the description will focus on the point that the example of the procedure in FIG. 7 differs from the example of the procedure in FIG.
[0055]
First, in step S 11, the user inputs a search sentence by text or voice via the input unit 101. Here, for the same purpose as described with reference to FIG. 2, the description will be focused on the case where the search sentence is input to the input unit 101 as text data.
[0056]
Next, in step S 12, the search unit 103 searches for data that matches the search sentence through the communication unit 108.
[0057]
Next, in step S 13, the search unit 103 presents a list of information related to the searched data on the display screen of the output unit 106. In this list, for example, information about search data is arranged in the order of scores obtained at the time of search.
[0058]
Next, in step S14, the input unit 101 receives an instruction from the user.
[0059]
Next, in step S15, if the instruction content input by the user via the input unit 101 in step S14 is an instruction to select desired data in the list presented in step S13, the process proceeds to step S16. If the instruction content indicates an end, the process of FIG. 7 ends.
[0060]
Next, in step S 16, the hit word detection unit 104 extracts hit words for the data selected by the user in step S 14.
[0061]
In step S17, the hit word is emphasized in the same way as in the procedure example of FIG. 2, and the data is output as voice, and the process returns to step S14.
[0062]
Note that the overall processing procedure of the information presenting apparatus 1 is not limited to the procedure examples of FIGS. 2 and 7 and can be implemented with various modifications.
[0063]
For example, instead of the procedure example of FIG. 7, hit word detection processing is performed for all data retrieved subsequent to step S 12 of FIG. 7, and information on the data up to the predetermined order is sorted in order of score. It is possible to make the user selectable by presenting, and perform voice output processing on the data selected by the user. Alternatively, for example, instead of the procedure example of FIG. 7, hit word detection processing is performed on data from the first order to a predetermined order following step S 12 of FIG. 7, and information about the retrieved data is obtained. Sort and present in order of score to allow user selection, and perform voice output processing on the data selected by the user (after hit word detection processing if the data selected by the user has not been hit word detection processing) You may do it.
[0064]
As described above, according to the present embodiment, the user can grasp and confirm the reason why the information presenting apparatus presented the data by listening to the emphasized hit word. In addition, if a character string different from the search sentence is emphasized, it can be known that the search sentence input by voice is recognized erroneously, and the search can be performed again immediately.
[0065]
In this embodiment, data including a character string is targeted, but the present invention can also be applied to other data. For example, voice data to which keyword information is attached is searched based on the keyword information, the searched voice data is converted into character string data, and the character string data is handled in the same manner as the searched data described so far. By doing so, it is also possible to target audio data.
[0066]
(Second Embodiment)
In the present embodiment, an example of an information presentation apparatus having a voice search function for searching voice data will be described.
[0067]
FIG. 8 shows a configuration example of an information presentation apparatus according to the second embodiment of the present invention.
[0068]
Here, the data to be searched is an audio content including an audio part that utters a character string, and an audio output by hit word detection or hit word emphasis is a part corresponding to the character string included in the searched audio data. (Parts capable of voice recognition processing) are targeted. Of course, the present invention can also be applied to a portion of the audio data in content including video data accompanied with such audio data.
[0069]
The hit word is as described in the first embodiment. In the present embodiment, the hit word is described as a portion corresponding to a character string that matches the search word (character string) in the search sentence in the voice data.
[0070]
In FIG. 8, the present information presentation apparatus 2 is connected to an input unit 201 for inputting a search sentence, a voice recognition unit 202 for converting voice data into character string data, and voice data suitable for the search sentence via a network 27. A voice search unit 203 for searching from the voice data server 28, a hit word detection unit 204 for detecting a portion corresponding to a hit word from the searched voice data, and a portion corresponding to a hit word in the searched voice data are emphasized. A reproduction processing unit 205 that processes the data corresponding to the hit word, an output unit 206 that outputs voice data in which the portion corresponding to the hit word is emphasized, and converts the voice data into character string data, and an index (document index) of the converted character string data A document index database 208 for storing data, and a communication unit 209 for performing data communication via the network 27. Eteiru.
[0071]
The network 27 is, for example, a LAN, but is not limited to this, and may be the Internet or the like. The audio data may be based on any format.
[0072]
In the above description, the information presentation apparatus 2 accesses the voice data server 28 having the database storing the voice data via the network 27. Instead of or in addition to this, the information presentation apparatus 2 The voice database 207 storing voice data may be provided, and the voice search unit 203 may search the voice database 207 for voice data that matches the search sentence.
[0073]
Further, when the voice data server 28 has a function of providing a document index, the document index provided by the voice data server 28 may be acquired and referenced at the time of voice data search, or the document index may be stored in advance. You may make it download to the index database 208. FIG.
[0074]
In addition, this information presentation apparatus 2 can be configured as a dedicated machine or can be configured using a general-purpose computer. Further, for example, it can be realized by a mobile phone terminal having various functions.
[0075]
In FIG. 9, an example of the whole process sequence of this information presentation apparatus 2 is shown.
[0076]
Here, a voice search system is described as an example of the information presenting apparatus. The voice search system is a system that can search data recorded in a voice recorder, for example. When searching for voicemail in the voicemail system, searching for audio data recorded from a conference, searching for audio data recorded from a radio broadcast, or searching for an audio track of data recorded from a TV program Can be used. It is assumed that such a voice database 207 is stored in the voice data server 28 or the voice database 207.
[0077]
First, as necessary, the voice recognition unit 202 converts voice data (for all or part of voice data stored in the voice data server 28 or the voice database 207) into a set of character strings (that is, a document). (Step S301). The voice data is converted into character string data using a known voice recognition technique.
[0078]
A time stamp (for example, start time and end time) is attached to the converted character string obtained by converting the voice data, and which part of the original voice data each word of the converted character string corresponds to. Create a document index that you can understand.
[0079]
When converting voice data into character string data, the voice data may be separated for each document by dividing the voice data at a position where the utterance is interrupted for a certain period of time or longer in the voice data (however, voice mail This process is not always necessary when the audio data is originally divided into contents as in the above.
[0080]
Then, for each audio data, a document index that associates each document created from the audio data, that is, each character string of the character string set, and a time stamp indicating a position in the audio data corresponding to each character string, Accumulated in the document index database 208.
[0081]
Note that the speech recognition technology does not necessarily convert the speech into a character string completely without error, and may include errors. Therefore, if there is a misrecognition, it is stored in the document index database 208 with the error included.
[0082]
Next, a user who wants to search for voice data inputs a search sentence by text, voice, or the like via the input unit 201 (step S302). This input method is not limited as long as a text search sentence is finally obtained. For example, both text input (keyboard input) and voice input are possible. It may be possible.
[0083]
When performing voice input, the voice recognition unit 202 recognizes the user's voice and converts it into a character string (search sentence).
[0084]
Note that, regardless of which method is used as the input method, the search sentence is given as a character string to the voice search unit 203. Therefore, in the following, for ease of explanation, the search sentence is input as text data. The explanation will be focused on.
[0085]
Next, the voice search unit 203 uses the document index database 208 to search for a document (corresponding voice data) that matches the search sentence input by the user (step S303).
[0086]
A known method may be used as the voice search method. For example, a full-text search is performed on the document index database 208 based on the search sentence, whereby a suitable document can be searched, and voice data corresponding to the document can be found based on the time stamp. Further, for example, a matching document can be searched from a document set by a search technique such as TF / IDF. It is also possible to search for a document including an answer to a search sentence by QA search. The search result is, for example, a document list arranged in the order of score (see the first embodiment).
[0087]
Next, the hit word detection unit 204 uses the document index database 208 to detect hit words existing in the searched document. As a detection result, information indicating the character string of the detected hit word and the position where the character string appears in the original voice data is obtained (step S304). The appearance position is, for example, the start time and end time of a hit word indicating the position in the voice data obtained when the voice recognition unit 202 converts the voice data into a document. When there are a plurality of portions corresponding to the hit word in the voice data, all the partial voices are detected.
[0088]
Next, the reproduction processing unit 205 first reproduces only the audio from the start time to the end time of the hit word for the audio data corresponding to the retrieved document having the first rank score (step S304). When there are a plurality of hit words, for example, the sound of the portion corresponding to the hit word is reproduced continuously.
[0089]
The user hears the voice of the hit word and determines that the search is correctly performed when the character string in the search sentence is included, and determines that the search cannot be performed correctly when the character string is not included (step S305). ). If the user is able to search correctly, for example, an instruction to indicate that the search has been made correctly, for example, by instructing by voice such as “correct” or “play” or by pressing an “OK” button or “play” button. If the search is not correctly performed, for example, an instruction is given to indicate that the search has not been performed correctly by, for example, indicating “different” or “next” by voice or pressing the “next” button.
[0090]
In step S305, when an instruction indicating that the search has been correctly performed is input from the user, the audio data corresponding to this document is reproduced (for example, from the top) (step S307).
[0091]
On the other hand, in step S305, if the user cannot search correctly (when the user hears a hit word, it is not a character string in the search sentence), it may be an error even if not all the voice data is heard. Since it is understood, the reproduction of the audio data is skipped (step 306), and the process returns to step S304. Then, the same processing is repeated for the audio data corresponding to the retrieved document having the next ranking score.
[0092]
Here, the hit word is described as a portion corresponding to the character string that matches the search word in the search sentence. However, as described in the first embodiment, the character string that belongs to the subordinate concept of the search word in the search sentence. , A character string belonging to a higher-order concept of a search word in the search sentence, a character string similar to the search word in the search sentence, or a character string configured as any combination thereof, or a response part to the search sentence in the case of a QA search Even when the user hears a hit word and a word that does not match the search sentence is reproduced, it can be easily determined that the search result is incorrect.
[0093]
Thus, the user can promptly detect a search error and instruct to reproduce the next audio data without listening to all of the audio data.
[0094]
Instead of the above-described procedure example, for example, only the speech from the start time to the end time of the hit word is sequentially reproduced for the speech data from the first rank to the predetermined rank in the retrieved voice data. Various procedures such as a procedure for reproducing the audio data selected by the user from the top are possible.
[0095]
Here, as in the first embodiment, if a hit word appears during the reproduction of the audio data corresponding to the searched document, the hit word may be emphasized and reproduced. For highlighting hit words, the same configuration examples and various variations as those described in the first embodiment can be used.
[0096]
Further, in the above description, processing is performed using a search statement using a character string and the document index database 208 that documents voice data (that is, by comparing character strings instead of voice data). Spoken speech input (or speech synthesized speech input from the keyboard) and speech data to be searched for in an intermediate language such as speech waveform, phoneme, syllable, phonetic symbol, etc. A configuration for performing a search process, a hit word detection process, and the like by comparison is also possible.
[0097]
As described above, according to the present embodiment, the user can grasp and confirm the reason why the information presenting apparatus presented the data by listening to the hit word. In addition, when a character string different from the search sentence is emphasized as a hit word, there was a misrecognition in voice recognition when creating a document index from voice data, or a search sentence entered by voice was mistakenly recognized, etc. And immediately move to the reproduction of the audio data of the next score rank, or immediately restart the search.
[0098]
(Third embodiment)
In the present embodiment, an example of an information presentation device having a video search function for searching video data will be described.
[0099]
In FIG. 10, the structural example of the information presentation apparatus which concerns on the 3rd Embodiment of this invention is shown.
[0100]
Here, the video data to be searched is accompanied by audio data including an audio portion that utters a character string, and audio output by hit word detection or hit word emphasis is audio data accompanying the searched video data. The part corresponding to the character string included in (a part capable of speech recognition processing) is targeted. Of course, the present invention can also be applied to such contents only with audio data (without video).
[0101]
The hit word is as described in the first embodiment. In the present embodiment, the hit word is described as a portion corresponding to a character string that matches the search word (character string) in the search sentence in the voice data.
[0102]
In FIG. 10, the information presenting apparatus 3 includes an information input / output unit 301 for inputting a search sentence and various instructions and outputting a search result, and a video connected to the network 37 with video data suitable for the search sentence. A data search unit 303 for searching from the data server 38, a hit word detection unit 304 for detecting a portion corresponding to a hit word from the searched video data, and a portion corresponding to a hit word in the searched video data are emphasized. A reproduction processing unit 305 for processing and a communication unit 309 for performing data communication via the network 37 are provided.
[0103]
The information input / output unit 301 is a user interface using a GUI screen.
[0104]
The network 37 is a LAN, for example, but is not limited to this, and may be the Internet or the like. The video data may be based on any format.
[0105]
In the above description, the information presentation apparatus 3 accesses the video data server 38 having the database storing the video data via the network 37, but instead of or in addition to this, the information presentation apparatus 3 However, a video database (video DB) 307 that stores video data may be built in, and the data search unit 303 may search the video database 307 for video data that matches the search text. In this case, the communication unit 309 is not always necessary.
[0106]
In addition, this information presentation apparatus 3 can be comprised as a special purpose machine, or can be comprised using a general purpose computer. Further, for example, it can be realized by a mobile phone terminal having various functions.
[0107]
In FIG. 11, an example of the whole process sequence of this information presentation apparatus 3 is shown.
[0108]
Hereinafter, a case where the TV program information is searched by the information presentation device 3 will be described as a specific example.
[0109]
The TV program information search of this specific example assumes a case where TV news is searched for each article, and the news that the user wants to see can be searched.
[0110]
The video DB 307 is a database of a video recording device such as a hard disk recorder, and includes video and audio. For example, news is recorded, and the news is divided and stored for each article. In this specific example, it is assumed that the information presentation apparatus 3 presents a video that matches the user's search text from the video DB 307 to the user.
[0111]
First, in step S 21, the user inputs a search sentence by text, voice or the like via the information input / output unit 301. Similar to the first and second embodiments, the input method is not limited as long as a text string search sentence can be finally obtained. For example, even if only text input (keyboard input) is possible, only voice input is possible. Either or both of these may be possible. When performing voice input, the information input / output unit 301 may have a voice recognition unit (not shown) that recognizes a user's voice and converts it into a character string (search sentence).
[0112]
Whichever method is used as the input method, the search sentence is given as a character string to the data search unit 303. In the following, the search sentence is input as text data in order to simplify the description. The explanation will be focused on the case.
[0113]
Next, the data search unit 303 searches the video DB 307 for video data that matches the search text (step S22). For this search, for example, a well-known technique for performing a voice search using an audio track of video data can be used (for example, Katsunobu Ito, Satoshi Fujii, Tetsuya Ishikawa, “On-demand lecture system using voice document search”, IPSJ Research Report, 2001-SLP-39, pp.165-170, Dec. 2001).
[0114]
Next, the hit word detection unit 304 detects a hit word from the searched video data (step S23). For example, the method described in the second embodiment may be used to detect the hit word. That is, for example, a document index may be used, or it may be directly detected by an intermediate language such as a speech waveform, phoneme, syllable, or phonetic symbol.
[0115]
Next, a search result screen is displayed (step S24).
[0116]
FIG. 12 shows an example of the search result screen.
[0117]
In the example of FIG. 12, the retrieved data is displayed as thumbnails (reference numeral 403 in the figure) and arranged in the order of ranking (reference numeral 401 in the figure). In the example of FIG. 12, since the data ranked third is audio content without video, the thumbnail is blank. In addition, on the screen of FIG. 12, various other information regarding each content may be presented. Information relating to each content includes, for example, information indicating a channel or a broadcasting station on which the content is broadcast, information indicating whether it is a terrestrial wave or satellite broadcast, information indicating whether it is a television broadcast or a radio broadcast, a broadcast date, Various information such as information indicating time and an address where the data is stored can be considered.
[0118]
Next, in step S25, the information input / output unit 301 receives an instruction from the user.
[0119]
Next, in step S26, the instruction content input by the user via the information input / output unit 301 in step S25 may be an instruction to reproduce the hit word of the desired data on the search result screen displayed in step S24. For example, the process proceeds to step S27, and if it is an instruction to reproduce data of desired data, the process proceeds to step S28. If the instruction content indicates an end, the process of FIG. 7 ends.
[0120]
Here, the processing in step S27 will be described in detail.
[0121]
In the search result screen of FIG. 12 displayed in the previous step S24, there is a [Hit word] button (reference numeral 402 in the figure) along with each data. When the user clicks the hit word button for the desired data with a mouse or the like in step S25, the process transits from step S26 to S27, and the hit word reproduction screen shown in FIG. 13 appears. The word part is played.
[0122]
The hit word playback screen of FIG. 13 includes an area 501 for displaying video corresponding to the sound of the hit word, a search statement 502, a timeline 503 corresponding to the hit word, and a pointer 504 for displaying the current playback position. When there are a plurality of hit words, the timeline 503 corresponds to a combination of a plurality of hit words.
[0123]
As shown in FIG. 14, a button may be provided for each type of hit word. In this case, a screen corresponding to the type of hit work clicked by the user is displayed. If there are a plurality of hit words of this type, the timeline 503 corresponds to a combination of a plurality of hit words.
[0124]
The user can easily determine whether the search result is correct or not by listening to the hit word on such a screen. If a word different from the search sentence is reproduced as a hit word, it is understood that it is not necessary to listen to all of the voice data.
[0125]
Here, the hit word is described as a portion corresponding to the character string that matches the search word in the search sentence. However, as described in the first embodiment, the character string that belongs to the subordinate concept of the search word in the search sentence. , A character string belonging to a higher-order concept of a search word in the search sentence, a character string similar to the search word in the search sentence, or a character string configured as any combination thereof, or a response part to the search sentence in the case of a QA search Even when the user hears a hit word and a word that does not match the search sentence is reproduced, it can be easily determined that the search result is incorrect.
[0126]
In the example of FIG. 13, there are an [OK] button 505 and an [NG] button 506 below the time bar to be reproduced.
[0127]
If the user hears the playback of the hit word and the same word as the search sentence is played, the [OK] button 505 is pressed. On the other hand, if a word different from the search sentence is reproduced as a hit word, the [NG] 506 button is pressed.
[0128]
When the [OK] button 505 is pressed, the character string of the query corresponding to this hit word is added to this voice data together with a positive weight (+ a).
[0129]
When the [NG] button 506 is pressed, a query character string is added to this voice data together with a negative weight (-b).
[0130]
The positive weight to be added may be limited to (+ a), and the negative weight to be added may be limited to (−b).
[0131]
This additional character string is used in subsequent searches.
[0132]
For example, when the hit word is reproduced with the voice data searched with the search character string “XX”, when the hit word is reproduced correctly and the user presses the [OK] button, this voice is reproduced. It can be seen that this part of the data contains the voice “OO”. For this reason, the character string “XX” is added to the audio data with a positive weight. On the other hand, when this hit word is played back, if “XX” is played instead of “XX” and the user presses the [NG] button, the character string “XX” is displayed in this audio data. Added with negative weight. If there is a character string “XX” in the search sentence from the next time, voice data with a positive weight “XX” in the additional character string will be searched preferentially (or the score will be higher). become. On the other hand, speech data having a negative weight “OO” in the additional character string is not searched (or the score is lowered).
[0133]
The score considering the additional character string is, for example, as follows.
[0134]
The similarity score between the current search sentence (search character string) and the additional character string added to the voice data is Sa (for example, + a or -b described above), and the similarity score between the new search sentence and the voice data ) As Sd, and the final score
S = αSd + (1−α) Sa
And Here, α is an appropriate number between 0 and 1.
[0135]
As for the similarity Sd between the new search sentence and the voice data, the similarity Sd can be obtained using a search character string indicating the new search sentence and a character string document obtained by converting the voice data into a character string. It is also possible to obtain the similarity Sd by using the intermediate language such as speech waveform, phoneme, syllable, phonetic symbol, etc., for the similarity between the new search sentence and the target speech data.
[0136]
When the search character string hits the additional character string, for example, as illustrated in FIG. 12, a mark (in this example, a star mark) 404 is added to the head of the title, and the additional character string is hit. You may make it understand.
[0137]
When the process of step S27 is completed, the process returns to step S25.
[0138]
Next, the processing in step S28 will be described in detail.
[0139]
In the search result screen of FIG. 12 displayed in the previous step S24, there is a thumbnail (reference numeral 403 in the figure) together with each data. In step S25, when the user clicks a thumbnail for the desired data with a mouse or the like, the process proceeds from step S26 to S28, and the data reproduction screen shown in FIG. 15 appears, and the data specified by the user (video data or audio data). ) Is played.
[0140]
In the data reproduction screen of FIG. 15, there are an area 601 for reproducing a video, a search statement 602, a timeline 603 corresponding to the entire content, and a pointer 604 for displaying the current reproduction position. In FIG. 15, a mark 607 is added to the timeline 603 at the position where the hit word appears. As a result, it can be determined at which position the hit word appears.
[0141]
As shown in FIG. 16, the hit word and the text before and after it may be displayed in the vicinity of each mark 607 as a character string (in FIG. 16, XX represents the hit word). , ... represents the text before and after that).
[0142]
When the process of step S27 is completed, the process returns to step S25.
[0143]
In this embodiment, as in the first embodiment, if a hit word appears during the reproduction of the audio data corresponding to the searched document, the portion may be reproduced with emphasis. For highlighting hit words, the same configuration examples and various variations as those described in the first embodiment can be used.
[0144]
As a result, even if the speech recognition of the system has been wrong so far, it can be seen that there is no search sentence after all the retrieved data is reproduced, and it is wrong if all the erroneous search results are not reproduced. I didn't understand. With the hit word emphasis function of this system, it is possible to notice a system error in advance without reproducing the data, so that the time for searching for target data is greatly shortened.
[0145]
Each of the above functions can be realized even if it is described as software and processed by a computer having an appropriate mechanism.
The present embodiment can also be implemented as a program for causing a computer to execute predetermined means, causing a computer to function as predetermined means, or causing a computer to realize predetermined functions. In addition, the present invention can be implemented as a computer-readable recording medium on which the program is recorded.
[0146]
Note that the present invention is not limited to the above-described embodiment as it is, and can be embodied by modifying the constituent elements without departing from the scope of the invention in the implementation stage. In addition, various inventions can be formed by appropriately combining a plurality of components disclosed in the embodiment. For example, some components may be deleted from all the components shown in the embodiment. Furthermore, constituent elements over different embodiments may be appropriately combined.
[0147]
【The invention's effect】
According to the present invention, it is possible to present information more effectively by voice.
[Brief description of the drawings]
FIG. 1 is a diagram showing a configuration example of an information presentation apparatus according to a first embodiment of the present invention.
FIG. 2 is a flowchart showing an example of a processing procedure of the information presentation apparatus according to the embodiment.
FIG. 3 is a diagram showing an example of hit word information
FIG. 4 is a diagram showing another example of hit word information
FIG. 5 is a diagram showing still another example of hit word information
FIG. 6 is a diagram showing still another example of hit word information
FIG. 7 is a flowchart showing another example of the processing procedure of the information presentation apparatus according to the embodiment.
FIG. 8 is a diagram showing a configuration example of an information presentation device according to a second embodiment of the present invention.
FIG. 9 is a flowchart showing an example of a processing procedure of the information presentation apparatus according to the embodiment.
FIG. 10 is a diagram showing a configuration example of an information presentation apparatus according to a third embodiment of the present invention.
FIG. 11 is a flowchart showing an example of a processing procedure of the information presentation apparatus according to the embodiment.
FIG. 12 is a diagram showing an example of a search result screen
FIG. 13 shows an example of a hit word playback screen.
FIG. 14 is a diagram showing another example of the search result screen
FIG. 15 shows an example of a data playback screen.
FIG. 16 is a diagram showing another example of the data reproduction screen.
[Explanation of symbols]
1, 2, 3 ... Information presentation device, 8 ... Data server, 7, 27, 37 ... Network, 18 ... Audio data server, 28 ... Image data server, 101, 201 ... Input unit, 103 ... Search unit, 104, 204 , 304 ... hit word detection section, 105 ... data processing section, 106, 206 ... output section, 107 ... speech synthesis section, 108, 209, 309 ... communication section, 109 ... database, 202 ... voice recognition section, 203 ... voice search , 205, 305... Reproduction processing unit, 207... Audio database, 208... Document index, 301.

Claims

An input means for inputting a search sentence;
Based on the search sentence, an acquisition means for searching a database storing content including audio data or video data accompanied by audio data and acquiring the content;
Detecting means for detecting, as a hit word , a voice data portion in which a character string matching the search sentence is recognized from the voice data included in the acquired content;
On the display screen, a first image for specifying that reproduces the video data with audio data or audio data included in the content, as before Symbol Hit word from by Ri in the audio data to said detecting means A second image for designating reproduction of the detected audio data portion , and when the first image is selected by the user, the audio data or audio data included in the content is displayed. reproduces the video data with, in the case where the image of the second has been selected by the user, and a user interface means for reproducing the detected audio data portion as the hit word,
When the second image is selected, the user interface means includes a third image for inputting a positive evaluation and a negative result for the reproduction result of the audio data portion detected as the hit word. A fourth image for inputting the evaluation is displayed on the display screen, and when the third image is selected by the user, a positive evaluation is stored in association with the content, and the user When the fourth image is selected, a negative evaluation is stored in association with the content,
It said acquisition means, in accordance with the calculated search score for the content, along with it's also sort the obtained content, if the evaluation of the positive in correspondence with the content is stored, for the content the sorting order of the search score is increased or the contents to the top, if not constant for evaluation is stored, also shall be the that do not get the allowed or the content reduces the search score for the content An information presenting apparatus characterized by that.

An information presentation method for an information presentation apparatus comprising search text input means, acquisition means, detection means, and user interface means,
The search text input means for inputting a search text;
The acquisition means searches the database storing content including audio data or video data accompanied by audio data based on the search sentence, and acquires the content;
A step wherein the detecting means, for detecting the in the speech data contained in the acquired the content, the audio data portion matching string is recognized by the search statement, as a hit word,
Said user interface means, on the display screen, a first image for specifying that reproduces the video data with audio data or audio data included in the prior SL content, the audio Ri by said detecting means and it displays a second image for specifying to play the detected audio data portion as before Symbol hit word from the data, if the first image is selected by the user, the content reproduces the video data with audio data or audio data included, if the image of the second has been selected by the user, and a step of reproducing the detected audio data portion as the hit word,
When the second image is selected, the user interface means includes a third image for inputting a positive evaluation and a negative result for the reproduction result of the audio data portion detected as the hit word. A fourth image for inputting the evaluation is displayed on the display screen, and when the third image is selected by the user, a positive evaluation is stored in association with the content, and the user When the fourth image is selected, a negative evaluation is stored in association with the content,
It said acquisition means, in accordance with the calculated search score for the content, along with it's also sort the obtained content, if the evaluation of the positive in correspondence with the content is stored, for the content the sorting order of the search score is increased or the contents to the top, if not constant for evaluation is stored, also because you and that do not get the allowed or the content reduces the search score for the content An information presentation method characterized by being.