JP4631251B2

JP4631251B2 - Media search device and media search program

Info

Publication number: JP4631251B2
Application number: JP2003127927A
Authority: JP
Inventors: 孝文越仲; 健一磯
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2003-05-06
Filing date: 2003-05-06
Publication date: 2011-02-16
Anticipated expiration: 2023-05-06
Also published as: JP2004333737A

Abstract

<P>PROBLEM TO BE SOLVED: To realize high-precision retrieval even for media data whose speech recognition is difficult. <P>SOLUTION: Speech recognition is carried out by a speech recognizing means 105 for secondary media data created by editing primary media data including data of one or more kinds of media, and the correspondence relation between a speech recognition result character string and the time in the secondary media data is stored in a recognition result storage means 101. A retrieval key character string inputted through a retrieval key input means 108 is collated by a secondary media data retrieving means 107 with the speech recognition result character string to identify a section in the secondary media data which match the retrieval key character string. A primary media data retrieving means 106 outputs the primary media data corresponding to the identified section by reference to the correspondence relation between the primary media data and secondary media data stored in a link information storage means 103. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、メディア検索装置およびメディア検索プログラムに関し、特に、音声、画像、映像等のメディアのデータにより構成される多数のメディアデータから、所望のメディアデータを検索して提示するメディア検索装置およびメディア検索プログラムに関する。
【０００２】
【従来の技術】
従来、この種のメディア検索装置として、例えば、特許文献１にあるような放送番組記録技術が知られている。この技術は、放送番組の音声信号を音声認識し、音声認識された音声信号を文字データに変換して記録・蓄積させることで、記録した放送番組をユーザがキーワードで検索できるものである。すなわち、検索対象である全メディアデータに含まれる音声信号に対して音声認識を行い、各メディアデータの各時刻に発声があるかどうか、またある場合はどのような発声があるかを予め全て記録しておく。検索時には、音声認識の結果で得られる文字列から、ユーザが入力する検索キー文字列とマッチングする部分を探し出し、その部分に対応するメディアデータの対応する時刻を出力するものである。
【０００３】
以下、従来技術について、図面を参照して説明する。図８は、従来の技術に基づくメディア検索装置のブロック図である。図８において、検索対象となるメディアデータを格納するためのメディアデータ格納手段９０２と、メディアデータ格納手段９０２に格納された各メディアデータから音声信号を抽出して音声区間検出を行い、検出された各音声区間に対して音声認識を行う音声認識手段９０３と、音声認識手段９０３が出力する各音声区間の音声認識結果文字列、および音声認識結果文字列とメディアデータの時間的対応関係（例えば、音声認識結果文字列中の各単語の始終端時刻）を格納する認識結果格納手段９０１と、検索を行おうとする者がキーワード等の検索キーを入力する検索キー入力手段９０５と、検索キー入力手段９０５から入力された検索キーと認識結果格納手段９０１に格納された認識結果文字列とのマッチングを行い、検索キーと一致した箇所に対応するメディアデータをメディアデータ格納手段９０２から選択して出力するメディアデータ検索手段９０４を備える。
【０００４】
音声認識結果格納手段９０１に格納される音声認識結果は、例えば以下のような形式で示される。
【０００５】
メディアデータ１：｛Ｗ（１，１），Ｔ（１，１），Ｄ（１，１）｝，｛Ｗ（１，２），Ｔ（１，２），Ｄ（１，２）｝，…
メディアデータ２：｛Ｗ（２，１），Ｔ（２，１），Ｄ（２，１）｝，｛Ｗ（２，２），Ｔ（２，２），Ｄ（２，２）｝，…
：
：
メディアデータＮ：｛Ｗ（Ｎ，１），Ｔ（Ｎ，１），Ｄ（Ｎ，１）｝，｛Ｗ（Ｎ，２），Ｔ（Ｎ，２），Ｄ（Ｎ，２）｝，…
【０００６】
ここで、Ｗ（ｉ，１）、Ｗ（ｉ，２）、…は、メディアデータｉに対する音声認識の結果として得られる単語列である。また、Ｔ（ｉ，ｊ）、Ｄ（ｉ，ｊ）は、それぞれメディアデータｉ内での単語Ｗ（ｉ，ｊ）の始端時刻、継続時間長である。音声認識結果｛Ｗ（ｉ，ｊ），Ｔ（ｉ，ｊ），Ｄ（ｉ，ｊ）｝の作成は、原則として各メディアデータに対して一度だけ行われ、検索前に完了しているものとする。検索キー入力手段９０５より検索キーとして単語Ｖが入力されたとすると、メディアデータ検索手段９０４は、単語Ｗ（ｉ，ｊ）と単語Ｖとを比較し、Ｗ（ｉ，ｊ）とＶが一致するようなすべてのｉについて、メディアデータｉの中の区間Ｔ（ｉ，ｊ）〜Ｔ（ｉ，ｊ）＋Ｄ（ｉ，ｊ）付近を検索結果として返すことによりメディアデータの検索を実現している。
【０００７】
【特許文献１】
特開２００１−３０９２８２号公報（第１図）
【特許文献２】
特開２０００−２７０２６３号公報
【０００８】
【発明が解決しようとする課題】
一般に、音声認識を用いると、静かな環境で書き言葉調により発話された音声は、比較的精度よく認識される。しかし、背景雑音が顕著な場合や、自由な話し言葉で話される場合の発話は、正確に認識されず、認識結果には誤りを多く含むものとなる。このような誤りを多く含む認識結果に対して検索を行っても、高精度な検索を実現できないというのは自明である。また、音声を全く含まないようなメディアデータ、例えば自然風景のみを撮影した映像の検索では、音声認識を行うことが原理的に不可能である。したがって、従来のメディア検索装置では、大きい背景雑音の重畳した音声、または自由な話し言葉調で発話された音声を含むメディアデータ、あるいは音声を含まないメディアデータを十分な精度で検索することができない。
【０００９】
本発明は、このような課題を解決するために、音声認識が困難なメディアデータに対しても、高精度な検索が実現できるような技術を提供することを目的とする。
【００１０】
前記目的を達成するために、本発明のメディア検索装置は、第１の視点によれば、１種類以上のメディアのデータを含む第１のメディアデータを格納する一次メディアデータ格納手段と、一次メディアデータ格納手段内の複数の第１のメディアデータの一部または全部の区間を含むと共に第１のメディアデータと比較して認識の容易な第２のメディアデータ上のある区間が第１のメディアデータ上のどの区間に対応して配置されているかを示すリンク情報を格納するリンク情報格納手段と、第２のメディアデータの各区間を検索対象となる文字に対応させて認識し、第２のメディアデータの所要区間を文字の組となる文字列で表して格納する認識結果格納手段と、認識結果格納手段に格納された文字列中の部分と検索のために入力された文字列とが一致する第２のメディアデータの区間を特定する二次メディアデータ検索手段と、リンク情報格納手段が持つリンク情報に従って特定された区間に対応する第１のメディアデータまたは第１のメディアデータ上の区間を出力する一次メディアデータ検索手段とを備える構成とされる。
【００１１】
［発明の概要］
本発明の原理・作用について説明する。まず、１種類以上のメディアのデータを含むメディアデータを２種類のクラスに分類する。その上で、音声認識が比較的容易な一方のメディアデータを検索した結果を利用して、音声認識が困難な他方のメディアデータを再度検索することにより、音声認識が困難なメディアデータを検索する。ここで２種類のクラスとは、音声認識が一般に困難な任意のメディアデータを含む第１のクラス、および、第１のクラスに含まれるメディアデータを加工したり、組み合わせたりする編集作業によって二次的に生成されたメディアデータである第２のクラスである。第１のクラスに含まれるメディアデータを一次メディアデータ（第１のメディアデータ）、第２のクラスに含まれるメディアデータを二次メディアデータ（第２のメディアデータ）と呼ぶこととする。
【００１２】
上述のようなメディアデータの２クラスへの分類は、常に可能というわけではないが、可能なケースが現実に多く存在する。例えば、放送業務における番組制作作業においては、取材によって得られた多数の映像データを編集して番組に埋め込み、ナレーターや番組出演者の発声を加えることにより、１本の番組データが作られる。この場合、取材で得られた映像データが一次メディアデータで、編集作業によって完成された番組が二次メディアデータに相当する。あるいは、一般の教育、研修等に使われるビデオ教材においても、種々の映像や音楽とナレーターの読上げ発声とを組み合わせて１本のビデオ教材が作成される。ここでも一次メディアデータと二次メディアデータが存在する。
【００１３】
一次メディアデータと二次メディアの関係は、以下の通りである。二次メディアデータは、通常複数個の一次メディアデータを内部に含む。すなわち、二次メディアデータと一次メディアデータとは一対多の関係にある。また、一次メディアデータは、一般に任意の映像、音声を含むために音声認識が困難という性質を持つのに対し、二次メディアデータは、訓練された話し手によるナレーション等が含まれているために音声認識が比較的容易である。さらに、訓練された話し手によるナレーション等は、二次メディアデータ内部に含まれる一次メディアデータに対する説明のような、一次メディアデータと関連のある内容を含んでいる場合が多い。
【００１４】
そこで、本発明におけるメディア検索装置およびメディア検索プログラムは、二次メディアデータと一次メディアデータの間の一対多関係、すなわち、一次メディアデータが二次メディアデータのどこに埋め込まれているかを表す対応関係を記憶しておく。また、検索に際しては、音声認識が比較的容易な二次メディアデータをまず検索し、二次メディアデータ上での所望のメディアデータの存在位置を特定する。さらに、二次メディアデータと一次メディアデータの対応関係をたどって、所望の一次メディアデータを見つけ出すことにより、音声認識が困難なメディアデータの検索を実現することができる。
【００１５】
【発明の実施の形態】
次に、本発明の実施の形態について、図面を参照して詳細に説明する。
【００１６】
［第１の実施形態］
図２は、本発明の第１の実施形態に係るメディア検索装置のブロック図である。図２において、メディア検索装置は、一次メディアデータ格納手段１０４と、二次メディアデータ格納手段１０２と、リンク情報格納手段１０３と、音声認識手段１０５と、認識結果格納手段１０１と、検索キー入力手段１０８と、二次メディアデータ検索手段１０７と、一次メディアデータ検索手段１０６とを備える。
【００１７】
一次メディアデータ格納手段１０４は、任意かつ多数のメディアデータ、すなわち一次メディアデータを格納する。二次メディアデータ格納手段１０２は、一次メディアデータを編集し組み合わせて作成された多数の二次的なメディアデータ、すなわち二次メディアデータを格納する。リンク情報格納手段１０３は、一次メディアデータが二次メディアデータのどの位置に使われているかを示すリンク情報を格納する。音声認識手段１０５は、二次メディアデータに対して音声認識を行い、二次メディアデータの各時刻における音声認識結果を文字列として出力する。認識結果格納手段１０１は、音声認識手段１０５が出力する音声認識結果文字列を二次メディアデータの時刻と対応付けて格納する。検索キー入力手段１０８は、検索のための検索キーの入力を受け付ける。二次メディアデータ検索手段１０７は、認識結果格納手段１０１に格納された認識結果文字列と検索キーとのマッチングを行い、検索キーを含む二次メディアデータおよび二次メディアデータ内で検索キーが現れる位置を特定する。一次メディアデータ検索手段１０６は、二次メディアデータ検索手段１０７が特定した二次メディアデータおよび二次メディアデータ内部の位置を入力として、リンク情報格納手段１０３が持つリンク情報に従って、入力に対応する一次メディアデータおよび一次メディアデータ内部の位置を算出し出力する。なお、各々の手段は、それぞれ計算機上に記憶されたプログラムとして動作させることによっても実現可能である。
【００１８】
次に、第１の実施形態に係るメディア検索装置の動作について、順を追って説明する。
【００１９】
一次メディアデータ格納手段１０４には、検索対象となる任意のメディアデータ、すなわち一次メディアデータが多数格納されている。一次メディアデータの形式は、音声、映像、音声を伴う映像、図面や写真等の静止画像等々、任意である。二次メディアデータ格納手段１０２には、一次メディアデータ格納手段１０４に格納された一次メディアデータのうちのいくつかを何らかの形で含んだメディアデータが多数格納されている。二次メディアデータの形式は、音声、あるいは音声を伴う映像である。
【００２０】
一次メディアデータが二次メディアデータの中にどのような形態で含まれるかについては、種々のバリエーションがあり得る。もっとも単純なケースは、ある一次メディアデータの全体もしくは一部分が、二次メディアデータの一部に埋め込まれた形で、単独で存在する場合である。単独で存在しないケースとは、一次メディアデータに重畳して二次メディアデータ固有のナレーションや字幕が加わる場合、あるいは、映像に背景音楽（ＢＧＭ）が重畳するというような、ある一次メディアデータに別の一次メディアデータが重畳する場合である。さらにはこれらの複合した形態もあり得る。
【００２１】
ただし、上述したいずれのケースでも、一次メディアデータが二次メディアデータのどの位置に使われているかという対応関係は、定量的なデータとして保持できる。リンク情報格納手段１０３は、一次メディアデータ格納手段１０４に格納された一次メディアデータと、一次メディアデータ格納手段１０４に格納された二次メディアデータとの対応関係を示すリンク情報を格納しておく。リンク情報の形式を次のように表すものとする。
【００２２】
［Ｍ１（ｉ），ＴＳ１（ｉ），ＴＥ１（ｉ）］←→［Ｍ２（ｉ），ＴＳ２（ｉ），ＴＥ２（ｉ）］（ｉ＝１，２，３，…）
【００２３】
ここに、Ｍ１およびＭ２は、それぞれ一次および二次メディアデータのうちの一つを特定するインデクス番号である。ＴＳ１およびＴＥ１は、Ｍ１で指定される一次メディアデータ上のある区間を指定する時刻パラメータで、それぞれ区間始端および区間終端の時刻である。同様に、ＴＳ２およびＴＥ２は、それぞれＭ２で指定される二次メディアデータ上のある区間の始端および終端の時刻である。上記は、一次メディアデータ上の区間［Ｍ１（ｉ），ＴＳ１（ｉ），ＴＥ１（ｉ）］が二次メディアデータ上の区間［Ｍ２（ｉ），ＴＳ２（ｉ），ＴＥ２（ｉ）］と対応していることを表している。ｉは１つの対応関係を特定するインデクスである。
【００２４】
なお、多くの場合、一次メディアデータ上の区間［Ｍ１，ＴＳ１，ＴＥ１］と二次メディアデータ上の区間［Ｍ２，ＴＳ２，ＴＥ２］とは長さが等しい。しかし、一次メディアが静止画であったり、一次メディアが二次メディア上に埋め込まれる際にスロー再生されたりしていれば、長さが異なるので、一般性を持たせて上記のような形式としている。
【００２５】
リンク情報格納手段１０３が持つリンク情報は、人手で作成することも可能である。また、一次メディアデータを使って二次メディアデータを作成する編集作業の際に、作業者が行った編集操作をすべて記録しておけば、その記録から自動的にリンク情報を生成することも可能である。さらに、編集操作の記録が残っていない場合は、一次メディアデータの映像や音声の部分パターンを二次メディアデータの部分パターンと照合するパターンマッチングを行うことによって、リンク情報を得ることができる。
【００２６】
音声認識手段１０５は、二次メディアデータ格納手段１０２に格納された二次メディアデータに対して音声認識を行い、音声認識の結果を出力する。出力された音声認識結果は、認識結果格納手段１０１に格納される。音声認識結果は、主要部分である認識結果文字列、および認識結果文字列と二次メディアデータとの時間的対応関係を規定する情報を備えていれば、特に形式は問わない。音声認識結果格納手段１０１に格納される音声認識結果の形式は、例えば、以下に示すような認識結果である単語と、二次メディアデータ上での位置のセット｛Ｗ（ｉ，ｊ），Ｔ（ｉ，ｊ），Ｄ（ｉ，ｊ）｝とする。
【００２７】
二次メディアデータ１：｛Ｗ（１，１），Ｔ（１，１），Ｄ（１，１）｝，｛Ｗ（１，２），Ｔ（１，２），Ｄ（１，２）｝，…
二次メディアデータ２：｛Ｗ（２，１），Ｔ（２，１），Ｄ（２，１）｝，｛Ｗ（２，２），Ｔ（２，２），Ｄ（２，２）｝，…
：
：
二次メディアデータＮ：｛Ｗ（Ｎ，１），Ｔ（Ｎ，１），Ｄ（Ｎ，１）｝，｛Ｗ（Ｎ，２），Ｔ（Ｎ，２），Ｄ（Ｎ，２）｝，…
【００２８】
ここで、Ｗ（ｉ，１），Ｗ（ｉ，２），…は、二次メディアデータｉに対する音声認識の結果として得られる単語列である。また、Ｔ（ｉ，ｊ）、Ｄ（ｉ，ｊ）は、それぞれメディアデータｉ内での単語Ｗ（ｉ，ｊ）の始端時刻、継続時間長であり、単語Ｗ（ｉ，ｊ）の二次メディアデータｉ上での位置を規定する。
【００２９】
上述の音声認識結果の形式において、単語の始端時刻Ｔ（ｉ，ｊ）や継続時間長Ｄ（ｉ，ｊ）といった時刻情報を得るのは、音声認識手段としてよく知られた隠れマルコフモデルを用いる方法では容易である。すなわち、同じく音声認識分野でよく知られたヴィタビ（Ｖｉｔｅｒｂｉ）アルゴリズム等によって、各単語と音声信号との時間的な対応（アラインメント）を効率的に計算することができる。
【００３０】
上述の音声認識結果は、単語を単位としているが、音声認識結果の単位としては、音節や音素等、任意のものでよい。また、上述の音声認識結果は、各音声信号に対してもっとも確からしい認識結果を１つだけ持つような形式としているが、複数個の認識結果候補を持つように拡張することも可能である。拡張された場合は、一つの二次メディアデータに対して、単語列を１個でなく複数個持つような形式、あるいは、認識結果の候補を単語のネットワークで表現したワードグラフとして持つような形式となる。
【００３１】
上記音声認識結果｛Ｗ（ｉ，ｊ），Ｔ（ｉ，ｊ），Ｄ（ｉ，ｊ）｝の作成は、原則として各二次メディアデータに対して一度だけ行い、検索前に完了しているものとする。
【００３２】
検索キー入力手段１０８は、キーワードなど、検索に用いる検索キー入力を受け付け、二次メディアデータ検索手段１０７へ送る。
【００３３】
二次メディアデータ検索手段１０７は、検索キー入力手段１０８から受け取った検索キーと、認識結果格納手段１０１に格納された音声認識結果とのマッチングを行い、二次メディアデータ内で検索キーと一致する部分をすべて検出し、一次メディアデータ検索手段１０６に送る。例えば、検索キーを単語Ｖとすると、二次メディアデータ検索手段１０７は、Ｖ＝Ｗ（ｉ，ｊ）となる全ての二次メディアデータのインデクスｉと、区間Ｔ（ｉ，ｊ）〜Ｔ（ｉ，ｊ）＋Ｄ（ｉ，ｊ）とを検出し、一次メディアデータ検索手段１０６に送る。
【００３４】
なお、二次メディアデータ検索手段１０７における検索の手続きは、文字列と文字列のマッチングに基づくものであり、この種の検索で一般的に使われる方法を使うことができる。例えば、文字列の部分的な不一致を許容したマッチングを行うことにより再現率を高める曖昧検索、複数のキーワードをＡＮＤやＯＲ等で組み合わせた論理式で検索して適合率を上げるような絞り込み検索などを使うことができる。
【００３５】
一次メディアデータ検索手段１０６は、二次メディアデータ検索手段１０７の出力、すなわち、検索キーとマッチする二次メディアデータのインデクスｉおよび区間Ｔ（ｉ，ｊ）〜Ｔ（ｉ，ｊ）＋Ｄ（ｉ，ｊ）とを受け取り、この区間に対応する一次メディアデータのインデクスｋおよび一次メディアデータの内部の位置を、リンク情報格納手段１０３に格納された一次メディアデータと二次メディアデータとの対応関係から割り出し、出力する。
【００３６】
一次メディアデータ検索手段１０６が、二次メディアデータインデクスｉおよび区間Ｔ（ｉ，ｊ）〜Ｔ（ｉ，ｊ）＋Ｄ（ｉ，ｊ）から、一次メディアデータインデクスｋおよび一次メディアデータの内部の位置を割り出す方法は、以下の通りである。
【００３７】
先に説明したように、リンク情報格納手段１０３に格納されているリンク情報、すなわち一次メディアデータと二次メディアデータとの対応関係は、次に示すようなものである。
【００３８】
［Ｍ１（ｌ），ＴＳ１（ｌ），ＴＥ１（ｌ）］←→［Ｍ２（ｌ），ＴＳ２（ｌ），ＴＥ２（ｌ）］（ｌ＝１，２，３，…）
【００３９】
ここで、Ｍ１およびＭ２は、それぞれ一次および二次メディアデータインデクスである。ＴＳ１およびＴＥ１は、それぞれ一次メディアデータＭ１上のある区間の始端および終端時刻、ＴＳ２およびＴＥ２は、それぞれ二次メディアデータＭ２上のある区間の始端および終端の時刻である。ｌは、対応関係を特定するインデクスである。
【００４０】
一次メディアデータ検索手段１０６が、二次メディアデータｉ上の区間Ｔ（ｉ，ｊ）〜Ｔ（ｉ，ｊ）＋Ｄ（ｉ，ｊ）を二次メディアデータ検索手段１０７から受け取ったとすると、一次メディアデータ検索手段１０６は、リンク情報格納手段１０３に格納されたリンク情報の右辺［Ｍ２（ｌ），ＴＳ２（ｌ），ＴＥ２（ｌ）］から、［ｉ，Ｔ（ｉ，ｊ），Ｔ（ｉ，ｊ）＋Ｄ（ｉ，ｊ）］と重複を持つものをすべて検出し、これらに対応するリンク情報左辺［Ｍ１（ｌ），ＴＳ１（ｌ），ＴＥ１（ｌ）］に相当する一次メディアデータの部分を出力する。例えば、二次メディアデータの区間［Ｍ２（ｌ），ＴＳ２（ｌ），ＴＥ２（ｌ）］の部分区間［Ｍ２（ｌ），ＴＳ２’，ＴＥ２’］（ただし、ＴＳ２（ｌ）≦ＴＳ２’かつＴＥ２’≦ＴＥ２（ｌ））が区間［ｉ，Ｔ（ｉ，ｊ），Ｔ（ｉ，ｊ）＋Ｄ（ｉ，ｊ）］と重複していたとすると、上記部分区間に対応する一次メディアデータの区間は、比例関係を仮定すれば、［Ｍ１（ｌ），ＴＳ１（ｌ）＋（ＴＥ１（ｌ）−ＴＳ１（ｌ））＊（ＴＳ２’−ＴＳ２（ｌ））／（ＴＥ２（ｌ）−ＴＳ２（ｌ）），ＴＥ１（ｌ）−（ＴＥ１（ｌ）−ＴＳ１（ｌ））＊（ＴＥ２（ｌ）−ＴＥ２’）／（ＴＥ２（ｌ）−ＴＳ２（ｌ））］となるから、一次メディアデータ検索手段１０６は、この区間に相当する一次メディアデータの部分を出力する。
【００４１】
一次メディアデータ検索手段１０６が出力する一次メディアデータの区間長は、適宜調整してもよい。例えば、上述の例では出力される区間は単語１個分の短いものとなるから、前後に数秒ずつ延長した区間の一次メディアデータを出力してもよい。また、区間［ｉ，Ｔ（ｉ，ｊ），Ｔ（ｉ，ｊ）＋Ｄ（ｉ，ｊ）］と重複を持つようなリンク情報右辺［Ｍ２（ｌ），ＴＳ２（ｌ），ＴＥ２（ｌ）］がまったく存在しない場合は、［ｉ，Ｔ（ｉ，ｊ），Ｔ（ｉ，ｊ）＋Ｄ（ｉ，ｊ）］と時間的に近いもの（近くの区間）を選べばよい。ここで時間的な近さとは、例えば、一方の区間内の任意の時刻と他方の区間の任意の時刻との差の最小値などと定義しておけばよい。
【００４２】
また、区間［ｉ，Ｔ（ｉ，ｊ），Ｔ（ｉ，ｊ）＋Ｄ（ｉ，ｊ）］と重複を持つようなリンク情報右辺［Ｍ２（ｌ），ＴＳ２（ｌ），ＴＥ２（ｌ）］がまったく存在しない場合への対処としては、時間的な近さにあるしきい値ΔＴを設けて、もとの区間を前後に広げた区間［ｉ，Ｔ（ｉ，ｊ）−ΔＴ，Ｔ（ｉ，ｊ）＋Ｄ（ｉ，ｊ）＋ΔＴ］とリンク情報右辺［Ｍ２（ｌ），ＴＳ２（ｌ），ＴＥ２（ｌ）］との重複を調べてもよい。この場合、区間［ｉ，Ｔ（ｉ，ｊ），Ｔ（ｉ，ｊ）＋Ｄ（ｉ，ｊ）］から極端に遠い位置にあるような［Ｍ２（ｌ），ＴＳ２（ｌ），ＴＥ２（ｌ）］は、検索にかからないようにできるため、検索精度（適合率）が高まる。なお、区間を広げる際に、しきい値ΔＴを適宜設定することで出力される一次メディアデータの数量を調整してもよい。
【００４３】
さらに、上述のΔＴを固定値ではなく、二次メディアデータの内容構造に応じて決定することも可能である。二次メディアデータの内容構造とは、話題、話者、映像シーン等の変化である。検索キーと一致する二次メディアデータの区間として［ｉ，Ｔ（ｉ，ｊ），Ｔ（ｉ，ｊ）＋Ｄ（ｉ，ｊ）］が見つかったとき、その区間を［ｉ，Ｔ（ｉ，ｊ）−ΔＴｂ，Ｔ（ｉ，ｊ）＋Ｄ（ｉ，ｊ）＋ΔＴｆ］と広げる（ただし、ΔＴｂ、ΔＴｆは正数）。広げる際、Ｔ（ｉ，ｊ）−ΔＴｂおよびＴ（ｉ，ｊ）＋Ｄ（ｉ，ｊ）＋ΔＴｆが話題、話者、映像シーンの変化点となるようにΔＴｂ、ΔＴｆを決定する。これらの変化点は、音声認識手段１０５が二次メディアデータの音声認識結果を作成する際に同時に作成して、認識結果格納手段１０１に格納しておけばよい。
【００４４】
メディアデータの話題、話者、映像シーンの変化点は、人手で抽出してもよいし、自動的に求める方法も種々知られている。例えば、話題に関しては、認識結果の単語列の一定長さ区間内で単語ごとの出現頻度を求め、これを話題の特徴ベクトルとし、この特徴ベクトルが急激に変化する時刻を話題の変化点として求めることができる。また、話者に関しては、音声認識でよく知られたケプストラム特徴量を音声信号から計算し、音声信号の一定長さ区間での平均を話者の特徴ベクトルとし、話者の特徴ベクトルが急激に変化する時刻を話者の変化点として求めることができる。映像シーンの変化点は、各映像フレームの画素値ヒストグラム、すなわち画素値の頻度分布が大きく変動する時刻として検出することができる。
【００４５】
なお、通常は、二次メディアデータ検索手段１０７が複数個の二次メディアデータインデクスと区間を出力するケースが多いと考えられる。この場合は、一次メディアデータ検索手段１０６はその各々について上述の手順を踏み、複数個の一次メディアデータインデクスまたは一次メディアデータの内部の位置を出力する。
【００４６】
次に、以上述べた一次メディアデータの検索手続きについて図を用いて説明する。図３は、本発明の第２の実施形態における一次メディアデータの検索手続きを模式的に示した図である。図３において、一次メディアデータＡ１の区間ａ１、一次メディアデータＡ３の区間ａ３、および一次メディアデータＡ４の区間ａ４が編集されて、二次メディアデータに含まれている。含まれる際の時間的対応関係は、リンク情報としてリンク情報格納手段１０３に保持されている。
【００４７】
このような状況で、音声認識手段１０５は、二次メディアデータに対して音声認識を行い、二次メディアデータとの時間的な対応が付いた音声認識結果「＊＊＊＊ＸＹＺ＊＊＊＊＊ＸＹＺ＊＊＊」を認識結果格納手段１０１に格納する。検索キー「ＸＹＺ」が入力された際に、二次メディア検索手段１０７は、検索キーと一致する部分（ＸＹＺ）を音声認識結果から探し出し、探し出された部分と対応する二次メディアデータの区間Ｂ３、区間Ｂ４を特定する。一次メディア検索手段１０６は、二次メディアデータの区間Ｂ３、区間Ｂ４とそれぞれ対応する一次メディアデータＡ３の区間ｂ３、および一次メディアデータＡ４の区間ｂ４を、リンク情報をたどることにより特定し、検索の最終結果として出力することができる。
【００４８】
［第２の実施形態］
次に、本発明の第２の実施形態について、図面を参照して説明する。図４は、本発明の第２の実施形態に係るメディア検索装置のブロック図である。図４において、メディア検索装置は、一次メディアデータ格納手段５０４と、二次メディアデータ格納手段５０２と、リンク情報格納手段５０３と、音声認識手段５０５と、認識結果格納手段５０１と、原稿時刻付与手段５１０と、時刻付き原稿格納手段５０９と、検索キー入力手段５０８と、二次メディアデータ検索手段５０７と、一次メディアデータ検索手段５０６とを備える。
【００４９】
一次メディアデータ格納手段５０４は、任意かつ多数のメディアデータ、すなわち一次メディアを格納する。二次メディアデータ格納手段５０２は、一次メディアデータを編集し組み合わせて作成された多数の二次的なメディアデータ、すなわち二次メディアデータを格納する。リンク情報格納手段５０３は、一次メディアデータが二次メディアデータのどの位置に使われているかを示すリンク情報を格納する。音声認識手段５０５は、二次メディアデータに対して音声認識を行い、二次メディアデータの各時刻における音声認識結果文字列を出力する。認識結果格納手段５０１は、音声認識手段５０５が出力する音声認識結果文字列を二次メディアデータの時刻と対応付けて格納する。原稿時刻付与手段５１０は、二次メディアデータ制作時に使用されたナレーション原稿や台本等のテキストデータがある場合に、このテキストデータと認識結果格納手段５０１に格納された音声認識結果とのマッチングを行い、原稿や台本等のテキストデータと二次メディアデータとの時間的対応関係を求める。時刻付き原稿格納手段５０９は、原稿時刻付与手段５１０の出力である、原稿や台本等のテキストデータと二次メディアデータとの時間的対応関係を、テキストデータとともに格納する。検索キー入力手段５０８は、検索のための検索キーの入力を受け付ける。二次メディアデータ検索手段５０７は、認識結果格納手段５０１に格納された認識結果文字列と検索キーとのマッチングを行い、検索キーを含む二次メディアデータおよび二次メディアデータ内で検索キーが現れる位置を特定する。一次メディアデータ検索手段５０６は、二次メディアデータ検索手段５０７が特定した二次メディアデータおよび二次メディアデータの内部の位置を入力として、リンク情報格納手段５０３が持つリンク情報に従って、入力に対応する一次メディアデータおよび一次メディアデータの内部の位置を算出し出力する。なお、各々の手段は、それぞれ計算機上に記憶されたプログラムとして動作させることによっても実現可能である。
【００５０】
次に、第２の実施形態に係るメディア検索装置の動作について、順を追って説明する。
【００５１】
なお、認識結果格納手段５０１、二次メディアデータ格納手段５０２、リンク情報格納手段５０３、一次メディアデータ格納手段５０４、音声認識手段５０５、一次メディアデータ検索手段５０６、検索キー入力手段５０８は、それぞれ本発明の第一の実施の形態における認識結果格納手段１０１、二次メディアデータ格納手段１０２、リンク情報格納手段１０３、一次メディアデータ格納手段１０４、音声認識手段１０５、一次メディアデータ検索手段１０６、検索キー入力手段１０８と同じものであって、本発明の第一の実施の形態で説明した動作と同じ動作をする。
【００５２】
原稿時刻付与手段５１０は、二次メディアデータ制作時に使用されたナレーション原稿や台本等のテキストデータと、認識結果格納手段５０１に格納された認識結果文字列とでマッチングを行い、同じく認識結果格納手段５０１に格納された認識結果文字列と二次メディアデータとの時間的対応関係を用いて、テキストデータと二次メディアデータの間の時間的対応関係を求める。所与の２つの文字列間の対応関係を求める方法については種々知られている。例えば、特開２０００−２７０２６３号公報（特許文献２参照）記載の自動字幕番組制作システムには、アナウンスの音声の進行と同期して、提示単位字幕文の作成、及びその始点／終点の各々に対応する高精度のタイミング情報付与の自動化について記載されている。本実施の形態の場合、２つの文字列間の対応関係さえ求まれば、そのうちの１つの文字列すなわち認識結果文字列と二次メディアデータとの時間的対応関係がわかっているので、テキストデータと二次メディアデータとの時間的対応関係も容易に求めることができる。
【００５３】
なお、原稿や台本等のテキストデータと二次メディアデータとの時間的対応関係を求めたい場合、上述のように音声認識結果を媒介として利用する方法の他に、前述のヴィタビ（Ｖｉｔｅｒｂｉ）アルゴリズムを用いて、原稿や台本等のテキストデータと二次メディアデータ音声信号との時間的対応関係を直接求めてしまう方法も可能である。
【００５４】
時刻付き原稿格納手段５０９は、原稿時刻付与手段５１０が出力した、テキストデータと二次メディアデータとの時間的対応関係を、テキストデータとともに受け取り、格納する。格納される情報の形式は、認識結果格納手段５０１と同様に、単語と単語の二次メディアデータ上での位置を示す時刻数値との多数の組からなる集合である。
【００５５】
二次メディアデータ検索手段５０７は、検索キー入力手段５０８から受け取った検索キーと、認識結果格納手段５０１に格納された音声認識結果および時刻付き原稿格納手段５０９に格納押された原稿や台本等のテキストデータとのマッチングを行い、二次メディアデータ内で検索キーと一致する部分をすべて検出し、一次メディアデータ検索手段５０６に送る。例えば、検索キーを単語Ｖとすると、二次メディアデータ検索手段５０７は、Ｖ＝Ｗ（ｉ，ｊ）となるすべての二次メディアデータのインデクスｉおよび区間Ｔ（ｉ，ｊ）〜Ｔ（ｉ，ｊ）＋Ｄ（ｉ，ｊ）を、認識結果格納手段５０１および時刻付き原稿格納手段５０９から検出し、一次メディアデータ検索手段５０６に送る。
【００５６】
次に、以上述べた一次メディアデータの検索手続きについて図を用いて説明する。図５は、本発明の第２の実施形態における一次メディアデータの検索手続きを模式的に示した図である。一次メディアデータＡ６の区間ａ６、一次メディアデータＡ３の区間ａ３、および一次メディアデータＡ４の区間ａ４が、編集されて、二次メディアデータに含まれている。含まれる際の時間的対応関係は、リンク情報としてリンク情報格納手段５０３に保持されている。
【００５７】
このような状況で、音声認識手段５０５は、二次メディアデータに対して音声認識を行い、二次メディアデータとの時間的な対応が付いた音声認識結果「…＊＊＊＊ＸＹＺ＊＊＊＊＊ＸＹＺ＊＊＊…」を認識結果格納手段５０１に格納する。また、原稿時刻付与手段５１０は、二次メディアデータに対して原稿や台本等のテキストデータに基づいて二次メディアデータとの時間的な対応が付いた時刻付き原稿「＊＊＊＊＊ＸＹＺ＊＊＊＊＊」を時刻付き原稿格納手段５０９に格納する。検索キー「ＸＹＺ」が入力された際に、二次メディア検索手段５０７は、検索キーと一致する部分（ＸＹＺ）を音声認識結果および時刻付き原稿から探し出し、探し出された部分と対応する二次メディアデータの区間Ｂ６、区間Ｂ３、および区間Ｂ４を特定する。一次メディア検索手段５０６は、二次メディアデータの区間Ｂ６、区間Ｂ３、および区間Ｂ４とそれぞれ対応する一次メディアデータＡ６の区間ｂ６、一次メディアデータＡ３の区間ｂ３、および一次メディアデータＡ４の区間ｂ４を、リンク情報をたどることにより特定し、検索の最終結果として出力することができる。
【００５８】
第二の実施の形態では第一の実施の形態における検索手続きと比べ、検索キーに基づく二次メディアデータの検索範囲を原稿や台本等のテキストデータにまで広げ、原稿や台本等のテキストデータを検索に利用している点が異なる。
【００５９】
［第３の実施形態］
次に、本発明の第３の実施形態について、図面を参照して説明する。図６は、本発明の第３の実施形態に係るメディア検索装置のブロック図である。図６において、メディア検索装置は、一次メディアデータ格納手段７０４と、二次メディアデータ格納手段７０２と、音声認識手段７０５と、認識結果格納手段７０１と、検索キー入力手段７０８と、二次メディアデータ検索手段７０７と、一次メディアデータ検索手段７０６とを備える。
【００６０】
一次メディアデータ格納手段７０４は、任意かつ多数のメディアデータ、すなわち一次メディアを格納する。二次メディアデータ格納手段７０２は、一次メディアデータ格納手段７０４には全く同じものが必ずしも含まれてない不特定の一次メディアデータを編集し組み合わせて作成された多数の二次メディアデータを格納する。音声認識手段７０５は、二次メディアデータに対して音声認識を行い、二次メディアデータの各時刻における音声認識結果文字列を出力する。認識結果格納手段７０１は、音声認識手段７０５が出力する音声認識結果文字列を二次メディアデータの時刻と対応付けて格納する。検索キー入力手段７０８は、検索のための検索キーの入力を受け付ける。二次メディアデータ検索手段７０７は、認識結果格納手段７０１に格納された認識結果文字列と検索キーとのマッチングを行い、検索キーを含む二次メディアデータおよび二次メディアデータ内で検索キーが現れる位置を特定する。一次メディアデータ検索手段７０６は、二次メディアデータ検索手段７０７が特定した二次メディアデータ内の区間に含まれる映像や音声を新たな検索キーとして、映像や音声の類似性に基づいて一次メディアデータ格納手段７０４から一次メディアデータを検索し出力する。なお、各々の手段は、それぞれ計算機上に記憶されたプログラムとして動作させることによっても実現可能である。
【００６１】
以下、第３の実施形態に係るメディア検索装置の動作について、順を追って説明する。
【００６２】
なお、認識結果格納手段７０１、二次メディアデータ格納手段７０２、一次メディアデータ格納手段７０４、音声認識手段７０５、二次メディアデータ検索手段７０７、検索キー入力手段７０８は、それぞれ本発明の第一の実施の形態における認識結果格納手段１０１、二次メディアデータ格納手段１０２、一次メディアデータ格納手段１０４、音声認識手段１０５、二次メディアデータ検索手段１０７、検索キー入力手段１０８と同じもので、本発明の第一の実施の形態で述べた動作と同じ動作をする。
【００６３】
ただし、一次メディアデータ格納手段７０４に格納される一次メディアデータは、必ずしも二次メディアデータ格納手段７０２に格納された二次メディアデータ中で使用されていなくてもよいという点で、検索対象となるメディアデータに関する制約は、本発明の第一の実施の形態よりも緩い。さらに、一次メディアデータ格納手段７０４に格納される一次メディアデータは、不図示のネットワーク上に存在する不特定のあらゆるメディアデータであってもよい。
【００６４】
一次メディアデータ検索手段７０６は、検索キー入力手段７０８より入力された検索キーと一致する二次メディアデータの区間を二次メディアデータ検索手段７０７より受け取り、その区間の映像や音声を特徴量に変換する。ここで特徴量とは、もとの映像や音声の持つ性質を保ちつつ、より少数のデータで表現できるようなパラメータセットである。現在、映像や音声の検索、分類、認識の分野で広く使われている特徴量は、極めて多岐にわたっており、すべてを列挙することはできないが、ここでは広く知られた特徴量の中から、目的に応じて適宜選択すればよい。一例として、映像の特徴量には、映像フレームを縦横に各々数個の領域に分割し、各領域の色の分布ヒストグラムや物体境界（エッジ）の方向ヒストグラムを計算したものの時系列、あるいはある区間全体にわたる平均等を用いることができる。また、音声の特徴量には、スペクトルパワーやケプストラムの時系列、あるいはその区間全体にわたる平均等を用いることが考えられる。
【００６５】
一次メディアデータ検索手段７０６は、さらに、一次メディアデータ格納手段７０４に格納された各メディアデータに対しても、同じ手順によって特徴量を計算して、二次メディアデータの区間と一次メディアデータを特徴量レベルで比較し、類似度を計算する。ここで類似度とは、特徴量が静的なベクトルであれば、例えばユークリッド距離（その符号を反転したもの）として容易に計算できる。また、特徴量が時系列、すなわちベクトルの系列であるような場合でも、動的計画法に基づくマッチング、すなわちＤＰマッチングにより特徴量間の距離が計算できるので、その符号を反転したものを類似度と定義すればよい。
【００６６】
なお、一次メディアデータに関する特徴量計算は、検索のたびに行う必要はなく、各一次メディアデータに対して一度だけ行っておけば、以後は計算の結果をくり返し使用することができる。
【００６７】
最終的に、一次メディアデータ検索手段７０６は、二次メディアデータの区間ともっとも類似度の高い１個あるいは複数個の一次メディアデータを、検索結果として出力する。
【００６８】
第３の実施形態に係るメディア検索装置は、以上の説明のように動作するので、一次メディアデータのある部分と二次メディアデータのある部分が完全に一致していなくてもデータを検索することができるようになる。例えば、春の富士山の映像と夏の富士山の映像とのように絵の構図が似ていればデータも似ていると判定できる。また、同じ人が昨日話しているシーンと今日話しているシーンとのように声の性質が似ていればデータが似ていると判定できることになる。すなわち、二次メディアデータに一次メディアデータのある部分が必ずしも含まれていない場合であっても、類似度を計算して類似度の高いものを選択することで検索することができる。
【００６９】
［第４の実施形態］
次に、本発明の第４の実施形態について、図面を参照して説明する。図７は、本発明の第４の実施形態に係るメディア検索装置のブロック図である。図７において、メディア検索装置は、一次メディアデータ格納手段８０４と、二次メディアデータ格納手段８０２と、リンク情報格納手段８０３と、背景雑音減算手段８０９と、音声認識手段８０５と、認識結果格納手段８０１と、検索キー入力手段８０８と、二次メディアデータ検索手段８０７と、一次メディアデータ検索手段８０６とを備える。
【００７０】
一次メディアデータ格納手段８０４は、任意かつ多数のメディアデータ、すなわち一次メディアを格納する。二次メディアデータ格納手段８０２は、一次メディアデータを編集し組み合わせて作成された多数の二次的なメディアデータ、すなわち二次メディアデータを格納する。リンク情報格納手段８０３は、一次メディアデータが二次メディアデータのどの位置に使われているかを示すリンク情報を格納する。背景雑音減算手段８０９は、リンク情報を利用して二次メディアデータ中に含まれる一次メディアデータの音声を二次メディアデータから減算する。音声認識手段８０５は、背景雑音減算手段８０９によって背景雑音が除去された二次メディアデータを受け取り、これに対して音声認識を行い、二次メディアデータの各時刻における音声認識結果文字列を出力する。認識結果格納手段８０１は、音声認識手段８０５が出力する音声認識結果文字列を二次メディアデータの時刻と対応付けて格納する。検索キー入力手段８０８は、検索のための検索キーの入力を受け付ける。二次メディアデータ検索手段８０７は、認識結果格納手段８０１に格納された認識結果文字列と検索キーとのマッチングを行い、検索キーを含む二次メディアデータおよび二次メディアデータ内で検索キーが現れる位置を特定する。一次メディアデータ検索手段８０６は、二次メディアデータ検索手段８０７が特定した二次メディアデータおよび二次メディアデータの内部の位置を入力として、リンク情報格納手段８０３が持つリンク情報に従って、入力に対応する一次メディアデータおよび一次メディアデータの内部の位置を算出し出力する。各々の手段は、それぞれ計算機上に記憶されたプログラムとして動作させることにより実現可能である。
【００７１】
以下、第４の実施形態に係るメディア検索装置の動作について、順を追って説明する。
【００７２】
まず、認識結果格納手段８０１、二次メディアデータ格納手段８０２、リンク情報格納手段８０３、一次メディアデータ格納手段８０４、一次メディアデータ検索手段８０６、二次メディアデータ検索手段８０７、検索キー入力手段８０８は、それぞれ本発明の第一の実施の形態における認識結果格納手段１０１、二次メディアデータ格納手段１０２、リンク情報格納手段１０３、一次メディアデータ格納手段１０４、一次メディアデータ検索手段１０６、二次メディアデータ検索手段１０７、検索キー入力手段１０８と同じもので、本発明の第一の実施の形態で述べたのと同じ動作をする。
【００７３】
二次メディアデータ格納手段８０２に格納された二次メディアデータの音声が２種類の音声信号、すなわち、一次メディアデータに元々含まれていた音声と、ナレーション音声のような二次メディアデータ固有の音声との重ね合わせであると仮定する。その上で背景雑音減算手段８０９は、一次メディアデータに元々含まれていた音声を背景雑音とした背景雑音除去を、二次メディアデータに対して行う。背景雑音除去の方法について次に説明する。
【００７４】
今、一次メディアデータと、二次メディアデータの対応する区間の音声信号をそれぞれＳ１（ｔ）、Ｓ２（ｔ）とする。一次メディアデータと二次メディアデータの対応する区間は、リンク情報格納手段８０３に格納されたリンク情報
［Ｍ１（ｌ），ＴＳ１（ｌ），ＴＥ１（ｌ）］←→［Ｍ２（ｌ），ＴＳ２（ｌ），ＴＥ２（ｌ）］（ｌ＝１，２，３，…）
から知ることができる。なお、ｔは、時刻インデクスであり、ｔ＝０、ｔ＝Ｔがそれぞれ一次メディアデータＭ１（ｌ）の時刻ＴＳ１（ｌ）、ＴＥ１（ｌ）、および二次メディアデータＭ２（ｌ）の時刻ＴＳ２（ｌ）、ＴＥ２（ｌ）に対応しているとする。一次および二次メディアデータの区間長ＴＥ１（ｌ）−ＴＳ１（ｌ）およびＴＥ２（ｌ）−ＴＳ２（ｌ）は等しいと仮定している。このとき、背景雑音除去によって得られる二次メディアデータ固有の音声信号Ｓ２’（ｔ）は、Ｓ２’（ｔ）＝Ｓ２（ｔ）−Ｓ１（ｔ）により算出することができる。ただし、ｔ∈［０，Ｔ］である。
【００７５】
なお、上述の背景雑音除去の方法では、一次メディアデータに元々含まれていた音声信号Ｓ１（ｔ）と二次メディアデータ固有の音声信号Ｓ２’（ｔ）とが１：１の比率で重ね合わせられて二次メディアデータの音声信号Ｓ２（ｔ）が生成されると仮定しているが、一般にはそうではないケースもあり得る。そのようなケース、例えばα：１（αは正定数）で重ね合わせられている場合、すなわち、一次メディアデータの音声信号が振幅をα倍に増幅して二次メディアデータに挿入されている場合は、Ｓ２’（ｔ）＝Ｓ２（ｔ）−α×Ｓ１（ｔ）によって二次メディアデータの背景雑音除去を行えばよい。
【００７６】
重ね合わせ比率αの値が未知の場合は、αの値を自動的に決定する必要があるが、例えばＳ２’（ｔ）のＳＮ比（信号雑音比）が大きくなるように決めればよい。すなわち、音声信号Ｓ１（ｔ）、Ｓ２（ｔ）に対応する対数パワー（局所スペクトルの周波数領域での積分値）をそれぞれＰ１（ｔ）、Ｐ２（ｔ）とすると、ｍｉｎｔ｛Ｐ２（ｔ）−α×Ｐ１（ｔ）｝＝εとなるようにαを決めればよい。ここに、εは十分小さい正の定数で、ｍｉｎｔは、ｔに関する最小値を意味する。
【００７７】
また、αの値を自動的に決定する別の方法として、二次メディアデータ音声信号Ｓ２（ｔ）の一部の区間、例えば先頭のΔｔ秒の区間に二次メディアデータ固有の音声が存在しないことを想定して、この区間を使ってαを推定することが考えられる。この場合、Ｓ１（ｔ）およびＳ２（ｔ）の区間ｔ∈［０，ΔＴ］にわたる積分値を計算し、それぞれの積分値の比をαとする。
【００７８】
音声認識手段８０５は、背景雑音除去が施された二次メディアデータを背景雑音減算手段８０９から受け取り、これらに対して音声認識を行って、認識結果を認識結果格納手段８０１に格納する。
【００７９】
次に、本発明に係るメディア検索プログラムについて、図面を参照して説明する。図１は、本発明に係るメディア検索装置の構成図である。図１において、メディア検索装置は、記憶部１０、データ処理部２０、入出力部３０を備える。記憶部１０は、メディア検索プログラムを記録した記録媒体１１、認識結果記録媒体１３、二次メディアデータ記録媒体１４、リンク情報記録媒体１５、一次メディアデータ記録媒体１６を備える。記録媒体１１は、ＣＤ−ＲＯＭ、磁気ディスク、半導体メモリその他の記録媒体であってよく、また、メディア検索プログラムは、不図示のネットワークを介して流通する場合も含む。
【００８０】
メディア検索プログラムは、記録媒体１１からデータ処理部２０に読み込まれ、メディア検索装置における各手段を機能させる。また、入出力部３０は、メディア検索装置におけるマンマシンインタフェースを司り、検索時の検索キーの入力などを行う。認識結果記録媒体１３、二次メディアデータ記録媒体１４、リンク情報記録媒体１５、一次メディアデータ記録媒体１６は、磁気ディスク、半導体メモリその他の記録媒体であってよく、メディア検索装置における各種データを記録する。
【００８１】
データ処理部２０は、メディア検索プログラムの制御により、第一の実施の形態における音声認識手段１０５、一次メディアデータ検索手段１０６、二次メディアデータ検索手段１０７、検索キー入力手段１０８による処理を実行する。また、処理を実行するにあたり、認識結果格納手段１０１、二次メディアデータ格納手段１０２、リンク情報格納手段１０３、一次メディアデータ格納手段１０４とそれぞれ同等の情報を有する認識結果記録媒体１３、二次メディアデータ記録媒体１４、リンク情報記録媒体１５、一次メディアデータ記録媒体１６を参照することでメディアデータの検索結果を出力する。
【００８２】
なお、一次メディアデータの編集、二次メディアデータの作成、および一次メディアデータの各区間と二次メディアデータの各区間との対応関係を表すリンク情報の作成は、データ処理部２０で実行されるようにしても良い。また、データ処理部２０とは異なる不図示の編集装置等において作成し、ネットワークを介して、あるいはオフラインによって、一次メディアデータ記録媒体１６、二次メディアデータ記録媒体１４、リンク情報記録媒体１５に記録しておいてもよい。さらに、二次メディアデータを文字列で表すための認識結果を編集装置等で得て、認識結果をネットワークを介して、あるいはオフラインによって、認識結果記録媒体１３に記録しておいてもよい。
【００８３】
以上の説明では第１の実施形態について説明したが、記憶部１０に他の実施の形態における記録媒体を備え、記録媒体１１に他の実施の形態におけるメディア検索プログラムを記録することで他の実施の形態における処理を同様の構成において実現できることは言うまでも無い。
【００８４】
【発明の効果】
以上説明したように、一般に、任意のメディアデータは背景雑音や自由な話し言葉を多く含んでいる、あるいは音声が一切含まれない、といった理由により、正確な音声認識が困難であり、したがって音声認識結果と検索キーとのマッチングに基づくメディア検索が困難であったが、このような任意のメディアデータを一次メディアデータとして用いて制作された二次メディアデータは、丁寧な発声で読み上げられたナレーション部分等、正確な音声認識が比較的容易な個所を多く含んでいる。本発明によれば、音声認識を利用した検索が困難な一次メディアデータに対して、検索が比較的容易な二次メディアデータを介して検索することができるため、高い検索精度を実現することができる。
【図面の簡単な説明】
【図１】本発明に係るメディア検索装置の構成図である。
【図２】本発明の第１の実施形態に係るメディア検索装置のブロック図である。
【図３】本発明の第１の実施形態における一次メディアデータの検索手続きを模式的に示した図である。
【図４】本発明の第２の実施形態に係るメディア検索装置のブロック図である。
【図５】本発明の第２の実施形態における一次メディアデータの検索手続きを模式的に示した図である。
【図６】本発明の第３の実施形態に係るメディア検索装置のブロック図である。
【図７】本発明の第４の実施形態に係るメディア検索装置のブロック図である。
【図８】従来の技術に基づくメディア検索装置のブロック図である。
【符号の説明】
１０記憶部
１１記録媒体
１３認識結果記録媒体
１４二次メディアデータ記録媒体
１５リンク情報記録媒体
１６一次メディアデータ記録媒体
２０データ処理部
３０入出力部
１０１、５０１、７０１、８０１認識結果格納手段
１０２、５０２、７０２、８０２二次メディアデータ格納手段
１０３、５０３、８０３リンク情報格納手段
１０４、５０４、７０４、８０４一次メディアデータ格納手段
１０５、５０５、７０５、８０５音声認識手段
１０６、５０６、７０６、８０６一次メディアデータ検索手段
１０７、５０７、７０７、８０７二次メディアデータ検索手段
１０８、５０８、７０８、８０８検索キー入力手段
５０９時刻付き原稿格納手段
５１０原稿時刻付与手段
８０９背景雑音減算手段[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a media search device and a media search program, and in particular, a media search device and media for searching and presenting desired media data from a large number of media data composed of media data such as voice, image, and video. Relating to the search program.
[0002]
[Prior art]
Conventionally, for example, a broadcast program recording technique as disclosed in Patent Document 1 is known as this type of media search device. In this technology, a speech signal of a broadcast program is recognized by speech, and the recognized speech signal is converted into character data to be recorded / stored so that a user can search for the recorded broadcast program by a keyword. That is, voice recognition is performed on audio signals included in all media data to be searched, and whether or not there is utterance at each time of each media data, and if so, all utterances are recorded in advance. Keep it. At the time of search, a part matching the search key character string input by the user is searched from the character string obtained as a result of the speech recognition, and the corresponding time of the media data corresponding to that part is output.
[0003]
The prior art will be described below with reference to the drawings. FIG. 8 is a block diagram of a media search device based on the conventional technology. In FIG. 8, a media data storage unit 902 for storing media data to be searched, and a speech signal is extracted from each piece of media data stored in the media data storage unit 902 to detect a speech section. A voice recognition unit 903 that performs voice recognition on each voice section, a voice recognition result character string of each voice section output from the voice recognition unit 903, and a temporal correspondence between the voice recognition result character string and media data (for example, A recognition result storage unit 901 for storing the start and end times of each word in the speech recognition result character string, a search key input unit 905 for a searcher to input a search key such as a keyword, and a search key input unit The search key input from 905 and the recognition result character string stored in the recognition result storage unit 901 are matched to match the search key. The media data corresponding to the portion comprises a media data retrieval means 904 for selecting and outputting the media data storage unit 902.
[0004]
The speech recognition result stored in the speech recognition result storage unit 901 is shown in the following format, for example.
[0005]
Media data 1: {W (1,1), T (1,1), D (1,1)}, {W (1,2), T (1,2), D (1,2)}, ...
Media data 2: {W (2,1), T (2,1), D (2,1)}, {W (2,2), T (2,2), D (2,2)}, ...
:
:
Media data N: {W (N, 1), T (N, 1), D (N, 1)}, {W (N, 2), T (N, 2), D (N, 2)}, ...
[0006]
Here, W (i, 1), W (i, 2),... Are word strings obtained as a result of speech recognition for the media data i. Further, T (i, j) and D (i, j) are the start time and duration of the word W (i, j) in the media data i, respectively. The speech recognition result {W (i, j), T (i, j), D (i, j)} is created only once for each media data in principle and completed before the search. And If a word V is input as a search key from the search key input means 905, the media data search means 904 compares the word W (i, j) with the word V, and W (i, j) and V match. For all such i, the media data search is realized by returning the vicinity of the section T (i, j) to T (i, j) + D (i, j) in the media data i as a search result. .
[0007]
[Patent Document 1]
JP 2001-309282 A (FIG. 1)
[Patent Document 2]
JP 2000-270263 A
[0008]
[Problems to be solved by the invention]
In general, when speech recognition is used, speech uttered in written tone in a quiet environment is recognized with relatively high accuracy. However, utterances when the background noise is noticeable or spoken in free speech are not accurately recognized, and the recognition result includes many errors. It is obvious that even if a search is performed on a recognition result including many errors, a high-precision search cannot be realized. In addition, it is theoretically impossible to perform voice recognition when searching for media data that does not include any audio, for example, a video that captures only natural scenery. Therefore, the conventional media search device cannot search with sufficient accuracy media data including speech with a large background noise superimposed, speech uttered in a free spoken tone, or media data not including speech.
[0009]
In order to solve such a problem, an object of the present invention is to provide a technique capable of realizing a highly accurate search even for media data in which speech recognition is difficult.
[0010]
In order to achieve the above object, according to the first aspect, the media search device of the present invention comprises primary media data storage means for storing first media data including data of one or more types of media, and primary media. Includes a part or all of a plurality of first media data in the data storage means And easier to recognize than the first media data Link information storage means for storing link information indicating which section on the first media data is arranged corresponding to a section on the second media data, and searching each section of the second media data A recognition result storage means for recognizing corresponding to the target character and storing the required section of the second media data as a character string as a character set; and a character string stored in the recognition result storage means Secondary media data search means for specifying the second media data section in which the portion and the character string input for the search match, and a second section corresponding to the section specified according to the link information of the link information storage means Primary media data search means for outputting one media data or a section on the first media data.
[0011]
[Summary of Invention]
The principle and operation of the present invention will be described. First, media data including data of one or more types of media is classified into two types of classes. Based on the result of searching one media data that is relatively easy to recognize voice, the other media data that is difficult to recognize voice is searched again to search for media data that is difficult to recognize voice. . Here, the two types of classes are a first class including arbitrary media data that is generally difficult to recognize voice, and a secondary by editing work that processes or combines the media data included in the first class. This is a second class which is media data generated automatically. The media data included in the first class is referred to as primary media data (first media data), and the media data included in the second class is referred to as secondary media data (second media data).
[0012]
Although the above-mentioned classification of media data into two classes is not always possible, there are actually many possible cases. For example, in a program production work in a broadcasting service, a large number of video data obtained by interviewing is edited and embedded in a program, and a single program data is created by adding the voice of a narrator or program performer. In this case, the video data obtained by the coverage corresponds to primary media data, and the program completed by the editing work corresponds to secondary media data. Alternatively, a video teaching material used for general education, training, etc. is created by combining various videos and music with the narrator's reading utterance. Again, there is primary media data and secondary media data.
[0013]
The relationship between primary media data and secondary media is as follows. Secondary media data usually includes a plurality of primary media data therein. That is, secondary media data and primary media data have a one-to-many relationship. In addition, primary media data generally has the property that speech recognition is difficult because it contains arbitrary video and audio, whereas secondary media data has voice narration by trained speakers, etc. Recognition is relatively easy. Furthermore, narration by trained speakers often includes content related to the primary media data, such as explanations for primary media data contained within the secondary media data.
[0014]
Therefore, the media search device and the media search program according to the present invention store a one-to-many relationship between secondary media data and primary media data, that is, a correspondence relationship indicating where the primary media data is embedded in the secondary media data. Keep it. When searching, secondary media data that is relatively easy to recognize is first searched, and the location of the desired media data on the secondary media data is specified. Furthermore, by searching the correspondence relationship between the secondary media data and the primary media data and finding the desired primary media data, it is possible to realize search of media data that is difficult to recognize voice.
[0015]
DETAILED DESCRIPTION OF THE INVENTION
Next, embodiments of the present invention will be described in detail with reference to the drawings.
[0016]
[First Embodiment]
FIG. 2 is a block diagram of the media search device according to the first embodiment of the present invention. In FIG. 2, the media search device includes a primary media data storage unit 104, a secondary media data storage unit 102, a link information storage unit 103, a voice recognition unit 105, a recognition result storage unit 101, and a search key input unit. 108, secondary media data search means 107, and primary media data search means 106.
[0017]
The primary media data storage means 104 stores an arbitrary number of media data, that is, primary media data. The secondary media data storage means 102 stores a large number of secondary media data created by editing and combining primary media data, that is, secondary media data. The link information storage means 103 stores link information indicating where in the secondary media data the primary media data is used. The voice recognition unit 105 performs voice recognition on the secondary media data, and outputs the voice recognition result at each time of the secondary media data as a character string. The recognition result storage unit 101 stores the voice recognition result character string output from the voice recognition unit 105 in association with the time of the secondary media data. Search key input means 108 accepts input of a search key for search. The secondary media data search unit 107 matches the recognition result character string stored in the recognition result storage unit 101 with the search key, and the search key appears in the secondary media data including the search key and the secondary media data. Identify the location. The primary media data search means 106 receives the secondary media data specified by the secondary media data search means 107 and the position in the secondary media data as input, and in accordance with the link information held by the link information storage means 103, the primary media data search means 106 corresponds to the input. Calculate and output the position within the media data and primary media data. Each means can also be realized by operating as a program stored on the computer.
[0018]
Next, the operation of the media search device according to the first embodiment will be described in order.
[0019]
The primary media data storage means 104 stores a large number of arbitrary media data to be searched, that is, primary media data. The format of the primary media data is arbitrary, such as audio, video, video with audio, still images such as drawings and photos. The secondary media data storage means 102 stores a large number of media data including some of the primary media data stored in the primary media data storage means 104 in some form. The secondary media data format is audio or video with audio.
[0020]
There may be various variations as to how the primary media data is included in the secondary media data. The simplest case is a case where all or a part of a certain primary media data exists alone in a form embedded in a part of the secondary media data. Cases that do not exist independently include cases in which narration or subtitles unique to the secondary media data are added to the primary media data, or background music (BGM) is superimposed on the video. In this case, the primary media data is superimposed. Furthermore, these compound forms may be possible.
[0021]
However, in any of the cases described above, the correspondence relationship in which location of the primary media data is used in the secondary media data can be held as quantitative data. The link information storage unit 103 stores link information indicating the correspondence between the primary media data stored in the primary media data storage unit 104 and the secondary media data stored in the primary media data storage unit 104. The format of link information shall be expressed as follows:
[0022]
[M1 (i), TS1 (i), TE1 (i)] ← → [M2 (i), TS2 (i), TE2 (i)] (i = 1, 2, 3,...)
[0023]
Here, M1 and M2 are index numbers that specify one of primary and secondary media data, respectively. TS1 and TE1 are time parameters for designating a certain section on the primary media data designated by M1, and are the time of the section start end and section end, respectively. Similarly, TS2 and TE2 are the start and end times of a certain section on the secondary media data specified by M2, respectively. In the above, the section [M1 (i), TS1 (i), TE1 (i)] on the primary media data is the same as the section [M2 (i), TS2 (i), TE2 (i)] on the secondary media data. It shows that it corresponds. i is an index for specifying one correspondence.
[0024]
In many cases, the section [M1, TS1, TE1] on the primary media data and the section [M2, TS2, TE2] on the secondary media data have the same length. However, if the primary media is a still image, or if the primary media is slow-played when it is embedded on the secondary media, the length will be different. Yes.
[0025]
The link information possessed by the link information storage means 103 can also be created manually. In addition, when editing operations that create secondary media data using primary media data, if all the editing operations performed by the operator are recorded, link information can be automatically generated from the recording. It is. Further, when there is no record of the editing operation, link information can be obtained by performing pattern matching in which the partial pattern of the primary media data is compared with the partial pattern of the secondary media data.
[0026]
The voice recognition unit 105 performs voice recognition on the secondary media data stored in the secondary media data storage unit 102, and outputs a result of the voice recognition. The output speech recognition result is stored in the recognition result storage unit 101. The speech recognition result is not particularly limited as long as it has information that defines the recognition result character string that is the main part and the temporal correspondence between the recognition result character string and the secondary media data. The format of the speech recognition result stored in the speech recognition result storage means 101 is, for example, a set of a word as a recognition result as shown below and a position on the secondary media data {W (i, j), T (I, j), D (i, j)}.
[0027]
Secondary media data 1: {W (1, 1), T (1, 1), D (1, 1)}, {W (1, 2), T (1, 2), D (1, 2) }, ...
Secondary media data 2: {W (2,1), T (2,1), D (2,1)}, {W (2,2), T (2,2), D (2,2) }, ...
:
:
Secondary media data N: {W (N, 1), T (N, 1), D (N, 1)}, {W (N, 2), T (N, 2), D (N, 2) }, ...
[0028]
Here, W (i, 1), W (i, 2),... Are word strings obtained as a result of speech recognition for the secondary media data i. T (i, j) and D (i, j) are the start time and duration of the word W (i, j) in the media data i, respectively. The position on the next media data i is defined.
[0029]
In the above-described speech recognition result format, the hidden Markov model well known as speech recognition means is used to obtain time information such as the word start time T (i, j) and duration time D (i, j). The method is easy. That is, the temporal correspondence (alignment) between each word and the voice signal can be efficiently calculated by a Viterbi algorithm or the like well known in the voice recognition field.
[0030]
The above speech recognition results are in units of words, but the units of speech recognition results may be arbitrary, such as syllables or phonemes. Moreover, although the above-mentioned speech recognition result has a format that has only one most likely recognition result for each speech signal, it can be extended to have a plurality of recognition result candidates. When extended, a format that has multiple word strings instead of one for a single secondary media data, or a format that has a recognition result candidate as a word graph that represents a word network It becomes.
[0031]
The speech recognition result {W (i, j), T (i, j), D (i, j)} is created only once for each secondary media data in principle, and is completed before the search. It shall be.
[0032]
The search key input means 108 accepts a search key input used for search such as a keyword and sends it to the secondary media data search means 107.
[0033]
The secondary media data search means 107 matches the search key received from the search key input means 108 with the speech recognition result stored in the recognition result storage means 101, and matches the search key in the secondary media data. All parts are detected and sent to the primary media data search means 106. For example, if the search key is the word V, the secondary media data search means 107 uses the index i of all secondary media data satisfying V = W (i, j) and the sections T (i, j) to T (T i, j) + D (i, j) is detected and sent to the primary media data search means 106.
[0034]
Note that the search procedure in the secondary media data search means 107 is based on matching between a character string and a character string, and a method generally used in this kind of search can be used. For example, an ambiguous search that increases the recall by performing matching that allows partial mismatch of character strings, a narrow search that increases the relevance rate by searching with a logical expression that combines multiple keywords with AND or OR, etc. Can be used.
[0035]
The primary media data search means 106 outputs the output of the secondary media data search means 107, that is, the index i of the secondary media data matching the search key and the section T (i, j) to T (i, j) + D (i , J) and the index k of the primary media data corresponding to this section and the internal position of the primary media data are determined from the correspondence between the primary media data stored in the link information storage means 103 and the secondary media data. Index and output.
[0036]
The primary media data search means 106 determines the positions of the primary media data index k and the primary media data from the secondary media data index i and the sections T (i, j) to T (i, j) + D (i, j). The method for determining the value is as follows.
[0037]
As described above, the link information stored in the link information storage unit 103, that is, the correspondence between the primary media data and the secondary media data is as follows.
[0038]
[M1 (l), TS1 (l), TE1 (l)] ← → [M2 (l), TS2 (l), TE2 (l)] (l = 1, 2, 3,...)
[0039]
Here, M1 and M2 are primary and secondary media data indexes, respectively. TS1 and TE1 are the start and end times of a section on the primary media data M1, respectively, and TS2 and TE2 are the start and end times of a section on the secondary media data M2, respectively. l is an index for specifying the correspondence.
[0040]
Suppose that the primary media data search means 106 receives the sections T (i, j) to T (i, j) + D (i, j) on the secondary media data i from the secondary media data search means 107. The data retrieval unit 106 calculates [i, T (i, j), T (i) from the right side [M2 (l), TS2 (l), TE2 (l)] of the link information stored in the link information storage unit 103. , J) + D (i, j)] are detected and all of the primary media data corresponding to the left side [M1 (l), TS1 (l), TE1 (l)] corresponding to the link information corresponding thereto are detected. Output part. For example, the partial section [M2 (l), TS2 ′, TE2 ′] of the section [M2 (l), TS2 (l), TE2 (l)] of the secondary media data (where TS2 (l) ≦ TS2 ′ and If TE2 ′ ≦ TE2 (l)) overlaps with the section [i, T (i, j), T (i, j) + D (i, j)], the primary media data corresponding to the partial section Assuming a proportional relationship, the section is [M1 (l), TS1 (l) + (TE1 (l) −TS1 (l)) * (TS2′−TS2 (l)) / (TE2 (l) −TS2]. (L)), TE1 (l)-(TE1 (l) -TS1 (l)) * (TE2 (l) -TE2 ') / (TE2 (l) -TS2 (l))] The data search means 106 outputs the primary media data portion corresponding to this section.
[0041]
The section length of the primary media data output from the primary media data search means 106 may be adjusted as appropriate. For example, in the above-described example, since the output section is as short as one word, primary media data that extends several seconds before and after may be output. Also, the right side of the link information [M2 (l), TS2 (l), TE2 (l) that overlaps with the section [i, T (i, j), T (i, j) + D (i, j)] ] Does not exist at all, it is only necessary to select one close in time (near section) to [i, T (i, j), T (i, j) + D (i, j)]. Here, near time may be defined as, for example, the minimum value of the difference between an arbitrary time in one section and an arbitrary time in the other section.
[0042]
Also, the right side of the link information [M2 (l), TS2 (l), TE2 (l) that overlaps with the section [i, T (i, j), T (i, j) + D (i, j)] ] Is not present at all, a threshold value ΔT that is close in time is provided, and the interval [i, T (i, j) −ΔT, T The overlap between (i, j) + D (i, j) + ΔT] and the right side of the link information [M2 (l), TS2 (l), TE2 (l)] may be examined. In this case, [M2 (l), TS2 (l), TE2 (l) that are extremely far from the section [i, T (i, j), T (i, j) + D (i, j)]. )] Can be prevented from being searched, so that the search accuracy (relevance rate) is increased. Note that when expanding the section, the amount of primary media data to be output may be adjusted by appropriately setting the threshold value ΔT.
[0043]
Furthermore, it is possible to determine the above-described ΔT according to the content structure of the secondary media data instead of a fixed value. The content structure of the secondary media data is a change in topics, speakers, video scenes, and the like. When [i, T (i, j), T (i, j) + D (i, j)] is found as a section of secondary media data that matches the search key, the section is represented by [i, T (i, j) −ΔTb, T (i, j) + D (i, j) + ΔTf] (where ΔTb and ΔTf are positive numbers). At the time of spreading, ΔTb and ΔTf are determined so that T (i, j) −ΔTb and T (i, j) + D (i, j) + ΔTf become changing points of the topic, the speaker, and the video scene. These change points may be created at the same time when the voice recognition unit 105 creates the voice recognition result of the secondary media data and stored in the recognition result storage unit 101.
[0044]
Media data topics, speakers, and video scene change points may be manually extracted, and various methods for automatically obtaining them are known. For example, for a topic, the appearance frequency for each word is obtained within a certain length section of the word string of the recognition result, and this is used as the feature vector of the topic, and the time when the feature vector suddenly changes is obtained as the topic change point. be able to. For speakers, the cepstrum features well known in speech recognition are calculated from speech signals, and the average over a certain length of speech signals is used as the speaker feature vector. The changing time can be obtained as the speaker's changing point. The change point of the video scene can be detected as the time when the pixel value histogram of each video frame, that is, the frequency distribution of the pixel values fluctuates greatly.
[0045]
Normally, it is considered that there are many cases where the secondary media data search means 107 outputs a plurality of secondary media data indexes and sections. In this case, the primary media data search means 106 performs the above-described procedure for each of them, and outputs a plurality of primary media data indexes or positions inside the primary media data.
[0046]
Next, the primary media data retrieval procedure described above will be described with reference to the drawings. FIG. 3 is a diagram schematically showing a search procedure for primary media data in the second embodiment of the present invention. In FIG. 3, the section a1 of the primary media data A1, the section a3 of the primary media data A3, and the section a4 of the primary media data A4 are edited and included in the secondary media data. The temporal correspondence when it is included is held in the link information storage means 103 as link information.
[0047]
In such a situation, the voice recognition unit 105 performs voice recognition on the secondary media data, and the voice recognition result “*** XYXY ***” with temporal correspondence with the secondary media data. * XYZ *** "is stored in the recognition result storage means 101. When the search key “XYZ” is input, the secondary media search means 107 searches for the portion (XYZ) that matches the search key from the speech recognition result, and the secondary media data section corresponding to the found portion. B3 and section B4 are specified. The primary media searching means 106 specifies the section B3 of the secondary media data, the section b3 of the primary media data A3 respectively corresponding to the section B4, and the section b4 of the primary media data A4 by following the link information. The final result can be output.
[0048]
[Second Embodiment]
Next, a second embodiment of the present invention will be described with reference to the drawings. FIG. 4 is a block diagram of a media search device according to the second embodiment of the present invention. In FIG. 4, the media search apparatus includes a primary media data storage unit 504, a secondary media data storage unit 502, a link information storage unit 503, a voice recognition unit 505, a recognition result storage unit 501, and a document time giving unit. 510, a timed document storage unit 509, a search key input unit 508, a secondary media data search unit 507, and a primary media data search unit 506.
[0049]
The primary media data storage means 504 stores an arbitrary number of media data, that is, primary media. The secondary media data storage unit 502 stores a large number of secondary media data created by editing and combining primary media data, that is, secondary media data. The link information storage unit 503 stores link information indicating where the primary media data is used in the secondary media data. The voice recognition unit 505 performs voice recognition on the secondary media data, and outputs a voice recognition result character string at each time of the secondary media data. The recognition result storage unit 501 stores the voice recognition result character string output from the voice recognition unit 505 in association with the time of the secondary media data. When there is text data such as a narration original or a script used at the time of producing the secondary media data, the original time giving means 510 performs matching between the text data and the voice recognition result stored in the recognition result storage means 501. Then, the temporal correspondence between text data such as a manuscript or script and secondary media data is obtained. The timed document storage unit 509 stores the temporal correspondence between text data such as a document or script and secondary media data, which is an output of the document time giving unit 510, together with the text data. Search key input means 508 accepts input of a search key for search. The secondary media data search means 507 matches the recognition result character string stored in the recognition result storage means 501 with the search key, and the search key appears in the secondary media data including the search key and the secondary media data. Identify the location. The primary media data search means 506 responds to the input according to the link information held by the link information storage means 503 using the secondary media data specified by the secondary media data search means 507 and the internal position of the secondary media data as inputs. Calculate and output the primary media data and the internal location of the primary media data. Each means can also be realized by operating as a program stored on the computer.
[0050]
Next, the operation of the media search device according to the second embodiment will be described step by step.
[0051]
Note that the recognition result storage means 501, secondary media data storage means 502, link information storage means 503, primary media data storage means 504, voice recognition means 505, primary media data search means 506, and search key input means 508 are respectively present. Recognition result storage means 101, secondary media data storage means 102, link information storage means 103, primary media data storage means 104, voice recognition means 105, primary media data search means 106, search key in the first embodiment of the invention This is the same as the input means 108 and performs the same operation as that described in the first embodiment of the present invention.
[0052]
The manuscript time giving means 510 performs matching between text data such as a narration manuscript or script used at the time of producing the secondary media data and the recognition result character string stored in the recognition result storage means 501, and also the recognition result storage means. The temporal correspondence between the text data and the secondary media data is obtained using the temporal correspondence between the recognition result character string stored in 501 and the secondary media data. Various methods for obtaining the correspondence between two given character strings are known. For example, in the automatic caption program production system described in Japanese Patent Application Laid-Open No. 2000-270263 (see Patent Document 2), a presentation unit caption sentence is created in synchronization with the progress of the announcement sound, and each of its start / end points is created. The corresponding high-precision timing information application automation is described. In the case of the present embodiment, as long as the correspondence between two character strings is obtained, the temporal correspondence between one of the character strings, that is, the recognition result character string and the secondary media data is known. And the temporal correspondence between the secondary media data and the secondary media data can be easily obtained.
[0053]
When the temporal correspondence between text data such as a manuscript or script and secondary media data is to be obtained, in addition to the method using the voice recognition result as a medium as described above, the above Viterbi algorithm is used. It is also possible to use a method of directly obtaining the temporal correspondence between text data such as a manuscript or a script and the secondary media data audio signal.
[0054]
The timed document storage unit 509 receives and stores the temporal correspondence between the text data and the secondary media data output from the document time giving unit 510 together with the text data. The format of the information to be stored is a set of a large number of sets of words and time numerical values indicating the positions of the words on the secondary media data, like the recognition result storage unit 501.
[0055]
The secondary media data search means 507 includes the search key received from the search key input means 508, the voice recognition result stored in the recognition result storage means 501, and the original or script stored in the original storage means 509 with time. Matching with the text data is performed, and all portions matching the search key in the secondary media data are detected and sent to the primary media data search means 506. For example, assuming that the search key is the word V, the secondary media data search means 507 uses the index i and the sections T (i, j) to T (i) of all secondary media data satisfying V = W (i, j). , J) + D (i, j) is detected from the recognition result storage unit 501 and the timed document storage unit 509 and sent to the primary media data search unit 506.
[0056]
Next, the primary media data retrieval procedure described above will be described with reference to the drawings. FIG. 5 is a diagram schematically showing a search procedure for primary media data in the second embodiment of the present invention. The section a6 of the primary media data A6, the section a3 of the primary media data A3, and the section a4 of the primary media data A4 are edited and included in the secondary media data. The temporal correspondence when it is included is held in the link information storage unit 503 as link information.
[0057]
In such a situation, the voice recognition unit 505 performs voice recognition on the secondary media data, and the voice recognition result “... *** XYZ ***” with temporal correspondence with the secondary media data. ** XYZ *** ... ”is stored in the recognition result storage unit 501. Also, the document time giving means 510 is a time-added document “****** XYZ *” in which the secondary media data is temporally associated with the secondary media data based on text data such as a document or a script. “****” is stored in the document storage unit 509 with time. When the search key “XYZ” is input, the secondary media search means 507 searches for a portion (XYZ) that matches the search key from the speech recognition result and the original with time, and the secondary corresponding to the found portion. The section B6, the section B3, and the section B4 of the media data are specified. The primary media search means 506 selects the section b6 of the primary media data A6, the section b3 of the primary media data A3, and the section b4 of the primary media data A4 corresponding to the sections B6, B3, and B4 of the secondary media data, respectively. The link information can be identified and output as the final search result.
[0058]
In the second embodiment, compared to the search procedure in the first embodiment, the search range of secondary media data based on the search key is extended to text data such as a manuscript or a script, and the text data such as a manuscript or a script is expanded. Different points are used for searching.
[0059]
[Third Embodiment]
Next, a third embodiment of the present invention will be described with reference to the drawings. FIG. 6 is a block diagram of a media search device according to the third embodiment of the present invention. In FIG. 6, the media search apparatus includes a primary media data storage unit 704, a secondary media data storage unit 702, a voice recognition unit 705, a recognition result storage unit 701, a search key input unit 708, and secondary media data. A search unit 707 and a primary media data search unit 706 are provided.
[0060]
The primary media data storage means 704 stores any number of media data, that is, primary media. The secondary media data storage means 702 stores a large number of secondary media data created by editing and combining unspecified primary media data that is not necessarily included in the primary media data storage means 704. The voice recognition unit 705 performs voice recognition on the secondary media data, and outputs a voice recognition result character string at each time of the secondary media data. The recognition result storage unit 701 stores the voice recognition result character string output from the voice recognition unit 705 in association with the time of the secondary media data. Search key input means 708 accepts input of a search key for search. The secondary media data search unit 707 matches the recognition result character string stored in the recognition result storage unit 701 with the search key, and the search key appears in the secondary media data including the search key and the secondary media data. Identify the location. The primary media data search means 706 uses the video and audio included in the section in the secondary media data specified by the secondary media data search means 707 as a new search key, based on the similarity between the video and audio. The primary media data is retrieved from the storage means 704 and output. Each means can also be realized by operating as a program stored on the computer.
[0061]
Hereinafter, the operation of the media search device according to the third embodiment will be described in order.
[0062]
The recognition result storage means 701, secondary media data storage means 702, primary media data storage means 704, voice recognition means 705, secondary media data search means 707, and search key input means 708 are respectively the first of the present invention. This is the same as the recognition result storage means 101, secondary media data storage means 102, primary media data storage means 104, voice recognition means 105, secondary media data search means 107, and search key input means 108 in the embodiment. The same operation as described in the first embodiment is performed.
[0063]
However, the primary media data stored in the primary media data storage unit 704 is a search target in that it is not necessarily used in the secondary media data stored in the secondary media data storage unit 702. The restrictions regarding the media data are looser than those of the first embodiment of the present invention. Further, the primary media data stored in the primary media data storage unit 704 may be any unspecified media data existing on a network (not shown).
[0064]
The primary media data search means 706 receives a secondary media data section that matches the search key input from the search key input means 708 from the secondary media data search means 707, and converts the video and audio in that section into feature quantities. To do. Here, the feature amount is a parameter set that can be expressed by a smaller number of data while maintaining the properties of the original video and audio. Currently, the feature quantities that are widely used in the fields of video, audio search, classification, and recognition are extremely diverse, and it is not possible to enumerate all of them. It may be appropriately selected depending on the situation. As an example, the feature amount of a video is a time series or a certain section obtained by dividing a video frame into several regions each vertically and horizontally, and calculating a color distribution histogram of each region and a direction histogram of an object boundary (edge). An overall average or the like can be used. In addition, it is conceivable to use spectral power, a time series of cepstrum, an average over the entire section, or the like as the feature amount of speech.
[0065]
The primary media data search means 706 further calculates the feature amount by the same procedure for each media data stored in the primary media data storage means 704, and characterizes the section of the secondary media data and the primary media data. Compare at quantity level and calculate similarity. Here, the similarity can be easily calculated as, for example, a Euclidean distance (inverted sign) if the feature quantity is a static vector. Even if the feature quantity is a time series, that is, a vector series, the distance between the feature quantities can be calculated by matching based on dynamic programming, that is, DP matching. Should be defined.
[0066]
Note that the feature amount calculation for the primary media data does not need to be performed for each search, and if the calculation is performed only once for each primary media data, the calculation results can be used repeatedly thereafter.
[0067]
Finally, the primary media data search means 706 outputs one or a plurality of primary media data having the highest similarity with the section of the secondary media data as a search result.
[0068]
Since the media search device according to the third embodiment operates as described above, data search is performed even if a portion of primary media data and a portion of secondary media data do not completely match. Will be able to. For example, if the composition of the picture is similar, such as a video of Mt. Fuji in spring and a video of Mt. Fuji in summer, it can be determined that the data is also similar. In addition, it can be determined that the data is similar if the voice characteristics are similar, such as the scene that the same person is speaking yesterday and the scene that is spoken today. In other words, even if the secondary media data does not necessarily include a portion of the primary media data, it can be searched for by calculating the similarity and selecting one with a high similarity.
[0069]
[Fourth Embodiment]
Next, a fourth embodiment of the present invention will be described with reference to the drawings. FIG. 7 is a block diagram of a media search device according to the fourth embodiment of the present invention. In FIG. 7, the media search apparatus includes a primary media data storage unit 804, a secondary media data storage unit 802, a link information storage unit 803, a background noise subtraction unit 809, a voice recognition unit 805, and a recognition result storage unit. 801, search key input means 808, secondary media data search means 807, and primary media data search means 806.
[0070]
The primary media data storage means 804 stores arbitrary and a large number of media data, that is, primary media. The secondary media data storage unit 802 stores a large number of secondary media data created by editing and combining primary media data, that is, secondary media data. The link information storage unit 803 stores link information indicating where the primary media data is used in the secondary media data. The background noise subtracting means 809 subtracts the sound of the primary media data included in the secondary media data from the secondary media data using the link information. The voice recognition unit 805 receives the secondary media data from which the background noise has been removed by the background noise subtraction unit 809, performs voice recognition on the secondary media data, and outputs a voice recognition result character string at each time of the secondary media data. . The recognition result storage unit 801 stores the voice recognition result character string output from the voice recognition unit 805 in association with the time of the secondary media data. Search key input means 808 accepts input of a search key for search. The secondary media data search unit 807 matches the recognition result character string stored in the recognition result storage unit 801 with the search key, and the search key appears in the secondary media data including the search key and the secondary media data. Identify the location. The primary media data search means 806 accepts the input according to the link information held by the link information storage means 803 with the secondary media data specified by the secondary media data search means 807 and the internal position of the secondary media data as inputs. Calculate and output the primary media data and the internal location of the primary media data. Each means can be realized by operating as a program stored on the computer.
[0071]
Hereinafter, operations of the media search device according to the fourth embodiment will be described in order.
[0072]
First, recognition result storage means 801, secondary media data storage means 802, link information storage means 803, primary media data storage means 804, primary media data search means 806, secondary media data search means 807, and search key input means 808 are: , Recognition result storage means 101, secondary media data storage means 102, link information storage means 103, primary media data storage means 104, primary media data search means 106, secondary media data, respectively, in the first embodiment of the present invention. This is the same as the search means 107 and search key input means 108, and performs the same operation as described in the first embodiment of the present invention.
[0073]
The audio of the secondary media data stored in the secondary media data storage means 802 is two types of audio signals, that is, the audio originally included in the primary media data and the audio specific to the secondary media data such as narration audio. And superposition. Then, the background noise subtracting means 809 performs background noise removal on the secondary media data using the voice originally included in the primary media data as background noise. A background noise removal method will be described next.
[0074]
Now, let S1 (t) and S2 (t) be the audio signals in the corresponding sections of the primary media data and the secondary media data, respectively. The section corresponding to the primary media data and the secondary media data is the link information stored in the link information storage means 803.
[M1 (l), TS1 (l), TE1 (l)] ← → [M2 (l), TS2 (l), TE2 (l)] (l = 1, 2, 3,...)
Can know from. Note that t is a time index, and t = 0 and t = T are times TS1 (l) and TE1 (l) of the primary media data M1 (l) and time TS2 of the secondary media data M2 (l), respectively. It is assumed that (l) and TE2 (l) are supported. The section lengths TE1 (l) -TS1 (l) and TE2 (l) -TS2 (l) of the primary and secondary media data are assumed to be equal. At this time, the audio signal S2 ′ (t) unique to the secondary media data obtained by the background noise removal can be calculated by S2 ′ (t) = S2 (t) −S1 (t). However, tε [0, T].
[0075]
In the above background noise removal method, the audio signal S1 (t) originally included in the primary media data and the audio signal S2 ′ (t) inherent to the secondary media data are superimposed at a ratio of 1: 1. It is assumed that the audio signal S2 (t) of the secondary media data is generated, but generally there may be cases where this is not the case. In such a case, for example, when α: 1 (α is a positive constant) is overlaid, that is, when the audio signal of the primary media data is amplified by α times and inserted into the secondary media data The background noise removal of the secondary media data may be performed by S2 ′ (t) = S2 (t) −α × S1 (t).
[0076]
When the value of the overlay ratio α is unknown, it is necessary to automatically determine the value of α, but it may be determined so that, for example, the SN ratio (signal-to-noise ratio) of S2 ′ (t) increases. That is, assuming that logarithmic powers (integral values in the frequency domain of the local spectrum) corresponding to the audio signals S1 (t) and S2 (t) are P1 (t) and P2 (t), respectively, mint {P2 (t) − α may be determined so that α × P1 (t)} = ε. Here, ε is a sufficiently small positive constant, and mint means a minimum value with respect to t.
[0077]
As another method for automatically determining the value of α, there is no voice specific to the secondary media data in a part of the secondary media data audio signal S2 (t), for example, the first Δt seconds. Therefore, it is conceivable to estimate α using this interval. In this case, the integral value over the interval tε [0, ΔT] of S1 (t) and S2 (t) is calculated, and the ratio of the respective integral values is α.
[0078]
The voice recognition unit 805 receives the secondary media data from which background noise has been removed from the background noise subtraction unit 809, performs voice recognition on these, and stores the recognition result in the recognition result storage unit 801.
[0079]
Next, a media search program according to the present invention will be described with reference to the drawings. FIG. 1 is a configuration diagram of a media search apparatus according to the present invention. In FIG. 1, the media search device includes a storage unit 10, a data processing unit 20, and an input / output unit 30. The storage unit 10 includes a recording medium 11 on which a media search program is recorded, a recognition result recording medium 13, a secondary media data recording medium 14, a link information recording medium 15, and a primary media data recording medium 16. The recording medium 11 may be a CD-ROM, a magnetic disk, a semiconductor memory, or other recording medium, and the media search program may be distributed via a network (not shown).
[0080]
The media search program is read from the recording medium 11 into the data processing unit 20 and causes each unit in the media search device to function. The input / output unit 30 controls a man-machine interface in the media search device, and inputs a search key at the time of search. The recognition result recording medium 13, the secondary media data recording medium 14, the link information recording medium 15, and the primary media data recording medium 16 may be magnetic disks, semiconductor memories, or other recording media, and record various data in the media search device. To do.
[0081]
The data processing unit 20 executes processing by the voice recognition unit 105, the primary media data search unit 106, the secondary media data search unit 107, and the search key input unit 108 in the first embodiment under the control of the media search program. . In executing the process, the recognition result storage means 101, the secondary media data storage means 102, the link information storage means 103, and the primary media data storage means 104 have the same information as the recognition result recording medium 13 and the secondary media, respectively. The search result of the media data is output by referring to the data recording medium 14, the link information recording medium 15, and the primary media data recording medium 16.
[0082]
The data processing unit 20 executes editing of the primary media data, creation of the secondary media data, and creation of link information indicating the correspondence between each section of the primary media data and each section of the secondary media data. You may do it. Further, it is created by an editing device (not shown) different from the data processing unit 20 and recorded on the primary media data recording medium 16, the secondary media data recording medium 14, and the link information recording medium 15 via a network or offline. You may keep it. Furthermore, a recognition result for representing secondary media data as a character string may be obtained by an editing device or the like, and the recognition result may be recorded on the recognition result recording medium 13 via a network or offline.
[0083]
In the above description, the first embodiment has been described. However, the storage unit 10 includes the recording medium according to another embodiment, and the recording medium 11 stores the media search program according to the other embodiment. It goes without saying that the processing in the form can be realized with the same configuration.
[0084]
【The invention's effect】
As described above, in general, arbitrary media data contains a lot of background noise, free speech, or does not contain any speech, so accurate speech recognition is difficult. Media search based on matching the search key with the search key was difficult, but secondary media data produced using such arbitrary media data as the primary media data is the narration part read out with careful utterances, etc. It contains many places where accurate speech recognition is relatively easy. According to the present invention, primary media data that is difficult to search using voice recognition can be searched via secondary media data that is relatively easy to search, so that high search accuracy can be realized. it can.
[Brief description of the drawings]
FIG. 1 is a configuration diagram of a media search device according to the present invention.
FIG. 2 is a block diagram of the media search device according to the first embodiment of the present invention.
FIG. 3 is a diagram schematically showing a search procedure for primary media data in the first embodiment of the present invention.
FIG. 4 is a block diagram of a media search device according to a second embodiment of the present invention.
FIG. 5 is a diagram schematically showing a search procedure for primary media data in the second embodiment of the present invention.
FIG. 6 is a block diagram of a media search device according to a third embodiment of the present invention.
FIG. 7 is a block diagram of a media search device according to a fourth embodiment of the present invention.
FIG. 8 is a block diagram of a media search device based on a conventional technique.
[Explanation of symbols]
10 storage unit
11 Recording media
13 Recognition result recording medium
14 Secondary media data recording medium
15 Link information recording medium
16 Primary media data recording medium
20 Data processing section
30 Input / output section
101, 501, 701, 801 Recognition result storage means
102, 502, 702, 802 Secondary media data storage means
103, 503, 803 Link information storage means
104, 504, 704, 804 Primary media data storage means
105, 505, 705, 805 Voice recognition means
106, 506, 706, 806 Primary media data search means
107, 507, 707, 807 Secondary media data search means
108, 508, 708, 808 Search key input means
509 Document storing means with time
510 Document time giving means
809 Background noise subtraction means

Claims

Primary media data storage means for storing first media data including data of one or more types of media;
There is a section on the second media data that includes a part or all of the plurality of first media data in the primary media data storage means and that is easier to recognize than the first media data. Link information storage means for storing link information indicating a section corresponding to which section on the first media data;
A recognition result storage means for recognizing each section of the second media data in association with a character to be searched, and representing a required section of the second media data as a character string that is a set of characters; ,
Secondary media data search means for specifying a section of the second media data in which a portion in the character string stored in the recognition result storage means matches a character string input for search;
Primary media data search means for outputting first media data corresponding to the specified section or a section on the first media data according to link information possessed by the link information storage means;
A media search device comprising:

Primary media data storage means for storing first media data including data of one or more types of media;
A secondary medium that is created by editing a plurality of the first media data in the primary media data storage means and stores second media data that can be easily recognized by voice compared with the first media data. Data storage means;
Link information storage means for storing link information indicating a section on the first media data corresponding to a section on the second media data;
Voice recognition means for performing voice recognition on the second media data and outputting a recognition result as a voice recognition result character string corresponding to a section of the second media data;
A recognition result storage means for storing the speech recognition result character string in association with a section of the second media data;
A search key input means for inputting a search key character string for search;
Secondary media data search for matching a speech recognition result character string stored in the recognition result storage means with the search key character string and specifying a section on the second media data where the search key character string exists Means,
Using the section on the second media data specified by the secondary media data search means as an input, the first media data or the first media data corresponding to the section according to the link information held by the link information storage means Primary media data retrieval means for outputting the section;
A media search device comprising:

When the primary media data search means searches for the first media data from the section on the second media data according to the link information of the link information storage means, the section on the second media data is set to T to T + D. 3. The media search device according to claim 2 , wherein the first media data is searched in a section of T−ΔTb to T + D + ΔTf (ΔTb and ΔTf may be positive numbers and may be equal) .

In order for the primary media data search means to search for the first media data, both ends of the section T to T + D on the second media data are set as predetermined change points in the second media data. The media retrieval device according to claim 3.

The media search device according to claim 4, wherein the predetermined change point is a topic change point.

The media search device according to claim 4, wherein the predetermined change point is a point where a speaker is changed.

The media search device according to claim 4, wherein the predetermined change point is a change point of a video scene.

When the voice recognition unit performs voice recognition on the second media data, the link information and the voice data of the first media data are used to superimpose the second media data on the second media data. 3. The media retrieval apparatus according to claim 2, further comprising background noise subtracting means for removing voice data of one media data as background noise.

Document time giving means for associating text data that is a narration document or script used when creating the second media data with a section on the second media data;
A document storage unit with time for storing the association information output by the document time giving unit together with the text data;
The secondary media data search means performs matching between the speech recognition result character string stored in the recognition result storage means and the text data stored in the timed document storage means and the search key character string, 3. The media search device according to claim 2, wherein a section on the second media data in which the search key character string exists is specified.

The document time giving means obtains the correspondence between the text data and the speech recognition result character string stored in the recognition result storage, and based on the correspondence, the time between the text data and the second media data. The media search device according to claim 9, wherein the correspondence relationship is obtained.

Primary media data storage means for storing first media data including data of one or more types of media;
Secondary media data that is created by editing a plurality of media data including data of one or more types of media and that stores second media data that is easier to recognize by voice than the first media data. Storage means;
Voice recognition means for performing voice recognition on the second media data and outputting a recognition result as a voice recognition result character string corresponding to a section of the second media data;
A recognition result storage means for storing the speech recognition result character string in association with a section of the second media data;
A search key input means for inputting a search key character string for search;
Secondary media data search for matching a speech recognition result character string stored in the recognition result storage means with the search key character string and specifying a section on the second media data where the search key character string exists Means,
The video or audio characteristics of the section on the second media data specified by the secondary media data search means, and the video or audio characteristics of the first media data stored in the primary media data storage means Primary media data search means for searching for and outputting video or audio similar to the section on the second media data from the first media data stored in the primary media data storage means,
A media search device comprising:

To the computer that constitutes the media search device,
Voice for second media data that includes a part or all of a plurality of first media data including one or more types of media data and that is easier to recognize than the first media data. Recognizing and storing a correspondence relationship between the voice recognition result character string and the time on the second media data;
A process of inputting a search key character string input from the outside;
A process of performing matching between the search key character string and the voice recognition result character string and identifying a section on the second media data corresponding to a portion in the voice recognition result character string that matches the search key character string When,
Correspondence relationship between the first media data and the second media data prepared in advance for the first media data corresponding to the section on the second media data or the section on the first media data Processing to output the first media data or the section on the first media data, with reference to link information representing
A program that executes

To the computer that constitutes the media search device,
Processing to store first media data including data of one or more types of media;
A process of editing the plurality of stored first media data and storing the second media data that is easier to recognize as compared to the first media data ;
A process of storing link information indicating a section on the first media data corresponding to a section on the second media data;
A process of performing speech recognition on the second media data and outputting a recognition result as a speech recognition result character string corresponding to a section of the second media data;
A process of storing the voice recognition result character string in association with a section of the second media data;
Entering a search key string for searching,
A process of matching the stored voice recognition result character string and the search key character string, and specifying a section on the second media data in which the search key character string exists;
A process of inputting the section on the specified second media data as input and outputting the first media data corresponding to the section or the section on the first media data according to the link information;
A program that executes