JP4033049B2

JP4033049B2 - Method and apparatus for matching video / audio and scenario text, and storage medium and computer software recording the method

Info

Publication number: JP4033049B2
Application number: JP2003169456A
Authority: JP
Inventors: 秀信長田; 行信谷口
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2003-06-13
Filing date: 2003-06-13
Publication date: 2008-01-16
Anticipated expiration: 2023-06-13
Also published as: JP2005006181A

Description

【０００１】
【発明の属する技術分野】
本発明は、ビデオ，ＴＶ放送番組等の映像音声に対し、メタデータと呼ばれる索引情報を生成する技術に関し、より詳細には、映像音声とシナリオテキストとの整合方法と装置、並びに前記方法を記録した記憶媒体とコンピュータソフトウェアに関する。
【０００２】
【従来の技術】
映像の内容に基づいて特定のシーンを検索したいという要求がある。例えば、ドラマ映像を短編に編集する際、長時間（数時間から十数時間）に及ぶドラマ映像の中から、特定の内容のシーンや、特定の人物の話すシーンを素早く探したいという要求がある。ドラマ，映画およびニュースといった放送番組は、放映以前の制作段階において、番組のシナリオをまず作成し、このシナリオに基づいて映像が制作される。シナリオには、場面の情報や、人物の会話，話題の進行などが含まれるため、シナリオを完成した映像音声と対応付けることができれば、先に述べた内容検索が実現できる。
【０００３】
これに対し、非特許文献１〜６では、次のような情報に着目してシナリオテキストと番組との対応付け（整合）を行ってきた。
すなわち、番組（映像音声）からは、
▲１▼：有音・無音による会話の有無および長さ
▲２▼：平均ピッチの違いによる女性発話区間
▲３▼：音声認識によって抽出された単語
を抽出し、シナリオテキストからは、
▲４▼：テキストに含まれるキーワード
▲５▼：台詞の括弧中の文字列から推定した発話長さ
を抽出し、これらを対応付けの処理に用いている。
【０００４】
【非特許文献１】
柳沼、坂内「ＤＰマッチングを用いたドラマ映像・音声・シナリオ文書の対応付け手法の一提案」
電子情報通信学会論文誌 D-II,Vol.J79-D-II,No.5,pp.747-755,1996
【非特許文献２】
柳沼、和泉、坂内「同期されたシナリオ文書を用いた映像編集方式の一提案」電気情報通信学会論文誌 D-II,Vol.J79-D-II,No.4,pp.547-558,1996
【非特許文献３】
谷村、中川「音声認識を用いたドラマのシナリオへの時刻情報付与」
（同じ題名で2件発表）
言語処理学会第5回年次大会講演論文集,pp.513‐516,1999
電子情報通信学会総合大会後援論文集,pp.377-378,1999
【０００５】
【非特許文献４】
谷村,中川「テレビドラマのシナリオと、音声トラックの自動対応付け」
情報処理学会自然言語処理音声言語情報処理合同研究会 pp.23-29
【非特許文献５】
谷村、中川「テレビドラマにおけるシナリオの台詞と音声トラックの同期システム」
1999年度第13回人工知能学会全国大会講演論文集,pp.205-208,1999
【非特許文献６】
谷村,中川「ドラマのビデオ音声トラックとシナリオのセリフの時刻同期法」情報処理学会知識と複雑系研究会 pp.25-31,1999
【０００６】
【発明が解決しようとする課題】
しかしながら、従来の技術では次のような問題があった。
（１）番組に女性が出現しない場合があるため、▲２▼の情報は全ての番組で用いることができるとは限らない。
（２）ドラマ，討論および映画といった番組では、出演者は「話し言葉」で話すため、認識による単語の正解率が５０％程度と低く、▲３▼の情報は不正確である。
（３）ドラマなどでは、背景雑音，ＢＧＭの影響により、単語認識誤りを生じやすいため、同様に▲３▼の情報は不正確である。
（４）音声認識によって得られた単語と、台詞中の単語の情報とを整合に用いるため、整合を行う要素数が多く、整合に時間がかかる。
【０００７】
このように、従来技術では、不正確な情報を元に対応付けの処理を行うため、対応の精度が低く、また、単語を整合に用いるため、整合に時間がかかるという問題があった。
【０００８】
本発明は上記事情に鑑みてなされたものであり、その目的とするところは、従来の技術における上述のような問題を解消し、映像音声とシナリオテキストを高い精度で対応付けることを可能とする、映像音声とシナリオテキストとの整合方法および装置、並びに前記方法を記録した記憶媒体とコンピュータソフトウェアを提供することにある。
【０００９】
【課題を解決するための手段】
上記目的を達成するため、本発明に係る映像音声とシナリオテキストとの整合方法は、人物データベース準備ステップと、映像音声信号入力ステップと、音声特徴量抽出ステップと、発話区間情報抽出ステップと、シナリオテキスト入力ステップと、台詞情報抽出ステップと、整合ステップと、インデクス情報生成ステップとを有することを特徴とする。
【００１０】
人物データベース準備ステップは、人物のデータベースを準備する。人物のデータベースは、各話者の基準となる音声から抽出された特徴ベクトルと、各話者のシナリオテキスト中における呼称とが関連付けられたものであり、既存のデータベースを用いる他に、本ステップで生成することもできる。
【００１１】
また、映像音声信号入力ステップは、ビデオ等の映像音声ファイルをプログラムの入力に指定する。音声特徴量抽出ステップは、映像音声からスペクトル，ピッチなどの特徴量を抽出する。発話区間情報抽出ステップは、話者毎の発話区間情報を抽出し、パタン化する。シナリオテキスト入力ステップは、シナリオテキストをプログラムの入力に指定する。台詞情報抽出ステップは、シナリオテキストから、各話者の台詞情報を抽出し、パタン化する。
【００１２】
整合ステップは、上述の発話区間情報抽出ステップによって得られたパタンと整理符情報抽出ステップによって得られたパタンとを、上記人物データベース生成ステップにより生成された人物データベースを用いて、同じ人物に関して共通のパタン要素へと変換し、変換したパタンを用いてＤＰマッチングに基づく整合処理を行う。
インデクス生成ステップは、上記各ステップによって得られる結果に基づいて、シナリオテキスト内に、映像音声の再生時刻を関連付けたインデクス情報を生成する。
【００１３】
また、本発明に係る映像音声とシナリオテキストとの整合方法では、前記人物データベース準備ステップにおいて、発話区間情報のパタンに含まれる人名と、台詞情報のパタンに含まれる人名とを、共通のＩＤにマッピングすることが可能な、人物ＩＤテーブルを生成することを特徴とする。
【００１４】
また、本発明に係る映像音声とシナリオテキストとの整合方法では、前記整合ステップにおいて、データベース中の特徴ベクトルに基づいて多次元空間インデクスを生成することを特徴とする。
【００１５】
また、本発明に係る映像音声とシナリオテキストとの整合方法では、前記発話区間情報抽出ステップにより話者毎の発話区間のパワーの値をパタン化し、また、前記台詞情報抽出ステップにより台詞に含まれる強調符の位置をパタン化し、両者を用いて整合を行うことを特徴とする。
【００１６】
また、本発明に係る映像音声とシナリオテキストとの整合方法では、前記発話区間情報抽出ステップにより話者毎の発話区間をピッチの値をパタン化し、また、前記台詞情報抽出ステップにより台詞に含まれる疑問符の位置をパタン化し、両者を用いて整合を行うことを特徴とする。
【００１７】
また、本発明に係る映像音声とシナリオテキストとの整合方法では、前記整合ステップにおいて、前記映像音声およびシナリオテキストからそれぞれ抽出されたパタンを用いて、非線形時間伸縮によるマッチングを行うことを特徴とする。
【００１８】
また、本発明に係る映像音声とシナリオテキストとの整合方法では、前記整合ステップにおいて、マッチングを行う際、マッチングスコア（距離）に対して、前記映像音声およびシナリオテキストから抽出された特徴に任意の重みをつけることを特徴とする。
【００１９】
また、本発明に係る映像音声とシナリオテキストとの整合方法では、前記発話区間情報抽出ステップにより得られる発話パタンと、前記台詞情報抽出ステップにより得られる台詞パタンとを用いて、人物ＩＤテーブルを生成する際、前記映像音声およびシナリオテキストを複数の区域に分割し、その各々について話者の発話時間の比率を計算し、人物ＩＤテーブルを生成することを特徴とする。
【００２０】
なお、本発明に係る映像音声とシナリオテキストとの整合方法は、これを、コンピュータのプログラム制御により実行することが可能であり、本発明の技術範囲は、このためのコンピュータソフトウェア、さらには、このコンピュータソフトウェアを記録した、コンピュータにより読み取り可能な記録媒体にも及ぶことはいうまでもない。
【００２１】
一方、本発明は、映像音声とシナリオテキストとの整合装置としても、特徴を有するものである。
すなわち、本発明に係る映像音声とシナリオテキストとの整合装置は、映像音声中に登場する人物とシナリオテキスト中に登場する人物との対応情報を準備する人物データベース準備処理手段と、映像音声を入力する映像音声信号入力処理手段と、この処理手段で入力された映像音声信号から、音声特徴量を抽出する音声特徴量抽出処理手段と、この処理手段で抽出された音声特徴量に基づいて、各話者の発話区間情報のパタンを抽出する発話区間情報抽出処理手段と、シナリオテキストを入力するシナリオテキスト入力処理手段と、この処理手段で入力されたシナリオテキストから、人物に関する台詞情報のパタンを抽出する台詞情報抽出処理手段と、上記処理手段で得られる発話区間情報のパタンと台詞情報のパタンとを、パタン間の距離が最小となるようにパタンに含まれる要素同士を対応させて、要素の対応付け情報を求める整合処理手段と、この処理手段で得られるパタン間の要素の対応付け情報から、シナリオテキストに映像音声の再生時刻情報を関連付けたインデクス情報を生成するインデクス情報生成処理手段とを有することを特徴とする。
【００２２】
【作用】
本発明によれば、
▲１▼：シナリオテキスト中の人物と、映像音声中の人物とは、必ずしも一致しないことに鑑みて、同じ人物を共通のＩＤで表現するためのテーブルを準備するようにしたこと
▲２▼：話者識別技術を用いて、各話者の発話区間を求め、発話区間情報，パワーの値およびピッチの値を等しい粒度でパタン化するようにしたこと
さらには、
▲３▼：シナリオテキストから台詞箇所を抽出し、それを前後に出現した人名と関連付け、モーラ数に基づいて台詞情報を求め、台詞情報，疑問符の位置情報，強調符の位置情報を等しい粒度でパタン化するようにしたこと
▲４▼：発話区間情報を台詞情報へ、シナリオ中の疑問符をピッチへ、シナリオ中の強調符をパワーへと対応させて、共通ＩＤに変換して、整合を行うようにしたこと
等の工夫により、従来技術よりも高い精度で、しかも高速に、映像音声とシナリオテキストを対応付けることが可能となる。
【００２３】
【発明の実施の形態】
以下、本発明の実施の形態を、図面に示す好適実施例に基づいて、詳細に説明する。
【００２４】
〔実施例１〕
図１を用いて、本発明の実施例１の実施形態を説明する。本実施例は、人物データベース準備ステップ１１，映像音声処理ステップ１２，シナリオテキスト処理ステップ１３，整合ステップ１４に分けることができる。
本実施例では、整合に用いるパタンとして、次の特徴を映像音声およびシナリオテキストから抽出する。映像音声からは各話者の発話区間，発話区間中のパワーの推移および発話区間中のピッチの推移を抽出する。また、シナリオテキストからは、人物名，台詞部分の文字に基づくモーラ数，疑問符の箇所および強調符の箇所を抽出する。
【００２５】
次に、人物データベース準備ステップ１１について説明する。
図２は、人物データベース準備ステップ１１の動作を説明する詳細フローチャートである。本ステップでは、人物データベースを生成する場合について説明する。
ステップ２０１では、各話者の基準となる発話音声（以後、これを基準音声と呼ぶ）を準備する。各話者の基準音声は、１０〜３０秒の、１人の話者の発話音声だけが含まれる音声のことである。これは、汎用のソフト（例えばWindows（登録商標） Media Playerなど）または非特許文献７に記載された方法によって、人手または自動処理により、映像音声から容易に取り出すことができる。
【００２６】
例えば、下記の文献が参考になる。
西田、秋田、河原「討論を対象とした話者モデル選択による話者インデキシングと自動書き起こし」
電子情報通信学会技術研究報告、SP2002-157,NLC2002-80(SLP-44-37),2002
討論音声を対象とした教師なし話者インデキシングとそれを用いた音声認識技術に関するものである。討論音声では、話者の交替が頻繁に発生し、継続時間が短い発話が多く、発話時間のばらつきも大きいため、画一的なモデルで話者インデキシングが難しい。
【００２７】
そこで、ここでは、発話時間の短い音声に対しては、ＶＱモデル、長い音声に対してはＧＭＭモデルが選択される枠組みを実現している。こ子では、これに加えて、討論音声認識のための音響・言語モデルについても検討し、良好な話者インデキシング結果を得ることに成功している。この技術は、最終的に音声認識精度の向上を狙ったものであるが、その過程での処理が、単独の話者の発話する区間の決定に応用できる。すなわち、上述のステップ２０１の処理に適用することができる。
【００２８】
ステップ２０２では、各話者の基準音声を入力して非特許文献８に記載される処理を行い、基準音声のスペクトルの包絡を表わす係数（線形予測係数，ＬＰＣケプストラム，ＭＦＣＣ等）を出力する。本ステップで得られた各話者の特徴量は、図３の例に示すような数値の列で表現される。なお、ここでは、短時間のスペクトル特徴から、逐次ベクトルデータが生成される。
【００２９】
例えば、下記の文献が参考になる。
F.K.Soong,A.E.Rosenberg,L.R.Rabiner and B.H.Juang,
"A Vector Quantization Approach to Speaker Recognition"
At&T Technical Journal,Vol.66,pp.14-26,Mar/Apr 1987
ここでは、話者の識別方法として、識別対象となる人物の音声から、短時間スペクトル包絡の特徴を抽出し、ベクトルとして格納しておき、識別時には、入力される音声から短時間スペクトル包絡の情報を同様にベクトルとして抽出し、格納されているベクトルと照合する。ベクトルの距離に基づいて入力された音声の話者が格納されている人物の音声か否かを判断する。この音声特徴量の抽出方法は、ステップ２０２に適用できる。具体的には、数ms〜数十msの短時間ずつに入力音声を取り出し、その各々からケプストラム等の係数を抽出するという処理である。
【００３０】
ステップ２０３で、話者毎に特徴ベクトル個数を正規化する場合には、ステップ２０４で、ステップ２０３の出力で得られた基準音声のスペクトルの包絡を表わす係数（線形予測係数，ＬＰＣケプストラム，ＭＦＣＣ等）を入力し、非特許文献９に記載される処理を行い、話者毎に一定の個数のベクトルを出力する。本ステップの出力により、例えば図４に示すようなデータ構造で、映像音声に含まれる話者名（４１）、およびシナリオテキスト中の人物の呼称（４２）、人物の特徴を表わすベクトルの番号（４３）、ベクトルの要素の値（４４）を各カラムとするテーブル４０１を作ることが可能である。
【００３１】
例えば、下記の文献が参考になる。
Y.Linde、a.Buzo、andRobertM.Gray.
An Algorithm for Vectoe Quantizer Design,IEEE Translations on Communications, Vol.COM-28,No.1 1980．
ＬＰＣ等を用いてパタン認識を行う場合、予め登録信号を学習し、それらの特徴をベクトルとして格納する。このとき、ベクトル空間内のベクトルの分布を、エラーベクトルに影響されずに忠実にベクトルの分布を表現できるような、代表ベクトルの集合を作成することが重要である（この作業を量子化という）。元のベクトルでなく代表ベクトルを用いる理由は、ベクトルの数を削減できることにあり、ベクトルの数が削減できれば、識別時間の短縮につながるためである。本発明では、必要に応じて、ステップ２０４に適用する。
【００３２】
ステップ２０５では、前述のテーブルを、磁気ディスク等の記録媒体へと保存する。ここでは、各話者のベクトルデータを一定個数に圧縮し、これに番号を付け、話者名およびシナリオテキスト中における人物の呼称と対応付けるテーブルを生成する。図４の例では、各話者について８つの特徴ベクトルが保存されている。
【００３３】
ステップ２０６で、例えば図５に示すようなテーブル（人物ＩＤテーブル）５０１に基づいて、話者名およびシナリオテキスト中の人物の呼称がそれぞれ共通のＩＤへと対応付けがなされる。ここで用いるテーブルは、映像音声中の人物名およびシナリオテキスト中の人物の呼称の対応表であり、これに基づいて、各人物に番号（ＩＤ）が与えられる。なお、図５の例ではテーブルは３つのカラムよりなっているが、２つのカラムからなる２つの表を用いてもよい。
【００３４】
ステップ２０７で、各人物に帰属するベクトルを包含する集合として、話者クラスを定義する（以後、クラスと呼ぶ）。
ステップ２０８で、必要な話者についての特徴ベクトルの登録が完了したかどうかを判断する。
【００３５】
ステップ２０９では、記録媒体等へ保存されたデータから、その特徴量の値に基づいて、非特許文献１０による処理を行い、多次元空間インデクスを生成し、インデクスのデータをメモリ等の主記憶装置へと格納する。図６に示すように、多次元空間インデクス６０１は、各ノードに類似するベクトルのＩＤ（６０２）が関連付けられており、キーベクトルに類似するベクトルを探索する際に、データベース中のベクトル全部に対して探索を行う必要がなく、探索時間を短縮することができる。
【００３６】
例えば、下記の文献が参考になる。
K.Curtis,N.Taniguchi,J.Nakagawa,and M.Yamamuro.
A comprehensive image similarity retrieval system that utilizes multiple feature vectors in high dimensional space, Proceedings of International Conference on Information, Communication and Signal Processing,pp. 180-184, 1997.
この技術は、画像の類似検索のために、特跳梁を多次元のベクトルに変換し、それらを階層構造に管理することにより、高速な探索を可能とするものである。２次元の場合を例にとると、ベクトルの空間を矩形で区切り、それぞれの部分に存在するベクトルを、階層構造に管理するインデクスを生成する。このインデクスを用いれば、新たにベクトルが入力された場合、それがどのベクトルの近傍であるかを、すべてのベクトルを探索することなしに知ることができる。この方法を、ステップ２０９に適用している。
【００３７】
図１のステップ１１では、上記のように人物データベースを作成する他に、予め準備された、既存のデータベースを用いることも可能である。
【００３８】
次に、図１の映像音声処理ステップ１２について説明する。図７は、映像音声処理ステップ１２の動作を説明する詳細フローチャートである。
【００３９】
ステップ７１では、プログラムの入力に映像音声を指定し、映像音声を入力する。
ステップ７２では、入力された映像音声から、音声トラックを分離する。
ステップ７３では、ステップ７２で分離された音声トラックから、前記人物データベース準備ステップ１１のステップ２０２と同じ方法で特徴量を抽出し、これをキーとする。
【００４０】
ステップ７４では、得られたキーを用いて、前記人物データベース準備ステップ１１によって作成された人物データベース中から、キーに類似する特徴ベクトルの探索を行う。特徴ベクトルとキーとの類似尺度として、ベクトル間の様々な距離尺度（ユークリッド距離，市街地距離等）を用いることが可能である。また、探索は、多次元空間インデクスを参照しながら探索する。
【００４１】
ステップ７５では、ステップ７４の探索結果に基づいて、一定粒度で話者名を判定することにより、各話者の発話区間を求めていく。
今、キーベクトルのｋ最近傍空間に含まれるデータベース中のベクトルをｘ_ｉ（ｉ=１,２,…ｋ）、ｘ_ｉのクラス判別関数をＣ（ｘ_ｉ）、話者クラスｐを含む話者クラスのセットをＰとすると、話者名ｐ_ａｎｓは次の式（１）で求められる。
【数１】

【００４２】
但し、１［Ｆ］とは、条件文Ｆが成立する場合に１をとり、他では０をとる関数と定義する。Ｊは話者名の判別粒度をＲ_ｓ、キーの生成頻度をｆとした場合、定数ＪはＪ＝Ｒ_ｓ/ｆで得られる数である。
ステップ７６では、話者名の判定が音声トラックの終端まで達したか否かを判断する。
【００４３】
ステップ７７では、ステップ７５で得られた話者名に基づいて、図８に示すように、ｐ_ａｎｓの値を各要素とするような、話者の発話区間を表わすパタン、Ｎ＝｛ｎ１,ｎ２,...ｎｍ｝を生成する（８０１）。但し、ｍは、映像音声の長さをＴとし、ｍ＝Ｔ/Ｒで得られる数である。
【００４４】
ステップ７８では、パワーの値を表わすパタンを生成する。キーを生成するための入力音声の振幅の値の積分値から、話者名の判定粒度と等しい粒度で、パワーの平均値を求める。これにより、パワーの値を表わすパタンＰ＝｛ｐ１,ｐ２,...ｐｍ｝を生成する（８０２）。
ステップ７９では、ピッチの値を表わすパタンを生成する。キーを生成するために入力音声に対し、自己相関法，変形相関法などの一般的な方法により、ピッチを求める。話者名の判定粒度と等しい粒度で、ピッチの平均値を求める。これにより、ピッチの値を表わすパタンＦ＝｛ｆ１,ｆ２,...ｆｍ｝を生成する（８０３）。
【００４５】
次に、図１のシナリオテキスト処理ステップ１３について説明する。図９は、シナリオテキスト処理ステップ１３の動作を説明する詳細フローチャートである。
ステップ９１では、プログラムの入力にシナリオテキストを指定する。
ステップ９２では、シナリオテキストファイルから台詞の部分を検索する。
シナリオの記述方法は様々なフォーマットが考えられ、台詞の検索には「…」，‘…’，“…”などの、記号で囲み表記された部分を用いることができる。また、シナリオテキストがＸＭＬ等の構造化文章である場合には、そのタグ情報を用いることが可能である。
【００４６】
ステップ９３では、各台詞の直前または直後に存在する人名を検索し、得られた人名から、台詞と人名とを関連付ける。
ステップ９４では、人名と関連付けられた各台詞の中から、疑問符“？”および強調符“！”を検索する。
ステップ９５では、台詞，疑問符および強調符のパタンを生成する。まず、シナリオテキストに含まれる全台詞から、モーラ数の総数を計算する。モーラ数とは、母音またはｎが含まれる音の数で、例えば「朝ごはんです」という台詞に対しては、「a/sa/go/ha/n/de/su」となり、モーラ数は７ということになる。
【００４７】
ステップ９５におけるパタン生成処理を、図１０に示す詳細フローチャートを用いて説明する。
ステップ１０１では、シナリオテキスト中に含まれる全台詞のモーラ数の総数（Ｔ_ｍ）を計算する。
ステップ１０２で、Ｔ_ｍ個のモーラに対し、ステップ９３の結果を用い、各モーラが含まれている台詞に関連付けられた人物名を各要素とする、Ｔ_ｍ個のデータを得る。
【００４８】
ステップ１０３では、ステップ１０２で得られたＴ_ｍ個の人物名のデータに対し、先頭からＲ_ｔ個毎にデータを集計し、各集計について最も多く含まれている人物名をパタンの１要素として出力する（図１１に例を示す）。こうして、台詞の人物を表わすパタンＳ｛ｓ１,ｓ２,...ｓｌ｝を生成する（１１１）。但し、パタンの要素数Ｌは、Ｌ＝Ｔ_ｍ/Ｒ_ｔにより得られる数である。
【００４９】
ステップ１０４では、疑問符の直前のモーラが含まれるパタンＳの要素ｓｌ*について、ｌ=ｌ*となる要素のみが１で他が０となるような疑問符の位置を表わすパタンＱ｛ｑ１,ｑ２,...ｑｌ｝を生成する（１１２）。
ステップ１０５で、強調符の直前のモーラが含まれるパタンＳの要素ｓｌ*について、ｌ=ｌ*となる要素のみが１で他が０となるような強調符の位置を表わすパタンＥ｛ｅ１,ｅ２,...ｅｌ｝を生成する（１１３）。
【００５０】
次に、図１の整合ステップ１４について説明する。図１２は、整合ステップ１４の動作を説明する詳細フローチャートである。
本実施例のこのフェーズでは、前記映像音声処理ステップ１２およびシナリオテキスト処理ステップ１３でそれぞれ得られたパタンを非線形に伸縮することにより、整合させる。この際、パタン同士の距離が最小になるように、各要素の対応付けを行う。
【００５１】
ステップ１２１では、パタンの対を生成する。本発明では、パタンの対は３対（ＮとＳ，ＰとＥおよびＦとＱ）生成される。パタンの対は、各パタンの由来に基づいて生成される。図１３に示すように、映像音声中における話者毎の発話区間を表わすパタンは、シナリオテキスト中の台詞に関連付けられた人物名と対応するので、パタンＮとＳとが対になる。
また、疑問文においては一般に語尾が尻上がりとなることから、映像音声中のピッチを表わすパタンＦは、シナリオテキスト中の疑問符の位置を表わすパタンＱと対になる。同様に、映像音声中のパワーを表わすパタンＰは、シナリオテキスト中の強調符の位置を表わすパタンＥと対になる（１３１〜１３３）。
【００５２】
ステップ１２２では、図１４に示すように、対応させるパタンをプログラムへ入力し（１４１，１４２）、人物ＩＤテーブル５０１を参照し、人物に関する要素を共通なＩＤ（１４３、１４４）へと変換する。本ステップは、前記映像音声処理ステップ１２とシナリオテキスト処理ステップ１３とから得られる人名が異なる場合に必要である。
例えば、シナリオテキストにはドラマの役者名が記載され、人物データベースには俳優名が記載される場合があるからである。ここでは、人物ＩＤテーブル５０１を図５，図６のような３カラムからなるテーブルとして説明したが、話者名を共通ＩＤに変換することが可能であれば、どのようなデータ構造でもよい。
【００５３】
ステップ１２３では、パタン間の距離が最小になるように、各パタンに含まれる要素同士を対応させる。例えば、パタンＮ｛ｎ１,ｎ２,...ｎｍ｝およびパタンＳ｛ｓ１,ｓ２,...ｓｌ｝を整合させる場合、２つのパタンの距離ｄ（Ｎ、Ｓ）は一般に次の式（２）で表される。
【数２】

【００５４】
但し、ｋ＝１,２,...Ｋは各パタン間の要素の組み合わせ数、ｍ（ｋ）は非負のパス重み係数、Ｍφは正規化係数、ｃｋは各パタンに含まれる要素の対、ｄ（ｃｋ）は２要素間の距離をそれぞれ表わす。ここで、パタン要素間の距離はユークリッド距離、City-Block距離等、様々な距離を用いることができる。
例えば、図１５に示すように、パス傾斜を０.５‐２.０に制限し、パタン要素間の距離としてCity-Block距離を用いた場合には、
Ｍφ＝ｍ＋１
ｄ（ｃｋ）＝｜ｎ_ｉｋ‐ｓ_ｊｋ｜
となる。
【００５５】
ここで、ＩはパタンＮに含まれる要素数、ＪはパタンＳに含まれる要素数をそれぞれ表わす。ｋ＝１,２,...Ｋの各ステップに至るまでの正規化した距離和をδ、各ステップにおけるｎｋとｓｋの組み合わせのセットをｃｋとすると、図１５に示すように、パス重みを１.０〜２.０とした場合、δおよびｃｋは、下記の漸化式（３）、（４）により得られる。
【００５６】
【数３】

【数４】

ここで、上の漸化式において、ｋ＝Ｋの時、δ（ｎＫ，ｓＫ）≡ｄ（Ｎ，Ｓ）である。整合フェーズでは、個のｄ（Ｎ，Ｓ）を最小にするようなｎｋとｓｋの組み合わせの、Ｋ個のセットＣ_Ｋを求める。
【００５７】
本発明では、パタンの対は３対（ＮとＳ、ＰとＥおよびＦとＱ）あるので、始めにパタンの対に対して生成されるパタン間距離に任意の重みｗ（ｇ）（ｇ＝１,２,３）を設定し、パタン間距離を求め、これを最小にするような要素の対ｃ_ｋを求めることになる。
すなわち、下記の式（５）のように表わせる。
【数５】

但し、ｇはパタンの対の個数であり、本実施例ではＧ＝３である。
【００５８】
ステップ１２４で、漸化式（３），（４）を用い、ｃ_ｋに対するδが最小かどうか判断し、最小値を与えるＣ_Ｋが更新される。
ステップ１２５では、最終的にｋ＝Ｋにおいて、最終的に最小δを与えるＣ_Ｋが出力される。
ステップ１２６では、Ｃ_Ｋに基づいて、シナリオテキストと映像音声の箇所を対応させ、インデクス情報を生成する。インデクス情報は、図１６の例に示すように、シナリオテキスト１６１内に、映像音声の再生時刻１６２を関連付けるものである。
【００５９】
本実施例では、パタンＮとＳ，ＰとＥおよびＦとＱの３対のパタンを用いたが、少なくとも１対のパタンがあれば、映像音声とシナリオテキストとの整合を行うことは可能である。また、本実施例では、ＤＰマッチングの技法を用いたが、総当り法で対応付けを行うことも可能である。
【００６０】
〔実施例２〕
図１７を用いて、本発明の実施例２の実施形態を説明する。本実施例では、実施例１において人が手作業で作成していた人物ＩＤテーブルを自動的に作成するための、人物ＩＤテーブル生成ステップ１７１が追加されている。また、人物データベース準備ステップ１１では、図４におけるシナリオテキストにおける人物の呼称を表わすカラム４２が空白となっている。
【００６１】
人物ＩＤテーブル生成ステップについて説明する。図１８は、実施例２の本ステップの動作を説明する詳細フローチャートである。
ステップ１８１では、映像音声処理ステップ１２において得られたパタンに基づいて、図１９に示すように、人物毎に、発話の総時間の比率を算出する（１９１）。
ステップ１８２では、シナリオテキスト処理ステップ１３において得られたパタンに基づいて、図１９に示すように、人物毎に、台詞長さの総時間の比率を算出する（１９２）。
【００６２】
ステップ１８３では、ステップ１８１およびステップ１８２で得られた比率に基づいて上位から順に人名を対応付けることにより、人物ＩＤテーブル１９３を生成する。また、ステップ１８３で、人名の対応付けができない場合には、映像全体を複数の区域に分割し、分割された区域のそれぞれに対して、ステップ１８１〜ステップ１８２の処理を行い、人物ＩＤテーブルを生成することも可能である。
なお、その他のフェーズの動作は全て実施例１と同様である。
【００６３】
その他の実施例
本発明は、上記の実施例の他に、次のような実施形態で行うことができる。
（１）大局的な発話パタンを抽出し、マッチングの際のパスを制限する／重みをつけることができる。
（２）従来技術で用いられてきた特徴と、本発明で用いている特徴とを任意に組み合わせて整合を行うことができる。
【００６４】
本実施例によっても、映像音声とシナリオテキストとを高い精度で対応付けることを可能とする、映像音声とシナリオテキストとの整合方法を実現することが可能である。
【００６５】
なお、上記各実施例はいずれも本発明の一例を示すものであり、本発明はこれらに限定されるべきものではなく、本発明の技術的範囲を逸脱しない範囲内で適宜の変更・改良を行ってもよいことはいうまでもない。
【００６６】
例えば、前述のように、本発明に係る映像音声とシナリオテキストとの整合方法は、これを、コンピュータのプログラム制御により実行することが可能であり、本発明の技術的範囲は、このためのコンピュータソフトウェア、さらには、このコンピュータソフトウェアを記録した、コンピュータにより読み取り可能な記録媒体にも及ぶことはいうまでもない。
【００６７】
また、本発明に係る映像音声とシナリオテキストとの整合方法を具体化した映像音声とシナリオテキストとの整合装置も、有用なものである。
【００６８】
【発明の効果】
以上、詳細に説明したように、本発明によれば、各話者の発話情報に基づいてシナリオテキストと映像音声の整合を行うことができ、従来技術よりも精度のよい整合が可能になる。また、単語の情報を用いた場合に比べて整合箇所が少なく、従来技術よりも精度のよい整合が可能になるという顕著な効果を奏するものである。
【図面の簡単な説明】
【図１】本発明の実施例１の実施形態を示す図である。
【図２】図１中の人物データベース準備ステップ１１の動作を説明する詳細フローチャートである。
【図３】図２中のステップ２０２で得られる特徴ベクトルのデータの一例を示す図である。
【図４】図２中のステップ２０５で得られる特徴ベクトルテーブルの一例を示す図である。
【図５】図２中のステップ２０６で用いる人物ＩＤテーブルの一例を示す図である。
【図６】図２中のステップ２０９における非特許文献１０による処理で得られるデータの一例を示す図である。
【図７】図１中のステップ１２の処理を示す詳細フローチャートである。
【図８】図７中のステップ７７，７８，７９で得られるデータの一例を示す図である。
【図９】図１中のステップ１３の処理を示す詳細フローチャート
【図１０】図９中のステップ９５における処理を示す詳細フローチャートである。
【図１１】図１０中のステップ１０３，１０４，１０５で得られるデータの一例を示す図である。
【図１２】図１中のステップ１４の処理を示す詳細フローチャートである。
【図１３】図１２中のステップ１２１におけるパタン対の生成方法を示す図である。
【図１４】図１２中のステップ１２２におけるパタン要素の変換方法を示す図である。
【図１５】パタン同士の非線形伸縮によるマッチング（図１２のステップ１２３および１２４の処理で用いる計算式）を示す図である。
【図１６】図１２中のステップ１２６で出力されるインデクス情報の一例を示す図である。
【図１７】本発明の実施例２の実施形態を示す図である。
【図１８】実施例２の人物ＩＤテーブル生成ステップの動作を説明する詳細フローチャートである。
【図１９】人物ＩＤテーブルの自動生成を説明する図である。
【符号の説明】
１１人物データベース準備ステップ
１２映像音声処理ステップ
１３シナリオテキスト処理ステップ
１４整合ステップ
１７１人物ＩＤテーブル生成ステップ
２０１〜２０９人物データベース準備ステップの詳細
４０１特徴ベクトルテーブル
５０１人物ＩＤテーブル
６０１多次元空間インデクス[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a technique for generating index information called metadata for video and audio such as video and TV broadcast programs. More specifically, the present invention relates to a method and apparatus for matching video and audio and scenario text, and recording the method. The storage medium and computer software.
[0002]
[Prior art]
There is a demand to search for a specific scene based on the content of the video. For example, when editing a drama video into a short story, there is a demand to quickly find a scene with a specific content or a scene where a specific person speaks from a long time (several hours to a few dozen hours) of drama video. . For broadcast programs such as dramas, movies, and news, a scenario of the program is first created in the production stage before airing, and video is produced based on this scenario. Since the scenario includes scene information, person conversation, topic progression, etc., the content search described above can be realized if the scenario can be associated with the completed video and audio.
[0003]
On the other hand, in Non-Patent Documents 1 to 6, scenario texts and programs are associated (matched) by paying attention to the following information.
That is, from the program (video and audio)
▲ 1 ▼: Presence / absence and length of voice / silence conversation
(2): Female utterance interval due to difference in average pitch
(3): Words extracted by speech recognition
And from the scenario text,
(4): Keyword included in text
(5): Speech length estimated from the character string in the brackets of the dialogue
Are extracted and used in association processing.
[0004]
[Non-Patent Document 1]
Yaginuma, Sakauchi "A proposal of drama video / sound / scenario document matching method using DP matching"
IEICE Transactions D-II, Vol.J79-D-II, No.5, pp.747-755, 1996
[Non-Patent Document 2]
Yanaginuma, Izumi, Sakauchi "A Proposal of Video Editing Method Using Synchronized Scenario Documents" IEICE Transactions D-II, Vol. J79-D-II, No. 4, pp. 547-558, 1996
[Non-Patent Document 3]
Tanimura, Nakagawa “Adding time information to drama scenarios using voice recognition”
(Two announcements with the same title)
Proc. Of the 5th Annual Conference of the Language Processing Society, pp.513-516, 1999
IEICE General Conference sponsored papers, pp.377-378, 1999
[0005]
[Non-Patent Document 4]
Tanimura, Nakagawa “Automatic correspondence between TV drama scenarios and audio tracks”
IPSJ Natural Language Processing Spoken Language Information Processing Joint Research Group pp.23-29
[Non-Patent Document 5]
Tanimura, Nakagawa “Synchronization system of dialogue lines and audio tracks in TV dramas”
1999 13th Annual Conference of the Japanese Society for Artificial Intelligence, Proceedings, pp.205-208, 1999
[Non-Patent Document 6]
Tanimura, Nakagawa “Time Synchronization Method of Drama Video and Audio Tracks and Scenarios” IPSJ Knowledge and Complex Systems Study Group pp.25-31,1999
[0006]
[Problems to be solved by the invention]
However, the conventional technique has the following problems.
(1) Since women may not appear in the program, the information (2) may not be used in all programs.
(2) In programs such as dramas, discussions, and movies, performers speak in “spoken language”, so the correct word rate by recognition is as low as about 50%, and information (3) is inaccurate.
(3) In a drama or the like, the word recognition error is likely to occur due to the influence of background noise and BGM.
(4) Since the word obtained by speech recognition and the word information in the dialogue are used for matching, the number of elements to be matched is large and matching takes time.
[0007]
As described above, the related art performs the matching process based on inaccurate information, so that the accuracy of the matching is low, and the word is used for matching.
[0008]
The present invention has been made in view of the above circumstances, and the object of the present invention is to solve the above-described problems in the prior art and to associate video and audio with scenario text with high accuracy. An object of the present invention is to provide a method and apparatus for matching video / audio and scenario text, a storage medium storing the method, and computer software.
[0009]
[Means for Solving the Problems]
In order to achieve the above object, a video / audio and scenario text matching method according to the present invention includes a person database preparation step, a video / audio signal input step, an audio feature amount extraction step, an utterance section information extraction step, a scenario, and the like. It has a text input step, a dialogue information extraction step, a matching step, and an index information generation step.
[0010]
In the person database preparation step, a person database is prepared. The human database is a database in which the feature vectors extracted from the speech that serves as the reference for each speaker are associated with the names in the scenario text of each speaker. It can also be generated.
[0011]
In the video / audio signal input step, a video / audio file such as a video is designated as an input of the program. In the audio feature quantity extraction step, feature quantities such as spectrum and pitch are extracted from the video and audio. In the utterance section information extraction step, utterance section information for each speaker is extracted and patterned. In the scenario text input step, the scenario text is designated as the program input. In the dialogue information extraction step, dialogue information of each speaker is extracted from the scenario text and patterned.
[0012]
The matching step uses the person database generated by the person database generation step to combine the pattern obtained by the utterance section information extraction step and the pattern information extraction step by using the person database generated by the person database generation step. Conversion into pattern elements is performed, and matching processing based on DP matching is performed using the converted patterns.
In the index generation step, index information in which the reproduction time of the video and audio is associated with the scenario text is generated based on the results obtained in the above steps.
[0013]
In the video / audio matching method according to the present invention, the person name included in the speech section information pattern and the person name included in the dialog information pattern are set to a common ID in the person database preparation step. A person ID table that can be mapped is generated.
[0014]
The video / audio and scenario text matching method according to the present invention is characterized in that, in the matching step, a multidimensional spatial index is generated based on a feature vector in a database.
[0015]
Also, in the method for matching video / audio and scenario text according to the present invention, the power value of the utterance section for each speaker is patterned by the utterance section information extraction step, and is included in the dialogue by the dialogue information extraction step. It is characterized in that the positions of emphasis marks are patterned and matching is performed using both.
[0016]
In the video / audio matching method for scenario text according to the present invention, the pitch value of the utterance interval for each speaker is patterned by the utterance interval information extraction step, and the speech information is included in the dialogue by the dialogue information extraction step. It is characterized by patterning the position of the question mark and performing matching using both.
[0017]
The video / audio and scenario text matching method according to the present invention is characterized in that, in the matching step, matching is performed by nonlinear time expansion / contraction using patterns extracted from the video / audio and scenario text, respectively. .
[0018]
In the method for matching video / audio and scenario text according to the present invention, when matching is performed in the matching step, a feature extracted from the video / audio and scenario text is arbitrarily selected with respect to a matching score (distance). It is characterized by weighting.
[0019]
In the video / audio and scenario text matching method according to the present invention, a person ID table is generated using the speech pattern obtained by the speech segment information extraction step and the speech pattern obtained by the speech information extraction step. In this case, the video and audio and the scenario text are divided into a plurality of areas, the ratio of the utterance time of the speaker is calculated for each of the areas, and a person ID table is generated.
[0020]
The video / audio and scenario text matching method according to the present invention can be executed by computer program control. The technical scope of the present invention includes computer software for this purpose, and further Needless to say, the present invention extends to a computer-readable recording medium in which computer software is recorded.
[0021]
On the other hand, the present invention also has a feature as a matching device between video and audio and scenario text.
That is, the video / audio / scenario text matching apparatus according to the present invention is a person database preparation processing means for preparing correspondence information between a person appearing in the video / audio and a person appearing in the scenario text; Based on the audio feature quantity extracted by the processing means, the audio feature quantity extraction processing means for extracting the audio feature quantity from the video / audio signal input by the processing means, A speech section information extraction processing means for extracting a speaker's speech section information pattern, a scenario text input processing means for inputting a scenario text, and a pattern of dialogue information about a person is extracted from the scenario text input by the processing means. The dialogue information extraction processing means, the utterance section information pattern and the dialogue information pattern obtained by the processing means, and the distance between the patterns. The matching processing means for associating the elements included in the pattern so as to minimize the pattern and obtaining the element correspondence information, and the correspondence information of the elements between the patterns obtained by the processing means, the video and audio in the scenario text And index information generation processing means for generating index information associated with the reproduction time information.
[0022]
[Action]
According to the present invention,
(1): Considering that the person in the scenario text and the person in the video / audio do not necessarily match, a table for expressing the same person with a common ID is prepared.
(2): Using the speaker identification technology, the utterance interval of each speaker is obtained, and the utterance interval information, power value, and pitch value are patterned with equal granularity.
Moreover,
(3): Extracts dialogue parts from the scenario text, associates them with the names of the people who appear before and after, finds dialogue information based on the number of mora, and uses the same granularity for dialogue information, question mark position information, and emphasis mark position information. What was made into a pattern
(4): Matching speech segment information to dialogue information, question marks in the scenario to pitch, emphasis marks in the scenario to power, and conversion to a common ID for matching
By such means, it is possible to associate the video and audio with the scenario text with higher accuracy and higher speed than in the prior art.
[0023]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, embodiments of the present invention will be described in detail based on preferred examples shown in the drawings.
[0024]
[Example 1]
An embodiment of Example 1 of the present invention will be described with reference to FIG. This embodiment can be divided into a person database preparation step 11, a video / audio processing step 12, a scenario text processing step 13, and a matching step 14.
In this embodiment, the following features are extracted from the video and audio and scenario text as patterns used for matching. From the video and audio, the utterance section of each speaker, the power transition in the utterance section and the pitch transition in the utterance section are extracted. Further, from the scenario text, the name of the person, the number of mora based on the characters in the dialogue part, the question mark part and the emphasis mark part are extracted.
[0025]
Next, the person database preparation step 11 will be described.
FIG. 2 is a detailed flowchart for explaining the operation of the person database preparation step 11. In this step, a case where a person database is generated will be described.
In step 201, an utterance voice serving as a reference for each speaker (hereinafter referred to as a reference voice) is prepared. The reference voice of each speaker is a voice that includes only the voice of one speaker for 10 to 30 seconds. This can be easily extracted from video and audio by manual or automatic processing using general-purpose software (for example, Windows (registered trademark) Media Player) or a method described in Non-Patent Document 7.
[0026]
For example, the following documents are helpful.
Nishida, Akita, Kawahara "Speaker Indexing and Automatic Transcription by Speaker Model Selection for Discussion"
IEICE technical report, SP2002-157, NLC2002-80 (SLP-44-37), 2002
It relates to unsupervised speaker indexing for discussion speech and speech recognition technology using it. In discussion voices, alternation of speakers occurs frequently, many utterances have short durations, and utterance times vary widely, so speaker indexing is difficult with a uniform model.
[0027]
Therefore, here, a framework is selected in which a VQ model is selected for speech with a short speech time and a GMM model is selected for long speech. In addition to this, Koko has also studied acoustic and language models for discussion speech recognition and succeeded in obtaining good speaker indexing results. This technique is intended to improve the accuracy of speech recognition in the end, but the process in the process can be applied to the determination of the section where a single speaker speaks. That is, it can be applied to the processing in step 201 described above.
[0028]
In step 202, the reference speech of each speaker is input and the processing described in Non-Patent Document 8 is performed, and coefficients (linear prediction coefficients, LPC cepstrum, MFCC, etc.) representing the spectrum envelope of the reference speech are output. The feature amount of each speaker obtained in this step is expressed by a numerical string as shown in the example of FIG. Here, sequential vector data is generated from short-time spectral features.
[0029]
For example, the following documents are helpful.
FKSoong, AERosenberg, LRRabiner and BHJuang,
"A Vector Quantization Approach to Speaker Recognition"
At & T Technical Journal, Vol. 66, pp. 14-26, Mar / Apr 1987
Here, as a speaker identification method, the characteristics of the short-time spectrum envelope are extracted from the speech of the person to be identified and stored as a vector. Is extracted as a vector in the same manner, and is compared with the stored vector. Based on the distance of the vector, it is determined whether or not the input voice speaker is the voice of a stored person. This voice feature extraction method can be applied to step 202. Specifically, it is a process in which input speech is taken out in a short time of several ms to several tens of ms, and coefficients such as cepstrum are extracted from each.
[0030]
When normalizing the number of feature vectors for each speaker in step 203, in step 204, a coefficient (linear prediction coefficient, LPC cepstrum, MFCC, etc.) representing the spectrum envelope of the reference speech obtained from the output of step 203 is obtained. ) Is input, the process described in Non-Patent Document 9 is performed, and a fixed number of vectors are output for each speaker. By the output of this step, for example, in the data structure as shown in FIG. 4, the speaker name (41) included in the video and audio, the name of the person in the scenario text (42), and the vector number representing the characteristics of the person ( 43) It is possible to create a table 401 having the vector element value (44) as each column.
[0031]
For example, the following documents are helpful.
Y.Linde, a.Buzo, andRobertM.Gray.
An Algorithm for Vectoe Quantizer Design, IEEE Translations on Communications, Vol.COM-28, No.1 1980.
When pattern recognition is performed using LPC or the like, registered signals are learned in advance and their features are stored as vectors. At this time, it is important to create a set of representative vectors that can faithfully represent the vector distribution in the vector space without being affected by the error vector (this operation is called quantization). . The reason for using the representative vector instead of the original vector is that the number of vectors can be reduced, and if the number of vectors can be reduced, the identification time can be shortened. The present invention is applied to step 204 as necessary.
[0032]
In step 205, the above-described table is stored in a recording medium such as a magnetic disk. Here, the vector data of each speaker is compressed into a certain number, numbered, and a table corresponding to the speaker name and the name of the person in the scenario text is generated. In the example of FIG. 4, eight feature vectors are stored for each speaker.
[0033]
In step 206, for example, based on a table (person ID table) 501 as shown in FIG. 5, the names of speakers and names of persons in the scenario text are associated with common IDs. The table used here is a correspondence table of person names in the video and audio and names of persons in the scenario text, and based on this, a number (ID) is given to each person. In the example of FIG. 5, the table includes three columns, but two tables including two columns may be used.
[0034]
In step 207, a speaker class is defined as a set including vectors belonging to each person (hereinafter referred to as a class).
In step 208, it is determined whether or not the registration of the feature vector for the necessary speaker has been completed.
[0035]
In step 209, from the data stored in the recording medium or the like, the processing according to Non-Patent Document 10 is performed based on the value of the feature value, a multidimensional space index is generated, and the index data is stored in a main storage device such as a memory. Store in As shown in FIG. 6, the multidimensional space index 601 is associated with the ID (602) of a vector similar to each node, and when searching for a vector similar to a key vector, all the vectors in the database are searched. Thus, the search time can be shortened.
[0036]
For example, the following documents are helpful.
K. Curtis, N. Taniguchi, J. Nakagawa, and M. Yamamuro.
A comprehensive image similarity retrieval system that utilizes multiple feature vectors in high dimensional space, Proceedings of International Conference on Information, Communication and Signal Processing, pp. 180-184, 1997.
This technique enables high-speed search by converting special jump beams into multidimensional vectors and managing them in a hierarchical structure for image similarity search. Taking the two-dimensional case as an example, the vector space is divided into rectangles, and an index for managing the vectors existing in each portion in a hierarchical structure is generated. By using this index, when a vector is newly input, it is possible to know which vector is near without searching all the vectors. This method is applied to step 209.
[0037]
In step 11 of FIG. 1, in addition to creating a person database as described above, an existing database prepared in advance can be used.
[0038]
Next, the video / audio processing step 12 in FIG. 1 will be described. FIG. 7 is a detailed flowchart for explaining the operation of the video / audio processing step 12.
[0039]
In step 71, video / audio is designated for program input, and video / audio is input.
In step 72, the audio track is separated from the input video and audio.
In step 73, feature amounts are extracted from the audio track separated in step 72 by the same method as in step 202 of the person database preparation step 11, and this is used as a key.
[0040]
In step 74, a feature vector similar to the key is searched from the person database created in the person database preparation step 11 using the obtained key. As a similarity measure between a feature vector and a key, various distance measures between vectors (Euclidean distance, city distance, etc.) can be used. Further, the search is performed with reference to the multidimensional space index.
[0041]
In step 75, the utterance section of each speaker is obtained by determining the speaker name with a constant granularity based on the search result of step 74.
Now, let x be a vector in the database contained in the k nearest neighbor space of the key vector. _i (I = 1, 2,... K), x _i The class discriminant function of C (x _i ), If the set of speaker classes including the speaker class p is P, the speaker name p _ans Is obtained by the following equation (1).
[Expression 1]

[0042]
However, 1 [F] is defined as a function that takes 1 when the conditional statement F is satisfied and 0 otherwise. J is the granularity of speaker name discrimination _s When the key generation frequency is f, the constant J is J = R _s This is the number obtained with / f.
In step 76, it is determined whether or not the speaker name has reached the end of the audio track.
[0043]
In step 77, based on the speaker name obtained in step 75, as shown in FIG. _ans A pattern N = {n1, n2,... However, m is a number obtained by m = T / R, where T is the length of video and audio.
[0044]
In step 78, a pattern representing the power value is generated. From the integrated value of the amplitude value of the input speech for generating the key, an average power value is obtained with a granularity equal to the determination granularity of the speaker name. As a result, a pattern P = {p1, p2,... Pm} representing the power value is generated (802).
In step 79, a pattern representing the pitch value is generated. In order to generate a key, a pitch is obtained from input speech by a general method such as an autocorrelation method or a modified correlation method. An average pitch value is obtained with a granularity equal to that of the speaker name. As a result, a pattern F = {f1, f2,... Fm} representing the pitch value is generated (803).
[0045]
Next, the scenario text processing step 13 in FIG. 1 will be described. FIG. 9 is a detailed flowchart for explaining the operation of the scenario text processing step 13.
In step 91, a scenario text is designated for program input.
In step 92, the dialogue part is searched from the scenario text file.
Scenarios can be written in a variety of formats. Lines can be searched by using parts enclosed by symbols such as “...”, “...”, “. When the scenario text is structured text such as XML, the tag information can be used.
[0046]
In step 93, a person name existing immediately before or after each line is searched, and the line and person name are associated with each other from the obtained person name.
In step 94, a question mark “?” And an emphasis mark “!” Are searched from each dialogue associated with the name of the person.
In step 95, patterns of dialogue, question marks, and emphasis marks are generated. First, the total number of mora is calculated from all lines included in the scenario text. The number of mora is the number of sounds that contain vowels or n. For example, for the line “Breakfast”, it is “a / sa / go / ha / n / de / su” and the number of mora is 7 It turns out that.
[0047]
The pattern generation process in step 95 will be described using the detailed flowchart shown in FIG.
In step 101, the total number of mora of all lines included in the scenario text (T _m ).
In step 102, T _m For each mora, the result of step 93 is used, and each person's name associated with the dialogue in which each mora is included is used as each element. _m Get data.
[0048]
In step 103, T obtained in step 102 is obtained. _m R from the beginning for the data of individual names _t The data is aggregated for each individual, and the most frequently included person name is output as one element of the pattern (an example is shown in FIG. 11). Thus, the pattern S {s1, s2,... Sl} representing the person in the dialogue is generated (111). However, the number of pattern elements L is L = T _m / R _t Is the number obtained.
[0049]
In step 104, for the element sl * of the pattern S including the mora immediately before the question mark, the pattern Q {q1, q2, ... q1} is generated (112).
In step 105, for the element sl * of the pattern S including the mora immediately before the emphasis mark, the pattern E {e1, e1, which represents the position of the emphasis mark such that only the element where l = l * is 1 and the other is 0 e2,... el} are generated (113).
[0050]
Next, the matching step 14 in FIG. 1 will be described. FIG. 12 is a detailed flowchart for explaining the operation of the matching step 14.
In this phase of the present embodiment, the patterns obtained in the video / audio processing step 12 and the scenario text processing step 13 are matched by non-linear expansion / contraction. At this time, the elements are associated with each other so that the distance between the patterns is minimized.
[0051]
In step 121, a pattern pair is generated. In the present invention, three pairs of patterns (N and S, P and E, and F and Q) are generated. Pattern pairs are generated based on the origin of each pattern. As shown in FIG. 13, the pattern representing the utterance section for each speaker in the video and audio corresponds to the person name associated with the dialogue in the scenario text, so the patterns N and S are paired.
In addition, since the ending of the question sentence generally rises, the pattern F representing the pitch in the video and audio is paired with the pattern Q representing the position of the question mark in the scenario text. Similarly, the pattern P representing the power in the video / audio is paired with the pattern E representing the position of the emphasis mark in the scenario text (131 to 133).
[0052]
In step 122, as shown in FIG. 14, the corresponding pattern is input to the program (141, 142), the person ID table 501 is referred to, and elements relating to the person are converted into common IDs (143, 144). This step is necessary when the names obtained from the video / audio processing step 12 and the scenario text processing step 13 are different.
This is because, for example, the actor name of the drama is described in the scenario text, and the actor name is described in the person database. Here, the person ID table 501 has been described as a table having three columns as shown in FIGS. 5 and 6, but any data structure may be used as long as the speaker name can be converted into a common ID.
[0053]
In step 123, the elements included in each pattern are associated with each other so that the distance between the patterns is minimized. For example, when the patterns N {n1, n2,... Nm} and the patterns S {s1, s2,... Sl} are matched, the distance d (N, S) between the two patterns is generally expressed by the following equation (2 ).
[Expression 2]

[0054]
Where k = 1, 2,... K is the number of combinations of elements between patterns, m (k) is a non-negative path weighting coefficient, Mφ is a normalization coefficient, ck is a pair of elements included in each pattern, d (ck) represents the distance between the two elements. Here, various distances such as Euclidean distance and City-Block distance can be used as the distance between pattern elements.
For example, as shown in FIG. 15, when the path inclination is limited to 0.5-2.0 and the City-Block distance is used as the distance between pattern elements,
Mφ = m + 1
d (ck) = | n _i k-s _j k |
It becomes.
[0055]
Here, I represents the number of elements included in the pattern N, and J represents the number of elements included in the pattern S. Assuming that the normalized distance sum up to each step of k = 1, 2,... K is δ, and the set of combinations of nk and sk in each step is ck, as shown in FIG. In the case of 1.0 to 2.0, δ and ck are obtained by the following recurrence formulas (3) and (4).
[0056]
[Equation 3]

[Expression 4]

Here, in the above recurrence formula, when k = K, δ (nK, sK) ≡d (N, S). In the matching phase, K sets C of nk and sk combinations that minimize the number of d (N, S) _K Ask for.
[0057]
In the present invention, there are three pattern pairs (N and S, P and E, and F and Q). Therefore, an arbitrary weight w (g) (g = 1,2,3), the distance between the patterns is obtained, and the element pair c that minimizes this is set. _k Will be asked.
That is, it can be expressed as the following formula (5).
[Equation 5]

However, g is the number of pattern pairs, and G = 3 in this embodiment.
[0058]
In step 124, using the recurrence formulas (3) and (4), c _k C is determined whether δ with respect to is minimum, and gives C _K Is updated.
In step 125, C finally gives the minimum δ at k = K. _K Is output.
In step 126, C _K Based on the above, the index text is generated by associating the scenario text with the video / audio portion. As shown in the example of FIG. 16, the index information is for associating a video / audio reproduction time 162 with the scenario text 161.
[0059]
In the present embodiment, three pairs of patterns N and S, P and E, and F and Q are used. However, if there is at least one pair of patterns, it is possible to match the video and audio with the scenario text. is there. In this embodiment, the DP matching technique is used, but it is also possible to perform association using the round robin method.
[0060]
[Example 2]
Embodiment 2 of Example 2 of this invention is described using FIG. In the present embodiment, a person ID table generation step 171 for automatically creating a person ID table that was manually created by a person in the first embodiment is added. In the person database preparation step 11, the column 42 representing the name of the person in the scenario text in FIG. 4 is blank.
[0061]
The person ID table generation step will be described. FIG. 18 is a detailed flowchart for explaining the operation of this step according to the second embodiment.
In step 181, the ratio of the total utterance time is calculated for each person based on the pattern obtained in the video / audio processing step 12 as shown in FIG. 19 (191).
In step 182, based on the pattern obtained in the scenario text processing step 13, as shown in FIG. 19, the ratio of the total length of the dialogue length is calculated for each person (192).
[0062]
In step 183, the person ID table 193 is generated by associating personal names in order from the top based on the ratios obtained in step 181 and step 182. If the name cannot be associated in step 183, the entire video is divided into a plurality of areas, and the processes in steps 181 to 182 are performed for each of the divided areas, and the person ID table is obtained. It is also possible to generate.
The other phases are the same as those in the first embodiment.
[0063]
Other examples
The present invention can be carried out in the following embodiments in addition to the above embodiments.
(1) A global utterance pattern can be extracted to limit / weight a path during matching.
(2) Matching can be performed by arbitrarily combining the features used in the prior art and the features used in the present invention.
[0064]
Also according to the present embodiment, it is possible to realize a method for matching video and audio and scenario text, which makes it possible to associate video and audio with scenario text with high accuracy.
[0065]
Each of the above-described embodiments shows an example of the present invention, and the present invention should not be limited to these. Appropriate changes and improvements can be made without departing from the technical scope of the present invention. It goes without saying that you can go.
[0066]
For example, as described above, the video / audio matching method and scenario text according to the present invention can be executed by computer program control, and the technical scope of the present invention is a computer for this purpose. Needless to say, the present invention extends to a computer-readable recording medium in which the computer software is recorded.
[0067]
A video / audio / scenario text matching apparatus that embodies the video / audio / scenario text matching method according to the present invention is also useful.
[0068]
【The invention's effect】
As described above in detail, according to the present invention, scenario text and video / audio can be matched based on the utterance information of each speaker, and matching can be performed with higher accuracy than in the prior art. In addition, the number of matching portions is smaller than when word information is used, and a remarkable effect is achieved in that matching with higher accuracy than in the prior art becomes possible.
[Brief description of the drawings]
FIG. 1 is a diagram showing an embodiment of Example 1 of the present invention.
FIG. 2 is a detailed flowchart for explaining the operation of a person database preparation step 11 in FIG. 1;
FIG. 3 is a diagram showing an example of feature vector data obtained in step 202 in FIG. 2;
4 is a diagram showing an example of a feature vector table obtained in step 205 in FIG. 2. FIG.
FIG. 5 is a diagram showing an example of a person ID table used in step 206 in FIG. 2;
6 is a diagram showing an example of data obtained by the processing according to Non-Patent Document 10 in Step 209 in FIG.
FIG. 7 is a detailed flowchart showing the process of step 12 in FIG. 1;
FIG. 8 is a diagram showing an example of data obtained in

steps

77, 78, and 79 in FIG.
9 is a detailed flowchart showing the process of step 13 in FIG. 1. FIG.
FIG. 10 is a detailed flowchart showing processing in step 95 in FIG. 9;
11 is a diagram showing an example of data obtained in

steps

103, 104, and 105 in FIG.
FIG. 12 is a detailed flowchart showing the process of step 14 in FIG. 1;
13 is a diagram showing a pattern pair generation method in step 121 in FIG. 12. FIG.
14 is a diagram showing a pattern element conversion method in step 122 in FIG. 12. FIG.
FIG. 15 is a diagram showing matching (calculation formula used in the processing of

steps

123 and 124 in FIG. 12) by non-linear expansion / contraction between patterns.
16 is a diagram showing an example of the index information output at step 126 in FIG.
FIG. 17 is a diagram showing an embodiment of Example 2 of the present invention.
FIG. 18 is a detailed flowchart illustrating the operation of a person ID table generation step according to the second embodiment.
FIG. 19 is a diagram illustrating automatic generation of a person ID table.
[Explanation of symbols]
11 Person database preparation steps
12 Video and audio processing steps
13 Scenario text processing steps
14 Alignment steps
171 Person ID table generation step
201-209 Person Database Preparation Step Details
401 Feature vector table
501 Person ID table
601 Multidimensional spatial index

Claims

A person database preparation step for preparing correspondence information between a person appearing in the video and audio and a person appearing in the scenario text;
A video and audio signal input step for inputting video and audio;
An audio feature amount extraction step for extracting an audio feature amount from the video and audio signal input in this step;
Based on the speech feature amount extracted in this step, an utterance section information extraction step for extracting the pattern of the utterance section information of each speaker;
A scenario text entry step for entering the scenario text;
Dialogue information extraction step for extracting the dialogue information pattern about the person from the scenario text input in this step,
A matching step for obtaining element correspondence information by associating the pattern of speech section information obtained in the above step and the pattern of dialogue information with elements included in the pattern so that the distance between the patterns is minimized,
An index information generation step for generating index information in which the playback time information of the video and audio is associated with the scenario text from the association information of the elements between the patterns obtained in this step, and the video and audio and the scenario text, Alignment method.

In the person database preparation step, a table that associates the names of persons included in the video and audio and the scenario text is input, or a table having considerable information prepared in advance is input,
The audio / video signal is input by the audio / video signal input step,
In the audio feature amount extraction step, an audio feature amount is extracted from an audio track of the input video / audio signal,
By the utterance section information extraction step, utterance section information for each speaker in the video and audio is extracted,
Enter the scenario text in the scenario text input step,
Through the dialogue information extraction step, information about a person is extracted as dialogue information from the input scenario text,
By the matching step, the previously obtained utterance section information and dialogue information are matched,
Based on the results of this alignment,
2. The method for matching video and audio with scenario text according to claim 1, wherein the index information generating step generates index information in which playback time information of video and audio is associated with a plurality of locations in the scenario text.

In the person database preparation step,
3. The method of matching video and audio with scenario text according to claim 1 or 2, wherein a database having vectors representing audio characteristics associated with each person is generated.

In the person database preparation step,
3. A person ID table capable of mapping a person name included in a pattern of utterance section information and a person name included in a line information pattern to a common ID is generated. A method of matching the described video and audio with the scenario text.

In the matching step,
5. The method for matching video and audio and scenario text according to claim 1, wherein a multidimensional spatial index is generated based on a feature vector in a database.

The utterance interval information extraction step patterns the power value of the utterance interval for each speaker, and the dialogue information extraction step patterns the emphasis mark position included in the dialogue,
6. The matching method of video and audio and scenario text according to claim 1, wherein both are used for matching.

The pitch value of the utterance interval for each speaker is patterned by the utterance interval information extraction step, and the position of the question mark included in the dialogue is patterned by the dialogue information extraction step,
6. The audio / video and scenario text matching method according to claim 1, wherein both are used for matching.

In the matching step,
8. The matching method of video / audio and scenario text according to claim 1, wherein matching is performed by nonlinear time expansion / contraction using patterns extracted from the video / audio and scenario text, respectively. .

In performing the matching in the matching step,
The video / audio and scenario text according to claim 1, wherein an arbitrary weight is assigned to a feature extracted from the video / audio and scenario text with respect to a matching score (distance). Alignment method.

Using the utterance pattern obtained by the utterance section information extraction step and the dialogue pattern obtained by the dialogue information extraction step,
The method for matching video and audio and scenario text according to any one of claims 1 to 9, wherein a person ID table is generated based on a ratio of speaking time of a speaker.

Using the utterance pattern obtained by the utterance section information extraction step and the dialogue pattern obtained by the dialogue information extraction step,
When generating the person ID table,
11. The video / audio and scenario text according to claim 10, wherein the video / audio and scenario text are divided into a plurality of areas, a ratio of a speaker's utterance time is calculated for each, and a person ID table is generated. How to align with.

A person database preparation processing means for preparing correspondence information between a person appearing in the video and audio and a person appearing in the scenario text;
Video / audio signal input processing means for inputting video / audio;
An audio feature quantity extraction processing means for extracting an audio feature quantity from the video / audio signal input by the processing means;
Utterance section information extraction processing means for extracting the pattern of the utterance section information of each speaker based on the voice feature amount extracted by the processing means;
Scenario text input processing means for inputting scenario text,
Dialogue information extraction processing means for extracting a dialogue information pattern related to a person from the scenario text input by the processing means;
Consistency processing means for associating elements included in the pattern so that the distance between the patterns is the shortest between the patterns of the utterance section information and the dialogue information obtained by the processing means, and for obtaining element correspondence information When,
Video / sound and scenario characterized by having index information generation processing means for generating index information in which the reproduction time information of video / audio is associated with the scenario text from the association information of the elements between the patterns obtained by the processing means Alignment device with text.

Computer software for causing the video / audio and scenario text matching method according to any one of claims 1 to 11 to be executed by computer program control.

A computer-readable recording medium on which the computer software according to claim 13 is recorded.