JP3781715B2

JP3781715B2 - Metadata production device and search device

Info

Publication number: JP3781715B2
Application number: JP2002319756A
Authority: JP
Inventors: 雅文下田代; 裕康桑野; 啓行酒井; 正明小林; 謙二松井
Original assignee: Panasonic Corp; Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Corp; Panasonic Holdings Corp
Priority date: 2002-11-01
Filing date: 2002-11-01
Publication date: 2006-05-31
Anticipated expiration: 2022-11-01
Also published as: JP2004153764A

Description

【０００１】
【発明の属する技術分野】
本発明は、コンテンツ制作におけるメタデータ制作装置及び検索装置に関するものである。
【０００２】
【従来の技術】
近年、映像・音声コンテンツの制作において、これらコンテンツに関連したメタデータの付与することがおこなわれている。
【０００３】
しかしながら、上記メタデータの付与は、制作された映像・音声コンテンツのシナリオあるいはナレーション原稿をもとに、制作された映像・音声コンテンツを再生しながらメタデータとすべき情報を確認し、手作業でコンピュータ入力することにより制作する方法が一般的であり、相当な労力の必要な方法であった。
【０００４】
また、カメラ撮影時に音声認識を用いタグ付けをするシステムは存在するが、撮影と同時に使用されるものに過ぎなかった。（特許文献１参照）
【０００５】
【特許文献１】
特開平０９−１３０７３６号公報
【０００６】
【発明が解決しようとする課題】
本願発明は、上記従来の問題点に係る課題を解決することを目的とするものであって、制作された映像・音声コンテンツを再生することによりメタデータとすべき情報を確認し、音声入力でコンンピュータ等に入力することにより前記メタデータを制作し、検索するシステムを提供することを目的とする。
【０００７】
【課題を解決するための手段】
上記課題を解決するために本願発明は、制作されたコンテンツに合わせて制作されたシナリオ、或いは前記コンテンツの内容から抽出されたコンテンツ管理用キーワードの音声信号が入力され、前記音声信号をデータ化する入力手段と、前記入力手段でデータ化された音声信号データから、キーワードを認識する音声認識手段と、前記音声認識手段から出力されたキーワードに、前記キーワードに関連する一般属性キーワードを付記するとともに、前記コンテンツから供給されるタイムコードを用いて、前記キーワードと前記コンテンツに含まれる画像信号との時間位置関係を示す情報を付記してメタデータファイルに記憶するファイル処理手段とを備えたものである。
これにより、従来キーボードで入力し、制作していたメタデータを、音声認識を用いて音声入力し、自動的にタイムコード付きのメタデータを制作することが可能となる。
特に、数秒単位の間隔でメタデータを付与する場合は、キー入力では困難であるが、本構成によれば、数秒単位間隔であっても効率よく、メタデータを付与できる。
【０００８】
【発明の実施の形態】
本発明の請求項１から３に係る発明は、制作されたコンテンツに合わせて制作されたシナリオ、或いは前記コンテンツの内容から抽出されたコンテンツ管理用キーワードの音声信号が入力され、前記音声信号をデータ化する入力手段と、前記入力手段でデータ化された音声信号データから、キーワードを認識する音声認識手段と、前記音声認識手段から出力されたキーワードに、前記キーワードに関連する一般属性キーワードを付記するとともに、前記コンテンツから供給されるタイムコードを用いて、前記キーワードと前記コンテンツに含まれる画像信号との時間位置関係を示す情報を付記してメタデータファイルに記憶するファイル処理手段とを具備したことを特徴とするメタデータ制作装置である。
【０００９】
本発明の請求項４から６に係る発明は、制作されたコンテンツに合わせて制作されたシナリオ、或いは前記コンテンツの内容から抽出されたコンテンツ管理用キーワードの音声信号が入力され、前記音声信号をデータ化する入力手段と、前記入力手段でデータ化された音声信号データから、キーワードを認識する音声認識手段と、前記音声認識手段から出力されたキーワードに、前記キーワードに関連する一般属性キーワードを付記するとともに、前記コンテンツから供給されるタイムコードを用いて、前記キーワードと前記コンテンツに含まれる画像信号との時間位置関係を示す情報を付記してメタデータファイルに記憶するファイル処理手段と、前記キーワードまたは前記一般属性キーワードに対応する前記コンテンツに含まれる前記画像信号を検索する検索手段とを具備し、検索時に前記キーワードで検索可能であるとともに、前記一般属性キーワードでも検索可能であることを特徴とするメタデータ検索装置である。
【００１０】
本発明の請求項７に係る発明は、制作されたコンテンツに合わせて制作されたシナリオ、或いは前記コンテンツの内容から抽出されたコンテンツ管理用キーワードの音声信号を入力し、前記音声信号をデータ化する入力手段と、ジャンル別辞書を複数用意し、コンテンツに適合したジャンルの辞書を選択し、前記入力手段でデータ化された音声信号データから、キーワードを認識する音声認識手段と、前記音声認識手段から出力されたキーワードを、コンテンツに含まれる画像信号との時間位置を示すタイムコードと共にメタデータファイルに記憶するファイル処理手段と、前記コンテンツファイルの記録位置と前記メタデータファイルの関係を管理する制御ファイルを発生させるコンテンツ情報ファイル処理手段と、前記コンテンツファイルと、前記メタデータファイルと、前記制御ファイルとを一緒に記録する記録手段と、検索したいコンテンツの分野に適合した、前記音声認識手段で用いた共通辞書からキーワードを選定し、前記選定したキーワードが記録されている前記メタデータファイルを特定し、前記記録手段に記録されているコンテンツの中から、検索したいコンテンツを検索し、前記制御ファイルから検索したいシーンの記録位置を検索する検索手段とを具備し、前記メタデータから前記記録手段に記録されたコンテンツファイルの記録位置とを特定することを特徴とするメタデータ検索装置である。
【００１１】
以下、本発明の実施の形態について図面を用いて説明する。
（実施の形態１）
図１は、本発明の実施の形態１によるメタデータ制作装置の構成を示すブロック図である。
【００１２】
図１において、１はコンテンツデータベース(ＤＢ)、２は入力手段、３は音声認識手段、４は辞書データベース（ＤＢ）、５はファイル処理手段、１１は映像モニタである。
コンテンツＤＢ１は、例えばＶＴＲ（あるいはハードディスクで構成された映像・音声信号再生手段、あるいは半導体メモリなどのメモリ手段を記録媒体とする映像・音声信号再生手段、あるいは光学記録式または磁気記録式などの回転型ディスクで構成された映像・音声信号再生手段、更には伝送されてきたあるいは放送されてきた映像・音声信号を１次記録し、再生する映像・音声再生手段などの、コンテンツに合わせたタイムコードを発生しながら再生する手段を備えたコンテンツ記録手段）である。
コンテンツＤＢ１から再生されたタイムコード付き、映像信号は映像モニタ１１に出力され、前記映像モニタ１１で映出される。
次に、前記映像モニタ１１に映出されたコンテンツに合わせて、ナレータ１２がマイクロホーンを用いてナレーションの音声信号を入力する。この際、ナレーターは映像モニタ１１に映しだされたコンテツ、あるいは、タイムコードを確認し、シナリオ、或いはナレーション原稿、或いはコンテンツの内容などを基に抽出されたコンテンツ管理用キーワードを発声し、マイクロホーンを用いてナレーションとして音声信号を入力する。
従って,前記したように入力される音声信号を前もってシナリオ等から限定されたキーワードを使用することによって、後段の音声認識手段３での認識率を改善させることができる。
次に、入力手段２ではマイクロホーンから入力された音声信号を、コンテンツＤＢ１から出力されている垂直同期信号に同期したクロックで、前記音声信号をデータ化する。
次に、入力手段２でデータ化された音声信号データは、音声認識手段３に入力される。また、同時に、音声認識に必要な辞書が辞書ＤＢ４から供給される。
ここで、使用する音声認識用辞書を端子１０２から辞書ＤＢ４に設定する。
例えば、図２に示すように各分野別に辞書ＤＢ４が構成されていたとすると、使用する分野を端子１０２（例えば、キー入力できるキーボード端子）から設定する。
料理番組の場合は、料理―日本料理―料理法―野菜炒め等を端子１０２から辞書ＤＢ４を設定する。
前記のように辞書ＤＢ４を設定することで使用する単語、および、音声認識すべき単語を制限し、音声認識手段３の認識率を改善する。
また、更に、図１にもどり、端子１０２からシナリオ、あるいは、シナリオ原稿、あるいはコンテンツの内容から抽出されたキーワードを入力する。
例えば、料理番組の場合は、図３に示すレシピを端子１０２から入力する。
従って、レシピに記入されている単語が音声信号として入力されてくる可能性が高いので、辞書ＤＢ４では端子１０２から入力されたレシピ単語の認識優先度を明示し、優先して音声認識を行うようにする。
例えば、「柿」と「貝のカキ」が辞書中にあった場合、端子１０２から入力されたレシピ単語が「貝のカキ」のみの場合は、「貝のカキ」に優先順位１がつけられる。
音声認識手段３では、「かき」という音声を認識した場合、辞書ＤＢ４に設定された単語の優先順位１が明記されている「貝のカキ」と認識する。
従って、辞書ＤＢ４では、端子１０２から入力される分野で単語を限定し、更に、シナリオを端子１０２から入力して単語の優先度を明示することで、音声認識手段３での認識率を改善させることができる。
図１にもどり、音声認識手段３では、辞書ＤＢ４から供給された辞書に従って、入力手段２から入力された音声信号データを認識し、メタデータを生成する。
次に、音声認識手段３から出力されたメタデータは、ファイル処理手段５に入力される。
ここで、前述したように入力手段２では、コンテンツＤＢ１から再生された垂直同期信号に同期して、音声信号をデータ化している。
従って、ファイル処理手段５では、入力手段２からの同期情報と、コンテンツＤＢ１から供給されるタイムコード値とを用いて、音声認識手段３から出力されたメタデータに、file開始からの１秒ごとの基準時間(TM＿ENT (秒))と、基準時間からのフレームオフセット数を示す（TM＿OFFSET）と、タイムコードを付記した形式でfile化処理する。
例えば、前述した料理番組の場合は、図４に示したようなＴＥＸＴ形式のメタデータファイルが、ファイル処理手段５から出力される。
次に、記録手段７ではファイル処理手段５から出力されたメタデータファイルとコンテンツＤＢ１から出力されたコンテンツを記録する。
ここで、記録手段７は、ＨＤＤ，メモリ、光ディスク等から構成されており、コンテンツＤＢ１から出力されたコンテンツもファイル形式で記録する。
【００１３】
（実施の形態２）
次に、実施の形態２について説明する。
実施の形態２は、図５に示すように、実施の形態１に対して、コンテンツ情報ファイル処理手段６が付加されている。前記コンテンツ情報ファイル処理手段６では、記録手段７に記録されたコンテンツの記録位置関係を示す制御ファイルを発生し、記録手段７に記録する。
即ち、前記コンテンツ情報ファイル処理手段６では、コンテンツＤＢ１から出力されたコンテンツと、記録手段７から出力されるコンテンツの記録位置情報をもとに、前記コンテンツが保有している時間軸情報と、記録手段７に記録したコンテンツのアドレス関係を示す情報を発生し、データ化して制御ファイルとして出力する。
例えば、図６に示すように、前記コンテンツの記録位置を示す記録メディアアドレスに対し、前記コンテンツの時間軸基準を示す、TM＿ENT #jを等時間軸間隔にポイントする。例えば、TM＿ENT #jを１秒（NTSC信号の場合、３０フレーム）毎に記録メディアアドレスをポイントする。
前記のようにマッピングすることで、コンテンツが１秒単位毎に分散記録されても、TM＿ENT #jから記録手段７の記録アドレスを一義的に求めることができる。さらに、図４で前述したようにメタデータファイルには、ファイル開始からの１秒ごとの基準時間(TM＿ENT (秒))と、基準時間からのフレームオフセット数を示す（TM＿OFFSET）と、タイムコードと、メタデータとがTEXT形式で記録されている。
従って、前記メタデータファイルの中でメタデータ１を指定すれば、タイムコード、基準時間、及び、フレームオフセット値がわかるので、図６に示す制御ファイルから記録手段７での記録位置が即座にわかることになる。
なお、ここでは前記TM＿ENTｊの等時間軸間隔は例えば、１秒おきにポイントとした例について説明したが、ＭＰＥＧ２圧縮等で用いられているＧＯＰ単位等に合わせて記述することもできる。
さらに、テレビビジョン信号のＮＴＳＣでは垂直同期信号が60/1.001Hzであるため、絶対時間にあわせるためにドロップフレームモードに合わせたタイムコードと、前記垂直同期信号（60/1.001 Hz）にあわせたノンドロップタイムコードの２種類をしようする。この場合、ノンドロップタイムコードをTM＿ENT #jであらわし、TC＿ENT #jをドロップフレーム対応タイムコードであらわして使用することもできる。
さらに、制御ファイルのデータ化は、SMIL２等の既存言語を用いてデータ化することも可能であり、さらに、SMIL2の機能をもちいれば、関連したコンテンツ、及び、メタデータファイルのファイル名も合わせてデータ化して、制御ファイルに格納することができる。
さらに、図６では記録手段の記録アドレスを直接表示する構成をしめしたが、記録アドレスの代わりに、コンテンツファイルの頭からタイムコードまでのデータ容量を表示し、前記データ容量とファイルシステムの記録アドレスから記録手段でのタイムコードの記録アドレスを計算し、検出してもよい。
また、本実施例では、TM＿ENTｊとタイムコードの対応テーブルをメタデータファイルに格納する形式で説明したが、前記TM＿ENTｊとタイムコードの対応テーブルは制御ファイル中に格納しても同様の効果がえられる。
（実施の形態３）
次に、実施の形態３について説明する。
【００１４】
実施の形態３は、図７に示すように、実施の形態２に対して、検索手段８が付加されている。前記検索手段８では検索したいシーンのキーワードを音声認識してメタデータを検出するのに使用した同一辞書ＤＢ４から選択し、設定する。
次に、検索手段８では前記メタデータファイルのメタデータ項目をサーチしてキーワードと一致するタイトル名とコンテンツシーンの位置（タイムコード）の一覧を表示する。
【００１５】
次に、一覧表示の中から、ひとつの特定シーンが設定された場合は、メタデータファイルの前記基準時間(TM＿ENT (秒))と、フレームオフセット数（TM＿OFFSET）から制御ファイル中の記録メディアアドレスとを自動的に検出して記録手段７に設定し、前記記録手段７から記録メディアアドレスに記録されたコンテンツシーンをモニタ11に再生する。上記のように構成することで、メタデータを検出して即座に、見たいシーンを検出できる装置を提供できる。
なお、コンテンツにリンクしたサムネイルファイルを準備しておけば、前述したキーワードに一致したコンテンツ名の一覧を表示する際、コンテンツの代表的サムネイル画を再生して表示することも可能である。
（実施の形態４）
次に、他の実施の形態について説明する。
前述の実施形態１〜３は、あらかじめ記録されているコンテンツにメタデータを付与するシステムについて述べたが、本発明をカメラ等、撮影時にメタデータを付与するシステム、特に、コンテンツ内容が前もって限定される風景撮り、或いは、撮影位置をメタデータとして付与するシステムに対して拡張できる。
このシステムを実施の形態４として、図６にその構成を示す。
カメラ５１で撮像し、コンテンツＤＢ５４に映像コンテンツが記録されると同時に、カメラが撮影している場所をＧＰＳ５２によって検出し、前記ＧＰＳ５２から出力された位置情報（経緯度数値）を音声合成手段５３で音声信号化した位置情報も別音声チャンネルに記録する。この場合、記録手段付きカメラ５０として、カメラ５１、ＧＰＳ５２、音声合成５３、コンテンツＤＢ５４を一体構成してもよい。
次に、コンテンツＤＢ５４では前記音声チャンネルに記録されている音声信号の位置情報を音声認識手段５６に入力する。
ここで、端子１０５から、キー入力ボード等によって、辞書ＤＢ５５の地域名、ランドマーク等を選択、制限し、前記音声認識手段５６に出力する。
音声認識手段５６では認識された経緯数値と辞書ＤＢ５５のデータを用いて地域名、ランドマークを検出し、ファイル処理手段５７に出力する。
次に、ファイル処理手段５７では、コンテンツＤＢ５４から出力されたタイムコードと音声認識手段５６から出力された地域名、ランドマークをメタデータとしてＴＥＸＴ化してメタデータファイルを発生させる。
次に、記録手段５８ではファイル処理手段５７から出力されたメタデータファイルとコンテンツＤＢ５４から出力されたコンテンツデータを記録する。
このように構成することで、撮影したシーン毎に、自動的に地域名、ランドマークのメタデータを付加することができる。
【００１６】
なお、上記の各実施形態において、一般的には、音声認識には何らかの誤認識が生じる可能性がある。誤認識が生じた場合、制作されたメタデータ、タグをコンピュータ手段などの情報処理手段を用いて修正することも可能である。
【００１７】
また、本発明に係る音声認識手段は単語単位で音声認識する単語認識方式とし、音声入力の単語数、及び、使用する認識辞書の単語数を制限することで、特に、音声認識率を改善することができる。
【００１８】
また、本発明では音声認識手段により認識したキーワードをタイムコード共に、メタデータファイルでファイル化する構成を記述したが、音声認識手段により認識したキーワードに加え、関連したキーワードを追加してファイル化してもよい。
【００１９】
例えば、音声で淀川を認識した場合は、地形、川等の一般属性キーワードも付加してファイル化する。こうすることで検索時、付加された地形、川等のキーワードも使用することができるので検索性を向上することができる。
【００２０】
【発明の効果】
以上説明したように本発明は、コンテンツに関連したメタデータの作成あるいはタグ付けを行うに当たり、制作されたコンテンツのシナリオ等から事前に抽出したキーワードを音声信号として入力し、また、前記シナリオに基づいて辞書分野の設定、及び,キーワードの優先順位つけをおこなっているため、効率よく、正確に音声認識手段からメタデータを発生することができる。
特に、数秒単位の間隔でメタデータを付与する場合は、キー入力では困難であるが、本構成のような音声入力、音声認識を用いれば、数秒単位間隔であっても効率よく、メタデータを付与できる。
また、前記コンテンツの記録位置を示す制御ファイルとメタデータ、及び、タイムコード等を示す前記メタデータファイルとを使用することによって、メタキーワードから一義的に必要なシーンを検索し、前記記録手段から再生することができる。
【図面の簡単な説明】
【図１】本発明の実施の形態１に係るメタデータ制作装置の構成を示すブロック図
【図２】本発明に係る辞書ＤＢの一例を示す構成図
【図３】本発明に係るシナリオの一例を示すレシピ図
【図４】本発明に係るメタデータファイルの一例を示すTEXT形式のデータ図
【図５】本発明の実施の形態２に係るメタデータ検索装置の構成を示すブロック図
【図６】本発明の情報ファイルの一例を示す構成図
【図７】本発明の実施の形態３に係るメタデータ検索装置の構成を示すブロック図
【図８】本発明の実施形態４に係るメタデータ制作装置の構成を示すブロック図
【符号の説明】
１コンテンツＤＢ
２入力手段
３音声認識手段
４辞書ＤＢ４
５ファイル処理手段
６コンテンツ情報ファイル処理手段
７記録手段
１１映像モニタ
５０記録装置付きカメラ
５１カメラ
５２ＧＰＳ
５３音声合成手段
５４コンテンツＤＢ
５５辞書ＤＢ
５６音声認識手段
５７ファイル処理手段
５８記録手段
１０１音声入力端子
１０２辞書分野選択入力端子
１０５辞書地名選択入力端子[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a metadata production device and a search device in content production.
[0002]
[Prior art]
In recent years, in the production of video / audio contents, metadata related to these contents has been assigned.
[0003]
However, the above-mentioned metadata is added manually by checking the information to be metadata while playing the produced video / audio content based on the produced video / audio content scenario or narration manuscript. The method of producing by computer input is common, and requires considerable labor.
[0004]
In addition, there is a system for tagging using voice recognition during camera shooting, but it was only used at the same time as shooting. (See Patent Document 1)
[0005]
[Patent Document 1]
Japanese Patent Application Laid-Open No. 09-130736
[Problems to be solved by the invention]
The present invention is intended to solve the problems related to the above-mentioned conventional problems, and by confirming information to be metadata by reproducing the produced video / audio content, the audio input can be performed. An object of the present invention is to provide a system for producing and searching for the metadata by inputting it to a computer or the like.
[0007]
[Means for Solving the Problems]
The present invention to solve the above problems, production scenarios was produced in accordance with the content, or the content audio signal of the extracted content management keyword from the content is input, the data of the previous SL audio signal And a speech recognition means for recognizing a keyword from the speech signal data converted into data by the input means, and a general attribute keyword related to the keyword is appended to the keyword output from the speech recognition means , using the time code supplied from the previous SL content, that a file processing means for storing the metadata file information indicating a time position relationship between the image signal included in the said keyword content by appended It is.
As a result, it is possible to automatically produce metadata with time code by automatically inputting metadata using speech recognition, which has been inputted and produced using a conventional keyboard.
In particular, when metadata is given at intervals of several seconds, it is difficult by key input. However, according to this configuration, metadata can be given efficiently even at intervals of several seconds.
[0008]
DETAILED DESCRIPTION OF THE INVENTION
The invention according to claims 1-3 of the present invention, production scenarios was produced in accordance with the content, or the audio signal of content management keyword extracted from the content of the content is input, the previous SL audio signal An input means for converting to data, a voice recognition means for recognizing a keyword from the voice signal data converted to data by the input means, and a general attribute keyword related to the keyword added to the keyword output from the voice recognition means to together, before SL using the timecode supplied from the content, and a file processing means to note the information indicating the time position relationship between the image signal included in said keyword the content stored in the metadata file This is a metadata production apparatus characterized by the above.
[0009]
The invention according to 6 claim 4 of the present invention, production scenarios was produced in accordance with the content, or the content audio signal of the extracted content management keyword from the content is entered, the voice An input means for converting a signal into data, a voice recognition means for recognizing a keyword from voice signal data converted into data by the input means, and a keyword output from the voice recognition means, and a general attribute keyword related to the keyword with by appending, by using the time code supplied from the previous SL content, and file processing means for storing the metadata file information indicating a time position relationship between the image signal included in the said keyword content by appended the image contained in the content corresponding to the keyword or the general attribute keyword Comprises a search means for searching for signals, as well as a searchable the keyword search when a metadata search apparatus, wherein the is also searchable general attribute keyword.
[0010]
According to the seventh aspect of the present invention, a scenario produced according to the produced content or an audio signal of a content management keyword extracted from the content is input, and the audio signal is converted into data. A plurality of input means and genre-specific dictionaries are prepared, a genre dictionary adapted to the content is selected, and voice recognition means for recognizing a keyword from voice signal data converted into data by the input means, and the voice recognition means File processing means for storing the output keyword in the metadata file together with a time code indicating the time position with the image signal included in the content, and a control file for managing the relationship between the recording position of the content file and the metadata file Content information file processing means for generating the content file and the content file And selecting a keyword from a common dictionary used in the voice recognition unit suitable for the field of the content to be searched, a recording unit that records the metadata file and the control file together, and the selected keyword is Search means for specifying the recorded metadata file, searching for the content to be searched from the contents recorded in the recording means, and searching for the recording position of the scene to be searched from the control file. and a metadata search apparatus characterized by identifying a recording position of the recorded content file to the recording means from the metadata.
[0011]
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
(Embodiment 1)
FIG. 1 is a block diagram showing a configuration of a metadata production apparatus according to Embodiment 1 of the present invention.
[0012]
In FIG. 1, 1 is a content database (DB), 2 is input means, 3 is speech recognition means, 4 is a dictionary database (DB), 5 is file processing means, and 11 is a video monitor.
The content DB 1 is, for example, a VTR (or a video / audio signal reproducing means composed of a hard disk, or a video / audio signal reproducing means using a memory means such as a semiconductor memory as a recording medium, or an optical recording type or magnetic recording type rotation. Video / audio signal playback means composed of type discs, and video / audio playback means for primary recording and playback of transmitted / broadcast video / audio signals, etc. Content recording means provided with means for reproducing while generating.
A video signal with a time code reproduced from the content DB 1 is output to the video monitor 11 and displayed on the video monitor 11.
Next, the narrator 12 inputs a voice signal of narration using a microphone in accordance with the content displayed on the video monitor 11. At this time, the narrator confirms the content displayed on the video monitor 11 or the time code, and utters the content management keyword extracted based on the scenario, the narration manuscript, or the contents of the content. Is used to input a voice signal as a narration.
Therefore, the recognition rate in the speech recognition means 3 in the subsequent stage can be improved by using a keyword limited in advance from a scenario or the like for the input speech signal as described above.
Next, the input means 2 converts the audio signal input from the microphone into data using a clock synchronized with the vertical synchronization signal output from the content DB 1.
Next, the voice signal data converted into data by the input means 2 is input to the voice recognition means 3. At the same time, a dictionary necessary for speech recognition is supplied from the dictionary DB 4.
Here, the voice recognition dictionary to be used is set in the dictionary DB 4 from the terminal 102.
For example, if the dictionary DB 4 is configured for each field as shown in FIG. 2, the field to be used is set from the terminal 102 (for example, a keyboard terminal capable of key input).
In the case of a cooking program, the dictionary DB 4 is set from the terminal 102 such as cooking-Japanese cuisine-cooking method-stir-fried vegetables.
By setting the dictionary DB 4 as described above, the words used and the words that should be recognized are restricted, and the recognition rate of the speech recognition means 3 is improved.
Further, returning to FIG. 1, a keyword extracted from a scenario, a scenario document, or content is input from the terminal 102.
For example, in the case of a cooking program, the recipe shown in FIG.
Therefore, since the word entered in the recipe is likely to be input as a speech signal, the dictionary DB 4 clearly indicates the recognition priority of the recipe word input from the terminal 102 and performs speech recognition with priority. To.
For example, when “柿” and “shellfish oyster” are in the dictionary, and the recipe word input from the terminal 102 is only “shellfish oyster”, priority 1 is given to “shellfish oyster”. .
When the voice recognition means 3 recognizes the voice “Kaki”, it recognizes it as “shellfish oyster” in which the priority order 1 of the word set in the dictionary DB 4 is specified.
Therefore, in the dictionary DB 4, words are limited in the field input from the terminal 102, and further, the scenario is input from the terminal 102 and the priority of the word is specified, thereby improving the recognition rate in the speech recognition means 3. be able to.
Returning to FIG. 1, the voice recognition unit 3 recognizes the voice signal data input from the input unit 2 according to the dictionary supplied from the dictionary DB 4 and generates metadata.
Next, the metadata output from the voice recognition unit 3 is input to the file processing unit 5.
Here, as described above, the input unit 2 converts the audio signal into data in synchronization with the vertical synchronization signal reproduced from the content DB 1.
Accordingly, the file processing means 5 uses the synchronization information from the input means 2 and the time code value supplied from the content DB 1 to the metadata output from the voice recognition means 3 every second from the start of file. The file is processed in a format in which the reference time (TM_ENT (seconds)) of the data, the number of frame offsets from the reference time (TM_OFFSET), and the time code are added.
For example, in the case of the above-described cooking program, a TEXT format metadata file as shown in FIG.
Next, the recording unit 7 records the metadata file output from the file processing unit 5 and the content output from the content DB 1.
Here, the recording means 7 is composed of an HDD, a memory, an optical disk, etc., and also records the content output from the content DB 1 in a file format.
[0013]
(Embodiment 2)
Next, a second embodiment will be described.
In the second embodiment, as shown in FIG. 5, content information file processing means 6 is added to the first embodiment. The content information file processing means 6 generates a control file indicating the recording position relationship of the content recorded in the recording means 7 and records it in the recording means 7.
That is, in the content information file processing means 6, based on the content output from the content DB 1 and the recording position information of the content output from the recording means 7, the time axis information possessed by the content, the recording Information indicating the address relationship of the content recorded in the means 7 is generated, converted into data, and output as a control file.
For example, as shown in FIG. 6, TM_ENT #j indicating the time axis reference of the content is pointed at an equal time axis interval with respect to the recording media address indicating the recording position of the content. For example, TM_ENT #j is pointed to the recording media address every second (30 frames for NTSC signal).
By mapping as described above, the recording address of the recording means 7 can be uniquely obtained from TM_ENT #j even if the content is distributed and recorded every second. Further, as described above with reference to FIG. 4, the metadata file includes a reference time (TM_ENT (seconds)) from the start of the file, a frame offset number from the reference time (TM_OFFSET), a time code, , Metadata and are recorded in TEXT format.
Therefore, if the metadata 1 is designated in the metadata file, the time code, the reference time, and the frame offset value can be known, so the recording position in the recording means 7 can be immediately known from the control file shown in FIG. It will be.
Here, the example in which the equal time axis interval of the TM_ENTj is pointed every other second has been described, but it can be described according to the GOP unit used in MPEG2 compression or the like.
Furthermore, in NTSC, which is a television vision signal, the vertical sync signal is 60 / 1.001 Hz, so the time code matched to the drop frame mode to match the absolute time and the non-sync matched to the vertical sync signal (60 / 1.001 Hz). Use two types of drop time codes. In this case, the non-drop time code can be represented by TM_ENT #j, and TC_ENT #j can be represented by a drop frame compatible time code.
In addition, the control file can be converted into data using an existing language such as SMIL2, and if the function of SMIL2 is used, the related content and the file name of the metadata file are combined. Can be converted into data and stored in the control file.
Further, in FIG. 6, the recording address of the recording means is directly displayed, but instead of the recording address, the data capacity from the head of the content file to the time code is displayed, and the data capacity and the recording address of the file system are displayed. From this, the recording address of the time code in the recording means may be calculated and detected.
In the present embodiment, the TM_ENTj and time code correspondence table is described as being stored in the metadata file. However, the same effect can be obtained by storing the TM_ENTj and time code correspondence table in the control file. .
(Embodiment 3)
Next, Embodiment 3 will be described.
[0014]
In the third embodiment, as shown in FIG. 7, search means 8 is added to the second embodiment. The search means 8 selects and sets the keyword of the scene to be searched from the same dictionary DB 4 used for speech recognition and metadata detection.
Next, the search means 8 searches the metadata items of the metadata file and displays a list of title names and content scene positions (time codes) that match the keywords.
[0015]
Next, when one specific scene is set from the list display, the reference time (TM_ENT (second)) of the metadata file and the recording media address in the control file are calculated from the number of frame offsets (TM_OFFSET). Is automatically detected and set in the recording means 7, and the content scene recorded at the recording media address from the recording means 7 is reproduced on the monitor 11. By configuring as described above, it is possible to provide an apparatus that can detect a scene to be viewed immediately after detecting metadata.
If a thumbnail file linked to the content is prepared, a representative thumbnail image of the content can be reproduced and displayed when displaying the list of content names that match the keyword.
(Embodiment 4)
Next, another embodiment will be described.
In the above-described first to third embodiments, the system for giving metadata to pre-recorded content has been described. However, the present invention is a system for giving metadata at the time of shooting, such as a camera, in particular, the content content is limited in advance. It can be extended to a system for taking a landscape or providing a shooting position as metadata.
The configuration of this system is shown as Embodiment 4 in FIG.
At the same time that the image is captured by the camera 51 and the video content is recorded in the content DB 54, the location where the camera is capturing is detected by the GPS 52. The position information converted into an audio signal is also recorded in another audio channel. In this case, the camera 51, the GPS 52, the voice synthesis 53, and the content DB 54 may be integrally configured as the camera 50 with a recording unit.
Next, the content DB 54 inputs the position information of the sound signal recorded in the sound channel to the sound recognition means 56.
Here, the area name, landmark, etc. in the dictionary DB 55 are selected and restricted from the terminal 105 by a key input board or the like and output to the voice recognition means 56.
The voice recognition means 56 detects the area name and landmark using the recognized history value and the data in the dictionary DB 55 and outputs it to the file processing means 57.
Next, the file processing unit 57 generates a metadata file by converting the time code output from the content DB 54 and the region name and landmark output from the voice recognition unit 56 into metadata.
Next, the recording unit 58 records the metadata file output from the file processing unit 57 and the content data output from the content DB 54.
With this configuration, it is possible to automatically add area name and landmark metadata for each photographed scene.
[0016]
In each of the above embodiments, in general, there is a possibility that some misrecognition may occur in the speech recognition. When erroneous recognition occurs, the produced metadata and tags can be corrected by using information processing means such as computer means.
[0017]
In addition, the speech recognition means according to the present invention adopts a word recognition method that recognizes speech in units of words, and particularly improves the speech recognition rate by limiting the number of words for speech input and the number of words in the recognition dictionary to be used. be able to.
[0018]
Further, in the present invention, the configuration has been described in which the keyword recognized by the voice recognition means is filed in the metadata file together with the time code. However, in addition to the keyword recognized by the voice recognition means, a related keyword is added to form a file. Also good.
[0019]
For example, when the Yodo River is recognized by voice, general attribute keywords such as terrain and river are also added to the file. By doing so, keywords such as added terrain and river can be used at the time of searching, so that the searchability can be improved.
[0020]
【The invention's effect】
As described above, the present invention inputs a keyword extracted in advance from a scenario or the like of the produced content as an audio signal when creating or tagging metadata related to the content, and based on the scenario. Since the dictionary field is set and the priorities of keywords are set, metadata can be generated from the speech recognition means efficiently and accurately.
In particular, when metadata is given at intervals of several seconds, it is difficult by key input. However, if voice input and voice recognition like this configuration are used, metadata can be efficiently stored even at intervals of several seconds. Can be granted.
Further, by using the control file indicating the recording position of the content and the metadata, and the metadata file indicating the time code, etc., a necessary scene is uniquely searched from the meta keyword, and the recording means Can be played.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of a metadata production apparatus according to Embodiment 1 of the present invention. FIG. 2 is a configuration diagram showing an example of a dictionary DB according to the present invention. FIG. 3 is an example of a scenario according to the present invention. FIG. 4 is a data diagram in TEXT format showing an example of a metadata file according to the present invention. FIG. 5 is a block diagram showing a configuration of a metadata search apparatus according to Embodiment 2 of the present invention. FIG. 7 is a block diagram showing a configuration of a metadata search apparatus according to the third embodiment of the present invention. FIG. 8 is a metadata production according to the fourth embodiment of the present invention. Block diagram showing device configuration 【Explanation of symbols】
1 Content DB
2 Input means 3 Voice recognition means 4 Dictionary DB4
5 File processing means 6 Content information file processing means 7 Recording means 11 Video monitor 50 Camera with recording device 51 Camera 52 GPS
53 Speech synthesis means 54 Content DB
55 Dictionary DB
56 voice recognition means 57 file processing means 58 recording means 101 voice input terminal 102 dictionary field selection input terminal 105 dictionary place name selection input terminal

Claims

A metadata production device for producing metadata related to content,
Production scenario was produced in accordance with the content, or the audio signal of the content keyword for content management extracted from the content is input, an input unit for data of the previous SL audio signal,
Voice recognition means for recognizing a keyword from voice signal data converted into data by the input means;
Wherein the keywords output from the voice recognition unit, as well as note the general attributes keyword associated with the keyword, before SL using the timecode supplied from the content, the keyword with the image signal and the time included in the content A metadata production apparatus comprising: file processing means for adding information indicating a positional relationship and storing the information in a metadata file.

The voice recognition means includes a plurality of genre dictionary metadata production device according to claim 1, wherein the selecting a dictionary genre adapted to the content.

Wherein said speech recognition means includes a plurality of genre dictionary, select a dictionary of a genre that conform to the content, further characterized in that the said scenarios, or keywords extracted from the content of the content to preferentially recognize Item 3. The metadata production device according to item 1 or 2.

A metadata search device for searching metadata related to content,
Production scenario was produced in accordance with the content, or the audio signal of the extracted content management keyword from the content of the content is input, an input unit for data of the audio signal,
Voice recognition means for recognizing a keyword from voice signal data converted into data by the input means;
Wherein the keywords output from the voice recognition unit, as well as note the general attributes keyword associated with the keyword, before SL using the timecode supplied from the content, the keyword with the image signal and the time included in the content File processing means for appending information indicating the positional relationship and storing it in the metadata file;
Search means for searching for the image signal included in the content corresponding to the keyword or the general attribute keyword ,
A metadata search apparatus , which can be searched by the keyword at the time of search and can also be searched by the general attribute keyword .

A content information file processing means for generating a control file for managing the relationship between the recording position of the content file and the metadata file;
Before SL control file, characterized in that the Meikisuru table recording position of the file of the content of the recording means to match the recording time of the contents, and to search the recording position of the file of the content from the time code The metadata search device according to claim 4.

The voice recognition means includes a plurality of genre dictionary, select a dictionary of a genre that conform to the content, further characterized in that the said scenarios, or keywords extracted from the content of the content to be preferentially recognized The metadata search device according to claim 4 or 5.

A search device for metadata related to content,
An input means for inputting a scenario produced according to the produced content or an audio signal of a content management keyword extracted from the content, and converting the audio signal into data;
A plurality of genre dictionaries are prepared, a genre dictionary suitable for the content is selected, and voice recognition means for recognizing a keyword from voice signal data converted into data by the input means;
File processing means for storing a keyword output from the voice recognition means together with a time code indicating a time position with an image signal included in the content in a metadata file;
Content information file processing means for generating a control file for managing the relationship between the recording position of the content file and the metadata file;
Recording means for recording the content file, the metadata file, and the control file together;
Select a keyword from the common dictionary used by the voice recognition unit, specify the metadata file in which the selected keyword is recorded, and search for the content to be searched from the content recorded in the recording unit And a search means for searching for a recording position of a scene to be searched from the control file,
A metadata search apparatus, wherein content management metadata is automatically generated, and a recording position of a content file recorded in the recording means is specified from the metadata.