JP2007052626A

JP2007052626A - Metadata input device and content processor

Info

Publication number: JP2007052626A
Application number: JP2005237154A
Authority: JP
Inventors: Yoshihiro Morioka; 芳宏森岡; Masazumi Yamada; 山田　　正純; Kenji Matsuura; 賢司松浦; Masaaki Kobayashi; 正明小林
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 2005-08-18
Filing date: 2005-08-18
Publication date: 2007-03-01

Abstract

<P>PROBLEM TO BE SOLVED: To solve the problem that the time and labor of access and reading is required for performing access along the time records of a recording medium in order to quickly perform access to desired content section, and to perform viewing or editing work as for a medium in which content such as an image or voice is recorded. <P>SOLUTION: When inputting relevant metadata to image data, the complicate extraction of a text input or the like by a keyboard is performed during editing work to easily retrieve an image. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明はカメラ撮影時に取得できるコンテンツ（映像、音声、データ）からメタデータを生成し、入力する方法と、検索と頭だしが高速化された編集システムに関して好適なものである。 The present invention is suitable for a method of generating and inputting metadata from contents (video, audio, data) that can be acquired at the time of camera shooting, and an editing system that speeds up searching and cueing.

従来、カメラ撮影されたコンテンツの編集作業はマスターとなる媒体（テープ、ディスクなど）上に記録されているオリジナルコンテンツ（映像、音声、データ）をコンテンツ制作者の意図に応じて選択、合成する作業であり、非常に多くの手間と時間を要する作業である。また編集作業に要する作業量及び作業時間は放送用や業務用、または家庭用などコンテンツの分野や内容に応じて大きく異なる。放送の分野におけるニュースやスポーツ番組の編集において、素材テープから数秒単位の映像コンテンツを抽出するのに多くの労力が必要である上、抽出したコンテンツが最もふさわしいシーンであるかどうかの信憑性に課題が残る。さらに、スポーツや運動会で特定の人物を追随して撮影する場合、撮影対象の動きが早かったり撮影者が撮影作業に集中できなくなると撮影対象が撮影ファインダーからフレームアウトしてしまうという課題がある。また、撮影場面（シーン）に関連したキーワードをタグとして自動で付加する方法や、簡単な動作でタグを付加する方法も確立されていないため、撮影コンテンツから求める場面にすぐにアクセスする、または、瞬時に頭出しして視聴することが困難である。視聴だけでなく、編集までの作業となると、撮影コンテンツの全貌を把握するのに、多くの作業を要し、結果として編集作業がコンテンツ制作に費用をかけることのできる放送など一部の分野に制限されてしまっている。
従来、メタデータ入力方法は特許文献１に記載されたものが知られている。収録される映像番組の各場面に関する番組付加データするため、入力フォームから対応場面を特定する情報を入力するものである。
また従来の編集支援システムは、特許文献２に記載されたものが知られている。これは、記録に関する各種の情報を示すメタデータであり、シリアル番号、カセット番号、コメントであるメタデータを用いて、記録した映像に関する文字情報を得、これにより、文字列を検索して、所望のシーンの映像を検索することにより編集作業を効率化するシステムである。
そこで、メタデータ入力方法と編集システムとして、特許文献３に記載されたものが知られている。まず、収録時に収録した画像に含まれる文字を検出し、検出された文字に文字認識を行い文字データを生成しメタデータとし、収録された画像データに付随させる。そして、収録時に自動的に生成されたメタデータを編集作業で活用することで、メタデータの入力に必要な労力と時間を必要としない。また、カチンコやメモに書かれている文字データでも認識を行えるため、編集に直接関連している情報をメタデータとして与えることが容易なので、メタデータを編集作業にそのまま活用できるという特徴を持つ。
特開２００２−１５２６９４号公報特開２００１−２９２４０７号公報特開２００５−３９５３４号公報 Conventionally, editing of camera-captured content involves selecting and combining original content (video, audio, data) recorded on a master medium (tape, disk, etc.) according to the intention of the content creator. It is a work that requires a great deal of labor and time. The amount of work and the time required for editing work vary greatly depending on the content field and content, such as broadcasting, business, or home. When editing news and sports programs in the broadcasting field, it takes a lot of effort to extract video content in seconds from material tapes, and there is a problem with the credibility of whether the extracted content is the most suitable scene. Remains. Furthermore, when taking a picture of a specific person following sports or athletic meet, there is a problem that the subject to be shot out of the viewfinder if the subject moves quickly or the photographer cannot concentrate on the shooting task. Also, since there is no established method for automatically adding keywords related to shooting scenes (scenes) as tags, or for adding tags with simple actions, you can immediately access the scene you want from shooting content, or It is difficult to find and watch instantly. Not only viewing but also editing, it takes a lot of work to grasp the whole picture of the shot content, and as a result, it can be used in some fields such as broadcasting where editing can cost content production. It has been restricted.
Conventionally, the metadata input method described in Patent Document 1 is known. In order to provide program additional data relating to each scene of the recorded video program, information for identifying the corresponding scene is input from the input form.
As a conventional editing support system, the one described in Patent Document 2 is known. This is metadata that shows various information related to recording. Using the metadata that is serial number, cassette number, and comment, character information related to the recorded video is obtained, and thus the character string is searched and desired. It is a system that makes editing work more efficient by searching for the video of the scene.
Therefore, a metadata input method and an editing system described in Patent Document 3 are known. First, characters included in an image recorded at the time of recording are detected, character recognition is performed on the detected characters, character data is generated as metadata, and is attached to the recorded image data. And, by using metadata automatically generated at the time of recording in editing work, labor and time required for inputting metadata are not required. In addition, since it is possible to recognize text data written in clappers and memos, it is easy to give information directly related to editing as metadata, so that the metadata can be used as it is for editing work.
JP 2002-152694 A JP 2001-292407 A JP 2005-39534 A

しかしながら上記した従来の方式ではメタデータを収録後に映像を見ながら、入力フォームにより人間の手で入力しなければならない。そのため、入力のための労力、時間を要する、と言う課題を有している。また、画像に付加されているメタデータがシリアル番号、カセット番号等から文字列信号を生成した場合、直接、編集に必要な情報をピンポイントで検出することは難しいと言う課題も有している。また、文字以外の情報、音声や人物や物体から直接メタデータを生成することができないという課題を有する。 However, in the conventional method described above, the metadata must be input manually by an input form while watching the video after recording. Therefore, there is a problem that labor and time for input are required. In addition, when the metadata added to the image generates a character string signal from the serial number, cassette number, etc., there is a problem that it is difficult to pinpoint information necessary for editing directly. . Further, there is a problem that metadata cannot be directly generated from information other than characters, voice, a person or an object.

収録時に収録したコンテンツ（映像、音声、データ）の映像に含まれる人物、物体を画像認識し、音声に含まれるキーワードなど音声の認識を行ない、それぞれの認識結果を文字データに変換してメタデータを生成する。そして、このメタデータを収録されたコンテンツと関連づける。収録時に自動的に生成されたメタデータや簡単な動作で生成したメタデータを頭だしや編集作業などで活用することにより、大幅な作業能率の向上を図ることができる。 Recognizes images of people and objects included in video (contents of video, audio, and data) recorded during recording, recognizes audio such as keywords included in audio, converts each recognition result into character data, and metadata Is generated. Then, this metadata is associated with the recorded content. By using metadata automatically generated at the time of recording and metadata generated by a simple operation for cueing and editing, the work efficiency can be greatly improved.

以上の発明により、収録時に収録したコンテンツ（映像、音声、データ）の映像に含まれる人物、物体を画像認識し、音声に含まれるキーワードなど音声の認識を行ない、それぞれの認識結果を文字データに変換してメタデータを生成する。そして、このメタデータを収録されたコンテンツと関連付ける。収録時に自動生成したメタデータや、簡単な動作で生成したメタデータを、頭出しや編集作業で活用し、大幅な作業能率の向上を図ることができる。 According to the above invention, a person or an object included in the video of the content (video, audio, data) recorded at the time of recording is recognized, voice such as a keyword included in the audio is recognized, and each recognition result is converted into character data. Convert to generate metadata. Then, this metadata is associated with the recorded content. The metadata automatically generated at the time of recording and the metadata generated by a simple operation can be used for cueing and editing work, and the work efficiency can be greatly improved.

さらに、画像認識手段で人を認識した場合、その人の着用している衣類やまた持っている鞄などを画像データベースに登録し、その人物と関連付けておくことより、視聴時の問い合わせに対して検索を簡単に実行できる。 Furthermore, when a person is recognized by the image recognition means, the clothing worn by the person or the bag he / she holds is registered in the image database and associated with the person, so that inquiries during viewing can be answered. Search can be performed easily.

また、音楽や人や人の動きなど著作権の発生するものを自動認識する手段を具備することにより、コンテンツ素材より編集を行った完パケを構成する各シーンの著作権関連項目を呼び出したり、表示したり、各著作権関連項目に著作権の管理元、連絡先、使用条件や著作権料などのメタデータを追記することにより、もし撮影コンテンツに著作権処理が必要な場合、編集素材や完パケに対して必要な著作権処理のリストを容易に作成できる。よって、コンテンツの再利用が促進される。 In addition, by providing means for automatically recognizing copyrighted items such as music and people and people's movements, you can call up copyright-related items for each scene that makes up the complete package edited from the content material, Display or add metadata such as copyright management source, contact information, usage conditions and copyright fee to each copyright-related item. You can easily create a list of copyright processing required for a complete package. Therefore, the reuse of content is promoted.

また、ユーザーは表示されたサムネイル画像を見て、自分の編集したい映像クリップを選び、各クリップ中で使用するシーンを選択し、順番を変えてプレイリストを生成することにより、ダイジェスト再生などが可能となる。 The user can also view the thumbnail images displayed, select the video clip that he / she wants to edit, select the scene to be used in each clip, generate a playlist by changing the order, and perform digest playback, etc. It becomes.

また、プレイリスト出力手段を持つことにより、プレイリストを外部に出力し、外部機器からプレイリストに従ったＡＶコンテンツだけを出力することができる。よて、リモートからのプレイリストによるダイジェスト再生が可能となる。 Further, by providing the playlist output means, it is possible to output the playlist to the outside and output only AV contents according to the playlist from the external device. Thus, it is possible to perform digest playback from a remote playlist.

さらに、メタデータ時刻修正手段を追加することによりプレイリストによる再生で、プレイリストとＡＶコンテンツ間のフレームまたはフィールド誤差取り除くことができる。これにより、ムービー撮影者や撮影監督などのユーザーは考えたとおりにメタデータと映像の同期を取ることが可能となり、ＡＶ信号の編集効率をアップし、映像表現の高精度化、高度化を図ることができる。 Further, by adding the metadata time correction means, it is possible to remove frame or field errors between the playlist and the AV content by reproduction by the playlist. As a result, users such as movie photographers and filmmakers can synchronize metadata and video as expected, improving the editing efficiency of AV signals, and improving the accuracy and sophistication of video expression. be able to.

（実施の形態１）
図１は本発明の説明図であり、カメラにおいて記録媒体（またはバッファメモリ）上に映像データと音声データとメタデータを作成するシステムのモデルを示している。１０１はカメラ、１０２はカメラのレンズ部、１０３はカメラのマイク、１０４はカメラの撮影対象（風景や人やペットなどの動物、車、建造物などの物）である。また、１０５はカメラで撮影したデータであり、映像データ１０６、音声データ１０７、メタデータ１０８により構成される。１０９はカメラで撮影されたデータシーケンスであり、時間軸上に映像、音声、メタデータが配置されている。メタデータはテキスト形式の文字データとして扱うが、バイナリィ形式のデータとしても良い。 (Embodiment 1)
FIG. 1 is an explanatory diagram of the present invention, and shows a model of a system that creates video data, audio data, and metadata on a recording medium (or buffer memory) in a camera. Reference numeral 101 denotes a camera, reference numeral 102 denotes a camera lens unit, reference numeral 103 denotes a camera microphone, and reference numeral 104 denotes a camera object (landscape, an animal such as a person or a pet, an object such as a car or a building). Reference numeral 105 denotes data photographed by a camera, and is composed of video data 106, audio data 107, and metadata 108. Reference numeral 109 denotes a data sequence photographed by the camera, and video, audio, and metadata are arranged on the time axis. The metadata is handled as text data in text format, but may be data in binary format.

ここでデータシーケンス１０９は、抽出されたシーン＃１からシーン＃５までを含んでいる。１１１は編集により、シーン＃１からシーン＃５までをつなぎ合わせたデータシーケンスである。ユーザはリモコン１１０によるリモート制御により、編集されたデータシーケンスの順番で各シーンをTV112に一覧表示することができる。 Here, the data sequence 109 includes the extracted scene # 1 to scene # 5. Reference numeral 111 denotes a data sequence in which scenes # 1 to # 5 are connected by editing. The user can display a list of scenes on the TV 112 in the order of the edited data sequence by remote control using the remote controller 110.

１１３はメタデータ入力用ボタンであり、３つのボタンにより構成されている。カメラで撮影中に重要な場面でメタデータ入力用ボタンを押すことにより、その重要な撮影場面（シーン）にマークをつけることができる（マーキング機能）。この重要シーンを指すマークもメタデータであり、撮影後にマーク検索によりマークを付けたシーンを呼び出すことができる。３つのボタンは、たとえば、１つ目のボタンは重要シーンの登録に、２つ目のボタンはボタン操作を有効にしたり文字入力モードに切替えるモード切替えに、３つ目のボタンは登録のキャンセルに、それぞれ使用する。また、１つ目のボタンを押している期間を重要シーンとして登録するモードに切替えることもできる。さらに、１つ目のボタンを押した時点の前後５秒、あるいは前５秒、後１０秒の合計１５秒を重要シーンとして登録するモードに切替えることもできる。ボタンが３つあれば、押すボタンの種類、タイミング、押す長さの組み合わせにより、多くの機能に利用することができる。 Reference numeral 113 denotes a metadata input button, which includes three buttons. By pressing the metadata input button at an important scene while shooting with the camera, the important shooting scene (scene) can be marked (marking function). The mark indicating the important scene is also metadata, and a scene with a mark can be called by mark search after shooting. Three buttons, for example, the first button is for registering important scenes, the second button is for mode switching to enable button operation or switch to character input mode, and the third button is for canceling registration , Use each. It is also possible to switch to a mode in which the period during which the first button is pressed is registered as an important scene. Furthermore, the mode can be switched to a mode in which 5 seconds before and after the first button is pressed, or a total of 15 seconds, 5 seconds before and 10 seconds after, is registered as an important scene. If there are three buttons, it can be used for many functions depending on the combination of the type, timing, and length of the button to be pressed.

ここで、＃１から＃５までのシーンの時間長は任意である。ユーザーはカメラ撮影した撮影素材であるデータシーケンスから、各シーンの開始位置（時間）と終了位置（時間）、または長さを選択して、各シーンを並べ替えることができる。各シーンをTVモニターなどに表示する場合、そのシーンの先頭または先頭以降最後尾のフレーム（またはフィールド）映像をそのシーンを代表する映像として表わすことができる。 Here, the time length of the scenes from # 1 to # 5 is arbitrary. The user can rearrange the scenes by selecting the start position (time) and end position (time) or length of each scene from the data sequence that is the photographing material taken by the camera. When each scene is displayed on a TV monitor or the like, the frame (or field) video at the head or the head after the head of the scene can be represented as a video representing the scene.

図２は、カメラ１０１における映像信号、音声信号、およびメタデータの取り扱いを説明する図である。２０１はカメラからの映像信号、音声信号の入力手段、２０２は音声と画像の認識ユニット手段、２０３は音声認識手段、２０４は画像認識手段、２０５はユーザーの声やユーザーが設定する重要さに関する情報の入力手段、２０６はカメラのセンサー情報入力手段、２０７はメタデータの生成ならびに映像音声との同期ならびにメタデータ管理を行うメタデータ生成・同期・管理手段である。 FIG. 2 is a diagram for explaining how the camera 101 handles video signals, audio signals, and metadata. 201 is a video signal and audio signal input means from the camera, 202 is a voice and image recognition unit means, 203 is a voice recognition means, 204 is an image recognition means, 205 is information about the voice of the user and the importance set by the user. Reference numeral 206 denotes camera sensor information input means. Reference numeral 207 denotes metadata generation / synchronization / management means for generating metadata, synchronizing with video and audio, and managing metadata.

また、２０８はＭＰＥＧ−２圧縮やＨ．２６４圧縮を行ない記録媒体に記録するフォーマットに変換し記録媒体に記録するＡＶ信号圧縮記録制御手段である。２０９はバッファメモリとしても動作する記録媒体であり、ＡＶデータファイルを含むＡＶデータファイルディレクトリ２１０、タイトルリスト／プレイリスト／ナビゲーションデータファイルを含むタイトルリスト／プレイリスト／ナビゲーションデータファイルディレクトリ２１１、ならびにメタデータファイルを含むメタデータファイルディレクトリ２１２を持つ。 208 is MPEG-2 compression or H.264. This is an AV signal compression / recording control means that performs H.264 compression, converts it into a format for recording on a recording medium, and records it on the recording medium. Reference numeral 209 denotes a recording medium that also operates as a buffer memory. The AV data file directory 210 includes AV data files, the title list / playlist / navigation data file directory 211 includes title lists / playlists / navigation data files, and metadata. It has a metadata file directory 212 containing files.

２１３は辞書群であり、内部に複数の分野別辞書（辞書Ａ、辞書Ｂ、辞書Ｃ）を含む。２１４は辞書登録データの追加削除管理手段、２１５は映像信号に含まれる画像に存在する人、物の特定を行う画像認識手段、２１６は画像認識のデータベース（人や動物や物の特徴を記述したデータベース）、２１７はＡＶ信号再生管理手段、２１８は時計、２１９は管理制御手段、２２０はＡＶ信号出力手段である。 Reference numeral 213 denotes a dictionary group which includes a plurality of field-specific dictionaries (dictionary A, dictionary B, dictionary C). 214 is an addition / deletion management unit for dictionary registration data, 215 is an image recognition unit for specifying a person or object existing in an image included in a video signal, and 216 is an image recognition database (describes characteristics of a person, an animal, or an object) (Database) 217 is an AV signal reproduction management means, 218 is a clock, 219 is a management control means, and 220 is an AV signal output means.

図２の動作について説明する。図２において、カメラ１０１で撮影した映像信号、音声信号は、それぞれＡＶ信号入力手段２０１に入力される。ＡＶ信号入力手段２０１に入力された映像信号および音声信号は、それぞれ複数の系統に分けられバッファ（一時保持）された後、それぞれ音声と画像の認識ユニット手段２０２およびＡＶ信号圧縮記録制御手段２０８に出力される。 The operation of FIG. 2 will be described. In FIG. 2, the video signal and the audio signal photographed by the camera 101 are respectively input to the AV signal input means 201. The video signal and the audio signal input to the AV signal input unit 201 are divided into a plurality of systems and buffered (temporarily held), and then respectively to the audio and image recognition unit unit 202 and the AV signal compression / recording control unit 208. Is output.

音声と画像の認識ユニット手段２０２は、音声認識手段２０３および画像の検出を行う画像認識手段２０４を内蔵しており、各々入力された音声の認識および映像に含まれる画像の検出を行う。 The voice and image recognition unit means 202 includes a voice recognition means 203 and an image recognition means 204 for detecting an image, and recognizes an input voice and detects an image included in the video.

ここで、音声認識手段２０３はユーザーのボタン入力などにより辞書群２１３内の複数の辞書Ａ、辞書Ｂ、辞書Ｃなどから任意の辞書を選択する、そして、選択された辞書に登録された単語群データを用いて音声認識を行う。なお辞書Ａ、辞書Ｂ、辞書Ｃの例としては、野球、サッカー、バスケットボールなどの各スポーツ分野別、あるいは運動会、お誕生会、結婚式などのイベント別に設定して登録単語の語彙や単語数を書く分野にふさわしい内容に選択して絞り込んだ辞書とする。そこで、音声認識の実行前に認識を行う分野を選択すれば、音声認識動作における誤認識を削減し、認識率の向上を図ることができる。また、各辞書は、辞書登録データの追加削除管理手段２１４を介して、分野別辞書自体の追加と削除、また各分野別辞書内の登録単語の追加、削除ができる。たとえば、運動会の辞書に、親が自分の子供や知人の子供の名前を追加することが可能であり、子供の名前を音声認識でテキスト化してメタデータとして映像に関連付けて（紐付けてとも言う）記録することにより、再生時に子供の名前を指定（たとえば、再生時にＴＶ画面上に表示された登録メタデータ一覧から選択）することにより、子供の映っている映像に素早くアクセス（クイックアクセス）できる。このように音声認識において、分野の選定と、分野別に絞った辞書でのキーワート゛登録の２段階で行うことにより、認識速度と精度を向上させることができる。 Here, the voice recognition unit 203 selects an arbitrary dictionary from the plurality of dictionaries A, B, C, etc. in the dictionary group 213 by a user button input or the like, and the word group registered in the selected dictionary Perform voice recognition using data. Examples of the dictionary A, dictionary B, and dictionary C include the vocabulary and the number of words for registered words set for each sports field such as baseball, soccer, basketball, or for events such as athletic meet, birthday party, wedding ceremony, etc. Select a dictionary that is narrowed down by selecting content suitable for the writing field. Therefore, if a field in which recognition is performed before performing speech recognition is selected, erroneous recognition in the speech recognition operation can be reduced and the recognition rate can be improved. In addition, each dictionary can add and delete the field-specific dictionary itself, and can add and delete registered words in each field-specific dictionary via the dictionary registration data addition / deletion management unit 214. For example, parents can add the names of their children and acquaintances' children to athletic meet dictionaries. The children's names are converted into text by speech recognition and associated with video as metadata (also known as linking) ) By recording, by specifying the name of the child at the time of playback (for example, selecting from the registered metadata list displayed on the TV screen at the time of playback), it is possible to quickly access (quick access) the image showing the child . As described above, in speech recognition, the recognition speed and accuracy can be improved by performing two steps of selecting a field and registering a keyword in a dictionary narrowed down by field.

また、画像の検出を行う画像認識手段２０４は、映像信号が１枚の絵を構成する映像フレームまたは映像フィールド（映像と略する）内の意味のある画像を検出、認識する。本実施例における画像の意味として、映像信号が１枚の絵を構成する映像内における意味のあるオブジェクト画像のこととする。画像認識手段２０４は映像内の意味のあるオブジェクトとして人物、人の顔、動物（犬、猫などのペット）、車、電車、飛行機などの乗り物、家やビルなどの建造物、標識などを含んだ道路の景色、観光名所、田園、山岳、町の風景などがある。これら映像内の意味のあるオブジェクトに関する情報は、人や動物や物の特徴を記述した画像認識のデータベース２１６より入力される。たとえば、人の顔を認識する場合には、映像（映像フレームまたは映像フィールド）中の人の顔を認識して、映像において人の顔が存在する領域を例えば四角い領域や丸い領域として認識する。 The image recognition means 204 for detecting an image detects and recognizes a meaningful image in a video frame or a video field (abbreviated as video) whose video signal constitutes one picture. The meaning of an image in the present embodiment is a meaningful object image in a video whose video signal constitutes one picture. The image recognition means 204 includes a person, a human face, an animal (a pet such as a dog or a cat), a vehicle such as a car, a train or an airplane, a building such as a house or a building, or a sign as a meaningful object in the video. There are road views, tourist attractions, countryside, mountains, and town scenery. Information about these meaningful objects in the video is input from an image recognition database 216 describing the characteristics of people, animals, and things. For example, when recognizing a human face, a human face in a video (video frame or video field) is recognized, and an area where the human face exists in the video is recognized as, for example, a square area or a round area.

ここで認識された人の顔が存在する領域を例えば四角い領域や丸い領域は、たとえば、「非特定人物ナンバー１」、「非特定人物ナンバー１２３」などのメタデータを付ける。なお、連続する映像内で同人物と認識される人の顔に関する認識領域は、同じ非特定人物ナンバーを付けることにより、非特定人物ナンバーを削減できる。また、１秒や３秒など一定時間以上に渡って検出した場合のみ非特定人物ナンバーを付ける付加機能を追加することにより、非特定人物ナンバーを削減できる。これにより、撮影者の意図に反して一部の時間だけチラリと見えた人物の認識を排除することができるまた、メタデータ生成するタイミングについては、メタデータ作成ボタンがユーザーにより押されたとき、としても良い。さらに、映像画面上の位置により特定の大きさ以上の場合のみ非特定人物ナンバーを付加する機能を追加することによりさらに非特定人物ナンバーを削減できる。たとえば、画素数がＶＧＡサイズ（横６４０ピクセル、縦４８０ピクセル）の場合、画面の真ん中（横３２０ピクセル、縦２４０ピクセル）の領域では、顔領域が縦、横６０ピクセル以上の場合のみ検出し、画面の端の領域では、顔領域が縦、横４０ピクセル以上の場合に検出する方法がある。これにより画面の真ん中で認識される人物の検出速度（計算速度）と精度を上げることができる。 For example, a square area or a round area where the recognized human face exists is attached with metadata such as “non-specific person number 1” and “non-specific person number 123”. It should be noted that non-specific person numbers can be reduced by assigning the same non-specific person number to the recognition area related to the face of a person who is recognized as the same person in successive images. In addition, the non-specific person number can be reduced by adding an additional function for assigning a non-specific person number only when it is detected over a certain time such as 1 second or 3 seconds. As a result, it is possible to eliminate the recognition of a person who seemed to flicker only for a part of the time against the photographer's intention.For the timing of generating metadata, when the metadata creation button is pressed by the user, It is also good. Furthermore, the non-specific person number can be further reduced by adding a function of adding the non-specific person number only when the size is larger than a specific size depending on the position on the video screen. For example, if the number of pixels is VGA size (horizontal 640 pixels, vertical 480 pixels), in the middle of the screen (horizontal 320 pixels, vertical 240 pixels), detect only when the facial area is vertical, horizontal 60 pixels or more, In the area at the edge of the screen, there is a method of detecting when the face area is vertical and horizontal 40 pixels or more. Thereby, the detection speed (calculation speed) and accuracy of the person recognized in the middle of the screen can be increased.

以上の様に、音声と画像の認識ユニット手段２０２は、音声認識および画像認識により得たテキスト情報をメタデータとしてメタデータ生成・同期・管理手段２０７に入力する。メタデータ生成・同期・管理手段２０７は時計２１８より時刻情報を受け取っており、ＡＶ信号圧縮記録制御手段２０８と連携して映像、音声、メタデータの時間管理（時刻同期）を行う。 As described above, the voice and image recognition unit means 202 inputs text information obtained by voice recognition and image recognition to the metadata generation / synchronization / management means 207 as metadata. The metadata generation / synchronization / management unit 207 receives time information from the clock 218 and performs time management (time synchronization) of video, audio, and metadata in cooperation with the AV signal compression / recording control unit 208.

メタデータ生成・同期・管理手段２０７に入力される情報は、音声と画像の認識ユニット手段２０２から入力されるメタデータだけでなく、カメラに付いた重要場面設定ボタンや静止画取得ボタンなどからのボタン入力情報を受け付けるユーザー情報入力手段２０５や、カメラの動作状態を表わす各種パラメータ入力を受け付けるセンサー情報入力手段２０６からのメタデータなどがある。ここで、カメラの動作状態を表わすパラメータの例としては、ＧＰＳや携帯電話の位置センサーや加速度センサーを用いた存在位置、カメラの向きや傾き（仰角）情報、カメラ１０１が使用しているレンズ１０２の種類、ズーズ倍率、絞りなどの露光情報などがある。 The information input to the metadata generation / synchronization / management means 207 includes not only the metadata input from the voice and image recognition unit means 202 but also from the important scene setting button or the still image acquisition button attached to the camera. There are metadata from user information input means 205 that accepts button input information, and sensor information input means 206 that accepts various parameter inputs representing the operating state of the camera. Here, examples of parameters representing the operation state of the camera include a position using GPS or a position sensor or an acceleration sensor of a mobile phone, information on the direction and inclination (elevation angle) of the camera, and a lens 102 used by the camera 101. Exposure information such as the type of lens, zoom magnification, and aperture.

さて、ＡＶ信号圧縮記録制御手段２０８に入力された映像信号および音声信号および各種のメタデータは、それぞれに関連した映像信号、音声信号、メタデータ同士で紐付けを行う。この紐付けは、映像、音声、メタデータの発生した時間情報（タイムコード。精度としては、映像フレームや映像フィールドでよい。）を元に実行するが、ストリームやファイル中のデータ位置情報を用いて紐付け（関連付け）を行ってもよい。 The video signal, audio signal, and various metadata input to the AV signal compression / recording control unit 208 are associated with each other with the associated video signal, audio signal, and metadata. This linking is performed based on time information (time code, which may be a video frame or a video field for accuracy) in which video, audio, and metadata are generated, but data position information in a stream or file is used. The association (association) may be performed.

ＡＶ信号圧縮記録制御手段２０８は、映像圧縮としてＭＰＥＧ−２（または、ＩＴＵ−Ｔ規格Ｈ．２６２）およびＨ．２６４／ＡＶＣ方式の圧縮エンジン、また、音声圧縮としてＭＰＥＧ−２ＡＡＣ（ＡｄｖａｎｃｅｄＡｕｄｉｏＣｏｄｉｎｇ）、ＭＰＥＧ−１レイヤ３（ＭＰ３）の圧縮エンジンを内蔵している。なお、圧縮エンジンはこれらに限らず、ＤＶ（ＩＥＣ６１８３４）方式、ＤＶＣＰＲＯ方式（ＳＭＰＴＥ３１４Ｍ）、ＤｉｖＸＶｉｄｅｏ方式（ｗｗｗ．ｄｉｖｘ．ｃｏｍ）、ＸｖｉＤ方式、ＷＭＶ９（Ｗｉｎｄｏｗｓ（登録商標）ＭｅｄｉａＶｉｄｅｏ９）方式（ｗｗｗ．ｍｉｃｒｏｓｏｆｔ．ｃｏｍ））や他の形式の圧縮エンジンを選択することもできる。本実施例では、映像圧縮としてＨ．２６４／ＡＶＣ方式（以下、ＡＶＣと略す）、音声圧縮としてＭＰＥＧ−２ＡＡＣ（以下、ＡＡＣと略す）を選択しているとする。 AV signal compression / recording control means 208 uses MPEG-2 (or ITU-T standard H.262) and H.264 as video compression. A compression engine of the H.264 / AVC system and MPEG-2 AAC (Advanced Audio Coding) and MPEG-1 Layer 3 (MP3) compression engines are incorporated as audio compression. The compression engine is not limited to these, and the DV (IEC 61834) method, the DVCPRO method (SMPTE 314M), the DivX Video method (www.divx.com), the XviD method, the WMV9 (Windows (registered trademark) Media Video 9) method. (Www.microsoft.com)) and other types of compression engines can also be selected. In this embodiment, H.264 is used as video compression. Assume that the H.264 / AVC format (hereinafter abbreviated as AVC) and MPEG-2 AAC (hereinafter abbreviated as AAC) are selected as audio compression.

ここで、本実施例で選択したＡＶＣ方式およびＡＡＣ方式について説明する。図３は、図２のＡＶ信号圧縮記録制御手段２０８内の映像と音声の圧縮エンジンとその周辺処理手段の構成をより詳細に説明する図である。図３における代表的な構成要素として、映像符号化部３０１、ＶＣＬ（ＶｉｄｅｏＣｏｄｉｎｇＬａｙｅｒ）−ＮＡＬ（ＮｅｔｗｏｒｋＡｂｓｔｒａｃｔｉｏｎＬａｙｅｒ）ユニットバッファ３０２、ＡＡＣ方式による音声符号化部３０３、ＰＳ（ＰａｒａｍｅｔｅｒＳｅｔ）バッファ３０４、ＶＵＩ（ＶｉｄｅｏＵｓａｂｉｌｉｔｙＩｎｆｏｒｍａｔｉｏｎ）バッファ（３０５）、ＳＥＩ（ＳｕｐｐｌｅｍｅｎｔａｌＥｎｈａｎｃｅｍｅｎｔＩｎｆｏｒｍａｔｉｏｎ）バッファ３０６、ｎｏｎ−ＶＣＬ−ＮＡＬユニットバッファ３０７、ＭＰＥＧ−ＴＳマッピング処理手段３０８などにより構成される。図３に示すように入力された映像信号をＶＣＬＮＡＬユニット形式のデータに変換する。また、音声信号、メタデータ、外部入力ＰＳ（ＰａｒａｍｅｔｅｒＳｅｔ）データ、外部入力ＶＵＩ（ＶｉｄｅｏＵｓａｂｉｌｉｔｙＩｎｆｏｒｍａｔｉｏｎ）データ、外部入力ＳＥＩ（ＳｕｐｐｌｅｍｅｎｔａｌＥｎｈａｎｃｅｍｅｎｔＩｎｆｏｒｍａｔｉｏｎ）データをＮｏｎＶＣＬＮＡＬユニット形式のデータに変換する。これらＶＣＬＮＡＬユニット形式のデータと、ＮｏｎＶＣＬＮＡＬユニット形式のデータをＭＰＥＧ−２ＴＳ形式に変換して出力する。なお、Ｈ．２６４／ＡＶＣ方式についての解説は、たとえば、「Ｈ．２６４／ＡＶＣ教科書」、大久保榮監修、株式会社インプレス発行などがある。また、ＭＰＥＧ−ＴＳ（ＭｏｖｉｎｇＰｉｃｔｕｒｅＥｘｐｅｒｔｓＧｒｏｕｐ、ＴｒａｎｓｐｏｒｔＳｔｒｅａｍ）信号はＩＥＣ６１８８３−４で規定されている。ＭＰＥＧ−ＴＳはＭＰＥＧトランスポートパケット（ＴＳパケットと略す）が複数個集まったものである。ＴＳパケットは１８８ｂｙｔｅの固定長パケットで、その長さはＡＴＭのセル長（５３バイト中、ＡＴＭペイロードは４７バイト）との整合性、およびリードソロモン符号などの誤り訂正符号化を行なう場合の適用性を考慮して決定されている。
ＴＳパケットは４ｂｙｔｅ固定長のパケットヘッダと可変長のアダプテーションフィールド（ａｄａｐｔａｔｉｏｎｆｉｅｌｄ）およびペイロード（ｐａｙｌｏａｄ）で構成される。パケットヘッダにはＰＩＤ（パケット識別子）や各種フラグが定義されている。このＰＩＤによりＴＳパケットの種類を識別する。ａｄａｐｔａｔｉｏｎ＿ｆｉｅｌｄとｐａｙｌｏａｄは、片方のみが存在する場合と両方が存在する場合とがあり、その有無はパケットヘッダ内のフラグ（ａｄａｐｔａｔｉｏｎ＿ｆｉｅｌｄ＿ｃｏｎｔｒｏｌ）により識別できる。ａｄａｐｔａｔｉｏｎ＿ｆｉｅｌｄは、ＰＣＲ（Ｐｒｏｇｒａｍ＿Ｃｌｏｃｋ＿Ｒｅｆｅｒｅｎｃｅ）等の情報伝送、および、ＴＳパケットを１８８ｂｙｔｅ固定長にするためのＴＳパケット内でのスタッフィング機能を持つ。また、ＭＰＥＧ−２の場合、ＰＣＲは２７ＭＨｚのタイムスタンプで、符号化時の基準時間を復号器のＳＴＣ（ＳｙｓｔｅｍＴｉｍｅＣｌｏｃｋ）で再現するためにＰＣＲ値が参照される。各ＴＳパケットに付加するタイムスタンプのクロックは、たとえば、ＭＰＥＧのシステムクロック周波数に等しく、パケット送信装置はさらに、ＴＳパケットを受信し、受信したＴＳパケットに付加されたタイムスタンプより、ＭＰＥＧ−ＴＳのネットワーク伝送によりＰｒｏｇｒａｍＣｌｏｃｋＲｅｆｅｒｅｎｃｅ（ＰＣＲ）に付加された伝送ジッターを除去して、ＭＰＥＧシステムクロックの再生を行うクロック再生手段を備える。
ＭＰＥＧ−２のＴＳでは復号器のＳＴＣはＰＣＲによるＰＬＬ動機機能を持つ。このＰＬＬ同期の動作を安定させるためにＰＣＲの送信間隔は、ＭＰＥＧ規格で１００ｍｓｅｃ以内と決められている。映像や音声などの個別ストリームが収められたＭＰＥＧ−ＰＥＳパケットは同じＰＩＤ番号を持つ複数のＴＳパケットのペイロードに分割して伝送する。ここで、ＰＥＳパケットの先頭は、ＴＳパケットの先頭から開始するように構成される。
トランスポートストリームは複数のプログラムを混合して伝送することができるため、ストリームに含まれているプログラムとそのプログラムを構成している映像や音声ストリームなどのプログラムの要素との関係を表すテーブル情報が用いられる。このテーブル情報はＰＳＩ（ＰｒｏｇｒａｍＳｐｅｃｉｆｉｃＩｎｆｏｒｍａｔｉｏｎ）と呼ばれ、ＰＡＴ（ＰｒｏｇｒａｍＡｓｓｏｃｉａｔｉｏｎＴａｂｌｅ）、ＰＭＴ（ＰｒｏｇｒａｍＭａｐＴａｂｌｅ）などのテーブルを用いる。ＰＡＴ、ＰＭＴなどのＰＳＩはセクションと呼ばれる単位でＴＳパケット中のペイロードに配置されて伝送される。
ＰＡＴにはプログラム番号に対応したＰＭＴのＰＩＤなどが指定されており、ＰＭＴには対応するプログラムに含まれる映像、音声、付加データおよびＰＣＲのＰＩＤが記述されるため、ＰＡＴとＰＭＴを参照することにより、ストリームの中から目的のプログラムを構成するＴＳパケットを取り出すことができる。ＴＳに関する参考文献としては、例えば、ＣＱ出版社、ＴＥＣＨＩＶｏ．４、「画像＆音声圧縮技術のすべて（インターネット／ディジタルテレビ、モバイル通信時代の必須技術）」、監修、藤原洋、第６章、「画像や音声を多重化するＭＰＥＧシステム」があり、同書にて解説されている。 Here, the AVC method and AAC method selected in this embodiment will be described. FIG. 3 is a diagram for explaining in more detail the configuration of the video and audio compression engine and its peripheral processing means in the AV signal compression / recording control means 208 of FIG. As typical constituent elements in FIG. 3, a video encoding unit 301, a VCL (Video Coding Layer) -NAL (Network Abstraction Layer) unit buffer 302, an AAC-based audio encoding unit 303, a PS (Parameter Set) buffer 304, A VUI (Video Usability Information) buffer (305), an SEI (Supplemental Enhancement Information) buffer 306, a non-VCL-NAL unit buffer 307, an MPEG-TS mapping processing means 308, and the like. As shown in FIG. 3, the input video signal is converted into data in the VCL NAL unit format. Also, audio signals, metadata, external input PS (Parameter Set) data, external input VUI (Video Usability Information) data, and external input SEI (Supplemental Enhancement Information) data are converted into data in the Non VCL NAL unit format. These VCL NAL unit format data and Non VCL NAL unit format data are converted into MPEG-2 TS format and output. H. The explanation about the H.264 / AVC system includes, for example, “H.264 / AVC textbook”, supervision by Satoshi Okubo, and Impress Co., Ltd. MPEG-TS (Moving Picture Experts Group, Transport Stream) signals are defined in IEC 61883-4. MPEG-TS is a collection of a plurality of MPEG transport packets (abbreviated as TS packets). The TS packet is a 188-byte fixed-length packet whose length is consistent with the ATM cell length (of 53 bytes, the ATM payload is 47 bytes), and applicable when performing error correction coding such as Reed-Solomon codes. Has been determined in consideration of.
The TS packet includes a 4-byte fixed-length packet header, a variable-length adaptation field (adaptation field), and a payload (payload). PID (packet identifier) and various flags are defined in the packet header. The type of TS packet is identified by this PID. Adaptation_field and payload can be either only one or both, and the presence / absence can be identified by a flag (adaptation_field_control) in the packet header. The adaptation_field has information transmission such as PCR (Program_Clock_Reference) and a stuffing function in the TS packet for making the TS packet have a fixed length of 188 bytes. In the case of MPEG-2, the PCR is a time stamp of 27 MHz, and the PCR value is referred to reproduce the reference time at the time of encoding by the STC (System Time Clock) of the decoder. The clock of the time stamp added to each TS packet is, for example, equal to the MPEG system clock frequency, and the packet transmitting apparatus further receives the TS packet, and from the time stamp added to the received TS packet, A clock recovery means for recovering the MPEG system clock by removing transmission jitter added to the Program Clock Reference (PCR) by network transmission is provided.
In MPEG-2 TS, the STC of the decoder has a PLL motive function by PCR. In order to stabilize the PLL synchronization operation, the PCR transmission interval is determined to be within 100 msec in the MPEG standard. An MPEG-PES packet containing individual streams such as video and audio is divided into a plurality of TS packet payloads having the same PID number and transmitted. Here, the head of the PES packet is configured to start from the head of the TS packet.
Since a transport stream can be transmitted by mixing a plurality of programs, table information indicating the relationship between a program included in the stream and program elements such as video and audio streams constituting the program is included. Used. This table information is called PSI (Program Specific Information) and uses tables such as PAT (Program Association Table) and PMT (Program Map Table). PSI such as PAT and PMT is transmitted by being arranged in a payload in a TS packet in units called sections.
Since PAT specifies the PID of the PMT corresponding to the program number, and the PMT describes the video, audio, additional data, and PID of the PCR included in the corresponding program, refer to the PAT and PMT. Thus, TS packets constituting the target program can be extracted from the stream. References regarding TS include, for example, CQ Publisher, TECH I Vo. 4. “All of image & audio compression technology (essential technology in the Internet / digital television and mobile communication era)”, supervised by Hiroshi Fujiwara, Chapter 6, “MPEG system for multiplexing images and audio” It is explained.

ＰＳＩやＳＩに関する論理的な階層構造、処理手順の例、選局処理の例に関して、「ディジタル放送受信機における選局技術」、三宅他、三洋電機技報、ＶＯＬ．３６、ＪＵＮＥ２００４、第７４号、３１ページから４４ページにて解説されている。 Regarding the logical hierarchical structure regarding PSI and SI, examples of processing procedures, and examples of channel selection processing, “Channel selection technology in digital broadcast receivers”, Miyake et al., Sanyo Electric Technical Report, VOL. 36, JUNE 2004, No. 74, pages 31-44.

ところで、メタデータはＳＥＩバッファ３０６に入力する。ここでは、メタデータはＳＥＩのＵｓｅｒＤａｔａＵｎｒｅｇｉｓｔｅｒｅｄＳＥＩに格納する。メタデータの種類としては、前述したメタデータ以外にも、一般的なデータをメタデータ化したメタデータ、また、デジタル放送を受信してそのＳＩ（ＳｅｒｖｉｃｅＩｎｆｏｒｍａｔｉｏｎ；番組配列情報）より得るメタデータ、EPG提供事業者より得たEPG情報などのメタデータ、Ｉｎｔｅｒｎｅｔから得たEPGなどのメタデータ、また、個人でムービー撮影したAVコンテンツ（静止画、音声、クリップなどの動画）に関連付けたメタデータなどがある。メタデータの形式としては、たとえば、ＵＰｎＰやＵＰｎＰ−ＡＶの標準仕様として、プロパティ（ｐｒｏｐｅｒｔｙ）やアトリビュート（ａｔｔｒｉｂｕｔｅ）があり、ｈｔｔｐ：／／ｕｐｎｐ．ｏｒｇで公開されており、ＸＭＬ（ＥｘｔｅｎｓｉｂｌｅＭａｒｋｕｐＬａｎｇｕａｇｅ）やＢＭＬ（ＢｒｏａｄｃａｓｔＭａｒｋｕｐＬａｎｇｕａｇｅ）などの記述言語で表現できる。ｈｔｔｐ：／／ｕｐｎｐ．ｏｒｇにおいて、例えば、「ＤｅｖｉｃｅＡｒｃｈｉｔｅｃｔｕｒｅＶ１．０」、「ＣｏｎｔｅｎｔＤｉｒｅｃｔｏｒｙ：１ＳｅｒｖｉｃｅＴｅｍｐｌａｔｅＶｅｒｓｉｏｎ１．０１」、「ＭｅｄｉａＳｅｒｖｅｒＶ１．０ａｎｄＭｅｄｉａＲｅｎｄｅｒｅｒＶ１．０」に関して、「ＭｅｄｉａＳｅｒｖｅｒＶ１．０」、「ＭｅｄｉａＲｅｎｄｅｒｅｒＶ１．０」、「ＣｏｎｎｅｃｔｉｏｎＭａｎａｇｅｒＶ１．０」、「ＣｏｎｔｅｎｔＤｉｒｅｃｔｏｒｙＶ１．０」、「ＲｅｎｄｅｒｉｎｇＣｏｎｔｒｏｌＶ１．０」、「ＡＶＴｒａｎｓｐｏｒｔＶ１．０」、「ＵＰｎＰ^ＴＭＡＶＡｒｃｈｉｔｅｃｔｕｒｅＶ．８３」などの仕様書が公開されている。また、メタデータ規格に関しては、ＥＢＵのＰ／Ｍｅｔａ、ＳＭＰＴＥのＫＬＶ方式、ＴＶＡｎｙｔｉｍｅ、ＭＰＥＧ７などで決められたメタデータ形式があり、「映像情報メディア学会誌、５５巻、３号、情報検索のためのメタデータの標準化動向」などで解説されている。
なお、ムービーなどの撮影者、コンテンツ制作者、またはコンテンツの著作権者が各メタデータに価値を付け、コンテンツを利用するユーザーの利用内容や頻度により利用料金を徴収するために、各メタデータに価値を与えるメタデータを関連づけることができる。この各メタデータに価値を与えるメタデータは該メタデータのアトリビュートで与えてもよいし、独立したプロパティとして与えてもよい。たとえば、録画機器と録画条件に関する情報、すなわち、ムービーの機器ＩＤ、ムービーなどの撮影者、コンテンツ制作者、またはコンテンツの著作権者が作成、登録するメタデータの価値が高くて使用許諾が必要と考える場合、該メタデータの利用には認証による使用許諾のプロセスを実行する構成を本発明に組み込んだ構成をとることもできる。
たとえば、自分で撮影した動画コンテンツを暗号化したファイルを作成し、Ｉｎｔｅｒｎｅｔ上のサーバーにその暗号化ファイルをアップロードする。その暗号化ファイルの説明や一部の画像などを公開して、気にいった人に購入してもらう構成をとることもできる。また、貴重なニュースソースが録画できた場合、複数の放送局のニュース部門間で競売（オークション）にかける構成をとることもできる。
これらメタデータを活用することにより、多くのAVコンテンツから所望のコンテンツを検索する、ライブラリに分類する、記録時間を長時間化する、自動表示を行う、コンテンツ販売するなどコンテンツの効率的な利用が可能となる。記録時間を長時間化するには、価値の低い動画コンテンツは解像度を低くするとか、音声と静止画（たとえば、ＭＰＥＧのＩピクチャーやＨ．２６４のＩＤＲピクチャーを抜き出してもよい）だけにするとか、静止画だけにするなどの構成をとることにより実現できる。 Incidentally, the metadata is input to the SEI buffer 306. Here, the metadata is stored in the User Data Unregistered SEI of the SEI. As types of metadata, in addition to the metadata described above, metadata obtained by converting general data into metadata, or metadata obtained from SI (Service Information; program arrangement information) by receiving digital broadcasting, Metadata such as EPG information obtained from EPG providers, metadata such as EPG obtained from the Internet, metadata associated with AV content (movies such as still images, audio, and clips) taken by individuals There is. As the metadata format, for example, there are properties and attributes as standard specifications of UPnP and UPnP-AV, such as http: // upnp. org and can be expressed in a description language such as XML (Extensible Markup Language) or BML (Broadcast Markup Language). http: // upnp. In org, for example, “Media Architecture V 1.0”, “Content Directory: 1 Service Template Version 1.01”, “MediaServer V 1.0 and MediaRenderer V 1.0”, “Media“ V ” "MediaRenderer V 1.0", "ConnectionManager V 1.0", "ContentDirectory V 1.0", "RenderingControl V 1.0", "AVTransport V 1.0", "UPnP ^TM AV Architecture V. 83", etc. The specification is published. As for metadata standards, there are metadata formats determined by EBU P / Meta, SMPTE KLV, TV Anytime, MPEG7, etc. The standardization trend of metadata for
In addition, in order for photographers such as movies, content creators, or content copyright holders to add value to each metadata and collect usage fees according to the usage and frequency of users who use the content, You can associate metadata that gives value. The metadata giving value to each metadata may be given as an attribute of the metadata or may be given as an independent property. For example, information related to the recording device and recording conditions, that is, the device ID of the movie, the photographer of the movie, the content creator, or the metadata created and registered by the content copyright holder is highly valuable and requires a license. In the case of thinking, the use of the metadata can take a configuration in which a configuration for executing a process of permission for use by authentication is incorporated in the present invention.
For example, a file obtained by encrypting moving image content shot by the user is created, and the encrypted file is uploaded to a server on the Internet. The description of the encrypted file and some images can be made public so that those who like it can purchase it. In addition, when a valuable news source can be recorded, it is possible to adopt a configuration for auctioning among news departments of a plurality of broadcasting stations.
By using these metadata, it is possible to search for desired content from many AV contents, classify it into a library, increase recording time, perform automatic display, sell content, and use content efficiently. It becomes possible. In order to extend the recording time, the resolution of low-value video content should be reduced, or only audio and still images (for example, MPEG I picture or H.264 IDR picture may be extracted). It can be realized by taking a configuration such as only a still image.

さて、図２に戻って説明を行う。ＡＶ信号圧縮記録制御手段２０８で生成されたＭＰＥＧ−ＴＳ信号は、記録媒体（または、バッファメモリ）２０９内のＡＶデータファイル用ディレクトリ２１０に記録（または、一時蓄積）される。なお、ここで、記録媒体（または、バッファメモリ）２０９として半導体メモリ、光ディスク（ＤＶＤ−ＲＡＭ、ＤＶＤ−Ｒ、ＢＤなど）、ＨＤＤ（ハードディスクドライブ）を用いることによりクイックアクセスが可能となるし、一部のデータ、たとえば、メタデータを修正したり追加したりすることが容易に実行できる。また、このＭＰＥＧ−ＴＳ信号のタイトルはＡＶ信号圧縮記録制御手段２０８よりタイトルを記録媒体２０９内のタイトルリスト／プレイリスト／ナビゲーションデータファイル用ディレクトリ２１１に記録する。さらに、このＭＰＥＧ−ＴＳ信号のメタデータはＡＶ信号圧縮記録制御手段２０８より記録媒体２０９内のメタデータ用ディレクトリ２１２に記録する。 Now, returning to FIG. The MPEG-TS signal generated by the AV signal compression recording control unit 208 is recorded (or temporarily stored) in the AV data file directory 210 in the recording medium (or buffer memory) 209. Here, by using a semiconductor memory, optical disk (DVD-RAM, DVD-R, BD, etc.), HDD (hard disk drive) as the recording medium (or buffer memory) 209, quick access becomes possible. It is easy to modify or add some data, for example, metadata. The title of the MPEG-TS signal is recorded in the title list / playlist / navigation data file directory 211 in the recording medium 209 by the AV signal compression / recording control means 208. Further, the metadata of the MPEG-TS signal is recorded in the metadata directory 212 in the recording medium 209 by the AV signal compression / recording control means 208.

次に、記録媒体２０９に記録されたＡＶデータファイルの内、画像認識手段２０４で検出された画像、たとえば、人（人物）の顔が誰であるか人の識別を行う方法について説明する。メタデータファイル２１２には、画像認識手段２０４で検出された画像がＡＶデータファイルに記録されたどのＡＶデータファイルのどの映像（映像フレームまたは映像フィールド）のどの位置にあるものであるかのメタ情報が記録されている。ＡＶ信号再生制御手段２１７は、メタデータファイルより人の識別を行うデータ位置に関する情報を受け取りＡＶデータファイルより該当の画像データを呼び出す。そして、呼び出した画像データを人の特定を行う画像認識手段２１５に入力する。画像認識手段２１５は、画像認識のデータベース（人や動物や物の特徴を記述したデータベース）２１６を用いて、照会された人が誰であるかを判定して、判定結果をメタデータファイルに追加する。この人の判定結果は、元のメタデータが前述の「非特定人物ナンバー１２３」である場合、人物判定結果、たとえば「田中次郎」が追加される。なお後にユーザーが確認した人物名と人物の顔が間違っていた場合、ユーザーは、管理制御手段２１９のボタン入力を介して、正しい名前である「田中一郎」に修正できる。なお、ボタン入力でも、最近の携帯電話で一般化されているように文字入力が実現できる。また、一枚の映像（映像フレームまたは映像フィールド）における人物の人数は一人に制限されず、検知領域の最小サイズ以上であれば、何人でも検知できる。すなわち、一枚の映像に、「田中一郎」、「鈴木あゆこ」、「加藤奈津子」など複数の人物名が同時に存在できる。なお、メタデータ生成・同期・管理手段２０７の設定によって、特定のファイルやファイル中の場面に対して、一枚の映像（映像フレームまたは映像フィールド）における人物の人数を特定の数、たとえば、５人に制限することができる。これにより人物データのデータ容量削減と、処理負荷の削減を図ることができる。 Next, a method for identifying a person who is an image detected by the image recognition unit 204, for example, a person (person) face, in the AV data file recorded on the recording medium 209 will be described. In the metadata file 212, meta information indicating which image (video frame or video field) of which AV data file recorded in the AV data file the image detected by the image recognition means 204 is located at. Is recorded. The AV signal reproduction control means 217 receives information on the data position for identifying a person from the metadata file and calls the corresponding image data from the AV data file. Then, the called image data is input to the image recognition means 215 for identifying a person. The image recognition means 215 uses the image recognition database (database describing the characteristics of people, animals, and objects) 216 to determine who is referred to and adds the determination result to the metadata file. To do. As the determination result of this person, when the original metadata is the above-mentioned “non-specific person number 123”, a person determination result, for example, “Jiro Tanaka” is added. If the person name confirmed later by the user and the person's face are wrong, the user can correct the name to “Ichiro Tanaka” through the button input of the management control means 219. In addition, even with button input, character input can be realized as is common in recent mobile phones. Further, the number of persons in one image (video frame or video field) is not limited to one person, and any number of persons can be detected as long as they are equal to or larger than the minimum size of the detection area. That is, a plurality of person names such as “Ichiro Tanaka”, “Ayuko Suzuki”, and “Natsuko Kato” can exist simultaneously in one image. Depending on the setting of the metadata generation / synchronization / management means 207, the number of persons in one video (video frame or video field) is specified by a specific number, for example, 5 for a specific file or scene in the file. Can be limited to people. Thereby, it is possible to reduce the data capacity of person data and the processing load.

なお、画像認識手段２１５において認識する対象が人の顔でなく、文字、動物、車など画像認識のデータベース（人や動物や物の特徴を記述したデータベース）２１６内のそれぞれのデータベースファイルを参照して画像の認識、特定を行い、メタデータファイルに追加記述する。 It should be noted that the object recognized by the image recognition means 215 is not a human face, but refers to each database file in the image recognition database 216 (a database describing the characteristics of a person, animal or object) such as characters, animals and cars. Recognize and identify the image and add it to the metadata file.

たとえば、人を認識した場合、その人の着用している服、ネクタイ、メガネ、帽子、時計、靴、また持っている鞄やバッグを画像データベースに登録し、その人物と関連付けておくことより、問い合わせに対応する検索を簡単に実行することができる。この場合、人物をＵＰｎＰのプロパティとし、その人物の着用している服、ネクタイ、メガネ、帽子、時計、靴、また持っている鞄やバッグを人物プロパティのアトリビュートと定義することもできる。 For example, if you recognize a person, you can register his / her clothes, ties, glasses, hats, watches, shoes, bags and bags you have in the image database and associate them with that person, A search corresponding to the inquiry can be easily executed. In this case, a person can be a UPnP property, and clothes, ties, glasses, hats, watches, shoes, and a bag or bag that the person is wearing can be defined as attributes of the person property.

また、音楽や人や人の動きなど著作権の発生するものを自動認識する手段を具備することにより、コンテンツ素材より編集を行った完パケ（完成パケット、完成コンテンツ）を構成する各シーンの著作権関連項目を呼び出したり、表示したり、各著作権関連項目に著作権の管理元、連絡先、使用条件や著作権料などのメタデータを追記できる。これにより、もし撮影コンテンツに著作権処理が必要な場合、編集素材や完パケに対して必要な著作権処理のリストを用意に作成できる。よって、コンテンツの再利用が促進される。 In addition, by providing means for automatically recognizing copyrighted items such as music, people and people's movements, the work of each scene that composes a complete packet (completed packet, completed content) edited from the content material Rights-related items can be called out and displayed, and metadata such as copyright management source, contact information, usage conditions and copyright fees can be added to each copyright-related item. This makes it possible to prepare a list of necessary copyright processing for the editing material and the complete package if copyright processing is required for the photographed content. Therefore, the reuse of content is promoted.

さて、記録媒体２０９から映像を再生する場合、ユーザーは管理制御手段２１９よりＡＶ信号再生制御手段２１７にアクセスし、記録されているファイルタイトルから再生ファイルを選ぶ。ユーザが複数のＡＶファイルの中から再生すべきファイルが特定できない場合には、ユーザーは管理制御手段２１９のユーザーインタフェース（ユーザーのボタン入力）より、検索キーワードを打ち込み、ＡＶ再生制御手段２１７に問い合わせる。ＡＶ再生制御手段２１７は、入力されたキーワードに全部または一部マッチングするメタデータをタイトルリスト／プレイリスト／ナビゲーションデータファイルディレクトリ２１１およびメタデータファイルディレクトリ２１２から検索して、その結果を、たとえばテキスト情報と該当映像のサムネイルをＡＶ信号出力手段２２０の出力映像に重畳する。これによりユーザは検索結果を、テキストとサムネイルのペアとして、ＴＶ画面上で確認できる。なお、ここで、ユーザーへの検索結果の通知は、カメラに付いているファインダーなど一般の表示ディスプレイを用いてもよい。 When reproducing a video from the recording medium 209, the user accesses the AV signal reproduction control unit 217 from the management control unit 219, and selects a reproduction file from the recorded file title. When the user cannot specify a file to be reproduced from a plurality of AV files, the user inputs a search keyword from the user interface (user button input) of the management control means 219 and makes an inquiry to the AV reproduction control means 217. The AV playback control means 217 searches the title list / playlist / navigation data file directory 211 and the metadata file directory 212 for metadata that matches all or part of the input keyword, and the result is, for example, text information. And the thumbnail of the corresponding video are superimposed on the output video of the AV signal output means 220. As a result, the user can check the search result on the TV screen as a pair of text and thumbnail. Here, for the notification of the search result to the user, a general display such as a finder attached to the camera may be used.

さて、上記のタイトルリストやメタデータ検索結果などから、再生すべきＡＶデータを特定した後は、ユーザーは目的のＡＶファイルを呼び出し再生する。なお、このＡＶファイル中の特定の場面にすぐにアクセスする場合、ユーザーは管理制御手段２１９を介してＡＶ信号と紐付けをしたメタデータを使用することによりアクセスしたい場面にすぐにアクセスして再生することができる。
（実施の形態２）
次に、本発明の第２の実施例について説明する。以下においては、実施の形態１と同じ部分は説明を省略し、異なる部分のみ説明する。図４は第２の実施例の説明図である。図４においては、プレイリスト生成・管理手段４０１を新たに追加する。 Now, after specifying the AV data to be reproduced from the title list and the metadata search result, the user calls and reproduces the target AV file. When a specific scene in the AV file is immediately accessed, the user can immediately access and play the scene to be accessed by using the metadata associated with the AV signal via the management control unit 219. can do.
(Embodiment 2)
Next, a second embodiment of the present invention will be described. In the following, description of the same parts as those in the first embodiment will be omitted, and only different parts will be described. FIG. 4 is an explanatory diagram of the second embodiment. In FIG. 4, a playlist generation / management means 401 is newly added.

プレイリスト生成・管理手段４０１は、複数のＡＶファイルから任意のＡＶ信号部を選択して自由に組み合わせて（編集して）、新たなＡＶファイルを生成する。この生成方法としては、管理制御手段２１９においてユーザーがボタン入力で指定したファイルに対して、メタデータで指定される場面と、メタデータで指定されないがユ−ザが重要と考える場面を時間軸に沿って、ＡＶ信号出力手段２２０の出力信号にサムネイル表示させる（図１の１１２参照）。なお、サムネイルはユーザーが指定した長さなど特定長の映像クリップの先頭画像（または、サムネイルの代表画像）である。ユーザーは表示されたサムネイル画像を見て、自分の編集したい映像クリップを選び、各クリップ中で使用するシーンを選択し、順番を変えて新たな映像ファイルを生成する。この作業で実際に生成しているのは、ファイルからのクリップ切り出し位置情報の組み合わせとしての、いわゆるプレイリストであり、このプレイリストをタイトルリスト／プレイリスト／ナビゲーションデータファイルディレクトリ２１１に登録する。この様にプレイリストを用いると、余分なＡＶ信号のないコンパクトなファイルをバーチャルに生成できる。 The playlist generation / management means 401 generates a new AV file by selecting arbitrary AV signal parts from a plurality of AV files and freely combining (editing) them. For this generation method, the management control means 219 uses a time axis based on the scene specified by the metadata for the file specified by the button input by the user and the scene that the user considers important but not specified by the metadata. Accordingly, thumbnails are displayed on the output signals of the AV signal output means 220 (see 112 in FIG. 1). The thumbnail is the head image (or thumbnail representative image) of a video clip having a specific length such as the length specified by the user. The user looks at the displayed thumbnail image, selects a video clip to be edited, selects a scene to be used in each clip, changes the order, and generates a new video file. What is actually generated in this operation is a so-called playlist as a combination of clip cut-out position information from the file, and this playlist is registered in the title list / playlist / navigation data file directory 211. If a playlist is used in this way, a compact file without an extra AV signal can be virtually generated.

さて、記録媒体２０９から映像を再生する場合、ユーザーは管理制御手段２１９よりＡＶ信号再生制御手段２１７にアクセスし、記録されているファイルタイトルおよびプレイリストから再生ファイルを選ぶ。ユーザが複数のＡＶファイルの中から再生すべきファイルのタイトルまたはプレイリストが特定できない場合には、ユーザーは管理制御手段２１９のユーザーインタフェース（ユーザーのボタン入力）より、検索キーワードを打ち込み、ＡＶ再生制御手段２１７に問い合わせる。ＡＶ再生制御手段２１７は、入力されたキーワードに全部または一部マッチングするメタデータをタイトルリスト／プレイリスト／ナビゲーションデータファイルディレクトリ２１１およびメタデータファイルディレクトリ２１２から検索して、その結果を、たとえばテキスト情報と該当映像のサムネイルをＡＶ信号出力手段２２０の出力映像に重畳する。これによりユーザは検索結果を、テキストとサムネイルのペアとして、ＴＶ画面上で確認できる。なお、ここで、ユーザーへの検索結果の通知は、カメラに付いているファインダーなど一般の表示ディスプレイを用いてもよい。 When reproducing a video from the recording medium 209, the user accesses the AV signal reproduction control unit 217 from the management control unit 219 and selects a reproduction file from the recorded file title and playlist. If the user cannot specify the title or playlist of a file to be played from a plurality of AV files, the user inputs a search keyword from the user interface (user button input) of the management control means 219, and AV playback control is performed. Query means 217. The AV playback control means 217 searches the title list / playlist / navigation data file directory 211 and the metadata file directory 212 for metadata that matches all or part of the input keyword, and the result is, for example, text information. And the thumbnail of the corresponding video are superimposed on the output video of the AV signal output means 220. As a result, the user can check the search result on the TV screen as a pair of text and thumbnail. Here, for the notification of the search result to the user, a general display such as a finder attached to the camera may be used.

さて、上記のタイトルリスト、メタデータ検索結果に加えてプレイリストから、再生すべきＡＶデータを特定でき、特定後、目的のＡＶファイルを呼び出し再生する。なお、このＡＶファイル中の特定の場面にすぐにアクセスする場合、ユーザーは管理制御手段２１９を介してＡＶ信号と関連付けしたメタデータを使用することにより、アクセスしたい場面にすぐにアクセスして再生できる。
（実施の形態３）
次に、本発明の第３の実施例について説明する。以下においては、実施の形態２と同じ部分は説明を省略し、異なる部分のみ説明する。図５は第３の実施例の説明図である。図５においては、プレイリスト出力手段５０１を新たに追加する。プレイリスト出力手段５０１は、第２の実施例で生成されタイトルリスト／プレイリスト／ナビゲーションデータファイルディレクトリ２１１に登録されたプレイリストを出力する。 Now, in addition to the above-described title list and metadata search results, AV data to be reproduced can be identified from the playlist, and after the identification, the target AV file is called and reproduced. When a specific scene in the AV file is immediately accessed, the user can immediately access and play the scene to be accessed by using the metadata associated with the AV signal via the management control means 219. .
(Embodiment 3)
Next, a third embodiment of the present invention will be described. In the following, description of the same parts as those of the second embodiment will be omitted, and only different parts will be described. FIG. 5 is an explanatory diagram of the third embodiment. In FIG. 5, a playlist output unit 501 is newly added. The playlist output unit 501 outputs the playlist generated in the second embodiment and registered in the title list / playlist / navigation data file directory 211.

記録媒体２０９から映像を再生する場合、ユーザーは管理制御手段２１９よりＡＶ信号再生制御手段２１７にアクセスし、記録されているファイルタイトルおよびプレイリストから再生ファイルを選ぶ。ユーザが複数のＡＶファイルの中から再生すべきファイルのタイトルまたはプレイリストが特定できない場合には、ユーザーは管理制御手段２１９のユーザーインタフェース（ユーザーのボタン入力）より、検索キーワードを打ち込み、ＡＶ再生制御手段２１７に問い合わせる。ＡＶ再生制御手段２１７は、入力されたキーワードに全部または一部マッチングするメタデータをタイトルリスト／プレイリスト／ナビゲーションデータファイルディレクトリ２１１およびメタデータファイルディレクトリ２１２から検索して、その結果を、たとえばテキスト情報と該当映像のサムネイルをＡＶ信号出力手段２２０の出力映像に重畳する。これによりユーザは検索結果を、テキストとサムネイルのペアとして、ＴＶ画面上で確認できる。なお、ここで、ユーザーへの検索結果の通知は、カメラに付いているファインダーなど一般の表示ディスプレイを用いてもよい。 When playing back video from the recording medium 209, the user accesses the AV signal playback control means 217 from the management control means 219, and selects a playback file from the recorded file title and playlist. If the user cannot specify the title or playlist of a file to be played from a plurality of AV files, the user inputs a search keyword from the user interface (user button input) of the management control means 219, and AV playback control is performed. Query means 217. The AV playback control means 217 searches the title list / playlist / navigation data file directory 211 and the metadata file directory 212 for metadata that matches all or part of the input keyword, and the result is, for example, text information. And the thumbnail of the corresponding video are superimposed on the output video of the AV signal output means 220. As a result, the user can check the search result on the TV screen as a pair of text and thumbnail. Here, for the notification of the search result to the user, a general display such as a finder attached to the camera may be used.

さて、上記のタイトルリスト、プレイリストやメタデータ検索結果などから、再生すべきＡＶデータを特定した後は、ユーザーは目的のプレイリストをプレイリスト出力手段より出力する。 Now, after specifying the AV data to be reproduced from the above-mentioned title list, playlist, metadata search result, etc., the user outputs the target playlist from the playlist output means.

この様にプレイリストを出力するメリットを図６を用いて説明する。図６においてユーザー１とユーザー２がネットワークを介して接続されているとする。ネットワークの種類は問わず、ＩＰベースのホームネットワークでもよいしインターネットでもよい。 The merit of outputting a playlist in this way will be described with reference to FIG. In FIG. 6, it is assumed that user 1 and user 2 are connected via a network. Regardless of the type of network, it may be an IP-based home network or the Internet.

ユーザー２がローカルのＴＶ６０８のリモコン６０９から、ネットワークを介してリモートのムービー１０１にアクセスする場合を考える。ユーザー２がムービー１０１内の録画ファイルを遠隔から視聴する場合、ムービー１０１内のプレイリストに従ってＡＶ信号を視聴できれば、プレイリストのファイルには編集前の余分なデータがない分、データ転送量が小さくなり、機器およびネットワークにかかる負荷を小さくできる。すなわち、ムービーでの生の撮影信号から不要な信号を削除して編集されたＡＶ信号を、より効率的に視聴できる。 Consider a case where the user 2 accesses the remote movie 101 from the remote controller 609 of the local TV 608 via a network. When the user 2 views the recorded file in the movie 101 remotely, if the AV signal can be viewed in accordance with the playlist in the movie 101, the data transfer amount is small because there is no extra data in the playlist file before editing. Thus, the load on the device and the network can be reduced. That is, it is possible to more efficiently view the AV signal edited by deleting unnecessary signals from the raw shooting signal in the movie.

また、ユーザー１はＡＶファイル、ＡＶファイルに関するメタデータおよびプレイリストをネットワーク６０５内のサーバー６０６や、ユーザー１ローカルのＡＶレコーダー６０２、ユーザー２ローカルのＡＶレコーダー６０２にアップロードできる。これにより、別のユーザー（たとえば、ユーザー３）がサーバー６０６や、ＡＶレコーダー６０２、ＡＶレコーダー６０２にアクセスしてプレイリストに従ったＡＶ信号を効率的に送受信することができる。 Also, the user 1 can upload the AV file, metadata regarding the AV file, and the playlist to the server 606 in the network 605, the user 1 local AV recorder 602, and the user 2 local AV recorder 602. Accordingly, another user (for example, user 3) can access the server 606, the AV recorder 602, and the AV recorder 602 to efficiently transmit and receive AV signals according to the playlist.

ここで、もし、ユーザー２や別のユーザー（たとえば、ユーザー３）が視聴したプレイリストよりもっと別のプレイリストを考案した場合、ユーザー２や別のユーザー（たとえば、ユーザー３）は新たなプレイリストを生成し、サーバー６０６や、ＡＶレコーダー６０２、ＡＶレコーダー６０２にアップロードする。これにより、限られた数のＡＶ信号（コンテンツ）から多くの編集タイトル（プレイリスト）を生成でき、ＡＶ信号（コンテンツ）を色々な観点から鑑賞することが可能となる。いわゆる、ネットワーク型の映像編集、制作をネットワークを介して共同で行うことも可能となる。 Here, if a different playlist is devised than the playlist watched by the user 2 or another user (for example, the user 3), the user 2 or another user (for example, the user 3) will create a new playlist. Is uploaded to the server 606, the AV recorder 602, and the AV recorder 602. Thereby, many edit titles (playlists) can be generated from a limited number of AV signals (contents), and the AV signals (contents) can be viewed from various viewpoints. It is also possible to jointly perform so-called network-type video editing and production via a network.

また、プレイリストを利用することにより別のアプリケーション（利用方法）も生まれる。たとえば、ムービー１０１からＡＶ信号再生制御手段２１７で低解像度のＡＶ信号に変換して、メタデータと共に携帯電話６０１に出力し、形態電話６０１でメタデータを用いて映像編集を行ない編集ＥＤＬ（または、プレイリスト）をムービー１０１に送る。ＴＶ６０３はムービー１０１にアクセスし、プレイリストを選択し、ムービーでの生の撮影信号から不要な信号を削除してきれいに編集されたＡＶ信号を視聴できる。 Also, another application (usage method) is born by using the playlist. For example, the movie 101 is converted into a low-resolution AV signal by the AV signal reproduction control means 217 and output to the mobile phone 601 together with the metadata, and the video editing is performed using the metadata in the form phone 601 to edit the EDL (or (Playlist) is sent to the movie 101. The TV 603 can access the movie 101, select a playlist, and delete an unnecessary signal from a raw shooting signal in the movie to view a finely edited AV signal.

また、プレイリストに従ったＡＶ信号をＡＶレコーダー６０２やサーバー６０５、ＡＶレコーダー６０７にアップロード（または、ダウンロード）することにより、ネットワークに繋がったユーザーはネットワークを介して、より完成度が高く編集されたＡＶコンテンツ信号を効率的に視聴できる。 Also, by uploading (or downloading) AV signals according to the playlist to the AV recorder 602, the server 605, and the AV recorder 607, users connected to the network are edited with a higher degree of completeness via the network. AV content signals can be viewed efficiently.

本発明は、プレイリストを利用したＡＶコンテンツ視聴に発展させることもできる。たとえば、図７においてネットワーク６０５をインターネットとした場合、ユーザー１はインターネット上のサーバー６０６にブログ（Ｂｌｏｇ、Ｗｅｂｌｏｇの別名）形式のサイトを公開し、そのブログサイトにアクセスして登録した複数のユーザーにＲＳＳ（ＲＤＦＳｉｔｅＳｕｍｍａｒｙ）形式でＡＶコンテンツの追加、更新情報を知らせることができる。ここでユーザー１はＡＶコンテンツとそのＡＶコンテンツに対応した複数のプレイリストを公開する。個々のプレイリストには、たとえば、そのＡＶコンテンツのダイジェスト版、簡易版、完全版、編集前の生コンテンツなどの解説が付けられており、ＡＶコンテンツを視聴するユーザーは好みのプレイリストを選びＡＶコンテンツを視聴することができる。これは、デジタル放送におけるＥＰＧ（ＥｌｅｃｔｒｏｎｉｃＰｒｏｇｒａｍＧｕｉｄｅ）配信をインターネット上の放送配信メディアに拡張したシステムと考えることができる。ダイジェスト版は１セグ放送や携帯電話での有料または無料のコンテンツ配信に有効であり、簡易版や完全版、編集前コンテンツはＷＥＢ上での有料または無料のコンテンツ配信に有効である。ユーザー１はインターネット上のを通じて多くのユーザーに知らせることができるので、小規模な企業体や個人でも音声や映像のインターネットベースの放送局を開局できる。また、ユーザーは、ＲＳＳフィードによるインターネット上のＡＶコンテンツを自動的に収集する仕組みである、いわゆる、Ｐｏｄｃａｓｔｉｎｇ（ポッドキャスティング）を利用してインターネット上のＡＶコンテンツを視聴できる。 The present invention can also be developed for viewing AV content using a playlist. For example, when the network 605 in FIG. 7 is the Internet, the user 1 publishes a blog (an alias for Blog, Weblog) format on a server 606 on the Internet, and accesses a plurality of users who have registered by accessing the blog site. AV content addition / update information can be notified in RSS (RDF Site Summary) format. Here, the user 1 publishes AV content and a plurality of playlists corresponding to the AV content. For example, a digest version of the AV content, a simplified version, a complete version, a raw content before editing, and the like are attached to each playlist, and a user who views the AV content selects a favorite playlist and selects AV. Content can be viewed. This can be considered as a system in which EPG (Electronic Program Guide) distribution in digital broadcasting is extended to broadcast distribution media on the Internet. The digest version is effective for paying or free content distribution on 1-segment broadcasting or a mobile phone, and the simplified version, complete version, or pre-editing content is effective for paying or free content distribution on the WEB. Since the user 1 can notify many users over the Internet, even a small business or individual can open an Internet-based broadcasting station for audio and video. In addition, the user can view AV content on the Internet using so-called podcasting, which is a mechanism for automatically collecting AV content on the Internet using RSS feeds.

さらに、ムービー１０１にインターネットに接続してサーバーとして働く機能を持たせる（ＩＰネットワーク接続機能付きムービーカメラ１０１）。この場合、ユーザー１は撮影中のコンテンツをメタデータと一緒にインターネットを介してライブ配信できる。すなわち、ユーザー１はライブ撮影の音声と映像を音声認識や画像認識やボタン入力ですばやくメタデータ化し、ＸＭＬ文書としてＲＳＳでインターネットで公開することにより、メタデータによる解説付きの生中継を全世界に行うことができる。 Further, the movie 101 has a function of connecting to the Internet and acting as a server (movie camera 101 with an IP network connection function). In this case, the user 1 can live-distribute the content being photographed together with the metadata via the Internet. In other words, the user 1 can quickly convert live shooting audio and video by voice recognition, image recognition, and button input, and publish it as an XML document on the Internet via RSS. It can be carried out.

（実施の形態４）
次に、本発明の第４の実施例について説明する。以下においては、実施の形態３と同じ部分は説明を省略し、異なる部分のみ説明する。図７は第４の実施例の説明図である。図７においては、メタデータ時刻修正手段７０１を新たに追加する。プレイリスト出力手段５０１は、第２の実施例で生成されタイトルリスト／プレイリスト／ナビゲーションデータファイルディレクトリ２１１に登録されたプレイリストを出力する。 (Embodiment 4)
Next, a fourth embodiment of the present invention will be described. In the following, description of the same parts as those of the third embodiment will be omitted, and only different parts will be described. FIG. 7 is an explanatory diagram of the fourth embodiment. In FIG. 7, metadata time correction means 701 is newly added. The playlist output unit 501 outputs the playlist generated in the second embodiment and registered in the title list / playlist / navigation data file directory 211.

記録媒体２０９から映像を再生する場合、ユーザーは管理制御手段２１９よりＡＶ信号再生制御手段２１７にアクセスし、記録されているファイルタイトルおよびプレイリストから再生ファイルを選ぶ。ユーザが複数のＡＶファイルの中から再生すべきファイルのタイトルまたはプレイリストが特定できない場合には、ユーザーは管理制御手段２１９のユーザーインタフェース（ユーザーのボタン入力）より、検索キーワードを打ち込み、ＡＶ再生制御手段２１７に問い合わせる。ＡＶ再生制御手段２１７は、入力されたキーワードに全部または一部マッチングするメタデータをタイトルリスト／プレイリスト／ナビゲーションデータファイルディレクトリ２１１およびメタデータファイルディレクトリ２１２から検索して、その結果を、たとえばテキスト情報と該当映像のサムネイルをＡＶ信号出力手段２２０の出力映像に重畳する。これによりユーザは検索結果を、テキストとサムネイルのペアとして、ＴＶ画面上で確認できるが、メタデータとサムネイルの間にムービー撮影者の意図しない時間ずれがあった場合、編集や視聴前にそのずれを修正する必要がある。そこでユーザーは、指定したメタデータに対するサムネイルを見ながら、管理制御手段２１９よりボタン入力によりメタデータとサムネイルの時間ずれを映像のフレームまたはフィールド単位で修正（トリミング）する。このとき、メタデータ時刻修正手段７０１がユーザーの指定した分だけメタデータに関連付けた映像信号の時間情報（タイムコードまたはデータ位置情報）を修正して新しい時間情報でメタデータと映像信号を関連付ける。これにより、ムービー撮影者や撮影監督などのユーザーは考えたとおりにメタデータと映像の同期を取ることが可能となる。よって、ＡＶ信号の編集効率をアップし、映像表現の高精度化、高度化を図ることができる。 When playing back video from the recording medium 209, the user accesses the AV signal playback control means 217 from the management control means 219, and selects a playback file from the recorded file title and playlist. If the user cannot specify the title or playlist of a file to be played from a plurality of AV files, the user inputs a search keyword from the user interface (user button input) of the management control means 219, and AV playback control is performed. Query means 217. The AV playback control means 217 searches the title list / playlist / navigation data file directory 211 and the metadata file directory 212 for metadata that matches all or part of the input keyword, and the result is, for example, text information. And the thumbnail of the corresponding video are superimposed on the output video of the AV signal output means 220. This allows the user to check the search results as a text / thumbnail pair on the TV screen. However, if there is a time lag that is not intended by the movie shooter between the metadata and the thumbnail, the search result will be shifted before editing or viewing. Need to be corrected. Therefore, the user corrects (trims) the time lag between the metadata and the thumbnail in units of video frames or fields by inputting a button from the management control unit 219 while viewing the thumbnail for the specified metadata. At this time, the metadata time correction unit 701 corrects the time information (time code or data position information) of the video signal associated with the metadata by the amount designated by the user, and associates the metadata and the video signal with the new time information. As a result, users such as movie photographers and film directors can synchronize metadata and video as expected. Therefore, the editing efficiency of the AV signal can be improved, and the accuracy and sophistication of the video expression can be improved.

また、本発明によればプレイリストの新たな利用方法として放送番組のプレイリスト配信を行うことができる。たとえば、ムービー１０１がＴＶチューナーを内蔵していて、ＴＶ放送を録画した場合、ユーザー１は録画したＴＶ番組を前述の様に編集してそのプレイリストを生成する。この場合、ユーザー１は録画番組のタイトルとプレイリストだけをインターネット上のサーバーに公開する。他の一般ユーザ（たとえば、ユーザー１２３とする）もユーザー１と同じ番組を録画している場合、ユーザー１２３はユーザー１の生成したプレイリストをダウンロードすることにより、自分（ユーザー１２３）の録画したＴＶ番組をユーザー１が生成したプレイリストに従った未知のストーリー仕立てで自分で録画した番組を視聴できる。たとえば、スポーツ番組ならばダイジェスト再生や、ニュースならばヘッドライン再生、ＣＭ（コマーシャル）だけを集めたプレイリストを生成できる。ここで、課題となるのは時刻同期であるが、ムービー、サーバーまたはＡＶレコーダーの時計精度を映像フレーム以内に合わせることは現在の技術で可能である。たとえば、日本のデジタル放送ではＡＲＩＢ規格で規定されたＴＯＴ（ＴｉｍｅＯｆｆｓｅｔＴａｂｌｅ）信号から共通の時刻情報を生成することができる。アナログ放送の場合には、標準電波や放送受信した映像フレームや音声の特徴より判別することができる。 Further, according to the present invention, playlist distribution of broadcast programs can be performed as a new usage method of playlists. For example, when the movie 101 has a built-in TV tuner and records a TV broadcast, the user 1 edits the recorded TV program as described above to generate a playlist. In this case, the user 1 publishes only the recorded program title and playlist to a server on the Internet. When another general user (for example, user 123) is recording the same program as user 1, user 123 downloads the playlist generated by user 1 to record the TV recorded by himself (user 123). The user can watch the program recorded by himself in an unknown story tailored according to the playlist generated by the user 1. For example, it is possible to generate a digest list for sports programs, a headline playback for news, and a playlist that collects only commercials. Here, the issue is time synchronization, but it is possible with current technology to match the clock accuracy of a movie, server or AV recorder within the video frame. For example, in Japanese digital broadcasting, common time information can be generated from a TOT (Time Offset Table) signal defined by the ARIB standard. In the case of analog broadcasting, it can be discriminated from the characteristics of the standard radio wave, the received video frame, and audio.

また、本発明はテレビ録画だけでなく、映画やインターネット上での動画コンテンツ、携帯端末向けのコンテンツ作成にも応用できる。 Further, the present invention can be applied not only to television recording but also to movie creation, movie content on the Internet, and content creation for mobile terminals.

メタデータはテキストデータとして前記コンテンツに付随させることもできるし、メタデータをバイナリデータとして前記コンテンツに付随させることもできる、また、メタデータをウォーターマークとして前記コンテンツに付随させることもできる。 The metadata can be attached to the content as text data, the metadata can be attached to the content as binary data, or the metadata can be attached to the content as a watermark.

また、メタデータはウォーターマークとして画像データの中に埋め込んだ形でコンコードし、記録再生、伝送受信した後、デコードして使うこともできる。なお、上記の説明では同一のメディアへの記録、蓄積を例としたが、関連付けの行ってある２つ以上のメディアにメタデータと映像データを別々に保存しても良い。また、関連付けの行ってあるメディアであればメタデータのみの保存、または映像データのみの保存、またはメタデータと映像データの２つを保存、のどれかを行っても良い。 Also, the metadata can be concoded as a watermark embedded in the image data, recorded, reproduced, transmitted and received, and then decoded. In the above description, recording and storage on the same medium are taken as an example. However, metadata and video data may be stored separately in two or more associated media. Further, as long as the medium is associated, either metadata only, video data only, or both metadata and video data may be stored.

本発明のカメラ撮影、撮影データとメタデータによる撮影データの編集システムのモデル図Model diagram of camera photographing, photographing data editing system based on photographing data and metadata of the present invention 本発明の第１の実施例の説明図Explanatory drawing of 1st Example of this invention Ｈ．２６４圧縮におけるメタデータの取り扱いの説明図H. Explanatory drawing of handling of metadata in H.264 compression 本発明の第２の実施例の説明図Explanatory drawing of 2nd Example of this invention 本発明の第３の実施例の説明図Explanatory drawing of 3rd Example of this invention 本発明の編集システムをネットワークに適用した例を示すモデル図Model diagram showing an example in which the editing system of the present invention is applied to a network 本発明の第４の実施例の説明図Explanatory drawing of 4th Example of this invention

Explanation of symbols

１０１カメラ
１０２カメラのレンズ部
１０３カメラのマイク
１０４カメラの撮影対象
１０５カメラで撮影したデータ
１０６映像データ
１０７音声データ
１０８メタデータ
１０９カメラで撮影されたデータシーケンス
１１０リモコン
１１１編集により、シーン＃１からシーン＃５までをつなぎ合わせたデータシーケンス
１１２テレビ（ＴＶ）
１１３メタデータ入力用ボタン DESCRIPTION OF SYMBOLS 101 Camera 102 Camera lens part 103 Camera microphone 104 Camera object 105 Data captured by camera 106 Video data 107 Audio data 108 Metadata 109 Data sequence captured by camera 110 Remote control 111 Scene # 1 to scene Data sequence connecting up to # 5 112 TV (TV)
113 Button for entering metadata

Claims

For recording means of content containing either video, audio or data,
A field-specific dictionary selection means for selecting and adding a field-specific tag for classifying the recorded content,
A dictionary of each field selected by the field-specific dictionary selection means;
Speech recognition means for converting speech contained in the content into character data by speech recognition referring to the dictionary, or speech recognition means for selecting external input speech and converting it to character data by speech recognition referring to the dictionary, or At least one means of image recognition means for recognizing a person or object in the video included in the content and converting it into character data;
Means for associating the character data with the content as metadata;
A metadata input device comprising:

The field-specific dictionary selecting means specifies the field by speech recognition referring to a dictionary including the field as a word, or specifies the field by an input from a button input or a keyboard input or an external interface input. The metadata input device according to claim 1.

2. The metadata input device according to claim 1, wherein words to be registered in the dictionary are added, replaced, or deleted by button input, keyboard input, voice recognition of voice input from a microphone, or external interface input. .

The content recording means includes a voice recognition operation button,
The voice recognition means, a means for attaching an identification flag to the output voice of the voice selection means when the voice recognition operation button is pressed;
2. The metadata input device according to claim 1, wherein a speech recognition operation is performed for speech in a separately designated time range including a position with the identification flag.

The voice recognition operation performed on the voice in the specified time range including the position with the identification flag is determined according to the CPU computing capability that the content recording means can assign to the voice recognition means. 5. The metadata input device according to claim 4, wherein the metadata input device is executed asynchronously after recording with respect to the voice at the indicated position.

6. The metadata that is asynchronously executed and recognized after the recording is added, replaced, or deleted by button input, keyboard input, voice recognition of voice input from a microphone, or external interface input. Metadata input device.

A human face or animal or object detection means included in the video signal of the content, a human face feature recognition means detected by the detection means, a human face feature data dictionary, and a human face Means for identifying a person from the feature of the person's face extracted by the feature recognition means of the person's face with reference to a dictionary of feature data and converting it to character data; and means for associating the character data with the content as metadata The metadata input device according to claim 1, further comprising:

9. The metadata input device according to claim 8, wherein the human face detecting means adds an individual identification flag to each detected human face.

9. The metadata input device according to claim 8, wherein said human face detection means tracks the detected human face, and adds one identification flag and information on the length of the identification time to the same person. .

If the person cannot be specified by the means for specifying the person and converting it into character data, the metadata is stored as a non-specific person, and the person is input by button recognition, keyboard input, voice recognition input from a microphone, or external interface input. 8. The metadata input device according to claim 7, wherein the metadata is registered as a specific person after the data is added.

When new person data is input to the human face feature data dictionary, person recognition is performed on the non-specific person or a person having features similar to the features of the new person data. 8. The metadata input device according to claim 7, wherein if it is made, the person information is set as newly recognized person information.

The content recording means includes a button for face recognition operation,
The human face detection means includes means for attaching an identification flag to the video when the face recognition operation button is pressed;
8. The metadata input device according to claim 7, wherein a face recognition operation included in a video of a separately designated time range including the position with the identification flag is performed.

The face recognition operation performed on the video in the specified time range including the position with the identification flag is determined according to the CPU computing capability that the content recording means can assign to the face recognition means. 13. The metadata input device according to claim 12, wherein the metadata input device is executed asynchronously after recording with respect to the face included in the video at the indicated position.

A means for calling metadata generated by the metadata input device according to claim 1 by button input, keyboard input, voice recognition of voice input from a microphone or external interface input, and associated with the metadata. A content processing apparatus comprising means for calling a new video (frame or field) and newly associating the metadata with another video (frame or field).

A priority is added to the metadata generated by the metadata input device according to claim 1 or claim 7, and each video having a separately specified length is extracted in descending order of the priority of the metadata, What is claimed is: 1. A content processing apparatus comprising: means for generating content having a length specified by button input, keyboard input, voice recognition of voice input from a microphone, or external interface input.

8. A content processing apparatus comprising: means for generating an edit list using metadata generated by the metadata input device according to claim 1; and means for editing content in accordance with the edit list. .

15. The content processing apparatus according to claim 14, wherein the length of the video associated with the metadata is specified by button input, keyboard input, voice recognition of voice input from a microphone, or external interface input.

The content processing apparatus according to claim 15 or claim 16 or claim 17,
A content processing apparatus comprising means for outputting generated content in a file format.

Means for generating content title, table of contents information or inclusion metadata information from the edit list for generating the file format content, title of the content, table of contents information, metadata included in the content, or playlist of the content 20. The content processing apparatus according to claim 19, further comprising means for disclosing the information to another user via a network.

20. The content processing according to claim 19, wherein at least the playlist is published on a server on the Internet, and information related to reproduction of AV content is notified to a user accessing the server in an RSS (RDF Site Summary) format. apparatus.