JP2008060622A

JP2008060622A - Video editing system, video processing apparatus, video editing device, video processing method, video editing method, program, and data structure

Info

Publication number: JP2008060622A
Application number: JP2006231429A
Authority: JP
Inventors: Satoshi Tsujii; 訓辻井
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 2006-08-29
Filing date: 2006-08-29
Publication date: 2008-03-13

Abstract

<P>PROBLEM TO BE SOLVED: To generate metadata holding a feature quantity of video data or another video data synchronized with the video data, synchronously with the video data or to edit video data on the basis of the metadata. <P>SOLUTION: In the video editing system, metadata 103 is generated from second video data 102 synchronized with first video data 101 by a video processing apparatus 110. The metadata 103 holds the feature quantity of the second video data 102 synchronously with the first video data 101 while managing it on the time base. Positions meeting an extraction condition are searched in the metadata 103 by a position search part 122 of a video editing device 120 to generate an edit list 104. The video images are extracted from the first video data 101 by a video extraction part 123 in accordance with the edit list 104 to generate edited video data 105. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、映像編集システムに関し、特に映像データにおける特徴量を示すメタデータを生成する映像編集システム、映像処理装置、または、そのメタデータに基づいて映像データを編集する映像編集装置、これらにおける処理方法ならびに当該方法をコンピュータに実行させるプログラムおよびこれらに用いられるデータ構造に関する。 The present invention relates to a video editing system, and in particular, a video editing system, a video processing apparatus, or a video editing apparatus that edits video data based on the metadata, and processing in these, The present invention relates to a method, a program for causing a computer to execute the method, and a data structure used for them.

記録再生装置における記録形式の一つとして、ＱｕｉｃｋＴｉｍｅファイルフォーマットが知られている。このＱｕｉｃｋＴｉｍｅファイルフォーマットは、マルチメディアデータを扱うためのファイルフォーマットであり、映像データ（ビデオデータおよびオーディオデータ）の実データをメディアデータアトム（ムービーデータともいう。）に保持して、その管理情報をムービーアトム（ムービーリソースともいう。）に保持する。これにより、実データに直接手を加えることなく、映像データを"非破壊的に"編集できるようになっている。このＱｕｉｃｋＴｉｍｅファイルフォーマットをベースとしたファイルフォーマットには、ＩＳＯベースメディア（ISO Base Media）ファイルフォーマット、そのアプリケーションフォーマットであるＭＰＥＧ４（ＭＰ４）ファイルフォーマット、ＭＪ２（Motion JPEG2000）ファイルフォーマット、ＡＶＣ（Advanced Video Coding：MPEG4-part10）ファイルフォーマットなどがある。 A QuickTime file format is known as one of recording formats in the recording / reproducing apparatus. This QuickTime file format is a file format for handling multimedia data. Real data of video data (video data and audio data) is held in a media data atom (also referred to as movie data), and management information thereof is stored. Stored in a movie atom (also called a movie resource). This allows video data to be edited "non-destructively" without having to modify the actual data directly. File formats based on this QuickTime file format include ISO base media file format, MPEG4 (MP4) file format, MJ2 (Motion JPEG2000) file format, and AVC (Advanced Video Coding: MPEG4-part10) file format.

このような実データ格納部と管理情報格納部に分かれた形式のファイルフォーマットにおいては、オリジナルデータを外部から参照して、再生上の時間軸管理を行うエディットアトムと呼ばれるデータ構造により編集を行う方法が知られている。例えば、そのようなエディットアトムを用いて、記録中の映像に対して非破壊的にマークを付与する映像記録装置が提案されている（例えば、特許文献１参照。）。
特開２００５−３０３９４３号公報（図５） In such a file format divided into an actual data storage unit and a management information storage unit, a method of editing with a data structure called an edit atom that refers to the original data from the outside and performs time axis management on reproduction It has been known. For example, a video recording apparatus has been proposed that uses such an edit atom to mark a video being recorded in a non-destructive manner (see, for example, Patent Document 1).
Japanese Patent Laying-Open No. 2005-303943 (FIG. 5)

しかしながら、エディットアトムにより構成されるムービーファイルでは、最終的な編集結果が１つだけ保存されるに留まり、他の条件で編集を行うためには最初から処理をやり直さなければならない。例えば、映像データもしくはその映像データに同期する他の映像データにおける何らかの特徴量を解析して、その解析結果に基づいて編集を行う場合、編集条件を変えるたびに特徴量の解析からやり直すことになり、処理効率上の問題がある。 However, in a movie file composed of edit atoms, only one final editing result is stored, and in order to perform editing under other conditions, the process must be restarted from the beginning. For example, when analyzing some feature quantity in video data or other video data synchronized with the video data, and editing based on the analysis result, it will start from the analysis of the feature quantity every time the editing condition is changed. There is a problem in processing efficiency.

そこで、本発明は、映像データもしくはその映像データに同期する他の映像データについて、その特徴量を映像データに同期して保持するメタデータを生成し、または、そのメタデータに基づいて映像データを編集することを目的とする。 Therefore, the present invention generates metadata that holds the feature amount of the video data or other video data synchronized with the video data in synchronization with the video data, or stores the video data based on the metadata. The purpose is to edit.

本発明は、上記課題を解決するためになされたものであり、その第１の側面は、時系列に管理される第１の映像データに同期した第２の映像データを取得する映像取得手段と、上記第２の映像データにおける特徴量を解析する映像解析手段と、上記第１の映像データと同期して上記特徴量を保持するメタデータを生成するメタデータ生成手段とを具備することを特徴とする映像処理装置である。これにより、第１の映像データに同期した第２の映像データの特徴量を、第１の映像データに同期するメタデータに保持させるという作用をもたらす。 The present invention has been made to solve the above problems, and a first aspect of the present invention is a video acquisition means for acquiring second video data synchronized with the first video data managed in time series. A video analysis unit that analyzes a feature quantity in the second video data; and a metadata generation unit that generates metadata that holds the feature quantity in synchronization with the first video data. Is a video processing apparatus. This brings about the effect that the feature quantity of the second video data synchronized with the first video data is held in the metadata synchronized with the first video data.

また、この第１の側面において、上記特徴量は、上記第２の映像データに含まれる顔の表情であってもよい。顔の表情として、笑い、驚き、怒り、眠いといった種別を表現することができる。 In the first aspect, the feature amount may be a facial expression included in the second video data. Types of facial expressions such as laughter, surprise, anger, and sleepiness can be expressed.

また、この第１の側面において、上記第１の映像データを撮像する第１の撮像手段をさらに具備し、上記映像取得手段は、上記第１の映像データの撮像と同時に上記第２の映像データを撮像する第２の撮像手段を含んでもよい。これにより、撮像されている第１の映像データに関する映像を第２の映像データとして撮像させるという作用をもたらす。 In the first aspect, the image processing device further includes a first imaging unit that captures the first video data, and the video acquisition unit is configured to capture the first video data and the second video data simultaneously. Second imaging means for imaging the image may be included. This brings about the effect | action that the image | video regarding the 1st image data currently imaged is imaged as 2nd image data.

また、この第１の側面において、上記第１の映像データを再生する再生手段をさらに具備し、上記映像取得手段は、上記再生手段による上記第１の映像データの再生と同時に上記第２の映像データを撮像する撮像手段を含んでもよい。これにより、再生されている第１の映像データに関する映像を第２の映像データとして撮像させるという作用をもたらす。 In the first aspect, the image processing device further includes a reproducing unit that reproduces the first video data, and the video obtaining unit is configured to simultaneously reproduce the first video data by the reproducing unit. An imaging means for imaging data may be included. This brings about the effect | action that the image | video regarding the 1st image data currently reproduced | regenerated is imaged as 2nd image data.

また、この第１の側面において、上記第１の映像データを再生する再生手段をさらに具備し、上記映像取得手段は、上記再生手段により再生された上記第１の映像データを上記第２の映像データとして入力する映像入力手段を含んでもよい。これにより、再生されている第１の映像データをそのまま第２の映像データとして入力させるという作用をもたらす。 Further, in the first aspect, the image processing device further includes a reproduction unit that reproduces the first video data, and the video acquisition unit converts the first video data reproduced by the reproduction unit into the second video. Video input means for inputting as data may be included. Thereby, the effect | action that the 1st video data currently reproduced | regenerated is input as 2nd video data as it is brought about is brought about.

なお、この第１の側面において、上記第１の映像データおよび上記メタデータは、ＱｕｉｃｋＴｉｍｅフォーマットにおけるメディアデータアトムの形式で記録されてもよい。 In the first aspect, the first video data and the metadata may be recorded in a media data atom format in the QuickTime format.

また、本発明の第２の側面は、時系列に管理される第１の映像データに同期した第２の映像データにおける特徴量を保持するメタデータを取得した後に所定の条件に合致する時系列上の位置を探索してその結果を探索情報として生成する位置探索手段と、上記探索情報に基づいて上記第１の映像データから上記合致する時系列上の位置に対応する部分を抽出する映像抽出手段とを具備することを特徴とする映像編集装置である。これにより、メタデータに含まれる特徴量が所定の条件に合致する時系列上の位置について、第１の映像データの対応する部分を抽出して、非破壊的に編集させるという作用をもたらす。 Further, the second aspect of the present invention provides a time series that meets a predetermined condition after acquiring metadata that holds the feature amount in the second video data synchronized with the first video data managed in time series. A position search means for searching the upper position and generating the result as search information, and a video extraction for extracting a portion corresponding to the matching time-series position from the first video data based on the search information And a video editing apparatus. As a result, the corresponding portion of the first video data is extracted and edited nondestructively with respect to the position on the time series where the feature amount included in the metadata matches a predetermined condition.

また、この第２の側面において、上記位置探索手段は、上記メタデータを管理する管理情報を取得して、上記管理情報が上記メタデータに上記特徴量は保持されない旨を示している場合には上記メタデータを取得しないようにしてもよい。これにより、無意味なメタデータへのアクセスを回避させるという作用をもたらす。 Further, in this second aspect, when the position search means acquires management information for managing the metadata, and the management information indicates that the feature amount is not retained in the metadata. The metadata may not be acquired. This brings about the effect of avoiding access to meaningless metadata.

また、本発明の第３の側面は、時系列に管理される第１の映像データに同期した第２の映像データを取得する映像取得手段と、上記第２の映像データにおける特徴量を解析する映像解析手段と、上記第１の映像データと同期して上記特徴量を保持するメタデータを生成するメタデータ生成手段と、上記メタデータから所定の条件に合致する時系列上の位置を探索してその結果を探索情報として生成する位置探索手段と、上記探索情報に基づいて上記第１の映像データから上記合致する時系列上の位置に対応する部分を抽出する映像抽出手段とを具備することを特徴とする映像編集システムである。これにより、第２の映像データの特徴量を保持するメタデータを中間状態として、第１の映像データを非破壊的に編集させるという作用をもたらす。 In addition, according to a third aspect of the present invention, video acquisition means for acquiring second video data synchronized with the first video data managed in time series, and a feature amount in the second video data are analyzed. A video analysis unit, a metadata generation unit that generates metadata that holds the feature amount in synchronization with the first video data, and a time-series position that matches a predetermined condition is searched from the metadata. Position search means for generating the result as search information, and video extraction means for extracting a portion corresponding to the matching time-series position from the first video data based on the search information. Is a video editing system characterized by Accordingly, there is an effect that the first video data is edited nondestructively with the metadata holding the feature amount of the second video data as an intermediate state.

また、本発明の第４の側面は、時系列に管理される第１の映像データと、上記第１の映像データに同期した第２の映像データにおける特徴量を上記第１の映像データと同期して保持するメタデータとを具備するデータ構造であって、コンピュータが、上記メタデータを取得して、上記メタデータにおいて所定の条件に合致する時系列上の位置を探索してその結果を探索情報として生成して、上記探索情報に基づいて上記第１の映像データから上記合致する時系列上の位置に対応する部分を抽出することを特徴とするデータ構造である。これにより、第１の映像データの編集のための中間状態としてメタデータを保持させるという作用をもたらす。 According to a fourth aspect of the present invention, the feature amount in the first video data managed in time series and the second video data synchronized with the first video data is synchronized with the first video data. A data structure having metadata stored therein, wherein the computer acquires the metadata, searches the metadata for a position in time series that matches a predetermined condition, and searches for the result The data structure is characterized in that it is generated as information and a portion corresponding to the matching time-series position is extracted from the first video data based on the search information. As a result, the metadata is held as an intermediate state for editing the first video data.

本発明によれば、映像データもしくはその映像データに同期する他の映像データについて、その特徴量を映像データに同期して保持するメタデータを生成し、または、そのメタデータに基づいて映像データを編集することができるという優れた効果を奏し得る。 According to the present invention, with respect to video data or other video data synchronized with the video data, metadata that retains the feature amount in synchronization with the video data is generated, or video data is generated based on the metadata. An excellent effect of being able to edit can be achieved.

次に本発明の実施の形態について図面を参照して詳細に説明する。 Next, embodiments of the present invention will be described in detail with reference to the drawings.

図１は、本発明の実施の形態における映像編集システム１００の一構成例を示す図である。この映像編集システム１００は、メタデータ１０３に基づいて第１のビデオデータ１０１を編集した編集ビデオデータ１０５を出力する映像編集装置１２０と、第２のビデオデータ１０２の特徴量を解析してメタデータ１０３を出力する映像処理装置１１０とを備えている。 FIG. 1 is a diagram illustrating a configuration example of a video editing system 100 according to an embodiment of the present invention. The video editing system 100 analyzes the feature quantity of the video editing apparatus 120 that outputs the edited video data 105 obtained by editing the first video data 101 based on the metadata 103, and the second video data 102, and outputs the metadata. And a video processing apparatus 110 that outputs 103.

第１のビデオデータ１０１および第２のビデオデータ１０２は、時系列に管理される映像データであり、動画像データに加えて音声データを含んでもよい。第１のビデオデータ１０１は、映像編集装置１２０における編集対象となる映像データである。第２のビデオデータ１０２は、第１のビデオデータ１０１に同期しており、第１のビデオデータ１０１の撮像または再生と同時に撮像され得る。また、第１のビデオデータ１０１および第２のビデオデータ１０２は、同一の内容であってもよい。 The first video data 101 and the second video data 102 are video data managed in time series, and may include audio data in addition to moving image data. The first video data 101 is video data to be edited by the video editing device 120. The second video data 102 is synchronized with the first video data 101 and can be imaged simultaneously with the imaging or reproduction of the first video data 101. Further, the first video data 101 and the second video data 102 may have the same content.

メタデータ１０３は、第１のビデオデータ１０１に同期しており、第２のビデオデータ１０２の特徴量を時間軸で管理しながら保持するものである。特徴量としては、後述のように、第２のビデオデータ１０２に含まれる顔の表情が想定される。 The metadata 103 is synchronized with the first video data 101 and holds the feature amount of the second video data 102 while managing it on the time axis. As described later, the facial expression included in the second video data 102 is assumed as the feature amount.

映像処理装置１１０は、映像取得部１１１と、映像解析部１１２と、メタデータ生成部１１３とを備えている。また、映像編集装置１２０は、抽出条件受付部１２１と、位置探索部１２２と、映像抽出部１２３とを備えている。 The video processing device 110 includes a video acquisition unit 111, a video analysis unit 112, and a metadata generation unit 113. The video editing apparatus 120 includes an extraction condition reception unit 121, a position search unit 122, and a video extraction unit 123.

映像取得部１１１は、第２のビデオデータ１０２を取得するものである。この映像取得部１１１は、映像を光学レンズにより撮像するビデオカメラであってもよく、また、電子信号を入力する入力端子などであってもよい。 The video acquisition unit 111 acquires the second video data 102. The video acquisition unit 111 may be a video camera that captures an image with an optical lens, or may be an input terminal for inputting an electronic signal.

映像解析部１１２は、映像取得部１１１によって取得された第２のビデオデータ１０２における特徴量を解析するものである。映像解析部１１２は、例えば、特徴量として第２のビデオデータ１０２に含まれる顔の画像を抽出して、その顔の表情を判断する。顔の表情を判断する処理手順は公知の技術を使用することができる。例えば、顔画像中の外眼角点を結ぶ線分と口角点を結ぶ線分との比から笑顔の判断をする技術（例えば、特開２００５−２６６９８４）や、顔の構成要素毎に基準画像と評価ポイントを設けて平均値を算出して被撮影者の表情を判断する技術（例えば、特開２００４−４６５９１）等が提案されている。 The video analysis unit 112 analyzes the feature amount in the second video data 102 acquired by the video acquisition unit 111. For example, the video analysis unit 112 extracts a face image included in the second video data 102 as a feature amount, and determines the facial expression. A known technique can be used as a processing procedure for determining the facial expression. For example, a technique (for example, Japanese Patent Application Laid-Open No. 2005-266984) that determines a smile based on a ratio of a line segment connecting an external eye corner point and a line segment connecting a mouth corner point in a face image, and a reference image for each face component A technique (for example, Japanese Patent Application Laid-Open No. 2004-46591) has been proposed in which an evaluation point is provided and an average value is calculated to determine the facial expression of the subject.

メタデータ生成部１１３は、映像解析部１１２によって解析された特徴量を保持するメタデータ１０３を生成するものである。この生成されたメタデータ１０３は、第１のビデオデータ１０１の各時刻に対応して、第２のビデオデータ１０２の特徴量を保持している。これにより、例えば、第１のビデオデータ１０１の各時刻において、第２のビデオデータ１０２にどのような顔の表情が含まれているかを、編集のための中間状態として保持することができる。 The metadata generation unit 113 generates the metadata 103 that holds the feature amount analyzed by the video analysis unit 112. The generated metadata 103 holds the feature amount of the second video data 102 corresponding to each time of the first video data 101. Thus, for example, what facial expression is included in the second video data 102 at each time of the first video data 101 can be held as an intermediate state for editing.

抽出条件受付部１２１は、抽出条件の入力を受け付けるものである。例えば、顔の表情を条件とするのであれば、笑顔を抽出するのか、または、驚いた顔を抽出するのか、といった条件を受け付ける。これらの条件は論理積（ＡＮＤ）や論理和（ＯＲ）などにより組み合わせた条件にすることができる。 The extraction condition receiving unit 121 receives input of extraction conditions. For example, if the facial expression is a condition, a condition such as whether to extract a smile or a surprised face is accepted. These conditions can be combined with logical product (AND) or logical sum (OR).

位置探索部１２２は、抽出条件受付部１２１によって受け付けられた抽出条件によりメタデータ１０３を探索するものである。これにより、抽出条件に合致する時系列上の位置が得られ、エディットリスト１０４として保持される。 The position search unit 122 searches the metadata 103 based on the extraction condition received by the extraction condition reception unit 121. Thereby, a position on the time series that matches the extraction condition is obtained and held as the edit list 104.

映像抽出部１２３は、エディットリスト１０４に基づいて、第１のビデオデータ１０１から抽出条件に合致する時系列上の位置に対応する部分を抽出して、編集ビデオデータ１０５に出力するものである。 The video extraction unit 123 extracts a portion corresponding to the position on the time series that matches the extraction condition from the first video data 101 based on the edit list 104, and outputs it to the edited video data 105.

図２は、本発明の実施の形態における映像処理装置１１０の第１の構成例を示す図である。この第１の構成例では、記録部２１０に、撮像部２１１と、映像加工部２１２と、映像圧縮部２１３と、ファイル生成部２１４と、書込み部２１５と、撮像部２１６と、映像解析部２１７と、記録制御部２１８と、記録媒体２１９とが備えられている。 FIG. 2 is a diagram illustrating a first configuration example of the video processing apparatus 110 according to the embodiment of the present invention. In the first configuration example, the recording unit 210 includes an imaging unit 211, a video processing unit 212, a video compression unit 213, a file generation unit 214, a writing unit 215, an imaging unit 216, and a video analysis unit 217. A recording control unit 218 and a recording medium 219.

撮像部２１１は、被写体を第１のビデオデータ１０１として撮像するものである。映像加工部２１２は、撮像部２１１によって撮像された映像に対してエフェクト処理などの加工を施すものである。映像圧縮部２１３は、映像加工部２１２によって加工された映像を圧縮するものである。 The imaging unit 211 captures the subject as the first video data 101. The video processing unit 212 performs processing such as effect processing on the video captured by the imaging unit 211. The video compression unit 213 compresses the video processed by the video processing unit 212.

撮像部２１６は、被写体を撮像している撮影者（もしくは、被写体の様子を鑑賞している鑑賞者）の顔を第２のビデオデータ１０２として撮像するものである。映像解析部２１７は、撮像部２１６によって撮像された映像を解析するものである。すなわち、映像解析部２１７は、映像に含まれる撮影者の顔の表情を解析する。この解析結果は、メタデータ１０３となる。 The imaging unit 216 captures the face of the photographer who is capturing the subject (or the viewer who is viewing the subject) as the second video data 102. The video analysis unit 217 analyzes the video imaged by the imaging unit 216. That is, the video analysis unit 217 analyzes the facial expression of the photographer included in the video. This analysis result becomes metadata 103.

ファイル生成部２１４は、映像圧縮部２１３によって圧縮された映像（第１のビデオデータ１０１）および映像解析部２１７によって解析された撮影者の顔の表情の解析結果（メタデータ１０３）を含むファイルをそれぞれ所定のファイル形式として生成するものである。 The file generation unit 214 includes a file including the video compressed by the video compression unit 213 (first video data 101) and the analysis result (metadata 103) of the facial expression of the photographer analyzed by the video analysis unit 217. Each is generated as a predetermined file format.

書込み部２１５は、ファイル生成部２１４によって生成されたファイルを記録媒体２１９に書き込むものである。記録媒体２１９としては、ハードディスクなどのディスク状記録媒体やメモリスティックなどの半導体記録媒体を想定することができる。 The writing unit 215 writes the file generated by the file generation unit 214 to the recording medium 219. As the recording medium 219, a disk-shaped recording medium such as a hard disk or a semiconductor recording medium such as a memory stick can be assumed.

記録制御部２１８は、記録部２１０における記録媒体２１９への記録動作を制御するものである。 The recording control unit 218 controls the recording operation on the recording medium 219 in the recording unit 210.

この第１の構成例では、撮像部２１１によって被写体を第１のビデオデータ１０１として撮像しながら、同時に撮像部２１６によって撮影者の顔の表情を第２のビデオデータ１０２として撮像し、その表情を解析してメタデータ１０３を生成している。 In this first configuration example, the imaging unit 211 captures the subject as the first video data 101, and at the same time, the imaging unit 216 captures the facial expression of the photographer as the second video data 102. The metadata 103 is generated by analysis.

図３は、本発明の実施の形態における映像処理装置１１０の第１の構成例による使用態様を示す図である。図３（ａ）では、ビデオカメラ装置５２０の前面のカメラ５２１によって被写体５０１が撮像されているのと同時に、ビデオカメラ装置５２０の操作面のカメラ５２２によって撮影者５０２の顔が撮像されている。これにより、ビデオカメラ装置５２０の内部で撮影者５０２の顔の表情を解析することによって、メタデータを生成することができる。 FIG. 3 is a diagram illustrating a usage mode according to the first configuration example of the video processing apparatus 110 according to the embodiment of the present invention. In FIG. 3A, the subject 501 is imaged by the camera 521 on the front surface of the video camera device 520, and the face of the photographer 502 is imaged by the camera 522 on the operation surface of the video camera device 520. Accordingly, the metadata can be generated by analyzing the facial expression of the photographer 502 inside the video camera device 520.

また、ビデオカメラ装置５２０はネットワーク５１０に接続されてもよい。このネットワーク５１０によって、ビデオカメラ装置５２０の前面のカメラ５２１によって撮像された被写体５０１の映像を、図３（ｂ）のテレビ装置５３０や図３（ｃ）のコンピュータ装置５４０に配信することができる。 In addition, the video camera device 520 may be connected to the network 510. Through the network 510, the video of the subject 501 captured by the camera 521 on the front surface of the video camera device 520 can be distributed to the television device 530 in FIG. 3B or the computer device 540 in FIG.

図３（ｂ）のテレビ装置５３０にはビデオカメラ装置５３１が接続され、鑑賞者５０３の顔が撮像されている。これにより、テレビ装置５３０またはビデオカメラ装置５３１の内部で鑑賞者５０３の顔の表情を解析することによって、メタデータを生成することができる。 A video camera device 531 is connected to the television device 530 in FIG. 3B, and the face of the viewer 503 is captured. Accordingly, the metadata can be generated by analyzing the facial expression of the viewer 503 inside the television device 530 or the video camera device 531.

図３（ｃ）のコンピュータ装置５４０の前面にはカメラ５４１が設けられ、鑑賞者５０４の顔が撮像されている。これにより、コンピュータ装置５４０の内部で鑑賞者５０４の顔の表情を解析することによって、メタデータを生成することができる。 A camera 541 is provided in front of the computer apparatus 540 in FIG. 3C, and the face of the viewer 504 is captured. Thus, the metadata can be generated by analyzing the facial expression of the viewer 504 inside the computer device 540.

このように、映像処理装置１１０の第１の構成例では、第１のビデオデータ１０１の撮像と同時に第２のビデオデータ１０２を撮像して、この第２のビデオデータ１０２からメタデータ１０３を生成する。第２のビデオデータ１０２に含まれる顔の表情には、第１のビデオデータ１０１の映像に対する何らかの反応が反映されているものと考えられるため、その特徴量をメタデータ１０３に中間状態として保存しておいて、第１のビデオデータ１０１の編集に利用しようとするものである。 As described above, in the first configuration example of the video processing apparatus 110, the second video data 102 is imaged simultaneously with the imaging of the first video data 101, and the metadata 103 is generated from the second video data 102. To do. Since the facial expression included in the second video data 102 is considered to reflect some reaction to the video of the first video data 101, the feature amount is stored in the metadata 103 as an intermediate state. The first video data 101 is to be used for editing.

図４は、本発明の実施の形態における映像処理装置１１０の第２の構成例を示す図である。この第２の構成例では、記録部２１０に、撮像部２１１と、ファイル生成部２１４と、書込み部２１５と、映像解析部２１７と、記録制御部２１８と、記録媒体２１９とが備えられている。また、再生部２２０に、表示部２２１と、映像加工部２２２と、映像伸張部２２３と、ファイル復号部２２４と、読出し部２２５と、記録媒体２２９とが備えられている。 FIG. 4 is a diagram illustrating a second configuration example of the video processing apparatus 110 according to the embodiment of the present invention. In the second configuration example, the recording unit 210 includes an imaging unit 211, a file generation unit 214, a writing unit 215, a video analysis unit 217, a recording control unit 218, and a recording medium 219. . In addition, the playback unit 220 includes a display unit 221, a video processing unit 222, a video expansion unit 223, a file decoding unit 224, a reading unit 225, and a recording medium 229.

記録媒体２２９は、第１のビデオデータ１０１を所定のファイル形式により記録するものである。読出し部２２５は、第１のビデオデータ１０１を含むファイルを記録媒体２２９から読み出すものである。ファイル復号部２２４は、読出し部２２５によって読み出されたファイルを復号するものである。映像伸張部２２３は、ファイル復号部２２４によって復号されたファイル内の圧縮された映像を伸張するものである。映像加工部２２２は、映像伸張部２２３によって伸張された映像に対してエフェクト処理などの加工を施すものである。 The recording medium 229 records the first video data 101 in a predetermined file format. The reading unit 225 reads a file including the first video data 101 from the recording medium 229. The file decryption unit 224 decrypts the file read by the reading unit 225. The video decompression unit 223 decompresses the compressed video in the file decrypted by the file decryption unit 224. The video processing unit 222 performs processing such as effect processing on the video expanded by the video expansion unit 223.

表示部２２１は、映像加工部２２２から出力された映像を表示するものである。これにより、記録媒体２２９に記録されていた第１のビデオデータ１０１が表示部２２１に再生表示される。 The display unit 221 displays the video output from the video processing unit 222. As a result, the first video data 101 recorded on the recording medium 229 is reproduced and displayed on the display unit 221.

撮像部２１１は、被写体を第２のビデオデータ１０２として撮像するものである。映像解析部２１７は、撮像部２１１によって撮像された映像を解析するものである。すなわち、映像解析部２１７は、映像に含まれる鑑賞者の顔の表情を解析する。この解析結果は、メタデータ１０３となる。 The imaging unit 211 images the subject as the second video data 102. The video analysis unit 217 analyzes the video imaged by the imaging unit 211. That is, the video analysis unit 217 analyzes the facial expression of the viewer included in the video. This analysis result becomes metadata 103.

ファイル生成部２１４は、映像解析部２１７によって解析された鑑賞者の顔の表情の解析結果（メタデータ１０３）を含むファイルを所定のファイル形式として生成するものである。 The file generation unit 214 generates a file including the analysis result (metadata 103) of the facial expression of the viewer analyzed by the video analysis unit 217 as a predetermined file format.

書込み部２１５は、ファイル生成部２１４によって生成されたファイルを記録媒体２１９に書き込むものである。記録制御部２１８は、記録部２１０における記録媒体２１９への記録動作を制御するものである。 The writing unit 215 writes the file generated by the file generation unit 214 to the recording medium 219. The recording control unit 218 controls the recording operation on the recording medium 219 in the recording unit 210.

なお、記録媒体２１９および２２９としては、ハードディスクなどのディスク状記録媒体やメモリスティックなどの半導体記録媒体を想定することができるが、両者は互いに異なる種類の記録媒体であってもよい。 As the recording media 219 and 229, a disc-shaped recording medium such as a hard disk and a semiconductor recording medium such as a memory stick can be assumed, but both may be different types of recording media.

この第２の構成例では、表示部２２１によって第１のビデオデータ１０１を再生表示しながら、同時に撮像部２１１によって鑑賞者の顔の表情を第２のビデオデータ１０２として撮像し、その表情を解析してメタデータ１０３を生成している。 In the second configuration example, the display unit 221 reproduces and displays the first video data 101, and at the same time, the imaging unit 211 captures the facial expression of the viewer as the second video data 102, and analyzes the facial expression. Thus, the metadata 103 is generated.

図５は、本発明の実施の形態における映像処理装置１１０の第２の構成例による使用態様を示す図である。図２（ａ）では、ビデオカメラ装置５２０の操作面に再生表示画面が表示されており、その操作面のカメラ５２２によって鑑賞者５０５の顔が撮像されている。これにより、ビデオカメラ装置５２０の内部で鑑賞者５０５の顔の表情を解析することによって、メタデータを生成することができる。 FIG. 5 is a diagram illustrating a usage mode according to the second configuration example of the video processing device 110 according to the embodiment of the present invention. In FIG. 2A, a playback display screen is displayed on the operation surface of the video camera device 520, and the face of the viewer 505 is captured by the camera 522 on the operation surface. Thereby, metadata can be generated by analyzing the facial expression of the viewer 505 inside the video camera device 520.

図５（ｂ）のテレビ装置５３０にはビデオカメラ装置５３１が接続され、鑑賞者５０３の顔が撮像されている。これにより、テレビ装置５３０またはビデオカメラ装置５３１の内部で鑑賞者５０３の顔の表情を解析することによって、メタデータを生成することができる。 A video camera device 531 is connected to the television device 530 in FIG. 5B, and the face of the viewer 503 is captured. Accordingly, the metadata can be generated by analyzing the facial expression of the viewer 503 inside the television device 530 or the video camera device 531.

図５（ｃ）のコンピュータ装置５４０の前面にはカメラ５４１が設けられ、鑑賞者５０４の顔が撮像されている。これにより、コンピュータ装置５４０の内部で鑑賞者５０４の顔の表情を解析することによって、メタデータを生成することができる。 A camera 541 is provided in front of the computer apparatus 540 in FIG. 5C, and the face of the viewer 504 is captured. Thus, the metadata can be generated by analyzing the facial expression of the viewer 504 inside the computer device 540.

このように、映像処理装置１１０の第２の構成例では、第１のビデオデータ１０１の再生と同時に第２のビデオデータ１０２を撮像して、この第２のビデオデータ１０２からメタデータ１０３を生成する。第２のビデオデータ１０２に含まれる顔の表情には、第１のビデオデータ１０１の映像に対する何らかの反応が反映されているものと考えられるため、その特徴量をメタデータ１０３に中間状態として保存しておいて、第１のビデオデータ１０１の編集に利用しようとするものである。 As described above, in the second configuration example of the video processing apparatus 110, the second video data 102 is captured simultaneously with the reproduction of the first video data 101, and the metadata 103 is generated from the second video data 102. To do. Since the facial expression included in the second video data 102 is considered to reflect some reaction to the video of the first video data 101, the feature amount is stored in the metadata 103 as an intermediate state. The first video data 101 is to be used for editing.

図６は、本発明の実施の形態における映像処理装置１１０の第３の構成例を示す図である。この第３の構成例では、記録部２１０に、映像入力部２０６と、ファイル生成部２１４と、書込み部２１５と、映像解析部２１７と、記録制御部２１８と、記録媒体２１９とが備えられている。また、再生部２２０に、映像加工部２２２と、映像伸張部２２３と、ファイル復号部２２４と、読出し部２２５と、記録媒体２２９とが備えられている。 FIG. 6 is a diagram illustrating a third configuration example of the video processing apparatus 110 according to the embodiment of the present invention. In the third configuration example, the recording unit 210 includes a video input unit 206, a file generation unit 214, a writing unit 215, a video analysis unit 217, a recording control unit 218, and a recording medium 219. Yes. Further, the playback unit 220 includes a video processing unit 222, a video expansion unit 223, a file decoding unit 224, a reading unit 225, and a recording medium 229.

再生部２２０の構成は、第２の構成例の場合と同様であるが、表示部２２１は省かれ、映像加工部２２２の出力が記録部２１０にそのまま供給されている点が異なる。 The configuration of the playback unit 220 is the same as that of the second configuration example, except that the display unit 221 is omitted and the output of the video processing unit 222 is supplied to the recording unit 210 as it is.

記録部２１０の構成も、第２の構成例の場合と同様であるが、撮像部２１１の代わりに、再生部２２０からの映像を入力する映像入力部２０６を備える点が異なる。 The configuration of the recording unit 210 is the same as that in the second configuration example, except that a video input unit 206 that inputs video from the playback unit 220 is provided instead of the imaging unit 211.

すなわち、この第３の構成例では、再生部２２０によって再生された映像（第１のビデオデータ１０１）をそのまま記録部２１０の入力映像（第２のビデオデータ１０２）として供給して、それに含まれる顔の表情を解析してメタデータ１０３を生成している。 That is, in the third configuration example, the video (first video data 101) reproduced by the reproduction unit 220 is supplied as it is as the input video (second video data 102) of the recording unit 210 and is included therein. The metadata 103 is generated by analyzing facial expressions.

図７は、本発明の実施の形態における映像処理装置１１０の一実施例であるカメラ一体型撮像装置の構成例を示す図である。この撮像装置は、撮像部３０１と、映像加工部３３０と、映像圧縮部３４１と、圧縮制御部３４２と、記録媒体アクセス部３５１と、ドライブ制御部３５２と、操作受付部３６０と、表示部３７０と、システム制御部３９０とを備えている。 FIG. 7 is a diagram illustrating a configuration example of a camera-integrated imaging apparatus that is an example of the video processing apparatus 110 according to the embodiment of the present invention. The imaging apparatus includes an imaging unit 301, a video processing unit 330, a video compression unit 341, a compression control unit 342, a recording medium access unit 351, a drive control unit 352, an operation reception unit 360, and a display unit 370. And a system control unit 390.

撮像部３０１は、被写体を撮像して映像データとして出力するものである。映像加工部３３０は、撮像部３０１から出力された映像データにエフェクト処理を施すものである。映像圧縮部３４１は、映像加工部３３０によって加工された映像データを圧縮するものである。圧縮制御部３４２は、映像圧縮部３４１における圧縮処理の制御を行うものである。 The imaging unit 301 images a subject and outputs it as video data. The video processing unit 330 performs effect processing on the video data output from the imaging unit 301. The video compression unit 341 compresses the video data processed by the video processing unit 330. The compression control unit 342 controls compression processing in the video compression unit 341.

記録媒体アクセス部３５１は、記録媒体３０９に対する書込みや読出しを行うものである。ドライブ制御部３５２は、記録媒体アクセス部３５１による書込みや読出しを制御するものである。 The recording medium access unit 351 performs writing and reading with respect to the recording medium 309. The drive control unit 352 controls writing and reading by the recording medium access unit 351.

操作受付部３６０は、ユーザによる操作入力を受け付けるものであり、各種ボタンやＧＵＩ（Graphical User Interface）などが想定される。表示部３７０は、撮像中の映像や再生映像、または、ユーザに対する各種メッセージなどを表示するものである。 The operation accepting unit 360 accepts an operation input by a user, and various buttons, GUI (Graphical User Interface), and the like are assumed. The display unit 370 displays an image being picked up or reproduced, or various messages to the user.

システム制御部３９０は、撮像装置の全体を制御するものであり、例えば、マイクロプロセッサなどにより実現され得る。このシステム制御部３９０は、操作受付部３６０によって受け付けられた操作入力によって映像の録画の開始、停止や、録画の経過時間情報などを制御するとともに、ユーザに対する表示部３７０における表示を制御する。また、システム制御部３９０は、カメラ制御部３２９や圧縮制御部３４２との間で情報をやり取りして、ドライブ制御部３５２を介して記録媒体３０９に対する書込み制御を行う。 The system control unit 390 controls the entire imaging apparatus and can be realized by, for example, a microprocessor. The system control unit 390 controls the start and stop of video recording, the elapsed time information of recording, and the like according to the operation input received by the operation receiving unit 360, and controls the display on the display unit 370 for the user. Further, the system control unit 390 exchanges information with the camera control unit 329 and the compression control unit 342, and performs writing control on the recording medium 309 via the drive control unit 352.

また、撮像部３０１は、ズームレンズ３１１と、アイリス（絞り）３１２と、フォーカスレンズ３１３と、フィルタ３１４と、撮像素子３２１と、Ａ／Ｄ変換器３２２と、カメラ信号処理回路３２３と、検波部３２４と、ズーム制御部３２５と、角速度センサ３２６と、カメラ制御部３２９とを備えている。 The imaging unit 301 includes a zoom lens 311, an iris (aperture) 312, a focus lens 313, a filter 314, an image sensor 321, an A / D converter 322, a camera signal processing circuit 323, and a detection unit. 324, a zoom control unit 325, an angular velocity sensor 326, and a camera control unit 329.

ズームレンズ３１１は、ズーム（拡大）処理を行うためのレンズである。アイリス３１２は、被写体からの光量を調整するための絞りである。フォーカスレンズ３１３は、被写体に焦点を合わせるためのレンズである。フィルタ３１４は、赤外線を除去するためのフィルタである。 The zoom lens 311 is a lens for performing zoom (enlargement) processing. The iris 312 is an aperture for adjusting the amount of light from the subject. The focus lens 313 is a lens for focusing on a subject. The filter 314 is a filter for removing infrared rays.

撮像素子３２１は、光学レンズ群から供給された光を電気信号に変換する光電変換素子であり、例えば、ＣＣＤ（Charge Coupled Devices）などにより実現され得る。この撮像素子３２１により、被写体の画像が、例えばＲＧＢ（赤、緑、青）の３原色に相当する３つの映像信号として取り出される。 The imaging element 321 is a photoelectric conversion element that converts light supplied from the optical lens group into an electrical signal, and can be realized by, for example, a CCD (Charge Coupled Devices). By this image sensor 321, the image of the subject is extracted as three video signals corresponding to, for example, three primary colors of RGB (red, green, blue).

Ａ／Ｄ変換器３２２は、撮像素子３２１から供給されたアナログの電気信号をデジタル信号に変換するものである。カメラ信号処理回路３２３は、Ａ／Ｄ変換器３２２により変換されたデジタル信号に対して、白色の基準を定めるホワイトバランスなどの信号処理を施すものである。 The A / D converter 322 converts an analog electric signal supplied from the image sensor 321 into a digital signal. The camera signal processing circuit 323 performs signal processing such as white balance that defines a white reference for the digital signal converted by the A / D converter 322.

検波部３２４は、カメラ信号処理回路３２３によって信号処理の施された映像信号のフィードバックを受けて、各種の検波処理を行うものである。例えば、自動的に被写体に焦点を合わせるためのオートフォーカス（ＡＦ：Auto Focus）検波、自動的に露光を行うためのオートエクスポージャ（ＡＥ：Auto Exposure）検波、自動的にホワイトバランスを行うためのオートホワイトバランス（ＡＷＢ：Auto White Balance）検波などを行うものである。 The detection unit 324 performs various types of detection processing upon receiving feedback of the video signal subjected to signal processing by the camera signal processing circuit 323. For example, auto focus (AF) detection for automatically focusing on a subject, auto exposure (AE) detection for automatically performing exposure, and for automatically performing white balance Auto white balance (AWB) detection is performed.

ズーム制御部３２５は、ユーザからの操作入力などに従ってズームレンズ３１１を移動させてズーム処理を制御するものである。角速度センサ３２６は、撮像装置の角速度を検出するものであり、例えば、ジャイロスコープなどにより手ぶれの度合いを検出するものである。 The zoom control unit 325 controls the zoom process by moving the zoom lens 311 in accordance with an operation input from the user. The angular velocity sensor 326 detects the angular velocity of the imaging apparatus, and detects the degree of camera shake using, for example, a gyroscope.

カメラ制御部３２９は、撮像部３０１の制御を行うものである。例えば、カメラ制御部３２９は、角速度センサ３２６において検知された手ぶれに対して手ぶれ補正を行って画質の劣化を低減するように制御を行う。また、カメラ制御部３２９は、撮像素子３２１からの映像入力の制御、検波部３２４における処理の制御、ズーム制御部３２５における処理の制御などを行う。 The camera control unit 329 controls the imaging unit 301. For example, the camera control unit 329 performs control so as to reduce the deterioration in image quality by performing camera shake correction on the camera shake detected by the angular velocity sensor 326. The camera control unit 329 performs control of video input from the image sensor 321, control of processing in the detection unit 324, control of processing in the zoom control unit 325, and the like.

図８は、ＱｕｉｃｋＴｉｍｅファイルフォーマットをベースとしたファイル形式（以下、ＱｕｉｃｋＴｉｍｅベースファイル形式という。）の構造例を示す図である。このファイル形式では、ファイルの内容が実データ格納部と、その実データを参照するために必要な場所情報などを格納する管理情報格納部とに分かれている。ＱｕｉｃｋＴｉｍｅファイルフォーマットでは、実データ格納部はメディアデータアトム（media data atom、タイプ名：'ｍｄａｔ'）と呼ばれ、管理情報格納部はムービーアトム（movie atom、タイプ名：'ｍｏｏｖ'）と呼ばれる。なお、「アトム（atom）」は「ボックス（box）」と表現されることもある。また、ムービーアトムはムービーリソースと表現されることがあり、メディアデータアトムは単にメディアデータまたはムービーデータと表現されることがある。 FIG. 8 is a diagram showing a structure example of a file format based on the QuickTime file format (hereinafter referred to as QuickTime base file format). In this file format, the content of the file is divided into an actual data storage unit and a management information storage unit that stores location information necessary for referring to the actual data. In the QuickTime file format, the actual data storage unit is called a media data atom (type name: 'mdat'), and the management information storage unit is called a movie atom (movie atom, type name: 'moov'). “Atom” may also be expressed as “box”. A movie atom may be expressed as a movie resource, and a media data atom may be expressed simply as media data or movie data.

これらメディアデータアトムおよびムービーアトムは、同一のファイルに含まれていてもよく、別ファイルに分かれていてもよい。例えば、図８のように、動画像（Ｖ１等）や音声（Ａ１等）のメディアデータを含むメディアデータアトム６１２と、それを参照するムービーアトム６１１とを同一のファイル６１０に格納するようにしてもよく、また、メディアデータアトム６１２を参照するムービーアトム６２１を別のファイル６２０に格納するようにしてもよい。前者の形式を有するファイルは自己内包型ファイルと呼ばれ、後者の形式を有するファイルは外部参照型ファイルと呼ばれる。そのため、ムービーアトムは、外部参照するメディアデータアトムが含まれる外部ファイルの相対パスまたは絶対パスを示す管理情報を格納できるようになっている。 These media data atom and movie atom may be included in the same file or may be separated into different files. For example, as shown in FIG. 8, a media data atom 612 including media data of a moving image (V1 or the like) or sound (A1 or the like) and a movie atom 611 that refers to the media data atom 612 are stored in the same file 610. Alternatively, a movie atom 621 that refers to the media data atom 612 may be stored in another file 620. A file having the former format is called a self-contained file, and a file having the latter format is called an external reference file. Therefore, the movie atom can store management information indicating a relative path or an absolute path of an external file including a media data atom to be externally referenced.

メディアデータアトム６１２には、例えばＭＰＥＧ１オーディオ（MPEG1 Audio Layer2）に基づく圧縮符号化方式によって符号化されたオーディオデータおよびＭＰＥＧ２ビデオ（MPEG2 Video）規定に従う圧縮符号化方式によって符号化された画像データが格納される。符号化方式はこれらに限定されるものではなく、例えば、ビデオデータであればモーションＪＰＥＧ（Motion JPEG）やＭＪ２（Motion JPEG2000）、ＭＰＥＧ４（ＭＰ４）やＡＶＣ（Advanced Video Coding：MPEG4-part10）、オーディオデータであればドルビーＡＣ３（Dolby AC3）やＡＴＲＡＣ（Adaptive TRansform Acoustic Coding）などでもよく、また、圧縮符号化が施されていないリニアデータを格納することも可能である。 The media data atom 612 stores, for example, audio data encoded by a compression encoding method based on MPEG1 audio (MPEG1 Audio Layer2) and image data encoded by a compression encoding method conforming to the MPEG2 Video (MPEG2 Video) standard. Is done. The encoding method is not limited to these. For example, in the case of video data, motion JPEG (Motion JPEG), MJ2 (Motion JPEG2000), MPEG4 (MP4), AVC (Advanced Video Coding: MPEG4-part10), audio If it is data, Dolby AC3 (Dolby AC3), ATRAC (Adaptive TRansform Acoustic Coding), etc. may be sufficient, and it is also possible to store the linear data which has not been compression-encoded.

図９は、ＱｕｉｃｋＴｉｍｅファイルフォーマットにおける階層構造を示す図である。メディアデータアトム（'ｍｄａｔ'）における実データはサンプル（sample）と呼ばれる最小管理単位に分かれており、このサンプルを任意の個数分集めたものがチャンク（chunk）と呼ばれる。メディアデータアトム（'ｍｄａｔ'）の管理情報であるムービーアトム（'ｍｏｏｖ'）では、サンプルのサイズや、チャンクの先頭格納場所、各サンプルの表示時間等が格納される。 FIG. 9 is a diagram showing a hierarchical structure in the QuickTime file format. Actual data in the media data atom ('mdat') is divided into minimum management units called samples, and a collection of an arbitrary number of samples is called a chunk. The movie atom ('moov'), which is management information of the media data atom ('mdat'), stores the sample size, the chunk storage location, the display time of each sample, and the like.

ムービーアトム（'ｍｏｏｖ'）は、ムービーヘッダアトム（'ｍｖｈｄ'）と、トラックアトム（'ｔｒａｋ'）等から構成される。 The movie atom ('moov') is composed of a movie header atom ('mvhd'), a track atom ('trak'), and the like.

ムービーヘッダアトム（'ｍｖｈｄ'）は、ムービーアトムのヘッダ情報を保持する部分であり、ムービー全体の特徴を示すものである。例えば、ムービー全体の期間や時間スケール、作成日等を項目として含む。 The movie header atom ('mvhd') is a part that holds movie atom header information, and indicates the characteristics of the entire movie. For example, the period, time scale, creation date, etc. of the entire movie are included as items.

トラックアトム（'ｔｒａｋ'）は、サウンド、ビデオ、テキストといった異なるタイプのデータをそれぞれ別のトラックにより格納するものであり、この図では、ビデオのトラックアトムとして、トラックヘッダアトム（'ｔｋｈｄ'）と、エディットアトム（'ｅｄｔｓ'）と、ユーザデータアトム（'ｕｄｔａ'）と、メディアアトム（'ｍｄｉａ'）とを含んで構成される。また、オーディオのトラックアトムについては省略されているが、ビデオの場合と同様の構成を備えて構成される。 A track atom ('trak') stores different types of data such as sound, video, and text in different tracks. In this figure, a track header atom ('tkhd') and a video track atom are shown. , An edit atom ('edts'), a user data atom ('udta'), and a media atom ('mdia'). Although the audio track atom is omitted, it has the same configuration as in the case of video.

トラックヘッダアトム（'ｔｋｈｄ'）は、トラックアトムのヘッダ情報を保持する部分であり、そのトラックの特徴を示すものである。例えば、ビデオのピクセル数やサウンドの音量、作成日等を項目として含む。 The track header atom ('tkhd') is a part that holds the track atom header information, and indicates the characteristics of the track. For example, the number of video pixels, sound volume, creation date, and the like are included as items.

エディットアトム（'ｅｄｔｓ'）は、トラックの編集情報をエディットリストアトム（'ｅｌｓｔ'）として保持するものである。なお、このエディットアトムについては、図２７により詳述する。 The edit atom ('edts') holds track editing information as an edit restore tom ('elst'). This edit atom will be described in detail with reference to FIG.

ユーザデータアトム（'ｕｄｔａ'）は、必要に応じてユーザにより定義された任意の情報を含むものである。例えば、ムービーのウィンドウ位置や再生方法、作成情報等を保持することができる。このユーザデータアトムは、ムービーユーザデータをリスト形式により保持する。 The user data atom ('udta') includes arbitrary information defined by the user as necessary. For example, it is possible to hold a movie window position, a playback method, creation information, and the like. This user data atom holds movie user data in a list format.

メディアアトム（'ｍｄｉａ'）は、そのトラックで実際に用いられる実データに関する情報を格納するものである。すなわち、メディアアトムは、メディア全体に関する情報、メディアデータの取扱いに関する情報、メディアの構成に関する情報等を格納する。実データはサンプル（sample）と呼ばれる最小管理単位に分かれており、このサンプルを任意の個数分集めたものがチャンク（chunk）と呼ばれる。メディアアトムでは、サンプルのサイズや、チャンクの先頭格納場所、各サンプルの表示時間等が格納される。 The media atom ('mdia') stores information regarding actual data actually used in the track. That is, the media atom stores information on the entire medium, information on handling of media data, information on the configuration of the media, and the like. Actual data is divided into minimum management units called samples, and a collection of an arbitrary number of samples is called a chunk. In the media atom, the sample size, the chunk storage location, the display time of each sample, and the like are stored.

このメディアアトムは、メディアヘッダアトム（'ｍｄｈｄ'）と、メディアハンドラアトム（'ｈｄｌｒ'）と、メディア情報アトム（'ｍｉｎｆ'）等から構成される。 The media atom includes a media header atom ('mdhd'), a media handler atom ('hdlr'), a media information atom ('minf'), and the like.

メディアヘッダアトム（'ｍｄｈｄ'）は、メディアアトムのヘッダ情報を保持する部分であり、メディア全体としての特徴を示すものである。 The media header atom ('mdhd') is a part that holds the header information of the media atom, and indicates the characteristics of the entire medium.

メディアハンドラアトム（'ｈｄｌｒ'）は、メディア毎の取り扱いに関する情報を保持するものである。 The media handler atom ('hdlr') holds information regarding handling for each medium.

メディア情報アトム（'ｍｉｎｆ'）は、そのメディアタイプで表現される情報を保持するものである。このメディア情報アトムは、ビデオメディア情報ヘッダアトム（'ｖｍｈｄ'）と、データハンドラアトム（'ｈｄｌｒ'）と、データ情報アトム（'ｄｉｎｆ'）と、サンプルテーブルアトム（'ｓｔｂｌ'）等から構成される。 The media information atom ('minf') holds information expressed by the media type. This media information atom is composed of a video media information header atom ('vmhd'), a data handler atom ('hdlr'), a data information atom ('dinf'), a sample table atom ('stbl'), and the like. The

ビデオメディア情報ヘッダアトム（'ｖｍｈｄ'）は、ビデオトラックにおいて、ビデオメディアに関するヘッダ情報を保持するものである。なお、オーディオトラックの場合、サウンドメディアに関するヘッダ情報を保持するサウンドメディア情報ヘッダアトム（'ｓｍｈｄ'）が、ビデオメディアヘッダアトム（'ｖｍｈｄ'）の代わりに含まれる。 The video media information header atom ('vmhd') holds header information related to video media in the video track. In the case of an audio track, a sound media information header atom ('smhd') that holds header information related to sound media is included instead of the video media header atom ('vmhd').

データハンドラアトム（'ｈｄｌｒ'）は、ビデオメディアの取り扱いに関する情報を保持するものである。 The data handler atom ('hdlr') holds information relating to handling of video media.

データ情報アトム（'ｄｉｎｆ'）は、実際に参照する実データの格納先に関する情報を保持するものである。このデータ情報アトムには、参照する実データの格納方法、格納場所、ファイル名に関する情報を保持するデータリファレンスアトム（'ｄｒｅｆ'）が含まれる。 The data information atom ('dinf') holds information related to the storage location of actual data that is actually referred to. The data information atom includes a data reference atom ('dref') that holds information regarding the storage method, storage location, and file name of the actual data to be referenced.

サンプルテーブルアトム（'ｓｔｂｌ'）は、そのメディアの実データの最小管理単位であるサンプルに関する情報を保持するものである。このサンプルテーブルアトムは、サンプルディスクリプションアトム（'ｓｔｓｄ'）と、時間対サンプルアトム（'ｓｔｓｓ'）と、サンプルサイズアトム（'ｓｔｓｚ'）と、サンプル対チャンクアトム（'ｓｔｓｃ'）と、チャンクオフセットアトム（'ｓｔｃｏ'）等から構成される。 The sample table atom ('stbl') holds information about a sample that is a minimum management unit of actual data of the medium. The sample table atom includes a sample description atom ('stsd'), a time-to-sample atom ('stss'), a sample size atom ('stsz'), a sample-to-chunk atom ('stsc'), and a chunk It consists of an offset atom ('stco') or the like.

サンプルディスクリプションアトム（'ｓｔｓｄ'）は、各サンプルに関する圧縮方式やその特性に関する情報を保持するものである。時間対サンプルアトム（'ｓｔｓｓ'）は、各サンプルと時間との関係を保持するものである。サンプルサイズアトム（'ｓｔｓｚ'）は、各サンプルのデータ量を保持するものである。サンプル対チャンクアトム（'ｓｔｓｃ'）は、チャンクとそのチャンクを構成するサンプルの関係を保持するものである。チャンクオフセットアトム（'ｓｔｃｏ'）は、ファイル先頭からの各チャンクの先頭位置までのオフセットを保持するものである。 The sample description atom ('stsd') holds information regarding the compression method and characteristics of each sample. Time vs. sample atom ('stss') holds the relationship between each sample and time. The sample size atom ('stsz') holds the data amount of each sample. Sample-to-chunk atom ('stsc') holds the relationship between a chunk and the samples that make up that chunk. A chunk offset atom ('stco') holds an offset from the file head to the head position of each chunk.

本発明の実施の形態では、第１のビデオデータ１０１をＱｕｉｃｋＴｉｍｅベースファイル形式により保持するのみならず、メタデータ１０３もこのＱｕｉｃｋＴｉｍｅベースファイル形式により保持する。これにより、第２のビデオデータ１０２の特徴量を時間軸で管理しながらメタデータ１０３に保持することができる。 In the embodiment of the present invention, not only the first video data 101 is held in the QuickTime base file format, but also the metadata 103 is held in the QuickTime base file format. Thereby, the feature amount of the second video data 102 can be held in the metadata 103 while being managed on the time axis.

図１０は、本発明の実施の形態におけるファイルの保存形式の一例を示す図である。 FIG. 10 is a diagram showing an example of a file storage format according to the embodiment of the present invention.

図１０（ａ）は、第１のビデオデータ１０１を含むビデオファイル６３０の構成例を示す図である。このビデオファイル６３０は、第１のビデオデータ１０１を有するメディアデータ６３２と、メディアデータ６３２を管理するムービーリソース６３１とを備えている。メディアデータ６３２は、第１のビデオデータ１０１の各サンプル６３３を含んでおり、これらはムービーリソース６３１によって管理される。 FIG. 10A is a diagram illustrating a configuration example of a video file 630 including the first video data 101. The video file 630 includes media data 632 having the first video data 101 and a movie resource 631 for managing the media data 632. The media data 632 includes each sample 633 of the first video data 101, and these are managed by the movie resource 631.

図１０（ｂ）は、メタデータ１０３を含むメタファイル６４０の構成例を示す図である。このメタファイル６４０は、メタデータ１０３を有するメディアデータ６４２と、メディアデータ６４２を管理するムービーリソース６４１とを備えている。メディアデータ６４２は、メタデータ１０３の各サンプル６４３を含んでおり、これらはムービーリソース６４１によって管理される。また、ムービーリソース６４１は、メディアデータ６３２も外部参照する形式で、同様に一つの時間軸によって管理している。 FIG. 10B is a diagram illustrating a configuration example of the metafile 640 including the metadata 103. The metafile 640 includes media data 642 having the metadata 103 and movie resources 641 for managing the media data 642. The media data 642 includes each sample 643 of the metadata 103, and these are managed by the movie resource 641. In addition, the movie resource 641 is also managed by one time axis in a format that externally refers to the media data 632 as well.

図１０（ｃ）は、ビデオファイル６３０およびメタファイル６４０によって実現されるビデオトラック６５０およびメタトラック６６０の時間軸上の流れを示す図である。ここでは、簡略化のため、ビデオファイル６３０の映像信号および音声信号のうち、ビデオトラック６５０のみを図示している。 FIG. 10C is a diagram illustrating the flow on the time axis of the video track 650 and the meta track 660 realized by the video file 630 and the meta file 640. Here, for simplification, only the video track 650 is shown in the video signal and audio signal of the video file 630.

ビデオトラック６５０では、各サンプル６５３が時間軸上に並んでいる。また、メタトラック６６０では、ビデオトラック６５０のサンプル６５３と同期して、各サンプル６６３が時間軸上に並んでいる。例えば、第２のビデオデータ１０２の時刻ｔ１からｔ２の区間において笑顔が特徴量として抽出された場合、その旨がメタトラック６６０の時刻ｔ１からｔ２の区間において記録される。同様に、第２のビデオデータ１０２の時刻ｔ３からｔ４の区間において驚いた顔が特徴量として抽出された場合、その旨がメタトラック６６０の時刻ｔ３からｔ４の区間において記録される。 In the video track 650, the samples 653 are arranged on the time axis. In the meta track 660, the samples 663 are arranged on the time axis in synchronization with the sample 653 of the video track 650. For example, when a smile is extracted as a feature amount in a section from time t1 to t2 of the second video data 102, that fact is recorded in a section from time t1 to t2 of the meta track 660. Similarly, when a surprised face is extracted as a feature amount in a section from time t3 to t4 of the second video data 102, the fact is recorded in a section from time t3 to t4 of the meta track 660.

すなわち、メタデータ１０３を示すメタトラック６６０は、第２のビデオデータ１０２を介して、第１のビデオデータ１０１を示すビデオトラック６５０と同期していることになる。 That is, the meta track 660 indicating the metadata 103 is synchronized with the video track 650 indicating the first video data 101 via the second video data 102.

なお、図中では省略されているが、顔の表情の分類上、無表情である区間においては、無表情である旨を示す情報がメタトラック６６０に記録される。 Although omitted in the figure, information indicating that there is no expression is recorded in the meta track 660 in a section where there is no expression in the classification of facial expressions.

図１１は、本発明の実施の形態におけるメタトラックの階層構造の一例を示す図である。このメタトラックは、図９で説明したビデオトラックと基本的に同様の構成を有している。但し、トラック配下にトラックリファレンスアトム（'ｔｒｅｆ'）を有し、メディアアトム配下にトラックインプットマップアトム（'ｉｍａｐ'）を有する点でビデオトラックとは異なっている。 FIG. 11 is a diagram showing an example of the hierarchical structure of the meta track in the embodiment of the present invention. This meta track has basically the same configuration as the video track described in FIG. However, it differs from a video track in that it has a track reference atom ('tref') under the track and a track input map atom ('map') under the media atom.

トラックリファレンスアトム（'ｔｒｅｆ'）は、ソーストラック（第１のビデオデータ１０１）との参照関係を指定するための情報を保持するものである。そのため、トラックリファレンスアトムは、指定対象となるトラックのトラックヘッダアトム（'ｔｋｈｄ'）に格納されているトラック固有のトラックＩＤを指定するトラックリファレンスタイプアトム（'ｓｓｒｃ'）を含む。このトラックリファレンスタイプアトムに含まれるトラックＩＤの数は、ソーストラックの数と一致する。 The track reference atom ('tref') holds information for designating a reference relationship with the source track (first video data 101). Therefore, the track reference atom includes a track reference type atom ('ssrc') that designates a track ID unique to the track stored in the track header atom ('tkhd') of the track to be designated. The number of track IDs included in this track reference type atom matches the number of source tracks.

トラックインプットマップアトム（'ｉｍａｐ'）は、ソーストラックに関する情報を保持するものであり、ＱｕｉｃｋＴｉｍｅにおけるＱＴアトム構造と呼ばれるデータ構造により構成される。このトラックインプットマップアトムには、ＱＴアトムコンテナ（'ｓｅａｎ'）を最上位アトムとするコンテナによってパッキングされたトラックインプットＱＴアトム（' ｉｎ'）が１つ以上含まれる。このトラックインプットＱＴアトムの数は、ソーストラックの数と一致する。 The track input map atom ('imap') holds information relating to the source track, and is configured by a data structure called a QT atom structure in QuickTime. The track input map atom includes one or more track input QT atoms ('in') packed by a container having a QT atom container ('sean') as the highest level atom. The number of track input QT atoms matches the number of source tracks.

トラックインプットＱＴアトム（' ｉｎ'）は、インプットタイプＱＴアトム（' ｔｙ'）およびデータソースタイプＱＴアトム（'ｄｔｓｔ'）を保持する。インプットタイプＱＴアトム（' ｔｙ'）は、ソーストラックがビデオメディアであることを指定するものである。また、データソースタイプＱＴアトム（'ｄｔｓｔ'）は、ソーストラックに対して固有の名称を与えるものである。 The track input QT atom ('in') holds an input type QT atom ('ty') and a data source type QT atom ('dtst'). The input type QT atom ('ty') specifies that the source track is video media. The data source type QT atom ('dtst') gives a unique name to the source track.

図１２は、本発明の実施の形態におけるメタトラックを含むムービーアトムの記載例を示す図である。この記載例では、ソーストラック（第１のビデオデータ１０１）としてビデオトラックアトム（ｖｉｄｅｏ）が１つだけ含まれている。そのため、メタトラックアトムにおけるトラックインプットＱＴアトム（' ｉｎ'）も１つだけ含まれている。 FIG. 12 is a diagram showing a description example of a movie atom including a meta track in the embodiment of the present invention. In this example, only one video track atom (video) is included as a source track (first video data 101). Therefore, only one track input QT atom ('in') in the meta track atom is included.

図１３は、本発明の実施の形態におけるメタトラックを含むムービーアトムの他の記載例を示す図である。また、図１４は、図１３のムービーアトムにおけるメタトラックアトムの記載例を示す図である。この記載例では、ソーストラック（第１のビデオデータ１０１）としてビデオトラックアトムが２つ含まれている（ｖｉｄｅｏ１およびｖｉｄｅｏ２）。そのため、図１４に示すように、メタトラックアトムにおけるトラックインプットＱＴアトム（' ｉｎ'）も２つ含まれている。 FIG. 13 is a diagram showing another description example of the movie atom including the meta track in the embodiment of the present invention. FIG. 14 is a diagram showing a description example of the meta track atom in the movie atom of FIG. In this description example, two video track atoms are included as a source track (first video data 101) (video 1 and video 2). Therefore, as shown in FIG. 14, two track input QT atoms ('in') in the meta track atom are also included.

図１５は、本発明の実施の形態におけるメタトラックのサンプルディスクリプションアトム（'ｓｔｓｄ'）の記載例を示す図である。この記載例では、メタサンプルディスクリプションエントリがＭ個（Ｍは１以上の整数）含まれている。このメタサンプルディスクリプションエントリの数は、特徴量の種類の数と一致する。例えば、特徴量の種類として、無表情の顔と笑顔の２種類の分類をするのであれば、メタサンプルディスクリプションエントリの数は２つになる。 FIG. 15 is a diagram showing a description example of the meta description sample description atom ('stsd') in the embodiment of the present invention. In this description example, M meta sample description entries (M is an integer of 1 or more) are included. The number of metasample description entries matches the number of feature types. For example, if there are two types of feature amounts, that is, an expressionless face and a smile, the number of metasample description entries is two.

なお、同図において、かっこ内の数字は各フィールドのバイト数を表す。 In the figure, the numbers in parentheses indicate the number of bytes in each field.

メタサンプルディスクリプションエントリは、ＱｕｉｃｋＴｉｍｅにおけるサンプルディスクリプションエントリに対してストリームディスクリプターアトムを拡張追加した構造になっている。サンプルディスクリプションエントリにおけるデータフォーマット（Data Format）フィールドは、本来、エフェクト効果を指定するためのものである。本発明の実施の形態では、このフィールドを拡張のために用いている。これにより、通常のＱｕｉｃｋＴｉｍｅファイルフォーマットとの間で互換性を維持しながら、拡張を施すことができる。 The meta sample description entry has a structure in which a stream descriptor atom is extended and added to the sample description entry in QuickTime. The data format field in the sample description entry is originally for specifying the effect. In the embodiment of the present invention, this field is used for expansion. As a result, the extension can be performed while maintaining compatibility with the normal QuickTime file format.

図１６は、本発明の実施の形態におけるデータフォーマットフィールドの一例を示す図である。この図に示すように、データフォーマットフィールドは、本来、エフェクト効果を指定するためのものである。同図において、アルファベット小文字で示している種別は、ＱｕｉｃｋＴｉｍｅにおいて定義済のエフェクト種別である。例えば、タイプ名'ｂｒｃｏ'は、明るさを示すブライトネス（brightness）と画像における黒色および白色の幅を示すコントラスト（contrast）とを変化させる効果を指定するものである。 FIG. 16 is a diagram showing an example of the data format field in the embodiment of the present invention. As shown in this figure, the data format field is originally for designating an effect. In the figure, the types indicated by lowercase letters are the effect types defined in QuickTime. For example, the type name “brco” designates the effect of changing the brightness indicating brightness and the contrast indicating the width of black and white in the image.

一方、アルファベット大文字で示している種別はＱｕｉｃｋＴｉｍｅにおいて定義されていないエフェクト種別である。本発明の実施の形態では、同図最下欄にあるユーザ定義のメタデータであることを示すタイプ名'ＵＤＥＦ'をこのデータフォーマットフィールドで指定することによって、メタデータとして独自拡張された意味を有することを示している。 On the other hand, the type indicated by the capital letter is an effect type not defined in QuickTime. In the embodiment of the present invention, by specifying the type name “UDEF” indicating user-defined metadata in the bottom column of the figure in this data format field, the meaning uniquely expanded as metadata is provided. It shows that it has.

図１７は、本発明の実施の形態におけるストリームディスクリプターアトム（'ｓｔｒｄ'）の記載例を示す図である。このストリームディスクリプターアトムは、ＱｕｉｃｋＴｉｍｅにおける他のアトム構造と同様に、サイズ（Size）、タイプ（Type）、バージョン（Version）およびフラグ群（Flags）の各フィールドを保持している。 FIG. 17 is a diagram illustrating a description example of the stream descriptor atom ('strd') according to the embodiment of the present invention. This stream descriptor atom holds fields of size (Size), type (Type), version (Version), and flags (Flags), as with other atom structures in QuickTime.

サイズフィールドは、このサイズフィールドを含むストリームディスクリプターアトム全体の大きさを保持するものである。タイプフィールドは、ストリームディスクリプターアトムのタイプ名として'ｓｔｒｄ'を保持するものである。バージョンフィールドおよびフラグ群フィールドは、将来の拡張用に確保されているものであり、ここでは全てゼロが設定されるものとする。 The size field holds the size of the entire stream descriptor atom including this size field. The type field holds “strd” as the type name of the stream descriptor atom. The version field and the flag group field are reserved for future expansion, and all zeros are set here.

ストリームディスクリプターアトム（'ｓｔｒｄ'）は、以下に説明するように、データフォーマット（Data Format）、ユーザデファインドメタタイプ（User Defined Meta Type）、パラメータフラグ（Parameter Flag）の３つのフィールドをさらに保持している。 As described below, the stream descriptor atom ('strd') further holds three fields of data format (Data Format), user defined meta type (User Defined Meta Type), and parameter flag (Parameter Flag). is doing.

データフォーマットフィールドは、図１６により説明したメタサンプルディスクリプションエントリのデータフォーマットフィールドと形式上同じものを保持するフィールドであり、本発明の実施の形態ではタイプ名'ＵＤＥＦ'を示すことになる。 The data format field is a field that holds the same format as the data format field of the metasample description entry described with reference to FIG. 16, and indicates the type name “UDEF” in the embodiment of the present invention.

ユーザデファインドメタタイプフィールドは、図１８に示すように、２バイトのオーナーＩＤ（Owner ID）と２バイトのメタＩＤ（Meta ID）とを保持している。オーナーＩＤは、メーカー毎に割り当てられたＩＤであり、これにより、各メーカーは、メタＩＤによって独自の拡張定義を用いることができるようになる。これらオーナーＩＤおよびメタＩＤは、データフォーマットフィールドがタイプ名'ＵＤＥＦ'を示す場合にのみ有効になるものである。例えば、メタＩＤとして、図１９に示すように、顔の表情として、笑い（smile）、驚き（surprise）、怒り（angry）、眠い（sleepy）といった種別を表現することができる。 As shown in FIG. 18, the user-defined metatype field holds a 2-byte owner ID (Owner ID) and a 2-byte meta ID (Meta ID). The owner ID is an ID assigned to each manufacturer, and thus each manufacturer can use a unique extension definition by using a meta ID. These owner IDs and meta IDs are valid only when the data format field indicates the type name “UDEF”. For example, as shown in FIG. 19, as a meta ID, types such as smile, surprise, angry, and sleepy can be expressed as facial expressions.

なお、オーナーＩＤおよびメタＩＤの２つのフィールドに分けて詳細メタデータ種別を定義するのは、メーカー内で閉じて、重複なくメタデータ種別の管理運用を行うためである。仮に、このような区別を設けないと、新規に定義したいメーカー間で名称が重複し、もしくは、申請順番などの管理が煩雑になるおそれがある。そこで、メタサンプルディスクリプションエントリ側のデータフォーマットフィールドにて指定する独自定義メタデータ種別には、'ＵＤＥＦ'として独自定義メタデータ大別であることだけを指示し、詳細な独自拡張定義メタデータ種別はオーナーＩＤおよびメタＩＤの両フィールドを組み合わせることによって、どのメーカーが定義したどのようなメタデータ種別かを指示するものである。 The reason that the detailed metadata type is defined separately in the two fields of the owner ID and the meta ID is that it is closed in the manufacturer and the metadata type is managed and operated without duplication. If such a distinction is not provided, names may be duplicated between manufacturers to be newly defined, or management of application order may become complicated. Therefore, the unique definition metadata type specified in the data format field on the metasample description entry side only indicates that it is broadly classified as “UDEF”, and the detailed unique extended definition metadata type. Indicates the metadata type defined by which manufacturer by combining both the owner ID and meta ID fields.

パラメータフラグフィールドは、メタデータによる効果が有効であるか否かを示すフィールドである。例えば、図２０に示すように、１６ビットのうちの１ビットを用いて、メタデータによる効果が「有効」であるか「無効」であるかを示す。これにより、ムービーリソース（'ｍｏｏｖ'）におけるパラメータフラグフィールドが「無効」を示している場合には、メディアデータ（'ｍｄａｔ'）をアクセスするまでもなく、メタデータによる効果がないものと判断することができ、処理の負荷を低減することができる。 The parameter flag field is a field indicating whether or not the effect by metadata is effective. For example, as shown in FIG. 20, 1 bit out of 16 bits is used to indicate whether the effect of metadata is “valid” or “invalid”. As a result, when the parameter flag field in the movie resource ('moov') indicates "invalid", it is determined that the media data ('mdat') has no effect without accessing the media data ('mdat'). And the processing load can be reduced.

図２１は、本発明の実施の形態におけるメタトラックのメディアデータ（メタデータ１０３）のサンプルの記載例を示す図である。メタトラックのサンプルは、ビデオトラックのサンプルと同様に、サンプル毎にメディアデータアトムに格納される。 FIG. 21 is a diagram illustrating a sample description example of media data (metadata 103) of a meta track according to the embodiment of the present invention. The meta track samples are stored in the media data atom for each sample in the same manner as the video track samples.

ここでは、第２のビデオデータ１０２の対応するサンプルにおいて抽出された顔の数（face_number）と、それぞれの顔についてその表情の度合いを示すことができるようになっている。例えば、笑顔度合い、驚き度合い、怒り度合い、眠さ度合いをそれぞれ示すことができる。 Here, it is possible to indicate the number of faces (face_number) extracted in the corresponding sample of the second video data 102 and the degree of facial expression of each face. For example, the degree of smile, the degree of surprise, the degree of anger, and the degree of sleepiness can be shown.

図２２は、本発明の実施の形態におけるメタデータを含むファイルの階層構造の一例を示す図である。この例では、ソーストラック（第１のビデオデータ１０１）が１つ（ソーストラック１）だけであることが想定されている。このソーストラック１のムービーリソース（'ｍｏｏｖ'）のトラックヘッダアトム（'ｔｋｈｄ'）には、そのトラックＩＤである「＃１」が保持されている。また、メタトラック（メタデータ１０３）のムービーリソースのトラックヘッダアトムには、そのトラックＩＤである「＃２」が保持されている。 FIG. 22 is a diagram showing an example of a hierarchical structure of a file including metadata in the embodiment of the present invention. In this example, it is assumed that there is only one source track (first video data 101) (source track 1). The track header atom ('tkhd') of the movie resource ('moov') of the source track 1 holds the track ID “# 1”. The track header atom of the movie resource of the meta track (meta data 103) holds “# 2” that is the track ID.

メタトラックのムービーリソースでは、トラックリファレンスアトム（'ｔｒｅｆ'）のトラックリファレンスタイプアトム（'ｓｓｒｃ'）に、ソーストラック１のトラックＩＤ「＃１」が保持されている。 In the meta-track movie resource, the track ID “# 1” of the source track 1 is held in the track reference type atom (“ssrc”) of the track reference atom (“tref”).

また、メディアアトム（'ｍｄｉａ'）のトラックインプットマップアトム（'ｉｍａｐ'）には、インプットタイプＱＴアトム（' ｔｙ'）としてビデオメディアを表す'ｖｉｄｅ'が設定され、データソースタイプＱＴアトム（'ｄｔｓｔ'）としてソーストラック１の名称'ｓｒｃＡ'が設定されている。 Also, in the track input map atom ('imap') of the media atom ('mdia'), 'video' representing the video media is set as the input type QT atom ('ty'), and the data source type QT atom (' The name “srcA” of the source track 1 is set as “dtst”).

また、メタトラックのムービーリソースにおいて、サンプルディスクリプションアトム（'ｓｔｓｄ'）のメタＩＤにより、２つの種別「ｍｅｔａ＿ｔｙｐｅ１」および「ｍｅｔａ＿ｔｙｐｅ２」が定義されている。 Also, in the metatrack movie resource, two types “meta_type1” and “meta_type2” are defined by the meta ID of the sample description atom ('stsd').

この例では、メタトラックのメディアデータ（'ｍｄａｔ'）において、メタサンプルが４つ設けられている。ソーストラックは１つだけであり、全て同じソーストラック１を示している。また、メタサンプル＃１および＃３が「ｍｅｔａ＿ｔｙｐｅ１」を示し、メタサンプル＃２および＃４が「ｍｅｔａ＿ｔｙｐｅ２」を示している。 In this example, four meta samples are provided in the media data ('mdat') of the meta track. There is only one source track, and all indicate the same source track 1. Further, meta samples # 1 and # 3 indicate “meta_type1”, and meta samples # 2 and # 4 indicate “meta_type2”.

図２３は、図２２の例におけるソーストラック（第１のビデオデータ１０１）とメタトラック（メタデータ１０３）との関係を示す図である。この例では、ソーストラックは１つだけであり、全て同じソーストラック１（'ｓｒｃＡ'）を対象としている。 FIG. 23 is a diagram illustrating the relationship between the source track (first video data 101) and the meta track (metadata 103) in the example of FIG. In this example, there is only one source track, and all are the same source track 1 ('srcA').

ここで、「ｍｅｔａ＿ｔｙｐｅ２」を特定区間として抽出する場合には、ソーストラックからこの特定区間のみが抽出されることになり、それ以外の区間は不要な区間として扱われる。 Here, when “meta_type2” is extracted as a specific section, only this specific section is extracted from the source track, and other sections are treated as unnecessary sections.

図２４は、本発明の実施の形態におけるメタデータを含むファイルの階層構造の他の例を示す図である。この例では、ソーストラック（第１のビデオデータ１０１）が２つ（ソーストラック１およびソーストラック２）存在することが想定されている。ソーストラック１のムービーリソース（'ｍｏｏｖ'）のトラックヘッダアトム（'ｔｋｈｄ'）には、そのトラックＩＤである「＃１」が保持されている。ソーストラック２のムービーリソースのトラックヘッダアトム（'ｔｋｈｄ'）には、そのトラックＩＤである「＃２」が保持されている。また、メタトラック（メタデータ１０３）のムービーリソースのトラックヘッダアトム（'ｔｋｈｄ'）には、そのトラックＩＤである「＃３」が保持されている。 FIG. 24 is a diagram showing another example of a hierarchical structure of a file including metadata in the embodiment of the present invention. In this example, it is assumed that there are two source tracks (first video data 101) (source track 1 and source track 2). The track header atom ('tkhd') of the movie resource ('moov') of the source track 1 holds the track ID “# 1”. The track header atom ('tkhd') of the movie resource of the source track 2 holds “# 2” that is the track ID. Also, the track header atom ('tkhd') of the movie resource of the meta track (meta data 103) holds the track ID “# 3”.

メタトラックのムービーリソースでは、トラックリファレンスアトム（'ｔｒｅｆ'）のトラックリファレンスタイプアトム（'ｓｓｒｃ'）に、ソーストラック１のトラックＩＤ「＃１」およびソーストラック２のトラックＩＤ「＃２」がそれぞれ保持されている。 In the meta track movie resource, the track reference type atom ('ssrc') of the track reference atom ('tref') has the track ID "# 1" of the source track 1 and the track ID "# 2" of the source track 2 respectively. Is retained.

また、メディアアトム（'ｍｄｉａ'）のトラックインプットマップアトム（'ｉｍａｐ'）には、２つのトラックインプットＱＴアトム（' ｉｎ'）が含まれており、１つ目のトラックインプットＱＴアトムには、インプットタイプＱＴアトム（' ｔｙ'）としてビデオメディアを表す'ｖｉｄｅ'が設定され、データソースタイプＱＴアトム（'ｄｔｓｔ'）としてソーストラック１の名称'ｓｒｃＡ'が設定されている。２つ目のトラックインプットＱＴアトムには、インプットタイプＱＴアトムとしてビデオメディアを表す'ｖｉｄｅ'が設定され、データソースタイプＱＴアトムとしてソーストラック２の名称'ｓｒｃＢ'が設定されている。 The track input map atom ('imap') of the media atom ('mdia') includes two track input QT atoms ('in'), and the first track input QT atom is 'Video' representing video media is set as the input type QT atom ('ty'), and the name 'srcA' of the source track 1 is set as the data source type QT atom ('dtst'). In the second track input QT atom, “video” representing video media is set as the input type QT atom, and the name “srcB” of the source track 2 is set as the data source type QT atom.

この例では、メタトラックのメディアデータ（'ｍｄａｔ'）において、メタサンプルが４つ設けられている。ソーストラックは２つ存在しており、メタサンプル＃１および＃２が'ｓｒｃＡ'を参照し、メタサンプル＃３および＃４が'ｓｒｃＢ'を参照している。また、メタサンプル＃１および＃３が「ｍｅｔａ＿ｔｙｐｅ１」を示し、メタサンプル＃２および＃４が「ｍｅｔａ＿ｔｙｐｅ２」を示している。 In this example, four meta samples are provided in the media data ('mdat') of the meta track. There are two source tracks, metasamples # 1 and # 2 refer to 'srcA', and metasamples # 3 and # 4 refer to 'srcB'. Further, meta samples # 1 and # 3 indicate “meta_type1”, and meta samples # 2 and # 4 indicate “meta_type2”.

図２５は、図２４の例におけるソーストラック（第１のビデオデータ１０１）とメタトラック（メタデータ１０３）との関係を示す図である。この例では、ソーストラックは２つ存在しており、メタサンプル＃１および＃２が'ｓｒｃＡ'を参照し、メタサンプル＃３および＃４が'ｓｒｃＢ'を参照している。 FIG. 25 is a diagram showing the relationship between the source track (first video data 101) and the meta track (metadata 103) in the example of FIG. In this example, there are two source tracks, metasamples # 1 and # 2 refer to 'srcA', and metasamples # 3 and # 4 refer to 'srcB'.

図２６は、本発明の実施の形態におけるソーストラック（第１のビデオデータ１０１）およびメタトラック（メタデータ１０３）と編集トラック（編集ビデオデータ１０５）との関係例を示す図である。この例では、ソーストラックは、時刻０から始まり、時刻ｔｓ６に終了している。また、メタトラックはソーストラックに同期しており、時刻ｔｓ１から時刻ｔｓ２の区間に笑顔を検出した旨を示し、時刻ｔｓ３から時刻ｔｓ４の区間に驚いた顔を検出した旨を示し、時刻ｔｓ５から時刻ｔｓ６の区間に笑顔を検出した旨を示している。 FIG. 26 is a diagram showing a relationship example between the source track (first video data 101) and meta track (meta data 103) and the edit track (edit video data 105) in the embodiment of the present invention. In this example, the source track starts at time 0 and ends at time ts6. The meta track is synchronized with the source track, indicates that a smile is detected in the section from time ts1 to time ts2, indicates that a surprised face is detected in the section from time ts3 to time ts4, and starts from time ts5. This indicates that a smile is detected in the section at time ts6.

また、編集していないムービーファイルにおいては、ソーストラック時間の時間軸はメディア時間の時間軸と一対一対応していることが多いため、ここでは、時刻ｔｓ１＝時刻ｔｍ１、時刻ｔｓ２＝時刻ｔｍ２、時刻ｔｓ３＝時刻ｔｍ３、時刻ｔｓ４＝時刻ｔｍ４、時刻ｔｓ５＝時刻ｔｍ５、時刻ｔｓ６＝時刻ｔｍ６としている。 In a movie file that has not been edited, the time axis of the source track time often has a one-to-one correspondence with the time axis of the media time. Therefore, here, time ts1 = time tm1, time ts2 = time tm2, Time ts3 = time tm3, time ts4 = time tm4, time ts5 = time tm5, time ts6 = time tm6.

ここで、メタトラックにおいて笑顔または驚いた顔を示す区間を抽出条件としてソーストラックを編集することを想定すると、その出力として図のような編集トラックが生成される。すなわち、ソーストラック時間の時刻ｔｓ１から時刻ｔｓ２の区間のソーストラックの部分が、編集トラックにおける編集トラック時間の時刻０から時刻ｔｅ１の区間になり、ソーストラック時間の時刻ｔｓ３から時刻ｔｓ４の区間のソーストラックの部分が、編集トラックにおける編集トラック時間の時刻ｔｅ１から時刻ｔｅ２の区間になり、ソーストラック時間の時刻ｔｓ５から時刻ｔｓ６の区間のソーストラックの部分が、編集トラックにおける編集トラック時間の時刻ｔｅ２から時刻ｔｅ３の区間になる。 Here, if it is assumed that the source track is edited using a section showing a smiling face or a surprised face in the meta track as an extraction condition, an edit track as shown in the figure is generated as the output. That is, the portion of the source track in the section from the time ts1 to the time ts2 of the source track time becomes the section from the time 0 to the time te1 of the edit track time in the edit track, and the source in the section from the time ts3 to the time ts4 of the source track time The track portion is a section from time te1 to time te2 of the edit track time in the edit track, and the source track portion in the section from time ts5 to time ts6 of the source track time is from time te2 of the edit track time in the edit track It becomes the section of time te3.

図２７は、ＱｕｉｃｋＴｉｍｅファイルフォーマットにおけるエディットアトム（'ｅｄｔｓ'）の記載例を示す図である。このエディットアトムは、図１７により説明したストリームディスクリプターアトム（'ｓｔｒｄ'）と同様に、サイズ、タイプ、バージョンおよびフラグ群の各フィールドを保持している。このエディットアトムでは、タイプ名として'ｅｄｔｓ'が保持されている。 FIG. 27 is a diagram illustrating a description example of edit atoms ('edts') in the QuickTime file format. As with the stream descriptor atom ('strd') described with reference to FIG. 17, this edit atom holds fields of size, type, version, and flag group. In this edit atom, “edts” is held as a type name.

このエディットアトムは、さらにエディットリストアトム（'ｅｌｓｔ'）を保持する。このエディットリストアトムは、エディットアトムと同様に、サイズ、タイプ、バージョンおよびフラグ群の各フィールドを保持している。このエディットリストアトムでは、タイプ名として'ｅｌｓｔ'が保持されている。エディットリストアトムは、さらにＮ個（Ｎは１以上の整数）のエディットリストエントリ（Edit List Entry）と、その数（Number of Entries）とを含んでいる。 This edit atom further holds an edit restore tom ('elst'). Similar to the edit atom, this edit restore atom holds fields of size, type, version, and flag group. In this edit restore tom, “elst” is held as the type name. The edit restore tom further includes N (N is an integer of 1 or more) edit list entries (Edit List Entry) and the number (Number of Entries).

エディットリストエントリの各々は、セグメント期間（Segment duration）と、メディア時間（Media time）と、メディアレート（Media rate）とを備えている。 Each of the edit list entries includes a segment duration, a media time, and a media rate.

このエディットアトム（'ｅｄｔｓ'）を模式的に表したものが図２８である。図２８（ａ）のように、エディットアトム６８０は、エディットアトム６８０の大きさを示すサイズ６８１と、エディットアトムであることを示すタイプ６８２と、エディットアトムのバージョン６８３と、未使用のフラグ群６９４と、エディットリストアトム５７０とから構成される。 FIG. 28 schematically shows the edit atom ('edts'). As shown in FIG. 28A, the edit atom 680 includes a size 681 indicating the size of the edit atom 680, a type 682 indicating that the edit atom 680 is an edit atom, a version 683 of the edit atom, and an unused flag group 694. And an edit restore tom 570.

エディットリストアトム６９０は、エディットリストアトム６９０の大きさを示すサイズ６９１と、エディットリストアトムであることを示すタイプ６９２と、エディットリストアトムのバージョン６９３と、未使用のフラグ群６９４と、エディットリストテーブル６９６と、エディットリストテーブル６９６のエントリ数６９５とから構成される。 The edit restore tom 690 includes a size 691 indicating the size of the edit restore tom 690, a type 692 indicating to be an edit restore tom, an edit restore tom version 693, an unused flag group 694, and an edit list table. 696 and the number of entries 695 in the edit list table 696.

エディットリストテーブル６９６はエントリ数６９５により示される数のエントリから構成される。図２８（ｂ）のようにエディットリストテーブル６９６の各エントリは、セグメント期間６９７と、メディア時間６９８と、メディアレート６９９とから構成される。セグメント期間６９７は、対応する編集単位の期間を示す。メディア時間６９８は、対応する編集単位のメディアデータアトムにおける開始時刻を示す。なお、このメディア時間６９８が「−１」を示している場合には、対応する編集単位はメディアデータアトムに存在しないことを意味する。メディアレート６９９は、再生の際の時間比率を示すものであり、メディアデータアトムにおける時間軸と編集後の時間軸とで再生時間が変わらない場合には「１．０」を示すことになる。 The edit list table 696 is composed of the number of entries indicated by the number of entries 695. As shown in FIG. 28B, each entry of the edit list table 696 includes a segment period 697, a media time 698, and a media rate 699. A segment period 697 indicates the period of the corresponding editing unit. The media time 698 indicates the start time in the media data atom of the corresponding editing unit. When the media time 698 indicates “−1”, it means that the corresponding editing unit does not exist in the media data atom. The media rate 699 indicates the time ratio at the time of reproduction, and indicates “1.0” when the reproduction time does not change between the time axis in the media data atom and the time axis after editing.

図２９は、図２６の例におけるエディットリストアトム（'ｅｌｓｔ'）の内容例を示す図である。図２９（ａ）は編集前のエディットリストアトムの内容である。編集前の状態では、エントリは１つだけ存在する。セグメント期間は全期間のｔｓ６を示す。メディア時間は先頭時刻の０を示す。また、メディアレート６９９は、「１．０」である。 FIG. 29 is a diagram showing an example of the contents of the edit restore tom ('elst') in the example of FIG. FIG. 29A shows the contents of the edit restore tom before editing. In the state before editing, there is only one entry. The segment period indicates ts6 of the entire period. The media time indicates 0 at the start time. The media rate 699 is “1.0”.

図２９（ｂ）は編集後のエディットリストアトムの内容である。編集後は、３つのエントリが生成される。 FIG. 29B shows the contents of the edited restore tom after editing. After editing, three entries are generated.

第１のエントリでは、最初の笑顔の期間ｔｅ１（＝ｔｍ２−ｔｍ１）を示す。メディア時間は最初の笑顔の先頭時刻ｔｍ１を示す。また、メディアレート６９９は、「１．０」である。 The first entry indicates the first smile period te1 (= tm2-tm1). The media time indicates the start time tm1 of the first smile. The media rate 699 is “1.0”.

第２のエントリでは、驚いた顔の期間ｔｅ２−ｔｅ１（＝ｔｍ４−ｔｍ３）を示す。メディア時間は驚いた顔の先頭時刻ｔｍ３を示す。また、メディアレート６９９は、「１．０」である。 The second entry indicates the surprised face period te2-te1 (= tm4-tm3). The media time indicates the start time tm3 of the surprised face. The media rate 699 is “1.0”.

第３のエントリでは、２回目の笑顔の期間ｔｅ３−ｔｅ２（＝ｔｍ６−ｔｍ５）を示す。メディア時間は２回目の笑顔の先頭時刻ｔｍ５を示す。また、メディアレート６９９は、「１．０」である。 The third entry indicates the second smile period te3-te2 (= tm6-tm5). The media time indicates the start time tm5 of the second smile. The media rate 699 is “1.0”.

図３０は、本発明の実施の形態におけるソーストラック（第１のビデオデータ１０１）およびメタトラック（メタデータ１０３）と編集トラック（編集ビデオデータ１０５）との他の関係例を示す図である。この例では、ソーストラックおよびメタトラックは図２６の例と同様の関係になっている。 FIG. 30 is a diagram showing another example of the relationship between the source track (first video data 101) and the meta track (meta data 103) and the edit track (edit video data 105) in the embodiment of the present invention. In this example, the source track and the meta track have the same relationship as in the example of FIG.

ここで、メタトラックにおいて笑顔を示す区間を抽出条件としてソーストラックを編集することを想定すると、その出力として図のような編集トラックが生成される。すなわち、ソーストラック時間の時刻ｔｓ１から時刻ｔｓ２の区間のソーストラックの部分が、編集トラックにおける編集トラック時間の時刻０から時刻ｔｅ１の区間になり、ソーストラック時間の時刻ｔｓ５から時刻ｔｓ６の区間のソーストラックの部分が、編集トラックにおける編集トラック時間の時刻ｔｅ１から時刻ｔｅ２の区間になる。 Here, if it is assumed that the source track is edited using the section showing smile in the meta track as an extraction condition, an edit track as shown in the figure is generated as the output. That is, the portion of the source track in the section from the time ts1 to the time ts2 of the source track time becomes the section from the time 0 to the time te1 of the editing track time in the editing track, and the source in the section from the time ts5 to the time ts6 of the source track time. The track portion is a section from the time te1 to the time te2 of the edit track time in the edit track.

図３１は、図３０の例におけるエディットリストアトム（'ｅｌｓｔ'）の内容例を示す図である。この図は編集後のエディットリストアトムの内容である。編集後は、２つのエントリが生成される。 FIG. 31 is a diagram showing an example of the contents of the edit restore tom ('elst') in the example of FIG. This figure shows the contents of Edit Restore Tom after editing. After editing, two entries are generated.

第２のエントリでは、２回目の笑顔の期間ｔｅ２−ｔｅ１（＝ｔｍ６−ｔｍ５）を示す。メディア時間は２回目の笑顔の先頭時刻ｔｍ５を示す。また、メディアレート６９９は、「１．０」である。 The second entry indicates the second smile period te2-te1 (= tm6-tm5). The media time indicates the start time tm5 of the second smile. The media rate 699 is “1.0”.

このようにして生成されたエディットリストアトム（エディットリスト１０４）は映像抽出部１２３に供給され、この映像抽出部１２３においてソーストラック（第１のビデオデータ１０１）から編集トラック（編集ビデオデータ１０５）が生成される。すなわち、元のソーストラックを破壊することのない非破壊型編集を実現することができる。 The edit restore tom (edit list 104) generated in this way is supplied to the video extraction unit 123, where the source track (first video data 101) to the editing track (edited video data 105) are supplied. Generated. That is, non-destructive editing that does not destroy the original source track can be realized.

次に本発明の実施の形態における映像編集システムの動作について図面を参照して説明する。 Next, the operation of the video editing system in the embodiment of the present invention will be described with reference to the drawings.

図３２は、本発明の実施の形態における映像処理装置１１０による処理手順の一例を示す図である。まず、映像取得部１１１によって、第２のビデオデータ１０２の映像が取得される（ステップＳ９１１）。この第２のビデオデータ１０２は、第１のビデオデータ１０１に同期しており、図２の例のように第１のビデオデータ１０１の撮像と同時に撮像されたものでもよく、図４の例のように第１のビデオデータ１０１の再生と同時に撮像されたものでもよく、また、図６の例のように第１のビデオデータ１０１と同一の内容であってもよい。 FIG. 32 is a diagram illustrating an example of a processing procedure performed by the video processing device 110 according to the embodiment of the present invention. First, the video of the second video data 102 is acquired by the video acquisition unit 111 (step S911). The second video data 102 is synchronized with the first video data 101 and may be captured simultaneously with the imaging of the first video data 101 as in the example of FIG. Thus, it may be captured at the same time as the reproduction of the first video data 101, or may have the same content as the first video data 101 as in the example of FIG.

ステップＳ９１１で映像が取得されると、取得された第２のビデオデータ１０２における特徴量が映像解析部１１２によって解析される（ステップＳ９１２）。例えば、特徴量として第２のビデオデータ１０２に含まれる顔の画像が抽出されて、その顔の表情が判断される。 When the video is acquired in step S911, the feature amount in the acquired second video data 102 is analyzed by the video analysis unit 112 (step S912). For example, a facial image included in the second video data 102 is extracted as a feature quantity, and the facial expression is determined.

そして、ステップＳ９１２において解析された特徴量を時間軸により管理するメタデータ１０３がメタデータ生成部１１３によって生成される（ステップＳ９１３）。この生成されたメタデータ１０３は、第１のビデオデータ１０１の各時刻に対応して、第２のビデオデータ１０２の特徴量を保持するものである。 Then, metadata 103 for managing the feature amount analyzed in step S912 on the time axis is generated by the metadata generation unit 113 (step S913). The generated metadata 103 holds the feature amount of the second video data 102 corresponding to each time of the first video data 101.

これらステップＳ９１１乃至Ｓ９１３の処理手順は、第２のビデオデータ１０２の映像が全て処理されるまで繰り返される（ステップＳ９１４）。 The processing procedure of these steps S911 to S913 is repeated until all the images of the second video data 102 are processed (step S914).

図３３は、本発明の実施の形態における映像編集装置１２０による処理手順の一例を示す図である。まず、抽出条件受付部１２１によって、抽出条件の入力が受け付けられる（ステップＳ９２１）。そして、その抽出条件によってメタデータ１０３における位置が探索される（ステップＳ９３０）。これにより、抽出条件に合致する時系列上の位置が得られ、エディットリスト１０４として保持される。 FIG. 33 is a diagram showing an example of a processing procedure performed by the video editing apparatus 120 in the embodiment of the present invention. First, the extraction condition receiving unit 121 receives input of extraction conditions (step S921). Then, the position in the metadata 103 is searched according to the extraction condition (step S930). Thereby, a position on the time series that matches the extraction condition is obtained and held as the edit list 104.

このエディットリスト１０４に基づいて、第１のビデオデータ１０１から抽出条件に合致する時系列上の位置に対応する部分が映像抽出部１２３によって抽出され、編集ビデオデータ１０５として出力される（ステップＳ９２３）。 Based on the edit list 104, a portion corresponding to the position on the time series that matches the extraction condition from the first video data 101 is extracted by the video extraction unit 123 and output as the edited video data 105 (step S923). .

図３４は、図３３の位置探索処理（ステップＳ９３０）における処理手順の一例を示す図である。まず、メタデータのムービーリソース（図１１のメタトラック）が取得される（ステップＳ９３１）。そして、このムービーリソースのサンプルディスクリプションアトム（'ｓｔｓｄ'）に含まれるパラメータフラグ（図２０参照）によりエフェクトの有効性、すなわち特徴量の記録の有効性が判断される（ステップＳ９３２）。 FIG. 34 is a diagram showing an example of a processing procedure in the position search process (step S930) of FIG. First, a metadata movie resource (meta track in FIG. 11) is acquired (step S931). Then, the effectiveness of the effect, that is, the effectiveness of recording the feature amount is determined based on the parameter flag (see FIG. 20) included in the sample description atom ('stsd') of the movie resource (step S932).

ステップＳ９３２において「有効」であると判断された場合には、メタデータの対応するメディアデータのサンプルが取得される（ステップＳ９３３）。その結果、メタデータのサンプルの示す特徴量が抽出条件と一致していれば（ステップＳ９３４）、その該当する区間（セグメント）がエディットリストアトム（エディットリスト１０４）のエントリとして登録される（ステップＳ９３５）。 If it is determined in step S932 that it is “valid”, a sample of media data corresponding to the metadata is acquired (step S933). As a result, if the feature amount indicated by the metadata sample matches the extraction condition (step S934), the corresponding section (segment) is registered as an entry in the edit restore tom (edit list 104) (step S935). ).

一方、ステップＳ９３２において「無効」であると判断された場合には、メタデータのメディアデータは取得されることなく、そのサンプルに関する処理は終了する。 On the other hand, if it is determined as “invalid” in step S932, the media data of the metadata is not acquired, and the processing relating to the sample ends.

これらステップＳ９３１乃至Ｓ９３５の処理手順は、メタデータ１０３のサンプルが全て処理されるまで繰り返される（ステップＳ９３６）。 The processing procedures in steps S931 to S935 are repeated until all the samples of the metadata 103 are processed (step S936).

このように、本発明の実施の形態によれば、第１のビデオデータ１０１に同期する第２のビデオデータ１０２から映像処理装置１１０によってメタデータ１０３が生成される。このメタデータ１０３は、第１のビデオデータ１０１に同期しており、第２のビデオデータ１０２の特徴量を時間軸で管理しながら保持するものである。また、本発明の実施の形態によれば、メタデータ１０３において抽出条件に合致する位置が映像編集装置１２０の位置探索部１２２によって探索され、エディットリスト１０４が生成される。このエディットリスト１０４に従って、第１のビデオデータ１０１から映像抽出部１２３によって映像が抽出され、編集ビデオデータ１０５が生成される。すなわち、本発明の実施の形態によれば、メタデータ１０３を中間状態として、第１のビデオデータ１０１を非破壊的に編集することができる。 As described above, according to the embodiment of the present invention, the metadata 103 is generated by the video processing device 110 from the second video data 102 synchronized with the first video data 101. This metadata 103 is synchronized with the first video data 101 and holds the feature amount of the second video data 102 while managing it on the time axis. Further, according to the embodiment of the present invention, the position matching the extraction condition in the metadata 103 is searched by the position search unit 122 of the video editing apparatus 120, and the edit list 104 is generated. In accordance with the edit list 104, a video is extracted from the first video data 101 by the video extraction unit 123, and edited video data 105 is generated. That is, according to the embodiment of the present invention, the first video data 101 can be edited nondestructively with the metadata 103 as an intermediate state.

なお、本発明の実施の形態は本発明を具現化するための一例を示したものであり、以下に示すように特許請求の範囲における発明特定事項とそれぞれ対応関係を有するが、これに限定されるものではなく本発明の要旨を逸脱しない範囲において種々の変形を施すことができる。 The embodiment of the present invention is an example for embodying the present invention and has a corresponding relationship with the invention-specific matters in the claims as shown below, but is not limited thereto. However, various modifications can be made without departing from the scope of the present invention.

すなわち、請求項１において、映像取得手段は例えば映像取得部１１１、撮像部２１１、２１６、または、映像入力部２０６に対応する。また、映像解析手段は例えば映像解析部１１２または２１７に対応する。また、メタデータ生成手段は例えばメタデータ生成部１１３またはファイル生成部２１４に対応する。 That is, in claim 1, the video acquisition unit corresponds to, for example, the video acquisition unit 111, the imaging units 211 and 216, or the video input unit 206. The video analysis means corresponds to the video analysis unit 112 or 217, for example. The metadata generation means corresponds to, for example, the metadata generation unit 113 or the file generation unit 214.

また、請求項３において、第１の撮像手段は例えば撮像部２１１に対応する。また、第２の撮像手段は例えば撮像部２１６に対応する。 Further, in claim 3, the first imaging unit corresponds to the imaging unit 211, for example. The second imaging unit corresponds to the imaging unit 216, for example.

また、請求項４において、再生手段は例えば再生部２２０に対応する。また、撮像手段は例えば撮像部２１１に対応する。 Further, in claim 4, reproduction means corresponds to the reproduction unit 220, for example. An imaging unit corresponds to the imaging unit 211, for example.

また、請求項５において、再生手段は例えば再生部２２０に対応する。また、映像入力手段は例えば映像入力部２０６に対応する。 Further, in claim 5, reproduction means corresponds to the reproduction unit 220, for example. The video input means corresponds to the video input unit 206, for example.

また、請求項７において、位置探索手段は例えば位置探索部１２２に対応する。また、映像抽出手段は例えば映像抽出部１２３に対応する。 Further, in claim 7, the position search means corresponds to the position search unit 122, for example. The video extraction means corresponds to the video extraction unit 123, for example.

また、請求項９において、映像取得手段は例えば映像取得部１１１、撮像部２１１、２１６、または、映像入力部２０６に対応する。また、映像解析手段は例えば映像解析部１１２または２１７に対応する。また、メタデータ生成手段は例えばメタデータ生成部１１３またはファイル生成部２１４に対応する。また、位置探索手段は例えば位置探索部１２２に対応する。また、映像抽出手段は例えば映像抽出部１２３に対応する。 Further, in claim 9, the video acquisition unit corresponds to, for example, the video acquisition unit 111, the imaging units 211 and 216, or the video input unit 206. The video analysis means corresponds to the video analysis unit 112 or 217, for example. The metadata generation means corresponds to, for example, the metadata generation unit 113 or the file generation unit 214. The position search means corresponds to the position search unit 122, for example. The video extraction means corresponds to the video extraction unit 123, for example.

また、請求項１０および１１において、映像取得手順は例えばステップＳ９１１に対応する。また、映像解析手順は例えばステップＳ９１２に対応する。また、メタデータ生成手順は例えばステップＳ９１３に対応する。 Further, in claims 10 and 11, the video acquisition procedure corresponds to, for example, step S911. The video analysis procedure corresponds to, for example, step S912. The metadata generation procedure corresponds to step S913, for example.

また、請求項１２および１３において、メタデータ取得手順は例えばステップＳ９３３に対応する。また、位置探索手順は例えばステップＳ９３４に対応する。また、映像抽出手順は例えばステップＳ９２３に対応する。 Further, in claims 12 and 13, the metadata acquisition procedure corresponds to, for example, step S933. The position search procedure corresponds to, for example, step S934. The video extraction procedure corresponds to step S923, for example.

また、請求項１４において、第１の映像データは例えば第１のビデオデータ１０１に対応する。また、第２の映像データは例えば第２のビデオデータ１０２に対応する。また、メタデータは例えばメタデータ１０３に対応する。 In claim 14, the first video data corresponds to, for example, the first video data 101. The second video data corresponds to the second video data 102, for example. The metadata corresponds to the metadata 103, for example.

なお、本発明の実施の形態において説明した処理手順は、これら一連の手順を有する方法として捉えてもよく、また、これら一連の手順をコンピュータに実行させるためのプログラム乃至そのプログラムを記憶する記録媒体として捉えてもよい。 The processing procedure described in the embodiment of the present invention may be regarded as a method having a series of these procedures, and a program for causing a computer to execute these series of procedures or a recording medium storing the program May be taken as

本発明の実施の形態における映像編集システム１００の一構成例を示す図である。It is a figure which shows the example of 1 structure of the video editing system 100 in embodiment of this invention. 本発明の実施の形態における映像処理装置１１０の第１の構成例を示す図である。It is a figure which shows the 1st structural example of the video processing apparatus 110 in embodiment of this invention. 本発明の実施の形態における映像処理装置１１０の第１の構成例による使用態様を示す図である。It is a figure which shows the usage condition by the 1st structural example of the video processing apparatus 110 in embodiment of this invention. 本発明の実施の形態における映像処理装置１１０の第２の構成例を示す図である。It is a figure which shows the 2nd structural example of the video processing apparatus 110 in embodiment of this invention. 本発明の実施の形態における映像処理装置１１０の第２の構成例による使用態様を示す図である。It is a figure which shows the usage condition by the 2nd structural example of the video processing apparatus 110 in embodiment of this invention. 本発明の実施の形態における映像処理装置１１０の第３の構成例を示す図である。It is a figure which shows the 3rd structural example of the video processing apparatus 110 in embodiment of this invention. 本発明の実施の形態における映像処理装置１１０の一実施例であるカメラ一体型撮像装置の構成例を示す図である。It is a figure which shows the structural example of the camera integrated imaging device which is one Example of the video processing apparatus 110 in embodiment of this invention. ＱｕｉｃｋＴｉｍｅベースファイル形式の構造例を示す図である。It is a figure which shows the structural example of QuickTime base file format. ＱｕｉｃｋＴｉｍｅファイルフォーマットにおける階層構造を示す図である。It is a figure which shows the hierarchical structure in a QuickTime file format. 本発明の実施の形態におけるファイルの保存形式の一例を示す図である。It is a figure which shows an example of the preservation | save format of the file in embodiment of this invention. 本発明の実施の形態におけるメタトラックの階層構造の一例を示す図である。It is a figure which shows an example of the hierarchical structure of the meta track | truck in embodiment of this invention. 本発明の実施の形態におけるメタトラックを含むムービーアトムの記載例を示す図である。It is a figure which shows the example of description of the movie atom containing the meta track | truck in embodiment of this invention. 本発明の実施の形態におけるメタトラックを含むムービーアトムの他の記載例を示す図である。It is a figure which shows the other description example of the movie atom containing the meta track | truck in embodiment of this invention. 図１３のムービーアトムにおけるメタトラックアトムの記載例を示す図である。It is a figure which shows the example of description of the meta track atom in the movie atom of FIG. 本発明の実施の形態におけるメタトラックのサンプルディスクリプションアトム（'ｓｔｓｄ'）の記載例を示す図である。It is a figure which shows the example of description of the sample description atom ('stsd') of the meta track | truck in embodiment of this invention. 本発明の実施の形態におけるデータフォーマットフィールドの一例を示す図である。It is a figure which shows an example of the data format field in embodiment of this invention. 本発明の実施の形態におけるストリームディスクリプターアトム（'ｓｔｒｄ'）の記載例を示す図である。It is a figure which shows the example of description of the stream descriptor atom ('strd') in embodiment of this invention. 本発明の実施の形態におけるユーザデファインドメタタイプフィールドの記載例を示す図である。It is a figure which shows the example of description of the user defined metatype field in embodiment of this invention. 本発明の実施の形態におけるメタＩＤのビットフィールド構成例を示す図である。It is a figure which shows the bit field structural example of meta ID in embodiment of this invention. 本発明の実施の形態におけるパラメータフラグフィールドのビットフィールド構成例を示す図である。It is a figure which shows the bit field structural example of the parameter flag field in embodiment of this invention. 本発明の実施の形態におけるメタトラックのメディアデータのサンプルの記載例を示す図である。It is a figure which shows the example of description of the sample of the media data of the meta track | truck in embodiment of this invention. 本発明の実施の形態におけるメタデータを含むファイルの階層構造の一例を示す図である。It is a figure which shows an example of the hierarchical structure of the file containing the metadata in embodiment of this invention. 図２２の例におけるソーストラックとメタトラックとの関係を示す図である。It is a figure which shows the relationship between the source track in the example of FIG. 22, and a meta track. 本発明の実施の形態におけるメタデータを含むファイルの階層構造の他の例を示す図である。It is a figure which shows the other example of the hierarchical structure of the file containing the metadata in embodiment of this invention. 図２４の例におけるソーストラックとメタトラックとの関係を示す図である。FIG. 25 is a diagram illustrating a relationship between a source track and a meta track in the example of FIG. 24. 本発明の実施の形態におけるソーストラックおよびメタトラックと編集トラックとの関係例を示す図である。It is a figure which shows the example of a relationship between the source track and meta track in the embodiment of this invention, and an edit track. ＱｕｉｃｋＴｉｍｅファイルフォーマットにおけるエディットアトム（'ｅｄｔｓ'）の記載例を示す図である。It is a figure which shows the example of a description of the edit atom ('edts') in a QuickTime file format. ＱｕｉｃｋＴｉｍｅファイルフォーマットにおけるエディットアトムを模式的に表した図である。It is the figure which represented typically the edit atom in a QuickTime file format. 図２６の例におけるエディットリストアトム（'ｅｌｓｔ'）の内容例を示す図である。FIG. 27 is a diagram illustrating a content example of an edit restore tom ('elst') in the example of FIG. 26. 本発明の実施の形態におけるソーストラックおよびメタトラックと編集トラックとの他の関係例を示す図である。It is a figure which shows the other example of a relationship between the source track in the embodiment of this invention, a meta track, and an edit track. 図３０の例におけるエディットリストアトムの内容例を示す図である。It is a figure which shows the example of the content of the edit restore tom in the example of FIG. 本発明の実施の形態における映像処理装置１１０による処理手順の一例を示す図である。It is a figure which shows an example of the process sequence by the video processing apparatus 110 in embodiment of this invention. 本発明の実施の形態における映像編集装置１２０による処理手順の一例を示す図である。It is a figure which shows an example of the process sequence by the video editing apparatus 120 in embodiment of this invention. 図３３の位置探索処理（ステップＳ９３０）における処理手順の一例を示す図である。It is a figure which shows an example of the process sequence in the position search process (step S930) of FIG.

Explanation of symbols

１００映像編集システム
１０１第１のビデオデータ
１０２第２のビデオデータ
１０３メタデータ
１０４エディットリスト
１０５編集ビデオデータ
１１０映像処理装置
１１１映像取得部
１１２映像解析部
１１３メタデータ生成部
１２０映像編集装置
１２１抽出条件受付部
１２２位置探索部
１２３映像抽出部
２０６映像入力部
２１０記録部
２１１撮像部
２１２映像加工部
２１３映像圧縮部
２１４ファイル生成部
２１５書込み部
２１６撮像部
２１７映像解析部
２１８記録制御部
２１９記録媒体
２２０再生部
２２１表示部
２２２映像加工部
２２３映像伸張部
２２４ファイル復号部
２２５読出し部
２２９記録媒体
３０１撮像部
３０９記録媒体
３１１ズームレンズ
３１２アイリス
３１３フォーカスレンズ
３１４フィルタ
３２１撮像素子
３２２Ａ／Ｄ変換器
３２３カメラ信号処理回路
３２４検波部
３２５ズーム制御部
３２６角速度センサ
３２９カメラ制御部
３３０映像加工部
３４１映像圧縮部
３４２圧縮制御部
３５１記録媒体アクセス部
３５２ドライブ制御部
３６０操作受付部
３７０表示部
３９０システム制御部
５０１被写体
５０２撮影者
５０３〜５０５鑑賞者
５１０ネットワーク
５２０、５３１ビデオカメラ装置
５２１、５２２、５４１カメラ
５３０テレビ装置
５４０コンピュータ装置
DESCRIPTION OF SYMBOLS 100 Video editing system 101 1st video data 102 2nd video data 103 Metadata 104 Edit list 105 Edit video data 110 Video processing apparatus 111 Video acquisition part 112 Video analysis part 113 Metadata generation part 120 Video editing apparatus 121 Extraction conditions Reception unit 122 Position search unit 123 Video extraction unit 206 Video input unit 210 Recording unit 211 Imaging unit 212 Video processing unit 213 Video compression unit 214 File generation unit 215 Writing unit 216 Imaging unit 217 Video analysis unit 218 Recording control unit 219 Recording medium 220 Playback unit 221 Display unit 222 Video processing unit 223 Video expansion unit 224 File decoding unit 225 Reading unit 229 Recording medium 301 Imaging unit 309 Recording medium 311 Zoom lens 312 Iris 313 Focus lens 314 Filter 321 Image sensor 322 A / D converter 323 Camera signal processing circuit 324 Detection unit 325 Zoom control unit 326 Angular velocity sensor 329 Camera control unit 330 Video processing unit 341 Video compression unit 342 Compression control unit 351 Recording medium access unit 352 Drive Control unit 360 Operation receiving unit 370 Display unit 390 System control unit 501 Subject 502 Photographer 503 to 505 Viewer 510 Network 520, 531 Video camera device 521, 522, 541 Camera 530 Television device 540 Computer device

Claims

Video acquisition means for acquiring second video data synchronized with the first video data managed in time series;
Video analysis means for analyzing the feature quantity in the second video data;
A video processing apparatus comprising: metadata generation means for generating metadata that holds the feature amount in synchronization with the first video data.

The video processing apparatus according to claim 1, wherein the feature amount is a facial expression included in the second video data.

Further comprising first imaging means for imaging the first video data;
The video processing apparatus according to claim 1, wherein the video acquisition unit includes a second imaging unit that captures the second video data simultaneously with the imaging of the first video data.

Replay means for replaying the first video data;
The video processing apparatus according to claim 1, wherein the video acquisition unit includes an imaging unit that captures the second video data simultaneously with the reproduction of the first video data by the reproduction unit.

Replay means for replaying the first video data;
2. The video processing apparatus according to claim 1, wherein the video acquisition means includes video input means for inputting the first video data reproduced by the reproduction means as the second video data.

The video processing apparatus according to claim 1, wherein the first video data and the metadata are recorded in a media data atom format in a QuickTime format.

Search for a position on the time series that matches a predetermined condition after obtaining metadata holding the feature amount in the second video data synchronized with the first video data managed in time series, and search for the result Position search means for generating information;
A video editing apparatus, comprising: a video extracting unit that extracts a portion corresponding to the matching time-series position from the first video data based on the search information.

The position search means acquires management information for managing the metadata, and does not acquire the metadata when the management information indicates that the feature amount is not retained in the metadata. The video editing apparatus according to claim 7.

Video acquisition means for acquiring second video data synchronized with the first video data managed in time series;
Video analysis means for analyzing the feature quantity in the second video data;
Metadata generation means for generating metadata for holding the feature amount in synchronization with the first video data;
A position search means for searching a position on the time series that matches a predetermined condition from the metadata and generating the result as search information;
A video editing system comprising: a video extracting unit that extracts a portion corresponding to the matching time-series position from the first video data based on the search information.

A video acquisition procedure for acquiring second video data synchronized with the first video data managed in time series;
A video analysis procedure for analyzing a feature amount in the second video data;
A video processing method comprising: a metadata generation procedure for generating metadata that holds the feature amount in synchronization with the first video data.

A video acquisition procedure for acquiring second video data synchronized with the first video data managed in time series;
A video analysis procedure for analyzing a feature amount in the second video data;
A program that causes a computer to execute a metadata generation procedure for generating metadata that holds the feature amount in synchronization with the first video data.

A metadata acquisition procedure for acquiring metadata that retains feature quantities in the second video data synchronized with the first video data managed in time series;
A position search procedure for searching for a position on a time series that matches a predetermined condition in the metadata and generating the result as search information;
A video editing method comprising: extracting a portion corresponding to the matching time-series position from the first video data based on the search information.

A metadata acquisition procedure for acquiring metadata that retains feature quantities in the second video data synchronized with the first video data managed in time series;
A position search procedure for searching for a position on a time series that matches a predetermined condition in the metadata and generating the result as search information;
A program causing a computer to execute a video extraction procedure for extracting a portion corresponding to the matching position on the time series from the first video data based on the search information.

First video data managed in time series;
A data structure comprising metadata for holding a feature amount in second video data synchronized with the first video data in synchronization with the first video data,
A computer acquires the metadata, searches a position on the time series that matches a predetermined condition in the metadata, generates a result as search information, and based on the search information, the first information A data structure, wherein a portion corresponding to the matching time-series position is extracted from video data.