JP2007072789A

JP2007072789A - Image structuring method, device, and program

Info

Publication number: JP2007072789A
Application number: JP2005259457A
Authority: JP
Inventors: Toshikazu Karitsuka; 俊和狩塚; Satoshi Shimada; 聡嶌田; Masashi Morimoto; 正志森本
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2005-09-07
Filing date: 2005-09-07
Publication date: 2007-03-22
Anticipated expiration: 2025-09-07
Also published as: JP4606278B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide an image structuring method and the like capable of dividing a personal image with blur due to high compression or low quality into six types of segments by means of camerawork and a combination of presence of a motion object inside an actual environment at a low calculation cost. <P>SOLUTION: In this image structuring method, an image to be analyzed is read, a motion vector between frames of the input image is calculated, a characteristic quantity determining a segment type from the calculated motion vector is calculated, and using the characteristic quantity, to identify the segment type of a frame unit. Based on the segment type of the frame unit, the input image is divided, and the divided image and the segment information are stored in a storage means. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、映像コンテンツの管理性向上のための映像構造化技術、インデクシング技術に係り、特に、撮影者がカメラ付き携帯電話やディジタルカメラ、ディジタルビデオ等で撮影した未編集映像素材コンテンツ（以下、個人が撮影した映像を、パーソナル映像と呼ぶ）を動き情報を用いて分割するための映像構造化方法及び装置及びプログラムに関する。 The present invention relates to video structuring technology and indexing technology for improving manageability of video content, and in particular, unedited video material content (hereinafter, referred to as “photographed content”) taken by a photographer using a mobile phone with a camera, a digital camera, a digital video, The present invention relates to a video structuring method, apparatus, and program for dividing a video taken by an individual (referred to as personal video) using motion information.

映像をセグメントに分割する手段の従来技術としては、映像制作の過程、すなわち、編集情報を用いて分割する方法と、色や動き等の映像の内容が類似しているシーンを同一のセグメントとして分割する方法に大別される。 The conventional technique for dividing video into segments is to divide scenes that are similar to the video production process, that is, the method of dividing using editing information, and the video content such as color and movement into the same segment. It is roughly divided into how to do.

映像編集者の意図に沿って分割する方法の代表的な技術として、映像コンテンツにおいて、編集映像のカットの検出、テロップが表示される区間の検出、カメラワークが発生した区間の検出、音楽が発生した区間、音声が発生した区間を検出し、それぞれのイベントが発生した区間をセグメントとして分割する方法である。当該方法は、編集におけるイベントを利用することから、特に、編集済みの映像コンテンツに対して有効に機能する方法である（例えば、特許文献１参照）。 As a typical technique for dividing according to the video editor's intention, in the video content, detection of cut of the edited video, detection of the section where the telop is displayed, detection of the section where the camera work has occurred, music generation This is a method of detecting a section in which a voice is generated and dividing a section in which each event occurs as a segment. Since this method uses an event in editing, it is particularly a method that functions effectively for edited video content (see, for example, Patent Document 1).

映像の内容が類似しているシーンを同一のセグメントとして分類する方法の代表的な従来技術として、MPEGの動き補償ベクトルを利用してカメラワークの検出を行い、映像を分割する方法がある（例えば、非特許文献１参照）。 As a typical conventional technique for classifying scenes with similar video content as the same segment, there is a method of dividing a video by detecting camera work using an MPEG motion compensation vector (for example, Non-Patent Document 1).

また、映像の色情報の変化量に注目して、映像を定常的な状態と推移的な状態に分割する方法がある（例えば、非特許文献２参照）。
特開平１１−２２４２６６号公報土橋健太郎、小館亮之、西塔隆二、富永英義、“MPEG2動きベクトルを用いたカメラワーク検出の検討”、電子情報通信学会、画像符号化シンポジウム PCSJ2000(2000) 堤富士雄、“ウェアラブルな視覚記憶補助装置のための映像分割手法”、WISS2001 pp.155-160 (2001) Further, there is a method of dividing an image into a steady state and a transitional state by paying attention to the amount of change in the color information of the image (for example, see Non-Patent Document 2).
JP-A-11-224266 Kentaro Dobashi, Ryoyuki Kodate, Ryuji Nishitoba, Hideyoshi Tominaga, “Examination of Camerawork Detection Using MPEG2 Motion Vector”, IEICE, Image Coding Symposium PCSJ2000 (2000) Tsutsumi Fujio, “Video Segmentation Method for Wearable Visual Memory Aid”, WISS2001 pp.155-160 (2001)

しかしながら、上記の従来技術において、撮影者がカメラ付携帯電話やディジタルカメラ、ディジタルビデオ等で撮影した未編集映像素材コンテンツをセグメントに分割する場合、そもそも、カット点やテロップが存在しないため、編集情報を用いて映像をセグメントに分割することはできない。 However, in the above prior art, when an unedited video material content photographed by a photographer using a camera-equipped mobile phone, digital camera, digital video, etc. is divided into segments, there is no cut point or telop in the first place. The video cannot be divided into segments using.

また、カメラワークや色情報を用いて、映像の内容が類似しているシーンを同一のセグメントとして分割する手法の場合においても、カメラワーク等の動きで分割することはできるが、映像の情報で最も重要である被写体の動きを考慮しておらず、被写体の動き等に基づいた分類は行えていない。 Also, in the case of a method of dividing scenes with similar video content into the same segment using camera work and color information, it is possible to divide by the movement of camera work etc. The most important subject movement is not considered, and the classification based on the subject movement or the like cannot be performed.

また、映像から被写体の動きに注目する場合においても、背景から被写体を抽出しなければならず、一般に、計算量のコストが高くなるという問題がある。また、撮影者が、三脚等を用いずに撮影したような手振れや不安定なカメラワークがある映像から、背景と被写体を分離することは困難である。 Further, even when attention is paid to the movement of the subject from the video, the subject must be extracted from the background, and there is a problem that the cost of the calculation amount is generally increased. In addition, it is difficult to separate the background and the subject from an image with camera shake or unstable camerawork as if the photographer photographed without using a tripod or the like.

本発明は、上記の点に鑑みなされたもので、高圧縮・低品質で手振れがあるようなパーソナル映像を対象に、計算コストを低く抑えながら、カメラワークと実環境中の動物体の有無の組み合わせで６種別のセグメントに分割することが可能な映像構造化方法及び装置及びプログラムを提供することを目的とする。 The present invention has been made in view of the above points, and is intended for personal images that have high compression, low quality, and camera shake. An object of the present invention is to provide a video structuring method, apparatus, and program that can be divided into six types of segments in combination.

図１は、本発明の原理を説明するための図である。 FIG. 1 is a diagram for explaining the principle of the present invention.

本発明（請求項１）は、カメラワークと映像内の動物体の有無に基づいたセグメント種別を定義しておき、映像コンテンツをセグメントに分割し、各セグメントにセグメント種別をラベリングする映像構造化方法であって、
分析対象である映像を読み込んで、一時的に記憶手段に格納する映像読込ステップ（ステップ１００）と、
記憶手段から入力映像を読み出して、該入力映像のフレーム間の動きベクトルを算出する動きベクトル算出ステップ（ステップ１０１）と、
算出された動きベクトルからセグメント種別を判定する特徴量を算出する特徴量算出ステップ（ステップ１０２）と、
特徴量を用いてフレーム単位のセグメント種別を識別するセグメント種別識別ステップと（ステップ１０３）、
フレーム単位のセグメント種別に基づいて、入力映像を分割し（ステップ１０４）、分割された映像及びセグメント情報を記憶手段に格納する（ステップ１０５）映像分割ステップと、を行う。 The present invention (Claim 1) defines a segment type based on camera work and presence / absence of a moving object in a video, divides video content into segments, and labels each segment with the segment type Because
A video reading step (step 100) of reading the video to be analyzed and temporarily storing it in the storage means;
A motion vector calculating step (step 101) of reading an input video from the storage means and calculating a motion vector between frames of the input video;
A feature amount calculating step (step 102) for calculating a feature amount for determining a segment type from the calculated motion vector;
A segment type identification step for identifying a segment type in units of frames using the feature amount (step 103);
Based on the segment type in units of frames, the input video is divided (step 104), and the divided video and segment information are stored in the storage means (step 105).

また、本発明（請求項２）は、請求項1の映像構造化方法であって、
特徴量算出ステップ（ステップ１０２）において、
入力された映像のフレーム画像を、予め定めた領域に分割し、分割された領域毎に特徴量を算出する。 The present invention (Claim 2) is the video structuring method of Claim 1,
In the feature amount calculating step (step 102),
The frame image of the input video is divided into predetermined areas, and a feature amount is calculated for each divided area.

また、本発明（請求項３）は、請求項２の映像構造化方法であって、
特徴量算出ステップ（ステップ１０２）において、
特徴量として、動きベクトルの大きさと方向の、平均、分散（標準偏差）、平均の変化量、分散（標準偏差）の変化量の、いずれか一つ以上組み合わせて利用し、識別する。 The present invention (Claim 3) is the image structuring method of Claim 2,
In the feature amount calculating step (step 102),
As a feature quantity, any one or more of an average, a variance (standard deviation), an average change amount, and a variance (standard deviation) change amount in magnitude and direction of a motion vector is used and identified.

図２は、本発明の原理構成図である。 FIG. 2 is a principle configuration diagram of the present invention.

本発明（請求項４）は、カメラワークと映像内の動物体の有無に基づいたセグメント種別を定義しておき、映像コンテンツをセグメントに分割し、各セグメントにセグメント種別をラベリングする映像構造化装置であって、
読み込まれた映像データを一時的に保持する映像一時記憶手段２と、
分割された映像及びセグメント情報を記憶する映像・セグメント情報記憶手段７と、
分析対象である映像を読み込んで、一時的に映像一時記憶手段２に格納する映像読込手段１と、
映像一時記憶手段２から入力映像を読み出して、該入力映像のフレーム間の動きベクトルを算出する動きベクトル算出手段３と、
算出された動きベクトルからセグメント種別を判定する特徴量を算出する特徴量算出手段４と、
特徴量を用いてフレーム単位のセグメント種別を識別するセグメント種別識別手段５と、
フレーム単位のセグメント種別に基づいて、入力映像を分割し、分割された映像及びセグメント情報を映像・セグメント情報記憶手段７に格納する映像分割手段６と、有する。 The present invention (Claim 4) defines a segment type based on camera work and the presence or absence of a moving object in a video, divides video content into segments, and labels each segment with a segment type. Because
Video temporary storage means 2 for temporarily storing the read video data;
Video / segment information storage means 7 for storing the divided video and segment information;
A video reading means 1 for reading a video to be analyzed and temporarily storing it in the video temporary storage means 2;
A motion vector calculation means 3 for reading an input video from the video temporary storage means 2 and calculating a motion vector between frames of the input video;
Feature quantity calculating means 4 for calculating a feature quantity for determining a segment type from the calculated motion vector;
Segment type identifying means 5 for identifying the segment type in units of frames using the feature amount;
And a video dividing unit 6 that divides the input video based on the segment type in units of frames and stores the divided video and segment information in the video / segment information storage unit 7.

また、本発明（請求項５）は、請求項４の映像構造化装置の特徴量算出手段４において、
入力された映像のフレーム画像を、予め定めた領域に分割し、分割された領域毎に特徴量を算出する手段を含む。 Further, the present invention (Claim 5) is characterized in that in the feature amount calculation means 4 of the video structuring apparatus according to Claim 4,
Means for dividing the frame image of the input video into predetermined regions and calculating a feature amount for each of the divided regions.

また、本発明（請求項６）は、請求項５の映像構造化装置の特徴量算出手段４において、
特徴量として、動きベクトルの大きさと方向の、平均、分散（標準偏差）、平均の変化量、分散（標準偏差）の変化量の、いずれか一つ以上組み合わせて利用し、識別する手段を含む。 Further, the present invention (Claim 6) is characterized in that the feature amount calculation means 4 of the video structuring apparatus according to claim 5
Includes a means for identifying and using as a feature quantity any one or more of average, variance (standard deviation), average change amount, and variance (standard deviation) change in magnitude and direction of the motion vector. .

本発明（請求項７）は、コンピュータを、
請求項４乃至６記載の映像構造化装置として機能させるプログラムである。 The present invention (Claim 7) provides a computer,
A program for functioning as the video structuring apparatus according to claim 4.

上記のように本発明によれば、一般的に管理することが煩雑なパーソナル映像を対象に、カメラワークと動物体の有無によって映像を分割することで、映像に関して最も重要な情報である動き情報によるインデクシングが可能となる。これにより、代表画像の選出や、映像内へのメタデータ付与の効率化、映像の一覧性の向上等、その実用的な効果は多大である。 As described above, according to the present invention, for personal video that is generally complicated to manage, the motion information that is the most important information regarding the video is divided by the presence or absence of camerawork and moving objects. Can be indexed. As a result, the practical effects such as selection of representative images, more efficient application of metadata in the video, and improvement of the video listing are enormous.

以下、図面と共に本発明の実施の形態を説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

図３は、本発明の一実施の形態における映像構造化装置の構成を示す。 FIG. 3 shows the configuration of the video structuring apparatus in one embodiment of the present invention.

同図に示す映像構造化装置は、映像読込部１、映像一時格納部２、動きベクトル算出部３、特徴量算出部４、サブショット種別識別部５、映像分割部６、映像・セグメント情報格納部７、学習データ格納部８から構成される。 The video structuring apparatus shown in FIG. 1 includes a video reading unit 1, a video temporary storage unit 2, a motion vector calculation unit 3, a feature amount calculation unit 4, a sub-shot type identification unit 5, a video division unit 6, and video / segment information storage. And a learning data storage unit 8.

映像読込部１は、分析対象のワンショットのパーソナル映像を読み込み、映像一時格納部２に格納する。ここで、分析対象のパーソナル映像は、既にショット毎に管理され、ディスク装置等の記憶手段に格納されているものとする。なお、ショットとは、カメラの録画ボタンのオンからオフまで撮影された一連の映像のことを示す。当該映像構造化装置に入力される映像は、ショット単位で行われ、特にこのような映像を「ワンショット映像」と呼ぶ。システムによっては、分割された映像のセグメントのことを、特に「サブショット」と定義し、サブショットの種別を「サブショット種別」と定義する。 The video reading unit 1 reads the one-shot personal video to be analyzed and stores it in the video temporary storage unit 2. Here, it is assumed that the personal video to be analyzed is already managed for each shot and stored in storage means such as a disk device. Note that a shot refers to a series of images taken from turning on and off the recording button of the camera. The video input to the video structuring apparatus is performed in units of shots. In particular, such video is called “one-shot video”. Depending on the system, the segment of the divided video is specifically defined as “sub-shot”, and the sub-shot type is defined as “sub-shot type”.

動きベクトル算出部３は、映像一時格納部２に格納された映像をフレーム単位で逐次読み出して、当該フレームの動きベクトルを算出する。 The motion vector calculation unit 3 sequentially reads the video stored in the video temporary storage unit 2 in units of frames and calculates the motion vector of the frame.

特徴量算出部４は、動きベクトル算出部３で算出された動きベクトルからサブショット種別を判定する特徴量を算出する。ここで、ローカルモーションとグローバルモーションの要素を特徴量に盛り込むため、フレーム画像を、予め定めた領域に分割して、その領域毎に分割して、その領域毎に特徴量を算出する。当該特徴量算出部４では、動物体を背景から抽出することなく、予め定められた領域毎に特徴量を算出するため、計算量コストを抑えることができる。 The feature amount calculation unit 4 calculates a feature amount for determining the sub-shot type from the motion vector calculated by the motion vector calculation unit 3. Here, in order to incorporate the elements of local motion and global motion into the feature amount, the frame image is divided into predetermined regions, divided into regions, and feature amounts are calculated for each region. Since the feature amount calculation unit 4 calculates the feature amount for each predetermined region without extracting the moving object from the background, the calculation amount cost can be suppressed.

サブショット種別識別部５は、学習データ格納部８に格納されている学習データを参照して、特徴量算出部４で算出された特徴量を用いてフレーム単位のセグメント種別を識別する。 The sub-shot type identification unit 5 refers to the learning data stored in the learning data storage unit 8 and identifies the segment type for each frame using the feature amount calculated by the feature amount calculation unit 4.

学習データ格納部８は、あらゆる被写体の種類や、手振れに関する学習データを保持する。これにより、あらゆる被写体の種類や手振れに対してロバストに識別することが可能となる。 The learning data storage unit 8 holds learning data regarding all types of subjects and camera shake. This makes it possible to robustly identify any subject type or camera shake.

映像分割部６は、フレーム単位のセグメント種別を用いて映像を分割し、映像・セグメント情報格納部７に格納する。 The video dividing unit 6 divides the video using the segment type in units of frames and stores the video in the video / segment information storage unit 7.

図４は、本発明の一実施の形態におけるワンショット映像をサブショットに分割した例を示す。 FIG. 4 shows an example in which the one-shot video in one embodiment of the present invention is divided into sub-shots.

同図に示す例は、止まっていた被写体が動き出し、それをフォロー撮影し、被写体が止まる所までを撮影した映像である。この映像の場合、４つのサブショットを含むことになる。 The example shown in the figure is an image in which a stopped subject starts to move, follow-up shooting is performed, and the subject stops. In the case of this video, four sub-shots are included.

上記の構成により、映像構造化装置は、入力としてワンショットの映像を受け取ると、撮影者のカメラワークと映像内の動物体の有無に基づいたサブショット種別に映像コンテンツを自動的に分割する。 With the above configuration, when receiving a one-shot video as input, the video structuring apparatus automatically divides the video content into sub-shot types based on the photographer's camera work and the presence or absence of moving objects in the video.

次に、上記の構成における動作を詳細に説明する。 Next, the operation in the above configuration will be described in detail.

最初に、パーソナル映像のサブショットを定義する。図４は、本発明の一実施の形態におけるサブショット種別を示す図である。 First, a sub-shot of personal video is defined. FIG. 4 is a diagram showing the sub-shot types in the embodiment of the present invention.

本実施の形態では、図４のように、サブショットはω０〜ω５までの６つの種別を持つ。縦軸は、カメラワークの種別である。カメラワークは、無い場合、有る場合、ズームの場合の３分類である。カメラワークが有る場合は、パン、チルト、ドリー、ブーム、トラックを含む。縦軸は、実世界に動物体が無い場合、有る場合の２分類としている。以下に各カテゴリの説明と例を示す。 In the present embodiment, as shown in FIG. 4, the sub-shot has six types of ω0 to ω5. The vertical axis represents the type of camera work. There are three types of camera work: when there is no camera work, when there is camera work, and when zooming. When there is camera work, it includes pan, tilt, dolly, boom, and track. The vertical axis is divided into two categories when there is no moving object in the real world and when there is no moving object. Below are descriptions and examples of each category.

ω０：カメラワーク無し、動物体無し；
（例）注目オブジェクトを単に撮影している場合、また動きだすことを待っている場合等；
ω１：カメラワーク無し、動物体有り；
（例）注目オブジェクトがフレーム内で動いている場合等；
ω２：カメラワーク有り、動物体無し；
（例）風景や、建物やその場の雰囲気を撮影するため、パンやチルトをする場合等；
ω３：カメラワーク有り、動物体無し；
（例）注目オブジェクトの動きに合わせてカメラを動かしている場合等；
ω４：カメラワーク有り、動物体有り（フォロー無し）
（例）風景や、建物やその場の雰囲気を撮影するため、パンやチルトをする場合に、フォローしていない動物体が含まれている場合等；
ω５：ズーム区間；
（例）注目オブジェクトがあり、そのオブジェクトにズーム・インする場合、その場全体をフレームに収めるためズームアウトする場合等；
ズームの場合は、その行為自体に大きな撮影者の意図が含まれるため、動物体の有無で細かく分類していない。 ω0: No camera work, no animal body;
(Example) When you are just shooting the object of interest, or when you are waiting to move;
ω1: No camera work, animal body;
(Example) When the object of interest moves within the frame, etc.
ω2: Camera work, no animal body;
(Example) Panning or tilting to take pictures of scenery, buildings, or the atmosphere of the place;
ω3: Camera work, no animal body;
(Example) When moving the camera according to the movement of the object of interest, etc .;
ω4: Camera work, moving body (no follow)
(Example) When panning or tilting to capture scenery, buildings, or the atmosphere of the place, including unanimated animals.
ω5: zoom section;
(Example) When there is an object of interest and you zoom in on that object, or when you zoom out to fit the entire scene in a frame, etc.
In the case of zooming, since the action itself includes the intention of a large photographer, it is not classified in detail by the presence or absence of a moving object.

図６は、本発明の一実施の形態における動作のフローチャートである。 FIG. 6 is a flowchart of the operation in one embodiment of the present invention.

ステップ２０１）映像読込部１が、入力された分析対象のワンショットの映像を読み込み、映像一時格納部２へ蓄積する。 Step 201) The video reading unit 1 reads the input one-shot video to be analyzed and accumulates it in the video temporary storage unit 2.

ステップ２０２）動きベクトル算出部３が、映像一時格納部２から映像をフレーム単位で、逐次読み出し、フレームの動きベクトルを算出する。本実施の形態では、特徴点のフレームの対応関係を逐次算出することで動きベクトルを算出する。他の方法として、ブロックマッチングによって算出する方法もある。 Step 202) The motion vector calculation unit 3 sequentially reads out the video from the video temporary storage unit 2 in units of frames, and calculates the motion vector of the frame. In this embodiment, the motion vector is calculated by sequentially calculating the correspondence between the feature point frames. As another method, there is a method of calculating by block matching.

特徴点の対応付けの具体的な方法を以下に示す。映像の総フレーム数をＮ枚とする、ｉ番目のフレーム画像をImg（ｉ），ｉ＋１番目のフレーム画像をImg（ｉ＋１）とする。但し、ｉ＝１，２，…，Ｎ−１である。Img（ｉ）から、Ｍ個の特徴点Ｐ_ij＝（ｐｘ_ij，ｐｙ_ij）（ｉ＝１，２，…，Ｎ−１，ｊ＝１，２，…，Ｍ）を検出する。特徴点検出は、従来手法である、コーナー、物体の輪郭など画像の濃度分布の変化が大きい点を抽出する。例えば、従来技術であるHarrisオペレータ等を用いればよい。Img(ｉ＋１)における対応点Ｑ_ij＝(ｑｘ_ij，ｑｙ_ij)（ｉ＝１，２，…，Ｎ−１，ｊ＝１，２，…，Ｍ）の検出は、Img（ｉ）におけるＰ_ijの近傍の濃度分布を参照し、Img(ｉ＋１)において、濃度分布と相関の高い位置を求めることにより実現することができる。 A specific method for associating feature points is shown below. Assume that the total number of frames of the video is N, the i-th frame image is Img (i), and the i + 1-th frame image is Img (i + 1). However, i = 1, 2,..., N−1. M feature points P _ij = (px _ij , py _ij ) (i = 1, 2,..., N−1, j = 1, 2,..., M) are detected from Img (i). In the feature point detection, points that have a large change in the density distribution of the image, such as corners and contours of an object, are extracted. For example, a Harris operator that is a conventional technique may be used. Detection of corresponding points Q _ij = (qx _ij , qy _ij ) (i = 1, 2,..., N−1, j = 1, 2,..., M) in Img (i + 1) _This can be realized by referring to the concentration distribution in the vicinity of _ij and obtaining a position having a high correlation with the concentration distribution in Img (i + 1).

ここで、算出された自然特徴点の組から、以下の式（１）、からImg（ｉ）とImg(ｉ＋１)の間の動きベクトルＶ_ij（ｉ＝１，２，…，Ｎ−１，ｊ＝１，２，…，Ｍ）を算出する。 Here, from the set of calculated natural feature points, a motion vector V _ij (i = 1, 2,..., N−1) between Img (i) and Img (i + 1) from the following formula (1). j = 1, 2,..., M).

Ｖ_ij＝（ｖｘ_ij，ｖｙ_ij）＝（ｑｘ_ij−ｐｘ_ij，ｑｙ_ij−ｐｙ_ij） …式（１）
（但し、ｉ＝１，２，…，Ｎ−１，ｊ＝１，２，…，Ｍ）
算出された動きベクトルＶ_ijと、動きベクトルの開始点Ｐ_ijを出力する。それぞれｉ＝１，２，…，Ｎ−１、ｊ＝１，２，…，Ｍとし、ｉとｊはＶ_ijとＰ_ijの間で対応付いているものとする。最終的に、動きベクトル算出部３内のメモリにＶ_ijとＰ_ijを出力する。 V _ij = (vx _ij , vy _ij ) = (qx _ij −px _ij , qy _ij −py _ij ) (1)
(Where i = 1, 2,..., N−1, j = 1, 2,..., M)
The calculated motion vector V _ij and the start point P _ij of the motion vector are output. It is assumed that i = 1, 2,..., N−1, j = 1, 2,..., M, and i and j correspond to V _ij and P _ij . Finally, V _ij and P _ij are output to the memory in the motion vector calculation unit 3.

ステップ２０３）特徴量算出部４は、動きベクトル算出部３のメモリから読み出されたＶ_ijとＰ_ij（ｉ＝１，２、…，Ｎ−１，ｊ＝１，２，…Ｍ）を入力とする。特徴量算出部４では、動きベクトルからフレーム単位のセグメント種別を識別するための特徴量を算出する。 Step 203) The feature amount calculation unit 4 uses V _ij and P _ij (i = 1, 2,..., N−1, j = 1, 2,... M) read from the memory of the motion vector calculation unit 3. As input. The feature amount calculation unit 4 calculates a feature amount for identifying the segment type in units of frames from the motion vector.

本実施の形態では、特徴量として、
・動きベクトルの大きさの平均、
・動きベクトルの大きさの標準偏差、
・動きベクトルの方向の標準偏差、
・動きベクトルの方向の平均の変化量
の４つを採用する。 In the present embodiment, as the feature amount,
・ Average motion vector size,
・ Standard deviation of the magnitude of the motion vector,
Standard deviation of the direction of the motion vector,
• Four of the average amount of change in the direction of the motion vector are adopted.

特徴量を計算する際、人間が撮影する際の心理として、注目オブジェクトである被写体を、フレームの真ん中付近に位置するように撮影するという性質を利用する。具体的にはフレームを図７のように、フレーム内を内側と外側に分けて特徴量を計算することで、ローカルモーションとグローバルモーションの要素を特徴量に盛り込む。以下、フレームの内側部分を「内部領域」、フレームの外側部分を「外部領域」と呼ぶこととする。領域の定義は任意であるが、本実施の形態では、形がフレーム画像と相似の矩形、大きさがフレーム画像の５０％、位置がフレーム画像の中心と矩形の中心が一致するように設置し、矩形の内部を「内部領域」、外側を「外部領域」と定義している。ここで、実験的に学習データを用いて、矩形の大きさを変えてサブショット識別を行った場合、最も識別率が高い結果が得られた比率を採用することもできる。 When calculating the feature amount, as a psychology when a human shoots, a characteristic that a subject that is an object of interest is shot so as to be positioned near the center of the frame is used. Specifically, as shown in FIG. 7, the feature amount is calculated by dividing the inside of the frame into the inside and outside, thereby incorporating the elements of local motion and global motion into the feature amount. Hereinafter, the inner part of the frame is referred to as “inner area” and the outer part of the frame is referred to as “external area”. The definition of the area is arbitrary, but in this embodiment, the rectangle is similar in shape to the frame image, the size is 50% of the frame image, and the position is set so that the center of the frame image coincides with the center of the rectangle. The inside of the rectangle is defined as “inner area” and the outside is defined as “outer area”. Here, when the sub-shot identification is performed by changing the size of the rectangle experimentally using the learning data, a ratio at which a result with the highest identification rate can be obtained may be employed.

動きベクトルがフレームの内部領域か、外部領域かを判定するには、動きベクトルの始点であるＰ_ijが、フレームの内部領域か、外部領域にあるかどうかで判定する。 In order to determine whether the motion vector is an internal region or an external region of the frame, it is determined whether P _ij that is the start point of the motion vector is in the internal region or the external region of the frame.

ここで、算出する特徴量を定義する。扱う領域内に含まれる特徴点の個数をＫ（ｋ＜Ｍ）とする。ｉ番目のフレーム画像の扱う領域内に含まれる特徴点のｋ番目の特徴点の動きベクトルは、
Ｖ_ik＝（ｖｘ_ik，ｖｙ_ik）（ｉ＝１，２，…，Ｎ−１，ｋ＝１，２，…Ｋ）
式（１）
であるとする。 Here, the feature quantity to be calculated is defined. The number of feature points included in the area to be handled is assumed to be K (k <M). The motion vector of the kth feature point of the feature point included in the area handled by the i-th frame image is
V _ik = (vx _ik , vy _ik ) (i = 1, 2,..., N−1, k = 1, 2,... K)
Formula (1)
Suppose that

である。 It is.

以上の４つの特徴量を定義する。フレームの内部領域の特徴量の個数をＫinner，外部領域の特徴量の個数をＫ_outerとして、上記の式（２）から式（５）のＫと入れ替えた値を特徴量とする。その結果フレーム毎に、内部領域４次元、外部領域４次元、計８次元の特徴量が得られる。特徴量に対して、ショットの全てのフレームで特徴量を算出する。全てのフレームの特徴量を算出した後、特徴量のスケールをあわせるために、各特徴量毎に平均値を、中心に分散が等しくなるように正規化処理を行ったものを実際に使用する特徴量として出力する。出力される特徴量をＴ_i（ｉ＝１，２，…，Ｎ−１）とし、特徴量算出部４内のメモリ（図示せず）に出力する。 The above four feature quantities are defined. The number of feature values in the inner region of the frame is Kinner, the number of feature values in the outer region is K _outer , and the value replaced with K in the above equations (2) to (5) is defined as the feature amount. As a result, a total of 8 dimensions of feature quantities are obtained for each frame, 4 dimensions for the internal area, 4 dimensions for the external area. With respect to the feature amount, the feature amount is calculated for all frames of the shot. After calculating the feature values of all the frames, in order to adjust the feature value scale, the average value is used for each feature value, and the normalization processing is performed so that the variance is equal at the center. Output as a quantity. The output feature quantity is T _i (i = 1, 2,..., N−1) and is output to a memory (not shown) in the feature quantity calculation unit 4.

ステップ２０４）フレーム総数をＮとする。 Step 204) Let N be the total number of frames.

ステップ２０５）ループカウンタｉを初期化する（ｉ＝１）。 Step 205) The loop counter i is initialized (i = 1).

以下では、サブショット種別識別部５においては、特徴量算出部４において算出された特徴量Ｔ_iを入力として、フレーム毎にサブショット種別を判定する。判定方法としては、閾値で行う方法もあるが、ホームビデオの場合、撮影された映像には、予測不可能な手振れや被写体があるため、一定の閾値処理で識別することは難しい。そこで、本実施の形態では、学習データを用いる手法を採用する。本実施の形態では、識別方法として最もシンプルな最近傍決定則にて行う。他の手法としては、判別分析等が考えられる。 Hereinafter, the sub-shot type identification unit 5 receives the feature amount T _i calculated by the feature amount calculation unit 4 and determines the sub-shot type for each frame. As a determination method, there is also a method using a threshold value. However, in the case of home video, it is difficult to identify with a certain threshold process because there are camera shakes and subjects that cannot be predicted in the captured video. Therefore, in this embodiment, a method using learning data is adopted. In the present embodiment, the identification method is based on the simplest nearest neighbor determination rule. Another method is discriminant analysis.

学習データは、手作業により各サブショットに分類される映像を収集し、ステップ２０３、ステップ２０４において前述した動きベクトル算出部３と、特徴量算出部４による同様の特徴量算出の処理を、予め施し特徴量を求めている。この学習データを学習データ格納部８に予め保持しておく。 The learning data collects video classified into each sub-shot manually, and the same feature amount calculation processing by the motion vector calculation unit 3 and the feature amount calculation unit 4 described above in step 203 and step 204 is performed in advance. The feature amount is determined. This learning data is held in the learning data storage unit 8 in advance.

ステップ２０６）サブショット種別識別部５は、学習データ格納部８から学習データを読み出し、学習データ群を、８次元の識別空間にマッピングしておく。 Step 206) The sub-shot type identification unit 5 reads the learning data from the learning data storage unit 8, and maps the learning data group to the 8-dimensional identification space.

ステップ２０７）サブショット種別識別部５は、ｉ番目のフレームの特徴量を特徴量算出部４内のメモリ（図示せず）から読み出す。 Step 207) The sub-shot type identification unit 5 reads the feature amount of the i-th frame from a memory (not shown) in the feature amount calculation unit 4.

ステップ２０８）サブショット種別識別部５は、識別空間にマッピングされた識別対象と、ユークリッド距離が最も近い学習データを算出する。 Step 208) The sub-shot type identification unit 5 calculates learning data having the closest Euclidean distance to the identification target mapped in the identification space.

ステップ２０９）サブショット種別識別部５は、最もユークリッド距離が近い学習データの属するサブショット種別を、識別対象のサブショット種別とする。この際、誤判定を防ぐため、ユークリッド距離の最も近い上位ａ個（ａは予め定めておく）の学習データ点を抽出し、多数決によって種別を判定する方法も考えられる。また、ユークリッド距離が予め定めた閾値εよりも大きい場合、信頼性が低いと判定できるため、リジェクト処理を行うことも考えられる。 Step 209) The sub-shot type identification unit 5 sets the sub-shot type to which the learning data with the shortest Euclidean distance belongs as the identification target sub-shot type. At this time, in order to prevent erroneous determination, a method of extracting the top a learning data points (a is predetermined) having the closest Euclidean distance and determining the type by majority decision is also conceivable. In addition, when the Euclidean distance is larger than a predetermined threshold ε, it can be determined that the reliability is low, and therefore it is conceivable to perform a rejection process.

ステップ２１０）全てのフレームの識別を行ったかを判定し、判定すべき識別対象がある場合はステップ２１１に移行し、全てのフレームの識別を行った場合はステップ２１２に移行する。このとき、各フレームのサブショット種別Ωｉ（ｉ＝１，２，…，Ｎ）を出力する。但し、Ωiはω０、ω１、ω２、ω３、ω４、ω５のいずれかである。 Step 210) It is determined whether all the frames have been identified. If there is an identification target to be determined, the process proceeds to Step 211, and if all the frames have been identified, the process proceeds to Step 212. At this time, the sub-shot type Ωi (i = 1, 2,..., N) of each frame is output. However, Ωi is any one of ω0, ω1, ω2, ω3, ω4, and ω5.

ステップ２１１）ｉ＝ｉ＋１とし、ステップ２０７に移行する。 Step 211) Set i = i + 1 and proceed to Step 207.

ステップ２１２）映像分割部６は、各フレームのサブショット種別Ωi（ｉ＝１，２，…，Ｎ）を入力とし、入力されたワンショット映像をサブショットに分割する。図８は、本発明の一実施の形態における映像分割部におけるサブショット分割の例を示す。フレーム毎においては、図８に示すように所々雑音が入り、サブショットが大量にできてしまう可能性がある。そこで、予め定めたフレーム数ｂ毎に、時系列順に区切る。この区間のことをサブショット最小区間と呼ぶ。図８では、ｂ＝８としている例である。このサブショット最小区間内に含まれる、ｂ個のフレームサブショット種別を用いて、多数決によって、サブショット最小区間のサブショット種別を判定する。ここで雑音とは、サブショット最小区間において、ω１と多数決により判定されている区間において、フレーム間のサブショット種別がω０と判定されているフレームが雑音となる。また、同様に、サブショット最小区間において、ω２と判定されている区間において、フレーム間のサブショット種別がω３と判定されているフレームが雑音となる。 Step 212) The video dividing unit 6 receives the sub-shot type Ωi (i = 1, 2,..., N) of each frame as an input, and divides the input one-shot video into sub-shots. FIG. 8 shows an example of sub-shot division in the video division unit according to the embodiment of the present invention. In each frame, as shown in FIG. 8, noise is generated in some places, and there is a possibility that a large number of sub-shots are made. Therefore, the frames are divided in chronological order for each predetermined number of frames b. This section is called a sub-shot minimum section. FIG. 8 shows an example in which b = 8. The sub-shot type of the minimum sub-shot section is determined by majority using the b frame sub-shot types included in the minimum sub-shot section. Here, noise refers to a frame in which the sub-shot type between frames is determined to be ω 0 in the interval determined by majority vote with ω 1 in the minimum sub-shot interval. Similarly, in the sub-shot minimum section, in the section determined to be ω2, a frame in which the sub-shot type between frames is determined to be ω3 becomes noise.

多数決で決まらない場合には、前後のサブショット最小区間の識別結果を採用することや、サブショット種別に優先順位を予め付けておく、前後のサブショット最小区間とマージして多数決を行う等の、ルールを予め設定しておくことができる。本実施の形態では、多数決でも決まらない場合、一つ前のセグメント最小区間の識別結果を採用することとする。もし、ワンショット映像の一番先頭のサブショット最小区間が、多数決で決まらない場合、そのサブショット種別はω０とする。全てのサブショット最小区間において同様の処理を行い、結果Ω’ｊを映像分割部６内のメモリ（図示せず）に出力する。ここで、サブショット最小区間の数をＵとすると、ｊ＝１，２，…，Ｕである。 If it is not decided by majority decision, the identification result of the sub-shot minimum interval before and after is adopted, the priority order is assigned to the sub-shot type in advance, the majority decision is made by merging with the minimum sub-shot interval before and after, etc. , Rules can be set in advance. In the present embodiment, when the majority decision is not made, the identification result of the previous segment minimum section is adopted. If the first sub-shot minimum section of the one-shot video is not determined by majority, the sub-shot type is ω0. The same processing is performed in all the sub-shot minimum sections, and the result Ω′j is output to a memory (not shown) in the video dividing unit 6. Here, if the number of sub-shot minimum sections is U, j = 1, 2,...

ステップ２１３）映像分割部６は、サブショット最小区間のサブショット種別Ω’ｊをメモリから取得し、前後の種別が同じ場合はマージを行い、反対に切り替わった場合は、切り替わった点をサブショットの分割点と定め、サブショットのｉｎ点、ｏｕｔ点のフレーム番号を記録する。サブショットのｉｎ点、ｏｕｔ点の情報と、映像とともに映像・セグメントを、映像・セグメント情報格納部７に格納する。図９は、本発明の一実施の形態における映像のセグメント情報の例を示す。 Step 213) The video dividing unit 6 obtains the sub-shot type Ω′j of the sub-shot minimum section from the memory, merges if the preceding and succeeding types are the same, and performs the merging if the opposite type is switched. And the frame numbers of the in and out points of the sub-shot are recorded. The video / segment information is stored in the video / segment information storage unit 7 together with the information on the in and out points of the sub-shot and the video. FIG. 9 shows an example of video segment information according to an embodiment of the present invention.

以上述べた処理により、高圧縮・低品質で手振れがあるようなパーソナル映像を対象に、計算コストを低く抑えられ、カメラワークと実環境中の動物体の有無の組み合わせで６種別のセグメントに大まかに分割することが可能となる。 Through the processing described above, calculation costs can be kept low for personal images with high compression, low quality, and camera shake, and roughly divided into 6 types of segments based on the combination of camera work and the presence or absence of moving objects in the real environment. It becomes possible to divide into.

なお、上記の動作をプログラムとして構築し、映像構造化装置として利用されるコンピュータにインストールして実行する、または、ネットワークを介して流通させることも可能である。 In addition, it is also possible to construct | assemble said operation | movement as a program, install it in the computer utilized as an image | video structuring apparatus, to execute, or to distribute | circulate through a network.

なお、本発明は、上記の実施の形態に限定されることなく、特許請求の範囲内において種々変更・応用が可能である。 The present invention is not limited to the above-described embodiment, and various modifications and applications can be made within the scope of the claims.

本発明は、映像構造化技術やインデクシング技術に適用可能である。 The present invention is applicable to a video structuring technique and an indexing technique.

本発明の原理を説明するための図である。It is a figure for demonstrating the principle of this invention. 本発明の原理構成図である。It is a principle block diagram of this invention. 本発明の一実施の形態における映像構造化装置の構成図である。It is a block diagram of the image | video structuring apparatus in one embodiment of this invention. 本発明の一実施の形態におけるワンショット映像をサブショットに分割した例である。It is an example which divided | segmented the one-shot image | video in one embodiment of this invention into the subshot. 本発明の一実施の形態におけるサブショット種別を示す図である。It is a figure which shows the subshot classification in one embodiment of this invention. 本発明の一実施の形態における動作のフローチャートである。It is a flowchart of the operation | movement in one embodiment of this invention. 本発明の一実施の形態における内部領域と外部領域の例である。It is an example of the internal area | region and external area | region in one embodiment of this invention. 本発明の一実施の形態におけるサブショット分割の例である。It is an example of sub-shot division in one embodiment of the present invention. 本発明の一実施の形態における映像セグメント情報の例である。It is an example of the video segment information in one embodiment of this invention.

Explanation of symbols

１映像読込手段、映像読込部
２映像一時記憶手段、映像一時格納部
３動きベクトル算出手段、動きベクトル算出部
４特徴量算出手段、特徴量算出部
５セグメント種別識別手段、サブショット種別識別部
６映像分割手段、映像分割部
７映像・セグメント情報記憶手段、映像・セグメント情報格納部
８学習データ格納部 DESCRIPTION OF SYMBOLS 1 Image | video reading means, image | video reading part 2 Image | video temporary storage means, image | video temporary storage part 3 Motion vector calculation means, Motion vector calculation part 4 Feature value calculation means, Feature value calculation part 5 Segment type identification means, Sub-shot type identification part 6 Video division means, video division section 7 Video / segment information storage means, video / segment information storage section 8 Learning data storage section

Claims

A video structuring method that defines segment types based on camerawork and presence or absence of moving objects in video, divides video content into segments, and labels each segment with a segment type,
A video reading step of reading the video to be analyzed and temporarily storing it in the storage means;
A motion vector calculating step of reading an input video from the storage means and calculating a motion vector between frames of the input video;
A feature amount calculating step for calculating a feature amount for determining a segment type from the calculated motion vector;
A segment type identification step for identifying a segment type in units of frames using the feature amount;
A video dividing step of dividing the input video based on a segment type in a frame unit and storing the divided video and segment information in a storage means;
A video structuring method characterized by:

In the feature amount calculating step,
Dividing the input frame image of the video into predetermined regions and calculating a feature amount for each of the divided regions;
The image structuring method according to claim 1.

In the feature amount calculating step,
As the feature quantity, the size and direction of the motion vector are averaged, variance (standard deviation), average change amount, and change amount of variance (standard deviation) are used in combination and identified.
The video structuring method according to claim 2.

A video structuring device that defines segment types based on camerawork and the presence or absence of moving objects in a video, divides video content into segments, and labels the segment types for each segment,
Video temporary storage means for temporarily storing the read video data;
Video / segment information storage means for storing the divided video and segment information;
Video reading means for reading the video to be analyzed and temporarily storing it in the video temporary storage means;
A motion vector calculating means for reading an input video from the video temporary storage means and calculating a motion vector between frames of the input video;
Feature quantity calculating means for calculating a feature quantity for determining a segment type from the calculated motion vector;
Segment type identifying means for identifying a segment type in units of frames using the feature amount;
Video dividing means for dividing the input video based on a segment type in units of frames, and storing the divided video and segment information in the video / segment information storage means;
A video structuring apparatus comprising:

The feature amount calculating means includes:
Means for dividing the input frame image of the video into predetermined regions and calculating a feature value for each of the divided regions;
The image structuring apparatus according to claim 4.

The feature amount calculating means includes:
Means for using and identifying one or more of the average and variance (standard deviation), the average change amount, and the variance (standard deviation) change amount of the magnitude and direction of the motion vector as the feature amount including,
The image structuring apparatus according to claim 5.

Computer
7. A video structuring program which functions as the video structuring apparatus according to claim 4.