JP5370170B2

JP5370170B2 - Summary video generation apparatus and summary video generation method

Info

Publication number: JP5370170B2
Application number: JP2010006670A
Authority: JP
Inventors: つきみ若林; 渉猪羽; 慎中手
Original assignee: JVCKenwood Corp
Current assignee: JVCKenwood Corp
Priority date: 2009-01-15
Filing date: 2010-01-15
Publication date: 2013-12-18
Anticipated expiration: 2030-01-15
Also published as: JP2010187374A

Abstract

<P>PROBLEM TO BE SOLVED: To generate a summary video which centralizes a person from photographed data and has variations. <P>SOLUTION: A summary video-generating apparatus 1 includes a person characteristic section extractor 9, a summary reproduction section selector 13 and a generator 14. The person characteristic section extractor divides a video into a first video section where person characteristic information showing a video characteristic of a person region extracted from the video is more than a prescribed threshold and a second video section where person characteristic information is less than the threshold, and obtains characteristic values showing the person characteristics of each of the first video section and the second video section based on the person characteristic information. The summary reproduction section selector selects a selective video section based on the characteristic value and extracts a first reproduction section used for a summary video from the selective video section. The generator generates the summary video by using the video of the video section selected by the summary reproduction section selector 13. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、要約映像生成装置に係り、特に映像コンテンツの要約映像を生成する要約映像生成装置および要約映像生成方法に関する。 The present invention relates to a summary video generation device, and more particularly to a summary video generation device and a summary video generation method for generating a summary video of video content.

近年、家庭用ビデオカメラの普及により、誰でも気軽に身近なイベントや風景を映像として記録保存することができるようになった。しかしながら、こうした所謂撮りっ放し映像は、撮影直後は楽しく見るものの、後々まで映像コンテンツとして鑑賞され活用される機会は少ない。また、一般のユーザにより撮影された映像は、失敗や不要な場面を多く含み、同じような場面が何度も映っているなど冗長度が高い。そのため、撮影直後に関係者でイベントを振り返るには好適であるが、後々の鑑賞に堪える映像コンテンツとするには、撮影映像を素材として整理し、選択された素材を製作意図に沿ってつなぎ合わせる編集作業を要する。なお、パソコン等を使用して別途編集作業をすることは、煩わしい。 In recent years, with the widespread use of home video cameras, anyone can easily record and save familiar events and scenery as video. However, such a so-called shot-free video can be viewed happily immediately after shooting, but there are few opportunities to be viewed and used as video content until later. Moreover, the video image | photographed by the general user contains many failures and unnecessary scenes, and the degree of redundancy is high, for example, similar scenes are shown many times. Therefore, it is suitable to review the event immediately after shooting, but to make video content that can be enjoyed later, organize the shot video as material and connect the selected material according to the production intention Editing work is required. It is troublesome to separately perform editing work using a personal computer or the like.

このような状況を背景に、自動で撮影データや音声データを編集して要約映像を生成する技術が各種提案されている。 Against this background, various techniques have been proposed for automatically editing shooting data and audio data to generate summary video.

例えば、特許文献１には、撮影画面間の画像の変化からシーン変化を検出し、動画像のシーンの長さ、シーン内の画像の変化度合を基に重要シーンを選択し要約再生する技術が開示されている。また、特許文献２には、動画像データを複数のシーンに分割し、複数の条件から再生シーンを選択する技術が開示されている。また、特許文献３には、顔検出技術を用いて人物シーンを抽出する技術が開示されている。更に、特許文献４では、データストリームの特徴量を基に撮影データの特徴的なシーンに対応する代表区間を選択すると共に、代表区間の導入部となるつなぎ区間を選択し、代表区間とつなぎ区間とを用いて要約映像を生成する技術が開示されている。 For example, Patent Document 1 discloses a technique for detecting a scene change from an image change between shooting screens, selecting an important scene based on the length of a moving image scene and the degree of image change in the scene, and performing summary playback. It is disclosed. Patent Document 2 discloses a technique for dividing moving image data into a plurality of scenes and selecting a playback scene from a plurality of conditions. Patent Document 3 discloses a technique for extracting a person scene using a face detection technique. Further, in Patent Document 4, a representative section corresponding to a characteristic scene of captured data is selected based on a feature amount of a data stream, and a connection section serving as an introduction part of the representative section is selected, and the representative section and the connection section are selected. A technique for generating a summary video using and is disclosed.

特開平６−１４９９０２号公報JP-A-6-149902 特開２００２−１４２１８９号公報JP 2002-142189 A 特開２００７−２８１８５８号公報JP 2007-281858 A 特開２００８−１７８０９０号公報JP 2008-178090 A

しかし、特許文献１に開示された技術のように、動きのあるシーンを重要選択するような設定では、動きの激しいシーンが次々と現れ、目まぐるしい要約映像になる、という問題がある。 However, as in the technique disclosed in Patent Document 1, there is a problem that scenes with intense motion appear one after another in a setting where important scenes with motion are selected one after another, resulting in a dazzling summary video.

また、特許文献２に開示された技術においては、画面の明るさや高周波成分など、シーンの評価に複数の条件を設定しているものの、各条件の評価結果からシーンを選択する基準は設定モード毎に一定であり、ある設定モードにおいては、重要と判定されるシーンと対極にあるような条件のシーンは選択されないため、やはり、同種のシーンが集まる、という問題がある。 In the technique disclosed in Patent Document 2, a plurality of conditions are set for scene evaluation, such as screen brightness and high-frequency components. However, a criterion for selecting a scene from the evaluation result of each condition is set for each setting mode. In a certain setting mode, there is a problem that scenes of the same kind are gathered because a scene having a condition opposite to a scene determined to be important is not selected.

また、特許文献３に開示された技術においては、顔という一般家庭ユーザにとって、中心的な被写体となる機会が多く、関心の深い特徴量を用いているが、ユーザが特定の１又は複数の人物を指定する必要がある、という問題がある。 Further, in the technique disclosed in Patent Document 3, although there are many opportunities to become a central subject for a general household user who is a face, a feature amount that is of great interest is used. There is a problem that it is necessary to specify.

また、特許文献４に開示された技術においては、データストリームの特徴量を基に撮影データの特徴的なシーンに対応する代表区間を選択すると共に、代表区間の導入部となるつなぎ区間を選択し、代表区間とつなぎ区間とを用いて要約映像を生成しているものの、人物の顔を特徴量として抽出することまでは開示していない。 Further, in the technique disclosed in Patent Document 4, a representative section corresponding to a characteristic scene of photographed data is selected based on a feature amount of a data stream, and a connection section serving as an introduction section of the representative section is selected. Although the summary video is generated using the representative section and the connecting section, it is not disclosed until a person's face is extracted as a feature amount.

更に、映画やテレビ番組の撮影データは、予め専門家によるシーン編集が施されているため、アクション区間に隣接するシーンを適宜組み合わせることにより、ある程度ストーリー性のある要約も可能であるが、一般のユーザがイベントや風景をスナップショット的に撮影した編集以前の素材映像においては、このような手法は有効ではない。 Furthermore, since the shooting data of movies and TV programs are pre-edited by experts, scenes can be summarized to some extent by appropriately combining scenes adjacent to the action section. Such a method is not effective for a material video before editing in which a user takes a snapshot of an event or landscape.

このように、上述のような従来の技術では、ある基準で選び出した重要シーンを発生時間順に次々と提示するため同種のシーンが続く可能性が高く、ユーザにとって必ずしも見易く退屈しない要約映像とはならないという問題があった。 As described above, in the conventional technology as described above, since important scenes selected according to a certain criterion are presented one after another in the order of generation time, there is a high possibility that the same kind of scenes will continue, and it is not always easy for the user to view and do not become a summary video that is not boring. There was a problem.

そこで、本発明は、撮影データから人物を中心とし、かつ、変化のある要約映像を生成することができる要約映像生成装置および要約映像生成方法を提供することを目的とする。 SUMMARY An advantage of some aspects of the invention is that it provides a summary video generation apparatus and a summary video generation method capable of generating a summary video having a person as a center and a change from photographed data.

上記した課題を解決するために、本発明は次の（ａ）〜（ｄ）の要約映像生成装置および（ｅ）〜（ｈ）の要約映像生成方法を提供する。
（ａ）映像から抽出された人物領域の映像特徴を示す人物特徴情報に基づいて前記映像を複数の映像区間に分割する人物特徴区間抽出部（９、６１６）と、前記複数の映像区間から所望の映像区間を選択する要約再生区間選択部（１３、６２）と、前記要約再生区間選択部が選択した映像区間の映像を用いて要約映像を生成する要約映像生成部（１４、６３）とを備え、前記人物特徴区間抽出部は、前記映像を前記人物特徴情報が所定の閾値以上である人物特徴区間と、前記人物特徴情報が前記閾値より小さい非人物特徴区間とに分割し、前記人物特徴情報に基づいて前記人物特徴区間および前記非人物特徴区間それぞれの人物特徴を示す特徴値を求め、前記要約再生区間選択部は、前記特徴植に対する重み付けを示す要約生成モードに基づいて前記人物特徴区間および前記非人物特徴区間を評価する評価値を求め、前記評価値に基づいて前記人物特徴区間および前記非人物特徴区間より要約映像を抽出する映像区間を選択し、前記要約映像を抽出する映像区間から要約再生区間を抽出し、前記要約映像生成部は、前記要約再生区間の映像を用いて前記要約映像を生成することを特徴とする要約映像生成装置（１、６００）。
（ｂ）前記要約再生区間選択部は、前記特徴値に対する重み付けを示す要約生成モードに基づいて前記人物特徴区間および前記非人物特徴区間を評価する評価値を求め、前記評価値に基づいて前記人物特徴区間より代表区間候補を選択し、前記代表区間候補より前記要約映像に使用する代表区間を抽出し、前記代表区間に基づいて前記人物特徴区間及び前記非人物特徴区間から前記要約映像に使用するつなぎ区間を抽出し、前記要約映像生成部は、前記代表区間および前記つなぎ区間の映像を用いて前記要約映像を生成することを特徴とする（ａ）記載の要約映像生成装置。
（ｃ）前記要約再生区間選択部は、前記特徴値に対する重み付けを、要約生成モードに複数設定したシーンタイプに対応付けて複数設定し、前記特徴値に対する重み付けを示す要約生成モードに基いて前記人物特徴区間および前記非人物特徴区間を評価する複数の評価値を前記シーンタイプに対応付けて求め、前記複数の評価値に基づいて前記人物特徴区間および前記非人物特徴区間より要約映像を抽出する映像区間を選択し、前記要約映像を抽出する映像区間から要約再生区間を抽出し、前記要約映像生成部は、前記要約再生区間の映像を用いて前記要約映像を生成することを特徴とする（ａ）記載の要約映像生成装置。
（ｄ）前記要約映像生成部は、前記要約再生区間の映像を用いて前記要約映像を再生リストとして生成する（ａ）から（ｃ）いずれか記載の要約映像生成装置。
（ｅ）映像を、前記映像から抽出された人物領域の映像特徴を示す人物特徴情報が所定の閾値以上である人物特徴区間と、前記人物特徴情報が前記閾値より小さい非人物特徴区間とに分割し、
前記人物特徴情報に基づいて前記人物特徴区間および前記非人物特徴区間それぞれの人物特徴を示す特徴値を求め、前記特徴植に対する重み付けを示す要約生成モードに基づいて前記人物特徴区間および前記非人物特徴区間を評価する評価値を求め、前記評価値に基づいて前記人物特徴区間および前記非人物特徴区間より要約映像を抽出する映像区間を選択し、前記要約映像を抽出する映像区間から要約再生区間を抽出し、前記要約再生区間の映像を用いて前記要約映像を生成することを特徴とする特徴とする要約映像生成方法。
（ｆ）前記特徴値に対する重み付けを示す要約生成モードに基いて前記人物特徴区間および前記非人物特徴区間を評価する評価値を求め、前記評価値に基づいて前記人物特徴区間より代表区間候補を選択し、前記代表区間候補より前記要約映像に使用する代表区間を抽出し、前記代表区間に基づいて前記要約映像に使用するつなぎ区間を抽出し、前記代表区間および前記つなぎ区間の映像を用いて前記要約映像を生成するを特徴とする（ｅ）の要約映像生成方法。
（ｇ）前記特徴値に対する重み付けを、要約生成モードに複数設定したシーンタイプに対応付けて複数設定し、前記特徴値に対する重み付けを示す要約生成モードに基づいて前記人物特徴区間および前記非人物特徴区間を評価する複数の評価値を前記シーンタイプに対応付けて求め、前記複数の評価値に基づいて前記人物特徴区間および前記非人物特徴区間より要約映像を抽出する映像区間を選択し、前記要約映像を抽出する映像区間から要約再生区間を抽出し、前記要約再生区間の映像を用いて前記要約映像を生成することを特徴とする（ｅ）記載の要約映像生成方法。
（ｈ）前記要約映像生成方法において、前記要約再生区間の映像を用いて前記要約映像を再生リストとして生成することを特徴とする（ｅ）から（ｇ）いずれか記載の要約映像生成方法。 In order to solve the above-described problems, the present invention provides the following summary video generation apparatuses (a) to (d) and the summary video generation method (e) to (h).
(A) A person feature section extraction unit (9, 616) that divides the video into a plurality of video sections based on the person feature information indicating the video features of the person area extracted from the video, and a desired one from the plurality of video sections A summary playback section selection section (13, 62) for selecting a video section of the video and a summary video generation section (14, 63) for generating a summary video using the video of the video section selected by the summary playback section selection section. The person feature section extraction unit divides the video into a person feature section in which the person feature information is equal to or greater than a predetermined threshold and a non-person feature section in which the person feature information is smaller than the threshold, Based on the information, a feature value indicating a person feature of each of the person feature section and the non-person feature section is obtained, and the summary playback section selection unit is based on a summary generation mode indicating a weight for the feature plant. Then, an evaluation value for evaluating the person feature section and the non-person feature section is obtained, a video section for extracting a summary video from the person feature section and the non-person feature section is selected based on the evaluation value, and the summary video is selected. A summary video generation device (1, 600), wherein a summary playback section is extracted from a video section from which the video is extracted, and the summary video generation unit generates the summary video using the video of the summary playback section.
(B) The summary playback section selection unit obtains an evaluation value for evaluating the person feature section and the non-person feature section based on a summary generation mode that indicates weighting for the feature value, and the person based on the evaluation value A representative section candidate is selected from the feature section, a representative section to be used for the summary video is extracted from the representative section candidate, and is used for the summary video from the person feature section and the non-person feature section based on the representative section. The summary video generation device according to (a), wherein a summary section is extracted, and the summary video generation section generates the summary video using the representative section and the joint section video.
(C) The summary playback section selection unit sets a plurality of weights for the feature values in association with scene types set in the summary generation mode, and the person based on the summary generation mode indicating the weights for the feature values. A plurality of evaluation values for evaluating the feature section and the non-person feature section are obtained in association with the scene type, and a summary video is extracted from the person feature section and the non-person feature section based on the plurality of evaluation values. A summary playback section is extracted from a video section from which a video section is selected and the summary video is extracted, and the summary video generation unit generates the summary video using the video of the summary playback section ( a) The summary video generation device described.
(D) The summary video generation device according to any one of (a) to (c), wherein the summary video generation unit generates the summary video as a playback list using the video of the summary playback section.
(E) The video is divided into a person feature section in which the person feature information indicating the video feature of the person area extracted from the video is equal to or greater than a predetermined threshold, and a non-person feature section in which the person feature information is smaller than the threshold And
Based on the person feature information, a feature value indicating a person feature of each of the person feature section and the non-person feature section is obtained, and the person feature section and the non-person feature are displayed based on a summary generation mode indicating a weight for the feature plant. An evaluation value for evaluating the section is obtained, a video section from which the summary video is extracted from the person feature section and the non-person feature section is selected based on the evaluation value, and a summary playback section is selected from the video section from which the summary video is extracted. A summary video generation method characterized by extracting and generating the summary video using the video of the summary playback section.
(F) Obtaining an evaluation value for evaluating the person feature section and the non-person feature section based on a summary generation mode indicating weighting for the feature value, and selecting a representative section candidate from the person feature section based on the evaluation value A representative section to be used for the summary video is extracted from the representative section candidates, a connection section to be used for the summary video is extracted based on the representative section, and the representative section and the video of the connection section are used to extract the representative section. A summary video generation method according to (e), characterized by generating a summary video.
(G) A plurality of weights for the feature values are set in association with scene types set in the summary generation mode, and the person feature section and the non-person are based on the summary generation mode indicating the weights for the feature values. A plurality of evaluation values for evaluating a feature section are obtained in association with the scene type, a video section for extracting a summary video from the person feature section and the non-person feature section is selected based on the plurality of evaluation values, The summary video generation method according to (e), wherein a summary playback section is extracted from a video section from which the summary video is extracted, and the summary video is generated using the video of the summary playback section.
(H) The summary video generation method according to any one of (e) to (g), wherein in the summary video generation method, the summary video is generated as a playback list using video in the summary playback section.

本発明の要約映像生成装置および要約映像生成方法によれば、ストリームデータ中の撮影データから撮影画面における人物の顔の特徴量による特徴量に基づいて評価値を算出し、算出した評価値に基づいて撮影データから要約再生区間を選択し、選択された要約再生区間に基づいて撮影データから要約映像を生成するようにしたので、撮影データから人物を中心とし、かつ、変化のある要約映像を生成することができる。 According to the summary video generation apparatus and the summary video generation method of the present invention, an evaluation value is calculated based on a feature amount based on a feature amount of a person's face on a shooting screen from shooting data in stream data, and based on the calculated evaluation value The summary playback section is selected from the shooting data, and the summary video is generated from the shooting data based on the selected summary playback section. can do.

本発明の実施形態１に係る要約映像生成装置の好適な使用形態を示す図である。It is a figure which shows the suitable usage pattern of the summary image | video production | generation apparatus which concerns on Embodiment 1 of this invention. 本発明の実施形態１に係る要約映像生成装置の構成例を示すブロック図である。It is a block diagram which shows the structural example of the summary image | video production | generation apparatus which concerns on Embodiment 1 of this invention. 人物特徴区間および非人物特徴区間抽出の概念図を示す。The conceptual diagram of a person feature area and a non-person feature area extraction is shown. 再生リスト生成の概念図である。It is a conceptual diagram of reproduction | regeneration list production | generation. つなぎ区間から特徴区間への切り替えタイミングの例を示す概念図である。It is a conceptual diagram which shows the example of the switching timing from a connection area to a characteristic area. 本発明の実施形態２に係る要約映像生成装置の構成例を示すブロック図である。It is a block diagram which shows the structural example of the summary video generation apparatus which concerns on Embodiment 2 of this invention. 本発明の実施形態２に係る要約映像生成装置の好適な使用形態を示す図である。It is a figure which shows the suitable usage pattern of the summary image | video production | generation apparatus which concerns on Embodiment 2 of this invention. 撮影したショットの撮影情報を示す図である。It is a figure which shows the imaging | photography information of the image | photographed shot. 要約再生区間選択処理を示す図である。It is a figure which shows the summary reproduction | regeneration area selection process. 各シーンタイプで高評価される映像例を示す図である。It is a figure which shows the example of an image | video highly evaluated with each scene type. 図１０に示す映像から求めた人物特徴量の得点例を示す図である。It is a figure which shows the example of a score of the person feature-value calculated | required from the image | video shown in FIG. 各シーンタイプの評価に用いるパラメータ例を示す図である。It is a figure which shows the example of a parameter used for evaluation of each scene type. 評価値例を示す図である。It is a figure which shows the example of an evaluation value. 顔情報リストを示す図である。It is a figure which shows a face information list. 画面内の座標を説明する図である。It is a figure explaining the coordinate in a screen. 時刻と顔情報の関係を示す図である。It is a figure which shows the relationship between time and face information.

以下、本発明の要約映像生成装置および要約映像生成方法を実施するための最良の形態について、図面を参照して説明する。
図１は、本発明の実施形態１である要約映像生成装置の接続例を示す図面である。 The best mode for carrying out a summary video generation apparatus and summary video generation method of the present invention will be described below with reference to the drawings.
FIG. 1 is a diagram showing a connection example of a summary video generation apparatus according to the first embodiment of the present invention.

（実施形態１）
図１に本発明における要約映像生成装置を備える機器と他の機器との接続例を示す。要約映像生成装置は、ＨＤＤレコーダや、ＤＶＤレコーダ、ＢＤレコーダ等の各種コンテンツを蓄積するコンテンツ蓄積装置２１に内蔵される。図１に示すようにコンテンツ蓄積装置２１は、ビデオカメラ２３と接続され、更にテレビやモニタ等の表示装置２２と接続される。
ビデオカメラ２３に蓄積されたデータをコンテンツ蓄積装置２１に蓄積し、蓄積したデータを表示装置２２で視聴する。なお、要約映像生成装置は、表示装置２２やビデオカメラ２３に内蔵されていてもよい。 (Embodiment 1)
FIG. 1 shows an example of connection between a device provided with the summary video generation apparatus of the present invention and another device. The summary video generation device is built in a content storage device 21 that stores various contents such as an HDD recorder, a DVD recorder, and a BD recorder. As shown in FIG. 1, the content storage device 21 is connected to a video camera 23 and further connected to a display device 22 such as a television or a monitor.
Data stored in the video camera 23 is stored in the content storage device 21 and the stored data is viewed on the display device 22. The summary video generation device may be built in the display device 22 or the video camera 23.

図２に本発明の実施形態１に係る要約映像生成装置１のブロック図を示す。
要約映像生成装置１は、ビデオカメラ、動画撮影機能を有するデジタルカメラや携帯電話等の撮影装置２（図１の撮影装置２３）および表示装置３（図１の表示装置２２）と接続され、撮影装置２が撮影した映像データおよび音声データを含む撮影データの要約映像を生成し、生成した要約映像を表示装置３に出力する。なお要約映像生成装置１は、撮影装置２が備える表示部（ディスプレイ）に要約映像を表示するように出力しても勿論よい。
また、要約映像生成装置１が撮影装置２内に設けられていても勿論よい。また撮影データには、映像データが含まれていれば音声データが含まれていなくても勿論よい。 FIG. 2 is a block diagram of the summary video generation apparatus 1 according to the first embodiment of the present invention.
The summary video generation device 1 is connected to a video camera, a digital camera having a video shooting function, a video camera 2 such as a mobile phone (photo camera 23 in FIG. 1), and a display device 3 (display device 22 in FIG. 1). A summary video of shooting data including video data and audio data shot by the device 2 is generated, and the generated summary video is output to the display device 3. Of course, the summary video generation device 1 may output the summary video to be displayed on a display unit (display) included in the photographing device 2.
Of course, the summary video generation device 1 may be provided in the photographing device 2. Of course, the captured data may not include audio data as long as it includes video data.

図２に示すように、要約映像生成装置１は、記録制御部４と要約生成再生部６とを備える。要約映像生成装置１は、撮影装置２、表示装置３及び、要約映像生成装置１に対し各種パラメータを設定するパラメータ設定部５等にそれぞれ接続されている。
記録制御部４は、ストリームデータ入力部７と、人物特徴量抽出部８と、人物特徴区間抽出部９と、データ記録部１０とを有する。 As shown in FIG. 2, the summary video generation apparatus 1 includes a recording control unit 4 and a summary generation / playback unit 6. The summary video generation device 1 is connected to the photographing device 2, the display device 3, and a parameter setting unit 5 for setting various parameters for the summary video generation device 1, respectively.
The recording control unit 4 includes a stream data input unit 7, a person feature amount extraction unit 8, a person feature section extraction unit 9, and a data recording unit 10.

ストリームデータ入力部７は、撮影装置２で撮影し録画された映像データ、この映像データに同期した音声データ、及び撮影データの撮影日時や画質などの撮影情報を含むストリームデータを取得し、取得したストリームデータから各データを分離する。ストリームデータは様々な形式が利用できる。
実施形態１及び後述する他の実施形態では、録画開始から録画停止までの撮影データの単位を１ショットと呼ぶ。１ショット毎に撮影データとその撮影情報を含むデータファイルが作成され保存され、１または複数のショットが集まりストリームデータとなる。ユーザにより設定が可能なタイトル情報も、メタデータとしてデータファイルに保存される。 The stream data input unit 7 acquires and acquires stream data including video data shot and recorded by the shooting device 2, audio data synchronized with the video data, and shooting information such as shooting date / time and image quality of the shooting data. Separate each data from the stream data. Various formats can be used for the stream data.
In the first embodiment and other embodiments described later, a unit of shooting data from the start of recording to the stop of recording is referred to as one shot. A data file including shooting data and shooting information is created and saved for each shot, and one or a plurality of shots are collected to form stream data. Title information that can be set by the user is also stored in the data file as metadata.

撮影情報は、撮影時間情報と、撮影装置２がＧＰＳ受信機を搭載している場合はＧＰＳ撮影位置情報と、撮影時に使用した風景撮影モードやスポーツ撮影モード等の撮影モード情報、撮影画角が１６：９（ワイド）あるいは４：３（ノーマル）であることを示すワイド／ノーマルモード情報、手振れ補正がありか、なしかを示す手振れモード情報などの、各種情報を含む。
各ショットは、これらの撮影情報を用いて分類できる。同様の撮影情報を有する複数のショットを集めたものを撮影シーンとよぶ。 The shooting information includes shooting time information, GPS shooting position information when the shooting device 2 is equipped with a GPS receiver, shooting mode information such as a landscape shooting mode and a sports shooting mode used at the time of shooting, and a shooting angle of view. It includes various information such as wide / normal mode information indicating 16: 9 (wide) or 4: 3 (normal), camera shake mode information indicating whether or not there is camera shake correction.
Each shot can be classified using these pieces of shooting information. A collection of a plurality of shots having similar shooting information is called a shooting scene.

人物特徴量抽出部８は、各ショットの撮影データから人物特徴量を抽出する。実施形態１において人物特徴量とは、例えば、撮影画像における顔画像の有無、顔画像の大きさ、顔画像の画面上の位置、顔画像の顔の向き（顔画像の傾き）等を示す情報や個人識別情報などである。更に抽出した人物特徴量を基に得点を求める。
得点の求め方は、人物特徴量をそのまま得点としてもよいし、抽出された人物特徴量を正規化した値を得点としてもよい。正規化することで、複数の人物特徴量に重み付けして加算することが容易となる。
また、１つの撮影データを通じた人物特徴量の分布を基にしてそれぞれの人物特徴量の値を相対的に求め、求めた値を得点としてもよい。このように求めることで、撮影データ全体に均等に分布する人物特徴量については得点が低くなり、まばらに分布する人物特徴量については得点が高くなる。 The person feature amount extraction unit 8 extracts a person feature amount from the shooting data of each shot. In the first embodiment, the human feature amount is, for example, information indicating the presence / absence of a face image in a captured image, the size of the face image, the position of the face image on the screen, the face orientation of the face image (face image inclination), and the like. And personal identification information. Further, a score is obtained based on the extracted person feature amount.
The method for obtaining the score may be the person feature amount as it is, or may be a value obtained by normalizing the extracted person feature amount. By normalizing, it becomes easy to add a weight to a plurality of person feature amounts.
Moreover, it is good also considering the value of each person feature-value relatively based on the distribution of the person feature-value through one imaging | photography data, and making the calculated | required value a score. By obtaining in this way, the score is low for the human feature amount evenly distributed in the entire photographing data, and the score is high for the human feature amount sparsely distributed.

人物特徴区間抽出部９は、人物特徴量抽出部８により抽出された人物特徴量に基づいて、撮影データ内で例えば人物が所定の基準値より大きい等の特徴的に撮影されている映像区間を人物特徴区間として抽出する。また抽出した人物特徴区間の時間軸上の位置等を示す人物特徴区間情報と、人物特徴区間以外の映像区間である非人物特徴区間の時間軸上の位置等を示す非人物特徴区間情報とを生成して出力する。なお、実施形態１では後述するように、非人物特徴区間を複数に分割する。 Based on the person feature amount extracted by the person feature amount extraction unit 8, the person feature section extraction unit 9 selects a video section in which the person is photographed characteristically, such as a person being larger than a predetermined reference value, in the photographing data. Extracted as a person feature section. Also, the person feature section information indicating the position on the time axis of the extracted person feature section and the non-person feature section information indicating the position on the time axis of the non-person feature section which is a video section other than the person feature section. Generate and output. In the first embodiment, as described later, the non-person feature section is divided into a plurality of sections.

データ記録部１０は、ストリームデータ入力部７から出力されたストリームデータと、人物特徴量抽出部８で抽出された人物特徴量と、人物特徴区間抽出部９で抽出された人物特徴区間情報および非人物特徴区間情報とを受け取り、各ストリームデータにストリームデータを基にして求めた各情報を対応付ける。ここでは、ストリームデータに人物特徴量と人物特徴区間情報および非人物特徴区間情報とを対応付けする。なおデータ記録部１０における各情報の対応付けは、ユーザの指示に基づいて行ってもよい。
データ記録部１０は、対応付けしたストリームデータと各情報とを蓄積する。
データ記録部１０に蓄積されたストリームデータおよび各情報を、記録媒体１１に記録することもできる。図２では記録に用いる手段の図示を省略するが、既知の手段を採用すればよい。記録媒体１１は例えばＨＤＤやメモリ、ＤＶＤ、ＢＤ等で、要約映像生成装置１に予め備えられている固定式でもよいし、要約映像生成装置１に脱着可能な脱着式でもよい。 The data recording unit 10 includes stream data output from the stream data input unit 7, a person feature amount extracted by the person feature amount extraction unit 8, person feature section information extracted by the person feature section extraction unit 9, and non-character information. The person characteristic section information is received, and each piece of information obtained based on the stream data is associated with each piece of stream data. Here, the person feature amount, the person feature section information, and the non-person feature section information are associated with the stream data. The association of each information in the data recording unit 10 may be performed based on a user instruction.
The data recording unit 10 accumulates the associated stream data and each information.
The stream data and each information accumulated in the data recording unit 10 can also be recorded on the recording medium 11. Although illustration of the means used for recording is omitted in FIG. 2, a known means may be employed. The recording medium 11 is, for example, an HDD, a memory, a DVD, or a BD, and may be a fixed type provided in advance in the summary video generation device 1 or a removable type that is removable from the summary video generation device 1.

パラメータ設定部５は、要約映像再生時にユーザが選択した、ストリームデータ情報、要約生成モード、要約再生時間等をパラメータとして要約生成再生部６に指示する。ユーザは、要約映像を生成するストリームデータと要約映像を生成する要約生成モードとを選択し、選択したストリームデータ、要約生成モードおよび要約映像を再生する時間等を示すパラメータをパラメータ設定部５に対して設定する。設定は、公知の方法で行えばよい。ストリームデータの選択は、ストリームデータのタイトルを選択する等の方法で行えばよい。
また、パラメータ設定部５は、要約映像生成装置１に設けられていてもよい。 The parameter setting unit 5 instructs the summary generation / playback unit 6 using the stream data information, the summary generation mode, the summary playback time, and the like selected by the user during playback of the summary video as parameters. The user selects the stream data for generating the summary video and the summary generation mode for generating the summary video, and sets parameters indicating the selected stream data, the summary generation mode, the time for reproducing the summary video, and the like to the parameter setting unit 5. To set. The setting may be performed by a known method. Selection of stream data may be performed by a method such as selecting a title of stream data.
The parameter setting unit 5 may be provided in the summary video generation device 1.

ここで、要約再生時間とは、要約映像生成装置１にて生成する要約映像の再生時間である。また、要約生成モードとは、記録した映像から要約映像をどのように生成するか、を示す情報である。
例えば、家族旅行に行って、はしゃぐ子供のアップや、観光地の風景や、モニュメントの前でポーズする親等の色々な被写体を撮影した時に、そのまま時系列で見てもよいが、雑多な要約映像となる。
ユーザは、記録した映像をどのような観点で要約した映像を視聴したいか、に応じた要約生成モードを選択すればよい。要約生成モードを変更することで、再生される要約映像のシーン構成を変更することができる。 Here, the summary playback time is the playback time of the summary video generated by the summary video generation device 1. The summary generation mode is information indicating how a summary video is generated from a recorded video.
For example, when you go on a family trip and take pictures of a variety of subjects such as parents who pose in front of monuments, up scenic children, you can watch them in chronological order. It becomes.
The user may select a summary generation mode according to what kind of viewpoint the user wants to view the video summarized from the recorded video. By changing the summary generation mode, the scene configuration of the summary video to be played can be changed.

例えば、「キッズ」、「旅行記念」、「風景」のような要約生成モードがあるとする。「キッズ」では、人物（特に子供）のアップシーンを主とし、人物の位置や向きに変化の有るシーンを加える。「旅行記念」では、人物は正面向きで余り大き過ぎずに映っていて人物と共に背景も楽しめるシーンを主とし、背景のみのシーンや人物の位置や向きに変化の有るシーンを加える。「風景」では、人物の映っていないシーンを主とし、所々顔の目立たない程度に人物が映っているシーンも加える。
要約生成モードを選択することで、ユーザは同じ撮影シーンから要約生成モード毎に異なった趣向の要約映像を再生することができる。 For example, it is assumed that there are summary generation modes such as “kids”, “travel memorial”, and “scenery”. In “Kids”, scenes with a change in the position and orientation of the person are added, with the main scene being an upscene of a person (especially a child). In “Travel Commemorative”, the scene is mainly a scene in which the person is reflected in the front and is not too large and the background can be enjoyed together with the person, and a scene with only the background and a scene with a change in the position and orientation of the person are added. In “Landscape”, scenes that do not show a person are mainly used, and scenes in which a person is shown to the extent that the face is not conspicuous are also added.
By selecting the summary generation mode, the user can reproduce summary videos having different preferences for each summary generation mode from the same shooting scene.

要約生成再生部６は、データ読出部１２と、要約再生区間選択部１３と、要約映像生成部１４と、再生処理部１５と、デコード部１６と、データ出力部１７とを有し、要約再生区間選択部１３は、代表区間選択部１３１と、つなぎ区間選択部１３２とを有する。 The summary generation / playback unit 6 includes a data reading unit 12, a summary playback section selection unit 13, a summary video generation unit 14, a playback processing unit 15, a decoding unit 16, and a data output unit 17, and performs summary playback. The section selection unit 13 includes a representative section selection unit 131 and a connection section selection unit 132.

データ読出部１２は、パラメータ設定部５により設定されたパラメータに基づいて、ユーザより指定されたストリームデータと、ストリームデータに対応付けられた人物特徴量と、人物特徴区間情報および非人物特徴区間情報とをデータ記録部１０より読み出す。データ読出部１２は読み出したストリームデータ等を、要約再生区間選択部１３に出力する。
データ読出部１２は、ストリームデータおよび対応付けられた各情報を記録媒体１１より読み出してもよい。 Based on the parameters set by the parameter setting unit 5, the data reading unit 12 includes stream data designated by the user, person feature amounts associated with the stream data, person feature section information, and non-person feature section information. Are read from the data recording unit 10. The data reading unit 12 outputs the read stream data and the like to the digest playback section selecting unit 13.
The data reading unit 12 may read stream data and associated information from the recording medium 11.

要約再生区間選択部１３は、データ読出部１２から供給されたストリームデータと、パラメータ設定部５から供給されたパラメータ、ここでは特に要約生成モードを示すパラメータ、に基づいて、要約映像に使用する映像区間（要約再生区間）を選択する。要約再生区間には、人物特徴区間から選択した代表区間の映像を採用する。更に、代表区間の映像に対して導入映像となるような映像区間、あるいは、複数の代表区間をつなぐ映像となるような映像区間をつなぎ区間として採用することが好ましい。代表区間に加えてつなぎ区間を設けることで、要約映像の内容に変化が生じる。
代表区間選択部１３１は、要約生成モードに応じて各人物特徴区間の人物特徴量を評価し、人物特徴区間から代表区間を選択する。以下では、各人物特徴区間の評価値を算出し、評価値の高い人物特徴区間を代表区間として選択する方法について述べる。 The summary playback section selection unit 13 is a video used for the summary video based on the stream data supplied from the data reading unit 12 and the parameters supplied from the parameter setting unit 5, in particular, the parameters indicating the summary generation mode. Select the section (summary playback section). The video of the representative section selected from the person feature sections is adopted as the summary playback section. Further, it is preferable to adopt a video section that becomes an introduction video for a video of a representative section or a video section that becomes a video that connects a plurality of representative sections as a connecting section. By providing a connecting section in addition to the representative section, the content of the summary video changes.
The representative section selection unit 131 evaluates the person feature amount of each person feature section according to the summary generation mode, and selects a representative section from the person feature section. Hereinafter, a method for calculating an evaluation value of each person feature section and selecting a person feature section having a high evaluation value as a representative section will be described.

実施形態１では、要約生成モードは、人物特徴量に対する重み付けを示す。具体的には要約生成モードは、人物特徴量抽出部８が抽出した人物特徴量に基づいて求めた得点に対する重み付けの程度を示す。評価値は、要約生成モードに基づいて得点を重み付けして算出した値である。
例えば上述した「キッズ」モードでは人物の顔のアップのシーンの評価値が高くなるように、顔のサイズに対応する特徴量に大きく重み付けして評価値を算出する。一方、「旅行記念」モードでは人物の顔が正面向きのシーンの評価値が高くなるように、顔の向きに対応する特徴量に大きく重み付けして評価値を算出する。このように要約生成モード毎に重み付けする特徴量を変更する。重み付けの量を変更する等で、要約生成モードに応じて異なるタイプのシーンの評価値が高くなるよう設定することができる。このようにして、視聴の趣向に応じた代表区間を選択できるようにする。 In the first embodiment, the summary generation mode indicates weighting for the person feature amount. Specifically, the summary generation mode indicates the degree of weighting for the score obtained based on the person feature amount extracted by the person feature amount extraction unit 8. The evaluation value is a value calculated by weighting the score based on the summary generation mode.
For example, in the “kids” mode described above, the evaluation value is calculated by heavily weighting the feature amount corresponding to the size of the face so that the evaluation value of the scene of the face up of the person is high. On the other hand, in the “travel memorial” mode, the evaluation value is calculated by heavily weighting the feature amount corresponding to the face direction so that the evaluation value of the scene with the human face facing forward is high. In this way, the feature amount to be weighted is changed for each summary generation mode. By changing the weighting amount or the like, it is possible to set the evaluation value of different types of scenes to be high depending on the summary generation mode. In this way, it is possible to select a representative section according to viewing preferences.

つなぎ区間選択部１３２は、代表区間以外の区間、すなわち非人物特徴区間と、代表区間にならなかった人物特徴区間とから、代表区間の導入部となるつなぎ区間を選択する。なお、つなぎ区間を非人物特徴区間のみから選択しても勿論よいし、代表区間にならなかった人物特徴区間のみから選択しても勿論よい。つなぎ区間を選択する方法は、パラメータ設定部５に設定された要約生成モードに応じて異なる。 The connection section selection unit 132 selects a connection section that serves as an introduction section of the representative section from sections other than the representative section, that is, the non-person feature section and the person feature section that has not become the representative section. Of course, the connecting section may be selected only from the non-person feature section, or may be selected from only the person feature section that has not become the representative section. The method for selecting the connecting section differs depending on the summary generation mode set in the parameter setting unit 5.

要約映像生成部１４は、代表区間選択部１３１によって選択された代表区間と、つなぎ区間選択部１３２によって選択されたつなぎ区間とから、要約映像あるいは再生リストを生成する。例えば要約映像は、代表区間を撮影時刻順に並び替え、つなぎ区間を適宜並べた映像を生成し、再生リストは、ストリームデータから要約映像に使用する映像区間を指定したリストを生成すればよい。
なお、実施形態１の要約映像生成部１４は、ストリームデータから別途要約映像を生成するとして以下の説明を進める。再生リストを生成する方法は、後述する実施形態２の要約映像生成部６３（図６参照）において説明し、要約映像生成部１４は同様の方法を用いることができるとする。 The summary video generation unit 14 generates a summary video or a playlist from the representative section selected by the representative section selection unit 131 and the connection section selected by the connection section selection unit 132. For example, the summary video may be generated by rearranging the representative sections in the order of shooting time and generating the video with the connection sections appropriately arranged, and the reproduction list may be generated by specifying a video section used for the summary video from the stream data.
Note that the summary video generation unit 14 of the first embodiment proceeds with the following description assuming that a summary video is separately generated from the stream data. A method of generating a play list will be described in the summary video generation unit 63 (see FIG. 6) of Embodiment 2 described later, and it is assumed that the summary video generation unit 14 can use the same method.

再生処理部１５は、要約映像生成部１４によって生成された要約映像の再生処理を行う。
デコード部１６は、再生処理部１５から出力された要約映像をデコードしてデータ出力部１７へ出力する。
データ出力部１７は、デコードされた要約映像データを表示装置３へ出力する。 The reproduction processing unit 15 performs reproduction processing of the summary video generated by the summary video generation unit 14.
The decoding unit 16 decodes the summary video output from the reproduction processing unit 15 and outputs it to the data output unit 17.
The data output unit 17 outputs the decoded summary video data to the display device 3.

次に、実施形態１に係る要約映像生成装置１の動作について説明する。
まず、ストリームデータ入力部７は、撮影装置２で撮影し録画された撮影データとその撮影情報とを含むストリームデータを取得すると、取得したストリームデータから撮影データと撮影情報等の各データを分離する。撮影データは人物特徴量抽出部８へ出力し、撮影データ以外の各データはデータ記録部１０へ出力する。なお、ストリームデータ入力部７から出力する撮影データは、映像データのみのものでもよい。 Next, the operation of the summary video generation apparatus 1 according to the first embodiment will be described.
First, when the stream data input unit 7 acquires stream data including shooting data that has been shot and recorded by the shooting device 2 and shooting information thereof, the stream data input unit 7 separates each piece of data such as shooting data and shooting information from the acquired stream data. . The shooting data is output to the person feature amount extraction unit 8 and each data other than the shooting data is output to the data recording unit 10. Note that the shooting data output from the stream data input unit 7 may be only video data.

人物特徴量抽出部８は、撮影データから人物特徴量を抽出し、人物特徴区間抽出部９と、データ記録部１０とに出力する。実施形態１では、人物特徴量抽出部８は、４種類の人物の顔の特徴を基にした４種類の人物特徴量を求める。４種類の特徴は、撮影画面における人物の、顔のサイズと顔の位置と顔の向きと顔の傾きとし、４種類の人物特徴量をそれぞれ求める。
なお、実施形態１では上記した４つの人物特徴量を用いて説明するが、全ての人物特徴量を用いる必要はない。顔のサイズと顔の位置等、任意の２つの特徴に基づいた２つの人物特徴量でも、任意の３つの特徴に基づいた３つの人物特徴量でも勿論よい。
人物特徴区間抽出部９は、４種類の人物特徴量に基づいて人物特徴区間を抽出する。 The person feature quantity extraction unit 8 extracts a person feature quantity from the photographed data and outputs it to the person feature section extraction unit 9 and the data recording unit 10. In the first embodiment, the person feature amount extraction unit 8 obtains four types of person feature amounts based on the features of the faces of four types of persons. The four types of features are the face size, the face position, the face direction, and the face tilt of the person on the shooting screen, and the four types of person feature amounts are obtained.
In the first embodiment, the description is made using the above-described four person feature amounts, but it is not necessary to use all the person feature amounts. Of course, two person feature amounts based on two arbitrary features such as face size and face position, or three person feature amounts based on three arbitrary features may be used.
The person feature section extraction unit 9 extracts a person feature section based on four types of person feature amounts.

次に、一例として、顔のサイズに基づいた人物特徴量（以下、顔サイズ特徴量）の求め方を説明する。なお、顔の位置、顔の向き、顔の傾きに基づく人物特徴量も同様に求めればよい。
人物特徴量抽出部８は、撮影データ中に含まれる人物の画像を所定の間隔で検出する。実施形態１では、輝度検出や肌色検出等を用いた公知の顔認識方法により、顔画像の領域を人物領域として検出するが、人物の特徴を示す領域であればこれに限定するものではない。人物特徴量抽出部８は、検出した顔の画像に基づいて、顔画像の特徴を表す情報（以下、顔情報）を検出する。顔画像の特徴を顔のサイズで表す場合は、撮影画面サイズに対する顔の大きさ、すなわち顔の面積や、長さ、幅、長さと幅との合計、等の値を求め、時刻ｔの画像における顔情報Ｌｔとする。
また人物特徴量抽出部８は、顔が検出された画像フレームの撮影時刻と、顔画像が検出された領域を示すリスト（顔情報リスト）を作成する。顔情報リストの一部を図１４に示す。ここでは、顔のサイズに基づく顔情報（以下、顔サイズＬ１ｔ）と顔の位置に基づく顔情報（以下、顔の位置Ｌ２ｔ）とを顔情報として用いる。 Next, as an example, a method for obtaining a person feature amount (hereinafter referred to as a face size feature amount) based on the face size will be described. It should be noted that the person feature amount based on the face position, the face orientation, and the face inclination may be obtained in the same manner.
The person feature amount extraction unit 8 detects human images included in the captured data at predetermined intervals. In the first embodiment, a face image area is detected as a person area by a known face recognition method using luminance detection, skin color detection, or the like. However, the present invention is not limited to this as long as it is an area showing the characteristics of a person. The person feature amount extraction unit 8 detects information representing the feature of the face image (hereinafter referred to as face information) based on the detected face image. When the features of the face image are represented by the size of the face, the face size relative to the shooting screen size, that is, the face area, length, width, total length and width, and the like are obtained, and the image at time t Is face information Lt.
The person feature amount extraction unit 8 creates a list (face information list) indicating the shooting time of the image frame in which the face is detected and the area in which the face image is detected. A part of the face information list is shown in FIG. Here, face information based on the face size (hereinafter, face size L1t) and face information based on the face position (hereinafter, face position L2t) are used as face information.

図１５は、顔画像が検出された人物領域を示す座標について説明する図である。画面の左上を（ｘ，ｙ）＝（０，０）とし、画面の右下を（ｘ，ｙ）＝（ｗ，ｈ）とした時、顔画像が検出された人物領域を（ｘ１，ｙ１）−（ｘ２，ｙ２）で示す。
図１４に示す顔情報リストは、更に、顔サイズＬ１ｔを基にした顔サイズ特徴量（得点）Ｐ１ｔと、顔サイズＬ１ｔと顔の位置Ｌ２ｔとを基に分けた人物特徴区間、非人物特徴区間のＩＤを有する。なお、撮影時刻の代わりに画像フレームの撮影データ中の位置を示す、インデックス情報を用いてもよい。 FIG. 15 is a diagram for explaining coordinates indicating a person region in which a face image is detected. When the upper left corner of the screen is (x, y) = (0, 0) and the lower right corner of the screen is (x, y) = (w, h), the human area where the face image is detected is (x1, y1). )-(X2, y2).
The face information list shown in FIG. 14 further includes a face size feature amount (score) P1t based on the face size L1t, a person feature section and a non-person feature section based on the face size L1t and the face position L2t. ID. Note that index information indicating the position of the image frame in the shooting data may be used instead of the shooting time.

次に人物特徴量抽出部８は、対象となる全ショットの顔情報Ｌｔから顔情報の平均値Ｌｍを求める。更に、人物特徴量抽出部８は下記の式（１）を用いて、時刻ｔの画像における顔情報Ｌｔを、第１の閾値Ｔｈ_１を基に判断する。第１の閾値Ｔｈ_１は、Ｔｈ_１＝Ｌｍ＋ｋ_１σ（ｋ_１は係数、σは顔情報Ｌｔの標準偏差）とする。
更に、時刻ｔにおける画像の人物特徴量Ｐｔを下記の式（２）で求める。ここで顔サイズＬ１ｔを顔情報Ｌｔとして用いた場合、顔情報Ｌｔの平均値Ｌｍは、顔サイズの平均値Ｌ１ｍ、顔情報Ｌｔの標準偏差σは、顔サイズの標準偏差σ１、人物特徴量Ｐｔは顔サイズ特徴量Ｐ１ｔとなる。顔情報Ｌｔが顔の位置、顔の向き、顔の傾き等に基づくとした場合も同様である。 Next, the person feature amount extraction unit 8 obtains the average value Lm of the face information from the face information Lt of all shots to be processed. Further, the person feature quantity extraction unit 8 determines the face information Lt in the image at time t based on the first threshold Th ₁ using the following equation (1). The first threshold Th ₁ is Th ₁ = Lm + k ₁ σ (k ₁ is a coefficient and σ is a standard deviation of the face information Lt).
Further, the person feature amount Pt of the image at time t is obtained by the following equation (2). Here, when the face size L1t is used as the face information Lt, the average value Lm of the face information Lt is the average value L1m of the face size, the standard deviation σ of the face information Lt is the standard deviation σ1 of the face size, and the person feature Pt Is the face size feature amount P1t. The same applies to the case where the face information Lt is based on the face position, face orientation, face inclination, and the like.

Ｌｔ＞Ｔｈ_１…式（１）
Ｐｔ＝１０（Ｌｔ−Ｌｍ）／σ＋５０…式（２） Lt> Th ₁ ... Formula (1)
Pt = 10 (Lt−Lm) / σ + 50 (2)

なお人物特徴量抽出部８が人物特徴量Ｐｔを求める際に、男性と女性とで重み付けを変える、子供と大人とで重み付けを変える等、性別や年齢に応じた重み付けを行ってもよいし、特定の人物に対する重み付けを行ってもよい。
人物特徴量抽出部８は、求めた顔情報Ｌｔと人物特徴量Ｐｔとを、人物特徴区間抽出部９と、データ記録部１０とに出力する。 When the person feature amount extraction unit 8 obtains the person feature amount Pt, weighting according to gender or age, such as changing weighting between men and women or changing weighting between children and adults, may be performed. You may weight with respect to a specific person.
The person feature amount extraction unit 8 outputs the obtained face information Lt and person feature amount Pt to the person feature section extraction unit 9 and the data recording unit 10.

人物特徴区間抽出部９は、人物特徴量抽出部８で抽出された顔情報Ｌｔと人物特徴量Ｐｔとに基づいて、撮影データ内の人物特徴区間および非人物特徴区間を抽出する。
人物特徴区間抽出部９は、まず、人物特徴量抽出部８が抽出した顔情報Ｌｔを基に、顔情報Ｌｔが第１の閾値Ｔｈ_１より大きい画像を含む映像区間を抽出する。そして人物特徴区間抽出部９は、第１の閾値Ｔｈ₁より大きい顔情報Ｌｔが検出された画像が連続する映像区間を人物特徴区間とする。 The person feature section extraction unit 9 extracts a person feature section and a non-person feature section in the shooting data based on the face information Lt and the person feature amount Pt extracted by the person feature amount extraction unit 8.
The person feature section extraction unit 9 first extracts a video section including an image in which the face information Lt is greater than the first threshold Th ₁ based on the face information Lt extracted by the person feature amount extraction unit 8. Then, the person feature section extraction unit 9 sets a video section in which an image in which face information Lt greater than the first threshold Th ₁ is detected continues as a person feature section.

例えば、顔情報Ｌｔを顔サイズＬ１ｔとすると、顔サイズＬ１ｔを基に求めた第１の閾値Ｔｈ_１１より大きい顔サイズＬ１ｔが検出された撮影画像が連続する映像区間を人物特徴区間とする。図１６に、顔サイズＬ１ｔを基にした人物特徴区間および非人物特徴区間の抽出例を示す。図１６は顔情報Ｌｔ（顔サイズＬ１ｔ）の時刻ｔ方向での変動を示す。図１６において人物特徴区間は、時刻ｔ１から時刻ｔ２の区間および、時刻ｔ５から時刻ｔ６までの区間である。
人物の顔が含まれる画像が一時的に途切れる場合は、途切れる期間や、途切れる前後の画像における顔の位置や顔サイズを基に、途切れる前後の画像が一連の人物特徴区間であるか否かの連続性を評価する。人物（被写体）が横を向いた後、正面に向き直るような映像は、複数の人物特徴区間に分割される可能性があるが、このような評価をすることで連続した映像区間として扱うことができ、好ましい。 For example, when the face information Lt and face size L1t, the first threshold value Th ₁₁ larger face size L1t is a video sequence personal characteristic section is captured image detected continuously determined based on face size L1t. FIG. 16 shows an example of extraction of person feature sections and non-person feature sections based on the face size L1t. FIG. 16 shows the variation of the face information Lt (face size L1t) in the time t direction. In FIG. 16, the person feature section is a section from time t1 to time t2 and a section from time t5 to time t6.
If an image containing a person's face is temporarily interrupted, whether or not the images before and after the interruption are a series of person feature sections based on the period of interruption and the face position and face size in the images before and after the interruption. Assess continuity. An image in which a person (subject) turns to the front after turning sideways may be divided into a plurality of person feature sections, but can be handled as a continuous video section by performing such an evaluation. It is possible and preferable.

次に、人物特徴区間抽出部９は、人物特徴量抽出部８が抽出した人物特徴量Ｐｔの得点を算出する。人物特徴量Ｐｔの得点は、一区間内で算出された人物特徴量Ｐｔを基に求めればよく、例えば、時刻ｔ１から時刻ｔ２の区間の人物特徴量Ｐｔの得点は、時刻ｔ１〜ｔ２間の複数の人物特徴量Ｐｔに基づいて求める。
時刻ｔ１〜ｔ２間で人物特徴量Ｐｔが変化する場合には、時刻ｔ１〜ｔ２間の人物特徴量Ｐｔの得点を例えば、人物特徴量Ｐｔの平均を基に求める。また、時刻ｔｌ〜ｔ２間で人物特徴量Ｐｔの変化が大きい場合は、時刻ｔ１〜ｔ２の人物特徴区間を人物特徴量Ｐｔに応じて更に複数の区間に分割し、分割した各区間における人物特徴量Ｐｔの平均を基にして得点を求めてもよい。また、複数の顔情報に基づいた複数の人物特徴量を用いる場合は、その区間で変化が大きい人物特徴量に基づいて人物特徴区間を複数の区間に分割すればよい。 Next, the person feature section extraction unit 9 calculates the score of the person feature amount Pt extracted by the person feature amount extraction unit 8. The score of the person feature quantity Pt may be obtained based on the person feature quantity Pt calculated in one section. For example, the score of the person feature quantity Pt in the section from time t1 to time t2 is between time t1 and t2. It calculates | requires based on several person feature-value Pt.
When the person feature Pt changes between times t1 and t2, the score of the person feature Pt between times t1 and t2 is obtained based on, for example, the average of the person features Pt. When the change in the person feature Pt is large between the times tl and t2, the person feature section at the times t1 and t2 is further divided into a plurality of sections according to the person feature Pt, and the person feature in each divided section is A score may be obtained based on the average of the amount Pt. In addition, when using a plurality of person feature amounts based on a plurality of face information, the person feature section may be divided into a plurality of sections based on the person feature quantities that change greatly in the section.

なお、人物特徴区間抽出部９は、手振れ、ピンボケなどのミスショット区間を、人物特徴量または撮影情報を基に検出し、人物特徴区間および非人物特徴区間として抽出されないようにしてもよい。 Note that the person feature section extraction unit 9 may detect misshot sections such as camera shake and out-of-focus based on the person feature amount or the shooting information, and may not extract them as the person feature section and the non-person feature section.

また、人物特徴区間抽出部９は、人物特徴量抽出部８が抽出した顔情報Ｌｔを基に、人物特徴区間以外の区間を非人物特徴区間として抽出する。
ここで、人物特徴区間以外の連続した区間を非人物特徴区間として一括して抽出してもよいが、実施形態１の人物特徴区間抽出部９では、非人物特徴区間を第１の非人物特徴区間と第２の非人物特徴区間とに分割する。これにより、映像区間の区分けがよりきめ細かく行える。
第１の非人物特徴区間は、人物が所定の基準値より小さく撮影されている、あるいは撮影されていない区間が連続する区間とし、第２の非人物特徴区間は、映像における人物特徴が人物特徴区間よりも小さく、第１の非人物特徴区間よりも大きい区間である。例えば、人物が撮影されているもののその顔は非常に短い時間しか現れない等の映像区間が相当する。 In addition, the person feature section extraction unit 9 extracts sections other than the person feature section as non-person feature sections based on the face information Lt extracted by the person feature amount extraction unit 8.
Here, continuous sections other than the person feature section may be collectively extracted as the non-person feature section. However, in the person feature section extraction unit 9 of the first embodiment, the non-person feature section is the first non-person feature. Dividing into a section and a second non-person feature section. Thereby, the segmentation of the video section can be performed more finely.
The first non-person feature section is a section in which a person is photographed smaller than a predetermined reference value or a section in which no person is photographed is continuous, and the second non-person feature section is a person feature in a video whose person feature is a person feature. It is a section that is smaller than the section and larger than the first non-person feature section. For example, a video section in which a person is photographed but the face appears only for a very short time corresponds.

実施形態１の人物特徴区間抽出部９は、下記の式（３）を満たす時刻ｔの画像が連続する区間を抽出する。第２の閾値Ｔｈ_２は、Ｔｈ_２＝Ｌｍ＋ｋ_２σ（ｋ_２は係数、ｋ_１＞ｋ_２）とする。第２の閾値Ｔｈ_２は第１の閾値Ｔｈ_１より小さい。
顔情報Ｌｔが第２の閾値Ｔｈ_２より小さい画像は、顔が非常に小さく写っている画像であり、従って顔の特徴を検出することが難しい。このような画像が連続する区間と、人物の顔が全く映っていない区間とを第１の非人物特徴区間として、実施形態１の人物特徴区間抽出部９が抽出する。 The person feature section extraction unit 9 of Embodiment 1 extracts a section in which images at time t satisfying the following expression (3) are continuous. The second threshold Th ₂ is assumed to be Th ₂ = Lm + k ₂ σ (k ₂ is a coefficient, k ₁ > k ₂ ). The second threshold Th ₂ is smaller than the first threshold Th ₁ .
An image whose face information Lt is smaller than the second threshold Th ₂ is an image in which the face is very small, and thus it is difficult to detect facial features. The person feature section extraction unit 9 of the first embodiment extracts a section in which such images are continuous and a section in which no person's face is shown as a first non-person feature section.

Ｌｔ＜Ｔｈ_２…式（３） Lt <Th ₂ Formula (3)

そして、人物特徴区間抽出部９は、上述のように抽出した人物特徴区間および第１の非人物特徴区間以外の区間を、第２の非人物特徴区間として抽出する。
顔サイズＬ１ｔに基づいて人物特徴量Ｐ１ｔを算出した場合、第２の非人物特徴区間とされる画像は、人物特徴区間の画像より人物の顔のサイズが小さいものの、第１の非人物特徴区間の画像と比較すると、映っている人物の顔から特徴が検出できる程度に大きいサイズの顔画像が連続する映像区間となる。あるいは、連続する映像区間において一瞬だけ顔画像が出現するものも含む。 Then, the person feature section extraction unit 9 extracts a section other than the person feature section and the first non-person feature section extracted as described above as the second non-person feature section.
When the person feature amount P1t is calculated based on the face size L1t, the image that is the second non-person feature section has a smaller face size than the image of the person feature section, but the first non-person feature section Compared with the image, the video section is a series of face images having a size large enough to detect the feature from the face of the person being shown. Or the thing where a face image appears only for a moment in a continuous video section is also included.

再び図１６を用いて、顔サイズＬ１ｔを基にした第１の非人物特徴区間および第２の非人物特徴区間の抽出例を示す。図１６において第１の非人物特徴区間は、顔サイズＬ１ｔが第２の閾値Ｔｈ_１２を下回る時刻ｔ３から時刻ｔ４の区間であり、第２の非人物特徴区間は時刻ｔ０から時刻ｔ１の区間、時刻ｔ２から時刻ｔ３の区間、時刻ｔ４から時刻ｔ５の区間である。
なお、ここでは人物特徴区間抽出部９は、顔情報Ｌｔを基にした人物特徴量Ｐｔを用いて上記各区間を抽出したが、輝度情報の変化量や画像から検出した顔以外の特徴に基づく特徴量を用いて各区間を抽出してもよい。 FIG. 16 is used again to show an example of extracting the first non-person feature section and the second non-person feature section based on the face size L1t. In FIG. 16, the first non-person feature section is a section from time t3 to time t4 when the face size L1t falls below the second threshold Th ₁₂ , and the second non-person feature section is a section from time t0 to time t1, A section from time t2 to time t3 and a section from time t4 to time t5.
Here, the person feature section extraction unit 9 extracts each of the sections using the person feature amount Pt based on the face information Lt. However, the person feature section extractor 9 is based on the brightness information change amount and features other than the face detected from the image. Each section may be extracted using the feature amount.

図３は、実施形態１における人物特徴区間および非人物特徴区間抽出の概念図を示している。
図３（ａ）は人物特徴量抽出部８が受け取るストリームデータの一例を示す。図３（ｂ）〜図３（ｉ）は、人物特徴量抽出部８が図３（ａ）のストリームデータより求めた人物特徴量Ｐｔを基に、人物特徴区間抽出部９が分割した特徴区間（人物特徴区間及び非人物特徴区間）及び分割した各特徴区間の人物特徴量に基づく得点を示す。
図３（ｂ）は顔サイズＬ１ｔに基づいた顔サイズ特徴量Ｐ１ｔで映像区間を分割した一例、図３（ｃ）は顔の位置Ｌ２ｔに基づいた顔位置特徴量Ｐ２ｔで分割した一例、図３（ｄ）は顔向きＬ３ｔに基づいた顔向き特徴量Ｐ３ｔで分割した一例、図３（ｅ）は顔傾きＬ４ｔに基づいた顔傾き特徴量Ｐ４ｔで分割した一例を示す。顔の位置Ｌ２ｔは、画面上の顔画像の位置を示し、顔向きＬ３ｔは、顔画像が画面でどちらの方向（左右）を向いているかを示し、顔傾きＬ４ｔは、顔画像が画面の垂直方向に対して傾いていることを示す。 FIG. 3 shows a conceptual diagram of person feature section and non-person feature section extraction in the first embodiment.
FIG. 3A shows an example of stream data received by the person feature quantity extraction unit 8. 3 (b) to 3 (i) show the feature sections divided by the person feature section extraction unit 9 based on the person feature quantity Pt obtained by the person feature quantity extraction unit 8 from the stream data of FIG. 3 (a). The score based on the person feature amount of each feature section (the person feature section and the non-person feature section) is shown.
3B is an example in which the video section is divided by the face size feature amount P1t based on the face size L1t, and FIG. 3C is an example in which the image section is divided by the face position feature amount P2t based on the face position L2t. FIG. 3D shows an example of division by a face orientation feature amount P3t based on the face orientation L3t, and FIG. 3E shows an example of division by a face inclination feature amount P4t based on the face orientation L4t. The face position L2t indicates the position of the face image on the screen, the face direction L3t indicates which direction (left and right) the face image is facing on the screen, and the face inclination L4t indicates that the face image is perpendicular to the screen. Indicates tilting with respect to the direction.

また、図３（ｆ）は人物特徴区間抽出部９が顔サイズ特徴量Ｐ１ｔより算出した各映像区間の得点、図３（ｇ）は人物特徴区間抽出部９が顔位置特徴量Ｐ２ｔより算出した各映像区間の得点、図３（ｈ）は人物特徴区間抽出部９が顔向き特徴量Ｐ３ｔより算出した各映像区間の得点、図３（ｉ）は人物特徴区間抽出部９が顔傾き特徴量Ｐ４ｔより算出した各映像区間の得点を、それぞれ一例として示す。 Further, FIG. 3F shows the score of each video section calculated by the human feature section extraction unit 9 from the face size feature amount P1t, and FIG. 3G shows the score calculated by the human feature section extraction unit 9 from the face position feature amount P2t. The score of each video section, FIG. 3 (h) shows the score of each video section calculated by the human feature section extraction unit 9 from the face direction feature quantity P3t, and FIG. 3 (i) shows the face tilt feature quantity of the person feature section extraction unit 9. The score of each video section calculated from P4t is shown as an example.

顔位置特徴量Ｐ２ｔは、顔の位置Ｌ２ｔが中央位置を示すとき最も大きい値となり、顔向き特徴量Ｐ３ｔは、顔向きＬ３ｔが左右方向より中央方向を示すほうが大きい値となり、顔傾き特徴量Ｐ４ｔは顔傾きＬ４ｔが（上下方向角度）が垂直方向を向いているほうが大きい値となる。
図３（ｆ）〜（ｉ）に示す得点例では、顔サイズ特徴量Ｐ１ｔが大きく、顔位置特徴量Ｐ２ｔは顔の位置が画面中心にあり、顔向き特徴量Ｐ３ｔは顔が中央方向を向いており、顔の上下方向の傾き（角度）が垂直の場合を高得点とした。位置、向き、傾きについては、左・右・上・下の方向毎に、更に詳細に特徴量を評価してもよい。 The face position feature amount P2t has the largest value when the face position L2t indicates the center position, and the face orientation feature amount P3t has a larger value when the face direction L3t indicates the center direction than the left-right direction, and the face tilt feature amount P4t. Is larger when the face inclination L4t (vertical angle) is in the vertical direction.
In the score examples shown in FIGS. 3F to 3I, the face size feature value P1t is large, the face position feature value P2t has the face position at the center of the screen, and the face orientation feature value P3t has the face facing the center. The score was high when the vertical inclination (angle) of the face was vertical. Regarding the position, orientation, and inclination, the feature amount may be evaluated in more detail for each of the left, right, up, and down directions.

実施形態１の人物特徴区間抽出部９は、顔情報Ｌｔに基づいて映像を人物特徴区間、第１の非人物特徴区間、第２の非人物特徴区間とに分割し、人物特徴量Ｐｔに基づいて各特徴区間の得点を算出する。人物特徴量Ｐｔは、映像から検出した人物の顔サイズ、画面上の顔位置、顔の向き、顔の傾き等の顔情報Ｌｔ毎に求める。
そして、人物特徴区間抽出部９は、人物特徴区間の時刻情報および得点、人物識別情報を含む人物特徴区間情報と、非人物特徴区間の時刻情報および得点を含む非人物区間情報を生成して出力する。 The person feature section extraction unit 9 according to the first embodiment divides the video into a person feature section, a first non-person feature section, and a second non-person feature section based on the face information Lt, and based on the person feature amount Pt. To calculate the score for each feature section. The person feature amount Pt is obtained for each face information Lt such as the face size of the person detected from the video, the face position on the screen, the face direction, and the face tilt.
Then, the person feature section extraction unit 9 generates and outputs the person feature section time information and score, person feature section information including person identification information, and non-person feature section time information and score. To do.

次に、データ記録部１０は、ストリームデータ入力部７から入力されるストリームデータと、人物特徴量抽出部８によって抽出された人物特徴量Ｐｔと、人物特徴区間抽出部９によって抽出された人物特徴区間情報および非人物特徴区間情報とを対応付けて記録媒体１１に保存する。 Next, the data recording unit 10 includes the stream data input from the stream data input unit 7, the person feature amount Pt extracted by the person feature amount extraction unit 8, and the person feature extracted by the person feature section extraction unit 9. The section information and the non-person feature section information are associated with each other and stored in the recording medium 11.

ここまでがストリームデータ記録時の動作である。次に、要約映像の再生動作を説明する。 This is the operation when recording stream data. Next, the summary video playback operation will be described.

(要約映像再生動作)
要約映像の再生時には、パラメータ設定部５が、ユーザより選択されたストリームデータ、要約生成モード、要約再生時間等をパラメータとして要約生成再生部６へ出力する。 (Summary video playback operation)
When reproducing the summary video, the parameter setting unit 5 outputs the stream data selected by the user, the summary generation mode, the summary playback time, and the like as parameters to the summary generation / playback unit 6.

データ読出部１２は、パラメータ設定部５に設定された情報の示すストリームデータと、そのストリームデータに対応する人物特徴量と、人物特徴区間情報および非人物特徴区間情報とを、データ記録部１０より読み出す。 The data reading unit 12 receives the stream data indicated by the information set in the parameter setting unit 5, the person feature amount corresponding to the stream data, the person feature section information and the non-person feature section information from the data recording unit 10. read out.

次に、代表区間選択部１３１は、パラメータ設定部５より設定された要約生成モードに基づいて各人物特徴区間の得点を重み付け等して、各人物特徴区間の得点の合計を求めて評価値とし、その評価値の大小に基づき代表区間を選択する。ここで、ｎ種類の人物特徴量を用いて人物特徴区間を評価する際に、ｉ番目の人物特徴量Ｐｔに関する時刻ｔを含む人物特徴区間の得点をＰｉｔとする。設定された要約生成モードｍでの人物特徴区間の評価値Ｖｍｔは、下記の式（４）により求める。 Next, the representative section selection unit 131 weights the score of each person feature section based on the summary generation mode set by the parameter setting unit 5, and obtains the total score of each person feature section as an evaluation value. The representative section is selected based on the magnitude of the evaluation value. Here, when evaluating a person feature section using n types of person feature amounts, the score of the person feature section including the time t related to the i-th person feature amount Pt is defined as Pit. The evaluation value Vmt of the person feature section in the set summary generation mode m is obtained by the following equation (4).

ここで、Ｃｍｉは、要約生成モードｍでのｉ番目の人物特徴量の重み付け係数、Ａｍｉは、初期設定値である。初期設定値Ａｍｉは、０等の所定の値でよい。
例えば、子供の可愛い表情を中心とする要約映像であれば、要約生成モードｍに応じた重み付け係数Ｃｍｉにより、顔の大きさの得点Ｐｉｔに重み付けして評価値Ｖｍｔを算出することにより、子供のアップシーンを優先することができる。旅行時などの風景とともに人物を撮影した映像を主とする要約映像であれば、要約生成モードｍに応じた重み付け係数Ｃｍｉにより、顔の向きに重み付けして評価値Ｖｍｔを算出することにより、観光地における記念撮影のように顔の向きが正面に近い向きのシーンを優先することができる。 Here, Cmi is a weighting coefficient of the i-th person feature amount in the summary generation mode m, and Ami is an initial setting value. The initial setting value Ami may be a predetermined value such as 0.
For example, in the case of a summary video centered on a cute facial expression of a child, the evaluation value Vmt is calculated by weighting the score Pit of the face size by a weighting coefficient Cmi corresponding to the summary generation mode m, thereby You can prioritize upscenes. In the case of a summary video mainly including a video of a person taken together with a landscape during a trip or the like, the evaluation value Vmt is calculated by weighting the orientation of the face with a weighting coefficient Cmi corresponding to the summary generation mode m. Priority can be given to scenes with face orientations close to the front, such as commemorative photography on the ground.

そして、代表区間選択部１３１は、人物特徴区間を評価値Ｖｍｔの大きい方から順に並べて代表区間候補とし、設定した要約再生時間を超えない範囲で、代表区間候補の上位から代表区間を選択する。各代表区間の長さは、見易さを損なわない最短の長さおよび設定要約再生時間に応じた最長の長さを範囲設定し、範囲内の長さとするよう特徴量を基に調整してもよい。 Then, the representative section selection unit 131 arranges the person feature sections in descending order of the evaluation value Vmt as representative section candidates, and selects the representative sections from the top of the representative section candidates within a range not exceeding the set summary playback time. The length of each representative section is adjusted based on the feature amount so that the shortest length that does not impair visibility and the longest length according to the set summary playback time are set as ranges. Also good.

次に、つなぎ区間選択部１３２は、代表区間選択部１３１によって選択された代表区間に対し背景の親和性が高く、中心部に大きな動きの少ない落ち着いた映像区間を短時間、つなぎ区間として選択する。なお、つなぎ区間は、選択しなくても勿論よく、この場合には、つなぎ区間選択部１３２を省略することができる。また、つなぎ区間は、代表区間の前でなく、代表区間の後ろに挿入するようにしても、更には、代表区間の前後に挿入するようにしても勿論よい。 Next, the connection segment selection unit 132 selects a calm video segment having a high background affinity for the representative segment selected by the representative segment selection unit 131 and having a large movement in the center as a connection segment for a short time. . Of course, the connecting section need not be selected. In this case, the connecting section selecting unit 132 can be omitted. Of course, the connecting section may be inserted not behind the representative section but behind the representative section, or may be inserted before or after the representative section.

ここで、実施形態１では、代表区間と同一ショット内の映像区間をつなぎ区間として用い、つなぎ区間を代表区間の導入部として再生する。すなわち、つなぎ区間に続けて代表区間が再生される要約映像を生成する。 Here, in the first embodiment, the video section in the same shot as the representative section is used as the connecting section, and the connecting section is reproduced as the representative section introducing unit. That is, a summary video is generated in which the representative section is reproduced following the connecting section.

つなぎ区間を選択する第１の例としては、代表区間と同一ショット内の第１の非人物特徴区間から、動きの少ない区間や代表区間に近い時刻情報を有する区間で、所定時間（例えば、２秒〜３秒）の映像区間をつなぎ区間として選択する。 As a first example of selecting a connecting section, from a first non-person feature section in the same shot as the representative section, a section with little movement or a section having time information close to the representative section, a predetermined time (for example, 2 Second to 3 seconds) is selected as the connecting section.

第１の非人物特徴区間は、上述したように、人物の顔が所定の基準値より小さく写っているか、または写っていない、人物の顔が十分に目立たない映像区間である。そのため、人物特徴区間から選択される代表区間との映像の特徴における差異が、第２の非人物特徴区間と代表区間との差異より大きい。従ってつなぎ区間を第１の非人物特徴区間から選択すると、その後に続けて再生される代表区間との映像の特徴において差異が大きいため、つなぎ区間と代表区間で映像の対比が生じ、要約映像に変化がでて、ユーザの退屈感を減少させることができる。 As described above, the first non-person feature section is a video section in which a person's face is captured less than or equal to a predetermined reference value and the person's face is not sufficiently conspicuous. Therefore, the difference in video features from the representative section selected from the person feature sections is larger than the difference between the second non-person feature section and the representative section. Therefore, when the connecting section is selected from the first non-person feature section, there is a large difference in the video characteristics of the representative section that is subsequently played back. Changes can be made and the user's boredom can be reduced.

第２の例としては、代表区間と同一ショット内の第２の非人物特徴区間から動きの少ない区間や、代表区間に時間的に近い区間から所定時間（２秒〜３秒）の映像区間をつなぎ区間として選択する。
第２の非人物特徴区間は、上述したように人物が写っているものの一瞬だけ顔が出現したりする人物の有無として中途半端な区間である。そのため、第１の非人物特徴区間よりは代表区間と映像の特徴における差異が小さいものの、人物特徴区間内の他の映像と比較すれば代表区間との差異は大きい。従って第２の非人物特徴区間からつなぎ区間を選択すると、つなぎ区間は代表区間と対比的な映像区間となるため、要約映像に変化がでて、ユーザの退屈感を減少させることができる。 As a second example, a video section of a predetermined time (2 to 3 seconds) from a second non-person feature section in the same shot as the representative section and a section with little movement, or a section temporally close to the representative section. Select as a connecting section.
As described above, the second non-person feature section is a half-finished section as to whether or not there is a person whose face appears only for a moment even if a person is shown. Therefore, although the difference between the representative section and the video feature is smaller than that of the first non-person feature section, the difference from the representative section is large when compared with other videos in the person feature section. Therefore, when a connecting section is selected from the second non-person feature sections, the connecting section becomes a video section that is contrasted with the representative section, so that the summary video changes and the user's boredom can be reduced.

更に、第３の例としては、つなぎ区間を非人物特徴区間から選ばずに、人物特徴区間および非人物特徴区間に関係なく、ショット開始時点から所定時間（２秒〜３秒）の映像区間を選択する。これは、一般ユーザの撮影傾向として、ショット開始時は、撮影対象人物の周囲の風景等を撮り始めるという撮影行動特性が多く見られるためである。従って簡易に、ショット内のショット開始時から２秒〜３秒程度の映像区間をつなぎ区間として選択する。
上述したように、一般的にショット開始時点から２秒〜３秒程度の映像区間は落ち着いた映像区間であるため、つなぎ区間と代表区間との差異が大きく対比的な映像となり、要約映像に変化がでて、ユーザの退屈感を減少させることができる。 Further, as a third example, a video section of a predetermined time (2 to 3 seconds) from the start of the shot is used regardless of the person feature section and the non-person feature section, without selecting the connection section from the non-person feature section. select. This is because, as a general user's shooting tendency, there are many shooting behavior characteristics such as starting to take a landscape around the shooting target person at the start of the shot. Therefore, a video section of about 2 to 3 seconds from the start of a shot in a shot is simply selected as a connecting section.
As described above, since the video section of about 2 to 3 seconds from the start of the shot is generally a calm video section, the difference between the connecting section and the representative section is large and becomes a contrasting video, changing to a summary video. Therefore, the user's boredom can be reduced.

また更に、第４の例としては、ユーザの撮影したストリームデータではなく、予め記録媒体１１に蓄積された、撮影データおよび音声データを含むストリームデータを用い、代表区間と背景の特徴量の親和性が高いショットのうちからつなぎ区間を選択する。また、予めつなぎ用の映像を対応する特徴量と共にデータベース化し、代表区間の特徴量と親和性の高いつなぎ区間を選択してもよい。ここで、特徴量とは映像の撮影日時、タイトル、撮影場所、背景色分布、エッジ分布、動きアクティビティ、音量、音声の種類等であり、代表区間と代表区間以外の映像区間とにおける特徴量を比較して、その特徴量が代表区間のものと同様の値を有する映像区間をつなぎ区間とすればよい。 Furthermore, as a fourth example, the affinity between the representative section and the background feature amount is used using stream data including shooting data and audio data stored in advance in the recording medium 11 instead of stream data shot by the user. Select a connecting section from shots with a high. Alternatively, a connection video having a high affinity with the feature amount of the representative section may be selected by previously creating a database of the connection video together with the corresponding feature amount. Here, the feature amount is the shooting date / time of the video, the title, the shooting location, the background color distribution, the edge distribution, the motion activity, the volume, the type of sound, etc., and the feature amount in the video section other than the representative section and the representative section. In comparison, a video section whose feature value has the same value as that of the representative section may be used as a connecting section.

このようにユーザの撮影したストリームデータ以外からつなぎ区間を選択することにより、更に、代表区間に対し差異が大きくなるつなぎ区間を選択することができる。
なお、つなぎ区間を代表区間の前後に複数設ける場合には、上記第１〜第４の例を任意に組み合わせるようにしても勿論よい。 In this way, by selecting a connecting section other than the stream data photographed by the user, it is possible to further select a connecting section having a greater difference from the representative section.
Of course, when a plurality of connecting sections are provided before and after the representative section, the above first to fourth examples may be arbitrarily combined.

次に、要約映像生成部１４は、ストリームデータからそれぞれ選択された代表区間と、それぞれに対応するつなぎ区間とを代表区間の撮影時刻順に並び替え、再生リストまたは要約映像を生成する。 Next, the summary video generation unit 14 rearranges the representative sections selected from the stream data and the corresponding connection sections in order of the shooting times of the representative sections, and generates a reproduction list or summary video.

図４は、要約映像生成部１４の要約映像生成方法の一例を示す図である。
図４に示すように、要約映像生成部１４は、ストリームデータ４０から選択された代表区間４０Ａと対応するつなぎ区間４０Ｂ、ストリームデータ４１から選択された代表区間４１Ａと対応するつなぎ区間４１Ｂ、とを代表区間の撮影時刻順に並び替え、ストリームデータ４０、４１の要約映像を生成する。この場合、要約映像には、音声データが含まれていても含まれていなくてもかまわない。 FIG. 4 is a diagram illustrating an example of a summary video generation method of the summary video generation unit 14.
As shown in FIG. 4, the summary video generation unit 14 includes a connection section 40B corresponding to the representative section 40A selected from the stream data 40, and a connection section 41B corresponding to the representative section 41A selected from the stream data 41. Rearranged in order of photographing time of the representative section, the summary video of the stream data 40 and 41 is generated. In this case, the summary video may or may not include audio data.

要約映像生成部１４は、上述したようにストリームデータ４０、４１の一部分を抜き出して要約映像を生成するが、要約映像を生成せずに、ストリームデータ４０、４１の一部分を再生リストにより指定するようにしてもよい。ここで、再生リストとは、要約映像自体は生成せずに、ストリームデータから要約映像として再生する映像の範囲を時刻情報等により指定するものである。 As described above, the summary video generation unit 14 extracts a part of the stream data 40 and 41 and generates a summary video. However, the summary video generation unit 14 specifies a part of the stream data 40 and 41 by the reproduction list without generating the summary video. It may be. Here, the reproduction list specifies a range of video to be played back as summary video from stream data without generating the summary video itself by time information or the like.

再生処理部１５は、要約映像生成部１４がストリームデータから生成した要約映像を再生処理してデコード部１６へ出力する。デコード部１６は、再生処理部１５からの要約映像をデコードし、データ出力部１７へ出力する。
データ出力部１７は、デコード部１６によってデコードされた要約映像を、表示装置３へ出力して表示させる。その際、データ出力部１７は、つなぎ区間と代表区間との接続部分において、ディゾルブやフェードイン・フェードアウト効果を用いて、各区間の映像を滑らかに再生するようにしてもよい。 The reproduction processing unit 15 reproduces the summary video generated from the stream data by the summary video generation unit 14 and outputs the summary video to the decoding unit 16. The decoding unit 16 decodes the summary video from the reproduction processing unit 15 and outputs it to the data output unit 17.
The data output unit 17 outputs the summary video decoded by the decoding unit 16 to the display device 3 for display. At that time, the data output unit 17 may smoothly reproduce the video of each section using a dissolve or fade-in / fade-out effect at the connection portion between the connection section and the representative section.

図５は、要約映像生成部１４による要約映像生成方法の他の例を示す図である。
図５に示す生成方法は、図４に示した生成方法とは異なり、映像データ（撮影データ）と音声データとを独立にずらして要約映像を生成する方法である。 FIG. 5 is a diagram illustrating another example of the summary video generation method by the summary video generation unit 14.
The generation method shown in FIG. 5 differs from the generation method shown in FIG. 4 in that a summary video is generated by shifting video data (shooting data) and audio data independently.

図５に示す生成方法では、ストリームデータ５０から選択した映像データのつなぎ区間５１Ｂと代表区間５１Ａとが切り替わるタイミングと、つなぎ区間５１Ｂに対応する音声データのつなぎ区間５２Ｂと代表区間５１Ａに対応する音声データの代表区間５２Ａとが切り替わるタイミングとをずらして要約映像を生成する。
実施形態１では、音声データのつなぎ区間５２Ｂから代表区間５２Ａへの切り替えのタイミングを、映像データのつなぎ区間５１Ｂから代表区間５１Ａへの切り替えのタイミングより早い時間とする。これにより、ユーザは予め音声でつなぎ区間から代表区間への移行を認識できるため、映像が切り替わっても違和感が少なく視聴でき、短時間で内容を把握し易くなる。
同様に、ストリームデータ５１から選択した音声データのつなぎ区間５４Ｂから代表区間５４Ａへの切り替えのタイミングを、映像データのつなぎ区間５３Ｂから代表区間５３Ａへの切り替えのタイミングより早い時間となるよう要約映像を生成する。 In the generation method shown in FIG. 5, the timing at which the connecting section 51B and the representative section 51A of the video data selected from the stream data 50 are switched, and the audio data corresponding to the connecting section 52B and the representative section 51A of the audio data corresponding to the connecting section 51B. The summary video is generated by shifting the timing at which the data representative section 52A is switched.
In the first embodiment, the switching timing of the audio data connection section 52B to the representative section 52A is set to be earlier than the switching timing of the video data connection section 51B to the representative section 51A. Accordingly, since the user can recognize the transition from the connecting section to the representative section in advance by voice, the user can watch with little discomfort even when the video is switched, and the contents can be easily grasped in a short time.
Similarly, the summary video is selected so that the switching timing of the audio data selected from the stream data 51 from the connecting section 54B to the representative section 54A is earlier than the switching timing of the video data from the connecting section 53B to the representative section 53A. Generate.

以上説明したように、実施形態１によれば、ストリームデータの人物特徴量を基に撮影データに人物が存在する特徴的なシーンを人物特徴区間として選択し、その人物特徴区間から代表区間を選択し所定の順序に並べて再生リストまたは要約映像を生成するので、ユーザが映像内容を把握でき、中心的な被写体を参照可能な要約映像を生成することができる。 As described above, according to the first embodiment, a characteristic scene in which a person is present in captured data is selected as a person feature section based on the person feature amount of stream data, and a representative section is selected from the person feature section. Since the reproduction list or summary video is generated in a predetermined order, the user can grasp the content of the video and generate the summary video that can refer to the central subject.

特に、実施形態１では、要約映像を生成する際、人物特徴区間から選択した代表区間だけでなく、非人物特徴区間からも、代表区間の導入部となるつなぎ区間を選択し、選択した代表区間とつなぎ区間とを並べて要約映像（または、再生リスト）を生成するので、ユーザが映像内容を把握でき、中心的な被写体を参照できることに加えて、見易く退屈しない要約映像を生成することができる。即ち、人物中心の代表区間と人物の存在しないつなぎ区間とを組み合わせることにより、メリハリの利いた要約映像を提供することができる。 In particular, in the first embodiment, when the summary video is generated, not only the representative section selected from the person feature section but also the non-person feature section is selected as a connection section serving as an introduction section of the representative section, and the selected representative section is selected. Since the summary video (or the reproduction list) is generated by arranging the connecting sections, the user can grasp the video content and can refer to the central subject, and can generate the summary video that is easy to see and is not bored. In other words, a sharp summary video can be provided by combining a representative section centered on a person and a connecting section where no person exists.

なお、実施形態１で説明したように、今日普及している家庭用のビデオカメラにおいては、録画／停止の撮影ショット毎にインデックスを生成し記録する形式が一般的であるので、強いてショットチェンジ、すなわちショットの切れ目検出を行う必要のない場合が多いが、旧型の撮影機器で撮影したインデックスのない撮影データを処理する際には、撮影画面間の相関など公知の手法を用いて予めショット区切を検出し、実施形態１の処理を適用すればよい。 As described in the first embodiment, in home video cameras that are widely used today, a format in which an index is generated and recorded for each recording / stop shot shot is generally used. In other words, there is often no need to detect shot breaks, but when processing non-indexed shooting data shot with an old shooting device, shot separation is performed in advance using a known technique such as correlation between shooting screens. It is only necessary to detect and apply the processing of the first embodiment.

（実施形態２）
次に、本発明の実施形態２に係る要約映像生成装置および要約映像生成方法について説明する。実施形態２の要約映像生成装置は、人物特徴量抽出部を備えていない点で、実施形態１にて説明した要約映像生成装置１と異なる。
実施形態２では、ストリームデータを生成する撮影装置が人物特徴量抽出部を備え、人物特徴量抽出部はストリームデータから人物特徴量を抽出する。要約映像生成装置は、撮影装置が出力した人物特徴量が付加されたストリームデータを受け取り処理する。 (Embodiment 2)
Next, a summary video generation apparatus and a summary video generation method according to Embodiment 2 of the present invention will be described. The summary video generation apparatus of the second embodiment is different from the summary video generation apparatus 1 described in the first embodiment in that it does not include a person feature amount extraction unit.
In the second embodiment, a photographing apparatus that generates stream data includes a person feature amount extraction unit, and the person feature amount extraction unit extracts a person feature amount from the stream data. The summary video generation apparatus receives and processes the stream data to which the person feature amount output from the photographing apparatus is added.

図６に本発明の実施形態２に係る要約映像生成装置６００のブロック図を示す。
要約映像生成装置６００は、図６に示すように、記録制御部６０２と、要約生成再生部６０３とを有する。要約映像生成装置６００は、撮影装置６０１から出力された、撮影データと撮影データの撮影条件を示す撮影情報とを含むストリームデータ、および撮影データから人物の顔の特徴に基づいて抽出された人物特徴量とを受け取り、要約映像を生成する。 FIG. 6 shows a block diagram of a summary video generation apparatus 600 according to Embodiment 2 of the present invention.
As shown in FIG. 6, the summary video generation apparatus 600 includes a recording control unit 602 and a summary generation / playback unit 603. The summary video generation device 600 outputs stream data including shooting data and shooting information indicating shooting conditions of the shooting data output from the shooting device 601, and person characteristics extracted from the shooting data based on the characteristics of a person's face. Receive a quantity and generate a summary video.

撮影装置６０１は、画像入力部６０５、音声入力部６０６、画像符号化部６０７、音声符号化部６０８、多重化処理部６０９、データ記憶部６１０、人物特徴量抽出部６１１、撮影情報出力部６１２を有する。撮影情報出力部６１２は、時計部６１３、センサ部６１４を含む。 The imaging device 601 includes an image input unit 605, an audio input unit 606, an image encoding unit 607, an audio encoding unit 608, a multiplexing processing unit 609, a data storage unit 610, a person feature amount extraction unit 611, and an imaging information output unit 612. Have The imaging information output unit 612 includes a clock unit 613 and a sensor unit 614.

記録制御部６０２は、ストリームデータ入力部６１５、人物特徴区間抽出部６１６、データ記録部６１７、再生制御処理情報生成部６１８を有する。 The recording control unit 602 includes a stream data input unit 615, a person feature section extraction unit 616, a data recording unit 617, and a reproduction control processing information generation unit 618.

要約生成再生部６０３は、データ読出部６２１、再生制御処理実行部６２２、操作入力部６２３、デコード部６２４、データ出力部６２５を有する。そして、再生制御処理実行部６２２は、パラメータ設定処理部６１と、要約再生区間選択部６２と、要約映像生成部６３と、再生処理部６４とを有する。
なお、実施形態２の要約生成再生部６０３は、ストリームデータにて要約映像として使用する映像を指定する再生リストを生成する要約映像生成部（再生リスト生成部）６３を有するものとして説明するが、上述の実施形態１のように、ストリームデータから別途要約映像を生成する要約映像生成部１４（図２参照）を有しても勿論よい。 The summary generation / playback unit 603 includes a data reading unit 621, a playback control processing execution unit 622, an operation input unit 623, a decoding unit 624, and a data output unit 625. The playback control processing execution unit 622 includes a parameter setting processing unit 61, a summary playback section selection unit 62, a summary video generation unit 63, and a playback processing unit 64.
The summary generation / playback unit 603 according to the second embodiment will be described as having a summary video generation unit (playback list generation unit) 63 that generates a playback list that specifies a video to be used as a summary video in the stream data. Of course, as in the first embodiment described above, the summary video generation unit 14 (see FIG. 2) that separately generates a summary video from the stream data may be included.

図７は、図６に示す要約映像生成装置６００を実際の製品に適用した一例を示す図である。
図７に示すビデオカメラ７１は、図６の撮影装置６０１に相当し、図７に示すレコーダ７２は図６の記録制御部６０２に相当し、図７に示すＢＤ等のディスク７３は図６の記録媒体６２６に相当する。ディスク７３に、ストリームデータが記録される。 FIG. 7 is a diagram showing an example in which the summary video generating apparatus 600 shown in FIG. 6 is applied to an actual product.
The video camera 71 shown in FIG. 7 corresponds to the photographing apparatus 601 in FIG. 6, the recorder 72 shown in FIG. 7 corresponds to the recording control unit 602 in FIG. 6, and the disc 73 such as BD shown in FIG. It corresponds to the recording medium 626. Stream data is recorded on the disc 73.

また、図７に示すモニタ７４、及びモニタ７４が接続されたメディアプレーヤ７５（ＤＶＤまたはＢＤプレーヤや、ＨＤＤプレーヤ、メモリプレーヤ等）、あるいはメディアプレーヤ内蔵のコンピュータ７６は、図６の要約生成再生部６０３に相当し、ディスク７３に記録されたストリームデータ等を再生する。 Further, the monitor 74 shown in FIG. 7 and the media player 75 (DVD or BD player, HDD player, memory player, etc.) to which the monitor 74 is connected, or the computer 76 with a built-in media player is connected to the summary generation / playback unit shown in FIG. The stream data recorded on the disk 73 is reproduced.

次に、図６に戻り、実施形態２の要約映像生成装置６００の動作および要約映像生成方法を説明する。 Next, returning to FIG. 6, the operation of the summary video generation apparatus 600 and the summary video generation method of the second embodiment will be described.

撮影装置６０１の図示しないカメラ等の撮影部で撮影された撮影画像を、画像入力部６０５がデジタル画像信号に変換して画像符号化部６０７に送り、画像符号化部６０７がデジタル画像信号を符号化し、多重化処理部６０９へ出力する。ここで、デジタル画像信号は、画像入力部６０５から人物特徴量抽出部６１１にも送られ、人物特徴量抽出部６１１は、実施形態１の人物特徴量抽出部８と同様に、そのデジタル画像信号から人物特徴量を抽出する。人物特徴量は、実施形態１と同様に、例えば、撮影画像内の顔画像の有無、顔画像のサイズ、顔画像の画面上の位置、顔画像の顔の向き等を示す情報や個人識別情報などである。 An image input unit 605 converts a captured image captured by an imaging unit such as a camera (not shown) of the imaging apparatus 601 into a digital image signal and sends the digital image signal to the image encoding unit 607. The image encoding unit 607 encodes the digital image signal. And output to the multiplexing processing unit 609. Here, the digital image signal is also sent from the image input unit 605 to the person feature quantity extraction unit 611, and the person feature quantity extraction unit 611 is similar to the digital image signal in the same manner as the person feature quantity extraction unit 8 of the first embodiment. The person feature quantity is extracted from. As in the first embodiment, the person feature amount includes, for example, information indicating the presence / absence of a face image in the captured image, the size of the face image, the position of the face image on the screen, the orientation of the face of the face image, and personal identification information Etc.

また、音声入力部６０６は、撮像装置６０１が撮影画像と共に記録した音声信号をデジタル音声信号に変換して音声符号化部６０８へ出力する。音声符号化部６０８はデジタル音声信号を符号化し、多重化処理部６０９へ出力する。
多重化処理部６０９は、符号化されたデジタル画像信号およびデジタル音声信号を多重化し、メモリやＨＤＤ、ＢＤ等のデータ記憶部６１０へ出力する。 The audio input unit 606 converts the audio signal recorded together with the captured image by the imaging device 601 into a digital audio signal and outputs the digital audio signal to the audio encoding unit 608. The audio encoding unit 608 encodes the digital audio signal and outputs it to the multiplexing processing unit 609.
The multiplexing processing unit 609 multiplexes the encoded digital image signal and digital audio signal, and outputs them to the data storage unit 610 such as a memory, HDD, or BD.

撮影情報出力部６１２は、撮影を開始した時刻や、撮影開始から撮影終了までの時間等を示す撮影時間情報を出力する時計部６１３と、撮影時のカメラの動きや、ズーム操作、画質設定等のカメラに実行された制御動作を示す情報や、図示しないＧＰＳ受信機やジャイロセンサ等のセンサが検出した撮影位置や撮影方位を示す情報、等の撮影環境を示す情報を出力するセンサ部６１４を備える。撮影情報出力部６１２はこれらの情報を撮影情報として、データ記憶部６１０へ出力する。 The shooting information output unit 612 is a clock unit 613 that outputs shooting time information indicating the time at which shooting is started, the time from the start of shooting to the end of shooting, and the like. The movement of the camera at the time of shooting, zoom operation, image quality setting, etc. A sensor unit 614 that outputs information indicating a shooting environment, such as information indicating a control operation performed on the camera, information indicating a shooting position and shooting direction detected by a sensor such as a GPS receiver or a gyro sensor (not shown), and the like. Prepare. The shooting information output unit 612 outputs these pieces of information as shooting information to the data storage unit 610.

データ記憶部６１０は、多重化処理部６０９で多重化された画像信号及び音声信号（以降、単に撮影データ）と、撮影データの撮影時に撮影情報出力部６１２が取得した撮影情報とを関連付けて、ストリームデータとして記憶する。更にストリームデータの基となる画像信号から人物特徴量抽出部６１１が抽出した人物特徴量を、ストリームデータと対応付けて記憶する。ここまでが撮影装置６０１の動作である。 The data storage unit 610 associates the image signal and the audio signal multiplexed by the multiplexing processing unit 609 (hereinafter simply referred to as shooting data) with the shooting information acquired by the shooting information output unit 612 when shooting the shooting data. Store as stream data. Further, the person feature quantity extracted by the person feature quantity extraction unit 611 from the image signal that is the basis of the stream data is stored in association with the stream data. The operation so far is the operation of the photographing apparatus 601.

次に記録制御部６０２の動作を説明する。
まず、ストリームデータ入力部６１５が、撮影データと撮影データに対応付けられた撮影情報等とを含むストリームデータ及びストリームデータに付加された人物特徴量を、撮影装置６０１のデータ記憶部６１０から取得する。実施形態２では、実施形態１と同様に、ストリームデータは、録画開始から録画停止までのショット毎に作成される映像データ（画像信号）および音声データ（音声信号）からなる撮影データと、撮影データと対応付けられた撮影情報とを含むデータファイルが作成される形式として説明する。なお、撮影データと撮影情報と人物特徴量等をそれぞれ別のファイルとし、互いに対応するものとして扱ってもよい。
撮影情報としては、実施形態１と同様に、撮影時間情報や、撮影位置情報等を用いることができる。更に、ショットは、撮影情報を用いて、撮影シーン毎に分類整理することができる。 Next, the operation of the recording control unit 602 will be described.
First, the stream data input unit 615 acquires, from the data storage unit 610 of the imaging apparatus 601, the stream data including the imaging data and imaging information associated with the imaging data and the person feature amount added to the stream data. . In the second embodiment, similarly to the first embodiment, the stream data includes shooting data including video data (image signal) and audio data (audio signal) created for each shot from the start of recording to the stop of recording, and shooting data. As a format for creating a data file including the shooting information associated with the. Note that the shooting data, the shooting information, the person feature amount, and the like may be handled as files that correspond to each other as separate files.
As shooting information, shooting time information, shooting position information, and the like can be used as in the first embodiment. Furthermore, shots can be classified and arranged for each shooting scene using shooting information.

人物特徴区間抽出部６１６は、実施形態１の人物特徴区間抽出部９と同様のものである。人物特徴区間抽出部６１６は、人物特徴量に基づいて撮影シーン内の人物特徴区間を抽出し、人物特徴区間の時間軸上の位置等を示す人物特徴区間情報と、人物特徴区間以外の区間の時間軸上の位置等を示す非人物特徴区間情報とを生成して、データ記録部６１７へ出力する。その際、実施形態２でも、実施形態１と同様に、非人物特徴区間を複数の非人物特徴区間に分けてもよい。 The person feature section extraction unit 616 is the same as the person feature section extraction unit 9 of the first embodiment. The person feature section extraction unit 616 extracts a person feature section in the shooting scene based on the person feature amount, and includes person feature section information indicating the position of the person feature section on the time axis, and sections other than the person feature section. Non-person feature section information indicating the position on the time axis is generated and output to the data recording unit 617. At this time, in the second embodiment, as in the first embodiment, the non-person feature section may be divided into a plurality of non-person feature sections.

また、再生制御処理情報生成部６１８は、記録対象のストリームデータの再生制御処理方法に関する再生制御処理情報を生成して、データ記録部６１７へ出力する。再生制御処理の詳細は後述する。 Also, the playback control processing information generation unit 618 generates playback control processing information related to the playback control processing method for the stream data to be recorded, and outputs the playback control processing information to the data recording unit 617. Details of the reproduction control process will be described later.

データ記録部６１７は、ストリームデータ入力部６１５からのストリームデータおよび人物特徴量と、人物特徴区間抽出部６１６で抽出された人物特徴区間情報および非人物特徴区間情報と、再生制御処理情報生成部６１８からの再生制御処理情報とを受け取り、各ストリームデータにストリームデータを基にして求めた各情報を対応付ける。実施形態２では、ストリームデータに人物特徴量、撮影情報、人物特徴区間情報および非人物特徴区間情報、再生制御処理情報を対応付けする。なお、要約映像の生成の際に、非人物特徴区間情報を利用しない場合には、非人物特徴区間情報のストリームデータへの対応付けは省略してもよい。以降、ストリームデータと、ストリームデータに対応付けられた上記情報を併せて、単にストリームデータ群とする。 The data recording unit 617 includes the stream data and the person feature amount from the stream data input unit 615, the person feature section information and the non-person feature section information extracted by the person feature section extraction unit 616, and the reproduction control processing information generation unit 618. And the reproduction control processing information from the stream data is received, and each piece of information obtained based on the stream data is associated with each piece of stream data. In the second embodiment, person feature amounts, shooting information, person feature section information, non-person feature section information, and playback control processing information are associated with stream data. Note that when non-person feature section information is not used when generating a summary video, the association of non-person feature section information with stream data may be omitted. Hereinafter, the stream data and the information associated with the stream data are collectively referred to as a stream data group.

データ記録部６１７は対応付けしたストリームデータ群を蓄積する。
データ記録部６１７に蓄積されたストリームデータ群を、記録媒体６２６に記録することもできる。図６では記録に用いる手段の図示を省略するが、既知の手段を採用すればよい。記録媒体６２６は、ＤＶＤ、ＢＤ、ＨＤＤ、半導体メモリ等のいずれでもよく、固定式でも脱着式でもよい。
以上が、記録制御部６０２の記録処理である。 The data recording unit 617 stores the associated stream data group.
The stream data group stored in the data recording unit 617 can also be recorded on the recording medium 626. In FIG. 6, illustration of means used for recording is omitted, but known means may be employed. The recording medium 626 may be any of DVD, BD, HDD, semiconductor memory, and the like, and may be a fixed type or a removable type.
The above is the recording process of the recording control unit 602.

(要約生成処理）
要約生成再生部６０３は、データ記録部６１７に記録されたストリームデータ群を基に、要約映像を生成して再生する。または要約生成再生部６０３は、記録媒体６２６に記録されたストリームデータ群を基にして要約映像を生成して再生してもよい。要約生成再生部６０３が記録媒体６２６に記録された情報を取得する手段は、既知の手段を採用すればよい。
以下では、要約生成再生部６０３がデータ記録部６１７からストリームデータ群を取得する場合について説明する。 (Summary generation process)
The summary generation / playback unit 603 generates and plays a summary video based on the stream data group recorded in the data recording unit 617. Alternatively, the summary generation / playback unit 603 may generate and play a summary video based on the stream data group recorded on the recording medium 626. A means for acquiring information recorded on the recording medium 626 by the summary generation / playback unit 603 may be a known means.
Hereinafter, a case where the summary generation / playback unit 603 acquires a stream data group from the data recording unit 617 will be described.

データ読出部６２１は、データ記録部６１７よりストリームデータ群を読み出す。データ読出部６２１が読み出したストリームデータ群は、再生制御処理実行部６２２へ供給される。
データ記録部６１７からのストリームデータ群の読み出しは、操作入力部６２３を介しユーザが要約映像の生成および再生を指示した場合にのみ、データ読出部６２１が行うよう制御してもよい。 The data reading unit 621 reads a stream data group from the data recording unit 617. The stream data group read by the data reading unit 621 is supplied to the reproduction control processing execution unit 622.
The data reading unit 621 may be controlled to read the stream data group from the data recording unit 617 only when the user instructs generation and playback of the summary video via the operation input unit 623.

次に、再生制御処理実行部６２２は、データ読出部６２１から出力されたストリームデータ群に含まれる再生制御処理情報に従い、ストリームデータの要約映像再生を行う。
再生制御処理実行部６２２内では、まず、パラメータ設定処理部６１が操作入力部６２３より、ユーザが設定したパラメータを受け取る。ユーザは、操作入力部６２３に対して、要約映像を生成するストリームデータと要約生成モードおよび、要約映像を再生する時間等を示すパラメータを設定する。
パラメータ設定処理部６１は、受け取ったパラメータが示すストリームデータを含むストリームデータ群をデータ読出部６２１から読み出す。
パラメータ設定処理部６１は、読み出したストリームデータ群とパラメータを、要約再生区間選択部６２に供給する。要約再生区間選択部６２は、実施形態１の要約再生区間選択部１３と同様のものである。 Next, the playback control processing execution unit 622 performs summary video playback of the stream data in accordance with the playback control processing information included in the stream data group output from the data reading unit 621.
In the reproduction control processing execution unit 622, first, the parameter setting processing unit 61 receives a parameter set by the user from the operation input unit 623. The user sets, for the operation input unit 623, parameters indicating stream data for generating a summary video, a summary generation mode, a time for reproducing the summary video, and the like.
The parameter setting processing unit 61 reads a stream data group including the stream data indicated by the received parameter from the data reading unit 621.
The parameter setting processing unit 61 supplies the read stream data group and parameters to the digest playback section selecting unit 62. The summary playback section selection unit 62 is the same as the summary playback section selection unit 13 of the first embodiment.

要約再生区間選択部６２は、パラメータ設定処理部６１より受け取ったパラメータに基づいて、ストリームデータの各ショットを、そのストリームデータに対応付けられている各ショットの撮影開始時刻や撮影終了時刻等の撮影情報に基づいて撮影シーンに分類する。更に、ユーザより設定された要約生成モードにより各人物特徴区間および非人物特徴区間の得点を評価し、撮影シーン毎に要約再生区間を選択する。要約映像の生成の際に非人物特徴区間情報を利用しない場合には、ストリームデータ群に非人物特徴区間情報が含まれていても非人物特徴区間情報を読み出さなくてもよい。
以下に、実施形態２の要約再生区間選択部６２の詳細動作を説明する。 Based on the parameters received from the parameter setting processing unit 61, the summary playback section selection unit 62 captures each shot of the stream data, such as the shooting start time and the shooting end time of each shot associated with the stream data. Based on the information, it is classified into shooting scenes. Further, the score of each person feature section and non-person feature section is evaluated according to the summary generation mode set by the user, and a summary playback section is selected for each shooting scene. When the non-person feature section information is not used when generating the summary video, the non-person feature section information may not be read even if the non-person feature section information is included in the stream data group.
The detailed operation of the summary playback section selection unit 62 of the second embodiment will be described below.

要約再生区間選択部６２は、パラメータ設定処理部６１より、パラメータとストリームデータ群を受け取る。上述の通り実施形態２では、ストリームデータ群はストリームデータと、ストリームデータに対応付けられた人物特徴量、撮影情報、人物特徴区間情報、非人物特徴区間情報とを含む。 The summary playback section selection unit 62 receives parameters and stream data groups from the parameter setting processing unit 61. As described above, in the second embodiment, the stream data group includes the stream data and the person feature amount, the shooting information, the person feature section information, and the non-person feature section information associated with the stream data.

要約再生区間選択部６２は、撮影時間情報を基にストリームデータを構成する各ショットを撮影シーンに分類する。例えば、あるシーンで撮影した後、移動し、次のシーンで撮影する、というユーザの撮影パターンを想定し、撮影時間情報を用いて、ショットが撮影された時間間隔の大きさ等により撮影シーンの区切りを決定しショットを分類する。または、ズーム操作の使用、カメラの動き、ＧＰＳなどのセンサ情報を用い、撮り方の変化に基づいてショットを分類してもよい。 The summary playback section selection unit 62 classifies each shot constituting the stream data into a shooting scene based on the shooting time information. For example, assuming a user's shooting pattern of shooting in a certain scene, moving and shooting in the next scene, the shooting time information is used to determine the shooting scene according to the size of the time interval at which the shot was shot. Determine the break and classify the shots. Alternatively, the shots may be classified based on changes in how to take the image using sensor information such as use of zoom operation, camera movement, and GPS.

次に、要約再生区間選択部６２は、設定された要約生成モードに基づいて各人物特徴区間および非人物特徴区間の得点を評価する。以降、実施形態２では人物特徴区間および非人物特徴区間をまとめて、特徴区間として扱う。
実施形態２では、要約生成モードに複数のシーンタイプを設定する。実施形態１では、１つの要約生成モードが、１種類の人物特徴量（得点）への重み付けを示すとした。実施形態２では、シーンタイプが重み付けする人物特徴量とその程度を示す。従って１つの要約生成モードには、複数の人物特徴量への重み付けが設定される。シーンタイプの詳細は後述する。
ここで、ｎ種類の人物特徴量Ｐｔを用いて特徴区間を評価する際に、ｉ番目の人物特徴量Ｐｔに関する時刻ｔを含む特徴区間の得点をＰｉｔとする。設定された要約生成モードｍおよびシーンタイプｓでの特徴区間の評価値Ｖｍｓｔは、下記の式（５）により求める。 Next, the summary playback section selection unit 62 evaluates the score of each person feature section and non-person feature section based on the set summary generation mode. Hereinafter, in the second embodiment, the person feature section and the non-person feature section are collectively treated as a feature section.
In the second embodiment, a plurality of scene types are set in the summary generation mode. In the first embodiment, it is assumed that one summary generation mode indicates weighting to one type of person feature (score). In the second embodiment, the person feature amount weighted by the scene type and its degree are shown. Accordingly, weighting for a plurality of person feature amounts is set in one summary generation mode. Details of the scene type will be described later.
Here, when evaluating a feature section using n types of person feature quantities Pt, the score of the feature section including the time t regarding the i-th person feature quantity Pt is defined as Pit. The evaluation value Vmst of the feature section in the set summary generation mode m and scene type s is obtained by the following equation (5).

ここで、Ｃｍｓｉは、要約生成モードｍおよびシーンタイプｓのｉ番目の人物特徴量Ｐｉｔの重み付け係数、Ａｍｓｉは初期設定値である。初期設定値Ａｍｓｉは、０等の所定の値でよい。 Here, Cmsi is a weighting coefficient of the i-th person feature quantity Pit of the summary generation mode m and the scene type s, and Amsi is an initial setting value. The initial setting value Amsi may be a predetermined value such as 0.

また、実施形態１と同様に、子供のアップシーンを優先する子供中心の要約生成モードや、風景と人物とが映っており、人物の顔の向きが正面向きである撮影シーンを優先する記念撮影の要約生成モード等の、要約生成モードｍの設定により、評価値Ｖｍｓｔを算出し、各要約生成モードｍに応じた要約映像を生成できる。 Similarly to the first embodiment, a child-centered summary generation mode that prioritizes children's upscenes, and a commemorative photo that prioritizes shooting scenes in which a landscape and a person are reflected and the face of the person is facing the front. An evaluation value Vmst can be calculated by setting the summary generation mode m such as the summary generation mode, and a summary video corresponding to each summary generation mode m can be generated.

ここで、実施形態２で用いる式（５）は、シーンタイプｓを考慮して特徴区間の評価値Ｖｍｓｔを算出する。一方実施形態１では、式（４）を用いて評価値Ｖｍｔを求めたが、式（４）ではシーンタイプｓを考慮していない。
つまり、実施形態１では要約生成モードｍ毎に、式（４）により複数の映像区間から代表区間を選択する為の評価値を算出する。実施形態１の要約生成モードｍは、１種類の人物特徴量に対する重み付けを示すものである。そして代表区間以外の適当な区間をつなぎ区間として選択し、要約映像を構成する。 Here, Expression (5) used in the second embodiment calculates the evaluation value Vmst of the feature section in consideration of the scene type s. On the other hand, in the first embodiment, the evaluation value Vmt is obtained using Expression (4), but the scene type s is not considered in Expression (4).
That is, in the first embodiment, for each summary generation mode m, an evaluation value for selecting a representative section from a plurality of video sections is calculated according to Expression (4). The summary generation mode m of Embodiment 1 shows weighting for one type of person feature. Then, an appropriate section other than the representative section is selected as a connecting section, and a summary video is constructed.

これに対し、実施形態２では要約生成モードｍに複数のシーンタイプｓを設定する。設定したシーンタイプｓ毎の評価値Ｖｍｓｔを式（５）により算出し、各シーンタイプｓでの評価値Ｖｍｓｔが高い映像区間からの映像を組み合わせて、要約映像を構成する。 On the other hand, in the second embodiment, a plurality of scene types s are set in the summary generation mode m. An evaluation value Vmst for each set scene type s is calculated by Expression (5), and a summary video is configured by combining videos from video sections having a high evaluation value Vmst for each scene type s.

例えば、子供中心の要約生成モードｍであれば、次のような３つのシーンタイプｓを想定できる。
シーンタイプ１．画面の中心に人が正面向きで大きく映っている
シーンタイプ２．人が大きく映っているが画面の端の方で向きにも変化がある
シーンタイプ３．人が大きく映ってはいない For example, in the child-centric summary generation mode m, the following three scene types s can be assumed.
Scene type A scene type in which a person appears in front of the screen. A scene type in which people appear large but the direction of the image changes at the edge of the screen. People are not reflected

つまり、実施形態２では、上述のように、要約生成モードｍ毎に任意のシーンタイプｓを設定して、要約生成モード毎に複数のシーンタイプｓで評価した映像区間から抽出した映像を組み合わせて変化のある要約映像を構成することができる。 That is, in the second embodiment, as described above, an arbitrary scene type s is set for each summary generation mode m, and videos extracted from video sections evaluated with a plurality of scene types s are combined for each summary generation mode. A summary video with changes can be constructed.

上述したように実施形態１は、実施形態２に比べ、評価値の種類が少ないため処理がシンプルとなる。一方、実施形態２は、実施形態１に比べ、要約生成モードｍに複数のシーンタイプｓを設定するため、複数の評価方法で特徴区間を評価し、そこから抽出した映像を組み合わせて要約映像を生成できる。そのため、ユーザが選択した要約生成モードｍに沿って、映像の変化をきめ細かくコントロールすることができる。 As described above, since the first embodiment has fewer types of evaluation values than the second embodiment, the processing is simple. On the other hand, compared to the first embodiment, the second embodiment sets a plurality of scene types s in the summary generation mode m. Therefore, the feature section is evaluated by a plurality of evaluation methods, and the extracted video is combined with the summary video. Can be generated. Therefore, it is possible to finely control the change of the video along the summary generation mode m selected by the user.

要約再生区間選択部６２は、実施形態１の要約再生区間選択部１３と同様のものであり、撮影シーンの撮影時間を基に要約映像に使用するカット数を決定する。そして、複数のシーンタイプｓ毎に評価値算出パラメータを設定して各映像区間の評価値Ｖｍｓｔを算出し、評価値Ｖｍｓｔを基にして所定のカット数になるまで要約映像に使用する映像区間を選択する。映像区間の選択は、シーンタイプｓ毎の評価を巡回しながら行うことが好ましい。 The summary playback section selection unit 62 is the same as the summary playback section selection unit 13 of the first embodiment, and determines the number of cuts used for the summary video based on the shooting time of the shooting scene. Then, an evaluation value calculation parameter is set for each of the plurality of scene types s to calculate an evaluation value Vmst for each video section. Based on the evaluation value Vmst, a video section used for the summary video until a predetermined number of cuts is obtained. select. The selection of the video section is preferably performed while cycling through the evaluation for each scene type s.

要約再生区間選択部６２は更に、選択された映像区間から更に所定時間の映像区間を要約再生区間として抽出する。 The summary playback section selection unit 62 further extracts a video section of a predetermined time from the selected video section as a summary playback section.

実施形態２における、要約再生区間の選択手順を、ショットの一例を用いて説明する。 The summary playback section selection procedure in the second embodiment will be described using an example of a shot.

図８は、８ショット分に対応する撮影情報を一例として示す。
図８に示す例では、ショット番号Ｓ１〜Ｓ８の８ショットがある。それぞれのショットに対して、撮影開始日時と、撮影時間と、前のショットとの撮影間隔、等の撮影情報が付加され要約再生区間選択部６２に記憶されている。 FIG. 8 shows imaging information corresponding to 8 shots as an example.
In the example shown in FIG. 8, there are 8 shots with shot numbers S1 to S8. Shooting information such as a shooting start date and time, a shooting time, and a shooting interval between previous shots is added to each shot and stored in the summary playback section selection unit 62.

図９は要約再生区間選択部６２の要約再生区間選択処理を示す。図８に示すショット例を使用して説明する。
要約再生区間選択部６２は、複数（ここでは、８つ）のショットＳ１〜Ｓ８を例えば各ショット間の撮影間隔に基づいて撮影シーンに分ける。例えば、３０分、１時間、３時間等の所定の閾値を設け、各ショット間の撮影間隔が閾値より小さいものをまとめて一つの撮影シーンとする。図８に示すショットＳ１〜Ｓ８の時系列の撮影ショットを閾値を３０分としてまとめる場合、ショットＳ１とショットＳ２との撮影間隔は１分２９秒であるため、同じ撮影シーンとする。次に、ショットＳ２とショットＳ３との撮影間隔は１時間２５秒であるため、ショットＳ２とショットＳ３とは別の撮影シーンとする。同様にショットＳ４〜Ｓ８についても判断する。
従って、図８に示す例ではショットＳ１とＳ２の撮影シーン１と、ショットＳ３〜Ｓ５の撮影シーン２と、ショットＳ６〜Ｓ８の撮影シーン３と、いう３つの撮影シーン１〜３に分けられる。撮影シーン１〜３について図９（ａ）、（ｂ）に示す。なお、ショットの分け方は所定の閾値と比較しなくてもよく、相対的に撮影間隔のあいている（大きい）ショット間を分けてもよい。また、位置情報を取得可能な場合は、撮影位置のまとまりによって分けてもよい。 FIG. 9 shows the summary playback section selection processing of the summary playback section selection unit 62. This will be described using the shot example shown in FIG.
The summary playback section selection unit 62 divides a plurality of (here, eight) shots S1 to S8 into shooting scenes based on, for example, shooting intervals between the shots. For example, a predetermined threshold value such as 30 minutes, 1 hour, 3 hours, etc. is provided, and the shooting intervals between shots that are smaller than the threshold value are combined into one shooting scene. When the time-series shots of the shots S1 to S8 shown in FIG. 8 are collected with a threshold of 30 minutes, the shooting interval between the shot S1 and the shot S2 is 1 minute 29 seconds, and therefore the same shooting scene is used. Next, since the shooting interval between the shot S2 and the shot S3 is 1 hour 25 seconds, the shooting scene is different from the shot S2 and the shot S3. Similarly, the shots S4 to S8 are determined.
Therefore, in the example shown in FIG. 8, the shooting scenes 1 and 3 of the shots S1 and S2, the shooting scene 2 of the shots S3 to S5, and the shooting scene 3 of the shots S6 to S8 are divided into three shooting scenes 1 to 3. The shooting scenes 1 to 3 are shown in FIGS. Note that shots may not be compared with a predetermined threshold, and shots with relatively large (larger) shooting intervals may be divided. If position information can be acquired, the position information may be divided according to a group of shooting positions.

次に、要約再生区間選択部６２は、撮影シーン１〜３毎の撮影時間を基に、要約再生用に抽出するカット数を決定する。図８に示す各ショットＳ１〜Ｓ８の撮影時間に基づいた撮影シーン１〜３の撮影時間を、図９（ｃ）に示す。例えば、撮影シーン１であれば、撮影時間が１分３１秒のショットＳ１と１分３５秒のショットＳ２とからなるため、撮影シーン１の撮影時間は３分０６秒（３：０６）となる。同様に撮影シーン２の撮影時間は２分１６秒（２：１６）、撮影シーン３の撮影時間は１分４５秒（１：４５）となる。
要約再生区間選択部６２は更に、各撮影シーンの撮影時間に応じて、抽出カット数を決定する。図９（ｃ）に示すように、撮影シーン１の撮影時間が最も長く、撮影シーン２、撮影シーン３の順に撮影時間が短い。ここでは、図９（ｄ）に示すように撮影シーン１から４カットを抽出し、撮影シーン２からは３カット、撮影シーン３からは２カット、と撮影時間の長さに準じて抽出カット数を決定する。 Next, the summary playback section selection unit 62 determines the number of cuts to be extracted for summary playback based on the shooting time for each of the shooting scenes 1 to 3. FIG. 9C shows the shooting times of the shooting scenes 1 to 3 based on the shooting times of the shots S1 to S8 shown in FIG. For example, in the case of the shooting scene 1, since the shooting time includes a shot S1 having a shooting time of 1 minute 31 seconds and a shot S2 having a shooting time of 1 minute 35 seconds, the shooting time of the shooting scene 1 is 3 minutes 06 seconds (3:06). . Similarly, the shooting time of the shooting scene 2 is 2 minutes 16 seconds (2:16), and the shooting time of the shooting scene 3 is 1 minute 45 seconds (1:45).
The summary playback section selection unit 62 further determines the number of extracted cuts according to the shooting time of each shooting scene. As shown in FIG. 9C, the shooting time of the shooting scene 1 is the longest, and the shooting time is shorter in the order of the shooting scene 2 and the shooting scene 3. Here, as shown in FIG. 9D, 4 cuts are extracted from the shooting scene 1, 3 cuts from the shooting scene 2, 2 cuts from the shooting scene 3, and the number of cuts extracted according to the length of the shooting time. To decide.

次に、要約再生区間選択部６２は、撮影シーン１〜３毎に、それぞれ要約として抽出する区間である要約再生区間（カット）を選択する。
実施形態２では、複数のシーンタイプを設定し、各シーンタイプでショットＳを構成する映像区間を評価して評価値を求める。シーンタイプは、映像の評価に際して、どの人物特徴量にどの程度重み付けをするかを設定したものであり、用いる人物特徴量および重み付けはシーンタイプ毎に異なる。
要約再生区間選択部６２は、各シーンタイプで評価した映像区間の評価値を基に、抽出する映像区間を選択する。例えば図９（ｅ）〜（ｇ）に示すように各シーンタイプで評価した評価値が高い映像区間を、先に定めた抽出カット数と同数となるまで選択する。要約再生区間選択部６２は選択した映像区間から、それぞれ所定時間（例えば、５秒〜１０秒）のカットを抽出する。
なお、要約再生区間選択部６２は、シーンタイプを設けずに、撮影シーン毎に要約再生区間または要約映像を生成するようにしても勿論よい。 Next, the summary playback section selection unit 62 selects a summary playback section (cut) that is a section to be extracted as a summary for each of the shooting scenes 1 to 3.
In the second embodiment, a plurality of scene types are set, and an evaluation value is obtained by evaluating a video section constituting the shot S with each scene type. The scene type is a set of how much weight is assigned to which person feature when evaluating a video, and the person feature and weight used differ depending on the scene type.
The summary playback section selection unit 62 selects a video section to be extracted based on the evaluation value of the video section evaluated for each scene type. For example, as shown in FIGS. 9E to 9G, video sections having a high evaluation value evaluated for each scene type are selected until the number of extracted cuts is equal to the number of extracted cuts. The summary playback section selection unit 62 extracts cuts of a predetermined time (for example, 5 seconds to 10 seconds) from the selected video section.
Of course, the summary playback section selection unit 62 may generate a summary playback section or a summary video for each shooting scene without providing a scene type.

図９を用いて、実施形態２の要約再生区間の選択方法を説明する。
要約再生区間選択部６２が、ショットＳ１とＳ２をシーンタイプ１〜シーンタイプ３でそれぞれ評価した評価値を図９（ｅ）〜（ｇ）に示す。要約再生区間選択部６２は、撮影シーン１の抽出カット数を４としたので、図９（ｅ）に示すシーンタイプ１（ＳＴ１）で評価したショットＳ１とショットＳ２、図９（ｆ）に示すシーンタイプ２（ＳＴ２）で評価したショットＳ１とショットＳ２、図９（ｇ）に示すシーンタイプ３（ＳＴ３）で評価したショットＳ１とショットＳ２、の中から合わせて４カットを抽出する。
要約再生区間選択部６２はまず、ＳＴ１〜ＳＴ３において、得点の最も高い映像区間をそれぞれ選択し、更にＳＴ１から２番目に得点の高い映像区間を選択して、４つのカットを抽出する映像区間を選択する。すなわち、図９（ｅ）に示すＳＴ１では、最も高い得点７０を示す映像区間と、２番目に高い得点５０を示す映像区間を選択し、図９（ｆ）に示すＳＴ２では、最も高い得点６０を示す映像区間を選択し、図９（ｇ）に示すＳＴ３では、最も高い得点０を示す映像区間を選択した。選択した映像区間は、図９（ｅ）〜（ｇ）に数字と矢印とを付して示す。数字は、要約再生区間選択部６２が映像区間を選択した順を示すが、これに限るものではない。また実施形態２では、ＳＴ１から順に映像区間の選択を行うが、これに限るものではない。 The summary playback section selection method of the second embodiment will be described with reference to FIG.
FIGS. 9E to 9G show evaluation values obtained by the summary playback section selection unit 62 evaluating the shots S1 and S2 using the scene type 1 to scene type 3, respectively. The summary playback section selection unit 62 sets the number of extracted cuts of the shooting scene 1 to 4, so that the shot S1 and shot S2 evaluated in the scene type 1 (ST1) shown in FIG. 9E are shown in FIG. 9F. Four cuts are extracted from the shots S1 and S2 evaluated by the scene type 2 (ST2) and the shots S1 and S2 evaluated by the scene type 3 (ST3) shown in FIG. 9G.
First, in ST1 to ST3, the summary playback section selection unit 62 selects the video section having the highest score, selects the video section having the second highest score from ST1, and selects the video section from which four cuts are extracted. select. That is, in ST1 shown in FIG. 9E, the video section showing the highest score 70 and the video section showing the second highest score 50 are selected, and in ST2 shown in FIG. 9F, the highest score 60 is selected. In ST3 shown in FIG. 9G, the video section showing the highest score 0 was selected. The selected video section is shown with numbers and arrows in FIGS. The numbers indicate the order in which the summary playback section selection unit 62 selects the video sections, but the present invention is not limited to this. In the second embodiment, video sections are selected in order from ST1, but the present invention is not limited to this.

要約再生区間選択部６２は最後に、選択した４つの映像区間それぞれから、図９（ｈ）に斜線で示す所定時間のカットを抽出し、要約再生区間１〜４とする。各映像区間からの所定時間のカットの抽出方法は、各映像区間の開始時間から所定時間でもよいし、各映像区間の中心から所定時間でもよく、特に限定されない。実施形態２では、各映像区間の撮影時間における中心を含む所定時間の映像を、選択したカットとして抽出する。
抽出された要約再生区間１〜４は、要約映像生成部６３に出力される。なお、要約再生区間選択部６２は抽出した要約再生区間に基づいて、つなぎ区間を抽出してもよい。つなぎ区間の抽出は、実施形態１において説明した抽出例のいずれかを採用する。つなぎ区間を抽出した場合は、要約再生区間およびつなぎ区間を要約映像生成部６３に出力する。 Finally, the summary playback section selection unit 62 extracts cuts of a predetermined time indicated by diagonal lines in FIG. 9 (h) from each of the selected four video sections to obtain summary playback sections 1 to 4. The method for extracting a cut at a predetermined time from each video section may be a predetermined time from the start time of each video section or a predetermined time from the center of each video section, and is not particularly limited. In the second embodiment, a video of a predetermined time including the center in the shooting time of each video section is extracted as a selected cut.
The extracted summary playback sections 1 to 4 are output to the summary video generation unit 63. The summary playback section selection unit 62 may extract a connection section based on the extracted summary playback section. Any of the extraction examples described in the first embodiment is used for the extraction of the connecting section. When the connection section is extracted, the summary playback section and the connection section are output to the summary video generation unit 63.

また、撮影シーン２では抽出カット数を３カットとしたので、図９（ｅ）〜図９（ｇ）にそれぞれ示すように、ＳＴ１〜ＳＴ３から最も得点の高い映像区間、すなわちＳＴ１の得点５０の区間と、ＳＴ２の得点５０の区間と、ＳＴ３の得点０の区間をそれぞれ１つ選択した。図９（ｈ）に斜線で示す、選択した３つの映像区間それぞれから抽出した所定時間のカットを要約再生区間１〜３とする。 Further, since the number of cuts to be extracted is set to 3 in the shooting scene 2, as shown in FIGS. 9E to 9G, the video section having the highest score from ST1 to ST3, that is, the score 50 of ST1. One section, one section with ST2 score 50, and one section with ST3 score 0 were selected. Cuts of a predetermined time extracted from each of the three selected video sections indicated by diagonal lines in FIG.

更に、撮影シーン３では、抽出カット数が２カットなので、ＳＴ１とＳＴ２から最も得点の高い映像区間、すなわちＳＴ１の得点３０の区間と、ＳＴ２の得点３０の区間を選択した。ＳＴ１とＳＴ２から映像区間を１つずつ選択したことで、抽出カット数２と同数の映像区間を選択したので、ＳＴ３で評価したものからは映像区間を選択しない。図９（ｈ）に斜線で示す、選択した２つの映像区間それぞれから抽出した所定時間のカットを、要約再生区間１、２とする。 Furthermore, in the shooting scene 3, the number of cuts to be extracted is 2, so that the video segment having the highest score from ST1 and ST2, that is, the segment of ST1 score 30 and the segment of ST2 score 30 was selected. Since one video segment is selected from ST1 and ST2 one by one, the same number of video segments as the number of extracted cuts 2 are selected. Therefore, no video segment is selected from those evaluated in ST3. The cuts for a predetermined time extracted from each of the two selected video sections indicated by diagonal lines in FIG.

図１０（ａ）〜（ｃ）はそれぞれ、シーンタイプ１〜３において高評価と判定される映像例を示す図である。
図１０（ａ）に示す映像は、人物が撮影画面の中心に大きく、顔が正面向きで映る特徴を有する。このような映像は人物撮影等のシーンに多く存在する。図１０（ｂ）に示す映像は、人物は撮影画面に大きく映っているが、その位置が撮影画面中心からずれて顔の向きも傾いている特徴を有する。このような映像は子供撮影や運動撮影等のシーンに多く存在する。図１０（ｃ）に示す映像は、人物が撮影画面内で小さく映っており、しかも中心からずれている特徴を有する。このような映像は風景重視撮影のシーンに多く存在する。 FIGS. 10A to 10C are diagrams illustrating examples of videos that are determined to be highly evaluated in scene types 1 to 3, respectively.
The video shown in FIG. 10A has a feature that a person is large at the center of the shooting screen and a face is projected in front. There are many such images in scenes such as portrait photography. The video shown in FIG. 10B has a feature that a person is greatly reflected on the shooting screen, but the position is shifted from the center of the shooting screen and the face is inclined. Many such images exist in scenes such as child photography and exercise photography. The video shown in FIG. 10C has a feature that a person appears small in the shooting screen and is offset from the center. There are many such images in scenes for landscape-oriented photography.

図１１は、図１０（ａ）〜（ｃ）の特徴を有する映像における、各人物特徴量Ｐｔの得点例を示す。実施形態２では、撮影画面における人物の顔のサイズ、顔の位置、顔の傾きの３種類の顔情報Ｌｔに基づいた、３種類の人物特徴量Ｐｔから得点を算出する。
図１０（ａ）の特徴を有する映像では、サイズ、位置及び傾きの全ての人物特徴量の得点が８０である。図１０（ｂ）の特徴を有する映像では、サイズの得点が８０、位置の得点が２０、傾きの得点が５０、図１０（ｃ）の特徴を有する映像では、サイズの得点が１０、位置の得点が２０、傾きの得点が８０である。 FIG. 11 shows a score example of each person feature amount Pt in the video having the features of FIGS. 10 (a) to 10 (c). In the second embodiment, the score is calculated from the three types of person feature Pt based on the three types of face information Lt of the face size, face position, and face inclination of the person on the shooting screen.
In the video having the feature of FIG. 10A, the score of all the human feature quantities of size, position and inclination is 80. In the image having the feature of FIG. 10B, the size score is 80, the position score is 20, the inclination score is 50, and in the image having the feature of FIG. 10C, the size score is 10, The score is 20, and the score of inclination is 80.

図１２は、以下の式（６）で示す評価値計算式に用いるパラメータＣｓｉ、Ａｓｉ（ｓ＝１〜ｎ）の、シーンタイプＳＴ１〜ＳＴ３毎の重み付けの一例を示している。 FIG. 12 shows an example of the weighting for each of the scene types ST1 to ST3 of the parameters Csi and Asi (s = 1 to n) used in the evaluation value calculation formula shown by the following formula (6).

式（６）に示す評価値計算式は、シーンタイプＳＴへの適合度を示す評価値Ｖｓを算出する。 The evaluation value calculation formula shown in Expression (6) calculates an evaluation value Vs indicating the degree of conformity to the scene type ST.

ここで、式（６）に示す評価値計算式において、パラメータＣｓｉ、Ａｓｉは、図１２に示す人物特徴量毎の重み付けパラメータ、Ｐｉは、図１１に示す各人物特徴量Ｐｔの得点である。
ｎ種類の人物特徴量Ｐｔのうち、ｉ番目の人物特徴量Ｐｔの得点をＰｉとする。また同様にｉ番目の人物特徴量Ｐｔの、重み付け係数をＣｓｉ、初期設定値をＡｓｉとする。 Here, in the evaluation value calculation formula shown in Formula (6), parameters Csi and Asi are weighting parameters for each human feature quantity shown in FIG. 12, and Pi is a score of each person feature quantity Pt shown in FIG.
Of the n types of person feature Pt, the score of the i-th person feature Pt is Pi. Similarly, for the i-th person feature Pt, the weighting coefficient is Csi and the initial setting value is Asi.

図１２より、シーンタイプ１でのパラメータＣ１ｉ、Ａ１ｉは、人物特徴量が顔画像のサイズの場合は（１，０）、顔画像の位置の場合は（１，０）、顔画像の傾きの場合は（１，０）である。シーンタイプ２でのパラメータＣ２ｉ、Ａ２ｉは、サイズの場合（１，０）、位置の場合（−１，１００）、傾きの場合（−１，１００）、シーンタイプ３でのパラメータＣ３ｉ、Ａ３ｉは、サイズの場合（−１，１００）、位置の場合（−１，１００）、傾きの場合（０，０）とした。 From FIG. 12, the parameters C1i and A1i in the scene type 1 are (1, 0) when the human feature amount is the size of the face image, (1, 0) when the person feature amount is the position of the face image, and the inclination of the face image. The case is (1, 0). The parameters C2i and A2i in the scene type 2 are the size (1, 0), the position (−1, 100), the inclination (−1, 100), and the parameters C3i and A3i in the scene type 3 are The case of size (-1, 100), the case of position (-1, 100), and the case of inclination (0, 0).

図１３は、図１０（ａ）〜（ｃ）に示す映像の評価値Ｖｓを、図１２に示すパラメータＣｓｉ、Ａｓｉを用いてシーンタイプ１〜３でそれぞれ求めた一例を示す。
図１０（ａ）のような人物の顔画像が撮影画面中心に位置し、かつ大きく映像の評価値Ｖｓを式（６）の評価値計算式により求めると、シーンタイプ１の評価値Ｖｓが最も高くなる。図１０（ｂ）のような人物の顔画像が撮影画面の端に映っており、かつ傾いているような映像、例えば、動きのある人物（子供やスポーツ選手等）を撮影した映像、の場合シーンタイプ２の評価値Ｖｓが最も高くなる。図１０（ｃ）のような人物の顔画像が小さく、背景の占める割合が大きい映像の場合、シーンタイプ３の評価値Ｖｓが最も高くなる。
要約再生区間選択部６２は、用いる人物特徴量ＰｔとパラメータＣｓｉ、Ａｓｉとを変更して、映像を複数のシーンタイプで評価すればよい。 FIG. 13 shows an example in which the video evaluation values Vs shown in FIGS. 10A to 10C are obtained for each of the scene types 1 to 3 using the parameters Csi and Asi shown in FIG.
When the face image of a person as shown in FIG. 10A is located at the center of the shooting screen and the evaluation value Vs of the video is largely obtained by the evaluation value calculation formula of Expression (6), the evaluation value Vs of the scene type 1 is the largest. Get higher. In the case where the face image of the person as shown in FIG. 10B is reflected at the end of the shooting screen and tilted, for example, a picture of a moving person (children, athletes, etc.) The evaluation value Vs for scene type 2 is the highest. When the face image of a person as shown in FIG. 10C is small and the proportion of the background is large, the evaluation value Vs of the scene type 3 is the highest.
The summary playback section selection unit 62 may change the person feature Pt and the parameters Csi and Asi to be used and evaluate the video with a plurality of scene types.

要約映像生成部６３は、以上のようにして要約再生区間選択部６２で抽出、選択された図９（ｈ）に示す要約再生区間を用いて再生リストを作成する。再生リストは、撮影時刻情報に基づいた要約再生区間の再生順を示す。つなぎ区間を要約再生区間選択部６２より受け取った場合は、つなぎ区間の再生順も共に示した再生リストを作成する。要約映像生成部６３は、再生リストを再生処理部６４へ出力する。
なお、要約映像生成部６３が要約映像を生成する場合、その動作は実施形態１の要約映像生成部１４と同様である。 The summary video generation unit 63 creates a playlist using the summary playback section shown in FIG. 9H extracted and selected by the summary playback section selection unit 62 as described above. The reproduction list indicates the reproduction order of the summary reproduction section based on the shooting time information. When a connected section is received from the summary playback section selection unit 62, a playback list that also shows the playback order of the connected sections is created. The summary video generation unit 63 outputs the playback list to the playback processing unit 64.
Note that when the summary video generation unit 63 generates a summary video, the operation is the same as that of the summary video generation unit 14 of the first embodiment.

再生処理部６４は、要約映像生成部６３から供給された再生リストに基づいて、撮影データから要約再生区間を抽出し、デコード部６２４に出力する。
デコード部６２４は、要約再生区間の撮影データをデコードしてデータ出力部６２５へ出力し、データ出力部６２５は、表示装置６０４へ撮影データを出力する。 Based on the playback list supplied from the summary video generation unit 63, the playback processing unit 64 extracts a summary playback section from the captured data and outputs the summary playback section to the decoding unit 624.
The decoding unit 624 decodes the shooting data of the digest playback section and outputs the decoded data to the data output unit 625, and the data output unit 625 outputs the shooting data to the display device 604.

ここで、データ出力部６２５は、実施形態１のデータ出力部１７と同様に、要約再生区間の接続部分では、ディゾルブやフェードイン・フェードアウト効果を用いて、各区間が滑らかに再生されるようにしてもよい。また図５において説明したように、映像データにおけるつなぎ区間５１Ｂから代表区間５１Ａへの切り替えタイミングより、音声データにおけるつなぎ区間５２Ｂから代表区間５２Ａへの切り替えタイミングが早くなるよう、切り替え時刻の設定を調整してもよい。このようにすることで、ユーザは予め音声でつなぎ区間から代表区間への要約再生区間の移行を認識できるため、映像の切り替えを違和感なく視聴でき、短時間で内容を把握し易くなる。 Here, similarly to the data output unit 17 of the first embodiment, the data output unit 625 uses the dissolve or fade-in / fade-out effect to smoothly reproduce each section at the connection portion of the summary playback section. May be. Further, as described in FIG. 5, the setting of the switching time is adjusted so that the switching timing from the connecting section 52B to the representative section 52A in the audio data is earlier than the switching timing from the connecting section 51B to the representative section 51A in the video data. May be. In this way, the user can recognize in advance the transition of the summary playback section from the connection section to the representative section by voice, so that the user can view the switching of the video without a sense of incongruity, and can easily grasp the contents in a short time.

以上説明したように、実施形態２によれば、ストリームデータの撮影間隔等の撮影情報に基づいて、複数のショットを撮影シーンに分類するとともに、撮影シーン毎に複数のシーンタイプで算出した評価値を用いて要約再生区間を選択し、再生リスト（または要約映像）を生成するので、色々なタイプの要約再生区間が撮影シーン毎にバランスよく配分される。 As described above, according to the second embodiment, a plurality of shots are classified into shooting scenes based on shooting information such as a shooting interval of stream data, and evaluation values calculated with a plurality of scene types for each shooting scene. Since a summary playback section is selected using and a playlist (or summary video) is generated, various types of summary playback sections are distributed in a balanced manner for each shooting scene.

従って、かかる要約映像をユーザが視聴することにより、ユーザが全体の映像内容を把握でき、中心的な被写体を参照できることに加えて、見易く退屈しない要約映像を生成することができる。すなわち、人物中心の映像区間（シーンタイプ１）、人物の配置に変化の有る映像区間（シーンタイプ２）、人物が存在しないか中心に位置しない映像区間（シーンタイプ３）など、複数のシーンタイプでそれぞれ評価が高い映像区間から抽出したカットを組み合わせて要約映像を生成することにより、ユーザが映像内容を把握でき中心的な被写体を参照できることに加えて、再生シーンに変化を与え、見易く退屈しない、メリハリの利いた要約映像を提供することができる。 Therefore, when the user views the summary video, the user can grasp the entire video content and can refer to the central subject, and can generate a summary video that is easy to see and is not bored. That is, a plurality of scene types, such as a video section centered on a person (scene type 1), a video section where the arrangement of people changes (scene type 2), and a video section where no person exists or is not centered (scene type 3). By combining the cuts extracted from the video sections with high ratings, the summary video can be generated so that the user can grasp the video content and refer to the central subject. It is possible to provide a sharp summary video.

なお、実施形態２では、要約再生区間を選択する際に、代表区間とそのつなぎ区間というように分けて選択することはしなかったが、実施形態１のように人物特徴区間から代表区間を選択すると共に、代表区間のつなぎ区間を選択するようにしても勿論よい。
また、実施形態１では、要約再生区間を選択する際に、人物特徴区間から代表区間を選択すると共に、非人物特徴区間から代表区間のつなぎ区間を選択するように説明したが、実施形態２のように代表区間とそのつなぎ区間というように分けずに、人物特徴区間および非人物特徴区間から評価値に基づいて要約再生区間を選択するようにしても勿論よい。 In the second embodiment, when selecting the summary playback section, the representative section and its connecting section are not selected separately, but the representative section is selected from the person feature section as in the first embodiment. Of course, the connecting section of the representative sections may be selected.
In the first embodiment, when selecting the summary playback section, the representative section is selected from the person feature section, and the connecting section of the representative section is selected from the non-person feature section. Of course, the summary playback section may be selected based on the evaluation value from the person feature section and the non-person feature section without being divided into the representative section and the connecting section.

また、実施形態２においては、抽出した要約再生区間を撮影時刻順に並べて再生リストとしたが、これは一例であり、要約再生区間の再生順をシーンタイプ毎の抽出順としてもよい。これは、１つの撮影シーン内であれば、ある程度内容の一貫性が有ることと、複数のシーンタイプを順番に用いて要約再生区間を抽出すれば、同じようなカットが続くことは無いので、内容を把握しやすいことに加えて変化の有る要約映像を見ることができるためである。 In the second embodiment, the extracted summary playback sections are arranged in the order of shooting time to form a playback list. However, this is an example, and the playback order of the summary playback sections may be the extraction order for each scene type. This is because the content is consistent to some extent within one shooting scene, and if a summary playback section is extracted using a plurality of scene types in order, the same cut will not continue. This is because it is easy to grasp the contents, and it is possible to see a summary video with changes.

また、実施形態１，２においては、図２や図６等に示すように、本発明に係る要約映像生成装置を、ブロック図によりハードウェア的に構成して説明したが、本発明はこれに限るものではない。要約映像生成装置を、ＣＰＵと、本発明に係る要約映像生成方法をＣＰＵに実行させるためのプログラムとによりソフトウエア的に構成するようにしても勿論よい。
また、実施形態２の要約映像生成装置６００は、本発明における人物特徴量抽出部を備えていないが、要約映像生成装置６００が人物特徴量抽出部を備える際は、実施形態１の記録制御部４と同様に、記録制御部６０２内に設ければよい。 In the first and second embodiments, the summary video generation apparatus according to the present invention has been described as a hardware configuration using a block diagram as shown in FIGS. 2 and 6, but the present invention is not limited thereto. It is not limited. Of course, the summary video generation device may be configured in software by a CPU and a program for causing the CPU to execute the summary video generation method according to the present invention.
The summary video generation apparatus 600 according to the second embodiment does not include the person feature amount extraction unit according to the present invention. However, when the summary video generation apparatus 600 includes the person feature amount extraction unit, the recording control unit according to the first embodiment. 4 may be provided in the recording control unit 602.

また、本発明に係る要約映像生成装置は、要約映像を表示する手段を含むことは任意であり、要約映像を生成する手段まで含んでいればよい。 In addition, the summary video generation apparatus according to the present invention may optionally include means for displaying the summary video, and may include means for generating the summary video.

１、６００要約映像生成装置
８、６１１人物特徴量抽出部
９、６１６人物特徴区間抽出部
１３、６２要約再生区間選択部
１４、６３要約映像生成部 DESCRIPTION OF SYMBOLS 1,600 Summary video production | generation apparatus 8,611 Person feature-value extraction part 9,616 Person feature area extraction part 13, 62 Summary reproduction | regeneration area selection part 14, 63 Summary video generation part

Claims

A person feature section extraction unit that divides the video into a plurality of video sections based on the person feature information indicating the video features of the person area extracted from the video;
A summary playback section selection unit for selecting a desired video section from the plurality of video sections;
A summary video generation unit that generates a summary video using the video of the video section selected by the summary playback section selection unit,
The person feature section extraction unit
A person feature section in which the person feature information is greater than or equal to a predetermined threshold;
The person feature information is divided into non-person feature sections smaller than the threshold,
Based on the person feature information, a feature value indicating a person feature of each of the person feature section and the non-person feature section is obtained,
The summary playback section selection unit
Obtaining an evaluation value for evaluating the person feature section and the non-person feature section based on a summary generation mode indicating a weight for the feature plant;
Based on the evaluation value, select a video section for extracting a summary video from the person feature section and the non-person feature section,
A summary video generation apparatus, wherein a summary playback section is extracted from a video section from which the summary video is extracted, and the summary video generation unit generates the summary video using the video of the summary playback section .

The summary playback section selection unit
Obtaining an evaluation value for evaluating the person feature section and the non-person feature section based on a summary generation mode indicating weighting for the feature value;
A representative section candidate is selected from the person feature section based on the evaluation value,
Extract a representative section to be used for the summary video from the representative section candidates,
Based on the representative section, extract a connecting section to be used for the summary video from the person feature section and the non-person feature section;
The summary video generation apparatus according to claim 1, wherein the summary video generation unit generates the summary video using the video of the representative section and the connection section.

The summary playback section selection unit
A plurality of weights for the feature values are set in association with the scene types set in the summary generation mode,
Obtaining a plurality of evaluation values for evaluating the person feature section and the non-person feature section based on the summary generation mode indicating weighting for the feature value in association with the scene type;
Selecting a video section from which the summary video is extracted from the person feature section and the non-person feature section based on the plurality of evaluation values;
Extracting the summary playback section from the video section from which the summary video is extracted,
The summary video generation apparatus according to claim 1, wherein the summary video generation unit generates the summary video using a video of the summary playback section.

The summary video generation device according to any one of claims 1 to 3, wherein the summary video generation unit generates the summary video as a playback list using videos of the summary playback section.

The video is divided into a person feature section in which the person feature information indicating the video feature of the person region extracted from the video is equal to or greater than a predetermined threshold, and a non-person feature section in which the person feature information is smaller than the threshold,
Based on the person feature information, a feature value indicating a person feature of each of the person feature section and the non-person feature section is obtained,
Obtaining an evaluation value for evaluating the person feature section and the non-person feature section based on a summary generation mode indicating weighting for the feature plant;
Based on the evaluation value, select a video section for extracting a summary video from the person feature section and the non-person feature section,
Extracting the summary playback section from the video section from which the summary video is extracted,
A summary video generation method, wherein the summary video is generated using a video of the summary playback section .

Obtaining an evaluation value for evaluating the person feature section and the non-person feature section based on a summary generation mode indicating weighting for the feature value;
A representative section candidate is selected from the person feature section based on the evaluation value,
Extract a representative section to be used for the summary video from the representative section candidates,
Based on the representative section, extract a connecting section to be used for the summary video,
6. The summary video generation method according to claim 5, wherein the summary video is generated using videos of the representative section and the connection section .

A plurality of weights for the feature values are set in association with the scene types set in the summary generation mode.
A plurality of evaluation values for evaluating the person feature section and the non-person feature section are obtained in association with the scene type based on a summary generation mode indicating a weighting for the feature value;
Selecting a video section from which the summary video is extracted from the person feature section and the non-person feature section based on the plurality of evaluation values;
Extracting the summary playback section from the video section from which the summary video is extracted,
The summary video is generated using the video of the summary playback section.
6. The summary video generation method according to claim 5, wherein:

In the summary video generation method,
8. The summary video generation method according to claim 5, wherein the summary video is generated as a playback list using the video of the summary playback section .