JP5600040B2

JP5600040B2 - Video summarization apparatus, video summarization method, and video summarization program

Info

Publication number: JP5600040B2
Application number: JP2010154406A
Authority: JP
Inventors: 豪入江; 隆佐藤; 明小島
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2010-07-07
Filing date: 2010-07-07
Publication date: 2014-10-01
Anticipated expiration: 2030-07-07
Also published as: JP2012019305A

Description

本発明は，一つ以上の映像を含む映像群から，映像の意味的な網羅性と見た目の多様性を評価して，一つの短い要約映像を生成する映像要約装置，映像要約方法および映像要約プログラムに関する。 The present invention relates to a video summarization apparatus, a video summarization method, and a video summarization for generating one short summary video by evaluating the semantic comprehensiveness and visual diversity of a video from a video group including one or more videos. Regarding the program.

通信を利用した映像の配信と共有が活発化し，オンラインで流通する映像の数は膨大な量となっている。ある動画共有サイトでは，現在，１分当たり２４時間分の映像が新規公開されていると報告されている。動画共有サイトでは，１本当たりの動画は平均約２分３０秒程度であるといわれている。仮に，１分当たり２４時間分の映像が公開されているとすると，５００本／分を超える映像が公開され続けている計算である。 Distribution and sharing of video using communication has become active, and the number of videos distributed online has become enormous. A video sharing site is currently reporting that 24 hours of video per minute are newly released. On video sharing sites, it is said that the average video per video is about 2 minutes and 30 seconds. Assuming that 24 hours of video per minute are released, this is a calculation in which more than 500 videos / minute continue to be released.

視聴者は，非常に多くの映像の中から，視聴したいと思うような興味のある映像を探し出す必要がある。映像を探すプロセスは大まかに次の通りである。
（１）何らかの方法で映像を検索する。通常はキーワード検索が行われる。
（２）検索結果として提示されたものの中から，興味に合致しそうなものを再生・試聴する。
（３）興味に合ったものであれば視聴を続け，そうでない場合には（１）または（２）に戻る。 Viewers need to find a video of interest that they want to watch from a large number of videos. The process of searching for a video is roughly as follows.
(1) Search for video by some method. A keyword search is usually performed.
(2) Play / listen to the search results that are likely to match your interests.
(3) Continue watching if it suits your interest, otherwise return to (1) or (2).

このプロセスは，非常に時間がかかるものであり，特に，（２）と（３）にかかる時間が大きい。映像は，時間軸を持ったメディアであるため，実際に再生し，時間をかけて視聴・試聴しなければ，内容がわからない。このため，本当に興味のあるものであったのか，視聴したいものであったのかを判断するためにも，必要となる時間は大きくなってしまうのである。 This process is very time consuming, and particularly takes a long time for (2) and (3). Since the video is a media with a time axis, the contents cannot be understood unless it is actually played back and watched and auditioned over time. For this reason, the time required to judge whether it was really interesting or what you wanted to watch would increase.

このような，映像の内容を確認・試聴するための時間を削減すべく，従来から映像要約に関する技術が発明されてきた。例えば特許文献１記載の技術では，映像中の音声の強調状態を判定し，より強調された発話を含む映像区間を要約に含める技術が開示されている。この技術では，映像中の盛り上がっている映像区間を効率的に試聴できるため，効率的な試聴が可能となっている。 In order to reduce the time for confirming / listening to the content of the video, a technology related to video summarization has been invented. For example, the technique described in Patent Document 1 discloses a technique for determining a voice emphasis state in a video and including a video section including a more emphasized speech in the summary. With this technology, it is possible to efficiently audition the lively video section in the video, so that efficient audition is possible.

また，特許文献２には，映像中の画像，音声に見られる特徴的な変化を重要なイベントであると仮定して検知し，イベントが少数しか検知されなかった区間をより高速に（スキップ）再生することで，重要でない区間をスキップする映像要約技術について開示されている。 Further, Patent Document 2 detects characteristic changes seen in images and sounds in video on the assumption that they are important events, and speeds up (skips) a section in which only a small number of events are detected. A video summarization technique for skipping insignificant sections by playing is disclosed.

特開２００３−２５９３１１号公報JP 2003-259111 A 特開２０１０−３９８７７号公報JP 2010-39877 A

上に述べた従来の映像要約技術は，一つの映像から一つの要約映像を生成することを目的としていた。すなわち，仮に１０本の映像を要約したとすると，１０本の要約映像が生成されることになるのである。これらは，要約すべき対象となる映像が少数である場合には，短い時間で内容を確認できるため，効率的かつ効果的な解決策となり得る。しかしながら，対象となる映像が非常に多数存在するような場合，このような技術だけでは，現実的に有効な要約映像を提供できない。 The conventional video summarization technology described above was aimed at generating one summary video from one video. That is, if 10 videos are summarized, 10 summary videos are generated. These can be efficient and effective solutions because the content can be confirmed in a short time when there are only a few videos to be summarized. However, when there are a very large number of target videos, such a technique alone cannot provide a practically effective summary video.

前述の通り，現在の動画共有サイトには，毎分５００本もの映像が新規公開されている。これら新規映像すべてを要約し，ユーザに新着映像を要約で確認できるシステムを構成したいとする。前述の技術では，一本の映像につき一本の要約映像を生成するため，生成される要約映像の数は，５００本／分になってしまう。 As described above, 500 videos per minute are newly released on the current video sharing site. Suppose you want to configure a system that summarizes all these new videos and allows users to check new videos in a summary. In the above-described technique, one summary video is generated for each video, so the number of summary videos to be generated is 500 lines / minute.

また，仮に映像一本当たり１０秒の要約映像を生成したとすると，生成される要約映像の総時間長は，５００本／分×１０秒＝５０００本・秒／分，すなわち，総時間約１時間２０分にも及ぶ要約映像が生成されることとなる。１分当たりに１分を遥かにしのぐ時間の要約映像が生成されるのである。一本ずつ要約映像を生成する技術では，原理的に，本質的な解決を与えないことは明らかである。 If a summary video of 10 seconds per video is generated, the total time length of the generated summary video is 500 lines / minute × 10 seconds = 5000 lines / second / minute, that is, the total time is about 1 A summary video for 20 minutes will be generated. A summary video of a time that far exceeds one minute per minute is generated. It is clear that the technology that generates the summary video one by one does not give an essential solution in principle.

本発明は，この課題を解決すべく，一本の映像から一本の要約映像を生成するのではなく，複数の映像から一本の要約映像を生成する映像要約技術の提供を目的とするものである。 In order to solve this problem, an object of the present invention is to provide a video summarization technique for generating a single summary video from a plurality of videos instead of generating a single summary video from a single video. It is.

本発明の着眼点を説明する。大量の映像の中には，別々の映像であっても，意味内容や見た目が同一，あるいは，非常に類似したものが数多く含まれているといわれている。例えば，ある動画共有サイトでは，特定のキーワードを用いて検索を実施した際に，検索結果にリストアップされた映像のうち，平均的には１４％程度，多いものでは７０％以上が完全，あるいは，部分的に重複する区間を持つ映像であるといわれている。 The focus of the present invention will be described. It is said that a large amount of video contains many videos that are the same or very similar in meaning and appearance even if they are different videos. For example, in a certain video sharing site, when a search is performed using a specific keyword, an average of about 14% of videos listed in the search result, and more than 70% are complete, or on average, , Is said to be a video with partially overlapping sections.

本発明の基本的な着眼点の一つは，この「複数の映像は似たような意味内容・見た目となる映像区間を有している」ことを利用し，複数の映像をまとめて要約することである。すなわち，異なる映像ファイルのうち，似たような区間はなるべく重複しないように要約映像を生成するのである。 One of the basic points of view of the present invention is to summarize a plurality of videos together by utilizing the fact that “a plurality of videos have similar semantic contents / visual video sections”. That is. That is, the summary video is generated so that similar sections of different video files are not overlapped as much as possible.

例を挙げる。ある２本の映像があるとする。簡単のため，それぞれの時間長は６０秒とする。仮に，これら２本の映像ファイルの中で，１０秒分の類似する映像区間が含まれており，さらに，この映像区間が，要約映像に含まれてしかるべき重要な映像区間であったとしよう。従来の「一本ずつ個別に要約する方法」では，一本あたり１０秒の要約映像を生成するとすれば，２本×１０秒，合わせて２０秒分の要約映像を生成することになる。この２０秒の要約映像は，２本の映像からそれぞれ１０秒分，互いに類似する映像区間２０秒から構成される要約映像となってしまう。 Give an example. Suppose there are two images. For simplicity, each time length is 60 seconds. Suppose that these two video files include a similar video segment for 10 seconds, and that this video segment is an important video segment that should be included in the summary video. In the conventional “method of summarizing one by one”, if 10 seconds of summary video is generated per line, 2 × 10 seconds, a total of 20 seconds of summary video is generated. The 20-second summary video becomes a summary video composed of 10 seconds from two videos and 20 seconds of similar video sections.

一方で，本発明による「複数の映像をまとめて要約する方法」では，２つの映像が持つ類似区間をなるべく含まないように要約映像を生成するため，どちらか一方の類似区間のみを含む，計１０秒分の要約映像を，２本の映像をまとめた要約映像として出力するのである。 On the other hand, in the “method for summarizing a plurality of videos together” according to the present invention, since a summary video is generated so as not to include similar sections of two videos as much as possible, only one of the similar sections is included. The summary video for 10 seconds is output as a summary video combining two videos.

以上の点を踏まえて，前記課題を解決すべく，第一の発明は，複数の映像と，各映像に付帯するメタデータとにより構成される映像群が与えられたとき，これらの映像群に対する要約映像を生成する映像要約装置であって，前記映像群に含まれる各映像を映像区間に分割する映像区間分割手段と，前記映像区間ごとに特徴量を抽出する映像区間特徴量抽出手段と，前記映像に付帯するメタデータと前記映像区間の分割情報と各映像区間ごとの特徴量とを記憶する記憶手段と，要約生成要求を受け付け，前記記憶手段を参照して，受け付けた要求に対応する一つ以上の前記映像区間を選定する映像区間群選定手段と，前記選定された映像区間群に含まれる映像区間が属する映像のメタデータもしくは前記抽出した特徴量，またはその双方の情報を用いて，一つ以上の前記映像区間を選択し，その時間順序を定めた順序付き映像区間集合に対する所定の評価関数により算出される評価値を最大とする順序付き映像区間集合を求める評価関数最適化手段と，前記求められた順序付き映像区間集合を結合して要約映像を生成する要約映像生成手段とを備え，前記評価関数は，順序付き映像区間の視聴しやすさを求める項であって，映像区間の順序付き集合について隣り合う映像区間に付与されたメタデータの類似性もしくは隣り合う映像区間の特徴量の類似性，またはその双方の類似性を評価する項を含むことを特徴とする。 Based on the above points, in order to solve the above problems, the first invention provides a video group composed of a plurality of videos and metadata attached to each video. A video summarization device for generating a summary video, a video segment dividing unit that divides each video included in the video group into video segments, a video segment feature amount extracting unit that extracts a feature amount for each video segment, Storage means for storing metadata incidental to the video, division information of the video section, and feature values for each video section, accepts a summary generation request, refers to the storage means, and responds to the received request Video segment group selection means for selecting one or more video segments, and metadata of the video to which the video segments included in the selected video segment group belong or the extracted feature quantity, or information of both Using one or more of the video segments, and an evaluation function optimum for obtaining an ordered video segment set that maximizes an evaluation value calculated by a predetermined evaluation function for the ordered video segment set in which the time order is determined and means, coupled to video section set with the determined order and a video summary generation means for generating a video summary, the evaluation function is a term for obtaining a viewing ease of ordered video section , Including a term for evaluating similarity of metadata assigned to adjacent video sections, similarity of feature quantities of adjacent video sections, or similarity of both of the ordered set of video sections .

第一の発明によれば，複数の映像に含まれる映像区間を対象に，その映像群をまとめた一つの要約映像を自動生成することができる。 According to the first aspect of the present invention, it is possible to automatically generate one summary video in which video groups are grouped for video sections included in a plurality of videos.

第二の発明は，前記第一の発明において，前記特徴量が，前記映像区間に含まれる所定の位置にある画像フレームから，明るさ特徴，色特徴，動き特徴，テクスチャ特徴，コンセプト特徴，景観特徴のうち，少なくとも一つを抽出することによって構成されるベクトルであることを特徴とする。 According to a second invention, in the first invention, the feature amount is a brightness feature, a color feature, a motion feature, a texture feature, a concept feature, a landscape from an image frame at a predetermined position included in the video section. It is a vector constituted by extracting at least one of the features.

第三の発明は，前記第一または第二の発明において，前記特徴量が，前記映像区間に含まれる所定の位置にある音信号から，音高特徴，音圧特徴，スペクトル特徴，リズム特徴，発話特徴，音楽特徴，音イベント特徴のうち，少なくとも一つを抽出することによって構成されるベクトルであることを特徴とする。 According to a third invention, in the first or second invention, the feature value is a pitch feature, a sound pressure feature, a spectrum feature, a rhythm feature, from a sound signal at a predetermined position included in the video section. It is a vector constituted by extracting at least one of speech features, music features, and sound event features.

第二，第三の発明によれば，映像区間の画像，音特徴に基づいて，見た目の類似性を精度よく測ることができるため，より効果の高い要約映像を生成することができる。 According to the second and third inventions, the visual similarity can be accurately measured based on the image and sound characteristics of the video section, so that a more effective summary video can be generated.

第四の発明は，前記第一，第二または第三の発明において，前記映像区間群選定手段は，ユーザからのキーワードによる要約生成要求を受け付け，当該キーワードと前記記憶手段に記憶された映像のメタデータとの適合を判定することによって，当該キーワードと映像との関連性を判断し，関連した映像から分割された前記映像区間を選定することを特徴とする。 According to a fourth invention, in the first, second, or third invention, the video section group selecting means accepts a summary generation request by a keyword from a user, and the video and the video stored in the storage means. The relevance between the keyword and the video is determined by determining the matching with the metadata, and the video section divided from the related video is selected.

第四の発明によれば，ユーザが要約映像を得たい際，キーワードという簡単な形式の要求のみによって，それに適合した要約映像を生成できるようになる。 According to the fourth invention, when a user wants to obtain a summary video, it is possible to generate a summary video conforming to the request only in a simple format called a keyword.

第五の発明は，前記第一から第四の発明までのいずれかの発明において，前記評価関数は，さらに，順序付き映像区間集合の要求に対する適合性を求める項，順序付き映像区間集合に含まれる映像区間の意味内容の網羅性を求める項，順序付き映像区間集合に含まれる映像区間の特徴の網羅性を求める項，順序付き映像区間の時間長を求める項のうち，少なくとも一つを加えて構成されることを特徴とする。 The fifth invention is the invention according to any one of the first to fourth inventions, wherein the evaluation function is further included in a term for obtaining conformity to the requirement of the ordered video segment set, the ordered video segment set. term for obtaining a coverage of the semantic content of the video section is, term for obtaining the coverage characteristics of a video section included in the ordered video section set, among the term for obtaining the time length of the order in with video section, at least one In addition, it is characterized by being configured.

前記第五の発明によれば，映像の意味的な網羅性，見た目の多様性といった，要約映像として満たすべき性質を適当に評価する評価関数となるため，利用しやすく，効果の高い要約映像を生成することができる。 According to the fifth aspect, since the evaluation function appropriately evaluates the characteristics to be satisfied as the summary video such as the semantic completeness of the video and the variety of appearance, the summary video that is easy to use and highly effective can be obtained. Can be generated.

また，以上の映像要約装置が実行する映像要約方法も本発明の特徴である。 The video summarization method executed by the video summarization apparatus is also a feature of the present invention.

この本発明に係る映像要約方法は，コンピュータプログラムでも実現できるものであり，このコンピュータプログラムは，適当なコンピュータ読み取り可能な記録媒体に記録して提供されたり，ネットワークを介して提供されたりしてもよく，本発明を実施する際にインストールされてＣＰＵなどの制御手段上で動作することにより本発明を実現することになる。 The video summarization method according to the present invention can also be realized by a computer program. The computer program may be provided by being recorded on an appropriate computer-readable recording medium or provided via a network. In many cases, the present invention is realized by being installed and operating on a control means such as a CPU when the present invention is implemented.

以上示したように，本発明によれば，複数の映像を，意味的な網羅性と見た目の多様性を保持した，一つの要約映像にまとめる映像要約装置，映像要約方法，映像要約プログラムを提供することができる。 As described above, according to the present invention, a video summarization apparatus, a video summarization method, and a video summarization program that combine a plurality of videos into one summary video that retains semantic completeness and appearance diversity are provided. can do.

本発明に係る映像要約装置の構成例を示す図である。It is a figure which shows the structural example of the image | video summary apparatus which concerns on this invention. 映像要約処理のフローチャートである。It is a flowchart of an image | video summary process. 記憶部に格納された映像区間情報の一例を示す図である。It is a figure which shows an example of the video area information stored in the memory | storage part. 評価関数最適化処理の一例を示す図である。It is a figure which shows an example of an evaluation function optimization process.

以下，実施の形態に従って本発明を詳細に説明する。 Hereinafter, the present invention will be described in detail according to embodiments.

図１に，本発明に係る映像要約装置の装置構成の一例を図示する。この図に示すように，映像要約装置１は，映像入力部１００と，記憶部１０１と，映像区間分割部１０２と，映像区間特徴量抽出部１０３と，映像区間群選定部１０４と，評価関数最適化部１０５と，要約映像生成部１０６とを備える。 FIG. 1 illustrates an example of a device configuration of a video summarization device according to the present invention. As shown in this figure, the video summarizing apparatus 1 includes a video input unit 100, a storage unit 101, a video segment dividing unit 102, a video segment feature value extracting unit 103, a video segment group selecting unit 104, an evaluation function. An optimization unit 105 and a summary video generation unit 106 are provided.

本実施形態の一例として，映像入力から映像区間分割，および映像区間特徴量抽出までを，前処理として事前に行っておき，後に，ユーザからの要求に応じて，要約映像を生成する要約処理のみを行う例について説明する。 As an example of this embodiment, video input, video segmentation, and video segment feature value extraction are performed in advance as pre-processing, and later only summary processing for generating a summary video in response to a request from the user. An example of performing is described.

映像入力部１００は，複数の映像からなる映像群２を入力して，それを記憶部１０１に格納する。この際，映像のタイトルなどのメタデータも合わせて格納しておく。 The video input unit 100 inputs a video group 2 including a plurality of videos and stores it in the storage unit 101. At this time, metadata such as a video title is also stored.

映像区間分割部１０２は，記憶部１０１に格納されている各映像それぞれを，複数の映像区間に分割する。本実施形態の一例では，分割した映像区間の時刻情報（開始点，終了点）を，記憶部１０１に格納しておく。 The video segment dividing unit 102 divides each video stored in the storage unit 101 into a plurality of video segments. In an example of this embodiment, time information (start point, end point) of the divided video section is stored in the storage unit 101.

映像区間特徴量抽出部１０３は，映像区間の持つ画像情報や音情報を分析し，各映像区間の持つ特徴量を抽出して出力する。本実施形態の一例では，抽出した映像区間ごとの特徴量を記憶部１０１に出力し，格納しておく。 The video segment feature value extraction unit 103 analyzes the image information and sound information of the video segment, and extracts and outputs the feature value of each video segment. In an example of the present embodiment, the extracted feature amount for each video section is output to the storage unit 101 and stored.

ここまでの処理を以て，記憶部１０１には，予め入力された映像に対して，その映像のメタデータ，分割された映像区間の時刻情報（開始点，終了点），および，各映像区間の特徴量が関連づけられて記憶されることとなる。誤解が生じない場合，この記憶された情報を映像区間情報と呼ぶことがある。 Through the processing so far, the storage unit 101 stores the metadata of the video, the time information (start point and end point) of the divided video sections, and the characteristics of each video section with respect to the video input in advance. The quantity will be associated and stored. If no misunderstanding occurs, this stored information may be referred to as video section information.

映像区間群選定部１０４は，要約生成処理の対象となる映像の映像区間を選定し，付帯する映像区間情報を記憶部１０１から読み込む。評価関数最適化部１０５は，読み込んだ映像区間情報群に対して，予め設計した評価関数が最大となるような映像区間群を選定し，出力する。最後に，要約映像生成部１０６は，評価関数最適化部１０５が出力した映像区間群を結合し，要約映像３を生成して出力する。 The video segment group selection unit 104 selects a video segment of the video that is the subject of the summary generation process, and reads accompanying video segment information from the storage unit 101. The evaluation function optimizing unit 105 selects and outputs a video segment group that maximizes a pre-designed evaluation function for the read video segment information group. Finally, the summary video generation unit 106 combines the video segment groups output by the evaluation function optimization unit 105 to generate and output the summary video 3.

このようにして，本発明の映像要約装置では，複数の映像を，人手を介することなく，一つの要約映像３に要約して出力する。 In this way, the video summarization apparatus of the present invention summarizes and outputs a plurality of videos into one summary video 3 without human intervention.

図２に，このように構成される映像要約装置１の実行する映像要約処理のフローチャートを図示する。このフローチャートを用いて，本実施形態の一例において実行される映像要約処理について詳述する。 FIG. 2 shows a flowchart of video summarization processing executed by the video summarization apparatus 1 configured as described above. The video summarization process executed in an example of the present embodiment will be described in detail using this flowchart.

本映像要約の処理は，大別して図２（Ａ）に示す前処理と図２（Ｂ）に示す要約処理とからなる。前処理では，入力された映像群２に対して映像区間分割処理を施し，必要な映像区間情報を抽出して記憶部１０１に登録しておく。要約処理では，ユーザからの要求に応じて，要約処理の対象となる区間情報を，記憶部１０１から読み出し，要約生成処理を実施する。以降，前処理と要約処理について，その処理内容を詳述する。 The video summarization processing is roughly divided into preprocessing shown in FIG. 2A and summarization processing shown in FIG. In the preprocessing, video segment division processing is performed on the input video group 2 to extract necessary video segment information and register it in the storage unit 101. In the summary process, in response to a request from the user, the section information that is the subject of the summary process is read from the storage unit 101 and the summary generation process is performed. The details of preprocessing and summary processing are described below.

〔前処理〕
図２（Ａ）のフローチャートに示すように，前処理では，ステップＳ２０１〜Ｓ２０３までの３段階の処理を経る。〔Preprocessing〕
As shown in the flowchart of FIG. 2 (A), in the pre-processing, three steps of steps S201 to S203 are performed.

まず，ステップＳ２０１では，映像入力部１００によって，処理対象となる映像群２を入力する。これらの映像群２は，全て同時に入力されるものとしても構わないし，一つ一つ異なるタイミングで入力されるものであってもよい。入力された各映像は，映像のメタデータと共に，記憶部１０１に記憶される。メタデータとは，映像に関連する情報を記載したデータであり，例えば，映像のタイトル，概要文，時間長，フォーマット，映像内容を表現したキーワードなどを含むものである。本実施形態の一例においては，映像内容を表現したキーワード，タイトル，概要文のうち，少なくともいずれか一つを含んでいる場合について述べる。記憶部１０１がリレイショナルデータベースシステムである場合には，このようなメタデータを映像と関連づけて記憶できるため，好適である。 First, in step S201, the video input unit 100 inputs the video group 2 to be processed. These video groups 2 may all be input at the same time, or may be input at different timings. Each input video is stored in the storage unit 101 together with video metadata. Metadata is data describing information related to a video, and includes, for example, a video title, a summary sentence, a time length, a format, a keyword expressing video content, and the like. In an example of the present embodiment, a case where at least one of a keyword, a title, and a summary sentence expressing video content is included will be described. When the storage unit 101 is a relational database system, it is preferable because such metadata can be stored in association with video.

続いて，ステップＳ２０２では，記憶部１０１に格納された映像それぞれに対して，映像区間分割部１０２によって映像区間分割処理を施す。この映像区間分割は，予め決定しておいた一定の間隔で分割するものとしてもよいし，例えば下記の参考文献１に記載される方法など，映像が不連続に切れる点であるカット点によって分割するものとしてもよい。
［参考文献１］：Y. Tonomura, A. Akutsu, Y. Taniguchi, and G. Suzuki,“Structured Video Computing”, IEEE Multimedia, pp.34-43, 1994 。 Subsequently, in step S202, video segment division processing is performed by the video segment division unit 102 on each of the videos stored in the storage unit 101. This video segmentation may be performed at predetermined intervals, for example, by the cut points that are points at which the video is discontinuously cut, such as the method described in Reference 1 below. It is good to do.
[Reference 1]: Y. Tonomura, A. Akutsu, Y. Taniguchi, and G. Suzuki, “Structured Video Computing”, IEEE Multimedia, pp. 34-43, 1994.

望ましくは，後者の方法を採用する。映像区間分割処理の結果として，区間の開始点（開始時刻）と終了点（終了時刻）が得られるが，この時刻情報も，映像区間情報として記憶部１０１に記憶しておく。 Preferably, the latter method is adopted. As a result of the video segmentation process, the start point (start time) and end point (end time) of the segment are obtained. This time information is also stored in the storage unit 101 as video segment information.

次に，ステップＳ２０３では，映像区間特徴量抽出部１０３によって，映像区間中の画像・音情報から，映像区間特徴量の抽出を行う。映像区間特徴量は，画像から抽出するものと，音から抽出するものがある。いずれも，例えば５０ｍｓなどの微小な区間（フレーム）から抽出する。 Next, in step S203, the video segment feature value extraction unit 103 extracts video segment feature values from the image / sound information in the video segment. Video segment feature values are extracted from images and extracted from sounds. Both are extracted from a minute section (frame) such as 50 ms.

画像から抽出する特徴としては，例えば，明るさ特徴，色特徴，動き特徴，テクスチャ特徴，コンセプト特徴，景観特徴などがある。 Features extracted from an image include, for example, brightness features, color features, motion features, texture features, concept features, and landscape features.

明るさ特徴は，ＨＳＶ色空間におけるＶ値を数え上げることで，ヒストグラムとして抽出することができる。 The brightness feature can be extracted as a histogram by counting the V values in the HSV color space.

色特徴は，Ｌ＊ａ＊ｂ＊色空間における各軸（Ｌ＊，ａ＊，ｂ＊）の値を数え上げることで，ヒストグラムとして抽出することができる。各軸のヒストグラムのビンの数は，例えば，Ｌ＊に対して４，ａ＊に対して１４，ｂ＊に対して１４などとすればよく，この場合，３軸の合計ビン数は，４×１４×１４＝７８４となる。 The color feature can be extracted as a histogram by counting the values of the respective axes (L *, a *, b *) in the L * a * b * color space. The number of histogram bins on each axis may be, for example, 4 for L *, 14 for a *, 14 for b *, etc. In this case, the total number of bins for 3 axes is 4 X14x14 = 784.

動き特徴は，例えば一定のインターバルの２枚の画像フレーム間で，動きベクトルを計算し，そのベクトルの大きさと角度の値を数え上げることで，ヒストグラムとして抽出することができる。大きさと角度のヒストグラムのビンの数は，例えば大きさに対して２５，角度に対して８などとすればよく，この場合，２軸の合計ビン数は２５×４＝１００となる。 Motion features can be extracted as a histogram, for example, by calculating a motion vector between two image frames at a fixed interval and counting the magnitude and angle values of the vector. The number of bins in the histogram of size and angle may be, for example, 25 for the size and 8 for the angle. In this case, the total number of bins for the two axes is 25 × 4 = 100.

テクスチャ特徴としては，濃淡ヒストグラムの統計量（コントラスト）やパワースペクトルなどを求めればよい。あるいは，局所特徴量を用いると，色や動きなどと同様，ヒストグラムの形式で抽出することができるようになるため好適である。局所特徴としては，例えば下記の参考文献２に記載されるＳＩＦＴ（Scale Invariant Feature Transform ）や，下記の参考文献３に記載されるＳＵＲＦ（Speeded Up Robust Features）などを用いることができる。
［参考文献２］：D.G. Lowe,“Distinctive Image Features from Scale-Invariant Keypoints ”，International Journal of Computer Vision, pp.91-110, 2004 。
［参考文献３］：H. Bay, T. Tuytelaars, and L.V. Gool, “SURF: Speeded Up Robust Features”, Lecture Notes in Computer Science, vol. 3951, pp.404-417, 2006。 As a texture feature, a statistic (contrast) of a density histogram, a power spectrum, or the like may be obtained. Alternatively, it is preferable to use a local feature amount because it can be extracted in the form of a histogram as in the case of color and movement. As the local feature, for example, SIFT (Scale Invariant Feature Transform) described in the following Reference 2 or SURF (Speeded Up Robust Features) described in the following Reference 3 can be used.
[Reference 2]: DG Lowe, “Distinctive Image Features from Scale-Invariant Keypoints”, International Journal of Computer Vision, pp. 91-110, 2004.
[Reference 3]: H. Bay, T. Tuytelaars, and LV Gool, “SURF: Speeded Up Robust Features”, Lecture Notes in Computer Science, vol. 3951, pp. 404-417, 2006.

これらによって抽出される局所特徴は，例えば１２８次元の実数値ベクトルとなる。このベクトルを，予め学習して生成しておいた符号長を参照して，符号に変換し，その符号の数を数え上げることでヒストグラムを生成することができる。この場合，ヒストグラムのビンの数は，符号長の符号数と一致する。符号数は任意のものを用いてよいが，例えば５１２あるいは１０２４などとしてもよい。 The local features extracted by these are, for example, 128-dimensional real value vectors. This vector is converted into a code with reference to a code length that has been learned and generated in advance, and a histogram can be generated by counting the number of codes. In this case, the number of bins in the histogram matches the code number of the code length. Any number of codes may be used. For example, 512 or 1024 may be used.

続いて，コンセプト特徴について説明する。コンセプトとは，画像中に含まれる物体や，画像が捉えているイベントのことである。任意のものを用いてよいが，例を挙げれば，「海」，「山」，「ボール」などのようなものである。もし，ある画像に「海」が映っていた場合，その画像は「海」コンセプトに帰属する画像であるという。その画像が，各コンセプトに帰属するか否かは，コンセプト識別器を用いて判断することができる。通常，コンセプト識別器はコンセプト毎に一つ用意され，画像の特徴量を入力として，その画像があるコンセプトに帰属しているか否かを帰属レベルとして出力する。コンセプト識別器は，予め学習して獲得しておくものであり，決められた画像特徴，例えば先に述べた局所特徴と，予め人手によって，その画像がどのコンセプトに帰属しているかを表した正解ラベルとの関係を学習することによって獲得する。学習器としては，例えばサポートベクターマシンなどを用いればよい。コンセプト特徴は，各コンセプトへの帰属レベルをまとめてベクトルとして表現することで得ることができる。 Next, the concept features will be described. A concept is an object included in an image or an event captured by the image. Anything may be used, but examples include "sea", "mountain", "ball", and the like. If “sea” appears in an image, the image belongs to the “sea” concept. Whether or not the image belongs to each concept can be determined using a concept classifier. Usually, one concept discriminator is prepared for each concept, and the feature amount of the image is input, and whether or not the image belongs to a certain concept is output as an attribution level. The concept classifier is learned and acquired in advance, and it is a correct answer that represents a specific image feature, for example, the local feature described above and the concept that the image belongs to in advance by hand. Earn by learning the relationship with the label. For example, a support vector machine may be used as the learning device. Concept features can be obtained by expressing the attribution levels for each concept together as a vector.

景観特徴は，画像の風景や場面を表現した特徴量である。例えば参考文献４に記載のＧＩＳＴ記述子を用いることができる。
［参考文献４］：A. Oliva and A. Torralba, “Building the gist of a scene: the role of global image features in recognition”，Progress in Brain Research, 155, pp.23-36, 2006 。 A landscape feature is a feature amount representing a landscape or a scene of an image. For example, the GIST descriptor described in Reference 4 can be used.
[Reference 4]: A. Oliva and A. Torralba, “Building the gist of a scene: the role of global image features in recognition”, Progress in Brain Research, 155, pp. 23-36, 2006.

一方，音情報から抽出する特徴量としては，音高特徴，音圧特徴，スペクトル特徴，リズム特徴，発話特徴，音楽特徴，音イベント特徴などがある。 On the other hand, features extracted from sound information include pitch features, sound pressure features, spectrum features, rhythm features, speech features, music features, sound event features, and the like.

音高特徴は，例えばピッチを取るものとすればよく，下記の参考文献５に記載される方法などを用いて抽出することができる。
［参考文献５］：古井貞熙, “ディジタル音声処理, ４. ９ピッチ抽出”, pp.57-59, 1985。 The pitch feature may be a pitch, for example, and can be extracted using the method described in Reference 5 below.
[Reference 5]: Sadaaki Furui, “Digital Audio Processing, 4.9 Pitch Extraction”, pp. 57-59, 1985.

音圧特徴としては，音声波形データの振幅値を用いるものとしてもよいし，短時間パワースペクトルを求め，任意の帯域の平均パワーを計算して用いるものとしてもよい。 As the sound pressure feature, an amplitude value of speech waveform data may be used, or a short-time power spectrum may be obtained, and an average power in an arbitrary band may be calculated and used.

スペクトル特徴としては，例えばメル尺度ケプストラム係数（ＭＦＣＣ：Mel-Frequency Cepstral Coefficients ）を用いることができる。 As the spectrum feature, for example, Mel-Frequency Cepstral Coefficients (MFCC) can be used.

リズム特徴としては，例えばテンポを抽出すればよい。テンポを抽出するには，例えば下記の参考文献６に記載される方法などを用いることができる。
［参考文献６］：E.D. Scheirer,“Tempo and Beat Analysis of Acoustic Musical Signals ”，Journal of Acoustic Society America, Vol.103, Issue 1, pp.588-601, 1998 。 For example, the tempo may be extracted as the rhythm feature. To extract the tempo, for example, the method described in Reference Document 6 below can be used.
[Reference 6]: ED Scheirer, “Tempo and Beat Analysis of Acoustic Musical Signals”, Journal of Acoustic Society America, Vol. 103, Issue 1, pp. 588-601, 1998.

発話特徴，音楽特徴は，それぞれ，発話の有無，音楽の有無を表す。発話・音楽の存在する区間を発見するには，例えば下記の参考文献７に記載される方法などを用いればよい。
［参考文献７］：K. Minami, A. Akutsu, H. Hamada, and Y. Tonomura, “Video Handling with Music and Speech Detection”，IEEE Multimedia, vol.5, no.3, pp.17-25, 1998。 The utterance feature and the music feature represent the presence or absence of utterance and the presence or absence of music, respectively. In order to find a section where speech / music exists, for example, a method described in Reference Document 7 below may be used.
[Reference 7]: K. Minami, A. Akutsu, H. Hamada, and Y. Tonomura, “Video Handling with Music and Speech Detection”, IEEE Multimedia, vol.5, no.3, pp.17-25, 1998.

音イベント情報としては，例えば，笑い声や大声などの感情的な音声，あるいは，銃声や爆発音などの環境音の生起などを用いるものとすればよい。このような音イベントを検出するには，例えば下記の参考文献８（特許文献）に記載される方法などを用いればよい。
［参考文献８］：国際公開ＷＯ／２００８／０３２７８７
ここに挙げたもの以外にも，任意の特徴量を用いても構わない。以上，抽出した特徴量をまとめて，記憶部１０１に格納しておく。 As sound event information, for example, emotional sounds such as laughter and loud voice, or occurrence of environmental sounds such as gunshots and explosion sounds may be used. In order to detect such a sound event, for example, a method described in Reference Document 8 (Patent Document) below may be used.
[Reference 8]: International Publication WO / 2008/032787
Any feature amount other than those listed here may be used. The extracted feature values are stored together in the storage unit 101 as described above.

以上の前処理によって，記憶部１０１に格納された映像区間情報の格納形式について，その一例を図３に示す。この例では，映像毎に，唯一に識別可能な映像ｉｄが付与されており，対応するメタデータとしてタイトルとキーワードが格納されている場合について示している。また，ステップＳ２０２の映像区間分割処理によって得られた映像区間に対し，唯一に識別可能な映像区間ｉｄが付与されており，対応する開始時刻，終了時刻が記載されている。さらに，各映像区間について，ステップＳ２０３の映像区間特徴量抽出処理によって抽出された特徴量を格納したファイルへのパスが，特徴量ファイルのカラムに記載されている。 An example of the storage format of the video section information stored in the storage unit 101 by the above preprocessing is shown in FIG. In this example, a video id that can be uniquely identified is assigned to each video, and a title and a keyword are stored as corresponding metadata. In addition, a uniquely identifiable video section id is assigned to the video section obtained by the video section division processing in step S202, and the corresponding start time and end time are described. Further, for each video section, the path to the file storing the feature quantity extracted by the video section feature quantity extraction process of step S203 is described in the feature quantity file column.

〔要約処理〕
続いて，要約処理について述べる。要約処理は，ユーザから要約の要求が生じた際に，前処理によって得た映像のメタデータおよび特徴量に基づいて，要約映像３を生成する処理である。 [Summary processing]
Next, summary processing is described. The summarization process is a process for generating the summary video 3 based on the metadata and the feature amount of the video obtained by the preprocessing when a summary request is generated from the user.

要約処理は，図２（Ｂ）に示すステップＳ２０４〜ステップＳ２０６の処理である。それぞれの処理について説明する。 The summarization process is a process of steps S204 to S206 shown in FIG. Each process will be described.

まず，ステップＳ２０４では，映像区間群選定部１０４が，ユーザからの要求に合わせて，要約に含まれる映像区間の候補となる，映像区間群を選定する。これには，いくつかの方法があるが，本実施形態における例においては，ユーザがキーワードを指定して，要約の要求を行う場合の例について述べる。ユーザが視聴したい映像を探すために検索エンジンに対して，キーワードクエリによる問い合わせを行う場合が多い。この例は，これと同様，キーワードクエリによる問い合わせを受け，関連する映像群に対する要約を生成し，提供する場合において利用可能なものである。 First, in step S204, the video segment group selection unit 104 selects a video segment group that is a candidate video segment included in the summary in accordance with a request from the user. There are several methods for this, but in the example in the present embodiment, an example will be described in which a user designates a keyword and requests a summary. In many cases, a user makes an inquiry using a keyword query to a search engine in order to search for a video to be viewed. Similarly to this, this example can be used when a query by a keyword query is received and a summary for a related video group is generated and provided.

この場合の映像区間群選定は，映像区間情報に含まれるメタデータとのマッチングによって実施する。ユーザからキーワードが指定された際に，映像区間情報のメタデータを参照し，当該キーワードとマッチするメタデータを持つ映像の映像区間のみを，映像区間群として選定する。使用するメタデータとしては，記憶部１０１に格納したもののうち，何を利用してもよいが，通常のキーワード検索と同様，タイトルや概要文，映像内容を表現したキーワードなどを利用すればよい。 In this case, the video segment group selection is performed by matching with metadata included in the video segment information. When a keyword is designated by the user, the metadata of the video section information is referred to, and only the video section of the video having metadata matching the keyword is selected as a video section group. As metadata to be used, anything stored in the storage unit 101 may be used, but a title, a summary sentence, a keyword expressing video content, or the like may be used as in a normal keyword search.

続いて，ステップＳ２０５では，評価関数最適化部１０５が，ステップＳ２０４で選定された映像区間群から，評価関数の最適化に基づいて，要約映像に用いる映像区間の選定と，一つの要約映像にまとめる際に，映像区間群をつなぎ合わせる順序を決定する。 Subsequently, in step S205, the evaluation function optimizing unit 105 selects a video section to be used for the summary video from the video section group selected in step S204 based on optimization of the evaluation function, and converts it into one summary video. When grouping together, the order in which the video segment groups are connected is determined.

要約映像は，映像区間が時間方向に複数結合されたものであるから，順序付きの映像区間集合であるとみなすことができる。評価関数は，順序付き映像区間集合を入力とし，その順序付き映像区間集合の，要約映像としての良さを求める関数である。 Since the summary video is a combination of a plurality of video sections in the time direction, it can be regarded as an ordered set of video sections. The evaluation function is a function that takes an ordered set of video segments as input and calculates the quality of the ordered video segment set as a summary video.

評価関数のフォーマルな定義を記載する。ステップＳ２０４において，選定された映像区間の集合を，Ｓ＝｛ｓ₁，ｓ₂，…，ｓ_n｝と表す。このとき，Ｓの部分集合に対して，さらに順序を入れた順序付き集合を，Ｓ′＝｛ｓ₁→ｓ₂→…→ｓ_m｝（ｍ＜ｎ）と表すとする。このとき，評価関数ｆは，順序付き集合Ｓ′の空間Ξから，スカラー空間Ｒへの写像ｆ：Ξ→Ｒである。 Describe the formal definition of the evaluation function. In step S204, the set of the selected video _{section, S = {s 1, s} 2, ..., s n} represents the. At this time, an ordered set in which an order is further added to the subset of S is represented as S ′ = {s ₁ → s ₂ →... → s _m } (m <n). At this time, the evaluation function f is a mapping f: Ξ → R from the space Ｓ of the ordered set S ′ to the scalar space R.

もし，仮に，この評価関数ｆが，順序付き集合Ｓ′の良さを表現しているとすると，このｆの関数の値が最大になるような順序付き集合Ｓ′を選ぶことによって，要約として最適な映像区間群と，その順序を決定することができる。以降，簡単のため，評価関数ｆの値を，要約映像の良さと表現する。 If the evaluation function f expresses the goodness of the ordered set S ′, it is optimal as a summary by selecting the ordered set S ′ that maximizes the value of the function of f. Video segments and their order can be determined. Hereinafter, for the sake of simplicity, the value of the evaluation function f is expressed as goodness of the summary video.

ここで重要なのは，評価関数ｆが順序付き集合Ｓ′の良さを表現するように設計することである。以下，本実施形態の一例における評価関数ｆの設計について説明する。評価関数ｆは，以下の５つの点を考慮して，要約映像の良さを評価する。 What is important here is that the evaluation function f is designed to express the goodness of the ordered set S ′. Hereinafter, the design of the evaluation function f in an example of the present embodiment will be described. The evaluation function f evaluates the goodness of the summary video in consideration of the following five points.

（１）ユーザの要求に対する適合性：本実施形態の一例では，ユーザからの要求はキーワードクエリによって与えられている。このクエリに，よくマッチした映像の映像区間を多く含む要約映像ほど，ユーザの求める要約映像に近く，よい要約映像であると考えることができる。 (1) Conformity to user request: In an example of this embodiment, a request from a user is given by a keyword query. It can be considered that the summary video including many video sections of the video that match this query is closer to the summary video desired by the user and is a better summary video.

（２）意味的な網羅性：要約というからには，多様な内容を，網羅的に短くまとめた内容であることが求められる。（１）でユーザの要求に対する適合性を評価しているが，要約という観点で考えた場合，同じような内容を扱った映像の映像区間ばかりが出てきてしまうと，要約としてよいものであるとはいえない。そこで，要約に現れる映像の意味的な内容が，多様であればあるほどよい要約であると考えることができる。 (2) Semantic comprehensiveness: Summarization requires a comprehensive and short summary of various contents. In (1), the user's requirements are evaluated for suitability. However, when considering from the viewpoint of summarization, if only video segments that deal with similar contents appear, the summarization is good. That's not true. Therefore, the more diverse the semantic content of the video that appears in the summary, the better the summary.

（３）見た目の網羅性：（２）と同様，“見た目”（画像あるいは音）が同じような映像区間や，重複した映像区間が出現してしまうと，網羅性の観点でよい要約であるとは言えない。そこで，“見た目”としてなるべく重複のない，多様な映像区間を含む要約映像であるほど，よい要約であると考えることができる。 (3) Appearance completeness: Similar to (2), if video segments with the same “look” (image or sound) or overlapping video segments appear, this is a good summary from the viewpoint of completeness. It can not be said. Therefore, it can be considered that a summary video that includes various video segments with as little “look” as possible is a better summary.

（４）視聴しやすさ：本技術は，複数の映像から一つの要約映像を生成するものであるため，生成される要約映像には，内容の異なる映像が含まれる。（２）の観点で，意味的な内容がなるべく多様になるような要約映像を生成するという要請があるが，再生したときに，内容の劇的に異なる映像が，次々出てきてしまうと，これを視聴するユーザの観点からは，理解しやすい映像であるとはいえない。また，同様に，“見た目”が劇的に異なる映像区間が連続再生されたりするような場合，例えば，急に暗い映像区間から，明るい映像区間に切り替わると，強い視覚刺激を生んでしまうため，視聴しやすい映像とはならない。そこで，映像を再生して，順次映像区間を視聴していくユーザにとって，意味的な内容が近いものは，なるべく近い位置に配置されるように，また，“見た目”がなるべくスムースにつながるように生成された要約映像は，よい要約であると考えることができる。 (4) Ease of viewing: Since this technology generates one summary video from multiple videos, the generated summary video includes videos with different contents. From the viewpoint of (2), there is a request to generate a summary video that has as much variety of semantic content as possible. However, when videos with dramatically different content appear one after another, From the viewpoint of the user who views this, it cannot be said that the video is easy to understand. Similarly, when video sections with dramatically different “looks” are played back continuously, for example, suddenly switching from a dark video section to a bright video section will generate a strong visual stimulus. The video is not easy to watch. Therefore, for users who play video and watch the video sections in sequence, those that are close to the semantic content should be placed as close as possible, and the “look” will be as smooth as possible. The generated summary video can be considered a good summary.

（５）時間長：要約映像の基本的な観点として，短くまとめられたものほどよい。そこで，生成された映像の要約時間長が短ければ短いほど，よい要約であると考えることができる。 (5) Time length: As a basic viewpoint of the summary video, a shorter one is better. Therefore, the shorter the summary time length of the generated video, the better the summary.

以上の観点を考慮し，評価関数を設計する。本実施形態の一例では，下記のような評価関数ｆを設計する。 The evaluation function is designed in consideration of the above viewpoints. In an example of the present embodiment, the following evaluation function f is designed.

この評価関数は，５つの項からなるが，各項は，前述の５つの評価尺度（１）〜（５）にそれぞれ対応している。λ₁〜λ₅は，スケーリングパラメータである。各項について，詳細に説明する。 This evaluation function is composed of five terms, and each term corresponds to the above-described five evaluation scales (1) to (5). λ _{1 to} λ ₅ are scaling parameters. Each section is explained in detail.

まず，評価尺度（１）に対応する第一項のｒｅｌ（ｓ_i）についてである。キーワードクエリと，映像のメタデータとの適合性を評価する。これには，キーワードのクエリと，各映像区間の（対応する映像の）メタデータとの意味的な近さを測る必要がある。これには，例えば参考文献９に記載の，Normalized Google Distance（ＮＧＤ）などを用いることができる。
［参考文献９］：R.L. Cilibrasi and P.M.B. Vitanyi,“The Google Similarity Distance”，IEEE Transactions on Knowledge and Data Engineering, Vol. 19, No. 3, pp.370-383, 2007 。 First, the first term rel (s _i ) corresponding to the evaluation scale (1). Evaluate the compatibility of keyword queries with video metadata. For this purpose, it is necessary to measure the semantic closeness between the keyword query and the metadata (corresponding video) of each video section. For this, for example, Normalized Google Distance (NGD) described in Reference 9 can be used.
[Reference 9]: RL Cilibrasi and PMB Vitanyi, “The Google Similarity Distance”, IEEE Transactions on Knowledge and Data Engineering, Vol. 19, No. 3, pp. 370-383, 2007.

ＮＧＤは，二つのキーワードｔ₁，ｔ₂の意味的な近さを数値として計算する。これを用いて，ｒｅｌ（ｓ_i）は，例えば下記のように定義することができる。 NGD calculates the semantic proximity of the two keywords t ₁ and t ₂ as numerical values. Using this, rel (s _i ) can be defined as follows, for example.

ただし，Ｔは映像区間（の映像）のメタデータに出現するキーワードの集合，ｔ_kはその要素，ｑはユーザによって与えられたキーワードクエリである。これによって，［数１］式の第一項が定義できる。 Here, T is a set of keywords appearing in the metadata of the video section (video), t _k is the element, and q is a keyword query given by the user. Thereby, the first term of the formula [1] can be defined.

続いて，［数１］式の第二項のｓｅｍ（ｓ_i，ｓ_j）について説明する。第二項は，要約に含まれる映像の意味的な多様性を評価するものであり，要約に含まれる，任意の２つの映像区間（を含む元映像の）の意味的な類似度を評価するように設計すればよい。 Subsequently, sem (s _i , s _j ) of the second term of the formula [1] will be described. The second term evaluates the semantic diversity of the video included in the summary, and evaluates the semantic similarity between any two video sections (including the original video) included in the summary. Should be designed as follows.

第一項と同様の考えの下，映像区間の意味的な内容は，メタデータの概要文などから読み取ることができると考える。このとき，２つの映像区間の意味的な内容の類似度は，ＮＧＤを利用して評価することができる。２つの映像区間ｓ_i，ｓ_jがあるとし，これに付与されたメタデータ中のキーワード集合を，それぞれＴ_i，Ｔ_jとおく。このとき，２つの映像区間ｓ_i，ｓ_jの意味的な類似度を，例えば次のように定義する。 Based on the same idea as in item 1, the semantic content of the video section can be read from the summary text of the metadata. At this time, the similarity between the semantic contents of the two video sections can be evaluated using NGD. It is assumed that there are two video sections s _i and s _j , and the keyword sets in the metadata assigned to these are set as T _i and T _j , respectively. At this time, the semantic similarity between the two video sections s _i and s _j is defined as follows, for example.

このように定義されたｓｅｍ（ｓ_i，ｓ_j）は，２つの映像区間ｓ_i，ｓ_jのキーワード集合が似ていれば似ているほど大きい値となり，逆に似ていなければ似ていないほど小さい値となる。 The sem (s _i , s _j ) defined in this way becomes a larger value if the keyword sets of the two video sections s _i and s _j are similar, and conversely if they are not similar. The smaller the value.

続いて，［数１］式の第三項のａｐｐ（ｓ_i，ｓ_j）について説明する。この第三項は，“見た目”として多様な映像区間を含むか否かを評価する項であり，要約に含まれる任意の２つの映像区間（を含む元映像の）の“見た目”の類似度を評価するように設計すればよい。 Next, app (s _i , s _j ) in the third term of the formula [1] will be described. This third term evaluates whether or not various visual segments are included as “look”, and the similarity of “look” between any two video segments (including the original video) included in the summary. Should be designed to evaluate.

前述の通り，“見た目”は，視覚的（画像）な近さだけではなく，聴覚的（音）な近さも含む。映像区間の“見た目”は，映像区間の特徴量を用いて評価することができ，２つの映像区間の“見た目”の類似度は，映像区間の特徴量の類似度として定義することができる。映像区間の特徴量は，前処理のステップＳ２０３にて，画像特徴，音特徴の双方について，様々な種類の特徴量を抽出している。映像区間ｓ_iの特徴量群を，Ｖ＝｛ｖ₁，ｖ₂，…，ｖ_P｝と表すことにする。ｖ_kは，各特徴量（ベクトル）であり，例えば，ｖ₁は色，ｖ₂はテクスチャ，ｖ₃はピッチ，などとして対応するものとする。 As described above, “look” includes not only visual (image) proximity but also auditory (sound) proximity. The “appearance” of the video section can be evaluated using the feature quantity of the video section, and the similarity of the “appearance” of the two video sections can be defined as the similarity of the feature quantity of the video section. As for the feature amount of the video section, various types of feature amounts are extracted for both the image feature and the sound feature in step S203 of the preprocessing. The feature amount group of the video section s _i is expressed as V = {v ₁ , v ₂ ,..., V _P }. v _k is each feature quantity (vector), for example, v ₁ is a color, v ₂ is a texture, v ₃ is a pitch, and so on.

各特徴量には，それぞれ同特徴量の距離ｄｖ_k（ｋ＝１，２，…，ｐ）を定義しておく。例えば，色（Ｌ＊ａ＊ｂ＊ヒストグラム）に対しては，ヒストグラム同士の距離を測るカイ二乗距離や，参考文献１０記載のEarth Mover's distance（ＥＭＤ）などを用いることができる。
［参考文献１０］：S. Peleg, M. Werman, and H. Rom. A unified approach to the change of resolution: Space and gray-level. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 11, pp.739-742, 1989。 Each feature quantity defines a distance dv _k (k = 1, 2,..., P) of the same feature quantity. For example, for a color (L * a * b * histogram), a chi-square distance for measuring the distance between the histograms, Earth Mover's distance (EMD) described in Reference 10, or the like can be used.
[Reference 10]: S. Peleg, M. Werman, and H. Rom. A unified approach to the change of resolution: Space and gray-level. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 11, pp. 739 -742, 1989.

また，例えば，ＧＩＳＴなどのベクトル型の特徴量については，ユークリッド距離などを用いることができる。 Further, for example, Euclidean distance can be used for a vector type feature such as GIST.

以上，与えられた条件の下，２つの映像区間ｓ_i，ｓ_jがあるとし，これらの特徴量をＶ_i＝｛ｖ_k ⁱ，ｖ_k ⁱ，…，ｖ_k ⁱ｝，Ｖ_j＝｛ｖ_k ^j，ｖ_k ^j，…，ｖ_k ^j｝とおく。このとき，２つの映像区間ｓ_i，ｓ_jの“見た目”の類似度を，例えば次のように定義する。 As described above, it is assumed that there are two video sections s _i and s _j under the given conditions, and these feature amounts are expressed as V _i = {v _k ⁱ , v _k ⁱ ,..., V _k ⁱ }, V _j = { Let v _k ^j , v _k ^j ,..., v _k ^j }. At this time, the “look” similarity between the two video sections s _i and s _j is defined as follows, for example.

ここで，θ_kはスケーリングパラメータである。 Here, θ _k is a scaling parameter.

続いて，［数１］式の第四項のｓｍ（Ｓ′）について説明する。第四項は，要約全体の視聴しやすさを評価するものである。より具体的には，映像区間の順序付き集合について，隣り合う映像区間の意味的な内容もしくは“見た目”の類似性，またはその双方の類似性を評価する。 Next, the fourth term sm (S ′) in the formula [1] will be described. Section 4 evaluates the ease of viewing the entire summary. More specifically, for the ordered set of video sections, the semantic content of adjacent video sections, the similarity of “look”, or the similarity of both is evaluated.

意味的な内容の繋がりについては，第二項と同様の方法で算出できる。隣り合う２つの映像区間ｓ_i，ｓ_i+1の意味的な内容の繋がりを，例えば，［数３］式を利用して次のように定義する。 The connection of semantic contents can be calculated in the same way as in item 2. The connection between the semantic contents of two adjacent video sections s _i and s _{i + 1} is defined as follows using, for example, the formula [3].

“見た目”の繋がりについても，第三項と同様の方法で算出できる。隣り合う２つの映像区間ｓ_i，ｓ_i+1の“見た目”の繋がりを，例えば，［数４］式を利用して次のように定義する。 The “look” connection can also be calculated in the same way as in Section 3. The connection of “look” of two adjacent video sections s _i and s _{i + 1} is defined as follows using, for example, the formula [4].

以上の定義を基に，要約映像全体の意味的な内容，および，“見た目”の繋がりの良さを次式のように定義することができる。 Based on the above definitions, the semantic content of the entire summary video and the goodness of “look” connection can be defined as follows:

最後に，［数１］式の第五項のｄｕｒ（ｓ_i）について説明する。第五項は，要約に含まれる映像の時間長を評価するものであり，ｄｕｒ（ｓ_i）は，ｓ_iの負の時間長とすればよい。 Finally, the fifth term dur (s _i ) of the formula [1] will be described. The fifth term evaluates the time length of the video included in the summary, and dur (s _i ) may be a negative time length of s _i .

あるいは，ｄｕｒ（ｓ_i）を用いる代わりに，別の評価項を導入してもよい。例えば，希望する要約映像の時間長がある場合には，その時間長に近い要約映像ほど，よい評価値を得るように設計する場合がある。希望する要約映像時間長をτ，ｓ_iの要約映像時間長をｄｕｒ（ｓ_i）とし，例えば下記の式によって定義する。 Alternatively, instead of using dur (s _i ), another evaluation term may be introduced. For example, when there is a desired length of the summary video, the summary video closer to the time length may be designed to obtain a better evaluation value. The video summary length of time desired tau, the digest video time length of s _i and dur (s _i), defined for example by the following equation.

−｜τ−ｄｕｒ（ｓ_i）｜
以上が，評価関数の設計の一例である。ここに挙げた各項の定義は一例であり，この他，先に述べた要請を満たす限り，任意の形態をとっても構わない。 − | Τ−dur (s _i ) |
The above is an example of the evaluation function design. The definition of each item listed here is an example, and any other form may be adopted as long as the above-mentioned requirements are satisfied.

さらに，ステップＳ２０５では，設計した評価関数を最大化する要約映像を発見する。順序付き集合を求める組合せ最適化問題の一種であり，一般的に効率的な解法は存在しない。そこで，本実施形態の一例では，これを近似的に求める。組合せ最適化問題の近似解法としては，Ｇｒｅｅｄｙ−Ｓｅａｒｃｈや，遺伝的アルゴリズムなどがあるので，これらを適用することによって，最適なＳ′を求めればよい。 In step S205, a summary video that maximizes the designed evaluation function is found. This is a kind of combinatorial optimization problem for obtaining ordered sets, and there is generally no efficient solution. Therefore, in an example of this embodiment, this is approximately obtained. As an approximate solution to the combinatorial optimization problem, there are Greedy-Search and genetic algorithms, and the optimum S ′ may be obtained by applying these.

例として，図４に，Ｇｒｅｅｄｙ−Ｓｅａｒｃｈを用いた場合の最適化処理の流れを示す。
（ａ）映像区間数が２となる全ての順列を列挙する。
（ｂ）全ての順列に対して，評価関数ｆの値を計算する。
（ｃ）評価関数ｆの値が最大になった順列を一つの映像区間としてマージする。
（ｄ）映像区間が一つになるまで，（ａ）以降の処理を繰り返す。
以上が，ステップＳ２０５の処理詳細の一例である。 As an example, FIG. 4 shows a flow of optimization processing when Grayy-Search is used.
(A) List all permutations with 2 video segments.
(B) The value of the evaluation function f is calculated for all permutations.
(C) The permutation having the maximum value of the evaluation function f is merged as one video section.
(D) The processes from (a) onward are repeated until there is one video section.
The above is an example of the processing details of step S205.

最後に，ステップＳ２０６で，ステップＳ２０５において出力された要約映像を，映像としてつなぎ合わせ，要約映像３として出力する。 Finally, in step S206, the summary video output in step S205 is connected as video and output as summary video 3.

以上が，本実施形態の一例における映像要約装置，および，そこで実施される映像要約方法の説明である。この映像要約方法で実施される処理プロセスを，コンピュータで読み取り可能なプログラムとして記述することも可能であることはいうまでもない。 The above is the description of the video summarization apparatus and the video summarization method performed in the example of the present embodiment. It goes without saying that the processing process carried out by this video summarization method can be described as a computer-readable program.

以上，本実施形態の一例における映像要約装置について詳細に説明した。本発明は説明した実施形態の一例に限定されるものでなく，特許請求の範囲に記載した技術的範囲において各種の変形を行うことが可能である。 The video summarization apparatus according to the example of the present embodiment has been described in detail above. The present invention is not limited to the example of the embodiment described, and various modifications can be made within the technical scope described in the claims.

１映像要約装置
２映像群
３要約映像
１００映像入力部
１０１記憶部
１０２映像区間分割部
１０３映像区間特徴量抽出部
１０４映像区間群選定部
１０５評価関数最適化部
１０６要約映像生成部 DESCRIPTION OF SYMBOLS 1 Video summarization apparatus 2 Video group 3 Summary video 100 Video input part 101 Storage part 102 Video segment division part 103 Video segment feature-value extraction part 104 Video segment group selection part 105 Evaluation function optimization part 106 Summary video generation part

Claims

A video summarization device that generates a summary video for a group of videos and given video groups composed of metadata attached to each video,
Video segment dividing means for dividing each video included in the video group into video segments;
Video segment feature quantity extraction means for extracting feature quantities for each video segment;
Storage means for storing metadata incidental to the video, division information of the video section, and a feature amount for each video section;
Video segment group selection means for receiving a summary generation request, referring to the storage means, and selecting one or more video segments corresponding to the received request;
Using the metadata of the video to which the video section included in the selected video section group belongs, the extracted feature amount, or both information, one or more video sections are selected, and the time order is determined. Evaluation function optimizing means for obtaining an ordered video segment set that maximizes an evaluation value calculated by a predetermined evaluation function for the ordered video segment set;
A summary video generation means for generating a summary video by combining the obtained ordered video segment sets;
The evaluation function is a term for obtaining ease of viewing of an ordered video section, and is a similarity of metadata assigned to adjacent video sections or an amount of feature of adjacent video sections for an ordered set of video sections. A video summarization apparatus including a term for evaluating similarity or similarity of both .

The video summarization apparatus according to claim 1,
The feature amount is configured by extracting at least one of a brightness feature, a color feature, a motion feature, a texture feature, a concept feature, and a landscape feature from an image frame at a predetermined position included in the video section. A video summarization apparatus characterized by being a vector to be generated.

The video summarization apparatus according to claim 1 or 2,
At least one of a pitch feature, a sound pressure feature, a spectrum feature, a rhythm feature, an utterance feature, a music feature, and a sound event feature is obtained from the sound signal at the predetermined position included in the video section. A video summarization apparatus characterized by being a vector constructed by extraction.

In the video summarization apparatus according to claim 1, claim 2 or claim 3,
The video section group selecting means accepts a summary generation request by a keyword from a user, and determines the relevance between the keyword and the video by determining a match between the keyword and the metadata of the video stored in the storage means. The video summarization apparatus is characterized in that the video section divided from the related video is selected.

The video summarization device according to any one of claims 1 to 4,
The evaluation function further includes:
A term to determine suitability for the requirements of the ordered video segment set, a term to determine the completeness of the semantic content of the video segment included in the ordered video segment set, and the completeness of the features of the video segment included in the ordered video segment set A video summarization apparatus comprising at least one of a term to be obtained and a term to obtain a time length of an ordered video section.

A video summarization method executed by a video summarization apparatus, comprising storage means for storing a video group composed of a plurality of videos and metadata attached to each video, and generating a summary video for these video groups. ,
A video segmentation process in which each video included in the video group is divided into video segments and stored in a storage means;
A video segment feature value extraction process for extracting a feature value for each video segment and storing it in the storage means;
A video section group selection process for receiving a summary generation request, referring to the storage means, and selecting one or more video sections corresponding to the received request;
Using the metadata of the video to which the video section included in the selected video section group belongs, the extracted feature amount, or both information, one or more video sections are selected, and the time order is determined. This is a term for determining the ease of viewing of a predetermined ordered video section with respect to the ordered set of video sections, and the similarity of metadata assigned to adjacent video sections or the adjacent videos for the ordered set of video sections. An evaluation function optimization process for obtaining an ordered set of video segments that maximizes an evaluation value calculated by an evaluation function including a term that evaluates the similarity of the feature values of the segments or the similarity of both ;
A video summarization method comprising: executing a summary video generation process for generating a summary video by combining the obtained ordered video segment sets.

The video summarization method according to claim 6,
The feature amount is configured by extracting at least one of a brightness feature, a color feature, a motion feature, a texture feature, a concept feature, and a landscape feature from an image frame at a predetermined position included in the video section. And extracting at least one of pitch feature, sound pressure feature, spectrum feature, rhythm feature, speech feature, music feature, sound event feature from the sound signal at a predetermined position. A video summarization method characterized by being a vector constituted by:

The video summarization method according to claim 6 or 7,
In the video section group selection process, a summary generation request based on a keyword from a user is received, and the relevance between the keyword and the video is determined by determining the matching between the keyword and video metadata stored in the storage means. The video summarization method is characterized in that the video section divided from the related video is selected.

The video summarization method according to claim 6, claim 7 or claim 8,
The evaluation function further includes:
A term to determine suitability for the requirements of the ordered video segment set, a term to determine the completeness of the semantic content of the video segment included in the ordered video segment set, and the completeness of the features of the video segment included in the ordered video segment set A video summarization method comprising at least one of a term to be obtained and a term to obtain a time length of an ordered video section.

A video summarization program for causing a computer to execute the video summarization method according to any one of claims 6 to 9.