JP2011124681A

JP2011124681A - Video editing device, video editing method, and video editing program

Info

Publication number: JP2011124681A
Application number: JP2009279253A
Authority: JP
Inventors: Takeshi Irie; 豪入江; Takashi Sato; 隆佐藤; Akira Kojima; 明小島
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2009-12-09
Filing date: 2009-12-09
Publication date: 2011-06-23
Anticipated expiration: 2029-12-09
Also published as: JP5209593B2

Abstract

【課題】生成される映像のコンテキストを考慮しながら編集映像を自動生成する。
【解決手段】映像を映像区間に区切り（Ｓ２０２），各映像区間の映像特徴量と意味内容カテゴリ辞書の尤度モデルとから各意味内容の尤度を付与する（Ｓ２０４）。その意味的尤度を用いて映像区間をクラスタリングする（Ｓ２０５）。各クラスタから選んだ代表映像区間について，映像としての繋がりの関係から映像の意味的な変化度合いを算出し（Ｓ２０６），候補代表映像区間全体の意味的な変化度合いから編集映像に用いる代表映像区間群を選出し（Ｓ２０７），編集映像を生成する（Ｓ２０９）。
【選択図】図２An edited video is automatically generated in consideration of the context of the generated video.
The video is divided into video segments (S202), and the likelihood of each semantic content is given from the video feature quantity of each video segment and the likelihood model of the semantic content category dictionary (S204). The video sections are clustered using the semantic likelihood (S205). For the representative video section selected from each cluster, the semantic change degree of the video is calculated from the relationship of the connection as video (S206), and the representative video section used for the edited video from the semantic change degree of the entire candidate representative video section A group is selected (S207), and an edited video is generated (S209).
[Selection] Figure 2

Description

本発明は，処理対象となる入力映像に対して，映像区間のコンテキストを考慮した映像編集を，短い処理時間で実施する装置およびその方法と，その映像編集方法の実現に用いられる映像編集プログラムとに関する。 The present invention relates to an apparatus and method for performing video editing in consideration of the context of a video section on an input video to be processed in a short processing time, and a video editing program used for realizing the video editing method. About.

近年，通信を利用した映像の配信と共有が活発化している。ここで配信・共有される映像は，主に一般のユーザ，すなわち，プロフェッショナルではない人物によって作成されたものがほとんどであり，消費者生成メディア（ＣＧＭ：Consumer Generated Media），あるいはユーザ生成コンテンツ（ＵＧＣ：User Generated Content）などと呼ばれる。以降，本稿では，ＣＧＭと呼ぶものとする。 In recent years, distribution and sharing of video using communication has become active. Most of the videos distributed and shared here are created by general users, that is, non-professional people, and are consumer generated media (CGM) or user generated content (UGC). : User Generated Content). In the following, this paper will be referred to as CGM.

ＣＧＭは，現状，まだまだプロフェッショナルが制作した映像に比べて，品質の高くないものが多いといえる。この主要な原因は，プロフェッショナルな作者と，ＣＧＭを作成する一般のユーザ（作者）が利用する設備，および，これらを利用するスキルにある。撮像機器（ビデオカメラ）に着目すれば，プロフェッショナルの場合は，鮮明な撮像が可能な高性能な業務用ビデオカメラを所有しているが，一般のユーザの場合には，市販のものを利用している。一般ユーザは，カメラワークなどの撮影技術についても訓練されていないため，撮影された映像の質にも大きな差異が生じてくる。 It can be said that there are many CGMs that are not of high quality as compared to the images produced by professionals. The main cause is the equipment used by professional authors and general users (authors) who create CGM, and the skills to use them. Focusing on imaging equipment (video cameras), professionals have high-performance professional video cameras that can capture clear images, but for general users, use commercially available ones. ing. Since general users are not trained in shooting techniques such as camera work, there is a great difference in the quality of the shot images.

特に大きな差異を引き起こすのは，映像編集である。映像は，画像と音の時系列データである。一般に，撮影した映像がそのままの形で流通することは稀であり，下記のような編集工程を経て公開される。 Video editing is the biggest difference. Video is time-series data of images and sounds. In general, it is rare for a photographed video to be distributed as it is, and it is released through the following editing process.

・部分区間を削除する（シーンの取捨選択）
・時間順序を並び替える
・別の映像（シーン）を挿入する
映像編集の目的は，撮影直後の整理されていない映像を，「視聴に適する形」に整理することである。例えば，主張点を明確にするために，冗長で無駄だと感じられるシーンを削除したり，内容の理解を助けるために，あえて時間順序を入れ替えたりといった場合がある。・ Delete partial sections (select scenes)
・ Reorder time sequence ・ Insert another video (scene) The purpose of video editing is to organize unorganized video immediately after shooting into a form suitable for viewing. For example, in order to clarify the assertion point, there are cases where a redundant and useless scene is deleted, or the time order is intentionally changed to help understanding the content.

映像を編集する際に最も重要な点は，視聴者に対してわかりやすい，整理されている，あるいは，楽しむことができるものとして制作するということである。このような編集は，ある程度専門的な知識や判断が必要となるため，訓練を積んだ映像作成者，編集者などのプロフェッショナルによるものでなければ，十分な効果を得ることができないのが現状となっており，一般のユーザには効果的な編集を実現することは難しい。 The most important point when editing video is to make it easy for viewers to understand, organize, or enjoy. Since such editing requires a certain amount of specialized knowledge and judgment, the current situation is that it cannot be fully effective unless it is done by professionals such as trained video creators and editors. Therefore, it is difficult for general users to realize effective editing.

このような現状から，昨今，一般のユーザの編集を支援し，高いレベルの映像編集を自動的に実施できる技術がますます求められている。 Under these circumstances, there is an increasing demand for technology that can support editing by general users and can automatically perform high-level video editing.

本発明に関連する先行技術として，下記の特許文献１には，映像の情報を分析して，感情的な区間を検出し，感情的な区間をわかりやすくユーザに提示して，編集を支援する映像編集技術が開示されている。 As prior art related to the present invention, the following Patent Document 1 analyzes video information, detects emotional sections, presents emotional sections to the user in an easy-to-understand manner, and supports editing. Video editing technology is disclosed.

特開２００９−１１１９３８号公報JP 2009-1111938 A

特許文献１で開示された映像編集技術は，重要な映像シーンの一つである，感情的な区間をわかりやすく提示することを行っており，これによりユーザにとって利用しやすい編集支援ツールを提供していた。 The video editing technology disclosed in Patent Document 1 presents an emotional section that is one of important video scenes in an easy-to-understand manner, thereby providing an editing support tool that is easy for the user to use. It was.

しかしながら，この映像編集技術は，あるシーンが感情的であるか否かを判断し，これに基づいてシーンの選別をしているにすぎず，編集の結果生成される映像の「コンテキスト（文脈）」を無視している。映像は，時間軸を持ったメディアであり，最初から再生し，タイムラインに沿って順に視聴することにより，意味を持つものである。すなわち，映像は，時間方向に沿ったコンテキストを持ったメディアであるといえる。当然，このことは映像を編集する際にも考慮されて然るべきである。 However, this video editing technology only determines whether or not a scene is emotional, and selects a scene based on it. The “context” of the video generated as a result of editing is selected. "Is ignored. Video is a media with a time axis, and it is meaningful by playing it from the beginning and watching it in order along the timeline. In other words, it can be said that the video is a media having a context along the time direction. Of course, this should be taken into account when editing video.

わかりやすい一例をあげる。例えば，編集される前の元の映像では，「(1) ：男Ａが歩いている」「(2) ：向かい側から歩いてきた男Ｂが，男Ａに向けて飛んでくる野球ボールに気が付いた」「(3) ：向かい側から歩いてきた男Ｂが，すれちがい際に男Ａを突き飛ばした」というコンテキストを持っていたとしよう。男Ｂは，男Ａにボールが当たらないよう，緊急的に突き飛ばしたシーンである。もし，編集された結果生成された映像に，(2) のシーンがなく，「(1) ：男Ａが歩いている」，「(3) ：向かい側から歩いてきた男Ｂが，すれちがい際に男Ａを突き飛ばした」という流れだったとしたら，全く別の意味を持つ内容に変わってしまうように感じられるだろう。 Here is an easy-to-understand example. For example, in the original video before editing, “(1): Man A is walking” “(2): Man B walking from the opposite side notices the baseball ball flying toward Man A. Let's say that “(3): Man B walking from the other side knocks out Man A when passing by”. Man B is a scene where he urgently jumps out so that Man A does not hit the ball. If there is no scene of (2) in the video generated as a result of editing, “(1): Man A is walking”, “(3): Man B walking from the opposite side passes. If it was a flow of “mushing man A”, it would seem that the content changed to something completely different.

上記のような，意味内容に影響を及ぼすケースに限らなくても，重要なコンテキストは存在する。例えば，非常に楽しく，笑えるシーンがあったとしよう。直後に，そのシーン単体としては悲しい，泣けるシーンが来るように編集されていたとした場合，そのシーンを視聴した視聴者が悲しい，泣ける気持ちになることは稀であろう。また，例えば，ずっと楽しいシーンが続くように編集されていたとすると，視聴者は次第に飽きてしまい，楽しさも薄れてくると推測できるであろう。 Even if it is not limited to cases that affect semantic content as described above, there is an important context. For example, suppose there was a very fun and funny scene. Immediately after that, if the scene was edited so that a scene that was sad and crying would come, it would be rare for viewers who watched the scene to feel sad or crying. Also, for example, if it was edited so that a much more enjoyable scene would continue, it would be speculated that the viewer will get bored gradually and the fun will fade.

以上のように，映像の自動編集において，コンテキストを考慮することは最も重要な課題の一つであると考えられる。 As described above, it is considered that considering the context is one of the most important issues in automatic video editing.

本発明は，この課題を解決すべく，処理対象の映像に対して，生成される編集映像のコンテキストを考慮しながら編集映像を自動生成し，出力する映像編集技術の提供を目的とするものである。 In order to solve this problem, an object of the present invention is to provide a video editing technique for automatically generating and outputting an edited video for a processing target video in consideration of the context of the generated edited video. is there.

この目的を達成するために，本発明は，映像を小区間に区切り，各小区間に各意味内容との尤度を付与していく。その意味的尤度から小区間をクラスタリングする。各クラスタの代表小区間について，映像としての繋がりの関係から映像の意味的な変化度合い（コンテキスト類似度）を所定の式に従って算出する。その意味的な変化度合いの平均をとれば，映像全体の意味的な変化度合いとなる。一つのクラスタ（代表区間）を取り除いた場合（取り除くものの候補は複数）を比較し，好み（意味的変化を望むか望まないか）によって，取り除くクラスタを決めていく。これにより，目的とする映像編集を行う。 In order to achieve this object, the present invention divides a video into small sections, and assigns a likelihood to each meaning content to each small section. Cluster the small intervals from their semantic likelihoods. For the representative small section of each cluster, the semantic change degree (context similarity) of the video is calculated according to a predetermined formula from the relationship of connection as video. If the average of the semantic change degree is taken, it becomes the semantic change degree of the whole image. When one cluster (representative section) is removed (a plurality of candidates for removal) are compared, and the cluster to be removed is determined according to preference (whether or not semantic change is desired). Thus, the target video editing is performed.

詳しくは，以下のとおりである。本発明の映像編集装置は，映像内容の持つ意味を予め定められた特定の単語によって表現した意味内容カテゴリと，映像区間の画像特徴もしくは音特徴またはその双方からなる映像区間特徴量との確率的な関係を示す尤度モデルの情報を記憶する意味内容カテゴリ辞書と，入力映像を映像区間に分割する映像区間分割部と，前記各映像区間から映像区間特徴量を抽出する映像区間特徴量抽出部と，抽出された映像区間特徴量に基づいて，前記意味内容カテゴリ辞書を参照し，該映像区間に対する各意味内容カテゴリの尤度を出力する意味内容尤度計算部と，前記尤度に基づいて前記映像区間をクラスタリングし，生成された各映像区間クラスタから一つ以上の代表映像区間を選出するクラスタリング部と，前記選出された代表映像区間の組み合わせから得られる複数の候補代表映像区間の並びについて，少なくとも前記尤度を用いて映像区間の繋がりにおける意味的な変化度合いを示すコンテキスト類似度を算出し，算出したコンテキスト類似度をもとに全体として意味的な変化度合いが大きい候補代表映像区間の並びまたは意味的な変化度合いが小さい候補代表映像区間の並びのいずれかを編集映像に用いる候補代表映像区間群として選出する編集対象映像区間選出部と，前記選出された候補代表映像区間群の映像区間をつなぎ合わせることにより編集映像を生成して出力する編集映像出力部とを備える。 Details are as follows. The video editing apparatus according to the present invention probabilistically includes a semantic content category in which the meaning of video content is expressed by a predetermined specific word, and a video segment feature value including an image feature and / or a sound feature of the video segment. Semantic content category dictionary for storing likelihood model information indicating the relationship, video segment dividing unit for segmenting the input video into video segments, and video segment feature amount extracting unit for extracting video segment feature values from each video segment A semantic content likelihood calculating unit that refers to the semantic content category dictionary based on the extracted video segment feature and outputs the likelihood of each semantic content category for the video segment, and based on the likelihood A clustering unit that clusters the video segments and selects one or more representative video segments from each of the generated video segment clusters, and a combination of the selected representative video segments For a sequence of a plurality of candidate representative video sections obtained from the matching, a context similarity indicating a degree of semantic change in the connection of the video sections is calculated using at least the likelihood, and the whole is calculated based on the calculated context similarity. As a candidate representative video segment group used for editing video, either an arrangement of candidate representative video segments having a large degree of semantic change or a sequence of candidate representative video segments having a small degree of semantic change is selected. And an edited video output unit that generates and outputs an edited video by connecting video segments of the selected candidate representative video segment group.

以上の各処理手段が動作することで実現される本発明の映像編集方法はコンピュータプログラムでも実現できるものであり，このコンピュータプログラムは，適当なコンピュータ読み取り可能な記録媒体に記録して提供されたり，ネットワークを介して提供されたりしてもよく，本発明を実施する際にインストールされてＣＰＵなどの制御手段上で動作することにより本発明を実現することになる。 The video editing method of the present invention realized by the operation of each processing means described above can also be realized by a computer program, which is provided by being recorded on a suitable computer-readable recording medium, It may be provided via a network, and is installed when the present invention is implemented, and the present invention is realized by operating on a control means such as a CPU.

このように構成される映像編集装置では，処理対象の映像を入力すると，その処理対象の映像の持つ画像情報および／または音情報から映像区間ごとに映像区間特徴量を抽出し，意味内容カテゴリの尤度を計算する。特徴量としては，例えば，明るさ特徴，色特徴，動き特徴，テクスチャ特徴，カット特徴，オブジェクト特徴，画像イベント特徴，音高特徴，音量特徴，スペクトル特徴，リズム特徴，発話特徴，音楽特徴，音イベント特徴などがあり，これらのうち少なくとも一つを特徴量として抽出する。 In the video editing apparatus configured as described above, when a video to be processed is input, a video segment feature amount is extracted for each video segment from image information and / or sound information of the video to be processed, and a semantic content category is extracted. Calculate the likelihood. Examples of feature quantities include brightness features, color features, motion features, texture features, cut features, object features, image event features, pitch features, volume features, spectrum features, rhythm features, speech features, music features, and sound. There are event features and the like, and at least one of them is extracted as a feature amount.

また，ある意味内容カテゴリに対して，その他の意味内容カテゴリとの関連度合いを計算して，コンテキストの類似度計算に用いるために記憶しておく。 In addition, the degree of association with a certain semantic content category is calculated and stored for use in the context similarity calculation.

続いて，以下の［処理１］〜［処理６］までの処理段階を繰り返す。
［処理１］：前記尤度に基づいて，全ての映像区間をクラスタリングして映像区間クラスタを生成し，各映像区間クラスタから一つ以上の代表映像区間を選出する。
［処理２］：代表映像区間の並びにおける各代表映像区間位置に対して，それよりも過去に位置する代表映像区間位置の映像区間特徴，および，前記尤度と前記関連度合いに基づいて，意味内容カテゴリに帰属する予測確率を計算する。
［処理３］：代表映像区間の並びにおける各代表映像区間位置に対して，それ以前に位置する代表映像区間位置の映像区間特徴，および，前記尤度と前記関連度合いに基づいて，意味内容カテゴリに帰属する事後確率を計算する。
［処理４］：前記予測確率と前記事後確率との類似度を，コンテキスト類似度として計算する。
［処理５］：前記コンテキスト類似度の代表映像区間の並び全体に渡る平均を求め，その平均が最大または最小となるような代表映像区間の並びを選出する。
［処理６］：終了条件が満たされていなければ，選出された代表映像区間群を，新たな映像区間群とみなして，［処理１］に戻る。終了条件が満たされていれば，得られた代表映像区間をつなぎ合わせ，編集映像として出力する。 Subsequently, the following processing steps from [Process 1] to [Process 6] are repeated.
[Process 1]: Based on the likelihood, all video sections are clustered to generate a video section cluster, and one or more representative video sections are selected from each video section cluster.
[Process 2]: For each representative video segment position in the representative video segment sequence, meaning based on the video segment feature of the representative video segment position located in the past and the likelihood and the degree of association Calculate the predicted probability attributed to the content category.
[Process 3]: For each representative video segment position in the representative video segment sequence, the semantic content category based on the video segment feature of the representative video segment position located before and the likelihood and the relevance degree Calculate the posterior probability belonging to.
[Process 4]: The similarity between the prediction probability and the posterior probability is calculated as the context similarity.
[Process 5]: The average of the context similarity over the entire arrangement of the representative video sections is obtained, and the arrangement of the representative video sections such that the average becomes the maximum or the minimum is selected.
[Process 6]: If the end condition is not satisfied, the selected representative video segment group is regarded as a new video segment group, and the process returns to [Process 1]. If the end condition is satisfied, the obtained representative video sections are connected and output as an edited video.

以上の［処理２］〜［処理４］の処理手続きにより，過去のコンテキストに依存した意味内容帰属確率（予測確率，事後確率）と，その類似度を計算することによって，それまでのコンテキストに合った映像編集や，それまでのコンテキストとは異なる映像編集を実現することができ，コンテキストを考慮した映像の自動編集が可能となる。 By calculating the semantic content attribution probabilities (predicted probabilities, posterior probabilities) and their similarities depending on the past contexts by the processing procedures of [Process 2] to [Process 4], the process matches the previous context. Video editing and video editing different from the previous context can be realized, and automatic video editing considering the context becomes possible.

本発明によれば，ユーザは，映像を入力するだけで，コンテキストを考慮した映像編集を自動的に実行することができるようになる。これにより，訓練を積んでいる映像作成者や編集者の人手を介することなく，高度な編集映像を実現することができる。 According to the present invention, a user can automatically execute video editing in consideration of a context simply by inputting a video. As a result, it is possible to realize highly edited video without the need for manual training for video creators and editors.

また，従来の映像編集においてコンテキストを考慮しようとした場合，さまざまな映像区間の組合せを考え，映像区間の選定を行わなければならないため，処理時間がかかる。より具体的には，映像区間の数Ｎに対して，Ｎの階乗オーダ以上の計算コストを見込むため，多項式時間で終了しない問題となる。本発明の処理手続きでは，
（１）段階的に映像区間の絞り込みを行う，
（２）意味内容の近い映像区間をクラスタリングする，
という二つの手続きを導入することによって，多項式時間で終了する処理手続きとなっている。 Also, when considering context in conventional video editing, processing time is required because video segment selection has to be performed in consideration of various video segment combinations. More specifically, for the number N of video sections, a calculation cost that is greater than or equal to the factorial order of N is expected, so that the problem does not end in polynomial time. In the processing procedure of the present invention,
(1) Narrow down the video section in stages,
(2) Cluster video segments with similar semantic content,
By introducing these two procedures, the procedure ends in polynomial time.

したがって，本発明のユーザは，コンテキストを考慮した編集映像を，短い時間で自動的に得ることができる。 Therefore, the user of the present invention can automatically obtain an edited video considering the context in a short time.

本発明の一実施形態に係る映像編集装置の構成例を示す図である。It is a figure which shows the structural example of the video editing apparatus which concerns on one Embodiment of this invention. 映像編集装置が実行する映像編集処理のフローチャートである。It is a flowchart of the video editing process which a video editing apparatus performs. 映像区間のクラスタリングと代表映像区間抽出の一例を示す図である。It is a figure which shows an example of clustering of a video section and representative video section extraction. 代表映像区間を篩う一例を示す図である。It is a figure which shows an example which sifts a representative video area.

以下，図面を用いて，本発明の実施の形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

図１は，本発明の一実施形態に係る映像編集装置の装置構成の一例を示す図である。図１において，映像編集装置１は，ＣＰＵ，メモリ，外部記憶装置などからなるコンピュータのハードウェアとソフトウェアプログラムとによって，編集対象となる映像１０を入力し，編集結果の編集映像１９を出力する装置であり，映像入力部１１と，映像記憶部１２と，映像区間分割部１３と，映像区間特徴量抽出部１４と，意味内容尤度計算部１５と，クラスタリング部１６と，編集対象映像区間選出部１７と，編集映像出力部１８と，意味内容カテゴリ辞書２０と，意味内容関連度計算部２１と，意味内容関連度記憶部２２とを備える。 FIG. 1 is a diagram illustrating an example of a device configuration of a video editing device according to an embodiment of the present invention. In FIG. 1, a video editing apparatus 1 is an apparatus for inputting a video 10 to be edited and outputting an edited video 19 as a result of editing by computer hardware and software programs including a CPU, a memory, an external storage device, and the like. A video input unit 11, a video storage unit 12, a video segment dividing unit 13, a video segment feature quantity extracting unit 14, a semantic content likelihood calculating unit 15, a clustering unit 16, and an editing target video segment selection. Unit 17, edited video output unit 18, semantic content category dictionary 20, semantic content relevance calculation unit 21, and semantic content relevance storage unit 22.

編集対象映像区間選出部１７は，類似度計算部１７０と，代表映像区間篩部１７４とを有しており，類似度計算部１７０は，コンテキスト予測確率計算部１７１と，コンテキスト事後確率計算部１７２と，コンテキスト類似度計算部１７３とを有している。 The editing target video segment selection unit 17 includes a similarity calculation unit 170 and a representative video segment screening unit 174. The similarity calculation unit 170 includes a context prediction probability calculation unit 171 and a context posterior probability calculation unit 172. And a context similarity calculation unit 173.

映像入力部１１は，編集の処理対象となる映像１０を入力して，それを映像記憶部１２に格納する。映像区間分割部１３は，処理対象の映像を複数の映像区間に分割する。映像区間特徴量抽出部１４は，処理対象の映像区間の持つ画像情報や音情報に基づいて各区間の持つ特徴量を抽出して出力する。 The video input unit 11 inputs the video 10 to be edited and stores it in the video storage unit 12. The video segment dividing unit 13 divides the video to be processed into a plurality of video segments. The video section feature quantity extraction unit 14 extracts and outputs the feature quantity of each section based on the image information and sound information of the video section to be processed.

意味内容尤度計算部１５は，映像区間特徴量抽出部１４の抽出した映像区間特徴量に基づいて，意味内容カテゴリ辞書２０に予め設定されて登録されている一つ以上の意味内容カテゴリのそれぞれに対して，映像区間が意味内容カテゴリにどの程度帰属しているかを尤度として計算し，出力する。 The semantic content likelihood calculation unit 15 includes each of one or more semantic content categories preset and registered in the semantic content category dictionary 20 based on the video segment feature extracted by the video segment feature extraction unit 14. On the other hand, the extent to which the video section belongs to the semantic content category is calculated as likelihood and output.

意味内容関連度計算部２１は，予め意味内容カテゴリ辞書２０に登録されている任意の意味内容カテゴリを入力として受け取り，当該意味内容カテゴリとその他の意味内容カテゴリとの関連度合いを計算し，意味内容関連度記憶部２２に格納する。 The semantic content relevance calculation unit 21 receives as input any semantic content category registered in the semantic content category dictionary 20 in advance, calculates the degree of association between the semantic content category and other semantic content categories, Stored in the relevance storage unit 22.

なお，意味内容カテゴリ辞書２０と意味内容関連度記憶部２２は，予め映像編集装置１またはそれとは別の装置により，学習用サンプル映像等を用いて意味内容カテゴリについての学習を行うことによって作成しておくことができる。 The semantic content category dictionary 20 and the semantic content relevance storage unit 22 are created in advance by learning about the semantic content category using the learning sample video or the like by the video editing device 1 or another device. I can keep it.

クラスタリング部１６１は，意味内容尤度計算部１５が出力した尤度に基づいて，映像区間をクラスタリングする。さらに，各クラスタの中から代表となる映像区間を一つ以上，代表映像区間として出力する。 The clustering unit 161 clusters video sections based on the likelihood output by the semantic content likelihood calculating unit 15. Further, one or more representative video sections from each cluster are output as representative video sections.

編集対象映像区間選出部１７は，編集映像に用いる候補代表映像区間群を選出する。そのため，類似度計算部１７０におけるコンテキスト予測確率計算部１７１は，クラスタリング部１６が出力した代表映像区間を再生時間順序に並べたときの，各代表映像区間位置におけるコンテキスト予測確率を，それよりも過去の代表映像区間位置における映像区間特徴に基づいて計算し，出力する。 The editing target video segment selection unit 17 selects a candidate representative video segment group used for the edited video. Therefore, the context prediction probability calculation unit 171 in the similarity calculation unit 170 displays the context prediction probability at each representative video section position when the representative video sections output from the clustering unit 16 are arranged in the playback time order. It is calculated based on the video segment feature at the representative video segment position and output.

コンテキスト事後確率計算部１７２は，クラスタリング部１６が出力した代表映像区間を再生時間順序に並べたときの，各代表映像区間位置におけるコンテキスト事後確率を，それ以前の代表映像区間位置における映像区間特徴に基づいて計算し，出力する。 The context posterior probability calculation unit 172 converts the context posterior probability at each representative video segment position when the representative video segments output from the clustering unit 16 are arranged in the playback time order into the video segment feature at the previous representative video segment position. Calculate based on the output.

コンテキスト類似度計算部１７３は，コンテキスト予測確率計算部１７１が出力したコンテキスト予測確率，および，コンテキスト事後確率計算部１７２が出力したコンテキスト事後確率の類似度を計算し，出力する。 The context similarity calculation unit 173 calculates and outputs the context prediction probability output by the context prediction probability calculation unit 171 and the similarity of the context posterior probability output by the context posterior probability calculation unit 172.

代表映像区間篩部１７４は，コンテキスト類似度計算部１７３が出力した類似度に基づいて，一つ以上の除去する代表映像区間を決定し，残りの代表映像区間を出力する。 The representative video section sieving unit 174 determines one or more representative video sections to be removed based on the similarity output from the context similarity calculation unit 173, and outputs the remaining representative video sections.

編集映像出力部１８は，終了条件が満たされたとき，代表映像区間篩部１７４が出力した代表映像区間をつなぎ合わせ，編集映像１９として出力する。 When the end condition is satisfied, the edited video output unit 18 joins the representative video sections output by the representative video section sieving unit 174 and outputs them as the edited video 19.

このようにして，映像編集装置１は，映像作成者や編集部の人手を介することなく，処理対象の映像を編集するように処理する。 In this way, the video editing apparatus 1 performs processing so as to edit the video to be processed without intervention of the video creator or the editing section.

図２に，このように構成される映像編集装置１が実行する映像編集処理のフローチャートを示す。このフローチャートを用いて，映像編集装置１が行う映像編集処理の一例について詳述する。 FIG. 2 shows a flowchart of the video editing process executed by the video editing apparatus 1 configured as described above. An example of video editing processing performed by the video editing device 1 will be described in detail using this flowchart.

なお，以下で説明する映像編集処理では，予め作成された意味内容カテゴリ辞書２０に格納された意味内容カテゴリの情報と，意味内容関連度計算部２１によって予め計算された意味内容関連度記憶部２２に記憶された意味内容関連度の情報を用いるが，これらの詳細については，以下の処理手順の説明に合わせて説明する。 In the video editing process described below, the semantic content category information stored in the semantic content category dictionary 20 created in advance and the semantic content relevance storage unit 22 calculated in advance by the semantic content relevance calculation unit 21 are used. The contents of the semantic content relevance stored in the above are used, and details thereof will be described in accordance with the description of the following processing procedure.

まず，ステップＳ２０１で，処理対象となる映像を入力し，ステップＳ２０２で，入力した映像の映像区間分割を行う。この映像区間分割は，予め決定しておいた一定の間隔で分割するものとしてもよいし，例えば，下記の参考文献１に記載される方法など，映像が不連続に切れる点であるカット点によって分割するものとしてもよい。
［参考文献１］：Y. Tonomura, A. Akutsu, Y. Taniguchi, and G. Suzuki, "Structured Video Computing", IEEE Multimedia, pp.34-43, 1994．
次に，ステップＳ２０３では，映像区間中の画像・音情報から，映像区間特徴量の抽出を行う。映像区間特徴量は，画像から抽出するものと，音から抽出するものとがある。いずれも，例えば，５０ｍｓなどの微小な区間（フレーム）から抽出したものの統計量を，区間内で計算することによって抽出する。 First, in step S201, a video to be processed is input, and in step S202, a video segmentation of the input video is performed. This video segmentation may be performed at predetermined intervals, for example, by a cut point that is a point at which the video is cut discontinuously, such as the method described in Reference 1 below. It is good also as what is divided | segmented.
[Reference 1]: Y. Tonomura, A. Akutsu, Y. Taniguchi, and G. Suzuki, “Structured Video Computing”, IEEE Multimedia, pp. 34-43, 1994.
Next, in step S203, video segment feature values are extracted from image / sound information in the video segment. Video segment feature values are extracted from images and extracted from sounds. In either case, for example, a statistic extracted from a minute section (frame) such as 50 ms is extracted by calculating within the section.

例えば，画像から抽出する特徴としては，明るさ特徴，色特徴，動き特徴，テクスチャ特徴，カット特徴，オブジェクト特徴，画像イベント特徴がある。 For example, features extracted from an image include brightness features, color features, motion features, texture features, cut features, object features, and image event features.

明るさ特徴，色特徴，動き特徴などは，それぞれ，ピクセル毎の明度，ＲＧＢ値，動きベクトルを計算することによって求めることができる。テクスチャ特徴としては，濃淡ヒストグラムの統計量（コントラスト）やパワースペクトルなどを求めればよい。また，これらは，一枚の画像全体に対する平均や分散などの統計量を用いるものとしてもよいし，あるいは，例えば８×８，１６×１６などの小さなピクセル領域ごとにヒストグラムを取り，ベクトルとして抽出するものとしてもよい。 The brightness feature, color feature, motion feature, and the like can be obtained by calculating the brightness, RGB value, and motion vector for each pixel, respectively. As a texture feature, a statistic (contrast) of a density histogram, a power spectrum, or the like may be obtained. In addition, these may use statistics such as an average or variance for an entire image, or, for example, take a histogram for each small pixel region such as 8 × 8, 16 × 16, and extract it as a vector. It is good also as what to do.

カット特徴とは，シーンの切り替わり（カット）の有無，あるいは，頻度を表す特徴量である。厳密には単一の画像から抽出することができないため，近傍の画像を用いて求めることとなる。 The cut feature is a feature quantity indicating the presence / absence of scene switching (cut) or the frequency. Strictly speaking, since it cannot be extracted from a single image, it is obtained using a neighboring image.

オブジェクト特徴とは，画像に収められている物体である。本実施形態では，その物体が何であるかを同定するような物体認識はしないで，物体認識に用いられる局所特徴をオブジェクト特徴として利用する。局所特徴としては，例えば，下記の参考文献２に記載されるＳＩＦＴ (Scale Invariant Feature Transform)や，下記の参考文献３に記載されるＳＵＲＦ (Speeded Up Robust Features) などを用いることができる。
［参考文献２］：D.G. Lowe, "Distinctive Image Features from Scale-Invariant Keypoints", International Journal of Computer Vision, pp.91-110, 2004 ．
［参考文献３］：H. Bay, T. Tuytelaars, and L.V. Gool, "SURF: Speeded Up Robust Features", Lecture Notes in Computer Science, vol. 3951, pp.404-417, 2006．
また，オブジェクト特徴として，特定の物体に焦点を当て，検出するといった方法を用いることも考えられる。例えば，顔の出現やその表情を得るといったアプローチが代表的である。顔を検出する方法としては，例えば，下記の参考文献４に記載される方法などを用いればよい。さらに表情も認識する場合には，下記の参考文献５に記載される方法などを用いればよい。
［参考文献４］：H.A. Rowley, S. Baluja, and T. Kanade, "Neural Network-based Face Detection", IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp.203-208, 1996 ．
［参考文献５］：I. Cohen, N. Sebe, A. Garg, L.S. Chen, and T.S. Huang, "Facial Expression Recognition from Video Sequences: Temporal and Static Modeling", Computer Vision and Image Understanding, vol.91, issues 1-2, pp.160-187, 2003．
画像イベント特徴とは，映像中に生起する事象のことである。例えば，急激なカメラワークや，テロップの出現などがある。例えば，急激なカメラワークを用いる場合には，上記の参考文献１に記載される方法などを用いることによって検出することができる。また，テロップを用いる場合には，下記の参考文献６に記載される方法などを用いることによって検出することができる。
［参考文献６］：桑野秀豪, 倉掛正治, 小高和己, “映像データ検索のためのテロップ文字抽出法”, 電子情報通信学会技術研究報告, PRMU, 96(385), pp.39-46, 1996 ．
一方，音情報から抽出する特徴量としては，音高特徴，音量特徴，スペクトル特徴，リズム特徴，発話特徴，音楽特徴，音イベント特徴などがある。 An object feature is an object contained in an image. In the present embodiment, object recognition for identifying what the object is is not performed, and local features used for object recognition are used as object features. As the local feature, for example, SIFT (Scale Invariant Feature Transform) described in Reference Document 2 below, SURF (Speeded Up Robust Features) described in Reference Document 3 below, or the like can be used.
[Reference 2]: DG Lowe, "Distinctive Image Features from Scale-Invariant Keypoints", International Journal of Computer Vision, pp.91-110, 2004.
[Reference 3]: H. Bay, T. Tuytelaars, and LV Gool, “SURF: Speeded Up Robust Features”, Lecture Notes in Computer Science, vol. 3951, pp. 404-417, 2006.
It is also conceivable to use a method of focusing and detecting a specific object as an object feature. For example, the approach of getting the appearance of a face and its expression is typical. As a method for detecting the face, for example, a method described in Reference Document 4 below may be used. Furthermore, when recognizing a facial expression, the method described in Reference Document 5 below may be used.
[Reference 4]: HA Rowley, S. Baluja, and T. Kanade, “Neural Network-based Face Detection”, IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 203-208, 1996.
[Reference 5]: I. Cohen, N. Sebe, A. Garg, LS Chen, and TS Huang, "Facial Expression Recognition from Video Sequences: Temporal and Static Modeling", Computer Vision and Image Understanding, vol.91, issues 1-2, pp.160-187, 2003.
An image event feature is an event that occurs in a video. For example, there are sudden camera work and the appearance of telop. For example, when a sudden camera work is used, it can be detected by using the method described in the above-mentioned Reference 1. Further, when using a telop, it can be detected by using a method described in Reference Document 6 below.
[Reference 6]: Hideo Kuwano, Masaharu Kurakake, Kazumi Odaka, “Telop Character Extraction Method for Video Data Retrieval”, IEICE Technical Report, PRMU, 96 (385), pp.39-46, 1996.
On the other hand, features extracted from sound information include pitch features, volume features, spectrum features, rhythm features, speech features, music features, sound event features, and the like.

音高特徴は，例えば，ピッチを取るものとすればよく，下記の参考文献７に記載される方法などを用いて抽出することができる。
［参考文献７］：古井貞熙, “ディジタル音声処理, ４. ９ピッチ抽出”, pp.57-59, 1985．
音量特徴としては，音声波形データの振幅値を用いるものとしてもよいし，短時間パワースペクトルを求め，任意の帯域の平均パワーを計算して用いるものとしてもよい。 The pitch feature may be a pitch, for example, and can be extracted using a method described in Reference Document 7 below.
[Reference 7]: Sadaaki Furui, “Digital Audio Processing, 4.9 Pitch Extraction”, pp.57-59, 1985.
As the volume feature, the amplitude value of the speech waveform data may be used, or a short-time power spectrum may be obtained and the average power in an arbitrary band may be calculated and used.

スペクトル特徴としては，例えば，メル尺度ケプストラム係数（Mel-Frequency Cepstral Coefficients: MFCC)を用いることができる。 As the spectral feature, for example, Mel-Frequency Cepstral Coefficients (MFCC) can be used.

リズム特徴としては，例えば，テンポを抽出すればよい。テンポを抽出するには，例えば，下記の参考文献８に記載される方法などを用いることができる。
［参考文献８］：E.D. Scheirer, "Tempo and Beat Analysis of Acoustic Musical Signals", Journal of Acoustic Society America, Vol.103, Issue 1, pp.588-601, 1998.
発話特徴，音楽特徴は，それぞれ，発話の有無，音楽の有無を表す。発話・音楽の存在する区間を発見するには，例えば，下記の参考文献９に記載される方法などを用いればよい。
［参考文献９］：K. Minami, A. Akutsu, H. Hamada, and Y. Tonomura, "Video Handling with Music and Speech Detection", IEEE Multimedia, vol.5, no.3, pp.17-25, 1998．
音イベント情報としては，例えば，笑い声や大声などの感情的な音声，あるいは，銃声や爆発音などの環境音の生起などを用いるものとすればよい。このような音イベントを検出するには，例えば，下記の参考文献１０（特許文献）に記載される方法などを用いればよい。
［参考文献１０］：ＷＯ／２００８／０３２７８７
続いて，ステップＳ２０４では，ステップＳ２０３で得た映像区間特徴量に基づいて，全ての映像区間に対して，意味内容尤度を計算する。 As the rhythm feature, for example, a tempo may be extracted. In order to extract the tempo, for example, a method described in Reference Document 8 below can be used.
[Reference 8]: ED Scheirer, "Tempo and Beat Analysis of Acoustic Musical Signals", Journal of Acoustic Society America, Vol.103, Issue 1, pp.588-601, 1998.
The utterance feature and the music feature represent the presence or absence of utterance and the presence or absence of music, respectively. In order to find a section where speech / music exists, for example, a method described in Reference 9 below may be used.
[Reference 9]: K. Minami, A. Akutsu, H. Hamada, and Y. Tonomura, "Video Handling with Music and Speech Detection", IEEE Multimedia, vol.5, no.3, pp.17-25, 1998.
As sound event information, for example, emotional sounds such as laughter and loud voice, or occurrence of environmental sounds such as gunshots and explosion sounds may be used. In order to detect such a sound event, for example, a method described in Reference Document 10 (Patent Document) below may be used.
[Reference 10]: WO / 2008/032787
Subsequently, in step S204, the semantic content likelihood is calculated for all video segments based on the video segment feature value obtained in step S203.

まず，ここでは，意味内容カテゴリについて説明する。意味内容カテゴリとは，映像内容の持つ意味を少数の単語によって表現したものであり，オブジェクト，イベント，概念，感情などを表すものである。 First, the semantic content category will be described here. The semantic content category represents the meaning of video content by a small number of words, and represents objects, events, concepts, emotions, and the like.

例えば，映像区間にサッカーのゴールシーンが収められている場合には，当該区間の意味内容カテゴリとして，「ボール（オブジェクト）」，「ゴール（イベント）」，「サッカー（概念）」，「歓喜（感情）」などがふさわしいだろう。また，旅行でハイキングをしている様子を収めた映像区間には，「木（オブジェクト）」，「山（オブジェクト）」，「歩行（イベント）」，「ハイキング（概念）」などが適当である。 For example, if a soccer goal scene is stored in the video section, the meaning content categories of the section are “ball (object)”, “goal (event)”, “soccer (concept)”, “joy ( Emotion) ”would be appropriate. In addition, “tree (object)”, “mountain (object)”, “walking (event)”, “hiking (concept)”, etc. are appropriate for the video section that shows hiking on a trip. .

これらのオブジェクト，イベント，概念，感情などを，予め意味内容カテゴリとして設定し，意味内容カテゴリ辞書２０に登録しておく。また，意味内容カテゴリ辞書２０に，多数のサンプル映像を学習することによって得られた各意味内容カテゴリと映像区間特徴量との確率的な関係を示す尤度モデルの情報を登録しておく。 These objects, events, concepts, emotions, etc. are set in advance as semantic content categories and registered in the semantic content category dictionary 20. Also, likelihood model information indicating a probabilistic relationship between each semantic content category and video segment feature obtained by learning a large number of sample videos is registered in the semantic content category dictionary 20.

ステップＳ２０４で算出する意味内容尤度は，映像区間ｉが与えられたとき，これがどの意味内容カテゴリに所属しているらしいかを，各意味内容カテゴリに属する尤もらしさとして確率的に推定した値である。意味内容尤度は，ステップＳ２０３で抽出した映像区間特徴量と，意味内容カテゴリ辞書２０に予め設定された意味内容カテゴリそれぞれとの関係を設定したモデル（尤度モデル）を用いて計算する。この尤度モデルは，例えばある意味内容カテゴリに属する映像区間の映像区間特徴量をガウス分布やポアソン分布などの確率密度関数でモデル化して得てもよいし，映像区間特徴が離散的な場合には，条件付確率テーブルや多項分布などでモデル化してもよい。 The semantic content likelihood calculated in step S204 is a value obtained by probabilistically estimating the likelihood content belonging to each semantic content category as to which semantic content category this video section i seems to belong to. is there. The semantic content likelihood is calculated using a model (likelihood model) in which the relationship between the video segment feature value extracted in step S203 and each of the semantic content categories preset in the semantic content category dictionary 20 is set. This likelihood model may be obtained, for example, by modeling the video segment feature quantity of a video segment belonging to a certain semantic content category with a probability density function such as Gaussian distribution or Poisson distribution, or when the video segment feature is discrete. May be modeled by a conditional probability table or a multinomial distribution.

上記のモデルは，全て確率モデルであるため，例えば人手によって，予め与えられた意味内容カテゴリのラベルに基づいて，学習することになる。例えば，サッカーの様子を撮影した映像区間ｉに，人手によって，「サッカー」のラベルが付与されているとする。このとき，意味内容カテゴリ「サッカー」に対して，当該映像区間ｉの映像区間特徴量ｘ_iが出現しやすくなるようにモデルを学習する。この学習には，例えば，最尤推定法などの公知の方法を用いればよい。ラベルが複数与えられている場合であっても，同様に各ラベルごとにモデルを学習すればよい。 Since all of the above models are probability models, learning is performed based on, for example, a label of a semantic content category given in advance by hand. For example, it is assumed that a label “soccer” is manually assigned to a video section i in which a state of soccer is photographed. At this time, for the semantic content category “soccer”, the model is learned so that the video section feature amount x _i of the video section _i is likely to appear. For this learning, for example, a known method such as a maximum likelihood estimation method may be used. Even when a plurality of labels are given, the model may be learned for each label in the same manner.

得られた尤度モデルを，以下の数式で表す。 The obtained likelihood model is expressed by the following mathematical formula.

ｐ（ｘ_i｜ｃ_i） …式(1)
ここで，ｘ_iはある映像区間ｉの映像区間特徴量，ｃ_iはある映像区間ｉの意味内容カテゴリであり，双方ともに確率変数である。ｘ_iは，ステップＳ２０３で計算された特徴量の値をとる。ｃ_iは，各意味内容カテゴリを指し示す。例えば，ｃ_i＝「サッカー」，ｃ_i＝「歩行」などである。 p (x _i | c _i ) (1)
Here, x _i is a video section feature amount of a video section i, c _i is a semantic content category of the video section i, and both are random variables. x _i takes the value of the feature amount calculated in step S203. c _i indicates each semantic content category. For example, c _i = “soccer” and c _i = “walk”.

意味内容尤度を計算する際には，各意味内容カテゴリに対して，ｐ（ｘ_i｜ｃ_i）の式を計算すればよい。 When calculating the semantic content likelihood, an equation of p (x _i | c _i ) may be calculated for each semantic content category.

ここでは，ある映像区間ｉについて，その意味内容尤度を計算する手順について説明する。例えば，意味内容カテゴリが５カテゴリ設定されているとしよう。例えば，次のものとする。 Here, a procedure for calculating the semantic content likelihood of a certain video section i will be described. For example, suppose that five semantic content categories are set. For example:

・意味内容カテゴリ１：「サッカー」
・意味内容カテゴリ２：「スポーツ」
・意味内容カテゴリ３：「海」
・意味内容カテゴリ４：「ハイキング」
・意味内容カテゴリ５：「水泳」
まず，ステップＳ２０３において，映像区間ｉの映像区間特徴量ｘ_iが求められている。ステップＳ２０４では，各意味内容カテゴリ１〜５のそれぞれに対して，この映像区間特徴量ｘ_iと式(1) のｐ（ｘ_i｜ｃ_i）に基づいて，意味内容尤度を計算するのである。仮に映像区間ｉがサッカーの様子を撮影したものであるとしよう。上記のカテゴリの例では，ｐ（ｘ_i｜ｃ_i＝サッカー) ，および，ｐ（ｘ_i｜ｃ_i＝スポーツ) の意味内容尤度は高く計算される。反対に，サッカーの映像には，「海」が含まれていることは稀であり，また，「ハイキング」や「水泳」とも異なることから，ｐ（ｘ_i｜ｃ_i＝海) ，ｐ（ｘ_i｜ｃ_i＝ハイキング) ，ｐ（ｘ_i｜ｃ_i＝水泳) の意味内容尤度は低く計算される。このようにして，ある映像区間ｉにおける各意味内容カテゴリの尤度が計算される。・ Meaning content category 1: "Soccer"
-Meaning content category 2: "Sports"
・ Meaning Category 3: “Sea”
・ Meaning Category 4: “Hiking”
Meaning content category 5: “Swimming”
First, in step S203, the video section feature amount x _i of the video section _i is obtained. In step S204, the semantic content likelihood is calculated for each of the semantic content categories 1 to 5 based on the video section feature amount x _i and p (x _i | c _i ) of the equation (1). is there. Let's assume that video section i is a picture of soccer. In the above category example, the semantic content likelihoods of p (x _i | c _i = soccer) and p (x _i | c _i = sports) are calculated to be high. On the other hand, soccer video rarely contains “sea” and is different from “hiking” and “swimming”, so p (x _i | c _i = sea), p ( The semantic content likelihood of x _i | c _i = hiking) and p (x _i | c _i = swimming) is calculated to be low. In this way, the likelihood of each semantic content category in a certain video section i is calculated.

また，ここで，ｐ（ｘ_i｜ｃ_i）の式について，Ｂａｙｅｓのルールから，
ｐ（ｃ_i｜ｘ_i）＝｛ｐ（ｘ_i｜ｃ_i）ｐ（ｃ_i）｝／ｐ（ｘ_i） …式(2)
が成立することにも注意されたい。 Also, with respect to the expression p (x _i | c _i ), from the Bayes rule,
p (c _i | x _i ) = {p (x _i | c _i ) p (c _i )} / p (x _i ) (2)
Note also that

続いて，ステップＳ２０５〜Ｓ２０８では，編集対象映像区間選出部１７により編集映像１９の生成に用いる映像区間の選出を行うが，ここでのコンテキスト類似度の計算に用いる意味内容カテゴリ間の関連度合いについて説明する。 Subsequently, in steps S205 to S208, the video segment used for generating the edited video 19 is selected by the editing target video segment selection unit 17, and the degree of association between the semantic content categories used for calculating the context similarity here. explain.

意味内容関連度計算部２１は，予め設定した一つ以上の意味内容カテゴリ同士の関連度合いを計算し，意味内容関連度記憶部２２に格納する。この意味内容カテゴリ間の関連度合いとは，次のようなものである。 The semantic content relevance calculation unit 21 calculates the degree of relevance between one or more predetermined semantic content categories and stores the calculated relevance level in the semantic content relevance storage unit 22. The degree of association between the semantic content categories is as follows.

映像区間は連続しているため，ある映像区間ｉの一つ前には，映像区間ｉ−１が存在する。これら二つの映像区間の意味内容カテゴリをそれぞれｃ_i，ｃ_i-1と表したとき，これらの条件付確率テーブルｃ_i｜ｃ_i-1を求め，これを意味内容カテゴリ間の関連度合いとする。 Since the video sections are continuous, a video section i-1 exists immediately before a certain video section i. When I expressed these two semantic content category image segment c _i respectively, and c _i-1, the probability These conditional tables c _i | seeking c _i-1, the degree of association between the semantic content categories this .

この条件付確率テーブルは，意味内容尤度を学習した際に用いた意味内容カテゴリのラベルを用いて学習すればよい。例えば，映像区間ｉに付与されている意味内容カテゴリのラベルが「サッカー」，映像区間ｉ−１に付与されている意味内容カテゴリのラベルが「スポーツ」であったとしよう。このとき，“ｃ_i-1が「スポーツ」であったとき，ｃ_iが「サッカー」である”という頻度が１あるとみなすことができる。 This conditional probability table may be learned using the meaning content category label used when learning the semantic content likelihood. For example, assume that the meaning content category label assigned to video section i is “soccer” and the meaning content category label assigned to video section i−1 is “sports”. In this case, "when c _i-1 is" sports ", c _i is" soccer "" can be regarded as a frequency that there is one.

ラベルが複数与えられているような場合には，計算量のために各意味内容カテゴリ間に独立性を仮定し，各ペアごとに頻度を計数すればよい。例えば，上記の例において，さらに，映像区間ｉに「スポーツ」，映像区間ｉ−１に「ボール」が付与されていたとしよう。このとき，“ｃ_i-1が「ボール」であったとき，ｃ_iが「サッカー」である”，また“ｃ_i-1が「ボール」であったとき，ｃ_iが「スポーツ」である”，また“ｃ_i-1が「スポーツ」であったとき，ｃ_iが「スポーツ」である”のそれぞれの頻度も，１あるとみなすのである。 When multiple labels are given, the independence between each semantic content category is assumed for the calculation amount, and the frequency may be counted for each pair. For example, in the above example, it is further assumed that “sport” is given to the video section i and “ball” is given to the video section i-1. In this case, "when c _i-1 is" ball ", c _i is" soccer "", also "when c _i-1 is" ball ", c _i is" sports ""When c _i-1 is" sport ", each frequency of c _i is" sport "is also considered to be 1.

このような計数を，ラベルの得られている映像（区間）全体に渡って行う。この結果，生成される条件付確率テーブルｃ_i｜ｃ_i-1は，行数，列数ともに意味内容カテゴリと同数となる。 Such counting is performed over the entire video (section) where the label is obtained. As a result, the generated conditional probability table c _i | c _i-1 has the same number as the semantic content category in both the number of rows and the number of columns.

得られた条件付確率テーブルを，以下の数式で表す。 The obtained conditional probability table is expressed by the following mathematical formula.

ｐ（ｃ_i｜ｃ_i-1） …式(3)
この条件付確率テーブルによれば，ある意味内容カテゴリｃ_i-1が与えられたとき，その他の全ての意味内容カテゴリｃ_iとの関連度合い（確率）を計算することができる。例えば，ｃ_i-1＝「サッカー」であるとしよう。このとき，その他のカテゴリ，例えば，ｃ_i＝「ボール」や，ｃ_i＝「スポーツ」の出現する確率は，ｐ（ｃ_i＝ボール｜ｃ_i-1＝サッカー) ，ｐ（ｃ_i＝スポーツ｜ｃ_i-1＝サッカー) を参照することによって得ることができる。 p (c _i | c _i-1 ) (3)
According to this conditional probability table, when a certain meaning content category c _i-1 is given, the degree of association (probability) with all other meaning content categories c _i can be calculated. For example, let c _i-1 = “soccer”. At this time, the probability of appearance of other categories, for example, c _i = “ball” or c _i = “sport” is p (c _i = ball | c _i−1 = soccer), p (c _i = sport) | C _i-1 = soccer).

以上のような条件付確率テーブルの情報が，意味内容関連度計算部２１によって算出され，意味内容関連度記憶部２２に記憶される。 Information in the conditional probability table as described above is calculated by the semantic content relevance calculation unit 21 and stored in the semantic content relevance storage unit 22.

続いて，ステップＳ２０５では，ステップＳ２０４で計算した意味内容尤度に基づいて，映像区間のクラスタリングを行い，各クラスタの中から一つ以上の代表となる映像区間を代表映像区間として出力する。 Subsequently, in step S205, the video sections are clustered based on the semantic content likelihood calculated in step S204, and one or more representative video sections from each cluster are output as representative video sections.

このクラスタリング処理による効果として，後の映像区間を篩にかける際，全ての映像区間について計算を行わないで，少数のクラスタを代表する代表映像区間に対してだけ計算を実施すれば済むようになる点があげられる。したがって，短い時間での自動編集を実行できるようになるのである。 As an effect of this clustering process, when sieving the subsequent video section, it is only necessary to perform the calculation for the representative video section representing a small number of clusters without performing the calculation for all the video sections. A point is raised. Therefore, automatic editing can be executed in a short time.

図３に，映像区間数が１５，クラスタ数が４のときのクラスタリングと代表映像区間抽出の一例を示す。 FIG. 3 shows an example of clustering and representative video segment extraction when the number of video segments is 15 and the number of clusters is 4.

クラスタリングは，Ｋ−ｍｅａｎｓ法やｍｅａｎｓｈｉｆｔ，階層型クラスタリング，ａｆｆｉｎｉｔｙｐｒｏｐａｇａｔｉｏｎなど，さまざまな公知の方法が存在するが，これらの方法を利用するために必要な要素は，任意の二つの映像区間の類似性を定義することである。したがって，ここでは，意味内容尤度と意味内容関連度に基づく類似性の定義について述べる。 There are various known methods for clustering, such as K-means method, mean shift, hierarchical clustering, and affinity propagation. The elements necessary to use these methods are similar to any two video sections. Is to define gender. Therefore, here we describe the definition of similarity based on the semantic content likelihood and the semantic content relevance.

視聴者の観点で考えれば，同じような意味内容を持つ映像区間ばかりを視聴しても飽きてしまうため，意味内容の観点で類似性を判断し，クラスタリングする方が望ましい。そこで，各映像区間の意味内容の確率を計算し，これを類似度として定義する。 From the viewer's point of view, it is preferable to perform clustering by judging similarity from the viewpoint of semantic content, because only the video sections having similar semantic content are bored. Therefore, the probability of the semantic content of each video section is calculated and defined as similarity.

例えば，ある映像区間ｉにおける意味内容カテゴリｃ_iの確率ｐ（ｃ_i｜ｘ_i) は，式(2) によって計算することができる。同様に，映像区間ｉ−１における意味内容カテゴリｃ_i-1の確率ｐ（ｃ_i-1｜ｘ_i-1)も，式(2) によって求めることができる。したがって，これら二つの確率密度の類似度を，映像区間の間の類似度として定義すればよい。 For example, the probability p of meaning category c _i in a certain image segment i (c _i | x _i) can be calculated by Equation (2). Similarly, the probability p (c _i-1 | x _i-1 ) of the semantic content category c _i-1 in the video section i-1 can also be obtained by Expression (2). Therefore, the similarity between these two probability densities may be defined as the similarity between video sections.

確率密度間の類似度として代表的なものに，負のカルバック―ライブラーダイバージェンス，あるいは，カルバック―ライブラーダイバージェンスの逆数がある。逆数は，例えば以下の数式で表現される。 Typical examples of the similarity between probability densities include negative cullback-liver divergence or reciprocal of cullback-liver divergence. The reciprocal is expressed by, for example, the following mathematical formula.

ｓ_i,i-1＝１／｛ＫＬ［ｐ（ｃ_i｜ｘ_i）‖ｐ（ｃ_i-1｜ｘ_i-1）］｝ …式(4)
ただし，
ＫＬ［ｐ（ｘ）‖ｑ（ｘ）］＝Σ_xｐ（ｘ） log｛ｐ（ｘ）／ｑ（ｘ）｝ …式(5)
である（Σ_xはｘについての総和）。カルバック―ライブラーダイバージェンスは，対称性が成立しない。すなわち，上記式(5) にある，二つの確率分布ｐ（ｘ）とｑ（ｘ）とを入れ替えると，出力される値が変わってしまう。クラスタリングに用いる類似度としては，対称性が成立しないことに不都合がある場合があるため，負のイェンセン−シャノンダイバージェンス，あるいは，イェンセン−シャノンダイバージェンスの逆数を用いる方が好ましい。 s _{i, i-1} = 1 / {KL [p (c _i | x _i ) ‖p (c _i-1 | x _i-1 )]} (4)
However,
KL [p (x) ‖q ( x)] = Σ x p (x) log {p (x) / q (x)} ... (5)
It is (sigma _x is the sum of x). Cullback-liver divergence does not have symmetry. That is, if the two probability distributions p (x) and q (x) in the above equation (5) are exchanged, the output value changes. As the similarity used for clustering, it may be inconvenient that the symmetry is not established. Therefore, it is preferable to use negative Jensen-Shannon divergence or the inverse of Jensen-Shannon divergence.

ｓ_i,i-1＝１／｛ＪＳ［ｐ（ｃ_i｜ｘ_i）‖ｐ（ｃ_i-1｜ｘ_i-1）］｝ …式(6)
ただし，
ＪＳ［ｐ（ｘ）‖ｑ（ｘ）］
＝λＫＬ［ｐ（ｘ）‖ｑ（ｘ）］＋（１−λ）ＫＬ［ｑ（ｘ）‖ｐ（ｘ）］ …式(7)
ここで，λ＝０．５としたとき，対称性が成立する。 s _{i, i-1} = 1 / {JS [p (c _i | x _i ) ‖p (c _i-1 | x _i-1 )]} Equation (6)
However,
JS [p (x) ‖q (x)]
= ΛKL [p (x) ‖q (x)] + (1−λ) KL [q (x) ‖p (x)] (7)
Here, when λ = 0.5, symmetry is established.

この例では，隣り合う二つの映像区間ｉとｉ−１についての類似度計算の例を述べているが，式(4) および式(6) は，隣り合う映像区間に限らなくても計算可能である。 In this example, an example of similarity calculation for two adjacent video sections i and i-1 is described. However, equations (4) and (6) can be calculated without being limited to adjacent video sections. It is.

これらの式によって求められた類似度ｓ_i,i-1に基づいて，先に述べた公知のクラスタリング法，例えば，下記の参考文献１１に記載されるａｆｆｉｎｉｔｙｐｒｏｐａｇａｔｉｏｎによってクラスタリングを行えばよい。ａｆｆｉｎｉｔｙｐｒｏｐａｇａｔｉｏｎを用いる利点は，特にＫ−ｍｅａｎｓ法と比べて３点ある。
（１）Ｋ−ｍｅａｎｓの場合，クラスタリングする前に，クラスタ数を設定する必要がある。ａｆｆｉｎｉｔｙｐｒｏｐａｇａｔｉｏｎでは，事前にクラスタ数を与える必要はない。
（２）Ｋ−ｍｅａｎｓの場合，生成されたクラスタの中心は，必ずしもある映像区間とはならない。ａｆｆｉｎｉｔｙｐｒｏｐａｇａｔｉｏｎの場合，必ずある映像区間を指す。このため，代表映像区間をクラスタ中心として決定することができる。
（３）Ｋ−ｍｅａｎｓの場合，クラスタリング結果が，通常ランダムに選定される初期値に大きく依存するため，複数回の試行の後，最もよいクラスタリング結果を得るなどの工夫を必要とする。ａｆｆｉｎｉｔｙｐｒｏｐａｇａｔｉｏｎの場合，クラスタリング結果は初期値に依存しないので，一度の試行のみで済む。
［参考文献１１］：B.J. Frey, and D. Deuck,“Clustering by Passing Messages Between Data Points”, Science, vol.315, pp.972-976, 2007.
また，参考文献１２に記載されるＴｉｍｅ−ＣｏｎｓｔｒａｉｎｅｄＣｌｕｓｔｅｒｉｎｇを適用する方法を取ってもよい。
［参考文献１２］：M.M. Yeung, and B.-L. Yeo,“Time-Constrained Clustering for Segmentation of Video into Story Unites ”，International Conference on Pattern Recognition, vol.3, pp.375-380, 1996.
続いて，ステップＳ２０６では，コンテキスト予測確率とコンテキスト事後確率に基づいてコンテキスト類似度を計算し，ステップＳ２０７では，このコンテキスト類似度に基づいて，一つ以上の除去する代表映像区間を決定し，残りの代表映像区間を出力する。 Based on the similarity s _{i, i−1} obtained by these equations, clustering may be performed by the above-described known clustering method, for example, affinity propagation described in Reference Document 11 below. There are three advantages of using affinity propagation, especially compared to the K-means method.
(1) In the case of K-means, it is necessary to set the number of clusters before clustering. In affinity propagation, it is not necessary to give the number of clusters in advance.
(2) In the case of K-means, the center of the generated cluster is not necessarily a certain video section. In the case of affinity propagation, it always indicates a certain video section. Therefore, the representative video section can be determined as the cluster center.
(3) In the case of K-means, the clustering result largely depends on an initial value that is normally selected at random, and therefore, it is necessary to devise such as obtaining the best clustering result after a plurality of trials. In the case of affinity propagation, the clustering result does not depend on the initial value, so only one trial is required.
[Reference 11]: BJ Frey, and D. Deuck, “Clustering by Passing Messages Between Data Points”, Science, vol. 315, pp. 972-976, 2007.
Also, a method of applying Time-Constrained Clustering described in Reference Document 12 may be taken.
[Reference 12]: MM Yeung, and B.-L. Yeo, “Time-Constrained Clustering for Segmentation of Video into Story Units”, International Conference on Pattern Recognition, vol.3, pp.375-380, 1996.
Subsequently, in step S206, the context similarity is calculated based on the context prediction probability and the context posterior probability, and in step S207, one or more representative video sections to be removed are determined based on the context similarity, and the rest The representative video section is output.

元の代表映像区間群をＳとおく。図４に，元の代表映像区間数が４であるときの代表映像空間を篩にかける一例を示す。この処理では，コンテキスト類似度の平均ｆが最も高くなる，あるいは，もっとも低くなるような代表映像区間を含むクラスタを一つ，除去する。クラスタには，代表映像区間に類似した意味内容を持つ映像区間が０個以上格納されているため，これらの映像区間が編集映像から除去されることとなる。 Let S be the original representative video segment group. FIG. 4 shows an example of sieving the representative video space when the original number of representative video sections is four. In this process, one cluster including the representative video section in which the average f of the context similarity is highest or lowest is removed. Since zero or more video sections having semantic contents similar to the representative video section are stored in the cluster, these video sections are removed from the edited video.

まず，各クラスタの代表映像区間を，再生時刻順序順に並べる。今，仮に代表映像区間数をＭとしたとき，これを，時間順にＫ₁，Ｋ₂，Ｋ₃，…，Ｋ_Mと表すこととする。図４の例では，代表映像区間数が４である。この元の代表映像区間群をＳとおくこととする。 First, the representative video sections of each cluster are arranged in the order of playback time. Now, when if the number of representative video section was M, _{_{which, K 1, K 2, K}} 3 time order, ..., and be expressed as K _M. In the example of FIG. 4, the number of representative video sections is four. Let this original representative video section group be S.

次に，代表映像区間群Ｓから，一つだけ代表映像区間を除いた候補代表映像区間群を生成する。図４の例では，Ｋ₁からＫ₄までの４つの代表映像区間が存在するので，Ｋ₁を除いたＳ／Ｋ₁からＫ₄を除いたＳ／Ｋ₄までの４つの候補代表映像区間群が生成される。 Next, a candidate representative video segment group is generated by removing only one representative video segment from the representative video segment group S. In the example of FIG. 4, since there are four representative image segment from K ₁ to K _4, four candidate representative image interval from S / K ₁ except for K ₁ to S / K ₄ except K ₄ A group is generated.

次に，各候補代表映像区間群に対して，平均コンテキスト類似度ｆを計算する。ここでは，図４に示す４つの候補代表映像区間群のうち，Ｋ₂を除去したＳ／Ｋ₂についての平均コンテキスト類似度ｆ（Ｓ／Ｋ₂）を求める例を示す。他の候補代表映像区間群についても同様に計算可能であることは言うまでもない。 Next, an average context similarity f is calculated for each candidate representative video section group. Here, among the four candidate representative image segments, shown in FIG. 4 shows an example of obtaining a mean context similarity f (S / K ₂₎ for the S / K ₂ was removed K _2. It goes without saying that other candidate representative video section groups can be similarly calculated.

平均コンテキスト類似度ｆ（Ｓ／Ｋ₂）を計算するためには，各代表区間Ｋ₁，Ｋ₃，Ｋ₄のコンテキスト類似度ｔ₁，ｔ₃，ｔ₄を計算する必要がある。今，Ｋ₂がないため，便宜上Ｋ₁をＫ′₁，Ｋ₃をＫ′₂，Ｋ₄をＫ′₃と置き換えて説明する。まず，各コンテキスト類似度ｔ_jの計算方法を述べる。 In order to calculate the average context similarity f (S / K ₂ ), it is necessary to calculate the context similarity t ₁ , t ₃ , t ₄ of each representative section K ₁ , K ₃ , K ₄ . Since there is no K ₂ now, for convenience, K ₁ will be replaced with K ′ ₁ , K ₃ will be replaced with K ′ ₂ , and K ₄ will be replaced with K ′ ₃ . First, a method for calculating each context similarity t _j will be described.

代表映像区間Ｋ′_jにおけるコンテキスト予測確率とコンテキスト事後確率を計算する。これには，それ以前の代表映像区間Ｋ′₁，…，Ｋ′_j-1までに計算してきた過去のコンテキスト予測確率とコンテキスト事後確率を用いる必要がある。 A context prediction probability and a context posterior probability in the representative video section K ′ _j are calculated. For this purpose, it is necessary to use past context prediction probabilities and context posterior probabilities calculated up to previous representative video sections K ′ ₁ ,..., K ′ _j−1 .

代表映像区間Ｋ′_jの映像区間特徴量をｘ_j，意味内容カテゴリをｃ_jと表す。まず，コンテキスト予測確率を，下記の式(8) に基づいて計算する。 The video section feature amount of the representative video section K ′ _{j is} represented as x _j and the semantic content category is represented as c _j . First, the context prediction probability is calculated based on the following equation (8).

ｐ（ｃ_j｜ｘ₁，ｘ₂，…，ｘ_j-1）
＝Σｐ（ｃ_j｜ｃ_j-1）ｐ（ｃ_j-1｜ｘ₁，ｘ₂，…，ｘ_j-1） …式(8)
（ただし，Σはｃ_j-1に関する総和）
ここで，右辺に現れるｐ（ｃ_j-1｜ｘ₁，ｘ₂，…，ｘ_j-1）は，Ｋ′_j-1のコンテキスト事後確率である。 p (c _j | x ₁ , x ₂ ,..., x _j-1 )
= Σp (c _j | c _j−1 ) p (c _j−1 | x ₁ , x ₂ ,..., X _j−1 ) (8)
(Where Σ is the sum of c _j-1 )
Here, p (c _j−1 | x ₁ , x ₂ ,..., X _j−1 ) appearing on the right side is the context posterior probability of K ′ _j−1 .

続いて，コンテキスト事後確率を下記の数式に基づいて計算する。 Subsequently, the context posterior probability is calculated based on the following formula.

ｐ（ｃ_j｜ｘ₁，ｘ₂，…，ｘ_j）
＝｛ｐ（ｘ_j｜ｃ_j）ｐ（ｃ_j｜ｘ₁，ｘ₂，…，ｘ_j-1）｝／Σ｛（ｐ（ｘ_j｜ｃ_j）ｐ（ｃ_j｜ｘ₁，ｘ₂，…，ｘ_j-1）｝ …式(9)
（ただし，Σはｃ_jに関する総和）
ここで，ｐ（ｃ_j｜ｘ₁，ｘ₂，…，ｘ_j-1）は，式(8) で求めたコンテキスト予測確率である。 p (c _j | x ₁ , x ₂ ,..., x _j )
= {P (x _j | c _j ) p (c _j | x ₁ , x ₂ ,..., X _j-1 )} / Σ {(p (x _j | c _j ) p (c _j | x ₁ , x ₂ , ..., x _j-1 )} ... (9)
(Where Σ is the sum of c _j )
Here, p (c _j | x ₁ , x ₂ ,..., X _j-1 ) is the context prediction probability obtained by the equation (8).

以上を，Ｋ′₁から順に，Ｋ′₃まで計算していくことによって，全ての代表映像区間におけるコンテキスト予測確率とコンテキスト事後確率を計算することができる。なお，Ｋ′₁のコンテキスト予測確率は，過去のコンテキスト事後確率が存在しないために，通常，計算することはできない。そこで，Ｋ′₁のコンテキスト予測確率については，例えば一様分布など任意の確率分布を与えるものとしてよい。 By calculating the above from K ′ ₁ to K ′ _{3 in} order, the context prediction probabilities and context posterior probabilities in all the representative video sections can be calculated. Note that the context prediction probability of K ′ ₁ cannot normally be calculated because there is no past context posterior probability. Therefore, the K ′ ₁ context prediction probability may be given an arbitrary probability distribution such as a uniform distribution.

続いて，各代表映像区間のコンテキスト類似度ｔ_jを求める。代表映像区間ｊのコンテキスト類似度ｔ_jは，コンテキスト予測確率とコンテキスト事後確率の類似度として定義する。双方とも確率分布であるので，式(4) や式(6) によって類似度を求める。 Subsequently, the context similarity t _j of each representative video section is obtained. Context similarity t _j of the representative video section j is defined as the similarity of context prediction probability and context posterior probability. Since both are probability distributions, the similarity is obtained using Eqs. (4) and (6).

ｔ_j＝Σ｛ｐ（ｃ_j｜ｘ₁，ｘ₂，…，ｘ_j） log｛ｐ( ｃ_j｜ｘ₁，ｘ₂，…，ｘ_j）／ｐ（ｃ_j｜ｘ₁，ｘ₂，…，ｘ_j-1）｝｝ …式(10)
（ただし，Σはｃ_jに関する総和）
このコンテキスト類似度ｔ_jは，代表映像区間Ｋ′_jのコンテキスト類似度であるが，Ｋ′₁から順に計算することで，他の代表映像区間についても同様に計算することができる。 _{t j = Σ {p (c} j | x 1, x 2, ..., x j) log {p (c j | x 1, x 2, ..., x j) / p (c j | x 1, x 2 , ..., x _j-1 )}} ... (10)
(Where Σ is the sum of c _j )
The context similarity t _j is the context similarity of the representative video section K ′ _j , but can be calculated similarly for other representative video sections by calculating in order from K ′ ₁ .

最後に，候補代表映像区間群Ｓ／Ｋ₂について，平均コンテキスト類似度ｆ（Ｓ／Ｋ₂）を計算する。これは，単純に各代表映像区間類似度ｔ_jの算術平均をとるものとしてもよいし，重みづけ平均をとるものとしてもよい。 Finally, the average context similarity f (S / K ₂ ) is calculated for the candidate representative video segment group S / K ₂ . This may be simply an arithmetic average of the representative video section similarity t _j or a weighted average.

以上のようにして，全ての候補代表映像区間群に対する平均コンテキスト類似度ｆを計算する。 As described above, the average context similarity f for all candidate representative video section groups is calculated.

全ての候補代表映像区間群の平均コンテキスト類似度が計算されたのち，その中で最大／最小になる候補代表映像区間群を選出し，新しい代表映像区間群Ｓ′として採用・出力する。 After the average context similarity of all candidate representative video section groups is calculated, the candidate representative video section group that becomes the maximum / minimum among them is selected, and adopted and output as a new representative video section group S ′.

この際，最大にするか，あるいは最小にするかは，どのような編集映像を作成したいかに依存する。同一コンテキストを持つ（コンテキストが一貫した）編集映像を作成したい場合には，平均が低くなるようにすればよく，さまざまなコンテキストを持つ編集映像を作成したい場合には，平均が高くなるようにすればよい。 At this time, whether to maximize or minimize depends on what kind of edited video you want to create. If you want to create edited videos with the same context (consistent context), the average should be low. If you want to create edited videos with various contexts, the average should be high. That's fine.

以上説明した処理ステップのうち，ステップＳ２０５〜ステップＳ２０７を，終了条件が満たされるまで繰り返し実行する（ステップＳ２０８）。終了条件は，例えば編集映像の時間長がある映像区間以下となったときとしてもよいし，コンテキスト類似度の平均値の値が一定値以上または一定値以下となったときなどとしてもよい。 Of the processing steps described above, steps S205 to S207 are repeatedly executed until the end condition is satisfied (step S208). The end condition may be, for example, when the time length of the edited video is equal to or less than a certain video section, or may be when the average value of the context similarity is greater than or equal to a certain value.

この繰り返し処理の効果は，次のように言及できる。通常，コンテキストを考慮した編集映像を生成する場合，全てのあり得る映像区間の組合せの中から編集映像を生成する必要があり，いわゆる組合せ爆発の問題が起こる。より具体的には，映像区間数をＮとおいたとき，計算オーダとしてＯ（Ｎ！）を超える計算量が必要となり，現実的な時間で計算を終了することができない。 The effect of this iterative process can be mentioned as follows. Usually, when generating an edited video in consideration of context, it is necessary to generate an edited video from all possible combinations of video sections, which causes a problem of so-called combination explosion. More specifically, when the number of video sections is N, a calculation amount exceeding O (N!) Is required as a calculation order, and the calculation cannot be completed in a realistic time.

これに対し，本実施形態による繰り返し処理では，段階的に削減された，より見込みのある代表映像区間群の中から編集映像を生成できるようになるため，多項式時間という劇的に短い時間で，コンテキストを考慮した編集映像を生成することができるようになるのである。 On the other hand, in the iterative processing according to the present embodiment, the edited video can be generated from the more promising representative video section groups that are reduced in stages, so that the polynomial time is dramatically shorter, The edited video can be generated in consideration of the context.

ステップＳ２０５のクラスタリング処理は，通常，一度のみ実施すればよい場合もあり，本実施形態の例においても，一度のみとしても処理の整合性を欠くことはない。しかしながら，繰り返しに含めることによる効果もあり，これは以下のように述べることができる。 The clustering process in step S205 may normally be performed only once, and even in the example of this embodiment, the process consistency is not lost. However, there is an effect by including it repeatedly, and this can be described as follows.

クラスタリングのプロセスにおいては，対象となるデータの数に応じて，生成するクラスタ数は適応的に決定する（ａｆｆｉｎｉｔｙｐｒｏｐａｇａｔｉｏｎなどを利用した場合には自動決定することができる）のが通常である。簡単な例を挙げれば，データ数が１０個しかないときに，９つのクラスタを生成してもあまり意味はなく，例えば，２つや３つなどと設定されるべきであろう。また，データ数が１００万あるときに，２つや３つのクラスタ数では，あまりに大雑把なクラスタが生成されるため，もう少し多数のクラスタを設定する必要があるであろう。 In the clustering process, the number of clusters to be generated is usually determined adaptively according to the number of target data (can be automatically determined when using affinity propagation). To give a simple example, if there are only 10 data, it would not make much sense to generate 9 clusters, for example, 2 or 3 should be set. Also, when the number of data is 1 million, an excessively rough cluster is generated with two or three clusters, so it will be necessary to set a few more clusters.

本実施形態では，クラスタリング対象となる映像区間を段階的に削減するが，このクラスタリング処理を繰り返しの中に含めることによって，各段階における映像区間数に応じた，意味のあるクラスタ数の決定が可能になる。 In this embodiment, the video segments to be clustered are reduced step by step. By including this clustering process in the iteration, it is possible to determine the number of meaningful clusters according to the number of video segments in each step. become.

また，上記の処理の例では，候補代表映像区間群を生成する際，元の代表映像区間群から一つ削除した場合についてだけ扱った。しかしながら，処理の方法としては，図４で説明したような映像区間の削除だけでなく，映像区間を加える，あるいは，時間順序を入れ替えることによって，候補代表映像区間群を生成してもよい。本実施形態を用いることによって，映像区間を削除するという編集だけでなく，より一般的な編集行為を支援できることも特筆すべき効果の一つである。 In the above processing example, only the case where one candidate representative video section group is deleted from the original representative video section group is handled. However, as a processing method, the candidate representative video segment group may be generated not only by deleting the video segment as described in FIG. 4 but also by adding the video segment or changing the time order. One of the notable effects is that the use of this embodiment can support not only the editing of deleting a video section but also a more general editing action.

最後に，ステップＳ２０９で，終了条件が満たされたとき，ステップＳ２０７が出力した代表映像区間をつなぎ合わせ，編集映像１９として出力する。 Finally, when the end condition is satisfied in step S209, the representative video sections output in step S207 are connected and output as the edited video 19.

以上が，本発明の実施形態の一例における映像編集方法の説明である。この映像編集装置で実施される処理プロセスを，コンピュータで読み取り可能なプログラムとして記述することも可能であることはいうまでもない。 The above is the description of the video editing method in the example of the embodiment of the present invention. It goes without saying that the processing process executed by the video editing apparatus can be described as a computer-readable program.

以上，本発明の実施形態の一例における映像編集装置について詳細に説明した。本発明は説明した実施形態の一例に限定されるものでなく，特許請求の範囲に記載した技術的範囲において各種の変形を行うことが可能である。 The video editing apparatus according to the exemplary embodiment of the present invention has been described in detail above. The present invention is not limited to the example of the embodiment described, and various modifications can be made within the technical scope described in the claims.

例えば，本発明は，ＩＰＴＶやデジタルサイネージ，ＶＯＤ(Video on Demand) などといった様々な映像配信・通信サービスに用いることができる。具体的には，映像広告効果を高めるアレンジメント，映像プレイリストの自動生成などのアプリケーションサービスを実現することができる。 For example, the present invention can be used for various video distribution / communication services such as IPTV, digital signage, and VOD (Video on Demand). Specifically, it is possible to realize application services such as an arrangement for enhancing the effect of video advertisement and automatic generation of a video playlist.

１映像編集装置
１０映像
１１映像入力部
１２映像記憶部
１３映像区間分割部
１４映像区間特徴量抽出部
１５意味内容尤度計算部
１６クラスタリング部
１７編集対象映像区間選出部
１７０類似度計算部
１７１コンテキスト予測確率計算部
１７２コンテキスト事後確率計算部
１７３コンテキスト類似度計算部
１７４代表映像区間篩部
１８編集映像出力部
１９編集映像
２０意味内容カテゴリ辞書
２１意味内容関連度計算部
２２意味内容関連度記憶部 DESCRIPTION OF SYMBOLS 1 Image | video editing apparatus 10 Image | video 11 Image | video input part 12 Image | video memory | storage part 13 Image | video area division | segmentation part 14 Image | video area feature-value extraction part 15 Semantic content likelihood calculation part 16 Clustering part 17 Editing object image | video area selection part 170 Similarity degree calculation part 171 Context Prediction probability calculation unit 172 Context posterior probability calculation unit 173 Context similarity calculation unit 174 Representative video section sieving unit 18 Edit video output unit 19 Edit video 20 Semantic content category dictionary 21 Semantic content relevance calculation unit 22 Semantic content relevance storage unit

Claims

A video editing device that automatically generates and outputs an edited video from an input video,
A likelihood model that shows the stochastic relationship between a semantic content category that expresses the meaning of video content by a specific word and a video segment feature that consists of image features and / or sound features of the video segment. A semantic content category dictionary for storing information;
A video segmentation unit that divides the input video into video segments;
A video segment feature value extraction unit that extracts a video segment feature value from each of the video segments;
A semantic content likelihood calculation unit that refers to the semantic content category dictionary based on the extracted video segment feature and outputs the likelihood of each semantic content category for the video segment;
Clustering the video segments based on the likelihood, and a clustering unit for selecting one or more representative video segments from each generated video segment cluster;
For the arrangement of a plurality of candidate representative video sections obtained from the selected representative video section combination, at least the likelihood is used to calculate the context similarity indicating the degree of semantic change in the connection of the video sections. As a candidate representative video segment group that uses either an arrangement of candidate representative video segments with a large degree of semantic change or a sequence of candidate representative video segments with a small degree of semantic change as a whole based on the context similarity A video segment selection section to be selected,
A video editing apparatus comprising: an edited video output unit configured to generate and output an edited video by connecting video segments of the selected candidate representative video segment group.

The video editing apparatus according to claim 1,
For each semantic content category stored in the semantic content category dictionary, a semantic content relevance storage unit that stores the degree of association with other semantic content categories obtained by learning the sample video,
The editing target video section selection unit
For each video segment position in the list of candidate representative video segments, input the video segment feature of the video segment position located in the past, the likelihood, and the degree of association stored in the semantic content relevance storage unit A context prediction probability calculation unit for calculating a context prediction probability using an expression for calculating a prediction probability that the video section at the position belongs to each semantic content category;
For each video segment position in the candidate representative video segment sequence, the video segment feature of the previous video segment position, the likelihood and the degree of association stored in the semantic content relevance storage unit are input. A context posterior probability calculation unit for calculating a context posterior probability using an expression for calculating a posterior probability that the video segment of the position belongs to each semantic content category;
A context similarity calculation unit that calculates the similarity between the context prediction probability and the context posterior probability as a context similarity indicating the semantic change degree;
For each arrangement of the candidate representative video sections, an average of the similarities calculated by the context similarity calculation section is obtained, and a representative video section sieving section for selecting an arrangement of candidate representative video sections having the maximum or the minimum A video editing apparatus characterized by comprising:

The video editing apparatus according to claim 1 or 2,
The video section included in the candidate representative video section group selected by the editing target video section selection section is set as a video section to be clustered by the clustering section, and the processing by the clustering section and the editing until a predetermined end condition is satisfied. A video editing apparatus characterized by repeating the processing by the target video section selection unit.

A video editing method in which a video editing device automatically generates and outputs an edited video from an input video,
A likelihood model that shows the stochastic relationship between a semantic content category that expresses the meaning of video content by a specific word and a video segment feature that consists of image features and / or sound features of the video segment. Using a semantic content category dictionary that stores information,
Video segmentation processing for dividing the input video into video segments;
Video segment feature value extraction processing for extracting video segment feature values from each of the video segments;
A semantic content likelihood calculation process for referring to the semantic content category dictionary based on the extracted video segment feature and outputting the likelihood of each semantic content category for the video segment;
Clustering the video segments based on the likelihood, and a clustering process for selecting one or more representative video segments from each generated video segment cluster;
For the arrangement of a plurality of candidate representative video sections obtained from the selected representative video section combination, at least the likelihood is used to calculate the context similarity indicating the degree of semantic change in the connection of the video sections. As a candidate representative video segment group that uses either an arrangement of candidate representative video segments with a large degree of semantic change or a sequence of candidate representative video segments with a small degree of semantic change as a whole based on the context similarity A selection process for selecting a video segment to be edited;
A video editing method comprising: executing an edited video output process of generating and outputting an edited video by connecting video segments of the selected candidate representative video segment group.

The video editing method according to claim 4,
For each semantic content category stored in the semantic content category dictionary, the degree of association with other semantic content categories obtained by learning the sample video is stored in the semantic content relevance storage unit,
In the video segment selection process for editing,
For each video segment position in the list of candidate representative video segments, input the video segment feature of the video segment position located in the past, the likelihood, and the degree of association stored in the semantic content relevance storage unit A context prediction probability calculation process for calculating a context prediction probability using an expression for calculating a prediction probability that the video section at the position belongs to each semantic content category;
For each video segment position in the candidate representative video segment sequence, the video segment feature of the previous video segment position, the likelihood and the degree of association stored in the semantic content relevance storage unit are input. A context posterior probability calculation process for calculating a context posterior probability using an expression for calculating a posterior probability that the video segment of the position belongs to each semantic content category;
A context similarity calculation process for calculating a similarity between the context prediction probability and the context posterior probability as a context similarity indicating the semantic change degree;
For each arrangement of the candidate representative video sections, an average of the similarities calculated by the context similarity calculation processing is obtained, and a representative video section sieving process for selecting an arrangement of candidate representative video sections having the maximum or the minimum The video editing method characterized by performing.

The video editing method according to claim 4 or 5,
The video section included in the candidate representative video section group selected by the editing target video section selection process is set as a video section to be clustered by the clustering process, and the clustering process and the editing target video are satisfied until a predetermined end condition is satisfied. A video editing method characterized by repeating the section selection process.

A video editing program for causing a computer to execute the video editing method according to claim 4, 5 or 6.