JP2007336106A

JP2007336106A - Video image editing assistant apparatus

Info

Publication number: JP2007336106A
Application number: JP2006164045A
Authority: JP
Inventors: Noboru Babaguchi; 登馬場口; Naoko Nitta; 直子新田; Tatsuto Akizuki; 達人秋月
Original assignee: Osaka University NUC
Current assignee: Osaka University NUC
Priority date: 2006-06-13
Filing date: 2006-06-13
Publication date: 2007-12-27

Abstract

<P>PROBLEM TO BE SOLVED: To provide a video image editing assistant apparatus for allowing an ordinary user without any technology and knowledge of video image editing to effectively and efficiently edit a video content. <P>SOLUTION: The video image editing assistant apparatus supports a user in editing, when the user edits a video content to create a user-edited video image. The apparatus comprises a learning portion 2, a material receiving portion 10, a candidate generating portion 13, and an outputting portion 4. The learning portion 2 learns an edit information indicative of alignment of a plurality of substance components contained in a sample video image, which is an edited video content. The material receiving portion 10 receives input of a material video image, which is a material video content of the user-edited video image. The candidate generating portion 13 generates information indicative of a candidate for alignment of substance components at the user-edited video image, according to the edit information learned by the learning portion 2. The substance components are the ones contained in the material video image received by the material receiving portion 10 or the ones added to the user-edited video image. The outputting portion 4 outputs the information indicative of a candidate and generated by the candidate generating portion 13. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、動画および音等を含む映像コンテンツの編集を支援する映像編集支援装置に関する。 The present invention relates to a video editing support apparatus that supports editing of video content including moving images and sounds.

近年、インターネット、デジタルテレビ放送といった情報配信に関する技術の向上に伴い、映像の専門家ではないユーザ、つまり、一般のユーザが大量かつ多様な映像コンテンツを入手することが可能となっている。 In recent years, with the improvement of information distribution technologies such as the Internet and digital television broadcasting, users who are not video specialists, that is, general users can obtain a large amount of various video contents.

さらに、高性能のディジタルビデオカメラやパーソナルコンピュータ（ＰＣ）等が安価に入手可能となり、一般のユーザが映像コンテンツを容易に蓄積し閲覧する環境が整備されつつある。 Furthermore, high-performance digital video cameras, personal computers (PCs), and the like are available at low cost, and an environment in which general users can easily store and browse video content is being prepared.

また、このように映像コンテンツを扱うことが身近になると共に、ＰＣ上で映像編集をすることができる、一般のユーザでも気軽に利用可能なソフトウェアが次々と開発され、発売されている。 In addition to handling video content in this way, software that can be easily edited by general users and can be edited on a PC has been developed and released one after another.

ここで、映像コンテンツとは、動画、音、字幕等の複数の要素メディアからなるマルチメディアコンテンツの１種であり、例えば、映画である。 Here, the video content is one type of multimedia content including a plurality of elemental media such as moving images, sounds, and subtitles, and is, for example, a movie.

また、映像編集とは、編集の対象となる映像コンテンツ（以下、「素材映像」という。）から選択した適切な部分映像をつなぎ合わせ、新たな映像を制作する作業であり、映像に、音および編集エフェクトを付加しながらつなぎ合わせる作業のことである。 In addition, video editing is the work of creating a new video by connecting appropriate partial videos selected from video content to be edited (hereinafter referred to as “material video”). It is the work of stitching together while adding editing effects.

部分映像とは、映像コンテンツを構成する所定の単位の映像である。所定の単位とは、例えば、カメラの撮影開始ボタンを押してから撮影停止ボタンを押すまでの間に撮影される映像の単位、つまり、途切れることなく継続的に録画されている映像の単位であるショットである。 A partial video is a predetermined unit of video that constitutes video content. The predetermined unit is, for example, a unit of video that is shot between pressing the shooting start button of the camera and pressing the shooting stop button, that is, a unit of video that is continuously recorded without interruption. It is.

また、映像編集の際に付加される音とは、音楽、効果音、およびナレーション音声などである。編集エフェクトとは、映像の切替時の視覚効果のことであり、カット、ディゾルブ、フェードイン、およびフェードアウトなどである。 The sound added at the time of video editing includes music, sound effects, narration sound, and the like. The editing effect is a visual effect at the time of video switching, and includes cut, dissolve, fade-in, fade-out, and the like.

つまり、素材映像に含まれる部分映像、並びに、編集時に付加する音および編集エフェクトの種類は数多く存在する。そのため、音と映像との組み合わせや、同じ要素メディア同士の時系列上での組み合わせは膨大な数となる。 That is, there are many types of partial videos included in the material video, and sounds and editing effects added during editing. Therefore, there are a huge number of combinations of sound and video and combinations of the same elemental media in time series.

従って、編集後の映像コンテンツ（以下、「編集済映像」という。）の質は編集者の持つ技術および知識に依存することとなる。 Therefore, the quality of the edited video content (hereinafter referred to as “edited video”) depends on the skill and knowledge of the editor.

そこで、編集技術および編集知識のない一般のユーザが、効果的かつ効率的に映像編集を行うことを可能とするための研究がなされている。 Therefore, research is being conducted to enable a general user who has no editing technique and editing knowledge to perform video editing effectively and efficiently.

例えば、映像編集のためのテンプレートを予め用意しておき、ユーザに部分映像をテンプレートに嵌め込ませることで、映像編集の効率化を図る技術が開示されている（例えば、非特許文献１参照。）。
X-S. Hua, Z. Wang, and S. Li, "LazyCut - Content-Aware Template-Based Video Authoring," Proc. ACM Multimedia, 2005. For example, a technique for improving the efficiency of video editing by preparing a template for video editing in advance and allowing a user to insert a partial video into the template is disclosed (for example, see Non-Patent Document 1). .
XS. Hua, Z. Wang, and S. Li, "LazyCut-Content-Aware Template-Based Video Authoring," Proc. ACM Multimedia, 2005.

しかしながら、上記従来の技術においては、部分映像の組み合わせ方を提示するテンプレートの作成、および部分映像の選択基準に関してはユーザに一任されている。従って、一般のユーザの日常的な映像編集を促進することはできない。 However, in the above-described conventional technology, it is left to the user to create a template that presents how to combine partial videos and selection criteria for partial videos. Therefore, it is not possible to promote daily video editing for general users.

また、自動映像編集に関連した研究も多くなされているが、そのほとんどは動画の構成に着目したものである。例えば、ショット長が大きく異なるショット同士は接続できない等の映像文法に基づいた自動映像編集について研究がなされている。 There are also many studies related to automatic video editing, most of which focus on the composition of moving images. For example, research is being conducted on automatic video editing based on video grammar, such as shots with significantly different shot lengths cannot be connected.

しかし、このような映像文法に基づく編集パターンは無数にあるが、主観的に設定されたものも含め、定式化されておらず、全ての編集パターンを考慮したシステム構築は容易ではない。 However, there are an infinite number of editing patterns based on such video grammars, but they are not formulated, including those that are set subjectively, and it is not easy to build a system that considers all editing patterns.

本発明は、上記課題を考慮し、映像編集の技術および知識のない一般のユーザが、効果的かつ効率的に映像コンテンツを編集するための映像編集支援装置を提供することを目的とする。 In consideration of the above-described problems, an object of the present invention is to provide a video editing support apparatus for a general user who does not have video editing technology and knowledge to edit video content effectively and efficiently.

上記目的を達成するために、本発明の映像編集支援装置は、ユーザが映像コンテンツである素材映像を編集することによりユーザ編集映像を制作する際に、その編集を支援する映像編集支援装置であって、編集済みの映像コンテンツであるサンプル映像に含まれる複数の実体要素の並び方に関する情報である編集情報を学習する学習手段と、前記素材映像の入力を受け付ける素材受付手段と、前記学習手段により学習された編集情報に従い、前記素材受付手段により受け付けられた前記素材映像に含まれる実体要素または前記ユーザ編集映像に付加される実体要素の、前記ユーザ編集映像における並び方の候補を示す情報を生成する候補生成手段と、前記候補生成手段により生成された前記並び方の候補を示す情報を出力する候補出力手段とを備える。 In order to achieve the above object, a video editing support apparatus according to the present invention is a video editing support apparatus that supports editing when a user creates a video edited by editing a material video that is video content. Learning means for learning editing information, which is information relating to the arrangement of a plurality of entity elements included in the sample video that is the edited video content, material receiving means for receiving input of the material video, and learning by the learning means Candidates for generating information indicating candidates for arrangement of the entity elements included in the material video received by the material reception unit or the entity elements added to the user edited video according to the edited editing information in the user edited video Generating means; candidate output means for outputting information indicating the arrangement candidates generated by the candidate generating means; Provided.

これにより、サンプル映像として、映像編集の専門家の編集により制作されたサンプル映像から、編集情報を学習することができる。また、ユーザに、素材映像に含まれる実体要素、またはユーザ編集映像に付加される複数の実体要素の並び方の候補を示す情報を提示することができる。 Thereby, editing information can be learned from a sample video produced by editing by a video editing specialist as a sample video. In addition, the user can be presented with information indicating candidates for arrangement of entity elements included in the material video or a plurality of entity elements added to the user-edited video.

すなわち、ユーザが選択したユーザ編集映像上のある時間位置に配置すべき動画、ユーザが選択した動画の配置すべき時間位置、動画に付加する付加音および編集エフェクト等の実体要素、並びに、付加音および編集エフェクト等の配置すべき時間位置などの候補を示す情報をユーザに提示することができる。 That is, a moving image to be placed at a certain time position on the user-edited video selected by the user, a time position to which the moving image selected by the user is to be placed, substantial elements such as additional sound and editing effect added to the moving image, and additional sound In addition, information indicating candidates such as time positions where editing effects and the like should be arranged can be presented to the user.

つまり、ユーザは、映像編集の専門家の編集技術および編集知識を参照しながら、効果的かつ効率的にユーザ編集映像を制作することができる。 That is, the user can effectively and efficiently produce the user edited video while referring to the editing technology and editing knowledge of the video editing specialist.

また、本発明の映像編集支援装置は、さらに、前記ユーザの指示を受け付ける指示受付手段を備え、前記候補選択手段は、前記指示受付手段が、前記素材映像に含まれる所定の単位の実体要素を選択する指示を受け付けた場合、指示された実体要素を配置すべき前記ユーザ編集映像上の時間位置の候補を、前記編集情報に従って、前記ユーザ編集映像上の任意の時間位置の中から選択することで、前記並び方の候補を示す情報を生成し、前記指示受付手段が、前記ユーザ編集映像上の時間位置を選択する指示を受け付けた場合、指示された前記時間位置に配置すべき実体要素の候補を、前記編集情報に従って、前記素材映像に含まれる複数の実体要素の中から選択することで、前記並び方の候補を示す情報を生成するとしてもよい。 The video editing support apparatus according to the present invention further includes an instruction receiving unit that receives an instruction from the user, and the candidate selection unit includes a predetermined unit entity element included in the material video. When an instruction to select is received, a candidate for a time position on the user-edited video where the instructed entity element is to be placed is selected from any time position on the user-edited video according to the editing information. Then, when the information indicating the arrangement candidate is generated and the instruction receiving unit receives an instruction to select a time position on the user-edited video, the entity element candidates to be arranged at the instructed time position May be selected from a plurality of entity elements included in the material video in accordance with the editing information to generate information indicating the arrangement candidate.

これにより、例えば、ユーザが、ユーザ編集映像に含めたいと考える１つのショットを選択した場合、映像編集支援装置は、編集情報に従って、そのショットを配置するに適した時間位置の候補を示す情報をユーザに提示することができる。 Thus, for example, when the user selects one shot that he / she wants to include in the user-edited video, the video editing support apparatus, according to the editing information, shows information indicating candidates for time positions suitable for arranging the shot. It can be presented to the user.

また、逆に、例えば、ユーザが、１つのショットを配置したい時間位置を選択した場合、映像編集支援装置は、その時間位置に配置するに適したショットの候補を示す情報をユーザに提示することができる。 On the other hand, for example, when the user selects a time position where one shot is desired to be placed, the video editing support apparatus presents information indicating shot candidates suitable for placement at the time position to the user. Can do.

また、本発明の映像編集支援装置は、さらに、前記素材受付手段により受け付けられた前記素材映像に含まれる実体要素である、動画および音のそれぞれの信号特徴を抽出する素材特徴抽出手段を備え、前記学習手段は、前記サンプル映像に含まれる実体要素である、動画および音それぞれの信号特徴を抽出するサンプル特徴抽出手段と、前記サンプル特徴抽出手段により抽出された信号特徴を解析することで、前記サンプル映像に含まれる動画および音のそれぞれの時系列構成を示す情報と、前記動画と前記音との同期構成を示す情報とを得る解析手段とを有し、前記解析手段により得られた前記時系列構成を示す情報と前記同期構成を示す情報とを前記編集情報として学習し、前記候補生成手段は、前記解析手段により得られた前記時系列構成を示す情報または前記同期構成を示す情報と、前記素材特徴抽出手段により抽出された信号特徴とを対照することで、前記素材映像に含まれる所定の単位の動画または音の前記ユーザ編集映像における並び方の候補を示す情報を生成するとしてもよい。 Further, the video editing support apparatus of the present invention further includes material feature extraction means for extracting each signal feature of the moving image and the sound, which is an entity element included in the material video received by the material reception means, The learning means is an entity element included in the sample video, sample feature extraction means for extracting signal features of moving images and sounds, and analyzing the signal features extracted by the sample feature extraction means, Analysis means for obtaining information indicating a time-series configuration of each of the moving image and the sound included in the sample video and information indicating a synchronous configuration of the moving image and the sound, and the time obtained by the analyzing means The information indicating the sequence structure and the information indicating the synchronous structure are learned as the editing information, and the candidate generating means is the time series obtained by the analyzing means. By comparing the information indicating the configuration or the information indicating the synchronous configuration with the signal feature extracted by the material feature extraction means, the user edit video of a predetermined unit of moving image or sound included in the material video Information indicating arrangement candidates may be generated.

このように、本発明の映像編集支援装置は、サンプル映像の動画および音の時系列構成、および、動画と音との同期構成をそれら実体要素の信号特徴から学習することができる。 As described above, the video editing support apparatus according to the present invention can learn the time-series configuration of the moving image and sound of the sample video and the synchronized configuration of the moving image and sound from the signal features of these entity elements.

さらに、その学習した情報を用いて、ユーザ編集映像における動画または音の並び方の候補を示す情報をユーザに提示することができる。 Furthermore, using the learned information, it is possible to present information indicating a moving image or sound arrangement candidate in the user edited video to the user.

すなわち、映像編集支援装置は、例えばサンプル映像においてどのような特徴を持つショットがどのように並べられているかを学習することができる。また、どのような特徴をもつショットにどのような音が同期されているかを学習することができる。 In other words, the video editing support apparatus can learn how the shots having what characteristics are arranged in the sample video, for example. In addition, it is possible to learn what sound is synchronized with what kind of shot.

さらに、サンプル画像におけるこのようなショットおよび音の並び方に関する情報と、素材映像のショットの信号特徴とを対照する。これにより、サンプル画像におけるショットの並び方に従った、ユーザ編集映像におけるショットの並び方の候補を示す情報をユーザに提示することができる。 Furthermore, the information on the arrangement of such shots and sounds in the sample image is compared with the signal characteristics of the shots of the material video. As a result, information indicating a shot arrangement candidate in the user-edited video in accordance with the shot arrangement in the sample image can be presented to the user.

また、前記解析手段は、前記同期構成を示す情報として、所定の単位に分割されて隣接する動画間の境界と、所定の単位に分割されて隣接する音の間の境界との関連性を示す情報、または、動画の特徴と前記動画に付加されている音の特徴との関連性を示す情報を得るとしてもよく、前記解析手段は、前記時系列構成を示す情報として、所定の単位に分割されて隣接する動画それぞれの信号特徴の関連性、および、所定の単位に分割されて隣接する音それぞれの信号特徴の関連性を示す情報を得るとしてもよい。 Further, the analysis means indicates the relationship between a boundary between adjacent moving images divided into a predetermined unit and a boundary between adjacent sounds divided into a predetermined unit as information indicating the synchronization configuration. The information or the information indicating the relationship between the feature of the moving image and the feature of the sound added to the moving image may be obtained, and the analysis unit divides the information into the predetermined units as the information indicating the time-series configuration. Then, it is possible to obtain information indicating the relevance of the signal features of the adjacent moving images and the relevance of the signal features of the adjacent sounds divided into predetermined units.

このように、動画の境界と、音の境界との関連性を学習することにより、例えば、ユーザ編集映像において、配置されたショットと音の変化点（セグメント境界）を自動的に同期付けることができる。また、例えば、サンプル映像において、１つのバックグランドミュージック（ＢＧＭ）が、連続する２つのショットにまたがって付加されていることなどが学習される。 In this way, by learning the relationship between the boundary of the moving image and the boundary of the sound, for example, in the user-edited video, the arranged shot and the sound change point (segment boundary) can be automatically synchronized. it can. Further, for example, it is learned that one background music (BGM) is added across two consecutive shots in the sample video.

また、動画の特徴と前記動画に付加されている音の特徴との関連性を学習することにより、サンプル映像において、例えば、動きの激しいショットにビートの速い音楽が付加されていることなどを学習が学習される。 In addition, by learning the relationship between the characteristics of the video and the characteristics of the sound added to the video, it is learned in the sample video that, for example, fast-beating music is added to a shot with intense movement. Is learned.

また、隣接する同種の要素メディア同士の関連性を学習することで、例えば、連続するショットのショット長は同程度である場合が多いなどの特徴を学習できる。 Further, by learning the relationship between adjacent similar element media, for example, it is possible to learn features such as the shot lengths of consecutive shots are often similar.

また、本発明の映像編集支援装置は、さらに、前記サンプル映像と、前記サンプル映像の編集前の映像であるソース映像とを受け付けるサンプル受付手段を備え、前記サンプル特徴抽出手段は、さらに、前記ソース映像に含まれる実体要素である動画および音それぞれの信号特徴を抽出し、前記解析手段は、さらに、前記サンプル映像に含まれる動画または音の信号特徴と、前記ソース映像に含まれる動画または音の信号特徴とを対照することで、前記ソース映像のどのような特徴をもつ動画がサンプル映像の動画として抽出されているかを示す情報を得るとしてもよい。 The video editing support apparatus of the present invention further includes sample receiving means for receiving the sample video and a source video that is a video before editing the sample video, and the sample feature extracting means further includes the source video The signal features of the moving image and the sound, which are entity elements included in the video, are extracted, and the analyzing means further includes the signal feature of the moving image or the sound included in the sample video and the moving image or the sound included in the source video By contrasting with the signal characteristics, information indicating what kind of characteristics of the source video is extracted as the video of the sample video may be obtained.

これにより、例えば、ソース映像のどの時間位置にあるショットが抽出されてサンプル映像のショットとなったのか、といった編集パターンを学習することができる。 Thereby, for example, it is possible to learn an editing pattern such as which shot position in the source video is extracted to become a sample video shot.

また、本発明の映像編集支援装置は、さらに、前記ユーザの指示を受け付ける指示受付手段と、候補出力手段が、前記並び方の候補を示す情報を出力した後に、前記指示受付手段により受け付けられた指示に従って、前記素材映像に含まれる２つ以上の実体要素を並べることで前記ユーザ編集映像を得る編集実行手段と、前記編集実行手段により得られた前記ユーザ編集映像における前記実体要素の並び方を、前記サンプル特徴抽出手段により抽出された信号特徴に応じた態様の画像を並べることで可視化して出力する映像出力手段とを備えるとしてもよい。 The video editing support apparatus according to the present invention further includes an instruction receiving unit that receives the user instruction and an instruction received by the instruction receiving unit after the candidate output unit outputs information indicating the arrangement candidate. The editing execution means for obtaining the user-edited video by arranging two or more entity elements included in the material video, and the arrangement of the entity elements in the user-edited video obtained by the editing execution means, Video output means for visualizing and outputting the images in a form corresponding to the signal features extracted by the sample feature extracting means may be provided.

これにより、ユーザに対し、ユーザ編集映像におけるショット等の並び方を効果的に認識させることができる。つまり、ユーザの編集作業の効率化をより一層図ることができる。 Thereby, the user can be made to recognize effectively how to arrange shots and the like in the user edited video. That is, it is possible to further improve the efficiency of the user's editing work.

また、さらに、前記ユーザ編集映像に付加される実体要素である付加音の信号特徴を複数記憶する記憶手段を備え、前記候補生成手段は、さらに、前記学習手段により学習された編集情報と、前記記憶手段に記憶されている複数の信号特徴とを対照することにより、前記ユーザ編集映像に付加すべき付加音の並び方の候補を示す情報を生成し、前記候補出力手段は、さらに、前記候補生成手段により生成された前記付加音の並び方の候補を示す情報を出力するとしてもよく、さらに、前記ユーザ編集映像に付加される実体要素である視覚効果の信号特徴を複数記憶する記憶手段を備え、前記学習手段は、前記解析手段が、前記抽出手段により抽出された前記動画の信号特徴の解析により、さらに、前記サンプル映像に含まれる実体要素である、動画に付加されている視覚効果の並び方に関する情報を得ることで、前記視覚効果の並び方に関する情報を編集情報として学習し、前記候補生成手段は、前記学習手段により学習された編集情報と、前記記憶手段に記憶されている複数の信号特徴とを対照することにより、前記ユーザ編集映像に付加すべき視覚効果の並び方の候補を示す情報を生成し、前記候補出力手段は、さらに、前記候補生成手段により選択された前記視覚効果の並び方の候補を示す情報を出力するとしてもよい。 Further, the storage device stores a plurality of additional audio signal features that are entity elements added to the user-edited video, and the candidate generation unit further includes editing information learned by the learning unit, By comparing the plurality of signal features stored in the storage means, information indicating candidates for arranging additional sounds to be added to the user-edited video is generated, and the candidate output means further includes the candidate generation Information indicating candidates for the arrangement of the additional sounds generated by the means may be output, and further comprising storage means for storing a plurality of signal characteristics of visual effects that are entity elements added to the user-edited video, The learning means is an entity element further included in the sample video by the analysis means analyzing the signal characteristics of the moving image extracted by the extraction means. By obtaining information relating to the arrangement of visual effects added to the video, the information relating to the arrangement of visual effects is learned as editing information, and the candidate generating means includes the editing information learned by the learning means, and the storage By comparing a plurality of signal features stored in the means, information indicating candidates for arranging visual effects to be added to the user edited video is generated, and the candidate output means further includes the candidate generation means Information indicating a candidate for the arrangement of the visual effects selected by may be output.

これにより、ユーザ編集映像に付加すべき音、または、視覚効果の種類を、その並び方の候補と共に、ユーザに提示することができる。 Thus, the sound to be added to the user-edited video or the type of visual effect can be presented to the user together with the arrangement candidate.

例えば、音楽を配置すべきユーザ編集映像における時間位置の候補、または、あるショットと空間上で同期するように並べるべき音楽の候補をユーザに提示することができる。 For example, it is possible to present to the user a candidate for a time position in a user-edited video to which music is to be arranged, or a candidate for music to be arranged so as to be synchronized with a certain shot in space.

また、本発明の映像編集支援装置は、さらに、前記ユーザの指示を受け付ける指示受付手段と、候補出力手段が、前記並び方の候補を示す情報を出力した後に、前記指示受付手段により受け付けられた指示に従って、前記素材映像に含まれる２つ以上の実体要素を並べることで前記ユーザ編集映像を得る編集実行手段と、前記編集手段により得られた前記ユーザ編集映像を出力する映像出力手段とを備えるとしてもよい。 The video editing support apparatus according to the present invention further includes an instruction receiving unit that receives the user instruction and an instruction received by the instruction receiving unit after the candidate output unit outputs information indicating the arrangement candidate. The editing execution means for obtaining the user-edited video by arranging two or more entity elements included in the material video and the video output means for outputting the user-edited video obtained by the editing means Also good.

これにより、サンプル映像から得られた編集情報に従って生成された、動画等の並び方の候補をユーザに提示するとともに、実編集処理を行う環境をユーザに与えることができる。 Accordingly, it is possible to present to the user candidates for how to arrange moving images and the like generated according to the editing information obtained from the sample video, and to provide the user with an environment for performing actual editing processing.

このように、本発明の映像編集支援装置は、サンプル映像から、その編集を行った編集者の編集技術または編集知識を編集情報として学習することができる。また、その編集情報に従い、素材映像の動画等の実体要素の並び方の候補をユーザに提示することができる。 As described above, the video editing support apparatus of the present invention can learn from the sample video as editing information the editing technique or editing knowledge of the editor who performed the editing. Further, according to the editing information, candidates for arrangement of entity elements such as a moving image of the material video can be presented to the user.

更に、本発明は、本発明の映像編集支援装置における特徴的な構成部が行う動作をステップとする方法として実現したり、それらのステップを含むプログラムとして実現したり、そのプログラムが格納されたＣＤ−ＲＯＭ等の記憶媒体として実現したり、集積回路として実現することもできる。プログラムは、通信ネットワーク等の伝送媒体を介して流通させることもできる。 Furthermore, the present invention can be realized as a method in which the operations performed by the characteristic components in the video editing support apparatus of the present invention are steps, or can be realized as a program including those steps, or a CD in which the program is stored. It can also be realized as a storage medium such as a ROM or an integrated circuit. The program can also be distributed via a transmission medium such as a communication network.

本発明は、上述のように、サンプル映像から、編集情報を学習することができる。これにより、例えば、映像編集の専門家により制作されたサンプル画像から、その専門家の編集技術または編集知識を編集情報として学習することができる。 As described above, the present invention can learn editing information from a sample video. Thereby, for example, the editing technique or editing knowledge of the expert can be learned as editing information from the sample image produced by the video editing expert.

また、その編集情報に従い、動画等の実体要素のユーザ編集映像上での並び方の候補をユーザに提示することができる。つまり、学習された映像編集の専門家の編集技術等に基づき、ユーザの編集行動をより適切な方向に誘導することができる。 Further, according to the editing information, candidates for arrangement of entity elements such as moving images on the user edited video can be presented to the user. That is, it is possible to guide the user's editing behavior in a more appropriate direction based on the learned video editing specialist's editing technique and the like.

すなわち、本発明は、映像編集の技術および知識のない一般のユーザが、効果的かつ効率的に映像コンテンツを編集するための映像編集支援装置を提供することができる。 That is, the present invention can provide a video editing support apparatus for a general user who does not have video editing technology and knowledge to edit video content effectively and efficiently.

以下、本発明の実施の形態について図面を参照しながら説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

まず、図１〜図３を用いて、実施の形態の映像編集支援装置１の構成を説明する。 First, the configuration of the video editing support apparatus 1 according to the embodiment will be described with reference to FIGS.

図１は、実施の形態の映像編集支援装置１の構成および機能を模式的に示す図である。 FIG. 1 is a diagram schematically illustrating the configuration and functions of a video editing support apparatus 1 according to the embodiment.

映像編集支援装置１は、ユーザが映像コンテンツを編集する際に、編集のための情報をユーザに提示することで、効果的かつ効率的に編集させるための装置である。 The video editing support apparatus 1 is an apparatus for effectively and efficiently editing information by presenting information for editing to the user when the user edits the video content.

図１に示すように、映像編集支援装置１は、学習部２と編集部３とを備えている。 As shown in FIG. 1, the video editing support apparatus 1 includes a learning unit 2 and an editing unit 3.

学習部２は、映像編集の専門家の編集により制作された映像コンテンツから、その映像コンテンツにおける、動画等の要素メディアの並び方に関する情報である編集情報を学習する構成部である。 The learning unit 2 is a configuration unit that learns editing information, which is information related to how elemental media such as moving images are arranged in the video content, from the video content produced by editing by a video editing specialist.

編集部３は、表示装置３０等のインタフェース上でユーザの編集行動を誘導する構成部である。また、ユーザの編集行動により編集された映像コンテンツを出力する機能を有している。 The editing unit 3 is a component that guides the user's editing behavior on an interface such as the display device 30. Also, it has a function of outputting video content edited by the user's editing action.

学習部２において編集情報の学習の対象となる編集済映像を以下、「サンプル映像」という。本実施の形態においては、学習部２は、サンプル映像として映画予告映像から編集情報を学習する。また、ユーザが編集中の映像コンテンツおよび編集が完了し編集部３から出力される映像コンテンツを以下、「ユーザ編集映像」という。 The edited video that is subject to editing information learning in the learning unit 2 is hereinafter referred to as a “sample video”. In the present embodiment, the learning unit 2 learns editing information from a movie preview video as a sample video. The video content being edited by the user and the video content that has been edited and output from the editing unit 3 are hereinafter referred to as “user edited video”.

学習部２は、サンプル映像を、動画、音、および、編集エフェクト等の要素メディア等に分離した上で、感覚的構造に関連する各要素の低レベル特徴（信号特徴）を抽出する。さらに、抽出した特徴から、サンプル映像の感覚的構造を学習する。 The learning unit 2 separates the sample video into moving images, sounds, and element media such as editing effects, and then extracts low-level features (signal features) of each element related to the sensory structure. Furthermore, the sensory structure of the sample video is learned from the extracted features.

なお、以下、単に「要素」という場合、動画等の要素メディアを指すものとする。また、映像コンテンツに含まれる動画および音等の要素メディアのそれぞれは、本発明の映像編集支援装置における実体要素の一例である。 Hereinafter, the term “element” simply refers to an element medium such as a moving image. In addition, each of element media such as moving images and sounds included in the video content is an example of a substantial element in the video editing support apparatus of the present invention.

感覚的構造に関連する低レベル特徴とは、例えば、サンプル映像に含まれる動画における、ショット長、ショットの種類（タイトショット、ミドルショット、およびロングショット等）、動き、輝度、および時系列上の位置である時間位置である。 Low-level features related to sensory structure include, for example, shot length, shot type (tight shot, middle shot, long shot, etc.), motion, brightness, and time series in the video included in the sample video It is a time position that is a position.

また、サンプル映像に含まれる音の、ボリューム、アクセント、種類（音楽、音声、および効果音等）である。また、サンプル映像に付加されている編集エフェクトの種類、時系列上の出現位置、および持続時間等である。 Also, the volume, accent, and type (music, voice, sound effect, etc.) of the sound included in the sample video. Also, the type of editing effect added to the sample video, the appearance position on the time series, and the duration.

図２は、学習部２が学習する、サンプル映像の感覚的構造を模式的に示す図である。 FIG. 2 is a diagram schematically showing the sensory structure of the sample video that the learning unit 2 learns.

学習部２は、動画の基本単位をショットとし、ショットの時系列パターンをショットの時間構造、つまり時系列構成として学習する。例えば、隣接するショットの信号特徴の関連性を示す情報を時系列構成として学習する。 The learning unit 2 learns the basic unit of the moving image as a shot and the time series pattern of the shot as a time structure of the shot, that is, a time series configuration. For example, information indicating the relevance of signal characteristics of adjacent shots is learned as a time-series configuration.

これにより、例えば、長さが同程度のショットが連続する場合が多いこと、および、サンプル映像の後半は映像の動きが速いことなどがショットの時系列構成として学習される。 As a result, for example, it is learned as a time-series composition of shots that shots having the same length in many cases are continuous and that the movement of the video is fast in the second half of the sample video.

このショットの時系列構成は、編集部３において、例えば、ユーザにより選択されたショットを配置すべき、ユーザ編集映像上の時間位置の候補を示す情報の生成に用いられる。 This time-series composition of shots is used by the editing unit 3 to generate information indicating candidates for time positions on the user-edited video where the shot selected by the user should be placed, for example.

また同様に、音ストリームの時間構造、つまり時系列構成を学習する。サンプル映像における音ストリームの時系列構成は、編集部３において、例えば、ユーザ編集映像に配置するためにユーザにより選択されたショットに付加すべき付加音の候補の選択に用いられる。 Similarly, the time structure of the sound stream, that is, the time series structure is learned. The time series structure of the sound stream in the sample video is used in the editing unit 3 to select additional sound candidates to be added to the shot selected by the user for placement in the user edited video, for example.

また、学習部２は、サンプル映像に含まれる音ストリームに対し、動画のフレームまたはショット単位に相当する分割を行う。さらに、（１）境界間の関係を学習する。例えば、動画の境界と音の境界との時間位置の一致不一致を示す情報等の、それら境界の関連性を示す情報を学習する。（２）セグメント間の関係を学習する。例えば、どのような特徴を持ったショットにどのような特徴を持った音が付加されているか等の情報を学習する。 Further, the learning unit 2 divides the sound stream included in the sample video corresponding to a moving image frame or shot unit. Furthermore, (1) the relationship between boundaries is learned. For example, information indicating the relevance of the boundaries, such as information indicating the coincidence / non-coincidence of the time positions between the moving image boundary and the sound boundary, is learned. (2) Learn the relationship between segments. For example, information such as what kind of sound is added to what kind of shot is learned.

すなわち、学習部２は、上記２点について、動画と音との空間構造、言い換えると動画と音との同期構成を学習する。 That is, the learning unit 2 learns the spatial structure of the moving image and the sound, in other words, the synchronized configuration of the moving image and the sound, for the two points.

編集部３において、上記（１）は、例えば、ユーザ編集映像において、ユーザが付加音を配置した時間位置に、既にショットが配置されていた場合、ショットと音の変化点（セグメント境界）を自動的に同期付けるために用いられる。上記（２）は、例えば、ユーザ編集映像に配置済みのショットに対応して配置すべき付加音の候補を選択するために用いられる。 In the editing unit 3, the above (1) is, for example, when the shot has already been arranged at the time position where the user arranged the additional sound in the user edited video, and the change point (segment boundary) between the shot and the sound is automatically set. Used to synchronize automatically. The above (2) is used, for example, to select additional sound candidates to be arranged corresponding to shots already arranged in the user edited video.

なお、同期構成とは、時系列上のある時点または期間においてどのような要素メディアが組み合わせられているかについての情報である。つまり、時系列上の同一空間内に存在する要素メディアの組み合わせについての情報である。 The synchronous configuration is information about what element media are combined at a certain point in time or period. That is, it is information about a combination of element media existing in the same space in time series.

学習部２は、例えば、動画と音との同期構成として、ショット内の映像の動きの速さと、当該ショットに付加されている音楽のビートの速さとを抽出し、動きの激しいショットには、ビートの速いＢＧＭが付加されていることなどを学習する。 The learning unit 2 extracts, for example, the movement speed of the video in the shot and the beat speed of the music added to the shot as a synchronized configuration of the moving image and the sound. It learns that BGM with a fast beat is added.

学習部２は、サンプル映像から学習したこれら動画等の要素メディアの時系列構成および同期構成を編集情報として編集部３に出力する。 The learning unit 2 outputs the time-series configuration and the synchronous configuration of the elemental media such as moving images learned from the sample video to the editing unit 3 as editing information.

編集部３は、ユーザから、ユーザが編集したい素材映像の入力を受け付ける。さらに、ユーザ編集映像に挿入するショット、ユーザ編集映像に付加する要素メディアである付加要素、および、ユーザ編集映像におけるショット等の時間位置等の選択の指示を受け付ける。つまり、ユーザが行う各種の編集行動を受け付ける。 The editing unit 3 receives an input of a material video that the user wants to edit from the user. Further, it receives an instruction to select a shot to be inserted into the user-edited video, an additional element that is an element medium to be added to the user-edited video, and a time position of the shot in the user-edited video. That is, various editing actions performed by the user are accepted.

なお、編集部３は、映像編集の際にユーザ編集映像に付加する付加要素として、複数種の音および編集エフェクトのデータを有している。 The editing unit 3 has a plurality of types of sound and editing effect data as additional elements to be added to the user edited video during video editing.

編集部３は、学習部２から得られるサンプル映像の編集情報と、ユーザの編集行動とに応じ、インタフェース上にてユーザの次の編集行動を誘導することができる。 The editing unit 3 can guide the user's next editing action on the interface according to the editing information of the sample video obtained from the learning unit 2 and the user's editing action.

例えば、ユーザが、ユーザ編集映像に配置するショットとして、あるショットが選択された場合、編集情報に従って、そのショットを配置すべきユーザ編集映像上の時間位置の候補である候補位置をユーザに提示することができる。また、そのショットに最適な付加音、例えばＢＧＭの候補をユーザに提示することができる。 For example, when a user selects a shot as a shot to be arranged on the user edited video, a candidate position that is a candidate of a time position on the user edited video on which the shot is to be arranged is presented to the user according to the editing information. be able to. Further, it is possible to present the user with an additional sound that is most suitable for the shot, for example, a BGM candidate.

図３は、映像編集支援装置１の機能的な構成を示す機能ブロック図である。 FIG. 3 is a functional block diagram showing a functional configuration of the video editing support apparatus 1.

上述のように、映像編集支援装置１は、学習部２と編集部３とを備えており、さらに、出力部４を備えている。 As described above, the video editing support apparatus 1 includes the learning unit 2 and the editing unit 3, and further includes the output unit 4.

学習部２は、サンプル受付部２０と、サンプル特徴抽出部２１と、サンプル解析部２２と、モデル生成部２３とを有している。サンプル受付部２０は、映画予告映像等のサンプル映像の入力を受け付ける構成部である。 The learning unit 2 includes a sample reception unit 20, a sample feature extraction unit 21, a sample analysis unit 22, and a model generation unit 23. The sample receiving unit 20 is a component that receives an input of a sample video such as a movie preview video.

サンプル特徴抽出部２１は、サンプル映像に含まれる要素メディアである、動画および音のそれぞれの信号特徴を抽出する構成部である。 The sample feature extraction unit 21 is a component that extracts signal features of moving images and sounds, which are element media included in the sample video.

サンプル解析部２２は、サンプル特徴抽出部２１により抽出された信号特徴を解析することで、サンプル映像に含まれる動画および音のそれぞれの時系列構成を示す情報と、動画と音との同期構成を示す情報とを得る構成部である。 The sample analysis unit 22 analyzes the signal feature extracted by the sample feature extraction unit 21 to obtain information indicating the time-series configuration of each of the moving image and the sound included in the sample video and the synchronized configuration of the moving image and the sound. It is a component which obtains the information shown.

モデル生成部２３は、サンプル解析部２２により得られた時系列構成を示す情報と、同期構成を示す情報とから、サンプル映像に含まれる要素メディアの時間・空間モデル（以下、「編集モデル」という。）を生成し、編集部３に出力する構成部である。つまり、編集情報である編集モデルが編集部３に出力される。 The model generation unit 23 uses a time / space model (hereinafter referred to as “edit model”) of the element media included in the sample video from the information indicating the time-series configuration obtained by the sample analysis unit 22 and the information indicating the synchronous configuration. .) Is generated and output to the editing unit 3. That is, the editing model as editing information is output to the editing unit 3.

編集部３は、素材受付部１０と、素材特徴抽出部１１と、ユーザ入力受付部１２と、候補生成部１３と、エフェクトデータ記憶部１４と、付加音記憶部１５、編集実行部１６とを有している。 The editing unit 3 includes a material reception unit 10, a material feature extraction unit 11, a user input reception unit 12, a candidate generation unit 13, an effect data storage unit 14, an additional sound storage unit 15, and an editing execution unit 16. Have.

素材受付部１０は、ユーザから入力される素材映像を受け付ける構成部である。素材特徴抽出部１１は、素材映像に含まれる要素メディアである動画および音のそれぞれの信号特徴を抽出する構成部である。 The material reception unit 10 is a configuration unit that receives a material video input from a user. The material feature extraction unit 11 is a configuration unit that extracts signal features of moving images and sounds that are elemental media included in the material video.

また、素材受付部１０は、ユーザから入力される付加音、および、ユーザ編集映像に編集エフェクトを付加するためのデータであるエフェクトデータも受け付ける。これら付加音およびエフェクトデータは、後述する付加音記憶部１５およびエフェクトデータ記憶部１４にそれぞれ記憶される。 The material receiving unit 10 also receives additional sound input from the user and effect data that is data for adding an editing effect to the user edited video. These additional sound and effect data are respectively stored in an additional sound storage unit 15 and an effect data storage unit 14 which will be described later.

ユーザ入力受付部１２は、キーボード等の入力装置４０から入力される、ショットの選択等の指示の入力を受け付ける構成部である。 The user input receiving unit 12 is a component that receives an input of an instruction such as a shot selection input from the input device 40 such as a keyboard.

候補生成部１３は、学習部２により学習された編集情報に従って、素材映像に含まれる要素メディアのユーザ編集映像における並び方の候補を示す情報を生成する構成部である。 The candidate generation unit 13 is a configuration unit that generates information indicating candidates for arrangement in the user edited video of the elemental media included in the material video according to the editing information learned by the learning unit 2.

具体的には、候補生成部１３は、ショット、付加音、編集エフェクト等の要素メディアを配置すべきユーザ編集映像上の時間位置の候補、または、ユーザ編集映像上のある時間位置に配置すべきショット等の要素メディアの候補を、ユーザ編集映像における要素メディアの並び方の候補として生成する。つまり、ユーザ編集映像における要素メディアの並び方の候補を示す情報を生成する。 Specifically, the candidate generation unit 13 should be arranged at a time position candidate on the user edited video where element media such as shots, additional sounds, and editing effects should be arranged, or at a certain time position on the user edited video. A candidate of elemental media such as a shot is generated as a candidate for arrangement of elemental media in the user edited video. That is, information indicating candidates for arrangement of element media in the user edited video is generated.

エフェクトデータ記憶部１４は、エフェクトデータを記憶する記憶装置である。エフェクトデータ記憶部１４には複数種のエフェクトデータが記憶されている。 The effect data storage unit 14 is a storage device that stores effect data. The effect data storage unit 14 stores a plurality of types of effect data.

エフェクトデータ記憶部１４には、具体的には、エフェクトデータとしてユーザ編集映像に付加するための複数種の視覚効果データを記憶しており、さらに、それぞれの視覚効果の特徴を示す情報を記憶している。視覚効果の特徴を示す情報とは、具体的には、視覚効果データの信号特徴から導かれた特徴量ベクトルである。この特徴量ベクトルにより、視覚効果の種類や持続期間等が示される。なお、編集エフェクトは本発明の映像編集支援装置における視覚効果の一例である。 Specifically, the effect data storage unit 14 stores a plurality of types of visual effect data to be added to the user edited video as effect data, and further stores information indicating the characteristics of each visual effect. ing. Specifically, the information indicating the feature of the visual effect is a feature vector derived from the signal feature of the visual effect data. The feature amount vector indicates the type and duration of the visual effect. The editing effect is an example of the visual effect in the video editing support apparatus of the present invention.

付加音記憶部１５は、ユーザ編集映像に付加するための複数種の付加音と、それぞれの特徴を示す情報とを記憶する記憶装置である。付加音の特徴を示す情報とは、具体的には、付加音の信号特徴から導かれた特徴量ベクトルである。この特徴量ベクトルにより、付加音の音量やビートの早さ等が示される。また、付加音記憶部１５は複数種の付加音として音楽や効果音等を記憶している。 The additional sound storage unit 15 is a storage device that stores a plurality of types of additional sounds to be added to the user-edited video and information indicating the respective characteristics. Specifically, the information indicating the characteristic of the additional sound is a feature amount vector derived from the signal characteristic of the additional sound. This feature vector indicates the volume of the additional sound, the speed of the beat, and the like. The additional sound storage unit 15 stores music, sound effects, and the like as a plurality of types of additional sounds.

候補生成部１３は、これらエフェクトデータ記憶部１４および付加音記憶部１５に記憶されている複数種のエフェクトデータ、および、複数種の付加音の中から、ユーザ編集映像に配置すべき、つまり、ユーザ編集映像に付加すべきエフェクトデータおよび付加音それぞれの候補を選択する。 The candidate generation unit 13 should be arranged in the user-edited video from the plurality of types of effect data and the plurality of types of additional sound stored in the effect data storage unit 14 and the additional sound storage unit 15, that is, A candidate for effect data and additional sound to be added to the user-edited video is selected.

編集実行部１６は、ショット単位の動画等の要素メディアをユーザの指示に従って並べ、つなぎ合わせることで、ユーザ編集映像を得る構成部である。 The editing execution unit 16 is a configuration unit that obtains a user-edited video by arranging and connecting elemental media such as moving images in shot units in accordance with a user instruction.

出力部４は、ユーザと映像編集支援装置１とのインタフェースである表示装置３０に、編集画面やユーザ編集映像等を出力する処理部である。なお、出力部４からは、図示しないスピーカに音が出力されている。また、出力部４が出力し表示装置３０に表示される編集画面の内容については、図６を用いて後述する。 The output unit 4 is a processing unit that outputs an editing screen, a user edited video, and the like to the display device 30 that is an interface between the user and the video editing support device 1. Note that the output unit 4 outputs sound to a speaker (not shown). The contents of the editing screen output from the output unit 4 and displayed on the display device 30 will be described later with reference to FIG.

次に、図４〜図２０を用いて、本実施の形態の映像編集支援装置１の動作を説明する。 Next, the operation of the video editing support apparatus 1 according to the present embodiment will be described with reference to FIGS.

図４は、映像編集支援装置１の編集支援に係る動作の流れの概要を示すフロー図である。図４を用いて、映像編集支援装置１の編集支援に係る動作の概要を説明する。 FIG. 4 is a flowchart showing an outline of an operation flow related to editing support of the video editing support apparatus 1. An outline of operations related to editing support of the video editing support apparatus 1 will be described with reference to FIG.

映像編集支援装置１に入力されたサンプル映像はサンプル受付部２０によって受け付けられる（Ｓ１）。受け付けられたサンプル映像は、サンプル特徴抽出部２１と、サンプル解析部２２とにより、映像解析される（Ｓ２）。映像解析の動作内容については、図５を用いて後述する。モデル生成部２３により、サンプル映像から得られる編集情報である、サンプル映像の編集モデルが生成される（Ｓ３）。 The sample video input to the video editing support device 1 is received by the sample receiving unit 20 (S1). The received sample video is analyzed by the sample feature extraction unit 21 and the sample analysis unit 22 (S2). The operation content of the video analysis will be described later with reference to FIG. The model generation unit 23 generates a sample video editing model, which is editing information obtained from the sample video (S3).

また、映像編集支援装置１にユーザによって入力された素材映像は素材受付部１０によって受け付けられる（Ｓ４）。受け付けられた素材映像は、素材特徴抽出部１１により、映像解析される（Ｓ５）。 The material video input by the user to the video editing support device 1 is received by the material receiving unit 10 (S4). The accepted material image is analyzed by the material feature extraction unit 11 (S5).

その後、モデル生成部２３により生成された編集モデルに従い、候補生成部１３等により、素材映像から効率的にユーザ編集映像を得るための編集支援が行われる（Ｓ６）。編集支援の動作内容については、図７を用いて後述する。ユーザは、編集支援を受けつつユーザ編集映像を制作し、ユーザ編集映像は出力部４から出力される（Ｓ７）。 Thereafter, according to the editing model generated by the model generation unit 23, the candidate generation unit 13 and the like provide editing support for efficiently obtaining a user edited video from the material video (S6). The details of the editing support operation will be described later with reference to FIG. The user produces a user edited video while receiving editing support, and the user edited video is output from the output unit 4 (S7).

図５は、映像編集支援装置１の映像解析（図４のＳ２およびＳ５）に係る動作の流れの概要を示すフロー図である。図５を用いて、映像編集支援装置１の映像解析に係る動作の概要を説明する。また、映像編集支援装置１が行う映像解析についての詳細な説明は、図１４〜図１８を用いて後述する。 FIG. 5 is a flowchart showing an outline of an operation flow related to video analysis (S2 and S5 in FIG. 4) of the video editing support apparatus 1. The outline of the operation related to the video analysis of the video editing support apparatus 1 will be described with reference to FIG. Detailed description of video analysis performed by the video editing support apparatus 1 will be described later with reference to FIGS.

なお、サンプル映像に対する映像解析（図４のＳ２）、および、素材映像に対する映像解析（図４のＳ５）は、主要な動作内容としては同じであるため、サンプル映像に対する映像解析を例にとって説明する。 Note that the video analysis for the sample video (S2 in FIG. 4) and the video analysis for the material video (S5 in FIG. 4) are the same as the main operation contents, so the video analysis for the sample video will be described as an example. .

サンプル映像が入力されると、サンプル特徴抽出部２１は、サンプル映像に含まれる動画中のカットを検出する（Ｓ１０）。これにより、サンプル映像に含まれる動画をショット列に分割することができる（Ｓ１１）。これは、カットとカットとの間が１つのショットを構成するからである。 When the sample video is input, the sample feature extraction unit 21 detects a cut in the moving image included in the sample video (S10). Thereby, the moving image included in the sample video can be divided into shot sequences (S11). This is because one shot is formed between the cuts.

また、動画を分割することにより得られた各ショットは、フレーム列に分割される（Ｓ１２）。フレームとは、ショットを構成する１つの単位である。 Each shot obtained by dividing the moving image is divided into frame sequences (S12). A frame is a unit constituting a shot.

サンプル特徴抽出部２１は、各ショットおよびそれらショットを構成する各フレームから、それぞれのショットの信号特徴を示す特徴量を抽出する（Ｓ１３）。 The sample feature extraction unit 21 extracts a feature amount indicating a signal feature of each shot from each shot and each frame constituting the shot (S13).

さらに、それら特徴量を多次元のベクトルで表した特徴量ベクトルを算出する（Ｓ１４）。例えば、特徴量ベクトルとして、ショット長、ショット内の映像の動き、ショットの平均輝度、ショットに付加された音の音量の４次元の特徴量ベクトルが算出される。 Further, a feature quantity vector representing these feature quantities as a multidimensional vector is calculated (S14). For example, as the feature amount vector, a four-dimensional feature amount vector of the shot length, the motion of the video in the shot, the average brightness of the shot, and the volume of the sound added to the shot is calculated.

サンプル解析部２２は、算出された各特徴量ベクトルを解析する。具体的には、各特徴量ベクトルを正規化することで、特徴量のばらつきによる影響を除去する。さらに、所定の類似度に基づいてクラスタリングする（Ｓ１５）。つまり、各ショットは、特徴の類似度に基づいてグループ分けされる。 The sample analysis unit 22 analyzes each calculated feature vector. Specifically, by normalizing each feature quantity vector, the influence due to feature quantity variation is removed. Further, clustering is performed based on a predetermined similarity (S15). That is, the shots are grouped based on the feature similarity.

また、各ショットの特徴量ベクトルには、音についての成分も含まれている。そのため、サンプル映像において、どのような音がどのように時系列上に並んでいるかをも解析可能である。なお、サンプル特徴抽出部２１およびサンプル解析部２２は、ショットと同様に、サンプル映像に含まれる音についても、音量またはビートの速さ等を示す信号特徴から、音のみについての１次元以上の特徴量ベクトルの算出を、別途行ってもよい。 In addition, the feature vector of each shot includes a component about sound. Therefore, it is possible to analyze what sound is arranged in time series in the sample video. Note that the sample feature extraction unit 21 and the sample analysis unit 22 also provide one-dimensional or more features for only the sound from the signal feature indicating the volume or the speed of the beat for the sound included in the sample video as well as the shot. The calculation of the quantity vector may be performed separately.

また、各ショットの特徴量ベクトルから、サンプル映像に付加されている編集エフェクトの種類等の特徴を抽出することが可能である。つまり、どのような編集エフェクトがどのようなショットに付加されているかについての解析も可能である。なお、編集エフェクトの特徴を明示する成分を、ショットの特徴量ベクトルの成分として含ませてもよい。さらに、編集エフェクトの種類、並びに、開始および終了時間位置等を示す信号特徴から、編集エフェクトのみについての１次元以上の特徴量ベクトルの算出を、別途行ってもよい。 Further, it is possible to extract features such as the type of editing effect added to the sample video from the feature quantity vector of each shot. That is, it is possible to analyze what editing effect is added to what shot. It should be noted that a component that clearly indicates the feature of the editing effect may be included as a component of the shot feature vector. Furthermore, a one-dimensional or more feature quantity vector for only the editing effect may be separately calculated from the signal characteristics indicating the type of editing effect and the start and end time positions.

このようにして、サンプル解析部２２は、サンプル映像において、どのような特徴を持ったショットおよび音が時系列上でどのように並んでいるか、また、どのような特徴を持ったショットが、どのような特徴の音および編集エフェクトと結び付けられているかについての情報を得ることができる。 In this way, the sample analysis unit 22 determines what kind of shots and sounds are arranged in the time series in the sample video, and what kind of shots have the characteristics. It is possible to obtain information on whether the sound is related to the sound and the editing effect.

つまり、サンプル映像に含まれる動画等の複数の要素メディアそれぞれの時系列構成を示す情報と、異種の要素メディア間の同期構成を示す情報とを得ることができる。 That is, it is possible to obtain information indicating a time-series configuration of each of a plurality of elementary media such as a moving image included in the sample video and information indicating a synchronous configuration between different types of elementary media.

なお、以上の動作は、素材特徴抽出部１１においても素材映像を対象として行われ、素材映像に含まれる各ショットも、特徴の類似度で複数のグループに分けられる。 Note that the above operation is also performed on the material feature extraction unit 11 for the material video, and each shot included in the material video is also divided into a plurality of groups based on the feature similarity.

以降、各ショットの特徴量等を表すシンボル列が生成され（Ｓ１６）、モデル生成部２３に出力される。 Thereafter, a symbol string representing the feature amount and the like of each shot is generated (S16) and output to the model generation unit 23.

モデル生成部２３は、１つのサンプル映像から得られた各ショットの特徴量およびそれら類似度等の情報をまとめることで、１つのサンプル映像についての編集情報である編集モデルを生成する。生成された編集モデルは候補生成部１３に出力される。 The model generation unit 23 generates an editing model that is editing information for one sample video by collecting information such as the feature amounts of each shot and their similarity obtained from one sample video. The generated editing model is output to the candidate generation unit 13.

図６は、図５に示す動作の流れの結果、表示装置３０に表示される編集画面の一例を示す図である。 FIG. 6 is a diagram showing an example of an editing screen displayed on the display device 30 as a result of the operation flow shown in FIG.

図６に示す編集画面３１は、ユーザ編集映像の時間・空間構造の可視化を行い、かつ、編集モデルに従ってユーザの編集行動を誘導する画面である。 The editing screen 31 shown in FIG. 6 is a screen for visualizing the time / space structure of the user edited video and guiding the user's editing behavior according to the editing model.

図６に示すように、編集画面３１には、時間入力欄３２と、素材映像タイムライン欄３３と、ショット選択欄３４と、付加要素選択欄３５と、再生表示欄３６と、ストーリーボード３７と、構造表示ボード３８とが表示される。 As shown in FIG. 6, the editing screen 31 includes a time input field 32, a material video timeline field 33, a shot selection field 34, an additional element selection field 35, a playback display field 36, and a story board 37. The structure display board 38 is displayed.

時間入力欄３２は、ユーザ編集映像の完成時の長さを示す時間の入力を受け付ける欄である。例えば、ユーザが、１時間の素材映像を編集し、５分のユーザ編集映像を制作することを所望した場合、時間入力欄３２には、ユーザにより“５分”と入力される。 The time input field 32 is a field for accepting an input of a time indicating the length when the user edited video is completed. For example, when the user desires to edit a one-hour material video and produce a five-minute user-edited video, “5 minutes” is input to the time input field 32 by the user.

素材映像タイムライン欄３３は、素材映像のショットをタイムライン表示する欄である。なお、素材映像タイムライン欄３３内の各矩形は、１つのショットを表現しており、例えば、色でショットの種類が表される。 The material video timeline column 33 is a column for displaying a shot of the material video in a timeline. Each rectangle in the material video timeline column 33 represents one shot. For example, the type of shot is represented by color.

図６において、各矩形に付されたドット等の模様は、画面上では色で表現されている。また、ショットの種類とは、言い換えるとショットの特徴である。 In FIG. 6, a pattern such as a dot attached to each rectangle is represented by a color on the screen. The type of shot is, in other words, a characteristic of the shot.

つまり、同じ模様が付された矩形は、互いに特徴が類似するショットを表している。上述のように、素材特徴抽出部１１により、各ショットは、特徴の類似度によりグループ分けされる。これにより、ショットを表す矩形に、グループごとに異なる色を付して表示することができる。 That is, rectangles with the same pattern represent shots having similar characteristics. As described above, the material feature extraction unit 11 groups the shots according to the feature similarity. As a result, the rectangle representing the shot can be displayed with a different color for each group.

例えば、図中のドットは画面上では赤であることを表し、赤が付された矩形は、ショット内の映像の動きが早く、輝度の高いショットのグループに属している。また、その他の模様についても、赤以外の色が付されていることを表し、所定の特徴と関連付けられたグループに属していることを表している。 For example, the dots in the figure indicate red on the screen, and the rectangles with red belong to a group of shots with high brightness and high-luminance shots. In addition, other patterns are also given a color other than red, indicating that they belong to a group associated with a predetermined feature.

そのためユーザは、素材映像において、どのような特徴のショットがどのように並んでいるかを視覚的に捉えることができる。 Therefore, the user can visually grasp how and what type of shots are arranged in the material video.

なお、矩形に付された模様の表現内容は、下記のショット選択欄３４、ストーリーボード３７、構造表示ボード３８においても同じである。また、ショット同様、付加音を表現する矩形にも、付加音の種類に応じて模様が付されている。つまり、画面上では一部または全部に色が付されている。 The expression contents of the pattern attached to the rectangle are the same in the shot selection field 34, the story board 37, and the structure display board 38 described below. Further, like the shot, the rectangle representing the additional sound is also given a pattern according to the type of the additional sound. That is, some or all of the colors are colored on the screen.

さらに、これら模様の表現内容は、後述する図８〜図１２においても同じである。以下、これらの編集画面３１を表す図の説明において、ショットおよび付加音を表す矩形を、それぞれ、単にショットおよび付加音と記載する。 Further, the expression contents of these patterns are the same in FIGS. Hereinafter, in the description of the diagrams representing these editing screens 31, rectangles representing shots and additional sounds are simply referred to as shots and additional sounds, respectively.

ショット選択欄３４は、ユーザ編集映像に配置するショットをユーザに選択させるための欄である。各ショットはグループごとにまとめて表示されている。ユーザは、ユーザ編集映像に含めたいショットを選択し、下記のストーリーボード３７にドラッグすることにより、そのショットをユーザ編集映像に配置することができる。 The shot selection column 34 is a column for allowing the user to select a shot to be arranged in the user edited video. Each shot is displayed in groups. The user can select a shot to be included in the user-edited video and drag it onto the storyboard 37 described below to place the shot on the user-edited video.

なお、図６および後述する図８〜図１２において、本発明の特徴を明確にするために、ショット選択欄３４内の各矩形は、付与された色のみを表示しているように図示しているが、具体的には、矩形の周縁部等の一部をのぞき、それぞれのショットの最初のフレームの画像等が表示される。これにより、ユーザは、各矩形に対応するショットの内容を認識できる。このことは、後述するストーリーボード３７においても同じである。 In FIG. 6 and FIGS. 8 to 12 to be described later, in order to clarify the features of the present invention, each rectangle in the shot selection field 34 is illustrated as displaying only a given color. Specifically, however, an image of the first frame of each shot is displayed except for a part of the peripheral edge of the rectangle. Thereby, the user can recognize the content of the shot corresponding to each rectangle. The same applies to the story board 37 described later.

また、１つのグループにおいて、複数のショットは重ねられて表示されている。この状態で、いずれかのショットがユーザにクリックされると、そのグループに属する全てのショットが表示される。 In one group, a plurality of shots are displayed in an overlapping manner. In this state, when any shot is clicked by the user, all shots belonging to the group are displayed.

例えば、各ショットは、少なくとも内容をユーザに認識させうる程度まで各ショットの重なり部分を減らすように縦方向または横方向に移動する。 For example, each shot moves in the vertical direction or the horizontal direction so as to reduce the overlapping portion of the shots at least to the extent that the user can recognize the contents.

または、最前面のショット以外のショットは、表示されている部分がユーザにクリックされると、グループの最前面に移動する。 Alternatively, shots other than the foremost shot move to the forefront of the group when the displayed portion is clicked by the user.

つまり、いずれのグループに属するいずれのショットであっても、ユーザは各ショットの内容を認識することができる。 That is, the user can recognize the contents of each shot regardless of which shot belongs to which group.

付加要素選択欄３５は、ユーザ編集映像に付加する要素メディアである付加要素をユーザに選択させるための欄である。このような要素メディアとして、各種の編集エフェクトおよび付加音が用意される。 The additional element selection column 35 is a column for allowing the user to select an additional element that is an element medium to be added to the user edited video. Various editing effects and additional sounds are prepared as such elemental media.

ユーザは、ユーザ編集映像に付加したい付加要素を選択し、下記のストーリーボード３７にドラッグすることにより、その付加要素をユーザ編集映像に付加することができる。 The user can add an additional element to the user-edited video by selecting an additional element to be added to the user-edited video and dragging it to the storyboard 37 described below.

再生表示欄３６は、各ショット、並びに、編集中および編集完了後のユーザ編集映像を再生し、それぞれの内容をユーザに確認させるための欄である。 The reproduction display column 36 is a column for reproducing each shot and a user-edited video during editing and after completion of editing, and allowing the user to confirm the contents of each.

ストーリーボード３７は、ユーザがショットまたは付加音を選択した場合、その配置位置を中心に、ショットおよび付加音の時系列構成および同期構成をユーザに確認させ、かつ、ショットおよび付加音の移動を行わせるための欄である。 When the user selects a shot or an additional sound, the storyboard 37 allows the user to check the time series configuration and the synchronization configuration of the shot and the additional sound, and moves the shot and the additional sound, centering on the arrangement position. It is a column to make it.

図に示すように、ストーリーボード３７は、映像ストーリー欄３７ａと、音ストーリー欄３７ｂとから構成されている。映像ストーリー欄３７ａには配置されたショットが、音ストーリー欄３７ｂには配置された付加音が表示される。 As shown in the figure, the story board 37 is composed of a video story column 37a and a sound story column 37b. The shots arranged in the video story column 37a are displayed, and the additional sounds arranged in the sound story column 37b are displayed.

これにより、ユーザは、ショットおよび付加音の時系列上の並び方だけではなく、どのショットと付加音との同期構成も確認することができる。 Thereby, the user can confirm not only how shots and additional sounds are arranged in a time series but also a synchronized configuration of any shot and additional sounds.

構造表示ボード３８は、ユーザ編集映像の全体構造をユーザに確認させるための欄である。図に示すように、構造表示ボード３８は、映像構造欄３８ａと、音構造欄３８ｂとから構成されている。映像構造欄３８ａには配置されたショットが、音構造欄３８ｂには配置された付加音が表示される。 The structure display board 38 is a column for allowing the user to confirm the entire structure of the user edited video. As shown in the figure, the structure display board 38 is composed of a video structure column 38a and a sound structure column 38b. The shots arranged in the video structure column 38a are displayed, and the additional sounds arranged in the sound structure column 38b are displayed.

なお、映像構造欄３８ａおよび音構造欄３８ｂにおいて、同じ種類の要素メディアの縦軸方向の位置により、各要素メディア間の関係を可視化することもできる。 In the video structure field 38a and the sound structure field 38b, the relationship between the element media can be visualized by the position of the same type of element media in the vertical axis direction.

例えば、図に示す各ショットおよび各付加音の縦軸方向の位置は、ショットおよび付加音の属するグループにより決められている。つまり、同じ色が付されたショットは同じ高さで表示される。また、同じ色が付された付加音は同じ高さで表示される。 For example, the position of each shot and each additional sound shown in the figure in the vertical axis direction is determined by the group to which the shot and the additional sound belong. That is, shots with the same color are displayed at the same height. In addition, additional sounds with the same color are displayed at the same height.

ユーザは、構造表示ボード３８を参照することで、編集中および編集完了後のユーザ編集映像の全体構造を確認することができる。 By referring to the structure display board 38, the user can confirm the entire structure of the user edited video during editing and after editing is completed.

なお、本実施の形態においては、ショット等の要素メディアを色を付した矩形で表すことで、要素メディアの並び方を可視化している。しかしながら、他の手法で要素メディアの並び方を可視化してもよい。 In this embodiment, element media such as shots are represented by colored rectangles to visualize how element media are arranged. However, the arrangement of element media may be visualized by other methods.

例えば、信号特徴に応じてグループ分けされたショットを、グループごとに異なる形状の画像で表すことにより、ユーザ編集映像におけるショットの並び方を可視化して出力してもよい。つまり、ユーザ編集映像における要素メディアの並び方を、要素メディアの信号特徴に応じた態様の画像を並べることで可視化して出力すればよい。 For example, the shots grouped according to the signal characteristics may be represented by images having different shapes for each group, so that the shot arrangement in the user edited video may be visualized and output. In other words, the arrangement of the element media in the user-edited video may be visualized and output by arranging the images in a mode corresponding to the signal characteristics of the element media.

このように表示されている編集画面３１上で、ユーザが素材映像を編集する際の、映像編集支援装置１の動作について、図７のフロー図を用い、図８〜図１２の編集画面３１の表示例を参照しながら説明する。 The operation of the video editing support apparatus 1 when the user edits the material video on the editing screen 31 displayed in this way will be described with reference to the editing screen 31 of FIGS. This will be described with reference to a display example.

図７は、映像編集支援装置１の編集支援（図４のＳ６）に係る動作の流れを示すフロー図である。 FIG. 7 is a flowchart showing a flow of operations related to editing support (S6 in FIG. 4) of the video editing support apparatus 1.

ユーザ入力受付部１２が、ユーザの入力を受け付ける（Ｓ２０）。候補生成部１３は、その入力が何の選択を指示しているかを判断する。判断の結果、要素であれば（Ｓ２１で要素）、選択された要素と、編集モデルに示される要素との適合度の算出を行う（Ｓ２２）。 The user input receiving unit 12 receives a user input (S20). Candidate generator 13 determines what selection the input indicates. As a result of the determination, if it is an element (element in S21), the degree of matching between the selected element and the element indicated in the edit model is calculated (S22).

候補生成部１３は、算出の結果に基づき、出力部４を介し、ユーザに選択された要素を配置すべき時間位置の候補、つまり、候補位置をユーザに提示する（Ｓ２３）。 Based on the calculation result, the candidate generation unit 13 presents to the user, through the output unit 4, a candidate for a time position where the element selected by the user should be arranged, that is, the candidate position (S23).

図８は、ユーザにより、１つのショットが選択された場合の編集画面３１の一例を示す図である。図８の、ショット選択欄３４内のショットに付されたＡ１〜Ｅ１はショットの識別子である。 FIG. 8 is a diagram illustrating an example of the edit screen 31 when one shot is selected by the user. In FIG. 8, A1 to E1 attached to shots in the shot selection column 34 are shot identifiers.

例えば、図８に示すように、（１）ユーザにより、ショットＣ２が選択された場合、候補生成部１３は、その情報を取得する。さらに、素材特徴抽出部１１から受け取ったショットＣ２の特徴量ベクトルと、モデル生成部２３から受け取った編集モデルに示される、サンプル映像におけるショットの時系列構成またはショットと音との同期構成とを照らし合わせることで、つまり対照することで、ショットＣ２のユーザ編集映像における並び方の候補を示す情報を生成する。 For example, as illustrated in FIG. 8, (1) when a shot C2 is selected by the user, the candidate generation unit 13 acquires the information. Further, the feature vector of the shot C2 received from the material feature extraction unit 11 and the time-series configuration of shots in the sample video or the synchronous configuration of shots and sounds shown in the editing model received from the model generation unit 23 are illuminated. By matching, that is, by contrasting, information indicating candidates for arrangement in the user edited video of the shot C2 is generated.

具体的には、ショットＣ２の特徴量ベクトルと、編集モデルに示される複数の特徴量ベクトルそれぞれとの距離を算出し、距離が短いほど、編集モデル内の特徴量ベクトルの適合度が高いと算出する。 Specifically, the distance between the feature vector of the shot C2 and each of the plurality of feature vectors shown in the editing model is calculated, and the shorter the distance, the higher the matching of the feature vector in the editing model is calculated. To do.

算出の結果、適合度が所定の基準を満たす特徴量ベクトルの時間位置を編集モデルから特定する。特定されたサンプル映像における時間位置に対応する、ユーザ編集映像における時間位置をユーザに提示する。 As a result of the calculation, the time position of the feature vector whose degree of conformity satisfies a predetermined criterion is specified from the editing model. The time position in the user edited video corresponding to the time position in the specified sample video is presented to the user.

例えば、サンプル映像の全長が６分３０秒であり、上記所定の基準を満たす特徴量ベクトルの時間位置が、１分３０秒である場合を想定する。この場合、この特徴量ベクトルの時間位置は、全長の２０％に相当する時間位置である。そのため、ユーザ編集映像の全長が５分である場合、その時間位置に対応するユーザ編集映像における時間位置は、５分の２０％に相当する１分になる。 For example, it is assumed that the total length of the sample video is 6 minutes 30 seconds, and the time position of the feature vector satisfying the predetermined criterion is 1 minute 30 seconds. In this case, the time position of the feature vector is a time position corresponding to 20% of the total length. Therefore, when the total length of the user edited video is 5 minutes, the time position in the user edited video corresponding to the time position is 1 minute corresponding to 20% of 5/5.

なお、素材映像およびサンプル映像のショット等の特徴量ベクトルは、正規化されクラスタリングされている。つまり、ショット等の要素は、各要素の特徴に応じてグループ分けされている。 Note that feature vectors such as shots of material video and sample video are normalized and clustered. That is, elements such as shots are grouped according to the characteristics of each element.

そのため、例えば、選択されたショットが属する素材映像におけるグループと同一もしくは近似する特徴を有する、サンプル映像におけるグループを特定する。この特定したグループに属する１以上のショットをそのまま所定の基準を満たすショットであるとする。さらに、その１以上のショットの編集映像における時間位置に対応する、ユーザ編集映像における時間位置を特定し、ユーザに提示してもよい。 Therefore, for example, a group in the sample video having the same or similar characteristics as the group in the material video to which the selected shot belongs is specified. It is assumed that one or more shots belonging to the specified group are shots that satisfy a predetermined standard as they are. Furthermore, the time position in the user edited video corresponding to the time position in the edited video of the one or more shots may be specified and presented to the user.

編集画面３１上では、図８に示すように、（２）映像ストーリー欄３７ａに該当位置が候補位置（図では点線で囲まれた領域で表現されている。）として表示される。 On the editing screen 31, as shown in FIG. 8, (2) the corresponding position is displayed in the video story column 37a as a candidate position (represented by a region surrounded by a dotted line in the figure).

このように、映像編集支援装置１は、ユーザがショットを選択した場合、そのショットを配置すべき候補位置をユーザに提示することができる。この候補位置は、映像編集の専門家の編集により制作されたサンプル映像から作られた編集モデルを参照することにより選択されたものである。 As described above, when the user selects a shot, the video editing support apparatus 1 can present the user with candidate positions where the shot is to be arranged. This candidate position is selected by referring to an editing model created from a sample video produced by editing by a video editing specialist.

つまり、映像編集支援装置１は、映像編集の専門家ではないユーザに、不要な試行錯誤を繰り返させることなく、ユーザ自身が選択したショットを、適した時間位置に配置するように誘導することができる。 In other words, the video editing support apparatus 1 can guide a user who is not a video editing expert to place a shot selected by the user at an appropriate time position without repeating unnecessary trial and error. it can.

次に、図７のフロー図に示すように、映像編集支援装置１は、ユーザによる要素の移動を検出する（Ｓ２６）。具体的には、ユーザにより、ショットＣ２が映像ストーリー欄３７ａ内に移動されると、編集実行部１６は、その移動を検出する。また、編集画面３１の表示内容の書き換え、およびユーザ編集映像へのショットＣ２の配置位置等の情報を記録する（Ｓ２７）。 Next, as shown in the flowchart of FIG. 7, the video editing support apparatus 1 detects the movement of the element by the user (S26). Specifically, when the shot C2 is moved into the video story column 37a by the user, the editing execution unit 16 detects the movement. Also, information such as rewriting of the display contents of the edit screen 31 and the arrangement position of the shot C2 on the user edited video is recorded (S27).

その後、ユーザからの終了指示がなく（Ｓ２８でＮｏ）、ユーザからショットの選択等の指示が入力されると、上記のユーザ入力受付（Ｓ２０）から、実編集のための情報の記録（Ｓ２７）までが繰り返される。また、ユーザからの終了指示があると（Ｓ２８でＹｅｓ）、編集実行部１６は記録した情報に従い実編集処理（Ｓ２９）を行う。つまり、記録した情報に従い、ユーザに選択されたショット等のデータをつなぎ合わせ１つのユーザ編集映像のファイルを生成する。 Thereafter, when there is no end instruction from the user (No in S28) and an instruction such as shot selection is input from the user, information for actual editing is recorded from the user input reception (S20) (S27). Is repeated. When there is an end instruction from the user (Yes in S28), the editing execution unit 16 performs an actual editing process (S29) according to the recorded information. That is, according to the recorded information, data such as shots selected by the user are connected to generate one user edited video file.

図９は、ユーザにより、１つのショットがストーリーボード３７に移動された場合の編集画面３１の一例を示す図である。 FIG. 9 is a diagram illustrating an example of the editing screen 31 when one shot is moved to the story board 37 by the user.

図９に示すように、（３）ユーザにより、ショットＣ２が映像ストーリー欄３７ａの候補位置に移動されると、その移動が（４）映像構造欄３８ａに反映される。 As shown in FIG. 9, (3) when the user moves the shot C2 to the candidate position in the video story column 37a, the movement is reflected in the (4) video structure column 38a.

この画像表示の書き換えは、編集実行部１６によって行われる。また、編集実行部１６は、ユーザ編集映像の当該時間位置にショットＣ２が配置されるように、内部に保持するユーザ編集映像のデータにショットＣ２のデータを追記する。 The rewriting of the image display is performed by the editing execution unit 16. Further, the editing execution unit 16 adds the data of the shot C2 to the data of the user edited video held therein so that the shot C2 is arranged at the time position of the user edited video.

また、図７のフロー図に示すように、ユーザにより、ストーリーボード３７上で時間位置が選択された場合（Ｓ２１で時間位置）、候補生成部１３は、その時間位置に対応する、編集モデルに示される要素と、編集画面３１に表示されている要素との適合度の算出を行う（Ｓ２４）。 Also, as shown in the flowchart of FIG. 7, when the user selects a time position on the storyboard 37 (time position in S21), the candidate generation unit 13 sets an editing model corresponding to the time position. The degree of matching between the displayed element and the element displayed on the editing screen 31 is calculated (S24).

候補生成部１３は、算出の結果に基づき、出力部４を介し、ユーザに選択された時間位置に配置すべき要素の候補、つまり、候補要素をユーザに提示する（Ｓ２５）。 Based on the result of the calculation, the candidate generation unit 13 presents to the user a candidate for an element to be placed at the time position selected by the user, that is, a candidate element, via the output unit 4 (S25).

図１０は、ユーザにより、時間位置が選択された場合の編集画面３１の一例を示す図である。 FIG. 10 is a diagram illustrating an example of the editing screen 31 when the time position is selected by the user.

例えば、図１０に示すように、（１）ユーザにより、映像ストーリー欄３７ａ上の時間位置（図では点線で囲まれた領域で表現されている。）が選択された場合、候補生成部１３は、その情報を取得する。 For example, as shown in FIG. 10, (1) when the user selects a time position on the video story column 37a (represented by a region surrounded by a dotted line in the figure), the candidate generator 13 , Get that information.

さらに、その選択された時間位置に対応する、編集モデルにおける時間位置にあるショットの特徴量ベクトルを編集モデルから読み出す。読み出した特徴量ベクトルと、ショット選択欄３４に表示されている各ショットの特徴量ベクトル（以下、「素材ショットベクトル」という。）とを対照し、それぞれの適合度を算出する。 Further, the feature quantity vector of the shot at the time position in the edit model corresponding to the selected time position is read from the edit model. The read feature quantity vector is compared with the feature quantity vector (hereinafter referred to as “material shot vector”) of each shot displayed in the shot selection field 34, and the degree of matching is calculated.

具体的には、読み出した特徴量ベクトルと素材ショットベクトルとのそれぞれの距離を算出し、距離が短いほど素材ショットベクトルの適合度が高いと算出する。 Specifically, each distance between the read feature vector and the material shot vector is calculated, and the shorter the distance is, the higher the compatibility of the material shot vector is.

算出の結果、適合度が所定の基準を満たすショットを特定する。特定されたショットは、図１０に示すように、（２）画面上では点滅等により候補要素として表示される。 As a result of the calculation, a shot whose fitness satisfies a predetermined standard is specified. As shown in FIG. 10, the identified shot is displayed as a candidate element by (2) blinking or the like on the screen.

図１０に示す表示例の場合、ショットＢ１と、ショットＣ２とが候補要素として表示されている。 In the case of the display example shown in FIG. 10, shot B1 and shot C2 are displayed as candidate elements.

その後、ユーザにより、いずれかのショットが選択され映像ストーリー欄３７ａ上に移動されると、図７のフロー図に示すように、映像編集支援装置１は、ユーザによる要素の移動を検出する（Ｓ２６）。また、編集実行部１６により、編集画面３１の表示内容の書き換え、およびユーザ編集映像へのショットＣ２の配置位置等の情報を記録する（Ｓ２７）。 Thereafter, when any shot is selected and moved onto the video story column 37a by the user, the video editing support device 1 detects the movement of the element by the user as shown in the flowchart of FIG. 7 (S26). ). Further, the edit execution unit 16 records information such as rewriting of the display contents of the edit screen 31 and the arrangement position of the shot C2 on the user edited video (S27).

その後、ユーザからの終了指示がなく（Ｓ２８でＮｏ）、ユーザから時間位置等の選択等の指示が入力されると、上記のユーザ入力受付（Ｓ２０）から、実編集のための情報の記録（Ｓ２７）までが繰り返される。また、ユーザからの終了指示があると（Ｓ２８でＹｅｓ）、編集実行部１６は記録した情報に従い実編集処理（Ｓ２９）を行い、１つのユーザ編集映像のファイルを生成する。 Thereafter, when there is no end instruction from the user (No in S28) and an instruction to select a time position or the like is input from the user, information for actual editing is recorded from the user input reception (S20) (S20). The steps up to S27) are repeated. When there is an end instruction from the user (Yes in S28), the editing execution unit 16 performs an actual editing process (S29) according to the recorded information and generates one user edited video file.

このように、映像編集支援装置１は、ユーザがショット等の要素を選択した場合のみならず、ユーザ編集映像における時間位置を選択した場合においても、その時間位置に配置すべき候補要素をユーザに提示することができる。 As described above, the video editing support apparatus 1 allows the user to select candidate elements to be arranged at the time position not only when the user selects an element such as a shot but also when the user selects a time position in the user edited video. Can be presented.

つまり、映像編集の専門家ではないユーザに、不要な試行錯誤を繰り返させることなく、ユーザ自身が選択した時間位置に、その時間位置に適したショットを配置するように誘導することができる。 That is, it is possible to guide a user who is not a video editing expert to place a shot suitable for the time position at the time position selected by the user without repeating unnecessary trial and error.

また、映像編集支援装置１は、ユーザに選択された付加要素をユーザ編集映像の適した時間位置に配置するようにユーザを誘導することができる。 Further, the video editing support apparatus 1 can guide the user to place the additional element selected by the user at a suitable time position of the user edited video.

図１１は、ユーザにより、１つの付加音が選択された場合の編集画面３１の一例を示す図である。 FIG. 11 is a diagram illustrating an example of the edit screen 31 when one additional sound is selected by the user.

例えば、図１１に示すように、（１）ユーザにより、“音楽１”が選択された場合、候補生成部１３は、その情報を取得する。さらに、付加音記憶部１５から読み出した“音楽１”の特徴量ベクトルと、モデル生成部２３から受け取った編集モデル内の複数の特徴量ベクトルとを対照し、それぞれの適合度を算出する。 For example, as illustrated in FIG. 11, (1) when “music 1” is selected by the user, the candidate generation unit 13 acquires the information. Furthermore, the feature amount vector of “Music 1” read from the additional sound storage unit 15 is compared with the plurality of feature amount vectors in the edit model received from the model generation unit 23, and the degree of matching is calculated.

具体的には、“音楽１”の特徴量ベクトルと編集モデル内の複数の特徴量ベクトルに含まれる音の特徴量を表す成分とが対照される。 Specifically, the feature amount vector of “Music 1” is compared with the component representing the feature amount of the sound included in the plurality of feature amount vectors in the editing model.

この対照により、“音楽１”の特徴量ベクトルとの距離が短い、つまり、“音楽１”の特徴量ベクトルとの適合度が所定の基準を満たす、編集モデル内の特徴量ベクトルが特定される。 By this comparison, the feature vector in the editing model is specified that has a short distance from the feature vector of “Music 1”, that is, the degree of matching with the feature vector of “Music 1” satisfies a predetermined criterion. .

さらに、特定された特徴量ベクトルとの距離が所定の範囲内の特徴量ベクトルを持つ、ユーザ編集映像に配置済みのショットが特定される。つまり、“音楽１”を付加すべきショットが特定される。加えてそのショットの時間位置が特定される。 Furthermore, a shot that has already been arranged in the user-edited video and has a feature vector whose distance from the specified feature vector is within a predetermined range is specified. That is, a shot to which “music 1” is to be added is specified. In addition, the time position of the shot is specified.

なお、“音楽１”の特徴量ベクトルとの適合度が所定の基準を満たす、編集モデル内の特徴量ベクトルの時間位置を特定し、さらに、その時間位置に対応する、ユーザ編集映像における時間位置を特定してもよい。つまり、音楽１”を配置すべきユーザ編集映像上の時間位置を、ショットの存在を介さずに特定してもよい。 It should be noted that the time position of the feature vector in the editing model that satisfies the predetermined standard for the degree of matching with the feature vector of “Music 1” is specified, and the time position in the user edited video corresponding to the time position May be specified. That is, the time position on the user-edited video where the music 1 ″ is to be placed may be specified without the presence of a shot.

このように特定されたーザ編集映像上の時間位置は、図１１に示すように、（２）音ストーリー欄３７ｂに候補位置として表示される。 The time position on the user-edited video specified in this way is displayed as a candidate position in the (2) sound story column 37b as shown in FIG.

その後、ユーザにより“音楽１”が、音ストーリー欄３７ｂに移動されると、編集実行部１６がその移動を検出する。編集実行部１６は移動を検出すると表示内容の書き換えを行う。さらに、ユーザ編集映像の当該時間位置に“音楽１”が配置されるように、内部に保持するユーザ編集映像のデータに“音楽１”のデータを追記する。 Thereafter, when the user moves “Music 1” to the sound story column 37b, the editing execution unit 16 detects the movement. When the edit execution unit 16 detects the movement, it rewrites the display contents. Further, the data of “music 1” is added to the data of the user edited video held inside so that “music 1” is arranged at the time position of the user edited video.

また、映像編集支援装置１は、ユーザにより編集エフェクトが選択された場合も、付加音が選択された場合と同様に、候補位置をユーザに提示することができる。 Further, the video editing support apparatus 1 can present the candidate position to the user even when the editing effect is selected by the user, as in the case where the additional sound is selected.

図１２は、ユーザにより、１つの付加要素である編集エフェクトが選択された場合の編集画面３１の一例を示す図である。 FIG. 12 is a diagram illustrating an example of the editing screen 31 when the editing effect which is one additional element is selected by the user.

例えば、図１２に示すように、（１）ユーザにより、“フェードアウト”が選択された場合、候補生成部１３は、その情報を取得する。さらに、エフェクトデータ記憶部１４から読み出した“フェードアウト”の特徴量ベクトルと、モデル生成部２３から受け取った編集モデル内の複数の特徴量ベクトルとを対照し、それぞれの適合度を算出する。 For example, as illustrated in FIG. 12, (1) when “fade out” is selected by the user, the candidate generation unit 13 acquires the information. Further, the “fade-out” feature quantity vector read from the effect data storage unit 14 is compared with the plurality of feature quantity vectors in the edit model received from the model generation unit 23, and the respective matching degrees are calculated.

具体的には、“フェードアウト”の特徴量ベクトルと編集モデル内の複数の特徴量ベクトルに含まれる編集エフェクトの特徴量を表す成分とが対照される。 Specifically, the “fade out” feature quantity vector and the component representing the feature quantity of the editing effect included in the plurality of feature quantity vectors in the editing model are compared.

この対照により、“フェードアウト”の特徴量ベクトルとの距離が短い、つまり、“フェードアウト”の特徴量ベクトルとの適合度が所定の基準を満たす、編集モデル内の特徴量ベクトルが特定される。 By this comparison, the feature vector in the edit model is identified that has a short distance from the “fade out” feature vector, that is, the degree of matching with the “fade out” feature vector satisfies a predetermined criterion.

さらに、特定された特徴量ベクトルとの距離が所定の範囲内の特徴量ベクトルを持つ、ユーザ編集映像に配置済みのショットが特定される。つまり、“フェードアウト”を付加すべきショットが特定される。加えてそのショットの時間位置が特定される。 Furthermore, a shot that has already been arranged in the user-edited video and has a feature vector whose distance from the specified feature vector is within a predetermined range is specified. That is, a shot to which “fade out” is to be added is specified. In addition, the time position of the shot is specified.

このように特定されたーザ編集映像上の時間位置を用い、図１２に示すように、（２）映像ストーリー欄３７ａにおいて、“フェードアウト”を付加すべきショットと次のショットとの間に、例えば点線の縦長矩形を表示する。 Using the time position on the user-edited video thus identified, as shown in FIG. 12, (2) in the video story column 37a, between the shot to which “fade out” should be added and the next shot, For example, a dotted vertical rectangle is displayed.

その後、ユーザにより“フェードアウト”が、映像ストーリー欄３７ａ内の縦長矩形に移動されると、編集実行部１６がその移動を検出する。編集実行部１６は移動を検出すると表示内容の書き換えを行う。さらに、ユーザ編集映像の当該時間位置の直前のショットにフェードアウトが付加されるように、内部に保持するユーザ編集映像のデータにフェードアウトのエフェクトデータを追記する。 Thereafter, when the user moves “Fade Out” to a vertically long rectangle in the video story column 37a, the editing execution unit 16 detects the movement. When the edit execution unit 16 detects the movement, it rewrites the display contents. Further, the fade-out effect data is added to the user-edited video data stored therein so that the fade-out is added to the shot immediately before the time position of the user-edited video.

なお、“フェードイン”等のショットの最初に付加すべき編集エフェクトが、上記の縦長矩形に移動された場合は、当該縦長矩形の直後のショットにその編集エフェクトが付加される。 When an edit effect to be added at the beginning of a shot such as “fade in” is moved to the above-described vertically long rectangle, the edit effect is added to the shot immediately after the vertically long rectangle.

また、編集画面３１の映像ストーリー欄３７ａ上で、ユーザにより、ショットとショットとの間に配置された縦長矩形が選択された場合、映像編集支援装置１は、その縦長矩形の直前または直後のショットに付加するのに適した編集エフェクトの候補をユーザに提示することもできる。 When the user selects a vertically long rectangle arranged between shots on the video story field 37a of the editing screen 31, the video editing support apparatus 1 shots immediately before or after the vertically long rectangle. Editing effect candidates suitable for addition to the user can also be presented to the user.

この場合、例えば、映像ストーリー欄３７ａに配置されたショットの前後に点線の縦長矩形を表示する。さらに、候補生成部１３が、選択された縦長矩形の前後にあるショットの特徴量ベクトルと編集モデルとから、それらショットに付加すべき編集エフェクトを特定すればよい。 In this case, for example, a dotted vertical rectangle is displayed before and after the shot arranged in the video story column 37a. Further, the candidate generation unit 13 may specify an editing effect to be added to the shots from the feature vector and the editing model of the shots before and after the selected vertically long rectangle.

これにより、例えば、選択された縦長矩形の直後にあるショットが暗く、映像の動きの遅いショットである場合、サンプル映像内の同様の特徴を有する１以上のショットが特定される。さらに、その１以上のショットの開始部分に付加されている１以上の編集エフェクトが特定され、特定された編集エフェクトは、ユーザが選択した縦長矩形の直後のショットに付加すべき編集エフェトの候補としてユーザに提示される。 Thereby, for example, when the shot immediately after the selected vertical rectangle is dark and the shot of the video is slow, one or more shots having the same characteristics in the sample video are specified. Further, one or more editing effects added to the start portion of the one or more shots are specified, and the specified editing effects are candidates for editing effects to be added to the shot immediately after the vertical rectangle selected by the user. Presented to the user.

また、選択された縦長矩形の直前にあるショットの特徴からも、そのショットの終了部分に付加すべき編集エフェクトが特定される。さらに、特定された編集エフェクトはユーザが選択した縦長矩形の直前のショットに付加すべき編集エフェトの候補としてユーザに提示される。 Also, the editing effect to be added to the end portion of the shot is specified from the characteristics of the shot immediately before the selected vertically long rectangle. Further, the specified editing effect is presented to the user as a candidate for an editing effect to be added to the shot immediately before the vertically long rectangle selected by the user.

ユーザは、このようにして提示された１以上の編集エフェクトの中から１つの編集エフェクトを選択し、縦長矩形上に移動する。これにより、その編集エフェクトの種類に応じ、その縦長矩形の直前または直後のショットにその編集エフェクトが付加される。 The user selects one editing effect from the one or more editing effects presented in this way, and moves it onto a vertically long rectangle. Thereby, the editing effect is added to the shot immediately before or after the vertically long rectangle according to the type of the editing effect.

また、あるショットに編集エフェクトが付加され、そのショットに音が付加されている場合、その音も編集エフェクトに同期するよう加工される。例えば、編集エフェクトとして“フェードアウト”がショットに付加された場合、音もフェードアウトするように加工される。 Further, when an editing effect is added to a certain shot and a sound is added to the shot, the sound is processed so as to be synchronized with the editing effect. For example, when “fade out” is added to the shot as an editing effect, the sound is also processed to fade out.

また、ショットに付加された音の特徴に基づいて、編集エフェクトを決定することもできる。 An editing effect can also be determined based on the characteristics of the sound added to the shot.

例えば、ユーザ編集映像において、ユーザに選択された縦長矩形の直前のショットに既に音が付加されており、そのショットの終了部分で、音がフェードアウトしている場合を想定する。この場合、素材特徴抽出部１１は、信号特徴としてショットの終了部分で音量が小さくなっていくことを検出することができる。 For example, in the user-edited video, it is assumed that sound has already been added to the shot immediately before the vertically long rectangle selected by the user, and the sound has faded out at the end of the shot. In this case, the material feature extraction unit 11 can detect that the volume is reduced at the end portion of the shot as the signal feature.

また、学習部２において、サンプル映像の各ショットの特徴として、付加された音がフェードアウトしている場合、動画もフェードアウトしている、つまり、編集エフェクトとして“フェードアウト”が付加されていることが学習されていると想定する。 In addition, the learning unit 2 learns that, as a feature of each shot of the sample video, if the added sound fades out, the moving image also fades out, that is, “Fade Out” is added as an editing effect. Assuming that

この場合、候補生成部１３は、学習部２による上記学習内容に基づき、ユーザ編集映像上の当該ショットに付加するエフェクトの候補として“フェードアウト”を選択し、出力部４を介しユーザに提示される。 In this case, the candidate generation unit 13 selects “fade out” as a candidate for the effect to be added to the shot on the user edited video based on the learning content by the learning unit 2 and is presented to the user via the output unit 4. .

つまり、映像編集において、視覚的特徴の要素である編集エフェクトと、聴覚的特徴である音の特徴との関連性は重要である。そのため、映像編集支援装置１は、このように、音のセグメント境界と動画のセグメント境界との関連性を示す情報をこれら境界の同期構成を示す情報としてサンプル映像から学習する。さらに、その学習内容を利用してユーザの編集行動をより適切な方向に誘導することができる。 In other words, in video editing, the relationship between editing effects, which are elements of visual features, and sound features, which are auditory features, is important. Therefore, the video editing support apparatus 1 learns information indicating the relationship between the segment boundaries of sound and the segment boundaries of moving images from the sample video as information indicating the synchronization configuration of these boundaries. Furthermore, it is possible to guide the user's editing behavior in a more appropriate direction using the learning content.

なお、各ショットの間に２つの縦長矩形を並べて表示させてもよい。つまり、例えばＡとＢの２つのショットがＡＢの順に時系列上に並んでいる場合、Ａの直後の縦長矩形と、Ｂの直前の縦長矩形とを並べて表示させてもよい。これにより、ユーザに選択された編集エフェクト、または、ユーザに選択された縦長矩形が、ＡおよびＢのいずれのショットに対応するものかを明示的にユーザに認識させることができる。 Note that two vertically long rectangles may be displayed side by side between each shot. That is, for example, when two shots A and B are arranged in time series in the order of AB, a vertically long rectangle immediately after A and a vertically long rectangle immediately before B may be displayed side by side. Thereby, the user can be made to explicitly recognize whether the editing effect selected by the user or the vertically long rectangle selected by the user corresponds to which shot of A and B.

同様に、編集画面３１の音ストーリー欄３７ｂ上で、ユーザにより、時間位置が選択され、かつ、映像ストーリー欄３７ａ上の当該時間位置に既にショットが配置されている場合、映像編集支援装置１は、そのショットに付加するのに適した音の候補をユーザに提示することができる。 Similarly, when the time position is selected by the user on the sound story column 37b of the edit screen 31 and a shot has already been placed at the time position on the video story column 37a, the video editing support apparatus 1 The sound candidate suitable for adding to the shot can be presented to the user.

この場合、候補生成部１３が、選択された時間位置に対応する映像ストーリー欄３７ａ上のショットの特徴量ベクトルと編集モデルとから、そのショットに付加すべき音を特定すればよい。 In this case, the candidate generation unit 13 may specify the sound to be added to the shot from the feature vector and the edit model of the shot on the video story column 37a corresponding to the selected time position.

これにより、例えば、選択された位置にあるショットが明るく、映像の動きの激しいショットである場合、サンプル映像内の同様の特徴を有する１以上のショットが特定される。 Thereby, for example, when the shot at the selected position is bright and the shot has a strong motion, one or more shots having the same characteristics in the sample video are specified.

さらに、その１以上のショットに付加されている１種類以上の音が特定され、特定された１種類以上の音は、付加音記憶部１５に記憶されている音と対照される。具体的には、それぞれの特徴量ベクトルが対照される。 Further, one or more types of sounds added to the one or more shots are specified, and the specified one or more types of sounds are compared with the sounds stored in the additional sound storage unit 15. Specifically, each feature vector is compared.

なお、ここで特定される音は、例えば、効果音や音楽のみならず、実質的に音がない状態を指す「無音」も含まれる。 Note that the sound specified here includes not only sound effects and music, but also “silence” indicating a state where there is substantially no sound.

この対照の結果、特定された１種類以上の音に類似する１種類以上の音が付加音記憶部１５に記憶されている音の中から特定される。この特定された１種類以上の音は、ユーザが選択した時間位置に配置すべき音の候補としてユーザに提示される。 As a result of this comparison, one or more types of sounds similar to the specified one or more types of sounds are identified from the sounds stored in the additional sound storage unit 15. The specified one or more types of sounds are presented to the user as sound candidates to be placed at the time position selected by the user.

また、編集画面３１の音ストーリー欄３７ｂ上で、ユーザにより、時間位置が選択され、かつ、映像ストーリー欄３７ａ上の当該時間位置にショットが配置されていない場合であっても、映像編集支援装置１は、その時間位置に適した音の候補をユーザに提示することができる。 Further, even when a time position is selected by the user on the sound story column 37b of the editing screen 31 and no shot is arranged at the time position on the video story column 37a, the video editing support device 1 can present a sound candidate suitable for the time position to the user.

この場合、候補生成部１３が、選択された時間位置に対応する時間位置と編集モデルとから、その時間位置に配置すべき音を特定すればよい。 In this case, the candidate generation unit 13 may specify the sound to be placed at the time position from the time position corresponding to the selected time position and the edit model.

これにより、例えば、選択された時間位置がユーザ編集映像の開始直後の時間位置である場合、サンプル映像において開始直後の映像に付加された音が特定される。例えば、ボリュームが小さい等の信号特徴を持つ穏やかな音楽が特定される。 Thereby, for example, when the selected time position is the time position immediately after the start of the user-edited video, the sound added to the video immediately after the start in the sample video is specified. For example, a gentle music having a signal feature such as a small volume is specified.

さらに、特定された音楽の信号特徴に基づき、その音楽に類似する１以上の音楽が、付加音記憶部１５に記憶されている音楽の中から特定される。この特定された１以上の音楽は、ユーザが選択した時間位置に配置すべき音楽の候補としてユーザに提示される。 Further, one or more music similar to the music is specified from the music stored in the additional sound storage unit 15 based on the signal characteristics of the specified music. The specified one or more pieces of music are presented to the user as music candidates to be placed at the time position selected by the user.

つまり、映像コンテンツに付加される音は、時間位置に強く依存するものもある。例えば、映像コンテンツの開始部分には穏やかな曲が付加されており、また、終了部分には盛り上がりを感じさせる曲が付加されていることが１つの例として挙げられる。 In other words, the sound added to the video content is strongly dependent on the time position. For example, one example is that a gentle song is added to the start portion of the video content, and a song that makes the end portion feel exciting.

そのため、映像がどのようなものであるかを考慮せず、ユーザに選択された時間位置のみを持って、その時間位置に配置すべき音の候補を選び出し、ユーザに提示することが有用な場合が存在する。 Therefore, when it is useful to have only the time position selected by the user, select the sound candidates to be placed at that time position, and present to the user without considering what the video looks like Exists.

また、付加音および編集エフェクトの候補については、ユーザがショットをユーザ編集映像に配置した時点で、ユーザに提示することもできる。 Further, the additional sound and editing effect candidates can be presented to the user when the user places the shot on the user edited video.

例えば、ユーザが、ショット選択欄３４に表示されている１つのショットを選択した場合、上述のように、そのショットと特徴に近い特徴をもつサンプル映像のショットを特定することができる。従って、そのサンプル映像のショットに付加されている音および編集エフェクトの特徴も特定することができる。 For example, when the user selects one shot displayed in the shot selection field 34, it is possible to specify a shot of the sample video having a feature close to that feature as described above. Therefore, the sound added to the sample video shot and the characteristics of the editing effect can also be specified.

そこで、それら音および編集エフェクトの特徴から、それらの特徴に近い特徴を持つ、１以上の付加音および編集エフェクトを、ユーザが選択し映像ストーリー欄３７ａに移動したショットに付加すべき付加音および編集エフェクトの候補として提示することができる。 Therefore, from the characteristics of these sounds and editing effects, one or more additional sounds and editing effects having characteristics close to those characteristics are selected by the user and added sound and editing to be added to the shot moved to the video story column 37a. Can be presented as an effect candidate.

このように、映像編集支援装置１は、ユーザが付加要素を選択した場合においても、その付加要素を配置すべき時間位置の候補を示す情報を、編集モデルに従って生成し、ユーザに提示することができる。また、同様に、ユーザが時間位置を選択した場合においても、その時間位置に配置すべき付加要素の候補を示す情報を、編集モデルに従って生成し、ユーザに提示することができる。 As described above, even when the user selects an additional element, the video editing support apparatus 1 can generate information indicating candidates for time positions where the additional element should be arranged according to the editing model and present it to the user. it can. Similarly, even when the user selects a time position, information indicating additional element candidates to be placed at the time position can be generated according to the editing model and presented to the user.

つまり、映像編集の専門家ではないユーザに、不要な試行錯誤を繰り返させることなく、ユーザ自身が選択した付加要素に応じて、その付加要素に適した時間位置に配置するように誘導することができる。また、ユーザ自身が選択した時間位置に応じて、その時間位置に適した付加要素を選択するように誘導することができる。 In other words, it is possible to guide a user who is not a video editing expert to arrange at a time position suitable for the additional element according to the additional element selected by the user without repeating unnecessary trial and error. it can. Further, according to the time position selected by the user himself / herself, the user can be guided to select an additional element suitable for the time position.

なお、上述のように、ユーザがショットまたは付加音の一方をユーザ編集映像に配置しようとした場合、他方が既にその配置しようとした時間位置に配置されている場合がある。 As described above, when the user attempts to place one of the shots or the additional sound on the user edited video, the other may already be placed at the time position at which the user tried to place the shot.

この場合、ユーザが、その配置を実行すると、ショットおよび付加音が同じ時間位置に並存することになる。この場合、ショット長と付加音の長さ（音長）とが同一でない場合が多いと考えられる。そこで、ショットと付加音との同期をとるために、ショット長と、そのショットに付加される付加音の音長とを合わせることが考えられる。 In this case, when the user executes the arrangement, the shot and the additional sound coexist at the same time position. In this case, it is considered that the shot length and the length of the additional sound (sound length) are often not the same. Therefore, in order to synchronize the shot and the additional sound, it is conceivable to match the shot length with the length of the additional sound added to the shot.

本実施の形態の映像編集支援装置１は、ショット長が音長よりも短い場合は、付加音をショット長に合わせてカットすることで、ショットと付加音との同期をとることができる。また、ショット長が音長よりも長い場合は、ショットを音長に合わせてカットすることで、ショットと付加音との同期をとることができる。 When the shot length is shorter than the sound length, the video editing support apparatus 1 according to the present embodiment can synchronize the shot and the additional sound by cutting the additional sound according to the shot length. When the shot length is longer than the sound length, the shot and the additional sound can be synchronized by cutting the shot according to the sound length.

図１３は、映像編集支援装置１の、既に配置されているショットまたは付加音を考慮した動作の流れの概要を示すフロー図である。 FIG. 13 is a flowchart showing an outline of an operation flow of the video editing support apparatus 1 in consideration of shots or additional sounds that are already arranged.

ユーザにより、例えば１つのショットが映像ストーリー欄３７ａ上のある時間位置に配置されようとした場合、その時間位置にショットが配置されていないとき（Ｓ３０でＮｏ）、そのままその時間位置にショットが移動される（Ｓ３２）。 For example, when one shot is about to be placed at a certain time position on the video story column 37a by the user, when the shot is not placed at that time position (No in S30), the shot moves to that time position as it is. (S32).

また、その時間位置に既にショットが配置されているときは（Ｓ３０でＹｅｓ）、配置済みのショットに上書きするように、ユーザに選択されたショットが移動される（Ｓ３１）。 If a shot has already been placed at that time position (Yes in S30), the shot selected by the user is moved so as to overwrite the already placed shot (S31).

さらに、音ストーリー欄３７ｂの当該時間位置に付加音が配置されていない場合（Ｓ３３でＮｏ）は、ショットおよび付加音の長さの調節は必要がなく、実編集処理を行った上で当該処理に係る動作を終了する。 Furthermore, when no additional sound is arranged at the time position in the sound story column 37b (No in S33), it is not necessary to adjust the length of the shot and the additional sound, and the processing is performed after performing the actual editing process. The operation related to is terminated.

また、音ストーリー欄３７ｂの当該時間位置に、例えばＢＧＭが既に配置されている場合（Ｓ３３でＹｅｓ）は、ショット長とＢＧＭの音長とが比較される。この比較は編集実行部１６により行われる。 For example, when BGM is already arranged at the time position in the sound story column 37b (Yes in S33), the shot length and the BGM sound length are compared. This comparison is performed by the editing execution unit 16.

比較の結果、ショット長が音長より長い場合（Ｓ３４でＹｅｓ）、ショットの開始位置の境界と、ＢＧＭの開始位置の境界とが同期付けられ、ショットが音長に合わせてカットされる。つまり、ショットの終了位置の境界と、ＢＧＭの終了位置の境界とが同期付けられる（Ｓ３５）。 As a result of the comparison, when the shot length is longer than the sound length (Yes in S34), the shot start position boundary and the BGM start position boundary are synchronized, and the shot is cut in accordance with the sound length. That is, the boundary of the shot end position and the boundary of the BGM end position are synchronized (S35).

例えば、１分のショットを３０秒にする場合、そのショットの１５秒以降から４５秒以前までが残されるように、その他の期間に相当するショットのデータがカットされる。つまり、ショットの中央部を中心に、音長に応じた長さのショットが残される。 For example, when a one-minute shot is set to 30 seconds, shot data corresponding to other periods is cut so that 15 seconds to 45 seconds before that shot remain. That is, a shot having a length corresponding to the sound length is left around the center of the shot.

なお、例えば、ショットの開始から音長に応じた長さのショットを残してもよく、また例えば、任意の部分を取り出して合計で３０秒になるようにしてもよい。要するにショット長を音長と同一の長さになるように短縮すればよい。 For example, a shot having a length corresponding to the sound length may be left from the start of the shot, or for example, an arbitrary portion may be taken out to be a total of 30 seconds. In short, the shot length may be shortened to be the same length as the sound length.

また、ショット長が音長より長くない場合（Ｓ３４でＮｏ）、実編集処理を行った上で当該処理に係る動作を終了する。具体的には、ショット長と音長とが同一の場合、そのまま、ユーザ編集映像の当該時間位置にそのショットが配置されるように、編集実行部１６は、内部に保持するユーザ編集映像のデータにそのショットのデータを追記する。 If the shot length is not longer than the sound length (No in S34), the actual editing process is performed and the operation related to the process is terminated. Specifically, when the shot length and the sound length are the same, the edit execution unit 16 stores the data of the user edited video stored therein so that the shot is arranged at the time position of the user edited video as it is. Append the shot data to.

また、ショット長が音長より短い場合、例えば、音長が１分であり、そのＢＧＭと対応付けられた単独または複数のショットの全体の長さが３０秒である場合を想定する。この場合、その単独または複数のショットとそのＢＧＭとが対応付けられたことが確定すると、ＢＧＭのデータは、例えば、開始から３０秒までに相当する部分だけ残すようにカットされる。 In addition, when the shot length is shorter than the sound length, for example, it is assumed that the sound length is 1 minute and the total length of one or a plurality of shots associated with the BGM is 30 seconds. In this case, when it is determined that the single or plural shots and the BGM are associated with each other, the BGM data is cut so as to leave only a portion corresponding to 30 seconds from the start, for example.

なお、ＢＧＭが複数のショットと対応付けられ、当該複数のショットの合計長が、ＢＧＭの長さより長い場合、ＢＧＭのどの時間位置でショットの切り替えがあるべきか、という情報に基づき、ショットの合計長がＢＧＭの長さにおさまるように、すべてのショットに対しカットするかカットしないかの判断を行う。 If the BGM is associated with a plurality of shots and the total length of the plurality of shots is longer than the length of the BGM, the total number of shots is based on information indicating at which time position of the BGM the shot should be switched. It is determined whether or not to cut all shots so that the length falls within the length of the BGM.

上記のＢＧＭのどの時間位置でショットの切り替えがあるべきかについての情報は、学習部２において、音のセグメント境界と動画のセグメント境界との関連性を示す情報としてサンプル映像から学習されるものである。 The information about the time position of the BGM at which the shot should be switched is learned from the sample video as information indicating the relationship between the sound segment boundary and the moving image segment boundary in the learning unit 2. is there.

さらに、ＢＧＭは、当該１以上のショットと同期付けられ、つまり、当該１以上のショットと同じ時間位置にＢＧＭが配置されるように、ユーザ編集映像のデータにそのＢＧＭのデータが編集実行部１６により追記される。 Further, the BGM is synchronized with the one or more shots, that is, the BGM data is added to the user edited video data so that the BGM is arranged at the same time position as the one or more shots. Will be added.

なお、上記のＢＧＭの開始位置は、ＢＧＭの楽曲としての開始位置でなくてもよく、ユーザの明示の指示により、ＢＧＭの途中の位置を開始位置として扱ってもよい。ショットに音楽以外の効果音等を付加する場合であっても同様である。 Note that the start position of the BGM does not have to be the start position of the BGM music, and a position in the middle of the BGM may be handled as the start position according to a user's explicit instruction. The same applies to the case where sound effects other than music are added to the shot.

ここで、サンプル映像に含まれる１つのショットは、そのサンプル映像のソースであるソース映像の１つのショットから一部を抽出することにより、または、一部を省略することにより得られている場合がある。 Here, one shot included in the sample video may be obtained by extracting a part from one shot of the source video that is the source of the sample video or omitting a part thereof. is there.

そこで、映像編集支援装置１が、サンプル映像の１つのショットが、そのショットに対応するソース映像のショットからどのように抽出または省略されたものであるかを学習し、学習された情報に従って、素材映像のショットの短縮のための支援を行ってもよい。 Therefore, the video editing support apparatus 1 learns how one shot of the sample video is extracted or omitted from the shot of the source video corresponding to the shot, and the material is determined according to the learned information. Support for shortening video shots may be provided.

図１４は、サンプル映像の１つのショットと、そのショットのソースであるショットとの関係を模式的に示す図である。 FIG. 14 is a diagram schematically showing a relationship between one shot of a sample video and a shot that is a source of the shot.

例えば、サンプル映像の１つのショット（以下、「サンプルショット」という。）を映画予告映像の１つのショットとすると、ソースであるショット（以下、「ソースショット」という。）は映画本編の１ショットである。 For example, if one shot of the sample video (hereinafter referred to as “sample shot”) is taken as one shot of the movie preview video, the shot that is the source (hereinafter referred to as “source shot”) is one shot of the main movie. is there.

サンプルショットは、対応するソースショットそのままである場合もあるが、図１４に示すように、ソースショットから、必要な部分のみを抽出して制作される場合もある。 The sample shot may be the corresponding source shot as it is, but as shown in FIG. 14, it may be produced by extracting only the necessary part from the source shot.

このような場合、ソースショットの動画および音の信号特徴と、そのソースショットから得られたサンプルショットの動画および音の信号特徴とから、そのサンプルショットを制作した編集者が行った、動画の省略または抽出の仕方を学習することができる。 In such a case, the video omission of the video performed by the editor who produced the sample shot from the video shot and sound signal characteristics of the source shot and the sample shot video and sound signal characteristics obtained from the source shot. Or it can learn how to extract.

この場合、例えば、学習部２において、サンプル受付部２０が、サンプル映像およびソース映像の入力を受け付ける。サンプル特徴抽出部２１は、１つのサンプルショットおよびそのサンプルショットに対応するソースショットの、フレームごとの、フレームの平均輝度およびフレームに付加された音の音量等の特徴量ベクトルを算出する。 In this case, for example, in the learning unit 2, the sample receiving unit 20 receives input of the sample video and the source video. The sample feature extraction unit 21 calculates a feature vector such as the average luminance of the frame and the volume of sound added to the frame for each frame of one sample shot and the source shot corresponding to the sample shot.

なお、特徴量ベクトルの算出の対象となるサンプルショットは、例えば、ユーザが短縮しようとしているショットの特徴に似た特徴を持つサンプルショットが選択される。 For example, a sample shot having a feature similar to the feature of the shot that the user is trying to shorten is selected as the sample shot that is a target for calculating the feature vector.

さらに、サンプル解析部２２が、それら特徴量ベクトルの正規化を行い、ソースショットの各フレームの特徴量ベクトルと、サンプルショットの各フレームの特徴量ベクトルとを比較する。 Further, the sample analysis unit 22 normalizes the feature quantity vectors, and compares the feature quantity vectors of each frame of the source shot with the feature quantity vectors of each frame of the sample shot.

これにより、ソースショットのどのような信号特徴をもつフレームが、サンプルショットのフレームとして抽出されているかを学習することができる。 Thereby, it is possible to learn what kind of signal characteristic of the source shot is extracted as the frame of the sample shot.

例えば、ソースショットの一定の輝度以上の輝度を持つフレームが、サンプルショットのフレームとして抽出されていることが学習される。また、例えば、ソースショットの特定の時間位置のフレームが、サンプルショットのフレームとして抽出されていることが学習される。 For example, it is learned that a frame having a luminance higher than a certain luminance of the source shot is extracted as a sample shot frame. Further, for example, it is learned that a frame at a specific time position of the source shot is extracted as a frame of the sample shot.

モデル生成部２３は、サンプル解析部２２により学習された情報を、ショット制作用の編集モデルとして、候補生成部１３に出力する。 The model generation unit 23 outputs the information learned by the sample analysis unit 22 to the candidate generation unit 13 as an edit model for shot production.

編集部３では、素材特徴抽出部１１が、短縮すべき素材映像のショット（以下、「素材ショット」という。）のフレームごとの特徴量ベクトルを、サンプル特徴抽出部２１と同様に算出し正規化する。 In the editing unit 3, the material feature extraction unit 11 calculates and normalizes a feature vector for each frame of a shot of a material video to be shortened (hereinafter referred to as “material shot”) in the same manner as the sample feature extraction unit 21. To do.

候補生成部１３は、素材特徴抽出部１１から受け取った素材ショットのフレームごとの特徴量ベクトルと、モデル生成部２３から受け取った編集モデルとを対照する。 The candidate generation unit 13 compares the feature vector for each frame of the material shot received from the material feature extraction unit 11 with the edit model received from the model generation unit 23.

これにより、例えば、編集モデルが、ソースショットの一定の輝度以上の輝度を持つフレームがサンプルショットのフレームとして抽出されていることを示す場合、候補生成部１３は、素材ショットに含まれる、一定の輝度以上の輝度を持つフレームを抽出し、素材ショットとして残すべきフレームの候補として、ユーザに提示することができる。 Thereby, for example, when the editing model indicates that a frame having a luminance equal to or higher than a certain luminance of the source shot is extracted as a frame of the sample shot, the candidate generating unit 13 includes the certain constant included in the material shot. A frame having a luminance higher than the luminance can be extracted and presented to the user as a frame candidate to be left as a material shot.

ユーザは、提示されたフレームの候補から、素材ショットとして残すべきフレームを選択することで、素材ショットを適切に短縮することができる。 The user can appropriately shorten the material shot by selecting a frame to be left as a material shot from the presented frame candidates.

また、上述のようなソース映像とサンプル映像との比較は、サンプル映像の編集パターンを学習する上でも有用である。 The comparison between the source video and the sample video as described above is also useful for learning the editing pattern of the sample video.

図１５は、サンプル映像と、そのサンプル映像のソースであるソース映像との関係を模式的に示す図である。 FIG. 15 is a diagram schematically illustrating a relationship between a sample video and a source video that is a source of the sample video.

図１５に示すように、サンプル映像を映画予告編映像とすると、ソース映像は映画本編である。また、サンプル映像は、ソース映像に含まれる複数のショットの中から選択された複数のショットで構成されている。 As shown in FIG. 15, if the sample video is a movie trailer video, the source video is the main movie. Further, the sample video is composed of a plurality of shots selected from a plurality of shots included in the source video.

そのため、例えば、サンプル映像に含まれる各ショットが、ソース映像においてどの時間位置に存在したか、という情報もショットの特徴量の１つとして捉えることができる。 Therefore, for example, information indicating at which time position each shot included in the sample video exists in the source video can also be regarded as one of the feature quantities of the shot.

この場合、上述のソースショットとサンプルショットとを比較する場合と同様に、サンプル解析部２２が、ソース映像とサンプル映像の各ショットの信号特徴を対照することにより、ソース映像のどのような特徴をもつ動画がサンプル映像の動画として抽出されているかを示す情報を得ることができる。例えば、どの時間位置のショットがサンプル映像のショットとして抽出されたかを学習することができる。 In this case, as in the case of comparing the source shot and the sample shot described above, the sample analysis unit 22 compares the signal characteristics of each shot of the source video and the sample video to determine what characteristics of the source video. It is possible to obtain information indicating whether or not a moving image has been extracted as a moving image of a sample video. For example, it is possible to learn which time position shot is extracted as a sample video shot.

さらに、モデル生成部２３は、サンプル解析部２２によって学習された情報を、１つの成分としてサンプル映像の各ショットの特徴量ベクトルに含ませる。さらに、それら特徴量ベクトルを含む編集モデルを候補生成部１３に出力する。 Further, the model generation unit 23 includes the information learned by the sample analysis unit 22 as one component in the feature vector of each shot of the sample video. Further, the editing model including these feature quantity vectors is output to the candidate generation unit 13.

候補生成部１３は、例えば、ある時間位置がユーザに選択された場合、その時間位置に対応する、サンプル映像内のショットの特徴量ベクトルを編集モデルから特定する。 For example, when a certain time position is selected by the user, the candidate generation unit 13 specifies a feature quantity vector of a shot in the sample video corresponding to the time position from the edit model.

さらに、その特定した特徴量から、そのショットがソース映像においてどの時間位置にあったかを編集モデルから読み出す。例えば、読み出した時間位置が、ソース映像の全長の５０％に相当する時間位置であったと想定する。 Further, the time position in the source video where the shot was located is read out from the editing model from the specified feature amount. For example, assume that the read time position is a time position corresponding to 50% of the total length of the source video.

この場合、候補生成部１３は、素材映像の全長の５０％に相当する時間位置にあるショットおよびその前後のショットの３つのショットを、ユーザ編集映像に配置すべきショットの候補として、ユーザに提示する。 In this case, the candidate generation unit 13 presents the three shots, the shot at the time position corresponding to 50% of the total length of the material video and the previous and subsequent shots, to the user as shot candidates to be arranged in the user edited video. To do.

または、その他の特徴量も考慮し、素材映像の全長の５０％に相当する時間位置にあるショットおよびそのショットの特徴量ベクトルと距離の近い特徴量ベクトルを持つ素材映像内の１以上のショットを、ユーザ編集映像に配置すべきショットの候補として、ユーザに提示する。 Or, considering other feature quantities, one or more shots in a material video having a shot at a time position corresponding to 50% of the total length of the material video and a feature quantity vector close to the feature quantity vector of the shot. Then, it is presented to the user as a shot candidate to be placed in the user edited video.

ユーザは、候補として提示された１以上のショットの中から、自身が選択した時間位置に配置するショットを選択し配置する。 The user selects and arranges a shot to be arranged at a time position selected by the user from one or more shots presented as candidates.

このように、映像編集支援装置１は、サンプル映像の各ショットが、ソース映像のどの時間位置に存在していたかを示す情報を利用することによっても、ユーザの映像編集を支援することができる。 As described above, the video editing support apparatus 1 can support the user's video editing also by using information indicating at which time position of the source video each shot of the sample video exists.

また、映画編集支援装置１は、複数の種類のサンプル映像から複数の種類の編集情報を学習してもよい。例えば、アクション、コメディー等のジャンルの異なる複数の映画予告編映像および、サッカー等のスポーツの試合のダイジェスト映像等から、それぞれ編集情報を学習してもよい。 Further, the movie editing support apparatus 1 may learn a plurality of types of editing information from a plurality of types of sample videos. For example, the editing information may be learned from a plurality of movie trailer videos of different genres such as action and comedy, a digest video of a sports game such as soccer, and the like.

この場合、それら複数の編集情報を記憶し、かつ、ユーザ等の指示により必要な編集情報を読み出せる機能を映像編集支援装置１が有していればよい。 In this case, it is only necessary that the video editing support apparatus 1 has a function of storing the plurality of pieces of editing information and reading out necessary editing information according to an instruction from a user or the like.

これにより、例えば、素材映像の内容に合わせて、または、ユーザの好みに合わせて、ユーザ編集映像の制作により適した編集情報を複数の中から選択できる。 Thereby, for example, editing information more suitable for production of user-edited video can be selected from a plurality according to the content of the material video or according to the user's preference.

また、本実施の形態の映像編集支援装置１が備える学習部２、編集部３および出力部４のそれぞれの動作は、中央演算装置（ＣＰＵ）、記憶装置、および情報の入出力を行うインタフェース等を有するコンピュータ、並びに、コンピュータに所定の処理を行わせるプログラムにより実現することができる。 The operations of the learning unit 2, the editing unit 3, and the output unit 4 included in the video editing support device 1 of the present embodiment are a central processing unit (CPU), a storage device, an interface for inputting and outputting information, and the like. And a program that causes the computer to perform predetermined processing.

例えば、インタフェースを介して入力されるサンプル映像を所定のメモリに記憶する。また、ＣＰＵが所定のプログラムを実行することにより、サンプル映像のショット列への分割、各ショットの特徴量の抽出、編集モデルの生成等を行う。 For example, the sample video input through the interface is stored in a predetermined memory. Further, the CPU executes a predetermined program to divide the sample video into shot sequences, extract feature quantities of each shot, generate an edit model, and the like.

また、インタフェースを介して入力される素材映像を所定のメモリに記憶する。また、ＣＰＵが所定のプログラムを実行することにより、素材映像のショット列への分割、各ショットの特徴量の抽出等を行う。 Further, the material video input via the interface is stored in a predetermined memory. Further, the CPU executes a predetermined program to divide the material video into shot sequences, extract feature amounts of each shot, and the like.

さらに、インタフェースを介して入力されるユーザの指示内容と、編集モデル等の情報とを用いて、ＣＰＵが候補要素または候補位置を示す情報を生成する。また、それら候補要素等を含む編集画面が生成され、ＣＰＵの指示により、インタフェースを介して表示装置に出力される。 Further, the CPU generates information indicating candidate elements or candidate positions by using the user instruction contents input through the interface and information such as an editing model. In addition, an edit screen including these candidate elements is generated and output to the display device via the interface according to instructions from the CPU.

このようなコンピュータの動作によっても、本発明の映像編集支援装置は実現される。 The video editing support apparatus of the present invention is also realized by such a computer operation.

また、映像編集支援装置１は、上述のように、サンプル映像に対して、動画および音の解析である映像解析を行い編集情報を学習する。さらに、編集情報に従って、ユーザ編集映像に配置すべきショット等の要素メディアの選択、および要素メディアを配置すべき時間位置の選択を誘導している。 In addition, as described above, the video editing support apparatus 1 learns editing information by performing video analysis, which is analysis of moving images and sounds, on the sample video. Furthermore, selection of elemental media such as shots to be arranged in the user edited video and selection of time positions at which the elemental media should be arranged are guided according to the editing information.

つまり、映像編集支援装置１は、サンプル映像を映像解析することにより、サンプル映像を制作した映像編集の専門家の編集技術および編集知識を編集情報としてユーザに利用させることができる。 That is, the video editing support apparatus 1 can allow the user to use the editing technology and knowledge of the video editing specialist who created the sample video as editing information by analyzing the sample video.

そこで、図１６〜図２０を用い、映像編集支援装置１が行う映像解析について、一部、上記説明と重複する部分を含め、特に理論的側面を中心に実験結果等を参照しながら映像解析の一例として説明する。 Accordingly, with reference to FIGS. 16 to 20, the video analysis performed by the video editing support apparatus 1 is performed while referring to experimental results and the like mainly focusing on theoretical aspects, including a part overlapping the above description. This will be described as an example.

図１６は、映像編集支援装置１の映像解析に係る動作の流れを示すフロー図である。 FIG. 16 is a flowchart showing the flow of operations related to video analysis of the video editing support apparatus 1.

映像編集支援装置１が映画予告映像を解析する場合を例にとって、映像編集支援装置１の映像解析に係る動作を説明する。 Taking the case where the video editing support apparatus 1 analyzes a movie preview video as an example, the operation of the video editing support apparatus 1 related to video analysis will be described.

映像編集支援装置１に映画予告映像が入力されると（Ｓ４０）映画予告映像のカットを検出し（Ｓ４１）、ショット列に分割する（Ｓ４２）。また、フレーム列に分割する（Ｓ４４）。 When a movie preview video is input to the video editing support device 1 (S40), a cut of the movie preview video is detected (S41) and divided into shot sequences (S42). Further, it is divided into frame sequences (S44).

次に、ショットから、ショット長に加え、動きに関する特徴量として、動きの強度・変動性・範囲を抽出し（Ｓ４３、Ｓ４５）、各ショットを４次元特徴ベクトルで表現する（Ｓ４６）。 Next, in addition to the shot length, the intensity, variability, and range of the motion are extracted from the shot as the feature quantity related to the motion (S43, S45), and each shot is represented by a four-dimensional feature vector (S46).

さらに、異なる映像におけるこれらの特徴量のばらつきの差による影響を除去するため、得られた特徴ベクトルを正規化する（Ｓ４７）。 Furthermore, the obtained feature vector is normalized in order to remove the influence due to the difference in variation of these feature amounts in different videos (S47).

このようにして得られた４次元ベクトルを類似度に基づいたクラスに分類し、各ショットにクラス番号を記号としてラベル付けする（Ｓ４８）。 The four-dimensional vectors obtained in this way are classified into classes based on similarity, and each shot is labeled with a class number as a symbol (S48).

最後に、すべてのショットを記号で表したショット時系列を観測系列とし（Ｓ４９）、隠れマルコフモデル（ＨＭＭ：ＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ）により映画予告映像に共通する時系列パターン学習（Ｓ５０）する。 Finally, a shot time series in which all shots are represented by symbols is used as an observation series (S49), and time series pattern learning common to movie preview images is performed using a hidden Markov model (HMM) (S50).

以下に、ショットからの特徴量抽出、ショットへのラベル付け、ＨＭＭによる学習方法の詳細について述べる。 Details of the feature extraction from shots, labeling of shots, and learning methods using HMM will be described below.

（ショットの特徴量抽出）
以下に述べる学習方法で着目する映像の感覚的側面である動画中のリズムは、映像中の動きやカメラの切り替わりの頻度など、動画中の変化により生成されると考えられる。そこで、ショットに対する特徴量としてショット長と動きを用いる。 (Shot feature extraction)
It is considered that the rhythm in the moving image, which is the sensory aspect of the image focused on in the learning method described below, is generated by changes in the moving image, such as the movement in the image and the frequency of camera switching. Therefore, the shot length and motion are used as the feature amount for the shot.

まず、ブロックマッチング（非特許文献２参照）によりフレームｆ中の画素（ｘ，ｙ）に対し求めた、下記（数１）に示す動きベクトルに基づき、動きベクトルの二次元ヒストグラムＨ（ｓ，ｔ）（ｓ，ｔ＝０，１，２．．．）を生成する。
G. Adiv, "Determining three-dimensional motion and structure from optical .flow generated by several moving objects," Proc. IEEE Trans. on PAMI, vol. 7, no. 4, pp. 384-401, 1985. First, a two-dimensional histogram H (s, t) of a motion vector based on the motion vector shown in the following (Equation 1) obtained for the pixel (x, y) in the frame f by block matching (see Non-Patent Document 2). ) (S, t = 0, 1, 2,...).
G. Adiv, "Determining three-dimensional motion and structure from optical .flow generated by several moving objects," Proc. IEEE Trans. On PAMI, vol. 7, no. 4, pp. 384-401, 1985.

ここで、ｓはベクトルの角度を、ｔはベクトルの長さを表す。ここで、ブロックマッチングは動きの速いフレームやカメラワークのあるフレームにおいて非常に弱く、実際の動きと一致しない動きベクトルが多く現れる。 Here, s represents the angle of the vector, and t represents the length of the vector. Here, block matching is very weak in a fast-moving frame or a frame with camerawork, and many motion vectors that do not match the actual motion appear.

このような動きベクトルの共通する特徴として、周りの動きベクトルと異なる角度と長さを持つ孤立したベクトルとして検出されることが挙げられる。 A common feature of such motion vectors is that they are detected as isolated vectors having different angles and lengths from surrounding motion vectors.

従って、このように誤検出された動きベクトルを考慮した特徴量として、動き強度ｖ_i（ｆ）、動き変動性ｖ_v（ｆ）、動き範囲ｖ_a（ｆ）を導入する。 Therefore, the motion intensity v _i (f), the motion variability v _v (f), and the motion range v _a (f) are introduced as feature quantities considering the motion vector erroneously detected in this way.

まず、動き強度ｖ_i（ｆ）を以下のように定義する。 First, the motion intensity v _i (f) is defined as follows.

この値は孤立ベクトルを除いた動きベクトルの総和となり、実際の動きの大きさをより忠実に表すと考えられる。また、動き変動性ｖ_v（ｆ）と動き範囲ｖ_a（ｆ）を次のように定義する。 This value is the sum of motion vectors excluding isolated vectors, and is considered to represent the actual magnitude of motion more faithfully. Also, the motion variability v _v (f) and the motion range v _a (f) are defined as follows.

ただし、Ｎ（ｆ）、Ｎ₀（ｆ）はフレームｆ中の動きベクトルの総数、及び大きさが１未満の動きベクトルの総数とする。動き変動性は、孤立ベクトルの検出された割合を表す。動きの速いフレーム間ではブロックマッチングによる誤検出が多く観測され、孤立した動きベクトルが多く現れる。 Here, N (f) and N ₀ (f) are the total number of motion vectors in the frame f and the total number of motion vectors having a size of less than 1. Motion variability represents the percentage of isolated vectors detected. Many false detections due to block matching are observed between fast-moving frames, and many isolated motion vectors appear.

従って、動き変動性が大きいほど対象フレーム中の動きが速いことが分かる。また、動き範囲は、動きがフレーム全体に占める割合を表す。背景が動いている場合は、フレーム全体に長さ１未満の動きベクトル、つまり静止点はほとんど存在せず、逆に静止背景においては、静止点が多く存在する。 Therefore, it can be seen that the greater the motion variability, the faster the motion in the target frame. The motion range represents the ratio of motion to the entire frame. When the background is moving, there are almost no motion vectors having a length of less than 1 in the entire frame, i.e., still points, and there are many still points in the still background.

図１７に、カメラワーク及びオブジェクトの動きの存在する、もしくは存在しないフレームから動き変動性及び動き範囲を抽出した例を示す。図に示すように、動き変動性はオブジェクトの動きの有無を、動き範囲はカメラワークの有無と関連すると考えられる。最後に、ショットｓに含まれるすべてのフレームの特徴量の平均を取ることで、ショットｓにおける動きの強度ｖ_i（ｓ）、変動性ｖ_v（ｓ）、範囲ｖ_a（ｓ）の三つの特徴量が求められる。 FIG. 17 shows an example in which motion variability and a motion range are extracted from a frame in which camera work and object motion exist or do not exist. As shown in the figure, it is considered that the motion variability is related to the presence / absence of the motion of the object, and the motion range is related to the presence / absence of the camera work. Finally, by taking the average of the feature quantities of all the frames included in the shot s, the motion intensity v _i (s), the variability v _v (s), and the range v _a (s) in the shot s are obtained. A feature value is obtained.

それらに、ショットの長さｌ（ｓ）を加え、ショットｓを４次元特徴ベクトル（ｌ（ｓ），ｖ_i（ｓ），ｖ_v（ｓ），ｖ_a（ｓ））で表現する。ここで、異なる映像におけるこれらの特徴量のばらつきの差による影響を除去するため、得られた特徴ベクトルを正規化する必要がある。動き変動性と動き範囲については、どちらも常に０〜１の値に収まるため、正規化の必要はない。そこで、動き強度とショット長に対して、同様に０〜１の値に収まるよう正規化する。動き強度に対しては、中間値が０．５となるように次式により正規化する。 The shot length l (s) is added to them, and the shot s is expressed by _a four-dimensional feature vector (l (s), v _i (s), v _v (s), v _a (s)). Here, it is necessary to normalize the obtained feature vectors in order to remove the influence due to the difference in variation of these feature amounts in different videos. Since both the motion variability and the motion range always fall within the range of 0 to 1, normalization is not necessary. Therefore, the motion intensity and the shot length are similarly normalized so as to fall within the range of 0 to 1. The motion intensity is normalized by the following equation so that the intermediate value is 0.5.

ただし、ｖ_i（ｓ_j）とｖ′_i（ｓ_j）はそれぞれ正規化前後のｊ番目のショットの動き強度を、ｖ_i（ｓ_m）は正規化前の全てのショットにおける動き強度の中間値を表す。但し、外れ値が存在する場合があるため、極端に大きいデータに対しては１とした。また、ショット長に対しては、各ショットの映像全体に対する割合に基づき次式で正規化する。 Where v _i (s _j ) and v ′ _i (s _j ) are the motion intensities of the j-th shot before and after normalization, and v _i (s _m ) is the middle of the motion intensities of all shots before normalization. Represents a value. However, since outliers may exist, 1 is set for extremely large data. The shot length is normalized by the following equation based on the ratio of each shot to the entire video.

ただし、ｌ（ｓ_j）とｌ′（ｓ_j）はそれぞれ正規化前後のｊ番目のショットのショット長を表す。 Here, l (s _j ) and l ′ (s _j ) represent the shot lengths of the j-th shot before and after normalization, respectively.

（ショットへのラベル付け）
特徴量抽出により、得られた４次元ベクトルに基づき、類似したショットに同じ記号をラベル付けする。ショットに記号を付加することにより、元のショット時系列データを記号列で表現することができる。 (Label shots)
The same symbol is labeled on similar shots based on the obtained four-dimensional vector by feature amount extraction. By adding a symbol to a shot, the original shot time-series data can be expressed by a symbol string.

ラベリングする際、四つの特徴量ショット長、動き強度、動き変動性と動き範囲のそれぞれをＮ_i（１≦ｉ≦４）分割する。分割する際、それぞれのデータの累積ヒストグラムにより、１００／Ｎ_i％ごとに分割点を決める。つまり、ショット長を３分割するとすると、短、中、長の三つの段階に分けることができる。その結果、ショットは合計Π_iＮ_i個のクラスに分類される。 When labeling, each of the four feature quantity shot lengths, motion intensity, motion variability, and motion range is divided into N _i (1 ≦ i ≦ 4). At the time of division, a division point is determined every 100 / N _i % by the cumulative histogram of each data. That is, if the shot length is divided into three, it can be divided into three stages, short, medium and long. As a result, the shots are classified into a total of Π _i N _i classes.

（ＨＭＭによる学習）
ＨＭＭは、複数の状態を持ち、それら相互の状態間に遷移確率が与えられる。更に各状態から確率的にシンボルが出力される。観測可能なのはこの出力シンボル列であり、モデルの状態を直接観測することは出来ない。図１８にＨＭＭの概念図を示す。 (Learning with HMM)
The HMM has a plurality of states, and a transition probability is given between these states. Further, a symbol is output stochastically from each state. It is this output symbol sequence that can be observed, and the state of the model cannot be observed directly. FIG. 18 shows a conceptual diagram of the HMM.

一般にＨＭＭはλ（Ａ，Ｂ，π）のパラメータで表現され、以下のように定義される。 In general, the HMM is expressed by a parameter of λ (A, B, π) and is defined as follows.

Ｖ＝｛Ｖ_m｝（ｍ＝１，・・・，Ｍ）：観測シンボルの集合。 V = {V _m } (m = 1,..., M): set of observation symbols.

Ｑ＝｛ｑ_ｎ｝（ｎ＝１，・・・，Ｎ）：状態の集合。 Q = {q _n } (n = 1,..., N): a set of states.

Ｏ＝Ｏ₁，Ｏ₂，・・・，Ｏ_T：観測されたシンボル系列（長さＴ）。 O = O ₁ , O ₂ ,..., O _T : Observed symbol sequence (length T).

Ｓ＝Ｓ₁，Ｓ₂，・・・，Ｓ_T：状態系列（長さＴ）。 S = S ₁ , S ₂ ,..., S _T : State sequence (length T).

Ｓ_t，Ｏ_t：時刻ｔでの状態と観測シンボル。 _St , _Ot : State and observation symbol at time t.

Ａ＝｛ａ_ij｜ａ_ij＝Ｐ（Ｓ_t+1＝ｑ_j｜Ｓ_t＝ｑ_i）｝：状態遷移確率分布。 A = {a _ij | a _ij = P (S _{t + 1} = q _j | S _t = q _i )}: State transition probability distribution.

ａ_ijは状態ｑ_iからｑ_jへ遷移する確率。 a _ij is the probability of transition from state q _i to q _j .

Ｂ＝｛ｂ_j（Ｏ_t）｜ｂ_j（Ｏ_t）＝Ｐ（Ｏ_t｜Ｓ_t＝ｑ_j）｝：シンボル出力確率分布。 B = {b _j (O _t ) | b _j (O _t ) = P (O _t | S _t = q _j )}: Symbol output probability distribution.

ｂ_j（Ｏ_t）は状態ｑ_jにおいてシンボルＯ_tを出力する確率。 b _j (O _t ) is the probability of outputting the symbol O _t in the state q _j .

π＝｛π_i｜π_i＝Ｐ（Ｓ₁＝ｑ_i）｝：初期状態確率分布。π_iは遷移が状態ｑ_iから始まる確率。 π = {π _i | π _i = P (S ₁ = q _i )}: Initial state probability distribution. π _i is the probability that the transition will start from state q _i .

本説明では、映像が「起承転結」のような流れで構成されていることを想定し、時間とともに連続的に特性を変えていく信号などを容易にモデル化することのできるＬｅｆｔ−ｔｏ−Ｒｉｇｈｔ型のＨＭＭを採用する。よって、初期状態確率分布が常に一定で、次のようになる。 In this description, it is assumed that the image is composed of a flow such as “contracting”, and a left-to-right type that can easily model a signal whose characteristics are continuously changed with time. Adopt HMM. Therefore, the initial state probability distribution is always constant and is as follows.

また、Ｌｅｆｔ−ｔｏ−Ｒｉｇｈｔ型ＨＭＭでは自己及び右側の状態への遷移のみが可能なので、状態遷移確率分布Ａに対し、以下の拘束条件が成り立つ。 In addition, since the Left-to-Right type HMM can only make a transition to the self and right states, the following constraint conditions are satisfied for the state transition probability distribution A.

ＨＭＭを用いた理由は、特徴量の揺らぎに影響を受けずに時系列データを認識できるだけでなく、時系列データからのモデル表現への抽象化、及びモデルからの時系列データの生成を一つの数学モデルで記述できる点にある。 The reason for using HMM is not only to recognize time-series data without being affected by fluctuations in features, but also to abstract time-series data into model representation and generate time-series data from a model. It can be described with a mathematical model.

本説明において時系列データの生成は、理想的な編集映像のショット時系列の生成に相当し、素材映像から編集映像を制作する際の、部分映像の選択及び部分映像の並びの決定に必要となる。 In this description, generation of time-series data corresponds to generation of an ideal edited video shot time series, and is necessary for selection of partial video and determination of arrangement of partial video when producing edited video from material video. Become.

また、ＨＭＭは音声認識の分野で非常に有効な手法として確立した手法であるため、高速計算のための様々なアルゴリズムが開発されている。ここでは、前節で得られたショット時系列を観測系列｛Ｏ_t｝として、ＨＭＭ λ（Ａ，Ｂ，π）のパラメータをＢａｕｍ−Ｗｅｌｃｈアルゴリズムにより学習する。 In addition, since HMM is a technique established as a very effective technique in the field of speech recognition, various algorithms for high-speed calculation have been developed. Here, the shot time series obtained in the previous section is used as the observation series {O _t }, and the parameters of HMM λ (A, B, π) are learned by the Baum-Welch algorithm.

（実験と評価）
比較的に映像中のリズムが顕著であると考えられるアクション映画の予告映像２０本を用い、初期パラメータ値と状態数の学習結果に与える影響について検証した。 (Experiment and evaluation)
The effect of the initial parameter values and the number of states on the learning result was verified using 20 preview images of action movies that are considered to have a relatively remarkable rhythm in the video.

ここでは、１本の映像をテスト映像、残りの映像を学習用映像とし、Ｆｏｒｗａｒｄアルゴリズムにより求められた、テスト映像から得られたショット時系列の学習前及び、学習後のＨＭＭにおける観測確率を比較することで、学習効果を評価する。２０通りの学習用映像・テスト映像の組み合わせを用いた交差検定法（Ｃｒｏｓｓ−Ｖａｌｉｄａｔｉｏｎ）を行った。 Here, one video is used as a test video, and the remaining video is used as a video for learning, and the observation probabilities in the HMM before and after learning of the shot time series obtained from the test video obtained by the forward algorithm are compared. To evaluate the learning effect. Cross-validation using a combination of 20 learning videos and test videos was performed.

また、獲得した最適初期パラメータ値と状態数を元に、同じソース映像から異なる編集方法により制作された映像を用いて学習後のＨＭＭを評価した。 In addition, based on the acquired optimum initial parameter value and the number of states, the HMM after learning was evaluated using videos produced from the same source video by different editing methods.

特徴量を抽出するには、まず映像をショットに分割する必要がある。なお、上述のように、映像内のカットを検出することで、ショットに分割することも可能であるが、本実験においては手動でショット分割を行った。ショットへのラベル付けの際、Ｎ₁＝Ｎ₂＝３，Ｎ₃＝Ｎ₄＝２とし、４次元特徴量空間を３６クラスに分けた。 In order to extract the feature amount, it is necessary to first divide the video into shots. As described above, it is possible to divide into shots by detecting cuts in the video, but in this experiment, shot division was performed manually. When labeling shots, N ₁ = N ₂ = 3, N ₃ = N ₄ = 2 and the four-dimensional feature space was divided into 36 classes.

学習前後のＨＭＭをそれぞれλ（Ａ，Ｂ，π）、λ′（Ａ′，Ｂ′，π）、観測系列｛Ｏ｝のλにおける観測確率をＰ（Ｏ｜λ）とする。観測確率はそれぞれの観測系列の長さによって指数的な違いが現れるため、異なる長さの映像間の比較ができない。よって、以下のように定義される適合度を導入する。 The HMMs before and after learning are λ (A, B, π) and λ ′ (A ′, B ′, π), respectively, and the observation probability at λ of the observation sequence {O} is P (O | λ). Since the observation probability varies exponentially depending on the length of each observation sequence, comparison between videos of different lengths is not possible. Therefore, the fitness defined as follows is introduced.

ただし、Ｔは観測系列の長さである。この適合度が表しているのは単位長の観測系列の観測確率であり、異なる長さの映像間の比較を可能にする。さらに、以下のように定義される向上率により学習結果を評価する。 Where T is the length of the observation sequence. This goodness of fit represents the observation probability of the observation sequence of unit length, and enables comparison between videos of different lengths. Furthermore, the learning result is evaluated based on the improvement rate defined as follows.

ただしZ'は観測系列の学習後のモデルに対する適合度を表しており、向上率は、同じ観測系列｛O｝に対する観測確率の学習前と後の変化に相当する。 However, Z ′ represents the degree of adaptation of the observation sequence to the model after learning, and the improvement rate corresponds to a change before and after learning of the observation probability for the same observation sequence {O}.

（初期パラメータ値による影響）
まず、初期パラメータ値が実験に与える影響について検証する。ここで使用するＨＭＭはＬｅｆｔ−ｔｏ−Ｒｉｇｈｔ型であるため、ＨＭＭの初期値（Ａ，Ｂ，π）の内、πは固定となる。従って、Ａ及びＢのみを変化させる。ただし、ここでは状態数を４とした。 (Influence by initial parameter value)
First, the effect of initial parameter values on the experiment will be verified. Since the HMM used here is a Left-to-Right type, π is fixed among the initial values (A, B, π) of the HMM. Therefore, only A and B are changed. However, the number of states is 4 here.

まず、Ｂを固定してＡを変化させる。ここでは、各状態から同じ確率で観測されるすべてのシンボルを出すように、シンボル出力確率分布Ｂを（式１２）のように設定した。 First, B is fixed and A is changed. Here, the symbol output probability distribution B is set as in (Equation 12) so that all symbols observed with the same probability are output from each state.

また、状態遷移確率Ａでは、Δ_i＝２とし、Ｌｅｆｔ−ｔｏ−Ｒｉｇｈｔ型ＨＭＭにより以下のような形となる。 Further, in the state transition probability A, Δ _i = 2 is assumed, and the following form is obtained by the Left-to-Right type HMM.

ランダムに設定した９通りのＡ（Ａ１〜Ａ９）、及び、以下のように人手で設定したＡ₁₀に対する結果を表１に示す。 The nine set randomly A (A 1 to A 9), and Table 1 shows the results for A ₁₀ set manually as follows.

Ａ₁₀は、映像が一般にシーンの並びで構成されているため、各状態が一つのシーンに相当するものとし、映像中の各シーンへの遷移確率として、ａ_ii＞０．９かつ、ａ_ii＞＞ａ_i(i+1)＞ａ_i(i+2)となるように設定したものである。 A _10, because the picture is constructed generally a sequence of scenes, it is assumed that each state corresponds to a single scene, the transition probabilities for each scene in the video, a _ii> 0.9 and, a _ii >> a _{i (i + 1)} > a _{i (i + 2)}

向上率はいずれも４５〜５０％程度となり、初期値Ａの変化が結果に及ぼす影響はほとんどないことが分かる。また、Ａ₁₀では比較的に高い向上率が得られており、ＨＭＭの各状態が映像のシーンと対応する可能性を示している。 The improvement rate is about 45 to 50% in all cases, and it is understood that the change of the initial value A has almost no influence on the result. Also, the A ₁₀ relatively high improvement rate has been obtained, each state of the HMM indicates a potential corresponding to the scene of the image.

次に、Ａを上記実験でのＡ₁₀に固定し、Ｂをランダムに設定した１０通りの値に設定した時の実験結果を表２に示す。Ｂについては、向上率に対する影響が比較的大きく、ショットの属するクラスの映像内容との関連性を考慮した上で、設定方法について今後検討する必要がある。 Next, Table 2 shows the experimental results when A is fixed at A ₁₀ in the above experiment and B is set to 10 randomly set values. As for B, the effect on the improvement rate is relatively large, and it is necessary to examine the setting method in the future in consideration of the relationship with the video content of the class to which the shot belongs.

（状態数による影響）
次に状態数が学習結果に与える影響について検証するため、状態数を２〜７個に変更し実験を行った。ただし、（式７）におけるΔ_iは状態数が２と３の時に１、状態数が６と７の時に３としている。各状態数で、５組の初期パラメータ値を用意し、それぞれ２０通りの映像の組み合わせに適用することで得られた平均向上率を図１９に示す。 (Effect of number of states)
Next, in order to verify the influence of the number of states on the learning result, the number of states was changed to 2 to 7 and experiments were performed. However, Δ _i in (Expression 7) is 1 when the number of states is 2 and 3, and is 3 when the number of states is 6 and 7. FIG. 19 shows the average improvement rate obtained by preparing five sets of initial parameter values for each number of states and applying them to 20 different video combinations.

ただし初期パラメータ値は、ＡについてはＡ₁₀と同様に設定した５通りの値、Ｂについては向上率への影響が高いため、（式１２）と同様に平均値を用いた。 However initial parameter values, the value of five types set in the same manner as A ₁₀ for A, due to the high impact on the improvement rate for B, using the mean value in the same manner as (Equation 12).

図１９は横軸が状態数、縦軸が向上率を表している。平均５０．２％の向上率が見られたが、状態数３個または４個の時の向上率が比較的大きく、最適状態数は４であると考えられる。これは、映像が「起承転結」のような構成を持っているためと想定される。ここで図２０に、映像の起承転結に基づくシーン構成と、学習後の状態数４のＨＭＭにおいて出力された最適な状態系列の関係を示す。 In FIG. 19, the horizontal axis represents the number of states, and the vertical axis represents the improvement rate. An average improvement rate of 50.2% was observed, but the improvement rate was relatively large when the number of states was 3 or 4, and the optimal number of states is considered to be 4. This is presumed to be because the video has a configuration such as “consolidation”. Here, FIG. 20 shows the relationship between the scene configuration based on the start and end of the video and the optimum state sequence output in the HMM with 4 states after learning.

ＨＭＭはＬｅｆｔ−ｔｏ−Ｒｉｇｈｔ型であるため、状態は１から４まで遷移している。図に示すように、状態の切り替わりはシーンの切り替わりに近似していることが分かる。このようにＨＭＭの各状態の映像内容との関連性が確認された。今後、すべての映像についてこれらの関連性を明確にする必要がある。 Since the HMM is a Left-to-Right type, the state transitions from 1 to 4. As shown in the figure, it can be seen that the state change approximates the scene change. Thus, the relevance with the video content of each state of HMM was confirmed. In the future, it will be necessary to clarify these relationships for all images.

（異なる編集映像による学習効果の評価）
最後に、同一のソース映像を編集して制作した異なる映像を用いて、学習されたＨＭＭを検証する。１本の映画予告映像（Ｖｉｄｅｏ１）に対し、ソース映像として元の映画から人手で制作した短時間の編集映像を別に用意した。ここでは、映画予告映像に含まれていた部分映像を含むショットを映画から抜粋し、これらのショットをソース映像に出現した順に並べたもの（Ｖｉｄｅｏ２）と、映画予告映像に出現した順に並べたもの（Ｖｉｄｅｏ３）を制作した。 (Evaluation of learning effect with different edited videos)
Finally, the learned HMM is verified using different videos produced by editing the same source video. For one movie preview video (Video 1), a short edited video manually created from the original movie was prepared separately as the source video. Here, the shots including the partial images included in the movie preview video are extracted from the movie, and these shots are arranged in the order in which they appeared in the source video (Video 2), and the shots are arranged in the order in which they appeared in the movie preview video (Video 3) was produced.

初期パラメータはＡはＡ₁₀、Ｂは平均値に設定し、状態数は４とした。各映像の学習前後のモデルに対する適合率、及び学習により得られた向上率を表３に示す。 As initial parameters, A is set to A ₁₀ , B is set to an average value, and the number of states is set to 4. Table 3 shows the relevance ratio of each image to the model before and after learning and the improvement ratio obtained by learning.

実際の映像であるＶｉｄｅｏ１ではこれまでの実験結果と同様に５０％程度の向上率が見られたのに対し、Ｖｉｄｅｏ２、及びＶｉｄｅｏ３に対しては適合度がほとんど向上しなかった。 Video1, which is an actual video, showed an improvement rate of about 50% as in the experimental results so far, whereas the adaptability was hardly improved for Video2 and Video3.

ここで、Ｖｉｄｅｏ２はＶｉｄｅｏ１と同じ内容のショットが含まれているが、出現する順番が異なるのに対し、Ｖｉｄｅｏ３はＶｉｄｅｏ１と同じ内容のショットが同じ順番で出現するため、Ｖｉｄｅｏ３の方が若干適合度が高いと考えられる。 Here, Video 2 includes shots with the same content as Video 1 but appears in a different order, whereas Video 3 appears with the same content as Video 1 in the same order, so Video 3 has a slightly higher degree of fitness. Is considered high.

しかし、Ｖｉｄｅｏ３で使用したソース映像から抜粋した各ショットは、対応するＶｉｄｅｏ１中のショットに比べ、ショット長が長く冗長な内容を含むため、編集された映像も実映像に類似せず、適合度はそれほど向上しなかった。 However, each shot extracted from the source video used in Video 3 has a longer shot length and contains redundant content compared to the corresponding Video 1 shot, so the edited video does not resemble the actual video and the fitness is It did not improve so much.

このように、学習後のＨＭＭは実映像のショット時系列パターンを良好に表現していると考えられるため、映像編集中のユーザに対し、学習後のＨＭＭに対し適合度が高くなるように、ショットを選択及び整列するよう誘導することで、実映像に類似した編集映像の制作支援が可能となると期待される。 Thus, since it is considered that the HMM after learning represents the shot time-series pattern of the actual video well, for the user who is editing the video, so that the degree of matching with the HMM after learning is high, By guiding the shots to be selected and arranged, it is expected to be able to support production of edited videos similar to actual videos.

本発明は、映像コンテンツの編集を支援する映像編集支援装置に適用できる。特に、映像編集の技術をよび知識のない一般のユーザが、効果的かつ効率的に映像コンテンツを編集するための映像編集支援装置、映像編集支援プログラム、映像編集支援方法等として有用である。 The present invention can be applied to a video editing support apparatus that supports editing of video content. In particular, it is useful as a video editing support device, a video editing support program, a video editing support method, and the like for a general user who has no technical knowledge and knowledge of video editing to edit video content effectively and efficiently.

実施の形態の映像編集支援装置の構成および機能を模式的に示す図である。It is a figure which shows typically the structure and function of the video editing assistance apparatus of embodiment. 学習部が学習する、サンプル映像の感覚的構造を模式的に示す図である。It is a figure which shows typically the sensory structure of the sample image | video which a learning part learns. 映像編集支援装置の機能的な構成を示す機能ブロック図である。It is a functional block diagram which shows the functional structure of a video editing assistance apparatus. 映像編集支援装置の編集支援に係る動作の流れの概要を示すフロー図である。It is a flowchart which shows the outline | summary of the flow of operation | movement which concerns on the edit assistance of a video editing assistance apparatus. 映像編集支援装置の映像解析に係る動作の流れの概要を示すフロー図である。It is a flowchart which shows the outline | summary of the flow of operation | movement which concerns on the video analysis of a video editing assistance apparatus. 図５に示す動作の流れの結果、表示装置に表示される編集画面の一例を示す図である。FIG. 6 is a diagram illustrating an example of an editing screen displayed on the display device as a result of the operation flow illustrated in FIG. 5. 映像編集支援装置の編集支援に係る動作の流れを示すフロー図である。It is a flowchart which shows the flow of the operation | movement which concerns on the edit assistance of a video editing assistance apparatus. ユーザにより、１つのショットが選択された場合の編集画面の一例を示す図である。It is a figure which shows an example of the edit screen when one shot is selected by the user. ユーザにより、１つのショットがストーリーボードに移動された場合の編集画面の一例を示す図である。It is a figure which shows an example of the edit screen when one shot is moved to the storyboard by the user. ユーザにより、時間位置が選択された場合の編集画面の一例を示す図である。It is a figure which shows an example of the edit screen when a time position is selected by the user. ユーザにより、付加音が選択された場合の編集画面の一例を示す図である。It is a figure which shows an example of the edit screen when an additional sound is selected by the user. ユーザにより、編集エフェクトが選択された場合の編集画面の一例を示す図である。It is a figure which shows an example of the edit screen when the edit effect is selected by the user. 映像編集支援装置の、既に配置されているショットまたは音を考慮した動作の流れの概要を示すフロー図である。It is a flowchart which shows the outline | summary of the flow of operation | movement which considered the shot or sound already arrange | positioned of a video editing assistance apparatus. サンプル映像の１つのショットと、そのショットのソースであるショットとの関係を模式的に示す図である。It is a figure which shows typically the relationship between one shot of a sample image | video, and the shot which is the source of the shot. サンプル映像と、そのサンプル映像のソースであるソース映像との関係を模式的に示す図である。It is a figure which shows typically the relationship between a sample image | video and the source image | video which is the source of the sample image | video. 映像編集支援装置１の映像解析に係る動作の流れを示すフロー図である。It is a flowchart which shows the flow of the operation | movement which concerns on the video analysis of the video editing assistance apparatus. 動き変動性および動き範囲と、カメラワークおよびオブジェクトの動きの関係を示す図である。It is a figure which shows the relationship between a motion variability and a motion range, a camera work, and the motion of an object. 隠れマルコフモデルを説明するための図である。It is a figure for demonstrating a hidden Markov model. 状態数が学習結果に与える影響についての検証実験の結果を示す図である。It is a figure which shows the result of the verification experiment about the influence which the number of states has on a learning result. 映画予告映像のシーン構成とＨＭＭ状態系列の関係を示す図である。It is a figure which shows the relationship between the scene structure of a movie preview image, and a HMM state series.

Explanation of symbols

１映像編集支援装置
２学習部
３編集部
４出力部
１０素材受付部
１１素材特徴抽出部
１２ユーザ入力受付部
１３候補生成部
１４エフェクトデータ記憶部
１５付加音記憶部
１６編集実行部
２０サンプル受付部
２１サンプル特徴抽出部
２２サンプル解析部
２３モデル生成部
３０表示装置
３１編集画面
３２時間入力欄
３３素材映像タイムライン欄
３４ショット選択欄
３５付加要素選択欄
３６再生表示欄
３７ストーリーボード
３７ａ映像ストーリー欄
３７ｂ音ストーリー欄
３８構造表示ボード
３８ａ映像構造欄
３８ｂ音構造欄
４０入力装置 DESCRIPTION OF SYMBOLS 1 Image | video editing assistance apparatus 2 Learning part 3 Editing part 4 Output part 10 Material reception part 11 Material feature extraction part 12 User input reception part 13 Candidate generation part 14 Effect data storage part 15 Additional sound storage part 16 Editing execution part 20 Sample reception part 21 Sample feature extraction unit 22 Sample analysis unit 23 Model generation unit 30 Display device 31 Edit screen 32 Time input column 33 Material video timeline column 34 Shot selection column 35 Additional element selection column 36 Playback display column 37 Storyboard 37a Video story column 37b Sound story field 38 Structure display board 38a Video structure field 38b Sound structure field 40 Input device

Claims

A video editing support device that supports editing when a user edits a material video that is video content to produce a user-edited video,
Learning means for learning editing information, which is information on how to arrange a plurality of entity elements included in a sample video that is edited video content;
Material accepting means for accepting input of the material video;
According to the editing information learned by the learning means, candidates for arrangement in the user edited video of the entity elements included in the material video received by the material receiving means or the entity elements added to the user edited video are indicated. Candidate generation means for generating information;
A video editing support apparatus comprising: candidate output means for outputting information indicating the arrangement candidates generated by the candidate generation means.

Furthermore, an instruction receiving means for receiving the user's instruction is provided,
The candidate selecting means includes
When the instruction accepting unit accepts an instruction to select an entity element of a predetermined unit included in the material video, the time position candidate on the user-edited video to which the instructed entity element should be placed is edited. According to the information, by selecting from any time position on the user-edited video, information indicating the arrangement candidate is generated,
When the instruction receiving unit receives an instruction to select a time position on the user-edited video, candidates for entity elements to be placed at the specified time position are included in the material video according to the editing information. The video editing support apparatus according to claim 1, wherein information indicating the arrangement candidate is generated by selecting from a plurality of entity elements.

Furthermore, it comprises material feature extraction means for extracting the signal features of each of the moving image and the sound, which are entity elements included in the material video received by the material reception means,
The learning means includes
Sample feature extraction means for extracting signal features of moving images and sounds, which are entity elements included in the sample video;
By analyzing the signal features extracted by the sample feature extraction means, information indicating the time-series configuration of each of the moving image and the sound included in the sample video, and information indicating the synchronous configuration of the moving image and the sound, Analyzing means for obtaining
Learning the information indicating the time-series configuration and the information indicating the synchronous configuration obtained by the analysis unit as the editing information,
The candidate generation unit compares the information indicating the time-series configuration or the information indicating the synchronous configuration obtained by the analysis unit with the signal features extracted by the material feature extraction unit, thereby The video editing support apparatus according to claim 1, wherein information indicating candidates for arranging the predetermined unit of moving image or sound in the user-edited video is generated.

The analysis means, as information indicating the synchronization configuration, information indicating a relationship between a boundary between adjacent moving images divided into a predetermined unit and a boundary between adjacent sounds divided into a predetermined unit; The video editing support apparatus according to claim 3, wherein information indicating a relationship between a feature of a moving image and a feature of a sound added to the moving image is obtained.

The analysis means, as information indicating the time-series configuration, relates to the signal characteristics of each adjacent video divided into predetermined units and the signal characteristics of each adjacent sound divided into predetermined units. The video editing support apparatus according to claim 3, wherein information indicating the characteristics is obtained.

Furthermore, it comprises a sample receiving means for receiving the sample video and a source video that is a video before editing the sample video,
The sample feature extraction unit further extracts signal features of each of moving images and sounds that are entity elements included in the source video,
The analysis means further has any characteristics of the source video by comparing the video or sound signal characteristics included in the sample video with the video or sound signal characteristics included in the source video. The video editing support apparatus according to claim 3, wherein information indicating whether the moving image is extracted as a moving image of a sample video is obtained.

An instruction receiving means for receiving the user's instruction;
After the candidate output means outputs the information indicating the arrangement candidate, the user-edited video is obtained by arranging two or more entity elements included in the material video according to the instruction received by the instruction receiving means. Editing execution means;
Video output means for visualizing and outputting the arrangement of the entity elements in the user-edited video obtained by the editing execution means by arranging images in a manner corresponding to the signal features extracted by the sample feature extraction means; The video editing support device according to claim 3.

Furthermore, a storage means for storing a plurality of signal characteristics of additional sound, which is an entity element added to the user edited video,
The candidate generation unit further compares the editing information learned by the learning unit with a plurality of signal features stored in the storage unit, thereby arranging the additional sounds to be added to the user edited video. Generate information that shows the candidates for
The video editing support apparatus according to claim 3, wherein the candidate output unit further outputs information indicating a candidate for arranging the additional sounds generated by the candidate generation unit.

Furthermore, storage means for storing a plurality of signal characteristics of visual effects that are entity elements added to the user edited video,
The learning means relates to a method of arranging visual effects added to the moving image, which is an entity element included in the sample video, by the analysis means analyzing the signal characteristics of the moving image extracted by the extracting means. By obtaining information, the information on how to arrange the visual effects is learned as editing information,
The candidate generation means compares the editing information learned by the learning means with a plurality of signal features stored in the storage means, thereby arranging visual effect alignment candidates to be added to the user edited video. Generate information that indicates
The video editing support apparatus according to claim 3, wherein the candidate output unit further outputs information indicating candidates for arranging the visual effects selected by the candidate generation unit.

An instruction receiving means for receiving the user's instruction;
After the candidate output means outputs the information indicating the arrangement candidate, the user-edited video is obtained by arranging two or more entity elements included in the material video according to the instruction received by the instruction receiving means. Editing execution means;
The video editing support apparatus according to claim 1, further comprising: a video output unit that outputs the user edited video obtained by the editing execution unit.

A method for supporting editing when a user edits a material video as a video content to produce a user-edited video,
A learning step for learning editing information, which is information relating to the arrangement of a plurality of entity elements included in the sample video that is the edited video content,
A material receiving step for receiving input of the material video;
In accordance with the editing information learned in the learning step, a candidate for arranging the entity elements included in the material video received in the material reception step or the entity elements added to the user edited video in the user edited video is shown. A candidate generation step for generating information;
A candidate output step for outputting information indicating the arrangement candidate generated in the candidate generation step.

A program for supporting editing when a user edits a material image by editing a material image as a video content,
A learning step for learning editing information, which is information relating to the arrangement of a plurality of entity elements included in the sample video that is the edited video content,
A material receiving step for receiving input of the material video;
In accordance with the editing information learned in the learning step, a candidate for arranging the entity elements included in the material video received in the material reception step or the entity elements added to the user edited video in the user edited video is shown. A candidate generation step for generating information;
A program for causing a computer to execute a candidate output step of outputting information indicating the arrangement candidate generated in the candidate generation step.