JP2006014084A

JP2006014084A - Video editing apparatus, video editing program, recording medium, and video editing method

Info

Publication number: JP2006014084A
Application number: JP2004190280A
Authority: JP
Inventors: Atsuo Yoshitaka; 淳夫吉▲高▼; Yoshinori Deguchi; 嘉紀出口
Original assignee: Hiroshima University NUC
Current assignee: Hiroshima University NUC
Priority date: 2004-06-28
Filing date: 2004-06-28
Publication date: 2006-01-12
Anticipated expiration: 2024-06-28
Also published as: JP4032122B2

Abstract

<P>PROBLEM TO BE SOLVED: To realize a summary video creating apparatus for creating a summary video image with which a viewer can accurately and easily grasp contents of entire video. <P>SOLUTION: A summary video creation apparatus 1 comprises: a shot analysis section 12 for recognizing features corresponding to a length of a duration time of a shot for each part of video based on video data 51; a video analysis section 13 for recognizing features corresponding to violence of a motion of the video for each part of the video based on the video data 51; a block extraction section 17 for specifying a block corresponding to intensified blocks (action block, tense block, relaxed block) in the video data 51 based on results of the recognitions; a subordination degree detection section 18 for detecting a subordination degree 63 among the intensified blocks based on the results of the recognitions; and a summary video generating section 19 for determining a portion to be adopted for a summary video image from the intensified blocks based on the results of the recognitions and a result of the detection. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、映画やテレビドラマなどストーリーを有する映像から要約映像を作成するための映像編集装置、映像編集プログラム、映像編集プログラムを記録したコンピュータ読み取り可能な記録媒体、および映像編集方法に関するものである。 The present invention relates to a video editing apparatus for creating a summary video from a video having a story such as a movie or a TV drama, a video editing program, a computer-readable recording medium recording the video editing program, and a video editing method. .

インターネット上での通信速度の増大により、映像配信やディジタル放送の利用が一般的になりつつあり、また、ＨＤＤ内蔵のビデオレコーダなどが普及してきていることから、ユーザは多くの映像をインターネットを通じて取得し、それらを蓄積し、視聴することが可能となってきている。そのためユーザは、多くの映像の中から観たい映像を選択する必要がある。短時間で映像の内容や雰囲気を理解することを目的とした手法の一つとして、映像を要約する手法が挙げられる。 Due to the increase in communication speed on the Internet, video distribution and digital broadcasting are becoming more common, and video recorders with built-in HDDs have become popular, so users can acquire many videos via the Internet. It has become possible to accumulate and view them. For this reason, the user needs to select a desired video from many videos. One of the methods aimed at understanding the content and atmosphere of a video in a short time is to summarize the video.

映像にはドラマ、映画、スポーツ、ニュース、音楽番組など様々なものが存在するが、特に映画やドラマは時間が長いため、短時間で内容が理解しやすい要約映像を作成することができれば、ユーザにとっては有用なものとなる。例えば、蓄積した映画をブラウジングする場合、映画評論家が過去に観た映画の紹介や批評を書く際にその映画の内容を思い出したい場合などでは、特に要約映像の有用性が高い。映画を対象とした映像要約に関する技術としては次のようなものが知られている。 There are various types of video such as dramas, movies, sports, news, music programs, etc. Especially, movies and dramas are long, so if you can create a summary video that is easy to understand in a short time, It will be useful to you. For example, when browsing an accumulated movie, summary video is particularly useful when a movie critic wants to remember the contents of the movie when writing or introducing a review of a movie that he has watched in the past. The following technologies are known as video summarization techniques for movies.

非特許文献１では、主要人物のクロースアップ、銃声や爆発、タイトルやテロップなどの特別なイベントを検出し、これらをつなぎ合わせることで映画の予告編を目的とした要約映像を作成している。また、非特許文献２では、ドラマの心理的印象の高い区間に注目し、音楽の開始や終了、カットが頻出する箇所など心理的に重要な箇所を切り出した要約映像を作成している。また、非特許文献３では、視聴者が視覚、聴覚に注意を向ける要素を元にして作成したUser Attention Modelに基づき、視聴者が注意を向けたと考えられる区間を要約映像に採用している。 In Non-Patent Document 1, special events such as close-up of main characters, gunshots and explosions, titles and telops are detected and connected to create a summary video for the purpose of a movie trailer. In Non-Patent Document 2, attention is paid to a section having a high psychological impression of a drama, and a summary video in which a psychologically important part such as a start or end of music or a part where cuts frequently occur is cut out is created. Further, in Non-Patent Document 3, based on a User Attention Model created based on an element in which the viewer pays attention to the visual and auditory sense, a section where the viewer is likely to pay attention is adopted in the summary video.

一方、非特許文献４では、ショットを視覚的な類似度に基づきクラスタリングし、各クラスタから一番長いショットを要約映像として採用している。 On the other hand, in Non-Patent Document 4, shots are clustered based on visual similarity, and the longest shot from each cluster is adopted as a summary video.

また、非特許文献５では、画像、音の特徴から映画をショット、ストーリ・ユニット、シーンに構造化し、それぞれの単位における従属性を検出することによって、映画の文脈を考慮に入れた要約映像を作成している。 In Non-Patent Document 5, a movie is structured into shots, story units, and scenes based on image and sound characteristics, and a summary video that takes into account the context of the movie is detected by detecting the dependency in each unit. Creating.

また、特許文献１では、各ショットまたはシーンに対応して付与された情報に基づいて作成された当該ショットまたはシーンの評価値を用いることにより映像を抽出する技術が開示されている。
ＷＯ００／４００１１（国際公開日２０００年７月６日） R. Lienhart, S. Pfeiffer, W. Effelsberg, “Video Abstracting”, Communications of the ACM, Vol. 40, No. 12, pp. 55-62, Dec. 1997. 森山剛, 坂内正夫, “ドラマ映像の心理的内容に基づいた要約映像の生成”, 電子情報通信学会論文誌, Vol. J84-D-II, No. 6, pp. 1122-1131, Jun. 2001. Yu-Fei Ma, Lie Lu, Hong-Jiang Zhang, Mingjing Li, “A User Attention Model for Video Summarization”, Proc. of ACM Multimedia, pp. 533-542, Dec. 2002. Yihong Gong, Xin Liu, “Summarizing Video by Minimizing Visual Content Redundancies”, IEEE International Conference on Multimedia and Exposition, pp. 788-791, 2001. 加藤和也, 吉高淳夫, 平川正人, “文脈を考慮に入れた映画の要約作成”, 情報処理学会研究報告, Vol. 2002, No. 25, pp. 25-30, Mar. 2002. ダニエル・アリホン著, 岩本憲児, 出口丈人訳, “映画の文法”, 紀伊國屋書店, 1980. 阿久津明人, 外村佳伸, “投影法を用いた映像の解析手法と映像ハンドリングへの応用”, 電子情報通信学会論文誌, Vol. J79-D-II, No. 5, pp. 675-686, May 1996. 川崎智広, 吉高淳夫, 平川正人, 市川忠男, “映画における音楽、効果音の抽出及び印象評価手法の提案”, 信学技報, MVE97-96, pp. 23-29, 1998. Patent Document 1 discloses a technique for extracting a video by using an evaluation value of the shot or scene created based on information assigned to each shot or scene.
WO00 / 40011 (International publication date July 6, 2000) R. Lienhart, S. Pfeiffer, W. Effelsberg, “Video Abstracting”, Communications of the ACM, Vol. 40, No. 12, pp. 55-62, Dec. 1997. Tsuyoshi Moriyama and Masao Sakauchi, “Generating Summary Video Based on the Psychological Content of Drama Video,” IEICE Transactions, Vol. J84-D-II, No. 6, pp. 1122-1131, Jun. 2001 . Yu-Fei Ma, Lie Lu, Hong-Jiang Zhang, Mingjing Li, “A User Attention Model for Video Summarization”, Proc. Of ACM Multimedia, pp. 533-542, Dec. 2002. Yihong Gong, Xin Liu, “Summarizing Video by Minimizing Visual Content Redundancies”, IEEE International Conference on Multimedia and Exposition, pp. 788-791, 2001. Kazuya Kato, Ikuo Yoshitaka, Masato Hirakawa, “Creating a Movie Summarizing the Context”, Information Processing Society of Japan, Vol. 2002, No. 25, pp. 25-30, Mar. 2002. By Daniel Arifon, Noriko Iwamoto, Taketo Deguchi, “Grammar of Cinema”, Kinokuniya, 1980. Akito Akutsu and Yoshinobu Tonomura, “Analysis Method of Video Using Projection and Application to Video Handling”, IEICE Transactions, Vol. J79-D-II, No. 5, pp. 675-686 , May 1996. Tomohiro Kawasaki, Ikuo Yoshitaka, Masato Hirakawa, Tadao Ichikawa, “Proposal of Music, Sound Effects Extraction and Impression Evaluation Techniques in Films”, IEICE Technical Report, MVE97-96, pp. 23-29, 1998.

上記非特許文献１〜３に開示された技術では、特定の特徴が検出された区間を単純につなぎ合わせているに過ぎない。したがって、このような技術によって作成された要約映像は、断片的な映像になってしまい、映像においてどのような出来事が起こっているのかを十分に知ることが困難である上に、その出来事の前後関係が分かり難い要約映像となる。 In the techniques disclosed in Non-Patent Documents 1 to 3, the sections in which specific features are detected are simply connected. Therefore, the summary video created by such a technique becomes a fragmentary video, and it is difficult to fully understand what happened in the video, and before and after the event. The summary video is difficult to understand.

また、上記非特許文献４に開示された技術では、視覚的に冗長なショットを除いたに過ぎず、映像の内容を伝える上で重要なショットの選択はされていない。また、各クラスタから一番長いショットを要約映像として採用しているが、映像の内容を伝える上で一番長いショットが重要であるとは必ずしもいえない。 Further, in the technique disclosed in Non-Patent Document 4, only visually redundant shots are removed, and shots that are important for conveying the contents of video are not selected. Further, although the longest shot from each cluster is adopted as the summary video, it cannot be said that the longest shot is important in conveying the content of the video.

また、上記非特許文献４に開示された技術では、文脈を考慮しているが、従属関係にあるショットすべてを要約映像に採用しているため、要約映像に偏りがあり映像全体の話の内容を知ることは困難である。 In the technique disclosed in Non-Patent Document 4 above, the context is taken into consideration, but since all the shots in the dependency relationship are adopted in the summary video, there is a bias in the summary video, and the content of the story of the entire video It is difficult to know.

また、特許文献１に開示された技術では、評価値を作成する際に用いる情報の付与に関しては、評価者による主観的な評価を行うことが開示されている以外には、具体的な技術内容が開示されていない。 In addition, in the technique disclosed in Patent Document 1, specific technical contents are disclosed in addition to the fact that subjective evaluation by an evaluator is disclosed regarding the provision of information used when creating an evaluation value. Is not disclosed.

以上のように、従来の技術では、映像の内容を的確に把握することができるような要約映像を作成することが困難である。 As described above, with the conventional technology, it is difficult to create a summary video that can accurately grasp the content of the video.

本発明は、上記の問題点に鑑みてなされたものであり、その目的は、映像全体の内容を視聴者が的確に把握しやすい要約映像を作成する映像編集装置および映像編集方法を実現することにある。 The present invention has been made in view of the above problems, and an object thereof is to realize a video editing apparatus and a video editing method for creating a summary video that allows a viewer to accurately grasp the contents of the entire video. It is in.

本発明に係る映像編集装置は、映像を構成する各ショットの長さと、映像における動きの激しさとに基づいて特定可能な強調区間を含んだ映像から、要約映像を作成する映像編集装置であって、上記課題を解決するために、映像データに基づき、映像の各部についてショットの継続時間の長さに応じた特徴を認識するショット認識手段と、映像データに基づき、映像の各部について映像の動きの激しさに応じた特徴を認識する映像認識手段と、前記ショット認識手段および映像認識手段による認識結果に基づき、映像データのうち強調区間に該当する区間を特定する強調区間特定手段と、前記ショット認識手段および映像認識手段による認識結果に基づき、各強調区間の間の従属度合を検出する従属度検出手段と、前記ショット認識手段および映像認識手段による認識結果と、前記従属度検出手段による検出結果とに基づき、強調区間から要約映像に採用すべき部分を決定する要約作成手段とを備えることを特徴としている。 The video editing apparatus according to the present invention is a video editing apparatus that creates a summary video from a video including an emphasis section that can be specified based on the length of each shot constituting the video and the intensity of motion in the video. In order to solve the above problems, the shot recognition means for recognizing the feature corresponding to the length of the shot duration for each part of the video based on the video data, and the motion of the video for each part of the video based on the video data. Video recognition means for recognizing features according to the intensity of the image, enhancement section specifying means for specifying a section corresponding to the enhancement section in the video data based on the recognition results by the shot recognition means and the video recognition means, and the shot Based on the recognition results of the recognition means and the image recognition means, a dependency level detection means for detecting a dependency level between the emphasis sections, the shot recognition means, A recognition result by the image recognition unit, based on the detection result by the dependent degree detecting means is characterized by comprising a summary creation unit for determining the portion to be employed in the video summary from emphasis interval.

また、本発明に係る映像編集方法は、映像を構成する各ショットの長さと、映像における動きの激しさとに基づいて特定可能な強調区間を含んだ映像から、要約映像を作成する映像編集方法であって、上記課題を解決するために、映像データに基づき、映像の各部についてショットの継続時間の長さに応じた特徴を認識するショット認識処理と、映像データに基づき、映像の各部について映像の動きの激しさに応じた特徴を認識する映像認識処理と、前記ショット認識処理および映像認識処理による認識結果に基づき、映像データのうち強調区間に該当する区間を特定する強調区間特定処理と、前記ショット認識処理および映像認識処理による認識結果に基づき、各強調区間の間の従属度合を検出する従属度検出処理と、前記ショット認識処理および映像認識処理による認識結果と、前記従属度検出処理による検出結果とに基づき、強調区間から要約映像に採用すべき部分を決定する要約作成処理とを含むことを特徴としている。 The video editing method according to the present invention is a video editing method for creating a summary video from a video including an emphasis section that can be specified based on the length of each shot constituting the video and the intensity of movement in the video. In order to solve the above-described problem, a shot recognition process for recognizing a feature corresponding to the length of a shot duration for each part of the video based on the video data, and a video for each part of the video based on the video data. Video recognition processing for recognizing features according to the intensity of motion, and enhancement section identification processing for identifying a section corresponding to the enhancement section in the video data based on the recognition result by the shot recognition processing and the video recognition processing, Based on the recognition results of the shot recognition process and the video recognition process, a dependency level detection process for detecting a dependency level between the emphasis sections, and the shot recognition process. And a recognition result of the image recognition processing, based on the detection result of the subordinate level detection process is characterized in that it comprises a summarization process that determines the portion to be employed in the video summary from emphasis interval.

映画やテレビドラマなどストーリーを有する映像においては、撮影や編集の際に、特定の意味や意図を強調する目的で「映画の文法」という技法が使用される。映画の文法では、内容が効果的に視聴者に伝わるように編集上強調された区間として、アクション区間、緊迫した区間、落ち着いた区間が設定される。ここで、アクション区間とは、短いショットが連続し、かつ、映像の動きが激しい傾向にある区間であり、緊迫した区間とは、ショットの長さが徐々に短くなる傾向にある区間であり、落ち着いた区間とは、長いショットが連続し、かつ、映像の動きが緩やかな傾向にある区間である。 In movies and television dramas that have a story, a technique called “movie grammar” is used for the purpose of emphasizing a specific meaning and intention when shooting or editing. In the grammar of the movie, an action section, a tight section, and a calm section are set as sections that are editorially emphasized so that the contents are effectively transmitted to the viewer. Here, the action section is a section in which short shots are continuous and the movement of the image tends to be intense, and the tense section is a section in which the length of the shot tends to be gradually reduced. A calm section is a section in which long shots continue and the motion of the video tends to be gradual.

また、映画の文法によると、これら区間の間には、原因と結果の関係（従属関係）が成り立っている場合があり、従属関係にある区間は結合されることにより内容が明確に伝達できるようになる。 Also, according to the grammar of the movie, there may be a relationship between cause and effect (dependency relationship) between these sections, and the sections in the dependency relation can be combined to clearly communicate the contents. become.

そこで、上記構成および方法では、全体映像を的確に要約した要約映像を作成するために、上記編集上強調された区間を強調区間として特定するとともに、強調区間の間の従属関係を考慮して、要約映像に採用すべき部分を決定している。 Therefore, in the above configuration and method, in order to create a summary video that accurately summarizes the entire video, the section emphasized on the editing is specified as an emphasized section, and the dependency relationship between the emphasized sections is considered, The part to be adopted in the summary video is determined.

すなわち、上記構成および方法では、映像の各部について、ショットの継続時間の長さに応じた特徴と、映像の動きの激しさに応じた特徴とを認識するため、これらに基づいて、アクション区間、緊迫した区間、落ち着いた区間を強調区間として特定することができる。 That is, in the above configuration and method, for each part of the video, in order to recognize the feature according to the length of the duration of the shot and the feature according to the intensity of the motion of the video, based on these, the action section, A tight section and a calm section can be specified as an emphasis section.

また、強調区間の間の従属関係の度合（従属度合）は、各強調区間の特徴的性質の度合（アクション性度合、緊迫性度合、落ち着き性度合）の差として捉えることができる。上記構成および方法では、映像の各部について、ショットの継続時間の長さに応じた特徴と、映像の動きの激しさに応じた特徴とを認識するため、これらに基づいて各強調区間のアクション性度合、緊迫性度合、落ち着き性度合を認識し、各強調区間の間の従属度合を検出することができる。 Further, the degree of dependency between the emphasis sections (dependency degree) can be understood as a difference in the degree of characteristic properties (action degree, tightness degree, calmness degree) of each emphasis section. In the above-described configuration and method, for each part of the video, the feature according to the length of the duration of the shot and the feature according to the intensity of the motion of the video are recognized. It is possible to recognize the degree, the tightness degree, and the calmness degree, and to detect the degree of dependency between the emphasis sections.

そして、上記構成および方法では、上記のとおり各強調区間のアクション性度合、緊迫性度合、落ち着き性度合を認識することができ、また、各強調区間の間の従属度合も検出することができるため、これらに基づいて強調区間から要約映像に採用すべき部分を決定する。 In the configuration and method described above, the degree of action, tightness, and calmness of each emphasis section can be recognized as described above, and the degree of dependency between each emphasis section can also be detected. Based on these, the portion to be adopted in the summary video is determined from the emphasis section.

これにより、上記構成および方法では、映画の文法に即した要約映像、つまり編集上強調された強調区間と、これら強調区間の間の従属関係を反映することにより、全体の内容を視聴者が的確に把握しやすい要約映像を作成することができる。 As a result, in the above configuration and method, a summary video that conforms to the grammar of the movie, that is, the emphasis sections that are emphasized in editing, and the dependency relationship between these emphasis sections are reflected, so that the entire content can be accurately viewed by the viewer. It is possible to create a summary video that is easy to grasp.

本発明に係る映像編集装置は、上記映像編集装置において、前記ショット認識手段は、認識結果として、ショットの継続時間を示す特徴量と、ショットの継続時間の長さ度合を示す特徴量とを生成し、前記映像認識手段は、認識結果として、映像の動きの激しさ度合を示す特徴量を生成するものであってもよい。 In the video editing apparatus according to the present invention, in the video editing apparatus, the shot recognizing unit generates, as a recognition result, a feature amount indicating a duration of the shot and a feature amount indicating the length of the duration of the shot. The video recognition unit may generate a feature amount indicating the degree of intenseness of the motion of the video as a recognition result.

上記構成では、映像の各部について、ショットの継続時間を示す特徴量、ショットの継続時間の長さ度合を示す特徴量、映像の動きの激しさ度合を示す特徴量を生成する。ここで、ショットの継続時間の長さ度合とは、映像全体に対する各部のショットの相対的な長さの度合であり、映像の動きの激しさ度合とは、映像全体に対する各部の動きの相対的な激しさの度合である。 In the above configuration, for each part of the video, a feature quantity indicating the duration of the shot, a feature quantity indicating the length of the duration of the shot, and a feature quantity indicating the severity of the motion of the video are generated. Here, the degree of the duration of the shot is the degree of the relative length of the shot of each part with respect to the whole image, and the degree of the intensity of the movement of the image is the relative degree of movement of each part with respect to the whole image. The degree of intensity.

上述したように、強調区間としてのアクション区間は、短いショットが連続し、かつ、映像の動きが激しい傾向にある区間であり、緊迫した区間は、ショットの長さが徐々に短くなる傾向にある区間であり、落ち着いた区間は、長いショットが連続し、かつ、映像の動きが緩やかな傾向にある区間であるので、上記各特徴量を用いることにより、比較的簡単な演算によって強調区間の特定、従属度合の検出、要約映像として採用すべき映像部分の決定を行うことができる。 As described above, the action section as the emphasis section is a section in which short shots are continuous and the movement of the video tends to be intense, and in a tense section, the length of the shot tends to be gradually shortened. Since the long and continuous shots are slow and the movement of the video tends to be slow, the emphasis section can be specified by relatively simple calculation by using the above feature quantities. In addition, it is possible to detect the degree of dependency and determine the video portion to be adopted as the summary video.

本発明に係る映像編集装置は、上記映像編集装置において、映像データに付加された音声データに基づき、映像の各部について音声に含まれる楽器音成分の継続時間の長さに応じた特徴を認識する音声認識手段をさらに備え、前記強調区間特定手段は、さらに前記音声認識手段による認識結果に基づき、映像データのうち強調区間に該当する区間を特定し、前記従属度検出手段は、さらに前記音声認識手段による認識結果に基づき、各強調区間の間の従属度合を検出し、前記要約作成手段は、さらに前記音声認識手段による認識結果に基づき、強調区間から要約映像に採用すべき部分を決定することが望ましい。 In the video editing apparatus according to the present invention, the video editing apparatus recognizes a feature corresponding to the duration of the instrument sound component included in the audio for each part of the video based on the audio data added to the video data. Voice recognition means is further provided, the enhancement section specifying means further specifies a section corresponding to the enhancement section in the video data based on a recognition result by the voice recognition means, and the dependency level detection means is further configured to recognize the voice recognition. Detecting the degree of dependency between each emphasis section based on the recognition result by the means, and the summary creating means further determines a portion to be adopted for the summary video from the emphasis section based on the recognition result by the voice recognition means. Is desirable.

映像には音声が付加されている場合が多く、この場合、アクション区間、落ち着いた区間の特徴的性質は、上記音声に含まれる楽器音成分の継続時間の長さとしても現れる。すなわち、アクション区間では楽器音成分の継続時間が短い傾向にあり、落ち着いた区間では楽器音成分の継続時間が長い傾向にある。 In many cases, audio is added to the video, and in this case, the characteristic properties of the action section and the calm section appear as the duration of the instrument sound component included in the audio. That is, the duration of the instrument sound component tends to be short in the action section, and the duration of the instrument sound component tends to be long in the calm section.

そこで上記構成では、映像の各部について、ショットの継続時間の長さに応じた特徴と、映像の動きの激しさに応じた特徴とに加えて、楽器音成分の継続時間の長さに応じた特徴を認識し、これらに基づいて強調区間の特定、従属度合の検出、要約映像として採用すべき映像部分の決定を行っている。これにより、より的確な要約映像を作成することができる。 Therefore, in the above configuration, for each part of the video, in addition to the characteristics according to the length of the duration of the shot and the characteristics according to the intensity of the motion of the video, according to the length of the duration of the instrument sound component Based on these features, the emphasis section is specified, the degree of dependency is detected, and the video portion to be used as the summary video is determined. As a result, a more accurate summary video can be created.

本発明に係る映像編集装置は、上記映像編集装置において、前記音声認識手段は、認識結果として、楽器音成分の継続時間の長さ度合を示す特徴量を生成するものであってもよい。 In the video editing apparatus according to the present invention, in the video editing apparatus, the voice recognition unit may generate a feature amount indicating a degree of duration of a musical instrument sound component as a recognition result.

上記構成では、映像の各部について、楽器音成分の継続時間の長さ度合を示す特徴量を生成する。ここで、楽器音成分の継続時間の長さ度合とは、旋律を構成する音の長さの度合である。 In the above configuration, a feature amount indicating the degree of duration of the instrument sound component is generated for each part of the video. Here, the degree of duration of the instrument sound component is the degree of the length of the sound constituting the melody.

上述したように、アクション区間では楽器音成分の継続時間が短い傾向にあり、落ち着いた区間では楽器音成分の継続時間が長い傾向にあるので、上記特徴量を用いることにより、比較的簡単な演算によって強調区間の特定、従属度合の検出、要約映像として採用すべき映像部分の決定を行うことができる。 As described above, the duration of the instrument sound component tends to be short in the action section, and the duration of the instrument sound component tends to be long in the calm section. Thus, it is possible to specify an emphasis section, detect the degree of dependency, and determine a video portion to be adopted as a summary video.

本発明に係る映像編集装置は、上記映像編集装置において、映像データに基づき、映像の各部について映像主体の存在を検出する主体検出手段をさらに備え、前記要約作成手段は、さらに前記主体検出手段による検出結果に基づき、強調区間から要約映像に採用すべき部分を決定することが望ましい。 The video editing apparatus according to the present invention further comprises subject detection means for detecting presence of a video subject for each part of the video based on the video data in the video editing device, and the summary creation means is further provided by the subject detection means. Based on the detection result, it is desirable to determine a portion to be adopted for the summary video from the emphasis section.

映像主体とは、映像上の比較的大きな部分を占めるように撮影された登場人物や各種物体であり、それらはしばしばある一定以上の大きさで、一定範囲の色相で構成され、かつ、周辺とのコントラストが大きなオブジェクトである。映像主体の存在する部分は、映像の内容を視聴者に伝える上で重要な部分となり、その部分を優先的に採用した要約映像は、それを考慮しないものに比べて、映像の内容を理解しやすくなる。 An image subject is a character or various objects photographed so as to occupy a relatively large part of the image, and they are often larger than a certain size, composed of a certain range of hues, and The object has a large contrast. The main part of the video is an important part in conveying the video content to the viewer, and the summary video that preferentially adopts that part understands the video content compared to the video without considering it. It becomes easy.

そこで上記構成では、映像の各部について映像主体の存在を検出し、その検出結果に基づいて強調区間から要約映像に採用すべき部分を決定する。これにより、より的確な要約映像を作成することができる。 Therefore, in the above configuration, the presence of the video subject is detected for each part of the video, and a part to be adopted for the summary video is determined from the enhancement section based on the detection result. As a result, a more accurate summary video can be created.

なお、本発明は、上記映像編集装置を動作させる映像編集プログラムであって、コンピュータを前記各手段として機能させるための映像編集プログラムとして実現することもでき、この映像編集プログラムを記録したコンピュータ読み取り可能な記録媒体として実現することもできる。 The present invention is a video editing program for operating the video editing apparatus, and can be realized as a video editing program for causing a computer to function as each of the means, and can be read by a computer recording the video editing program. It can also be realized as a simple recording medium.

本発明に係る映像編集装置は、以上のように、映像データに基づき、映像の各部についてショットの継続時間の長さに応じた特徴を認識するショット認識手段と、映像データに基づき、映像の各部について映像の動きの激しさに応じた特徴を認識する映像認識手段と、前記ショット認識手段および映像認識手段による認識結果に基づき、映像データのうち強調区間に該当する区間を特定する強調区間特定手段と、前記ショット認識手段および映像認識手段による認識結果に基づき、各強調区間の間の従属度合を検出する従属度検出手段と、前記ショット認識手段および映像認識手段による認識結果と、前記従属度検出手段による検出結果とに基づき、強調区間から要約映像に採用すべき部分を決定する要約作成手段とを備えている。 As described above, the video editing apparatus according to the present invention is based on video data, shot recognition means for recognizing a feature corresponding to the length of a shot duration for each part of the video, and each part of the video based on the video data. Video recognition means for recognizing features according to the intensity of motion of the video, and enhancement section specifying means for specifying a section corresponding to the enhancement section in the video data based on the recognition results by the shot recognition means and the video recognition means Based on the recognition results by the shot recognition means and the video recognition means, a dependency level detection means for detecting the degree of dependency between the emphasis sections, the recognition results by the shot recognition means and the video recognition means, and the dependency level detection Summarizing means for determining a portion to be adopted in the summary video from the emphasis section based on the detection result by the means.

また、本発明に係る映像編集装置は、以上のように、映像データに基づき、映像の各部についてショットの継続時間の長さに応じた特徴を認識するショット認識処理と、映像データに基づき、映像の各部について映像の動きの激しさに応じた特徴を認識する映像認識処理と、前記ショット認識処理および映像認識処理による認識結果に基づき、映像データのうち強調区間に該当する区間を特定する強調区間特定処理と、前記ショット認識処理および映像認識処理による認識結果に基づき、各強調区間の間の従属度合を検出する従属度検出処理と、前記ショット認識処理および映像認識処理による認識結果と、前記従属度検出処理による検出結果とに基づき、強調区間から要約映像に採用すべき部分を決定する要約作成処理とを含んでいる。 In addition, as described above, the video editing apparatus according to the present invention is based on video data, based on video data, based on video data based on shot recognition processing for recognizing features corresponding to the length of shot duration for each part of video. Video recognition processing for recognizing features according to the intensity of motion of the video for each part, and an emphasis section for identifying a section corresponding to the emphasis section of the video data based on the recognition result by the shot recognition process and the video recognition process A subordinate degree detection process for detecting a subordinate degree between each emphasis section based on a recognition process by the identification process, the shot recognition process and the video recognition process, a recognition result by the shot recognition process and the video recognition process, and the subordinate Based on the detection result of the degree detection process, and a summary creation process for determining a portion to be adopted in the summary video from the emphasis section.

これにより、映画の文法に即した要約映像、つまり編集上強調された強調区間と、これら強調区間の間の従属関係を反映することにより、全体の内容を視聴者が的確に把握しやすい要約映像を作成することができるという効果を奏する。 As a result, a summary video that conforms to the grammar of the movie, that is, a summary video that allows the viewer to accurately grasp the entire contents by reflecting the emphasis sections that are emphasized for editing and the dependency relationship between these emphasis sections. The effect is that it can be created.

本発明では、映画の撮影や編集の際に制作者によって、特定の意味や意図を強調する目的で使用される「映画の文法」に基づき、内容が効果的に視聴者に伝わるように、編集上強調された区間としてアクション区間（アクションシーン）、緊迫した区間（緊迫したシーン）、落ち着いた区間（落ち着いたシーン）と、それらの区間と従属関係にある区間を抽出する。そして制約時間を満たすように、重要度の高い順にそれらの区間内のショットを要約映像として採用する。したがって、強調された区間だけでなくそれに至る経緯も要約映像に含めることができる。これにより、映画の内容と文脈が理解しやすい要約映像の作成手法を実現する。 In the present invention, editing is performed so that the content is effectively conveyed to the viewer based on the “movie grammar” used by the producer for emphasizing a specific meaning or intention when shooting or editing a movie. An action section (action scene), a tight section (tight scene), a calm section (calm scene), and sections that are subordinate to these sections are extracted as the above-emphasized sections. Then, the shots in those sections are adopted as the summary video in descending order of importance so as to satisfy the constraint time. Therefore, not only the emphasized section but also the process leading to it can be included in the summary video. As a result, a method for creating a summary video in which the contents and context of the movie are easy to understand is realized.

本発明の実施の一形態について図１から図１５に基づいて説明すると以下の通りである。 An embodiment of the present invention will be described with reference to FIGS. 1 to 15 as follows.

１．処理内容
１．１映画の文法
映画には、撮影や編集の際に制作者によって特定の意味や意図を強調する目的で使用される技法がある。それを「映画の文法」という（非特許文献６：ダニエル・アリホン著, 岩本憲児, 出口丈人訳, “映画の文法”, 紀伊國屋書店, 1980.参照）。 1. Processing Content 1.1 Movie Grammar Movies have techniques that are used by producers to emphasize specific meanings and intentions during filming and editing. This is called “movie grammar” (Non-Patent Document 6: written by Daniel Arihon, Noriko Iwamoto, Taketo Deguchi, “Grammar of Movie”, Kinokuniya, 1980).

映画の文法によると、編集上強調された区間であるアクション区間、緊迫した区間、落ち着いた区間の特性として次のことが述べられている。すなわち、アクション区間は、短いショットが連続し、かつ、映像の動きが激しい区間であり、緊迫した区間は、ショットの長さが徐々に短くなる区間であり、落ち着いた区間は、長いショットが連続し、かつ、映像の動きが緩やかな区間である。また、映画の文法によると、効果的な内容伝達には、原因と結果の関係にある区間を結合することが重要であることが述べられている。 According to the grammar of the movie, the following points are stated as the characteristics of the action section, the tight section, and the calm section, which are editorially emphasized sections. In other words, the action section is a section in which short shots are continuous and the motion of the video is intense, the tight section is a section in which the shot length is gradually shortened, and the calm section is a series of long shots. In addition, this is a section where the motion of the video is slow. In addition, according to the grammar of the movie, it is stated that it is important to combine the sections having a relationship between cause and effect for effective content transmission.

１．２処理の流れ
映画の文法に基づき、話の内容を視聴者に効果的に伝えるために、編集上強調された区間として、アクション区間、緊迫した区間、落ち着いた区間を抽出する。その際、各ショットにおいて、ショットの長さ、画像の動きの激しさや緩やかさに基づき、ショットの性質として、アクション性、緊迫性、落ち着き性を定義する。そして性質を表す値が連続して高い値をとるショット群をそれぞれアクション区間、緊迫した区間、落ち着いた区間とする。これら３つの区間を抽出し、各性質を表す値の高い順に要約映像を作成する際の候補とすることにより、映画の中で編集上強調された区間を要約映像に加えることが可能となり、その要約映像は映画の内容が分かりやすいものとなる。 1.2 Process Flow Based on the grammar of the movie, in order to effectively convey the contents of the story to the viewer, an action section, a tight section, and a calm section are extracted as sections that are emphasized for editing. At that time, in each shot, action property, tightness, and calmness are defined as shot properties based on the length of the shot and the intensity and gentleness of the motion of the image. A group of shots in which the values representing the properties are continuously high are defined as an action section, a tight section, and a calm section, respectively. By extracting these three sections and using them as candidates for creating a summary video in descending order of the value representing each property, it becomes possible to add a section that is editorially emphasized in the movie to the summary video. The summary video makes the content of the movie easy to understand.

ここで、ショットとは一台のカメラから撮影された連続するフレームの集合のことである。またカットとは、ショットの境界のことである。 Here, a shot is a set of continuous frames taken from one camera. A cut is a shot boundary.

なお、ショットの性質として、アクション性、緊迫性、落ち着き性を定義する際には、そのショットに同期して再現される楽曲のテンポも考慮することが望ましい。 It should be noted that when defining the action property, tightness, and calmness as the nature of the shot, it is desirable to consider the tempo of the music that is reproduced in synchronization with the shot.

また、抽出した区間を要約映像に加えるか否かを判断する際には、主体（映像主体）の存在を考慮することが望ましい。主体の存在するショットは、話の内容を視聴者に伝える上で重要なショットとなり、そのショットを中心に採用した要約映像は、それを考慮しないものに比べて、映画の内容を理解しやすくなる。画像の中で強調されているオブジェクトが主体である可能性が高いことから、ある一定以上の大きさで、同一色で輝度の変化が周囲と異なるオブジェクトが存在するショットを検出する。 In addition, when determining whether or not to add the extracted section to the summary video, it is desirable to consider the presence of the subject (video subject). The subject's shot is an important shot for telling the audience what the story is about, and the summary video that is mainly based on that shot makes it easier to understand the content of the movie than when it is not taken into account. . Since there is a high possibility that the object emphasized in the image is the main subject, a shot in which an object having a certain size or more and the same color and a change in luminance is different from the surroundings is detected.

さらにアクション区間、緊迫した区間、落ち着いた区間のいずれか２つの区間が隣接している場合、それらの区間には原因と結果を表す従属関係がある。そのため、それら２つの区間を含めた要約映像は、含めない映像に比べてより文脈を理解しやすいものとなる。抽出した区間内でアクション性度合、緊迫性度合、あるいは落ち着き性度合の平均値を求め、前後の区間においてその差を求めることにより、それらの区間での従属関係の度合を求める。ここで従属関係の度合を前後の区間の値の差としているのは、前後の性質の違いが大きいほど、視聴者に強い印象を与えて内容を効果的に伝えることができるからである。 Furthermore, when any two sections of an action section, a tight section, and a calm section are adjacent to each other, these sections have a dependency relationship that represents a cause and an effect. Therefore, the summary video including these two sections is easier to understand the context than the video not including. An average value of the action degree, the tightness degree, or the calmness degree is obtained in the extracted sections, and the difference is obtained in the preceding and following sections, thereby obtaining the degree of dependency in those sections. The reason why the degree of dependency is the difference between the values of the preceding and following sections is that the greater the difference in the properties before and after, the stronger the impression the viewer can have and the more effective the contents can be conveyed.

最後に要約映像を作成する際、映画全体から満遍なく要約映像となる映像区間を選択し、話の内容を理解しやすくするため、映画をn(=20)等分する。そしてその分割された区間の中から、視聴者が指定した制約時間を満たすように、アクション性度合、緊迫性度合、落ち着き性度合のいずれかが高く、主体が存在するショットを優先して要約映像として採用し、それと強い従属関係のある区間内の主体の存在するショットも要約映像として採用することにより、映画の内容と文脈とをより理解しやすい要約映像を作成する。 Finally, when creating a summary video, select a video section that is uniformly a summary video from the entire movie, and divide the movie into n (= 20) equal parts to make it easier to understand the content of the story. Then, from the divided sections, the action video, the tightness, or the calmness are high so that the restriction time specified by the viewer is satisfied. As a summary video, a summary video that makes it easier to understand the content and context of the movie is created.

２．ショットの性質の定義
２．１アクション性
２．１．１ショットの長さによるアクション性
アクション区間では、短いショットが連続するという特徴があるため、それを以下の条件で抽出し、アクション性を表す値を求める。 2. Definition of shot characteristics 2.1 Action characteristics 2.1.1 Action characteristics according to shot length In the action section, there is a feature that short shots are continuous. Find the value.

k番目のショットs_kでのショットの長さをSL(s_k)[秒]とすると、s_kでのショットの長さによるアクション性を表す値SLV_A(s_k)を数式（１）のように定義する。これは、アクションを視聴者に効果的に伝えるためには、短いショットを用いることに基づき、あるショットの長さが短いと判定された場合、アクションを表しているショットとみなし、アクション性を1とする。ここで、ショットの長さによるアクション性を2値としているのは、ショットの長さが短ければ短いほど、アクション性が高くなることは映画の文法により示されていないためである。 Assuming that the length of the shot at the k-th shot s _k is SL (s _k ) [seconds], a value SLV _A (s _k ) representing the action property according to the length of the shot at s _k is expressed by the equation (1). Define as follows. This is based on using a short shot to effectively convey the action to the viewer. If it is determined that the length of a certain shot is short, it is regarded as a shot representing the action, and the action property is 1 And Here, the reason why the action property according to the shot length is binary is that the fact that the shorter the shot length is, the higher the action property is, is not shown by the grammar of the movie.

ただし、Th_shot[秒]はショットの長さが短いことを表す閾値で、SL_mean[秒]はある映画全体のショットの長さの平均値である。SL_mode[秒]は、ショットの長さの最頻値を表す。ただし最頻値は、0.5秒間隔でショットの累積頻度を求め、その度数が最大になる0.5秒間での中間値としている。 However, Th _shot [seconds] is a threshold value indicating that the shot length is short, and SL _mean [seconds] is an average value of shot lengths of a whole movie. SL _mode [second] represents the mode of the shot length. However, the mode value is an intermediate value in 0.5 seconds in which the cumulative frequency of shots is obtained at intervals of 0.5 seconds and the frequency is maximum.

２．１．２画像内の変化によるアクション性
図１に示す時空間投影画像（非特許文献７：阿久津明人, 外村佳伸, “投影法を用いた映像の解析手法と映像ハンドリングへの応用”, 電子情報通信学会論文誌, Vol. J79-D-II, No. 5, pp. 675-686, May 1996.参照）は、映像中のオブジェクトやカメラワークによって生じる動きを可視化した画像であるため、非特許文献７ではカメラワークを検出する際に用いられている。 2.1.2 Actionability due to changes in the image Spatiotemporal projection image shown in Fig. 1 (Non-patent document 7: Akito Akutsu, Yoshinobu Tonomura, “Image analysis method using projection method and its application to image handling” ”, See IEICE Transactions, Vol. J79-D-II, No. 5, pp. 675-686, May 1996.) is an image that visualizes movements caused by objects and camera work in the video. Therefore, in Non-Patent Document 7, it is used when camerawork is detected.

本実施形態では、時空間投影画像中に、画像の動きの激しさに伴う特徴が現れることに着目し、その特徴を検出することによってアクション性を求める。なお、本実施形態では、水平方向の時空間投影画像を利用する。水平方向の時空間投影画像は、図１に示すように、フレームの並びを横方向（図１中ｆ方向、以下「時間軸方向」という）にとり、映像における水平方向のピクセルの並びを縦方向（図１中ｘ方向、以下「画像走査方向」という）にとったものである。 In the present embodiment, attention is paid to the appearance of a feature associated with the intensity of image movement in the spatiotemporal projection image, and the action property is obtained by detecting the feature. In the present embodiment, a spatiotemporal projection image in the horizontal direction is used. As shown in FIG. 1, in the horizontal spatiotemporal projection image, the arrangement of frames is taken in the horizontal direction (the f direction in FIG. 1, hereinafter referred to as “time axis direction”), and the arrangement of horizontal pixels in the video is taken in the vertical direction. (The x direction in FIG. 1, hereinafter referred to as “image scanning direction”).

映像の動きが激しい場合、図２（ａ）（ｂ）に示すように時空間投影画像上では画像走査方向のエッジが現れる。 When the motion of the image is intense, as shown in FIGS. 2A and 2B, an edge in the image scanning direction appears on the spatiotemporal projection image.

ショットs_kでの時空間投影画像における画像走査方向のエッジの数をE_v(s_k)とすると、時空間投影画像によるアクション性を表す値VTIV_A(s_k)を数式（２）のように定義する。数式（２）では、映像内の激しさを単位時間に現れるエッジの数として表している。これは、アクション区間で映像内の動きが激しいほど、時空間投影画像中に現れる画像走査方向のエッジの数が多くなることに基づいている。 Assuming that the number of edges in the image scanning direction in the spatiotemporal projection image at the shot s _k is E _v (s _k ), a value VTIV _A (s _k ) representing the action property by the spatiotemporal projection image is expressed as in Equation (2). Defined in In Formula (2), the intensity in the video is expressed as the number of edges that appear in unit time. This is based on the fact that the number of edges in the image scanning direction appearing in the spatio-temporal projection image increases as the movement in the video in the action section increases.

２．１．３音楽によるアクション性
図３に示すようにサウンドスペクトログラム上に現れる時間軸（横軸）に沿った周波数ピークを示す楽器音成分を検出することにより、ある時間間隔における楽器音成分の数により音楽が流れていることを判定することができる（非特許文献８：川崎智広, 吉高淳夫, 平川正人, 市川忠男, “映画における音楽、効果音の抽出及び印象評価手法の提案”, 信学技報, MVE97-96, pp. 23-29, 1998.参照）。 2.1.3 Action by music As shown in Fig. 3, by detecting instrument sound components showing frequency peaks along the time axis (horizontal axis) appearing on the sound spectrogram, the instrument sound components in a certain time interval are detected. It is possible to determine that music is flowing according to the number (Non-patent Document 8: Tomohiro Kawasaki, Ikuo Yoshitaka, Masato Hirakawa, Tadao Ichikawa, “Proposal of Extracting Music and Sound Effects and Impression Evaluation Techniques in Movies”, (See IEICE Technical Report, MVE97-96, pp. 23-29, 1998.)

本実施形態では、音楽の特徴がその楽器音成分の継続時間に表れることに着目し、その時間によって音楽の性質を検出する。実験により、アクション区間で流れている音楽は、楽器音成分の継続時間が短い傾向にあることを確認している。また、音楽の中でベースに分類される楽器は楽曲のテンポを知る指標になるため、ベースが担う周波数帯の楽器音成分に着目する。映画では、オーケストラで演奏された楽曲が流れることが多いため、オーケストラでベースを担う楽器の周波数帯(30-300Hz)の楽器音成分の継続時間を指標とする。 In the present embodiment, attention is paid to the fact that the feature of music appears in the duration of the instrument sound component, and the nature of the music is detected based on that time. Through experiments, it has been confirmed that the music flowing in the action section tends to have a short duration of instrument sound components. In addition, since musical instruments classified as bass in music serve as indices for knowing the tempo of music, attention is paid to musical instrument sound components in the frequency band that the bass plays. In movies, music played in an orchestra often flows, so the duration of instrument sound components in the frequency band (30-300 Hz) of the instrument that plays the bass in the orchestra is used as an index.

ショットs_kでの楽器音成分の長さをIL(s_k) [秒]とし、楽器音成分の継続時間が短いことを判定する閾値をTh_instA[秒]とすると、音楽により表現されるアクション性を表す値MV_A(s_k)を数式（３）のように定義する。ただし、Th_instAは実験により求めた値で1.24[秒]とした。 An action expressed by music, _assuming that the length of the instrument sound component at shot s _k is IL (s _k ) [seconds], and the threshold for determining that the duration of the instrument sound component is short is _ThinstA [seconds]. _A value MV _A (s _k ) representing sex is defined as in Equation (3). However, _ThinstA is a value obtained by experiments and is 1.24 [seconds].

２．１．４アクション性
以上で求めた各特徴によるアクション性を表す値に基づき、ショットs_kでのアクション性度合Action(s_k)を数式（４）のように表す。以上で求めた3つの値に基づき、ショットs_kでのアクション性度合を求めるが、ある要素のみが必ずアクション区間に表れるのではなく、各要素が満たされる可能性があるため、各要素の平均を求めアクション性度合としている。 2.1.4 Action Property Based on the value representing the action property of each feature obtained as described above, the action property degree Action (s _k ) in the shot s _k is expressed as Equation (4). Based on the three values obtained above, the degree of action at shot s _k is _obtained. However, not only certain elements always appear in the action section, but each element may be satisfied, so the average of each element The degree of action is sought.

２．２緊迫性
緊迫した区間ではショットの長さが徐々に短くなるという特徴がある。その特徴に基づいて緊迫した区間を抽出する。また、緊迫した区間内でショットの平均時間が短いほど、緊迫性が高く感じられるため、それを緊迫性度合として、Tension(s_k)を数式（５）のように定義する。ただし、SL_Tensionは緊迫した区間内でのショットの長さの平均値、nは緊迫した区間内のショットの数、m_iはk番目のショットからの変位を表す。なお、緊迫性度合は、緊迫した区間、つまりショットの長さが徐々に短くなるという条件を満たす区間においてのみ定義する。 2.2 Tension The length of a shot is gradually shortened in a tight section. A tight section is extracted based on the feature. In addition, the shorter the average time of shots in a tight section, the higher the sense of tightness. Therefore, Tension (s _k ) is defined as Equation (5), using this as the degree of tightness. However, SL _Tension length of the average value of shots in a tense period, n represents the number shots in tense section, m _i denotes the displacement from the k-th shot. Note that the degree of tightness is defined only in a tight section, that is, a section that satisfies the condition that the shot length gradually decreases.

２．３落ち着き性
２．３．１ショットの長さによる落ち着き性
落ち着いた区間では、長いショットが連続するという特徴があるため、それを以下の条件で抽出し、落ち着き性を表す値を求める。 2.3 Calmness 2.3.1 Calmness due to shot length Since there is a feature that long shots are continuous in a calm zone, it is extracted under the following conditions to obtain a value representing calmness.

ショットs_kでのショットの長さによる落ち着き性を表す値SLV_C(s_k)を数式（６）のように定義する。これは、落ち着いた雰囲気を視聴者に効果的に伝えるためには、長いショットを用いるということに基づき、あるショットの長さが長いと判定された場合、落ち着いた感じを表しているショットとみなし、落ち着き性を1とする。ここで、ショットの長さによる落ち着き性を2値としているのは、ショットの長さが長ければ長いほど、落ち着き性が高くなることは映画の文法により示されていないためである。 Shot s _k values representative of the restless due the length of the shot with SLV _C (s _k) is defined as Equation (6). This is based on the fact that a long shot is used to effectively convey a calm atmosphere to the viewer. If it is determined that the length of a certain shot is long, it is regarded as a shot expressing a calm feeling. The calmness is 1. The reason why the calmness by the shot length is binary is that the longer the shot length, the higher the calmness is not indicated by the grammar of the movie.

２．３．２画像内の動きによる落ち着き性
落ち着いた区間では、映像内でオブジェクトやカメラワークによる動きがあまり見られないため、時空間投影画像上には時間軸方向に沿ってエッジが存在する。そのエッジの平らさを検出することによって落ち着き性を定義する。この場合、平らさの尺度が落ち着き性を表す値とする。 2.3.2 Calmness due to movement in the image In a calm period, there is not much movement due to objects or camerawork in the video, so there are edges along the time axis direction on the spatiotemporal projection image . The calmness is defined by detecting the flatness of the edge. In this case, the flatness scale is a value representing calmness.

ショットs_kでの平らさの尺度を求めるには、時空間投影画像上でエッジとなる部分を追跡し、図４（ａ）に示す値を図４（ｂ）に示す追跡順序に従って加算していく。 In order to obtain a measure of flatness in the shot s _k , a portion that becomes an edge on a spatiotemporal projection image is tracked, and the values shown in FIG. 4A are added according to the tracking order shown in FIG. Go.

具体的には次のとおりである。まず、時空間投影画像に対して時間軸方向のエッジ強調を行い、エッジの有無に応じて二値化した画像（時間軸方向エッジ強調画像）を作成する。そして、この時間軸方向エッジ強調画像において、エッジに相当するピクセルを注目ピクセルとし、そのエッジを時間軸方向に追跡していく。エッジを追跡するためには、図４（ｂ）の追跡順序に従って最初にピクセルが検出される位置をエッジの移動先とする。そして、注目ピクセルに対する移動先のピクセルの位置に応じて図４（ａ）のように設定されている数値（スコア）を取得し、上記移動先のピクセルを新たな注目ピクセルとして上記追跡を繰り返す。このようにして追跡とともに取得していくスコアを順次加算し、この加算結果を追跡したピクセル数で除算することにより求めた値を平らさの尺度とする。 Specifically, it is as follows. First, edge enhancement in the time axis direction is performed on the spatiotemporal projection image, and a binarized image (time axis direction edge enhanced image) is created according to the presence or absence of an edge. Then, in this time axis direction edge enhanced image, a pixel corresponding to the edge is set as a target pixel, and the edge is traced in the time axis direction. In order to track the edge, the position where the pixel is first detected in accordance with the tracking sequence shown in FIG. Then, a numerical value (score) set as shown in FIG. 4A is acquired according to the position of the destination pixel with respect to the target pixel, and the above tracking is repeated using the destination pixel as a new target pixel. The scores obtained together with the tracking in this way are sequentially added, and the value obtained by dividing the addition result by the number of tracked pixels is used as a measure of flatness.

スコアの加算結果をSum(s_k)、追跡ピクセル数をN(s_k)とすると、ショットs_kでの時空間投影画像による落ち着き性を表す値VTIV_C(s_k)を数式（７）のように定義する。VTIV_C(s_k)は、エッジが時間軸方向の直線となる場合、最大値1をとり、図４（ｂ）の追跡順序において7、あるいは9の位置に繰り返しエッジとなる部分が存在する場合、最小値0をとる。 If the score addition result is Sum (s _k ) and the number of pixels to be tracked is N (s _k ), the value VTIV _C (s _k ) representing the calmness of the spatiotemporal projection image at the shot s _k is expressed by Equation (7). Define as follows. VTIV _C (s _k ) takes the maximum value 1 when the edge is a straight line in the time axis direction, and there is a portion that becomes a repeated edge at the position 7 or 9 in the tracking sequence of FIG. The minimum value is 0.

２．３．３音楽による落ち着き性
楽器音成分の継続時間により、落ち着き性を判定する。実験により、落ち着いた区間で流れている音楽は、楽器音成分の継続時間が長い傾向があることを確認している。 2.3.3 Calmness by music Calmness is determined by the duration of musical instrument sound components. Through experiments, it has been confirmed that music flowing in a calm section tends to have a long duration of instrument sound components.

ショットs_kで楽器音成分の継続時間が長いことを判定する閾値をTh_instC[秒]とすると、音楽による落ち着き性を表す値MV_C(s_k)を数式（８）のように定義する。ただし、Th_instCは実験により求めた値で1.40[秒]とした。 _Assuming that the threshold for determining that the duration of the instrument sound component is long in the shot s _k is Th _instC [seconds], a value MV _C (s _k ) representing the calmness of music is defined as in Equation (8). However, _ThinstC is a value obtained by experiments and set to 1.40 [seconds].

２．３．４落ち着き性
以上で求めた各特徴による落ち着き性を表す値に基づき、ショットs_kでの落ち着き性度合Calm(s_k)を数式（９）のように定義する。以上で求めた3つの値に基づき、ショットs_kでの落ち着き性度合を求めるが、ある要素のみが必ず落ち着いた区間に表れるのではなく、各要素が満たされる可能性があるため、各要素の平均を求め落ち着き性度合としている。 2.3.4 Based on the value representing the restless due each feature determined by restlessness and higher, defined restless with the degree of shot s _k Calm the (s _k) as in Equation (9). Based on the three values obtained above, the degree of calmness in the shot s _k is _obtained , but only certain elements do not always appear in the calmed section, but each element may be satisfied. The average is obtained and the degree of calmness is assumed.

３．装置構成および処理手順
３．１装置構成
図５のブロック図は、本実施形態における要約映像作成装置１の構成を示している。要約映像作成装置１は、制御部２、記憶部３、データ入力部４、操作部５、データ出力部６を備えて構成されている。 3. 3. Device Configuration and Processing Procedure 3.1 Device Configuration The block diagram of FIG. 5 shows the configuration of the summary video creation device 1 in this embodiment. The summary video creation device 1 includes a control unit 2, a storage unit 3, a data input unit 4, an operation unit 5, and a data output unit 6.

制御部２は、所定のプログラムの命令を実行するＣＰＵ（central processing unit）、プログラムを展開するＲＡＭ（random access memory）、プログラムやデータを格納したＲＯＭ（read only memory）などを備えたコンピュータによって構成されている。そして、制御部２は、映像編集プログラムを実行することにより、カット検出部１１、ショット分析部１２、映像分析部１３、音声分析部１４、主体検出部１５、指標生成部１６、区間抽出部１７、従属度検出部１８、要約映像生成部１９の各部として機能する。 The control unit 2 is configured by a computer having a central processing unit (CPU) that executes instructions of a predetermined program, a random access memory (RAM) that expands the program, and a read only memory (ROM) that stores programs and data. Has been. Then, the control unit 2 executes the video editing program, thereby performing the cut detection unit 11, the shot analysis unit 12, the video analysis unit 13, the audio analysis unit 14, the subject detection unit 15, the index generation unit 16, and the section extraction unit 17. , And functions as a dependency level detection unit 18 and a summary video generation unit 19.

上記映像編集プログラムは、そのプログラムを記録した記録媒体から上記コンピュータに供給することができる。この映像編集プログラムを記録した記録媒体は、上記コンピュータと分離可能に構成してもよく、上記コンピュータに組み込むようになっていてもよい。この記録媒体は、記録したプログラムコードをコンピュータが直接読み取ることができるようにコンピュータに装着されるものであっても、外部記憶装置としてコンピュータに接続されたプログラム読み取り装置を介して読み取ることができるように装着されるものであってもよい。 The video editing program can be supplied to the computer from a recording medium on which the program is recorded. The recording medium on which the video editing program is recorded may be configured to be separable from the computer, or may be incorporated in the computer. Even if this recording medium is mounted on a computer so that the recorded program code can be directly read by the computer, it can be read via a program reading device connected to the computer as an external storage device. It may be attached to.

上記記録媒体としては、例えば、磁気テープ、フレキシブルディスク、ハードディスク、ＣＤ−ＲＯＭ、ＭＯ、ＭＤ、ＤＶＤ、ＣＤ−Ｒ、ＩＣカード、各種ＲＯＭなどを用いることができる。 As the recording medium, for example, magnetic tape, flexible disk, hard disk, CD-ROM, MO, MD, DVD, CD-R, IC card, various ROMs and the like can be used.

なお、制御部２を通信ネットワークと接続可能に構成し、上記プログラムコードを通信ネットワークを介して供給してもよい。つまり、上記映像編集プログラムは、上記プログラムコードが電子的な伝送で具現化された搬送波あるいはデータ信号列の形態をとって供給されることもある。 The control unit 2 may be configured to be connectable to a communication network, and the program code may be supplied via the communication network. That is, the video editing program may be supplied in the form of a carrier wave or a data signal sequence in which the program code is embodied by electronic transmission.

なお、本実施形態では、コンピュータと映像編集プログラムとによって制御部２の上記各部を実現することを想定しているが、ハードウェアによって制御部２の上記各部を構成してもよい。 In the present embodiment, it is assumed that the respective units of the control unit 2 are realized by a computer and a video editing program. However, the respective units of the control unit 2 may be configured by hardware.

記憶部３は、ハードディスクによって構成され、外部から供給される映像データや、制御部２の実行する処理によって生成されたデータなどを記憶する。なお、記憶部３に記憶されるものとして図５に図示している各種データの一部は、記憶部３に記憶する代わりに、制御部２内部のＲＡＭ等に記憶するようにしてもよい。また、記憶部３は、ハードディスクに限らず、上記データを記憶することができる記憶装置であればよい。 The storage unit 3 is configured by a hard disk, and stores video data supplied from the outside, data generated by processing executed by the control unit 2, and the like. Note that some of the various data illustrated in FIG. 5 as stored in the storage unit 3 may be stored in a RAM or the like inside the control unit 2 instead of being stored in the storage unit 3. The storage unit 3 is not limited to a hard disk, but may be any storage device that can store the data.

データ入力部４は、外部から要約映像作成装置１に対して供給される映像データを要約映像作成装置１内部へ入力するためのものであり、データ出力部６は、要約映像作成装置１において作成した要約映像データを要約映像作成装置１の外部へ出力するためのものである。 The data input unit 4 is for inputting video data supplied from the outside to the summary video creation device 1 into the summary video creation device 1. The data output unit 6 is created by the summary video creation device 1. The summary video data is output to the outside of the summary video creation device 1.

操作部５は、要約映像作成装置１の操作者の操作入力を受け付け、その操作入力に応じた信号を制御部２に対して出力するものである。 The operation unit 5 receives an operation input from the operator of the summary video creating apparatus 1 and outputs a signal corresponding to the operation input to the control unit 2.

要約映像作成装置１の各部の機能や動作の詳細については、フローチャートに基づいて以下に説明する。 Details of the functions and operations of each unit of the summary video creation device 1 will be described below based on a flowchart.

３．２全体の流れ
図６のフローチャートに基づいて、要約映像作成装置１における全体的な処理の流れについて説明する。 3.2 Overall Flow The overall processing flow in the summary video creation apparatus 1 will be described based on the flowchart of FIG.

まず、データ入力部４を介して映像データが入力されると、記憶部３に映像データ５１として記憶される（ステップＳ１）。そして、カット検出部１１により、映像データ５１に基づいて当該映像に含まれるカットを検出し、そのカット位置を記憶部３にカット位置５２として記憶させる（ステップＳ２）。カット位置５２は、例えば映像における先頭からの経過時間によって表すことができる。このカット位置５２に基づいて、ショット分析部１２により、各ショットの長さを検出する（ステップＳ３）。 First, when video data is input via the data input unit 4, it is stored as video data 51 in the storage unit 3 (step S1). Then, the cut detection unit 11 detects a cut included in the video based on the video data 51, and stores the cut position as the cut position 52 in the storage unit 3 (step S2). The cut position 52 can be represented by, for example, the elapsed time from the beginning in the video. Based on the cut position 52, the shot analysis unit 12 detects the length of each shot (step S3).

そして、映像分析部１３により、映像データ５１に基づいて当該映像の時空間投影画像５３（図２（ａ）参照）を作成して記憶部３に記憶させるとともに（ステップＳ４）、映像分析部１３により、時空間投影画像５３に基づいて映像の動きを検出する（ステップＳ６）。 Then, the video analysis unit 13 creates a spatiotemporal projection image 53 (see FIG. 2A) of the video based on the video data 51 and stores it in the storage unit 3 (step S4). Thus, the motion of the video is detected based on the spatiotemporal projection image 53 (step S6).

また、音声分析部１４により、映像データ５１に含まれる音声データに基づいて当該映像に付加されている音声のサウンドスペクトログラム５４（図３参照）を作成して記憶部３に記憶させるとともに（ステップＳ４）、音声分析部１４により、サウンドスペクトログラム５４に基づいて映像に付加されている音楽の性質を検出する（ステップＳ７）。 Further, the sound analysis unit 14 creates a sound spectrogram 54 (see FIG. 3) of the sound added to the video based on the audio data included in the video data 51 and stores it in the storage unit 3 (step S4). ) The sound analysis unit 14 detects the nature of the music added to the video based on the sound spectrogram 54 (step S7).

また、映像分析部１３により、映像における主体の有無を検出する（ステップＳ８）。 Further, the video analysis unit 13 detects the presence or absence of the subject in the video (step S8).

そして、指標生成部１６により、ステップＳ３，Ｓ５，Ｓ７の検出結果に基づいて、アクション性度合、緊迫性度合、落ち着き性度合を生成するとともに、区間抽出部１７により、アクション区間、緊迫した区間、落ち着いた区間を抽出する（ステップＳ９）。また、従属度検出部１８により、各区間の従属関係を検出する（ステップＳ１０）。そして、ステップＳ９において抽出した区間やステップＳ１０において検出した各区間の従属関係に基づいて、要約映像生成部１９によりショットを採用することにより要約映像を作成する（ステップＳ１１）。 Then, the index generation unit 16 generates the action degree, the tightness degree, and the calmness degree based on the detection results of steps S3, S5, and S7. A settled section is extracted (step S9). In addition, the dependency level detection unit 18 detects the dependency relationship of each section (step S10). Then, based on the section extracted in step S9 and the dependency of each section detected in step S10, a summary video is generated by adopting shots by the summary video generation unit 19 (step S11).

以下では、上記各ステップＳについてより詳細に説明する。なお、上記ステップＳ２のカットの検出処理、およびステップＳ６のサウンドスペクトログラムの作成処理は周知の処理を利用することができるので、ここでは詳細な説明を省略する。 Below, each said step S is demonstrated in detail. The cut detection process in step S2 and the sound spectrogram creation process in step S6 can use well-known processes, and thus detailed description thereof is omitted here.

３．３ショット長さの検出
図７のフローチャートに基づいて、ショット分析部１２によるショット長さの検出処理について説明する。 3.3 Shot Length Detection A shot length detection process by the shot analysis unit 12 will be described based on the flowchart of FIG.

ショット分析部１２は、カット位置５２に基づくことにより、各ショットのショット長さSL(s_k)を計算する（ステップＳ００１）。 The shot analysis unit 12 calculates the shot length SL (s _k ) of each shot based on the cut position 52 (step S001).

そして、ショット分析部１２は、計算したショット長さSL(s_k)が閾値Th_shotよりも大きい場合には（Ｓ００２）、落ち着き性が高いと判定してSVL_C(s_k)=1とし（ステップＳ００３、数式（６）参照）、計算したショット長さSL(s_k)が閾値Th_shotよりも小さい場合には（Ｓ００４）、アクション性が高いと判定してSVL_A(s_k)=1とする（ステップＳ００５、数式（１）参照）。 When the calculated shot length SL (s _k ) is larger than the threshold Th _shot (S002), the shot analysis unit 12 determines that the calmness is high and sets SVL _C (s _k ) = 1 ( If the calculated shot length SL (s _k ) is smaller than the threshold Th _shot (S004), it is determined that the action property is high, and SVL _A (s _k ) = 1 (See step S005, equation (1)).

このように、ショット分析部１２は、ショットの継続時間を示す特徴量（SL(s_k)）と、ショットの継続時間の長さ度合を示す特徴量（SVL_C(s_k)，SVL_A(s_k)）とを生成する。ショットの継続時間の長さ度合とは、映像全体に対する各部のショットの相対的な長さの度合である。なお、ショット分析部１２の生成するSL(s_k)、SVL_C(s_k)、SVL_A(s_k)は、図示はしていないが記憶部３に記憶され、後に指標生成部１６や区間抽出部１７による処理に用いられる。 As described above, the shot analysis unit 12 includes the feature amount (SL (s _k )) indicating the shot duration, and the feature amounts (SVL _C (s _k ), SVL _A () indicating the length of the shot duration. s _k )). The length of the duration of a shot is the degree of relative length of each part shot with respect to the entire video. Note that SL (s _k ), SVL _C (s _k ), and SVL _A (s _k ) generated by the shot analysis unit 12 are stored in the storage unit 3 (not shown), and later the index generation unit 16 and sections Used for processing by the extraction unit 17.

３．４時空間投影画像の作成
図８のフローチャートに基づいて、映像分析部１３による時空間投影画像の作成処理について説明する。 3.4 Creation of Spatiotemporal Projection Image A process of creating a spatiotemporal projection image by the video analysis unit 13 will be described based on the flowchart of FIG.

映像分析部１３は、まず、映像中の各フレーム（水平方向ｘ＝１６０ピクセル、垂直方向（ｙ）＝１２０ピクセル）において、ｙ＝３０，６０，９０の各水平ラインに注目し、各水平ラインにおけるピクセルの輝度を同一のｘ座標のピクセルごとに平均することにより、各フレームの平均輝度ラインを作成する。そして、この平均輝度ラインをフレームの時間順に並べて、図２（ａ）に示すような時空間投影画像を作成する（ステップＳ１０１）。 First, the video analysis unit 13 pays attention to each horizontal line of y = 30, 60, 90 in each frame (horizontal direction x = 160 pixels, vertical direction (y) = 120 pixels) in the video, and each horizontal line The average luminance line of each frame is created by averaging the luminance of the pixels at every pixel of the same x coordinate. Then, the average luminance lines are arranged in the time order of the frames to create a spatiotemporal projection image as shown in FIG. 2A (step S101).

そして、映像分析部１３は、作成した時空間投影画像に基づいて、画像走査方向のエッジを強調した二値画像（画像走査方向エッジ強調画像）と、時間軸方向のエッジを強調した二値画像（時間軸方向エッジ強調画像）とを生成する（ステップＳ１０２，Ｓ１０３）。 The video analysis unit 13 then, based on the created spatiotemporal projection image, a binary image in which the edge in the image scanning direction is emphasized (image scanning direction edge enhanced image) and a binary image in which the edge in the time axis direction is enhanced. (Time-axis direction edge enhanced image) is generated (steps S102 and S103).

３．５動きの検出
図９のフローチャートに基づいて、映像分析部１３による映像の動きの検出処理について説明する。 3.5 Motion Detection A video motion detection process performed by the video analysis unit 13 will be described with reference to the flowchart of FIG.

映像分析部１３は、図８のステップＳ１０２において作成した画像走査方向エッジ強調画像を用いて、この画像走査方向エッジ強調画像における各ショットに対応する部分をそれぞれ参照し、その部分に存在する１０ピクセル以上で構成されたエッジの本数を計算し、その結果を当該ショットのエッジの数E_v(s_k)（数式（２）参照）とする（ステップＳ２０１）。そして、数式（２）に基づいて、画像の動きに基づくアクション性を表す値VTIV_A(s_k)を計算する（ステップＳ２０２）。 The video analysis unit 13 uses the image scanning direction edge enhanced image created in step S102 of FIG. 8 to refer to the portion corresponding to each shot in the image scanning direction edge enhanced image, and 10 pixels existing in that portion. The number of edges configured as described above is calculated, and the result is set as the number of edges E _v (s _k ) of the shot (see formula (2)) (step S201). Then, based on the formula (2), a value VTIV _A (s _k ) representing the action property based on the motion of the image is calculated (step S202).

次に、映像分析部１３は、図８のステップＳ１０３において作成した時間軸方向エッジ強調画像を用いて、この時間軸方向エッジ強調画像における各ショットに対応する部分それぞれにおいて、時間軸方向にエッジを追跡しつつ、図４（ａ）（ｂ）に基づいてスコア加算を行い、その結果をSum(s_k)（数式（７）参照）とする（ステップＳ２０３）。そして、数式（７）に基づいて、画像の動きに基づく落ち着き性を表す値VTIV_C(s_k)を計算する（ステップＳ２０４）。 Next, the video analysis unit 13 uses the time-axis direction edge emphasized image created in step S103 of FIG. 8 to set an edge in the time axis direction in each portion corresponding to each shot in the time-axis direction edge emphasized image. While tracking, score addition is performed based on FIGS. 4A and 4B, and the result is Sum (s _k ) (see equation (7)) (step S203). Then, based on Expression (7), a value VTIV _C (s _k ) representing calmness based on the motion of the image is calculated (step S204).

このように、映像分析部１３は、映像の動きの激しさ度合を示す特徴量（VTIV_A(s_k)，VTIV_C(s_k)）を生成する。映像の動きの激しさ度合とは、映像全体に対する各部の動きの相対的な激しさの度合である。なお、映像分析部１３の生成するVTIV_A(s_k)、VTIV_C(s_k)は、図示はしていないが記憶部３に記憶され、後に指標生成部１６による処理に用いられる。 In this way, the video analysis unit 13 generates feature quantities (VTIV _A (s _k ), VTIV _C (s _k )) indicating the degree of intenseness of video motion. The intensity of motion of the image is the degree of relative intensity of the movement of each part with respect to the entire image. Although not shown, VTIV _A (s _k ) and VTIV _C (s _k ) generated by the video analysis unit 13 are stored in the storage unit 3 and are used later for processing by the index generation unit 16.

３．６音楽の性質の検出
図１０のフローチャートに基づいて、音声分析部１４による音楽の性質の検出処理について説明する。 3.6 Music Property Detection Music property detection processing by the voice analysis unit 14 will be described based on the flowchart of FIG.

音声分析部１４は、サウンドスペクトログラム５４に基づくことにより、各ショットにおける楽器音成分の継続時間IL(s_k)の平均値を計算する（ステップＳ３０１）。平均値の計算は、当該ショットよりも前の５ショットと、後の４ショットとの合計１０ショット分における楽器音成分の継続時間の合計をショット数１０で除算することにより行う（数式（３）（８）参照）。 The voice analysis unit 14 calculates the average value of the duration IL (s _k ) of the instrument sound component in each shot based on the sound spectrogram 54 (step S301). The calculation of the average value is performed by dividing the total duration of the instrument sound components in the total of 10 shots of 5 shots before and 4 shots after that shot by the number of shots 10 (formula (3) (See (8)).

そして、音声分析部１４は、計算した平均値が閾値Th_instCよりも大きい場合には（Ｓ３０２）、緩やかな音楽が流れていると判定してMV_C(s_k)=1とし（ステップＳ３０３、数式（８）参照）、計算した平均値が閾値Th_instAよりも小さい場合には（Ｓ３０４）、激しい音楽が流れていると判定してMV_A(s_k)=1とする（ステップＳ３０５、数式（３）参照）。 When the calculated average value is larger than the threshold value _ThinstC (S302), the speech analysis unit 14 determines that gentle music is flowing and sets MV _C (s _k ) = 1 (step S303, equation (8)), and when the calculated average value is less than the threshold value Th _instA is (S304), it is determined that intense music is flowing and _{_{MV a (s k) = 1}} ( step S305, equation (See (3)).

このように、音声分析部１４は、音楽の継続時間の長さ度合を示す特徴量（MV_C(s_k)，MV_A(s_k)）を生成する。楽器音成分の継続時間の長さ度合とは、サウンドスペクトログラム上でリズムを構成する楽器により線分として表れる成分の長さの度合、すなわち旋律を構成する音の長さの度合である。なお、音声分析部１４の生成するMV_C(s_k)、MV_A(s_k)は、図示はしていないが記憶部３に記憶され、後に指標生成部１６による処理に用いられる。 As described above, the voice analysis unit 14 generates feature quantities (MV _C (s _k ), _{M A} (s _k )) indicating the degree of the duration of music. The degree of duration of the instrument sound component is the degree of the length of the component that appears as a line segment by the instrument constituting the rhythm on the sound spectrogram, that is, the degree of the sound constituting the melody. Note that MV _C (s _k ) and MV _A (s _k ) generated by the voice analysis unit 14 are stored in the storage unit 3 (not shown), and are used later for processing by the index generation unit 16.

３．７主体の検出
画像内に輝度の変化が周囲と異なっており強調されたオブジェクトが存在する場合、そのショットは内容を伝える上で強調されているため重要である。そのため、以下のようにして各ショットにおいて主体を検出する。 3.7 Detecting the subject When the brightness change is different from the surroundings in the image and there is an emphasized object, it is important because the shot is emphasized in conveying the contents. Therefore, the subject is detected in each shot as follows.

図１１のフローチャートに基づいて、主体検出部１５による主体の検出処理について説明する。 Based on the flowchart of FIG. 11, the subject detection process by the subject detection unit 15 will be described.

主体検出部１５は、映像データ５１とカット位置５２とに基づくことにより、各ショットの最初のフレーム（先頭フレーム）に対して次の処理を行う。まず、先頭フレームの画像をグレースケール16階調表現へと変換する（ステップＳ４０１）。これにより、複雑なオブジェクトが存在する部分は画像上でエッジ密度が高くなるので、このエッジを検出する（ステップＳ４０２）
また、主体検出部１５は、１６０ピクセル×１２０ピクセルの先頭フレームを８ピクセル×６ピクセルのブロックに分割し（ステップＳ４０３）、ブロック内の主要色により各ブロックの色を統一し（ステップＳ４０４）、ＨＳＶ表色系で領域分割を行う（ステップＳ４０５）。 The subject detection unit 15 performs the following process on the first frame (first frame) of each shot based on the video data 51 and the cut position 52. First, the image of the first frame is converted into a grayscale 16-gradation representation (step S401). As a result, the edge density is high on the image where the complex object exists, and this edge is detected (step S402).
Further, the main body detection unit 15 divides the first frame of 160 pixels × 120 pixels into blocks of 8 pixels × 6 pixels (step S403), and unifies the color of each block with the main colors in the block (step S404), Region division is performed in the HSV color system (step S405).

そして、主体検出部１５は、エッジ密度が高いブロックの分布により主体の存在する可能性のある矩形領域を特定し（ステップＳ４０６）、矩形領域内の最大領域のブロック数が予め定めた閾値（例えば15%）以上であれば（ステップＳ４０７）、主体が存在すると判定して当該ショットについての主体の有無５９に主体「有り」を記録する（ステップＳ４０８）。 Then, the subject detection unit 15 identifies a rectangular region where the subject may exist based on the distribution of blocks having a high edge density (step S406), and the number of blocks in the maximum region within the rectangular region is a predetermined threshold (for example, 15%) or more (step S407), it is determined that the subject exists, and the subject “present” is recorded in the subject presence / absence 59 for the shot (step S408).

３．８強調された区間の抽出
図１２のフローチャートに基づいて、強調された区間の抽出処理について説明する。 3.8 Extraction of Emphasized Section Based on the flowchart of FIG.

まず、指標生成部１６により各ショットのアクション性度合および落ち着き性度合を計算する。具体的には、指標生成部１６は、アクション性度合および落ち着き性度合を、それぞれ数式（４）および（９）に基づいて計算し、算出されたアクション性度合Action(s_k)および落ち着き性度合Calm(s_k)をそれぞれアクション性度合５６および落ち着き性度合５８として記憶部３に記憶させる（ステップＳ５０１）。なお、数式（４）および（９）の計算を行う際には、ショット分析部１２により算出したSVL_A(s_k)およびSVL_C(s_k)、映像分析部１３により算出したVTIV_A(s_k)およびVTIV_C(s_k)、音声分析部１４により算出したMV_A(s_k)およびMV_C(s_k)を用いる。 First, the index generation unit 16 calculates the action degree and calmness degree of each shot. Specifically, the index generation unit 16 calculates the action degree and calmness degree based on the equations (4) and (9), respectively, and the calculated action degree and Action (s _k ) and calmness degree are calculated. Calm (s _k ) is stored in the storage unit 3 as the action degree 56 and the calmness degree 58, respectively (step S501). It should be noted that when the mathematical expressions (4) and (9) are calculated, SVL _A (s _k ) and SVL _C (s _k ) calculated by the shot analysis unit 12 and VTIV _A (s) calculated by the video analysis unit 13 are used. _k) and VTIV _C (s _k), using the calculated MV _a (s _k) and MV _C (s _k) by the voice analyzer 14.

また、各ショットについて算出されたアクション性度合および落ち着き性度合を平滑化して記憶部３に記憶させる（ステップＳ５０２）。平滑化は、注目しているショットと、そのショットの前後２ショットずつの合計５ショットにおけるアクション性度合および落ち着き性度合の平均をとることにより行う。このように平滑化することにより、アクション性度合および落ち着き性度合の大まかな変動に基づいて区間の抽出を行うことができるため、より望ましい結果が得られる。そこで、区間の抽出処理においては、アクション性度合および落ち着き性度合として平滑化された値を用いる。 In addition, the action degree and calmness degree calculated for each shot are smoothed and stored in the storage unit 3 (step S502). Smoothing is performed by taking the average of the action degree and calmness degree of the shot of interest and a total of five shots, two shots before and after the shot. By smoothing in this way, it is possible to extract sections based on rough fluctuations in the degree of action and the degree of calmness, so a more desirable result can be obtained. Therefore, in the section extraction process, smoothed values are used as the action degree and the calmness degree.

次に、区間抽出部１７によりアクション区間、緊迫した区間、落ち着いた区間を抽出する。そのために、区間抽出部１７は、各ショットに対して次の処理を行う。 Next, the section extraction unit 17 extracts action sections, tight sections, and calm sections. For this purpose, the section extraction unit 17 performs the following processing for each shot.

まず、注目しているショット（注目ショット）を含む前後のショットのショット長に基づき、ショットの長さが徐々に短くなる区間（数式（５）のｉｆ式を満たす区間）に注目ショットが含まれているか否かを判別する（ステップＳ５０３）。含まれている場合は、注目ショットを緊迫した区間６１として記憶部３に記憶させる（ステップＳ５０４）。なお、上記判別の際、１ショットのみが直前ショットよりも長くなり、他のショットが徐々に短くなっている区間についても、ショットの長さが徐々に短くなる区間とみなすようにしてもよい。 First, based on the shot lengths of the preceding and following shots including the shot of interest (focus shot), the shot of interest is included in a section in which the shot length gradually decreases (section that satisfies the if expression of Formula (5)). It is determined whether or not (step S503). If it is included, the shot of interest is stored in the storage unit 3 as a tight section 61 (step S504). In the above determination, a section in which only one shot is longer than the previous shot and other shots are gradually shortened may be regarded as a section in which the shot length is gradually shortened.

ショットの長さが徐々に短くなる区間に注目ショットが含まれていない場合は、注目ショットのアクション性度合５６が予め定めた閾値以上であり、かつ、注目ショットのアクション性度合５６が落ち着き性度合５８よりも大きい、という条件を満たすか否かを判別し（ステップＳ５０５）、上記条件を満たす場合には、注目ショット以降、アクション性度合５６が落ち着き性度合５８よりも大きい、という条件を連続して満たすショット群をアクション区間６０として記憶部３に記憶させる（ステップＳ５０６〜Ｓ５０９）。 When the shot of interest is not included in the section in which the length of the shot is gradually shortened, the action performance level 56 of the focus shot is equal to or higher than a predetermined threshold value, and the action performance level 56 of the shot of interest is calm. It is determined whether or not the condition of greater than 58 is satisfied (step S505). If the above condition is satisfied, the condition that the action performance degree 56 is greater than the calmness degree 58 after the shot of interest continues. The group of shots that are satisfied is stored in the storage unit 3 as the action section 60 (steps S506 to S509).

また、ステップＳ５０５の条件が満たされない場合には、注目ショットの落ち着き性度合５８が予め定めた閾値以上であり、かつ、注目ショットの落ち着き性度合５８がアクション性度合５６よりも大きい、という条件を満たすか否かを判別し（ステップＳ５１０）、上記条件を満たす場合には、注目ショット以降、落ち着き性度合５８がアクション性度合５６よりも大きい、という条件を連続して満たすショット群を落ち着いた区間６２として記憶部３に記憶させる（ステップＳ５１１〜Ｓ５１４）。 If the condition of step S505 is not satisfied, the condition that the degree of calmness 58 of the target shot is equal to or greater than a predetermined threshold and the degree of calmness 58 of the target shot is larger than the action degree 56 is set. It is determined whether or not it is satisfied (step S510), and if the above condition is satisfied, the shot group that satisfies the condition that the calmness degree 58 is greater than the actionness degree 56 after the shot of interest is calm 62 is stored in the storage unit 3 (steps S511 to S514).

３．９区間の従属関係の検出
性質の異なる区間が連続している場合、それらは原因と結果との従属関係となる。よって、それらの関係を検出することにより、話の文脈を考慮することが可能となる。 3.9 Detecting Dependencies of Sections If sections of different nature are consecutive, they become a dependency relationship between cause and effect. Therefore, it is possible to consider the context of the story by detecting their relationship.

原因と結果とを表す映像区間には従属関係があるが、性質は異なっているため、それらの区間を同時に要約映像に採用することにより、印象を強めることができる。前後の区間の性質の差に着目し、アクション性度合、緊迫性度合、あるいは落ち着き性度合の平均値の差を求め、従属関係の度合（従属度）とする。従属度を求めることにより、編集上強調された区間と従属関係にある前後の区間のどちらから、要約映像に採用するかを決定する際の手がかりとする。これによって、より編集上強調された区間と従属関係が強い区間を要約映像として採用することが可能となる。 The video sections representing the cause and the result have a subordinate relationship, but since the properties are different, the impression can be strengthened by simultaneously adopting these sections in the summary video. Paying attention to the difference in the properties of the preceding and following sections, the difference in the average value of the degree of action, the degree of tension, or the degree of calmness is obtained and used as the degree of dependency (dependency). By obtaining the degree of dependency, it is a clue when determining which section to be adopted for the summary video from the section emphasized in editing and the preceding and following sections that are in a dependency relationship. As a result, it is possible to adopt a section having a strong dependency relationship with the section emphasized for editing as a summary video.

図１３のフローチャートに基づいて、区間の従属関係の検出処理について説明する。 Based on the flowchart of FIG. 13, the processing for detecting the dependency relationship between the sections will be described.

まず、指標生成部１６により、緊迫した区間における各ショットの緊迫性度合を計算する。具体的には、指標生成部１６は、緊迫性度合を数式（５）に基づいて計算し、算出された緊迫性度合Tension(s_k)を緊迫性度合５７として記憶部３に記憶させる（ステップＳ６０１）。なお、数式（５）の計算を行う際には、ショット分析部１２により算出したSL(s_k)を用いる。 First, the index generation unit 16 calculates the tightness degree of each shot in the tight section. Specifically, the index generation unit 16 calculates the degree of tension based on the formula (5), and stores the calculated degree of tension Tension (s _k ) in the storage unit 3 as the degree of tension 57 (step) S601). It should be noted that SL (s _k ) calculated by the shot analysis unit 12 is used when calculating Equation (5).

次に、従属度検出部１８により従属度を検出する。そのために、従属度検出部１８は、各区間に対して次の処理を行う。 Next, the dependency level is detected by the dependency level detection unit 18. Therefore, the dependency level detection unit 18 performs the following process for each section.

まず、注目している区間（注目区間）がアクション区間であるか否かを判別する（ステップＳ６０２）。 First, it is determined whether or not the section of interest (target section) is an action section (step S602).

アクション区間である場合には、さらに注目区間の後に緊迫した区間が続くか否かを判別し（ステップＳ６０３）、緊迫した区間が続く場合には、これら２つの区間に含まれるショットのアクション性度合５６の平均値の差を計算して、この計算結果を、注目区間と次に続く区間との従属度６３として記憶部３に記憶させる（ステップＳ６０４）。 If it is an action section, it is further determined whether or not a tense section continues after the section of interest (step S603). If the tense section continues, the action property level of shots included in these two sections is determined. The difference between the average values of 56 is calculated, and the calculation result is stored in the storage unit 3 as the dependency 63 between the target section and the next section (step S604).

注目区間がアクション区間ではない場合には、さらに注目区間の後に落ち着いた区間が続くか否かを判別し（ステップＳ６０５）、落ち着いた区間が続く場合には、これら２つの区間に含まれるショットのアクション性度合５６の平均値の差を計算して、この計算結果を、注目区間と次に続く区間との従属度６３として記憶部３に記憶させる（ステップＳ６０６）。 If the attention section is not an action section, it is further determined whether or not a calm section continues after the attention section (step S605). If the calm section continues, the shots included in these two sections are determined. The difference of the average value of the action degree 56 is calculated, and the calculation result is stored in the storage unit 3 as the dependency 63 between the attention interval and the next interval (step S606).

注目区間が緊迫した区間や落ち着いた区間である場合にも、上記アクション区間の場合と同様にして、それぞれ注目区間と次に続く区間との従属度６３を計算して記憶部３に記憶させる（ステップＳ６０７〜Ｓ６１１，Ｓ６１２〜Ｓ６１６）。 Even when the attention section is a tight section or a calm section, the degree of dependency 63 between the attention section and the next section is calculated and stored in the storage unit 3 in the same manner as in the action section. Steps S607 to S611, S612 to S616).

３．１０要約映像の生成
図１４のフローチャートに基づいて、要約映像の生成処理について説明する。 3.10 Generation of Summary Video A summary video generation process will be described based on the flowchart of FIG.

まず、利用者が操作部５を操作することにより、利用者の指定した要約映像の制約時間が入力される（ステップＳ７０１）。制約時間は、例えば5, 10, 15, 20, 25, 30分のいずれかを指定することにより決定される。 First, when the user operates the operation unit 5, the restriction time of the summary video designated by the user is input (step S701). The constraint time is determined by designating any one of 5, 10, 15, 20, 25, and 30 minutes, for example.

次に、要約映像生成部１９により、映像データが時間軸に沿ってｎ（例えばｎ＝２０）等分される（ステップＳ７０２）。そして、このｎ等分された各期間について、要約映像生成部１９により次の処理が行われる。 Next, the summary video generation unit 19 equally divides the video data into n (for example, n = 20) along the time axis (step S702). Then, the summary video generation unit 19 performs the following process for each of the divided n periods.

まず、要約映像生成部１９は、注目している期間（注目期間）に含まれるアクション区間、緊迫した区間、落ち着いた区間それぞれが占めるショット数を計算し（ステップＳ７０３）、このショット数の割合に応じて、注目期間から要約映像に採用するアクション区間、緊迫した区間、落ち着いた区間の時間長（制約時間）を計算する（ステップＳ７０４）。 First, the summary video generation unit 19 calculates the number of shots occupied by the action section, the tight section, and the calm section included in the period of interest (attention period) (step S703), and the ratio of the number of shots is calculated. Correspondingly, the time length (constraint time) of the action section, the tight section, and the calm section that are adopted for the summary video from the attention period is calculated (step S704).

そして、要約映像生成部１９は、注目期間に含まれるアクション区間において、アクション区間の制約時間が満たされるまで、次のようにしてショットの採用を行う。すなわち、未採用のショットの中で、主体が存在し、かつ、アクション性度合の最も高いショットを採用し（ステップＳ７０５）、採用したショットを含むアクション区間に隣接する区間の中から従属度の高い区間を選択し（ステップＳ７０６）、選択した区間における未採用のショットの中で、上記採用したショットを含むアクション区間と時間的に最も近いショットを採用する（ステップＳ７０７）、という処理を、アクション区間の制約時間が満たされるまで繰り返す。 Then, the summary video generation unit 19 adopts shots in the following manner until the constraint time of the action section is satisfied in the action section included in the attention period. That is, among the shots that have not been adopted, the shot that has the subject and has the highest action degree is adopted (step S705), and the degree of dependency is high among the sections adjacent to the action section including the adopted shot. A process of selecting a section (step S706) and adopting a shot that is temporally closest to the action section including the adopted shot among the unadopted shots in the selected section (step S707) is performed. Repeat until the constraint time is satisfied.

また、要約映像生成部１９は、注目期間に含まれる緊迫した期間および落ち着いた期間についても、上記アクション期間の場合と同様にしてショットの選択を行う（ステップＳ７０８〜Ｓ７１０，Ｓ７１１〜Ｓ７１３）。 In addition, the summary video generation unit 19 selects shots in the tension period and the calm period included in the attention period as in the case of the action period (steps S708 to S710, S711 to S713).

要約映像生成部１９は、以上のようにして採用したショットを、要約映像データ６４として記憶部３に記憶させる。なお、要約映像データ６４は、採用したショットに対応する部分を映像データ５１から抜き出してつなぎ合わせることにより作成したデータであってもよいが、採用したショットに対応する部分を映像データ５１において特定できる情報を示すデータであってもよい。 The summary video generation unit 19 stores the shot adopted as described above in the storage unit 3 as summary video data 64. The summary video data 64 may be data created by extracting and connecting portions corresponding to the adopted shots from the video data 51, but the portions corresponding to the adopted shots can be specified in the video data 51. Data indicating information may be used.

なお、ここでは、要約映像を生成するために、音声分析部１４による処理結果、および主体検出部１５による検出結果に基づくものとして説明しており、これらはより的確な要約映像を生成する上で有用であるものの、これらを省略したとしても的確な要約映像を生成することは可能である。 Here, in order to generate the summary video, the description is based on the processing result by the voice analysis unit 14 and the detection result by the subject detection unit 15, and these are for generating a more accurate summary video. Although useful, even if these are omitted, it is possible to generate an accurate summary video.

４．要約映像作成装置のまとめ
以上のように、要約映像作成装置（映像編集装置）１では、ショット分析部（ショット認識手段）１２により、映像データ５１に基づき、映像の各部についてショットの継続時間の長さに応じた特徴を認識する。また、映像分析部（映像認識手段）１３により、映像データ５１に基づき、映像の各部について映像の動きの激しさに応じた特徴を認識する。 4). Summary of Summary Video Creation Device As described above, in the summary video creation device (video editing device) 1, the shot analysis unit (shot recognition means) 12 uses the shot data for each part of the video based on the video data 51. Recognize features according to the size. Also, the video analysis unit (video recognition unit) 13 recognizes the characteristics of each part of the video according to the intensity of motion of the video based on the video data 51.

そして、区間抽出部（強調区間特定手段）１７により、ショット分析部１２および映像分析部１３による認識結果（これらに基づいて指標生成部１６により生成されるアクション性度合５６、緊迫性度合５７、落ち着き性度合５８も含む）に基づき、映像データのうち強調区間（アクション区間、緊迫した区間、落ち着いた区間）に該当する区間を特定する。また、従属度検出部（従属度検出手段）１８により、ショット分析部１２および映像分析部１３による認識結果に基づき、各強調区間の間の従属度合を検出する。 Then, the section extraction unit (emphasis section specifying means) 17 recognizes the recognition results by the shot analysis unit 12 and the video analysis unit 13 (the action degree 56, the tightness degree 57, and the calmness generated by the index generation unit 16 based on them). Based on the degree of sex 58), a section corresponding to the emphasis section (action section, tight section, calm section) is specified in the video data. Further, the dependency level detection unit (dependency level detection means) 18 detects the level of dependency between the emphasis sections based on the recognition results by the shot analysis unit 12 and the video analysis unit 13.

そして、要約映像生成部（要約作成手段）１９により、ショット分析部１２および映像分析部１３による認識結果と、従属度検出部１８による検出結果とに基づき、強調区間から要約映像に採用すべき部分を決定する。 Then, the summary video generation unit (summary creation means) 19 uses the recognition result from the shot analysis unit 12 and the video analysis unit 13 and the detection result from the dependency level detection unit 18 to adopt the summary video from the enhancement section. To decide.

これにより、要約映像作成装置１では、映画の文法に即した要約映像、つまり編集上強調された強調区間と、これら強調区間の間の従属関係を反映することにより、全体の内容を視聴者が的確に把握しやすい要約映像を作成することができる。 As a result, the summary video creation device 1 reflects the summary video in accordance with the grammar of the movie, that is, the emphasis section emphasized on editing, and the dependency between these emphasis sections, so that the viewer can view the entire contents. It is possible to create a summary video that is easy to grasp accurately.

また、要約映像作成装置１では、音声分析部（音声認識手段）１４により、映像データ５１に付加された音声データに基づき、映像の各部について音声に含まれる楽器音成分の継続時間の長さに応じた特徴を認識し、区間抽出部１７、従属度検出部１８、要約映像生成部１９における各処理に用いることが望ましい。 Also, in the summary video creation device 1, the duration of the instrument sound component included in the audio for each part of the video is determined based on the audio data added to the video data 51 by the audio analysis unit (voice recognition means) 14. It is desirable to recognize the corresponding feature and use it for each processing in the section extraction unit 17, the dependency level detection unit 18, and the summary video generation unit 19.

映像には音声が付加されている場合が多く、この場合、アクション区間、落ち着いた区間の特徴的性質は、上記音声に含まれる楽器音成分の継続時間の長さとしても現れる。したがって、ショットの継続時間の長さに応じた特徴と、映像の動きの激しさに応じた特徴とに加えて、楽器音成分の継続時間の長さに応じた特徴を認識し、これらに基づいて強調区間の特定、従属度合の検出、要約映像として採用すべき映像部分の決定を行うことにより、より的確な要約映像を作成することができる。 In many cases, audio is added to the video, and in this case, the characteristic properties of the action section and the calm section appear as the duration of the instrument sound component included in the audio. Therefore, in addition to the characteristics according to the length of the duration of the shot and the characteristics according to the intensity of the motion of the video, the characteristics according to the length of the duration of the instrument sound component are recognized and based on these. By specifying the emphasis section, detecting the degree of dependency, and determining the video portion to be adopted as the summary video, a more accurate summary video can be created.

また、要約映像作成装置１では、主体検出部（主体検出手段）１５により、映像データ５１に基づき、映像の各部について主体の存在を検出し、要約映像生成部１９における処理に用いることが望ましい。 In the summary video creating apparatus 1, it is desirable that the subject detection unit (subject detection means) 15 detects the presence of the subject for each part of the video based on the video data 51 and uses it for processing in the summary video generation unit 19.

主体の存在する部分は、映像の内容を視聴者に伝える上で重要な部分となり、その部分を優先的に採用した要約映像は、それを考慮しないものに比べて、映像の内容を理解しやすくなる。したがって、主体の存在を検出し、その検出結果に基づいて強調区間から要約映像に採用すべき部分を決定することにより、より的確な要約映像を作成することができる。 The part where the subject is present becomes an important part in conveying the content of the video to the viewer, and the summary video that preferentially adopts that part is easier to understand than the one that does not consider it Become. Therefore, a more accurate summary video can be created by detecting the presence of the subject and determining a portion to be adopted for the summary video from the emphasis section based on the detection result.

５．実験と評価
大学生6名の被験者に、要約映像作成装置１により作成した要約映像（実施例）と、内容、文脈ともに考慮せずに作成した要約映像（比較例）とを見比べてもらい、どちらの方が、映画の内容、話の流れが理解しやすい要約映像となっているかを評価した。 5. Experiments and Evaluations 6 university students examined the summary video created by the summary video creation device 1 (Example) and the summary video created without considering the content and context (comparative example). He evaluated whether the contents of the movie and the flow of the story were easy to understand.

比較例として、以下のようなカットの頻度による要約映像を作成した。映画の先頭から5秒毎のフレームに対して、そこから10秒間に含まれるカットの数を求める。この10秒間に含まれるカット数が最も多いフレームから順にキーフレームとする。ここでキーフレームとは、要約映像を作成する際に着目するフレームのことである。キーフレームが含まれるショットを先頭ショットとして、先頭ショットから合計時間が10秒を越えるまでのショットを連結し、要約映像として採用する。要約映像の時間長が目的の時間に達するまでその処理を繰り返し、選択した区間を時間順に並べることで要約映像とした。この比較例の要約映像は、ショットの長さが短く、映像として印象の強い区間のみをつなぎ合わせた映像となる。 As a comparative example, a summary video with the following cut frequency was created. For a frame every 5 seconds from the beginning of the movie, find the number of cuts in 10 seconds from there. The key frames are set in order from the frame having the largest number of cuts included in the 10 seconds. Here, the key frame is a frame to be focused on when creating a summary video. Shots that include key frames are taken as the first shot, and shots from the first shot until the total time exceeds 10 seconds are connected and adopted as a summary video. The process was repeated until the time length of the summary video reached the target time, and the selected sections were arranged in time order to obtain a summary video. The summary video of this comparative example is a video in which only shots with short shot lengths and strong impressions are connected.

2本の映画（「スピード2」ヤン・デ・ボン監督, 1997年, アクション、「A.I.」スティーブン・スピルバーグ監督, 2001年, SF／ドラマ）について、実施例として作成した5分および10分の要約映像と、比較例として作成した5分および10分の要約映像とを被験者に観てもらい、話の内容の理解しやすさ、話の流れの理解しやすさの2つの観点について5段階評価をしてもらった。5段階の内訳は、5が実施例の方がよい、4がどちらかといえば実施例の方がよい、3がどちらともいえない、2がどちらかといえば比較例の方がよい、1が比較例の方がよいである。 Summary of 5 and 10 minutes of two movies ("Speed 2" directed by Jan De Bon, 1997, action, "AI" directed by Steven Spielberg, 2001, science fiction / drama) as an example Have the subject watch the video and a summary video of 5 minutes and 10 minutes created as a comparative example, and make a five-level evaluation of the two aspects of ease of understanding the content of the story and ease of understanding the flow of the story I was asked to. The breakdown of 5 stages is that 5 is better in the example, 4 is better if the example is better, 3 is neither, 2 is rather better in the comparative example, 1 is better The comparative example is better.

なお、使用した映像データの形式は、フレームサイズ160×120[pixel]、フレームレート30[frames/sec.]、24ビットカラー、オーディオ形式はサンプリング周波数22.050[kHz]、量子化8ビット、モノラルである。 The format of the video data used is frame size 160 x 120 [pixel], frame rate 30 [frames / sec.], 24-bit color, audio format is sampling frequency 22.050 [kHz], quantization 8 bits, monaural is there.

事象間の因果関係や話の展開が把握可能な要約になっているか否かを評価するために、本実験で用いた映画を観たことがない被験者に対しては、あらかじめ映画のあらすじを読んでもらうことによって、ある程度話の内容を理解してもらった上で実験を行った。 For subjects who have not seen the movie used in this experiment in order to evaluate whether the causal relationship between events and the story development are comprehensible summaries, read the movie synopsis in advance. The experiment was carried out with some understanding of the content of the story.

評価結果を図１５に示す。図１５では、6名の平均評価値をプロットしている。全体的に実施例の方が、話の内容、流れともに、理解のしやすい要約映像となっている。実施例では、編集上強調された区間としてアクション区間、緊迫した区間、落ち着いた区間を抽出し、それに従属する区間も求めて要約映像を作成しているため、比較例よりも話の内容、流れともに理解のしやすい要約映像が作成できたと考えられる。 The evaluation results are shown in FIG. In FIG. 15, the average evaluation values of six people are plotted. Overall, the embodiment is a summary video that is easy to understand, both in terms of content and flow. In the embodiment, an action section, a tight section, and a calm section are extracted as sections that are emphasized for editing, and a summary video is created by obtaining sections subordinate thereto. It is thought that a summary video that is easy to understand was created.

本実施形態では、映画の内容と文脈を考慮することにより、話の内容がより理解しやすい要約映像を作成する手法を提案した。映画の文法に基づき、アクション区間、緊迫した区間、落ち着いた区間を抽出することによって、内容が効果的に伝わるように編集上強調された区間を要約映像に含めることが可能となる。さらに、それらの区間との従属関係を求めることにより、前後の話のつながりもあまり失うことなく、要約映像を作成することが可能となる。 In the present embodiment, a method of creating a summary video that makes it easier to understand the content of the story by considering the content and context of the movie has been proposed. By extracting an action section, a tight section, and a calm section based on the grammar of the movie, it is possible to include a section that has been editorially emphasized so that the contents can be effectively transmitted to the summary video. Furthermore, by obtaining a dependency relationship with these sections, it is possible to create a summary video without losing the connection between the previous and next stories.

なお、映画の要約映像を作成する上では、効果音も重要な要素と考えられるため、効果音も考慮して要約映像を作成することが望ましい。 In creating a summary video of a movie, sound effects are considered to be an important factor, so it is desirable to create a summary video in consideration of sound effects.

本発明は、映画やテレビドラマなどストーリーを有する映像から要約映像を自動的に作成するために利用することができ、例えば、視聴者に提供される映像視聴用の装置に適用できるほか、映像の制作者に提供される宣伝用映像を作成するための装置にも適用できる。 The present invention can be used to automatically create a summary video from a video having a story, such as a movie or a TV drama, and can be applied to, for example, a video viewing device provided to a viewer. The present invention can also be applied to an apparatus for creating an advertisement video provided to a producer.

時空間投影画像を説明するための図面である。It is drawing for demonstrating a spatiotemporal projection image. （ａ）は時空間投影画像を示す図面であり、（ｂ）は（ａ）の時空間投影画像からエッジを抽出したエッジ画像を示す図面である。(A) is drawing which shows a spatiotemporal projection image, (b) is a drawing which shows the edge image which extracted the edge from the spatiotemporal projection image of (a). サウンドスペクトログラムの例を示す図面である。It is drawing which shows the example of a sound spectrogram. （ａ）は映像の平らさの尺度を求めるための演算に用いる値を示す図面であり、（ｂ）は映像の平らさの尺度を求めるために行うエッジ追跡の順序を示す図面である。(A) is a figure which shows the value used for the calculation for calculating | requiring the flatness scale of an image | video, (b) is drawing which shows the order of the edge tracking performed in order to obtain | require the scale of an image | video flatness. 本発明の実施の一形態に係る要約映像作成装置の構成を示すブロック図である。It is a block diagram which shows the structure of the summary image | video production apparatus which concerns on one Embodiment of this invention. 図５の要約映像作成装置における要約映像作成処理の全体的な流れを示すフローチャートである。6 is a flowchart showing an overall flow of summary video creation processing in the summary video creation device of FIG. 5. 図６におけるショット長さの検出処理の具体的な内容を示すフローチャートである。It is a flowchart which shows the specific content of the detection process of the shot length in FIG. 図６における時空間投影画像の作成処理の具体的な内容を示すフローチャートである。It is a flowchart which shows the specific content of the creation process of the spatiotemporal projection image in FIG. 図６における動きの検出処理の具体的な内容を示すフローチャートである。It is a flowchart which shows the specific content of the motion detection process in FIG. 図６における音楽の性質の検出処理の具体的な内容を示すフローチャートである。It is a flowchart which shows the specific content of the detection process of the property of the music in FIG. 図６における主体の検出処理の具体的な内容を示すフローチャートである。It is a flowchart which shows the specific content of the detection process of the main body in FIG. 図６における区間の抽出処理の具体的な内容を示すフローチャートである。It is a flowchart which shows the specific content of the extraction process of the area in FIG. 図６における従属関係の検出処理の具体的な内容を示すフローチャートである。It is a flowchart which shows the specific content of the detection process of the dependency relationship in FIG. 図６における要約映像の生成処理の具体的な内容を示すフローチャートである。It is a flowchart which shows the specific content of the production | generation process of the summary image | video in FIG. 本発明の実施例を比較例と比較した評価結果を示すグラフである。It is a graph which shows the evaluation result which compared the Example of this invention with the comparative example.

Explanation of symbols

１要約映像作成装置（映像編集装置）
２制御部
３記憶部
４データ入力部
５操作部
６データ出力部
１１カット検出部
１２ショット分析部（ショット認識手段）
１３映像分析部（映像認識手段）
１４音声分析部（音声認識手段）
１５主体検出部（主体検出手段）
１６指標生成部
１７区間抽出部（強調区間特定手段）
１８従属度検出部（従属度検出手段）
１９要約映像生成部（要約作成手段） 1. Summary video creation device (video editing device)
2 control unit 3 storage unit 4 data input unit 5 operation unit 6 data output unit 11 cut detection unit 12 shot analysis unit (shot recognition means)
13 Video analysis unit (video recognition means)
14 Voice analysis unit (voice recognition means)
15 Subject detection unit (Subject detection means)
16 Index generation unit 17 Section extraction unit (emphasis section specifying means)
18 Dependency detector (Dependency detector)
19 Summary video generator (summary creation means)

Claims

In a video editing device for creating a summary video from a video including an emphasis section that can be identified based on the length of each shot constituting the video and the intensity of movement in the video,
Shot recognition means for recognizing features according to the length of the duration of the shot for each part of the video based on the video data;
Video recognition means for recognizing features according to the intensity of motion of the video for each part of the video based on the video data;
Based on the recognition result by the shot recognizing unit and the video recognizing unit, the emphasis section specifying unit for specifying the section corresponding to the emphasis section in the video data;
Dependency degree detection means for detecting the degree of dependence between the emphasis sections based on the recognition results by the shot recognition means and the image recognition means;
Video comprising: summary creating means for determining a portion to be adopted for the summary video from the emphasis section based on the recognition result by the shot recognition means and the video recognition means and the detection result by the dependency level detection means. Editing device.

The shot recognizing means generates, as a recognition result, a feature amount indicating the duration of the shot and a feature amount indicating the degree of the duration of the shot,
The video editing apparatus according to claim 1, wherein the video recognizing unit generates a feature amount indicating a degree of motion intensity of the video as a recognition result.

Voice recognition means for recognizing a feature corresponding to the duration of the instrument sound component included in the voice for each part of the video based on the voice data added to the video data;
The enhancement section specifying means further specifies a section corresponding to the enhancement section in the video data based on the recognition result by the voice recognition means,
The dependency level detecting means further detects a dependency level between the emphasis sections based on a recognition result by the voice recognition means,
The video editing apparatus according to claim 1, wherein the summary creation unit further determines a portion to be adopted for the summary video from the emphasis section based on a recognition result by the voice recognition unit.

The video editing apparatus according to claim 3, wherein the voice recognition unit generates a feature amount indicating a degree of duration of the musical instrument sound component as a recognition result.

Based on video data, further comprising subject detection means for detecting the presence of a video subject for each part of the video,
The video editing apparatus according to claim 1, wherein the summary creation unit further determines a portion to be adopted for the summary video from the emphasis section based on a detection result by the subject detection unit.

6. A video editing program for operating the video editing apparatus according to claim 1, wherein the video editing program causes a computer to function as each means.

A computer-readable recording medium on which the video editing program according to claim 6 is recorded.

In a video editing method for creating a summary video from a video including an emphasis section that can be identified based on the length of each shot constituting the video and the intensity of movement in the video,
Shot recognition processing for recognizing features according to the length of the shot duration for each part of the video based on the video data;
Based on the video data, video recognition processing for recognizing features according to the intensity of motion of the video for each part of the video,
Based on the recognition result by the shot recognition process and the video recognition process, an emphasis section specifying process for specifying a section corresponding to the emphasis section in the video data;
Based on the recognition result by the shot recognition process and the video recognition process, a dependency level detection process for detecting a dependency level between the emphasis sections;
A summary creation process for determining a portion to be adopted in the summary video from the emphasis section based on the recognition result by the shot recognition process and the video recognition process and the detection result by the dependency detection process. Editing method.