JP5129198B2

JP5129198B2 - Video preview generation device, video preview generation method, and video preview generation program

Info

Publication number: JP5129198B2
Application number: JP2009133593A
Authority: JP
Inventors: 浩太日高; 明小島; 豪入江
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2009-06-03
Filing date: 2009-06-03
Publication date: 2013-01-23
Anticipated expiration: 2029-06-03
Also published as: JP2010283478A

Description

本発明は、映像素材からその映像の予告を生成する映像予告生成装置およびその方法と、その映像予告生成装置の実現に用いられる映像予告生成プログラムとに関する。 The present invention relates to a video preview generation apparatus and method for generating a video preview from a video material, and a video preview generation program used for realizing the video preview generation apparatus.

映画では、その映画の宣伝に用いるために、その映画の予告を生成することが行われている。 For movies, a trailer for the movie is generated for use in advertising the movie.

このような映像予告の生成については、従来、人手による作業により行われているのが実情であり、その作業負荷が大きいことから、映像予告を自動的に生成できるようにする技術の構築が望まれている。 In the past, the generation of such a video preview has been performed manually, and since the workload is large, it is hoped that a technology that can automatically generate a video preview can be constructed. It is rare.

このようなことを背景にして、下記の非特許文献１には、電子番組表の紹介文章と映像のクローズドキャプションとを対応付けることで、電子番組表の紹介文章に対応付けられる映像区間を抽出して、それに基づいて番組紹介映像を自動生成するという発明が開示されている。 Against this background, Non-Patent Document 1 below extracts video sections associated with the introduction text of the electronic program guide by associating the introduction text of the electronic program guide with the closed caption of the video. Thus, an invention of automatically generating a program introduction video based on this is disclosed.

河合吉彦, 住吉英樹, 八木伸行,"電子番組表における紹介文を利用した番組紹介映像の自動生成手法",電子情報通信学会論文誌Ａ Vol.J91-D No.8 pp.2157-2165．Yoshihiko Kawai, Hideki Sumiyoshi, Nobuyuki Yagi, "Automatic Generation Method of Program Introduction Video Using Introductory Text in Electronic Program Guide", IEICE Transactions Vol.J91-D No.8 pp.2157-2165.

確かに、非特許文献１には、電子番組表の紹介文章と映像のクローズドキャプションとを対応付けることで、電子番組表の紹介文章に対応付けられる映像区間を抽出して、それに基づいて番組紹介映像を自動生成するという発明が開示されている。 Certainly, in Non-Patent Document 1, by extracting the introductory text of the electronic program guide and the closed caption of the video, the video section associated with the introductory text of the electronic program guide is extracted, and the program introduction video is based on that. The invention of automatically generating the is disclosed.

しかるに、非特許文献１に開示される発明では、番組紹介映像を自動生成するために、映像のクローズドキャプションが必要となる。 However, the invention disclosed in Non-Patent Document 1 requires closed captioning of the video in order to automatically generate the program introduction video.

これから、非特許文献１に開示される発明に従っていたのでは、クローズドキャプションが付与されていない映像については、その映像の予告を自動生成できないという問題がある。 From the above, according to the invention disclosed in Non-Patent Document 1, there is a problem in that it is not possible to automatically generate a notice for a video to which closed captions are not given.

例えば、クローズドキャプションは一般的に未編集映像には付与されていない。これから、非特許文献１に開示される発明に従っていたのでは、未編集映像についてはその映像の予告を自動生成できないという問題がある。 For example, closed captions are generally not assigned to unedited video. Thus, according to the invention disclosed in Non-Patent Document 1, there is a problem in that a preview of an unedited video cannot be automatically generated.

また、電子番組表についても、テレビ番組などの商用映像に付与されているのが一般的ではあるものの、それ以外の映像については付与されていないというのが実情である。 In addition, although the electronic program guide is generally assigned to commercial video such as a television program, the actual situation is that other video is not provided.

このことから分かるように、非特許文献１に開示される発明は、広く映像を対象に映像予告を自動生成する手法とはいえない。 As can be seen from this, the invention disclosed in Non-Patent Document 1 cannot be said to be a method for automatically generating a video preview for a wide range of videos.

本発明はかかる事情に鑑みてなされたものであって、商用映像、ホームビデオなどの映像の属性にかかわらず、その映像の予告を自動生成できるようにする新たな映像予告生成技術の提供を目的とする。 The present invention has been made in view of such circumstances, and it is an object of the present invention to provide a new video notice generation technology that can automatically generate a notice of a video regardless of the attributes of the video such as commercial video and home video. And

この目的を達成するために、本発明の映像予告生成装置は、映像素材からその映像の予告を自動生成することを実現するために、（１）映像素材の映像区間（１つ又は複数のフレームで構成される）を処理対象として、音声フィラーに対応付けられる映像区間と、過度なカメラワークに対応付けられる映像区間と、撮影映像が存在しない映像区間の内のいずれか１つ以上の映像区間を検出して、映像素材からその検出した映像区間を除去するフィルタ手段と、（２）フィルタ手段により除去されなかった映像素材の映像区間を処理対象として、音声の強調度が所定の閾値よりも大きな映像区間と、被写体の顔サイズが所定の閾値よりも大きな映像区間と、被写体の顔表情が所定の閾値よりも大きな映像区間の内のいずれか１つ以上の映像区間を検出して、その検出した映像区間を使って映像予告の冒頭部を生成する冒頭部生成手段と、（３）フィルタ手段により除去されなかった映像素材の映像区間で、かつ、冒頭部とならなかった映像区間を処理対象として、音声の基本周波数の時間変化量が所定の閾値よりも大きな映像区間と、音声の強調度が時間経過とともに増加し、かつ、その増加の時間変化量が所定の閾値よりも大きな映像区間の内のいずれか１つ以上の映像区間を検出して、その検出した映像区間を使って映像予告の本編を生成する本編生成手段と、（４）冒頭部生成手段の生成した冒頭部と本編生成手段の生成した本編とを結合することで、映像予告を生成する映像予告生成手段とを備えるように構成する。 In order to achieve this object, the video preview generation device of the present invention realizes (1) a video section (one or a plurality of frames) of a video material in order to automatically generate a video preview from the video material. One or more video sections among a video section associated with an audio filler, a video section associated with excessive camera work, and a video section in which no captured video exists. And a filter means for removing the detected video section from the video material, and (2) a video section of the video material that has not been removed by the filter means as a processing target, and the audio enhancement level is higher than a predetermined threshold One or more video sections among a large video section, a video section in which the face size of the subject is larger than a predetermined threshold, and a video section in which the facial expression of the subject is larger than a predetermined threshold An opening generation means for detecting and generating the beginning of the video preview using the detected video section; and (3) a video section of the video material that has not been removed by the filter means and does not become the beginning. As a processing target, a video section in which the time change amount of the fundamental frequency of audio is larger than a predetermined threshold, and the degree of audio enhancement increases with time, and the time change amount of the increase is a predetermined threshold value. Main part generating means for detecting any one or more video sections in a larger video section and generating the main part of the video preview using the detected video section; and (4) generation of the head generating means. By combining the beginning and the main part generated by the main part generating means, a video notice generating means for generating a video notice is provided.

この構成を採るときにあって、冒頭部生成手段は、楽音のトーン、テンポ、強調度のいずれか１つと冒頭部における映像の切り替えのタイミングとを同期させる形で、冒頭部における映像に対して楽音を付与することがあり、また、本編生成手段は、楽音のトーン、テンポ、強調度のいずれか１つと本編における映像の切り替えのタイミングとを同期させる形で、本編における映像に対して楽音を付与することがある。 In adopting this configuration, the beginning generation means synchronizes one of the tone, tempo, and enhancement degree of the musical sound with the timing of switching the image at the beginning, to the image at the beginning. The main part generating means may synchronize one of the tone, tempo, and enhancement degree of the musical sound with the video switching timing in the main part in order to synchronize the musical sound with the main part. May be granted.

以上の各処理手段はコンピュータプログラムでも実現できるものであり、このコンピュータプログラムは、適当なコンピュータ読み取り可能な記録媒体に記録して提供されたり、ネットワークを介して提供され、本発明を実施する際にインストールされてＣＰＵなどの制御手段上で動作することにより本発明を実現することになる。 Each of the above processing means can also be realized by a computer program. This computer program is provided by being recorded on an appropriate computer-readable recording medium or provided via a network, and is used when implementing the present invention. The present invention is realized by being installed and operating on a control means such as a CPU.

このように構成される本発明の映像予告生成装置では、まず最初に、フィルタ処理として、映像予告の生成対象となる映像素材の映像区間を処理対象として、
（イ）音声フィラーに対応付けられる映像区間
（ロ）過度なカメラワークに対応付けられる映像区間
（ハ）有効な撮影映像が存在しない映像区間
の内のいずれか１つ以上の映像区間（１つのフレームで構成されることもある）を検出して、映像素材からその検出した映像区間を除去する。 In the video preview generation device of the present invention configured as described above, first, as a filtering process, the video section of the video material that is the target of video preview generation is processed,
(B) Video segment associated with audio filler (b) Video segment associated with excessive camerawork (c) Any one or more video segments (one The detected video section is removed from the video material.

このようにして、まず最初に、フィルタ処理として、映像予告の生成対象となる映像素材に含まれる映像品質の低い映像区間を検出して、映像素材からその検出した映像区間を除去するのである。 In this way, first, as a filtering process, a video section with a low video quality included in a video material to be generated as a video preview is detected, and the detected video section is removed from the video material.

続いて、フィルタ処理により除去されなかった映像素材の映像区間を処理対象として、
（イ）音声の強調度が所定の閾値よりも大きな映像区間
（ロ）被写体の顔サイズが所定の閾値よりも大きな映像区間
（ハ）被写体の顔表情が所定の閾値よりも大きな映像区間
の内のいずれか１つ以上の映像区間（１つのフレームで構成されることもある）を検出して、その検出した映像区間を使って映像予告の冒頭部を生成する。 Subsequently, the video section of the video material that has not been removed by the filtering process is processed.
(B) Video segment in which the degree of audio enhancement is greater than a predetermined threshold (b) Video segment in which the face size of the subject is greater than the predetermined threshold (c) Within the video segment in which the facial expression of the subject is greater than the predetermined threshold Any one or more video sections (which may be composed of one frame) are detected, and the head of the video preview is generated using the detected video sections.

すなわち、フィルタ処理により除去されなかった映像素材の映像区間を処理対象として、その処理対象の映像素材に含まれるインパクトの大きな映像区間を検出して、その検出した映像区間を使って、インパクトの大きな映像区間で構成される映像予告の冒頭部を生成するのである。 That is, the video section of the video material that has not been removed by the filtering process is set as a processing target, a video section having a large impact included in the video material to be processed is detected, and the detected video section is used to generate a large impact. The beginning of the video preview composed of video sections is generated.

続いて、フィルタ処理により除去されなかった映像素材の映像区間で、かつ、映像予告の冒頭部とならなかった映像区間を処理対象として、
（イ）音声の基本周波数の時間変化量が所定の閾値よりも大きな映像区間
（ロ）音声の強調度が時間経過とともに増加し、かつ、その増加の時間変化量が所定の閾値よりも大きな映像区間
の内のいずれか１つ以上の映像区間（１つのフレームで構成されることもある）を検出して、その検出した映像区間を使って映像予告の本編を生成する。 Subsequently, the video section of the video material that was not removed by the filtering process, and the video section that did not become the beginning of the video preview,
(B) Video section in which the amount of time change in the fundamental frequency of audio is greater than a predetermined threshold (b) Video in which the degree of enhancement of audio increases with time and the amount of increase in time change is greater than the predetermined threshold Any one or more video sections (may be composed of one frame) in the section are detected, and a main part of the video preview is generated using the detected video section.

すなわち、フィルタ処理により除去されなかった映像素材の映像区間で、かつ、映像予告の冒頭部とならなかった映像区間を処理対象として、その処理対象の映像区間に含まれるハイライトシーンを示す映像区間を検出して、その検出した映像区間を使って、ハイライトシーンを示す映像区間で構成される映像予告の本編を生成するのである。 In other words, a video section of a video material that has not been removed by the filtering process, and a video section that indicates a highlight scene included in the processing target video section with a video section that is not the beginning of the video preview as a processing target Is detected, and the detected video section is used to generate the main part of the video preview composed of the video section showing the highlight scene.

最後に、生成した映像予告の冒頭部と生成した映像予告の本編とを結合することで、映像予告を生成する。 Finally, a video trailer is generated by combining the beginning of the generated video trailer and the main part of the generated video trailer.

このようにして、本発明では、電子番組表の紹介文章や映像のクローズドキャプションなどを用いることなく、映像予告の生成対象となる映像素材から、インパクトの大きな映像区間で構成される冒頭部とハイライトシーンを示す映像区間で構成される本編とを持つ映像予告を自動的に生成するように処理するのである。 In this way, in the present invention, without using the introduction text of the electronic program guide or the closed caption of the video, from the video material for which the video preview is to be generated, the beginning and the high-level composed of the video section having a high impact are recorded. Processing is performed so as to automatically generate a video trailer having a main part composed of video sections showing light scenes.

これから、本発明によれば、商用映像、ホームビデオなどの映像の属性にかかわらず、その映像の予告を自動的に生成することができるようになる。 Thus, according to the present invention, it is possible to automatically generate a preview of a video regardless of the attributes of the video such as commercial video and home video.

本発明の映像予告生成装置の装置構成図である。It is an apparatus block diagram of the image | video notification production | generation apparatus of this invention. 本発明の映像予告生成装置の実行するフローチャートである。It is a flowchart which the video preview production | generation apparatus of this invention performs. 本発明の映像予告生成装置の実行するフローチャートである。It is a flowchart which the video preview production | generation apparatus of this invention performs. 本発明の映像予告生成装置の実行するフローチャートである。It is a flowchart which the video preview production | generation apparatus of this invention performs. 本発明の映像予告生成装置の実行するフローチャートである。It is a flowchart which the video preview production | generation apparatus of this invention performs. カメラワークの検出処理の説明図である。It is explanatory drawing of the detection process of a camera work. ジャンクショットの一例の説明図である。It is explanatory drawing of an example of a junk shot. 映像予告に楽音を重畳させるときの説明図である。It is explanatory drawing when a musical sound is superimposed on a video preview.

以下、実施の形態に従って本発明を詳細に説明する。ここで、以下に説明する映像区間については、１つのフレームで構成される場合も含むものとする。 Hereinafter, the present invention will be described in detail according to embodiments. Here, the video section described below includes a case where the video section is configured by one frame.

図１に、本発明を具備する映像予告生成装置１の装置構成の一例を図示する。 FIG. 1 illustrates an example of a device configuration of a video preview generation device 1 having the present invention.

この図に示すように、本発明の映像予告生成装置１は、電子番組表の紹介文章や映像のクローズドキャプションなどを用いることなく、映像素材からその映像の予告を自動生成することを実現するために、映像素材入力部１０と、映像素材記憶部１１と、フィルタ部１２と、映像予告生成用映像素材記憶部１３と、冒頭部生成部１４と、本編生成部１５と、映像予告生成部１６と、映像予告記憶部１７と、映像予告出力部１８とを備える。 As shown in this figure, the video preview generation apparatus 1 of the present invention realizes automatic generation of a video preview from video material without using introductory text of an electronic program guide or closed captions of the video. The video material input unit 10, the video material storage unit 11, the filter unit 12, the video preview generation video material storage unit 13, the head generation unit 14, the main generation unit 15, and the video preview generation unit 16. A video preview storage unit 17 and a video preview output unit 18.

次に、これらの各処理部について説明する。 Next, each of these processing units will be described.

映像素材入力部１０は、映像予告の生成対象となる映像素材を入力して、映像素材記憶部１１に格納する。 The video material input unit 10 inputs a video material to be generated as a video preview and stores it in the video material storage unit 11.

フィルタ部１２は、映像素材記憶部１１に記憶される映像素材を処理対象として、その映像素材から、映像予告の生成に利用することができないような品質の低い映像区間を除去して、その除去した映像素材を映像予告生成用映像素材記憶部１３に格納する。 The filter unit 12 uses the video material stored in the video material storage unit 11 as a processing target, and removes from the video material a low-quality video section that cannot be used to generate a video preview, and removes the video segment. The video material thus stored is stored in the video material storage unit 13 for generating a video preview.

映像素材には、映像・楽音的に冗長なシーンが含まれることが多く、これから、これらの冗長なシーンを除去することなく編集すると、不必要なシーンを内包した映像予告が生成されてしまうことになる。このような冗長性は、音声フィラー（話し言葉特有の現象である「あのー」などのような言いよどみ）や、過度なカメラワークや、ジャンクショット（Junk Shot:有効な撮影映像の存在しない映像区間）といった特徴に関連しているものと考えられる。 Video material often includes video and musically redundant scenes, and if you edit without removing these redundant scenes, a video preview containing unnecessary scenes will be generated. become. Such redundancies include voice fillers (speaking words such as “Ano”, a phenomenon peculiar to spoken language), excessive camera work, and junk shots (Junk Shot: video sections where there is no valid video shot). It seems to be related to the feature.

そこで、フィルタ部１２は、このような冗長性を除去するために、（イ）映像素材の音情報に含まれる音声フィラーの出現位置を検出して、その出現位置に対応付けられる映像区間を特定する音声フィラー検出部１２０と、（ロ）映像素材を撮影した際のカメラワークを判定することで、過度なカメラワークの出現位置を検出して、その出現位置に対応付けられる映像区間を特定する過度カメラワーク検出部１２１と、（ハ）映像素材に含まれるジャンクショットの出現位置を検出して、その出現位置に対応付けられる映像区間を特定するジャンクショット検出部１２２とを備えるようにして、これらの各検出部により特定された映像区間を、入力した映像素材から除去するように処理する。 Therefore, in order to remove such redundancy, the filter unit 12 (a) detects the appearance position of the audio filler included in the sound information of the video material, and specifies the video section associated with the appearance position. And (b) determining the camera work when the video material is shot, thereby detecting the excessive camera work appearance position and specifying the video section associated with the appearance position. An excessive camera work detection unit 121, and (c) a junk shot detection unit 122 that detects an appearance position of a junk shot included in the video material and identifies a video section associated with the appearance position, The video section specified by each of these detectors is processed to be removed from the input video material.

冒頭部生成部１４は、映像予告生成用映像素材記憶部１３に記憶される映像素材を処理対象として、インパクトの大きな映像区間で構成される映像予告の冒頭部を自動生成する処理を実行する。 The head generation unit 14 executes processing for automatically generating the head of a video preview composed of video sections having a large impact, with the video material stored in the video material storage unit 13 for video preview generation as a processing target.

冒頭部生成部１４は、この冒頭部の自動生成処理を実行するために、（イ）映像素材の音情報に含まれる音声の強調度を判定することで、音声強調度が所定の閾値よりも大きくなる出現位置を検出して、その出現位置に対応付けられる映像区間を特定する音声強調度判定部１４０と、（ロ）映像素材に含まれる被写体の顔サイズを判定することで、顔サイズが所定の閾値よりも大きくなる出現位置を検出して、その出現位置に対応付けられる映像区間を特定する顔サイズ判定部１４１と、（ハ）映像素材に含まれる被写体の顔表情を判定することで、顔表情が所定の閾値よりも大きくなる出現位置を検出して、その出現位置に対応付けられる映像区間を特定する顔表情判定部１４２と、（ニ）所定のルールに従って、音声強調度判定部１４０／顔サイズ判定部１４１／顔表情判定部１４２の特定した映像区間に対して順番を付与して、予め用意する楽音と同期をとる形で、それらの映像区間の切り替えのタイミングを決定することで、映像予告の冒頭部を生成する楽音付与部１４３とを備える。 In order to execute the automatic generation process of the beginning, the beginning generation unit 14 determines the degree of enhancement of the sound included in the sound information of the video material, so that the degree of sound enhancement is higher than a predetermined threshold value. By detecting the appearance position that increases, and by determining the face size of the subject included in the video material, the voice enhancement degree determination unit 140 that identifies the video section associated with the appearance position, and the face size is determined. By detecting an appearance position that is larger than a predetermined threshold and identifying a video section associated with the appearance position, and (c) determining a facial expression of a subject included in the video material. A facial expression determination unit 142 that detects an appearance position where the facial expression is greater than a predetermined threshold and identifies a video section associated with the appearance position; and (d) a speech enhancement level determination unit according to a predetermined rule. 140 By assigning an order to the video sections specified by the face size determination unit 1411 / facial expression determination unit 142 and determining the timing of switching those video sections in a form that synchronizes with the previously prepared musical sound, And a musical sound assigning unit 143 for generating the beginning of the video preview.

本編生成部１５は、映像予告生成用映像素材記憶部１３に記憶される映像素材で、かつ、映像予告の冒頭部とならなかった映像区間を処理対象として、ハイライトシーンを示す映像区間で構成される映像予告の本編を自動生成する処理を実行する。 The main part generation unit 15 is composed of video sections indicating highlight scenes, with the video sections stored in the video preview generation video material storage section 13 being processed as video sections that are not the beginning of the video preview. A process for automatically generating the main video preview is executed.

本編生成部１５は、この本編の自動生成処理を実行するために、（イ）映像素材の音情報に含まれる音声の基本周波数の時間変化を判定することで、音声の基本周波数の時間変化量が所定の閾値よりも大きくなる出現位置を検出して、その出現位置に対応付けられる映像区間を特定する基本周波数パターン判定部１５０と、（ロ）映像素材の音情報に含まれる音声の強調度の時間変化を判定することで、音声の強調度が時間経過とともに増加し、かつ、その増加の時間変化量が所定の閾値よりも大きくなる出現位置を検出して、その出現位置に対応付けられる映像区間を特定する音声強調度パターン判定部１５１と、（ハ）所定のルールに従って、基本周波数パターン判定部１５０／音声強調度パターン判定部１５１の特定した映像区間に対して順番を付与して、予め用意する楽音と同期をとる形で、それらの映像区間の切り替えのタイミングを決定することで、映像予告の本編を生成する楽音付与部１５２とを備える。 In order to execute the automatic generation process of the main part, the main part generation unit 15 determines (a) the temporal change amount of the basic frequency of the audio by determining the temporal change of the basic frequency of the audio included in the sound information of the video material. Detecting an appearance position where is greater than a predetermined threshold and identifying a video section associated with the appearance position; and (b) enhancement level of audio included in the sound information of the video material By determining the time change, the appearance position where the enhancement degree of the sound increases with time and the time change amount of the increase is larger than a predetermined threshold is detected and associated with the appearance position. A voice enhancement level pattern determination unit 151 that identifies a video section, and (c) according to a predetermined rule, the video frequency range determined by the fundamental frequency pattern determination unit 150 / sound enhancement level pattern determination unit 151 By giving the order Te, in the form of taking the tone to be prepared in advance synchronously, by determining the timing of switching of their video section, and a musical tone assigning unit 152 for generating a main picture notice.

映像予告生成部１６は、冒頭部生成部１４の生成した冒頭部と本編生成部１５の生成した本編とを結合することで映像予告を生成して、その生成した映像予告を映像予告記憶部１７に格納する。 The video preview generation unit 16 generates a video preview by combining the beginning generated by the beginning generation unit 14 and the main part generated by the main part generation unit 15, and the generated video preview is stored in the video notice storage unit 17. To store.

映像予告出力部１８は、映像の出力要求があるときに、映像予告記憶部１７から、出力要求のある映像予告を読み出して出力する。 When there is a video output request, the video preview output unit 18 reads out and outputs a video preview with an output request from the video preview storage unit 17.

図２〜図５に、このように構成される本発明の映像予告生成装置１の実行するフローチャートを図示する。 2 to 5 show flowcharts executed by the video preview generation device 1 of the present invention configured as described above.

次に、これらのフローチャートに従って、このように構成される本発明の映像予告生成装置１の実行する処理について説明する。 Next, according to these flowcharts, processing executed by the video preview generation device 1 of the present invention configured as described above will be described.

〔１〕全体的な処理
本発明の映像予告生成装置１は、映像予告の生成要求があると、図２のフローチャートに示すように、先ず最初に、ステップＳ１０で、映像予告の生成対象となる映像素材を入力して、映像素材記憶部１１に格納する。 [1] Overall Processing When there is a video preview generation request, the video preview generation apparatus 1 of the present invention first becomes a target for video preview generation in step S10 as shown in the flowchart of FIG. The video material is input and stored in the video material storage unit 11.

続いて、ステップＳ２０で、映像予告の生成に用いることができない品質の低い映像区間を除去すべく、映像素材記憶部１１に記憶される映像素材に対して所定のフィルタ処理を施して、そのフィルタ処理結果の映像素材を映像予告生成用映像素材記憶部１３に格納する。 Subsequently, in step S20, a predetermined filter process is performed on the video material stored in the video material storage unit 11 in order to remove a low-quality video section that cannot be used for generating the video preview, and the filter The processed video material is stored in the video material storage unit 13 for generating a video preview.

続いて、ステップＳ３０で、映像予告生成用映像素材記憶部１３に記憶される映像素材の映像区間を使って、映像予告の冒頭部を生成する。 Subsequently, in step S30, the head of the video preview is generated using the video section of the video material stored in the video material storage unit 13 for generating video preview.

続いて、ステップＳ４０で、映像予告生成用映像素材記憶部１３に記憶され、かつ、映像予告の冒頭部とならなかった映像区間を使って、映像予告の本編を生成する。 Subsequently, in step S40, the main part of the video preview is generated using the video section that is stored in the video preview generation video material storage unit 13 and does not become the beginning of the video preview.

続いて、ステップＳ５０で、ステップＳ３０で生成した冒頭部とステップＳ４０で生成した本編とを結合することで映像予告を生成して、映像予告記憶部１７に格納する。 Subsequently, in step S50, a video preview is generated by combining the beginning generated in step S30 and the main part generated in step S40, and is stored in the video preview storage unit 17.

続いて、ステップＳ６０で、映像予告の出力要求に応答して、映像予告記憶部１７に記憶される映像予告を出力する。 Subsequently, in step S60, in response to the video preview output request, the video preview stored in the video preview storage unit 17 is output.

〔２〕ステップＳ２０の処理の詳細
次に、図３のフローチャートに従って、図２のフローチャートのステップＳ２０で実行するフィルタ処理の詳細について説明する。 [2] Details of Step S20 Next, details of the filter processing executed in step S20 of the flowchart of FIG.

本発明の映像予告生成装置１は、図２のフローチャートのステップＳ２０の処理に入ると、映像素材記憶部１１に記憶される映像素材を処理対象として、図３のフローチャートに示すように、まず最初に、ステップＳ２００で、音声フィラーの出現位置を検出して、その出現位置に対応付けられる映像区間を特定し、続くステップＳ２０１で、映像素材から、その音声フィラーの出現位置に対応付けられる映像区間を除去する。 When entering the process of step S20 in the flowchart of FIG. 2, the video preview generation apparatus 1 of the present invention first sets the video material stored in the video material storage unit 11 as a processing target as shown in the flowchart of FIG. In step S200, the appearance position of the audio filler is detected, the video section associated with the appearance position is specified, and in step S201, the video section is associated with the appearance position of the audio filler from the video material. Remove.

音声フィラーは考えながら発話するシーンなど、意味的にまとまりのない発声で出現することが多く、それに対応付けられる映像区間は映像予告の生成にとって不要な映像区間であることが多い。そこで、音声フィラーの出現位置を検出して、その出現位置に対応付けられる映像区間を特定し、映像素材から、その特定した映像区間を除去するように処理するのである。 Audio fillers often appear with utterances that are semantically unorganized, such as scenes that are uttered while thinking, and the video segments associated therewith are often video segments that are unnecessary for the generation of video previews. Therefore, the appearance position of the audio filler is detected, the video section associated with the appearance position is specified, and processing is performed to remove the specified video section from the video material.

音声フィラーの検出方法としては、例えば、基本周波数の時間履歴を計測し、１秒間などというような所定の時間の間、基本周波数値に大きな変化が見られない場合には、音声フィラーであると判定する方法を用いる。ここで、基本周波数の抽出については、例えば、下記の参考文献１に示されている方法を用いることができる。 As a method for detecting the voice filler, for example, a time history of the fundamental frequency is measured, and when a large change is not seen in the fundamental frequency value for a predetermined time such as one second, the voice filler is Use the method of judgment. Here, for the extraction of the fundamental frequency, for example, the method shown in Reference Document 1 below can be used.

参考文献１：古井貞煕著, ディジタル信号処理, 東海大学出版, pp.57-63.
このようにしてステップＳ２００およびステップＳ２０１の処理を実行すると、続いて、ステップＳ２０２で、過度なカメラワークの出現位置を検出して、その出現位置に対応付けられる映像区間を特定し、続くステップＳ２０３で、映像素材から、その過度なカメラワークの出現位置に対応付けられる映像区間を除去する。 Reference 1: Sadahiro Furui, Digital Signal Processing, Tokai University Press, pp.57-63.
When the processing of step S200 and step S201 is executed in this way, subsequently, in step S202, an excessive camerawork appearance position is detected, a video section associated with the appearance position is specified, and subsequent step S203 is performed. Then, the video section associated with the excessive camerawork appearance position is removed from the video material.

過度なカメラワークにより撮影された映像は品質が悪く、それに対応付けられる映像区間は映像予告の生成にとって不適切な映像区間であることが多い。そこで、過度なカメラワークの出現位置を検出して、その出現位置に対応付けられる映像区間を特定し、映像素材から、その特定した映像区間を除去するように処理するのである。 A video shot by excessive camera work has poor quality, and a video segment associated with the video is often an inappropriate video segment for generating a video preview. Therefore, an excessive camerawork appearance position is detected, a video section associated with the appearance position is specified, and processing is performed to remove the specified video section from the video material.

ここで、カメラワークの検出方法については、例えば、下記の参考文献２に示されている方法を用いることができる。 Here, as a camera work detection method, for example, the method shown in Reference Document 2 below can be used.

参考文献２：特許第3408117 号, 谷口行信, 阿久津明人, 外村佳伸, 「カメラ操作推定方法およびカメラ操作推定プログラムを記録した記録媒体」
このステップＳ２０２においては、例えば、図６に示すように、撮影時の画角が時間的にどのように遷移するのかを計測することでカメラワークを検出する。図６では、時刻ｔ０〜ｔ５までのカメラの動きに対して画角の左下頂点の遷移を示している。 Reference 2: Patent No. 3408117, Yukinobu Taniguchi, Akito Akutsu, Yoshinobu Tonomura, “Recording medium for recording camera operation estimation method and camera operation estimation program”
In this step S202, for example, as shown in FIG. 6, camerawork is detected by measuring how the angle of view at the time of shooting changes with time. FIG. 6 shows the transition of the lower left vertex of the angle of view with respect to the camera movement from time t0 to time t5.

そして、このステップＳ２０２においては、例えば、この遷移に対して、Ａ：回帰直線、Ｂ：二次回帰曲線などを求めて、信頼係数が閾値Ｒ以上の場合にはカメラワークが安定していると判定し、閾値Ｒ未満の場合には過度なカメラワークであると判定する。ここで、閾値Ｒについては例えば０．７と設定してもよく、また、三次以上の回帰曲線によって判定を行うようにしてもよい。 In this step S202, for example, A: regression line, B: quadratic regression curve, etc. are obtained for this transition, and the camerawork is stable when the reliability coefficient is equal to or greater than the threshold value R. If it is less than the threshold value R, it is determined that the camera work is excessive. Here, the threshold value R may be set to 0.7, for example, or may be determined by a cubic or higher regression curve.

このようにしてステップＳ２０２およびステップＳ２０３の処理を実行すると、続いて、ステップＳ２０４で、ジャンクショットの出現位置を検出して、その出現位置に対応付けられる映像区間を特定し、続くステップＳ２０５で、映像素材から、そのジャンクショットの出現位置に対応付けられる映像区間を除去する。 When the processing in step S202 and step S203 is executed in this way, subsequently, in step S204, the appearance position of the junk shot is detected, the video section associated with the appearance position is specified, and in step S205, The video section associated with the appearance position of the junk shot is removed from the video material.

ジャンクショットの映像部分は有効な映像部分ではなく、それに対応付けられる映像区間は映像予告の生成にとって不要な映像区間である。そこで、ジャンクショットの出現位置を検出して、その出現位置に対応付けられる映像区間を特定し、映像素材から、その特定した映像区間を除去するように処理するのである。 The video portion of the junk shot is not an effective video portion, and the video section associated therewith is a video section that is unnecessary for generating the video preview. Therefore, the appearance position of the junk shot is detected, the video section associated with the appearance position is specified, and processing is performed to remove the specified video section from the video material.

ジャンクショットとして、例えば、図７に示すように、ある撮影映像を視聴後に停止ボタンを押したタイミング（再生停止位置）が撮影終了の時刻とずれることによって生ずるものがある。また、撮影映像が何もない場合、青や黒で表示されることがあるが、データとして何も記録されていない区間を検出すればジャンクショットを抽出できる。 As a junk shot, for example, as shown in FIG. 7, there is a case where the timing (reproduction stop position) at which a stop button is pressed after viewing a certain captured image deviates from the photographing end time. Further, when there is no photographed video, it may be displayed in blue or black, but if a section where nothing is recorded as data is detected, a junk shot can be extracted.

また、商業利用の映像の映像素材では、カチンコと呼ばれる撮影開始を知らせる道具を利用することや、カラーバーと呼ばれる画面が挿入されることがある。これらもジャンクショットに位置付けられるが、カチンコについては、予めカチンコを使用する際に発生する音程、周波数を保存しておけば音響的にカチンコの使用を検出でき、これにより、カチンコの使用前の映像部分を抽出することでジャンクショットを抽出できることになる。また、カラーバーについては、色ヒストグラムを計測して、１〜Ｎ個の色のみから構成される場合にはカラーバーと判定することでカラーバーの画面の挿入を検出でき、これによりジャンクショットを抽出できることになる。なお、Ｎについては例えば５などと予め設定しておけばよい。 In addition, for video materials for commercial use, a tool called a clapperboard that notifies the start of shooting may be used, or a screen called a color bar may be inserted. These are also positioned as junk shots, but for clappers, the use of clappers can be detected acoustically by preserving the pitch and frequency generated when the clappers are used in advance. A junk shot can be extracted by extracting a part. For color bars, if a color histogram is measured and it is composed of only 1 to N colors, the color bar screen can be detected by determining that it is a color bar. It can be extracted. Note that N may be set in advance, for example, 5 or the like.

このようにしてステップＳ２０４およびステップＳ２０５の処理を実行すると、続いて、ステップＳ２０６で、不要な映像区間を除去した映像素材を映像予告生成用映像素材記憶部１３に格納して、図２のフローチャートのステップＳ２０の処理を終了する。 When the processing of step S204 and step S205 is executed in this way, subsequently, in step S206, the video material from which unnecessary video sections have been removed is stored in the video material storage unit 13 for video preview generation, and the flowchart of FIG. The process of step S20 is terminated.

このようにして、本発明の映像予告生成装置１は、図２のフローチャートのステップＳ２０の処理に入ると、図３のフローチャートを実行することで、映像素材記憶部１１に記憶される映像素材に対して、映像予告の生成にとって不要な映像区間を除去するフィルタ処理を施して、そのフィルタ処理結果の映像素材を映像予告生成用映像素材記憶部１３に格納するように処理するのである。 In this way, when the video preview generation apparatus 1 of the present invention enters the process of step S20 of the flowchart of FIG. 2, the video material stored in the video material storage unit 11 is executed by executing the flowchart of FIG. On the other hand, a filter process for removing a video section unnecessary for the generation of the video trailer is performed, and the video material as a result of the filter process is stored in the video material storage unit 13 for video trailer generation.

〔３〕ステップＳ３０の処理の詳細
次に、図４のフローチャートに従って、図２のフローチャートのステップＳ３０で実行する映像予告の冒頭部の生成処理の詳細について説明する。 [3] Details of Processing in Step S30 Next, details of the processing for generating the head of the video notice executed in step S30 in the flowchart in FIG. 2 will be described according to the flowchart in FIG.

本発明の映像予告生成装置１は、図２のフローチャートのステップＳ３０の処理に入ると、映像予告生成用映像素材記憶部１３に記憶される映像素材を処理対象として、図４のフローチャートに示すように、まず最初に、ステップＳ３００で、音声の強調度が所定の閾値よりも大きなものとなる映像区間を検出する。 As shown in the flowchart of FIG. 4, when the video preview generation apparatus 1 of the present invention enters the process of step S <b> 30 of the flowchart of FIG. 2, the video material stored in the video preview generation video material storage unit 13 is processed. First, in step S300, a video section in which the audio enhancement level is greater than a predetermined threshold is detected.

ここで、音声の強調度については、音声の強調確率と平静確率とによって求めればよい。具体的には、音声小段落ごとの強調確率Ｐs(ｅ）と平静確率Ｐs(ｎ）との確率比Ｐs(ｅ）／Ｐs(ｎ）を音声の強調度とすればよい。音声の強調確率、平静確率については、例えば、下記の参考文献３に示されている方法を用いて求めることができる。 Here, the speech enhancement degree may be obtained from the speech enhancement probability and the calm probability. Specifically, the probability ratio Ps (e) / Ps (n) between the emphasis probability Ps (e) and the calm probability Ps (n) for each audio sub-paragraph may be used as the audio enhancement level. The speech enhancement probability and the calm probability can be obtained using, for example, the method shown in Reference Document 3 below.

参考文献３：特許第3803311 号, 日高浩太, 水野理, 中嶌信弥, 「音声処理方法及びその方法を使用した装置及びそのプログラム」
続いて、ステップＳ３０１で、被写体の顔サイズが所定の閾値よりも大きなものとなるフレームを検出し、これにより、被写体の顔サイズが所定の閾値よりも大きなものとなる映像区間（１フレームで構成されることもある）を検出する。 Reference 3: Japanese Patent No. 3803131, Kota Hidaka, Osamu Mizuno, Shinya Nakajo, “Speech Processing Method, Apparatus Using the Method, and Program”
Subsequently, in step S301, a frame in which the face size of the subject is larger than a predetermined threshold is detected, and thereby a video section (composed of one frame) in which the face size of the subject is larger than the predetermined threshold. May be detected).

ここで、被写体の顔サイズについては、Ａdaboost 学習によるＨaar-like特徴を用いた識別器を用いればよい。多数の弱識別器をカスケード型とし、このカスケード型識別器を識別対象の大きさ、位置を変化させて適用し、顔画像領域を特定するなどすればよい。これについては、下記の参考文献４などに記載されている。このようにして顔画像領域が求まれば顔サイズを計測できることになる。 Here, as for the face size of the subject, a discriminator using Haar-like features by Adaboost learning may be used. A large number of weak classifiers may be of a cascade type, and this cascade type classifier may be applied by changing the size and position of the identification target to identify the face image area. This is described in Reference Document 4 below. If the face image area is obtained in this way, the face size can be measured.

参考文献４：Paul Viola, Michael J. Jones, Robust Real-Time Face Detection. I nternational Journal of Computer Vision. Vol57, No.2, pp.137-154 (2004).
続いて、ステップＳ３０２で、被写体の顔表情の豊かさが所定の閾値よりも大きなものとなるフレーム（顔に表情のあるフレーム）を検出し、これにより、被写体の顔表情の豊かさが所定の閾値よりも大きなものとなる映像区間（１フレームで構成されることもある）を検出する。 Reference 4: Paul Viola, Michael J. Jones, Robust Real-Time Face Detection. International Journal of Computer Vision. Vol57, No.2, pp.137-154 (2004).
Subsequently, in step S302, a frame in which the richness of the facial expression of the subject is greater than a predetermined threshold (a frame having a facial expression on the face) is detected. A video section (which may be composed of one frame) that is larger than the threshold is detected.

ここで、被写体の顔表情については、顔画像が存在すると判定されたフレームを処理対象として、その顔に表情があるのか否かを検出することで行う。顔表情の検出については、例えば、下記の参考文献５に示されている方法を用いることができる。 Here, the facial expression of the subject is performed by detecting whether or not there is an expression on the face, with a frame determined to have a face image as a processing target. For the detection of facial expressions, for example, the method shown in Reference Document 5 below can be used.

参考文献５：Kotsia I., Pitas I.: Facial Expression Recognition in Image Sequ ences Using Geometric Deformation Features and Support Vector Ma chines. IEEE Transactions on Image Processing, Vol.16, No.1 ,pp. 172-187 (2007).
続いて、ステップＳ３０３で、ステップＳ３００〜ステップＳ３０２で検出した映像区間の中から冒頭部の生成に用いる映像区間を選択し、所定のルールに従って、その選択した映像区間について順序を決定して、その決定した順序に従ってそれらの映像区間を並べることで映像予告の冒頭部を生成する。 Reference 5: Kotsia I., Pitas I .: Facial Expression Recognition in Image Sequences Using Geometric Deformation Features and Support Vector Machines. IEEE Transactions on Image Processing, Vol.16, No.1, pp. 172-187 (2007 ).
Subsequently, in step S303, the video section used for generating the head is selected from the video sections detected in steps S300 to S302, and the order is determined for the selected video section according to a predetermined rule. The beginning of the video preview is generated by arranging the video sections according to the determined order.

例えば、まず最初に、（ｉ）音声の強調度が高く、（ii）顔サイズが大きく、(iii）表情が豊かなフレームを冒頭部の候補フレームとして１枚以上抽出する。候補フレームが連続する場合には、それらの候補フレームで構成されるシーンを抽出する。このとき、（ｉ）、（ii）、(iii）の全ての条件を満たさなくても、例えば、（ｉ）、（ii）、(iii）の順に条件を満たすものを冒頭部の候補フレーム（シーン）として抽出するようにしてもよい。 For example, first, one or more frames are extracted as candidate frames at the beginning of (i) a high degree of speech enhancement, (ii) a large face size, and (iii) a rich expression. When candidate frames are continuous, a scene composed of these candidate frames is extracted. At this time, even if all of the conditions (i), (ii), and (iii) are not satisfied, for example, those that satisfy the conditions in the order of (i), (ii), and (iii) (Scene) may be extracted.

ここで、例えば、音声強調度についてはシーン（参考文献３に記載する音声小段落や音声段落などの映像区間）ごとに求めて、音声強調度を降順にソートして、総シーン数の上位ＰＥ％（ＰＥの値は１０などと予め設定する）となる場合を音声強調度が高いと定義する。また、顔サイズについては、全体のＰＳ％（ＰＳの値は２０などと予め設定する）以上となる場合を大きいと定義する。また、顔表情については、笑顔、驚き、悲しみなどの場合を表情が豊かであると定義する。 Here, for example, the audio enhancement level is obtained for each scene (video sections such as audio sub-paragraphs and audio paragraphs described in Reference 3), and the audio enhancement levels are sorted in descending order to obtain the upper PE of the total number of scenes. A case where the value is% (the value of PE is preset to 10 or the like) is defined as high speech enhancement. As for the face size, a case where the face size is greater than or equal to the total PS% (PS value is preset to 20 or the like) is defined as large. As for facial expressions, smiles, surprises, sadness, etc. are defined as rich expressions.

続いて、このようにして抽出した冒頭部の候補フレームの中から、例えば、音声強調度の高い順に所定の数のフレームを選択することで、冒頭部の生成に用いるフレームを選択して、それらのフレームについて、例えば、音声強調度の高い順に順序を決定したり、撮影時刻の早い順に順序を決定することで、それらのフレームについて順序を決定して、その決定した順序に従って、それらのフレームを並べることで映像予告の冒頭部を生成する。 Subsequently, for example, by selecting a predetermined number of frames from the extracted candidate frames in the descending order of the speech enhancement degree, the frames used for generating the beginning are selected, and these frames are selected. For example, by determining the order of the frames in order of increasing voice enhancement or by determining the order in order of the shooting time, the frames are determined according to the determined order. The beginning of the video preview is generated by arranging them.

ここで、この説明ではフレームを具体例にして説明したが、本発明の映像予告生成装置１の生成する映像予告の冒頭部は複数のフレームで構成されるシーン（映像区間）であってもよく、その場合には、シーンを単位にして、前述の（ｉ）、（ii）、(iii）の条件を前述と同様の方法によって取捨選択して、冒頭シーンを生成するようにすればよい。 Here, in this description, the frame has been described as a specific example, but the beginning of the video preview generated by the video preview generation apparatus 1 of the present invention may be a scene (video section) composed of a plurality of frames. In that case, the first scene may be generated by selecting the above conditions (i), (ii), and (iii) by the same method as described above in units of scenes.

上記の条件によって選択された候補フレーム（シーン）には、見た目の類似する区間が含まれていることがある。このような類似区間が映像予告の冒頭部に複数含まれていると、見た目の情報としては冗長となり、かつ、映像予告のサイズを増加させてしまうため、非効率的である。 Candidate frames (scenes) selected according to the above conditions may include sections that look similar. If a plurality of such similar sections are included at the beginning of the video preview, the appearance information becomes redundant and increases the size of the video preview, which is inefficient.

そこで、このような類似区間についてはただ１つだけ含まれるように、あらかじめ類似区間を削除する処理を実施することが好ましい。 Therefore, it is preferable to execute a process of deleting the similar section in advance so that only one such similar section is included.

このような処理は、例えば、候補フレーム（シーン）の画像情報を用いて実施することができる。例えば、画像の色情報やテクスチャ情報や形状情報などを抽出してベクトル化し、ベクトルの距離が近い２つの候補フレーム（シーン）のいずれかを候補候補フレーム（シーン）から除外する。この際、２つの内、いずれの候補フレーム（シーン）を除外するのかについては、先に述べたルールに基づく優先度に応じて、優先度の低いものを除外するようにすればよい。例えば、音声強調度が低い方を除外するとか、撮影時間の遅い方を除外するようにすればよい。 Such processing can be performed using image information of candidate frames (scenes), for example. For example, image color information, texture information, shape information, and the like are extracted and vectorized, and one of two candidate frames (scenes) having a short vector distance is excluded from the candidate candidate frames (scenes). At this time, as to which of the two candidate frames (scenes) should be excluded, a frame with a lower priority may be excluded according to the priority based on the rules described above. For example, it is only necessary to exclude the one with the lower voice enhancement degree or the one with the slower shooting time.

このようにして、本発明の映像予告生成装置１は、図２のフローチャートのステップＳ３０の処理に入ると、図４のフローチャートを実行することで、映像予告生成用映像素材記憶部１３に記憶される映像素材の映像区間を処理対象として、その映像素材に含まれるインパクトの大きな映像区間を検出して、その検出した映像区間を使って、インパクトの大きな映像区間で構成される映像予告の冒頭部を生成するように処理するのである。 In this way, when the video preview generation apparatus 1 of the present invention enters the process of step S30 of the flowchart of FIG. 2, the video preview generation apparatus 1 stores the video preview generation video material storage unit 13 by executing the flowchart of FIG. The video section of the video material to be processed is detected, the video section with a large impact included in the video material is detected, and the beginning of the video preview composed of the video section with the high impact using the detected video section Is processed to generate.

本発明の映像予告生成装置１は、図４のフローチャートに従って映像予告の冒頭部を生成する場合には、冒頭部に楽音を重畳させることについては想定していないが、冒頭部に楽音を重畳させることによって、冒頭部をより期待感あふれるように編集するようにしてもよい。 The video preview generation device 1 of the present invention does not assume that a musical sound is superimposed on the beginning when generating the beginning of the video preview according to the flowchart of FIG. Depending on the situation, the beginning may be edited so as to satisfy the expectation.

この場合、単純に、選択した楽音を冒頭部に重畳させるようにしてもよいが、楽音のトーンや強調度と同期をとる形で重畳させるようにしてもよい。例えば、図８に示すように、強調度が急峻に高くなる時刻やトーンの変化する時刻に合わせて、冒頭部を構成する映像区間を挿入するようにしてもよい。 In this case, the selected musical sound may be simply superimposed on the beginning, but may be superimposed in a manner that is synchronized with the tone and the degree of enhancement of the musical sound. For example, as shown in FIG. 8, the video section constituting the head may be inserted in accordance with the time when the enhancement degree sharply increases or the time when the tone changes.

すなわち、トーン変化の間隔で冒頭部を構成するフレームやシーンを挿入するのである。また、強調度についてもピークとなる時刻に冒頭部を構成するフレームやシーンを挿入するのである。なお、強調度については、前後ｉフレームとの差分および／または強調度値に閾値を設定して、閾値以上の強調度の時刻をフレームやシーンの挿入時刻とする。 In other words, frames and scenes constituting the beginning are inserted at intervals of tone change. Also, a frame or scene constituting the beginning is inserted at a time when the emphasis degree reaches a peak. For the enhancement level, a threshold is set for the difference from the previous and subsequent i-frames and / or the enhancement level value, and the time of the enhancement level equal to or greater than the threshold is set as the insertion time of the frame or scene.

ここで、図８では、トーン変化を逆三角形で示し、強調度を曲線で示している。なお、強調度は本来離散値であるが、この図では、離散値を直線または曲線でつなげたもので示している（離散値をスプライン曲線で近似することも可能である）。 Here, in FIG. 8, the tone change is indicated by an inverted triangle, and the enhancement degree is indicated by a curve. Although the degree of emphasis is originally a discrete value, this figure shows the discrete values connected by straight lines or curves (the discrete values can be approximated by spline curves).

トーンの変化については、単位時間Ｔあたりの振幅のピーク値を計測し、ピーク間隔が約Ｔｐとなる場合に、そのＴｐの時間間隔でトーンが変化していると判定するようにしてもよい。緊迫感を与える効果音を擬音で示せば「ドン・ドン・ドン・・・・」などで表すことができる。このような場合、最初の「ドン」と次の「ドン」の間隔は一定であることが想定され、例えばＴｐ＝１秒などの間隔で出現するので、そのＴｐの時間間隔でトーンが変化していると判定するようにしてもよい。あるいは、単位時間Ｔあたりの周波数を計測し、特定周波数がＴｐ間隔で出現していれば、それをトーンの変化と判定するようにしてもよい。 Regarding the change of the tone, the peak value of the amplitude per unit time T may be measured, and when the peak interval is about Tp, it may be determined that the tone has changed at the time interval of Tp. If a sound effect that gives a sense of urgency is indicated by an onomatopoeia, it can be expressed by “don don don ...” or the like. In such a case, it is assumed that the interval between the first “Don” and the next “Don” is constant, and for example, it appears at an interval such as Tp = 1 second. Therefore, the tone changes at the time interval of Tp. You may make it determine with it. Alternatively, the frequency per unit time T may be measured, and if a specific frequency appears at Tp intervals, it may be determined as a tone change.

このようにして、映像予告の冒頭部に楽音を重畳させる場合にあって、映像予告の生成に用いられたシーンには本来のＡudio情報が存在することになるが、このＡudio情報については必ずしも必要となるものでない。例えば、Ｖisual だけのシーンを挿入することで、効果音（楽音）にあわせて緊迫感や安心感を付与することが可能となる。 In this way, when music is superimposed on the beginning of the video preview, the original audio information exists in the scene used to generate the video preview, but this audio information is not necessarily required. It will not be. For example, by inserting a scene only for Visual, it becomes possible to give a sense of urgency and a sense of security in accordance with a sound effect (musical sound).

なお、映像予告の冒頭部に重畳させる楽音については、効果音に限られるものではなく、楽曲であってもよい。例えば、歌謡曲において強調度が急峻となる場合は、俗に言われる「さび」の部分であることがあり、この「さび」の時刻にフレームやシーンを挿入することで、前述と同様に感性豊かな編集を実現することが可能である。 Note that the musical sound to be superimposed on the beginning of the video preview is not limited to sound effects but may be music. For example, when the degree of emphasis is steep in a song, it may be a commonly called “rust” part. By inserting a frame or scene at the time of this “rust”, the sensitivity is the same as above. It is possible to realize rich editing.

〔４〕ステップＳ４０の処理の詳細
次に、図５のフローチャートに従って、図２のフローチャートのステップＳ４０で実行する映像予告の本編の生成処理の詳細について説明する。 [4] Details of Processing in Step S40 Next, details of the main image generation processing of the video preview executed in step S40 of the flowchart of FIG. 2 will be described according to the flowchart of FIG.

本発明の映像予告生成装置１は、図２のフローチャートのステップＳ４０の処理に入ると、映像予告生成用映像素材記憶部１３に記憶される映像素材で、かつ、映像予告の冒頭部とならなかった映像区間を処理対象として、図５のフローチャートに示すように、まず最初に、ステップＳ４００で、音声の基本周波数の時間変化量が所定の閾値よりも大きなものとなる映像区間を検出する。 When entering the process of step S40 in the flowchart of FIG. 2, the video preview generation device 1 of the present invention is a video material stored in the video preview generation video material storage unit 13 and does not become the beginning of the video preview. As shown in the flowchart of FIG. 5, first, in step S400, a video section in which the temporal change amount of the fundamental frequency of the audio is larger than a predetermined threshold is detected.

続いて、ステップＳ４０１で、音声の強調度が時間経過とともに増加し、かつ、その増加の時間変化量が所定の閾値よりも大きなものとなる映像区間を検出する。 Subsequently, in step S401, a video section in which the degree of enhancement of the sound increases with time and the time change amount of the increase is larger than a predetermined threshold is detected.

続いて、ステップＳ４０２で、ステップＳ４０１およびステップＳ４０２で検出した映像区間の中から本編の生成に用いる映像区間を選択し、所定のルールに従って、その選択した映像区間について順序を決定して、その決定した順序に従ってそれらの映像区間を並べることで映像予告の本編を生成する。 Subsequently, in step S402, the video section used for generating the main part is selected from the video sections detected in step S401 and step S402, and the order is determined for the selected video section according to a predetermined rule. The main part of the video preview is generated by arranging the video sections according to the order.

例えば、まず最初に、（ｉ）音声の基本周波数の時間変化量が大きく、（ii）音声の強調度が時間経過とともに増加し、かつ、その増加の時間変化量が大きい映像区間を本編の候補映像区間として１つ以上抽出する。このとき、（ｉ）、（ii）、(iii）の全ての条件を満たさなくても、例えば、（ｉ）、（ii）の順に条件を満たすものを本編の候補映像区間として抽出するようにしてもよい。 For example, first, a video segment in which (i) the amount of time variation of the fundamental frequency of the sound is large, (ii) the degree of enhancement of the sound increases with time, and the amount of time variation of the increase is large is selected as a candidate for this volume One or more video segments are extracted. At this time, even if all the conditions of (i), (ii), and (iii) are not satisfied, for example, those satisfying the conditions of (i) and (ii) in this order are extracted as candidate video sections of the main part. May be.

続いて、このようにして抽出した本編の候補映像区間の中から、例えば、音声の基本周波数の時間変化量が大きい順に所定の数の映像区間を選択することで、本編の生成に用いる映像区間を選択して、それらの映像区間について、例えば、音声の基本周波数の時間変化量が大きい順に順序を決定したり、撮影時刻の早い順に順序を決定することで、それらの映像区間について順序を決定して、その決定した順序に従って、それらの映像区間を並べることで映像予告の本編を生成する。 Subsequently, for example, by selecting a predetermined number of video segments in descending order of the amount of temporal change in the fundamental frequency of the audio from the candidate video segments extracted in this way, the video segment used for generating the main program is selected. For example, the order of these video sections is determined by determining the order in descending order of the amount of time change in the fundamental frequency of the audio, or by determining the order from the earliest shooting time. Then, the main part of the video preview is generated by arranging the video sections according to the determined order.

ここで、上記の条件によって選択された候補映像区間には、見た目の類似する区間が含まれていることがある。このような類似区間が映像予告の本編に複数含まれていると、見た目の情報としては冗長となり、かつ、映像予告のサイズを増加させてしまうため、非効率的である。 Here, the candidate video section selected according to the above-described condition may include a section that looks similar. If a plurality of such similar sections are included in the main part of the video notice, the appearance information becomes redundant and increases the size of the video notice, which is inefficient.

このような処理は、例えば、候補映像区間の画像情報を用いて実施することができる。例えば、画像の色情報やテクスチャ情報や形状情報などを抽出してベクトル化し、ベクトルの距離が近い２つの候補映像区間のいずれかを候補映像区間から除外する。この際、２つの内、いずれの候補映像区間を除外するのかについては、先に述べたルールに基づく優先度に応じて、優先度の低いものを除外するようにすればよい。例えば、基本周波数の時間変化量が小さい方を除外するとか、撮影時間の遅い方を除外するようにすればよい。 Such processing can be performed using, for example, image information of candidate video sections. For example, image color information, texture information, shape information, and the like are extracted and vectorized, and one of two candidate video sections having a short vector distance is excluded from the candidate video sections. At this time, as to which of the two candidate video sections to exclude, the one with the lower priority may be excluded according to the priority based on the rules described above. For example, it is only necessary to exclude the one with a smaller time variation of the fundamental frequency or exclude the one with a slower photographing time.

このようにして、本発明の映像予告生成装置１は、図２のフローチャートのステップＳ４０の処理に入ると、図５のフローチャートを実行することで、映像予告生成用映像素材記憶部１３に記憶される映像素材で、かつ、映像予告の冒頭部とならなかった映像区間を処理対象として、その映像素材に含まれるハイライトシーンを示す映像区間を検出して、その検出した映像区間を使って、ハイライトシーンを示す映像区間で構成される映像予告の本編を生成するように処理するのである。 In this way, when the video preview generation apparatus 1 of the present invention enters the process of step S40 in the flowchart of FIG. 2, the video preview generation apparatus 1 stores the video preview generation video material storage unit 13 by executing the flowchart of FIG. In the video material that is not the beginning of the video preview, the video segment indicating the highlight scene included in the video material is detected, and the detected video segment is used. Processing is performed so as to generate the main part of the video preview composed of the video section indicating the highlight scene.

本発明の映像予告生成装置１は、図５のフローチャートに従って映像予告の本編を生成する場合には、本編に楽音を重畳させることについては想定していないが、本編に楽音を重畳させることによって、本編をより期待感あふれるように編集するようにしてもよい。 When generating the main part of the video preview according to the flowchart of FIG. 5, the video preview generation apparatus 1 of the present invention does not assume that the musical sound is superimposed on the main part, but by superimposing the musical sound on the main part, You may make it edit so that a main part may be filled with expectation.

この場合、単純に、選択した楽音を本編に重畳させるようにしてもよいが、楽音のトーンや強調度と同期をとる形で重畳させるようにしてもよい。例えば、図８に示すように、強調度が急峻に高くなる時刻やトーンの変化する時刻に合わせて、本編を構成する映像区間を挿入するようにしてもよい。 In this case, the selected musical sound may be simply superimposed on the main part, but may be superimposed in a form that is synchronized with the tone and enhancement degree of the musical sound. For example, as shown in FIG. 8, the video section constituting the main part may be inserted in accordance with the time when the emphasis is sharply increased or the time when the tone changes.

すなわち、トーン変化の間隔で本編を構成する映像区間を挿入するのである。また、強調度についてもピークとなる時刻に本編を構成する映像区間を挿入するのである。 In other words, video sections constituting the main part are inserted at intervals of tone change. Also, the video section constituting the main part is inserted at the time when the emphasis degree reaches a peak.

このようにして、映像予告の本編に楽音を重畳させる場合にあって、映像予告の生成に用いられたシーンには本来のＡudio情報が存在することになるが、このＡudio情報については必ずしも必要となるものでない。例えば、Ｖisual だけのシーンを挿入することで、効果音（楽音）にあわせて緊迫感や安心感を付与することが可能となる。 In this way, when music is superimposed on the main part of the video preview, the original audio information exists in the scene used to generate the video preview, but this audio information is not necessarily required. It will not be. For example, by inserting a scene only for Visual, it becomes possible to give a sense of urgency and a sense of security in accordance with a sound effect (musical sound).

なお、映像予告の本編に重畳させる楽音については、効果音に限られるものではなく、楽曲であってもよい。例えば、歌謡曲において強調度が急峻となる場合は、俗に言われる「さび」の部分であることがあり、この「さび」の時刻にフレームやシーンを挿入することで、前述と同様に、感性豊かな編集を実現することが可能である。 Note that the musical sound to be superimposed on the main video preview is not limited to sound effects but may be music. For example, when the degree of emphasis is steep in a popular song, it may be a part of the so-called “rust”. By inserting a frame or scene at the time of this “rust”, It is possible to realize sensitive editing.

本発明は、映像の予告を生成する要求がある場合に適用できるものであり、本発明を適用することで、商用映像、ホームビデオなどの映像の属性にかかわらず、その映像の予告を自動的に生成することができるようになる。 The present invention can be applied when there is a request to generate a video preview. By applying the present invention, the video preview is automatically performed regardless of the video attributes such as commercial video and home video. Can be generated.

１映像予告生成装置
１０映像素材入力部
１１映像素材記憶部
１２フィルタ部
１３映像予告生成用映像素材記憶部
１４冒頭部生成部
１５本編生成部
１６映像予告生成部
１７映像予告記憶部
１８映像予告出力部
１２０音声フィラー検出部
１２１過度カメラワーク検出部
１２２ジャンクショット検出部
１４０音声強調度判定部
１４１顔サイズ判定部
１４２顔表情判定部
１４３楽音付与部
１５０基本周波数パターン判定部
１５１音声強調度パターン判定部
１５２楽音付与部 DESCRIPTION OF SYMBOLS 1 Image | video preview production | generation apparatus 10 Image | video material input part 11 Image | video material storage part 12 Filter part 13 Image | video material storage part for image | video notice generation 14 Head part generation part 15 Main part production | generation part 16 Image | video announcement generation part 17 Image | Unit 120 voice filler detection unit 121 excessive camera work detection unit 122 junk shot detection unit 140 voice enhancement level determination unit 141 face size determination unit 142 face expression determination unit 143 musical sound imparting unit 150 fundamental frequency pattern determination unit 151 voice enhancement level pattern determination unit 152 Musical sound giving part

Claims

A video preview generation device that generates a video preview from video material,
One or more video sections among a video section associated with an audio filler, a video section associated with excessive camera work, and a video section in which no captured video exists, with the video section of the video material as a processing target Filter means for detecting the detected video section from the video material,
The video section of the video material that has not been removed by the filter means is processed, and the video section in which the audio enhancement level is larger than a predetermined threshold, the video section in which the subject face size is larger than the predetermined threshold, A head generation means for detecting any one or more video sections in a video section having a facial expression larger than a predetermined threshold, and generating a head of a video preview using the detected video section;
The video section of the video material that has not been removed by the filter means, and the video section that is not the beginning of the video section, the video section in which the time change amount of the fundamental frequency of the audio is greater than a predetermined threshold, Detecting one or more video segments in a video segment in which the audio enhancement level increases with time and the time change of the increase is greater than a predetermined threshold, and the detected video segment is A main part generating means for generating a main part of a video preview using,
Video preview generation means for generating a video preview by combining the beginning generated by the opening generation means and the main part generated by the main part generation means,
A featured video preview generator.

In the video preview generation device according to claim 1,
Either one or both of the head generation means and the main part generation means give the sound to the video in such a manner that one of the tone, tempo, and enhancement degree of the sound is synchronized with the video switching timing. To do the
A featured video preview generator.

A video preview generation method executed by a video preview generation device that generates a preview of the video from video material,
One or more video sections among a video section associated with an audio filler, a video section associated with excessive camera work, and a video section in which no captured video exists, with the video section of the video material as a processing target And detecting the detected video section from the video material,
The video section of the video material that has not been removed is processed, and the video section in which the audio enhancement level is larger than a predetermined threshold, the video section in which the subject face size is larger than the predetermined threshold, and the facial expression of the subject Detecting any one or more video segments within a video segment larger than a predetermined threshold, and generating the beginning of the video preview using the detected video segments;
The video section of the video material that has not been removed and the video section that did not become the beginning, the video section in which the temporal change in the fundamental frequency of the audio is greater than a predetermined threshold, and the audio enhancement The degree of time increases with time, and the time change amount of the increase is larger than a predetermined threshold, and any one or more video sections are detected, and the detected video section is used for video. The process of generating the main part of the notice,
A step of generating a video preview by combining the generated head portion and the generated main part,
A featured video preview generation method.

In the video notice generation method according to claim 3,
Either or both of the process of generating the opening part and the process of generating the main part synchronize one of the tone, tempo, and enhancement degree of the musical sound with the timing of switching the video. To give a musical sound,
A featured video preview generation method.

A video preview generation program for causing a computer to function as means for constituting the video preview generation device according to claim 1.