JP6318451B2

JP6318451B2 - Saliency image generating apparatus, method, and program

Info

Publication number: JP6318451B2
Application number: JP2014265444A
Authority: JP
Inventors: 昭悟木村; 柏野　邦夫; 邦夫柏野; 次郎中島; 晃宏杉本
Original assignee: Nippon Telegraph and Telephone Corp; Inter University Research Institute Corp Research Organization of Information and Systems
Current assignee: Nippon Telegraph and Telephone Corp; Inter University Research Institute Corp Research Organization of Information and Systems
Priority date: 2014-05-27
Filing date: 2014-12-26
Publication date: 2018-05-09
Anticipated expiration: 2034-12-26
Also published as: JP2016006478A

Description

本発明は、顕著度画像生成装置、方法、及びプログラムに係り、特に、各時刻のフレームの入力画像の各位置における顕著度を示す顕著度画像を生成する顕著度画像生成装置、方法、及びプログラムに関する。 The present invention relates to a saliency image generating apparatus, method, and program, and in particular, a saliency image generating apparatus, method, and program for generating a saliency image indicating saliency at each position of an input image of a frame at each time. About.

人間は、視覚的注意と呼ばれるメカニズムにより、網膜に写る映像の中から重要と思われる情報を瞬時に判断して、効率的に情報を獲得している。これら人間の知覚特性を計算機上に模擬するとことで、人間と同様に重要度に応じて情報を能動的に取捨選択する人工的な視覚システムの構築が期待される。 Humans efficiently acquire information by instantly judging information that is considered important from the image captured in the retina by a mechanism called visual attention. By simulating these human perceptual characteristics on a computer, it is expected to construct an artificial visual system that actively selects information according to the degree of importance, similar to humans.

視覚的注意を計算機上に模擬する方法として、視覚的顕著性に基づく方法が一般的である。この視覚的顕著性に基づく方法では、与えられた画像信号の各部分において、人間が注意を向ける度合いである視覚的顕著性を計算し、視覚的顕著性が所定値以上の大きい箇所を注視箇所として予測する方法である。 As a method of simulating visual attention on a computer, a method based on visual saliency is common. In this method based on visual saliency, in each part of a given image signal, the visual saliency, which is the degree to which human attention is directed, is calculated, and a portion where the visual saliency is greater than a predetermined value is observed. It is a method to predict as.

視覚的顕著性に基づく注視予測方法として、非特許文献1及び2に記載の方法が提案されている。これらの方法はいずれも、Bayesian surpriseと呼ばれる確率的顕著性モデルを採用している。このBayesian surpriseモデルでは、入力される画像信号の時系列に対して、今後発生する可能性の高い視覚刺激を画像空間中の各位置で逐次的に予測し、新しく入力された画像信号に起因する視覚刺激と予測とが一定値以上の大きく乖離した箇所に高い視覚的顕著性を割り当てるモデルである。 As gaze prediction methods based on visual saliency, methods described in Non-Patent Documents 1 and 2 have been proposed. Both of these methods employ a stochastic saliency model called Bayesian surprise. In this Bayesian surprise model, visual stimuli that are likely to occur in the future are predicted sequentially at each position in the image space with respect to the time series of input image signals, resulting from newly input image signals. This model assigns a high visual saliency to a location where the visual stimulus and the prediction are greatly deviated by a certain value or more.

L. Itti, P.F. Baldi “A principled approach to detecting surprising events in videos,” Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR2005), pp.631-637, 2005.L. Itti, P.F. Baldi “A principled approach to detecting surprising events in videos,” Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR2005), pp.631-637, 2005. L. Itti, P.F. Baldi “Bayesian surprise attracts human attention,” Vision Research, Vol.49, No.10, pp.1295-1306, 2009.L. Itti, P.F. Baldi “Bayesian surprise attracts human attention,” Vision Research, Vol.49, No.10, pp.1295-1306, 2009.

上記非特許文献１、２を含めた数多くの先行技術では、映像を構成する一部の信号、すなわち画像信号しか用いることができないという問題点があった。特に、映像を構成するもう一つの主要成分である音響信号は、注意を引く音がする方向に視線を向けやすい、音の変化と同期した動きをする対象に視線を向けやすい、などの例からもわかるように、人間の注視行動に大きな影響を与えるため、視覚的顕著性の算出に適切に組み込む必要がある。しかし、画像信号と音響信号の双方を利用し、双方の相互作用に着目した視覚的顕著性モデルに関する議論はほとんどなされていない。 Many of the prior arts including Non-Patent Documents 1 and 2 have a problem that only a part of signals constituting an image, that is, image signals can be used. In particular, the acoustic signal, which is another main component that composes the video, is easy to direct the line of sight in the direction of the sound that draws attention, and it is easy to direct the line of sight to the object that moves in synchronization with the change in sound. As can be seen, since it has a great influence on human gaze behavior, it must be appropriately incorporated into the calculation of visual saliency. However, there has been little discussion on a visual saliency model that uses both image signals and sound signals and focuses on the interaction between the two.

本発明は、上記の課題に鑑みてなされたもので、入力映像を構成する各時刻のフレームの入力画像及び入力映像を構成する音響信号を用いて、各時刻のフレームの入力画像の各位置における顕著度を示す顕著度画像を生成することができる顕著度画像生成装置、方法、及びプログラムを提供することを目的とする。 The present invention has been made in view of the above-described problems. The input image of each time frame constituting the input video and the acoustic signal constituting the input video are used at each position of the input image of the frame of each time. It is an object of the present invention to provide a saliency image generating apparatus, method, and program capable of generating a saliency image indicating saliency.

上記の目的を達成するために本発明に係る顕著度画像生成装置は、入力映像を構成する各時刻のフレームの入力画像について、複数の特徴種別の各々に対し、前記入力画像における顕著な特性を有する度合いを示す基礎顕著度画像を生成し、基礎顕著度画像の集合とする画像基礎顕著度画像抽出部と、前記入力映像を構成する音響信号について、各時刻における顕著な特性を有する度合いを示す音響顕著度信号を生成する音響顕著度信号算出部と、前記複数の特徴種別の各々に対し、各時刻及び各画素について、前記時刻のフレームについての前記基礎顕著度画像の集合に含まれる前記特徴種別に対する基礎顕著度画像の前記画素と、前記時刻における音響顕著度信号との相関を算出し、前記複数の特徴種別の各々に対する、各時刻及び各画素についての前記相関に基づいて、主要画像基礎顕著度成分を生成する画像基礎顕著度選択部と、各時刻のフレームについての前記基礎顕著度画像の集合と、前記主要画像基礎顕著度成分とに基づいて、各時刻のフレームの入力画像の各位置における顕著度を示す顕著度画像を生成する画像顕著度画像算出部と、を含んで構成されている。 In order to achieve the above object, the saliency image generating apparatus according to the present invention provides a remarkable characteristic in the input image for each of a plurality of feature types with respect to the input image of each time frame constituting the input video. A basic saliency image extraction unit that generates a basic saliency image that indicates the degree of possession and sets the basic saliency image as a set, and an acoustic signal that constitutes the input video indicates the degree of remarkable characteristics at each time An acoustic saliency signal calculation unit that generates an acoustic saliency signal, and for each of the plurality of feature types, for each time and each pixel, the features included in the set of basic saliency images for the frame of the time The correlation between the pixel of the basic saliency image for the type and the acoustic saliency signal at the time is calculated, and each time and each image for each of the plurality of feature types is calculated. An image basic saliency selection unit that generates a main image basic saliency component based on the correlation, a set of the basic saliency images for each time frame, and the main image basic saliency component An image saliency image calculating unit that generates a saliency image indicating the saliency at each position of the input image of the frame at each time.

本発明に係る顕著度画像生成方法は、画像基礎顕著度画像抽出部が、入力映像を構成する各時刻のフレームの入力画像について、複数の特徴種別の各々に対し、前記入力画像における顕著な特性を有する度合いを示す基礎顕著度画像を生成し、基礎顕著度画像の集合とし、音響顕著度信号算出部が、前記入力映像を構成する音響信号について、各時刻における顕著な特性を有する度合いを示す音響顕著度信号を生成し、画像基礎顕著度選択部が、前記複数の特徴種別の各々に対し、各時刻及び各画素について、前記時刻のフレームについての前記基礎顕著度画像の集合に含まれる前記特徴種別に対する基礎顕著度画像の前記画素と、前記時刻における音響顕著度信号との相関を算出し、前記複数の特徴種別の各々に対する、各時刻及び各画素についての前記相関に基づいて、主要画像基礎顕著度成分を生成し、画像顕著度画像算出部が、各時刻のフレームについての前記基礎顕著度画像の集合と、前記主要画像基礎顕著度成分とに基づいて、各時刻のフレームの入力画像の各位置における顕著度を示す顕著度画像を生成する。 In the saliency image generation method according to the present invention, the image basic saliency image extraction unit has a remarkable characteristic in the input image for each of a plurality of feature types with respect to the input image of each time frame constituting the input video. Generating a basic saliency image indicating the degree of having a saliency, a set of basic saliency images, and the acoustic saliency signal calculation unit indicating the degree of remarkable characteristics at each time with respect to the acoustic signals constituting the input video An acoustic saliency signal is generated, and an image basic saliency selection unit is included in the set of basic saliency images for the frame of the time for each time and each pixel for each of the plurality of feature types. The correlation between the pixel of the basic saliency image for the feature type and the acoustic saliency signal at the time is calculated, and each time and each pixel for each of the plurality of feature types Based on the correlation, a main image basic saliency component is generated, and an image saliency image calculating unit converts the set of basic saliency images and the main image basic saliency component for each time frame. Based on this, a saliency image indicating the saliency at each position of the input image of the frame at each time is generated.

本発明によれば、画像基礎顕著度画像抽出部が、入力映像を構成する各時刻のフレームの入力画像について、複数の特徴種別の各々に対し、前記入力画像における顕著な特性を有する度合いを示す基礎顕著度画像を生成し、基礎顕著度画像の集合とする。 According to the present invention, the image basic saliency image extraction unit indicates the degree of remarkable characteristics in the input image for each of a plurality of feature types with respect to the input image of each time frame constituting the input video. A basic saliency image is generated and set as a set of basic saliency images.

音響顕著度信号算出部が、前記入力映像を構成する音響信号について、各時刻における顕著な特性を有する度合いを示す音響顕著度信号を生成する。 The acoustic saliency signal calculation unit generates an acoustic saliency signal indicating the degree of remarkable characteristics at each time for the acoustic signals constituting the input video.

画像基礎顕著度選択部が、前記複数の特徴種別の各々に対し、各時刻及び各画素について、前記時刻のフレームについての前記基礎顕著度画像の集合に含まれる前記特徴種別に対する基礎顕著度画像の前記画素と、前記時刻における音響顕著度信号との相関を算出し、前記複数の特徴種別の各々に対する、各時刻及び各画素についての前記相関に基づいて、主要画像基礎顕著度成分を生成する。 An image basic saliency selection unit, for each of the plurality of feature types, for each time and each pixel, a basic saliency image for the feature type included in the set of basic saliency images for the frame of the time A correlation between the pixel and the acoustic saliency signal at the time is calculated, and a main image basic saliency component is generated based on the correlation at each time and each pixel for each of the plurality of feature types.

画像顕著度画像算出部が、各時刻のフレームについての前記基礎顕著度画像の集合と、前記主要画像基礎顕著度成分とに基づいて、各時刻のフレームの入力画像の各位置における顕著度を示す顕著度画像を生成する。 The image saliency image calculation unit indicates the saliency at each position of the input image of the frame at each time based on the set of basic saliency images for the frame at each time and the main image basic saliency component. Generate a saliency image.

このように、入力映像を構成する各時刻のフレームの入力画像及び入力映像を構成する音響信号を用いて、各時刻のフレームの入力画像の各位置における顕著度を示す顕著度画像を生成することができる。 As described above, the saliency image indicating the saliency at each position of the input image of the frame at each time is generated using the input image of the frame at each time constituting the input video and the acoustic signal constituting the input video. Can do.

本発明に係るプログラムは、コンピュータを、上記顕著度画像生成装置の各手段として機能させるためのプログラムである。 The program according to the present invention is a program for causing a computer to function as each unit of the saliency image generating apparatus.

以上説明したように、本発明の顕著度画像生成装置、方法、及びプログラムによれば、入力映像を構成する各時刻のフレームの入力画像及び入力映像を構成する音響信号を用いて、各時刻のフレームの入力画像の各位置における顕著度を示す顕著度画像を生成することができる、という効果が得られる。 As described above, according to the saliency image generating apparatus, method, and program of the present invention, the input image of each time frame constituting the input video and the acoustic signal constituting the input video are used. The effect that the saliency image indicating the saliency at each position of the input image of the frame can be generated is obtained.

本発明の第１の実施の形態に係る顕著度画像生成装置の構成を示す概略図である。It is the schematic which shows the structure of the saliency image generation apparatus which concerns on the 1st Embodiment of this invention. 主として画像基礎顕著度画像算出部１の構成を示す図である。FIG. 2 is a diagram mainly illustrating a configuration of an image basic saliency image calculating unit 1. 主として音響顕著度信号算出部２の構成を示す図である。FIG. 3 is a diagram mainly illustrating a configuration of an acoustic saliency signal calculation unit 2. 画像基礎顕著度画像抽出部１５及び音響顕著度信号抽出部２２からのデータの流れを示す図である。It is a figure which shows the data flow from the image basic saliency image extraction part 15 and the acoustic saliency signal extraction part 22. FIG. 本発明の第１の実施の形態に係る顕著度画像生成処理プログラムを示すフローチャートである。It is a flowchart which shows the saliency image generation processing program which concerns on the 1st Embodiment of this invention. 図５のステップ１Ｓの画像基礎顕著度画像算出処理プログラムを示すフローチャートである。It is a flowchart which shows the image basic saliency image calculation processing program of step 1S of FIG. 図６のステップ１１Ｓの画像基礎特徴量画像算出処理プログラムを示すフローチャートである。It is a flowchart which shows the image basic feature-value image calculation processing program of step 11S of FIG. 図５のステップ２Ｓの音響顕著度信号算出処理プログラムを示すフローチャートである。It is a flowchart which shows the acoustic saliency signal calculation processing program of step 2S of FIG. 本発明の第２の実施の形態に係る注視位置推定装置の構成を示す概略図である。It is the schematic which shows the structure of the gaze position estimation apparatus which concerns on the 2nd Embodiment of this invention. 本発明の第２の実施の形態に係る注視位置推定処理プログラムを示すフローチャートである。It is a flowchart which shows the gaze position estimation processing program which concerns on the 2nd Embodiment of this invention. 映像１Ｅに対する評価結果の概要を示す図である。It is a figure which shows the outline | summary of the evaluation result with respect to the image | video 1E. 映像１Ｅに対するフレームごとの評価結果を示す図である。It is a figure which shows the evaluation result for every flame | frame with respect to the image | video 1E. 映像２Ｅに対する評価結果の概要を示す図である。It is a figure which shows the outline | summary of the evaluation result with respect to the image | video 2E. 映像２Ｅに対するフレームごとの評価結果を示す図である。It is a figure which shows the evaluation result for every flame | frame with respect to the image | video 2E. 映像３Ｅに対する評価結果の概要を示す図である。It is a figure which shows the outline | summary of the evaluation result with respect to the image | video 3E. 映像３Ｅに対するフレームごとの評価結果を示す図である。It is a figure which shows the evaluation result for every flame | frame with respect to the image | video 3E.

以下、図面を参照して本発明の実施の形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

[概要]
本発明は上記の状況を鑑みてなされたものであり、以下の２点により上記の問題を解決する注視位置推定を実現するものである。
１．音響信号から算出される顕著性が大きな映像区間を検出し、その映像区間における主要な画像信号成分を選択する。これにより、顕著な音響信号と相関の強い画像信号成分を選択的に抽出することが可能となる。
２．画像信号から顕著性を算出する際に、1.で選択された画像信号成分を強調する。これにより、音響信号に起因する視覚的顕著性の算出を行うことが可能となる。 [Overview]
The present invention has been made in view of the above situation, and realizes gaze position estimation that solves the above problem by the following two points.
1. A video section having a high saliency calculated from the audio signal is detected, and main image signal components in the video section are selected. Thereby, it is possible to selectively extract an image signal component having a strong correlation with a remarkable acoustic signal.
2. When calculating the saliency from the image signal, the image signal component selected in 1. is emphasized. This makes it possible to calculate visual saliency due to the acoustic signal.

[第１の実施の形態] [First embodiment]

以下、本発明の第１の実施形態に係る顕著度画像生成装置について図面を参照して説明する。顕著度画像生成装置は、ＣＰＵと、ＲＡＭと、プログラムを記憶したＲＯＭとを備えたコンピュータで構成され、機能的には次に示すように構成されている。図１には、第１の実施形態に係る顕著度画像生成装置の構成の概略が示されている。図１に示すように、第１の実施形態に係る顕著度画像生成装置は、入力部１０と、画像基礎顕著度画像算出部１と、音響顕著度信号算出部２と、画像基礎顕著度選択部３と、画像顕著度画像算出部４と、顕著度映像算出部５とで構成されている。顕著度画像生成装置は、入力映像を入力し、入力映像のフレーム内の各位置における顕著度を表示した映像である顕著度映像を出力する。 Hereinafter, a saliency image generating apparatus according to a first embodiment of the present invention will be described with reference to the drawings. The saliency image generating apparatus is configured by a computer including a CPU, a RAM, and a ROM storing a program, and is functionally configured as follows. FIG. 1 shows an outline of the configuration of the saliency image generating apparatus according to the first embodiment. As shown in FIG. 1, the saliency image generating apparatus according to the first embodiment includes an input unit 10, an image basic saliency image calculating unit 1, an acoustic saliency signal calculating unit 2, and an image basic saliency selection. The unit 3, the image saliency image calculating unit 4, and the saliency video calculating unit 5 are configured. The saliency image generating apparatus inputs an input video and outputs a saliency video that is a video displaying the saliency at each position in a frame of the input video.

図２には、主として画像基礎顕著度画像算出部１の構成が示されている。図２に示すように、画像基礎顕著度画像算出部１は、入力部１０により入力された、入力映像のあるフレームである入力画像の中で各画素について顕著な特性を持つ度合いを表示した画像である基礎顕著度画像をいくつか算出し、それら基礎顕著度画像の集合を出力する。 FIG. 2 mainly shows the configuration of the image basic saliency image calculating unit 1. As shown in FIG. 2, the image basic saliency image calculation unit 1 is an image that displays the degree of remarkable characteristics for each pixel in the input image that is a frame of the input video and is input by the input unit 10. Some basic saliency images are calculated, and a set of these basic saliency images is output.

基礎顕著度画像の算出方法は特に限定されるものではないが、本実施形態においては、非特許文献１及び２に示す方法を採用する。この方法に従った画像基礎顕著度画像算出部１は、図２に示すように、画像基礎特徴量画像抽出部１１と、画像多重解像度画像抽出部１２と、画像解像度差分画像抽出部１３と、画像時間差分画像抽出部１４と、画像基礎顕著度画像抽出部１５とで構成される。 The calculation method of the basic saliency image is not particularly limited, but in this embodiment, the methods shown in Non-Patent Documents 1 and 2 are adopted. As shown in FIG. 2, the image basic saliency image calculating unit 1 according to this method includes an image basic feature image extracting unit 11, an image multi-resolution image extracting unit 12, an image resolution difference image extracting unit 13, The image time difference image extraction unit 14 and the image basic saliency image extraction unit 15 are configured.

画像基礎特徴量画像抽出部１１は、入力画像から複数の特徴抽出方法を用いて入力画像の各画素の特徴的な成分を表現する画像基礎特徴画像を抽出し、特徴抽出方法毎の基礎特徴画像からなる集合を、画像多重解像度画像抽出部１２に出力する。画像基礎特徴画像の抽出方法は特に限定されるものではないが、本実施形態においては、図２に示すように、輝度特徴画像抽出部１１１と、色特徴画像抽出部１１２と、方向特徴画像抽出部１１３と、点滅特徴画像抽出部１１４と、運動特徴画像抽出部１１５とによって構成される。詳細には後述するが、画像基礎特徴量画像抽出部１１の特徴抽出方法は、特許文献３（特開2009-003615号）に記載の方法と関連する。 The image basic feature image extraction unit 11 extracts an image basic feature image that represents a characteristic component of each pixel of the input image using a plurality of feature extraction methods from the input image, and the basic feature image for each feature extraction method. Is output to the image multi-resolution image extraction unit 12. The image basic feature image extraction method is not particularly limited, but in the present embodiment, as shown in FIG. 2, a luminance feature image extraction unit 111, a color feature image extraction unit 112, and a direction feature image extraction are performed. Unit 113, blinking feature image extraction unit 114, and motion feature image extraction unit 115. As will be described in detail later, the feature extraction method of the image basic feature quantity image extraction unit 11 is related to the method described in Patent Document 3 (Japanese Patent Laid-Open No. 2009-003615).

図３には、主として音響顕著度信号算出部２の構成が示されている。音響顕著度信号算出部２は、入力部１０から入力された、入力映像を構成する音響信号である入力音響信号の中で各時刻について顕著な特性を持つ度合いを表示した信号である音響顕著度信号を算出し、この音響顕著度信号を、画像基礎顕著度選択部３及び画像顕著度画像算出部４に出力する。音響顕著度信号の算出方法は特に限定されるものではないが、本実施形態では、Bayesian surpriseモデルを音響信号に適用した非特許文献５に記載の方法を採用する。この方法に従う音響顕著度信号算出部２は、図３に示すように、音響基礎特徴量抽出部２１と、音響顕著度信号抽出部２２とから構成される。 FIG. 3 mainly shows the configuration of the acoustic saliency signal calculation unit 2. The acoustic saliency signal calculating unit 2 is a signal that displays a degree of remarkable characteristics at each time among input acoustic signals that are input from the input unit 10 and are acoustic signals constituting the input video. A signal is calculated, and this acoustic saliency signal is output to the image basic saliency selection unit 3 and the image saliency image calculation unit 4. The calculation method of the acoustic saliency signal is not particularly limited, but in the present embodiment, the method described in Non-Patent Document 5 in which the Bayesian surprise model is applied to the acoustic signal is adopted. As shown in FIG. 3, the acoustic saliency signal calculation unit 2 according to this method includes an acoustic basic feature amount extraction unit 21 and an acoustic saliency signal extraction unit 22.

（非特許文献５）Scheuerte and Stiefelhagen "Wow! Bayesian surprise for salient acoustic event detection," Proc. IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP2013), pp.6402-6406, 2013. (Non-Patent Document 5) Scheuerte and Stiefelhagen "Wow! Bayesian surprise for salient acoustic event detection," Proc. IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP2013), pp.6402-6406, 2013.

図４には、画像基礎顕著度画像抽出部１５及び音響顕著度信号抽出部２２からのデータの流れが示されている。図４に示すように、画像基礎顕著度画像抽出部１５及び音響顕著度信号抽出部２２からのデータはそれぞれ、画像基礎顕著度選択部３及び画像顕著度画像算出部４に入力される。 FIG. 4 shows a data flow from the image basic saliency image extraction unit 15 and the acoustic saliency signal extraction unit 22. As shown in FIG. 4, data from the image basic saliency image extraction unit 15 and the acoustic saliency signal extraction unit 22 are input to the image basic saliency selection unit 3 and the image saliency image calculation unit 4, respectively.

次に、本発明の第１の実施形態に係る顕著度画像生成装置の作用を説明する。 Next, the operation of the saliency image generating apparatus according to the first embodiment of the present invention will be described.

図５には、本発明の第１の実施の形態に係る顕著度画像生成処理プログラムを示すフローチャートが示されている。顕著度画像生成処理プログラムがスタートすると、ステップ１Ｓで、画像基礎顕著度画像算出部１が、画像基礎顕著度画像算出処理を実行する。図６には、図５のステップ１Ｓの画像基礎顕著度画像算出処理プログラムを示すフローチャートが示されている。図６に示すように、ステップ１１Ｓで、画像基礎特徴量画像抽出部１１が、画像基礎特徴量画像抽出処理を実行する。図７には、図６のステップ１１Ｓの画像基礎特徴量画像算出処理プログラムを示すフローチャートが示されている。 FIG. 5 shows a flowchart showing the saliency image generation processing program according to the first embodiment of the present invention. When the saliency image generation processing program is started, the image basic saliency image calculation unit 1 executes image basic saliency image calculation processing in step 1S. FIG. 6 is a flowchart showing the image basic saliency image calculation processing program in step 1S of FIG. As shown in FIG. 6, in step 11 </ b> S, the image basic feature value image extraction unit 11 executes image basic feature value image extraction processing. FIG. 7 is a flowchart showing the image basic feature amount image calculation processing program in step 11S of FIG.

図７に示すように、ステップ１１１Ｓで、輝度特徴画像抽出部１１１が、輝度特徴画像抽出処理を実行する。詳細には次の通りである。即ち、輝度特徴画像抽出部１１１は、入力部１０から入力されたｔ番目の入力画像（＝入力映像のｔ番目のフレーム）の輝度成分を表現する輝度特徴画像を出力する。輝度特徴画像抽出部１１１は、輝度特徴画像i(t)を、入力画像の赤（Ｒ）、緑（Ｇ）、及び青（Ｂ）の成分の平均として、以下のように求める。 As shown in FIG. 7, in step 111S, the luminance feature image extraction unit 111 executes a luminance feature image extraction process. Details are as follows. That is, the luminance feature image extraction unit 111 outputs a luminance feature image representing the luminance component of the t-th input image (= t-th frame of the input video) input from the input unit 10. The luminance feature image extraction unit 111 obtains the luminance feature image i (t) as an average of red (R), green (G), and blue (B) components of the input image as follows.

ただし、r(t)、g(t)、b(t)はそれぞれｔ番目の入力画像（＝入力映像のｔ番目のフレーム）における赤（Ｒ）、緑（Ｇ）、青（Ｂ）の各成分画像であり、画素値はそれぞれ０以上１以下の実数値で表現されているものとする。別の実施形態として、以下のいずれかの式で輝度特徴画像の各画素値ｉ（ｔ）_ｘを抽出することもできる。 However, r (t), g (t), and b (t) are red (R), green (G), and blue (B) in the t-th input image (= t-th frame of the input video), respectively. It is a component image, and each pixel value is expressed by a real value of 0 or more and 1 or less. As another embodiment, each pixel value i (t) _x of the luminance feature image can be extracted by any one of the following expressions.

ただし、ｒ(ｔ）_ｘは画素位置ｘにおける画像ｒ(ｔ)の画素値である。 Here, r (t) _x is the pixel value of the image r (t) at the pixel position x.

図７のステップ１１２Ｓで、色特徴画像抽出部１１２が、色特徴画像抽出処理を実行する。詳細には次の通りである。 In step 112S of FIG. 7, the color feature image extraction unit 112 executes color feature image extraction processing. Details are as follows.

色特徴画像抽出部１１２は、入力部１０から入力されたｔ番目の入力画像の各画素の色成分を表現する色特徴画像を出力する。即ち、色特徴画像抽出部１１２では、赤（Ｒ）、緑（Ｇ）、青（Ｂ）、及び黄（Ｙ）にそれぞれ対応する色特徴画像Ｒ(t)、Ｇ(t)、Ｂ(t)、Ｙ(t)を以下の画素値Ｒ(t)_x、Ｇ(t)_x、Ｂ(t)_x、Ｙ(t)_xから抽出する。例えば、Ｒ(t)_xは位置ｘにおける画像Ｒ(t)の画素値である。 The color feature image extraction unit 112 outputs a color feature image representing the color component of each pixel of the t-th input image input from the input unit 10. That is, in the color feature image extraction unit 112, color feature images R (t), G (t), and B (t corresponding to red (R), green (G), blue (B), and yellow (Y), respectively. ), Y (t) are extracted from the following pixel values R (t) _x , G (t) _x, B (t) _x, Y (t) _x . For example, R (t) _x is a pixel value of the image R (t) at the position x.

図７のステップ１１３Ｓで、方向特徴画像抽出部１１３が、方向特徴画像抽出処理を実行する。詳細には次の通りである。 In step 113S of FIG. 7, the direction feature image extraction unit 113 executes the direction feature image extraction process. Details are as follows.

方向特徴画像抽出部１１３は、入力部１０から入力されたｔ番目の入力画像の各画素の方向成分を表現する方向特徴画像を出力する。方向特徴画像Ｏ_φ(ｔ)は、現在の入力画像から計算される輝度特徴画像ｉ（ｔ）に、回転角_φを持つガボールフィルタg_φを作用させることによって、以下のように求められる。 The direction feature image extraction unit 113 outputs a direction feature image representing the direction component of each pixel of the t-th input image input from the input unit 10. The direction feature image O _φ (t) is obtained as follows by applying a Gabor filter g _φ having a rotation angle _φ to the luminance feature image i (t) calculated from the current input image.

ただし、＊は畳み込みを表現する演算子である。方向特徴画像Ｏ_φ(ｔ)は、ｎ_φ通りの回転角について抽出される。このとき、回転角φは例えばπ＝１８０°を均等にｎ_φ分割するように選択される。 However, * is an operator expressing convolution. The direction feature image O _φ (t) is extracted for n _φ rotation angles. At this time, the rotation angle φ is selected so that, for example, π = 180 ° is divided into n _φ equally.

図７のステップ１１４Ｓで、点滅特徴画像抽出部１１４が、点滅特徴画像抽出処理を実行する。詳細には次の通りである。点滅特徴画像抽出部１１４は、入力部１０から入力された入力画像の各画素の点滅成分を表現する点滅特徴画像を出力する。点滅特徴画像Ｆ（ｔ）は、現在及びそれ以前のいくつかの入力画像から計算される輝度特徴画像ｉ（ｔ）、・・・、ｉ（ｔ-ｎ_Ｆ）から、以下のように求められる。 In step 114S of FIG. 7, the blinking feature image extraction unit 114 executes a blinking feature image extraction process. Details are as follows. The blinking feature image extraction unit 114 outputs a blinking feature image representing the blinking component of each pixel of the input image input from the input unit 10. The blinking feature image F (t) is obtained as follows from the luminance feature images i (t),..., I (t−n _F ) calculated from several current and previous input images. .

ただし、ｎ_Fは点滅特徴画像を抽出する際に参照する過去の輝度特徴画像の数である。ｎ_F=1とすると、非特許文献４に記載の方法と一致する。 Here, n _F is the number of past luminance feature images to be referred to when extracting the blinking feature image. When n _F = 1, the method described in Non-Patent Document 4 is consistent.

（非特許文献４）Itti, Dhavale and Pighin "Realistic avatar eye and head animation using a neurobiological model of visual attention," Proc. SPIE International Symposium on Optical Science and Technology, pp.64-78, 2003. (Non-Patent Document 4) Itti, Dhavale and Pighin "Realistic avatar eye and head animation using a neurobiological model of visual attention," Proc. SPIE International Symposium on Optical Science and Technology, pp.64-78, 2003.

図７のステップ１１５Ｓで、運動特徴画像抽出部１１５が、運動特徴画像抽出処理を実行する。詳細には次の通りである。 In step 115S of FIG. 7, the motion feature image extraction unit 115 executes motion feature image extraction processing. Details are as follows.

運動特徴画像抽出部１１５は、入力部１０から入力された入力画像の各画素の運動成分を表現する運動特徴画像を出力する。運動特徴画像の抽出方法は特に限定されるものではないが、本実施形態においては、現在及びその１時点（１フレーム前のフレームに対応する時刻）前の入力画像から計算される輝度特徴画像ｉ(ｔ)、ｉ(ｔ−１)の各点におけるオプティカルフローを求めることによって抽出する。オプティカルフローの抽出方法は特に限定されるものではないが、例えば一般にLucas-Kanade法と呼ばれる画像勾配に基づく方法を用いることができ、この方法により、それぞれ運動の水平成分・垂直成分に対応する運動特徴画像Ｍ_ｘ(t)、Ｍ_y(t)を抽出する（詳細な抽出方法は特許文献３を参照）。 The motion feature image extraction unit 115 outputs a motion feature image representing a motion component of each pixel of the input image input from the input unit 10. The method for extracting the motion feature image is not particularly limited, but in the present embodiment, the luminance feature image i calculated from the input image at the present time and one time point before (the time corresponding to the frame one frame before). Extraction is performed by obtaining an optical flow at each point of (t) and i (t-1). The optical flow extraction method is not particularly limited. For example, a method based on an image gradient generally called the Lucas-Kanade method can be used, and the motion corresponding to the horizontal component and the vertical component of the motion can be used. feature image M _{x (t),} extracts a M _y (t) (see Patent Document 3 detailed extraction method).

別の例として、非特許文献４に記載の方法が挙げられる。すなわち、現在の入力画像から計算される方向特徴画像Ｏ_φ(t)を回転角_φと垂直の方向に１画素分シフトさせた画像をＳ_φ（ｔ）としたとき、運動特徴画像Ｍ_φ(ｔ)は，現在及びその１時点前の入力画像から計算される方向特徴画像Ｏ_φ(ｔ)、Ｏ_φ(ｔ−１)を用いて，以下のように算出される。 Another example is the method described in Non-Patent Document 4. That is, when an image obtained by shifting the direction feature image O _φ (t) calculated from the current input image by one pixel in the direction perpendicular to the rotation angle _φ is S _φ (t), the motion feature image M _φ (t t) is calculated as follows using the direction feature images O _φ (t) and O _φ (t−1) calculated from the input images at the present time and one time before.

ただし、演算子×は画素ごとの積を表すものとする。この実施形態では，運動特徴画像Ｍ_φ(ｔ)がｎ_φ通りの回転角の各々について抽出される。 However, the operator x represents a product for each pixel. In this embodiment, a motion feature image M _φ (t) is extracted for each of n _φ rotation angles.

図２に示すように、画像基礎特徴量画像抽出部１１は、輝度特徴画像、色特徴画像、方向特徴画像、点滅特徴画像、運動特徴画像を、それぞれ画像基礎特徴画像とし、これら画像基礎特徴画像の集合を、画像多重解像度画像抽出部１２に出力する。 As shown in FIG. 2, the image basic feature image extraction unit 11 sets a luminance feature image, a color feature image, a direction feature image, a blinking feature image, and a motion feature image as image basic feature images, and these image basic feature images. Are output to the image multi-resolution image extraction unit 12.

上記画像基礎特徴画像の集合を画像多重解像度画像抽出部１２に出力すると、図７のステップ１１５Ｓの運動特徴画像抽出処理が終了する。ステップ１１５Ｓの運動特徴画像抽出処理が終了すると、処理は、図６のステップ１２Ｓに進む。 When the set of image basic feature images is output to the image multi-resolution image extraction unit 12, the motion feature image extraction process in step 115S of FIG. 7 ends. When the motion feature image extraction process in step 115S ends, the process proceeds to step 12S in FIG.

なお、方向特徴画像抽出部１１３、点滅特徴画像抽出部１１４、及び運動特徴画像抽出部１１５は、輝度特徴画像抽出部１１１から輝度特徴画像が入力されるようにしているが、輝度特徴画像抽出部１１１から輝度特徴画像が入力されずに、方向特徴画像抽出部１１３、点滅特徴画像抽出部１１４、及び運動特徴画像抽出部１１５の各々が輝度特徴画像抽出部１１１の処理と同様の処理を実行して輝度特徴画像を得るようにしてもよい。 The direction feature image extraction unit 113, the blinking feature image extraction unit 114, and the motion feature image extraction unit 115 are configured to receive the luminance feature image from the luminance feature image extraction unit 111. Without the luminance feature image being input from 111, each of the direction feature image extraction unit 113, the blinking feature image extraction unit 114, and the motion feature image extraction unit 115 performs the same processing as the processing of the luminance feature image extraction unit 111. Thus, a luminance feature image may be obtained.

図６のステップ１２Ｓで、画像多重解像度画像抽出部１２が、画像多重解像度画像抽出処理を実行する。詳細には次の通りである。 In step 12S of FIG. 6, the image multi-resolution image extraction unit 12 executes image multi-resolution image extraction processing. Details are as follows.

画像多重解像度画像抽出部１２は、上記のように入力された画像基礎特徴画像の集合の各画像基礎特徴画像について、その多重解像度表現である多重解像度画像を抽出し、この多重解像度画像の集合を出力する。 The image multi-resolution image extraction unit 12 extracts a multi-resolution image, which is a multi-resolution expression, for each image basic feature image of the set of image basic feature images input as described above. Output.

本実施形態において、いずれの基礎特徴画像についても同様の処理を行うため、以下、輝度特徴画像を例に取って、処理を説明し、他の特徴画像の説明を省略する。 In the present embodiment, since the same processing is performed for any basic feature image, the processing will be described below taking the luminance feature image as an example, and description of other feature images will be omitted.

輝度特徴画像についての多重解像度表現である輝度多重解像度画像は、輝度特徴画像にガウシアンフィルタを作用させながら縮小させる操作を、解像度レベル毎に繰り返し行うことで抽出される。 A luminance multi-resolution image, which is a multi-resolution representation of a luminance feature image, is extracted by repeatedly performing an operation for reducing the luminance feature image while applying a Gaussian filter to each luminance level.

ただし、Ｇ_σは分散σを持つガウシアンフィルタ、ｄｏｗｎ()はダウンサンプリングを行う関数、ｉ(ｔ,ｌ)は輝度特徴画像ｉ(ｔ)から抽出した第ｌレベルの輝度多重解像度画像、ｎ_lは多重解像度画像のレベル数である。第０レベルの輝度多重解像度画像は輝度特徴画像そのもの、すなわち、ｉ(ｔ,０)＝ｉ(ｔ)とする。 Where G _σ is a Gaussian filter having variance σ, down () is a function for downsampling, i (t, l) is the l-th level luminance multi-resolution image extracted from the luminance feature image i (t), n _l Is the number of levels of the multi-resolution image. The brightness multi-resolution image at the 0th level is the brightness feature image itself, that is, i (t, 0) = i (t).

他の基礎特徴画像についても、同様の方法で多重解像度画像を抽出することができる。このとき、輝度多重解像度画像がｎ_l枚抽出されるのに対して、色多重解像度画像Ｒ(ｔ,ｌ)、Ｇ(ｔ,ｌ)、Ｂ(ｔ,ｌ)、Ｙ(ｔ,ｌ)は合計４ｎ_l枚、方向多重解像度画像Ｏ_φ（ｔ,ｌ)は合計ｎ_φｎ_l枚、点滅多重解像度画像Ｆ(ｔ,ｌ)はｎ_l枚、運動多重解像度画像Ｍ_x(ｔ,ｌ)、Ｍ_y(ｔ,ｌ)は合計２ｎ_l枚もしくはｎ_φｎ_l枚、それぞれ抽出される。 For other basic feature images, a multi-resolution image can be extracted by the same method. At this time, n _l luminance multi-resolution images are extracted, whereas color multi-resolution images R (t, l), G (t, l), B (t, l), Y (t, l) Is a total of 4n _l images, direction multi-resolution images O _φ (t, l) are total n _φ n _l images, blinking multi-resolution images F (t, l) are n _l images, motion multi-resolution images M _x (t, l _{), M y (t, l} ) is a total of 2n _l Like or n _phi n _l sheets are extracted, respectively.

上記の通り、画像多重解像度画像抽出部１２は、輝度多重解像度画像、色多重解像度画像、方向多重解像度画像、点滅多重解像度画像、運動多重解像度画像を、それぞれ多重解像度画像とし、これら多重解像度画像の集合を、画像解像度差分画像抽出部１３に出力する（図２参照）。 As described above, the image multi-resolution image extraction unit 12 sets the luminance multi-resolution image, the color multi-resolution image, the direction multi-resolution image, the blinking multi-resolution image, and the motion multi-resolution image as multi-resolution images, respectively. The set is output to the image resolution difference image extraction unit 13 (see FIG. 2).

図６のステップ１３Ｓで、画像解像度差分画像抽出部１３が、画像解像度差分画像抽出処理を実行する。詳細には次の通りである。 In step 13S of FIG. 6, the image resolution difference image extraction unit 13 executes image resolution difference image extraction processing. Details are as follows.

画像解像度差分画像抽出部１３は、上記のように入力された多重解像度画像の各種類（輝度・色など）について、解像度レベルの異なる画像の間の差分画像である解像度差分画像を抽出し、これら解像度差分画像の集合を出力する。 The image resolution difference image extraction unit 13 extracts a resolution difference image that is a difference image between images having different resolution levels for each type (luminance, color, etc.) of the multi-resolution image input as described above. A set of resolution difference images is output.

解像度差分画像の抽出方法は特に限定されるものではないが、本実施形態においては、以下のようにして各種類の解像度差分画像を抽出する。 The method of extracting the resolution difference image is not particularly limited, but in the present embodiment, each type of resolution difference image is extracted as follows.

ただし、up()はアップサンプリングを行う関数、Ｌ_c、Ｌ_sは解像度差分画像を抽出する際に考慮する解像度レベルの集合であり、それぞれ中心解像度レベル集合、周辺解像度レベル集合と呼ぶ。また、ＲＳ_I(t;ｌ_c,ｌ_s)は第ｌ_ｃレベルと第ｌ_ｓレベルの輝度多重解像度画像の差分から得られる輝度解像度差分画像であり、以降、(ｌ_c,ｌ_s)レベル輝度解像度差分画像と呼ぶことにする。同様にして、ＲS_RG(t;ｌ_c,ｌ_s)及びＲＳ_BY(t;ｌ_c,ｌ_s)を(ｌ_c,ｌ_s)レベル色解像度差分画像、ＲＳ_Ｏφ（t;ｌ_c,ｌ_s)を(ｌ_c,ｌ_s)レベル方向解像度差分画像、ＲＳ_F(t;ｌ_c,ｌ_s)を(ｌ_c,ｌ_s)レベル点滅解像度差分画像、ＲＳ_Mk(t;ｌ_c,ｌ_s)を(ｌ_c,ｌ_s)レベル運動解像度差分画像と、それぞれ呼ぶ。 Here, up () is a function that performs upsampling, and L _c and L _s are sets of resolution levels to be considered when extracting a resolution difference image, and are called a central resolution level set and a peripheral resolution level set, respectively. Further, RS _I (t; l _c , l _s ) is a luminance resolution difference image obtained from the difference between the luminance multi-resolution images of the l _c level and the l _s level, and thereafter the (l _c , l _s ) level. This is called a luminance resolution difference image. _{_{Similarly, RS RG (t; l c}} , l s) and _{_{RS BY (t; l c,}} l s) a (l _c, l _s) level color resolution difference _{_{image, RS Oφ (t; l c}} , l the _{_{_{s) (l c, l s}}} ) level direction resolution difference _{_{image, RS F (t; l c}} , l s) a (l _c, l _s) level flashing resolution difference _{_{image, RS Mk (t; l c}} , l _s) the (l _c, and l _s) level motion resolution difference image, termed respectively.

上記の通り、画像解像度差分画像抽出部１３は、輝度解像度差分画像、色解像度差分画像、方向解像度差分画像、点滅解像度差分画像、及び運動解像度差分画像をそれぞれ解像度差分画像とし、これら解像度差分画像の集合を、画像時間差分画像抽出部１４に出力する（図２参照）。 As described above, the image resolution difference image extraction unit 13 sets the luminance resolution difference image, the color resolution difference image, the direction resolution difference image, the blinking resolution difference image, and the motion resolution difference image as the resolution difference images, respectively. The set is output to the image time difference image extraction unit 14 (see FIG. 2).

図６のステップ１４Ｓで、画像時間差分画像抽出部１４が、画像時間差分画像抽出処理を実行する。詳細には次の通りである。 In step 14S of FIG. 6, the image time difference image extraction unit 14 executes image time difference image extraction processing. Details are as follows.

画像時間差分画像抽出部１４は、入力された解像度差分画像の集合の各解像度差分画像について、当該解像度差分画像の時間的遷移を記録する時間差分画像を抽出し、これら時間差分画像の集合を出力する。 The image time difference image extraction unit 14 extracts, for each resolution difference image in the set of input resolution difference images, a time difference image that records a temporal transition of the resolution difference image, and outputs the set of time difference images. To do.

時間差分画像の抽出方法は特に限定されるものではないが、本実施形態においては、解像度差分画像の各画素値がポアソン分布に従うことを仮定した非特許文献１及び２の方法を用いる。 The extraction method of the time difference image is not particularly limited, but in the present embodiment, the methods of Non-Patent Documents 1 and 2 assuming that each pixel value of the resolution difference image follows a Poisson distribution are used.

本実施形態においては、いずれの解像度差分画像についても同様の処理を行うため、以下、輝度解像度差分画像を例に取って、処理を説明し、他の解像度差分画像に対する処理の説明を省略する。まず、輝度解像度差分画像ＲＳ_I(ｔ;ｌ_c,ｌ_s)の画素位置ｘの画素値λ_I(ｔ,ｘ)が以下のガンマ分布に従うと仮定する。 In the present embodiment, since the same processing is performed for any resolution difference image, the processing will be described below by taking the luminance resolution difference image as an example, and description of processing for other resolution difference images will be omitted. First, it is assumed that the pixel value λ _I (t, x) at the pixel position x of the luminance resolution difference image RS _I (t; l _c , l _s ) follows the following gamma distribution.

ただし、Γ()はガンマ関数、α、βはガンマ分布のパラメータである。また、解像度レベルを示すインデックスｌ_c,ｌ_sは簡単のため省略する。本実施形態では、ガンマ分布のパラメータα、βを画像の各画素位置ｘで保持し、これを輝度時間差分画像の各画素α_I(t,x)、β_I(t,x)とする。このとき、輝度時間差分画像の画素位置ｘの画素値α_I(t,x)、β_I(t,x)は、１時刻前の分布 Where Γ () is a gamma function, and α and β are parameters of a gamma distribution. Also, the indexes l _c and l _s indicating the resolution level are omitted for simplicity. In the present embodiment, the parameters α and β of the gamma distribution are held at each pixel position x of the image, and are set as the pixels α _I (t, x) and β _I (t, x) of the luminance time difference image. At this time, the pixel values α _I (t, x) and β _I (t, x) at the pixel position x of the luminance time difference image are distributed one time before.

を事前分布、現在の輝度解像度差分画像の画素位置ｘの画素値λ_Ｉ（ｔ，ｘ）を観測としたときの事後分布 Is the prior distribution, and the posterior distribution when the pixel value λ _I (t, x) at the pixel position x of the current luminance resolution difference image is observed

のパラメータとして得ることができ、ベイズ則から以下のように求められる。 And can be obtained as follows from the Bayes rule.

また、時間スケールを考慮した別の実施形態も可能である。この実施形態では、輝度時間差分画像の画素位置ｘの画素値を以下のようにして求める。 In addition, another embodiment considering a time scale is possible. In this embodiment, the pixel value at the pixel position x of the luminance time difference image is obtained as follows.

ただし、ξは忘却係数、ｎ_dは時間差分画像のレベル数である。他の（色・方向・点滅・運動）時間差分画像についても同様にして抽出できる。 Where ξ is a forgetting factor and n _d is the number of levels of the time difference image. Other (color / direction / flashing / motion) time difference images can be extracted in the same manner.

上記の通り、画像時間差分画像抽出部１４は、輝度時間差分画像、色時間差分画像、方向時間差分画像、点滅時間差分画像、及び運動時間差分画像をそれぞれ時間差分画像として、これら時間差分画像の集合を、画像基礎顕著度画像抽出部１５に出力する（図２参照）。 As described above, the image time difference image extraction unit 14 uses the luminance time difference image, the color time difference image, the direction time difference image, the blinking time difference image, and the exercise time difference image as time difference images, respectively. The set is output to the image basic saliency image extraction unit 15 (see FIG. 2).

図６のステップ１５Ｓで、画像基礎顕著度画像抽出部１５が、画像基礎顕著度画像抽出処理を実行する。詳細には次の通りである。 In step 15S of FIG. 6, the image basic saliency image extraction unit 15 executes image basic saliency image extraction processing. Details are as follows.

画像基礎顕著度画像抽出部１５は、上記のように入力された時間差分画像の集合の各時間差分画像について、当該時間差分画像の時間的・空間的特異性に基づいて基礎顕著度画像を抽出し、これら基礎顕著度画像の集合を出力する。 The image basic saliency image extraction unit 15 extracts a basic saliency image for each time difference image of the set of time difference images input as described above based on the temporal and spatial specificity of the time difference image. Then, a set of these basic saliency images is output.

基礎顕著度画像の抽出方法は特に限定されるものではないが、本実施形態においては、非特許文献１及び２に記載のBayesian surpriseモデルに従う。このBayesian surpriseモデルでは、事前分布（１時点前の事後分布）と事後分布のKullback-Leibler divergenceで基礎顕著度を算出する。具体的には、以下のように計算される。 The method for extracting the basic saliency image is not particularly limited, but in the present embodiment, the Bayesian surprise model described in Non-Patent Documents 1 and 2 is used. In this Bayesian surprise model, the basic saliency is calculated by the prior distribution (post-distribution one point before) and the Kullback-Leibler divergence of the posterior distribution. Specifically, it is calculated as follows.

本実施形態では、いずれの時間差分画像に対しても同様の処理を行うので、以降では輝度時間差分画像を例に記載する。本実施形態においては、同じ画素位置に着目して事前分布と事後分布のdivergenceを計算する時間方向の輝度基礎顕著度画像と、周辺の画素位置にも着目してdivergenceを計算する空間方向の輝度基礎顕著度画像とを、個別に計算して、後で統合する。まず、時間方向の輝度基礎顕著度画像の画素位置ｘの画素値を以下のように計算する。 In the present embodiment, since the same processing is performed for any time difference image, the luminance time difference image will be described as an example hereinafter. In the present embodiment, the luminance-basis saliency image in the time direction for calculating the divergence of the prior distribution and the posterior distribution by paying attention to the same pixel position, and the luminance in the spatial direction for calculating the divergence by paying attention also to surrounding pixel positions. The basic saliency images are calculated separately and integrated later. First, the pixel value at the pixel position x of the luminance basic saliency image in the time direction is calculated as follows.

ただし、Ψ(・)はディガンマ関数である。次に、空間方向の輝度基礎顕著度画像の画素位置ｘの画素値を以下のように算出する。 However, Ψ (·) is a digamma function. Next, the pixel value of the pixel position x of the luminance basic saliency image in the spatial direction is calculated as follows.

ただし、ＤｏＧ()はDifference-of-Gaussian処理の関数である。最後に、次のように、時間方向の輝度基礎顕著度画像と空間方向の輝度基礎顕著度画像とを組み合わせて、最終的な輝度基礎顕著度画像を構成する。組み合わせる方法は特に限定されるものではないが、本実施形態においては、非特許文献１に記載の組合せをそのまま採用し、以下の式で計算する。 However, DoG () is a function of Difference-of-Gaussian processing. Finally, the final luminance basic saliency image is configured by combining the luminance basic saliency image in the time direction and the luminance basic saliency image in the spatial direction as follows. The combination method is not particularly limited, but in the present embodiment, the combination described in Non-Patent Document 1 is adopted as it is, and calculation is performed using the following formula.

他の（色・方向・点滅・運動）基礎顕著度画像についても同様にして抽出できる。
上記の通り、画像基礎顕著度画像抽出部１５は、輝度基礎顕著度画像、色基礎顕著度画像、方向基礎顕著度画像、点滅基礎顕著度画像、及び運動基礎顕著度画像をそれぞれ基礎顕著度画像として、これら基礎顕著度画像の集合を、画像基礎顕著度選択部３及び画像顕著度画像算出部４に出力する（図２及び図４参照）。これにより、図６のステップ１５Ｓの画像基礎顕著度画像抽出処理が終了する。 Other (color / direction / flashing / motion) basic saliency images can be extracted in the same manner.
As described above, the image basic saliency image extraction unit 15 converts the luminance basic saliency image, the color basic saliency image, the direction basic saliency image, the blinking basic saliency image, and the motion basic saliency image into the basic saliency image. Are output to the image basic saliency selection unit 3 and the image saliency image calculation unit 4 (see FIGS. 2 and 4). Thereby, the image basic saliency image extraction process in step 15S of FIG. 6 is completed.

ステップ１５Ｓの画像基礎顕著度画像抽出処理が終了すると、処理は、図５のステップ２Ｓに進む。ステップ２Ｓで、音響顕著度信号算出部２は、音響顕著度信号算出処理を実行する。上記のように、音響顕著度信号算出部２は、入力映像を構成する音響信号である入力音響信号が顕著な特性を持つ度合いを各時刻で表示した信号である音響顕著度信号を算出し、この音響顕著度信号を出力する。本実施形態では、上記のように、音響顕著度信号の算出方法として、Bayesian surpriseモデルを音響信号に適用した非特許文献５に記載の方法を採用する。 When the image basic saliency image extraction process in step 15S ends, the process proceeds to step 2S in FIG. In step 2S, the acoustic saliency signal calculation unit 2 executes an acoustic saliency signal calculation process. As described above, the acoustic saliency signal calculation unit 2 calculates the acoustic saliency signal, which is a signal indicating the degree to which the input acoustic signal, which is the acoustic signal constituting the input video, has remarkable characteristics at each time, This acoustic saliency signal is output. In the present embodiment, as described above, the method described in Non-Patent Document 5 in which the Bayesian surprise model is applied to the acoustic signal is adopted as the acoustic saliency signal calculation method.

図８には、非特許文献５に記載の方法に従った、図５のステップ２Ｓの音響顕著度信号算出処理プログラムを示すフローチャートが示されている。 FIG. 8 is a flowchart showing the acoustic saliency signal calculation processing program in step 2S of FIG. 5 according to the method described in Non-Patent Document 5.

図８のステップ２１Ｓで、音響基礎特徴量抽出部２１が、音響基礎特徴量抽出処理を実行する。詳細には次の通りである。 In step 21 </ b> S of FIG. 8, the acoustic basic feature amount extraction unit 21 performs an acoustic basic feature amount extraction process. Details are as follows.

音響基礎特徴量抽出部２１は、入力音響信号の特性を表現する特徴量である音響基礎特徴量を抽出し、この音響基礎特徴量を出力する。 The acoustic basic feature quantity extraction unit 21 extracts an acoustic basic feature quantity that is a characteristic quantity expressing the characteristics of the input acoustic signal, and outputs the acoustic basic feature quantity.

音響基礎特徴量の抽出方法は特に限定されるものではないが、本実施形態においては、音響信号から時間周波数特性を算出する方法を採用する。すなわち、時刻ｔを中心とする前後窓幅ｔｗの幅を持って切り出された音響信号a(t)から、時間周波数変換を利用して各周波数ωについてスペクトログラムＦ(ｔ,ω)を抽出する。このとき、時間周波数変換として、短時間フーリエ変換 (ＳＴＦＴ)、離散コサイン変換 (ＤＣＴ)、短時間コサイン変換 (ＳＴＣＴ) などを用いることができる。 The method of extracting the acoustic basic feature amount is not particularly limited, but in the present embodiment, a method of calculating the time frequency characteristic from the acoustic signal is adopted. That is, the spectrogram F (t, ω) is extracted for each frequency ω from the acoustic signal a (t) cut out with the width of the front and rear window width tw centered on the time t by using time-frequency conversion. At this time, short-time Fourier transform (STFT), discrete cosine transform (DCT), short-time cosine transform (STCT), or the like can be used as the time-frequency transform.

図８のステップ２２Ｓで、音響顕著度信号抽出部２２が、音響顕著度信号抽出処理を実行する。詳細には次の通りである。 In step 22S of FIG. 8, the acoustic saliency signal extraction unit 22 performs an acoustic saliency signal extraction process. Details are as follows.

音響顕著度信号抽出部２２は、音響基礎特徴量を入力し、音響信号の中で各時刻について顕著な特性を持つ度合いを示した音響顕著度信号を抽出し、この音響顕著度信号を出力する。 The acoustic saliency signal extraction unit 22 receives the acoustic basic feature amount, extracts an acoustic saliency signal indicating the degree of remarkable characteristics at each time in the acoustic signal, and outputs the acoustic saliency signal. .

音響顕著度信号の抽出方法は特に限定されるものではないが、本実施形態においては、各時間周波数におけるスペクトログラムがガウス分布もしくはガンマ分布に従って生成されていると仮定したBayesian surpriseモデルを採用する。 The method for extracting the acoustic saliency signal is not particularly limited, but in this embodiment, a Bayesian surprise model is used that assumes that the spectrogram at each time frequency is generated according to a Gaussian distribution or a gamma distribution.

スペクトログラムがガウス分布に従って生成されると仮定した場合、時刻ｔ、周波数ωにおけるスペクトログラムＦ(ｔ,ω)の事前分布は、同周波数のスペクトログラムの履歴Ｆ(ｔ−1,ω),・・・,Ｆ(ｔ−Ｎ,ω)を用いて、以下のように表現される。 Assuming that the spectrogram is generated according to a Gaussian distribution, the prior distribution of the spectrogram F (t, ω) at time t and frequency ω is the spectrogram history F (t−1, ω),. It is expressed as follows using F (t−N, ω).

同様にして、同スペクトログラムの事後分布は、以下のように表現される。 Similarly, the posterior distribution of the spectrogram is expressed as follows.

このとき、時刻ｔ、周波数ωの音響顕著度信号Ｓ_A(t,ω)は、事前分布と事後分布のKullback-Leibler divergenceとして、以下のように算出される。 At this time, the acoustic saliency signal S _A (t, ω) at time t and frequency ω is calculated as Kullback-Leibler divergence of the prior distribution and the posterior distribution as follows.

一方、スペクトログラムがガンマ分布に従って生成されると仮定した場合、時刻ｔ、周波数ωにおけるスペクトログラムＦ(ｔ,ω)の事前分布・事後分布は、それぞれ以下のように算出される。 On the other hand, assuming that the spectrogram is generated according to the gamma distribution, the prior distribution and posterior distribution of the spectrogram F (t, ω) at time t and frequency ω are calculated as follows.

このとき、時刻t、周波数ωの音響顕著度信号Ｓ_A(ｔ,ω)は、事前分布と事後分布のKullback-Leibler divergenceとして、以下のように算出される。 At this time, the acoustic saliency signal S _A (t, ω) at time t and frequency ω is calculated as Kullback-Leibler divergence of the prior distribution and the posterior distribution as follows.

最後に、時刻ｔの音響顕著度信号Ｓ_A(t)を、全周波数ωの音響顕著度信号Ｓ_Ａ（ｔ,ω）の平均として算出する。 Finally, the acoustic saliency signal S _A (t) at time t is calculated as the average of the acoustic saliency signals S _A (t, ω) at all frequencies ω.

上記の通り、音響顕著度信号抽出部２２は、音響顕著度信号を算出し、音響顕著度信号を、画像基礎顕著度選択部３及び画像顕著度画像算出部４に出力する（図４参照）。これにより、図５のステップ２２Ｓの音響顕著度信号抽出処理が終了する。 As described above, the acoustic saliency signal extraction unit 22 calculates the acoustic saliency signal, and outputs the acoustic saliency signal to the image basic saliency selection unit 3 and the image saliency image calculation unit 4 (see FIG. 4). . Thereby, the acoustic saliency signal extraction process in step 22S of FIG. 5 ends.

ステップ２２Ｓの音響顕著度信号抽出処理が終了すると、図５のステップ２Ｓが終了する。なお、ステップ１Ｓの画像基礎顕著度画像算出処理とステップ２Ｓの音響顕著度信号算出処理の順番はこれに限定されず、ステップ２Ｓの処理の後にステップ１Ｓの処理が実行されてもよく、同時に実行されてもよい。 When the acoustic saliency signal extraction process in step 22S ends, step 2S in FIG. 5 ends. Note that the order of the image basic saliency image calculation process of step 1S and the acoustic saliency signal calculation process of step 2S is not limited to this, and the process of step 1S may be executed after the process of step 2S, or executed simultaneously. May be.

上記例（図５）では、ステップ２Ｓが終了すると、処理は、図５のステップ３Ｓに進む。ステップ３Ｓで、画像基礎顕著度選択部３が、画像基礎顕著度選択処理を実行する。詳細には次の通りである。 In the above example (FIG. 5), when step 2S ends, the process proceeds to step 3S in FIG. In step 3S, the image basic saliency selection unit 3 executes an image basic saliency selection process. Details are as follows.

画像基礎顕著度選択部３は、上記のように入力された画像基礎顕著度画像の集合及び音響顕著度信号に基づいて、音響顕著度が大きな時間区間における主要な画像基礎顕著度成分を選択もしくは強調し、これを主要画像基礎顕著度成分として出力する。 The image basic saliency selection unit 3 selects a main image basic saliency component in a time interval with a large acoustic saliency based on the set of image basic saliency images and the acoustic saliency signal input as described above. Emphasize and output this as a main image basic saliency component.

画像基礎顕著度成分の選択方法は特に限定されるものではないが、本実施形態においては、音響顕著度信号と画像基礎顕著度画像の画素値との単純な相関に基づく方法を採用する。 The method for selecting the image basic saliency component is not particularly limited, but in the present embodiment, a method based on a simple correlation between the acoustic saliency signal and the pixel value of the image basic saliency image is adopted.

以降、表記を簡略化するために、時刻ｔにおける基礎顕著度画像各々にインデックスを割り当て、インデックスｊを用いて Hereinafter, in order to simplify the notation, an index is assigned to each basic saliency image at time t, and the index j is used.

と表記する。すなわち、インデックスｊによって、基礎顕著度画像の種別（輝度・色など）や時間スケールの違いをまとめて表現する。 Is written. That is, the index j is used to collectively represent differences in basic saliency image types (such as luminance and color) and time scales.

まず、各時刻ｔについて、画素位置ｘごとに、音響顕著度信号Ｓ_A(t)と各画像基礎顕著度画像 First, at each time t, for each pixel position x, the acoustic saliency signal S _A (t) and each image basic saliency image

との相関を、以下のように計算する。 Is calculated as follows.

ただし、ｈ(ｎ,ｔ)は幅Ｎ_w(t)を持つ時刻ｔの時間窓である。時間窓は、矩形窓、ハニング窓、ハミング窓など、任意の時間窓を利用できる。時間窓の幅は、全ての時刻ｔで共通の値を用いる方法、音響顕著度信号によって変動させる方法、などが考えられる。音響顕著度信号によって時間窓の幅を制御する方法として、以下のような方法が考えられる。音響顕著度信号が閾値θ_sを上回る連続時間区間をＴ_S,i(ｉ＝１,２,)とすると、時刻ｔにおける窓幅Ｎ_w(t)は以下のように決定する。 Here, h (n, t) is a time window at time t having a width N _w (t). As the time window, any time window such as a rectangular window, a Hanning window, and a Hamming window can be used. As the width of the time window, a method of using a common value at all times t, a method of changing by the acoustic saliency signal, and the like can be considered. As a method for controlling the width of the time window by the acoustic saliency signal, the following method can be considered. If the continuous time interval in which the acoustic saliency signal exceeds the threshold θ _s is T _{S, i} (i = 1, 2,), the window width N _w (t) at time t is determined as follows.

ただし、ｗ_a1＞０、ｗ_b1＞０は予め定められた整数であり、ｗ_b2はＮ_w(t)が奇数になるように１もしくは２に設定される。上記の定義により、音響顕著度信号Ｓ_A(t)が閾値θ_sを上回る時刻ｔにおいてのみ時間窓が設定され、その幅は音響顕著度信号が閾値を上回る連続時間区間の長さに比例して長くなる。各時刻ｔにおいて、相関値 However, w _a1 > 0 and w _b1 > 0 are predetermined integers, and w _b2 is set to 1 or 2 so that N _w (t) becomes an odd number. According to the above definition, the time window is set only at time t when the acoustic saliency signal S _A (t) exceeds the threshold θ _s , and its width is proportional to the length of the continuous time interval in which the acoustic saliency signal exceeds the threshold. Become longer. Correlation value at each time t

の値が上位ｐ_ｕ％から上位ｐ_ｌ％の間の値を取る画素位置ｘについて当該相関値の平均値を計算し、その値を時刻ｔにおける相関 The average value of the correlation values is calculated for the pixel position x in which the value of the pixel takes a value between the upper p _u % and the upper p _l %, and the value is correlated with the correlation at time t

とする。
And

続いて、音響顕著度信号Ｓ_Ａ(ｔ)があらかじめ定められた閾値θ_Ｓを上回る各時刻Ｔ_Ｓ＝｛ｔ_s,1,ｔ_s,2・・・}において、相関 Subsequently, at each time T _S = {t _{s, 1} , t _{s, 2} ...} Where the acoustic saliency signal S _A (t) exceeds a predetermined threshold θ _S , the correlation

があらかじめ定められた閾値θ_ｃを上回る画像基礎顕著度画像のインデックスｊを取り出し、全時刻でインデックスごとに数え上げる。この数え上げの結果は、Ｊ次元の整数ベクトルＨ=（ｈ₁,ｈ₂,・・・,ｈ_J）^Tとして表現できる。すなわち、このベクトルの要素ｈ_ｊは、音響顕著度信号が閾値θ_sを上回った時刻において、インデックスｊを持つ画像基礎顕著度画像が、音響顕著度信号との相関で閾値θ_ｃを上回った回数を示す。 There removed index j of the image basis saliency image above a threshold theta _c predetermined, enumerate each index in all the time. The counting result can be expressed as a J-dimensional integer vector H = (h ₁ , h ₂ ,..., H _J ) ^T. Number i.e., element h _j of the vector, at the time when the acoustic saliency signal exceeds the threshold value theta _s, image basis saliency image with index j, which exceeds the threshold value theta _c correlation with the acoustic saliency signal Indicates.

最後に、このベクトルの要素ｈ_jがあらかじめ定められた閾値θ_hよりも大きなインデックスを残し、このインデックスの集合Ｊ_s={ｊ_s,1, ｊ_s,2,・・・}を主要画像基礎顕著度成分として、画像顕著度画像算出部４に出力する。 Finally, an index in which the vector element h _j is larger than a predetermined threshold value θ _h is left, and this set of indices J _s = {j _{s, 1} , j _{s, 2} ,. The saliency component is output to the image saliency image calculation unit 4.

図５のステップ４Ｓで、画像顕著度画像算出部４が、画像顕著度画像算出処理を実行する。詳細には次の通りである。 In step 4S of FIG. 5, the image saliency image calculation unit 4 executes image saliency image calculation processing. Details are as follows.

画像顕著度画像算出部４は、上記のように入力された画像基礎顕著度画像の集合、主要画像基礎顕著度成分及び必要であれば音響顕著度信号に基づいて、入力画像の各位置における顕著度を表示した画像である顕著度画像を出力する。 The image saliency image calculation unit 4 performs saliency at each position of the input image based on the set of image basic saliency images input as described above, the main image basic saliency component, and, if necessary, the acoustic saliency signal. A saliency image that is an image indicating the degree is output.

顕著度画像の算出方法は特に限定されるものではないが、本実施形態においては、主要画像基礎顕著度成分として選択された画像基礎顕著度画像を選択的に用いて顕著度画像を構成する方法を採用する。すなわち、時刻ｔの顕著度画像Ｓ(t)は以下のように算出される。 The calculation method of the saliency image is not particularly limited, but in the present embodiment, a method of constructing the saliency image by selectively using the image basic saliency image selected as the main image basic saliency component Is adopted. That is, the saliency image S (t) at time t is calculated as follows.

ただし、θ_s2はあらかじめ定められた閾値、 Where θ _s2 is a predetermined threshold,

は指示関数であり、括弧内の条件が満たされたときに１、それ以外の場合に０を返す関数である。すなわち、上式は、音響顕著度信号Ｓ_Ａ(ｔ)が閾値θ_ｓ2を上回る時刻では主要画像基礎顕著度成分として選択された画像基礎顕著度画像のみを用いて顕著度画像を構成し、それ以外の時刻ではすべての画像基礎顕著度画像を用いて顕著度画像を構成することを示している。閾値θ_s2を０に設定すると、すべての時刻において主要画像基礎顕著度成分として選択された画像基礎顕著度画像のみを用いて顕著度画像を構成することと等価となる。 Is an indicator function, which returns 1 when the condition in parentheses is satisfied, and returns 0 otherwise. That is, the above equation constructs a saliency image using only the image basic saliency image selected as the main image basic saliency component at the time when the acoustic saliency signal S _A (t) exceeds the threshold θ _s2 , It is shown that a saliency image is constructed using all image basic saliency images at times other than. Setting the threshold θ _s2 to 0 is equivalent to constructing the saliency image using only the image basic saliency image selected as the main image basic saliency component at all times.

また、別の実施形態として、以下のような方法を実行してもよい。
まず、準備として、基礎顕著度画像 Moreover, you may perform the following methods as another embodiment.
First, as a preparation, the basic saliency image

を、特徴種別を表現するインデックスｆ、空間スケールを表現するインデックスσ、及び時間スケールを表現するインデックスｄを用いて、 Using an index f representing a feature type, an index σ representing a spatial scale, and an index d representing a time scale,

と書き直す。すなわち、 And rewrite. That is,

は、基礎顕著度画像
The basic saliency image

のインデックスｊを、画像基礎特徴種別ｆ、空間スケールσ、時間スケールｄの3つに分解した表記である。また、空間スケールσのインデックス集合をΣ、時間スケールｄのインデックス集合Ｄとし、主要画像基礎顕著度画像のインデックス集合Ｊ_sに含まれる空間スケールσのインデックス集合をΣ_s、時間スケールｄのインデックス集合Ｄ_Ｓとする。
以上の記号を用いて、時刻ｔの顕著度画像Ｓ(ｔ)は以下のように算出される。 The index j is divided into three parts: an image basic feature type f, a spatial scale σ, and a time scale d. Also, the index set of the spatial scale σ is Σ, the index set D of the time scale d, the index set of the spatial scale σ included in the index set J _s of the main image basic saliency image is Σ _s , and the index set of the time scale d _{Let DS} be.
Using the above symbols, the saliency image S (t) at time t is calculated as follows.

図５のステップ５Ｓで、顕著度映像算出部５が、顕著度映像算出処理を実行する。即ち、顕著度映像算出部５は、各時刻で算出された顕著度画像を連結した時系列画像である顕著度映像を算出し、この顕著度映像を出力する。 In step 5S of FIG. 5, the saliency video calculation unit 5 executes the saliency video calculation process. That is, the saliency video calculating unit 5 calculates a saliency video that is a time-series image obtained by connecting the saliency images calculated at each time, and outputs the saliency video.

なお、顕著度映像算出部５は、顕著度映像に、各時系列に対応する時刻に対応するように入力音響信号を含ませるようにしてもよい。 Note that the saliency video calculation unit 5 may include the input acoustic signal in the saliency video so as to correspond to the time corresponding to each time series.

以上説明したように、第１の実施の形態に係る顕著度画像生成装置によれば、入力映像を構成する各時刻のフレームの入力画像及び入力映像を構成する音響信号を用いて、各時刻のフレームの入力画像の各位置における顕著度を示す顕著度画像を生成することができる。 As described above, according to the saliency image generating device according to the first embodiment, the input image of each time frame constituting the input video and the acoustic signal constituting the input video are used to set the time of each time. A saliency image indicating the saliency at each position of the input image of the frame can be generated.

また、画像基礎顕著度画像抽出部１５と音響顕著度信号算出部２とは、同一のBayesian surpriseモデル（確率モデル）を用いているので、画像基礎顕著度選択部３は、異なる物理量の相関が物理的に意味をなすようにすることができる。 Further, since the image basic saliency image extraction unit 15 and the acoustic saliency signal calculation unit 2 use the same Bayesian surprise model (probability model), the image basic saliency selection unit 3 has a correlation between different physical quantities. It can be physically meaningful.

なお、上記の実施の形態において、画像基礎顕著度選択部３が、後述する第３の実施の形態で説明する方法を用いて、主要画像基礎特徴量成分を生成してもよい。
また、画像顕著度画像算出部４が、後述する第３の実施の形態で説明する方法を用いて、顕著度画像を算出してもよい。 In the above embodiment, the image basic saliency selection unit 3 may generate the main image basic feature amount component by using a method described in a third embodiment to be described later.
Further, the image saliency image calculation unit 4 may calculate the saliency image using a method described in a third embodiment to be described later.

[第２の実施の形態]
次に、第２の実施の形態に係る注視位置推定装置について説明する。なお、第１の実施の形態と同様の構成となる部分には、同一符号を付して説明を省略する。 [Second Embodiment]
Next, a gaze position estimation apparatus according to the second embodiment will be described. In addition, the same code | symbol is attached | subjected to the part which becomes the same structure as 1st Embodiment, and description is abbreviate | omitted.

図９には、第２の実施の形態に係る注視位置推定装置の構成の概略が示されている。図９に示すように、本実施形態の注視位置推定装置は、第１の実施形態の顕著度画像生成装置における入力部１０、画像基礎顕著度画像算出部１、音響顕著度信号算出部２、画像基礎顕著度選択部３、画像顕著度画像算出部４、及び顕著度映像算出部５と、注視位置推定部６とで構成される。本実施形態の注視位置推定装置は、入力部１０により入力された、注視位置推定の対象となる入力映像のフレーム内の各位置における人間の注視位置を推定した結果である推定注視位置を出力する。 FIG. 9 shows an outline of the configuration of the gaze position estimation apparatus according to the second embodiment. As shown in FIG. 9, the gaze position estimation device of the present embodiment includes an input unit 10, an image basic saliency image calculation unit 1, an acoustic saliency signal calculation unit 2 in the saliency image generation device of the first embodiment, The image basic saliency selection unit 3, the image saliency image calculation unit 4, the saliency video calculation unit 5, and the gaze position estimation unit 6 are configured. The gaze position estimation apparatus according to the present embodiment outputs an estimated gaze position that is a result of estimating the gaze position of a human at each position in the frame of the input video that is the target of gaze position estimation, input by the input unit 10. .

次に、第２の実施形態の作用を説明する。第２の実施形態の作用は、第１の実施形態の作用と同様な部分があるので、異なる部分についてのみ説明する。 Next, the operation of the second embodiment will be described. Since the operation of the second embodiment has the same part as the operation of the first embodiment, only different parts will be described.

図１０には、第２の実施の形態に係る注視位置推定処理プログラムを示すフローチャートが示されている。 FIG. 10 is a flowchart showing a gaze position estimation processing program according to the second embodiment.

図１０に示されているように、ステップ５Ｓの顕著度映像算出処理が実行されると、ステップ６Ｓで、注視位置推定部６が、注視位置推定処理を実行する。詳細には次の通りである。 As shown in FIG. 10, when the saliency video calculation process in step 5S is executed, the gaze position estimation unit 6 executes the gaze position estimation process in step 6S. Details are as follows.

注視位置推定部６は、顕著度映像算出部５により入力された顕著度映像の各フレームである顕著度画像の各位置における人間の注視位置を推定した結果である推定注視位置を出力する。 The gaze position estimation unit 6 outputs an estimated gaze position that is a result of estimating a human gaze position at each position of the saliency image that is each frame of the saliency video input by the saliency video calculation unit 5.

注視位置の推定方法は特に限定されるものではないが、顕著度画像の画素値が最大となる位置を推定注視位置とする方法、特許文献６（特開2009-259035号公報）などに示される確率的モデルに基づいて注視位置を推定する方法を用いてもよい。 The method for estimating the gaze position is not particularly limited, but is described in a method in which the position where the pixel value of the saliency image is maximum is set as the estimated gaze position, Patent Document 6 (Japanese Patent Laid-Open No. 2009-259035), and the like. A method of estimating the gaze position based on the probabilistic model may be used.

以上説明したように、第２の実施の形態に係る注視位置推定装置によれば、入力映像を構成する各時刻のフレームの入力画像及び入力映像を構成する音響信号を用いて得られた顕著度画像から注視位置を推定することができる。 As described above, according to the gaze position estimation device according to the second embodiment, the saliency obtained by using the input image of the frame at each time constituting the input video and the acoustic signal constituting the input video. The gaze position can be estimated from the image.

[第３の実施の形態] [Third embodiment]

次に、第３の実施の形態に係る注視位置推定装置について説明する。なお、第３の実施の形態に係る注視位置推定装置の構成は、第１の実施の形態と同様となるため、同一符号を付して説明を省略する。 Next, a gaze position estimation apparatus according to the third embodiment will be described. In addition, since the structure of the gaze position estimation apparatus which concerns on 3rd Embodiment becomes the same as that of 1st Embodiment, it attaches | subjects the same code | symbol and abbreviate | omits description.

第３の実施の形態に係る注視位置推定装置では、画像基礎顕著度選択部３は、上記のように入力された画像基礎顕著度画像の集合及び音響顕著度信号に基づいて、音響顕著度が大きな時間区間における主要な画像基礎顕著度成分を選択（上記画像信号成分を強調）し、これを主要画像基礎顕著度成分として出力する。 In the gaze position estimation device according to the third embodiment, the image basic saliency selection unit 3 has the acoustic saliency based on the set of image basic saliency images and the acoustic saliency signal input as described above. A main image basic saliency component in a large time interval is selected (the image signal component is emphasized), and this is output as a main image basic saliency component.

画像基礎顕著度成分の選択方法は特に限定されるものではないが、本実施形態においては、指数平滑法に基づく音響顕著度信号と画像基礎顕著度画像の画素値との相関係数を採用する。 The selection method of the image basic saliency component is not particularly limited, but in the present embodiment, a correlation coefficient between the acoustic saliency signal based on the exponential smoothing method and the pixel value of the image basic saliency image is adopted. .

本実施の形態では、指数平滑法と呼ばれる、現時点までに得られている時系列信号から未来の時系列信号を予測する手法を用いる。
指数平滑法では，2つの時系列信号が同時正規分布に従って生成されていると仮定して、時系列信号の予測を行う。2つの時系列信号を In this embodiment, a technique called exponential smoothing is used to predict a future time-series signal from time-series signals obtained up to the present time.
In exponential smoothing, a time series signal is predicted on the assumption that two time series signals are generated according to a simultaneous normal distribution. Two time series signals

とすると、それぞれの平均値は Then each average value is

と計算され、同様に共分散は Similarly, the covariance is

と計算される。ただし、αはあらかじめ定められた定数もしくは時刻tに対して単調に減少する関数の出力とする。これらの統計量を用いることで，2つの時系列信号の相関係数 Is calculated. Here, α is a predetermined constant or an output of a function that monotonously decreases with respect to time t. By using these statistics, the correlation coefficient of two time series signals

及び相互情報量 And mutual information

が、以下のように計算される。 Is calculated as follows:

この指数平滑法を用いることで、各時刻ｔ、画素ｘにおいて、音響顕著度信号Ｓ_Ａ(ｔ)と各画像基礎顕著度画像 By using this exponential smoothing method, at each time t and pixel x, the acoustic saliency signal S _A (t) and each image basic saliency image

との相関係数 Correlation coefficient

と相互情報量 And mutual information

を計算できる。この相関係数の２乗もしくは相互情報量、もしくはそれを二値化したものが、各特徴種別（インデックスｊ）・各時刻ｔ・各位置ｘの重要度 Can be calculated. The square of this correlation coefficient or mutual information, or the binarized value is the importance of each feature type (index j), each time t, and each position x

を表現していると考え、これを主要画像基礎特徴量成分として、出力する。 Is output as a main image basic feature amount component.

別の実施形態として、隣接する画素位置での重要度が互いに近い値を取るように、空間的なフィルタリング処理を加える方法が考えられる。 As another embodiment, a method of applying a spatial filtering process so that importance at adjacent pixel positions takes values close to each other can be considered.

まず、画像基礎顕著度画像 First, image basic saliency image

を二値化する。二値化の方法として、例えば、平均画素値を閾値とする方法などが考えられる。次に、二値化した画像基礎顕著度画像 Is binarized. As a binarization method, for example, a method using an average pixel value as a threshold value can be considered. Next, binarized image basic saliency image

を重要度画像 The importance image

に掛け合わせ、二値化した画像基礎顕著度画像が非零の画素位置でのみ非零となる重要度画像を得る。これにガウシアンフィルタなどの空間平滑化フィルタをかけ、その結果 To obtain an importance image in which the binarized image basic saliency image is non-zero only at non-zero pixel positions. This is subjected to a spatial smoothing filter such as a Gaussian filter.

を最終的な重要度として採用し、主要画像基礎特徴量成分として出力する。 Is used as the final importance and is output as the main image basic feature amount component.

さらに別の実施形態として、重要度画像 In yet another embodiment, the importance image

の画素値の平均と分散に応じて重要度を操作する方法も考えられる。この重要度画像に代えて、空間的なフィルタリング処理を加えた重要度画像 A method of manipulating the importance in accordance with the average and variance of the pixel values is also conceivable. Instead of this importance image, importance image with spatial filtering processing added

を用いても良い。重要度画像 May be used. Importance image

の画素の平均値を The average value of pixels

、標準偏差を The standard deviation

とすると、変換後の重要度画像の各画素値 Then, each pixel value of the importance image after conversion

は以下のように計算される。 Is calculated as follows:

画像顕著度画像算出部４は、入力された画像基礎顕著度画像の集合、主要画像基礎顕著度成分及び必要であれば音響顕著度信号に基づいて、入力画像の各位置における顕著度を表示した画像である顕著度画像を出力する。
The image saliency image calculation unit 4 displays the saliency at each position of the input image based on the set of the input image basic saliency images, the main image basic saliency component, and if necessary, the acoustic saliency signal. A saliency image that is an image is output.

顕著度画像の算出方法は特に限定されるものではないが、本実施形態においては、主要画像基礎顕著度成分として選択された画像基礎顕著度画像を選択的に用いて顕著度画像を構成する方法を採用する。本実施形態では、主要画像基礎特徴量成分が各画像基礎顕著度・各時刻・各画素位置の重要度として与えられており、時刻ｔの顕著度画像Ｓ(t)は以下のように算出される。 The calculation method of the saliency image is not particularly limited, but in the present embodiment, a method of constructing the saliency image by selectively using the image basic saliency image selected as the main image basic saliency component Is adopted. In this embodiment, the main image basic feature amount component is given as the importance of each image basic saliency, each time, and each pixel position, and the saliency image S (t) at time t is calculated as follows. The

ここで、βは予め定められた定数とする。上の式はすなわち、音響顕著度信号が0に近いときには第1項が、音響顕著度信号が大きいときには第2項が、それぞれ支配的となることから、音響顕著度信号の大小によって、画像基礎顕著度選択部３の結果を反映させるかどうかを制御することを意味する。β＝０の場合には選択された主要画像基礎特徴量成分を用いずにすべての画像基礎特徴量成分を、β＝∞の場合には選択された主要画像基礎特徴量成分のみを、それぞれ用いる、特殊ケースとなる。 Here, β is a predetermined constant. In other words, the first term is dominant when the acoustic saliency signal is close to 0, and the second term is dominant when the acoustic saliency signal is large. It means to control whether or not the result of the saliency selector 3 is reflected. When β = 0, all image basic feature components are used without using the selected main image basic feature component, and when β = 0, only the selected main image basic feature component is used. It becomes a special case.

なお、第３の実施の形態に係る注視位置推定装置の他の構成及び作用については、第１の実施の形態と同様であるため、説明を省略する。 In addition, about the other structure and effect | action of the gaze position estimation apparatus which concern on 3rd Embodiment, since it is the same as that of 1st Embodiment, description is abbreviate | omitted.

[実験結果] [Experimental result]

次に、本発明の第１の実施形態の実験結果を説明する。本実験では、入力映像として、長さ３．５〜８．０秒の中の映像１Ｅ〜映像３Ｅの３種類を用意した。各映像の大きさは、映像１Ｅと映像２Ｅが１０２４×５７６ピクセル、映像３Ｅが１２８０×７１０ピクセルである。第１の実施形態の効果を確認するため、第１の実施形態の方法及び既知の方法によって得られる顕著度映像がどの程度人間の視覚特性を模擬できているかを比較した。人間の視覚特性を表現する統計量として、人間が実際に入力映像を視聴している際の注視位置を採用した。１５名の被験者に入力映像を提示し、既存の視線測定装置を用いて各被験者の入力映像中の注視位置を逐次測定した。各被験者に各入力映像をランダムな順序で１回ずつ提示した。これにより、各被験者・各入力映像について、注視位置の時系列を１本獲得した。この注視位置の時系列を、時刻の整合性を保ちながら入力映像の各フレーム（すなわち入力画像）に対応付けることで、各被験者・各入力画像について注視位置を獲得した。 Next, experimental results of the first embodiment of the present invention will be described. In this experiment, three types of images 1E to 3E within a length of 3.5 to 8.0 seconds were prepared as input images. The size of each video is 1024 × 576 pixels for video 1E and video 2E, and 1280 × 710 pixels for video 3E. In order to confirm the effect of the first embodiment, the degree of saliency video obtained by the method of the first embodiment and the known method is compared to how much human visual characteristics can be simulated. As a statistic that expresses human visual characteristics, we used the gaze position when humans were actually watching the input video. The input video was presented to 15 subjects, and the gaze position in the input video of each subject was sequentially measured using an existing gaze measurement device. Each subject was presented with each input video once in a random order. As a result, one time series of gaze positions was obtained for each subject and each input video. The gaze position was acquired for each subject and each input image by associating the time series of the gaze position with each frame (that is, the input image) of the input video while maintaining time consistency.

人間の視覚特性の模擬に関する評価尺度として、normalized scan-path saliency(ＮＳＳ)と呼ばれる評価尺度を採用した。これは、被験者の注視位置における顕著度の値を正規化して期待値を取ったものであり、この定義より、ＮＳＳは、被験者の注視位置における顕著度の値が大きいほど大きな値を取る尺度であることが理解される。このＮＳＳは、以下のように算出される。第ｊ番目の入力映像Ｉ_ｊ(j＝１,２,３)の時刻ｔの入力画像ｉ_ｊ（ｔ）(ｔ＝１,２,・・・、Ｔ_ｊ )について、評価対象とする顕著度画像Ｓ(ｔ；Ｉ_ｊ)＝｛ｓ(ｘ,ｔ; Ｉ_ｊ)｝_ｘとする。また、入力映像Ｉ_ｊに対応する被験者ｎ (ｎ＝１,２,・・・、Ｎ＝１５) の注視位置系列をＶ_ｎ（Ｉ_ｊ）＝{ｖ_ｎ（ｔ;Ｉ_ｊ）}_ｔとする。このとき、時刻ｔの顕著度画像Ｓ(ｔ;Ｉ_ｊ) の評価値ＮＳＳ(ｔ; Ｉ_ｊ) は以下のように計算される。 As an evaluation measure for the simulation of the visual characteristics between human employing a rating scale called normalized scan-path saliency (NSS) . This is the expected value obtained by normalizing the saliency value at the gaze position of the subject. From this definition, NSS is a scale that takes a larger value as the saliency value at the gaze position of the subject is larger. It is understood that there is. This NSS is calculated as follows. The saliency to be evaluated for the input image i _j (t) (t = 1, 2,..., T _j ) at the time t of the j-th input video I _j (j = 1, 2, 3). Let image S (t; I _j ) = {s (x, t; I _j )} _x . Further, the gaze position series of the subject n (n = 1, 2,..., N = 15) corresponding to the input video I _j is represented as V _n (I _j ) = {v _n (t; I _j )} _t . To do. At this time, the evaluation value NSS (t; I _j ) of the saliency image S (t; I _j ) at time t is calculated as follows.

ただし、ｓ(t;Ｉ_j)及びσ_S(t;Ｉ_j)は、入力映像I_ｊから抽出した顕著度画像Ｓ(t;Ｉ_j)のピクセル値ｓ(x,t;Ｉ_ｊ) の平均及び分散を表す。 However, s (t; I _j) and σ _S (t; I _j) is significantly level image S extracted from the input Film image I _j; pixel value s (x in _{(t I j), t;} I j ) Mean and variance.

顕著度映像Ｓ（Ｉ_ｊ)＝{Ｓ（ｔ;Ｉ_ｊ)}_tの評価値ＮＳＳ(Ｉ_ｊ)は、各時刻の顕著度画像Ｓ(t;Ｉ_ｊ)についての評価値を平均することで得られる。 The evaluation value NSS (I _j ) of the saliency video S (I _j ) = {S (t; I _j )} _t is obtained by averaging the evaluation values for the saliency images S (t; I _j ) at each time. It is obtained by.

ＮＳＳを評価尺度とした結果の概略を図１１、図１３、及び図１５に、フレームごとの評価結果を図１２、図１４、及び図１６に示す。 11, 13, and 15 schematically show the results of using NSS as an evaluation scale, and FIGS. 12, 14, and 16 show the evaluation results for each frame.

図１１、図１３、及び図１５はそれぞれ、映像１Ｅ〜３Ｅに対する評価結果の概要を示す図であり、図１２、図１４、及び図１６はそれぞれ、映像１Ｅ〜３Ｅに対するフレームごとの評価結果を示す図である。 11, FIG. 13, and FIG. 15 are diagrams each showing an overview of the evaluation results for the videos 1E to 3E. FIGS. 12, 14, and 16 show the evaluation results for each frame for the videos 1E to 3E. FIG.

図１１、図１３、及び図１５に示すように、音響顕著度が閾値θ_s以上の時刻のみを評価した場合（上欄）と、すべての時刻を評価した場合（下欄）とのそれぞれにおいて、ＮＳＳの値は、非特許文献１に従う従来手法より、本提案手法（第１の実施形態）の方が大きい。上記のように、ＮＳＳは、被験者の注視位置における顕著度の値が大きいほど大きな値を取る尺度である。よって、図１１、図１３、及び図１５に示す結果から、第１の本実施形態の方法が従来手法よりも良い評価結果を得ていることがわかる。 As shown in FIG. 11, FIG. 13, and FIG. 15, in each of the case where only the time when the acoustic saliency is equal to or greater than the threshold θ _s is evaluated (upper column) and the case where all the times are evaluated (lower column). , NSS is larger in the proposed method (first embodiment) than in the conventional method according to Non-Patent Document 1. As described above, NSS is a scale that takes a larger value as the value of the saliency at the gaze position of the subject increases. Therefore, it can be seen from the results shown in FIG. 11, FIG. 13, and FIG. 15 that the method of the first embodiment obtains a better evaluation result than the conventional method.

また、図１１、図１３、及び図１５に示すように、音響顕著度の閾値θ_sが、最適値（optimal）の場合と０の場合では、ＮＳＳの値は大きく異ならない。よって、音響顕著度の高いフレームにおける主要な画像特徴量を強調する第１の実施形態の方法は、音響顕著度が高いフレームだけに限って適用する必要はないことが見て取れる。 Further, FIG. 11, as shown in FIG. 13, and FIG. 15, the threshold theta _s acoustic remarkable degree, in the case where the 0 optimal value (optimal), the value of NSS is not significantly different. Therefore, it can be seen that the method of the first embodiment for emphasizing main image feature amounts in frames with high acoustic saliency need not be applied only to frames with high acoustic saliency.

図１２、図１４、及び図１６には、各フレーム（Frame；横軸）に対する、ＮＳＳの値（左縦軸）及び音響顕著度（surprise；右縦軸）が示されている。図１２、図１４、及び図１６において、灰色に塗られたフレームは、音響顕著度が閾値以上の時刻に対応するフレームを示す。各フレームに対応する音響顕著度(Auditory surprise)は、実線で示されている。非特許文献１に従う従来手法のＮＳＳの値(Conventional)は、点線で示されている。音響顕著度の閾値θ_sが最適値（optimal）の場合の本提案手法（第１の実施形態）のＮＳＳの値(Surprise frame)は、二点鎖線で示されている。音響顕著度の閾値θ_sが０の場合の本提案手法（第１の実施形態）のＮＳＳの値(All frame)は、一点鎖線で示されている。図１２、図１４、及び図１６に示すように、音響顕著度が閾値以上の時刻に対応するフレームにおけるＮＳＳの値ばかりではなく、音響顕著度が閾値未満の時刻に対応するフレームにおけるＮＳＳの値の多くも一定値以上である。よって、図１２、図１４、及び図１６に示す結果から、選択された画像特徴量を強調する本発明の方法は、音響顕著度が高いフレームだけではなく、音響顕著度が必ずしも高くないフレームの多くに対しても、効果が高いことが見て取れる。 In FIG. 12, FIG. 14, and FIG. 16, the NSS value (left vertical axis) and acoustic saliency (surprise; right vertical axis) for each frame (Frame; horizontal axis) are shown. In FIG. 12, FIG. 14, and FIG. 16, a frame painted in gray indicates a frame corresponding to a time when the acoustic saliency is equal to or greater than a threshold value. The acoustic surprise corresponding to each frame is indicated by a solid line. The NSS value (Conventional) of the conventional method according to Non-Patent Document 1 is indicated by a dotted line. The NSS value (Surprise frame) of the proposed method (first embodiment) when the acoustic saliency threshold value θ _s is an optimum value (optimal) is indicated by a two-dot chain line. The NSS value (All frame) of the proposed method (first embodiment) when the acoustic saliency threshold θ _s is 0 is indicated by a one-dot chain line. As shown in FIGS. 12, 14, and 16, not only the NSS value in the frame corresponding to the time when the acoustic saliency is greater than or equal to the threshold value, but also the NSS value in the frame corresponding to the time when the acoustic saliency is less than the threshold value. Many of them are above a certain value. Therefore, from the results shown in FIGS. 12, 14, and 16, the method of the present invention for enhancing the selected image feature amount is not only for frames with high acoustic saliency but also for frames with low acoustic saliency. For many, it can be seen that the effect is high.

[変形例]
顕著度画像生成装置及び注視位置推定装置の各々の各処理を実行するためのプログラムをコンピュータ読み取り可能な記録媒体に記録して、当該記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することにより、顕著度画像生成装置及び注視位置推定装置の各々に係る上述した種々の処理を行ってもよい。なお、ここでいう「コンピュータシステム」とは、ＯＳや周辺機器等のハードウェアを含むものであってもよい。また、「コンピュータシステム」は、ＷＷＷシステムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）も含むものとする。また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、フラッシュメモリ等の書き込み可能な不揮発性メモリ、ＣＤ−ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。 [Modification]
A program for executing each process of the saliency image generating device and the gaze position estimating device is recorded on a computer-readable recording medium, and the program recorded on the recording medium is read into a computer system and executed. Accordingly, the above-described various processes relating to each of the saliency image generation apparatus and the gaze position estimation apparatus may be performed. Here, the “computer system” may include an OS and hardware such as peripheral devices. Further, the “computer system” includes a homepage providing environment (or display environment) if a WWW system is used. The “computer-readable recording medium” means a flexible disk, a magneto-optical disk, a ROM, a writable nonvolatile memory such as a flash memory, a portable medium such as a CD-ROM, a hard disk built in a computer system, etc. This is a storage device.

さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムが送信された場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリ（例えばＤＲＡＭ（Dynamic Random Access Memory））のように、一定時間プログラムを保持しているものも含むものとする。また、上記プログラムは、このプログラムを記憶装置等に格納したコンピュータシステムから、伝送媒体を介して、あるいは、伝送媒体中の伝送波により他のコンピュータシステムに伝送されてもよい。ここで、プログラムを伝送する「伝送媒体」は、インターネット等のネットワーク（通信網）や電話回線等の通信回線（通信線）のように情報を伝送する機能を有する媒体のことをいう。また、上記プログラムは、前述した機能の一部を実現するためのものであっても良い。さらに、前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるもの、いわゆる差分ファイル（差分プログラム）であっても良い。 Further, the “computer-readable recording medium” means a volatile memory (for example, DRAM (Dynamic DRAM) in a computer system that becomes a server or a client when a program is transmitted through a network such as the Internet or a communication line such as a telephone line. Random Access Memory)), etc., which hold programs for a certain period of time. The program may be transmitted from a computer system storing the program in a storage device or the like to another computer system via a transmission medium or by a transmission wave in the transmission medium. Here, the “transmission medium” for transmitting the program refers to a medium having a function of transmitting information, such as a network (communication network) such as the Internet or a communication line (communication line) such as a telephone line. The program may be for realizing a part of the functions described above. Furthermore, what can implement | achieve the function mentioned above in combination with the program already recorded on the computer system, and what is called a difference file (difference program) may be sufficient.

以上、この発明の実施形態について図面を参照して詳述してきたが、具体的な構成はこの実施形態に限られるものではなく、この発明の要旨を逸脱しない範囲の設計等も含まれる。 The embodiment of the present invention has been described in detail with reference to the drawings. However, the specific configuration is not limited to this embodiment, and includes designs and the like that do not depart from the gist of the present invention.

１０入力部
１画像基礎顕著度画像算出部
２音響顕著度信号算出部
３画像基礎顕著度選択部
４画像顕著度画像算出部
５顕著度映像算出部
１１画像基礎特徴量画像抽出部
１２画像多重解像度画像抽出部
１３画像解像度差分画像抽出部
１４画像時間差分画像抽出部
１５画像基礎顕著度画像抽出部
２１音響基礎特徴量抽出部
２２音響顕著度信号抽出部
１１１輝度特徴画像抽出部
１１２色特徴画像抽出部
１１３方向特徴画像抽出部
１１４点滅特徴画像抽出部
１１５運動特徴画像抽出部 DESCRIPTION OF SYMBOLS 10 Input part 1 Image basic saliency image calculation part 2 Acoustic saliency signal calculation part 3 Image basic saliency selection part 4 Image saliency image calculation part 5 Saliency video calculation part 11 Image basic feature image extraction part 12 Image multi-resolution Image extraction unit 13 Image resolution difference image extraction unit 14 Image time difference image extraction unit 15 Image basic saliency image extraction unit 21 Acoustic basic feature amount extraction unit 22 Acoustic saliency signal extraction unit 111 Luminance feature image extraction unit 112 Color feature image extraction Unit 113 direction feature image extraction unit 114 blinking feature image extraction unit 115 motion feature image extraction unit

Claims

A basic saliency image indicating a degree of remarkable characteristics in the input image is generated for each of a plurality of feature types for an input image of each time frame constituting the input video, and a set of basic saliency images An image basic saliency image extracting unit,
An acoustic saliency signal calculation unit that generates an acoustic saliency signal indicating a degree of having a remarkable characteristic at each time for the acoustic signal constituting the input video;
For each of the plurality of feature types, for each time and each pixel, the pixel of the basic saliency image for the feature type included in the set of basic saliency images for the frame of the time, and the sound at the time An image basic saliency selection unit that calculates a correlation with a saliency signal and generates a main image basic saliency component based on the correlation for each time and each pixel for each of the plurality of feature types;
Image saliency that generates a saliency image indicating saliency at each position of the input image of each time frame based on the set of basic saliency images for each time frame and the main image basic saliency component. A degree image calculator,
A saliency image generating apparatus.

The image basic saliency selection unit, for each of the plurality of feature types, for each time and each pixel, a basic saliency image for the feature type included in the set of basic saliency images for the frame of the time A correlation value indicating a correlation between the pixel and the acoustic saliency signal at the time is calculated, and for each of the plurality of feature types, the number of times the correlation value exceeds a threshold is calculated. The saliency image generating apparatus according to claim 1, wherein a main image basic saliency component including the characteristic type that increases is also generated.

The image basic saliency selection unit, for each of the plurality of feature types, for each time and each pixel, a basic saliency image for the feature type included in the set of basic saliency images for the frame of the time A statistic indicating a correlation between the pixel and the acoustic saliency signal at the time, and for each of the plurality of feature types, based on the statistic for each time and each pixel, The saliency image generating apparatus according to claim 1, wherein a main image basic saliency component including importance for each time and each pixel is generated for each.

The image basic saliency image extraction unit indicates a feature amount of the feature type of each pixel in the input image with respect to each of the plurality of feature types for the input image of each time frame constituting the input video. An image basic feature image is generated as a set of image basic feature images.
For each of the plurality of feature types, a basic saliency image in a spatial direction indicating a degree of having a spatially remarkable characteristic with respect to the image basic feature image for the feature type included in the set of image basic feature images, and A basic saliency image in a time direction indicating a degree having a remarkable characteristic in time is generated, and the basic saliency is generated based on the generated basic saliency image in the spatial direction and the basic saliency image in the time direction. Images are generated at predetermined time intervals, and set as the set of basic saliency images;
The acoustic saliency signal calculation unit extracts an acoustic basic feature amount at each time for an acoustic signal constituting the input video, and based on the extracted acoustic basic feature amount at each time, the image basic saliency level 3. The saliency image generating apparatus according to claim 1, wherein the acoustic saliency signal at each same time as the time when the basic saliency image is generated by the image extraction unit is generated at the predetermined time interval.

The image basic saliency image extracting unit has a degree of spatial remarkable characteristics with respect to each of the plurality of feature types with respect to the image basic feature image corresponding to the feature type included in the set of image basic feature images. Generating a spatial saliency image and a temporal saliency image indicating the degree of temporal characteristics, and generating the generated spatial saliency image and temporal saliency image. Generating the basic saliency image based on a degree image and a predetermined probability model, and as a set of the basic saliency images,
The acoustic saliency signal calculation unit extracts an acoustic basic feature amount at each time from the acoustic signal constituting the input video, and the extracted acoustic basic feature amount at each time, and the predetermined probability model The saliency image generation device according to claim 4, wherein the saliency signal at each time is generated based on the sig- nal.

The saliency image generating apparatus according to claim 5, wherein the probability model is a gamma distribution.

The image basic saliency image extraction unit generates a basic saliency image indicating the degree of remarkable characteristics in the input image for each of a plurality of feature types for the input image of each time frame constituting the input video. And a set of basic saliency images,
The acoustic saliency signal calculating unit generates an acoustic saliency signal indicating the degree of remarkable characteristics at each time for the acoustic signals constituting the input video,
An image basic saliency selection unit, for each of the plurality of feature types, for each time and each pixel, a basic saliency image for the feature type included in the set of basic saliency images for the frame of the time Calculating a correlation between the pixel and the acoustic saliency signal at the time, and generating a main image basic saliency component based on the correlation for each time and each pixel for each of the plurality of feature types;
The image saliency image calculation unit indicates the saliency at each position of the input image of the frame at each time based on the set of basic saliency images for the frame at each time and the main image basic saliency component. A method for generating a saliency image.

The program for functioning a computer as each part of the saliency image generation apparatus of any one of Claims 1-6.