JP7405570B2

JP7405570B2 - Visual load estimation device

Info

Publication number: JP7405570B2
Application number: JP2019205194A
Authority: JP
Inventors: 俊明井上
Original assignee: Pioneer Corp
Current assignee: Pioneer Corp
Priority date: 2019-11-13
Filing date: 2019-11-13
Publication date: 2023-12-26
Anticipated expiration: 2039-11-13
Also published as: JP2021077248A; JP2024032022A

Description

本発明は、移動体から外部を撮像した画像に基づいて視認負荷量を推定する視認負荷量推定装置に関する。 The present invention relates to a visible load amount estimating device that estimates a visible load amount based on an image taken of the outside from a moving object.

特許文献１には、走行環境がどの程度目が疲れやすい状況であるかを自車両の進行方向を撮像した画像から推定するために、自車両の進行方向における撮像画像を取得し、撮像画像中において、運転者が生理的に注視してしまう位置を推定し、撮像画像中において、運転者が自車両を運転する際に視認すべき位置を推定し、注視してしまう位置と視認すべき位置との位置関係に基づいて、視認負荷量を推定することが記載されている。 Patent Document 1 discloses that in order to estimate how easy it is for the eyes to get tired in the driving environment from an image taken in the traveling direction of the own vehicle, a photographed image in the traveling direction of the own vehicle is acquired, and an image in the photographed image is , the position that the driver physiologically gazes at is estimated, the position in the captured image that the driver should see when driving the own vehicle is estimated, and the position that the driver looks at and the position that should be visually recognized are estimated. It is described that the amount of visible load is estimated based on the positional relationship with the vehicle.

特許第５４８２７３７号公報Patent No. 5482737

しかしながら、視覚的に負荷を感じるのは、特許文献１に記載された、視認すべき位置と生理的に注視してしまう度合いとのギャップに限らない。例えば、視線の推移が大きくなるような場面も視覚的に負荷を感じることがあるが、特許文献１では、このような場面は考慮されていない。 However, what causes a visual burden is not limited to the gap between the position to be visually recognized and the physiological degree of gaze, as described in Patent Document 1. For example, scenes in which the line of sight changes significantly may also cause a visual burden, but Patent Document 1 does not take such scenes into consideration.

本発明が解決しようとする課題としては、視覚的に負荷を感じる部分を自動的に抽出することが一例として挙げられる。 An example of the problem to be solved by the present invention is to automatically extract a visually burdensome part.

上記課題を解決するために、請求項１に記載の発明は、移動体から外部を連続的に撮像した画像に基づいて視覚顕著性の高低を推測して得られた視覚顕著性分布情報を前記画像ごとに生成する生成部と、生成された前記視覚顕著性分布情報に基づいて推定注視点の移動量を算出する算出部と、算出された前記推定注視点の移動量の時間的推移に基づいて視認負荷量を推定する推定部と、を備えることを特徴としている。 In order to solve the above problem, the invention according to claim 1 uses visual saliency distribution information obtained by estimating the level of visual saliency based on images continuously captured outside from a moving body. a generation unit that generates for each image, a calculation unit that calculates an estimated movement amount of the gaze point based on the generated visual saliency distribution information, and a calculation unit that calculates the movement amount of the estimated gaze point based on the temporal transition of the calculated movement amount of the estimated gaze point. and an estimating section that estimates the amount of visible load.

請求項６に記載の発明は、移動体から外部を撮像した画像に基づいて視認負荷量の推定を行う視認負荷量推定装置で実行される視認負荷量推定方法であって、前記移動体から外部を連続的に撮像した画像に基づいて視覚顕著性の高低を推測して得られた視覚顕著性分布情報を前記画像ごとに生成する生成工程と、生成された前記視覚顕著性分布情報に基づいて推定注視点の移動量を算出する算出工程と、算出された前記推定注視点の移動量の時間的推移に基づいて視認負荷量を推定する推定工程と、を備えることを特徴としている。 The invention according to claim 6 is a visible load amount estimation method executed by a visible load amount estimation device that estimates a visible load amount based on an image taken of the outside from the moving body, a generation step of generating visual saliency distribution information for each image, which is obtained by estimating the level of visual saliency based on continuously captured images, and based on the generated visual saliency distribution information; The present invention is characterized by comprising a calculation step of calculating the estimated movement amount of the gaze point, and an estimation step of estimating the amount of visual recognition load based on the temporal transition of the calculated movement amount of the estimated gaze point.

請求項７に記載の発明は、請求項６に記載の視認負荷量推定方法をコンピュータにより実行させることを特徴としている。 The invention according to claim 7 is characterized in that the visual load estimation method according to claim 6 is executed by a computer.

請求項８に記載の発明は、請求項７に記載の視認負荷量推定プログラムを格納したことを特徴としている。 The invention according to claim 8 is characterized in that the visual load estimation program according to claim 7 is stored.

本発明の一実施例にかかる視認負荷量推定装置の機能構成図である。FIG. 1 is a functional configuration diagram of a visual load estimation device according to an embodiment of the present invention. 図１に示された視覚顕著性抽出手段の構成を例示するブロック図である。FIG. 2 is a block diagram illustrating the configuration of visual saliency extraction means shown in FIG. 1. FIG. （ａ）は判定装置へ入力する画像を例示する図であり、（ｂ）は（ａ）に対し推定される、視覚顕著性マップを例示する図である。(a) is a diagram illustrating an image input to the determination device, and (b) is a diagram illustrating a visual saliency map estimated for (a). 図１に示された視覚顕著性抽出手段の処理方法を例示するフローチャートである。2 is a flowchart illustrating a processing method of the visual saliency extraction means shown in FIG. 1. FIG. 非線形写像部の構成を詳しく例示する図である。FIG. 3 is a diagram illustrating in detail the configuration of a nonlinear mapping section. 中間層の構成を例示する図である。FIG. 3 is a diagram illustrating a configuration of an intermediate layer. （ａ）および（ｂ）はそれぞれ、フィルタで行われる畳み込み処理の例を示す図である。(a) and (b) are diagrams each showing an example of convolution processing performed by a filter. （ａ）は、第１のプーリング部の処理を説明するための図であり、（ｂ）は、第２のプーリング部の処理を説明するための図であり、（ｃ）は、アンプーリング部の処理を説明するための図である。(a) is a diagram for explaining the processing of the first pooling section, (b) is a diagram for explaining the processing of the second pooling section, and (c) is a diagram for explaining the processing of the second pooling section. FIG. 図１に示された視認負荷量推定手段の機能構成図である。FIG. 2 is a functional configuration diagram of a visual load estimation means shown in FIG. 1. FIG. 図９に示された視認負荷量推定手段における各機能の動作を示した波形図である。10 is a waveform diagram showing the operation of each function in the visual load estimation means shown in FIG. 9. FIG. 図１に示された視認負荷量推定装置の動作のフローチャートである。2 is a flowchart of the operation of the visual load estimation device shown in FIG. 1. FIG.

以下、本発明の一実施形態にかかる視認負荷量推定装置を説明する。本発明の一実施形態にかかる視認負荷量推定装置は、生成部で移動体から外部を連続的に撮像した画像に基づいて視覚顕著性の高低を推測して得られた視覚顕著性分布情報を前記画像ごとに生成し、算出部で生成された視覚顕著性分布情報に基づいて推定注視点の移動量を算出する。そして、推定部で算出された推定注視点の移動量の時間的推移に基づいて視認負荷量を推定する。このようにすることにより、推定注視点の移動量の時間的推移に基づくことから視線の推移が大きくなるような場面を推定することができる。したがって、視覚的に負荷を感じる部分を自動的に抽出することができる。 A visual load estimation device according to an embodiment of the present invention will be described below. A visual recognition load estimation device according to an embodiment of the present invention uses visual saliency distribution information obtained by estimating the level of visual saliency based on images continuously captured outside from a moving object in a generation unit. The amount of movement of the estimated point of gaze is calculated based on the visual saliency distribution information generated for each image and generated by the calculation unit. Then, the amount of visual recognition load is estimated based on the temporal transition of the amount of movement of the estimated point of gaze calculated by the estimator. By doing this, it is possible to estimate a scene in which the change in the line of sight becomes large based on the time change in the amount of movement of the estimated point of gaze. Therefore, it is possible to automatically extract portions that are visually burdensome.

また、算出部は、推定注視点を、視覚顕著性分布情報において視覚顕著性が最大値となる画像上の位置と推定して移動量の算出をしてもよい。このようにすることにより、最も視認すると推定される位置に基づいて移動量を算出することができる。 Further, the calculation unit may calculate the movement amount by estimating the estimated gaze point as a position on the image where the visual saliency is the maximum value in the visual saliency distribution information. By doing so, the amount of movement can be calculated based on the position that is estimated to be most visible.

また、推定部は、推定注視点の移動量を複数の基底成分に分解し、分解された基底成分に基づいて視認負荷量を算出してもよい。このようにすることにより、経験モード分解（ＥＭＤ；Empirical Mode Decomposition）といったアルゴリズムを利用することができるようになる。 Furthermore, the estimation unit may decompose the estimated movement amount of the gaze point into a plurality of base components, and calculate the visual recognition load amount based on the decomposed base components. By doing so, it becomes possible to use an algorithm such as Empirical Mode Decomposition (EMD).

また、生成部は、画像を写像処理可能な中間データに変換する入力部と、中間データを写像データに変換する非線形写像部と、写像データに基づき顕著性分布を示す顕著性推定情報を生成する出力部と、を備え、非線形写像部は、中間データに対し特徴の抽出を行う特徴抽出部と、特徴抽出部で生成されたデータのアップサンプルを行うアップサンプル部と、を備えてもよい。このようにすることにより、小さな計算コストで、視覚顕著性を推定することができる。 The generation unit also includes an input unit that converts the image into intermediate data that can be mapped, a nonlinear mapping unit that converts the intermediate data into mapping data, and generates saliency estimation information that indicates a saliency distribution based on the mapping data. The nonlinear mapping section may include a feature extraction section that extracts features from the intermediate data, and an upsampling section that upsamples the data generated by the feature extraction section. By doing so, visual saliency can be estimated with low calculation cost.

また、推定部における推定結果を提示する提示部を備えてもよい。このようにすることにより、例えば視認負荷量の多い地点が接近することを通知することが可能となる。 Further, the information processing apparatus may include a presentation section that presents the estimation results of the estimation section. By doing this, for example, it becomes possible to notify that a point with a large amount of visible load is approaching.

また、本発明の一実施形態にかかる視認負荷量推定方法は、生成工程で移動体から外部を連続的に撮像した画像に基づいて視覚顕著性の高低を推測して得られた視覚顕著性分布情報を前記画像ごとに生成し、算出工程で生成された視覚顕著性分布情報に基づいて推定注視点の移動量を算出する。そして、推定工程で算出された推定注視点の移動量の時間的推移に基づいて視認負荷量を推定する。このようにすることにより、推定注視点の移動量の時間的推移に基づくことから視線の推移が大きくなるような位置を推定することができる。したがって、視覚的に負荷を感じる部分を自動的に抽出することができる。 Furthermore, in the visual recognition load estimation method according to an embodiment of the present invention, a visual saliency distribution obtained by estimating the level of visual saliency based on images continuously captured outside from a moving object in the generation step. Information is generated for each image, and the amount of movement of the estimated point of gaze is calculated based on the visual saliency distribution information generated in the calculation step. Then, the visual recognition load amount is estimated based on the temporal transition of the movement amount of the estimated gaze point calculated in the estimation step. By doing so, it is possible to estimate a position where the change in the line of sight is large based on the time change in the amount of movement of the estimated point of gaze. Therefore, it is possible to automatically extract portions that are visually burdensome.

また、上述した視認負荷量推定方法を、コンピュータにより実行させている。このようにすることにより、コンピュータを用いて、推定注視点の移動量の時間的推移に基づくことから視線の推移が大きくなるような位置を推定することができる。したがって、視覚的に負荷を感じる部分を自動的に抽出することができる。 Further, the above-described visual load estimation method is executed by a computer. By doing so, it is possible to use a computer to estimate a position where the shift in the line of sight increases based on the temporal shift in the amount of movement of the estimated point of gaze. Therefore, it is possible to automatically extract portions that are visually burdensome.

また、上述した視認負荷量推定プログラムをコンピュータ読み取り可能な記憶媒体に格納してもよい。このようにすることにより、当該プログラムを機器に組み込む以外に単体でも流通させることができ、バージョンアップ等も容易に行える。 Further, the above-described visual load estimation program may be stored in a computer-readable storage medium. By doing so, the program can be distributed as a standalone program in addition to being incorporated into a device, and version upgrades can be easily performed.

本発明の一実施例にかかる視認負荷量推定装置を図１～図１１を参照して説明する。本実施例にかかる視認負荷量推定装置は、例えば自動車等の移動体に設置されるに限らず、事業所等に設置されるサーバ装置等で構成してもよい。即ち、リアルタイムに解析する必要はなく、走行後等に解析を行ってもよい。 A visual load estimation device according to an embodiment of the present invention will be described with reference to FIGS. 1 to 11. The visual load amount estimating device according to this embodiment is not limited to being installed in a moving body such as a car, but may be configured with a server device or the like installed in a business office or the like. That is, it is not necessary to analyze in real time, and analysis may be performed after driving.

図１に示したように、視認負荷量推定装置１は、入力手段２と、視覚顕著性抽出手段３と、視認負荷量推定手段４と、情報提示手段５と、を備えている。 As shown in FIG. 1, the visual load estimation device 1 includes an input means 2, a visual saliency extraction means 3, a visual load estimation means 4, and an information presentation means 5.

入力手段２は、例えばカメラなどで撮像された画像（動画像）が入力され、その画像を画像データとして出力する。なお、入力された動画像は、例えばフレーム毎等の時系列に分解された画像データとして出力する。入力手段２に入力される画像として静止画を入力してもよいが、時系列に沿った複数の静止画からなる画像群として入力するのが好ましい。 The input means 2 receives an image (moving image) captured by, for example, a camera, and outputs the image as image data. Note that the input moving image is output as image data that is decomposed into time series, such as for each frame, for example. Although a still image may be input as the image input to the input means 2, it is preferable to input it as an image group consisting of a plurality of still images in chronological order.

入力手段２に入力される画像は、例えば車両の進行方向が撮像された画像が挙げられる。つまり、移動体から外部を連続的に撮像した画像とする。この画像はいわゆるパノラマ画像等の水平方向に１８０°や３６０°等進行方向以外が含まれる画像であってもよい。また、入力手段２には入力されるのは、カメラで撮像された画像に限らず、ハードディスクドライブやメモリカード等の記録媒体から読み出した画像であってもよい。 The image input to the input means 2 may be, for example, an image in which the direction of travel of the vehicle is captured. In other words, the images are images taken continuously of the outside from the moving body. This image may be a so-called panoramic image, which includes a horizontal direction other than the direction of travel, such as 180° or 360°. Moreover, what is input to the input means 2 is not limited to an image captured by a camera, but may also be an image read from a recording medium such as a hard disk drive or a memory card.

視覚顕著性抽出手段３は、入力手段２から画像データが入力され、後述する視覚顕著性推定情報として視覚顕著性マップを出力する。即ち、視覚顕著性抽出手段３は、移動体から外部を撮像した画像に基づいて視覚顕著性の高低を推測して得られた視覚顕著性マップ（視覚顕著性分布情報）を生成する生成部として機能する。 The visual saliency extraction means 3 receives image data from the input means 2 and outputs a visual saliency map as visual saliency estimation information to be described later. That is, the visual saliency extraction means 3 acts as a generation unit that generates a visual saliency map (visual saliency distribution information) obtained by estimating the level of visual saliency based on an image taken of the outside from a moving object. Function.

図２は、視覚顕著性抽出手段３の構成を例示するブロック図である。本実施例に係る視覚顕著性抽出手段３は、入力部３１０、非線形写像部３２０、出力部３３０および記憶部３９０を備える。入力部３１０は、画像を写像処理可能な中間データに変換する。非線形写像部３２０は、中間データを写像データに変換する。出力部３３０は、写像データに基づき顕著性分布を示す顕著性推定情報を生成する。そして、非線形写像部３２０は、中間データに対し特徴の抽出を行う特徴抽出部３２１と、特徴抽出部３２１で生成されたデータのアップサンプルを行うアップサンプル部３２２とを備える。記憶部３９０は、入力手段２から入力された画像データや後述するフィルタの係数等が保持されている。以下に詳しく説明する。 FIG. 2 is a block diagram illustrating the configuration of the visual saliency extraction means 3. As shown in FIG. The visual saliency extraction means 3 according to this embodiment includes an input section 310, a nonlinear mapping section 320, an output section 330, and a storage section 390. The input unit 310 converts the image into intermediate data that can be mapped. The nonlinear mapping unit 320 converts intermediate data into mapping data. The output unit 330 generates saliency estimation information indicating a saliency distribution based on the mapping data. The nonlinear mapping section 320 includes a feature extraction section 321 that extracts features from intermediate data, and an upsampling section 322 that upsamples the data generated by the feature extraction section 321. The storage unit 390 stores image data input from the input means 2, coefficients of a filter to be described later, and the like. This will be explained in detail below.

図３（ａ）は、視覚顕著性抽出手段３へ入力する画像を例示する図であり、図３（ｂ）は、図３（ａ）に対し推定される、視覚顕著性分布を示す画像を例示する図である。本実施例に係る視覚顕著性抽出手段３は、画像における各部分の視覚顕著性を推定する装置である。視覚顕著性とは例えば、目立ちやすさや視線の集まりやすさを意味する。具体的には視覚顕著性は、確率等で示される。ここで、確率の大小は、たとえばその画像を見た人の視線がその位置に向く確率の大小に対応する。 FIG. 3(a) is a diagram illustrating an image input to the visual saliency extraction means 3, and FIG. 3(b) is a diagram illustrating an image showing the visual saliency distribution estimated for FIG. 3(a). It is a figure which illustrates. The visual saliency extraction means 3 according to this embodiment is a device that estimates the visual saliency of each part in an image. Visual conspicuousness means, for example, how easy it is to stand out or how easy it is to attract attention. Specifically, visual saliency is indicated by probability or the like. Here, the magnitude of the probability corresponds to, for example, the magnitude of the probability that the line of sight of the person viewing the image will turn to the position.

図３（ａ）と図３（ｂ）とは、互いに位置が対応している。そして、図３（ａ）において、視覚顕著性が高い位置ほど、図３（ｂ）において輝度が高く表示されている。図３（ｂ）のような視覚顕著性分布を示す画像は、出力部３３０が出力する視覚顕著性マップの一例である。本図の例において、視覚顕著性は、２５６階調の輝度値で可視化されている。出力部３３０が出力する視覚顕著性マップの例については詳しく後述する。 The positions of FIGS. 3(a) and 3(b) correspond to each other. In FIG. 3(a), the higher the visual saliency of a position, the higher the brightness is displayed in FIG. 3(b). An image showing a visual saliency distribution as shown in FIG. 3(b) is an example of a visual saliency map output by the output unit 330. In the example of this figure, visual saliency is visualized using 256 gradations of brightness values. An example of the visual saliency map output by the output unit 330 will be described in detail later.

図４は、本実施例に係る視覚顕著性抽出手段３の動作を例示するフローチャートである。図４に示したフローチャートは、コンピュータによって実行される視認負荷量推定方法の一部であって、入力ステップＳ１１０、非線形写像ステップＳ１２０、および出力ステップＳ１３０を含む。入力ステップＳ１１０では、画像が写像処理可能な中間データに変換される。非線形写像ステップＳ１２０では、中間データが写像データに変換される。出力ステップＳ１３０では、写像データに基づき顕著性分布を示す視覚顕著性推定情報が生成される。ここで、非線形写像ステップＳ１２０は、中間データに対し特徴の抽出を行う特徴抽出ステップＳ１２１と、特徴抽出ステップＳ１２１で生成されたデータのアップサンプルを行うアップサンプルステップＳ１２２とを含む。 FIG. 4 is a flowchart illustrating the operation of the visual saliency extraction means 3 according to this embodiment. The flowchart shown in FIG. 4 is part of a visual load estimation method executed by a computer, and includes an input step S110, a nonlinear mapping step S120, and an output step S130. In the input step S110, the image is converted into intermediate data that can be mapped. In the nonlinear mapping step S120, intermediate data is converted into mapping data. In the output step S130, visual saliency estimation information indicating saliency distribution is generated based on the mapping data. Here, the nonlinear mapping step S120 includes a feature extraction step S121 that extracts features from intermediate data, and an upsampling step S122 that upsamples the data generated in the feature extraction step S121.

図２に戻り、視覚顕著性抽出手段３の各構成要素について説明する。入力ステップＳ１１０において入力部３１０は、画像を取得し、中間データに変換する。入力部３１０は、画像データを入力手段２から取得する。そして入力部３１０は、取得した画像を中間データに変換する。中間データは非線形写像部３２０が受け付け可能なデータであれば特に限定されないが、たとえば高次元テンソルである。また、中間データはたとえば、取得した画像に対し輝度を正規化したデータ、または、取得した画像の各画素を、輝度の傾きに変換したデータである。入力ステップＳ１１０において入力部３１０は、さらに画像のノイズ除去や解像度変換等を行っても良い。 Returning to FIG. 2, each component of the visual saliency extraction means 3 will be explained. In input step S110, the input unit 310 acquires an image and converts it into intermediate data. The input unit 310 acquires image data from the input means 2. The input unit 310 then converts the acquired image into intermediate data. The intermediate data is not particularly limited as long as it is data that can be accepted by the nonlinear mapping unit 320, and is, for example, a high-dimensional tensor. Further, the intermediate data is, for example, data obtained by normalizing the brightness of the obtained image, or data obtained by converting each pixel of the obtained image into a slope of brightness. In input step S110, the input unit 310 may further perform image noise removal, resolution conversion, and the like.

非線形写像ステップＳ１２０において、非線形写像部３２０は入力部３１０から中間データを取得する。そして、非線形写像部３２０において中間データが写像データに変換される。ここで、写像データは例えば高次元テンソルである。非線形写像部３２０で中間データに施される写像処理は、たとえばパラメータ等により制御可能な写像処理であり、関数、汎関数、またはニューラルネットワークによる処理であることが好ましい。 In the nonlinear mapping step S120, the nonlinear mapping section 320 obtains intermediate data from the input section 310. Then, the intermediate data is converted into mapping data in the nonlinear mapping section 320. Here, the mapping data is, for example, a high-dimensional tensor. The mapping process performed on the intermediate data by the nonlinear mapping unit 320 is, for example, a mapping process that can be controlled by parameters, etc., and is preferably a process using a function, a functional, or a neural network.

図５は、非線形写像部３２０の構成を詳しく例示する図であり、図６は、中間層３２３の構成を例示する図である。上記した通り、非線形写像部３２０は、特徴抽出部３２１およびアップサンプル部３２２を備える。特徴抽出部３２１において特徴抽出ステップＳ１２１が行われ、アップサンプル部３２２においてアップサンプルステップＳ１２２が行われる。また、本図の例において、特徴抽出部３２１およびアップサンプル部３２２の少なくとも一方は、複数の中間層３２３を含むニューラルネットワークを含んで構成される。ニューラルネットワークにおいては、複数の中間層３２３が結合されている。 FIG. 5 is a diagram illustrating the configuration of the nonlinear mapping unit 320 in detail, and FIG. 6 is a diagram illustrating the configuration of the intermediate layer 323. As described above, the nonlinear mapping section 320 includes the feature extraction section 321 and the up-sampling section 322. The feature extraction section 321 performs a feature extraction step S121, and the upsampling section 322 performs an upsampling step S122. Further, in the example shown in the figure, at least one of the feature extraction section 321 and the up-sampling section 322 is configured to include a neural network including a plurality of intermediate layers 323. In the neural network, multiple intermediate layers 323 are coupled.

特にニューラルネットワークは畳み込みニューラルネットワークであることが好ましい。具体的には、複数の中間層３２３のそれぞれは、一または二以上の畳み込み層３２４を含む。そして、畳み込み層３２４では、入力されたデータに対し複数のフィルタ３２５による畳み込みが行われ、複数のフィルタ３２５の出力に対し活性化処理が施される。 In particular, it is preferable that the neural network is a convolutional neural network. Specifically, each of the plurality of intermediate layers 323 includes one or more convolutional layers 324. Then, in the convolution layer 324, the input data is convolved by a plurality of filters 325, and the outputs of the plurality of filters 325 are subjected to activation processing.

図５の例において、特徴抽出部３２１は、複数の中間層３２３を含むニューラルネットワークを含んで構成され、複数の中間層３２３の間に第１のプーリング部３２６を備える。また、アップサンプル部３２２は、複数の中間層３２３を含むニューラルネットワークを含んで構成され、複数の中間層３２３の間にアンプーリング部３２８を備える。さらに、特徴抽出部３２１とアップサンプル部３２２とは、オーバーラッププーリングを行う第２のプーリング部３２７を介して互いに接続されている。 In the example of FIG. 5, the feature extraction unit 321 includes a neural network including a plurality of intermediate layers 323, and includes a first pooling unit 326 between the plurality of intermediate layers 323. Further, the up-sampling unit 322 includes a neural network including a plurality of intermediate layers 323, and includes an unpooling unit 328 between the plurality of intermediate layers 323. Further, the feature extraction section 321 and the up-sampling section 322 are connected to each other via a second pooling section 327 that performs overlap pooling.

なお、本図の例において各中間層３２３は、二以上の畳み込み層３２４からなる。ただし、少なくとも一部の中間層３２３は、一の畳み込み層３２４のみからなってもよい。互いに隣り合う中間層３２３は、第１のプーリング部３２６、第２のプーリング部３２７およびアンプーリング部３２８のいずれかで区切られる。ここで、中間層３２３に二以上の畳み込み層３２４が含まれる場合、それらの畳み込み層３２４におけるフィルタ３２５の数は互いに等しいことが好ましい。 Note that in the example shown in this figure, each intermediate layer 323 consists of two or more convolutional layers 324. However, at least some of the intermediate layers 323 may consist of only one convolutional layer 324. The intermediate layers 323 that are adjacent to each other are separated by one of a first pooling section 326, a second pooling section 327, and an unpooling section 328. Here, when the intermediate layer 323 includes two or more convolutional layers 324, it is preferable that the numbers of filters 325 in those convolutional layers 324 are equal to each other.

本図では、「Ａ×Ｂ」と記された中間層３２３は、Ｂ個の畳み込み層３２４からなり、各畳み込み層３２４は、各チャネルに対しＡ個の畳み込みフィルタを含むことを意味している。このような中間層３２３を以下では「Ａ×Ｂ中間層」とも呼ぶ。たとえば、６４×２中間層３２３は、２個の畳み込み層３２４からなり、各畳み込み層３２４は、各チャネルに対し６４個の畳み込みフィルタを含むことを意味している。 In this figure, the intermediate layer 323 labeled "A×B" is composed of B convolutional layers 324, and each convolutional layer 324 means that it includes A convolutional filters for each channel. . Such an intermediate layer 323 will also be referred to as an "A×B intermediate layer" below. For example, a 64×2 hidden layer 323 consists of two convolutional layers 324, meaning that each convolutional layer 324 includes 64 convolutional filters for each channel.

本図の例において、特徴抽出部３２１は、６４×２中間層３２３、１２８×２中間層３２３、２５６×３中間層３２３、および、５１２×３中間層３２３をこの順に含む。また、アップサンプル部３２２は、５１２×３中間層３２３、２５６×３中間層３２３、１２８×２中間層３２３、および６４×２中間層３２３をこの順に含む。また、第２のプーリング部３２７は、２つの５１２×３中間層３２３を互いに接続している。なお、非線形写像部３２０を構成する中間層３２３の数は特に限定されず、たとえば画像データの画素数に応じて定めることができる。 In the example shown in the figure, the feature extraction unit 321 includes a 64×2 intermediate layer 323, a 128×2 intermediate layer 323, a 256×3 intermediate layer 323, and a 512×3 intermediate layer 323 in this order. Further, the up-sample section 322 includes a 512×3 intermediate layer 323, a 256×3 intermediate layer 323, a 128×2 intermediate layer 323, and a 64×2 intermediate layer 323 in this order. Further, the second pooling section 327 connects the two 512×3 intermediate layers 323 to each other. Note that the number of intermediate layers 323 constituting the nonlinear mapping section 320 is not particularly limited, and can be determined depending on, for example, the number of pixels of image data.

なお、本図は非線形写像部３２０の構成の一例であり、非線形写像部３２０は他の構成を有していても良い。たとえば、６４×２中間層３２３の代わりに６４×１中間層３２３が含まれても良い。中間層３２３に含まれる畳み込み層３２４の数が削減されることで、計算コストがより低減される可能性がある。また、たとえば、６４×２中間層３２３の代わりに３２×２中間層３２３が含まれても良い。中間層３２３のチャネル数が削減されることで、計算コストがより低減される可能性がある。さらに、中間層３２３における畳み込み層３２４の数とチャネル数との両方を削減しても良い。 Note that this figure is an example of the configuration of the nonlinear mapping section 320, and the nonlinear mapping section 320 may have another configuration. For example, instead of the 64×2 middle layer 323, a 64×1 middle layer 323 may be included. By reducing the number of convolutional layers 324 included in the intermediate layer 323, calculation costs may be further reduced. Further, for example, a 32×2 intermediate layer 323 may be included instead of the 64×2 intermediate layer 323. By reducing the number of channels in the intermediate layer 323, calculation costs may be further reduced. Furthermore, both the number of convolutional layers 324 and the number of channels in the intermediate layer 323 may be reduced.

ここで、特徴抽出部３２１に含まれる複数の中間層３２３においては、第１のプーリング部３２６を経る毎にフィルタ３２５の数が増加することが好ましい。具体的には、第１の中間層３２３ａと第２の中間層３２３ｂとが、第１のプーリング部３２６を介して互いに連続しており、第１の中間層３２３ａの後段に第２の中間層３２３ｂが位置する。そして、第１の中間層３２３ａは、各チャネルに対するフィルタ３２５の数がＮ１である畳み込み層３２４で構成されており、第２の中間層３２３ｂは、各チャネルに対するフィルタ
３２５の数がＮ２である畳み込み層３２４で構成されている。このとき、Ｎ２＞Ｎ１が成り立つことが好ましい。また、Ｎ２＝Ｎ１×２が成り立つことがより好ましい。 Here, in the plurality of intermediate layers 323 included in the feature extraction section 321, it is preferable that the number of filters 325 increases each time the filter passes through the first pooling section 326. Specifically, the first intermediate layer 323a and the second intermediate layer 323b are continuous with each other via the first pooling part 326, and the second intermediate layer 323a is disposed after the first intermediate layer 323a. 323b is located. The first intermediate layer 323a is composed of a convolutional layer 324 in which the number of filters 325 for each channel is N1, and the second intermediate layer 323b is composed of a convolutional layer 324 in which the number of filters 325 for each channel is N2. It is composed of layer 324. At this time, it is preferable that N2>N1 holds true. Further, it is more preferable that N2=N1×2 holds true.

また、アップサンプル部３２２に含まれる複数の中間層３２３においては、アンプーリング部３２８を経る毎にフィルタ３２５の数が減少することが好ましい。具体的には、第３の中間層３２３ｃと第４の中間層３２３ｄとが、アンプーリング部３２８を介して互いに連続しており、第３の中間層３２３ｃの後段に第４の中間層３２３ｄが位置する。そして、第３の中間層３２３ｃは、各チャネルに対するフィルタ３２５の数がＮ３である畳み込み層３２４で構成されており、第４の中間層３２３ｄは、各チャネルに対するフィルタ３２５の数がＮ４である畳み込み層３２４で構成されている。このとき、Ｎ４＜Ｎ３が成り立つことが好ましい。また、Ｎ３＝Ｎ４×２が成り立つことがより好ましい。 Further, in the plurality of intermediate layers 323 included in the up-sampling section 322, it is preferable that the number of filters 325 decreases each time the signal passes through the unpooling section 328. Specifically, the third intermediate layer 323c and the fourth intermediate layer 323d are continuous with each other via the unpooling section 328, and the fourth intermediate layer 323d is provided after the third intermediate layer 323c. To position. The third intermediate layer 323c is composed of a convolutional layer 324 in which the number of filters 325 for each channel is N3, and the fourth intermediate layer 323d is composed of a convolutional layer 324 in which the number of filters 325 for each channel is N4. It is composed of layer 324. At this time, it is preferable that N4<N3 holds true. Further, it is more preferable that N3=N4×2 holds true.

特徴抽出部３２１では、入力部３１０から取得した中間データから勾配や形状など、複数の抽象度を持つ画像特徴を中間層３２３のチャネルとして抽出する。図６は、６４×２
中間層３２３の構成を例示している。本図を参照して、中間層３２３における処理を説明する。本図の例において、中間層３２３は第１の畳み込み層３２４ａと第２の畳み込み層３２４ｂとで構成されており、各畳み込み層３２４は６４個のフィルタ３２５を備える。第１の畳み込み層３２４ａでは、中間層３２３に入力されたデータの各チャネルに対して、フィルタ３２５を用いた畳み込み処理が施される。たとえば入力部３１０へ入力された画像がＲＧＢ画像である場合、３つのチャネルｈ^０ _ｉ（ｉ＝１．．３）のそれぞれに対して処理が施される。また、本図の例において、フィルタ３２５は６４種の３×３フィルタであり、すなわち合計６４×３種のフィルタである。畳み込み処理の結果、各チャネルｉに対して、６４個の結果ｈ^０ _ｉ，ｊ（ｉ＝１．．３，ｊ＝１．．６４）が得られる。 The feature extraction unit 321 extracts image features having multiple levels of abstraction, such as gradients and shapes, from the intermediate data obtained from the input unit 310 as channels of the intermediate layer 323. Figure 6 is 64×2
The configuration of the intermediate layer 323 is illustrated. Processing in the intermediate layer 323 will be described with reference to this figure. In the example shown, the intermediate layer 323 is composed of a first convolutional layer 324a and a second convolutional layer 324b, and each convolutional layer 324 includes 64 filters 325. In the first convolution layer 324a, convolution processing using a filter 325 is performed on each channel of data input to the intermediate layer 323. For example, if the image input to the input unit 310 is an RGB image, processing is performed on each of the three channels h ⁰ _i (i=1..3). Further, in the example of this figure, the filter 325 is a 3×3 filter of 64 types, that is, a total of 64×3 types of filters. As a result of the convolution process, 64 results h ⁰ _i,j (i=1..3, j=1..64) are obtained for each channel i.

次に、複数のフィルタ３２５の出力に対し、活性化部３２９において活性化処理が行われる。具体的には、全チャネルの対応する結果ｊについて、対応する要素毎の総和に活性化処理が施される。この活性化処理により、６４チャネルの結果ｈ^１ _ｉ（ｉ＝１．．６４
）、すなわち、第１の畳み込み層３２４ａの出力が、画像特徴として得られる。活性化処理は特に限定されないが、双曲関数、シグモイド関数、および正規化線形関数の少なくともいずれかを用いる処理が好ましい。 Next, the activation unit 329 performs activation processing on the outputs of the plurality of filters 325. Specifically, for the corresponding result j of all channels, activation processing is performed on the sum of each corresponding element. Through this activation process, the result of 64 channels h ¹ _i (i=1..64
), that is, the output of the first convolutional layer 324a is obtained as an image feature. Although the activation process is not particularly limited, it is preferable to use at least one of a hyperbolic function, a sigmoid function, and a normalized linear function.

さらに、第１の畳み込み層３２４ａの出力データを第２の畳み込み層３２４ｂの入力データとし、第２の畳み込み層３２４ｂにて第１の畳み込み層３２４ａと同様の処理を行って、６４チャネルの結果ｈ^２ _ｉ（ｉ＝１．．６４）、すなわち第２の畳み込み層３２４ｂの出力が、画像特徴として得られる。第２の畳み込み層３２４ｂの出力がこの６４×２中間層３２３の出力データとなる。 Furthermore, the output data of the first convolutional layer 324a is used as the input data of the second convolutional layer 324b, and the second convolutional layer 324b performs the same processing as the first convolutional layer 324a, resulting in the 64-channel result h ² _i (i=1..64), ie, the output of the second convolutional layer 324b, is obtained as an image feature. The output of the second convolutional layer 324b becomes the output data of this 64×2 intermediate layer 323.

ここで、フィルタ３２５の構造は特に限定されないが、３×３の二次元フィルタであることが好ましい。また、各フィルタ３２５の係数は独立に設定可能である。本実施例において、各フィルタ３２５の係数は記憶部３９０に保持されており、非線形写像部３２０がそれを読み出して処理に用いることができる。ここで、複数のフィルタ３２５の係数は機械学習を用いて生成、修正された補正情報に基づいて定められてもよい。たとえば、補正情報は、複数のフィルタ３２５の係数を、複数の補正パラメータとして含む。非線形写像部３２０は、この補正情報をさらに用いて中間データを写像データに変換することができる。記憶部３９０は視覚顕著性抽出手段３に備えられていてもよいし、視覚顕著性抽出手段３の外部に設けられていてもよい。また、非線形写像部３２０は補正情報を、通信ネットワークを介して外部から取得しても良い。 Here, the structure of the filter 325 is not particularly limited, but is preferably a 3×3 two-dimensional filter. Further, the coefficients of each filter 325 can be set independently. In this embodiment, the coefficients of each filter 325 are held in the storage section 390, and the nonlinear mapping section 320 can read them and use them for processing. Here, the coefficients of the plurality of filters 325 may be determined based on correction information generated and modified using machine learning. For example, the correction information includes coefficients of a plurality of filters 325 as a plurality of correction parameters. The nonlinear mapping unit 320 can further use this correction information to convert intermediate data into mapped data. The storage unit 390 may be included in the visual saliency extraction means 3 or may be provided outside the visual saliency extraction means 3. Furthermore, the nonlinear mapping unit 320 may acquire the correction information from outside via a communication network.

図７（ａ）および図７（ｂ）はそれぞれ、フィルタ３２５で行われる畳み込み処理の例を示す図である。図７（ａ）および図７（ｂ）では、いずれも３×３畳み込みの例が示されている。図７（ａ）の例は、最近接要素を用いた畳み込み処理である。図７（ｂ）の例は、距離が二以上の近接要素を用いた畳み込み処理である。なお、距離が三以上の近接要素を用いた畳み込み処理も可能である。フィルタ３２５は、距離が二以上の近接要素を用いた畳み込み処理を行うことが好ましい。より広範囲の特徴を抽出することができ、視覚顕著性の推定精度をさらに高めることができるからである。 FIGS. 7A and 7B are diagrams showing examples of convolution processing performed by the filter 325, respectively. 7(a) and 7(b) both show examples of 3×3 convolution. The example in FIG. 7(a) is a convolution process using the nearest elements. The example in FIG. 7(b) is a convolution process using adjacent elements having a distance of two or more. Note that convolution processing using adjacent elements having a distance of three or more is also possible. It is preferable that the filter 325 performs convolution processing using adjacent elements having a distance of two or more. This is because a wider range of features can be extracted and the accuracy of estimating visual saliency can be further improved.

以上、６４×２中間層３２３の動作について説明した。他の中間層３２３（１２８×２中間層３２３、２５６×３中間層３２３、および、５１２×３中間層３２３等）の動作についても、畳み込み層３２４の数およびチャネルの数を除いて、６４×２中間層３２３の動作と同じである。また、特徴抽出部３２１における中間層３２３の動作も、アップサンプル部３２２における中間層３２３の動作も上記と同様である。 The operation of the 64×2 intermediate layer 323 has been described above. Regarding the operation of other intermediate layers 323 (128×2 intermediate layer 323, 256×3 intermediate layer 323, 512×3 intermediate layer 323, etc.), except for the number of convolutional layers 324 and the number of channels, 64× The operation is the same as that of the second intermediate layer 323. Furthermore, the operation of the intermediate layer 323 in the feature extraction section 321 and the operation of the intermediate layer 323 in the up-sampling section 322 are similar to those described above.

図８（ａ）は、第１のプーリング部３２６の処理を説明するための図であり、図８（ｂ）は、第２のプーリング部３２７の処理を説明するための図であり、図８（ｃ）は、アンプーリング部３２８の処理を説明するための図である。 8(a) is a diagram for explaining the processing of the first pooling unit 326, and FIG. 8(b) is a diagram for explaining the processing of the second pooling unit 327. (c) is a diagram for explaining the processing of the unpooling unit 328.

特徴抽出部３２１において、中間層３２３から出力されたデータは、第１のプーリング部３２６においてチャネル毎にプーリング処理が施された後、次の中間層３２３に入力される。第１のプーリング部３２６ではたとえば、非オーバーラップのプーリング処理が行われる。図８（ａ）では、各チャネルに含まれる要素群に対し、２×２の４つの要素３０を１つの要素３０に対応づける処理を示している。第１のプーリング部３２６ではこのような対応づけが全ての要素３０に対し行われる。ここで、２×２の４つの要素３０は互いに重ならないよう選択される。本例では、各チャネルの要素数が４分の１に縮小される。なお、第１のプーリング部３２６において要素数が縮小される限り、対応づける前後の要素３０の数は特に限定されない。 In the feature extraction unit 321, the data output from the intermediate layer 323 is subjected to pooling processing for each channel in the first pooling unit 326, and then input to the next intermediate layer 323. For example, the first pooling unit 326 performs non-overlapping pooling processing. FIG. 8A shows a process of associating four 2×2 elements 30 with one element 30 for a group of elements included in each channel. The first pooling unit 326 performs this kind of association for all elements 30. Here, the four 2×2 elements 30 are selected so as not to overlap each other. In this example, the number of elements in each channel is reduced by a factor of four. Note that as long as the number of elements is reduced in the first pooling unit 326, the number of elements 30 before and after being associated is not particularly limited.

特徴抽出部３２１から出力されたデータは、第２のプーリング部３２７を介してアップサンプル部３２２に入力される。第２のプーリング部３２７では、特徴抽出部３２１からの出力データに対し、オーバーラッププーリングが施される。図８（ｂ）では、一部の要素３０をオーバーラップさせながら、２×２の４つの要素３０を１つの要素３０に対応づける処理を示している。すなわち、繰り返される対応づけにおいて、ある対応づけにおける２×２の４つの要素３０のうち一部が、次の対応づけにおける２×２の４つの要素３０にも含まれる。本図のような第２のプーリング部３２７では要素数は縮小されない。なお、第２のプーリング部３２７において対応づける前後の要素３０の数は特に限定されない。 The data output from the feature extraction section 321 is input to the up-sampling section 322 via the second pooling section 327. The second pooling unit 327 performs overlap pooling on the output data from the feature extraction unit 321. FIG. 8B shows a process of associating four 2×2 elements 30 with one element 30 while making some of the elements 30 overlap. That is, in repeated associations, some of the four 2×2 elements 30 in one association are also included in the four 2×2 elements 30 in the next association. In the second pooling unit 327 as shown in this figure, the number of elements is not reduced. Note that the number of elements 30 before and after the elements 30 associated in the second pooling unit 327 is not particularly limited.

第１のプーリング部３２６および第２のプーリング部３２７で行われる各処理の方法は特に限定されないが、たとえば、４つの要素３０の最大値を１つの要素３０とする対応づけ（max pooling）や４つの要素３０の平均値を１つの要素３０とする対応づけ（average pooling）が挙げられる。 The method of each process performed by the first pooling unit 326 and the second pooling unit 327 is not particularly limited, but for example, the maximum value of four elements 30 is associated with one element 30 (max pooling), An example of this is average pooling, in which the average value of two elements 30 is used as one element 30.

第２のプーリング部３２７から出力されたデータは、アップサンプル部３２２における中間層３２３に入力される。そして、アップサンプル部３２２の中間層３２３からの出力データはアンプーリング部３２８においてチャネル毎にアンプーリング処理が施された後、次の中間層３２３に入力される。図８（ｃ）では、１つの要素３０を複数の要素３０に拡大する処理を示している。拡大の方法は特に限定されないが、１つの要素３０を２×２の４つの要素３０へ複製する方法が例として挙げられる。 The data output from the second pooling section 327 is input to the intermediate layer 323 in the up-sampling section 322. Then, the output data from the intermediate layer 323 of the up-sampling section 322 is subjected to unpooling processing for each channel in an unpooling section 328, and then input to the next intermediate layer 323. FIG. 8C shows a process of expanding one element 30 into multiple elements 30. The method of enlarging is not particularly limited, but an example is a method of duplicating one element 30 into four 2×2 elements 30.

アップサンプル部３２２の最後の中間層３２３の出力データは写像データとして非線形写像部３２０から出力され、出力部３３０に入力される。出力ステップＳ１３０において出力部３３０は、非線形写像部３２０から取得したデータに対し、たとえば正規化や解像度変換等を行うことで視覚顕著性マップを生成し、出力する。視覚顕著性マップはたとえば、図３（ｂ）に例示したような視覚顕著性を輝度値で可視化した画像（画像データ）である。また、視覚顕著性マップはたとえば、ヒートマップのように視覚顕著性に応じて色分けされた画像であっても良いし、視覚顕著性が予め定められた基準より高い視覚顕著領域を、その他の位置とは識別可能にマーキングした画像であっても良い。さらに、視覚顕著性推定情報は画像等として示されたマップ情報に限定されず、視覚顕著領域を示す情報を列挙したテーブル等であっても良い。 The output data of the last intermediate layer 323 of the up-sampling section 322 is outputted as mapping data from the nonlinear mapping section 320 and input to the output section 330. In output step S130, the output unit 330 generates a visual saliency map by performing normalization, resolution conversion, etc. on the data acquired from the nonlinear mapping unit 320, and outputs the map. The visual saliency map is, for example, an image (image data) in which visual saliency is visualized using brightness values, as illustrated in FIG. 3(b). Furthermore, the visual saliency map may be, for example, an image that is color-coded according to visual saliency like a heat map, or it may be a visual saliency map that maps visually salient areas whose visual saliency is higher than a predetermined standard to other positions. may be an image marked with identification. Further, the visual saliency estimation information is not limited to map information shown as an image or the like, but may also be a table listing information indicating visually salient regions.

視認負荷量推定手段４は、視覚顕著性抽出手段３が出力した視覚顕著性マップに基づいて視認負荷量推定を推定する。視認負荷量推定手段４で推定された結果である視認負荷量は、例えばスカラ量またはベクトル量であってもよい。あるいは単一データまたは複数の時系列データであってもよい。視認負荷量推定手段４は、図９に示したように、注視点推定手段４１と、注視点移動量算出手段４２と、基底成分分解手段４３と、基底成分選択手段４４と、パワー算出手段４５と、パワー合成手段４６と、を備えている。 The visual load estimation means 4 estimates the visual load amount estimation based on the visual saliency map output by the visual saliency extraction means 3. The visual load amount, which is the result estimated by the visual load amount estimating means 4, may be, for example, a scalar amount or a vector amount. Alternatively, it may be single data or multiple time series data. As shown in FIG. 9, the visual recognition load estimation means 4 includes a gaze point estimation means 41, a gaze point movement amount calculation means 42, a base component decomposition means 43, a base component selection means 44, and a power calculation means 45. and a power combining means 46.

注視点推定手段４１は、視覚顕著性抽出手段３が出力した時系列の視覚顕著性マップから注視点情報を推定する。注視点情報の定義については特に限定しないが、例えば顕著性の値が最大値となる位置（座標）などとすることができる。即ち、注視点推定手段４１は、推定注視点を、視覚顕著性マップ（視覚顕著性分布情報）において視覚顕著性が最大値となる画像上の位置と推定している。 The gaze point estimation means 41 estimates gaze point information from the time-series visual saliency map output by the visual saliency extraction means 3. Although the definition of the gaze point information is not particularly limited, it may be, for example, the position (coordinates) where the saliency value is the maximum value. That is, the gaze point estimating means 41 estimates the estimated gaze point to be the position on the image where the visual saliency has the maximum value in the visual saliency map (visual saliency distribution information).

注視点移動量算出手段４２は、注視点推定手段４１で推定された時系列の注視点情報から時系列の注視点移動量を算出する。注視点移動量算出手段４２により算出された注視点移動量もまた時系列データとなる。算出方法については特に限定しないが、例えば時系列で前後の関係にある注視点座標間のユークリッド距離などとすることができる。即ち、注視点移動量算出手段４２は、生成された視覚顕著性マップ（視覚顕著性分布情報）に基づいて注視点（推定注視点）の移動量を算出している。 The gaze point movement amount calculation means 42 calculates the time series gaze point movement amount from the time series gaze point information estimated by the gaze point estimation means 41. The gaze point movement amount calculated by the gaze point movement amount calculation means 42 also becomes time series data. Although the calculation method is not particularly limited, it may be, for example, a Euclidean distance between gaze point coordinates that are in a sequential relationship in time series. That is, the gaze point movement amount calculating means 42 calculates the movement amount of the gaze point (estimated gaze point) based on the generated visual saliency map (visual saliency distribution information).

基底成分分解手段４３は、注視点移動量算出手段４２で算出された注視点移動量の時系列変化を１個以上の基底成分群に分解する。基底成分分解手段４３により分解された各基底成分もまた移動量に対応した時系列データとなる。分解方法については特に限定しないが、例えば経験モード分解（Empirical Mode Decomposition；ＥＭＤ）が望ましく、この場合の基底成分は固有モード関数（Intrinsic Mode Function；ＩＭＦ）となる。他にフーリエ変換ファミリ（基底成分は正弦波）などを用いても良い。 The base component decomposition means 43 decomposes the time-series change in the gaze point movement amount calculated by the gaze point movement amount calculation means 42 into one or more base component groups. Each base component decomposed by the base component decomposition means 43 also becomes time series data corresponding to the amount of movement. Although the decomposition method is not particularly limited, for example, empirical mode decomposition (EMD) is desirable, and in this case, the basis component is an intrinsic mode function (IMF). Alternatively, a Fourier transform family (base component is a sine wave) may be used.

基底成分選択手段４４は、基底成分分解手段４３で分解された基底成分群から１個以上の基底成分を選択する。なお、選択方法は限定されず種々の方法を用いることができる。 The base component selection means 44 selects one or more base components from the base component group decomposed by the base component decomposition means 43. Note that the selection method is not limited and various methods can be used.

パワー算出手段４５は、基底成分選択手段４４で選択された基底成分の各々についてパワーを算出する。ここでパワーもまた基底成分に対応した時系列データとなる。算出方法については特に限定しないが、基底成分分解手段として経験モード分解を用いた場合は、基底成分（固有モード関数）に対してＨｉｌｂｅｒｔ変換を用いて振幅成分と位相成分を算出し、振幅成分をパワーとするのが望ましい。 The power calculation means 45 calculates the power for each of the base components selected by the base component selection means 44. Here, the power is also time series data corresponding to the base component. There are no particular limitations on the calculation method, but if empirical mode decomposition is used as the basis component decomposition means, the amplitude component and phase component are calculated using Hilbert transform for the basis component (eigenmode function), and the amplitude component is It is desirable to use it as power.

パワー合成手段４６は、パワー算出手段４５により算出された複数のパワー成分を合成して視認負荷量を算出する。ここで視認負荷量もパワーに対応した時系列データとなる。合成手段については特に限定しないが、例えば複数のパワー成分の単純加算としてもよい。即ち、基底成分分解手段４３～パワー合成手段４６までの手段によって、算出された注視点（推定注視点）の移動量の時間的推移に基づいて視認負荷量を推定している。 The power synthesizing means 46 synthesizes the plurality of power components calculated by the power calculating means 45 to calculate a visual load amount. Here, the visual load amount is also time series data corresponding to the power. The combining means is not particularly limited, but may be, for example, simple addition of a plurality of power components. That is, the visual recognition load amount is estimated by the means from the base component decomposition means 43 to the power synthesis means 46 based on the temporal transition of the movement amount of the calculated gaze point (estimated gaze point).

図１０に注視点移動量算出手段４２が出力した注視点移動量と、その注視点移動量を基底成分分解手段４３により分解した基底成分群と、基底成分群からパワー算出手段４５により算出されたパワーをパワー合成手段４６で合成して出力された視認負荷量と、の波形の例を示す。図１０の最上段が注視点移動量である。図１０の２段目から１６段目までが基底成分群である。そして図１０の１７段目（最下段）が視認負荷量である。 FIG. 10 shows the amount of gaze point movement output by the gaze point movement amount calculation means 42, the base component group obtained by decomposing the gaze point movement amount by the base component decomposition means 43, and the power calculation means 45 calculated from the base component group. An example of the waveform of the visual load amount output by combining the power by the power combining means 46 is shown. The top row of FIG. 10 shows the amount of movement of the gaze point. The 2nd to 16th rows in FIG. 10 are the base component group. The 17th row (lowest row) in FIG. 10 is the visual load amount.

図１０においては、最下段に示した視認負荷量の振幅が大きくなる部分が視認負荷量が大きい推定される。 In FIG. 10, it is estimated that the visible load amount is large in the portion where the amplitude of the visible load amount shown at the bottom is large.

情報提示手段５は、視認負荷量推定手段４が出力した視認負荷量を提示する。情報提示手段としては、液晶ディスプレイ等の表示装置やスピーカ等の音声出力手段等が挙げられる。 The information presenting means 5 presents the amount of visible load outputted by the amount of visible load estimating means 4. Examples of the information presentation means include a display device such as a liquid crystal display, and an audio output means such as a speaker.

次に、上述した構成の視認負荷量推定装置１における動作（視認負荷量推定方法）について、図１１のフローチャートを参照して説明する。また、このフローチャートを視認負荷量推定装置１として機能するコンピュータで実行されるプログラムとして構成することで視認負荷量推定プログラムとすることができる。また、この視認負荷量推定プログラムは、視認負荷量推定装置１が有するメモリ等に記憶するに限らず、メモリカードや光ディスク等の記憶媒体に格納してもよい。 Next, the operation (visual load amount estimation method) of the visible load amount estimating device 1 having the above-described configuration will be described with reference to the flowchart of FIG. 11. Further, by configuring this flowchart as a program executed by a computer functioning as the visible load amount estimation device 1, it can be made into a visible load amount estimation program. Further, this visual load estimation program is not limited to being stored in the memory of the visual load estimation device 1, but may be stored in a storage medium such as a memory card or an optical disk.

まず、入力手段２が、入力された画像を画像データとして視覚顕著性抽出手段３に出力する（ステップＳ２１０）。本ステップでは、入力手段２に入力された画像データを画像フレーム等の時系列に分解して視覚顕著性抽出手段３へ入力している。また、本ステップでノイズ除去や幾何学変換などの画像処理を施してもよい。 First, the input means 2 outputs the input image as image data to the visual saliency extraction means 3 (step S210). In this step, the image data input to the input means 2 is decomposed into time series such as image frames and inputted to the visual saliency extraction means 3. Further, image processing such as noise removal and geometric transformation may be performed in this step.

次に、視覚顕著性抽出手段３が、視覚顕著性マップを抽出する（ステップＳ２２０）。視覚顕著性マップは、視覚顕著性抽出手段３において、上述した方法により図３（ｂ）に示したような視覚顕著性マップを時系列に出力する。 Next, the visual saliency extraction means 3 extracts a visual saliency map (step S220). The visual saliency map as shown in FIG. 3(b) is outputted in time series by the visual saliency extracting means 3 using the method described above.

次に、視覚顕著性マップから視認負荷量を推定する（ステップＳ２３０）。視認負荷量の推定は、視認負荷量推定手段４において、視覚顕著性抽出手段３が出力した視覚顕著性マップに基づいて、上記した方法により視認負荷量を推定する。 Next, the visual recognition load amount is estimated from the visual saliency map (step S230). To estimate the visual load amount, the visual load amount estimating means 4 estimates the visual load amount by the method described above based on the visual saliency map output by the visual saliency extraction means 3.

次に、視認負荷量を提示する（ステップＳ２４０）。本実施例では、図１０に示した視認負荷量の推移のグラフを示してもよいし、当該グラフとともに対応する画像（フレーム）や走行位置等を表示するようにしてもよい。 Next, the visual recognition load amount is presented (step S240). In this embodiment, the graph of the change in visual load amount shown in FIG. 10 may be shown, or a corresponding image (frame), traveling position, etc. may be displayed together with the graph.

そして、全フレームについて読み出しが完了した場合は（ステップＳ２５０：Ｙｅｓ）、フローチャートを終了し、全フレームについて読み出しが完了していない場合は（ステップＳ２５０：Ｎｏ）、ステップＳ２２０に戻って、次のフレームについて処理を行う。 Then, if reading has been completed for all frames (step S250: Yes), the flowchart is ended, and if reading has not been completed for all frames (step S250: No), the process returns to step S220 and the next frame is read out. Processing is performed.

本実施例によれば、視認負荷量推定装置１は、視覚顕著性抽出手段３で車両から外部を連続的に撮像した画像に基づいて視覚顕著性の高低を推測して得られた視覚顕著性マップを画像ごとに生成し、視認負荷量推定手段４で生成された視覚顕著性マップに基づいて注視点の移動量を算出する。そして、視認負荷量推定手段４では算出された推定注視点の移動量の時間的推移に基づいて視認負荷量を推定する。このようにすることにより、注視点の移動量の時間的推移に基づくことから視線の推移が大きくなるような位置を推定することができる。したがって、視覚的に負荷を感じる部分を自動的に抽出することができる。 According to the present embodiment, the visual recognition load amount estimating device 1 estimates the level of visual saliency based on the images continuously captured outside the vehicle by the visual saliency extracting means 3, and the visual saliency A map is generated for each image, and the amount of movement of the gaze point is calculated based on the visual saliency map generated by the visual load estimation means 4. Then, the visual load amount estimating means 4 estimates the visual load amount based on the temporal transition of the calculated estimated movement amount of the gaze point. By doing this, it is possible to estimate a position where the change in the line of sight is large based on the change in the amount of movement of the gaze point over time. Therefore, it is possible to automatically extract portions that are visually burdensome.

また、視認負荷量推定手段４は、推定注視点の移動量を基底成分に分解し、分解された基底成分に基づいて視認負荷量を算出している。このようにすることにより、経験モード分解といった、注視点の移動量の解析に適したアルゴリズムを利用することができるようになる。 Further, the visual load amount estimating means 4 decomposes the estimated movement amount of the gaze point into base components, and calculates the visual load amount based on the decomposed base components. By doing so, it becomes possible to use an algorithm suitable for analyzing the amount of movement of the gaze point, such as empirical mode decomposition.

また、視認負荷量推定手段４は、注視点を視覚顕著性マップにおいて視覚顕著性が最大値となる画像上の位置と推定して移動量の算出をしている。このようにすることにより、最も視認すると推定される位置に基づいて移動量を推定することができる。 Further, the visual load estimation means 4 calculates the amount of movement by estimating the point of gaze to be the position on the image where the visual saliency becomes the maximum value in the visual saliency map. By doing so, the amount of movement can be estimated based on the position that is estimated to be most visible.

また、視覚顕著性抽出手段３は、画像を写像処理可能な中間データに変換する入力部３１０と、中間データを写像データに変換する非線形写像部３２０と、写像データに基づき顕著性分布を示す顕著性推定情報を生成する出力部３３０と、を備え、非線形写像部３２０は、中間データに対し特徴の抽出を行う特徴抽出部３２１と、特徴抽出部３２１で生成されたデータのアップサンプルを行うアップサンプル部３２２と、を備えている。このようにすることにより、小さな計算コストで、視覚顕著性を推定することができる。 The visual saliency extraction means 3 also includes an input unit 310 that converts an image into intermediate data that can be mapped, a nonlinear mapping unit 320 that converts the intermediate data into mapped data, and a saliency that shows a saliency distribution based on the mapped data. The nonlinear mapping section 320 includes a feature extraction section 321 that extracts features from intermediate data, and an up-sampling section that upsamples the data generated by the feature extraction section 321. A sample section 322 is provided. By doing so, visual saliency can be estimated with low calculation cost.

また、視認負荷量推定手段４における推定結果を提示する情報提示手段５を備えている。このようにすることにより、例えば視認負荷量の大きい地点が接近することを通知することが可能となる。具体的には、運転中の運転者に対して、視覚的に負荷量が増加しそうな場所が近づいていることを、車載カメラ画像などから自動的に抽出して事前に知らせることができるようになる。したがって、ドライバディストラクションによる運転ミスや事故の低減が期待できる。 It also includes information presentation means 5 for presenting the estimation results of the visual load amount estimating means 4. By doing this, for example, it becomes possible to notify that a point with a large visual load is approaching. Specifically, it will be possible to automatically extract from in-vehicle camera images and notify drivers in advance that they are approaching a location where the amount of load is likely to increase visually. Become. Therefore, reductions in driving errors and accidents due to driver distraction can be expected.

また、例えば、地図上の任意の経路に対応する実際の風景画像から、視覚的な負荷量が増加しそうな場所を特定することで、交通環境（道路構造や標識配置など）に関わる危険性解析の効率化が期待できる。 In addition, for example, by identifying locations where the amount of visual load is likely to increase from actual landscape images corresponding to any route on a map, we can analyze risks related to the traffic environment (road structure, sign placement, etc.). This can be expected to improve efficiency.

また、本発明は上記実施例に限定されるものではない。即ち、当業者は、従来公知の知見に従い、本発明の骨子を逸脱しない範囲で種々変形して実施することができる。かかる変形によってもなお本発明の視認負荷量推定装置を具備する限り、勿論、本発明の範疇に含まれるものである。 Further, the present invention is not limited to the above embodiments. That is, those skilled in the art can implement various modifications based on conventionally known knowledge without departing from the gist of the present invention. Such modifications are of course included in the scope of the present invention as long as the visual load amount estimating device of the present invention is provided.

１視認負荷量推定装置
２入力手段
３視覚顕著性抽出手段（生成部）
４視認負荷量推定手段（算出部、推定部）
５情報提示手段（提示部） 1 Visual recognition load estimation device 2 Input means 3 Visual saliency extraction means (generation unit)
4 Visual load estimation means (calculation section, estimation section)
5 Information presentation means (presentation part)

Claims

a generation unit that generates visual saliency distribution information for each image, which is obtained by estimating the level of visual saliency based on images continuously captured outside from a moving object;
a calculation unit that calculates a movement amount of the estimated gaze point based on the generated visual saliency distribution information;
an estimation unit that estimates a visual load amount based on a temporal change in the calculated movement amount of the estimated gaze point;
A visual load estimation device comprising:

The calculation unit calculates the movement amount by estimating the estimated gaze point to be a position on the image where the visual saliency has a maximum value in the visual saliency distribution information. 1. The visual load estimation device according to 1.

3. The estimation unit decomposes the movement amount of the estimated gaze point into a plurality of base components, and estimates the visual recognition load amount based on the decomposed base components. Visual load estimation device.

The generation unit is
an input unit that converts the image into intermediate data that can be mapped;
a nonlinear mapping unit that converts the intermediate data into mapping data;
an output unit that generates saliency estimation information indicating a saliency distribution based on the mapping data,
The nonlinear mapping section includes a feature extraction section that extracts features from the intermediate data, and an upsampling section that upsamples the data generated by the feature extraction section.
The visual load amount estimating device according to any one of claims 1 to 3.

The visual load amount estimating device according to any one of claims 1 to 4, further comprising a presentation unit that presents the estimation result from the estimation unit.

A visible load amount estimation method executed by a visible load amount estimating device that estimates a visible load amount based on an image taken of the outside from a moving body, the method comprising:
a generation step of generating visual saliency distribution information for each of the images, which is obtained by estimating the level of visual saliency based on images continuously captured outside the moving body;
a calculation step of calculating the movement amount of the estimated gaze point based on the generated visual saliency distribution information;
an estimating step of estimating a visual load amount based on a temporal change in the calculated movement amount of the estimated point of gaze;
A method for estimating a visible load amount, comprising:

A visual load amount estimating program characterized by causing a computer to execute the visual load amount estimating method according to claim 6 .

A computer-readable storage medium storing the visual load estimation program according to claim 7 .