JP2020184146A

JP2020184146A - Saliency estimation device, saliency estimation method and program

Info

Publication number: JP2020184146A
Application number: JP2019087565A
Authority: JP
Inventors: 井上　俊明; Toshiaki Inoue; 俊明井上
Original assignee: Pioneer Electronic Corp
Current assignee: Pioneer Corp
Priority date: 2019-05-07
Filing date: 2019-05-07
Publication date: 2020-11-12
Also published as: JP2023153309A

Abstract

To provide a saliency estimation device, a saliency estimation method and a program for accurately detecting from an image a region which a moving person looks at and feels to be highly salient.SOLUTION: A saliency estimation device 10 comprises: an input unit 110 which acquires moving image data and outputs each frame image of frame images constituting the acquired moving image data; a modification unit 120 which generates a modified image by modifying the frame image outputted by the input unit 110; and a saliency estimation unit 130 which processes the modified image to generate saliency estimation information showing a saliency distribution within the modified image or the frame image. The modification unit 120 generates brightness information for modifying brightness of at least a part of the frame image, by using speed information about a movement speed of a first viewpoint and a relative position from a reference point to the at least a part in the frame image, and uses this brightness information to modify brightness of the at least a part.SELECTED DRAWING: Figure 1

Description

本発明は、顕著性推定装置、顕著性推定方法、及びプログラムに関する。 The present invention relates to a saliency estimation device, a saliency estimation method, and a program.

画像の中の顕著領域を自動で検出する技術が提案されている。一方、人が移動している場合、人の移動速度が速くなるにつれて人の有効視野は狭くなる。非特許文献１には、有効視野を考慮して顕著領域を自動で検出することについて記載されている。具体的には、非特許文献１には、対象画像の解像度及び彩度を、注視点からの距離に応じて落としていき、その後、顕著性を推定することが記載されている。 A technique for automatically detecting a prominent area in an image has been proposed. On the other hand, when a person is moving, the effective field of view of the person becomes narrower as the moving speed of the person increases. Non-Patent Document 1 describes that a prominent region is automatically detected in consideration of an effective field of view. Specifically, Non-Patent Document 1 describes that the resolution and saturation of the target image are reduced according to the distance from the gazing point, and then the prominence is estimated.

森本他，「ＭＳＴ野の応答特性を考慮した動画像に対する顕著性推定モデル」，映像メディア学会技術報告，Vol.38，No.10，pp.57-60，2014年Morimoto et al., "Estimation model for moving images considering the response characteristics of the MST field", Technical Report of the Institute of Imaging Media, Vol.38, No.10, pp.57-60, 2014

本発明者は、画像から、移動中の人が見た場合に顕著性が高いと感じる領域を高い精度で検出する方法を検討した。本発明が解決しようとする課題としては、画像から、移動中の人が見た場合に顕著性が高いと感じる領域を高い精度で検出することが一例として挙げられる。 The present inventor has investigated a method of detecting a region that is felt to be highly prominent when viewed by a moving person from an image with high accuracy. As an example of the problem to be solved by the present invention, it is possible to detect, from an image, a region that is felt to be highly prominent when viewed by a moving person with high accuracy.

請求項１に記載の発明は、第１の視点から見た景色の画像を補正することにより補正済画像を生成する補正部と、
前記補正済画像を処理することにより、前記補正済画像内又は前記画像内における顕著性分布を示す顕著性推定情報を生成する顕著性推定部と、
を備え、
前記補正部は、
前記画像の少なくとも一部の明度の変更を示す明度情報を、前記第１の視点の移動速度に関する速度情報、及び前記画像中の基準点から前記少なくとも一部までの相対位置を用いて生成し、
前記明度情報を用いて前記少なくとも一部の明度を補正する顕著性推定装置である。 The invention according to claim 1 comprises a correction unit that generates a corrected image by correcting an image of a landscape viewed from a first viewpoint.
A saliency estimation unit that generates saliency estimation information indicating a saliency distribution in the corrected image or in the image by processing the corrected image, and a saliency estimation unit.
With
The correction unit
Brightness information indicating a change in the brightness of at least a part of the image is generated by using the speed information regarding the moving speed of the first viewpoint and the relative position from the reference point in the image to the at least a part.
It is a saliency estimation device that corrects at least a part of the brightness by using the brightness information.

請求項８に記載の発明は、コンピュータが、
第１の視点から見た景色の画像を補正することにより補正済画像を生成し、
前記補正済画像を処理することにより、前記補正済画像内又は前記画像内における顕著性分布を示す顕著性推定情報を生成し、
さらに前記コンピュータが、
前記画像の少なくとも一部の明度の変更を示す明度情報を、前記第１の視点の移動速度に関する速度情報、及び前記画像中の基準点から前記少なくとも一部までの相対位置を用いて生成し、
前記明度情報を用いて前記少なくとも一部の明度を補正する顕著性推定方法である。 The invention according to claim 8 is a computer.
A corrected image is generated by correcting the image of the scenery seen from the first viewpoint.
By processing the corrected image, saliency estimation information indicating the saliency distribution in the corrected image or in the image is generated.
Furthermore, the computer
Brightness information indicating a change in the brightness of at least a part of the image is generated by using the speed information regarding the moving speed of the first viewpoint and the relative position from the reference point in the image to the at least a part.
This is a saliency estimation method for correcting at least a part of the brightness using the brightness information.

請求項９に記載の発明は、コンピュータに、
第１の視点から見た景色の画像を補正することにより補正済画像を生成する補正機能と、
前記補正済画像を処理することにより、前記補正済画像内又は前記画像内における顕著性分布を示す顕著性推定情報を生成する推定機能と、
を持たせ、
さらに前記補正機能の少なくとも一部として、
前記画像の少なくとも一部の明度の変更を示す明度情報を、前記第１の視点の移動速度に関する速度情報、及び前記画像中の基準点から前記少なくとも一部までの相対位置を用いて生成する機能と、
前記明度情報を用いて前記少なくとも一部の明度を補正する機能と、
を持たせるプログラムである。 The invention according to claim 9 is applied to a computer.
A correction function that generates a corrected image by correcting the image of the scenery seen from the first viewpoint,
An estimation function that generates saliency estimation information indicating a saliency distribution in the corrected image or in the image by processing the corrected image, and an estimation function.
To have
Further, as at least a part of the correction function
A function to generate brightness information indicating a change in brightness of at least a part of the image by using speed information regarding the moving speed of the first viewpoint and a relative position from a reference point in the image to at least a part of the image. When,
A function to correct at least a part of the brightness using the brightness information, and
It is a program to have.

第１の実施形態に係る顕著性推定装置の機能構成を示す図である。It is a figure which shows the functional structure of the saliency estimation apparatus which concerns on 1st Embodiment. 視野角情報の設定方法を説明するための図である。It is a figure for demonstrating the setting method of the viewing angle information. 視野角情報の一例を説明するための図である。It is a figure for demonstrating an example of a viewing angle information. 明度情報を説明するための図である。It is a figure for demonstrating the brightness information. 解像度情報を説明するための図である。It is a figure for demonstrating the resolution information. 彩度情報を説明するための図である。It is a figure for demonstrating the saturation information. 補正処理部の機能構成の一例を示す図である。It is a figure which shows an example of the functional structure of a correction processing part. 顕著性推定部の構成例を例示するブロック図である。It is a block diagram which illustrates the structural example of the saliency estimation part. （ａ）は、顕著性推定部へ入力する画像を例示する図であり、（ｂ）は、（ａ）に対し推定される、顕著性分布を示す画像を例示する図である。(A) is a diagram exemplifying an image to be input to the saliency estimation unit, and (b) is a diagram exemplifying an image showing a saliency distribution estimated with respect to (a). 第１の構成例に係る処理方法を例示するフローチャートである。It is a flowchart which illustrates the processing method which concerns on the 1st configuration example. 非線形写像部の構成を詳しく例示する図である。It is a figure which illustrates the structure of the nonlinear mapping part in detail. 中間層の構成を例示する図である。It is a figure which illustrates the structure of the intermediate layer. （ａ）および（ｂ）はそれぞれ、フィルタで行われる畳み込み処理の例を示す図である。(A) and (b) are diagrams showing an example of a convolution process performed by a filter, respectively. （ａ）は、第１のプーリング部の処理を説明するための図であり、（ｂ）は、第２のプーリング部の処理を説明するための図であり、（ｃ）は、アンプーリング部の処理を説明するための図である。(A) is a diagram for explaining the processing of the first pooling unit, (b) is a diagram for explaining the processing of the second pooling unit, and (c) is a diagram for explaining the processing of the second pooling unit. It is a figure for demonstrating the process of. 顕著性推定装置のハードウエア構成を例示するブロック図である。It is a block diagram which illustrates the hardware configuration of the saliency estimation device. 第２の実施形態に係る顕著性推定装置の機能構成を示す図である。It is a figure which shows the functional structure of the saliency estimation apparatus which concerns on 2nd Embodiment. 第３の実施形態に係る顕著性推定装置の機能構成を示す図である。It is a figure which shows the functional structure of the saliency estimation apparatus which concerns on 3rd Embodiment. 基準点設定部の動作例を説明するための図である。It is a figure for demonstrating the operation example of the reference point setting part. 第４の実施形態に係る顕著性推定装置の機能構成を示す図である。It is a figure which shows the functional structure of the saliency estimation apparatus which concerns on 4th Embodiment. 第５の実施形態に係る顕著性推定部の構成を例示する図である。It is a figure which illustrates the structure of the saliency estimation part which concerns on 5th Embodiment. 第５の実施形態に係る学習動作を例示するフローチャートである。It is a flowchart which illustrates the learning operation which concerns on 5th Embodiment. 第６の実施形態に係る演算装置の構成および使用環境を例示する図である。It is a figure which illustrates the structure and use environment of the arithmetic unit which concerns on 6th Embodiment. 第７の実施形態に係る顕著性推定部の構成を例示する図である。It is a figure which illustrates the structure of the saliency estimation part which concerns on 7th Embodiment. 合成部で生成された合成情報が示す画像を例示する図である。It is a figure which illustrates the image which the synthesis information generated by the synthesis part shows. 第８の実施形態に係る顕著性推定部の構成を例示する図である。It is a figure which illustrates the structure of the saliency estimation part which concerns on 8th Embodiment.

以下、本発明の実施の形態について、図面を用いて説明する。尚、すべての図面において、同様な構成要素には同様の符号を付し、適宜説明を省略する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. In all drawings, similar components are designated by the same reference numerals, and description thereof will be omitted as appropriate.

（第１の実施形態）
図１は、第１の実施形態に係る顕著性推定装置１０の機能構成を示す図である。本図に示す顕著性推定装置１０は、入力部１１０、補正部１２０、及び顕著性推定部１３０を備えている。入力部１１０は動画データを取得し、取得した動画データを構成するフレーム画像の各々を補正部１２０に出力する。これらフレーム画像は、第１の視点から見た景色の画像となっている。第１の視点は、例えば動画データを生成したカメラが配置されていた位置である。補正部１２０は、入力部１１０が出力したフレーム画像を補正することにより補正済画像を生成する。顕著性推定部１３０は、補正済画像を処理することにより、顕著性推定情報を生成する。顕著性推定情報は、補正済画像内又はフレーム画像内における顕著性分布を示している。ここで、補正部１２０は、フレーム画像の少なくとも一部の明度の変更を示す明度情報を、第１の視点の移動速度に関する速度情報、及びフレーム画像中の基準点から上記した少なくとも一部までの相対位置を用いて生成する。そして補正部１２０は、この明度情報を用いて、上記した少なくとも一部の明度を補正する。以下、顕著性推定装置１０について詳細に説明する。 (First Embodiment)
FIG. 1 is a diagram showing a functional configuration of the saliency estimation device 10 according to the first embodiment. The saliency estimation device 10 shown in this figure includes an input unit 110, a correction unit 120, and a saliency estimation unit 130. The input unit 110 acquires the moving image data, and outputs each of the frame images constituting the acquired moving image data to the correction unit 120. These frame images are images of the scenery seen from the first viewpoint. The first viewpoint is, for example, the position where the camera that generated the moving image data was arranged. The correction unit 120 generates a corrected image by correcting the frame image output by the input unit 110. The saliency estimation unit 130 generates saliency estimation information by processing the corrected image. The saliency estimation information shows the saliency distribution in the corrected image or the frame image. Here, the correction unit 120 provides the brightness information indicating the change in the brightness of at least a part of the frame image from the speed information regarding the moving speed of the first viewpoint and the reference point in the frame image to at least a part described above. Generate using relative position. Then, the correction unit 120 corrects at least a part of the above-mentioned brightness by using this brightness information. Hereinafter, the saliency estimation device 10 will be described in detail.

上記したように、入力部１１０には動画データが入力される。入力部１１０は、動画データに含まれる複数のフレーム画像のそれぞれを、補正部１２０に出力する。 As described above, moving image data is input to the input unit 110. The input unit 110 outputs each of the plurality of frame images included in the moving image data to the correction unit 120.

補正部１２０は、入力部１１０から入力されたフレーム画像を補正する。具体的には、補正部１２０は、視野設定部１２２、補正情報生成部１２４、及び補正処理部１２６を有している。 The correction unit 120 corrects the frame image input from the input unit 110. Specifically, the correction unit 120 includes a field of view setting unit 122, a correction information generation unit 124, and a correction processing unit 126.

視野設定部１２２は、速度情報及び基準点情報を外部から受け取る。速度情報は、移動速度を示している。この移動速度は、例えば入力部１１０が取得する動画データが撮影されたときのカメラの速度であるが、これに限定されない。基準点情報は、補正部１２０が取得したフレーム画像のうち注視点となるべき位置を特定している。視野設定部１２２は、速度情報を用いて、その速度で人が移動したときの人の視野角を示す視野角情報を生成する。そして視野設定部１２２は、速度情報、基準点情報、及び視野角情報を補正情報生成部１２４に出力する。 The field of view setting unit 122 receives speed information and reference point information from the outside. The speed information indicates the moving speed. This moving speed is, for example, the speed of the camera when the moving image data acquired by the input unit 110 is captured, but is not limited to this. The reference point information specifies a position to be a gazing point in the frame image acquired by the correction unit 120. The visual field setting unit 122 uses the speed information to generate viewing angle information indicating the viewing angle of the person when the person moves at that speed. Then, the visual field setting unit 122 outputs the speed information, the reference point information, and the visual field angle information to the correction information generation unit 124.

図２は、視野角情報の設定方法を説明するための図である。移動している人の視野角は、移動速度が上がるにつれて狭くなる。視野設定部１２２は、例えば図２に示すような、速さと視野角の関係を示すデータを記憶しておき、このデータを用いて、入力された移動速度に対応する視野角を特定し、特定した視野角を示す視野角情報を生成する。 FIG. 2 is a diagram for explaining a method of setting the viewing angle information. The viewing angle of a moving person becomes narrower as the moving speed increases. The visual field setting unit 122 stores data showing the relationship between the speed and the viewing angle, as shown in FIG. 2, for example, and uses this data to specify and specify the viewing angle corresponding to the input moving speed. Generates viewing angle information indicating the viewing angle.

図３は、視野角情報の一例を説明するための図である。本図に示す例において、視野角情報は、基準点をからの距離に基づいて画像を複数の領域に分ける情報である。具体的には、基準点を含む第０の領域は、画像が明瞭になるべき領域である。そして、補正情報生成部１２４及び補正処理部１２６は、第０の領域の外側に位置する第１の領域、第１の領域の外側に位置する第２の領域・・・となるにつれて、その領域が徐々に不明瞭かつ暗くなるように、画像を補正する。 FIG. 3 is a diagram for explaining an example of viewing angle information. In the example shown in this figure, the viewing angle information is information that divides the image into a plurality of regions based on the distance from the reference point. Specifically, the 0th region including the reference point is the region where the image should be clear. Then, the correction information generation unit 124 and the correction processing unit 126 become the first region located outside the 0th region, the second region located outside the first region, and so on. Correct the image so that is gradually obscured and darkened.

視野設定部１２２は、少なくとも第０の領域の大きさを、速度情報を用いて定める。例えば速度情報が示す速度が低い場合、第０の領域を大きくし、他の領域を狭くする。なお、視野設定部１２２は、各領域の大きさの他に、設定される領域の数を、速度情報を用いて設定してもよい。この場合、視野設定部１２２は、速度が大きくなるにつれて、設定される領域の数を増やす。 The field of view setting unit 122 determines the size of at least the 0th region by using the speed information. For example, when the speed indicated by the speed information is low, the 0th region is increased and the other regions are narrowed. In addition to the size of each region, the field of view setting unit 122 may set the number of regions to be set by using the speed information. In this case, the field of view setting unit 122 increases the number of areas to be set as the speed increases.

図３に示す例において、各領域の外形線は矩形となっている。ただし、この外形線は他の形状（例えば円形や楕円形）であってもよい。 In the example shown in FIG. 3, the outline of each region is rectangular. However, this outline may have other shapes (for example, circular or elliptical).

図１に戻る。補正情報生成部１２４は、視野設定部１２２から取得した基準点情報を用いて、補正情報を生成する。補正情報は、フレーム画像の各画素の値の補正量を特定する情報である。視野角情報は、上記したように速度情報及び基準点からの距離を用いて生成されている。このため、視野設定部１２２は、実質的には、基準点からの相対位置及び速度情報を用いて、補正情報を生成することになる。 Return to FIG. The correction information generation unit 124 generates correction information using the reference point information acquired from the field of view setting unit 122. The correction information is information for specifying the correction amount of the value of each pixel of the frame image. The viewing angle information is generated using the velocity information and the distance from the reference point as described above. Therefore, the field of view setting unit 122 substantially uses the relative position and velocity information from the reference point to generate the correction information.

詳細には、補正情報は、明度情報、解像度情報、及び彩度情報を有している。明度情報は、画像の少なくとも一部の明度の変更を示しており、解像度情報は、画像の少なくとも一部の解像度の変更を示しており、彩度情報は、画像の少なくとも一部の彩度の変更を示している。そして補正情報生成部１２４は、視野角情報、基準点情報、及び補正情報を補正処理部１２６に出力する。 Specifically, the correction information includes brightness information, resolution information, and saturation information. The brightness information indicates a change in the brightness of at least a part of the image, the resolution information indicates a change in the resolution of at least a part of the image, and the saturation information indicates a change in the saturation of at least a part of the image. Indicates a change. Then, the correction information generation unit 124 outputs the viewing angle information, the reference point information, and the correction information to the correction processing unit 126.

図４は明度情報を説明するための図であり、図５は解像度情報を説明するための図であり、図６は彩度情報を説明するための図である。これらの図に示すように、補正情報生成部１２４が生成する補正情報は、基準点からの距離が大きくなるにつれて、明度、解像度、及び彩度のいずれも下げることを示している。具体的には、明度、解像度、及び彩度のいずれに関しても、図３に示した「第ｋの領域」のｋの値毎に、補正量が設定されている。そしてｋの値が大きくなるにつれて、明度、解像度、及び彩度のいずれも下がる。 FIG. 4 is a diagram for explaining brightness information, FIG. 5 is a diagram for explaining resolution information, and FIG. 6 is a diagram for explaining saturation information. As shown in these figures, the correction information generated by the correction information generation unit 124 indicates that the brightness, the resolution, and the saturation decrease as the distance from the reference point increases. Specifically, for any of the brightness, the resolution, and the saturation, the correction amount is set for each k value in the “kth region” shown in FIG. And as the value of k increases, all of the brightness, resolution, and saturation decrease.

図１に戻る。補正処理部１２６は、入力部１１０からフレーム画像を取得し、このフレーム画像を、視野角情報、基準点情報、及び補正情報を用いて補正する。具体的には、補正処理部１２６は、フレーム画像内の基準点を、基準点情報を用いて定義する。そして補正処理部１２６は、基準点及び視野角情報を用いて、フレーム画像を図３に示した各領域に分ける。そして補正処理部１２６は、各領域に対して、補正情報に従った補正を行う。 Return to FIG. The correction processing unit 126 acquires a frame image from the input unit 110, and corrects the frame image using the viewing angle information, the reference point information, and the correction information. Specifically, the correction processing unit 126 defines a reference point in the frame image using the reference point information. Then, the correction processing unit 126 divides the frame image into each region shown in FIG. 3 by using the reference point and the viewing angle information. Then, the correction processing unit 126 corrects each area according to the correction information.

図７は、補正処理部１２６の機能構成の一例を示す図である。補正処理部１２６は、解像度補正部２０２、彩度補正部２０４、及び明度補正部２０６を有している。解像度補正部２０２は、補正情報に含まれる解像度情報を用いて、フレーム画像の解像度を領域ごとに補正する。彩度補正部２０４は、補正情報に含まれる再度情報を用いて、フレーム画像の彩度を領域ごとに補正する。明度補正部２０６は、補正情報に含まれる明度情報を用いて、フレーム画像の明度を領域ごとに補正する。なお、本図に示す例において、解像度補正部２０２、彩度補正部２０４、及び明度補正部２０６はこの順に直列に配置されているが、これらの並び順は図７に示す例に限定されない。 FIG. 7 is a diagram showing an example of the functional configuration of the correction processing unit 126. The correction processing unit 126 includes a resolution correction unit 202, a saturation correction unit 204, and a brightness correction unit 206. The resolution correction unit 202 corrects the resolution of the frame image for each area by using the resolution information included in the correction information. The saturation correction unit 204 corrects the saturation of the frame image for each region by using the information included in the correction information again. The brightness correction unit 206 corrects the brightness of the frame image for each region by using the brightness information included in the correction information. In the example shown in this figure, the resolution correction unit 202, the saturation correction unit 204, and the brightness correction unit 206 are arranged in series in this order, but the arrangement order thereof is not limited to the example shown in FIG.

そして、補正部１２０は、補正後のフレーム画像（補正済画像）を顕著性推定部１３０に出力する。 Then, the correction unit 120 outputs the corrected frame image (corrected image) to the saliency estimation unit 130.

＜顕著性推定部１３０の構成例＞
図８は、顕著性推定部１３０の構成例を例示するブロック図である。顕著性推定部１３０は、機械学習によって生成されたモデルに補正済のフレーム画像を入力することにより、顕著性推定情報を生成する。詳細には、顕著性推定部１３０は、入力部３１０、非線形写像部３２０、および出力部３３０を備える。入力部３１０は、入力されたフレーム画像（以下、顕著性推定部１３０に関する説明においては画像と記載）を写像処理可能な中間データに変換する。非線形写像部３２０は、中間データを写像データに変換する。出力部３３０は、写像データに基づき顕著性推定情報を生成する。そして、非線形写像部３２０は、中間データに対し特徴の抽出を行う特徴抽出部３２１と、特徴抽出部３２１で生成されたデータのアップサンプルを行うアップサンプル部３２２とを備える。以下に詳しく説明する。 <Structure example of saliency estimation unit 130>
FIG. 8 is a block diagram illustrating a configuration example of the saliency estimation unit 130. The saliency estimation unit 130 generates saliency estimation information by inputting a corrected frame image into the model generated by machine learning. Specifically, the saliency estimation unit 130 includes an input unit 310, a non-linear mapping unit 320, and an output unit 330. The input unit 310 converts the input frame image (hereinafter, referred to as an image in the description regarding the saliency estimation unit 130) into intermediate data that can be mapped. The non-linear mapping unit 320 converts the intermediate data into mapping data. The output unit 330 generates saliency estimation information based on the mapping data. The nonlinear mapping unit 320 includes a feature extraction unit 321 that extracts features from the intermediate data, and an upsampling unit 322 that upsamples the data generated by the feature extraction unit 321. This will be described in detail below.

図９（ａ）は、顕著性推定部１３０へ入力する画像を例示する図であり、図９（ｂ）は、図９（ａ）に対し推定される、顕著性分布を示す画像を例示する図である。なお、説明のため、これらの図は、補正部１２０によって補正される前のフレーム画像を示している。本構成例に係る顕著性推定部１３０は、画像における各部分の顕著性を推定する。顕著性とはたとえば、目立ちやすさや視線の集まりやすさを意味する。具体的には顕著性は、確率等で示される。ここで、確率の大小は、たとえばその画像を見た人の視線がその位置に向く確率の大小に対応する。 FIG. 9A is a diagram illustrating an image to be input to the saliency estimation unit 130, and FIG. 9B exemplifies an image showing a saliency distribution estimated with respect to FIG. 9A. It is a figure. For the sake of explanation, these figures show the frame image before being corrected by the correction unit 120. The saliency estimation unit 130 according to this configuration example estimates the saliency of each part in the image. Severity means, for example, the ease of conspicuousness and the ease of gathering eyes. Specifically, the prominence is indicated by a probability or the like. Here, the magnitude of the probability corresponds to, for example, the magnitude of the probability that the line of sight of the person who sees the image points to the position.

図９（ａ）と図９（ｂ）とは、互いに位置が対応している。そして、図９（ａ）において、顕著性が高い位置ほど、図９（ｂ）において輝度が高く表示されている。図９（ｂ）のような顕著性分布を示す画像は、出力部３３０が出力する顕著性推定情報の一例である。本図の例において、顕著性は、２５６階調の輝度値で可視化されている。出力部３３０が出力する顕著性推定情報の例については詳しく後述する。 The positions of FIGS. 9 (a) and 9 (b) correspond to each other. Then, in FIG. 9A, the higher the prominence is, the higher the brightness is displayed in FIG. 9B. The image showing the saliency distribution as shown in FIG. 9B is an example of the saliency estimation information output by the output unit 330. In the example of this figure, the prominence is visualized by the luminance value of 256 gradations. An example of the saliency estimation information output by the output unit 330 will be described in detail later.

顕著性分布の推定結果は、たとえば、運転者や歩行者等の交通参加者の視線予測や、交通参加者の見落とし防止、広告媒体などのコンテンツの見栄え評価、視線誘導、スポーツ選手や技能熟練者のノウハウのデータ化、生体の視覚認知の理解など、様々な分野で用いることができる。さらに、本構成例に係る顕著性推定部１３０および処理方法は、自動運転や先進運転支援システム（ＡＤＡＳ）、道路交通システム等のモビリティ分野、仮想現実（ＶＲ）、拡張現実（ＡＲ）、ゲーム等のエンターテインメント分野、ドキュメント、映像コンテンツ、サイネージ等のコンテンツ分野、画像診断、手術支援、介護サービス等の医療分野等への応用が可能である。 The estimation results of the saliency distribution are, for example, prediction of the line of sight of traffic participants such as drivers and pedestrians, prevention of oversight of traffic participants, evaluation of the appearance of contents such as advertising media, line of sight guidance, athletes and skilled experts. It can be used in various fields such as digitization of know-how and understanding of visual recognition of living organisms. Further, the prominence estimation unit 130 and the processing method according to this configuration example include mobility fields such as automatic driving, advanced driver assistance system (ADAS), and road traffic system, virtual reality (VR), augmented reality (AR), games, and the like. It can be applied to the entertainment field, document, video content, content field such as signage, medical field such as diagnostic imaging, surgical support, and nursing care service.

図１０は、第１の構成例に係る処理方法を例示するフローチャートである。本構成例に係る処理方法は、コンピュータによって実行される処理方法であって、入力ステップＳ１１０、非線形写像ステップＳ１２０、および出力ステップＳ１３０を含む。入力ステップＳ１１０では、画像が写像処理可能な中間データに変換される。非線形写像ステップＳ１２０では、中間データが写像データに変換される。出力ステップＳ１３０では、写像データに基づき顕著性分布を示す顕著性推定情報が生成される。ここで、非線形写像ステップＳ１２０は、中間データに対し特徴の抽出を行う特徴抽出ステップＳ１２１と、特徴抽出ステップＳ１２１で生成されたデータのアップサンプルを行うアップサンプルステップＳ１２２とを含む。本構成例に係る処理方法は、構成例に係る顕著性推定部１３０により実現される。 FIG. 10 is a flowchart illustrating a processing method according to the first configuration example. The processing method according to this configuration example is a processing method executed by a computer and includes an input step S110, a nonlinear mapping step S120, and an output step S130. In the input step S110, the image is converted into intermediate data that can be mapped. In the nonlinear mapping step S120, the intermediate data is converted into mapping data. In the output step S130, the saliency estimation information showing the saliency distribution is generated based on the mapping data. Here, the nonlinear mapping step S120 includes a feature extraction step S121 for extracting features from the intermediate data and an upsampling step S122 for upsampling the data generated in the feature extraction step S121. The processing method according to this configuration example is realized by the prominence estimation unit 130 according to the configuration example.

図８に戻り、顕著性推定部１３０の各構成要素について説明する。入力ステップＳ１１０において入力部３１０は、画像を取得し、中間データに変換する。入力部３１０は、補正部１２０から画像を取得する。そして入力部３１０は、取得した画像を中間データに変換する。中間データは非線形写像部３２０が受け付け可能なデータであれば特に限定されないが、たとえば高次元テンソルである。また、中間データはたとえば、取得した画像に対し輝度を正規化したデータ、または、取得した画像の各画素を、輝度の傾きに変換したデータである。入力ステップＳ１１０において入力部３１０は、さらに画像のノイズ除去や解像度変換等を行っても良い。 Returning to FIG. 8, each component of the saliency estimation unit 130 will be described. In the input step S110, the input unit 310 acquires an image and converts it into intermediate data. The input unit 310 acquires an image from the correction unit 120. Then, the input unit 310 converts the acquired image into intermediate data. The intermediate data is not particularly limited as long as it is data that can be accepted by the nonlinear mapping unit 320, but is, for example, a high-dimensional tensor. Further, the intermediate data is, for example, data in which the brightness of the acquired image is normalized, or data in which each pixel of the acquired image is converted into a slope of the brightness. In the input step S110, the input unit 310 may further perform image noise removal, resolution conversion, and the like.

非線形写像ステップＳ１２０において、非線形写像部３２０は入力部３１０から中間データを取得する。そして、非線形写像部３２０において中間データが写像データに変換される。ここで、写像データは例えば高次元テンソルである。非線形写像部３２０で中間データに施される写像処理は、たとえばパラメータ等により制御可能な写像処理であり、関数、汎関数、またはニューラルネットワークによる処理であることが好ましい。 In the nonlinear mapping step S120, the nonlinear mapping unit 320 acquires intermediate data from the input unit 310. Then, the non-linear mapping unit 320 converts the intermediate data into mapping data. Here, the mapping data is, for example, a high-dimensional tensor. The mapping process applied to the intermediate data by the nonlinear mapping unit 320 is, for example, a mapping process that can be controlled by parameters or the like, and is preferably a function, a functional, or a neural network process.

図１１は、非線形写像部３２０の構成を詳しく例示する図であり、図１２は、中間層３２３の構成を例示する図である。上記した通り、非線形写像部３２０は、特徴抽出部３２１およびアップサンプル部３２２を備える。特徴抽出部３２１において特徴抽出ステップＳ１２１が行われ、アップサンプル部３２２においてアップサンプルステップＳ１２２が行われる。また、本図の例において、特徴抽出部３２１およびアップサンプル部３２２の少なくとも一方は、複数の中間層３２３を含むニューラルネットワークを含んで構成される。ニューラルネットワークにおいては、複数の中間層３２３が結合されている。 FIG. 11 is a diagram illustrating the configuration of the nonlinear mapping unit 320 in detail, and FIG. 12 is a diagram illustrating the configuration of the intermediate layer 323. As described above, the nonlinear mapping unit 320 includes a feature extraction unit 321 and an upsampling unit 322. The feature extraction step S121 is performed in the feature extraction unit 321, and the upsample step S122 is performed in the upsample unit 322. Further, in the example of this figure, at least one of the feature extraction unit 321 and the upsampling unit 322 is configured to include a neural network including a plurality of intermediate layers 323. In the neural network, a plurality of intermediate layers 323 are connected.

特にニューラルネットワークは畳み込みニューラルネットワークであることが好ましい。具体的には、複数の中間層３２３のそれぞれは、一または二以上の畳み込み層３２４を含む。そして、畳み込み層３２４では、入力されたデータに対し複数のフィルタ３２５による畳み込みが行われ、複数のフィルタ３２５の出力に対し活性化処理が施される。 In particular, the neural network is preferably a convolutional neural network. Specifically, each of the plurality of intermediate layers 323 includes one or more convolutional layers 324. Then, in the convolutional layer 324, the input data is convolved by the plurality of filters 325, and the outputs of the plurality of filters 325 are activated.

図１１の例において、特徴抽出部３２１は、複数の中間層３２３を含むニューラルネットワークを含んで構成され、複数の中間層３２３の間に第１のプーリング部３２６を備える。また、アップサンプル部３２２は、複数の中間層３２３を含むニューラルネットワークを含んで構成され、複数の中間層３２３の間にアンプーリング部３２８を備える。さらに、特徴抽出部３２１とアップサンプル部３２２とは、オーバーラッププーリングを行う第２のプーリング部３２７を介して互いに接続されている。 In the example of FIG. 11, the feature extraction unit 321 is configured to include a neural network including a plurality of intermediate layers 323, and includes a first pooling unit 326 between the plurality of intermediate layers 323. Further, the upsampling unit 322 is configured to include a neural network including a plurality of intermediate layers 323, and an amplifiering unit 328 is provided between the plurality of intermediate layers 323. Further, the feature extraction unit 321 and the upsample unit 322 are connected to each other via a second pooling unit 327 that performs overlap pooling.

なお、本図の例において各中間層３２３は、二以上の畳み込み層３２４からなる。ただし、少なくとも一部の中間層３２３は、一の畳み込み層３２４のみからなってもよい。互いに隣り合う中間層３２３は、第１のプーリング部３２６、第２のプーリング部３２７およびアンプーリング部３２８のいずれかで区切られる。ここで、中間層３２３に二以上の畳み込み層３２４が含まれる場合、それらの畳み込み層３２４におけるフィルタ３２５の数は互いに等しいことが好ましい。 In the example of this figure, each intermediate layer 323 is composed of two or more convolutional layers 324. However, at least a part of the intermediate layer 323 may consist of only one convolutional layer 324. The intermediate layers 323 adjacent to each other are separated by one of a first pooling portion 326, a second pooling portion 327, and an amplifiering portion 328. Here, when the intermediate layer 323 includes two or more convolutional layers 324, it is preferable that the number of filters 325 in those convolutional layers 324 is equal to each other.

本図では、「Ａ×Ｂ」と記された中間層３２３は、Ｂ個の畳み込み層３２４からなり、各畳み込み層３２４は、各チャネルに対しＡ個の畳み込みフィルタを含むことを意味している。このような中間層３２３を以下では「Ａ×Ｂ中間層」とも呼ぶ。たとえば、６４×２中間層３２３は、２個の畳み込み層３２４からなり、各畳み込み層３２４は、各チャネルに対し６４個の畳み込みフィルタを含むことを意味している。 In this figure, the intermediate layer 323 marked "AxB" is composed of B convolutional layers 324, which means that each convolutional layer 324 includes A convolutional filters for each channel. .. Such an intermediate layer 323 will also be referred to as an "A × B intermediate layer" below. For example, the 64 × 2 intermediate layer 323 consists of two convolutional layers 324, which means that each convolutional layer 324 includes 64 convolutional filters for each channel.

本図の例において、特徴抽出部３２１は、６４×２中間層３２３、１２８×２中間層３２３、２５６×３中間層３２３、および、５１２×３中間層３２３をこの順に含む。また、アップサンプル部３２２は、５１２×３中間層３２３、２５６×３中間層３２３、１２８×２中間層３２３、および６４×２中間層３２３をこの順に含む。また、第２のプーリング部３２７は、２つの５１２×３中間層３２３を互いに接続している。なお、非線形写像部３２０を構成する中間層３２３の数は特に限定されず、たとえば画像データの画素数に応じて定めることができる。 In the example of this figure, the feature extraction unit 321 includes a 64 × 2 intermediate layer 323, a 128 × 2 intermediate layer 323, a 256 × 3 intermediate layer 323, and a 512 × 3 intermediate layer 323 in this order. Further, the upsampling unit 322 includes 512 × 3 intermediate layer 323, 256 × 3 intermediate layer 323, 128 × 2 intermediate layer 323, and 64 × 2 intermediate layer 323 in this order. Further, the second pooling unit 327 connects two 512 × 3 intermediate layers 323 to each other. The number of intermediate layers 323 constituting the nonlinear mapping unit 320 is not particularly limited, and can be determined, for example, according to the number of pixels of the image data.

なお、本図は非線形写像部３２０の構成の一例であり、非線形写像部３２０は他の構成を有していても良い。たとえば、６４×２中間層３２３の代わりに６４×１中間層３２３が含まれても良い。中間層３２３に含まれる畳み込み層３２４の数が削減されることで、計算コストがより低減される可能性がある。また、たとえば、６４×２中間層３２３の代わりに３２×２中間層３２３が含まれても良い。中間層３２３のチャネル数が削減されることで、計算コストがより低減される可能性がある。さらに、中間層３２３における畳み込み層３２４の数とチャネル数との両方を削減しても良い。 Note that this figure is an example of the configuration of the nonlinear mapping unit 320, and the nonlinear mapping unit 320 may have another configuration. For example, a 64x1 intermediate layer 323 may be included instead of the 64x2 intermediate layer 323. By reducing the number of convolutional layers 324 included in the intermediate layer 323, the calculation cost may be further reduced. Further, for example, a 32 × 2 intermediate layer 323 may be included instead of the 64 × 2 intermediate layer 323. By reducing the number of channels in the intermediate layer 323, the calculation cost may be further reduced. Further, both the number of convolutional layers 324 and the number of channels in the intermediate layer 323 may be reduced.

ここで、特徴抽出部３２１に含まれる複数の中間層３２３においては、第１のプーリング部３２６を経る毎にフィルタ３２５の数が増加することが好ましい。具体的には、第１の中間層３２３ａと第２の中間層３２３ｂとが、第１のプーリング部３２６を介して互いに連続しており、第１の中間層３２３ａの後段に第２の中間層３２３ｂが位置する。そして、第１の中間層３２３ａは、各チャネルに対するフィルタ３２５の数がＮ１である畳み込み層３２４で構成されており、第２の中間層３２３ｂは、各チャネルに対するフィルタ３２５の数がＮ２である畳み込み層３２４で構成されている。このとき、Ｎ２＞Ｎ１が成り立つことが好ましい。また、Ｎ２＝Ｎ１×２が成り立つことがより好ましい。 Here, in the plurality of intermediate layers 323 included in the feature extraction unit 321, it is preferable that the number of filters 325 increases each time the first pooling unit 326 is passed. Specifically, the first intermediate layer 323a and the second intermediate layer 323b are continuous with each other via the first pooling portion 326, and the second intermediate layer is behind the first intermediate layer 323a. 323b is located. The first intermediate layer 323a is composed of a convolutional layer 324 in which the number of filters 325 for each channel is N1, and the second intermediate layer 323b is a convolutional layer in which the number of filters 325 for each channel is N2. It is composed of layers 324. At this time, it is preferable that N2> N1 holds. Further, it is more preferable that N2 = N1 × 2 holds.

また、アップサンプル部３２２に含まれる複数の中間層３２３においては、アンプーリング部３２８を経る毎にフィルタ３２５の数が減少することが好ましい。具体的には、第３の中間層３２３ｃと第４の中間層３２３ｄとが、アンプーリング部３２８を介して互いに連続しており、第３の中間層３２３ｃの後段に第４の中間層３２３ｄが位置する。そして、第３の中間層３２３ｃは、各チャネルに対するフィルタ３２５の数がＮ３である畳み込み層３２４で構成されており、第４の中間層３２３ｄは、各チャネルに対するフィルタ３２５の数がＮ４である畳み込み層３２４で構成されている。このとき、Ｎ４＜Ｎ３が成り立つことが好ましい。また、Ｎ３＝Ｎ４×２が成り立つことがより好ましい。 Further, in the plurality of intermediate layers 323 included in the upsample unit 322, it is preferable that the number of filters 325 decreases each time the amplifier ring unit 328 is passed through. Specifically, the third intermediate layer 323c and the fourth intermediate layer 323d are continuous with each other via the amplifiering portion 328, and the fourth intermediate layer 323d is located after the third intermediate layer 323c. To position. The third intermediate layer 323c is composed of a convolutional layer 324 in which the number of filters 325 for each channel is N3, and the fourth intermediate layer 323d is a convolutional layer 323d in which the number of filters 325 for each channel is N4. It is composed of layers 324. At this time, it is preferable that N4 <N3 holds. Further, it is more preferable that N3 = N4 × 2 holds.

特徴抽出部３２１では、入力部３１０から取得した中間データから勾配や形状など、複数の抽象度を持つ画像特徴を中間層３２３のチャネルとして抽出する。図１２は、６４×２中間層３２３の構成を例示している。本図を参照して、中間層３２３における処理を説明する。本図の例において、中間層３２３は第１の畳み込み層３２４ａと第２の畳み込み層３２４ｂとで構成されており、各畳み込み層３２４は６４個のフィルタ３２５を備える。第１の畳み込み層３２４ａでは、中間層３２３に入力されたデータの各チャネルに対して、フィルタ３２５を用いた畳み込み処理が施される。たとえば入力部３１０へ入力された画像がＲＧＢ画像である場合、３つのチャネルｈ０ｉ（ｉ＝１．．３）のそれぞれに対して処理が施される。また、本図の例において、フィルタ３２５は６４種の３×３フィルタであり、すなわち合計６４×３種のフィルタである。畳み込み処理の結果、各チャネルｉに対して、６４個の結果ｈ０ｉ，ｊ（ｉ＝１．．３，ｊ＝１．．６４）が得られる。 The feature extraction unit 321 extracts image features having a plurality of abstractions such as gradients and shapes from the intermediate data acquired from the input unit 310 as channels of the intermediate layer 323. FIG. 12 illustrates the configuration of the 64 × 2 intermediate layer 323. The processing in the intermediate layer 323 will be described with reference to this figure. In the example of this figure, the intermediate layer 323 is composed of a first convolutional layer 324a and a second convolutional layer 324b, and each convolutional layer 324 includes 64 filters 325. In the first convolutional layer 324a, convolutional processing using the filter 325 is performed on each channel of the data input to the intermediate layer 323. For example, when the image input to the input unit 310 is an RGB image, processing is performed on each of the three channels h0i (i = 1.3). Further, in the example of this figure, the filter 325 is 64 types of 3 × 3 filters, that is, a total of 64 × 3 types of filters. As a result of the convolution process, 64 results h0i, j (i = 1..3, j = 1..64) are obtained for each channel i.

次に、複数のフィルタ３２５の出力に対し、活性化部３２９において活性化処理が行われる。具体的には、全チャネルの対応する結果ｊについて、対応する要素毎の総和に活性化処理が施される。この活性化処理により、６４チャネルの結果ｈ１ｉ（ｉ＝１．．６４）、すなわち、第１の畳み込み層３２４ａの出力が、画像特徴として得られる。活性化処理は特に限定されないが、双曲関数、シグモイド関数、および正規化線形関数の少なくともいずれかを用いる処理が好ましい。 Next, the activation process is performed on the output of the plurality of filters 325 in the activation unit 329. Specifically, for the corresponding result j of all channels, the activation process is applied to the sum of the corresponding elements. By this activation treatment, the result h1i (i = 1..64) of 64 channels, that is, the output of the first convolutional layer 324a is obtained as an image feature. The activation process is not particularly limited, but a process using at least one of a hyperbolic function, a sigmoid function, and a rectified linear function is preferable.

さらに、第１の畳み込み層３２４ａの出力データを第２の畳み込み層３２４ｂの入力データとし、第２の畳み込み層３２４ｂにて第１の畳み込み層３２４ａと同様の処理を行って、６４チャネルの結果ｈ２ｉ（ｉ＝１．．６４）、すなわち第２の畳み込み層３２４ｂの出力が、画像特徴として得られる。第２の畳み込み層３２４ｂの出力がこの６４×２中間層３２３の出力データとなる。 Further, the output data of the first convolutional layer 324a is used as the input data of the second convolutional layer 324b, and the same processing as that of the first convolutional layer 324a is performed on the second convolutional layer 324b, resulting in 64 channels h2i. (I = 1..64), that is, the output of the second convolutional layer 324b is obtained as an image feature. The output of the second convolutional layer 324b becomes the output data of the 64 × 2 intermediate layer 323.

ここで、フィルタ３２５の構造は特に限定されないが、３×３の二次元フィルタであることが好ましい。また、各フィルタ３２５の係数は独立に設定可能である。本構成例において、各フィルタ３２５の係数は記憶部３９０に保持されており、非線形写像部３２０がそれを読み出して処理に用いることができる。ここで、複数のフィルタ３２５の係数は機械学習を用いて生成、修正された補正情報に基づいて定められてもよい。たとえば、補正情報は、複数のフィルタ３２５の係数を、複数の補正パラメータとして含む。非線形写像部３２０は、この補正情報をさらに用いて中間データを写像データに変換することができる。記憶部３９０は顕著性推定部１３０に備えられていてもよいし、顕著性推定部１３０の外部に設けられていてもよい。また、非線形写像部３２０は補正情報を、通信ネットワークを介して外部から取得しても良い。 Here, the structure of the filter 325 is not particularly limited, but a 3 × 3 two-dimensional filter is preferable. Further, the coefficient of each filter 325 can be set independently. In this configuration example, the coefficient of each filter 325 is stored in the storage unit 390, and the nonlinear mapping unit 320 can read it out and use it for processing. Here, the coefficients of the plurality of filters 325 may be determined based on the correction information generated and corrected by using machine learning. For example, the correction information includes the coefficients of the plurality of filters 325 as a plurality of correction parameters. The nonlinear mapping unit 320 can further use this correction information to convert the intermediate data into mapping data. The storage unit 390 may be provided in the saliency estimation unit 130, or may be provided outside the saliency estimation unit 130. Further, the nonlinear mapping unit 320 may acquire correction information from the outside via a communication network.

図１３（ａ）および図１３（ｂ）はそれぞれ、フィルタ３２５で行われる畳み込み処理の例を示す図である。図１３（ａ）および図１３（ｂ）では、いずれも３×３畳み込みの例が示されている。図１３（ａ）の例は、最近接要素を用いた畳み込み処理である。図１３（ｂ）の例は、距離が二以上の近接要素を用いた畳み込み処理である。なお、距離が三以上の近接要素を用いた畳み込み処理も可能である。フィルタ３２５は、距離が二以上の近接要素を用いた畳み込み処理を行うことが好ましい。より広範囲の特徴を抽出することができ、顕著性の推定精度をさらに高めることができるからである。 13 (a) and 13 (b) are diagrams showing an example of the convolution process performed by the filter 325, respectively. In both FIGS. 13 (a) and 13 (b), an example of 3 × 3 convolution is shown. The example of FIG. 13A is a convolution process using the closest element. The example of FIG. 13B is a convolution process using proximity elements having a distance of two or more. It should be noted that a convolution process using proximity elements having a distance of three or more is also possible. The filter 325 preferably performs a convolution process using proximity elements having a distance of two or more. This is because a wider range of features can be extracted and the estimation accuracy of saliency can be further improved.

以上、６４×２中間層３２３の動作について説明した。他の中間層３２３（１２８×２中間層３２３、２５６×３中間層３２３、および、５１２×３中間層３２３等）の動作についても、畳み込み層３２４の数およびチャネルの数を除いて、６４×２中間層３２３の動作と同じである。また、特徴抽出部３２１における中間層３２３の動作も、アップサンプル部３２２における中間層３２３の動作も上記と同様である。 The operation of the 64 × 2 intermediate layer 323 has been described above. The operation of the other intermediate layers 323 (128 × 2 intermediate layer 323, 256 × 3 intermediate layer 323, 512 × 3 intermediate layer 323, etc.) is also 64 ×, excluding the number of convolutional layers 324 and the number of channels. 2 The operation is the same as that of the intermediate layer 323. Further, the operation of the intermediate layer 323 in the feature extraction unit 321 and the operation of the intermediate layer 323 in the upsample unit 322 are the same as described above.

図１４（ａ）は、第１のプーリング部３２６の処理を説明するための図であり、図１４（ｂ）は、第２のプーリング部３２７の処理を説明するための図であり、図１４（ｃ）は、アンプーリング部３２８の処理を説明するための図である。 14 (a) is a diagram for explaining the processing of the first pooling unit 326, and FIG. 14 (b) is a diagram for explaining the processing of the second pooling unit 327. (C) is a diagram for explaining the processing of the amplifiering unit 328.

特徴抽出部３２１において、中間層３２３から出力されたデータは、第１のプーリング部３２６においてチャネル毎にプーリング処理が施された後、次の中間層３２３に入力される。第１のプーリング部３２６ではたとえば、非オーバーラップのプーリング処理が行われる。図１４（ａ）では、各チャネルに含まれる要素群に対し、２×２の４つの要素３０を１つの要素３０に対応づける処理を示している。第１のプーリング部３２６ではこのような対応づけが全ての要素３０に対し行われる。ここで、２×２の４つの要素３０は互いに重ならないよう選択される。本例では、各チャネルの要素数が４分の１に縮小される。なお、第１のプーリング部３２６において要素数が縮小される限り、対応づける前後の要素３０の数は特に限定されない。 The data output from the intermediate layer 323 in the feature extraction unit 321 is input to the next intermediate layer 323 after the pooling process is performed for each channel in the first pooling unit 326. In the first pooling unit 326, for example, a non-overlapping pooling process is performed. FIG. 14A shows a process of associating four 2 × 2 elements 30 with one element 30 for an element group included in each channel. In the first pooling unit 326, such a correspondence is made for all the elements 30. Here, the four elements 30 of 2 × 2 are selected so as not to overlap each other. In this example, the number of elements in each channel is reduced to a quarter. As long as the number of elements in the first pooling unit 326 is reduced, the number of elements 30 before and after the association is not particularly limited.

特徴抽出部３２１から出力されたデータは、第２のプーリング部３２７を介してアップサンプル部３２２に入力される。第２のプーリング部３２７では、特徴抽出部３２１からの出力データに対し、オーバーラッププーリングが施される。図１４（ｂ）では、一部の要素３０をオーバーラップさせながら、２×２の４つの要素３０を１つの要素３０に対応づける処理を示している。すなわち、繰り返される対応づけにおいて、ある対応づけにおける２×２の４つの要素３０のうち一部が、次の対応づけにおける２×２の４つの要素３０にも含まれる。本図のような第２のプーリング部３２７では要素数は縮小されない。なお、第２のプーリング部３２７において対応づける前後の要素３０の数は特に限定されない。 The data output from the feature extraction unit 321 is input to the upsampling unit 322 via the second pooling unit 327. In the second pooling unit 327, overlap pooling is applied to the output data from the feature extraction unit 321. FIG. 14B shows a process of associating four 2 × 2 elements 30 with one element 30 while overlapping some elements 30. That is, in the repeated association, a part of the 2 × 2 four elements 30 in one association is also included in the 2 × 2 four elements 30 in the next association. The number of elements is not reduced in the second pooling unit 327 as shown in this figure. The number of elements 30 before and after being associated with the second pooling unit 327 is not particularly limited.

第１のプーリング部３２６および第２のプーリング部３２７で行われる各処理の方法は特に限定されないが、たとえば、４つの要素３０の最大値を１つの要素３０とする対応づけ（max pooling）や４つの要素３０の平均値を１つの要素３０とする対応づけ（average pooling）が挙げられる。 The method of each processing performed by the first pooling unit 326 and the second pooling unit 327 is not particularly limited, but for example, a mapping (max pooling) in which the maximum value of the four elements 30 is set as one element 30 or 4 An association (average pooling) in which the average value of one element 30 is set as one element 30 can be mentioned.

第２のプーリング部３２７から出力されたデータは、アップサンプル部３２２における中間層３２３に入力される。そして、アップサンプル部３２２の中間層３２３からの出力データはアンプーリング部３２８においてチャネル毎にアンプーリング処理が施された後、次の中間層３２３に入力される。図１４（ｃ）では、１つの要素３０を複数の要素３０に拡大する処理を示している。拡大の方法は特に限定されないが、１つの要素３０を２×２の４つの要素３０へ複製する方法が例として挙げられる。 The data output from the second pooling unit 327 is input to the intermediate layer 323 in the upsample unit 322. Then, the output data from the intermediate layer 323 of the upsample unit 322 is input to the next intermediate layer 323 after the amplifiering process is performed for each channel in the amplifiering unit 328. FIG. 14C shows a process of expanding one element 30 to a plurality of elements 30. The method of enlargement is not particularly limited, and an example is a method of duplicating one element 30 into four 2 × 2 elements 30.

アップサンプル部３２２の最後の中間層３２３の出力データは写像データとして非線形写像部３２０から出力され、出力部３３０に入力される。出力ステップＳ１３０において出力部３３０は、非線形写像部３２０から取得したデータに対し、たとえば正規化や解像度変換等を行うことで顕著性推定情報を生成し、出力する。顕著性推定情報はたとえば、図９（ｂ）に例示したような顕著性を輝度値で可視化した画像（画像データ）である。また、顕著性推定情報はたとえば、ヒートマップのように顕著性に応じて色分けされた画像であっても良いし、顕著性が予め定められた基準より高い顕著領域を、その他の位置とは識別可能にマーキングした画像であっても良い。さらに、顕著性推定情報は画像に限定されず、顕著領域を示す情報を列挙したテーブル等であっても良い。 The output data of the last intermediate layer 323 of the upsample unit 322 is output as mapping data from the nonlinear mapping unit 320 and input to the output unit 330. In the output step S130, the output unit 330 generates and outputs saliency estimation information by performing, for example, normalization or resolution conversion on the data acquired from the nonlinear mapping unit 320. The saliency estimation information is, for example, an image (image data) in which the saliency as illustrated in FIG. 9B is visualized by a luminance value. Further, the saliency estimation information may be, for example, an image color-coded according to saliency such as a heat map, and a saliency region having a saliency higher than a predetermined standard is distinguished from other positions. The image may be marked as possible. Further, the prominence estimation information is not limited to the image, and may be a table or the like in which information indicating a prominent region is listed.

出力部３３０から出力された顕著性推定情報に対しては、顕著性推定部１３０内、または顕著性推定部１３０の外部において、画像分割や物体認識、画像分類などの各種コンピュータビジョン処理が施されても良い。 The saliency estimation information output from the output unit 330 is subjected to various computer vision processing such as image division, object recognition, and image classification inside the saliency estimation unit 130 or outside the saliency estimation unit 130. You may.

＜ハードウエア構成例＞
図１５は、図１に示す顕著性推定装置１０のハードウエア構成を例示するブロック図である。顕著性推定装置１０は、バス１０１０、プロセッサ１０２０、メモリ１０３０、ストレージデバイス１０４０、入出力インタフェース１０５０、及びネットワークインタフェース１０６０を有する。 <Hardware configuration example>
FIG. 15 is a block diagram illustrating the hardware configuration of the saliency estimation device 10 shown in FIG. The saliency estimation device 10 includes a bus 1010, a processor 1020, a memory 1030, a storage device 1040, an input / output interface 1050, and a network interface 1060.

バス１０１０は、プロセッサ１０２０、メモリ１０３０、ストレージデバイス１０４０、入出力インタフェース１０５０、及びネットワークインタフェース１０６０が、相互にデータを送受信するためのデータ伝送路である。ただし、プロセッサ１０２０などを互いに接続する方法は、バス接続に限定されない。 The bus 1010 is a data transmission line for the processor 1020, the memory 1030, the storage device 1040, the input / output interface 1050, and the network interface 1060 to transmit and receive data to and from each other. However, the method of connecting the processors 1020 and the like to each other is not limited to the bus connection.

プロセッサ１０２０は、ＣＰＵ（Central Processing Unit）やＧＰＵ（Graphics Processing Unit）などで実現されるプロセッサである。 The processor 1020 is a processor realized by a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), or the like.

メモリ１０３０は、ＲＡＭ（Random Access Memory）などで実現される主記憶装置である。 The memory 1030 is a main storage device realized by a RAM (Random Access Memory) or the like.

ストレージデバイス１０４０は、ＨＤＤ（Hard Disk Drive）、ＳＳＤ（Solid State Drive）、メモリカード、又はＲＯＭ（Read Only Memory）などで実現される補助記憶装置である。ストレージデバイス１０４０は顕著性推定装置１０の各機能を実現するプログラムモジュールを記憶している。プロセッサ１０２０がこれら各プログラムモジュールをメモリ１０３０上に読み込んで実行することで、そのプログラムモジュールに対応する各機能が実現される。 The storage device 1040 is an auxiliary storage device realized by an HDD (Hard Disk Drive), an SSD (Solid State Drive), a memory card, a ROM (Read Only Memory), or the like. The storage device 1040 stores a program module that realizes each function of the saliency estimation device 10. When the processor 1020 reads each of these program modules into the memory 1030 and executes them, each function corresponding to the program module is realized.

入出力インタフェース１０５０は、顕著性推定装置１０と各種入出力機器とを接続するためのインタフェースである。 The input / output interface 1050 is an interface for connecting the saliency estimation device 10 and various input / output devices.

ネットワークインタフェース１０６０は、顕著性推定装置１０をネットワークに接続するためのインタフェースである。このネットワークは、例えばＬＡＮ（Local Area Network）やＷＡＮ（Wide Area Network）である。ネットワークインタフェース１０６０がネットワークに接続する方法は、無線接続であってもよいし、有線接続であってもよい。 The network interface 1060 is an interface for connecting the saliency estimation device 10 to the network. This network is, for example, a LAN (Local Area Network) or a WAN (Wide Area Network). The method of connecting the network interface 1060 to the network may be a wireless connection or a wired connection.

以上、本実施形態によれば、顕著性の推定処理の前に、速度情報を用いてフレーム画像の明度を変更している。このため、顕著性推定部１３０は、動画を構成する各フレーム画像から、移動中の人が見た場合に顕著性が高いと感じる領域を高い精度で検出することができる。 As described above, according to the present embodiment, the brightness of the frame image is changed by using the velocity information before the saliency estimation process. Therefore, the saliency estimation unit 130 can detect a region that is perceived as having high saliency when viewed by a moving person with high accuracy from each frame image constituting the moving image.

また、顕著性推定部１３０は、中間データに対し特徴の抽出を行う特徴抽出部３２１と、特徴抽出部３２１で生成されたデータのアップサンプルを行うアップサンプル部３２２とを備える。したがって、小さな計算コストで、顕著性を推定することができる。 Further, the saliency estimation unit 130 includes a feature extraction unit 321 that extracts features from the intermediate data, and an upsampling unit 322 that upsamples the data generated by the feature extraction unit 321. Therefore, the prominence can be estimated with a small calculation cost.

なお、顕著性推定装置１０は、静止画を処理することもできる。この場合においても、上述した効果が得られる。 The saliency estimation device 10 can also process a still image. Even in this case, the above-mentioned effect can be obtained.

（第２の実施形態）
図１６は、第２の実施形態に係る顕著性推定装置１０の機能構成を示す図である。本実施形態に係る顕著性推定装置１０は、速度推定部１４０を備えている点を除いて、第１の実施形態に係る顕著性推定装置１０と同様の構成である。 (Second Embodiment)
FIG. 16 is a diagram showing a functional configuration of the saliency estimation device 10 according to the second embodiment. The saliency estimation device 10 according to the present embodiment has the same configuration as the saliency estimation device 10 according to the first embodiment, except that the speed estimation unit 140 is provided.

速度推定部１４０は、入力部１１０に入力される動画データを処理することにより、移動中の速度を推定する。この移動速度の推定アルゴリズムとしてはオプティカルフロー推定を利用するなど既存のアルゴリズムを用いることができる。そして速度推定部１４０は、推定した速度を速度情報として視野設定部１２２に出力する。 The speed estimation unit 140 estimates the moving speed by processing the moving image data input to the input unit 110. As the estimation algorithm of this movement speed, an existing algorithm such as using optical flow estimation can be used. Then, the speed estimation unit 140 outputs the estimated speed as speed information to the visual field setting unit 122.

本実施形態によっても、第１の実施形態と同様の効果が得られる。また、速度推定部１４０が速度情報を生成するため、外部から速度情報を入力する必要はない。 The same effect as that of the first embodiment can be obtained by this embodiment as well. Further, since the speed estimation unit 140 generates the speed information, it is not necessary to input the speed information from the outside.

（第３の実施形態）
図１７は、第３の実施形態に係る顕著性推定装置１０の機能構成を示す図である。本実施形態に係る顕著性推定装置１０は、基準点設定部１５０を備えている点を除いて、第２の実施形態に係る顕著性推定装置１０と同様の構成である。 (Third Embodiment)
FIG. 17 is a diagram showing a functional configuration of the saliency estimation device 10 according to the third embodiment. The saliency estimation device 10 according to the present embodiment has the same configuration as the saliency estimation device 10 according to the second embodiment, except that the reference point setting unit 150 is provided.

基準点設定部１５０は、入力部１１０に入力される動画データの少なくとも一つのフレーム画像を処理することにより、基準点情報を生成する。 The reference point setting unit 150 generates reference point information by processing at least one frame image of the moving image data input to the input unit 110.

例えば基準点設定部１５０は、フレーム画像に予め定められた物体、例えば特定の交通標識など人の目につきやすい物体が含まれている場合、その物体の位置を基準点に設定する。この物体の検知は、例えば特徴量のマッチング処理によって行われる。ここで用いられる特徴量は、予め顕著性推定装置１０に記憶されている。 For example, when the frame image includes a predetermined object, for example, an object that is easily noticed by a person such as a specific traffic sign, the reference point setting unit 150 sets the position of the object as the reference point. The detection of this object is performed, for example, by a feature matching process. The feature amount used here is stored in advance in the saliency estimation device 10.

また基準点設定部１５０は、フレーム画像に道路が含まれており、かつ、その道路の直線部分の長さが基準以上の場合、消失点を基準点に設定する。消失点の検出には、Hough変換を利用するなど既存のアルゴリズムを用いることができる。 Further, when the frame image includes a road and the length of the straight line portion of the road is equal to or longer than the reference point, the reference point setting unit 150 sets the vanishing point as the reference point. Existing algorithms such as the Hough transform can be used to detect the vanishing point.

また基準点設定部１５０は、フレーム画像が示す風景が特定の条件を満たしたとき、その条件に応じた処理を行うことにより基準点を設定してもよい。例えば図１８（ａ）及び（ｂ）に示すように、フレーム画像に道路が含まれており、且その道路が曲がっている場合、道路のうち、中心（例えば中央分離線）よりもその道路が曲がっている方向に位置する部分を、基準点として設定する。 Further, when the landscape indicated by the frame image satisfies a specific condition, the reference point setting unit 150 may set the reference point by performing processing according to the condition. For example, as shown in FIGS. 18A and 18B, when a road is included in the frame image and the road is curved, the road is more than the center (for example, the central separation line) of the road. The part located in the bending direction is set as the reference point.

本実施形態によっても、第２の実施形態と同様の効果が得られる。また、基準点設定部１５０を有しているため、外部から基準点情報を入力する必要はない。 The same effect as that of the second embodiment can be obtained by this embodiment as well. Further, since the reference point setting unit 150 is provided, it is not necessary to input the reference point information from the outside.

（第４の実施形態）
図１９は、第４の実施形態に係る顕著性推定装置１０の機能構成を示す図である。本実施形態に係る顕著性推定装置１０は、トリミング部１６０を備えている点を除いて第1の実施形態に係る顕著性推定装置１０と同様の構成である。 (Fourth Embodiment)
FIG. 19 is a diagram showing a functional configuration of the saliency estimation device 10 according to the fourth embodiment. The saliency estimation device 10 according to the present embodiment has the same configuration as the saliency estimation device 10 according to the first embodiment except that the trimming unit 160 is provided.

トリミング部１６０は、入力部１１０が取得した動画データの生成条件を取得し、生成条件を用いてフレーム画像をトリミングする。補正部１２０は、トリミング部１６０がトリミングしたフレーム画像を処理する。トリミング部１６０に入力される生成条件は、例えば、動画データを生成したカメラのレンズの種類（広角レンズや魚眼レンズ）である。動画データの生成条件によっては、フレーム画像に写っている風景の範囲が、静止している人の視野より広いことがある。トリミング部１６０は、フレーム画像に写っている風景の範囲を、静止している人の視野に合わせるために、フレーム画像をトリミングする。フレーム画像からトリミングする範囲は、例えばトリミング部１６０が生成条件別に予め記憶している。 The trimming unit 160 acquires the generation conditions of the moving image data acquired by the input unit 110, and trims the frame image using the generation conditions. The correction unit 120 processes the frame image trimmed by the trimming unit 160. The generation condition input to the trimming unit 160 is, for example, the type of camera lens (wide-angle lens or fisheye lens) that generated the moving image data. Depending on the conditions for generating moving image data, the range of the landscape shown in the frame image may be wider than the field of view of a stationary person. The trimming unit 160 trims the frame image in order to match the range of the landscape reflected in the frame image with the field of view of a stationary person. For example, the trimming unit 160 stores in advance the range to be trimmed from the frame image according to the generation conditions.

本実施形態によっても、第１の実施形態と同様の効果が得られる。また、トリミング部１６０は、フレーム画像に写っている風景の範囲を、静止している人の視野に合わせるためにフレーム画像をトリミングする。このため、さらに高い精度で、画像から、移動中の人が見た場合に顕著性が高いと感じる領域を検出できる。 The same effect as that of the first embodiment can be obtained by this embodiment as well. Further, the trimming unit 160 trims the frame image in order to match the range of the landscape reflected in the frame image with the field of view of a stationary person. Therefore, with even higher accuracy, it is possible to detect from the image a region that is felt to be highly prominent when viewed by a moving person.

なお、第２又は第３の実施形態に示した顕著性推定装置１０に、本実施形態に係るトリミング部１６０を設けてもよい。 The trimming unit 160 according to the present embodiment may be provided in the saliency estimation device 10 shown in the second or third embodiment.

（第５の実施形態）
本実施形態に係る顕著性推定装置１０は、顕著性推定部１３０の機能構成を除いて上記したいずれかの実施形態に係る顕著性推定装置１０と同様の構成である。 (Fifth Embodiment)
The saliency estimation device 10 according to the present embodiment has the same configuration as the saliency estimation device 10 according to any one of the above-described embodiments except for the functional configuration of the saliency estimation unit 130.

図２０は、本実施形態に係る顕著性推定部１３０の構成を例示する図である。本実施形態に係る顕著性推定部１３０は、誤差算出部３４０および修正部３５０をさらに備える点を除いて第１の実施形態に係る顕著性推定部１３０と同じである。誤差算出部３４０は、画像に対して生成された顕著性推定情報と、その画像に対して実測された顕著性分布を示す顕著性実測情報とを用いて、顕著性推定情報が示す顕著性分布と顕著性実測情報が示す顕著性分布との誤差を算出する。そして、修正部３５０は、算出された誤差に基づいて補正情報を修正する。 FIG. 20 is a diagram illustrating the configuration of the prominence estimation unit 130 according to the present embodiment. The saliency estimation unit 130 according to the present embodiment is the same as the saliency estimation unit 130 according to the first embodiment except that an error calculation unit 340 and a correction unit 350 are further provided. The error calculation unit 340 uses the saliency estimation information generated for the image and the saliency measurement information indicating the saliency distribution actually measured for the image, and the saliency distribution indicated by the saliency estimation information. And the error from the saliency distribution shown by the saliency measurement information is calculated. Then, the correction unit 350 corrects the correction information based on the calculated error.

本実施形態に係る顕著性推定部１３０は、推定動作と、学習動作とを行う。推定動作では、入力された画像に対する顕著性推定情報が生成され、出力される。推定動作は、第１の実施形態で説明した通りの動作である。特に、本実施形態では、非線形写像部３２０は補正情報を用いて中間データを写像データに変換する。一方、学習動作では、教師用画像と教師用画像に対する顕著性実測情報とを用いて機械学習が行われ、補正情報が生成または修正（更新）される。補正情報は、非線形写像部３２０で用いられる情報であり、たとえば複数の補正パラメータを含む。 The prominence estimation unit 130 according to the present embodiment performs an estimation operation and a learning operation. In the estimation operation, saliency estimation information for the input image is generated and output. The estimated operation is the operation as described in the first embodiment. In particular, in the present embodiment, the nonlinear mapping unit 320 converts the intermediate data into mapping data using the correction information. On the other hand, in the learning operation, machine learning is performed using the teacher image and the saliency actual measurement information for the teacher image, and the correction information is generated or corrected (updated). The correction information is information used by the nonlinear mapping unit 320, and includes, for example, a plurality of correction parameters.

本実施形態において、非線形写像部３２０は、補正情報を用いて中間データを写像データに変換する。補正情報は機械学習を用いて生成および修正の少なくとも一方がされた情報である。具体的には非線形写像部３２０は第１の実施形態で説明した通り複数のフィルタ３２５を含み、複数のフィルタ３２５の係数は補正情報に基づいて定められる。たとえば、補正情報は、複数のフィルタ３２５の係数を、複数の補正パラメータとして含む。 In the present embodiment, the nonlinear mapping unit 320 converts the intermediate data into mapping data using the correction information. Correction information is information that has been generated and modified using machine learning. Specifically, the nonlinear mapping unit 320 includes a plurality of filters 325 as described in the first embodiment, and the coefficients of the plurality of filters 325 are determined based on the correction information. For example, the correction information includes the coefficients of the plurality of filters 325 as a plurality of correction parameters.

図２１は、本実施形態に係る学習動作を例示するフローチャートである。学習動作について以下に詳しく説明する。学習動作のためには、教師用画像と、その教師用画像に対する顕著性実測情報とが準備される。たとえば教師用画像と顕著性実測情報とは互いに関連づけられて記憶部３９０に保持されている。入力部３１０および誤差算出部３４０はこれらの情報を記憶部３９０から読み出して用いることができる。 FIG. 21 is a flowchart illustrating a learning operation according to the present embodiment. The learning operation will be described in detail below. For the learning operation, a teacher image and saliency measurement information for the teacher image are prepared. For example, the teacher image and the saliency measured information are associated with each other and stored in the storage unit 390. The input unit 310 and the error calculation unit 340 can read and use this information from the storage unit 390.

教師用画像は写真等の任意の画像である。そして、顕著性実測情報はたとえば、人が教師用画像を見たときの視線を、アイトラッカを用いて実測した結果に基づき生成される。顕著性実測情報は、顕著性推定情報と同様の形態を有することができる。すなわち、顕著性実測情報は、顕著性を輝度値で可視化した画像であっても良いし、顕著性実測情報はたとえば、ヒートマップのように顕著性に応じて色分けされた画像であっても良い。 The teacher image is an arbitrary image such as a photograph. Then, the saliency measurement information is generated, for example, based on the result of actually measuring the line of sight when a person looks at the teacher's image using an eye tracker. The saliency measurement information can have the same form as the saliency estimation information. That is, the saliency actual measurement information may be an image in which the saliency is visualized by a luminance value, and the saliency actual measurement information may be an image color-coded according to the saliency such as a heat map. ..

学習動作では、入力ステップＳ１１０および非線形写像ステップＳ１２０および出力ステップＳ１３０が第１の実施形態に係る入力ステップＳ１１０および非線形写像ステップＳ１２０および出力ステップＳ１３０と同様に行われる。ただし、入力ステップＳ１１０において入力部３１０が取得する画像は教師用画像である。また、非線形写像ステップＳ１２０において非線形写像部３２０は記憶部３９０から補正情報を読み出す。そして、補正情報を用いて中間データを写像データに変換する。なお、非線形写像部３２０は、補正情報を記憶部３９０から読み出す代わりに、修正部３５０から直接取得しても良い。また、初期状態において、補正情報に含まれる補正パラメータは任意の値とすることができる。 In the learning operation, the input step S110, the nonlinear mapping step S120, and the output step S130 are performed in the same manner as the input step S110, the nonlinear mapping step S120, and the output step S130 according to the first embodiment. However, the image acquired by the input unit 310 in the input step S110 is a teacher image. Further, in the nonlinear mapping step S120, the nonlinear mapping unit 320 reads the correction information from the storage unit 390. Then, the intermediate data is converted into mapping data using the correction information. The nonlinear mapping unit 320 may directly acquire the correction information from the correction unit 350 instead of reading the correction information from the storage unit 390. Further, in the initial state, the correction parameter included in the correction information can be an arbitrary value.

次いで、誤差算出ステップＳ１４０では、誤差算出部３４０が出力部３３０から顕著性推定情報を取得する。また、誤差算出部３４０はその顕著性推定情報の元となった教師用画像に関連づけられた顕著性実測情報を取得する。そして、誤差算出部３４０は、取得した顕著性推定情報と顕著性実測情報との誤差を算出する。誤差の算出方法は特に限定されないが、たとえばＬ１距離、Ｌ２距離（ユークリッド距離、平均二乗誤差）、Kullback-Leibler距離、Jensen-Shannon距離、およびPearson相関係数の少なくともいずれかを算出することが好ましい。 Next, in the error calculation step S140, the error calculation unit 340 acquires the prominence estimation information from the output unit 330. In addition, the error calculation unit 340 acquires the saliency measurement information associated with the teacher image that is the source of the saliency estimation information. Then, the error calculation unit 340 calculates the error between the acquired saliency estimation information and the saliency measurement information. The method for calculating the error is not particularly limited, but it is preferable to calculate at least one of, for example, L1 distance, L2 distance (Euclidean distance, mean square error), Kullback-Leibler distance, Jensen-Shannon distance, and Pearson correlation coefficient. ..

具体的には、ユークリッド距離は以下の式（１）で求められ、Kullback-Leibler距離は以下の式（２）で求められ、Jensen-Shannon距離は以下の式（３）で求められる。ここで、ｐｉは推定結果（顕著性推定情報に基づく値）を示し、ｑｉは真値（顕著性実測情報に基づく値）を示す。 Specifically, the Euclidean distance is calculated by the following formula (1), the Kullback-Leibler distance is calculated by the following formula (2), and the Jensen-Shannon distance is calculated by the following formula (3). Here, pi indicates an estimation result (value based on saliency estimation information), and qi indicates a true value (value based on saliency measurement information).

次いで、修正ステップＳ１５０では、修正部３５０が誤差算出部３４０から誤差を取得し、この誤差が小さくなるように補正パラメータを修正する。そして、記憶部３９０に保持された補正パラメータが修正後の補正パラメータに置き換えられる。ここで、補正パラメータの修正方法は特に限定されないが、たとえば、最小二乗法、２次計画法、stochastic gradient descent（ＳＧＤ）、adaptive moment estimation（ＡＤＡＭ）、および変分法の少なくともいずれかを用いることが好ましい。 Next, in the correction step S150, the correction unit 350 acquires an error from the error calculation unit 340, and corrects the correction parameter so that this error becomes small. Then, the correction parameter held in the storage unit 390 is replaced with the corrected correction parameter. Here, the correction method of the correction parameter is not particularly limited, and for example, at least one of the least squares method, the quadratic programming method, the stochastic gradient descent (SGD), the adaptive moment estimation (ADAM), and the variational method is used. Is preferable.

ここで、修正すべき補正パラメータは多数存在し、それらの値を効率よく確定して高精度に顕著性を推定するためには多数の教師データによる統計的な学習（機械学習）を用いることが好ましい。したがって、学習動作においては、非線形写像部３２０、誤差算出部３４０、および修正部３５０の協働により、機械学習が行われることが好ましい。 Here, there are many correction parameters to be corrected, and statistical learning (machine learning) using a large number of teacher data can be used to efficiently determine those values and estimate the prominence with high accuracy. preferable. Therefore, in the learning operation, it is preferable that machine learning is performed by the cooperation of the nonlinear mapping unit 320, the error calculation unit 340, and the correction unit 350.

なお、修正部３５０は、記憶部３９０に保持された補正パラメータを修正後の補正パラメータに置き換える代わりに、修正後の補正パラメータを直接非線形写像部３２０に対し出力しても良い。次の非線形写像ステップＳ１２０において、非線形写像部３２０は修正後の補正パラメータを用いて処理を行う。 The correction unit 350 may output the corrected correction parameter directly to the nonlinear mapping unit 320 instead of replacing the correction parameter held in the storage unit 390 with the corrected correction parameter. In the next nonlinear mapping step S120, the nonlinear mapping unit 320 performs processing using the corrected correction parameters.

なお、１つの教師用画像に関連づけられる顕著性実測情報は１つであっても良いし、複数であっても良い。１つの教師用画像に複数の顕著性実測情報が関連づけられる場合、複数の顕著性実測情報は互いに異なる実測結果に基づく情報である。そして、誤差算出部３４０は顕著性推定情報と各顕著性実測情報との誤差を算出する。また、修正部３５０は、たとえば、全ての誤差の合計が小さくなるように補正パラメータを修正する。 The saliency actual measurement information associated with one teacher image may be one or a plurality. When a plurality of saliency measurement information is associated with one teacher image, the plurality of saliency measurement information is information based on different measurement results. Then, the error calculation unit 340 calculates the error between the saliency estimation information and each saliency measurement information. Further, the correction unit 350 corrects the correction parameter so that the total of all the errors becomes small, for example.

学習動作は教師用画像と顕著性実測情報との複数の組に対して行われてもよい。学習動作が繰り返されることにより、顕著性の推定精度がさらに向上する。 The learning operation may be performed on a plurality of sets of the teacher image and the measured saliency information. By repeating the learning operation, the estimation accuracy of saliency is further improved.

学習動作が行われるタイミングは特に限定されない。たとえば、顕著性推定部１３０はユーザによる学習動作を開始する旨の操作を受け付け可能である。そして、学習動作を開始する旨の操作に基づいて、顕著性推定部１３０は学習動作を開始することができる。また、顕著性推定部１３０は、学習動作を、ユーザによる終了操作または予め定められた終了条件に基づき終了することができる。終了条件としてはたとえば、予め定められた学習動作の反復回数を満たすこと、または、誤差が予め定められた基準値以下となることが挙げられる。 The timing at which the learning operation is performed is not particularly limited. For example, the saliency estimation unit 130 can accept an operation to start a learning operation by the user. Then, the saliency estimation unit 130 can start the learning operation based on the operation to start the learning operation. Further, the saliency estimation unit 130 can end the learning operation based on a user's end operation or a predetermined end condition. The end condition includes, for example, satisfying a predetermined number of repetitions of the learning operation, or having an error of less than or equal to a predetermined reference value.

以上、本実施形態によれば、第１の実施形態と同様、非線形写像部３２０は、中間データに対し特徴の抽出を行う特徴抽出部３２１と、特徴抽出部３２１で生成されたデータのアップサンプルを行うアップサンプル部３２２とを備える。したがって、小さな計算コストで、顕著性を推定することができる。 As described above, according to the present embodiment, as in the first embodiment, the nonlinear mapping unit 320 has a feature extraction unit 321 that extracts features from the intermediate data and an upsample of the data generated by the feature extraction unit 321. It is provided with an upsampling unit 322 for performing the above. Therefore, the prominence can be estimated with a small calculation cost.

くわえて、本実施形態によれば、顕著性推定部１３０は誤差算出部３４０および修正部３５０を備える。したがって、学習動作により修正された補正情報を用いて、より高精度な顕著性推定が実現する。 In addition, according to the present embodiment, the prominence estimation unit 130 includes an error calculation unit 340 and a correction unit 350. Therefore, more accurate saliency estimation is realized by using the correction information corrected by the learning operation.

（第６の実施形態）
本実施形態に係る顕著性推定装置１０は、顕著性推定部１３０の機能構成を除いて上記したいずれかの実施形態に係る顕著性推定装置１０と同様の構成である。 (Sixth Embodiment)
The saliency estimation device 10 according to the present embodiment has the same configuration as the saliency estimation device 10 according to any one of the above-described embodiments except for the functional configuration of the saliency estimation unit 130.

図２２は、本実施形態に係る演算装置４０の構成および使用環境を例示する図である。本実施形態に係る演算装置４０は、顕著性推定部１３０で用いられる補正情報を生成する装置である。演算装置４０は、誤差算出部４４０および補正部４５０を備える。誤差算出部４４０は、教師用画像に対して生成された顕著性推定情報と、教師用画像に対して実測された顕著性分布を示す顕著性実測情報とを用いて、顕著性推定情報が示す顕著性分布と顕著性実測情報が示す顕著性分布との誤差を算出する。補正部４５０は、誤差に基づいて補正情報を算出する。 FIG. 22 is a diagram illustrating the configuration and usage environment of the arithmetic unit 40 according to the present embodiment. The arithmetic unit 40 according to the present embodiment is a device that generates correction information used by the saliency estimation unit 130. The arithmetic unit 40 includes an error calculation unit 440 and a correction unit 450. The error calculation unit 440 uses the saliency estimation information generated for the teacher image and the saliency measurement information indicating the saliency distribution actually measured for the teacher image, and the saliency estimation information indicates. Calculate the error between the saliency distribution and the saliency distribution indicated by the saliency measurement information. The correction unit 450 calculates correction information based on the error.

本実施形態に係る顕著性推定部１３０は、第１の実施形態に係る顕著性推定部１３０と同様である。本実施形態に係る顕著性推定部１３０は入力部３１０、非線形写像部３２０、および出力部３３０を備える。また、本実施形態に係る顕著性推定部１３０は、第５の実施形態で説明した誤差算出部３４０および修正部３５０を備えなくても良い。本実施形態に係る入力部３１０は第１および第５の実施形態の少なくともいずれかに係る入力部３１０と同じであり、本実施形態に係る非線形写像部３２０は第１および第５の実施形態の少なくともいずれかに係る非線形写像部３２０と同じであり、本実施形態に係る出力部３３０は第１および第５の実施形態の少なくともいずれかに係る出力部３３０と同じである。本実施形態に係る誤差算出部４４０の動作は第５の実施形態に係る誤差算出部３４０の動作と同じであり、本実施形態に係る補正部４５０の動作は第５の実施形態に係る修正部３５０の動作と同じである。顕著性推定部１３０と演算装置４０とは協働して第５の実施形態において説明した学習動作および推定動作を行う。また、顕著性推定部１３０と演算装置４０とは物理的に離れていても良く、たとえば通信ネットワークを介して互いに接続されてもよい。 The saliency estimation unit 130 according to the present embodiment is the same as the saliency estimation unit 130 according to the first embodiment. The saliency estimation unit 130 according to the present embodiment includes an input unit 310, a nonlinear mapping unit 320, and an output unit 330. Further, the prominence estimation unit 130 according to the present embodiment does not have to include the error calculation unit 340 and the correction unit 350 described in the fifth embodiment. The input unit 310 according to the present embodiment is the same as the input unit 310 according to at least one of the first and fifth embodiments, and the non-linear mapping unit 320 according to the present embodiment is the first and fifth embodiments. It is the same as the non-linear mapping unit 320 according to at least one of them, and the output unit 330 according to the present embodiment is the same as the output unit 330 according to at least one of the first and fifth embodiments. The operation of the error calculation unit 440 according to the present embodiment is the same as the operation of the error calculation unit 340 according to the fifth embodiment, and the operation of the correction unit 450 according to the present embodiment is the operation of the correction unit 450 according to the fifth embodiment. It is the same as the operation of 350. The saliency estimation unit 130 and the arithmetic unit 40 cooperate with each other to perform the learning operation and the estimation operation described in the fifth embodiment. Further, the saliency estimation unit 130 and the arithmetic unit 40 may be physically separated from each other, and may be connected to each other via, for example, a communication network.

また、本実施形態に係る学習動作においては、非線形写像部３２０、誤差算出部４４０、および補正部４５０の協働により、機械学習が行われることが好ましい。 Further, in the learning operation according to the present embodiment, it is preferable that machine learning is performed by the cooperation of the nonlinear mapping unit 320, the error calculation unit 440, and the correction unit 450.

なお、出力部３３０は生成した顕著性推定情報を一旦記憶部３９０に記憶させ、誤差算出部４４０は記憶部３９０に記憶された顕著性推定情報を読み出して用いても良い。 The output unit 330 may temporarily store the generated saliency estimation information in the storage unit 390, and the error calculation unit 440 may read out and use the saliency estimation information stored in the storage unit 390.

本図の例において記憶部３９０は顕著性推定部１３０および演算装置４０とは別途設けられているが本例に限定されず、記憶部３９０は顕著性推定部１３０に設けられていても良いし、演算装置４０に設けられていても良い。記憶部３９０が演算装置４０の内部に設けられる場合、例えば記憶部３９０は、演算装置４０を実現する計算機１０００のストレージデバイス１０８０を用いて実現される。また、記憶部３９０は、顕著性推定部１３０を実現する計算機１０００のストレージデバイス１０８０と演算装置４０を実現する計算機１０００のストレージデバイス１０８０との協働で成り立っても良い。 In the example of this figure, the storage unit 390 is provided separately from the saliency estimation unit 130 and the arithmetic unit 40, but the present invention is not limited to this, and the storage unit 390 may be provided in the saliency estimation unit 130. , May be provided in the arithmetic unit 40. When the storage unit 390 is provided inside the arithmetic unit 40, for example, the storage unit 390 is realized by using the storage device 1080 of the computer 1000 that realizes the arithmetic unit 40. Further, the storage unit 390 may be formed in collaboration with the storage device 1080 of the computer 1000 that realizes the saliency estimation unit 130 and the storage device 1080 of the computer 1000 that realizes the arithmetic unit 40.

くわえて、本実施形態によれば、演算装置４０は誤差算出部４４０および補正部４５０を備える。したがって、学習動作により修正された補正情報を用いて、より高精度な顕著性推定が実現する。 In addition, according to the present embodiment, the arithmetic unit 40 includes an error calculation unit 440 and a correction unit 450. Therefore, more accurate saliency estimation is realized by using the correction information corrected by the learning operation.

（第７の実施形態）
本実施形態に係る顕著性推定装置１０は、顕著性推定部１３０の機能構成を除いて上記したいずれかの実施形態に係る顕著性推定装置１０と同様の構成である。 (7th Embodiment)
The saliency estimation device 10 according to the present embodiment has the same configuration as the saliency estimation device 10 according to any one of the above-described embodiments except for the functional configuration of the saliency estimation unit 130.

図２３は、本実施形態に係る顕著性推定部１３０の構成を例示する図である。本実施形態に係る顕著性推定部１３０は、合成部３６０および表示部３８０をさらに備える点を除いて第１および第５の実施形態の少なくともいずれかに係る顕著性推定部１３０と同じである。 FIG. 23 is a diagram illustrating the configuration of the prominence estimation unit 130 according to the present embodiment. The saliency estimation unit 130 according to the present embodiment is the same as the saliency estimation unit 130 according to at least one of the first and fifth embodiments, except that the synthesis unit 360 and the display unit 380 are further provided.

合成部３６０は、顕著性推定情報が示す顕著性分布と、入力部３１０に入力された画像（入力画像）とを合成した合成情報を生成する。具体的には合成部３６０は、出力部３３０から顕著性推定情報を取得し、たとえば記憶部３９０から入力画像を取得する。そして、入力画像と顕著性分布とを合わせて示した合成情報を出力する。合成情報はたとえば顕著性推定部１３０に備えられた表示部３８０に出力される。また、合成部３６０から出力された合成情報は、記憶部３９０に保持されたり、外部の装置により取得されたりしても良い。 The synthesizing unit 360 generates synthesizing information obtained by synthesizing the saliency distribution indicated by the saliency estimation information and the image (input image) input to the input unit 310. Specifically, the synthesis unit 360 acquires the prominence estimation information from the output unit 330, and acquires the input image from, for example, the storage unit 390. Then, the composite information showing the input image and the saliency distribution together is output. The composite information is output to, for example, the display unit 380 provided in the saliency estimation unit 130. Further, the synthesis information output from the synthesis unit 360 may be held in the storage unit 390 or may be acquired by an external device.

図２４は、合成部３６０で生成された合成情報が示す画像を例示する図である。本図の例において、合成情報は入力画像と顕著性を示すヒートマップを重ねた画像である。なお、合成情報の形式は特に限定されない。合成情報はたとえば、顕著領域を、入力画像において円や四角で囲った画像であってもよい。また、合成手法も特に限定されず、αブレンド等が挙げられる。 FIG. 24 is a diagram illustrating an image indicated by the composite information generated by the composite unit 360. In the example of this figure, the composite information is an image in which the input image and the heat map showing the remarkableness are superimposed. The format of the composite information is not particularly limited. The composite information may be, for example, an image in which a prominent region is surrounded by a circle or a square in the input image. Further, the synthesis method is not particularly limited, and examples thereof include α blend.

本実施形態に係る顕著性推定部１３０を、たとえば、カメラ等の撮像装置を搭載した携帯端末（スマートフォン、タブレット等）に実装することができる。そうすれば、携帯端末で撮影しながら、顕著性の高い重要物体をその場で抽出すると共に、視認性良く可視化することができる。 The prominence estimation unit 130 according to the present embodiment can be mounted on, for example, a mobile terminal (smartphone, tablet, etc.) equipped with an imaging device such as a camera. By doing so, it is possible to extract an important object with high prominence on the spot and visualize it with good visibility while taking a picture with a mobile terminal.

くわえて、本実施形態によれば、顕著性推定部１３０は合成部３６０をさらに備える。したがって、画像の各位置における顕著性を視認性良く可視化することができる。 In addition, according to the present embodiment, the prominence estimation unit 130 further includes a synthesis unit 360. Therefore, the prominence at each position of the image can be visualized with good visibility.

（第８の実施形態）
本実施形態に係る顕著性推定装置１０は、顕著性推定部１３０の機能構成を除いて上記したいずれかの実施形態に係る顕著性推定装置１０と同様の構成である。 (8th Embodiment)
The saliency estimation device 10 according to the present embodiment has the same configuration as the saliency estimation device 10 according to any one of the above-described embodiments except for the functional configuration of the saliency estimation unit 130.

図２５は、本実施形態に係る顕著性推定部１３０の構成を例示する図である。本実施形態に係る顕著性推定部１３０は、マスク画像生成部３７０、領域抽出部３７２、および物体検出部３７４をさらに備える点を除いて第１、第５および第７の実施形態の少なくともいずれかに係る顕著性推定部１３０と同じである。 FIG. 25 is a diagram illustrating the configuration of the prominence estimation unit 130 according to the present embodiment. The prominence estimation unit 130 according to the present embodiment is at least one of the first, fifth, and seventh embodiments except that the mask image generation unit 370, the region extraction unit 372, and the object detection unit 374 are further provided. It is the same as the saliency estimation unit 130 according to the above.

マスク画像生成部３７０は、出力部３３０から顕著性推定情報を取得し、マスク画像を生成する。具体的には、マスク画像生成部３７０は顕著性推定情報で示される顕著性分布において、顕著性が予め定められた基準より低い領域をマスク領域とし、顕著性が予め定められた基準以上である領域を非マスク領域としたマスク画像を生成する。すなわち、マスク画像生成部３７０は顕著性分布の二値化を行う。ここで基準は予め設定され、記憶部３９０に保持されており、マスク画像生成部３７０がそれを読み出して用いることができる。 The mask image generation unit 370 acquires the prominence estimation information from the output unit 330 and generates a mask image. Specifically, in the saliency distribution indicated by the saliency estimation information, the mask image generation unit 370 sets a region whose saliency is lower than a predetermined standard as a mask region, and the saliency is equal to or higher than the predetermined standard. Generate a mask image with the area as a non-mask area. That is, the mask image generation unit 370 binarizes the saliency distribution. Here, the reference is set in advance and is held in the storage unit 390, and the mask image generation unit 370 can read and use it.

領域抽出部３７２は、入力画像とマスク画像を取得する。そして、入力画像にマスク画像を作用させることにより、入力画像から顕著性の高い領域を抽出する。たとえば領域抽出部３７２は、入力画像とマスク画像と論理演算を行うことで、入力画像から顕著性の高い領域を抽出することができる。 The area extraction unit 372 acquires an input image and a mask image. Then, by applying a mask image to the input image, a region having high prominence is extracted from the input image. For example, the area extraction unit 372 can extract a highly prominent region from the input image by performing a logical operation with the input image and the mask image.

そして、物体検出部３７４は、領域抽出部３７２で抽出された領域から、物体を検出する。物体の検出方法は特に限定されないが、たとえばSingle Shot Multibox Detector（ＳＳＤ）を用いる方法が挙げられる。本実施形態に係る顕著性推定部１３０では予め顕著性が高い領域を抽出し、抽出された領域のみで物体検出が行われるため、誤検出が抑制される。 Then, the object detection unit 374 detects an object from the region extracted by the region extraction unit 372. The method for detecting an object is not particularly limited, and examples thereof include a method using a Single Shot Multibox Detector (SSD). Since the prominence estimation unit 130 according to the present embodiment extracts a region with high prominence in advance and detects an object only in the extracted region, erroneous detection is suppressed.

本実施形態に係る顕著性推定部１３０はたとえば自動車等の移動体に搭載される。そして、物体検出部３７４による物体の検出結果は自動運転や運転支援に用いることができる。 The prominence estimation unit 130 according to the present embodiment is mounted on a moving body such as an automobile. Then, the detection result of the object by the object detection unit 374 can be used for automatic driving and driving support.

くわえて、本実施形態によれば、顕著性推定部１３０はマスク画像生成部３７０、領域抽出部３７２、および物体検出部３７４をさらに備える。したがって、入力画像において高精度の物体検出が行える。 In addition, according to the present embodiment, the saliency estimation unit 130 further includes a mask image generation unit 370, a region extraction unit 372, and an object detection unit 374. Therefore, it is possible to detect an object with high accuracy in the input image.

以上、図面を参照して実施形態及び実施例について述べたが、これらは本発明の例示であり、上記以外の様々な構成を採用することもできる。 Although the embodiments and examples have been described above with reference to the drawings, these are examples of the present invention, and various configurations other than the above can be adopted.

１０顕著性推定装置
１１０入力部
１２０補正部
１２２視野設定部
１２４補正情報生成部
１２６補正処理部
１３０顕著性推定部
１４０速度推定部
１５０基準点設定部
１６０トリミング部
２０２解像度補正部
２０４彩度補正部
２０６明度補正部 10 Brightness estimation device 110 Input unit 120 Correction unit 122 Field of view setting unit 124 Correction information generation unit 126 Correction processing unit 130 Randomness estimation unit 140 Speed estimation unit 150 Reference point setting unit 160 Trimming unit 202 Resolution correction unit 204 Saturation correction unit 206 Brightness correction unit

Claims

A correction unit that generates a corrected image by correcting the image of the scenery seen from the first viewpoint,
A saliency estimation unit that generates saliency estimation information indicating a saliency distribution in the corrected image or in the image by processing the corrected image, and a saliency estimation unit.
With
The correction unit
Brightness information indicating a change in the brightness of at least a part of the image is generated by using the speed information regarding the moving speed of the first viewpoint and the relative position from the reference point in the image to the at least a part.
A saliency estimation device that generates the saliency estimation information based on a corrected image in which at least a part of the brightness is corrected using the brightness information.

In the saliency estimation device according to claim 1,
The correction unit
Further, the resolution information indicating the change in at least a part of the resolution and the saturation information indicating the change in the saturation of at least a part are generated by using the speed information and the relative position.
A prominence estimation device that corrects the resolution and brightness of at least a part of the image by using the resolution information and the saturation information.

In the saliency estimation device according to claim 1 or 2.
The correction unit is a saliency estimation device that generates the brightness information so that the brightness decreases as the distance from the reference point in the image to at least a part of the image increases.

In the saliency estimation device according to any one of claims 1 to 3,
The saliency estimation device, wherein the image is a frame image included in a moving image.

In the saliency estimation device according to any one of claims 1 to 4.
The correction unit is a saliency estimation device that sets the reference point by processing the image.

In the saliency estimation device according to any one of claims 1 to 5.
A trimming unit for trimming the image using the image generation conditions is provided.
The correction unit is a saliency estimation device that processes the trimmed image to generate the corrected image.

In the saliency estimation device according to any one of claims 1 to 6.
The saliency estimation unit is a saliency estimation device that generates saliency estimation information by inputting the corrected image into a model generated by machine learning.

The computer
A corrected image is generated by correcting the image of the scenery seen from the first viewpoint.
By processing the corrected image, saliency estimation information indicating the saliency distribution in the corrected image or in the image is generated.
Furthermore, the computer
Brightness information indicating a change in the brightness of at least a part of the image is generated by using the speed information regarding the moving speed of the first viewpoint and the relative position from the reference point in the image to the at least a part.
A saliency estimation method that generates the saliency estimation information based on a corrected image in which at least a part of the brightness is corrected using the brightness information.

On the computer
A correction function that generates a corrected image by correcting the image of the scenery seen from the first viewpoint,
An estimation function that generates saliency estimation information indicating a saliency distribution in the corrected image or in the image by processing the corrected image, and an estimation function.
To have
Furthermore, as at least a part of the correction function,
A function to generate brightness information indicating a change in brightness of at least a part of the image by using speed information regarding the moving speed of the first viewpoint and a relative position from a reference point in the image to at least a part of the image. When,
A function of generating the saliency estimation information based on a corrected image in which at least a part of the brightness is corrected using the brightness information, and
Program to have.