JP2022177579A

JP2022177579A - Learning device, estimation device, learning method, estimation method, and program

Info

Publication number: JP2022177579A
Application number: JP2021083943A
Authority: JP
Inventors: 豪入江; Takeshi Irie; クリシュナオンカー; Krishna Onkar; 清晴相澤; Kiyoharu Aizawa
Original assignee: Nippon Telegraph and Telephone Corp; University of Tokyo NUC
Current assignee: Nippon Telegraph and Telephone Corp; University of Tokyo NUC
Priority date: 2021-05-18
Filing date: 2021-05-18
Publication date: 2022-12-01

Abstract

To provide a technique for estimating a focus-of-attention region that is a region in a space and is in a region that does not appear in an image or video.SOLUTION: A learning device includes a control unit that updates a mathematical model indicating a relation between image data and a focus-of-attention region in a space that does not appear in an image indicated by the image data, based on training data that is a pair of image data and correct focus-of-attention region data that is data indicating a focus-of-attention region in a space that does not appear in an image indicated by the image data. The control unit estimates a focus-of-attention region based on image data included in training data by using the mathematical model, and updates the mathematical model so as to make an estimation result closer to the focus-of-attention region indicated by correct focus-of-attention region data included in the training data.SELECTED DRAWING: Figure 6

Description

本発明は、学習装置、推測装置、学習方法、推測方法及びプログラムに関する。 The present invention relates to a learning device, a guessing device, a learning method, a guessing method, and a program.

情景等の空間の内の領域のうち人間が目を向けやすい領域である注視領域（Focus of Attention: ＦｏＡ）を予測する技術が開発されている。代表的なものに、画像又は映像を分析することにより、その画像又は映像の中の注視領域を検出する顕著性推定（Saliency Estimation）と呼ばれる技術がある。ロボットビジョンやコンピュータビジョンの技術領域で長年研究され、物体認識や画像又は映像編集、画像符号化、画質評価、自律運転、オンラインマーケティングなど、多岐にわたる分野で利用されてきた。 Techniques have been developed for predicting a focus of attention (FoA), which is a region within a space such as a scene, to which a human tends to look. A typical example is a technique called saliency estimation that detects a region of interest in an image or video by analyzing the image or video. It has been studied for many years in the technical fields of robot vision and computer vision, and has been used in a wide variety of fields such as object recognition, image or video editing, image coding, image quality evaluation, autonomous driving, and online marketing.

ＡＩ（artificial intelligence）技術の社会実装が進められているが、人間と同じように実世界を知覚又は認知する機能はＡＩ技術が備えるべき基本的な要件の一つであり、注視領域を予測する技術はその根幹をなす技術の一つである。特に、人間と同じように実世界で活動するＡＩロボットやＡＩエージェントについては、備え付けられたカメラによる一人称視点での画像又は映像を入力として外界を知覚することが想定されている。そのため、一人称視点での画像又は映像を対象とした注視領域を予測する技術への期待は高い。以降、説明の簡単のため画像を例に注視領域を予測する技術を説明するが、映像は複数の画像の集合であるため、以下の説明は画像に代えて映像についても成り立つ。 The social implementation of AI (artificial intelligence) technology is progressing, but the ability to perceive or perceive the real world in the same way as humans does is one of the basic requirements for AI technology, and predict the gaze area. Technology is one of the underlying technologies. In particular, AI robots and AI agents, which act in the real world like humans, are expected to perceive the outside world by inputting images or videos from a first-person viewpoint captured by a built-in camera. Therefore, there are high expectations for technology that predicts the region of interest for an image or video from a first-person viewpoint. To simplify the explanation, the technology for predicting the region of interest will be described using an image as an example. However, since a video is a collection of a plurality of images, the following description also applies to videos instead of images.

最近の深層学習の発展と共に、一人称視点の画像に基づく顕著性推定にも大きな進展がもたらされ、優れた予測性能が達成された。例えば、非特許文献１は、自動運転シナリオにおける一人称視点の画像中の注視領域を高精度に推定可能な３次元畳み込みネットワークに基づくモデルを提案した。 With the recent development of deep learning, great progress has also been made in saliency estimation based on first-person view images, and excellent prediction performance has been achieved. For example, Non-Patent Document 1 proposed a model based on a three-dimensional convolutional network that can accurately estimate the gaze region in the first-person viewpoint image in an autonomous driving scenario.

また、非特許文献２は、年齢による注視領域の違いに着目し、画像変換の技術を利用することで、成人による注視領域の推定の結果を高齢者による注視領域の推定の結果へと変換する技術を提案した。この技術も、自動運転シナリオや歩行者視点での一人称視点画像を対象とし、高精度な顕著性予測を可能とする。 In addition, Non-Patent Document 2 focuses on the difference in the gaze area depending on age and uses image conversion technology to convert the result of estimating the gaze area by adults into the result of estimating the gaze area by elderly people. proposed the technique. This technology also targets autonomous driving scenarios and first-person perspective images from the perspective of pedestrians, and enables highly accurate saliency prediction.

Andrea Palazzi, Davide Abati, Simone Calderara, Francesco Solera, and Rita Cucchiara, “Predicting the drivers focus of attention: the dr(eye)ve project”, IEEE transactions on pattern analysis and machine intelligence, 2018.Andrea Palazzi, Davide Abati, Simone Calderara, Francesco Solera, and Rita Cucchiara, “Predicting the drivers focus of attention: the dr(eye)ve project”, IEEE transactions on pattern analysis and machine intelligence, 2018. Onkar Krishna, Go Irie, Takahito Kawanishi, Kunio Kashino, and Kiyoharu Aizawa, “Translating Adult’s Focus of Attention to Elderly’s”, In Proceedings of International Conference on Pattern Recognition, 2020.Onkar Krishna, Go Irie, Takahito Kawanishi, Kunio Kashino, and Kiyoharu Aizawa, “Translating Adult’s Focus of Attention to Elderly’s”, In Proceedings of International Conference on Pattern Recognition, 2020.

しかしながら、これらの技術は、あくまで画像中に写る注視領域を推定するように設計され、あくまで画像中に写る注視領域を推定する技術である。実世界で活動する人間は、必ずしも情景のなかの観測可能な領域に注意を向けて行動しているとは限らない。人間は、時には後方や、あるいは壁の向こう側など、不可視ながらも注意を向けるべき領域を予測又は判断し、予備動作をとったり、危険回避をするような行動を起こす。 However, these techniques are designed to estimate the region of interest captured in the image, and are techniques for estimating the region of interest captured in the image. Humans who are active in the real world do not always pay attention to observable areas in the scene. Humans sometimes predict or judge an invisible area, such as the back or the other side of a wall, to which they should pay attention, and take preparatory actions or actions to avoid danger.

そこで、人間と共に実世界で活動するＡＩシステムにも、人間と同様の機能が期待される。具体的には、ＡＩシステムが、画像に写らない領域中の注視領域を予測する機能を備えることが期待される。 Therefore, AI systems that work together with humans in the real world are expected to have functions similar to those of humans. Specifically, it is expected that the AI system will have a function of predicting a region of interest in a region not captured in the image.

しかしながら、既存の画像顕著性の推定の技術は、いずれも画像に写る領域に限って顕著な領域を推定する技術として実現されており、画像に写らない領域ついては注視領域を予測する技術では無かった。上述したように、このことは映像についても同様である。 However, existing techniques for estimating image salience are all implemented as techniques for estimating salient areas only in areas that are visible in the image, and are not techniques for predicting attention areas for areas that are not visible in the image. . As mentioned above, this also applies to video.

上記事情に鑑み、本発明は、空間内の領域であって画像又は映像に写らない領域中の注視領域を予測する技術の提供を目的としている。 SUMMARY OF THE INVENTION In view of the above circumstances, the present invention aims to provide a technology for predicting a region of interest in a region in space that does not appear in an image or video.

本発明の一態様は、画像データと前記画像データが示す画像には写らない空間内の注視領域を示すデータである正解注視領域データとの対のデータである訓練用データに基づき、画像データと前記画像データが示す画像には写らない空間内の注視領域との関係を示す数理モデルを更新する制御部、を備え、前記制御部は、前記数理モデルを用い訓練データに含まれる画像データに基づき注視領域を推定し、推定の結果が前記訓練データに含まれる正解注視領域データの示す前記注視領域に近づくように前記数理モデルを更新する、学習装置である。 One aspect of the present invention is based on training data, which is paired data of image data and correct gaze region data, which is data indicating a gaze region in a space not captured in the image represented by the image data. a control unit that updates a mathematical model that indicates a relationship with a gaze region in a space that is not captured in the image indicated by the image data, wherein the control unit updates the mathematical model based on image data included in training data using the mathematical model; The learning device estimates a gaze area, and updates the mathematical model so that the result of estimation approaches the gaze area indicated by correct gaze area data included in the training data.

本発明の一態様は、画像データと前記画像データが示す画像には写らない空間内の注視領域を示すデータである正解注視領域データとの対のデータである訓練用データに基づき、画像データと前記画像データが示す画像には写らない空間内の注視領域との関係を示す数理モデルを更新する制御部、を備え、前記制御部は、前記数理モデルを用い訓練データに含まれる画像データに基づき注視領域を推定し、推定の結果が前記訓練データに含まれる正解注視領域データの示す前記注視領域に近づくように前記数理モデルを更新する学習装置によって、所定の終了条件が満たされるまで更新された前記数理モデルである更新済みの数理モデルを用いて、入力された画像データに基づき前記画像データが示す画像には写らない空間内の注視領域を推定する制御部、を備える推測装置である。 One aspect of the present invention is based on training data, which is paired data of image data and correct gaze region data, which is data indicating a gaze region in a space not captured in the image represented by the image data. a control unit that updates a mathematical model that indicates a relationship with a gaze region in a space that is not captured in the image indicated by the image data, wherein the control unit updates the mathematical model based on image data included in training data using the mathematical model; The mathematical model is updated until a predetermined termination condition is satisfied by a learning device that estimates a gaze area and updates the mathematical model so that the result of estimation approaches the gaze area indicated by the correct gaze area data included in the training data. and a control unit for estimating, based on input image data, a gaze region in a space that is not captured in an image represented by the image data, using the updated mathematical model, which is the mathematical model.

本発明の一態様は、画像データと前記画像データが示す画像には写らない空間内の注視領域を示すデータである正解注視領域データとの対のデータである訓練用データに基づき、画像データと前記画像データが示す画像には写らない空間内の注視領域との関係を示す数理モデルを更新する制御ステップ、を有し、前記制御ステップでは、前記数理モデルを用い訓練データに含まれる画像データに基づき注視領域が推定され、推定の結果が前記訓練データに含まれる正解注視領域データの示す前記注視領域に近づくように前記数理モデルが更新される、学習方法である。 One aspect of the present invention is based on training data, which is paired data of image data and correct gaze region data, which is data indicating a gaze region in a space not captured in the image represented by the image data. a control step of updating a mathematical model that indicates a relationship with a region of interest in a space that is not captured in the image indicated by the image data; a gaze area is estimated based on the training data, and the mathematical model is updated so that the result of the estimation approaches the gaze area indicated by the correct gaze area data included in the training data.

本発明の一態様は、画像データと前記画像データが示す画像には写らない空間内の注視領域を示すデータである正解注視領域データとの対のデータである訓練用データに基づき、画像データと前記画像データが示す画像には写らない空間内の注視領域との関係を示す数理モデルを更新する制御部、を備え、前記制御部は、前記数理モデルを用い訓練データに含まれる画像データに基づき注視領域を推定し、推定の結果が前記訓練データに含まれる正解注視領域データの示す前記注視領域に近づくように前記数理モデルを更新する学習装置によって、所定の終了条件が満たされるまで更新された前記数理モデルである更新済みの数理モデルを用いて、入力された画像データに基づき前記画像データが示す画像には写らない空間内の注視領域を推定する制御ステップ、を有する推測方法である。 One aspect of the present invention is based on training data, which is paired data of image data and correct gaze region data, which is data indicating a gaze region in a space not captured in the image represented by the image data. a control unit that updates a mathematical model that indicates a relationship with a gaze region in a space that is not captured in the image indicated by the image data, wherein the control unit updates the mathematical model based on image data included in training data using the mathematical model; The mathematical model is updated until a predetermined termination condition is satisfied by a learning device that estimates a gaze area and updates the mathematical model so that the result of estimation approaches the gaze area indicated by the correct gaze area data included in the training data. a control step of estimating, based on input image data, a gaze region in a space that does not appear in an image represented by the image data, using the updated mathematical model, which is the mathematical model.

本発明の一態様は、上記の学習装置としてコンピュータを機能させるためのプログラムである。 One aspect of the present invention is a program for causing a computer to function as the learning device.

本発明により、空間内の領域であって画像又は映像に写らない領域中の注視領域を予測することが可能となる。 According to the present invention, it is possible to predict a region of interest in a region in space that is not captured in an image or video.

実施形態の予測システム１００の構成の一例を示す図。The figure which shows an example of a structure of the prediction system 100 of embodiment. 実施形態における予測結果表現形式の第１の例を説明する説明図。Explanatory drawing explaining the 1st example of the prediction result expression format in embodiment. 実施形態における予測結果表現形式の第２の例を説明する説明図。FIG. 9 is an explanatory diagram for explaining a second example of the prediction result representation format in the embodiment; 実施形態における予測結果表現形式の第３の例を説明する説明図。FIG. 9 is an explanatory diagram for explaining a third example of the prediction result expression format in the embodiment; 実施形態における予測結果表現形式の第４の例を説明する説明図。FIG. 9 is an explanatory diagram for explaining a fourth example of the prediction result expression format in the embodiment; 実施形態における予測器１０の構成の一例を示す図。The figure which shows an example of a structure of the predictor 10 in embodiment. 実施形態における予測器１０が実行する処理の流れの一例を示すフローチャート。4 is a flowchart showing an example of the flow of processing executed by the predictor 10 in the embodiment; 実施形態における強化予測処理の流れの一例を示すフローチャート。4 is a flowchart showing an example of the flow of enhanced prediction processing according to the embodiment; 実施形態における報酬取得処理の流れの一例を示すフローチャート。4 is a flowchart showing an example of the flow of reward acquisition processing according to the embodiment; 実施形態における学習装置１のハードウェア構成の一例を示す図。1 is a diagram showing an example of a hardware configuration of a learning device 1 according to an embodiment; FIG. 実施形態における制御部１１の機能構成の一例を示す図。The figure which shows an example of the functional structure of the control part 11 in embodiment. 実施形態における予測装置２のハードウェア構成の一例を示す図。The figure which shows an example of the hardware constitutions of the prediction apparatus 2 in embodiment. 実施形態における制御部２１の機能構成の一例を示す図。The figure which shows an example of the functional structure of the control part 21 in embodiment.

（実施形態）
図１は、実施形態の予測システム１００の構成の一例を示す図である。予測システム１００は、予測システム１００は、学習装置１及び予測装置２を備える。以下、説明の簡単のため学習装置１及び予測装置２に入力されるデータが画像データである場合を例に、予測システム１００を説明する。しかしながら、映像は画像の時系列であるため、学習装置１及び予測装置２に入力されるデータは、画像データに代えて映像のデータである映像データであってもよい。以下、説明の簡単のため学習装置１及び予測装置２に画像データが入力される場合を例に予測システム１００を説明するが、学習装置１及び予測装置２には画像データに代えて映像データが入力されてもよい。 (embodiment)
Drawing 1 is a figure showing an example of composition of prediction system 100 of an embodiment. The prediction system 100 includes a learning device 1 and a prediction device 2 . For simplicity of explanation, the prediction system 100 will be described below by taking as an example a case where the data input to the learning device 1 and the prediction device 2 are image data. However, since a video is a time series of images, the data input to the learning device 1 and the prediction device 2 may be video data, which is video data, instead of image data. To simplify the explanation, the prediction system 100 will be described with an example in which image data is input to the learning device 1 and the prediction device 2. However, the learning device 1 and the prediction device 2 receive video data instead of image data. may be entered.

学習装置１は、画像データの入力を受け付ける。学習装置１は、入力された画像データに基づき、予測モデルを機械学習の方法により更新する。予測モデルは、入力された画像データと、入力された画像データが示す画像には写らない空間内の注視領域（Focus of Attention: ＦｏＡ）との関係を示す数理モデルである。画像データが示す画像には写らない空間内の注視領域とは、言い換えれば、画像データの画像が示す空間の外側の空間における注視領域である。注視領域は、空間の内の領域（以下「空間内領域」という。）のうち人間が目を向けやすい領域である。 The learning device 1 receives input of image data. The learning device 1 updates the prediction model by a machine learning method based on the input image data. The prediction model is a mathematical model that represents the relationship between input image data and a focus of attention (FoA) in space that is not captured in the image represented by the input image data. The gaze area in the space that does not appear in the image indicated by the image data is, in other words, the gaze area in the space outside the space indicated by the image of the image data. A gaze area is an area within a space (hereinafter referred to as "intraspace area") to which a human tends to look.

予測装置２は、画像データの入力を受け付ける。予測装置２は、学習装置１が取得した学習済みの予測モデルを用い、入力された画像データに基づいて、注視領域を予測する。予測装置２は、予測器制御部２１１を備える。予測器制御部２１１の詳細は後述するが、予測器制御部２１１は、学習済みの予測モデルを表現する回路の動作を制御する。 The prediction device 2 receives input of image data. The prediction device 2 uses the learned prediction model acquired by the learning device 1 and predicts the gaze region based on the input image data. The prediction device 2 includes a predictor control section 211 . Although the details of the predictor control unit 211 will be described later, the predictor control unit 211 controls the operation of a circuit that expresses a learned prediction model.

なお学習済みとは、学習が所定の終了条件（以下「学習終了条件」という。）が満たされるまで実行されたことを意味する。そのため、学習済みの数理モデルとは、学習終了条件が満たされた時点における数理モデルである。学習終了条件は、例えば所定の回数の学習が行われたという条件である、学習終了条件は、例えば学習による学習モデルの変化が所定の変化より小さい、という条件であってもよい。 Note that "learning completed" means that learning has been executed until a predetermined end condition (hereinafter referred to as "learning end condition") is satisfied. Therefore, a trained mathematical model is a mathematical model at the time when the learning termination condition is satisfied. The learning termination condition may be, for example, a condition that learning has been performed a predetermined number of times, and the learning termination condition may be, for example, a condition that the change in the learning model due to learning is less than a predetermined change.

なお、数理モデルとは、実行される条件と順番とが予め定められた１又は複数の処理を含む集合である。数理モデルが含む処理は、例えば予め定められた関数に値を入力することで関数の値を取得する処理である。 A mathematical model is a set including one or a plurality of processes whose execution conditions and order are predetermined. A process included in the mathematical model is, for example, a process of obtaining a function value by inputting a value into a predetermined function.

なお、学習を行うとは数理モデルを更新することを意味する。数理モデルの更新とは、数理モデルを表現する回路のパラメータの値が更新されることを意味する。数理モデルが含む少なくとも一部の処理は、例えばニューラルネットワークによって表現される。なお、ニューラルネットワークとは、電子回路、電気回路、光回路、集積回路等の回路であって数理モデルの少なくとも一部の処理を表現する回路の一例である。学習によって数理モデルが更新されるとは、数理モデルを表現する回路のパラメータの値が更新されることを意味する。数理モデルを表現する回路の一部がニューラルネットワークである場合、ニューラルネットワークのパラメータは、予め定義済みの量に基づいて好適に調整される。予め定義済みの量は、例えば予め定義済みの目的関数の値（すなわち損失）である。 Note that learning means updating the mathematical model. Updating the mathematical model means updating the parameter values of the circuit representing the mathematical model. At least part of the processing included in the mathematical model is represented by, for example, a neural network. A neural network is an example of a circuit such as an electronic circuit, an electric circuit, an optical circuit, an integrated circuit, or the like, which expresses at least part of the processing of a mathematical model. Updating the mathematical model by learning means updating the parameter values of the circuit representing the mathematical model. If the part of the circuit representing the mathematical model is a neural network, the parameters of the neural network are preferably adjusted based on predefined quantities. A predefined quantity is, for example, a predefined objective function value (ie loss).

＜予測の結果の表現の形式の例＞
学習済み予測モデル又は予測モデルによる予測の結果を表現する表現の形式（以下「予測結果表現形式」という。）について説明する。 <Example of expression format of prediction result>
A description will be given of an expression format (hereinafter referred to as "prediction result expression format") for expressing the learned prediction model or the prediction result of the prediction model.

図２は、実施形態における予測結果表現形式の第１の例を説明する説明図である。図２は、予測の結果の注視領域を、注意を引きやすい点（以下「注視点」という。）の位置として表現する予測結果表現形式の一例である。図２には、高さ２Ｈ＋１画素、幅２Ｗ＋１画素の画像が示されている。図２において、例えば画像の中心の画素を(０，０)と表現すると、画像中の任意の画素の位置（すなわち座標）は、水平位置ｘ及び垂直位置ｙを用いて（ｘ、ｙ）と表現される。 FIG. 2 is an explanatory diagram for explaining a first example of the prediction result representation format in the embodiment. FIG. 2 is an example of a prediction result expression format that expresses a gaze area of a prediction result as the position of a point that easily attracts attention (hereinafter referred to as a "gazing point"). FIG. 2 shows an image that is 2H+1 pixels high and 2W+1 pixels wide. In FIG. 2, for example, if the pixel at the center of the image is expressed as (0, 0), the position (that is, coordinates) of any pixel in the image is expressed as (x, y) using horizontal position x and vertical position y. expressed.

図２の予測結果表現形式は、図２の画像に写る空間の外側に注視点が存在するような場合であっても、注視点を座標によって表現可能である。すなわち、注視点のｘ座標が－Ｗ＜ｘ＜Ｗの範囲外にあり、ｙ座標が－Ｈ＜ｙ＜Ｈの範囲外にある場合であっても、図２の予測結果表現形式は、注視点を座標によって表現可能である。ｘ軸とｙ軸とは互いに直交する予め定められた座標軸である。 The prediction result expression format in FIG. 2 can express the point of interest by coordinates even when the point of interest exists outside the space captured in the image in FIG. That is, even if the x-coordinate of the gaze point is outside the range of −W<x<W and the y-coordinate is outside the range of −H<y<H, the prediction result representation format in FIG. A viewpoint can be represented by coordinates. The x-axis and the y-axis are predetermined coordinate axes orthogonal to each other.

図２は、注視点の一例として、中心から画素数にして２Ｗだけ右に離れた位置（２Ｗ、０）に位置する点を示す。中心から２Ｗだけ右に離れているため、位置（２Ｗ、０）に位置する点は、画像の範囲外に位置する。 FIG. 2 shows a point located at a position (2W, 0) that is 2W pixels to the right from the center as an example of the gaze point. Since it is 2W to the right of the center, the point located at location (2W, 0) is outside the extent of the image.

図３は、実施形態における予測結果表現形式の第２の例を説明する説明図である。図３は、中心からの距離ｒと角度θを用いて（すなわち極座標）、予測の結果の注視点の位置を表現する予測結果表現形式の一例を示す。なお、図３の予測結果表現形式では、注視点の位置の表現に関して距離ｒは必ずしも用いられる必要は無い。図３の予測結果表現形式において、注視点の位置は、角度θ（すなわち注視方向）のみで表現されてもよい。 FIG. 3 is an explanatory diagram illustrating a second example of the prediction result representation format in the embodiment. FIG. 3 shows an example of a prediction result expression format that expresses the position of the gaze point of the prediction result using the distance r from the center and the angle θ (that is, polar coordinates). In addition, in the prediction result expression format of FIG. 3, the distance r does not necessarily need to be used in expressing the position of the gaze point. In the prediction result expression format of FIG. 3, the position of the gaze point may be expressed only by the angle θ (that is, the gaze direction).

図４は、実施形態における予測結果表現形式の第３の例を説明する説明図である。図４の例は、点の位置を離散化して離散化の結果を用いて表現する予測結果表現形式の一例である。離散化の方法は、例えば図４に示すように、Ｗ×Ｈの単位で画像の内外の面を２４の領域に分割する方法である。以下、分割後の各領域を離散化領域という。画像の内外の面が２４の領域に分割されたため、各離散化領域のサイズは、Ｗ×Ｈである。各離散化領域は、１から分割数までの分割数個の識別子のうちの１つが付与されており、任意の点は、属する各離散化領域の識別子の値によって表現される。 FIG. 4 is an explanatory diagram illustrating a third example of the prediction result representation format in the embodiment. The example of FIG. 4 is an example of a prediction result expression format that discretizes the positions of points and expresses them using the discretization result. The discretization method is, for example, a method of dividing the inner and outer planes of the image into 24 regions in units of W×H, as shown in FIG. Each region after division is hereinafter referred to as a discretized region. Since the inner and outer planes of the image were divided into 24 regions, the size of each discretized region is W×H. Each discretized region is given one of the identifiers of the number of divisions from 1 to the number of divisions, and an arbitrary point is represented by the value of the identifier of each discretized region to which it belongs.

例えば図４の点Ａは、識別子が５の識別子の離散化領域に属するため、「５」と出力される。なお、識別子の付与されていない離散化領域では点は、例えば、識別子が付与された離散化領域のうち最も近い離散化領域の識別子で表現される。予測結果表現形式では、点Ａの属する離散化領域が識別子の付与されていない離散化領域である場合には、例えば「該当領域無し」という情報で、点Ａの属する離散化領域が識別子の付与されていない離散化領域であることを表現してもよい。なお、各離散化領域のサイズは必ずしもＷ×Ｈである必要は無い。また、離散化領域の数も必ずしも２４である必要は無い。各離散化領域のサイズや離散化領域の数は、予測システム１００を適用する場面に応じて適宜ユーザが予め決定してもよい。 For example, the point A in FIG. 4 belongs to the discretized region with the identifier of 5, so it is output as "5". In a discretization area to which no identifier is assigned, a point is represented by, for example, the identifier of the nearest discretization area among the discretization areas to which identifiers are assigned. In the prediction result representation format, if the discretized region to which point A belongs is a discretized region to which no identifier is assigned, the discretized region to which point A belongs is given an identifier, for example, with the information "no corresponding region". It may be expressed as a discretized region that is not discretized. Note that the size of each discretized region does not necessarily have to be W×H. Also, the number of discretized regions does not necessarily have to be 24. The size of each discretized region and the number of discretized regions may be determined in advance by the user as appropriate according to the scene in which the prediction system 100 is applied.

図５は、実施形態における予測結果表現形式の第４の例を説明する説明図である。図５は、注視領域を分布として表現する出力の一例である。図５は、平均（Ｗ、Ｈ）、分散σ^２の等方的な２次元正規分布として注視領域が表現された出力の一例を示す。分散は必ずしも当方的である必要は無い。分散は、ｘ軸方向とｙ軸方向とでそれぞれ異なる分散であってもよい。また、分布は必ずしも正規分布である必要はない。分布は、注視領域の位置及び範囲を示すことのできる分布であればどのような確率分布であってもよい。分布は例えば確率分布であってもよい。 FIG. 5 is an explanatory diagram illustrating a fourth example of the prediction result representation format in the embodiment. FIG. 5 is an example of the output representing the gaze area as a distribution. FIG. 5 shows an example of output in which the region of interest is represented as an isotropic two-dimensional normal distribution with mean (W, H) and variance ^σ2 . Dispersion need not necessarily be one-way. The dispersion may be different in the x-axis direction and the y-axis direction. Also, the distribution does not necessarily have to be a normal distribution. The distribution may be any probability distribution that can indicate the location and extent of the gaze area. The distribution may for example be a probability distribution.

予測結果表現形式は、図２～図５の例に限らず、学習済み予測モデル又は予測モデルによる予測の結果を表現可能であって、画像に写る空間の少なくとも外側の空間における空間領域を表現可能であればどのような表現の形式であってもよい。 The prediction result representation format is not limited to the examples shown in FIGS. 2 to 5, and can represent a trained prediction model or a prediction result by a prediction model, and can represent at least a spatial region in a space outside the space shown in the image. Any form of expression may be used.

図１の説明に戻る。学習装置１は、予測モデルを学習可能な方法であればどのような方法で予測モデルを更新してもよい。予測モデルは例えばニューラルネットワークを用いて表現されてもよい。以下、予測モデルを表現する回路を予測器１０という。 Returning to the description of FIG. The learning device 1 may update the prediction model by any method as long as the prediction model can be learned. A prediction model may be represented using a neural network, for example. A circuit representing a prediction model is hereinafter referred to as a predictor 10 .

学習装置１は予測器１０のパラメータを更新することで予測モデルを更新する。学習終了条件が満たされた時点の予測器１０のパラメータは、予測装置２に送信される。予測装置２は、予測器１０と同様の回路を備え、学習装置１から取得したパラメータを用いて回路を動作させる。これにより、予測装置２は、学習装置１が取得した学習済みの予測モデルを実行する。そのため、予測装置２が備える予測器１０と同様の回路は、予測器制御部２１１の制御対象の回路である。以下、説明の簡単のため予測装置２が備える回路であって予測器１０と同様の回路もまた予測器１０という。 The learning device 1 updates the prediction model by updating the parameters of the predictor 10 . The parameters of the predictor 10 when the learning end condition is satisfied are sent to the prediction device 2 . The prediction device 2 includes a circuit similar to that of the predictor 10 and operates the circuit using the parameters acquired from the learning device 1 . Thereby, the prediction device 2 executes the learned prediction model acquired by the learning device 1 . Therefore, a circuit similar to the predictor 10 included in the prediction device 2 is a circuit to be controlled by the predictor control section 211 . Hereinafter, a circuit provided in the prediction device 2 and similar to the predictor 10 is also referred to as the predictor 10 for the sake of simplicity of explanation.

学習装置１が学習済みの予測モデルを取得する方法の詳細は後述するが、学習装置１が学習済みの予測モデルを取得する方法の説明のためにまずは、予測器１０について説明する。 The details of how the learning device 1 acquires a learned prediction model will be described later, but first, the predictor 10 will be described to explain how the learning device 1 acquires a learned prediction model.

＜予測器１０の説明＞
予測器１０は、予測モデルを表現する回路であり、予め定められた予測結果表現形式で予測モデルによる予測の結果（すなわち画像データが示す画像には写らない空間内の注視領域）を出力可能であればどのような回路であってもよい。すなわち予測器１０は、画像データに基づき画像データが示す画像には写らない空間内の注視領域を予測し、予測した結果を予め定められた予測結果表現形式で出力可能であればどのような回路であってもよい。予測器１０は、例えば畳み込みニューラルネットワーク（Convolutional neural network;ＣＮＮ）と、長短期記憶（Long Short Term Memory；ＬＳＴＭ）ネットワークと、２つの全結合層（Fully Connected Layer;ＦＣ）とによって構成される回路である。 <Description of Predictor 10>
The predictor 10 is a circuit that expresses a prediction model, and is capable of outputting the prediction result of the prediction model in a predetermined prediction result expression format (i.e., the gaze region in space that does not appear in the image indicated by the image data). Any circuit may be used. That is, the predictor 10 predicts a region of interest in a space that does not appear in the image indicated by the image data based on the image data, and outputs the prediction result in a predetermined prediction result expression format. may be The predictor 10 is, for example, a convolutional neural network (CNN), a long short term memory (LSTM) network, and a circuit configured by two fully connected layers (FC) is.

予測器１０がＣＮＮとＬＳＴＭと２つの全結合層とで構成される回路である場合、予測器１０を構成するＣＮＮのネットワーク構造は、予測器１０が予測モデルを表現可能であり予め定められた予測結果表現形式で結果を出力可能であれば、どのようなネットワーク構造であってもよい。予測器１０を構成するＣＮＮは、例えば以下の参考文献１に記載のResNetであってもよい。ResNetは画像データを直接入力して取得可能であるため、予測システム１００に好適である。 When the predictor 10 is a circuit composed of a CNN, an LSTM, and two fully connected layers, the network structure of the CNN constituting the predictor 10 is such that the predictor 10 can express a prediction model and is predetermined Any network structure may be used as long as the result can be output in a prediction result expression format. The CNN that configures the predictor 10 may be, for example, ResNet described in Reference 1 below. ResNet is suitable for the prediction system 100 because image data can be directly input and acquired.

参考文献１：Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition, In Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. Reference 1: Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition, In Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

以下説明の簡単のため、予測器１０がＣＮＮとＬＳＴＭと２つの全結合層とで構成される回路である場合を例に、予測システム１００を説明する。 For simplicity of explanation, the prediction system 100 will be described by taking as an example the case where the predictor 10 is a circuit composed of a CNN, an LSTM, and two fully connected layers.

図６は、実施形態における予測器１０の構成の一例を示す図である。図６に記載の予測器１０はＣＮＮ１０１、ＬＳＴＭ１０２、予測ＦＣ１０３及び停止判定ＦＣ１０４を備える。ＣＮＮ１０１には画像データが入力される。ＣＮＮ１０１は、畳み込みニューラルネットワーク（Convolutional neural network;ＣＮＮ）であり、入力された画像データに基づき、入力された画像データの特徴量（以下「画像特徴」という。）を取得する。 FIG. 6 is a diagram showing an example of the configuration of the predictor 10 in the embodiment. The predictor 10 shown in FIG. 6 comprises CNN 101, LSTM 102, prediction FC 103 and stop decision FC 104. FIG. Image data is input to the CNN 101 . The CNN 101 is a convolutional neural network (CNN), and acquires feature amounts (hereinafter referred to as "image features") of input image data based on the input image data.

ＬＳＴＭ１０２は、長短期記憶ネットワークであり、ＣＮＮ１０１の取得した画像特徴と補助情報とに基づき、領域特徴を取得する。領域特徴は、ＬＳＴＭ１０２の出力であってＬＳＴＭ１０２の後段に位置する予測ＦＣ１０３及び停止判定ＦＣ１０４に入力される出力である。したがって、領域特徴は、中間出力の一種である。領域特徴は、画像特徴及び補助情報が示す情報の少なくとも一部と過去の注視領域の予測の結果の履歴が示す情報の少なくとも一部とを含む情報である。領域情報が含む情報の種類又は量は、学習により更新される。 The LSTM 102 is a long-term memory network, and acquires area features based on the image features and auxiliary information acquired by the CNN 101 . The area feature is an output of the LSTM 102 and is input to the prediction FC 103 and the stop judgment FC 104 positioned after the LSTM 102 . Region features are therefore a kind of intermediate output. The area feature is information including at least part of the information indicated by the image feature and the auxiliary information, and at least part of the information indicated by the history of past prediction results of the attention area. The type or amount of information included in the area information is updated by learning.

補助情報はＬＳＴＭ１０２に入力される情報である。補助情報は、ＬＳＴＭ１０２に入力されるタイミングに応じた値を示す。補助情報が示す値は、０では無い分散を有する所定の分布にしたがう値である。補助情報は、例えばＬＳＴＭ１０２に入力される回数に依存して定まるベクトルを示す。補助情報は、例えば１つ前のタイミングの入力以前に得られた領域特徴を示してもよい。補助情報は、例えば１つ前のタイミングの入力以前に得られた複数の領域特徴の分布の統計量を示してもよい。補助情報は、例えば画像特徴と同じ次元数ｄを持ち、各要素の値が以下の式（１）及び（２）により規定されるベクトル（以下「補助ベクトル」という。）を示してもよい。 Auxiliary information is information that is input to the LSTM 102 . Auxiliary information indicates a value corresponding to the timing of input to the LSTM 102 . The value indicated by the auxiliary information is a value that follows a predetermined distribution with non-zero variance. The auxiliary information indicates, for example, a vector determined depending on the number of times it is input to the LSTM 102 . The auxiliary information may indicate, for example, area features obtained before the previous timing input. The auxiliary information may indicate, for example, a statistic of the distribution of a plurality of area features obtained before the previous timing input. The auxiliary information may indicate a vector (hereinafter referred to as "auxiliary vector") having the same number of dimensions d as the image feature and whose element values are defined by the following equations (1) and (2).

ａ（ｉ、ｋ）はｉ回目の入力時における補助ベクトルのｋ番目の要素の値を表す。ｉ番目の入力時とは、補助情報がＬＳＴＭ１０２に入力されるｉ回目のタイミングを意味する。ｃは定数である。定数ｃの値は、例えば１００００である。 a(i, k) represents the value of the k-th element of the auxiliary vector at the i-th input. The i-th input time means the i-th timing at which the auxiliary information is input to the LSTM 102 . c is a constant. The value of constant c is 10000, for example.

以下説明の簡単のため、補助情報が、補助ベクトルを示す場合を例に予測システム１００を説明する。 For simplicity of explanation, the prediction system 100 will be described with an example in which the auxiliary information indicates an auxiliary vector.

＜補助情報が奏する効果について＞
補助情報が奏する効果について説明する。補助情報の奏する効果の説明のため、まずは予測器１０が備えるＬＳＴＭ（すなわちＬＳＴＭ１０２）について説明する。ＬＳＴＭ１０２は、ＣＮＮ１０１の出力である画像特徴を入力として受け取り、領域特徴を出力する。入力された１つの画像における注視領域の数は必ずしも一つとは限らない。また、入力される画像によってその数は変化しうる。 <Effects of auxiliary information>
The effect of the auxiliary information will be explained. In order to explain the effect of the auxiliary information, first, the LSTM (that is, the LSTM 102) provided in the predictor 10 will be explained. The LSTM 102 receives as input image features, which are the outputs of the CNN 101, and outputs region features. The number of gaze areas in one input image is not necessarily one. Also, the number may change depending on the input image.

予測器１０は、ＬＳＴＭ１０２に画像特徴が複数回入力されることにより、１又は複数の注視領域を出力する。ＬＳＴＭは再帰的ニューラルネットの一種であり、内部に状態変数を持ち、入力と状態変数の双方に基づいて出力が決定されるニューラルネットワークである。ＬＳＴＭの状態変数は、入力を受けるたびに更新される。そのため、ＬＳＴＭ１０２に同一の画像特徴が入力された場合であっても、入力のたびにその出力は変化し得る。したがって、ＬＳＴＭを用いる予測器１０は、同一の画像データから得られた画像特徴が複数回入力された場合であっても、入力のたびに必ずしも同一ではない注視領域を予測することが可能である。 The predictor 10 outputs one or more attention regions by inputting image features to the LSTM 102 multiple times. An LSTM is a type of recursive neural network, and is a neural network that has state variables inside and whose output is determined based on both the input and state variables. The state variables of the LSTM are updated each time an input is received. Therefore, even if the same image feature is input to the LSTM 102, its output may change each time it is input. Therefore, even when image features obtained from the same image data are input multiple times, the predictor 10 using LSTM can predict a region of interest that is not necessarily the same for each input. .

補助情報が用いられる場合、入力された画像特徴が過去の学習時の画像特徴と同一であったとしても、過去に予測器１０に入力された情報と異なる情報が予測器１０に入力される。その結果、予測器１０に入力された画像特徴が過去の学習時の画像特徴と同一であったとしても、過去の結果とは異なる結果が予測器１０から出力される頻度が高まる。このように、補助情報は、予測の結果が同一でない頻度を高める効果を奏する。 When the auxiliary information is used, even if the input image feature is the same as the image feature at the time of past learning, information different from the information input to the predictor 10 in the past is input to the predictor 10 . As a result, even if the image feature input to the predictor 10 is the same as the image feature at the time of past learning, the frequency at which the predictor 10 outputs a result different from the past result increases. In this way, the auxiliary information has the effect of increasing the frequency of non-identical prediction results.

このように予測器１０は例えば、画像データと補助情報とに基づき、画像データが示す画像には写らない空間内の注視領域を予測し、予測した結果を予め定められた予測結果表現形式で出力する回路である。以下、画像データに少なくとも基づき画像データが示す画像には写らない空間内の注視領域を予測する処理を、予測処理という。 In this manner, the predictor 10 predicts, for example, a region of interest in a space that does not appear in the image indicated by the image data, based on the image data and the auxiliary information, and outputs the prediction result in a predetermined prediction result expression format. It is a circuit that Hereinafter, a process of predicting a region of interest in a space that does not appear in the image indicated by the image data based at least on the image data will be referred to as a prediction process.

領域特徴は、予測ＦＣ１０３及び停止判定ＦＣ１０４に入力される。予測ＦＣ１０３は、入力された領域特徴に基づき注視領域を出力する。予測ＦＣ１０３の出力の形式（すなわち予測結果表現形式）は、予め定められた形式である。予測結果表現形式は、例えば予測ＦＣ１０３に入力される領域特徴に応じてユーザが予め定めた形式である。 Area features are input to the prediction FC 103 and stop determination FC 104 . The prediction FC 103 outputs a gaze area based on the input area feature. The output format of the prediction FC 103 (that is, the prediction result expression format) is a predetermined format. The prediction result representation format is a format predetermined by the user according to the area feature input to the prediction FC 103, for example.

予測ＦＣ１０３は、例えば注視領域を図２に示す例のように注視点が座標値として表現される場合には、２次元の座標値（ｘ、ｙ）を出力する。予測ＦＣ１０３は、例えば図４に示す例のように点の位置が離散化されて表現される場合には、各離散化領域に注視領域が存在する確率を出力する。予測ＦＣ１０３は、例えば図５に示す例のように注視領域の位置及び範囲が分布を用いて表現されている場合には、その分布のパラメータを出力する。予測ＦＣ１０３が出力する分布のパラメータは、例えば図５の例であれば、平均ｘ及びｙの値と、分散σ^２の値とである。 The prediction FC 103 outputs two-dimensional coordinate values (x, y) when the gaze point is expressed as coordinate values as in the example shown in FIG. The prediction FC 103 outputs the probability that the region of interest exists in each discretized region when the positions of the points are represented by discretization as in the example shown in FIG. 4, for example. The prediction FC 103 outputs parameters of the distribution when the position and range of the attention area are expressed using a distribution as in the example shown in FIG. 5, for example. The distribution parameters output by the prediction FC 103 are, for example, the values of the average x and y and the value of the variance ^σ2 in the example of FIG.

停止判定ＦＣ１０４は、領域特徴に基づき、予測処理を停止するか否かを判定する。より具体的には、停止判定ＦＣ１０４は、領域特徴に基づき、予測処理を停止するか否かを表す二値の値である停止判定情報ｔを出力する。停止判定ＦＣ１０４が予測処理を停止する条件は学習により得られる。学習により得られるとは、例えば損失関数が最小化されるように条件が更新されることで得られることを意味する。 The stop judgment FC 104 judges whether or not to stop the prediction process based on the area feature. More specifically, the stop determination FC 104 outputs stop determination information t, which is a binary value indicating whether or not to stop the prediction process, based on the area feature. A condition for the stop judgment FC 104 to stop the prediction process is obtained by learning. Obtained by learning means obtained by, for example, updating the conditions so that the loss function is minimized.

予測器１０の構成がＬＳＴＭに画像特徴が複数回入力されることによって１又は複数の注視領域を出力する構成である場合、１又は複数の注視領域が出力される。しかしながら、上述したように何回出力を得るべきかについては自明ではない。そこで、停止判定ＦＣ１０４が領域特徴に基づき状況に応じた判定を行う。状況に応じたとは具体的には、予測処理の過程で生じた情報に基づいて、ということを意味する。 If the configuration of the predictor 10 is such that the image features are input to the LSTM multiple times to output one or more regions of interest, then one or more regions of interest are output. However, it is not obvious how many times the output should be obtained as described above. Therefore, the stop judgment FC 104 makes a judgment according to the situation based on the region feature. Contextual specifically means based on information generated during the course of the prediction process.

なお、出力を得る回数は、例えば予め定められた回数（以下「最大予測回数」という。）Ｔであってもよい。このような場合、停止判定ＦＣ１０４は、例えば出力の得られた回数が最大予測回数Ｔに到達した場合に予測処理を停止すると判定し、出力の得られた回数が最大予測回数Ｔ未満である場合に予測処理を停止しないと判定する。 The number of times to obtain the output may be, for example, a predetermined number of times (hereinafter referred to as "maximum predicted number of times") T. In such a case, the stop determination FC 104 determines to stop the prediction process when the number of times the output is obtained reaches the maximum predicted number of times T, and when the number of times the output is obtained is less than the maximum predicted number of times T It is determined that the prediction process is not stopped at this time.

予測器１０に入力される画像データごとに適切な予測の回数（すなわち注視領域の数）は異なる。そのため、停止判定ＦＣ１０４による予測処理を停止するか否かの判定は、予め定められた最大予測回数Ｔを用いた判定よりも、上述の、領域特徴に基づいた判定の方が好ましい。 The appropriate number of predictions (that is, the number of gaze regions) differs for each image data input to the predictor 10 . Therefore, it is preferable to determine whether or not to stop the prediction process by the stop determination FC 104 based on the above-described area feature rather than using the predetermined maximum number of predictions T. FIG.

図７は、実施形態における予測器１０が実行する処理の流れの一例を示すフローチャートである。ＣＮＮ１０１に画像データが入力される（ステップＳ１０１）。画像データは、例えばカメラで撮影された画像の画像データである。画像データは、例えば映像から抽出されたフレームの画像データであってもよく、例えば、自動車の車載カメラなどによってキャプチャされた映像のフレームの画像データであってもよい。 FIG. 7 is a flow chart showing an example of the flow of processing executed by the predictor 10 in the embodiment. Image data is input to CNN 101 (step S101). Image data is, for example, image data of an image captured by a camera. The image data may be, for example, image data of a frame extracted from a video, or may be image data of a frame of a video captured, for example, by an on-board camera of an automobile.

次に、予測器制御部２１１が、予測処理の実行が開始された回数ｉを１に設定する（ステップＳ１０２）。回数ｉを１に設定するとは、補助情報を初期化することを意味する。次にＣＮＮ１０１が入力された画像の画像データに基づき、画像特徴ｆを取得する（ステップＳ１０３）。次に、画像特徴ｆ及び補助情報ａｉに基づきＬＳＴＭ１０２が領域特徴を取得する（ステップＳ１０４）。次に、領域特徴に基づき予測ＦＣ１０３が注視領域を予測する（ステップＳ１０５）。次に、予測ＦＣ１０３は、予測の結果を示す予測結果Ｏｉを出力する（ステップＳ１０６）。 Next, the predictor control unit 211 sets the number i of times the execution of prediction processing is started to 1 (step S102). Setting the number i to 1 means initializing the auxiliary information. Next, based on the image data of the input image, the CNN 101 acquires the image feature f (step S103). Next, the LSTM 102 acquires area features based on the image feature f and the auxiliary information ai (step S104). Next, the prediction FC 103 predicts the region of interest based on the region features (step S105). Next, the prediction FC 103 outputs a prediction result Oi indicating the prediction result (step S106).

次に、停止判定ＦＣ１０４が領域特徴に基づき予測処理を停止するか否かを判定する（ステップＳ１０７）。次に、停止判定ＦＣ１０４が、判定の結果を示す情報（すなわち停止判定情報ｔｉ）を出力する（ステップＳ１０８）。次に、予測器制御部２１１が、停止判定情報に基づき予測終了条件が満たされるか否かを判定する（ステップＳ１０９）。予測終了条件は、予測処理の終了に関する条件であって少なくとも停止判定情報に基づく条件である。予測終了条件は例えば、停止判定情報tiが停止を示すという条件と、予測処理の実行が開始された回数が最大予測回数Ｔ以上（すなわち、i＜T）であるという条件とのいずれか一方が満たされる、という条件である。 Next, the stop judgment FC 104 judges whether or not to stop the prediction process based on the area feature (step S107). Next, the stop judgment FC 104 outputs information indicating the result of the judgment (that is, stop judgment information ti) (step S108). Next, the predictor control unit 211 determines whether or not the prediction end condition is satisfied based on the stop determination information (step S109). The prediction end condition is a condition relating to the end of the prediction process and is based on at least stop determination information. The prediction end condition is, for example, either the condition that the stop determination information ti indicates stop or the condition that the number of times the execution of the prediction process is started is equal to or greater than the maximum prediction number of times T (that is, i<T). condition that it is satisfied.

予測終了条件が満たされる場合、処理が終了する。一方、予測終了条件が満たされない場合、予測器制御部２１１は、予め定められた所定の更新の規則にしたがい補助情報ａｉを更新する（ステップＳ１１０）。ステップＳ１１０の次に、ステップＳ１０４の処理に戻る。 If the predicted termination condition is met, the process ends. On the other hand, if the prediction end condition is not satisfied, the predictor control unit 211 updates the auxiliary information ai according to a predetermined update rule (step S110). After step S110, the process returns to step S104.

なお、ステップＳ１０６はステップＳ１０５の実行後であってステップＳ１０９の実行前に実行されればどのようなタイミングで実行されてもよい。なお、ステップＳ１０８はステップＳ１０７の実行後であってステップＳ１０９の実行前に実行されればどのようなタイミングで実行されてもよい。なお、ステップＳ１０５の処理は、ステップＳ１０４の実行後であってステップＳ１０９の実行前であれば、ステップＳ１０７の処理より後に実行されてもよい。 Note that step S106 may be executed at any timing as long as it is executed after step S105 and before step S109 is executed. Note that step S108 may be executed at any timing as long as it is executed after step S107 and before step S109 is executed. Note that the process of step S105 may be performed after the process of step S107 as long as it is after the process of step S104 and before the process of step S109.

なお、最大予測回数Ｔは、任意の正数であり、例えばＴ＝５である。最大予測回数は任意の正数でよいが、学習装置１により予測モデルの学習時に用いられる教師データが示す正解の注視領域の数の最大値よりも大きな値であることが望ましい。 Note that the maximum number of predictions T is any positive number, for example, T=5. The maximum number of predictions may be any positive number, but it is desirable that it be a value larger than the maximum number of correct fixation regions indicated by the teacher data used by the learning device 1 when learning the prediction model.

このように、予測器１０は画像特徴抽出処理、予測処理及び停止判定処理を実行する。画像特徴抽出処理は、画像データに基づき画像特徴を取得する処理である。図７の例ではステップ１０３の処理である。予測処理は、図７の例では、ステップＳ１０４、ステップＳ１０５及びステップＳ１０６の一連の流れが示す処理である。停止判定処理は、画像データに少なくとも基づき予測処理を停止するか否かを判定する処理である。停止判定処理は、図７の例では、ステップＳ１０４、ステップＳ１０７及びステップＳ１０８の一連の流れが示す処理である。 Thus, the predictor 10 performs image feature extraction processing, prediction processing, and stop determination processing. Image feature extraction processing is processing for acquiring image features based on image data. In the example of FIG. 7, this is the processing of step 103 . The prediction process is a process indicated by a series of steps S104, S105 and S106 in the example of FIG. The stop determination process is a process of determining whether or not to stop the prediction process based at least on the image data. The stop determination process is a process indicated by a series of steps S104, S107, and S108 in the example of FIG.

このような処理により、予測器１０は、最大Ｔ回の予測処理を通じ、最大Ｔ個の予測の結果を得ることができる。以降、予測器１０による予測の結果の数をＮと表す。 Through such a process, the predictor 10 can obtain a maximum of T prediction results through a maximum of T prediction processes. Hereinafter, the number of prediction results by the predictor 10 is represented as N.

＜学習装置１が学習済みの予測モデルを取得する方法＞
学習装置１は、訓練用データを用いて予測モデルの学習を行う。訓練用データは、画像データと正解注視領域の集合を示すデータ（以下「正解注視領域データ」という。）との対のデータである。正解注視領域は対応する画像データが示す画像には写らない空間内の注視領域である。そのため正解注視領域の集合は、対応する画像に含まれる注視領域の集合である。 <Method for Acquiring a Learned Prediction Model by the Learning Apparatus 1>
The learning device 1 learns a prediction model using training data. The training data is paired data of image data and data indicating a set of correct gaze areas (hereinafter referred to as "correct gaze area data"). A correct fixation area is a fixation area in a space that is not reflected in the image indicated by the corresponding image data. Therefore, the set of correct fixation regions is the set of fixation regions included in the corresponding image.

より具体的には訓練用データＤは、画像データを学習データ（すなわち説明変数側のデータ）とし、正解注視領域データを教師データ（すなわち目的変数側のデータ）として含むデータである。学習データの画像データが示す画像は、注視領域を１つだけ含む画像であってもよいし、複数含む画像であってもよい。以下、学習データの画像データを画像データＩと表し、正解注視領域の集合を集合｛Ｓｊ｝（ｊ＝１、・・・、Ｍ）と表す。したがって、訓練用データは集合Ｄ＝｛（Ｉ、｛Ｓｊ｝）｝である。そこで以下、訓練用データを訓練用データＤと表す。 More specifically, the training data D is data including image data as learning data (that is, data on the explanatory variable side) and correct gaze region data as teacher data (that is, data on the objective variable side). The image indicated by the image data of the learning data may be an image including only one attention area, or may be an image including a plurality of attention areas. Hereinafter, image data of learning data will be referred to as image data I, and a set of correct fixation regions will be referred to as a set {Sj} (j=1, . . . , M). Therefore, the training data is the set D={(I, {Sj})}. Therefore, the training data will be referred to as training data D hereinafter.

予測モデルの学習の方法は、訓練用データを用いた学習の方法であればどのような方法であってもよく、例えば強化学習の方法であってもよい。以下、強化学習による予測モデルの学習の処理（以下「予測モデル強化学習処理」という。）の一例を説明する。 The method of learning the prediction model may be any method as long as it is a method of learning using training data, and may be, for example, a method of reinforcement learning. An example of processing for learning a prediction model by reinforcement learning (hereinafter referred to as “prediction model reinforcement learning processing”) will be described below.

予測モデル強化学習処理は、強化予測処理と、報酬取得処理と、強化予測学習処理と、停止判定学習処理と、を含む。各処理の説明の前に、強化学習の概略を説明する。 The prediction model reinforcement learning process includes a reinforcement prediction process, a reward acquisition process, a reinforcement prediction learning process, and a stop determination learning process. Before describing each process, an outline of reinforcement learning will be described.

強化学習は、ある状況下での行動を決定するエージェントの最適な行動決定方策を学習する学習方法である。強化学習は、一連のエージェントの行動の結果、もたらされた最終的な状況が望ましいものであるか否かに応じた報酬を規定することによって、エージェントを学習させる学習方法である。 Reinforcement learning is a learning method that learns the optimal action decision policy of an agent that decides actions under a certain situation. Reinforcement learning is a learning method that makes an agent learn by prescribing a reward depending on whether the final situation brought about as a result of a series of actions of the agent is desirable or not.

予測モデル強化学習処理では、強化学習におけるエージェントとして予測器１０を用いる。予測モデル強化学習処理では、強化学習における行動として注視領域の予測と、予測処理の停止の判定と、を用いる。予測モデル強化学習処理では、強化学習における状況として、画像データＩと、過去の予測の結果の正誤と、停止の判定の結果の正誤と、を用いる。 In the prediction model reinforcement learning process, the predictor 10 is used as an agent in reinforcement learning. In the prediction model reinforcement learning process, prediction of the region of interest and determination of termination of the prediction process are used as actions in reinforcement learning. In the prediction model reinforcement learning process, the image data I, the correct/wrong result of the past prediction, and the correct/wrong result of the stop determination are used as the situation in the reinforcement learning.

上述したように強化学習は、試行錯誤による探索型の学習である。より具体的には、強化学習は、更新の対象となる数理モデルを用いて結果を得た後、得られた結果に基づき報酬を算出し、報酬に基づき更新の対象の数理モデルを更新する。したがって、強化学習ではまず、更新の対象となる数理モデルを用いて結果を得る処理が行われる。予測モデル強化学習処理における、更新の対象となる数理モデルを用いて結果を得る処理、が強化予測処理である。 As described above, reinforcement learning is trial-and-error search-type learning. More specifically, in reinforcement learning, after obtaining a result using a mathematical model to be updated, a reward is calculated based on the obtained result, and the mathematical model to be updated is updated based on the reward. Therefore, in reinforcement learning, first, processing is performed to obtain a result using a mathematical model to be updated. Reinforcement prediction processing is the processing of obtaining results using a mathematical model to be updated in the prediction model reinforcement learning processing.

したがって、強化予測処理は、予測器１０を用いて予測処理及び停止判定処理を実行する処理である。 Therefore, the enhanced prediction process is a process that uses the predictor 10 to perform the prediction process and the stop determination process.

図８は、実施形態における強化予測処理の流れの一例を示すフローチャートである。後述する学習制御部１１１が、訓練用データＤを取得する（ステップＳ２０１）。次に、学習制御部１１１が、強化予測処理の実行が開始された回数ｉを１に設定する（ステップＳ２０２）。回数ｉを１に設定するとは、補助情報を初期化することを意味する。次にＣＮＮ１０１が入力された画像の画像データに基づき、画像特徴ｆを取得する（ステップＳ２０３）。次に、画像特徴ｆ及び補助情報ａｉに基づきＬＳＴＭ１０２が領域特徴を取得する（ステップＳ２０４）。次に、領域特徴に基づき予測ＦＣ１０３が注視領域を予測する（ステップＳ２０５）。次に、予測ＦＣ１０３が予測の結果を示す予測結果Ｏｉを出力する。出力された予測結果Ｏｉは学習制御部１１１により後述の記憶部１３等の所定の記憶装置に記録される（ステップＳ２０６）。 FIG. 8 is a flowchart illustrating an example of the flow of enhanced prediction processing in the embodiment. The learning control unit 111, which will be described later, acquires the training data D (step S201). Next, the learning control unit 111 sets the number of times i that execution of the reinforcement prediction process is started to 1 (step S202). Setting the number i to 1 means initializing the auxiliary information. Next, based on the image data of the input image, the CNN 101 acquires the image feature f (step S203). Next, the LSTM 102 acquires area features based on the image feature f and the auxiliary information ai (step S204). Next, the prediction FC 103 predicts the region of interest based on the region features (step S205). Next, the prediction FC 103 outputs a prediction result Oi indicating the prediction result. The output prediction result Oi is recorded by the learning control unit 111 in a predetermined storage device such as the storage unit 13 (described later) (step S206).

次に、停止判定ＦＣ１０４が領域特徴に基づき予測処理を停止するか否かを判定する（ステップＳ２０７）。次に、停止判定ＦＣ１０４が、判定の結果を示す情報（すなわち停止判定情報ｔｉ）を出力する。出力された停止判定情報ｔｉは学習制御部１１１により所定の記憶装置に記録される（ステップＳ２０８）。次に、学習制御部１１１が、停止判定情報に基づき学習時予測終了条件が満たされるか否かを判定する（ステップＳ２０９）。強化予測終了条件は、強化予測処理の終了に関する条件であって少なくとも停止判定情報に基づく条件である。強化予測終了条件は、停止させるか否かの判定の対象の処理が予測処理に代えて強化予測処理である点で予測終了条件と異なる条件である。強化予測終了条件は例えば、停止判定情報tiが停止を示すという条件と、強化予測処理の実行が開始された回数が最大予測回数Ｔ以上（すなわち、i＜T）であるという条件とのいずれか一方が満たされる、という条件である。 Next, the stop judgment FC 104 judges whether or not to stop the prediction process based on the area feature (step S207). Next, the stop judgment FC 104 outputs information indicating the result of the judgment (that is, stop judgment information ti). The output stop determination information ti is recorded in a predetermined storage device by the learning control unit 111 (step S208). Next, the learning control unit 111 determines whether or not the learning-time prediction end condition is satisfied based on the stop determination information (step S209). The strengthened prediction end condition is a condition regarding the end of the strengthened prediction process and is based on at least the stop determination information. The strengthened prediction end condition is different from the prediction end condition in that the process for which it is determined whether or not to stop is the strengthened prediction process instead of the prediction process. The strengthened prediction end condition is, for example, either the condition that the stop determination information ti indicates stop or the condition that the number of times the execution of the strengthened prediction process is started is equal to or greater than the maximum number of predictions T (that is, i<T). The condition is that one of them is satisfied.

強化予測終了条件が満たされる場合、処理が終了する。一方、強化予測終了条件が満たされない場合、学習制御部１１１は、予め定められた所定の更新の規則にしたがい補助情報ａｉを更新する（ステップＳ２１０）。ステップＳ２１０の次に、ステップＳ２０４の処理に戻る。 If the enhanced prediction end condition is met, the process ends. On the other hand, if the reinforcement prediction end condition is not satisfied, the learning control unit 111 updates the auxiliary information ai according to a predetermined update rule (step S210). After step S210, the process returns to step S204.

なお、ステップＳ２０６はステップＳ２０５の実行後であってステップＳ２０９の実行前に実行されればどのようなタイミングで実行されてもよい。なお、ステップＳ２０８はステップＳ２０７の実行後であってステップＳ２０９の実行前に実行されればどのようなタイミングで実行されてもよい。なお、ステップＳ２０５の処理は、ステップＳ２０４の実行後であってステップＳ２０９の実行前であれば、ステップＳ２０７の処理より後に実行されてもよい。 Note that step S206 may be executed at any timing as long as it is executed after step S205 and before step S209 is executed. Note that step S208 may be executed at any timing as long as it is executed after step S207 and before step S209 is executed. Note that the process of step S205 may be performed after the process of step S207 as long as it is after the process of step S204 and before the process of step S209.

図８の処理は、予め用意された全ての訓練データＤに対して実行される。 The processing of FIG. 8 is executed for all training data D prepared in advance.

報酬取得処理は、正誤判定処理を含む。正誤判定処理は、強化予測処理によって得られた予測結果の集合｛Ｏｉ｝が正規注視領域の集合｛Ｓｊ｝を正しく予測できたか否かを判定する処理である。報酬取得処理では、正誤判定処理の実行の後に、正誤判定処理の結果に基づいて、報酬が取得される。報酬の取得は例えば演算により取得される。 The reward acquisition process includes correctness/incorrectness determination process. The correctness/incorrectness determination process is a process for determining whether or not the set of prediction results {Oi} obtained by the enhanced prediction process correctly predicted the set of normal gaze regions {Sj}. In the reward acquisition process, after execution of the correctness determination process, a reward is acquired based on the result of the correctness determination process. Acquisition of the reward is acquired by computation, for example.

図９は、実施形態における報酬取得処理の流れの一例を示すフローチャートである。学習制御部１１１が成功予測数Ｑを初期化する（ステップＳ３０１）。初期化の結果、成功予測数Ｑには０が代入される。次に、学習制御部１１１は、強化予測処理によって得られた予測結果の集合｛Ｏｉ｝のうち未だ予測の成否が判定されていない１つの予測結果Ｏｉを選択する（ステップＳ３０２）。次に、学習制御部１１１は、ステップＳ３０２で選択された予測結果Ｏｉの予測の成否を判定する（ステップＳ３０３）。予測の成否の判定とは、具体的には、正解注視領域｛Ｓｊ｝のうちの少なくとも１つを予測できたか否かを判定することを意味する。正解注視領域｛Ｓｊ｝のうちの少なくとも１つを予測できた場合、予測は成功であり、正解注視領域｛Ｓｊ｝のいずれも予測できなかった場合、予測が成功しなかった（すなわち否である）ことを意味する。 FIG. 9 is a flowchart showing an example of the flow of reward acquisition processing in the embodiment. The learning control unit 111 initializes the predicted success number Q (step S301). As a result of the initialization, 0 is assigned to the number Q of predicted successes. Next, the learning control unit 111 selects one prediction result Oi for which prediction success/failure has not yet been determined from among the prediction result set {Oi} obtained by the enhanced prediction process (step S302). Next, the learning control unit 111 determines whether the prediction result Oi selected in step S302 is successful or not (step S303). Determining the success or failure of prediction specifically means determining whether or not at least one of the correct gaze regions {Sj} has been predicted. If at least one of the correct fixation regions {Sj} could be predicted, the prediction was successful; if none of the correct fixation regions {Sj} could be predicted, the prediction was not successful (i.e., no ) means that

予測の成否の判定の方法は、予測結果表現形式に依存する。一例として図２又は図３のように注視領域が注視点によって与えられている場合について、予測の成否の判定の方法の一例を説明する。この場合、Ｏｉ、｛Ｓｊ｝共に点を表している。そのため、｛Ｓｊ｝の中からＯｉの距離が最も近いものをＳ＊と決定し、Ｓ＊とＯｉとの距離が一定以下であれば予測の成功と判定し、一定より大きければ予測の失敗と判定する方法で予測の成否は判定される。 The method of determining whether prediction is successful or not depends on the prediction result expression format. As an example, an example of a method for judging the success or failure of prediction will be described for the case where the region of interest is given by the point of interest as shown in FIG. 2 or 3 . In this case, both Oi and {Sj} represent points. Therefore, S* is determined to be the closest distance to Oi from among {Sj}. If the distance between S* and Oi is less than a certain value, the prediction is judged to be successful, and if it is greater than the certain value, the prediction is to fail. The success or failure of the prediction is determined by the determination method.

予測の成否の判定の方法の他の例として図４のように離散化された領域になっている場合について、予測の成否の判定の方法の一例を説明する。この場合、Ｏｉと｛Ｓｊ｝のうちの少なくとも一つのＳ＊と、が同一の領域を示している場合に予測の成功と判定し、示していない場合に予測の失敗と判定する方法で予測の成否は判定される。また、図５のように分布によって表現されている場合には、Ｏｉの分布が覆う領域と｛Ｓｊ｝のうちの少なくとも一つＳ＊が覆う領域との重なりが一定以上である場合に予測の成功、一定未満である場合に予測の失敗と判定する方法で予測の成否は判定される。 As another example of the method for determining the success or failure of prediction, an example of the method for determining the success or failure of prediction will be described for the case where the regions are discretized as shown in FIG. In this case, if Oi and at least one S* of {Sj} indicate the same area, the prediction is determined to be successful, and if not, the prediction is determined to be unsuccessful. Success or failure is determined. In addition, when the distribution is represented as shown in FIG. 5, prediction is performed when the overlap between the area covered by the distribution of Oi and the area covered by at least one S* of {Sj} is greater than or equal to a certain amount. The success or failure of the prediction is judged by a method of judging the prediction as failure when the value is successful or less than a certain value.

ステップＳ３０３の次に、学習制御部１１１は、成功予測数Ｑを更新するとともに、予測結果Ｏｉによって予測された正解注視領域を正解注視領域の集合｛Ｓｊ｝から取り除く（ステップＳ３０４）。成功予測数Ｑの更新は、具体的には成功予測数の値を１増加させる処理である。 After step S303, the learning control unit 111 updates the number Q of successful predictions, and removes the correct fixation region predicted by the prediction result Oi from the correct fixation region set {Sj} (step S304). Specifically, updating the predicted success number Q is a process of increasing the value of the predicted success number by one.

次に学習制御部１１１は、報酬取得条件が満たされたか否かを判定する（ステップＳ３０５）。報酬取得条件は、強化予測処理によって得られた予測結果全ての予測結果Ｏｉについて予測の成否が判定されたという条件と、正規注視領域の集合｛Ｓｊ｝が空集合であるという条件と、の少なくとも一方が満たされるという条件である。なお、予測の成否の判定の処理は、具体的には、ステップＳ３０３の処理である。 Next, the learning control unit 111 determines whether or not the reward acquisition condition is satisfied (step S305). The reward acquisition condition is at least the condition that the prediction result Oi of all the prediction results obtained by the enhanced prediction process is judged to be successful or not, and the condition that the set {Sj} of the normal gaze regions is an empty set. It is a condition that one is satisfied. Note that the process of determining whether the prediction is successful or not is specifically the process of step S303.

報酬取得条件が満たされない場合（ステップＳ３０５：ＮＯ）、ステップＳ３０２の処理に戻る。一方、報酬取得条件が満たされた場合（ステップＳ３０５：ＹＥＳ）、学習制御部１１１は予め定義された報酬の値を取得する（ステップＳ３０６）。 If the reward acquisition condition is not satisfied (step S305: NO), the process returns to step S302. On the other hand, if the reward acquisition condition is satisfied (step S305: YES), the learning control unit 111 acquires a predefined reward value (step S306).

＜報酬について＞
報酬について説明する。報酬は、例えば以下の式（３）で定義される量Ｒ_ｐｒｅｄである。 <About rewards>
Describe rewards. The reward is, for example, the quantity R _pred defined by Equation (3) below.

望ましい予測モデルは、報酬が適切であればあるほど得られる確率が高まる。また、複数ある正解注視領域のうち、できる限り多くの注視領域を、なるべく少ない予測回数で、より正確に予測できる学習を実行することが好ましい。したがって、できる限り多くの注視領域を、なるべく少ない予測回数で、より正確に予測できた場合により大きな報酬を与えることにより、好ましい学習が実行される確率が高まる。 A desirable predictive model is more likely to be obtained if the reward is appropriate. Further, it is preferable to perform learning that can more accurately predict as many gaze regions as possible among a plurality of correct gaze regions with as few predictions as possible. Therefore, by giving a larger reward when as many gaze regions as possible can be predicted more accurately with as few predictions as possible, the probability of favorable learning is increased.

上記式（３）の左辺の値は、０以上１以下の値である。上記式（３）の左辺の値は、予測回数Ｎが正解注視領域数Ｍと同数で、かつ、全ての予測が成功したとき、すなわちＭ＝Ｎ＝Ｑのときに最大値１．０となる。したがって式（３）で定義される報酬Ｒ_ｐｒｅｄは、複数ある正解注視領域のうち、できる限り多くの注視領域を、なるべく少ない予測回数で、より正確に予測できた場合に高い報酬を与えるという性質を満たす。そのため、式（３）で定義される報酬Ｒ_ｐｒｅｄは、予測モデルの学習に好適である。 The value of the left side of the above formula (3) is a value of 0 or more and 1 or less. The value of the left side of the above equation (3) has a maximum value of 1.0 when the number of predictions N is the same as the number of correct gaze regions M and all predictions are successful, that is, when M=N=Q. . Therefore, the reward R _pred defined by Equation (3) has the property of giving a high reward when as many of the correct gaze regions as possible can be predicted more accurately with as few predictions as possible. meet. Therefore, the reward R _pred defined by Equation (3) is suitable for learning prediction models.

強化予測学習処理について説明する。強化予測学習処理は、報酬取得処理で得られた報酬に基づき、予測モデルを更新する。強化予測学習処理は、例えば予測器１０がＣＮＮ１０１、ＬＳＴＭ１０２、予測ＦＣ１０３及び停止判定ＦＣ１０４を備える場合、ＮＮ１０１、ＬＳＴＭ１０２及び予測ＦＣ１０３を報酬に基づき更新する。以下、説明の簡単のためＮＮ１０１、ＬＳＴＭ１０２及び予測ＦＣ１０３を結合した深層ニューラルネットワークを、φと表す。 Reinforcement prediction learning processing will be described. The reinforcement prediction learning process updates the prediction model based on the reward obtained in the reward acquisition process. Reinforcement prediction learning processing updates NN101, LSTM102, and prediction FC103 based on a reward, for example, when predictor 10 is provided with CNN101, LSTM102, prediction FC103, and stop judgment FC104. For simplicity of explanation, a deep neural network combining the NN 101, the LSTM 102 and the prediction FC 103 is represented as φ.

強化予測学習処理は報酬に基づきネットワークφを更新する処理である。強化予測学習処理が報酬に基づきネットワークφを更新する処理は報酬に基づく方法であればどのような方法であってもよい。例えば更新の方法は、方策勾配法であってもよい。更新の方法が方策勾配法の場合、以下の式（４）及び式（５）に基づいて、ネットワークφの重みＷφが更新される。 Reinforcement predictive learning processing is processing that updates the network φ based on rewards. The process of updating the network φ based on the reward in the reinforcement predictive learning process may be any method as long as it is based on the reward. For example, the update method may be the policy gradient method. When the update method is the policy gradient method, the weight Wφ of the network φ is updated based on the following equations (4) and (5).

αは学習率を表す任意の正の実数値である。αは例えば０．０１である。 α is any positive real value representing the learning rate. α is, for example, 0.01.

このように、予測モデル強化学習処理では、予測モデルを用い、訓練データに含まれる画像データに基づき注視領域を推定する処理（すなわち強化予測処理）が実行される。予測モデル強化学習処理では、報酬取得処理及び強化予測学習処理の実行により、強化予測処理によって得られた推定の結果が訓練データに含まれる正解注視領域データの示す注視領域に近づくように予測モデルが更新される。 As described above, in the prediction model reinforcement learning process, the prediction model is used to perform the process of estimating the region of interest based on the image data included in the training data (that is, the reinforcement prediction process). In the prediction model reinforcement learning process, by executing the reward acquisition process and the reinforcement prediction learning process, the prediction model is adjusted so that the estimation result obtained by the reinforcement prediction process approaches the gaze area indicated by the correct gaze area data included in the training data. Updated.

ここでネットワークφの学習に強化学習が用いられる方が、教師あり学習を用いる場合よりも推定精度の高い予測モデルを生成することができる。強化学習の方が推定の精度が高い理由を説明する。 Here, when reinforcement learning is used for learning the network φ, a prediction model with higher estimation accuracy can be generated than when using supervised learning. Explain why reinforcement learning has higher estimation accuracy.

理由の１つは、予測器１０によるｉ回目の予測結果Ｏｉに対して、複数ある正解注視領域｛Ｓｊ｝のうち、どの正解注視領域と比較されるべきかが自明ではないことである。理由の１つは、予測器１０による予測は、予測終了条件が満たされるまで繰り返し実行されるため、複数回の予測の結果を総合的に判断して学習を進める必要があることである。例えば、１回目の予測結果Ｏ１が正解注視領域の内の最初の一つＳ１を正しく示したとしても、２回目以降の予測結果Ｏ２，…，ＯＮが、全く同じ正解注視領域Ｓ１を示す場合、２回目以降の予測結果は意味のある予測結果であるとは言い難い。 One of the reasons is that it is not obvious which of the plurality of correct gazing regions {Sj} should be compared with the i-th prediction result Oi of the predictor 10 . One of the reasons is that the prediction by the predictor 10 is repeatedly executed until the prediction end condition is satisfied, so it is necessary to proceed with learning by comprehensively judging the results of multiple predictions. For example, even if the first prediction result O1 correctly indicates the first one of the correct fixation regions S1, if the second and subsequent prediction results O2, . It is difficult to say that the second and subsequent prediction results are meaningful prediction results.

このように、予測モデルの良し悪しは各予測結果に基づいて判断することはできず、複数回の予測結果に基づいて判断する（すなわち総合的に判断する）必要がある。さらに、予測モデルは正解注視領域を過不足なく予測できることが望ましい。理由の１つは、注視領域の予測結果｛Ｏｉ｝だけでなく停止判定情報｛ｔｉ｝についても高い精度の情報であることが望ましいことである。 In this way, the quality of a prediction model cannot be determined based on each prediction result, and must be determined based on multiple prediction results (that is, comprehensively determined). Furthermore, it is desirable that the prediction model can predict the correct fixation region just enough. One of the reasons is that it is desirable that not only the prediction result {Oi} of the region of interest but also the stop determination information {ti} be highly accurate information.

停止判定学習処理について説明する。停止判定学習処理は停止判定処理の内容を更新する処理である。停止判定学習処理は、例えばＣＮＮ１０１、ＬＳＴＭ１０２及び停止判定ＦＣ１０４を更新する。停止判定処理は、全ての正解注視領域を予測できたタイミングを停止と判断する処理であることが望ましい。全ての正解注視領域を予測できたタイミングは、例えばステップＳ３０５において｛Ｓｊ｝が空集合になったタイミングである。 The stop determination learning process will be described. The stop determination learning process is a process of updating the contents of the stop determination process. The stop judgment learning process updates the CNN 101, the LSTM 102 and the stop judgment FC 104, for example. The stop determination process is desirably a process of determining the timing at which all the correct fixation regions have been predicted as a stop. The timing at which all correct fixation regions can be predicted is, for example, the timing at which {Sj} becomes an empty set in step S305.

停止判定処理については予測処理と異なり、出力に対する明確な教示データが得られる。そのため、停止判定処理については、教師ありの機械学習の方法で更新されても強化学習で更新される場合と同程度以上の高い精度の処理が生成される。 Unlike the prediction process, the stop determination process provides clear teaching data for the output. Therefore, even if the stop determination process is updated by the supervised machine learning method, the process is generated with a high accuracy equal to or higher than that in the case of the reinforcement learning.

停止判定学習処理は、予測器１０の備えるＣＮＮ１０１、ＬＳＴＭ１０２及び停止判定ＦＣ１０４の更新を行う。以下、説明の簡単のためＮＮ１０１、ＬＳＴＭ１０２及び停止判定ＦＣ１０４を結合した深層ニューラルネットワークを、ψと表す。停止判定学習処理が実行する学習の方法は、教師有りの学習方法であればどのような方法であってもよい。停止判定学習処理は、例えば勾配法により停止判定処理の内容を更新する。一例として停止判定学習処理が実行する学習の方法が勾配法である場合について、停止判定学習処理を説明する。 The stop judgment learning process updates the CNN 101, LSTM 102 and stop judgment FC 104 provided in the predictor 10. FIG. For simplicity of explanation, a deep neural network connecting the NN 101, the LSTM 102 and the stop determination FC 104 will be represented as ψ. The learning method executed by the stop determination learning process may be any method as long as it is a supervised learning method. The stop judgment learning process updates the contents of the stop judgment process by, for example, the gradient method. As an example, the stop determination learning process will be described for a case where the learning method executed by the stop determination learning process is the gradient method.

説明の簡単のため、示す内容が正解である停止判定情報（以下「正解停止判定情報」という。）であって識別子ｉの正解停止判定情報をｕｉと表す。例えば、識別子ｉの正解停止判定情報ｕｉは、｛Ｓｊ｝が空になった場合には１を、そうでない場合には０を示す。停止判定学習処理は、以下の式（６）の損失関数が小さくなるように、ψの重みＷψを更新する。 For the sake of simplicity of description, the correct stop determination information (hereinafter referred to as "correct stop determination information") indicating the correct content is represented by ui as the correct stop determination information for the identifier i. For example, the correct stop determination information ui for the identifier i indicates 1 when {Sj} is empty, and indicates 0 otherwise. The stop determination learning process updates the weight Wψ of ψ so that the loss function of Equation (6) below becomes small.

式（６）の損失関数は、Binary Cross Entropyと呼称される損失関数である。式（６）の損失関数は、停止判定情報ｔｉが、正解停止判定情報ｕｉと一致するときに小さい値を取る。したがって、式（６）の損失関数の値（すなわち損失）が小さくなるようにＷψを更新していくことにより、ｕｉに近いｔｉを出力可能なψが得られる。更新の方法は、例えば勾配法である。 The loss function of Equation (6) is a loss function called Binary Cross Entropy. The loss function of Equation (6) takes a small value when the stop determination information ti matches the correct stop determination information ui. Therefore, by updating Wψ so that the value of the loss function (that is, the loss) in Equation (6) becomes smaller, ψ that can output ti close to ui is obtained. The update method is, for example, the gradient method.

学習装置１は、強化予測処理、報酬取得処理、強化予測学習処理及び停止判定学習処理の各サブルーチンを順次実行することを繰り返すことにより予測モデルの学習を実行する。 The learning device 1 learns the prediction model by repeating the sequential execution of subroutines of reinforcement prediction processing, reward acquisition processing, reinforcement prediction learning processing, and stop determination learning processing.

図１０は、実施形態における学習装置１のハードウェア構成の一例を示す図である。学習装置１は、バスで接続されたＣＰＵ（Central Processing Unit）等のプロセッサ９１とメモリ９２とを備える制御部１１を備え、プログラムを実行する。学習装置１は、プログラムの実行によって制御部１１、入出力インタフェース１２及び記憶部１３を備える装置として機能する。 FIG. 10 is a diagram showing an example of the hardware configuration of the learning device 1 according to the embodiment. The learning device 1 includes a control unit 11 including a processor 91 such as a CPU (Central Processing Unit) connected via a bus and a memory 92, and executes a program. The learning device 1 functions as a device having a control unit 11, an input/output interface 12, and a storage unit 13 by executing a program.

より具体的には、学習装置１は、プロセッサ９１が記憶部１３に記憶されているプログラムを読み出し、読み出したプログラムをメモリ９２に記憶させる。プロセッサ９１が、メモリ９２に記憶させたプログラムを実行することによって、学習装置１は、制御部１１、入出力インタフェース１２及び記憶部１３を備える装置として機能する。 More specifically, in the learning device 1, the processor 91 reads the program stored in the storage unit 13, and causes the memory 92 to store the read program. The processor 91 executes a program stored in the memory 92 , whereby the learning device 1 functions as a device comprising the control section 11 , the input/output interface 12 and the storage section 13 .

制御部１１は、学習装置１が備える各種機能部の動作を制御する。入出力インタフェース１２は、学習装置１を外部装置に接続するためのインタフェースを含んで構成される。入出力インタフェース１２は、有線又は無線を介して外部装置と通信する。外部装置は例えば予測装置２である。また入出力インタフェース１２は、例えばマウスやキーボード、タッチパネル等の入力装置を含んで構成される。入出力インタフェース１２は、これらの入力装置を学習装置１に接続するインタフェースを含んで構成されてもよい。 The control unit 11 controls operations of various functional units included in the learning device 1 . The input/output interface 12 includes an interface for connecting the learning device 1 to an external device. The input/output interface 12 communicates with an external device via wire or wireless. The external device is the prediction device 2, for example. The input/output interface 12 includes input devices such as a mouse, a keyboard, and a touch panel. The input/output interface 12 may include an interface that connects these input devices to the learning device 1 .

入出力インタフェース１２は、例えばＣＲＴ（Cathode Ray Tube）ディスプレイや液晶ディスプレイ、有機ＥＬ（Electro-Luminescence）ディスプレイ等の表示装置を含んで構成される。入出力インタフェース１２は、これらの表示装置を学習装置１に接続するインタフェースを含んで構成されてもよい。入出力インタフェース１２には、例えば訓練用データが入力される。 The input/output interface 12 includes a display device such as a CRT (Cathode Ray Tube) display, a liquid crystal display, or an organic EL (Electro-Luminescence) display. The input/output interface 12 may include an interface that connects these display devices to the study device 1 . Training data, for example, is input to the input/output interface 12 .

記憶部１３は、磁気ハードディスク装置や半導体記憶装置などのコンピュータ読み出し可能な記憶媒体装置を用いて構成される。記憶部１３は、学習装置１に関する各種情報を記憶する。記憶部１３は、例えば制御部１１が実行する処理の結果生じた各種情報を記憶する。記憶部１３は、例えば、予測器１０のパラメータの値を記憶する。記憶部１３は、例えば予め補助情報を記憶する。記憶部１３は、例えば予め最大予測回数Ｔを記憶していてもよい。 The storage unit 13 is configured using a computer-readable storage medium device such as a magnetic hard disk device or a semiconductor storage device. The storage unit 13 stores various information about the learning device 1 . The storage unit 13 stores, for example, various information generated as a result of processing executed by the control unit 11 . The storage unit 13 stores, for example, parameter values of the predictor 10 . The storage unit 13 stores auxiliary information in advance, for example. The storage unit 13 may store the maximum number of times of prediction T in advance, for example.

図１１は、実施形態における制御部１１の機能構成の一例を示す図である。制御部１１は、学習制御部１１１と予測器１０とを備える。学習制御部１１１は予測器１０の動作を制御し、予測器１０の学習を実行する。学習制御部１１１は例えば予測モデル強化学習処理を実行する。 FIG. 11 is a diagram showing an example of the functional configuration of the control section 11 in the embodiment. The controller 11 includes a learning controller 111 and a predictor 10 . A learning control unit 111 controls the operation of the predictor 10 and executes learning of the predictor 10 . The learning control unit 111 executes prediction model reinforcement learning processing, for example.

図１２は、実施形態における予測装置２のハードウェア構成の一例を示す図である。予測装置２は、バスで接続されたＣＰＵ（Central Processing Unit）等のプロセッサ９３とメモリ９４とを備える制御部２１を備え、プログラムを実行する。予測装置２は、プログラムの実行によって制御部２１、入出力インタフェース２２及び記憶部２３を備える装置として機能する。 FIG. 12 is a diagram showing an example of the hardware configuration of the prediction device 2 in the embodiment. The prediction device 2 includes a control unit 21 including a processor 93 such as a CPU (Central Processing Unit) connected via a bus and a memory 94, and executes a program. The prediction device 2 functions as a device including a control unit 21, an input/output interface 22, and a storage unit 23 by executing a program.

より具体的には、予測装置２は、プロセッサ９３が記憶部２３に記憶されているプログラムを読み出し、読み出したプログラムをメモリ９４に記憶させる。プロセッサ９３が、メモリ９４に記憶させたプログラムを実行することによって、予測装置２は、制御部２１、入出力インタフェース２２及び記憶部２３を備える装置として機能する。 More specifically, the prediction device 2 causes the processor 93 to read the program stored in the storage unit 23 and store the read program in the memory 94 . The processor 93 executes the program stored in the memory 94 so that the prediction device 2 functions as a device including the control unit 21 , the input/output interface 22 and the storage unit 23 .

制御部２１は、予測装置２が備える各種機能部の動作を制御する。入出力インタフェース２２は、予測装置２を外部装置に接続するためのインタフェースを含んで構成される。入出力インタフェース２２は、有線又は無線を介して外部装置と通信する。外部装置は例えば学習装置１である。また入出力インタフェース２２は、例えばマウスやキーボード、タッチパネル等の入力装置を含んで構成される。入出力インタフェース２２は、これらの入力装置を予測装置２に接続するインタフェースを含んで構成されてもよい。 The control unit 21 controls operations of various functional units included in the prediction device 2 . The input/output interface 22 includes an interface for connecting the prediction device 2 to an external device. The input/output interface 22 communicates with an external device via wire or wireless. The external device is the learning device 1, for example. The input/output interface 22 includes input devices such as a mouse, a keyboard, and a touch panel. The input/output interface 22 may comprise an interface connecting these input devices to the prediction device 2 .

入出力インタフェース２２は、例えばＣＲＴディスプレイや液晶ディスプレイ、有機ＥＬディスプレイ等の表示装置を含んで構成される。入出力インタフェース２２は、これらの表示装置を予測装置２に接続するインタフェースを含んで構成されてもよい。入出力インタフェース２２には、例えば推定対象の画像データが入力される。 The input/output interface 22 includes a display device such as a CRT display, a liquid crystal display, an organic EL display, or the like. The input/output interface 22 may comprise an interface connecting these display devices to the prediction device 2 . For example, image data to be estimated is input to the input/output interface 22 .

記憶部２３は、磁気ハードディスク装置や半導体記憶装置などのコンピュータ読み出し可能な記憶媒体装置を用いて構成される。記憶部２３は、予測装置２に関する各種情報を記憶する。記憶部２３は、例えば制御部２１が実行する処理の結果生じた各種情報を記憶する。記憶部２３は、例えば、学習済みの予測モデルを記憶する。記憶部２３は、例えば予め補助情報を記憶する。記憶部２３は、例えば予め最大予測回数Ｔを記憶していてもよい。 The storage unit 23 is configured using a computer-readable storage medium device such as a magnetic hard disk device or a semiconductor storage device. The storage unit 23 stores various information regarding the prediction device 2 . The storage unit 23 stores various information generated as a result of processing executed by the control unit 21, for example. The storage unit 23 stores, for example, a learned prediction model. The storage unit 23 stores auxiliary information in advance, for example. The storage unit 23 may store the maximum number of times of prediction T in advance, for example.

図１３は、実施形態における制御部２１の機能構成の一例を示す図である。制御部２１は、予測器制御部２１１と予測器１０とを備える。予測器制御部２１１は、予測装置２が備える予測器１０の動作を制御する。すなわち、予測器制御部２１１は、学習済みの予測モデルを予測器１０に実行させる。 FIG. 13 is a diagram showing an example of the functional configuration of the control section 21 in the embodiment. The control unit 21 includes a predictor control unit 211 and the predictor 10 . The predictor control unit 211 controls the operation of the predictor 10 included in the prediction device 2 . That is, the predictor control unit 211 causes the predictor 10 to execute the learned prediction model.

＜適用例＞
予測システム１００の適用例を説明する。予測装置２は、例えば車両に設置される。この場合、予測装置２は、車両に搭載されたカメラにより取得された画像に基づき可視領域外の注視すべき領域を予測する。予測の後、予測装置２は、例えば視覚的又は聴覚的な警告を介して、注視又は注目されるべきエリアを示す。 <Application example>
An application example of the prediction system 100 will be described. The prediction device 2 is installed, for example, in a vehicle. In this case, the prediction device 2 predicts a region to be watched outside the visible region based on the image acquired by the camera mounted on the vehicle. After prediction, the prediction device 2 indicates the areas to be watched or noticed, for example via visual or auditory warnings.

予測装置２は、例えばロボットに搭載されてもよい。この場合、予測装置２は、ロボットに搭載されたカメラにより取得された画像から、可視領域外の注視すべき領域を予測する。予測の後、予測装置２は、予測の結果に基づきロボットの行動を制御する制御装置に対して予測の結果を送信することで、ロボットの行動を制御する。 The prediction device 2 may be mounted on a robot, for example. In this case, the prediction device 2 predicts an area to be watched outside the visible area from the image acquired by the camera mounted on the robot. After the prediction, the prediction device 2 controls the action of the robot by transmitting the prediction result to the control device that controls the action of the robot based on the prediction result.

このように構成された予測システム１００は、画像データと画像データが示す画像には写らない空間内の注視領域との関係を示す数理モデルを得る学習装置１を備える。そのため予測システム１００は、空間内の領域であって画像に写らない領域中の注視領域を予測することができる。このことは上述したように画像データに代えて映像データについても同様である。 The prediction system 100 configured in this manner includes a learning device 1 that obtains a mathematical model representing the relationship between image data and a gaze region in space that is not captured in the image represented by the image data. Therefore, the prediction system 100 can predict a region of interest in a region in space that is not imaged. This also applies to video data instead of image data as described above.

また、予測システム１００は、学習装置１が得た数理モデルを用いて注視領域を予測する予測装置２を備える。そのため予測システム１００は、空間内の領域であって画像に写らない領域中の注視領域を予測することができる。このことは上述したように画像データに代えて映像データについても同様である。 The prediction system 100 also includes a prediction device 2 that predicts a region of interest using the mathematical model obtained by the learning device 1 . Therefore, the prediction system 100 can predict a region of interest in a region in space that is not imaged. This also applies to video data instead of image data as described above.

（変形例）
なお、予測システム１００、学習装置１及び予測装置２のそれぞれは、ネットワークを介して通信可能に接続された複数台の情報処理装置を用いて実装されてもよい。なお、予測システム１００、学習装置１及び予測装置２それぞれの各機能の全て又は一部は、ＡＳＩＣ（Application Specific Integrated Circuit）やＰＬＤ（Programmable Logic Device）やＦＰＧＡ（Field Programmable Gate Array）等のハードウェアを用いて実現されてもよい。プログラムは、コンピュータ読み取り可能な記録媒体に記録されてもよい。コンピュータ読み取り可能な記録媒体とは、例えばフレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ－ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置である。プログラムは、電気通信回線を介して送信されてもよい。なお、学習済みの予測モデルは、更新済みの数理モデルの一例である。なお、予測装置２は推測装置の一例である。なお、予測は推測の一例である。 (Modification)
Note that each of the prediction system 100, the learning device 1, and the prediction device 2 may be implemented using a plurality of information processing devices that are communicably connected via a network. Note that all or part of each function of the prediction system 100, the learning device 1, and the prediction device 2 is hardware such as ASIC (Application Specific Integrated Circuit), PLD (Programmable Logic Device), and FPGA (Field Programmable Gate Array). may be implemented using The program may be recorded on a computer-readable recording medium. Computer-readable recording media include portable media such as flexible disks, magneto-optical disks, ROMs and CD-ROMs, and storage devices such as hard disks incorporated in computer systems. The program may be transmitted over telecommunications lines. Note that the learned prediction model is an example of an updated mathematical model. Note that the prediction device 2 is an example of a prediction device. Prediction is an example of guessing.

以上、この発明の実施形態について図面を参照して詳述してきたが、具体的な構成はこの実施形態に限られるものではなく、この発明の要旨を逸脱しない範囲の設計等も含まれる。 Although the embodiment of the present invention has been described in detail with reference to the drawings, the specific configuration is not limited to this embodiment, and design and the like are included within the scope of the gist of the present invention.

１００…予測システム、１…学習装置、２…予測装置、１０…予測器、１１…制御部、１２…入出力インタフェース、１３…記憶部、１１１…学習制御部、２１…制御部、２２…入出力インタフェース、２３…記憶部、２１１…予測器制御部、９１…プロセッサ、９２…メモリ、９３…プロセッサ、９４…メモリ DESCRIPTION OF SYMBOLS 100... Prediction system 1... Learning apparatus 2... Prediction apparatus 10... Predictor 11... Control part 12... Input/output interface 13... Storage part 111... Learning control part 21... Control part 22... Input Output interface 23 Storage unit 211 Predictor control unit 91 Processor 92 Memory 93 Processor 94 Memory

Claims

Based on the training data, which is a pair of the image data and the correct gaze region data, which is the data indicating the gaze region in the space that is not shown in the image indicated by the image data, the image indicated by the image data and the image indicated by the image data. is a control unit that updates a mathematical model that shows the relationship with the gaze area in the space that is not captured,
with
The control unit uses the mathematical model to estimate a gaze area based on the image data included in the training data, and uses the mathematical model so that the estimation result approaches the gaze area indicated by the correct gaze area data included in the training data. update the model,
learning device.

The control unit updates the mathematical model by a method of reinforcement learning,
A learning device according to claim 1.

The mathematical model is expressed using a long-term memory network,
The mathematical model is information input to the long short-term memory network, the indicated value is a value corresponding to the timing of input to the long short-term memory network, and the value has a predetermined distribution with a non-zero variance. estimating the gaze region based on auxiliary information that is a value according to
3. The learning device according to claim 1 or 2.

Based on the training data, which is a pair of the image data and the correct gaze region data, which is the data indicating the gaze region in the space that is not shown in the image indicated by the image data, the image indicated by the image data and the image indicated by the image data. a control unit that updates a mathematical model indicating the relationship with the gaze area in a space that is not captured, the control unit estimates the gaze area based on the image data included in the training data using the mathematical model, and estimates updated until a predetermined termination condition is satisfied by a learning device that updates the mathematical model so that the result of is closer to the gaze region indicated by the correct gaze region data included in the training data A control unit that estimates, based on the input image data, a gaze area in a space that is not reflected in the image indicated by the image data, using the mathematical model of
Guessing device with

Based on the training data, which is a pair of the image data and the correct gaze region data, which is the data indicating the gaze region in the space that is not shown in the image indicated by the image data, the image indicated by the image data and the image indicated by the image data. is a control step that updates the mathematical model that shows the relationship with the gaze area in the space that is not captured,
has
In the control step, a gaze area is estimated based on image data included in the training data using the mathematical model, and the estimation result approaches the gaze area indicated by the correct gaze area data included in the training data. the model is updated,
learning method.

Based on the training data, which is a pair of the image data and the correct gaze region data, which is the data indicating the gaze region in the space that is not shown in the image indicated by the image data, the image indicated by the image data and the image indicated by the image data. a control unit that updates a mathematical model indicating the relationship with the gaze area in a space that is not captured, the control unit estimates the gaze area based on the image data included in the training data using the mathematical model, and estimates updated until a predetermined termination condition is satisfied by a learning device that updates the mathematical model so that the result of is closer to the gaze region indicated by the correct gaze region data included in the training data A control step of estimating, based on the input image data, a gaze region in a space that is not reflected in the image indicated by the image data, using the mathematical model of
Guessing method with

A program for causing a computer to function as the learning device according to any one of claims 1 to 3.