JP2023143222A

JP2023143222A - Information processing system and information processing method

Info

Publication number: JP2023143222A
Application number: JP2022050484A
Authority: JP
Inventors: 輝黄瀬; Akira Kinose
Original assignee: Panasonic Intellectual Property Management Co Ltd
Current assignee: Panasonic Intellectual Property Management Co Ltd
Priority date: 2022-03-25
Filing date: 2022-03-25
Publication date: 2023-10-06

Abstract

To provide an information processing system which allows a user to intuitively easily understand a learning state of a world model.SOLUTION: An information processing system comprises: a designation unit which designates a learned world model, environmental data about environment in which an agent operates, and a learned strategy model which specifies operation of the agent; a strategy execution unit which sequentially executes strategy in the environment, on the basis of the environmental data and the strategy model; an observation image processing unit which captures the environment by a virtual camera for each execution of the strategy, and generates a plurality of observation images; a latent state processing unit which compresses each of the plurality of observation images to derive each of a plurality of latent states, on the basis of the world model; and a display unit which displays the plurality of latent states.SELECTED DRAWING: Figure 2

Description

本開示は、情報処理システム及び情報処理方法に関する。 The present disclosure relates to an information processing system and an information processing method.

近年、ビジュアルベースの強化学習分野において、エージェントが行動の結果を予測できるように環境のモデルを構築する世界モデル（World Model）の研究が発展している。世界モデルは、言い換えると、環境の状態遷移の予測モデルを学習する強化学習手法である。世界モデルに関して、例えば、エージェントが画像入力と行動を基に世界モデルを学習することが知られている（非特許文献１参照）。 In recent years, in the field of visual-based reinforcement learning, research on world models, which construct models of the environment so that agents can predict the results of their actions, has been progressing. In other words, the world model is a reinforcement learning method that learns a predictive model of the state transition of the environment. Regarding world models, for example, it is known that an agent learns a world model based on image input and actions (see Non-Patent Document 1).

Google AI Blog, The latest from Google Research,「Introducing PlaNet: A Deep Planning Network for Reinforcement Learning」,＜URL:https://webbigdata.jp/ai/post-2867＞Google AI Blog, The latest from Google Research, “Introducing PlaNet: A Deep Planning Network for Reinforcement Learning”,＜URL:https://webbigdata.jp/ai/post-2867＞

世界モデルの研究開発において、エンジニアは、エージェントの世界モデルがどのように学習されているかを確認することが求められる。しかし、世界モデルはベクトルデータで表現されるため、エンジニアが直感的に理解することが困難である。また、時間の変化、行動による観測の変化、又は観測する位置の変化に応じて復元画像や世界モデルの状態をインタラクティブに確認することが困難である。また、従来、世界モデルの学習結果がエンジニアの想定と合致しているかどうか、つまり学習結果の良否を確認する確認手段が存在しない。 In research and development of world models, engineers are required to check how agents' world models are learned. However, because the world model is expressed as vector data, it is difficult for engineers to understand it intuitively. Furthermore, it is difficult to interactively check the restored image or the state of the world model in response to changes in time, changes in observation due to behavior, or changes in the observation position. Furthermore, conventionally, there is no way to check whether the learning results of the world model match the engineers' assumptions, that is, whether the learning results are good or bad.

本開示は、世界モデルの学習状態を直感的に理解し易くできる情報処理システム及び情報処理方法を提供する。 The present disclosure provides an information processing system and an information processing method that make it easy to intuitively understand the learning state of a world model.

本開示の一態様は、学習済みの世界モデルと、エージェントが動作する環境に関する環境データと、前記エージェントの動作を規定する学習済みの方策モデルと、を指定する指定部と、前記環境データと前記方策モデルとに基づいて、前記環境で方策を順次実行する方策実行部と、前記方策の実行毎に前記環境を仮想カメラで撮像して、複数の観測画像を生成する観測画像処理部と、前記世界モデルに基づいて、前記複数の観測画像のそれぞれを圧縮して複数の潜在状態のそれぞれを導出する潜在状態処理部と、前記複数の潜在状態を表示する表示部と、を備える情報処理システムである。 One aspect of the present disclosure includes a specification unit that specifies a learned world model, environmental data regarding an environment in which an agent operates, and a learned policy model that defines the behavior of the agent; a policy execution unit that sequentially executes policies in the environment based on a policy model; an observation image processing unit that images the environment with a virtual camera and generates a plurality of observation images each time the policy is executed; An information processing system comprising: a latent state processing unit that compresses each of the plurality of observed images and derives each of the plurality of latent states based on a world model; and a display unit that displays the plurality of latent states. be.

本開示の一態様は、学習済みの世界モデルと、エージェントが動作する環境に関する環境データと、前記エージェントの動作を規定する学習済みの方策モデルと、を指定するステップと、前記環境データと前記方策モデルとに基づいて、前記環境で方策を順次実行するステップと、前記方策の実行毎に前記環境を仮想カメラで撮像して、複数の観測画像を生成するステップと、前記世界モデルに基づいて、前記複数の観測画像のそれぞれを圧縮して複数の潜在状態のそれぞれを導出するステップと、前記複数の潜在状態を表示部に表示するステップと、を有する情報処理方法である。 One aspect of the present disclosure provides a step of specifying a learned world model, environmental data regarding an environment in which an agent operates, and a learned policy model that defines the behavior of the agent, and the step of specifying the environmental data and the policy. sequentially executing strategies in the environment based on the model; capturing an image of the environment with a virtual camera each time the strategy is executed to generate a plurality of observation images; and based on the world model, The information processing method includes the steps of compressing each of the plurality of observed images to derive each of the plurality of latent states, and displaying the plurality of latent states on a display unit.

本開示によれば、世界モデルの学習状態を直感的に理解し易くできる。 According to the present disclosure, the learning state of the world model can be easily understood intuitively.

本開示の実施形態の情報処理システムの構成例を示すブロック図Block diagram illustrating a configuration example of an information processing system according to an embodiment of the present disclosure ＧＵＩ部の構成例を示すブロック図Block diagram showing an example of the configuration of the GUI section シミュレータ部の構成例を示すブロック図Block diagram showing an example of the configuration of the simulator section 一覧画面の一例を示す図Diagram showing an example of a list screen モデル設定画面の一例を示す図Diagram showing an example of the model setting screen パス名が表示されたモデル設定画面の一例を示す図Diagram showing an example of the model settings screen with path names displayed カメラ設定画面の一例を示す図Diagram showing an example of a camera setting screen 仮想カメラが複数設けられる場合のカメラ設定画面の一例を示す図Diagram showing an example of a camera setting screen when multiple virtual cameras are provided 潜在状態マッピングの一例を示す図Diagram showing an example of latent state mapping 情報処理システムの動作例を示すシーケンス図Sequence diagram showing an example of the operation of the information processing system

以下、適宜図面を参照しながら、実施形態を詳細に説明する。但し、必要以上に詳細な説明は省略する場合がある。例えば、既によく知られた事項の詳細説明や実質的に同一の構成に対する重複説明を省略する場合がある。これは、以下の説明が不必要に冗長になることを避け、当業者の理解を容易にするためである。尚、添付図面及び以下の説明は、当業者が本開示を十分に理解するために提供されるものであり、これらにより特許請求の範囲に記載の主題を限定することは意図されていない。 Hereinafter, embodiments will be described in detail with reference to the drawings as appropriate. However, more detailed explanation than necessary may be omitted. For example, detailed explanations of well-known matters or redundant explanations of substantially the same configurations may be omitted. This is to avoid unnecessary redundancy in the following description and to facilitate understanding by those skilled in the art. The accompanying drawings and the following description are provided to enable those skilled in the art to fully understand the present disclosure, and are not intended to limit the subject matter recited in the claims.

例えば、実施形態でいう「部」又は「装置」とは単にハードウェアによって機械的に実現される物理的構成に限らず、その構成が有する機能をプログラムなどのソフトウェアにより実現されるものも含む。また、１つの構成が有する機能が２つ以上の物理的構成により実現されても、又は２つ以上の構成の機能が例えば１つの物理的構成によって実現されていてもかまわない。 For example, the term "unit" or "device" used in the embodiments is not limited to a physical configuration mechanically realized by hardware, but also includes one whose functions are realized by software such as a program. Moreover, the functions of one configuration may be realized by two or more physical configurations, or the functions of two or more configurations may be realized by, for example, one physical configuration.

（用語の説明）
本実施形態では、世界モデル等の強化学習について扱う。まず、強化学習や世界モデルに関する用語について説明する。 (Explanation of terms)
This embodiment deals with reinforcement learning such as a world model. First, we will explain terms related to reinforcement learning and world models.

「強化学習」とは、エージェントが、動的な環境との間で繰り返し試行錯誤のやり取りを行うことで、タスクを学習する機械学習手法の１つである。エージェントは、現在の状態の良さを評価する報酬を獲得する。タスクは、例えば、エージェントとしてのロボット装置が物を掴む、等である。タスクが達成されると、報酬が得られる。強化学習は、一連の行動を通じて、最も報酬が多く得られる方策（ポリシ）を学習する。 "Reinforcement learning" is a machine learning method in which an agent learns tasks by repeatedly interacting with a dynamic environment through trial and error. Agents earn rewards for evaluating the goodness of their current state. The task is, for example, a robot device acting as an agent grasping an object. When a task is accomplished, you get a reward. Reinforcement learning learns the policy that yields the most rewards through a series of actions.

「エージェント」とは、環境との間でやり取りを行い、方策を学習する主体である。具体例としては、エージェントは、ＡＩ（Artificial Intelligence）、ロボット装置、又はコントローラである。 An "agent" is a subject that interacts with the environment and learns strategies. As a specific example, the agent is an AI (Artificial Intelligence), a robot device, or a controller.

「状態」とは、環境の状況を示すものである。具体例としては、状態は、ロボット装置の位置、ロボット装置のアーム（ロボットアーム）の角度、等である。したがって、状態は、例えば制御対象の情報となる。 "Status" indicates the state of the environment. As a specific example, the state is the position of the robot device, the angle of the arm of the robot device (robot arm), and the like. Therefore, the state is, for example, information about the controlled object.

「行動」とは、エージェントが環境に影響を与え、状態を変化させる作用である。具体例としては、行動は、ロボット装置の移動、ロボットアームに与えるトルク、等である。 "Action" is an action by which an agent influences the environment and changes its state. As a specific example, the action is movement of a robot device, torque applied to a robot arm, and the like.

「報酬」とは、エージェントの状態や行動に対して良さを評価する指標である。強化学習では、報酬の値を最大化するように学習することで、タスクを学習する。具体例としては、報酬は、ゴールとロボット装置との距離であり、つまりロボット装置の目標位置と現在位置との距離である。なお、報酬の値の大小は良さの評価基準に応じて設定されるため、必ずしも報酬の算出に用いるパラメータの大小と一致しない。例えば、ロボット装置がゴールに近づくことを「良い」結果として学習させる場合には、距離が近いほど報酬の値が大きくなるような指標（例えば、距離の逆数）を用いて報酬を算出する。また、ロボット装置がゴールから遠ざかることを「良い」結果として学習させる場合には、距離が遠いほど報酬の値が大きくなる指標（例えば、距離の値そのもの）を用いて報酬を算出する。 "Reward" is an index that evaluates the goodness of an agent's state and actions. In reinforcement learning, a task is learned by learning to maximize the value of the reward. As a specific example, the reward is the distance between the goal and the robot device, that is, the distance between the target position and the current position of the robot device. Note that the magnitude of the reward value is set according to the quality evaluation criteria, and therefore does not necessarily match the magnitude of the parameter used to calculate the reward. For example, when the robot device learns that getting closer to a goal is a "good" result, the reward is calculated using an index (for example, the reciprocal of the distance) such that the closer the distance, the larger the reward value. Further, when the robot device learns that moving away from the goal is a "good" result, the reward is calculated using an index (for example, the distance value itself) such that the farther the distance, the larger the reward value.

「方策」(Policy)とは、状態（例えば環境の状況やエージェントの状態）においてどのような行動を取るべきかを返す関数である。具体例としては、方策は、ロボット装置の動作の戦略である。 "Policy" is a function that returns what action should be taken in a state (for example, an environment situation or an agent state). As a specific example, the policy is a strategy for the operation of the robotic device.

「潜在状態」（潜在変数）は、例えば世界モデルで用いられる。潜在状態は、環境の観測情報(例えば観測画像)を圧縮し、潜在的に環境の状況を表現した情報（例えばベクトル情報）である。 A "latent state" (latent variable) is used, for example, in a world model. The latent state is information (for example, vector information) that is obtained by compressing the observed information of the environment (for example, an observed image) and potentially expresses the situation of the environment.

「潜在空間」は、例えば世界モデルで用いられる。潜在空間は、潜在状態が表現されるベクトル空間である。したがって、潜在空間上における潜在状態がどう位置づけられるかによって、エージェントがどのように観測情報を認識しているか、つまり世界モデルの学習状態が理解可能である。 "Latent space" is used, for example, in world models. Latent space is a vector space in which latent states are represented. Therefore, depending on how the latent state is positioned in the latent space, it is possible to understand how the agent recognizes observed information, that is, the learning state of the world model.

「復元画像」は、例えば世界モデルで用いられる。復元画像は、圧縮された潜在状態を基に、再び観測情報（例えば観測画像）を復元した画像である。したがって、復元画像によって、エージェントが観測画像を正しく認識しているか、つまり世界モデルの学習状態が理解可能である。 The “restored image” is used, for example, in a world model. The restored image is an image in which observation information (for example, an observed image) is restored again based on the compressed latent state. Therefore, from the restored image, it is possible to understand whether the agent correctly recognizes the observed image, that is, the learning state of the world model.

（情報処理システムの構成）
次に、情報処理システムの構成について説明する。 (Configuration of information processing system)
Next, the configuration of the information processing system will be explained.

図１は、本開示の実施形態の情報処理システム５の構成例を示すブロック図である。情報処理システム５は、ＧＵＩ（Graphical User Interface）部１００とシミュレータ部２００とを含む構成である。 FIG. 1 is a block diagram illustrating a configuration example of an information processing system 5 according to an embodiment of the present disclosure. The information processing system 5 includes a GUI (Graphical User Interface) section 100 and a simulator section 200.

ＧＵＩ部１００及びシミュレータ部２００は、通信ネットワークを介して又は直接通信により通信可能に接続される。ＧＵＩ部１００は、ユーザにより操作され、ユーザに表示情報を提供する。シミュレータ部２００は、強化学習の実験に利用する物理演算シミュレータである。シミュレータ部２００は、ＡＰＩ（Application Programming Interface）等を用いて環境の情報（例えば画像、ロボットの位置や関節、時間、又は報酬)等の情報を取得したり、行動を入力して環境に作用させたりする。 The GUI unit 100 and the simulator unit 200 are communicably connected via a communication network or by direct communication. The GUI unit 100 is operated by a user and provides display information to the user. The simulator unit 200 is a physical calculation simulator used for reinforcement learning experiments. The simulator unit 200 uses an API (Application Programming Interface) or the like to acquire information on the environment (for example, images, the position and joints of the robot, time, or rewards), and inputs actions to act on the environment. or

ＧＵＩ部１００は、例えばＰＣ（Personal Computer）又はその他の情報処理装置である。シミュレータ部２００は、例えばサーバ装置又はその他の情報処理装置である。よって、ＧＵＩ部１００とシミュレータ部２００とは、クライアントサーバの関係であってよい。 The GUI unit 100 is, for example, a PC (Personal Computer) or other information processing device. The simulator unit 200 is, for example, a server device or other information processing device. Therefore, the GUI section 100 and the simulator section 200 may have a client-server relationship.

図２は、ＧＵＩ部１００の構成例を示すブロック図である。ＧＵＩ部１００は、プロセッサ１１０と、メモリ１２０と、通信デバイス１３０と、操作デバイス１４０と、表示デバイス１５０と、を含む構成である。 FIG. 2 is a block diagram showing an example of the configuration of the GUI section 100. The GUI section 100 includes a processor 110, a memory 120, a communication device 130, an operating device 140, and a display device 150.

プロセッサ１１０は、ＭＰＵ（Micro processing Unit）、ＣＰＵ（Central Processing Unit）、ＤＳＰ（Digital Signal Processor）、又はＧＰＵ（Graphical Processing Unit）等を含んでよい。プロセッサ１１０は、各種集積回路（例えばＬＳＩ（Large Scale Integration）、ＦＰＧＡ（Field Programmable Gate Array））により構成されてもよい。プロセッサ１１０は、メモリ１２０に保持されたプログラムを実行することで、各種機能を実現する。プロセッサ１１０は、ＧＵＩ部１００の各部を統括的に制御し、各種処理を行う。 The processor 110 may include an MPU (Micro Processing Unit), a CPU (Central Processing Unit), a DSP (Digital Signal Processor), a GPU (Graphical Processing Unit), or the like. The processor 110 may be configured with various integrated circuits (eg, LSI (Large Scale Integration), FPGA (Field Programmable Gate Array)). The processor 110 implements various functions by executing programs stored in the memory 120. The processor 110 centrally controls each section of the GUI section 100 and performs various processes.

メモリ１２０は、一次記憶装置（例えばＲＡＭ（Random Access Memory）又はＲＯＭ（Read Only Memory））を含む。メモリ１２０は、二次記憶装置（例えばＨＤＤ（Hard Disk Drive）又はＳＳＤ（Solid State Drive））又は三次記憶装置（例えば光ディスク又はＳＤカード）等を含んでよい。また、メモリ１２０は、外部記憶媒体であり、ＧＵＩ部１００に対して着脱可能であってよい。メモリ１２０は、各種データ、情報又はプログラム等を記憶する。 Memory 120 includes a primary storage device (eg, RAM (Random Access Memory) or ROM (Read Only Memory)). The memory 120 may include a secondary storage device (for example, an HDD (Hard Disk Drive) or an SSD (Solid State Drive)), a tertiary storage device (for example, an optical disk or an SD card), or the like. Further, the memory 120 is an external storage medium, and may be removable from the GUI unit 100. The memory 120 stores various data, information, programs, etc.

メモリ１２０は、例えば、環境ファイル、学習済みの世界モデル、及び学習済みの方策モデルを保持してよい。環境ファイルは、エージェントが行動する環境に関する情報を含む環境データの１つである。方策モデルは、方策を学習した学習モデルであり、エージェントが行動するためのモデルである。環境ファイル、学習済みの世界モデル、又は学習済みの方策モデルは、それぞれ１つ以上用意され、メモリ１２０に保持されてよい。 Memory 120 may hold, for example, environment files, trained world models, and trained policy models. The environment file is one type of environment data that includes information regarding the environment in which the agent acts. A policy model is a learning model that has learned a policy, and is a model for an agent to act. One or more environment files, one or more learned world models, or one or more learned policy models may be prepared and held in the memory 120.

通信デバイス１３０は、各種データ又は情報等を通信する。通信デバイス１３０による有線又は無線による通信方式に従って通信する。通信方式は、ＷＡＮ（Wide Area Network）、ＬＡＮ（Local Area Network）、又は携帯電話用のセルラー通信（例えばＬＴＥ、５Ｇ）、又は近距離通信（例えば、赤外線通信又はBluetooth（登録商標）通信）又は電力線通信等であってもよい。 The communication device 130 communicates various data or information. The communication device 130 communicates according to a wired or wireless communication method. The communication method is WAN (Wide Area Network), LAN (Local Area Network), cellular communication for mobile phones (e.g. LTE, 5G), or short-range communication (e.g. infrared communication or Bluetooth (registered trademark) communication), or Power line communication or the like may also be used.

操作デバイス１４０は、マウス、キーボード、タッチパッド、タッチパネル、マイクロホン、又はその他の入力デバイスを含んでよい。操作デバイス１４０は、各種データや情報の入力を受け付ける。 The operating device 140 may include a mouse, keyboard, touch pad, touch panel, microphone, or other input device. The operating device 140 receives input of various data and information.

表示デバイス１５０は、液晶表示デバイス、有機ＥＬデバイス、又はその他の表示デバイスを含んでよい。表示デバイス１５０は、各種データや情報を表示する。表示デバイス１５０は、例えば、後述する画面や画像を表示する。 Display device 150 may include a liquid crystal display device, an organic EL device, or other display device. The display device 150 displays various data and information. The display device 150 displays, for example, a screen or image that will be described later.

プロセッサ１１０は、少なくとも世界モデルの学習状態の提示に必要な処理を行う。世界モデルとは、前述したように、環境の状態遷移の予測モデルを学習する強化学習手法である。環境の状態遷移とは、どのような状態でどのような行動を行った場合にどのような状態に遷移するかを示すものである。世界モデルを加味することで、エージェントは、一連の行動の結果を潜在的に予測できるようになり、タスク学習の試行錯誤を低減できる。 The processor 110 performs at least the processing necessary to present the learning state of the world model. As mentioned above, the world model is a reinforcement learning method that learns a predictive model of the state transition of the environment. The state transition of the environment indicates what state the environment transitions to when what action is performed in what state. By incorporating the world model, agents can potentially predict the outcome of a series of actions, reducing trial and error in task learning.

プロセッサ１１０は、機能構成として、少なくとも、情報指定部１１１と、潜在状態処理部１１２と、復元画像処理部１１３と、を有する。 The processor 110 has at least an information specifying section 111, a latent state processing section 112, and a restored image processing section 113 as functional configurations.

情報指定部１１１は、世界モデルの学習状態の判定に際し、各種の情報を指定する。例えば、情報指定部１１１は、仮想環境を規定する情報（例えば、環境ファイル又は学習済みの世界モデル）、又はエージェントの動作を規定する情報（例えば、学習済みの方策モデル）を指定する。また、情報指定部１１１は、仮想環境の観測基準を規定する情報（例えば、仮想環境を撮像する仮想カメラに関する情報（カメラ情報））を指定する。カメラ情報は、例えば、仮想環境に対する仮想カメラの位置（カメラ座標）や向きの情報を含む。情報指定部１１１は、例えば、メモリ１２０に保持された情報、又は操作デバイス１４０に入力された操作の情報に基づいて、手動で各種モデルやカメラ情報を指定して、設定してよい。 The information specifying unit 111 specifies various types of information when determining the learning state of the world model. For example, the information specifying unit 111 specifies information that defines the virtual environment (for example, an environment file or a learned world model) or information that defines the behavior of the agent (for example, a learned policy model). Furthermore, the information specifying unit 111 specifies information that defines observation standards for the virtual environment (for example, information regarding a virtual camera that images the virtual environment (camera information)). The camera information includes, for example, information on the position (camera coordinates) and orientation of the virtual camera with respect to the virtual environment. The information specifying unit 111 may manually specify and set various model and camera information, for example, based on information held in the memory 120 or information on operations input to the operating device 140.

潜在状態処理部１１２は、学習済みの世界モデルと、シミュレータ部２００から取得された観測画像と、に基づいて、潜在空間上で潜在状態をマッピングする。潜在状態処理部１１２は、観測画像を圧縮することで潜在状態を算出してよい。潜在状態は、ベクトルデータでよい。潜在状態処理部１１２は、世界モデルに観測画像を入力させ、世界モデルの出力として潜在状態を導出させる。この潜在状態は、例えば多次元の情報である。潜在状態処理部１１２は、得られた多次元の潜在状態を次元圧縮して３次元の潜在状態を導出（例えば算出）してよい。なお、潜在状態処理部１１２は、３次元の潜在状態ではなく、２次元の潜在状態に次元圧縮してもよい。潜在状態処理部１１２は、潜在空間を示す潜在状態マッピング画面上に３次元の潜在状態をマッピングし、表示デバイス１５０に表示させる。なお、潜在状態処理部１１２は、３次元ではなく、２次元平面上に２次元の潜在状態にマッピングしてもよい。 The latent state processing unit 112 maps latent states on the latent space based on the learned world model and the observed image acquired from the simulator unit 200. The latent state processing unit 112 may calculate the latent state by compressing the observed image. The latent state may be vector data. The latent state processing unit 112 inputs the observed image to the world model and derives the latent state as an output of the world model. This latent state is, for example, multidimensional information. The latent state processing unit 112 may dimensionally compress the obtained multidimensional latent state to derive (eg, calculate) a three-dimensional latent state. Note that the latent state processing unit 112 may perform dimension compression into a two-dimensional latent state instead of a three-dimensional latent state. The latent state processing unit 112 maps a three-dimensional latent state on a latent state mapping screen showing a latent space, and causes the display device 150 to display the map. Note that the latent state processing unit 112 may map onto a two-dimensional latent state on a two-dimensional plane instead of three-dimensionally.

また、ユーザが特殊な訓練を経て４次元以上の空間を取り扱うことが可能であれば、潜在状態処理部１１２は、４次元以上の潜在状態に次元圧縮を行い、潜在状態マッピング画面に４次元の情報を表示させてもよい。ただし、一般的なユーザが直感的に取り扱うことのできる次元数は１次元～３次元であり、１次元では全ての潜在状態が１点にマッピングされてしまい複数の潜在状態を認識すること自体ができない。そのため、潜在状態処理部１１２は、２次元又は３次元の潜在状態を取り扱うことが望ましい。また、多様なユーザの要望に対応するため、潜在状態処理部１１２は、次元数を切り替え可能としてもよい。 Furthermore, if the user is able to handle a space of four dimensions or more through special training, the latent state processing unit 112 performs dimension compression on the four or more dimensional latent state, and displays the four-dimensional space on the latent state mapping screen. Information may also be displayed. However, the number of dimensions that a general user can intuitively handle is one to three dimensions, and in one dimension, all latent states are mapped to one point, making it difficult to recognize multiple latent states. Can not. Therefore, it is desirable that the latent state processing unit 112 handles two-dimensional or three-dimensional latent states. Further, in order to respond to various user requests, the latent state processing unit 112 may be configured to be able to switch the number of dimensions.

また、次元圧縮に用いる手法や次元圧縮で残すパラメータによって潜在状態マッピングで表現される潜在状態は変化する。そのため、潜在状態処理部１１２は、次元圧縮の手法又は残すべきパラメータ等を変更可能としてもよい。このようにすることで、ＧＵＩ部１００は、重点的に評価したいパラメータ等が分かっている習熟したユーザにとって、利用しやすいインタフェースを提供することができる。 Furthermore, the latent state expressed by latent state mapping changes depending on the method used for dimension compression and the parameters left in dimension compression. Therefore, the latent state processing unit 112 may be able to change the method of dimension compression, the parameters to be left, and the like. By doing so, the GUI unit 100 can provide an interface that is easy to use for experienced users who are aware of the parameters and the like that they want to evaluate with priority.

復元画像処理部１１３は、学習済みの世界モデルと、潜在状態処理部１１２により導出された潜在状態と、に基づいて、復元画像を生成する。例えば、復元画像処理部１１３は、世界モデルに潜在状態を入力させ、世界モデルの出力として復元画像を導出させる。復元画像処理部１１３は、多次元の潜在状態、又は、次元圧縮された２次元又は３次元の潜在状態、に基づいて、復元画像を生成してよい。例えば、復元画像処理部１１３は、潜在状態を復元することで、復元画像を生成してよい。復元画像処理部１１３は、生成された復元画像を表示デバイス１５０に表示させる。 The restored image processing unit 113 generates a restored image based on the learned world model and the latent state derived by the latent state processing unit 112. For example, the restored image processing unit 113 inputs the latent state to the world model and derives the restored image as an output of the world model. The restored image processing unit 113 may generate a restored image based on a multidimensional latent state or a dimensionally compressed two-dimensional or three-dimensional latent state. For example, the restored image processing unit 113 may generate a restored image by restoring the latent state. The restored image processing unit 113 displays the generated restored image on the display device 150.

潜在状態は、観測画像が圧縮されて得られたものであるので、復元画像は観測画像に対応する。ただし、世界モデルの学習状態によって、観測画像と復元画像とが一致する場合もあれば、観測画像と復元画像とが一致しない場合もある。復元画像処理部１１３は、観測画像と復元画像との相関度（例えば一致度）を判定してよい。復元画像処理部１１３は、相関度を、例えば、観測画像と復元画像との各画素値の差を集計することで、判定することができる。また、復元画像処理部１１３は、画像検索等で使用されている既知の技術を用いて、より精緻に相関度を判定してもよい。復元画像処理部１１３は、この相関度が所定基準を満たす場合には世界モデルの学習状態が良好であり、この相関度が所定基準を満たさない場合には世界モデルの学習状態が劣悪であると判定してよい。なお、復元画像処理部１１３は、世界モデルの学習状態の判定まで行わずに、表示デバイス１５０に潜在状態マッピングの情報を表示させてもよい。そして、ユーザが、潜在状態マッピングの状態から世界モデルの学習状態の良し悪しを判断してもよい。 Since the latent state is obtained by compressing the observed image, the restored image corresponds to the observed image. However, depending on the learning state of the world model, the observed image and the restored image may match, or may not match. The restored image processing unit 113 may determine the degree of correlation (eg, degree of coincidence) between the observed image and the restored image. The restored image processing unit 113 can determine the degree of correlation, for example, by summing up the differences in each pixel value between the observed image and the restored image. Further, the restored image processing unit 113 may determine the degree of correlation more precisely using a known technique used in image searches and the like. The restored image processing unit 113 determines that the learning state of the world model is good when this degree of correlation satisfies a predetermined standard, and that the learning state of the world model is poor when this degree of correlation does not meet the predetermined standard. You can judge. Note that the restored image processing unit 113 may display the information on the latent state mapping on the display device 150 without determining the learning state of the world model. Then, the user may judge whether the learning state of the world model is good or bad from the state of latent state mapping.

図３は、ＧＵＩ部１００の構成例を示すブロック図である。シミュレータ部２００は、例えば、プロセッサ２１０、メモリ２２０、及び通信デバイス２３０を備える。 FIG. 3 is a block diagram showing an example of the configuration of the GUI section 100. The simulator unit 200 includes, for example, a processor 210, a memory 220, and a communication device 230.

プロセッサ２１０は、ＭＰＵ、ＣＰＵ、ＤＳＰ、又はＧＰＵ等を含んでよい。プロセッサは、各種集積回路（例えばＬＳＩ、ＦＰＧＡ）により構成されてもよい。プロセッサ２１０は、メモリ２２０に保持されたプログラムを実行することで、各種機能を実現する。プロセッサ２１０は、シミュレータ部２００の各部を統括的に制御し、各種処理を行う。なお、シミュレータ部２００のプロセッサ２１０は、ＧＵＩ部１００のプロセッサ１１０よりも高性能であってよい。 Processor 210 may include an MPU, CPU, DSP, GPU, or the like. The processor may be configured with various integrated circuits (eg, LSI, FPGA). The processor 210 implements various functions by executing programs held in the memory 220. The processor 210 centrally controls each section of the simulator section 200 and performs various processes. Note that the processor 210 of the simulator section 200 may have higher performance than the processor 110 of the GUI section 100.

メモリ２２０は、一次記憶装置（例えばＲＡＭ又はＲＯＭ）を含む。メモリ２２０は、二次記憶装置（例えばＨＤＤ又はＳＳＤ）又は三次記憶装置（例えば光ディスク又はＳＤカード）等を含んでよい。また、メモリ２２０は、外部記憶媒体であってもよい。メモリ２２０は、各種データ、情報又はプログラム等を記憶する。 Memory 220 includes primary storage (eg, RAM or ROM). Memory 220 may include a secondary storage device (eg, HDD or SSD), a tertiary storage device (eg, optical disk or SD card), or the like. Furthermore, memory 220 may be an external storage medium. The memory 220 stores various data, information, programs, etc.

通信デバイス２３０は、各種データ又は情報等を通信する。通信デバイス２３０による有線又は無線による通信方式に従って通信する。通信方式は、ＷＡＮ、ＬＡＮ、又は携帯電話用のセルラー通信（例えばＬＴＥ、５Ｇ）、又は近距離通信（例えば、赤外線通信又はBluetooth（登録商標）通信）又は電力線通信等であってもよい。 The communication device 230 communicates various data or information. The communication device 230 communicates according to a wired or wireless communication method. The communication method may be WAN, LAN, cellular communication for mobile phones (eg, LTE, 5G), short-range communication (eg, infrared communication or Bluetooth (registered trademark) communication), power line communication, or the like.

プロセッサ２１０は、機能構成として、少なくとも、情報設定部２１１と、方策実行部２１２と、観測画像処理部２１３と、を有する。 The processor 210 has at least an information setting section 211, a policy execution section 212, and an observation image processing section 213 as functional configurations.

情報設定部２１１は、各種情報を設定する。例えば、ＧＵＩ部１００からの環境ファイル、カメラ情報、及び方策モデルを取得して設定する。設定される環境ファイル、カメラ情報、方策モデルは、ＧＵＩ部１００で指定されたものと同じである。この設定情報は、メモリ２２０に保持されてよい。 The information setting section 211 sets various information. For example, the environment file, camera information, and policy model are acquired from the GUI unit 100 and set. The environment file, camera information, and policy model that are set are the same as those specified in the GUI section 100. This configuration information may be held in memory 220.

方策実行部２１２は、設定された学習済みの方策モデルに従って、方策を実行する。具体的には、方策実行部２１２は、方策モデルに環境の状態を示す情報（例えば設定された環境ファイル、カメラ情報）を入力し、方策モデルの出力として、エージェントがとるべき行動（例えばロボット装置がボールを掴むための制御情報）を導出する。導出されたエージェントの行動によって、環境の状態が変化し得る。 The policy execution unit 212 executes the policy according to the set trained policy model. Specifically, the policy execution unit 212 inputs information indicating the state of the environment (for example, a set environment file, camera information) into the policy model, and outputs the action that the agent should take (for example, a robot device) as the output of the policy model. (control information for the player to grasp the ball). The state of the environment can change depending on the derived actions of the agent.

観測画像処理部２１３は、仮想カメラを制御し、仮想カメラから観測画像を取得する。仮想カメラは、設定されたカメラ情報に従って、動的な環境を撮像する。取得された観測画像は、通信デバイス２３０によってＧＵＩ部１００に送信される。仮想カメラは、方策の実行中に観測画像を順次取得してよく、方策の実行前の観測画像や方策の実行後の観測画像を取得してもよい。 The observation image processing unit 213 controls the virtual camera and acquires observation images from the virtual camera. The virtual camera images the dynamic environment according to the configured camera information. The acquired observation image is transmitted to the GUI section 100 by the communication device 230. The virtual camera may sequentially acquire observation images while the strategy is being executed, or may acquire observation images before the strategy is executed or observation images after the strategy is executed.

図４は、表示デバイス１５０に表示される一覧画面ＧＬを示す図である。 FIG. 4 is a diagram showing the list screen GL displayed on the display device 150.

一覧画面ＧＬは、環境ファイルにより定義される環境から復元画像が得られるまでの過程を示し、この過程における各タイミングで得られる画面や画像を含む。具体的には、一覧画面ＧＬは、環境画面Ｇ１と、モデル設定画面Ｇ２と、カメラ設定画面Ｇ３と、観測画面Ｇ４と、潜在状態マッピング画面Ｇ５と、復元画面Ｇ６と、が含まれる。 The list screen GL shows the process from the environment defined by the environment file to obtaining the restored image, and includes screens and images obtained at each timing in this process. Specifically, the list screen GL includes an environment screen G1, a model setting screen G2, a camera setting screen G3, an observation screen G4, a latent state mapping screen G5, and a restoration screen G6.

環境画面Ｇ１は、環境ファイルにより定義される環境（仮想環境）を表示する。環境画面Ｇ１では、環境が例えば３次元モデル化して示されている。環境画面Ｇ１が示す環境では、エージェントとしてのロボット装置１０、ボール２０、及び遮蔽物３０等が配置されている。この環境では、例えば、ロボット装置１０が、ロボットアーム１５等を動かしながら、ボール２０を掴もうとしている。そのため、タスクは、例えば、ロボットアーム１５をボール２０に近づけることである。 The environment screen G1 displays an environment (virtual environment) defined by an environment file. In the environment screen G1, the environment is shown as a three-dimensional model, for example. In the environment shown by the environment screen G1, a robot device 10 as an agent, a ball 20, a shield 30, and the like are arranged. In this environment, for example, the robot device 10 is trying to grab the ball 20 while moving the robot arm 15 and the like. Therefore, the task is, for example, to bring the robot arm 15 closer to the ball 20.

モデル設定画面Ｇ２は、各種のモデル等に関する情報を設定するための設定画面である。モデル設定画面Ｇ２は、例えば、環境に関する環境ファイル、世界モデル、及び方策モデルの設定を支援する。カメラ設定画面Ｇ３は、仮想カメラに関するカメラ情報を設定するための画面である。 The model setting screen G2 is a setting screen for setting information regarding various models and the like. The model setting screen G2 supports, for example, setting of an environment file, a world model, and a policy model regarding the environment. The camera setting screen G3 is a screen for setting camera information regarding the virtual camera.

観測画面Ｇ４は、仮想カメラから環境が撮像されて得られた観測画像を表示する画面である。観測画像は、方策の実行毎に順次得られるので、方策の実行毎に更新される。また、方策の実行毎にエージェントが動作するので、環境内の様子は順次変化し得る。図４の観測画面Ｇ４では、仮想カメラの撮像範囲に遮蔽物３０が入り込み、仮想カメラから見るとロボット装置１０の一部が遮蔽物３０の背後に隠れ、見えにくくなっている状態の観測画像が表示されている。 The observation screen G4 is a screen that displays an observation image obtained by capturing an image of the environment from a virtual camera. Observation images are sequentially obtained each time a strategy is executed, so they are updated each time a strategy is executed. Furthermore, since the agent operates each time a policy is executed, the state of the environment may change sequentially. The observation screen G4 in FIG. 4 shows an observation image in which a shielding object 30 has entered the imaging range of the virtual camera, and when viewed from the virtual camera, a part of the robot device 10 is hidden behind the shielding object 30 and is difficult to see. Displayed.

潜在状態マッピング画面Ｇ５は、観測画像に基づく潜在状態が２次元座標上又は３次元座標上にマッピングされる画面である。潜在状態マッピング画面Ｇ５では、複数の観測画像に対応する複数の潜在状態が点で示されてマッピングされている。潜在状態は観測画像を圧縮したものであるので、潜在状態マッピング画面Ｇ５は、観測画像が示す環境の特徴点を複数表示することで、方策の実行による複数の環境の状態の関係性を示している。 The latent state mapping screen G5 is a screen on which the latent state based on the observed image is mapped onto two-dimensional coordinates or three-dimensional coordinates. On the latent state mapping screen G5, a plurality of latent states corresponding to a plurality of observation images are shown as dots and mapped. Since the latent state is a compressed image of the observed image, the latent state mapping screen G5 shows the relationship between multiple environmental states due to the execution of the policy by displaying multiple feature points of the environment indicated by the observed image. There is.

復元画面Ｇ６は、潜在状態に基づく復元画像を表示する画面である。図４の復元画面Ｇ６では、仮想カメラの撮像範囲に遮蔽物３０が入り込み、仮想カメラから見るとロボット装置１０の一部が遮蔽物３０の背後に隠れ、見えにくくなっている状態の復元画像が表示されている。なお、図４の例では、観測画像と復元画像とが一致しておらず、やや異なっている。 The restoration screen G6 is a screen that displays a restored image based on the latent state. In the restored screen G6 of FIG. 4, the restored image shows that the shielding object 30 has entered the imaging range of the virtual camera, and when viewed from the virtual camera, a part of the robot device 10 is hidden behind the shielding object 30 and is difficult to see. Displayed. Note that in the example of FIG. 4, the observed image and the restored image do not match and are slightly different.

なお、図４に示された一覧画面ＧＬにおける各画面の配置は一例であり、これに限られない。また、図４に示された一覧画面ＧＬに含まれる各画面の一部が非表示であってもよいし、図４に示されていない他の画面が一覧画面ＧＬに含まれて表示されてもよい。 Note that the arrangement of each screen on the list screen GL shown in FIG. 4 is an example, and is not limited to this. Further, a part of each screen included in the list screen GL shown in FIG. 4 may be hidden, or other screens not shown in FIG. 4 may be included and displayed in the list screen GL. Good too.

図５は、モデル設定画面Ｇ２１の一例を示す図である。図６は、パス名が表示されたモデル設定画面Ｇ２２の一例を示す図である。モデル設定画面Ｇ２１は、表示デバイス１５０に表示される。モデル設定画面Ｇ２１，Ｇ２２は、モデル設定画面Ｇ２の一例である。 FIG. 5 is a diagram showing an example of the model setting screen G21. FIG. 6 is a diagram showing an example of the model setting screen G22 on which path names are displayed. Model setting screen G21 is displayed on display device 150. The model setting screens G21 and G22 are examples of the model setting screen G2.

図５に示すように、情報指定部１１１は、操作デバイス１４０を介して、モデル設定画面Ｇ２１を用いて、環境ファイル、世界モデル、及び方策モデルを指定する。指定される各情報（環境ファイル、世界モデル、及び方策モデル）は、それぞれ、操作デバイス１４０を介して入力されてもよいし、メモリ１２０に保持された複数種類のうちの１つが選択されてもよい。情報指定部１１１は、指定された各情報をシミュレータ部２００に送信（例えばアップロード）し、シミュレータ部２００に各情報を設定するよう指示する。 As shown in FIG. 5, the information specifying unit 111 uses the model setting screen G21 via the operating device 140 to specify an environment file, a world model, and a policy model. Each of the specified information (environment file, world model, and policy model) may be input via the operation device 140, or one of the plural types held in the memory 120 may be selected. good. The information specifying unit 111 transmits (for example, uploads) each piece of specified information to the simulator unit 200 and instructs the simulator unit 200 to set each piece of information.

つまり、情報指定部１１１は、通信デバイス１３０を介して環境ファイルをシミュレータ部２００にアップロードすることで、入力された実行環境をシミュレータ部２００に設定（反映）させる。情報指定部１１１は、通信デバイス１３０を介して世界モデルをシミュレータ部２００にアップロードすることで、エージェントが環境の変化の予測や画像復元を行うためのモデル（世界モデル）をシミュレータ部２００に設定（反映）させる。情報指定部１１１は、通信デバイス１３０を介して方策モデルをシミュレータ部２００にアップロードすることで、エージェントが行動するためのモデル（方策モデル）をシミュレータ部２００に設定（反映）させる。なお、設定情報のアップロードが完了すると、それぞれ、パス名が表示されてもよい（図６参照）。なお、情報指定部１１１が世界モデル及び方策モデルをシミュレータ部２００にアップロードして設定させることは必須ではなく、ＧＵＩ部１００側で世界モデル及び方策モデルに基づく処理が完結してもよい。 That is, the information specifying unit 111 uploads the environment file to the simulator unit 200 via the communication device 130, thereby setting (reflecting) the input execution environment in the simulator unit 200. The information specifying unit 111 uploads the world model to the simulator unit 200 via the communication device 130 to set a model (world model) in the simulator unit 200 for the agent to use to predict changes in the environment and restore images ( To reflect. The information specifying unit 111 uploads the policy model to the simulator unit 200 via the communication device 130, thereby setting (reflecting) a model for the agent to act (policy model) in the simulator unit 200. Note that when the uploading of the setting information is completed, the respective path names may be displayed (see FIG. 6). Note that it is not essential for the information specifying unit 111 to upload and set the world model and policy model to the simulator unit 200, and the processing based on the world model and policy model may be completed on the GUI unit 100 side.

なお、情報指定部１１１は、操作デバイス１４０による操作無しで、所定の指定基準に従って、環境ファイル、世界モデル及び方策モデルを指定してもよい。 Note that the information specifying unit 111 may specify the environment file, the world model, and the policy model according to predetermined specification criteria without any operation using the operating device 140.

図７は、カメラ設定画面Ｇ３１の一例を示す図である。図８は、仮想カメラが複数設けられる場合のカメラ設定画面Ｇ３２の一例を示す図である。カメラ設定画面Ｇ３１，Ｇ３２は、表示デバイス１５０に表示される。カメラ設定画面Ｇ３１，Ｇ３２は、カメラ設定画面Ｇ３の一例である。 FIG. 7 is a diagram showing an example of the camera setting screen G31. FIG. 8 is a diagram showing an example of the camera setting screen G32 when a plurality of virtual cameras are provided. Camera setting screens G31 and G32 are displayed on display device 150. Camera setting screens G31 and G32 are examples of camera setting screen G3.

情報指定部１１１は、操作デバイス１４０を介して、カメラ設定画面Ｇ３を用いて、カメラ情報の詳細を設定する。カメラ情報は、環境を撮像する仮想カメラに関する情報である。カメラ情報は、例えば、環境に対する仮想カメラの位置（カメラ座標）や向きの情報を含む。この場合、情報指定部１１１は、操作デバイス１４０を介して、仮想カメラが配置される位置として、環境におけるＸ座標、Ｙ座標、及びＺ座標を指定してよい。また、情報指定部１１１は、操作デバイス１４０を介して、仮想カメラの向きとして、ロール、ピッチ及びヨーの値を指定してよい。また、情報指定部１１１は、操作デバイス１４０を介して操作せずに、例えば、メモリ１２０にカメラ情報のデフォルト情報を保持しておき、このデフォルト情報をカメラ情報として指定してもよい。 The information specifying unit 111 uses the camera setting screen G3 via the operating device 140 to set details of camera information. The camera information is information regarding a virtual camera that images the environment. The camera information includes, for example, information on the position (camera coordinates) and orientation of the virtual camera with respect to the environment. In this case, the information specifying unit 111 may specify, via the operating device 140, the X coordinate, Y coordinate, and Z coordinate in the environment as the position where the virtual camera is placed. Further, the information specifying unit 111 may specify the values of roll, pitch, and yaw as the orientation of the virtual camera via the operating device 140. Further, the information specifying unit 111 may store default information of camera information in the memory 120, for example, and specify this default information as the camera information without operating via the operating device 140.

また、情報指定部１１１は、環境画面Ｇ１上での操作デバイス１４０の操作に基づいて、カメラ情報を設定してもよい。例えば、情報指定部１１１は、環境画面Ｇ１上でのマウス操作によって仮想カメラをドラッグし、仮想カメラの位置や向きを調整してもよい。 Further, the information specifying unit 111 may set camera information based on the operation of the operating device 140 on the environment screen G1. For example, the information specifying unit 111 may adjust the position and orientation of the virtual camera by dragging the virtual camera using a mouse operation on the environment screen G1.

また、カメラ情報の設定時には、表示デバイス１５０に一覧画面ＧＬが表示されてもよい。この場合、ユーザは、カメラ情報の設定時に、環境に設置される（エージェントが取得する）カメラ情報を、一覧画面ＧＬに含まれる環境画面Ｇ１を見ながら設定可能である。 Further, when setting camera information, a list screen GL may be displayed on the display device 150. In this case, when setting camera information, the user can set camera information installed in the environment (obtained by the agent) while viewing the environment screen G1 included in the list screen GL.

また、情報指定部１１１は、カメラ情報を設定した際、設定されたカメラ情報を、環境画面Ｇ１及び観測画面Ｇ４にインタラクティブに反映してよい。つまり、情報指定部１１１は、設定された位置（カメラ座標）や向き（カメラ角度）に従って仮想カメラにより撮像された画像に基づいて、表示される環境画面Ｇ１及び観測画面Ｇ４を更新してよい。この場合、ユーザは、カメラ情報の指定を調整しながら（例えば視点移動しながら）、環境を観測するために仮想カメラをどの位置又は向き等に配置したらよいかを直感的に把握できる。なお、観測画面Ｇ４の観測画像は、シミュレータ部２００と協働して動作することで、通信デバイス１３０を介して取得可能である。 Further, when setting camera information, the information specifying unit 111 may interactively reflect the set camera information on the environment screen G1 and the observation screen G4. That is, the information specifying unit 111 may update the displayed environment screen G1 and observation screen G4 based on the image captured by the virtual camera according to the set position (camera coordinates) and direction (camera angle). In this case, the user can intuitively understand in what position or direction the virtual camera should be placed in order to observe the environment while adjusting the specification of camera information (for example, while moving the viewpoint). Note that the observation image on the observation screen G4 can be acquired via the communication device 130 by operating in cooperation with the simulator section 200.

また、環境を撮像する仮想カメラは、１つでなく、複数設けられてよい。例えば、情報指定部１１１は、操作デバイス１４０を介して、カメラ追加ボタンＢ１（図７参照）の押下を受け付けることで、２つ目以降の仮想カメラのカメラ情報を設定してもよい（図８参照）。複数の仮想カメラのカメラ情報が異なることで、様々な視点から環境が観測可能である。 Further, the number of virtual cameras that capture images of the environment may not be one, but a plurality of them. For example, the information specifying unit 111 may set the camera information of the second and subsequent virtual cameras by accepting a press of the camera addition button B1 (see FIG. 7) via the operation device 140 (see FIG. 8). reference). By having different camera information for multiple virtual cameras, the environment can be observed from various viewpoints.

次に、方策モデルの実行について説明する。 Next, execution of the policy model will be explained.

方策モデルは、観測画像に応じて次の行動を出力するモデルである。方策モデルが変更されると、方策の実行によるエージェントの挙動が変わる。 The policy model is a model that outputs the next action according to the observed image. When the policy model is changed, the behavior of the agent due to policy execution changes.

ＧＵＩ部１００では、情報指定部１１１は、操作デバイス１４０を介して方策の実行指示を受けた場合、通信デバイス１３０を介して方策の実行指示をシミュレータ部２００に送信する。例えば、操作デバイス１４０を介した方策の実行指示は、モデル設定画面Ｇ２（Ｇ２２）における再生ボタンＢ２（図６参照）の押下であってよい。シミュレータ部２００では、方策実行部２１２は、方策の実行指示を受信すると、この実行指示に従って、設定された方策モデルに従って方策を実行する。方策の実行に従って、エージェントが環境内で動作する。 In the GUI unit 100 , when the information specifying unit 111 receives an instruction to execute the policy via the operating device 140 , the information specifying unit 111 transmits the instruction to execute the policy to the simulator unit 200 via the communication device 130 . For example, the instruction to execute the policy via the operating device 140 may be a press of the play button B2 (see FIG. 6) on the model setting screen G2 (G22). In the simulator unit 200, upon receiving the policy execution instruction, the policy execution unit 212 executes the policy according to the set policy model in accordance with the execution instruction. Agents operate within the environment according to the execution of the policy.

また、ＧＵＩ部１００では、情報指定部１１１は、操作デバイス１４０を介して方策の実行の停止指示を受けた場合、通信デバイス１３０を介して方策の実行の停止指示をシミュレータ部２００に送信する。例えば、操作デバイス１４０を介した方策の実行の停止指示は、例えば、モデル設定画面Ｇ２（Ｇ２２）における停止ボタンＢ３（図６参照）の押下であってよい。シミュレータ部２００では、プロセッサ２１０が、方策の実行の停止指示を受信すると、この停止指示に従って、設定された方策の実行を停止する。方策モデルの実行の停止に従って、エージェントが環境内での動作を停止する。 Furthermore, in the GUI unit 100 , when receiving an instruction to stop the execution of the policy via the operating device 140 , the information specifying unit 111 transmits the instruction to stop the execution of the policy to the simulator unit 200 via the communication device 130 . For example, the instruction to stop execution of the policy via the operating device 140 may be, for example, pressing the stop button B3 (see FIG. 6) on the model setting screen G2 (G22). In the simulator unit 200, when the processor 210 receives the instruction to stop execution of the policy, it stops the execution of the set policy in accordance with this stop instruction. The agent stops working in the environment as the policy model stops running.

ＧＵＩ部１００及びシミュレータ部２００は、相互に協働して、方策の実行による環境の変化を反映した環境画面Ｇ１と観測画面Ｇ４とをインタラクティブに表示する。具体的には、シミュレータ部２００では、通信デバイス２３０は、方策モデルの実行の際、変化した環境画像と観測画像とをＧＵＩ部１００に送信する。ＧＵＩ部１００では、方策実行部２１２は、通信デバイス１３０を介してシミュレータ部２００から環境画像と観測画像とを受信し、表示デバイス１５０を介して環境画像を含む環境画面Ｇ１と観測画像を含む観測画面Ｇ４とを表示する。なお、方策の実行が継続される期間には、環境画像と観測画像との変化は順次発生し得る。そのため、ＧＵＩ部１００は、順次変化する環境画像と観測画像とをシミュレータ部２００から取得して表示してよい。よって、情報処理システム５は、方策の実行によって次々と変化する環境と観測画像とを可視化してユーザに提供できる。よって、ユーザは、方策の実行による環境の変化等をリアルタイムに直感的に把握できる。 The GUI section 100 and the simulator section 200 cooperate with each other to interactively display an environment screen G1 and an observation screen G4 that reflect changes in the environment due to execution of the policy. Specifically, in the simulator unit 200, the communication device 230 transmits the changed environmental image and observation image to the GUI unit 100 when executing the policy model. In the GUI unit 100, the policy execution unit 212 receives the environmental image and observation image from the simulator unit 200 via the communication device 130, and displays the environment screen G1 including the environment image and the observation image including the observation image via the display device 150. Screen G4 is displayed. Note that during the period in which the implementation of the policy continues, changes between the environmental image and the observed image may occur sequentially. Therefore, the GUI unit 100 may acquire sequentially changing environmental images and observed images from the simulator unit 200 and display them. Therefore, the information processing system 5 can visualize and provide the user with the environment and observation images that change one after another as a result of executing the policy. Therefore, the user can intuitively understand changes in the environment due to execution of the policy in real time.

また、方策実行部２１２は、通信デバイス２３０を介してＧＵＩ部１００の操作デバイス１４０で入力された操作情報を取得し、この移動指示に従って、方策モデルの実行時に仮想カメラを移動させてもよい。例えば、方策実行部２１２は、方策モデルの実行中に、環境画面Ｇ１に対するマウス操作によって仮想カメラのカメラアイコンＣＩをドラッグする操作情報を取得し、この操作情報に基づいて仮想カメラの位置や向きを調整（例えば視点移動）してもよい。つまり、方策実行部２１２は、方策モデルの実行中に、方策モデルの実行前に設定された仮想カメラのカメラ情報を変更してもよい。例えば、カメラアイコンＣＩの位置をドラッグ操作によって移動させることで、方策実行部２１２は、仮想カメラの位置（視点）を変更してよい。例えば、カメラアイコンＣＩのレンズ（不図示）の位置を移動させることで、方策実行部２１２は、仮想カメラの向き（視線方向）を変更してよい。なお、図４に示したこのカメラアイコンＣＩの表示例は一例であり、カメラアイコンＣＩが他の位置や表示態様で表示されてもよい。表示デバイス１５０は、カメラ情報が変更された仮想カメラに撮像された画像に基づく情報（例えば環境画像、観測画像、潜在状態、復元画像）を表示できる。よって、ユーザは、方策モデルの実行によるエージェントの動作挙動や、潜在状態や復元画像等を確認しながら、インタラクティブに好適なカメラ情報（例えばカメラの位置や向き）を探索し、仮想カメラの撮像に基づく画像を調整可能である。 Furthermore, the policy execution unit 212 may obtain operation information input by the operation device 140 of the GUI unit 100 via the communication device 230, and move the virtual camera when executing the policy model according to this movement instruction. For example, during execution of the policy model, the policy execution unit 212 obtains operation information for dragging the camera icon CI of the virtual camera by a mouse operation on the environment screen G1, and determines the position and orientation of the virtual camera based on this operation information. It may be adjusted (for example, by moving the viewpoint). That is, the policy execution unit 212 may change the camera information of the virtual camera that was set before executing the policy model while executing the policy model. For example, the policy execution unit 212 may change the position (viewpoint) of the virtual camera by moving the position of the camera icon CI by a drag operation. For example, the policy execution unit 212 may change the orientation (line-of-sight direction) of the virtual camera by moving the position of the lens (not shown) of the camera icon CI. Note that the display example of the camera icon CI shown in FIG. 4 is just an example, and the camera icon CI may be displayed in other positions or display modes. The display device 150 can display information (for example, an environmental image, an observed image, a latent state, a restored image) based on an image captured by a virtual camera whose camera information has been changed. Therefore, the user can interactively search for suitable camera information (for example, camera position and orientation) while checking the agent's operational behavior, latent state, restored image, etc. by executing the policy model, and use the virtual camera to capture images. The image based on it is adjustable.

図９は、潜在状態マッピング画面Ｇ５の一例を示す図である。図９では、潜在状態マッピング画面Ｇ５で描画されている潜在状態の様子が時系列で変化している。また、図９では、潜在状態マッピング画面Ｇ５にマッピングされた１つの潜在状態を基に、復元画像が生成されることを例示している。 FIG. 9 is a diagram showing an example of the latent state mapping screen G5. In FIG. 9, the state of the latent state drawn on the latent state mapping screen G5 changes over time. Further, FIG. 9 illustrates that a restored image is generated based on one latent state mapped on the latent state mapping screen G5.

シミュレータ部２００では、観測画像処理部２１３は、仮想カメラを制御する。仮想カメラは、設定されたカメラ情報に従って、環境を撮像して観測画像を生成する。方策の実行に従ってエージェントが動作することで環境に影響を与え、環境が変化し得る。仮想カメラは、変化する環境を時系列で順次撮像し、観測画像を順次生成してよい。通信デバイス２３０は、順次生成された観測画像をＧＵＩ部１００へ送信する。 In the simulator unit 200, an observation image processing unit 213 controls a virtual camera. The virtual camera images the environment and generates an observation image according to the set camera information. The action of the agent according to the execution of the policy affects the environment, and the environment can change. The virtual camera may sequentially capture images of a changing environment in chronological order and may sequentially generate observation images. The communication device 230 sequentially transmits the generated observation images to the GUI unit 100.

ＧＵＩ部１００では、通信デバイス１３０は、シミュレータ部２００から観測画像を受信する。潜在状態処理部１１２は、受信された観測画像を圧縮して、観測画像に対応する潜在状態を生成する。この場合、潜在状態処理部１１２は、指定された世界モデルを用いて、観測画像に基づいて潜在状態を生成してよい。例えば、潜在状態処理部１１２は、観測画像を世界モデルの入力とし、世界モデルの出力として多次元の潜在状態を取得する。潜在状態処理部１１２は、取得された多次元の潜在状態を次元圧縮し、２次元又は３次元の潜在状態を生成する。この場合、潜在状態処理部１１２は、次元圧縮手法（例えばＰＣＡ（principal component analysis））に従って、次元圧縮された潜在状態を生成してよい。２次元又は３次元の潜在状態は、視認可能に２次元平面又は３次元空間にマッピングが可能である。潜在状態処理部１１２は、潜在空間を示す潜在状態マッピング画面Ｇ５に、例えば３次元の潜在状態をマッピングして描画する。 In the GUI section 100, the communication device 130 receives observed images from the simulator section 200. The latent state processing unit 112 compresses the received observed image and generates a latent state corresponding to the observed image. In this case, the latent state processing unit 112 may generate a latent state based on the observed image using the specified world model. For example, the latent state processing unit 112 uses the observed image as input to a world model, and obtains a multidimensional latent state as an output of the world model. The latent state processing unit 112 performs dimension compression on the acquired multidimensional latent state to generate a two-dimensional or three-dimensional latent state. In this case, the latent state processing unit 112 may generate a dimensionally compressed latent state according to a dimensionally compressed technique (for example, PCA (principal component analysis)). A two-dimensional or three-dimensional latent state can be visually mapped onto a two-dimensional plane or three-dimensional space. The latent state processing unit 112 maps and draws, for example, a three-dimensional latent state on a latent state mapping screen G5 showing a latent space.

また、潜在状態処理部１１２は、シミュレータ部２００から観測画像を順次受信する。よって、潜在状態処理部１１２は、順次受信された複数の観測画像に基づいて複数の潜在状態を生成する。よって、潜在状態処理部１１２は、潜在状態マッピング画面Ｇ５に、生成された複数の潜在状態を描画する。観測画像は時系列で順次得られるので、描画される潜在状態も時系列で増加していく。潜在状態処理部１１２は、生成された各潜在状態を潜在状態マッピング画面Ｇ５にマッピングし、表示デバイス１５０に表示させる。 Further, the latent state processing unit 112 sequentially receives observation images from the simulator unit 200. Therefore, the latent state processing unit 112 generates a plurality of latent states based on a plurality of sequentially received observation images. Therefore, the latent state processing unit 112 draws the plurality of generated latent states on the latent state mapping screen G5. Since the observed images are obtained sequentially in chronological order, the number of latent states drawn also increases in chronological order. The latent state processing unit 112 maps each generated latent state onto the latent state mapping screen G5 and displays it on the display device 150.

図９では、まず、現在の観測画像に対応する現在の潜在状態ｐ１がマッピングされ、潜在状態マッピング画面Ｇ５１に表示されている。ここでの現在の観測画像は、例えば、方策モデルの実行前の環境が撮像された観測画像である。 In FIG. 9, first, the current latent state p1 corresponding to the current observation image is mapped and displayed on the latent state mapping screen G51. The current observed image here is, for example, an observed image of the environment before execution of the policy model.

潜在状態処理部１１２は、シミュレータ部２００により方策の実行が開始されると、方策実行の毎ステップで得られる各観測画像に対応する各潜在状態を、潜在状態マッピング画面Ｇ５２に描画していく。よって、方策の実行中の時間経過とともに、マッピングされる潜在状態の数が増加していく。潜在状態マッピング画面Ｇ５２では、潜在状態ｐ１１は、現在の潜在状態（この方策の実行中に得られた最新の潜在状態）である。潜在状態ｐ１２は、方策の実行によって順次得られた潜在状態（潜在状態の系列）であって、潜在状態ｐ１１以外の潜在状態を示す。 When the simulator unit 200 starts executing the policy, the latent state processing unit 112 draws each latent state corresponding to each observation image obtained at each step of the policy execution on the latent state mapping screen G52. Therefore, the number of mapped latent states increases as time passes while the policy is being executed. In the latent state mapping screen G52, the latent state p11 is the current latent state (the latest latent state obtained during execution of this strategy). The latent state p12 is a latent state (sequence of latent states) sequentially obtained by executing the policy, and indicates a latent state other than the latent state p11.

潜在状態マッピング画面Ｇ５における潜在状態同士の近さは、潜在状態の意味的な類似度を示す。したがって、ユーザは、潜在状態マッピング画面Ｇ５の表示を確認することで、エージェントがどの潜在状態に対応するどの観測同士が近いと認識しているか、を把握できる。方策の実行による環境の変化は、時系列で少しずつ変化することが想定され、時系列で大きく変化する可能性は少ない。そのため、時系列で近い複数の観測画像に対応する複数の潜在状態は、潜在状態マッピング画面Ｇ５上では相互に近い位置に描画されると想定される。よって、ユーザは、時系列で近い時点で得られた複数の潜在状態が潜在状態マッピング画面Ｇ５上で遠い位置に配置されている場合には、観測画像から潜在状態を導出する世界モデルの学習が不十分であったり、誤った方向に学習されていたりする可能性があることを認識できる。 The closeness between latent states on the latent state mapping screen G5 indicates the semantic similarity of the latent states. Therefore, by checking the display on the latent state mapping screen G5, the user can understand which observations that the agent recognizes as being close to each other, corresponding to which latent state. Changes in the environment due to the implementation of policies are expected to change little by little over time, and are unlikely to change significantly over time. Therefore, it is assumed that a plurality of latent states corresponding to a plurality of observation images that are close in time series are drawn at positions close to each other on the latent state mapping screen G5. Therefore, if multiple latent states obtained at close points in time are placed at far positions on the latent state mapping screen G5, the user may be unable to learn the world model that derives the latent states from the observed images. Be able to recognize that there is a possibility that the learning is insufficient or that it is being learned in the wrong direction.

なお、潜在状態が３次元に次元圧縮されている場合は、潜在状態マッピング画面Ｇ５に表示されている画面は、３次元の情報を表示デバイス１５０の投影面である２次元に投影した情報である。この場合、ユーザが他の視点から投影した結果を観測することを希望するのであれば、潜在状態処理部１１２は、操作デバイス１４０を介して、潜在状態がマッピングされる３次元空間を示す潜在状態マッピング画面Ｇ５を、任意の方向に回転させて、表示デバイス１５０を介して表示してよい。例えば、潜在状態処理部１１２は、マウスによるドラッグ操作を受け付けて、ドラッグ操作に応じて、３次元空間を規定する相互に直交する３軸の向きを回転させてよい。これにより、ユーザは、複数の潜在状態の位置関係を様々な視点及び視線方向に沿って確認できる。 Note that when the latent state is dimensionally compressed into three dimensions, the screen displayed on the latent state mapping screen G5 is information obtained by projecting three-dimensional information onto the two-dimensional projection surface of the display device 150. . In this case, if the user wishes to observe the projected result from another viewpoint, the latent state processing unit 112, via the operation device 140, displays a latent state indicating the three-dimensional space to which the latent state is mapped. The mapping screen G5 may be rotated in any direction and displayed via the display device 150. For example, the latent state processing unit 112 may accept a drag operation using a mouse, and may rotate the directions of three mutually orthogonal axes defining a three-dimensional space in accordance with the drag operation. This allows the user to check the positional relationships of a plurality of latent states from various viewpoints and line-of-sight directions.

また、潜在状態マッピング画面Ｇ５３は、潜在状態マッピング画面Ｇ５２よりも更に時系列で後の状態であり、潜在状態の数が増えている。潜在状態マッピング画面Ｇ５３では、潜在状態ｐ２１は、現在の潜在状態（この方策の実行中に得られた最新の潜在状態）である。潜在状態ｐ２２は、方策の実行によって順次得られた潜在状態（潜在状態の系列）であって、潜在状態ｐ２１以外の潜在状態を示す。 Further, the latent state mapping screen G53 is a state that is later in time series than the latent state mapping screen G52, and the number of latent states is increasing. In the latent state mapping screen G53, the latent state p21 is the current latent state (the latest latent state obtained during execution of this strategy). The latent state p22 is a latent state (sequence of latent states) sequentially obtained by executing the policy, and indicates a latent state other than the latent state p21.

ここで、復元画像処理部１１３は、操作デバイス１４０を介して、潜在状態マッピング画面Ｇ５３に描画された複数の潜在状態のうち、１つの潜在状態ｐ２３を指定してよい。潜在状態２３は、復元画像処理部１１３は、指定された潜在状態ｐ２３を、他の潜在状態ｐ２１，ｐ２２とは異なる表示態様で表示デバイス１５０に表示させてよい。復元画像処理部１１３は、指定された潜在状態ｐ２３に基づいて、潜在状態ｐ２３に対応する復元画像を生成する。例えば、復元画像処理部１１３は、指定された潜在状態ｐ２３に対して復元処理して、復元画像を生成する。この場合、復元画像処理部１１３は、指定された潜在状態ｐ２３を世界モデルの入力とし、世界モデルの出力として復元画像を取得してよい。復元画像処理部１１３は、生成された復元画像を表示デバイス１５０に表示させる。ユーザは、この復元画像を含む復元画面Ｇ６１の表示を確認することで、エージェントがどのように環境を理解しているか、エージェントの世界モデルの学習度合い、等を確認できる。 Here, the restored image processing unit 113 may specify one latent state p23 from among the plurality of latent states drawn on the latent state mapping screen G53 via the operating device 140. Regarding the latent state 23, the restored image processing unit 113 may display the specified latent state p23 on the display device 150 in a display mode different from the other latent states p21 and p22. The restored image processing unit 113 generates a restored image corresponding to the latent state p23 based on the specified latent state p23. For example, the restored image processing unit 113 performs restoration processing on the specified latent state p23 to generate a restored image. In this case, the restored image processing unit 113 may use the designated latent state p23 as an input to the world model, and may obtain the restored image as an output of the world model. The restored image processing unit 113 displays the generated restored image on the display device 150. By checking the display of the restored screen G61 including this restored image, the user can check how the agent understands the environment, the degree of learning of the agent's world model, etc.

潜在状態は、観測画像が圧縮されたものである。また、復元画像は潜在状態が復元されたものであるので、観測画像が圧縮されて復元されたものである。そのため、観測画像から潜在状態を導出する世界モデルの学習精度が高い場合には、観測画像を示す観測画面Ｇ４と復元画像を示す復元画面Ｇ６（Ｇ６１）は一致するはずである。図４の一覧画面ＧＬでは、観測画面Ｇの観測画像と復元画面Ｇ６の復元画像とはやや異なる。そのため、この場合には、ユーザは、世界モデルの学習状態が完全ではないことを理解できる。 The latent state is a compressed observation image. Furthermore, since the restored image is a latent state restored, the observed image is compressed and restored. Therefore, if the learning accuracy of the world model that derives the latent state from the observed image is high, the observed screen G4 showing the observed image and the restored screen G6 (G61) showing the restored image should match. In the list screen GL of FIG. 4, the observed image on the observation screen G and the restored image on the restored screen G6 are slightly different. Therefore, in this case, the user can understand that the learning state of the world model is not perfect.

次に、情報処理システム５の動作例について説明する。
図１０は、情報処理システム５の動作例を示すシーケンス図である。 Next, an example of the operation of the information processing system 5 will be described.
FIG. 10 is a sequence diagram showing an example of the operation of the information processing system 5. As shown in FIG.

まず、ＧＵＩ部１００では、情報指定部１１１は、操作デバイス１４０を介して学習済みの世界モデルと学習済みの方策モデルとを入力して指定する（Ｓ１１）。ここでの学習済みモデルは、世界モデルと方策モデルとを含む。例えば、学習済みの世界モデルと方策モデルとは、それぞれ１つ以上がメモリ１２０に保持されており、どの世界モデルと方策モデルとを用いるかが指定されてよい。 First, in the GUI unit 100, the information specifying unit 111 inputs and specifies a learned world model and a learned policy model via the operating device 140 (S11). The trained models here include a world model and a policy model. For example, one or more learned world models and policy models may be stored in the memory 120, and which world model and policy model to use may be specified.

情報指定部１１１は、操作デバイス１４０を介して環境条件を入力し、環境条件を指定する（Ｓ１２）。ここでの環境条件は、環境に関する環境ファイル、仮想カメラのカメラ情報（例えば仮想カメラの位置、角度（向き））、環境に影響を与えるエージェントの行動の情報（例えばロボット装置の動作の情報）、等を含む。例えば、１つ以上の環境ファイルがメモリ１２０に保持されており、どの環境ファイルを用いるかが指定されてよい。 The information specifying unit 111 inputs environmental conditions via the operating device 140 and specifies the environmental conditions (S12). The environmental conditions here include environment files related to the environment, camera information of the virtual camera (for example, the position and angle (orientation) of the virtual camera), information on the behavior of the agent that affects the environment (for example, information on the behavior of the robot device), Including etc. For example, one or more environment files may be maintained in memory 120 and which environment file to use may be specified.

通信デバイス１３０は、決定された方策モデル及び環境条件をシミュレータ部２００に送信し、シミュレータ部２００に方策モデル及び環境条件を設定（反映）するよう指示する（Ｓ１３）。 The communication device 130 transmits the determined policy model and environmental conditions to the simulator unit 200, and instructs the simulator unit 200 to set (reflect) the policy model and environmental conditions (S13).

シミュレータ部２００では、情報設定部２１１は、通信デバイス２３０を介して、ＧＵＩ部１００からの方策モデル及び環境条件を受信し、この方策モデル及び環境条件を設定し、設定情報をメモリ２２０に保持させる（Ｓ２１）。 In the simulator unit 200, the information setting unit 211 receives the policy model and environmental conditions from the GUI unit 100 via the communication device 230, sets the policy model and environmental conditions, and causes the memory 220 to retain the setting information. (S21).

方策実行部２１２は、設定された環境条件及び方策モデルに基づいて、エージェントが配置された環境で方策を実行する（Ｓ２２）。方策の実行により、環境においてエージェントが動作するので、環境が変化し得る。観測画像処理部２１３は、仮想カメラを制御し、設定されたカメラ情報に従って、環境を撮像して観測画像を生成する（Ｓ２３）。つまり、観測画像処理部２１３は、環境条件の反映後の環境情報として、仮想カメラを介して観測画像を取得する。通信デバイス２３０は、取得された観測画像をＧＵＩ部１００へ送信する（Ｓ２４）。 The policy execution unit 212 executes the policy in the environment where the agent is placed based on the set environmental conditions and policy model (S22). Execution of the policy may cause the environment to change as the agent operates in the environment. The observation image processing unit 213 controls the virtual camera to capture an image of the environment and generate an observation image according to the set camera information (S23). That is, the observed image processing unit 213 acquires an observed image via the virtual camera as environmental information after reflecting the environmental conditions. The communication device 230 transmits the acquired observation image to the GUI unit 100 (S24).

なお、方策は、方策モデルに従って、時系列で連続的に実行されていく。つまり、方策実行部２１２は、方策モデルを順次実行し、観測画像処理部２１３は、仮想カメラを介して観測画像を順次取得する。通信デバイス２３０は、順次得られた観測画像を１つずつ順次送信してもよいし、いくつかの観測画像をまとめて送信することを順次繰り返してもよい。 Note that the policy is executed continuously in chronological order according to the policy model. That is, the policy execution unit 212 sequentially executes the policy model, and the observed image processing unit 213 sequentially acquires observed images via the virtual camera. The communication device 230 may sequentially transmit the sequentially obtained observation images one by one, or may repeat sequentially transmitting several observation images at once.

ＧＵＩ部１００では、通信デバイス１３０は、ＧＵＩ部１００から観測画像を受信する（Ｓ１４）。この場合、通信デバイス１３０は、１つずつ観測画像を順次受信してもよいし、いくつかの観測画像をまとめて受信することを順次繰り返してもよい。 In the GUI section 100, the communication device 130 receives the observation image from the GUI section 100 (S14). In this case, the communication device 130 may sequentially receive the observed images one by one, or may sequentially and repeatedly receive several observed images at once.

潜在状態処理部１１２は、観測画像と学習済みの世界モデル（指定された世界モデル）とに基づいて、観測画像を圧縮して、観測画像に対応する潜在状態を導出する（Ｓ１５）。潜在状態の生成は、観測画像毎に行われるので、複数の潜在状態が導出される。複数の潜在状態は、ベクトル空間である潜在空間にマッピングされる。この場合、当初の多次元の潜在状態が、２次元又は３次元の潜在状態に次元圧縮されてよい。 The latent state processing unit 112 compresses the observed image and derives a latent state corresponding to the observed image based on the observed image and the learned world model (designated world model) (S15). Since the latent state is generated for each observed image, a plurality of latent states are derived. Multiple latent states are mapped into a latent space, which is a vector space. In this case, the original multidimensional latent state may be dimensionally compressed into a two-dimensional or three-dimensional latent state.

復元画像処理部１１３は、導出された潜在状態と学習済みの世界モデル（指定された世界モデル）に基づいて、潜在状態を復元して復元画像を生成する（Ｓ１６）。復元画像処理部１１３は、例えば操作デバイス１４０を介して指定された潜在状態に対応する復元画像を生成してよい。また、復元画像処理部１１３は、特に潜在状態を指定せずに、任意の１つ以上の復元画像に対応する１つ以上の復元画像を生成してもよい。 The restored image processing unit 113 restores the latent state and generates a restored image based on the derived latent state and the learned world model (designated world model) (S16). The restored image processing unit 113 may generate a restored image corresponding to a latent state specified via the operating device 140, for example. Further, the restored image processing unit 113 may generate one or more restored images corresponding to any one or more restored images without particularly specifying a latent state.

表示デバイスが１５０は、導出された観測画像と、潜在状態と、復元画像と、を表示する（Ｓ１７）。例えば、観測画像は観測画面Ｇ４に表示され、潜在状態は潜在状態マッピング画面Ｇ５に表示され、復元画像は復元画面Ｇ６に表示される。また、表示デバイス１５０は、これらの画像を含む一覧画面ＧＬを表示してもよい。このように、ＧＵＩ部１００は、復元画像と潜在空間とを演算し、可視化する。 The display device 150 displays the derived observed image, latent state, and restored image (S17). For example, the observation image is displayed on the observation screen G4, the latent state is displayed on the latent state mapping screen G5, and the restored image is displayed on the restored screen G6. Further, the display device 150 may display a list screen GL including these images. In this way, the GUI unit 100 calculates and visualizes the restored image and latent space.

このように、情報処理システム５は、潜在空間を示す潜在状態マッピング画面Ｇ５における複数の潜在状態の分布や、潜在状態が復元された復元画像をユーザに提示できる。よって、情報処理システム５は、世界モデルの学習状態の良し悪しを可視化できる。 In this way, the information processing system 5 can present to the user the distribution of a plurality of latent states on the latent state mapping screen G5 showing the latent space and a restored image in which the latent states are restored. Therefore, the information processing system 5 can visualize whether the learning state of the world model is good or bad.

世界モデルでは、観測画像と復元画像とが完全に可逆ではなく、圧縮と復元とを反復することで徐々に情報量が低下し得る。世界モデルでは、タスクの観測情報が特徴的に捉えた潜在状態が取得されることで、復元画像が元の観測画像に近づくように学習される。また、世界モデルは、性質として、時系列の予測を行うので、性質の近い特徴点（潜在状態）は、ベクトル空間上で近くなるように設計される。 In the world model, the observed image and the restored image are not completely reversible, and the amount of information may gradually decrease as compression and restoration are repeated. The world model learns to make the restored image closer to the original observed image by acquiring latent states that are characteristically captured by the observed information of the task. Furthermore, since the world model inherently performs time-series prediction, feature points (latent states) with similar properties are designed to be close on the vector space.

ユーザは、複数の潜在状態の分布を確認することで、エージェントがどの観測同士が近い（意味的に類似している）と判断しているのかを認識できる。また、ユーザは、復元画像を確認することで、エージェントがどのように理解しているのか、世界モデルの学習具合を確認できる。ユーザは、復元画像が正しく復元されていれば、潜在状態についても復元についても好適に学習できていることが確認できる。したがって、情報処理システム５は、ユーザ（例えばエンジニア）による世界モデルの解釈性及び実験の円滑化を向上できる。 By checking the distribution of multiple latent states, the user can recognize which observations the agent has determined to be close (semantically similar). In addition, by checking the restored image, the user can check how the agent understands and the learning progress of the world model. If the restored image is correctly restored, the user can confirm that both the latent state and the restoration are successfully learned. Therefore, the information processing system 5 can improve the interpretability of the world model and facilitate experiments by users (for example, engineers).

なお、本実施形態では、主に仮想カメラが１つであり、単視点であることを例示したが、仮想カメラが複数あり、多視点であってもよい。 Note that in this embodiment, there is mainly one virtual camera and a single viewpoint, but there may be a plurality of virtual cameras and a multi-viewpoint.

また、本実施形態では、タスクとして、エージェントのロボット装置１０がボール２０を掴むことを例示したが、これに限られない。つまり、タスクは、ロボット装置１０が物を掴む以外の動作であってもよい。また、エージェントは、ロボット装置以外であってもよく、例えば自動運転シミュレータでも、ゲームシミュレータでもよい。本実施形態は、世界モデルの強化学習を用いる技術全般に適用可能である。 Furthermore, in this embodiment, the task is illustrated as the agent's robot device 10 grabbing the ball 20, but the task is not limited to this. In other words, the task may be an operation other than the robot device 10 grasping an object. Furthermore, the agent may be other than a robot device, such as an automatic driving simulator or a game simulator. This embodiment is applicable to all technologies using reinforcement learning of a world model.

また、本実施形態では、ＧＵＩ部１００とシミュレータ部２００とで分担する処理は、他の分担方法であってもよい。例えば、ＧＵＩ部１００は、操作デバイス１４０を介した入力と表示デバイス１５０による表示以外の処理をなるべくシミュレータ部２００側で実施するようにしてもよい。例えば、潜在状態処理部１１２と復元画像処理部１１３とを、シミュレータ部２００が有してもよい。 Further, in this embodiment, the processing to be shared between the GUI unit 100 and the simulator unit 200 may be shared by another method. For example, the GUI unit 100 may perform processing other than input via the operating device 140 and display by the display device 150 on the simulator unit 200 side as much as possible. For example, the simulator unit 200 may include the latent state processing unit 112 and the restored image processing unit 113.

また、本実施形態では、ＧＵＩ部１００及びシミュレータ部２００は、一体の情報処理装置として構成されてもよい。つまり、シミュレータ部２００を省略して、ＧＵＩ部１００側だけで処理を完結してよい。つまり、ＧＵＩ部１００が、方策実行部２１２と観測画像処理部２１３とを有してもよい。 Furthermore, in this embodiment, the GUI section 100 and the simulator section 200 may be configured as an integrated information processing device. In other words, the simulator section 200 may be omitted and the processing may be completed only on the GUI section 100 side. That is, the GUI section 100 may include the policy execution section 212 and the observed image processing section 213.

また、本実施形態では、仮想環境を画像として観測することを想定した例を説明したが、本実施形態の思想は、音声などの他の情報を観測する場合にも適用できる。例えば、音声を観測する場合、仮想カメラを仮想マイクとし、観測画像や復元画像に替えて、仮想マイクによって計測された音声や復元された音声を出力すればよい。 Further, in this embodiment, an example has been described assuming that the virtual environment is observed as an image, but the idea of this embodiment can also be applied to the case of observing other information such as audio. For example, when observing audio, the virtual camera may be used as a virtual microphone, and the audio measured by the virtual microphone or the restored audio may be output instead of the observed image or restored image.

また、本実施形態では、表示デバイス１５０は、複数の潜在状態を前記次元圧縮された潜在空間における２次元又は３次元の座標に配置した点として表示していたが、２次元又は３次元の座標と対応づけた他の表示を行ってもよい。例えば、表示デバイス１５０は、各座標を数値として列挙したテーブルとして表示してもよい。また、表示デバイス１５０は、２次元又は３次元の点として表示する場合は、近接する点間に線などを表示してもよい。この場合、表示デバイス１５０は、より上位の次元における類似度を反映して線の態様（色や太さ）を変更してもよい。次元圧縮の過程で上位の情報が失われる結果、各潜在状態は、潜在状態マッピングでは近くに見えていても、上位の次元では離れている場合がある。例えば、２次元に次元圧縮している場合、２次元の情報として残っている平面座標としては近くに見えていても、圧縮の過程で失われた奥行方向の座標が著しく遠い場合などがある。同様の問題は、４次元以上の情報を３次元に次元圧縮した場合にも存在する。そのため、表示デバイス１５０は、次元圧縮の過程で失われた情報を、線の態様などの他の情報として表示することで、失われた情報を踏まえた類似度を直感的に表現することができる。 Furthermore, in the present embodiment, the display device 150 displays the plurality of latent states as points arranged at two-dimensional or three-dimensional coordinates in the dimensionally compressed latent space; Other displays may be displayed in association with the above. For example, the display device 150 may display a table listing each coordinate as a numerical value. Further, when displaying as two-dimensional or three-dimensional points, the display device 150 may display lines or the like between adjacent points. In this case, the display device 150 may change the form (color and thickness) of the line to reflect the degree of similarity in a higher dimension. As a result of the loss of higher-order information in the process of dimensional compression, each latent state may appear close in the latent state mapping, but may be far apart in the higher-order dimension. For example, in the case of two-dimensional compression, even if the plane coordinates remaining as two-dimensional information appear to be close, the coordinates in the depth direction lost during the compression process may be extremely far away. A similar problem also exists when information of four or more dimensions is compressed into three dimensions. Therefore, the display device 150 can intuitively express the degree of similarity based on the lost information by displaying the information lost in the process of dimensional compression as other information such as the shape of the line. .

［実施形態の概要］
以上のように、上記実施形態の情報処理システム５は、学習済みの世界モデルと、エージェントが動作する環境に関する環境データ（例えば環境ファイル）と、前記エージェントの動作を規定する学習済みの方策モデルと、を指定する情報指定部１１１（指定部の一例）を備える。情報処理システム５は、指定された前記環境データと前記方策モデルとに基づいて、前記環境で方策を順次実行する方策実行部２１２と、方策の実行毎に環境を仮想カメラで撮像して、複数の観測画像を生成する観測画像処理部２１３と、を備える。情報処理システム５は、世界モデルに基づいて、複数の観測画像のそれぞれを圧縮して複数の潜在状態のそれぞれを導出する潜在状態処理部１１２と、複数の潜在状態を表示する表示デバイス１５０（表示部の一例）と、を備える。 [Overview of embodiment]
As described above, the information processing system 5 of the above embodiment includes a learned world model, environmental data (for example, an environment file) regarding the environment in which the agent operates, and a learned policy model that defines the behavior of the agent. , an information specifying section 111 (an example of a specifying section) is provided. The information processing system 5 includes a policy execution unit 212 that sequentially executes policies in the environment based on the specified environmental data and the policy model, and a policy execution unit 212 that images the environment with a virtual camera each time a policy is executed, and and an observed image processing unit 213 that generates an observed image. The information processing system 5 includes a latent state processing unit 112 that compresses each of a plurality of observed images and derives each of a plurality of latent states based on a world model, and a display device 150 that displays the plurality of latent states. (an example of the section) and.

つまり、情報処理システム５は、指定された環境と方策に従って実行した結果をサンプリングして、潜在状態マッピングを可視化できる。潜在状態は、世界モデルに基づいて観測画像から得られる。潜在状態は、観測画像に対応し、環境の特徴的な部分を示す。そのため、複数の潜在状態が示す複数の特徴が類似する場合にはマッピングされた複数の潜在状態の距離が近くなり、複数の特徴が類似しない場合にはマッピングされた複数の潜在状態の距離が遠くなる。したがって、ユーザは、潜在状態マッピングの表示を確認し、例えば複数の潜在状態間の距離と想定される距離との差分に基づいて、世界モデルの学習状態を直感的に理解し易くなる。 In other words, the information processing system 5 can visualize the latent state mapping by sampling the results of execution according to the specified environment and policy. Latent states are obtained from observed images based on a world model. The latent state corresponds to the observed image and indicates a characteristic part of the environment. Therefore, when multiple features indicated by multiple latent states are similar, the distance between the mapped multiple latent states becomes close, and when multiple features are dissimilar, the distance between the multiple mapped latent states becomes long. Become. Therefore, the user can easily check the display of the latent state mapping and intuitively understand the learning state of the world model based on, for example, the difference between the distance between the plurality of latent states and the assumed distance.

また、潜在状態処理部１１２は、世界モデルに基づいて、複数の観測画像のそれぞれを圧縮して複数の多次元の潜在状態のそれぞれを導出し、複数の多次元の潜在状態のそれぞれを次元圧縮して、複数の２次元又は３次元の潜在状態のそれぞれを導出してよい。表示デバイス１５０は、複数の潜在状態を次元圧縮された潜在空間における２次元又は３次元の座標に対応付けて表示してよい。 Further, the latent state processing unit 112 compresses each of the plurality of observed images based on the world model, derives each of the plurality of multidimensional latent states, and dimensionally compresses each of the plurality of multidimensional latent states. Each of a plurality of two-dimensional or three-dimensional latent states may be derived in this way. The display device 150 may display a plurality of latent states in association with two-dimensional or three-dimensional coordinates in the dimensionally compressed latent space.

これにより、情報処理システム５は、世界モデルに基づく潜在状態の導出当初には多次元の潜在状態であっても、表示可能に調整できる。 Thereby, the information processing system 5 can adjust the latent state so that it can be displayed even if the latent state is multidimensional at the beginning of deriving the latent state based on the world model.

また、情報処理システム５は、復元画像処理部１１３、を更に備えてよい。復元画像処理部１１３は、複数の潜在状態のうち一の潜在状態を指定し、世界モデルに基づいて、指定された一の潜在状態を復元して復元画像を生成してよい。表示デバイス１５０は、復元画像を表示してよい。 Furthermore, the information processing system 5 may further include a restored image processing section 113. The restored image processing unit 113 may designate one of the plurality of latent states, restore the specified one latent state based on the world model, and generate a restored image. Display device 150 may display the restored image.

これにより、ユーザは、復元画像の表示を確認することで、エージェントがどのように環境を理解しているのか、エージェントの世界モデルの学習度合い、等を確認できる。 Thereby, by checking the display of the restored image, the user can check how the agent understands the environment, the degree of learning of the agent's world model, etc.

また、情報指定部１１１は、環境に対する仮想カメラの位置及び向きを含むカメラ情報を指定してよい。観測画像処理部２１３は、カメラ情報に基づいて、環境を仮想カメラで撮像して、複数の観測画像を生成してよい。 Further, the information specifying unit 111 may specify camera information including the position and orientation of the virtual camera with respect to the environment. The observation image processing unit 213 may generate a plurality of observation images by capturing an image of the environment with a virtual camera based on the camera information.

これにより、情報処理システム５は、様々な位置から様々な向きで撮像された観測画像を取得でき、世界モデルによる学習状態を詳細に把握できる。 Thereby, the information processing system 5 can acquire observation images taken from various positions and in various directions, and can grasp the learning state by the world model in detail.

また、情報処理システム５は、操作デバイス１４０（第１の操作部の一例、を更に備えてよい。観測画像処理部２１３は、操作デバイス１４０への操作に基づいて、情報指定部１１１により指定されたカメラ情報を変更し、変更されたカメラ情報に基づいて、環境を仮想カメラで撮像して、変更された観測画像を生成してよい。表示デバイス１５０は、変更された観測画像を表示してよい。 The information processing system 5 may further include an operation device 140 (an example of a first operation section). The changed camera information may be changed, and the environment may be imaged by a virtual camera based on the changed camera information to generate a changed observation image.The display device 150 may display the changed observation image. good.

これにより、情報処理システム５は、カメラ情報の変更に従って生成される観測画像を確認しながら、仮想カメラのカメラ情報を指定できる。よって、ユーザ所望の観測画像が得られるように、カメラ情報を調整できる。 Thereby, the information processing system 5 can specify the camera information of the virtual camera while checking the observation image generated according to the change of the camera information. Therefore, camera information can be adjusted so that the observation image desired by the user can be obtained.

また、情報処理システム５は、操作デバイス１４０（第２の操作部の一例、を更に備えてよい。表示デバイス１５０は、操作デバイス１４０への操作に基づいて、潜在状態を描画する潜在空間を回転して表示してよい。 The information processing system 5 may further include an operation device 140 (an example of a second operation unit). The display device 150 rotates the latent space in which the latent state is drawn based on the operation on the operation device 140. It may be displayed as

これにより、情報処理システム５は、潜在空間を確認するための視点（カメラ位置）や視線方向（カメラ向き）が固定された場合と比較すると、様々な視点から様々な方向を見て潜在空間を視認でき、潜在空間に配置された複数の潜在状態同士の位置関係の詳細を容易に確認できる。 As a result, the information processing system 5 can check the latent space by looking in various directions from various viewpoints, compared to a case where the viewpoint (camera position) and line of sight direction (camera orientation) for checking the latent space are fixed. It can be visually confirmed, and the details of the positional relationship between multiple latent states arranged in the latent space can be easily confirmed.

また、情報処理システム５は、ＧＵＩ部１００（操作表示装置の一例）と、シミュレータ部２００（シミュレータ装置の一例）と、を備えてよい。ＧＵＩ部１００とシミュレータ部２００とは通信可能に接続されてよい。ＧＵＩ部１００は、情報指定部１１１と、潜在状態処理部１１２と、表示デバイス１５０と、を備えてよい。シミュレータ部２００は、方策実行部２１２と、観測画像処理部２１３と、を備えてよい。 Further, the information processing system 5 may include a GUI section 100 (an example of an operation display device) and a simulator section 200 (an example of a simulator device). The GUI section 100 and the simulator section 200 may be communicably connected. The GUI section 100 may include an information specifying section 111, a latent state processing section 112, and a display device 150. The simulator unit 200 may include a policy execution unit 212 and an observed image processing unit 213.

これにより、情報処理システム５は、例えば処理負荷の高い方策の実行を、処理能力の高いシミュレータ部２００側で実施でき、装置毎の処理能力に応じて分散処理できる。 Thereby, the information processing system 5 can execute, for example, a strategy with a high processing load on the simulator unit 200 side having a high processing capacity, and can perform distributed processing according to the processing capacity of each device.

また、情報処理システム５は、単一の情報処理装置（例えばＧＵＩ部１００）により構成されてよい。これにより、情報処理システム５は、世界モデルの学習状態を確認するための処理を、単一の情報処理装置によって完結できる。 Further, the information processing system 5 may be configured by a single information processing device (for example, the GUI section 100). Thereby, the information processing system 5 can complete the process for checking the learning state of the world model using a single information processing device.

以上、図面を参照しながら各種の実施形態について説明したが、本開示はかかる例に限定されないことは言うまでもない。当業者であれば、特許請求の範囲に記載された範疇内において、各種の変更例又は修正例に想到し得ることは明らかであり、それらについても当然に本開示の技術的範囲に属するものと了解される。また、開示の趣旨を逸脱しない範囲において、上記実施形態における各構成要素を任意に組み合わせてもよい。 Although various embodiments have been described above with reference to the drawings, it goes without saying that the present disclosure is not limited to such examples. It is clear that those skilled in the art can come up with various changes or modifications within the scope of the claims, and these naturally fall within the technical scope of the present disclosure. Understood. Further, each of the constituent elements in the above embodiments may be arbitrarily combined without departing from the spirit of the disclosure.

特許請求の範囲、明細書、及び図面中において示した装置、システム、プログラム、及び方法における動作、手順、ステップ、及び段階等の各処理の実行順序は、特段「より前に」、「先立って」等と明示しておらず、前の処理の出力を後の処理で用いるのでない限り、任意の順序で実現可能である。特許請求の範囲、明細書、及び図面中の動作フローに関して、便宜上「先ず、」、「次に」等を用いて説明したとしても、この順で実施することが必須であることを意味するものではない。 The execution order of each process such as operation, procedure, step, and stage in the apparatus, system, program, and method shown in the claims, specification, and drawings is specifically defined as "before" or "before". ” etc., and unless the output of the previous process is used in the subsequent process, it can be implemented in any order. Even if the claims, specifications, and operational flows in the drawings are explained using "first," "next," etc. for convenience, this does not mean that it is essential to implement them in this order. isn't it.

本開示は、世界モデルの学習状態を直感的に理解し易くできる情報処理装置及び情報処理方法等に有用である。 The present disclosure is useful for information processing devices, information processing methods, and the like that make it easy to intuitively understand the learning state of a world model.

５情報処理システム
１００ＧＵＩ部
１１０プロセッサ
１１１情報指定部
１１２潜在状態処理部
１１３復元画像処理部
１２０メモリ
１３０通信デバイス
１４０操作デバイス
１５０表示デバイス
２００シミュレータ部
２１０プロセッサ
２１１情報設定部
２１２方策実行部
２１３観測画像処理部
２２０メモリ
２３０通信デバイス
ＧＬ一覧画面
Ｇ１環境画面
Ｇ２モデル設定画面
Ｇ３カメラ設定画面
Ｇ４観測画面
Ｇ５潜在状態マッピング画面
Ｇ６復元画面 5 Information processing system 100 GUI section 110 Processor 111 Information specification section 112 Latent state processing section 113 Restored image processing section 120 Memory 130 Communication device 140 Operation device 150 Display device 200 Simulator section 210 Processor 211 Information setting section 212 Policy execution section 213 Observed image Processing unit 220 Memory 230 Communication device GL List screen G1 Environment screen G2 Model setting screen G3 Camera setting screen G4 Observation screen G5 Latent state mapping screen G6 Restoration screen

Claims

a specification unit that specifies a learned world model, environmental data regarding the environment in which the agent operates, and a learned policy model that defines the behavior of the agent;
a policy execution unit that sequentially executes policies in the environment based on the environment data and the policy model;
an observation image processing unit that generates a plurality of observation images by capturing an image of the environment with a virtual camera each time the policy is executed;
a latent state processing unit that compresses each of the plurality of observed images and derives each of the plurality of latent states based on the world model;
a display unit that displays the plurality of latent states;
An information processing system equipped with.

The latent state processing unit is
compressing each of the plurality of observed images based on the world model to derive each of the plurality of multidimensional latent states;
Dimensionally compressing each of the plurality of multidimensional latent states to derive each of the plurality of two-dimensional or three-dimensional latent states,
The display unit displays the plurality of latent states in association with two-dimensional or three-dimensional coordinates in the dimensionally compressed latent space.
The information processing system according to claim 1.

further comprising a restored image processing unit;
The restored image processing unit includes:
specifying one latent state among the plurality of latent states;
generating a restored image by restoring the specified one latent state based on the world model;
the display unit displays the restored image;
The information processing system according to claim 1 or 2.

The specifying unit specifies camera information including the position and orientation of the virtual camera with respect to the environment,
The observation image processing unit generates the plurality of observation images by capturing an image of the environment with the virtual camera based on the camera information.
The information processing system according to any one of claims 1 to 3.

further comprising a first operation section,
The observation image processing unit includes:
changing the camera information specified by the specifying unit based on the operation on the first operating unit;
capturing an image of the environment with the virtual camera based on the changed camera information to generate the changed observation image;
the display unit displays the changed observation image;
The information processing system according to claim 4.

further comprising a second operation section,
The display unit rotates and displays the latent space in which the latent state is drawn based on the operation on the second operation unit.
The information processing system according to any one of claims 1 to 5.

The information processing system includes an operation display device and a simulator device,
The operation display device and the simulator device are communicably connected,
The operation display device includes the designation section, the observation image processing section, the latent state processing section, and the display section,
The simulator device includes the policy execution unit.
The information processing system according to any one of claims 1 to 6.

The information processing system is configured by a single information processing device,
The information processing system according to any one of claims 1 to 6.

specifying a learned world model, environmental data regarding the environment in which the agent operates, and a learned policy model that defines the behavior of the agent;
sequentially executing strategies in the environment based on the environment data and the strategy model;
capturing an image of the environment with a virtual camera each time the policy is executed to generate a plurality of observation images;
compressing each of the plurality of observation images to derive each of the plurality of latent states based on the world model;
Displaying the plurality of latent states on a display unit;
An information processing method having