JP2023137880A

JP2023137880A - Information processing device, information processing method and program

Info

Publication number: JP2023137880A
Application number: JP2022044303A
Authority: JP
Inventors: 裕太郎石田; Yutaro Ishida; 太郎高橋; Taro Takahashi
Original assignee: Toyota Motor Corp
Current assignee: Toyota Motor Corp
Priority date: 2022-03-18
Filing date: 2022-03-18
Publication date: 2023-09-29

Abstract

To provide an information processing device, an information processing method, and a program that enhance efficiency of machine learning.SOLUTION: In a system including an information processing device, a robot, and a sensor, an information processing device 10 includes: an acquisition unit 11 that acquires an expert trajectory for a specific task; a generator 12 that performs reinforcement learning on the basis of at least a first reward based on behavior information being identified as the expert trajectory, and a second reward based on the specific task being executed on the basis of the behavior information, and generates the behavior information on the basis of the acquired information and a result of the reinforcement learning; and a discriminator 13 that discriminates whether input information is the generated behavior information or the expert trajectory. The generator 12 performs reinforcement learning at a first point of time by setting a ratio of the second reward to the first reward as a first ratio, and performs reinforcement learning at a second point of time later than the first point of time by setting the ratio of the second reward to the first reward as a second ratio higher than the first ratio.SELECTED DRAWING: Figure 3

Description

本開示は、情報処理装置、情報処理方法、及びプログラムに関する。 The present disclosure relates to an information processing device, an information processing method, and a program.

近年、逆強化学習(Inverse Reinforcement Learning)を用いた模倣学習アルゴリズムと、敵対的生成ネットワーク(ＧＡＮ、Generative Adversarial Network)を組み合わせたＧＡＩＬ（Generative Adversarial Imitation Learning）という手法が注目されている（非特許文献１を参照）。 In recent years, a method called GAIL (Generative Adversarial Imitation Learning), which combines an imitation learning algorithm using inverse reinforcement learning and a generative adversarial network (GAN), has been attracting attention (non-patent literature). 1).

なお、逆強化学習では、エキスパートの行動軌跡（専門家データ、Expert Trajectory）から環境の報酬関数を推定する手法であるため、環境からの報酬が得られない場合でも模倣学習を行う事ができる。逆強化学習を用いた模倣学習では、エキスパートの行動軌跡から報酬関数を求めるという問題と、得られた報酬関数から強化学習によってエキスパートの方策(Expert Policy)を求めるという問題の２つを解く必要がある。一方、ＧＡＩＬでは、ＧＡＮの仕組みを利用することで、エキスパートの行動軌跡からエキスパートの方策を求めることができる。 Inverse reinforcement learning is a method that estimates the reward function of the environment from the expert's action trajectory (expert data, Expert Trajectory), so imitation learning can be performed even when no reward is obtained from the environment. In imitation learning using inverse reinforcement learning, it is necessary to solve two problems: the problem of finding a reward function from the expert's behavioral trajectory, and the problem of finding the expert policy from the obtained reward function by reinforcement learning. be. On the other hand, in GAIL, by using the GAN mechanism, it is possible to determine the expert's policy from the expert's action trajectory.

また、非特許文献２には、ＧＡＩＬを用いて、ロボットのアームの制御を機械学習する技術が開示されている。非特許文献２では、カメラで撮影された画像と、ロボットのアームの関節の位置及び角速度を示す情報とを取得する。また、模倣学習の報酬γ_ｇａｉｌと強化学習の報酬γ_ｔａｓｋとを以下の式（１）のように組み合わせたハイブリッド報酬γを用いて、生成器（ｇｅｎｅｒａｔｏｒ）を学習させる。なお、模倣学習の報酬γ_ｇａｉｌは、識別器（判別器、ｄｉｓｃｒｉｍｉｎａｔｏｒ）を騙せたことに基づく報酬である。また、強化学習の報酬γ_ｔａｓｋは、ロボットによるタスク（作業）が完了したことに基づく報酬である。 Furthermore, Non-Patent Document 2 discloses a technique for machine learning control of a robot arm using GAIL. In Non-Patent Document 2, an image taken with a camera and information indicating the position and angular velocity of a joint of a robot arm are acquired. Further, a generator is trained using a hybrid reward γ that is a combination of the imitation learning reward γ _gail and the reinforcement learning reward γ _task as shown in the following equation (1). Note that the imitation learning reward γ _gail is a reward based on being able to fool the discriminator. Further, the reinforcement learning reward γ _task is a reward based on the completion of a task (work) by the robot.

また、λは予め設定されている定数であり、０から１までのいずれかの値である。ｓ_ｔは生成器への入力データであり、ａ_ｔは生成器からの出力データである。なお、λが０の場合は強化学習（ＲＬ、ＲｅｉｎｆｏｒｃｅｍｅｎｔＬｅａｒｎｉｎｇ）のみとなり、λが１の場合は通常の（例えば、非特許文献１に記載の）ＧＡＩＬとなる。
γ（ｓ_ｔ，ａ_ｔ）＝λγ_ｇａｉｌ（ｓ_ｔ，ａ_ｔ）＋（１－λ）γ_ｔａｓｋ（ｓ_ｔ，ａ_ｔ）・・・（１） Further, λ is a constant set in advance, and has a value between 0 and 1. s _t is the input data to the generator, and a _t is the output data from the generator. Note that when λ is 0, only reinforcement learning (RL) is used, and when λ is 1, normal GAIL (for example, as described in Non-Patent Document 1) is used.
γ (s _t , _at ) = λγ _gail (s _t , _at ) + (1-λ) γ _task (s _t , at ₎ ...(1)

Jonathan Ho and Stefano Ermon. "Generative adversarial imitation learning" NIPS, 2016.Jonathan Ho and Stefano Ermon. "Generative adversarial imitation learning" NIPS, 2016. Yuke Zhu, et al. "Reinforcement and Imitation Learning for Diverse Visuomotor Skills" RSS, 2018Yuke Zhu, et al. "Reinforcement and Imitation Learning for Diverse Visuomotor Skills" RSS, 2018

しかしながら、従来技術では、例えば、機械学習を効率的に実行できない場合がある。 However, with the conventional technology, for example, machine learning may not be efficiently executed.

本開示の目的は、機械学習の効率を向上させることができる情報処理装置、情報処理方法、及びプログラムを提供することである。 An object of the present disclosure is to provide an information processing device, an information processing method, and a program that can improve the efficiency of machine learning.

本開示に係る第１の態様では、特定のタスクに対するエキスパートの行動軌跡を示す情報を取得する取得部と、行動情報が前記エキスパートの行動軌跡を示す情報であると識別されたことに基づく第１報酬と、当該行動情報に基づいて前記特定のタスクが実行されたことに基づく第２報酬とに少なくとも基づいて強化学習を行い、前記取得部により取得された情報と、前記強化学習の結果とに基づいて、行動情報を生成する生成器と、入力された情報が前記生成器により生成された行動情報であるか前記エキスパートの行動軌跡を示す情報であるかを識別する識別器と、を有し、前記生成器は、前記第１報酬に対する前記第２報酬の割合を第１割合として第１時点における前記強化学習を行い、前記第１報酬に対する前記第２報酬を前記第１割合よりも高い第２割合として前記第１時点よりも後の第２時点における前記強化学習を行う、情報処理装置が提供される。 In a first aspect of the present disclosure, there is provided an acquisition unit that acquires information indicating a behavioral trajectory of an expert with respect to a specific task; Reinforcement learning is performed based on at least a reward and a second reward based on the execution of the specific task based on the behavioral information, and the information acquired by the acquisition unit and the result of the reinforcement learning are a generator that generates behavior information based on the information, and a discriminator that identifies whether input information is behavior information generated by the generator or information indicating a behavior trajectory of the expert. , the generator performs the reinforcement learning at a first point in time with a ratio of the second reward to the first reward as a first ratio, and sets the second reward to the first reward as a first ratio higher than the first ratio. An information processing device is provided that performs the reinforcement learning at a second time point after the first time point at a ratio of 2 to 2.

また、本開示に係る第２の態様では、特定のタスクに対するエキスパートの行動軌跡を示す情報を取得する処理と、行動情報が前記エキスパートの行動軌跡を示す情報であると識別されたことに基づく第１報酬と、当該行動情報に基づいて前記特定のタスクが実行されたことに基づく第２報酬とに少なくとも基づいて強化学習を行い、前記取得する処理により取得した情報と、前記強化学習の結果とに基づいて、行動情報を生成する処理と、入力された情報が生成する処理により生成した情報であるか前記エキスパートの行動軌跡を示す情報であるかを識別する処理と、を実行し、前記生成する処理では、前記第１報酬に対する前記第２報酬の割合を第１割合として第１時点における前記強化学習を行い、前記第１報酬に対する前記第２報酬を前記第１割合よりも高い第２割合として前記第１時点よりも後の第２時点における前記強化学習を行う、情報処理方法が提供される。 Further, in a second aspect of the present disclosure, there is provided a process of acquiring information indicating a behavioral trajectory of an expert with respect to a specific task, and a process of acquiring information indicating the behavioral trajectory of the expert with respect to a specific task, and Reinforcement learning is performed based on at least one reward and a second reward based on the execution of the specific task based on the behavioral information, and the information acquired by the acquisition process and the result of the reinforcement learning. and a process of identifying whether the input information is information generated by the generation process or information indicating the behavioral trajectory of the expert based on the generated behavior information. In the processing, the reinforcement learning is performed at a first time point by setting the ratio of the second reward to the first reward as a first ratio, and the second reward to the first reward is set to a second ratio higher than the first ratio. An information processing method is provided in which the reinforcement learning is performed at a second time point after the first time point.

また、本開示に係る第３の態様では、特定のタスクに対するエキスパートの行動軌跡を示す情報を取得する処理と、行動情報が前記エキスパートの行動軌跡を示す情報であると識別されたことに基づく第１報酬と、当該行動情報に基づいて前記特定のタスクが実行されたことに基づく第２報酬とに少なくとも基づいて強化学習を行い、前記取得する処理により取得した情報と、前記強化学習の結果とに基づいて、行動情報を生成する処理と、入力された情報が生成する処理により生成した情報であるか前記エキスパートの行動軌跡を示す情報であるかを識別する処理と、をコンピュータに実行させ、前記生成する処理では、前記第１報酬に対する前記第２報酬の割合を第１割合として第１時点における前記強化学習を行い、前記第１報酬に対する前記第２報酬を前記第１割合よりも高い第２割合として前記第１時点よりも後の第２時点における前記強化学習を行う、プログラムが提供される。 Further, in a third aspect of the present disclosure, there is provided a process of acquiring information indicating a behavioral trajectory of an expert with respect to a specific task, and a process of acquiring information indicating the behavioral trajectory of the expert with respect to a specific task; Reinforcement learning is performed based on at least one reward and a second reward based on the execution of the specific task based on the behavioral information, and the information acquired by the acquisition process and the result of the reinforcement learning. causing a computer to execute a process of generating behavioral information based on the above, and a process of identifying whether the input information is information generated by the generating process or information indicating the behavioral trajectory of the expert, In the generation process, the reinforcement learning is performed at a first point in time with the ratio of the second reward to the first reward as a first ratio, and the second reward to the first reward is set to a higher ratio than the first ratio. A program is provided that performs the reinforcement learning at a second point in time that is later than the first point in time.

一側面によれば、機械学習の効率を向上させることができる。 According to one aspect, the efficiency of machine learning can be improved.

実施形態に係る情報処理システムの構成の一例を示す図である。1 is a diagram illustrating an example of the configuration of an information processing system according to an embodiment. 実施形態に係る情報処理装置のハードウェア構成例を示す図である。1 is a diagram illustrating an example of a hardware configuration of an information processing device according to an embodiment. 実施形態に係る情報処理装置の構成の一例を示す図である。FIG. 1 is a diagram illustrating an example of the configuration of an information processing device according to an embodiment. 実施形態に係る情報処理装置の学習処理の一例を示すフローチャートである。5 is a flowchart illustrating an example of learning processing of the information processing device according to the embodiment. 実施形態に係る情報処理装置の推論処理の一例を示すフローチャートである。7 is a flowchart illustrating an example of inference processing of the information processing device according to the embodiment.

本開示の原理は、いくつかの例示的な実施形態を参照して説明される。これらの実施形態は、例示のみを目的として記載されており、本開示の範囲に関する制限を示唆することなく、当業者が本開示を理解および実施するのを助けることを理解されたい。本明細書で説明される開示は、以下で説明されるもの以外の様々な方法で実装される。
以下の説明および特許請求の範囲において、他に定義されない限り、本明細書で使用されるすべての技術用語および科学用語は、本開示が属する技術分野の当業者によって一般に理解されるのと同じ意味を有する。
以下、図面を参照して、本開示の実施形態を説明する。 The principles of the present disclosure are explained with reference to several exemplary embodiments. It is to be understood that these embodiments are described for illustrative purposes only and do not suggest limitations as to the scope of the disclosure, and to assist those skilled in the art in understanding and practicing the disclosure. The disclosure described herein may be implemented in a variety of ways other than those described below.
In the following description and claims, unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs. has.
Embodiments of the present disclosure will be described below with reference to the drawings.

＜システム構成＞
図１を参照し、実施形態に係る情報処理システム１の構成について説明する。図１は、実施形態に係る情報処理システム１の構成の一例を示す図である。図１の例では、情報処理システム１は、情報処理装置１０、ロボット２０、及びセンサ３０を有する。なお、情報処理装置１０、ロボット２０、及びセンサ３０の数は、図１の例に限定されない。なお、情報処理装置１０、及びセンサ３０は、ロボット２０の筐体の内部に収容されてもよい。情報処理装置１０、ロボット２０、及びセンサ３０は、無線または有線により通信できるように接続されている。 <System configuration>
With reference to FIG. 1, the configuration of an information processing system 1 according to an embodiment will be described. FIG. 1 is a diagram illustrating an example of the configuration of an information processing system 1 according to an embodiment. In the example of FIG. 1, the information processing system 1 includes an information processing device 10, a robot 20, and a sensor 30. Note that the number of information processing devices 10, robots 20, and sensors 30 is not limited to the example in FIG. 1. Note that the information processing device 10 and the sensor 30 may be housed inside the housing of the robot 20. The information processing device 10, the robot 20, and the sensor 30 are connected so that they can communicate wirelessly or by wire.

情報処理装置１０は、機械学習を用いてロボット２０を制御する装置である。情報処理装置１０は、例えば、人間等が道具を用いてタスク（作業）を実行する際の動作をセンサ３０によりエキスパートの行動軌跡を示す情報として取得し、取得した情報に基づいて学習を行う。そして、情報処理装置１０は、ロボット２０に当該道具を人間等と同様に用いらせて当該タスクを実行させる。 The information processing device 10 is a device that controls the robot 20 using machine learning. For example, the information processing device 10 acquires, through the sensor 30, the movement of a human or the like when performing a task (work) using a tool as information indicating an expert's action trajectory, and performs learning based on the acquired information. Then, the information processing device 10 causes the robot 20 to use the tool in the same way as a human or the like to execute the task.

ロボット２０は、アーム等により各種の道具を用いたタスクを行うロボットである。ロボット２０は、道具を用いたタスクを実行できる装置であればよく、外観の形状は限定されない。ロボット２０は、例えば、家庭用、探索用、工場用等の各種の目的で用いることができる。センサ３０は、ロボット２０の周辺を測定するセンサである。センサ３０は、例えば、カメラ、またはＬｉＤＡＲでもよい。 The robot 20 is a robot that performs tasks using various tools using an arm or the like. The robot 20 may be any device that can perform tasks using tools, and its external shape is not limited. The robot 20 can be used for various purposes such as home use, exploration use, and factory use. The sensor 30 is a sensor that measures the surroundings of the robot 20. Sensor 30 may be, for example, a camera or LiDAR.

＜ハードウェア構成＞
図２は、実施形態に係る情報処理装置１０のハードウェア構成例を示す図である。図２の例では、情報処理装置１０（コンピュータ１００）は、プロセッサ１０１、メモリ１０２、通信インターフェイス１０３を含む。これら各部は、バス等により接続されてもよい。メモリ１０２は、プログラム１０４の少なくとも一部を格納する。通信インターフェイス１０３は、他のネットワーク要素との通信に必要なインターフェイスを含む。 <Hardware configuration>
FIG. 2 is a diagram showing an example of the hardware configuration of the information processing device 10 according to the embodiment. In the example of FIG. 2, the information processing device 10 (computer 100) includes a processor 101, a memory 102, and a communication interface 103. These parts may be connected by a bus or the like. Memory 102 stores at least a portion of program 104. Communication interface 103 includes interfaces necessary for communication with other network elements.

プログラム１０４が、プロセッサ１０１及びメモリ１０２等の協働により実行されると、コンピュータ１００により本開示の実施形態の少なくとも一部の処理が行われる。メモリ１０２は、ローカル技術ネットワークに適した任意のタイプのものであってもよい。メモリ１０２は、非限定的な例として、非一時的なコンピュータ可読記憶媒体でもよい。また、メモリ１０２は、半導体ベースのメモリデバイス、磁気メモリデバイスおよびシステム、光学メモリデバイスおよびシステム、固定メモリおよびリムーバブルメモリなどの任意の適切なデータストレージ技術を使用して実装されてもよい。コンピュータ１００には１つのメモリ１０２のみが示されているが、コンピュータ１００にはいくつかの物理的に異なるメモリモジュールが存在してもよい。プロセッサ１０１は、任意のタイプのものであってよい。プロセッサ１０１は、汎用コンピュータ、専用コンピュータ、マイクロプロセッサ、デジタル信号プロセッサ（ＤＳＰ：Digital Signal Processor）、および非限定的な例としてマルチコアプロセッサアーキテクチャに基づくプロセッサの１つ以上を含んでよい。コンピュータ１００は、メインプロセッサを同期させるクロックに時間的に従属する特定用途向け集積回路チップなどの複数のプロセッサを有してもよい。 When the program 104 is executed by the cooperation of the processor 101, the memory 102, etc., the computer 100 performs at least part of the processing of the embodiment of the present disclosure. Memory 102 may be of any type suitable for the local technology network. Memory 102 may be, by way of non-limiting example, a non-transitory computer-readable storage medium. Memory 102 may also be implemented using any suitable data storage technology, such as semiconductor-based memory devices, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory. Although only one memory 102 is shown in computer 100, there may be several physically different memory modules present in computer 100. Processor 101 may be of any type. Processor 101 may include one or more of a general purpose computer, a special purpose computer, a microprocessor, a digital signal processor (DSP), and a processor based on a multi-core processor architecture, by way of non-limiting example. Computer 100 may have multiple processors, such as application specific integrated circuit chips, that are time dependent on a clock that synchronizes the main processors.

本開示の実施形態は、ハードウェアまたは専用回路、ソフトウェア、ロジックまたはそれらの任意の組み合わせで実装され得る。いくつかの態様はハードウェアで実装されてもよく、一方、他の態様はコントローラ、マイクロプロセッサまたは他のコンピューティングデバイスによって実行され得るファームウェアまたはソフトウェアで実装されてもよい。 Embodiments of the present disclosure may be implemented in hardware or special purpose circuitry, software, logic or any combination thereof. Some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software that may be executed by a controller, microprocessor, or other computing device.

本開示はまた、非一時的なコンピュータ可読記憶媒体に有形に記憶された少なくとも１つのコンピュータプログラム製品を提供する。コンピュータプログラム製品は、プログラムモジュールに含まれる命令などのコンピュータ実行可能命令を含み、対象の実プロセッサまたは仮想プロセッサ上のデバイスで実行され、本開示のプロセスまたは方法を実行する。プログラムモジュールには、特定のタスクを実行したり、特定の抽象データ型を実装したりするルーチン、プログラム、ライブラリ、オブジェクト、クラス、コンポーネント、データ構造などが含まれる。プログラムモジュールの機能は、様々な実施形態で望まれるようにプログラムモジュール間で結合または分割されてもよい。プログラムモジュールのマシン実行可能命令は、ローカルまたは分散デバイス内で実行できる。分散デバイスでは、プログラムモジュールはローカルとリモートの両方のストレージメディアに配置できる。 The present disclosure also provides at least one computer program product tangibly stored on a non-transitory computer readable storage medium. A computer program product includes computer-executable instructions, such as instructions contained in program modules, that are executed on a device on a target real or virtual processor to perform the processes or methods of the present disclosure. Program modules include routines, programs, libraries, objects, classes, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The functionality of the program modules may be combined or divided among program modules as desired in various embodiments. Machine-executable instructions of program modules can be executed locally or within distributed devices. In distributed devices, program modules can be located in both local and remote storage media.

本開示の方法を実行するためのプログラムコードは、１つ以上のプログラミング言語の任意の組み合わせで書かれてもよい。これらのプログラムコードは、汎用コンピュータ、専用コンピュータ、またはその他のプログラム可能なデータ処理装置のプロセッサまたはコントローラに提供される。プログラムコードがプロセッサまたはコントローラによって実行されると、フローチャートおよび／または実装するブロック図内の機能／動作が実行される。プログラムコードは、完全にマシン上で実行され、一部はマシン上で、スタンドアロンソフトウェアパッケージとして、一部はマシン上で、一部はリモートマシン上で、または完全にリモートマシンまたはサーバ上で実行される。 Program code for implementing the methods of this disclosure may be written in any combination of one or more programming languages. These program codes are provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing device. When the program code is executed by a processor or controller, the functions/acts illustrated in the flowcharts and/or implementing block diagrams are performed. Program code can run entirely on a machine, partially on a machine, as a standalone software package, partially on a machine, partially on a remote machine, or entirely on a remote machine or server. Ru.

プログラムは、様々なタイプの非一時的なコンピュータ可読媒体を用いて格納され、コンピュータに供給することができる。非一時的なコンピュータ可読媒体は、様々なタイプの実体のある記録媒体を含む。非一時的なコンピュータ可読媒体の例には、磁気記録媒体、光磁気記録媒体、光ディスク媒体、半導体メモリ等が含まれる。磁気記録媒体には、例えば、フレキシブルディスク、磁気テープ、ハードディスクドライブ等が含まれる。光磁気記録媒体には、例えば、光磁気ディスク等が含まれる。光ディスク媒体には、例えば、ブルーレイディスク、ＣＤ（Compact Disc）－ＲＯＭ（Read Only Memory）、ＣＤ－Ｒ（Recordable）、ＣＤ－ＲＷ（ReWritable）等が含まれる。半導体メモリには、例えば、ソリッドステートドライブ、マスクＲＯＭ、ＰＲＯＭ（Programmable ROM）、ＥＰＲＯＭ（Erasable PROM）、フラッシュＲＯＭ、ＲＡＭ（random access memory）等が含まれる。また、プログラムは、様々なタイプの一時的なコンピュータ可読媒体によってコンピュータに供給されてもよい。一時的なコンピュータ可読媒体の例は、電気信号、光信号、及び電磁波を含む。一時的なコンピュータ可読媒体は、電線及び光ファイバ等の有線通信路、又は無線通信路を介して、プログラムをコンピュータに供給できる。 The program can be stored and provided to a computer using various types of non-transitory computer-readable media. Non-transitory computer-readable media includes various types of tangible storage media. Examples of non-transitory computer-readable media include magnetic recording media, magneto-optical recording media, optical disk media, semiconductor memory, and the like. Magnetic recording media include, for example, flexible disks, magnetic tapes, hard disk drives, and the like. The magneto-optical recording medium includes, for example, a magneto-optical disk. Optical disc media include, for example, Blu-ray discs, CDs (Compact Discs)-ROMs (Read Only Memory), CD-Rs (Recordables), CD-RWs (ReWritables), and the like. Semiconductor memories include, for example, solid state drives, mask ROMs, PROMs (Programmable ROMs), EPROMs (Erasable PROMs), flash ROMs, RAMs (Random Access Memory), and the like. The program may also be provided to the computer on various types of temporary computer-readable media. Examples of transitory computer-readable media include electrical signals, optical signals, and electromagnetic waves. The temporary computer-readable medium can provide the program to the computer via wired communication channels, such as electrical wires and fiber optics, or wireless communication channels.

＜構成＞
次に、図３を参照し、実施形態に係る情報処理装置１０の構成について説明する。図３は、実施形態に係る情報処理装置１０の構成の一例を示す図である。図３の例では、情報処理装置１０は、取得部１１、生成器（ｇｅｎｅｒａｔｏｒ）１２、識別器（判別器、discriminator）１３、及び制御部１４を有する。これら各部は、情報処理装置１０にインストールされた１以上のプログラムと、情報処理装置１０のプロセッサ１０１、及びメモリ１０２等のハードウェアとの協働により実現されてもよい。 <Configuration>
Next, with reference to FIG. 3, the configuration of the information processing device 10 according to the embodiment will be described. FIG. 3 is a diagram illustrating an example of the configuration of the information processing device 10 according to the embodiment. In the example of FIG. 3, the information processing device 10 includes an acquisition unit 11, a generator 12, a discriminator 13, and a control unit 14. Each of these units may be realized by cooperation between one or more programs installed in the information processing device 10 and hardware such as the processor 101 and the memory 102 of the information processing device 10.

取得部１１は、特定のタスクに対するエキスパートの行動軌跡を示す情報と、ロボット２０に関する環境を示す情報とを、情報処理装置１０内部の記憶装置または外部装置から取得する。 The acquisition unit 11 acquires information indicating an expert's action trajectory for a specific task and information indicating an environment related to the robot 20 from a storage device inside the information processing device 10 or an external device.

生成器１２は、取得部１１により取得された情報と、強化学習の結果とに基づいて、行動を示す情報を生成する。生成器１２は、取得部１１により取得されたエキスパートの行動軌跡を示す情報と、強化学習の結果に基づいて生成した行動を示す情報とのいずれかを識別器１３に出力した場合の識別器１３による識別結果に基づいて、報酬γ_ｇａｉｌ（「第１報酬」の一例。）を決定する。また、生成器１２は、ロボット２０による特定のタスクの実行（完了、成功）に基づいて、報酬γ_ｔａｓｋ（「第２報酬」の一例。）を決定する。 The generator 12 generates information indicating behavior based on the information acquired by the acquisition unit 11 and the results of reinforcement learning. The generator 12 outputs to the discriminator 13 either the information indicating the expert's behavior trajectory acquired by the acquisition unit 11 or the information indicating the behavior generated based on the results of reinforcement learning. Based on the identification result, the reward γ _gail (an example of the "first reward") is determined. Furthermore, the generator 12 determines a reward γ _task (an example of a “second reward”) based on execution (completion, success) of a specific task by the robot 20.

また、生成器１２は、報酬γ_ｇａｉｌに対する報酬γ_ｔａｓｋの割合を第１割合として第１時点における強化学習を行い、報酬γ_ｇａｉｌに対する報酬γ_ｔａｓｋを第１割合よりも高い第２割合として第１時点よりも後の第２時点における強化学習を行う。 Further, the generator 12 performs reinforcement learning at the first time point by setting the ratio of the reward γ _task to the reward γ _gail as the first ratio, and sets the reward γ _task to the reward γ _gail as the second ratio higher than the first ratio. Reinforcement learning is performed at a second point in time after the point in time.

識別器１３は、生成器１２から入力された情報が生成器１２により生成された情報であるか取得部１１により取得されたエキスパートの行動軌跡を示す情報であるかを識別する。制御部１４は、生成器１２により生成された行動を示す情報に基づいてロボット２０を制御する。 The discriminator 13 identifies whether the information input from the generator 12 is information generated by the generator 12 or information indicating the expert's action trajectory acquired by the acquisition unit 11. The control unit 14 controls the robot 20 based on information indicating the behavior generated by the generator 12.

＜処理＞
＜＜学習フェーズ＞＞
次に、図４を参照し、実施形態に係る情報処理装置１０の学習処理の一例について説明する。図４は、実施形態に係る情報処理装置１０の学習処理の一例を示すフローチャートである。 <Processing>
＜＜Learning phase＞＞
Next, with reference to FIG. 4, an example of the learning process of the information processing apparatus 10 according to the embodiment will be described. FIG. 4 is a flowchart illustrating an example of the learning process of the information processing device 10 according to the embodiment.

ステップＳ１０１において、情報処理装置１０の取得部１１は、特定のタスクに対するエキスパートの行動軌跡を示す情報を取得する。特定のタスクは、例えば、ハンマーで釘を打つ、またはコップで水をすくう等でもよい。ここで、取得部１１は、エキスパートの行動軌跡を示す情報として、例えば、人間により道具（例えば、ハンマー、コップ等）が使用された際の人間の腕及び道具の各時点における位置及び姿勢等を示す情報を取得してもよい。エキスパートの行動軌跡を示す情報は、例えば、カメラであるセンサ３０で撮影された画像を畳み込みニューラルネットワーク（ＣＮＮ、Convolutional Neural Network）で分析することにより生成されてもよい。また、エキスパートの行動軌跡を示す情報は、例えば、人間の腕及び道具の少なくとも一方に装着されたセンサ３０で測定されたデータに基づいて生成されてもよい。 In step S101, the acquisition unit 11 of the information processing device 10 acquires information indicating an expert's action trajectory for a specific task. A particular task may be, for example, driving a nail with a hammer or scooping water with a glass. Here, the acquisition unit 11 acquires, as information indicating the expert's action trajectory, the position and posture of the human arm and the tool at each point in time when the tool (e.g., hammer, cup, etc.) is used by the human. You may also obtain information that indicates. Information indicating the expert's action trajectory may be generated, for example, by analyzing an image taken by the sensor 30, which is a camera, using a convolutional neural network (CNN). Further, the information indicating the expert's action trajectory may be generated based on data measured by a sensor 30 attached to at least one of a human arm and a tool, for example.

続いて、情報処理装置１０の取得部１１は、環境を示す情報（環境情報、ロボット２０の行動軌跡を示す情報）を取得する（ステップＳ１０２）。環境情報は、例えば、カメラであるセンサ３０で撮影された画像をＣＮＮで分析することにより生成されてもよい。また、環境情報は、例えば、ロボット２０のアーム及び道具の少なくとも一方に設けられた（装着された）センサ３０で測定されたデータに基づいて生成されてもよい。環境情報には、例えば、道具の位置及び姿勢を示す情報が含まれてもよい。また、環境情報には、例えば、ロボット２０のアームの関節の位置及び角速度を示す情報が含まれてもよい。 Subsequently, the acquisition unit 11 of the information processing device 10 acquires information indicating the environment (environment information, information indicating the action trajectory of the robot 20) (step S102). The environmental information may be generated, for example, by analyzing an image taken by the sensor 30, which is a camera, using CNN. Further, the environmental information may be generated based on data measured by a sensor 30 provided (attached) to at least one of an arm and a tool of the robot 20, for example. The environmental information may include, for example, information indicating the position and orientation of the tool. Further, the environmental information may include, for example, information indicating the positions and angular velocities of the joints of the arms of the robot 20.

続いて、情報処理装置１０の生成器１２は、ハイブリッド報酬γ_ｓを決定（更新）する（ステップＳ１０３）。ここで、生成器１２は、例えば、以下の式（２）における重み係数αの値を決定して、ハイブリッド報酬γ_ｓを算出してもよい。
γ_ｓ（ｓ_ｔ，ａ_ｔ）＝αγ_ｇａｉｌ（ｓ_ｔ，ａ_ｔ）＋（１－α）γ_ｔａｓｋ（ｓ_ｔ，ａ_ｔ）・・・（２） Subsequently, the generator 12 of the information processing device 10 determines (updates) the hybrid reward γ _s (step S103). Here, the generator 12 may calculate the hybrid reward γ _s by determining the value of the weighting coefficient α in the following equation (2), for example.
γ _s (s _t , _at )=αγ _gail (s _t , _at )+(1-α) γ _task (s _t , at ₎ ...(2)

ここで、αは０から１までの変数でもよい。ｓ_ｔは環境情報であり、ａ_ｔは行動を示す情報（行動情報）である。なお、αが０の場合は通常の強化学習のみとなり、αが１の場合は通常のＧＡＩＬ（模倣学習）のみとなる。 Here, α may be a variable from 0 to 1. s _t is environmental information, and a _t is information indicating behavior (behavior information). Note that when α is 0, only normal reinforcement learning is used, and when α is 1, only normal GAIL (imitation learning) is used.

報酬γ_ｇａｉｌは通常のＧＡＩＬによる模倣学習の報酬である。報酬γ_ｔａｓｋ（「第２報酬」の一例。）は、ロボット２０による特定のタスクの実行（完了、成功）に基づく強化学習の報酬である。 The reward γ _gail is a reward for imitation learning by normal GAIL. The reward γ _task (an example of a “second reward”) is a reinforcement learning reward based on execution (completion, success) of a specific task by the robot 20.

ＧＡＩＬによる模倣学習の報酬γ_ｇａｉｌは、通常のＧＡＩＬでの報酬である。すなわち、報酬γ_ｇａｉｌは、生成器１２により生成した行動情報が、取得部１１により取得されたエキスパートの行動軌跡を示す情報であると識別（判定、判断、判別）されたこと（識別器１３を騙せたこと、生成した行動情報のエキスパートの行動軌跡らしさ）に基づく報酬である。 Reward γ _gail for imitation learning by GAIL is a reward in normal GAIL. In other words, the reward γ _gail is based on the fact that the behavior information generated by the generator 12 is identified (determined, determined, discriminated) as information indicating the expert's behavior trajectory acquired by the acquisition unit 11 (the discriminator 13 The reward is based on whether the user was able to deceive the user, and whether the generated behavioral information is likely to be an expert's behavioral trajectory.

生成器１２は、例えば、生成器１２への入力データであるｓ_ｔ及び生成器１２からの出力データであるａ_ｔの組み合わせのデータを識別器１３に入力してもよい。そして、識別器１３は、後述するステップＳ１０７において、当該データに対するエキスパートの行動軌跡らしさの値（確度、信頼度）を算出（推定、推論）してもよい。そして、生成器１２は、識別器１３により当該データに対して算出されたエキスパートの行動軌跡らしさｐの値が高いほど、報酬γ_ｇａｉｌの値を高く決定してもよい。 For example, the generator 12 may input data of a combination of s _t that is input data to the generator 12 and a _t that is output data from the generator 12 to the discriminator 13 . Then, the classifier 13 may calculate (estimate, infer) a value (accuracy, reliability) of the likelihood of the expert's action trajectory for the data in step S107, which will be described later. Then, the generator 12 may determine the value of the reward γ _gail to be higher as the value of the expert's behavior trajectory likelihood p calculated for the data by the discriminator 13 is higher.

また、強化学習の報酬γ_ｔａｓｋは、ロボット２０により特定のタスクが実行されたことに基づく報酬である。制御部１４は、例えば、生成器１２により生成された行動情報ａ_ｔに基づいてロボットを制御する。そして、生成器１２は、例えば、行動情報ａ_ｔに基づいて制御されたロボットのアームにより使用される道具の位置及び姿勢等が、エキスパートの行動軌跡におけるタスク開始時点の位置及び姿勢等からタスク完了時点の位置及び姿勢等に変化（遷移）した場合、強化学習の報酬γ_ｔａｓｋの値を０以外の特定の値としてもよい。 In addition, the reinforcement learning reward γ _task is a reward based on the execution of a specific task by the robot 20. The control unit 14 controls the robot based on the behavior _information at generated by the generator 12, for example. Then, the generator 12 determines, for example, the position and posture of the tool used by the robot arm controlled based on the behavior information _at , based on the position and posture at the start of the task in the expert's behavior trajectory when the task is completed. When there is a change (transition) to the position, posture, etc. at the time, the value of the reinforcement learning reward γ _task may be set to a specific value other than 0.

生成器１２は、学習が進行するにしたがって、αの値を小さな値に決定してもよい。これにより、例えば、学習の初期においてはＧＡＩＬによる模倣学習の影響を比較的大きくし、学習の後期においてはタスクが実行できたことによる強化学習の影響を比較的大きくすることができる。そのため、当初は見様見真似で学習し、ある程度学習できたら自身で試行錯誤して微調整するような、人間と同様の学習過程により、機会学習の効率を向上させることができると考えられる。この場合、生成器１２は、第１時点において報酬γ_ｇａｉｌに対する報酬γ_ｔａｓｋを第１割合とし、第１時点よりも後の第２時点において報酬γ_ｇａｉｌに対する報酬γ_ｔａｓｋを第１割合よりも高い第２割合としてもよい。 The generator 12 may determine the value of α to be a smaller value as learning progresses. As a result, for example, in the early stage of learning, the influence of imitation learning by GAIL can be made relatively large, and in the latter stage of learning, the influence of reinforcement learning due to being able to execute a task can be made relatively large. Therefore, it is thought that the efficiency of machine learning can be improved through a learning process similar to that used by humans, in which the robot initially learns by watching what it sees, and once it has learned a certain amount, fine-tunes it by trial and error. In this case, the generator 12 sets the reward γ _task to the reward γ _gail at a first rate at the first time point, and sets the reward γ _task to the reward γ _gail at a higher rate than the first rate at a second time point after the first time point. It may also be a second ratio.

この場合、生成器１２は、例えば、ステップＳ１０４の強化学習が行われた回数、及び強化学習の結果の性能の少なくとも一方に応じて、第１割合及び第２割合を決定してもよい。ここで、生成器１２は、例えば、ロボット２０による特定のタスクの実行にかかる所要時間、特定のタスクの実行にかかる消費電力、及び特定のタスクの実行にかかる前記行動を示す情報を生成した回数の少なくとも一つに基づいて、強化学習の結果の性能の値を決定（特定、算出）してもよい。この場合、生成器１２は、例えば、ロボット２０により特定のタスクが開始されてから完了するまでの所要時間が短いほど、当該性能の値を高く決定してもよい。また、生成器１２は、例えば、ロボット２０により特定のタスクが実行されるために消費された電力が少ないほど、当該性能の値を高く決定してもよい。また、生成器１２は、例えば、ロボット２０により特定のタスクが実行されるまでに生成した行動情報の数が少ないほど、当該性能の値を高く決定してもよい。なお、ロボット２０での消費電力の値は、ロボット２０のセンサ３０により測定されてもよい。なお、上記式（２）の例では、ＧＡＩＬによる模倣学習の報酬γ_ｇａｉｌと、強化学習の報酬γ_ｔａｓｋとに基づいてハイブリッド報酬γ_ｓを決定する例について説明したが、本開示の技術はこれに限定されない。生成器１２は、例えば、ＧＡＩＬによる模倣学習の報酬γ_ｇａｉｌと、強化学習の報酬γ_ｔａｓｋとに加え、他の学習手法の報酬にも基づいてハイブリッド報酬γ_ｓを決定してもよい。 In this case, the generator 12 may determine the first ratio and the second ratio, for example, depending on at least one of the number of times reinforcement learning in step S104 is performed and the performance of the reinforcement learning result. Here, the generator 12 generates information indicating, for example, the time required for the robot 20 to execute a specific task, the power consumption required to execute the specific task, and the number of times the robot 20 generates information indicating the behavior related to the execution of the specific task. The performance value of the reinforcement learning result may be determined (identified, calculated) based on at least one of the following. In this case, the generator 12 may determine, for example, that the shorter the time required from the start of a specific task to the completion by the robot 20, the higher the value of the performance. Further, the generator 12 may determine, for example, that the less power is consumed by the robot 20 to perform a specific task, the higher the value of the performance is. Furthermore, the generator 12 may determine, for example, that the smaller the number of behavioral information generated until the robot 20 executes a specific task, the higher the value of the performance. Note that the value of power consumption in the robot 20 may be measured by the sensor 30 of the robot 20. Note that in the example of formula (2) above, an example was explained in which the hybrid reward γ _s is determined based on the reward γ _gail of imitation learning by GAIL and the reward γ _task of reinforcement learning, but the technology of the present disclosure but not limited to. For example, the generator 12 may determine the hybrid reward γ _s based on the reward γ _gail of imitation learning by GAIL and the reward γ _task of reinforcement learning, as well as the reward of other learning methods.

続いて、情報処理装置１０の生成器１２は、上記の式（２）のハイブリッド報酬γ_ｓを用いて強化学習を行う（ステップＳ１０４）。続いて、情報処理装置１０の生成器１２は、強化学習の結果に基づいて、入力データである環境情報ｓ_ｔに対する行動情報ａ_ｔを生成する（ステップＳ１０５）ここで、生成器１２は、行動情報ａ_ｔとして、例えば、ロボットを制御するためのデータを生成して出力してもよい。この場合、行動情報には、例えば、当該ロボットのアームの各関節の角速度を示す情報が含まれてもよい。 Subsequently, the generator 12 of the information processing device 10 performs reinforcement learning using the hybrid reward γ _s of equation (2) above (step S104). Next, the generator 12 of the information processing device 10 generates behavior information a _t for the environmental information s _t that is input data based on the result of reinforcement learning (step S105). As the information _at , for example, data for controlling the robot may be generated and output. In this case, the behavior information may include, for example, information indicating the angular velocity of each joint of the arm of the robot.

続いて、情報処理装置１０の識別器１３は、環境情報ｓ_ｔと行動情報ａ_ｔとの組み合わせのデータに対するエキスパートの行動軌跡らしさｐの値（確度、信頼度）を算出（推定、推論）して生成器１２へ出力する（ステップＳ１０６）。 Subsequently, the discriminator 13 of the information processing device 10 calculates (estimates, infers) the value (accuracy, reliability) of the expert's behavior trajectory likelihood p for the data of the combination of the environmental information s _t and the behavior information a _t . and outputs it to the generator 12 (step S106).

続いて、情報処理装置１０の識別器１３は、例えば、ニューラルネットワーク（ＮＮ、Neural Network）を用いて、行動情報ａ_ｔが生成器１２により生成されたものである場合はｐの値を０と算出し、行動情報ａ_ｔがエキスパートの行動軌跡である場合はｐの値を１と算出するように教師あり学習で学習する（ステップＳ１０７）。これにより、ＧＡＩＬと同様に、ＧＡＮにより識別器１３と生成器１２とを競わせるように学習させることができる。ここで、識別器１３は、例えば、環境情報ｓ_ｔと、行動情報ａ_ｔと、行動情報ａ_ｔが生成器１２により生成されたものであるか否かを示す正解ラベルとの組み合わせのデータに基づいて教師あり学習で学習してもよい。 Next, the discriminator 13 of the information processing device 10 uses, for example, a neural network (NN) to set the value of p to 0 if the behavioral information _at is generated by the generator 12. supervised learning is performed to calculate the value of p as 1 when the behavior _information at is the expert's behavior trajectory (step S107). Thereby, similarly to GAIL, the discriminator 13 and the generator 12 can be trained to compete with each other using GAN. Here, the discriminator 13 uses, for example, data of a combination of environmental information _st , behavior information _at , and a correct label indicating whether or not the behavior information _at is generated by the generator 12. You may also learn by supervised learning based on this.

続いて、情報処理装置１０の制御部１４は、生成器１２により生成された行動情報に基づいて、ロボット２０を動作させる（ステップＳ１０８）。ここで、制御部１４は、行動情報に応じた制御コマンドをロボット２０に送信してもよい。続いて、情報処理装置１０の生成器１２は、学習を終了するか否かを判定する（ステップＳ１０９）。ここで、生成器１２は、例えば、ロボット２０により特定のタスクが実行された回数、ステップＳ１０４の強化学習が行われた回数、及び強化学習の結果の性能の少なくとも一つが閾値以上である場合に、学習を終了すると判定してもよい。 Subsequently, the control unit 14 of the information processing device 10 causes the robot 20 to operate based on the behavior information generated by the generator 12 (step S108). Here, the control unit 14 may transmit a control command according to the behavior information to the robot 20. Subsequently, the generator 12 of the information processing device 10 determines whether to end learning (step S109). Here, the generator 12 determines, for example, when at least one of the number of times a specific task has been executed by the robot 20, the number of times reinforcement learning in step S104 has been performed, and the performance of the reinforcement learning result is greater than or equal to a threshold value. , it may be determined that learning has ended.

学習を終了しないと判定した場合（ステップＳ１０９でＮＯ）、ステップＳ１０２の処理に進む。一方、学習を終了すると判定した場合（ステップＳ１０９でＹＥＳ）、学習処理を終了する。 If it is determined that learning is not to be completed (NO in step S109), the process proceeds to step S102. On the other hand, if it is determined that learning is to be ended (YES in step S109), the learning process is ended.

＜＜推論フェーズ＞＞
次に、図５を参照し、実施形態に係る情報処理装置１０の推論処理の一例について説明する。図５は、実施形態に係る情報処理装置１０の推論処理の一例を示すフローチャートである。＜＜Inference phase＞＞
Next, with reference to FIG. 5, an example of the inference processing of the information processing apparatus 10 according to the embodiment will be described. FIG. 5 is a flowchart illustrating an example of inference processing by the information processing device 10 according to the embodiment.

ステップＳ２０１において、情報処理装置１０の制御部１４は、ロボット２０に実行させるタスクの内容を判定（認識）する。ここで、制御部１４は、例えば、ユーザからの音声またはボタン操作等の入力に基づいて、タスクの内容を判定してもよい。制御部１４は、例えば、ハンマーで釘を打つ、またはコップで水をすくう等のタスクの内容を判定してもよい。 In step S201, the control unit 14 of the information processing device 10 determines (recognizes) the content of the task to be executed by the robot 20. Here, the control unit 14 may determine the content of the task based on, for example, input from the user such as voice or button operation. For example, the control unit 14 may determine the content of the task, such as driving a nail with a hammer or scooping water with a cup.

続いて、情報処理装置１０の取得部１１は、環境情報を取得する（ステップＳ２０２）。ステップＳ２０２の処理は、例えば、図４のステップＳ１０２の処理と同様でもよい。 Subsequently, the acquisition unit 11 of the information processing device 10 acquires environmental information (step S202). The process in step S202 may be similar to the process in step S102 in FIG. 4, for example.

続いて、情報処理装置１０の生成器１２は、強化学習の結果に基づいて、入力データである環境情報ｓ_ｔに対する行動情報ａ_ｔを生成する（ステップＳ２０３）。ステップＳ２０３の処理は、例えば、図４のステップＳ１０５の処理と同様でもよい。 Subsequently, the generator 12 of the information processing device 10 generates behavior information a _t for the environmental information s _t that is input data, based on the result of reinforcement learning (step S203). The process in step S203 may be similar to the process in step S105 in FIG. 4, for example.

続いて、情報処理装置１０の制御部１４は、生成器１２により生成された行動情報に基づいて、ロボット２０を動作させる（ステップＳ２０４）。ステップＳ２０４の処理は、例えば、図４のステップＳ１０８の処理と同様でもよい。 Subsequently, the control unit 14 of the information processing device 10 causes the robot 20 to operate based on the behavior information generated by the generator 12 (step S204). The process in step S204 may be similar to the process in step S108 in FIG. 4, for example.

＜変形例＞
情報処理装置１０は、一つの筐体に含まれる装置でもよいが、本開示の情報処理装置１０はこれに限定されない。情報処理装置１０の各部は、例えば１以上のコンピュータにより構成されるクラウドコンピューティングにより実現されていてもよい。これらのような情報処理装置についても、本開示の「情報処理装置」の一例に含まれる。 <Modified example>
The information processing device 10 may be a device included in one housing, but the information processing device 10 of the present disclosure is not limited to this. Each part of the information processing device 10 may be realized by cloud computing configured by, for example, one or more computers. Information processing apparatuses such as these are also included in an example of the "information processing apparatus" of the present disclosure.

なお、本発明は上記実施の形態に限られたものではなく、趣旨を逸脱しない範囲で適宜変更することが可能である。 Note that the present invention is not limited to the above embodiments, and can be modified as appropriate without departing from the spirit.

１情報処理システム
１０情報処理装置
１１取得部
１２生成器
１３識別器
１４制御部
２０ロボット
３０センサ 1 Information processing system 10 Information processing device 11 Acquisition unit 12 Generator 13 Discriminator 14 Control unit 20 Robot 30 Sensor

Claims

an acquisition unit that acquires information indicating an expert's action trajectory for a specific task;
At least based on a first reward based on the fact that the behavior information is identified as information indicating a behavior trajectory of the expert, and a second reward based on the fact that the specific task is executed based on the behavior information. a generator that performs reinforcement learning and generates behavioral information based on the information acquired by the acquisition unit and the result of the reinforcement learning;
a discriminator that identifies whether the input information is behavior information generated by the generator or information indicating a behavior trajectory of the expert;
The generator performs the reinforcement learning at a first time point by setting a ratio of the second reward to the first reward as a first ratio, and sets the second reward to the first reward as a second ratio higher than the first ratio. performing the reinforcement learning at a second point in time that is later than the first point in time as a proportion;
Information processing device.

The generator determines the first ratio and the second ratio according to the number of times the reinforcement learning is performed.
The information processing device according to claim 1.

The generator determines the first ratio and the second ratio according to the performance of the reinforcement learning result.
The information processing device according to claim 1 or 2.

The generator is based on at least one of the time required to execute the specific task, the power consumption required to execute the specific task, and the number of times the behavior information related to the execution of the specific task is generated. , determining the value of said performance;
The information processing device according to claim 3.

The expert's action trajectory for the specific task is the action trajectory of a human in a task using a specific tool,
The behavior information includes information indicating the angular velocity of the joint of the robot arm.
The information processing device according to any one of claims 1 to 4.

a process of acquiring information indicating an expert's action trajectory for a specific task;
At least based on a first reward based on the fact that the behavior information is identified as information indicating a behavior trajectory of the expert, and a second reward based on the fact that the specific task is executed based on the behavior information. A process of performing reinforcement learning and generating behavioral information based on the information acquired by the acquisition process and the result of the reinforcement learning;
performing a process of identifying whether the input information is information generated by a generation process or information indicating an action trajectory of the expert;
In the generation process, the reinforcement learning is performed at a first point in time with the ratio of the second reward to the first reward as a first ratio, and the second reward to the first reward is set to a higher ratio than the first ratio. performing the reinforcement learning at a second time point later than the first time point at a rate of 2;
Information processing method.

a process of acquiring information indicating an expert's action trajectory for a specific task;
At least based on a first reward based on the fact that the behavior information is identified as information indicating a behavior trajectory of the expert, and a second reward based on the fact that the specific task is executed based on the behavior information. A process of performing reinforcement learning and generating behavioral information based on the information acquired by the acquisition process and the result of the reinforcement learning;
causing a computer to perform a process of identifying whether the input information is information generated by a generation process or information indicating an action trajectory of the expert;
In the generation process, the reinforcement learning is performed at a first point in time with the ratio of the second reward to the first reward as a first ratio, and the second reward to the first reward is set to a higher ratio than the first ratio. performing the reinforcement learning at a second time point later than the first time point at a rate of 2;
program.