JP2023145809A

JP2023145809A - Reinforcement learning device, reinforcement learning system, object operation device, model generation method and reinforcement learning program

Info

Publication number: JP2023145809A
Application number: JP2020119349A
Authority: JP
Inventors: 康博藤田; Yasuhiro Fujita
Original assignee: Preferred Networks Inc
Current assignee: Preferred Networks Inc
Priority date: 2020-07-10
Filing date: 2020-07-10
Publication date: 2023-10-12
Also published as: WO2022009859A1

Abstract

To provide a reinforcement learning device, reinforcement learning system, object operation device, model generation method and reinforcement learning program which can increase a success probability of a prescribed operation to an object.SOLUTION: A reinforcement learning device includes at least one memory and at least one processor. The at least one processor is configured to be able to execute processing of inputting information about a photographed image captured by an imaging device in which any of the position and attitude thereof changes and information about a target object image showing an object being an operation target operated by an end effector to a training model that outputs information for controlling the operation of the end effector, and updating a parameter of the training model on the basis of the operation result to the object when the operation of the end effector is controlled on the basis of the information output by the training model.SELECTED DRAWING: Figure 1

Description

本開示は、強化学習装置、強化学習システム、物体操作装置、モデル生成方法及び強化学習プログラムに関する。 The present disclosure relates to a reinforcement learning device, a reinforcement learning system, an object manipulation device, a model generation method, and a reinforcement learning program.

所定領域内に載置された複数種類の物体のうち、指定された種類の物体に対して所定の操作（例えば、エンドエフェクタによる把持操作）を成功させるよう、固定カメラで撮影した撮影画像を入力として、エンドエフェクタの動作を強化学習する強化学習システムが知られている。 Input images taken with a fixed camera to successfully perform a predetermined operation (for example, a gripping operation using an end effector) on a specified type of object among multiple types of objects placed within a predetermined area. A reinforcement learning system that performs reinforcement learning on the motion of an end effector is known.

当該強化学習システムによれば、指定された種類の物体が、撮影可能な位置に載置されていれば、強化学習を繰り返すことで、所定の操作の成功確率を上げることができる。一方で、指定された種類の物体が、撮影可能な位置に載置されていない場合には、強化学習を進めることができず、所定の操作の成功確率を上げることができない。 According to the reinforcement learning system, if an object of a specified type is placed in a position where it can be photographed, reinforcement learning can be repeated to increase the success probability of a predetermined operation. On the other hand, if the specified type of object is not placed in a position where it can be photographed, reinforcement learning cannot proceed and the probability of success of a predetermined operation cannot be increased.

M.Danielczuk et al., "Mechanical Search: Multi-Step Retrieval of a Target Object Occluded by Cluster", in ICRA, 2019.M. Danielczuk et al., "Mechanical Search: Multi-Step Retrieval of a Target Object Occluded by Cluster", in ICRA, 2019. E. Jang, C. Devin, V. Vanhoucke, and S. Levine, "Grasp2Vec: Learning Object Representations from Self-Supervised Grasping", in CoRL, 2018.E. Jang, C. Devin, V. Vanhoucke, and S. Levine, "Grasp2Vec: Learning Object Representations from Self-Supervised Grasping", in CoRL, 2018.

本開示は、物体に対する所定の操作の成功確率を上げることが可能な、強化学習装置、強化学習システム、物体操作装置、モデル生成方法及び強化学習プログラムを提供する。 The present disclosure provides a reinforcement learning device, a reinforcement learning system, an object manipulation device, a model generation method, and a reinforcement learning program that can increase the success probability of a predetermined operation on an object.

本開示の一態様による強化学習装置は、例えば、以下のような構成を有する。即ち、
少なくとも１つのメモリと、
少なくとも１つのプロセッサと、を有し、
前記少なくとも１つのプロセッサは、
少なくとも位置及び姿勢のいずれかが変化する撮像装置により撮影された撮影画像に関する情報と、エンドエフェクタにより操作される操作対象の物体を示す目標物体画像に関する情報とを、前記エンドエフェクタの動作を制御するための情報を出力する訓練モデルに入力することと、
前記訓練モデルにより出力された情報に基づき前記エンドエフェクタの動作が制御された場合の、前記物体に対する操作結果に基づいて、前記訓練モデルのパラメータを更新することとを実行可能に構成される。 A reinforcement learning device according to one aspect of the present disclosure has, for example, the following configuration. That is,
at least one memory;
at least one processor;
The at least one processor includes:
The operation of the end effector is controlled by using information regarding a photographed image photographed by an imaging device whose position or orientation changes at least, and information regarding a target object image indicating an object to be operated by the end effector. inputting information into a training model that outputs information for
When the operation of the end effector is controlled based on the information output by the training model, the parameters of the training model can be updated based on the operation result for the object.

強化学習システムのシステム構成の一例を示す図である。1 is a diagram illustrating an example of a system configuration of a reinforcement learning system. 強化学習システムを構成する各装置のハードウェア構成の一例を示す図である。It is a diagram showing an example of the hardware configuration of each device that constitutes the reinforcement learning system. 強化学習装置の機能構成の一例を示す第１の図である。FIG. 1 is a first diagram showing an example of a functional configuration of a reinforcement learning device. 強化学習処理の流れを示す第１のフローチャートである。It is a 1st flowchart which shows the flow of reinforcement learning processing. 強化学習処理の実行例を示す第１の図である。FIG. 2 is a first diagram showing an example of execution of reinforcement learning processing. 強化学習処理の実行例を示す第２の図である。FIG. 2 is a second diagram showing an example of execution of reinforcement learning processing. 強化学習装置の機能構成の一例を示す第２の図である。FIG. 2 is a second diagram showing an example of the functional configuration of the reinforcement learning device. 強化学習処理の流れを示す第２のフローチャートである。It is a 2nd flowchart which shows the flow of reinforcement learning processing.

以下、各実施形態について添付の図面を参照しながら説明する。なお、本明細書及び図面において、実質的に同一の機能構成を有する構成要素については、同一の符号を付することにより重複した説明を省略する。 Each embodiment will be described below with reference to the accompanying drawings. Note that, in this specification and the drawings, components having substantially the same functional configuration are designated by the same reference numerals, thereby omitting redundant explanation.

［第１の実施形態］
＜強化学習システムのシステム構成＞
はじめに、強化学習システムのシステム構成について説明する。図１は、強化学習システムのシステム構成の一例を示す図である。図１に示すように、強化学習システム１００は、マニピュレータ１１０と、強化学習装置１２０とを有する。 [First embodiment]
<System configuration of reinforcement learning system>
First, the system configuration of the reinforcement learning system will be explained. FIG. 1 is a diagram showing an example of the system configuration of a reinforcement learning system. As shown in FIG. 1, the reinforcement learning system 100 includes a manipulator 110 and a reinforcement learning device 120.

マニピュレータ１１０は、例えば、複数種類の物体が混在して載置された物体群１３０の中から、指定された種類の物体（目標物体画像により示された操作対象の物体）に対して所定の操作を行う装置である。 For example, the manipulator 110 performs a predetermined operation on a specified type of object (the object to be operated indicated by the target object image) from among the object group 130 in which a plurality of types of objects are placed together. This is a device that performs

マニピュレータ１１０の本体部１１３は、複数の関節を介して接続された複数のアームを有し、それぞれの関節角を制御することで、マニピュレータ１１０の本体部１１３の先端部分の位置及び姿勢が制御されるように構成されている。 The main body 113 of the manipulator 110 has a plurality of arms connected via a plurality of joints, and by controlling the joint angles of each arm, the position and posture of the distal end portion of the main body 113 of the manipulator 110 are controlled. It is configured to

マニピュレータ１１０の本体部１１３の先端部分には、指定された種類の物体に対して所定の操作（本実施形態では把持操作）を行う把持機構部１１１（エンドエフェクタの一例）が取り付けられている。指定された種類の物体に対する把持操作は、把持機構部１１１の開閉を制御することにより行われる。 A gripping mechanism 111 (an example of an end effector) that performs a predetermined operation (grasping operation in this embodiment) on a specified type of object is attached to the tip of the main body 113 of the manipulator 110. A gripping operation for a specified type of object is performed by controlling opening and closing of the gripping mechanism section 111.

また、マニピュレータ１１０の本体部１１３の先端部分には、撮像装置１１２が取り付けられている。つまり、撮像装置１１２は、把持機構部１１１の位置及び姿勢の変化に伴って、位置及び姿勢が変化するように構成されている。撮像装置１１２は、Ｒ値、Ｇ値、Ｂ値の各画像を含む撮影画像を所定のフレーム周期で出力する。あるいは、撮像装置１１２は、Ｒ値、Ｇ値、Ｂ値の各画像に加えて、物体表面の各位置までの距離情報を含む撮影画像を所定のフレーム周期で出力してもよい。あるいは、撮像装置１１２は、物体表面の各位置までの距離情報を含む距離画像を所定のフレーム周期で出力してもよい。また、撮像装置１１２が撮影する撮影画像は動画像であってもよい。以下では、説明の簡略化のため、一例として、撮像装置１１２は、Ｒ値、Ｇ値、Ｂ値の各画像を含む撮影画像を所定のフレーム周期で出力するものとして説明する。 Furthermore, an imaging device 112 is attached to the tip of the main body 113 of the manipulator 110. In other words, the imaging device 112 is configured so that its position and orientation change as the position and orientation of the gripping mechanism section 111 change. The imaging device 112 outputs captured images including R value, G value, and B value images at a predetermined frame period. Alternatively, the imaging device 112 may output captured images containing distance information to each position on the object surface at a predetermined frame period, in addition to the R value, G value, and B value images. Alternatively, the imaging device 112 may output a distance image including distance information to each position on the object surface at a predetermined frame period. Further, the captured image captured by the imaging device 112 may be a moving image. In the following, for the sake of simplicity, the imaging device 112 will be described as an example that outputs captured images including R value, G value, and B value images at a predetermined frame period.

更に、マニピュレータ１１０の本体部１１３を支持する支持台１１４には、「把持機構部１１１の動作を制御」（把持機構部１１１の位置及び姿勢と、把持機構部１１１の開閉とを制御）する駆動制御装置１１５が内蔵されている。 Furthermore, the support base 114 that supports the main body 113 of the manipulator 110 is provided with a drive that "controls the operation of the gripping mechanism 111" (controls the position and posture of the gripping mechanism 111 and the opening and closing of the gripping mechanism 111). A control device 115 is built-in.

駆動制御装置１１５は、撮像装置１１２により撮影された撮影画像を取得し、強化学習装置１２０に送信する。また、駆動制御装置１１５は、マニピュレータ１１０の把持機構部１１１及び本体部１１３内に配された各種センサ（不図示）により検出されたセンサ信号を取得し、強化学習装置１２０に送信する。 The drive control device 115 acquires a captured image captured by the imaging device 112 and transmits it to the reinforcement learning device 120. Further, the drive control device 115 acquires sensor signals detected by various sensors (not shown) arranged in the gripping mechanism section 111 and the main body section 113 of the manipulator 110, and transmits them to the reinforcement learning device 120.

また、駆動制御装置１１５は、撮影画像及びセンサ信号を送信したことに応じて、強化学習装置１２０から、把持機構部１１１の動作を制御するための情報を取得する。ここでいう把持機構部１１１の動作を制御するための情報には、例えば、
・把持機構部１１１の動作後の状態を示す情報（目標値）、
・把持機構部１１１の位置及び姿勢と、把持機構部１１１の開閉とを制御するための具体的な操作量、制御量、
等、把持機構部１１１の動作に関する任意の指令が含まれてもよい。また、把持機構部１１１の動作を制御するための情報には、マニピュレータ１１０の動作を制御するための情報が含まれてもよい。以下では、駆動制御装置１１５は、把持機構部１１１の動作を制御するための情報の一例として、把持機構部１１１の動作後の状態を示す情報を取得するものとして説明する。 Further, the drive control device 115 acquires information for controlling the operation of the gripping mechanism section 111 from the reinforcement learning device 120 in response to transmitting the captured image and the sensor signal. The information for controlling the operation of the gripping mechanism section 111 here includes, for example,
- Information indicating the state of the gripping mechanism section 111 after operation (target value),
- Specific operation amount and control amount for controlling the position and posture of the gripping mechanism section 111 and opening and closing of the gripping mechanism section 111,
Any command regarding the operation of the gripping mechanism section 111 may be included. Further, the information for controlling the operation of the gripping mechanism section 111 may include information for controlling the operation of the manipulator 110. In the following, the drive control device 115 will be described as acquiring information indicating the state of the gripping mechanism 111 after its operation, as an example of information for controlling the operation of the gripping mechanism 111.

更に、駆動制御装置１１５は、把持機構部１１１の動作後の状態を示す情報を取得すると、各種センサ信号（把持機構部１１１の動作前の状態を示す情報）に基づいて、
・マニピュレータ１１０の把持機構部１１１内の各種アクチュエータ（不図示）、及び、
・マニピュレータ１１０の本体部１１３内の各種アクチュエータ（不図示）、
を制御する。これにより、把持機構部１１１の位置及び姿勢と、把持機構部１１１の開閉とが制御される。 Furthermore, upon acquiring the information indicating the state of the gripping mechanism 111 after operation, the drive control device 115 performs the following operations based on various sensor signals (information indicating the state of the gripping mechanism 111 before operation).
- Various actuators (not shown) in the gripping mechanism section 111 of the manipulator 110, and
- Various actuators (not shown) in the main body 113 of the manipulator 110,
control. As a result, the position and posture of the gripping mechanism section 111 and the opening and closing of the gripping mechanism section 111 are controlled.

強化学習装置１２０は、駆動制御装置１１５より送信された撮影画像と、把持機構部１１１が把持する把持対象の物体を示す目標物体画像とを入力として、把持機構部１１１の動作後の状態を示す情報を出力する強化学習モデル（訓練モデルの一例）を有する。強化学習モデルには、例えば、ニューラルネットワークが用いられてもよい。 The reinforcement learning device 120 inputs the photographed image transmitted from the drive control device 115 and a target object image indicating the object to be grasped to be grasped by the grasping mechanism section 111, and indicates the state after the operation of the grasping mechanism section 111. It has a reinforcement learning model (an example of a training model) that outputs information. For example, a neural network may be used as the reinforcement learning model.

なお、駆動制御装置１１５より送信された撮影画像に関する情報を強化学習モデルに入力するにあたっては、撮影画像そのものを入力する代わりに、撮影画像から抽出される特徴量を入力してもよい。撮影画像から抽出される特徴量とは、例えば、撮影画像をニューラルネットワークに入力することで中間層から出力される特徴量等である。 Note that when inputting the information regarding the photographed image transmitted from the drive control device 115 to the reinforcement learning model, instead of inputting the photographed image itself, a feature amount extracted from the photographed image may be input. The feature amount extracted from the photographed image is, for example, the feature amount output from the intermediate layer by inputting the photographed image into a neural network.

また、強化学習モデルに入力する目標物体画像に関する情報は、Ｒ値、Ｇ値、Ｂ値の各画像を含む撮影画像であってもよいし、Ｒ値、Ｇ値、Ｂ値の各画像と物体表面の各位置までの距離情報とを含む撮影画像であってもよい。あるいは、目標物体画像は、物体表面の各位置までの距離情報を含む距離画像であってもよい。あるいは、目標物体画像は、動画像であってもよい。また、強化学習モデルには、目標物体画像そのものを入力する代わりに、目標物体画像から抽出される特徴量（例えば、目標物体画像をニューラルネットワークに入力することで中間層から出力される特徴量）を入力してもよい。以下では、説明の簡略化のため、目標物体画像の一例として、Ｒ値、Ｇ値、Ｂ値の各画像を含む撮影画像が入力されるものとして説明する。 Further, the information regarding the target object image input to the reinforcement learning model may be a captured image including each image of R value, G value, and B value, or each image of R value, G value, and B value and the object The captured image may also include distance information to each position on the surface. Alternatively, the target object image may be a distance image including distance information to each position on the object surface. Alternatively, the target object image may be a moving image. In addition, instead of inputting the target object image itself, the reinforcement learning model uses features extracted from the target object image (for example, features output from the intermediate layer by inputting the target object image into a neural network). You may also enter In the following description, in order to simplify the explanation, it will be assumed that a captured image including R value, G value, and B value images is input as an example of a target object image.

また、強化学習モデルにより出力された、把持機構部１１１の動作後の状態を示す情報に基づき、把持機構部１１１の動作が制御されることで、強化学習装置１２０は、把持対象の物体に対する操作結果（例えば、把持操作が成功したか否かの判定結果）を取得する。そして、強化学習装置１２０では、取得した操作結果に基づいて、強化学習モデルのモデルパラメータを更新する。 Furthermore, the operation of the grasping mechanism section 111 is controlled based on the information indicating the state after the operation of the grasping mechanism section 111 outputted by the reinforcement learning model, so that the reinforcement learning device 120 can perform operations on the object to be grasped. A result (for example, a determination result as to whether or not the gripping operation was successful) is obtained. Then, the reinforcement learning device 120 updates the model parameters of the reinforcement learning model based on the obtained operation results.

このように、強化学習システム１００では、複数種類の物体が混在して載置された物体群１３０の中から、指定した種類の物体を把持する場合の把持操作の成功確率を上げるために、
・把持機構部１１１の位置及び姿勢の変化に伴って、位置及び姿勢が変化する撮像装置１１２により撮影された撮影画像に関する情報を用いて、強化学習を行う。 In this way, in the reinforcement learning system 100, in order to increase the success probability of a grasping operation when grasping a specified type of object from among the object group 130 in which a plurality of types of objects are placed together,
- Reinforcement learning is performed using information regarding images taken by the imaging device 112 whose position and orientation change as the position and orientation of the gripping mechanism section 111 change.

これにより、例えば、把持対象の物体が、撮影可能な位置に載置されていない場合でも、強化学習の過程で把持対象の物体が撮影可能となるように、把持機構部１１１を動作させることができる。つまり、本実施形態によれば、把持対象の物体の載置状態によらず、把持操作の成功確率を上げることが可能な強化学習システム１００を提供することができる。 As a result, for example, even if the object to be gripped is not placed in a position where it can be photographed, the gripping mechanism unit 111 can be operated so that the object to be gripped can be photographed during the reinforcement learning process. can. In other words, according to the present embodiment, it is possible to provide the reinforcement learning system 100 that can increase the success probability of a grasping operation regardless of the placement state of the object to be grasped.

なお、本実施形態では、符号１４０に示すように、図１の紙面縦方向をＺ軸方向、図１の紙面横方向をＹ軸方向、図１の紙面奥行き方向をＸ軸方向と定義するものとする。 In this embodiment, as shown by reference numeral 140, the vertical direction of the paper in FIG. 1 is defined as the Z-axis direction, the horizontal direction in the paper of FIG. 1 is defined as the Y-axis direction, and the depth direction of the paper in FIG. 1 is defined as the X-axis direction. shall be.

＜強化学習システムを構成する各装置のハードウェア構成＞
次に、強化学習システム１００を構成する、マニピュレータ１１０のハードウェア構成（ここでは機構系については省略し、制御系に関するハードウェア構成を示す）及び強化学習装置１２０のハードウェア構成について図２を用いて説明する。図２は、強化学習システムを構成する各装置のハードウェア構成の一例を示す図である。 <Hardware configuration of each device configuring the reinforcement learning system>
Next, using FIG. 2, we will explain the hardware configuration of the manipulator 110 (here, the mechanism system is omitted and the hardware configuration related to the control system is shown) and the reinforcement learning device 120, which constitute the reinforcement learning system 100. I will explain. FIG. 2 is a diagram illustrating an example of the hardware configuration of each device configuring the reinforcement learning system.

（１）マニピュレータのハードウェア構成
図２に示すように、マニピュレータ１１０は、撮像装置１１２、駆動制御装置１１５に加えて、センサ群２１１、アクチュエータ群２１２を有する。 (1) Hardware Configuration of Manipulator As shown in FIG. 2, the manipulator 110 includes a sensor group 211 and an actuator group 212 in addition to an imaging device 112 and a drive control device 115.

センサ群２１１は、ｎ個のセンサを含む。本実施形態において、ｎ個のセンサには、少なくとも、
・把持機構部１１１の位置及び姿勢を算出するためのセンサ（本体部１１３の各関節角を測定するセンサ）、
・把持機構部１１１の開閉を検知するセンサ、
が含まれる。 The sensor group 211 includes n sensors. In this embodiment, the n sensors include at least:
- A sensor for calculating the position and orientation of the gripping mechanism section 111 (a sensor for measuring each joint angle of the main body section 113),
- A sensor that detects opening and closing of the gripping mechanism section 111;
is included.

また、アクチュエータ群２１２は、ｍ個のアクチュエータを含む。本実施形態において、ｍ個のアクチュエータには、少なくとも、
・把持機構部１１１の位置及び姿勢を制御するためのアクチュエータ（本体部１１３の各関節角を制御するためのアクチュエータ）、
・把持機構部１１１の開閉を制御するためのアクチュエータ、
が含まれる。 Furthermore, the actuator group 212 includes m actuators. In this embodiment, the m actuators include at least the following:
- An actuator for controlling the position and posture of the gripping mechanism section 111 (an actuator for controlling each joint angle of the main body section 113),
- An actuator for controlling opening and closing of the gripping mechanism section 111;
is included.

また、駆動制御装置１１５は、センサ信号処理装置２０１、アクチュエータ駆動装置２０２、コントローラ２０３を有する。センサ信号処理装置２０１は、センサ群２１１から送信されたセンサ信号を受信し、コントローラ２０３にセンサ信号データを通知する。また、アクチュエータ駆動装置２０２は、コントローラ２０３からの制御信号データを取得し、アクチュエータ群２１２に制御信号を送信する。 Further, the drive control device 115 includes a sensor signal processing device 201, an actuator drive device 202, and a controller 203. The sensor signal processing device 201 receives sensor signals transmitted from the sensor group 211 and notifies the controller 203 of the sensor signal data. Further, the actuator drive device 202 acquires control signal data from the controller 203 and transmits the control signal to the actuator group 212.

コントローラ２０３は、撮像装置１１２から送信された撮影画像を取得し、強化学習装置１２０に送信する。また、コントローラ２０３は、センサ信号処理装置２０１より通知されたセンサ信号データを、強化学習装置１２０に送信する。 The controller 203 acquires the captured image transmitted from the imaging device 112 and transmits it to the reinforcement learning device 120. Further, the controller 203 transmits the sensor signal data notified from the sensor signal processing device 201 to the reinforcement learning device 120.

また、コントローラ２０３は、撮影画像及びセンサ信号データを送信したことに応じて、強化学習装置１２０から、把持機構部１１１の動作後の状態を示す情報を取得する。更に、コントローラ２０３は、把持機構部１１１の動作後の状態を示す情報を取得すると、センサ信号データに基づいて、アクチュエータ群２１２を動作させる制御信号データを生成し、アクチュエータ駆動装置２０２に通知する。 Further, the controller 203 acquires information indicating the state of the gripping mechanism section 111 after operation from the reinforcement learning device 120 in response to transmitting the captured image and sensor signal data. Furthermore, upon acquiring information indicating the state of the gripping mechanism section 111 after operation, the controller 203 generates control signal data for operating the actuator group 212 based on the sensor signal data, and notifies the actuator drive device 202 of the control signal data.

（２）強化学習装置のハードウェア構成
次に、強化学習装置１２０のハードウェア構成について説明する。図２に示すように、強化学習装置１２０は、構成要素として、プロセッサ２２１、主記憶装置（メモリ）２２２、補助記憶装置２２３、ネットワークインタフェース２２４、デバイスインタフェース２２５を有する。強化学習装置１２０は、これらの構成要素がバス２２６を介して接続されたコンピュータとして実現される。 (2) Hardware configuration of reinforcement learning device Next, the hardware configuration of the reinforcement learning device 120 will be explained. As shown in FIG. 2, the reinforcement learning device 120 includes a processor 221, a main storage device (memory) 222, an auxiliary storage device 223, a network interface 224, and a device interface 225 as components. Reinforcement learning device 120 is realized as a computer to which these components are connected via bus 226.

なお、図２の例では、強化学習装置１２０は、各構成要素を１個ずつ備えるものとして示しているが、強化学習装置１２０は、同じ構成要素を複数備えていてもよい。また、図２の例では、１台の強化学習装置１２０が示されているが、強化学習プログラムが複数台の強化学習装置にインストールされて、当該複数台の強化学習装置それぞれが強化学習プログラムの同一のまたは異なる一部の処理を実行するように構成してもよい。この場合、強化学習装置それぞれがネットワークインタフェース２２４等を介して通信することで全体の処理を実行する分散コンピューティングの形態をとってもよい。つまり、強化学習装置１２０は、１または複数の記憶装置に記憶された命令を１台または複数台のコンピュータが実行することで機能を実現するシステムとして構成されてもよい。また、駆動制御装置１１５から送信された各種データをクラウド上に設けられた１台または複数台の強化学習装置で処理し、処理結果を駆動制御装置１１５に送信する構成であってもよい。 In the example of FIG. 2, the reinforcement learning device 120 is shown as having one of each component, but the reinforcement learning device 120 may include a plurality of the same components. In addition, in the example of FIG. 2, one reinforcement learning device 120 is shown, but the reinforcement learning program is installed in a plurality of reinforcement learning devices, and each of the plurality of reinforcement learning devices has a reinforcement learning program. They may be configured to perform some of the same or different processing. In this case, a form of distributed computing may be adopted in which each reinforcement learning device executes the entire process by communicating via the network interface 224 or the like. In other words, the reinforcement learning device 120 may be configured as a system that realizes functions by having one or more computers execute instructions stored in one or more storage devices. Alternatively, the configuration may be such that various data transmitted from the drive control device 115 are processed by one or more reinforcement learning devices provided on the cloud, and the processing results are transmitted to the drive control device 115.

強化学習装置１２０の各種演算は、１または複数のプロセッサを用いて、または、通信ネットワーク２４０を介して通信する複数台の強化学習装置を用いて並列処理で実行されてもよい。また、各種演算は、プロセッサ２２１内に複数ある演算コアに振り分けられて、並列処理で実行されてもよい。また、本開示の処理、手段等の一部または全部は、通信ネットワーク２４０を介して強化学習装置１２０と通信可能なクラウド上に設けられた外部装置２３０（プロセッサ及び記憶装置の少なくとも一方）により実行されてもよい。このように、強化学習装置１２０は、１台または複数台のコンピュータによる並列コンピューティングの形態をとってもよい。 Various operations of the reinforcement learning device 120 may be executed in parallel using one or more processors or using multiple reinforcement learning devices communicating via the communication network 240. Furthermore, various calculations may be distributed to a plurality of calculation cores within the processor 221 and executed in parallel. Further, some or all of the processing, means, etc. of the present disclosure are executed by an external device 230 (at least one of a processor and a storage device) provided on the cloud that can communicate with the reinforcement learning device 120 via the communication network 240. may be done. In this way, the reinforcement learning device 120 may take the form of parallel computing using one or more computers.

プロセッサ２２１は、電子回路（処理回路、Processing circuit、Processing circuitry、ＣＰＵ、ＧＰＵ、ＦＰＧＡ、又はＡＳＩＣ等）であってもよい。また、プロセッサ２２１は、専用の処理回路を含む半導体装置等であってもよい。なお、プロセッサ２２１は、電子論理素子を用いた電子回路に限定されるものではなく、光論理素子を用いた光回路により実現されてもよい。また、プロセッサ２２１は、量子コンピューティングに基づく演算機能を含むものであってもよい。 The processor 221 may be an electronic circuit (processing circuit, processing circuitry, CPU, GPU, FPGA, ASIC, etc.). Further, the processor 221 may be a semiconductor device or the like including a dedicated processing circuit. Note that the processor 221 is not limited to an electronic circuit using an electronic logic element, but may be realized by an optical circuit using an optical logic element. Furthermore, the processor 221 may include an arithmetic function based on quantum computing.

プロセッサ２２１は、強化学習装置１２０の内部構成の各装置等から入力された各種データや命令に基づいて各種演算を行い、演算結果や制御信号を各装置等に出力する。プロセッサ２２１は、ＯＳ（Operating System）や、アプリケーション等を実行することにより、強化学習装置１２０が備える各構成要素を制御する。 The processor 221 performs various calculations based on various data and instructions input from each device in the internal configuration of the reinforcement learning device 120, and outputs calculation results and control signals to each device. The processor 221 controls each component included in the reinforcement learning device 120 by executing an OS (Operating System), applications, and the like.

また、プロセッサ２２１は、１チップ上に配置された１又は複数の電子回路を指してもよいし、２つ以上のチップあるいはデバイス上に配置された１又は複数の電子回路を指してもよい。複数の電子回路を用いる場合、各電子回路は有線又は無線により通信してもよい。 Further, the processor 221 may refer to one or more electronic circuits arranged on one chip, or may refer to one or more electronic circuits arranged on two or more chips or devices. When using multiple electronic circuits, each electronic circuit may communicate by wire or wirelessly.

主記憶装置２２２は、プロセッサ２２１が実行する命令及び各種データ等を記憶する記憶装置であり、主記憶装置２２２に記憶された各種データがプロセッサ２２１により読み出される。補助記憶装置２２３は、主記憶装置２２２以外の記憶装置である。なお、これらの記憶装置は、各種データを格納可能な任意の電子部品を意味するものとし、半導体のメモリでもよい。半導体のメモリは、揮発性メモリ、不揮発性メモリのいずれでもよい。強化学習装置１２０において各種データを保存するための記憶装置は、主記憶装置２２２又は補助記憶装置２２３により実現されてもよく、プロセッサ２２１に内蔵される内蔵メモリにより実現されてもよい。 The main storage device 222 is a storage device that stores instructions and various data to be executed by the processor 221, and the various data stored in the main storage device 222 is read out by the processor 221. The auxiliary storage device 223 is a storage device other than the main storage device 222. Note that these storage devices are any electronic components capable of storing various data, and may be semiconductor memories. Semiconductor memory may be either volatile memory or nonvolatile memory. A storage device for storing various data in the reinforcement learning device 120 may be implemented by the main storage device 222 or the auxiliary storage device 223, or may be implemented by a built-in memory built into the processor 221.

また、１つの主記憶装置２２２に対して、複数のプロセッサ２２１が接続（結合）されてもよいし、単数のプロセッサ２２１が接続されてもよい。あるいは、１つのプロセッサ２２１に対して、複数の主記憶装置２２２が接続（結合）されてもよい。強化学習装置１２０が、少なくとも１つの主記憶装置２２２と、この少なくとも１つの主記憶装置２２２に接続（結合）される複数のプロセッサ２２１とで構成される場合、複数のプロセッサ２２１のうち少なくとも１つのプロセッサが、少なくとも１つの主記憶装置２２２に接続（結合）される構成を含んでもよい。また、複数台の強化学習装置１２０に含まれる主記憶装置２２２とプロセッサ２２１とによって、この構成が実現されてもよい。さらに、主記憶装置２２２がプロセッサと一体になっている構成（例えば、Ｌ１キャッシュ、Ｌ２キャッシュを含むキャッシュメモリ）を含んでもよい。 Further, a plurality of processors 221 may be connected (combined) to one main storage device 222, or a single processor 221 may be connected to one main storage device 222. Alternatively, a plurality of main storage devices 222 may be connected (combined) to one processor 221. When the reinforcement learning device 120 is configured with at least one main storage device 222 and a plurality of processors 221 connected (coupled) to this at least one main storage device 222, at least one of the plurality of processors 221 The processor may include a configuration in which the processor is connected to (coupled with) at least one main memory device 222 . Further, this configuration may be realized by the main storage device 222 and processor 221 included in a plurality of reinforcement learning devices 120. Furthermore, a configuration in which the main storage device 222 is integrated with the processor (for example, a cache memory including an L1 cache and an L2 cache) may be included.

ネットワークインタフェース２２４は、無線又は有線により、通信ネットワーク２４０に接続するためのインタフェースである。ネットワークインタフェース２２４には、既存の通信規格に適合したもの等、適切なインタフェースが用いられる。ネットワークインタフェース２２４により、通信ネットワーク２４０を介して接続された駆動制御装置１１５やその他の外部装置２３０と各種データのやり取りが行われてもよい。なお、通信ネットワーク２４０は、ＷＡＮ（Wide Area Network）、ＬＡＮ（Local Area Network）、ＰＡＮ（Personal Area Network）等のいずれか、又は、それらの組み合わせであってもよく、コンピュータと駆動制御装置１１５やその他の外部装置２３０との間で情報のやり取りが行われるものであればよい。ＷＡＮの一例としてインタネット等があり、ＬＡＮの一例としてＩＥＥＥ８０２．１１やイーサネット等があり、ＰＡＮの一例としてＢｌｕｅｔｏｏｔｈ（登録商標が）やＮＦＣ（Near Field Communication）等がある。 Network interface 224 is an interface for connecting to communication network 240 wirelessly or by wire. As the network interface 224, an appropriate interface such as one that complies with existing communication standards is used. The network interface 224 may exchange various data with the drive control device 115 and other external devices 230 connected via the communication network 240. Note that the communication network 240 may be a WAN (Wide Area Network), a LAN (Local Area Network), a PAN (Personal Area Network), or a combination thereof, and may include a computer, the drive control device 115, and the like. Any device that can exchange information with other external devices 230 may be used. Examples of WAN include the Internet, examples of LAN include IEEE802.11 and Ethernet, and examples of PAN include Bluetooth (registered trademark) and NFC (Near Field Communication).

デバイスインタフェース２２５は、外部装置２５０と直接接続するＵＳＢ等のインタフェースである。 The device interface 225 is an interface such as a USB that is directly connected to the external device 250.

外部装置２５０はコンピュータと接続されている装置である。外部装置２５０は、一例として、入力装置であってもよい。入力装置は、例えば、カメラ、マイクロフォン、モーションキャプチャ、各種センサ、キーボード、マウス、又はタッチパネル等のデバイスであり、取得した情報をコンピュータに与える。また、パーソナルコンピュータ、タブレット端末、又はスマートフォン等の入力部とメモリとプロセッサとを備えるデバイス等であってもよい。 External device 250 is a device connected to the computer. External device 250 may be an input device, for example. The input device is, for example, a device such as a camera, microphone, motion capture, various sensors, keyboard, mouse, or touch panel, and provides the acquired information to the computer. Alternatively, the device may be a personal computer, a tablet terminal, a smartphone, or other device including an input section, a memory, and a processor.

また、外部装置２５０は、一例として、出力装置であってもよい。出力装置は、例えば、ＬＣＤ（Liquid Crystal Display）、ＣＲＴ（Cathode Ray Tube）、ＰＤＰ（Plasma Display Panel）、又は有機ＥＬ（Electro Luminescence）パネル等の表示装置であってもよいし、音声等を出力するスピーカ等であってもよい。また、パーソナルコンピュータ、タブレット端末、又はスマートフォン等の出力部とメモリとプロセッサとを備えるデバイス等であってもよい。 Moreover, the external device 250 may be an output device, for example. The output device may be a display device such as an LCD (Liquid Crystal Display), a CRT (Cathode Ray Tube), a PDP (Plasma Display Panel), or an organic EL (Electro Luminescence) panel, or output audio or the like. It may also be a speaker or the like. Further, it may be a device including an output unit, a memory, and a processor, such as a personal computer, a tablet terminal, or a smartphone.

また、外部装置２５０は、記憶装置（メモリ）であってもよい。例えば、外部装置２５０はネットワークストレージ等であってもよく、外部装置２５０はＨＤＤ等のストレージであってもよい。 Further, the external device 250 may be a storage device (memory). For example, the external device 250 may be a network storage or the like, and the external device 250 may be a storage such as an HDD.

また、外部装置２５０は、強化学習装置１２０の構成要素の一部の機能を有する装置でもよい。つまり、コンピュータは、外部装置２５０の処理結果の一部又は全部を送信または受信してもよい。 Further, the external device 250 may be a device that has some functions of the components of the reinforcement learning device 120. That is, the computer may transmit or receive part or all of the processing results of the external device 250.

＜強化学習装置の機能構成＞
次に、強化学習装置１２０の機能構成として、ここでは、２種類の機能構成例について説明する。図３は、強化学習装置の機能構成の一例を示す第１の図である。図３（ａ）に示すように、強化学習装置１２０は、更新部３１０、状態入力部３２０、強化学習モデル３３０を有する。 <Functional configuration of reinforcement learning device>
Next, as the functional configuration of the reinforcement learning device 120, two types of functional configuration examples will be described here. FIG. 3 is a first diagram showing an example of the functional configuration of the reinforcement learning device. As shown in FIG. 3A, the reinforcement learning device 120 includes an update section 310, a state input section 320, and a reinforcement learning model 330.

更新部３１０は、報酬算出部３１１を有し、強化学習モデル３３０のモデルパラメータを更新する。具体的には、更新部３１０は、把持対象の物体に対する把持操作が成功したか否かの判定結果と、把持機構部１１１の動作が制御されたことによる状態の変化を示す情報とを取得する。また、報酬算出部３１１は、更新部３１０により取得された判定結果に基づき、報酬を算出する。そして、更新部３１０は、これまでに取得または算出した各種情報（状態の変化を示す情報、報酬等）に基づいて、強化学習モデル３３０のモデルパラメータを更新する。 The update unit 310 includes a reward calculation unit 311 and updates model parameters of the reinforcement learning model 330. Specifically, the updating unit 310 acquires a determination result as to whether or not the gripping operation on the object to be gripped has been successful, and information indicating a change in the state due to the controlled operation of the gripping mechanism unit 111. . Further, the remuneration calculating unit 311 calculates remuneration based on the determination result obtained by the updating unit 310. Then, the updating unit 310 updates the model parameters of the reinforcement learning model 330 based on various types of information (information indicating a change in state, reward, etc.) acquired or calculated so far.

なお、把持対象の物体に対する把持操作が成功したか否かの判定は、例えば、撮影画像に基づいて自動で行われてもよい。あるいは、把持対象の物体に対する把持操作が成功したか否かの判定は、強化学習システム１００のユーザが行ってもよい。 Note that the determination as to whether or not the gripping operation on the object to be gripped has been successful may be automatically performed based on the photographed image, for example. Alternatively, the user of the reinforcement learning system 100 may determine whether the grasping operation on the object to be grasped is successful.

また、上述した報酬の算出方法は一例にすぎず、更新部３１０は、把持操作が成功したか否かの判定結果以外の情報に基づいて、報酬を算出してもよい。例えば、更新部３１０は、把持操作が成功するまでに要した動作時間や動作回数、把持操作する際のマニピュレータ１１０全体の動作の大きさ（エネルギ効率）など、各種情報に基づいて、報酬を算出してもよい。 Further, the method for calculating the reward described above is only an example, and the updating unit 310 may calculate the reward based on information other than the determination result as to whether or not the grasping operation was successful. For example, the updating unit 310 calculates the reward based on various information such as the operation time and number of operations required for a successful gripping operation, and the size of the entire operation of the manipulator 110 during the gripping operation (energy efficiency). You may.

状態入力部３２０は、駆動制御装置１１５より送信された撮影画像と、ユーザにより入力された目標物体画像とを取得し、強化学習モデル３３０に通知する。 The state input unit 320 acquires the captured image transmitted from the drive control device 115 and the target object image input by the user, and notifies the reinforcement learning model 330 of the acquired image.

強化学習モデル３３０は、更新部３１０によりモデルパラメータが更新される。また、モデルパラメータが更新された後の強化学習モデル３３０は、状態入力部３２０より通知された撮影画像と目標物体画像とを入力として、把持機構部１１１の動作後の状態を示す情報を出力する。本実施形態において、強化学習モデル３３０は、把持機構部１１１の動作後の状態を示す情報として、例えば、
・把持機構部１１１の動作後の位置及び姿勢を示す情報、
・把持機構部１１１の動作後の開閉を示す情報、
を出力する。 The model parameters of the reinforcement learning model 330 are updated by the updating unit 310. Further, the reinforcement learning model 330 after the model parameters have been updated receives the captured image and target object image notified from the state input unit 320 and outputs information indicating the state of the gripping mechanism unit 111 after the operation. . In this embodiment, the reinforcement learning model 330 uses, for example, information indicating the state of the gripping mechanism section 111 after operation.
- Information indicating the position and posture of the gripping mechanism section 111 after operation;
- Information indicating opening/closing after operation of the gripping mechanism section 111,
Output.

一方、図３（ｂ）は、他の機能構成例を示している。図３（ｂ）に示すように、強化学習装置１２０の状態入力部３２０は、撮影画像及び目標物体画像に加えて、把持機構部１１１の動作前（現在）の状態を示す情報を取得し、強化学習モデル３３０に通知するように構成されている。ここでいう、把持機構部１１１の動作前（現在）の状態を示す情報には、例えば、
・把持機構部１１１の動作前（現在）の位置及び姿勢を示す情報、
・把持機構部１１１の動作前（現在）の開閉を示す情報、
が含まれる。 On the other hand, FIG. 3(b) shows another functional configuration example. As shown in FIG. 3(b), the state input unit 320 of the reinforcement learning device 120 acquires information indicating the state of the gripping mechanism unit 111 before operation (currently) in addition to the captured image and the target object image, The reinforcement learning model 330 is configured to notify the reinforcement learning model 330. Here, the information indicating the state of the gripping mechanism section 111 before operation (currently) includes, for example,
- Information indicating the position and orientation of the gripping mechanism section 111 before operation (currently);
- Information indicating the opening/closing of the gripping mechanism section 111 before operation (currently);
is included.

この場合、強化学習モデル３３０は、状態入力部３２０より通知された撮影画像、目標物体画像、把持機構部１１１の動作前（現在）の状態を示す情報を入力として、把持機構部１１１の動作後の状態を示す情報を出力する。 In this case, the reinforcement learning model 330 receives the captured image, the target object image, and the information indicating the state before the operation of the gripping mechanism section 111 (currently) notified from the state input section 320 as input, and after the operation of the gripping mechanism section 111 Outputs information indicating the status.

＜強化学習処理の流れ＞
次に、強化学習装置１２０による強化学習処理の流れについて説明する。図４は、強化学習処理の流れを示す第１のフローチャートである。以下、図４を参照しながら、強化学習処理の流れについて説明する。なお、図４に示す強化学習処理は、あくまで一例であり、他のモデル生成方法により強化学習処理が実行されることで強化学習済みのモデルが生成されてもよい。 <Flow of reinforcement learning processing>
Next, the flow of reinforcement learning processing by the reinforcement learning device 120 will be explained. FIG. 4 is a first flowchart showing the flow of reinforcement learning processing. The flow of reinforcement learning processing will be described below with reference to FIG. 4. Note that the reinforcement learning process shown in FIG. 4 is just an example, and a reinforced learning model may be generated by executing the reinforcement learning process using another model generation method.

ステップＳ４０１において、強化学習装置１２０の状態入力部３２０は、目標物体画像を取得する。 In step S401, the state input unit 320 of the reinforcement learning device 120 acquires a target object image.

ステップＳ４０２において、強化学習装置１２０の状態入力部３２０は、撮影画像を取得する。 In step S402, the state input unit 320 of the reinforcement learning device 120 acquires a captured image.

ステップＳ４０３において、強化学習装置１２０の状態入力部３２０は、把持機構部１１１の動作前（現在）の状態を示す情報を取得するように構成されている場合にあっては、把持機構部１１１の動作前（現在）の状態を示す情報を取得する。 In step S403, if the state input unit 320 of the reinforcement learning device 120 is configured to acquire information indicating the state of the gripping mechanism 111 before operation (currently), the state input unit 320 of the reinforcement learning device 120 Obtain information indicating the pre-operation (current) state.

ステップＳ４０４において、強化学習装置１２０の強化学習モデル３３０は、目標物体画像、撮影画像、（及び把持機構部１１１の動作前の状態を示す情報）を入力として、把持機構部１１１の動作後の状態を示す情報を出力する。なお、強化学習モデル３３０は、把持機構部１１１の動作後の状態を示す情報として、様々な情報を網羅的に出力するように構成されているものとする。この結果、強化学習処理中の把持機構部１１１の動作には、可能な動作の集合の中から選択された最適な動作と、可能な動作の集合の中からランダムに選択された動作とが含まれることになる。 In step S404, the reinforcement learning model 330 of the reinforcement learning device 120 inputs the target object image, the photographed image (and information indicating the state of the gripping mechanism 111 before operation), and calculates the state of the gripping mechanism 111 after the operation. Outputs information indicating. It is assumed that the reinforcement learning model 330 is configured to comprehensively output various information as information indicating the state of the gripping mechanism section 111 after operation. As a result, the operation of the gripping mechanism unit 111 during the reinforcement learning process includes an optimal operation selected from a set of possible operations and an operation randomly selected from a set of possible operations. It will be.

ステップＳ４０５において、強化学習装置１２０は、強化学習モデル３３０により出力された、把持機構部１１１の動作後の状態を示す情報を、駆動制御装置１１５に送信する。 In step S405, the reinforcement learning device 120 transmits information output by the reinforcement learning model 330 indicating the state of the gripping mechanism section 111 after operation to the drive control device 115.

ステップＳ４０６において、強化学習装置１２０の更新部３１０は、把持機構部１１１の動作が制御されたことによる状態の変化を示す情報を取得する。 In step S406, the updating unit 310 of the reinforcement learning device 120 acquires information indicating a change in state due to the controlled operation of the gripping mechanism unit 111.

ステップＳ４０７において、強化学習装置１２０の更新部３１０は、把持対象の物体に対する把持操作が成功したか否かの判定結果を取得し、強化学習装置１２０の報酬算出部３１１は、取得した判定結果に基づき、報酬を算出する。 In step S407, the updating unit 310 of the reinforcement learning device 120 acquires a determination result as to whether or not the grasping operation on the object to be grasped is successful, and the reward calculation unit 311 of the reinforcement learning device 120 uses the acquired determination result to Compensation will be calculated based on the following.

ステップＳ４０８において、強化学習装置１２０の更新部３１０は、これまでに取得または算出した各種情報（状態の変化を示す情報、報酬等）に基づいて、強化学習モデル３３０のモデルパラメータを更新する。 In step S408, the updating unit 310 of the reinforcement learning device 120 updates the model parameters of the reinforcement learning model 330 based on various information (information indicating a change in state, reward, etc.) acquired or calculated so far.

ステップＳ４０９において、強化学習装置１２０の状態入力部３２０は、現在の目標物体画像から、異なる目標物体画像へと切り替えるか否かを判定する。 In step S409, the state input unit 320 of the reinforcement learning device 120 determines whether to switch from the current target object image to a different target object image.

ステップＳ４０９において、異なる目標物体画像に切り替えないと判定した場合には（ステップＳ４０９においてＮｏの場合には）、ステップＳ４０２に戻る。 If it is determined in step S409 not to switch to a different target object image (No in step S409), the process returns to step S402.

一方、ステップＳ４０９において、異なる目標物体画像に切り替えると判定した場合には（ステップＳ４０９においてＹｅｓの場合には）、ステップＳ４１０に進む。 On the other hand, if it is determined in step S409 to switch to a different target object image (in the case of Yes in step S409), the process advances to step S410.

ステップＳ４１０において、強化学習装置１２０の更新部３１０は、強化学習処理の終了条件を満たすか否かを判定する。なお、強化学習処理の終了条件とは、例えば、強化学習システム１００のユーザによって規定された条件であり、一例として、所定の物体に対する把持操作の目標成功確率等が挙げられる。 In step S410, the update unit 310 of the reinforcement learning device 120 determines whether the conditions for ending the reinforcement learning process are satisfied. Note that the termination condition for the reinforcement learning process is, for example, a condition defined by the user of the reinforcement learning system 100, and includes, for example, the target success probability of a grasping operation on a predetermined object.

ステップＳ４１０において、強化学習処理の終了条件を満たさないと判定した場合には（ステップＳ４１０においてＮｏの場合には）、ステップＳ４０１に戻る。 In step S410, if it is determined that the termination condition for the reinforcement learning process is not satisfied (in the case of No in step S410), the process returns to step S401.

一方、ステップＳ４１０において、強化学習処理の終了条件を満たすと判定した場合には（ステップＳ４１０においてＹｅｓの場合には）、強化学習処理を終了する。なお、強化学習処理を終了した後の強化学習モデル３３０は、強化学習済みモデルとして、把持機構部１１１の動作を制御するための情報を、駆動制御装置１１５に対して出力する装置（物体操作装置と称す）に適用される。 On the other hand, if it is determined in step S410 that the conditions for ending the reinforcement learning process are satisfied (in the case of Yes in step S410), the reinforcement learning process is ended. Note that the reinforcement learning model 330 after the reinforcement learning process is a device (object manipulation device) that outputs information for controlling the operation of the gripping mechanism section 111 to the drive control device 115 as a reinforcement learning model. ).

物体操作装置に適用された強化学習済みモデルは、図４のステップＳ４０１～Ｓ４０５の処理を実行する（つまり、状態の変化を示す情報の取得、報酬の算出、モデルパラメータの更新等は行わない）。また、ステップＳ４０４では、把持機構部１１１の動作後の状態を示す情報として、最適な情報が出力されるように構成される。つまり、把持機構部１１１は、強化学習処理中とは異なり、様々な動作を網羅的に行う代わりに、可能な動作の集合の中から選択された最適な動作を行う。 The reinforcement learning model applied to the object manipulation device executes the processes of steps S401 to S405 in FIG. 4 (that is, it does not acquire information indicating changes in state, calculate rewards, update model parameters, etc.) . Further, in step S404, the configuration is such that optimal information is output as information indicating the state of the gripping mechanism section 111 after the operation. That is, unlike during reinforcement learning processing, the gripping mechanism unit 111 performs an optimal motion selected from a set of possible motions instead of exhaustively performing various motions.

＜強化学習処理の実行例＞
次に、強化学習システム１００による強化学習処理の実行例について説明する。図５及び図６は、強化学習処理の実行例を示す第１及び第２の図である。図５（ａ）に示す目標物体画像５１０がユーザにより入力されると、強化学習装置１２０は、目標物体画像５１０に含まれる物体５１１を把持対象の物体として認識する。 <Execution example of reinforcement learning processing>
Next, an example of execution of reinforcement learning processing by the reinforcement learning system 100 will be described. 5 and 6 are first and second diagrams showing an example of execution of reinforcement learning processing. When the target object image 510 shown in FIG. 5A is input by the user, the reinforcement learning device 120 recognizes the object 511 included in the target object image 510 as an object to be grasped.

このように、目標物体画像５１０の入力により把持対象の物体の種類を指定する構成とすることで、強化学習装置１２０によれば、ユーザは、物体群１３０に含まれる任意の物体を、把持対象の物体として指定することができる。 In this way, by specifying the type of the object to be grasped by inputting the target object image 510, the reinforcement learning device 120 allows the user to select any object included in the object group 130 as the object to be grasped. can be specified as an object.

図５（ｂ）において矢印５００は、物体５１１が把持対象の物体として認識された時点での撮像装置１１２の位置及び姿勢（撮影位置及び撮影方向）を示している。また、撮影画像５２１は、矢印５００に示す位置及び姿勢のもとで物体群１３０を撮影した場合の撮影画像を示している。 In FIG. 5B, an arrow 500 indicates the position and orientation (photographing position and photographing direction) of the imaging device 112 at the time when the object 511 is recognized as the object to be grasped. Further, a photographed image 521 shows a photographed image when the object group 130 is photographed under the position and orientation shown by the arrow 500.

撮影画像５２１に示すように、物体群１３０を上方向（Ｚ軸方向）から撮影した場合、把持対象の物体５１１は、他の物体５１２によって遮蔽され、撮像装置１１２は、物体５１１を撮影することができない。つまり、この状態では、物体５１１を把持することができない。 As shown in the photographed image 521, when the object group 130 is photographed from above (Z-axis direction), the object 511 to be grasped is blocked by another object 512, and the imaging device 112 cannot photograph the object 511. I can't. That is, in this state, the object 511 cannot be gripped.

このため、強化学習装置１２０では、把持対象の物体５１１が撮影可能となるように把持機構部１１１の位置及び姿勢を変化させるべく、把持機構部１１１の動作後の状態を示す情報を出力する。これにより、駆動制御装置１１５では、把持機構部１１１の動作後の状態を示す情報に基づいて、把持機構部１１１の動作を制御する。 Therefore, the reinforcement learning device 120 outputs information indicating the state of the gripping mechanism 111 after operation in order to change the position and posture of the gripping mechanism 111 so that the object 511 to be gripped can be photographed. Thereby, the drive control device 115 controls the operation of the gripping mechanism 111 based on information indicating the state of the gripping mechanism 111 after the operation.

図５（ｃ）において矢印５０１は、把持機構部１１１の動作が制御されることで変化した、変化後の撮像装置１１２の位置及び姿勢（撮影位置及び撮影方向）を示している。また、撮影画像５２２は、矢印５０１に示す位置及び姿勢のもとで物体群１３０を撮影した場合の撮影画像を示している。 In FIG. 5C, an arrow 501 indicates the changed position and orientation (photographing position and photographing direction) of the imaging device 112, which are changed by controlling the operation of the gripping mechanism section 111. Further, a photographed image 522 shows a photographed image when the object group 130 is photographed under the position and orientation shown by the arrow 501.

撮影画像５２２に示すように、物体群１３０を横方向（Ｘ軸方向）から撮影することで、把持対象の物体５１１が撮影可能になっている。 As shown in the photographed image 522, by photographing the object group 130 from the lateral direction (X-axis direction), the object 511 to be grasped can be photographed.

このため、強化学習装置１２０では、把持対象の物体５１１が把持可能となるように把持機構部１１１の位置及び姿勢を更に変化させるべく、把持機構部１１１の動作後の状態を示す情報を出力する。これにより、駆動制御装置１１５では、把持機構部１１１の動作後の状態を示す情報に基づいて、把持機構部１１１の動作を制御する。 For this reason, the reinforcement learning device 120 outputs information indicating the state of the gripping mechanism 111 after the operation in order to further change the position and orientation of the gripping mechanism 111 so that the object 511 to be gripped can be gripped. . Thereby, the drive control device 115 controls the operation of the gripping mechanism 111 based on information indicating the state of the gripping mechanism 111 after the operation.

図６（ａ）において矢印６０１は、マニピュレータ１１０の動作が制御されることで変化した、変化後の撮像装置１１２の位置及び姿勢（撮影位置及び撮影方向）を示している。また、撮影画像６１１は、矢印６０１に示す位置及び姿勢のもとで物体群１３０を撮影した場合の撮影画像を示している。 In FIG. 6A, an arrow 601 indicates the changed position and orientation (photographing position and photographing direction) of the imaging device 112, which are changed by controlling the operation of the manipulator 110. Further, a photographed image 611 shows a photographed image when the object group 130 is photographed under the position and orientation shown by the arrow 601.

撮影画像６１１に示すように、把持対象の物体５１１に近づけたことで、把持対象の物体５１１が把持可能となっている。 As shown in the photographed image 611, the object 511 to be gripped can be gripped by approaching the object 511 to be gripped.

このため、強化学習装置１２０では、把持機構部１１１に把持対象の物体５１１を把持させるべく、把持機構部１１１の動作後の状態を示す情報を出力する。これにより、駆動制御装置１１５では、動作後の把持機構部１１１の状態を示す情報に基づいて、把持機構部１１１の動作を制御する。 For this reason, the reinforcement learning device 120 outputs information indicating the state of the gripping mechanism 111 after the operation in order to cause the gripping mechanism 111 to grip the object 511 to be gripped. Thereby, the drive control device 115 controls the operation of the gripping mechanism 111 based on information indicating the state of the gripping mechanism 111 after the operation.

図６（ｂ）において矢印６０２は、把持機構部１１１の動作が制御されることで変化した、変化後の撮像装置１１２の位置及び姿勢（撮影位置及び撮影方向）を示している（物体５１１が把持され、所定の高さまで持ち上げられた状態を示している）。また、撮影画像６１２は、矢印６０２に示す位置及び姿勢のもとで物体５１１を撮影した場合の撮影画像を示している。 In FIG. 6(b), an arrow 602 indicates the position and orientation (photographing position and photographing direction) of the imaging device 112 after the change, which is changed by controlling the operation of the gripping mechanism section 111 (the object 511 is (It is shown being grasped and lifted to a predetermined height). Further, a photographed image 612 shows a photographed image when the object 511 is photographed under the position and orientation shown by the arrow 602.

このように、撮像装置１１２の位置及び姿勢が、把持機構部１１１の位置及び姿勢の変化に伴って変化するように構成したうえで、当該撮像装置により撮影された撮影画像を用いて強化学習を行うことで、
・撮像装置からの見え方が変わるなどの長期的な視点での評価ができる。
・把持対象の物体を把持するという動作を試行する過程で、把持対象の物体を探索するという動作を試行することができる。 In this way, the position and orientation of the imaging device 112 are configured to change in accordance with changes in the position and orientation of the gripping mechanism section 111, and reinforcement learning is performed using images taken by the imaging device. By doing
・It is possible to evaluate from a long-term perspective, such as changes in the appearance from the imaging device.
- In the process of attempting the action of grasping the object to be grasped, it is possible to attempt the action of searching for the object to be grasped.

つまり、強化学習の過程で、把持対象の物体が撮影可能となるように、把持機構部の動作を制御することができる。この結果、把持対象の物体の載置状態によらず、把持操作の成功確率を上げることができる。 That is, in the process of reinforcement learning, the operation of the gripping mechanism can be controlled so that the object to be gripped can be photographed. As a result, the success probability of the grasping operation can be increased regardless of the placement state of the object to be grasped.

＜まとめ＞
以上の説明から明らかなように、第１の実施形態に係る強化学習システム１００は、
・把持機構部の位置及び姿勢の変化に伴って、位置及び姿勢が変化する撮像装置により撮影された撮影画像と、把持機構部が把持する把持対象の物体を示す目標物体画像とを、把持機構部の動作後の状態を示す情報を出力する強化学習モデルに入力する。
・把持機構部の動作後の状態を示す情報に基づき把持機構部の動作が制御された場合の、把持対象の物体に対する操作結果（エンドエフェクタによる把持操作が成功したか否かの判定結果）に基づいて、強化学習モデルのモデルパラメータを更新する。 <Summary>
As is clear from the above description, the reinforcement learning system 100 according to the first embodiment is
- A captured image taken by an imaging device whose position and orientation change as the position and orientation of the gripping mechanism changes, and a target object image showing the object to be gripped by the gripping mechanism, are transferred to the gripping mechanism. This information is input to a reinforcement learning model that outputs information indicating the state of the part after its operation.
・When the operation of the gripping mechanism is controlled based on the information indicating the state after the operation of the gripping mechanism, the result of the operation on the object to be gripped (the result of determining whether the gripping operation by the end effector was successful) Update the model parameters of the reinforcement learning model based on the information.

これにより、強化学習システム１００によれば、把持対象の物体が遮蔽されるように載置されていた場合でも、強化学習の過程で把持対象の物体が撮影可能となるように、把持機構部の動作を制御することができる。 As a result, according to the reinforcement learning system 100, even if the object to be gripped is placed so as to be shielded, the gripping mechanism unit is configured such that the object to be gripped can be photographed during the reinforcement learning process. The movement can be controlled.

つまり、第１の実施形態によれば、載置状態によらず、指定された種類の物体に対して把持操作の成功確率を上げることが可能な、強化学習装置、強化学習システム、物体操作装置、モデル生成方法及び強化学習プログラムを提供することができる。 In other words, according to the first embodiment, there is a reinforcement learning device, a reinforcement learning system, and an object manipulation device that can increase the probability of success in grasping a specified type of object regardless of the placement state. , a model generation method and a reinforcement learning program can be provided.

［第２の実施形態］
第２の実施形態では、Ｑ学習により、強化学習を行う場合について説明する。以下、第２の実施形態について、上記第１の実施形態との相違点を中心に説明する。 [Second embodiment]
In the second embodiment, a case will be described in which reinforcement learning is performed using Q learning. The second embodiment will be described below, focusing on the differences from the first embodiment.

＜強化学習装置の機能構成＞
はじめに、第２の実施形態に係る強化学習装置１２０の機能構成例について説明する。図７は、強化学習装置の機能構成の一例を示す第２の図である。図７に示すように、第２の実施形態に係る強化学習装置１２０は、更新部７１０、強化学習モデル７２０を有する。強化学習モデル７２０には、例えば、ニューラルネットワークが用いられてもよい。 <Functional configuration of reinforcement learning device>
First, an example of the functional configuration of the reinforcement learning device 120 according to the second embodiment will be described. FIG. 7 is a second diagram showing an example of the functional configuration of the reinforcement learning device. As shown in FIG. 7, the reinforcement learning device 120 according to the second embodiment includes an update section 710 and a reinforcement learning model 720. For example, a neural network may be used as the reinforcement learning model 720.

更新部７１０は、報酬算出部７１１、パラメータ更新部７１２を有し、強化学習モデル７２０のモデルパラメータを更新する。 The update unit 710 includes a reward calculation unit 711 and a parameter update unit 712, and updates model parameters of the reinforcement learning model 720.

具体的には、更新部７１０は、把持対象の物体に対する把持操作が成功したか否かの判定結果、及び、把持機構部１１１の動作が制御されたことによる状態の変化を示す情報を取得する。 Specifically, the updating unit 710 acquires a determination result as to whether or not the gripping operation on the object to be gripped has been successful, and information indicating a change in the state due to the controlled operation of the gripping mechanism unit 111. .

また、報酬算出部７１１は、把持対象の物体に対する把持操作が成功したか否かの判定結果に基づき報酬を算出する。なお、把持対象の物体に対する把持操作が成功したか否かの判定方法や、報酬の算出方法は、上記第１の実施形態において説明済みであるため、ここでは説明を省略する。 Further, the reward calculation unit 711 calculates a reward based on the determination result of whether or not the gripping operation on the object to be gripped is successful. Note that the method for determining whether or not the gripping operation on the object to be gripped has been successful and the method for calculating the reward have already been explained in the first embodiment, and therefore will not be described here.

また、パラメータ更新部７１２は、強化学習モデル７２０に含まれる、画像解析部７２１、状態及び動作入力部７２２、期待値算出部７２４の各モデルパラメータを更新する。なお、パラメータ更新部７１２は、
・更新部７１０により取得された、状態の変化を示す情報、
・報酬算出部７１１により算出された報酬（即時報酬）、
・後述する期待値算出部７２４において算出された、割引累積報酬の期待値（Ｑ値）の予測値、
に基づいて、モデルパラメータを更新する。 Furthermore, the parameter update unit 712 updates each model parameter of the image analysis unit 721, state and action input unit 722, and expected value calculation unit 724 included in the reinforcement learning model 720. Note that the parameter update unit 712
- Information indicating a change in state acquired by the update unit 710;
・Remuneration calculated by the remuneration calculation unit 711 (immediate remuneration),
- A predicted value of the expected value (Q value) of the discount cumulative reward calculated by the expected value calculation unit 724, which will be described later;
Update model parameters based on.

強化学習モデル７２０は、更新部７１０によりモデルパラメータが更新される。また、モデルパラメータが更新された後の強化学習モデル７２０は、撮影画像、目標物体画像、把持機構部１１１の動作前の状態（ｓ）を示す情報を入力として、把持機構部１１１の動作後の状態を示す情報を出力する。 The model parameters of the reinforcement learning model 720 are updated by the updating unit 710. In addition, the reinforcement learning model 720 after the model parameters have been updated receives the captured image, the target object image, and the information indicating the state (s) before the operation of the gripping mechanism 111 as input, and uses the information after the operation of the gripping mechanism 111 as input. Outputs information indicating the status.

具体的には、図７に示すように、強化学習モデル７２０は、画像解析部７２１、状態及び動作入力部７２２、加算部７２３、期待値算出部７２４、調整部７２５を有する。 Specifically, as shown in FIG. 7, the reinforcement learning model 720 includes an image analysis section 721, a state and action input section 722, an addition section 723, an expected value calculation section 724, and an adjustment section 725.

画像解析部７２１は、駆動制御装置１１５より送信された撮影画像と、ユーザにより入力された目標物体（ｇ）画像とを取得することで処理を実行し、実行結果を加算部７２３に出力する。なお、画像解析部７２１は、例えば、ニューラルネットワークを用いて構成される。より具体的には、画像解析部７２１は、例えば、第１の畳み込み層、第１のＭａｘＰｏｏｌｉｎｇ層、第２の畳み込み層、第２のＭａｘＰｏｏｌｉｎｇ層等により構成される。 The image analysis unit 721 executes processing by acquiring the captured image transmitted from the drive control device 115 and the target object (g) image input by the user, and outputs the execution result to the addition unit 723. Note that the image analysis unit 721 is configured using, for example, a neural network. More specifically, the image analysis unit 721 includes, for example, a first convolution layer, a first MaxPooling layer, a second convolution layer, a second MaxPooling layer, and the like.

状態及び動作入力部７２２は、把持機構部１１１の動作前の状態（ｓ）を示す情報と、把持機構部１１１の動作（ａ）を示す情報とを取得することで処理を実行し、実行結果を加算部７２３に出力する。なお、状態及び動作入力部７２２は、例えば、ニューラルネットワークを用いて構成される。より具体的には、状態及び動作入力部７２２は、第１の線形層、第２の線形層、形状変換層等により構成される。また、状態及び動作入力部７２２には、後述する期待値算出部７２４により算出される最大のＱ値を探索するために、調整部７２５により調整された、把持機構部１１１の動作（ａ）を示す情報が、所定回数（例えば、２０回）入力される。 The state and operation input unit 722 executes processing by acquiring information indicating the state (s) of the gripping mechanism 111 before operation and information indicating the operation (a) of the gripping mechanism 111, and outputs the execution result. is output to the adding section 723. Note that the state and operation input unit 722 is configured using, for example, a neural network. More specifically, the state and operation input section 722 is configured of a first linear layer, a second linear layer, a shape conversion layer, and the like. In addition, the state and operation input section 722 inputs the operation (a) of the gripping mechanism section 111 adjusted by the adjustment section 725 in order to search for the maximum Q value calculated by the expected value calculation section 724, which will be described later. The information shown is input a predetermined number of times (for example, 20 times).

加算部７２３は、画像解析部７２１より出力された実行結果と、状態及び動作入力部７２２より出力された実行結果とを加算して、期待値算出部７２４に入力する。 The adding unit 723 adds the execution result output from the image analysis unit 721 and the execution result output from the state and action input unit 722 and inputs the result to the expected value calculation unit 724.

期待値算出部７２４は、加算部７２３において加算された、画像解析部７２１の実行結果と、状態及び動作入力部７２２の実行結果とが入力されることで処理を実行し、Ｑ値（Ｑ（ｓ，ａ，ｇ））を算出する。期待値算出部７２４では、調整部７２５により調整された、把持機構部１１１の動作（ａ）を示す情報の数に応じた数のＱ値を算出する。なお、期待値算出部７２４は、例えば、ニューラルネットワークを用いて構成される。より具体的には、期待値算出部７２４は、第１の畳み込み層、第１のＭａｘＰｏｏｌｉｎｇ層、第２の畳み込み層、第２のＭａｘＰｏｏｌｉｎｇ層等により構成される。 The expected value calculation unit 724 executes processing by inputting the execution result of the image analysis unit 721 and the execution result of the state and action input unit 722, which are added in the addition unit 723, and calculates the Q value (Q( s, a, g)). The expected value calculation unit 724 calculates a number of Q values corresponding to the number of pieces of information indicating the operation (a) of the gripping mechanism unit 111 adjusted by the adjustment unit 725. Note that the expected value calculation unit 724 is configured using, for example, a neural network. More specifically, the expected value calculation unit 724 includes a first convolution layer, a first MaxPooling layer, a second convolution layer, a second MaxPooling layer, and the like.

調整部７２５は、期待値算出部７２４においてＱ値が算出されるごとに、把持機構部１１１の動作（ａ）を示す情報を調整し、状態及び動作入力部７２２に入力する。調整部７２５では、把持機構部１１１の動作（ａ）を示す情報を、所定回数（例えば、２０回）調整し、その間に算出されたＱ値の中から最大のＱ値を抽出する。なお、調整部７２５は、例えば、ε－グリーディ法に基づいて、把持機構部１１１の可能な動作の集合の中から、いずれかの動作（ａ）を示す情報を特定する。 The adjustment unit 725 adjusts information indicating the operation (a) of the gripping mechanism unit 111 every time the expected value calculation unit 724 calculates the Q value, and inputs the information to the state and operation input unit 722. The adjustment unit 725 adjusts the information indicating the operation (a) of the gripping mechanism unit 111 a predetermined number of times (for example, 20 times), and extracts the maximum Q value from among the Q values calculated during that time. Note that the adjustment unit 725 specifies information indicating one of the movements (a) from a set of possible movements of the gripping mechanism unit 111, for example, based on the ε-greedy method.

ε－グリーディ法によれば、最大のＱ値に対応する動作（ａ）を示す情報が特定される場合もあれば、ランダムに選択された動作（ａ）を示す情報が特定される場合もある。 According to the ε-greedy method, information indicating the action (a) corresponding to the maximum Q value may be identified, or information indicating a randomly selected action (a) may be identified. .

更に、調整部７２５は、特定した把持機構部１１１の動作（ａ）を示す情報と、把持機構部１１１の動作前の状態（ｓ）を示す情報とに基づいて、把持機構部１１１の動作後の状態を示す情報を導出し、駆動制御装置１１５に送信する。 Further, the adjustment unit 725 adjusts the state after the operation of the gripping mechanism 111 based on the information indicating the specified operation (a) of the gripping mechanism 111 and the information indicating the state (s) before the operation of the gripping mechanism 111. The information indicating the state of is derived and transmitted to the drive control device 115.

このように、第２の実施形態に係る強化学習装置１２０では、ε－グリーディ法を用いることで、把持機構部１１１の動作後の状態を示す情報として、様々な情報を網羅的に出力することができる。この結果、強化学習処理中の把持機構部１１１の動作には、可能な動作の集合の中から選択された最適な動作（Ｑ値が最大となる動作）と、可能な動作の集合の中からランダムに選択された動作とが含まれることになる。 In this way, the reinforcement learning device 120 according to the second embodiment uses the ε-greedy method to comprehensively output various information as information indicating the state of the gripping mechanism section 111 after operation. I can do it. As a result, the operation of the gripping mechanism unit 111 during the reinforcement learning process includes an optimal operation (operation with the maximum Q value) selected from a set of possible operations, and an operation selected from a set of possible operations. Randomly selected actions will be included.

なお、かかる機能を実現する強化学習モデル７２０の構成として、図７に示した機能構成は、あくまで一例にすぎず、他の機能構成により強化学習モデル７２０を構成してもよい。例えば、上記説明では、画像解析部７２１、状態及び動作入力部７２２、期待値算出部７２４がそれぞれ、ニューラルネットワークを用いて構成されるものとしたが、強化学習モデル７２０全体がニューラルネットワークを用いて構成されてもよい。 Note that the functional configuration shown in FIG. 7 is merely an example of the configuration of the reinforcement learning model 720 that implements this function, and the reinforcement learning model 720 may be configured with other functional configurations. For example, in the above description, the image analysis section 721, the state and action input section 722, and the expected value calculation section 724 are each configured using a neural network, but the entire reinforcement learning model 720 is configured using a neural network. may be configured.

また、上記説明では、強化学習処理時の機能について言及したが、強化学習処理が終了した後の機能については、上記第１の実施形態と同様である。すなわち、強化学習処理が終了した後は、更新部７１０による、状態の変化を示す情報の取得、報酬の算出、モデルパラメータの更新等は行われない。また、調整部７２５では、把持機構部１１１の動作後の状態を示す情報として、最適な情報（Ｑ値が最大となる動作（ａ）を示す情報に基づいて導出された把持機構部１１１の動作後の状態を示す情報）が出力される。これにより、強化学習済みのモデルによれば、割引累積報酬の期待値（Ｑ値）を最大化する行動則を獲得することができる。 Further, in the above description, the functions during the reinforcement learning process have been mentioned, but the functions after the reinforcement learning process is the same as those in the first embodiment. That is, after the reinforcement learning process is completed, the updating unit 710 does not acquire information indicating a change in state, calculate a reward, update model parameters, etc. The adjustment unit 725 also uses optimal information (the operation of the gripping mechanism 111 derived based on information indicating the operation (a) with the maximum Q value) as information indicating the state of the gripping mechanism 111 after the operation. information indicating the subsequent state) is output. Thereby, according to the model that has undergone reinforcement learning, it is possible to obtain a behavior rule that maximizes the expected value (Q value) of the discount cumulative reward.

＜強化学習処理の流れ＞
次に、第２の実施形態に係る強化学習装置１２０による強化学習処理の流れについて説明する。図８は、強化学習処理の流れを示す第２のフローチャートである。以下、図８を参照しながら、強化学習処理の流れについて説明する。なお、図８に示す強化学習処理は、あくまで一例であり、他のモデル生成方法により強化学習処理が実行されることで強化学習済みのモデルが生成されてもよい。 <Flow of reinforcement learning processing>
Next, the flow of reinforcement learning processing by the reinforcement learning device 120 according to the second embodiment will be described. FIG. 8 is a second flowchart showing the flow of reinforcement learning processing. The flow of reinforcement learning processing will be described below with reference to FIG. 8. Note that the reinforcement learning process shown in FIG. 8 is just an example, and a reinforcement learning model may be generated by executing the reinforcement learning process using another model generation method.

ステップＳ８０１において、強化学習装置１２０の強化学習モデル７２０は、目標物体画像を取得する。 In step S801, the reinforcement learning model 720 of the reinforcement learning device 120 acquires a target object image.

ステップＳ８０２において、強化学習装置１２０の強化学習モデル７２０は、撮影画像を取得する。 In step S802, the reinforcement learning model 720 of the reinforcement learning device 120 acquires a photographed image.

ステップＳ８０３において、強化学習装置１２０の強化学習モデル７２０は、把持機構部１１１の動作前（現在）の状態（ｓ）を示す情報を取得する。 In step S803, the reinforcement learning model 720 of the reinforcement learning device 120 acquires information indicating the pre-operation (current) state (s) of the gripping mechanism section 111.

ステップＳ８０４～Ｓ８０７は、例えば、ε－グリーディ法に基づいて、可能な動作の集合の中から、いずれかの動作（ａ）を示す情報を特定し、把持機構部１１１の動作後の状態を示す情報を網羅的に出力する。 Steps S804 to S807 identify information indicating one of the motions (a) from a set of possible motions based on the ε-greedy method, and indicate the state of the gripping mechanism 111 after the motion. Output information comprehensively.

具体的には、可能な動作の集合の中から、最適なＱ値に対応する動作（ａ）を示す情報を特定する場合にあっては、ステップＳ８０４～Ｓ８０６を実行したうえで、ステップＳ８０７に進む。また、可能な動作の集合の中から、ランダムに選択された動作（ａ）を示す情報を特定する場合にあっては、直接、ステップＳ８０７に進む。 Specifically, when identifying information indicating the action (a) corresponding to the optimal Q value from a set of possible actions, steps S804 to S806 are executed, and then step S807 is performed. move on. Further, in the case of specifying information indicating a randomly selected action (a) from a set of possible actions, the process directly advances to step S807.

ステップＳ８０４において、強化学習装置１２０の強化学習モデル７２０は、Ｑ値を算出する。 In step S804, the reinforcement learning model 720 of the reinforcement learning device 120 calculates a Q value.

ステップＳ８０５において、強化学習装置１２０の強化学習モデル７２０は、Ｑ値を所定回数算出したか否かを判定する。ステップＳ８０５において、Ｑ値を所定回数算出していないと判定した場合には（ステップＳ８０５においてＮｏの場合には）、ステップＳ８０６に進む。 In step S805, the reinforcement learning model 720 of the reinforcement learning device 120 determines whether the Q value has been calculated a predetermined number of times. If it is determined in step S805 that the Q value has not been calculated a predetermined number of times (No in step S805), the process advances to step S806.

ステップＳ８０６において、強化学習装置１２０の強化学習モデル７２０は、把持機構部１１１の動作（ａ）を示す情報を調整し、ステップＳ８０４に戻る。 In step S806, the reinforcement learning model 720 of the reinforcement learning device 120 adjusts the information indicating the operation (a) of the gripping mechanism section 111, and returns to step S804.

一方、ステップＳ８０５において、Ｑ値を所定回数算出したと判定した場合には（ステップＳ８０５においてＹｅｓの場合には）、ステップＳ８０７に進む。 On the other hand, if it is determined in step S805 that the Q value has been calculated a predetermined number of times (in the case of Yes in step S805), the process advances to step S807.

ステップＳ８０７において、強化学習装置１２０の強化学習モデル７２０は、ステップＳ８０４～Ｓ８０７を実行した場合にあっては、最大のＱ値に対応する動作（ａ）を示す情報を特定し、把持機構部１１１の動作後の状態を示す情報を導出した後、駆動制御装置１１５に送信する。また、強化学習装置１２０の強化学習モデル７２０は、ステップＳ８０４～Ｓ８０７を実行しなかった場合にあっては、ランダムに選択した動作（ａ）を示す情報を特定し、把持機構部１１１の動作後の状態を示す情報を導出した後、駆動制御装置１１５に送信する。 In step S807, if steps S804 to S807 have been executed, the reinforcement learning model 720 of the reinforcement learning device 120 specifies information indicating the motion (a) corresponding to the maximum Q value, and After deriving information indicating the state after the operation, the information is transmitted to the drive control device 115. Furthermore, in the case where steps S804 to S807 are not executed, the reinforcement learning model 720 of the reinforcement learning device 120 specifies information indicating the randomly selected motion (a), and after the operation of the gripping mechanism section 111 After deriving information indicating the state, the information is transmitted to the drive control device 115.

ステップＳ８０８において、強化学習装置１２０の更新部７１０は、把持機構部１１１の動作が制御されたことによる状態の変化を示す情報を取得する。 In step S808, the updating unit 710 of the reinforcement learning device 120 acquires information indicating a change in state due to the controlled operation of the gripping mechanism unit 111.

ステップＳ８０９において、強化学習装置１２０の更新部７１０は、把持対象の物体に対する把持操作が成功したか否かの判定結果を取得し、即時報酬を算出する。また、強化学習装置１２０の更新部７１０は、期待値算出部７２４により算出された割引累積報酬の期待値（Ｑ値）の予測値を取得する。 In step S809, the updating unit 710 of the reinforcement learning device 120 obtains a determination result as to whether or not the grasping operation on the object to be grasped is successful, and calculates an immediate reward. Further, the updating unit 710 of the reinforcement learning device 120 obtains the predicted value of the expected value (Q value) of the discount cumulative reward calculated by the expected value calculating unit 724.

ステップＳ８１０において、強化学習装置１２０の更新部７１０は、取得した状態の変化を示す情報、算出した即時報酬、取得した割引累積報酬の期待値（Ｑ値）の予測値を用いて、強化学習モデル７２０のモデルパラメータを更新する。 In step S810, the updating unit 710 of the reinforcement learning device 120 uses the acquired information indicating the state change, the calculated immediate reward, and the predicted value of the expected value (Q value) of the acquired discounted cumulative reward to update the reinforcement learning model. 720 model parameters are updated.

ステップＳ８１１において、強化学習装置１２０は、現在の目標物体画像から、異なる目標物体画像へと切り替えるか否かを判定する。 In step S811, the reinforcement learning device 120 determines whether to switch from the current target object image to a different target object image.

ステップＳ８１１において、異なる目標物体画像に切り替えないと判定した場合には（ステップＳ８１１においてＮｏの場合には）、ステップＳ８０２に戻る。 If it is determined in step S811 not to switch to a different target object image (No in step S811), the process returns to step S802.

一方、ステップＳ８１１において、異なる目標物体画像に切り替えると判定した場合には（ステップＳ８１１においてＹｅｓの場合には）、ステップＳ８１２に進む。 On the other hand, if it is determined in step S811 to switch to a different target object image (in the case of Yes in step S811), the process advances to step S812.

ステップＳ８１２において、強化学習装置１２０の更新部３１０は、強化学習処理の終了条件を満たすか否かを判定する。なお、強化学習処理の終了条件とは、例えば、強化学習システム１００のユーザによって規定された条件であり、一例として、所定の物体に対する把持操作の目標成功確率等が挙げられる。 In step S812, the update unit 310 of the reinforcement learning device 120 determines whether the termination condition for the reinforcement learning process is satisfied. Note that the termination condition for the reinforcement learning process is, for example, a condition defined by the user of the reinforcement learning system 100, and includes, for example, the target success probability of a grasping operation on a predetermined object.

ステップＳ８１２において、強化学習処理の終了条件を満たさないと判定した場合には（ステップＳ８１２においてＮｏの場合には）、ステップＳ８０１に戻る。 In step S812, if it is determined that the termination condition for the reinforcement learning process is not satisfied (in the case of No in step S812), the process returns to step S801.

一方、ステップＳ８１２において、強化学習処理の終了条件を満たすと判定した場合には（ステップＳ８１２においてＹｅｓの場合には）、強化学習処理を終了する。なお、強化学習処理を終了した後の強化学習モデル７２０は、強化学習済みモデルとして、物体操作装置に適用される。 On the other hand, if it is determined in step S812 that the conditions for ending the reinforcement learning process are satisfied (in the case of Yes in step S812), the reinforcement learning process is ended. Note that the reinforcement learning model 720 after the reinforcement learning process is applied to the object manipulation device as a reinforcement learning completed model.

物体操作装置に適用された強化学習済みモデルは、図８のステップＳ８０１～Ｓ８０７の処理を実行する（つまり、状態の変化を示す情報の取得、報酬の算出、モデルパラメータの更新等は行わない）。また、ステップＳ８０７では、把持機構部１１１の動作後の状態を示す情報として、最適な情報が出力されるように構成される。つまり、把持機構部１１１は、強化学習処理中とは異なり、様々な動作を網羅的に行う代わりに、可能な動作の集合の中から選択された最適な動作（Ｑ値が最大となる動作）を行う。 The reinforcement learning model applied to the object manipulation device executes the processes of steps S801 to S807 in FIG. 8 (that is, it does not acquire information indicating changes in state, calculate rewards, update model parameters, etc.) . Further, in step S807, the configuration is such that optimal information is output as information indicating the state of the gripping mechanism section 111 after its operation. In other words, unlike during reinforcement learning processing, the gripping mechanism unit 111 performs an optimal motion (motion with the maximum Q value) selected from a set of possible motions, instead of exhaustively performing various motions. I do.

＜まとめ＞
以上の説明から明らかなように、第２の実施形態に係る強化学習システム１００によれば、上記第１の実施形態と同様な効果を奏する。 <Summary>
As is clear from the above description, the reinforcement learning system 100 according to the second embodiment provides the same effects as the first embodiment.

［第３の実施形態］
上記第１及び第２の実施形態では、指定された種類の物体に対して、把持操作を行う場合について説明した。しかしながら、指定された種類の物体に対して行う所定の操作は、把持操作に限定されず、他の任意の操作であってもよい。つまり、マニピュレータ１１０の本体部１１３の先端部分に取り付けられるエンドエフェクタは、把持機構部１１１に限定されず、他の任意の操作機構部であってもよい。ここでいう任意の操作には、例えば、指定された種類の物体を押す押圧操作や、指定された種類の物体を吸着する吸着操作、指定された種類の物体を電磁石等で吸引する吸引操作等が含まれる。 [Third embodiment]
In the first and second embodiments described above, a case has been described in which a gripping operation is performed on a specified type of object. However, the predetermined operation performed on the specified type of object is not limited to the grasping operation, and may be any other operation. That is, the end effector attached to the distal end portion of the main body 113 of the manipulator 110 is not limited to the gripping mechanism 111, but may be any other operating mechanism. The arbitrary operations mentioned here include, for example, a pressing operation to press a specified type of object, a suction operation to attract a specified type of object, a suction operation to attract a specified type of object with an electromagnet, etc. is included.

また、上記第１及び第２の実施形態では、マニピュレータの先端部分に撮像装置が取り付けられるものとして説明したが、撮像装置の取り付け位置はマニピュレータの先端部分に限定されない。把持機構部の位置及び姿勢の変化に応じて、撮像装置の位置及び姿勢が変化する位置であれば、他の位置であってもよい。 Furthermore, in the first and second embodiments described above, the imaging device is attached to the tip of the manipulator, but the mounting position of the imaging device is not limited to the tip of the manipulator. Any other position may be used as long as the position and orientation of the imaging device change in accordance with changes in the position and orientation of the gripping mechanism.

なお、把持機構部と撮像装置とは、例えば、異なるマニピュレータに取り付けられていてもよく、その場合も上述した強化学習モデルが適用可能である。この場合の強化学習モデルは、把持機構部の動作を制御するための情報に加え、撮像装置の少なくとも位置及び姿勢のいずれかを制御するための情報を出力するように構成されてもよい。 Note that the gripping mechanism section and the imaging device may be attached to different manipulators, for example, and the above-described reinforcement learning model can be applied in that case as well. The reinforcement learning model in this case may be configured to output information for controlling at least one of the position and orientation of the imaging device in addition to information for controlling the operation of the gripping mechanism.

また、上記第１及び第２の実施形態では、強化学習モデルに入力する、把持機構部の動作前の状態を示す情報として、把持機構部の位置及び姿勢を示す情報、把持機構部の開閉を示す情報が含まれるものとして説明した。しかしながら、把持機構部の動作前の状態を示す情報はこれらに限定されず、他の情報が入力されてもよい。 In addition, in the first and second embodiments described above, information indicating the position and orientation of the grasping mechanism, information indicating the opening and closing of the grasping mechanism, and information indicating the state before operation of the grasping mechanism are input to the reinforcement learning model. The explanation has been made assuming that the information shown is included. However, the information indicating the state of the gripping mechanism before operation is not limited to these, and other information may be input.

また、上記第１及び第２の実施形態では、マニピュレータ１１０と強化学習装置１２０（あるいは物体操作装置）とを別体として構成したが、マニピュレータ１１０と強化学習装置１２０（あるいは物体操作装置）とは一体として構成されてもよい。あるいは、駆動制御装置１１５と強化学習装置１２０（あるいは物体操作装置）とは一体として構成されてもよい。 Furthermore, in the first and second embodiments described above, the manipulator 110 and the reinforcement learning device 120 (or the object manipulation device) are configured as separate bodies, but the manipulator 110 and the reinforcement learning device 120 (or the object manipulation device) are different from each other. It may also be configured as one piece. Alternatively, the drive control device 115 and the reinforcement learning device 120 (or object manipulation device) may be configured as one unit.

また、上記第１及び第２の実施形態では、強化学習装置１２０より出力された、把持機構部１１１の動作後の状態を示す情報に基づいて、把持機構部１１１の動作を実際に制御することで強化学習処理を行うものとして説明した。しかしながら、把持機構部１１１の動作を実際に制御する必要はなく、実環境を模擬したシミュレータを用いて、強化学習処理を行うように構成してもよい。この場合、撮像装置についても、実環境を模擬したシミュレータ上で、位置及び姿勢を変化させたり、撮影を行うように構成してもよい。また、操作対象の物体に対する所定の操作及び操作結果の生成についても、実環境を模擬したシミュレータ上で行うように構成してもよい。 Furthermore, in the first and second embodiments described above, the operation of the gripping mechanism section 111 is actually controlled based on the information output from the reinforcement learning device 120 indicating the state after the operation of the gripping mechanism section 111. This was explained as performing reinforcement learning processing. However, it is not necessary to actually control the operation of the gripping mechanism section 111, and the reinforcement learning process may be performed using a simulator that simulates a real environment. In this case, the imaging device may also be configured to change its position and orientation, or to take images on a simulator that simulates a real environment. Further, the predetermined operation on the object to be operated and the generation of the operation result may also be configured to be performed on a simulator that simulates the real environment.

また、上記第１及び第２の実施形態では、マニピュレータ１１０の本体部１１３の先端部分にエンドエフェクタが取り付けられているケースについて、強化学習装置１２０が強化学習処理を行うものとして説明した。しかしながら、エンドエフェクタが先端部分に取り付けられていないマニピュレータ１１０が、本体部１１３により操作対象の物体を操作するケースについて、強化学習装置１２０が強化学習処理を行ってもよい。この場合、強化学習装置１２０では、マニピュレータ１１０の本体部１１３の先端部分の動作を制御するための情報を出力してもよい。 Furthermore, in the first and second embodiments, the reinforcement learning device 120 performs reinforcement learning processing in the case where the end effector is attached to the tip of the main body 113 of the manipulator 110. However, the reinforcement learning device 120 may perform reinforcement learning processing in a case where the manipulator 110, in which the end effector is not attached to the tip portion, operates the object to be manipulated using the main body portion 113. In this case, the reinforcement learning device 120 may output information for controlling the operation of the tip portion of the main body 113 of the manipulator 110.

また、上記第１及び第２の実施形態では、マニピュレータ１１０の本体部１１３の先端部分の位置及び姿勢が変化するように構成されているものとして説明したが、少なくとも位置及び姿勢のいずれか一方が変化するように構成されていてもよい。つまり、把持機構部１１１は、少なくとも位置及び姿勢のいずれか一方が変化するように構成されていてもよい。また、撮像装置１１２は、把持機構部１１１の少なくとも位置及び姿勢のいずれか一方の変化に伴って、少なくとも位置及び姿勢のいずれか一方が変化するように構成されていてもよい。この場合、強化学習装置１２０では、把持機構部１１１の動作を制御するための情報として、把持機構部１１１の少なくも位置及び姿勢のいずれか一方を制御するための情報、及び、把持機構部１１１の開閉を制御するための情報を出力してもよい。 Furthermore, in the first and second embodiments, the position and orientation of the tip portion of the main body 113 of the manipulator 110 are configured to change, but at least one of the position and orientation changes. It may be configured to change. In other words, the gripping mechanism section 111 may be configured to change at least one of its position and orientation. Further, the imaging device 112 may be configured such that at least one of the position and orientation of the gripping mechanism section 111 changes as at least one of the position and orientation of the gripping mechanism 111 changes. In this case, in the reinforcement learning device 120, the information for controlling the operation of the grasping mechanism section 111 includes information for controlling at least one of the position and orientation of the grasping mechanism section 111, and information for controlling the operation of the grasping mechanism section 111. Information for controlling opening/closing may be output.

［その他の実施形態］
本明細書（請求項を含む）において、「ａ、ｂおよびｃの少なくとも１つ（一方）」又は「ａ、ｂ又はｃの少なくとも１つ（一方）」の表現（同様な表現を含む）が用いられる場合は、ａ、ｂ、ｃ、ａ－ｂ、ａ－ｃ、ｂ－ｃ、又はａ－ｂ－ｃのいずれかを含む。また、ａ－ａ、ａ－ｂ－ｂ、ａ－ａ－ｂ－ｂ－ｃ－ｃ等のように、いずれかの要素について複数のインスタンスを含んでもよい。さらに、ａ－ｂ－ｃ－ｄのようにｄを有する等、列挙された要素（ａ、ｂ及びｃ）以外の他の要素を加えることも含む。 [Other embodiments]
In this specification (including claims), the expression "at least one (one) of a, b, and c" or "at least one (one) of a, b, or c" (including similar expressions) When used, it includes any of a, b, c, a-b, a-c, b-c, or a-b-c. Further, each element may include multiple instances, such as aa, abb, aaabbbcc, etc. Furthermore, it also includes adding other elements other than the listed elements (a, b, and c), such as having d as in abcd.

また、本明細書（請求項を含む）において、「データを入力として／データに基づいて／に従って／に応じて」等の表現（同様な表現を含む）が用いられる場合は、特に断りがない場合、各種データそのものを入力として用いる場合や、各種データに何らかの処理を行ったもの（例えば、ノイズ加算したもの、正規化したもの、各種データの中間表現等）を入力として用いる場合を含む。また「データに基づいて／に従って／に応じて」何らかの結果が得られる旨が記載されている場合、当該データのみに基づいて当該結果が得られる場合を含むとともに、当該データ以外の他のデータ、要因、条件、及び／又は状態等にも影響を受けて当該結果が得られる場合をも含み得る。また、「データを出力する」旨が記載されている場合、特に断りがない場合、各種データそのものを出力として用いる場合や、各種データに何らかの処理を行ったもの（例えば、ノイズ加算したもの、正規化したもの、各種データの中間表現等）を出力とする場合も含む。 In addition, in this specification (including claims), when expressions such as "as input data/based on data/according to/according to" (including similar expressions) are used, there is no specific notice. This includes cases in which various data itself is used as input, and cases in which various data subjected to some processing (for example, noise added, normalized, intermediate representation of various data, etc.) are used as input. In addition, if it is stated that a certain result is obtained "based on/according to/according to data", this includes cases where the result is obtained only based on the data, and other data other than the data, It may also include cases where the results are obtained under the influence of factors, conditions, and/or states. In addition, if it is stated that "data will be output", if there is no special notice, various data itself may be used as output, or data that has been processed in some way (for example, data with added noise, normal This also includes cases in which the output is digitized data, intermediate representations of various data, etc.).

また、本明細書（請求項を含む）において、「接続される（connected）」及び「結合される（coupled）」との用語が用いられる場合は、直接的な接続／結合、間接的な接続／結合、電気的（electrically）な接続／結合、通信的（communicatively）な接続／結合、機能的（operatively）な接続／結合、物理的（physically）な接続／結合等のいずれをも含む非限定的な用語として意図される。当該用語は、当該用語が用いられた文脈に応じて適宜解釈されるべきであるが、意図的に或いは当然に排除されるのではない接続／結合形態は、当該用語に含まれるものして非限定的に解釈されるべきである。 In addition, in this specification (including claims), when the terms "connected" and "coupled" are used, direct connection/coupling, indirect connection /coupling, electrically connected/coupled, communicatively connected/coupled, functionally connected/coupled, physically connected/coupled, etc., but not limited to intended as a descriptive term. The term should be interpreted as appropriate depending on the context in which the term is used, but forms of connection/coupling that are not intentionally or naturally excluded are not included in the term. Should be construed in a limited manner.

また、本明細書（請求項を含む）において、「ＡがＢするよう構成される（A configured to B）」との表現が用いられる場合は、要素Ａの物理的構造が、動作Ｂを実行可能な構成を有するとともに、要素Ａの恒常的（permanent）又は一時的（temporary）な設定（setting/configuration）が、動作Ｂを実際に実行するように設定（configured/set）されていることを含んでよい。例えば、要素Ａが汎用プロセッサである場合、当該プロセッサが動作Ｂを実行可能なハードウェア構成を有するとともに、恒常的（permanent）又は一時的（temporary）なプログラム（命令）の設定により、動作Ｂを実際に実行するように設定（configured）されていればよい。また、要素Ａが専用プロセッサ又は専用演算回路等である場合、制御用命令及びデータが実際に付属しているか否かとは無関係に、当該プロセッサの回路的構造が動作Ｂを実際に実行するように構築（implemented）されていればよい。 In addition, in this specification (including the claims), when the expression "A configured to B" is used, it means that the physical structure of element A performs operation B. possible configuration and that the permanent or temporary setting/configuration of element A is configured/set to actually perform action B. may be included. For example, if element A is a general-purpose processor, the processor has a hardware configuration that can execute operation B, and can perform operation B by setting a permanent or temporary program (instruction). It only needs to be configured to actually execute. In addition, if element A is a dedicated processor or a dedicated arithmetic circuit, the circuit structure of the processor is configured to actually execute operation B, regardless of whether control instructions and data are actually attached. It is sufficient if it has been implemented.

また、本明細書（請求項を含む）において、含有又は所有を意味する用語（例えば、「含む（comprising/including）」及び「有する（having）」等）が用いられる場合は、当該用語の目的語により示される対象物以外の物を含有又は所有する場合を含む、open-endedな用語として意図される。これらの含有又は所有を意味する用語の目的語が数量を指定しない又は単数を示唆する表現（a又はanを冠詞とする表現）である場合は、当該表現は特定の数に限定されないものとして解釈されるべきである。 In addition, in this specification (including claims), when terms meaning inclusion or ownership (for example, "comprising/including" and "having", etc.) are used, the purpose of the term is It is intended as an open-ended term, including the case of containing or possessing something other than the object indicated by the word. If the object of a term meaning inclusion or possession is an expression that does not specify a quantity or suggests a singular number (an expression with a or an as an article), the expression shall be interpreted as not being limited to a specific number. It should be.

また、本明細書（請求項を含む）において、ある箇所において「１つ又は複数（one or more）」又は「少なくとも１つ（at least one）」等の表現が用いられ、他の箇所において数量を指定しない又は単数を示唆する表現（a又はanを冠詞とする表現）が用いられているとしても、後者の表現が「１つ」を意味することを意図しない。一般に、数量を指定しない又は単数を示唆する表現（a又はanを冠詞とする表現）は、必ずしも特定の数に限定されないものとして解釈されるべきである。 In addition, in this specification (including the claims), expressions such as "one or more" or "at least one" are used in some places, and in other places, quantities are used. Even if an expression is used that does not specify or suggests the singular (an expression with a or an as an article), it is not intended that the latter expression means "one". In general, expressions that do not specify a quantity or imply a singular number (expressions with the article a or an) should be construed as not necessarily being limited to a particular number.

また、本明細書において、ある実施例の有する特定の構成について特定の効果（advantage/result）が得られる旨が記載されている場合、別段の理由がない限り、当該構成を有する他の１つ又は複数の実施例についても当該効果が得られると理解されるべきである。但し当該効果の有無は、一般に種々の要因、条件、及び／又は状態等に依存し、当該構成により必ず当該効果が得られるものではないと理解されるべきである。当該効果は、種々の要因、条件、及び／又は状態等が満たされたときに実施例に記載の当該構成により得られるものに過ぎず、当該構成又は類似の構成を規定したクレームに係る発明において、当該効果が必ずしも得られるものではない。 In addition, in this specification, if it is stated that a specific effect (advantage/result) can be obtained with a specific configuration of a certain embodiment, unless there is a reason to the contrary, another example having the configuration It should be understood that the same effect can also be obtained in a plurality of embodiments. However, it should be understood that the presence or absence of the said effect generally depends on various factors, conditions, and/or states, and that the said effect is not necessarily obtained by the said configuration. The effect is only obtained by the configuration described in the Examples when various factors, conditions, and/or states, etc. are satisfied, and in the claimed invention that specifies the configuration or a similar configuration. However, this effect is not necessarily obtained.

また、本明細書（請求項を含む）において、複数のハードウェアが所定の処理を行う場合、各ハードウェアが協働して所定の処理を行ってもよいし、一部のハードウェアが所定の処理の全てを行ってもよい。また、一部のハードウェアが所定の処理の一部を行い、別のハードウェアが所定の処理の残りを行ってもよい。本明細書（請求項を含む）において、「１又は複数のハードウェアが第１の処理を行い、前記１又は複数のハードウェアが第２の処理を行う」等の表現が用いられている場合、第１の処理を行うハードウェアと第２の処理を行うハードウェアは同じものであってもよいし、異なるものであってもよい。つまり、第１の処理を行うハードウェア及び第２の処理を行うハードウェアが、前記１又は複数のハードウェアに含まれていればよい。なお、ハードウェアは、電子回路、又は、電子回路を含む装置等を含んでよい。 In addition, in this specification (including claims), when multiple pieces of hardware perform a predetermined process, each piece of hardware may cooperate to perform the predetermined process, or some of the hardware may perform the predetermined process. You may perform all of the above processing. Further, some hardware may perform part of a predetermined process, and another piece of hardware may perform the rest of the predetermined process. In this specification (including claims), when expressions such as "one or more hardware performs the first process, and the one or more hardware performs the second process" are used , the hardware that performs the first processing and the hardware that performs the second processing may be the same or different. In other words, the hardware that performs the first processing and the hardware that performs the second processing may be included in the one or more pieces of hardware. Note that the hardware may include an electronic circuit, a device including an electronic circuit, or the like.

また、本明細書（請求項を含む）において、複数の記憶装置（メモリ）がデータの記憶を行う場合、複数の記憶装置（メモリ）のうち個々の記憶装置（メモリ）は、データの一部のみを記憶してもよいし、データの全体を記憶してもよい。 In addition, in this specification (including claims), when multiple storage devices (memories) store data, each storage device (memory) among the multiple storage devices (memories) stores a portion of the data. Only the data may be stored, or the entire data may be stored.

以上、本開示の実施形態について詳述したが、本開示は上記した個々の実施形態に限定されるものではない。特許請求の範囲に規定された内容及びその均等物から導き出される本発明の概念的な思想と趣旨を逸脱しない範囲において種々の追加、変更、置き換え及び部分的削除等が可能である。例えば、前述した全ての実施形態において、説明に用いた数値は、一例として示したものであり、これらに限られるものではない。また、実施形態における各動作の順序は、一例として示したものであり、これらに限られるものではない。 Although the embodiments of the present disclosure have been described in detail above, the present disclosure is not limited to the individual embodiments described above. Various additions, changes, substitutions, and partial deletions are possible without departing from the conceptual idea and gist of the present invention derived from the content defined in the claims and equivalents thereof. For example, in all the embodiments described above, the numerical values used in the explanation are shown as examples, and the present invention is not limited to these. Further, the order of each operation in the embodiment is shown as an example, and is not limited to this order.

１００：強化学習システム
１１０：マニピュレータ
１１１：把持機構部
１１２：撮像装置
１１３：本体部
１１５：駆動制御装置
１２０：強化学習装置
３１０：更新部
３１１：報酬算出部
３２０：状態入力部
３３０：強化学習モデル
５１０：目標物体画像
５１１：物体
５２１、５２２：撮影画像
６１１、６１２：撮影画像
７１０：更新部
７１１：報酬算出部
７１２：パラメータ更新部
７２０：強化学習モデル
７２１：画像解析部
７２２：状態及び動作入力部
７２３：加算部
７２４：期待値算出部
７２５：調整部 100: Reinforcement learning system 110: Manipulator 111: Gripping mechanism section 112: Imaging device 113: Main body section 115: Drive control device 120: Reinforcement learning device 310: Updating section 311: Reward calculation section 320: State input section 330: Reinforcement learning model 510: Target object image 511: Objects 521, 522: Captured images 611, 612: Captured image 710: Update unit 711: Reward calculation unit 712: Parameter update unit 720: Reinforcement learning model 721: Image analysis unit 722: State and motion input Section 723: Addition section 724: Expected value calculation section 725: Adjustment section

Claims

at least one memory;
at least one processor;
The at least one processor includes:
The operation of the end effector is controlled by using information regarding a photographed image photographed by an imaging device whose position or orientation changes at least, and information regarding a target object image indicating an object to be operated by the end effector. inputting information into a training model that outputs information for
and updating the parameters of the training model based on the operation result on the object when the operation of the end effector is controlled based on the information output by the training model.
Reinforcement learning device.

The at least one of the position and orientation of the imaging device changes depending on at least one of the position and orientation of the end effector.
The reinforcement learning device according to claim 1.

the imaging device is attached to the end effector;
The reinforcement learning device according to claim 2.

at least one of the position and orientation of the imaging device is controlled based on an output from the training model;
The reinforcement learning device according to claim 1.

The end effector is a gripping mechanism that grips the object,
5. The at least one processor updates parameters of the training model based on a determination result of whether or not the gripping mechanism unit has successfully gripped the object. Reinforcement learning device described.

5. The at least one processor inputs into the training model information regarding at least one of the position and orientation of the gripping mechanism before operation, and information regarding opening and closing of the gripping mechanism before operation. Reinforcement learning device described in.

7. The reinforcement learning device according to claim 6, wherein the training model outputs information regarding at least one of the position and orientation of the gripping mechanism after the operation, and information regarding opening/closing of the gripping mechanism after the operation.

The reinforcement learning device according to any one of claims 1 to 7, wherein a predetermined operation on the object by the end effector and a change in at least one of the position and orientation of the imaging device are executed on a simulator.

A reinforcement learning device according to any one of claims 1 to 8,
a manipulator to which the end effector and the imaging device are attached;
A reinforcement learning system with

at least one memory that stores a training model whose parameters have been updated by the reinforcement learning device according to any one of claims 1 to 8;
at least one processor;
The at least one processor includes:
Inputting into the training model information regarding a photographed image photographed by an imaging device whose position or orientation changes, and information regarding a target object image indicating an object to be operated by the end effector. and,
controlling the operation of the end effector based on information output by the training model;
configured to be executable,
Object manipulation device.

the end effector;
the imaging device;
The object manipulation device according to claim 10, further comprising:

A model generation method executed by at least one processor, the method comprising:
The operation of the end effector is controlled by using information regarding a photographed image photographed by an imaging device whose position or orientation changes at least, and information regarding a target object image indicating an object to be operated by the end effector. inputting information into a training model that outputs information for the purpose;
a step of updating parameters of the training model based on an operation result on the object when the operation of the end effector is controlled based on information output by the training model.

The operation of the end effector is controlled by using information regarding a photographed image photographed by an imaging device whose position or orientation changes at least, and information regarding a target object image indicating an object to be operated by the end effector. inputting information into a training model that outputs information for the purpose;
and causing at least one computer to execute a step of updating the parameters of the training model based on the operation result on the object when the operation of the end effector is controlled based on the information output by the training model. Reinforcement learning program for.