JP2022122670A

JP2022122670A - Robot model learning device, robot model machine learning method, robot model machine learning program, robot control device, robot control method, and robot control program

Info

Publication number: JP2022122670A
Application number: JP2021020049A
Authority: JP
Inventors: 政志 ▲濱▼屋; Masaya Hamaya; 一敏田中; Kazutoshi Tanaka
Original assignee: Omron Corp; Omron Tateisi Electronics Co
Current assignee: Omron Corp
Priority date: 2021-02-10
Filing date: 2021-02-10
Publication date: 2022-08-23
Also published as: EP4292778A1; US20240083023A1; CN116867620A; WO2022172812A1

Abstract

To enable a robot model to learn efficiently through machine learning.SOLUTION: A robot control device 40 is configured to: acquire an actual value of a position attitude of a robot 10 and an actual value of external force applied to the robot 10, and executes a robot model LM including a state transition model for calculating a prediction value of the position attitude of the robot 10 and an external force model for calculating a prediction value of the external force applied to the robot 10, on the basis of an actual value of the position attitude at a certain time and an action command which can be given to the robot 10; calculate a reward on the basis of an error of the position attitude and the prediction value of the external force; generate a plurality of action command candidates and give them to the robot model LM, for each control period; determine an action command that maximizes the reward on the basis of rewards calculated in correspondence with the plurality of action command candidates respectively; and update the external model so as to minimize an error between the prediction value of the external force calculated using the external force model according to the determined action command and the actual value of the external force corresponding to the prediction value of the external force.SELECTED DRAWING: Figure 1

Description

開示の技術は、ロボットモデルの学習装置、ロボットモデルの機械学習方法、ロボットモデルの機械学習プログラム、ロボット制御装置、ロボット制御方法、及びロボット制御プログラムに関する。 The disclosed technology relates to a robot model learning device, a robot model machine learning method, a robot model machine learning program, a robot control device, a robot control method, and a robot control program.

ロボットが作業を達成するために必要な制御則を自動で獲得するために、機械学習によりロボットモデルを学習することが行われている。 Machine learning is used to learn a robot model in order to automatically acquire the control law necessary for a robot to accomplish a task.

例えば特許文献１には、マニピュレータに掛かる力とモーメントを検出する機能を備えた産業用ロボットを制御する制御装置であって、前記産業用ロボットを制御指令に基づいて制御する制御部と、前記産業用ロボットのマニピュレータに掛かる力及びモーメントの少なくともいずれかを取得データとして取得するデータ取得部と、前記取得データに基づいて、前記マニピュレータに掛かる力に係る情報を含む力状態データ、及び前記マニピュレータに係る制御指令の調整行動を示す制御指令調整データを、状態データとして生成する前処理部と、を備え、前記状態データに基づいて、前記マニピュレータに係る制御指令の調整行動に係る機械学習の処理を実行する技術が開示されている。 For example, Patent Literature 1 discloses a control device for controlling an industrial robot having a function of detecting force and moment applied to a manipulator, comprising a control unit for controlling the industrial robot based on a control command; a data acquisition unit that acquires at least one of a force and a moment applied to a manipulator of a robot for use as acquired data; force state data including information related to the force applied to the manipulator based on the acquired data; a preprocessing unit that generates control command adjustment data indicating control command adjustment behavior as state data, and executes machine learning processing related to the control command adjustment behavior of the manipulator based on the state data. A technique for doing so is disclosed.

特開２０２０－０５５０９５号公報JP 2020-055095 A

しかしながら、ロボットモデルを機械学習により学習する際のパラメータの設定及び報酬関数の設計は難しく、効率良く学習するのは困難である。 However, it is difficult to set parameters and design a reward function when learning a robot model by machine learning, and it is difficult to learn efficiently.

開示の技術は、上記の点に鑑みてなされたものであり、ロボットモデルを機械学習により学習する際に、効率良く学習することができるロボットモデルの学習装置、ロボットモデルの機械学習方法、ロボットモデルの機械学習プログラム、ロボット制御装置、ロボット制御方法、及びロボット制御プログラムを提供することを目的とする。 The disclosed technique has been made in view of the above points, and provides a robot model learning apparatus, a robot model machine learning method, and a robot model that enable efficient learning when a robot model is learned by machine learning. machine learning program, robot control device, robot control method, and robot control program.

開示の第１態様は、ロボットモデルの学習装置であって、ロボットの位置姿勢の実績値及び前記ロボットに加えられる外力の実績値を取得する取得部と、ある時間における前記位置姿勢の実績値及び前記ロボットに与えることができる行動指令に基づき、その次の時間における前記ロボットの位置姿勢の予測値を算出する状態遷移モデル及び前記ロボットに加えられる外力の予測値を算出する外力モデルを含むロボットモデルと、前記ロボットモデルを実行するモデル実行部と、前記位置姿勢の予測値と到達すべき位置姿勢の目標値との間の誤差及び前記外力の予測値に基づいて報酬を算出する報酬算出部と、制御周期毎に、前記行動指令の複数の候補を生成して前記ロボットモデルに与え、前記行動指令の複数の候補のそれぞれに対応して前記報酬算出部が算出する報酬に基づいて報酬を最大化する行動指令を決定する行動決定部と、決定された前記行動指令に基づいて前記外力モデルが算出した前記外力の予測値と、当該外力の予測値に対応する前記外力の実績値との間の差異が小さくなるように前記外力モデルを更新する外力モデル更新部と、を備える。 A first aspect of the disclosure is a robot model learning apparatus comprising: an acquisition unit that acquires actual values of the position and orientation of a robot and actual values of an external force applied to the robot; A robot model that includes a state transition model that calculates a predicted value of the position and orientation of the robot at the next time based on an action command that can be given to the robot, and an external force model that calculates a predicted value of an external force applied to the robot. a model execution unit that executes the robot model; and a reward calculation unit that calculates a reward based on the error between the predicted value of the position and orientation and the target value of the position and orientation to be reached, and the predicted value of the external force. generating a plurality of candidates for the action command in each control cycle and giving them to the robot model; between an action determination unit that determines an action command to be converted, the predicted value of the external force calculated by the external force model based on the determined action command, and the actual value of the external force corresponding to the predicted value of the external force and an external force model updating unit that updates the external force model so that the difference between is reduced.

上記第１態様において、決定された前記行動指令に基づいて前記状態遷移モデルが算出した前記位置姿勢の予測値と、当該位置姿勢の予測値に対応する前記位置姿勢の実績値との間の誤差が小さくなるように前記状態遷移モデルを更新する状態遷移モデル更新部と、を備えた構成としてもよい。 In the first aspect, an error between the predicted value of the position and orientation calculated by the state transition model based on the determined action command and the actual value of the position and orientation corresponding to the predicted value of the position and orientation. and a state transition model updating unit that updates the state transition model so that

上記第１態様において、前記報酬算出部は、前記外力が前記誤差の拡大を抑制する外力である修正外力である場合において、前記修正外力の予測値を前記報酬の減少要因とする計算により前記報酬を算出するようにしてもよい。 In the first aspect, when the external force is a corrected external force that suppresses the expansion of the error, the remuneration calculation unit calculates the remuneration using a predicted value of the corrected external force as a factor for decreasing the remuneration. may be calculated.

上記第１態様において、前記報酬算出部は、前記外力が前記誤差の縮小を抑制する外力である敵対外力である場合において、前記敵対外力の予測値を前記報酬の増加要因とする計算により前記報酬を算出するようにしてもよい。 In the first aspect, when the external force is a hostile external force that suppresses reduction of the error, the reward calculation unit calculates the reward using the predicted value of the hostile external force as a factor for increasing the reward. may be calculated.

上記第１態様において、前記報酬算出部は、前記外力が前記誤差の拡大を抑制する修正外力である場合において、前記修正外力の予測値を前記報酬の減少要因とする計算により前記報酬を算出し、前記外力が前記誤差の縮小を抑制する外力である敵対外力である場合において、前記敵対外力の予測値を前記報酬の増加要因とする計算により前記報酬を算出するようにしてもよい。 In the first aspect, the remuneration calculation unit calculates the remuneration by calculation using a predicted value of the corrected external force as a reduction factor of the remuneration when the external force is a corrected external force that suppresses the expansion of the error. , when the external force is a hostile external force that suppresses the reduction of the error, the reward may be calculated by calculation using a predicted value of the hostile external force as an increase factor of the reward.

上記第１態様において、前記報酬算出部は、タスク実行中における前記修正外力の予測値に基づく前記報酬の減少量の変化の幅が前記誤差に基づく前記報酬の変化の幅よりも小さくなり、タスク実行中における前記敵対外力の予測値に基づく前記報酬の増加量の変化の幅が前記誤差に基づく前記報酬の変化の幅よりも小さくなる計算により前記報酬を算出するようにしてもよい。 In the first aspect, the reward calculation unit makes the range of change in the amount of decrease in the reward based on the predicted value of the corrected external force during task execution smaller than the range of change in the amount of decrease in the reward based on the error, and the task The remuneration may be calculated by a calculation in which a change range of the increase amount of the remuneration based on the predicted value of the hostile external force during execution is smaller than a change range of the remuneration based on the error.

上記第１態様において、前記外力モデルは、前記外力が前記修正外力である場合において、前記修正外力の予測値を出力する修正外力モデルと、前記外力が前記敵対外力である場合において、前記敵対外力の予測値を出力する敵対外力モデルとを含み、前記外力モデル更新部は、前記外力が前記修正外力である場合において、前記決定された行動指令に基づいて前記修正外力モデルが算出した前記修正外力の予測値と前記外力の実績値との差異が小さくなるように前記修正外力モデルを更新する修正外力モデル更新部と、前記外力が前記敵対外力である場合において、前記決定された行動指令に基づいて前記敵対外力モデルが算出した前記敵対外力の予測値と前記外力の実績値との差異が小さくなるように前記敵対外力モデルを更新する敵対外力モデル更新部とを含む構成としてもよい。 In the first aspect, the external force model includes a corrected external force model that outputs a predicted value of the corrected external force when the external force is the corrected external force, and the hostile external force when the external force is the hostile external force. and a hostile external force model that outputs a predicted value of the corrected external force calculated by the corrected external force model based on the determined action command when the external force is the corrected external force. and a modified external force model updating unit that updates the modified external force model so that the difference between the predicted value of the external force and the actual value of the external force becomes smaller; and a hostile external force model updating unit that updates the hostile external force model so that the difference between the predicted value of the hostile external force calculated by the hostile external force model and the actual value of the external force becomes smaller.

上記第１態様において、前記ロボットモデルは、前記修正外力モデル及び前記敵対外力モデルを備えた統合外力モデルを含み、前記修正外力モデル及び前記敵対外力モデルはニューラルネットワークであり、前記敵対外力モデルの１又は複数の中間層及び出力層のうちの少なくとも１つの層は、前記修正外力モデルの対応する層の前段の層の出力をプログレッシブニューラルネットワークの手法により統合し、前記敵対外力モデルは、外力の予測値及び当該外力が修正外力か敵対外力かの識別情報を出力し、前記統合外力モデルは、前記敵対外力モデルの出力を自身の出力とし、前記報酬算出部は、前記識別情報が修正外力を示す場合には前記外力の予測値を前記報酬の減少要因とする計算により前記報酬を算出し、前記識別情報が敵対外力を示す場合には前記外力の予測値を前記報酬の増加要因とする計算により前記報酬を算出するようにしてもよい。 In the first aspect, the robot model includes an integrated external force model comprising the corrected external force model and the hostile external force model, the corrected external force model and the hostile external force model are neural networks, and one of the hostile external force models Alternatively, at least one layer of a plurality of intermediate layers and output layers integrates outputs of layers preceding the corresponding layer of the modified external force model by a progressive neural network technique, and the hostile external force model predicts an external force. A value and identification information indicating whether the external force is a corrected external force or a hostile external force, the integrated external force model uses the output of the hostile external force model as its own output, and the reward calculation unit determines that the identification information indicates the corrected external force. If the identification information indicates a hostile external force, the reward is calculated using the predicted value of the external force as a factor to decrease the reward, and if the identification information indicates a hostile external force, a calculation using the predicted value of the external force as a factor to increase the reward The reward may be calculated.

上記第１態様において、前記外力が前記修正外力であるか前記敵対外力であるかの指定を受け付ける受け付け部をさらに備え、前記指定が前記修正外力である場合は前記修正外力モデル更新部の動作を有効化し、前記指定が前記敵対外力である場合は前記敵対外力モデル更新部の動作を有効化する学習制御部をさらに備えた構成としてもよい。 In the above first aspect, further comprising a receiving unit that receives a designation as to whether the external force is the corrected external force or the hostile external force, and if the designation is the corrected external force, causes the corrected external force model updating unit to operate. The configuration may further include a learning control unit that validates and validates the operation of the hostile external force model updating unit when the designation is the hostile external force.

上記第１態様において、前記位置姿勢の実績値及び前記外力の実績値に基づき前記外力が前記修正外力であるか前記敵対外力であるかを判別し、前記判別の結果が前記修正外力である場合は前記修正外力モデル更新部の動作を有効化し、前記判別の結果が前記敵対外力である場合は前記敵対外力モデル更新部の動作を有効化する学習制御部をさらに備える構成としてもよい。 In the first aspect, when it is determined whether the external force is the corrected external force or the hostile external force based on the actual values of the position and orientation and the actual values of the external force, and the result of the determination is the corrected external force. may further include a learning control unit that validates the operation of the corrected external force model updating unit, and validates the operation of the hostile external force model updating unit when the determination result is the hostile external force.

開示の第２態様は、ロボットモデルの機械学習方法であって、ある時間におけるロボットの位置姿勢の実績値及び前記ロボットに与えることができる行動指令に基づき、その次の時間における前記ロボットの位置姿勢の予測値を算出する状態遷移モデル及び前記ロボットに加えられる外力の予測値を算出する外力モデルを含むロボットモデルを用意し、制御周期毎に、前記位置姿勢の実績値及び前記ロボットに加えられる外力の実績値を取得し、制御周期毎に、前記行動指令の複数の候補を生成して前記ロボットモデルに与え、前記行動指令の複数の候補に対応して前記状態遷移モデルにより算出される複数の前記位置姿勢の予測値と到達すべき位置姿勢の目標値との間の複数の誤差及び前記行動指令の複数の候補に対応して前記外力モデルにより算出される複数の前記外力の予測値に基づいて、前記行動指令の複数の候補に対応して算出される複数の報酬に基づいて、報酬を最大化する行動指令を決定し、決定された前記行動指令に基づいて前記外力モデルが算出した前記外力の予測値と、当該外力の予測値に対応する前記外力の実績値との間の差異が小さくなるように前記外力モデルを更新する。 A second aspect of the disclosure is a machine learning method for a robot model, which is based on the actual values of the position and orientation of the robot at a certain time and the action command that can be given to the robot, and the position and orientation of the robot at the next time. and an external force model for calculating the predicted value of the external force applied to the robot are prepared. is obtained, and a plurality of candidates for the action command are generated and given to the robot model for each control cycle, and a plurality of candidates calculated by the state transition model corresponding to the plurality of candidates for the action command are obtained. based on a plurality of predicted values of the external force calculated by the external force model corresponding to a plurality of errors between the predicted values of the position and orientation and the target values of the position and orientation to be reached and the plurality of candidates of the action command; Then, an action command that maximizes the reward is determined based on a plurality of rewards calculated corresponding to the plurality of candidates for the action command, and the external force model calculates the action command based on the determined action command. The external force model is updated so that the difference between the predicted value of the external force and the actual value of the external force corresponding to the predicted value of the external force is reduced.

上記第２態様において、さらに、決定された前記行動指令に基づいて前記状態遷移モデルが算出した前記位置姿勢の予測値と、当該位置姿勢の予測値に対応する前記位置姿勢の実績値との間の誤差が小さくなるように前記状態遷移モデルを更新するようにしてもよい。 In the second aspect, further, a difference between the position/orientation predicted value calculated by the state transition model based on the determined action command and the position/orientation actual value corresponding to the position/orientation predicted value The state transition model may be updated so that the error of is reduced.

上記第２態様において、前記外力が前記誤差の拡大を抑制する外力である修正外力である場合において、前記修正外力の予測値を前記報酬の減少要因とする計算により前記報酬を算出するようにしてもよい。 In the second aspect, when the external force is a corrected external force that suppresses the expansion of the error, the reward is calculated by calculation using a predicted value of the corrected external force as a factor for reducing the reward. good too.

上記第２態様において、前記外力が前記誤差の縮小を抑制する外力である敵対外力である場合において、前記敵対外力の予測値を前記報酬の増加要因とする計算により前記報酬を算出するようにしてもよい。 In the second aspect, when the external force is a hostile external force that suppresses reduction of the error, the reward is calculated by calculation using the predicted value of the hostile external force as a factor for increasing the reward. good too.

上記第２態様において、前記外力が前記誤差の拡大を抑制する修正外力である場合において、前記修正外力の予測値を前記報酬の減少要因とする計算により前記報酬を算出し、前記外力が前記誤差の縮小を抑制する外力である敵対外力である場合において、前記敵対外力の予測値を前記報酬の増加要因とする計算により前記報酬を算出するようにしてもよい。 In the second aspect, when the external force is a corrected external force that suppresses the expansion of the error, the reward is calculated by calculation using the predicted value of the corrected external force as a factor for reducing the reward, and the external force is the error , the reward may be calculated by calculation using the predicted value of the hostile external force as a factor for increasing the reward.

上記第２態様において、前記外力モデルは、前記外力が前記修正外力である場合において、前記修正外力の予測値を出力する修正外力モデルと、前記外力が前記敵対外力である場合において、前記敵対外力の予測値を出力する敵対外力モデルとを含み、前記外力が前記修正外力である場合において、前記決定された行動指令に基づいて前記修正外力モデルが算出した前記修正外力の予測値と前記外力の実績値との差異が小さくなるように前記修正外力モデルを更新し、前記外力が前記敵対外力である場合において、前記決定された行動指令に基づいて前記敵対外力モデルが算出した前記敵対外力の予測値と前記外力の実績値との差異が小さくなるように前記敵対外力モデルを更新するようにしてもよい。 In the second aspect, the external force model includes a corrected external force model that outputs a predicted value of the corrected external force when the external force is the corrected external force, and the hostile external force when the external force is the hostile external force. and a hostile external force model that outputs a predicted value of the external force, and when the external force is the corrected external force, the predicted value of the corrected external force calculated by the corrected external force model based on the determined action command and the external force Updating the modified external force model so that the difference from the actual value is reduced, and predicting the hostile external force calculated by the hostile external force model based on the determined action command when the external force is the hostile external force. The hostile external force model may be updated so that the difference between the value and the actual value of the external force becomes smaller.

上記第２態様において、前記誤差が拡大しつつある場合に、前記ロボットに対して前記修正外力を加え、前記誤差が縮小しつつある場合に、前記ロボットに対して前記敵対外力を加えるようにしてもよい。 In the second aspect, the correction external force is applied to the robot when the error is increasing, and the hostile external force is applied to the robot when the error is decreasing. good too.

開示の第３態様は、ロボットモデルの機械学習プログラムであって、ある時間におけるロボットの位置姿勢の実績値及び前記ロボットに与えることができる行動指令に基づき、その次の時間における前記ロボットの位置姿勢の予測値を算出する状態遷移モデル及び前記ロボットに加えられる外力の予測値を算出する外力モデルを含むロボットモデルを機械学習するための機械学習プログラムであって、制御周期毎に、前記位置姿勢の実績値及び前記ロボットに加えられる外力の実績値を取得し、制御周期毎に、前記行動指令の複数の候補を生成して前記ロボットモデルに与え、前記行動指令の複数の候補に対応して前記状態遷移モデルにより算出される複数の前記位置姿勢の予測値と到達すべき位置姿勢の目標値との間の複数の誤差及び前記行動指令の複数の候補に対応して前記外力モデルにより算出される複数の前記外力の予測値に基づいて、前記行動指令の複数の候補に対応して算出される複数の報酬に基づいて、報酬を最大化する行動指令を決定し、決定された前記行動指令に基づいて前記外力モデルが算出した前記外力の予測値と、当該外力の予測値に対応する前記外力の実績値との間の差異が小さくなるように前記外力モデルを更新する、各処理をコンピュータに行わせる。 A third aspect of the disclosure is a machine learning program for a robot model, which is based on the actual values of the position and orientation of the robot at a certain time and the action command that can be given to the robot, and the position and orientation of the robot at the next time. A machine learning program for machine learning a robot model including a state transition model for calculating a predicted value of and an external force model for calculating a predicted value of an external force applied to the robot, the machine learning program for machine learning the position and orientation of the robot at each control cycle A performance value and a performance value of the external force applied to the robot are obtained, a plurality of candidates for the action command are generated for each control cycle, and given to the robot model, and the plurality of candidates for the action command are obtained. Calculated by the external force model corresponding to a plurality of errors between the plurality of predicted values of the position and orientation calculated by the state transition model and the target values of the position and orientation to be reached, and the plurality of candidates for the action command. determining an action command that maximizes the reward based on a plurality of rewards calculated corresponding to the plurality of candidates for the action command based on the plurality of predicted values of the external force; updating the external force model so that the difference between the predicted value of the external force calculated by the external force model based on the external force model and the actual value of the external force corresponding to the predicted value of the external force is reduced; let it happen

開示の第４態様は、ロボット制御装置であって、ある時間におけるロボットの位置姿勢の実績値及び前記ロボットに与えることができる行動指令に基づき、その次の時間における前記ロボットの位置姿勢の予測値を算出する状態遷移モデル及び前記ロボットに加えられる外力の予測値を算出する外力モデルを含むロボットモデルを実行するモデル実行部と、前記ロボットの位置姿勢の実績値及び前記ロボットに加えられる外力の実績値を取得する取得部と、前記ロボットモデルにより算出された位置姿勢の予測値と到達すべき位置姿勢の目標値との間の誤差及び前記ロボットモデルにより算出された外力の予測値に基づいて報酬を算出する報酬算出部と、制御周期毎に、前記行動指令の複数の候補を生成して前記ロボットモデルに与え、前記行動指令の複数の候補のそれぞれに対応して前記報酬算出部が算出する報酬に基づいて報酬を最大化する行動指令を決定する行動決定部と、を備える。 A fourth aspect of the disclosure is a robot control device, which predicts the position and orientation of the robot at the next time based on the actual values of the position and orientation of the robot at a certain time and the action command that can be given to the robot. and an external force model for calculating a predicted value of the external force applied to the robot; a model execution unit that executes a robot model; an acquisition unit that acquires a value, and a reward based on the error between the predicted value of the position and orientation calculated by the robot model and the target value of the position and orientation to be reached, and the predicted value of the external force calculated by the robot model. and a reward calculation unit that calculates a plurality of candidates for the action command for each control cycle, and provides the plurality of candidates for the action command to the robot model, and the reward calculation unit calculates each of the plurality of candidates for the action command. a behavior determination unit that determines a behavior directive that maximizes the reward based on the reward.

開示の第５態様は、ロボット制御方法であって、ある時間におけるロボットの位置姿勢の実績値及び前記ロボットに与えることができる行動指令に基づき、その次の時間における前記ロボットの位置姿勢の予測値を算出する状態遷移モデル及び前記ロボットに加えられる外力の予測値を算出する外力モデルを含むロボットモデルを用意し、制御周期毎に、前記位置姿勢の実績値及び前記ロボットに加えられる外力の実績値を取得し、制御周期毎に、前記行動指令の複数の候補を生成して前記ロボットモデルに与え、前記行動指令の複数の候補に対応して前記状態遷移モデルにより算出される複数の前記位置姿勢の予測値と到達すべき位置姿勢の目標値との間の複数の誤差及び前記行動指令の複数の候補に対応して前記外力モデルにより算出される複数の前記外力の予測値に基づいて、前記行動指令の複数の候補に対応して算出される複数の報酬に基づいて、報酬を最大化する行動指令を決定し、決定された前記行動指令に基づいて前記ロボットを制御する。 A fifth aspect of the disclosure is a robot control method, which predicts the position and orientation of the robot at the next time based on the actual values of the position and orientation of the robot at a certain time and the action command that can be given to the robot. and an external force model for calculating a predicted value of the external force applied to the robot are prepared, and in each control cycle, the actual values of the position and orientation and the actual values of the external force applied to the robot are prepared. is obtained, and a plurality of candidates for the action command are generated and given to the robot model for each control cycle, and a plurality of the positions and orientations calculated by the state transition model corresponding to the plurality of candidates for the action command are obtained. based on a plurality of predicted values of the external force calculated by the external force model corresponding to a plurality of errors between the predicted value of and the target value of the position and orientation to be reached, and the plurality of candidates for the action command, An action command that maximizes the reward is determined based on a plurality of rewards calculated corresponding to the plurality of action command candidates, and the robot is controlled based on the determined action command.

開示の第６態様は、ロボット制御プログラムであって、ある時間におけるロボットの位置姿勢の実績値及び前記ロボットに与えることができる行動指令に基づき、その次の時間における前記ロボットの位置姿勢の予測値を算出する状態遷移モデル及び前記ロボットに加えられる外力の予測値を算出する外力モデルを含むロボットモデルを用いて前記ロボットを制御するためのプログラムであって、制御周期毎に、前記位置姿勢の実績値及び前記ロボットに加えられる外力の実績値を取得し、制御周期毎に、前記行動指令の複数の候補を生成して前記ロボットモデルに与え、前記行動指令の複数の候補に対応して前記状態遷移モデルにより算出される複数の前記位置姿勢の予測値と到達すべき位置姿勢の目標値との間の複数の誤差及び前記行動指令の複数の候補に対応して前記外力モデルにより算出される複数の前記外力の予測値に基づいて、前記行動指令の複数の候補に対応して算出される複数の報酬に基づいて、報酬を最大化する行動指令を決定し、決定された前記行動指令に基づいて前記ロボットを制御する、各処理をコンピュータに行わせる。 A sixth aspect of the disclosure is a robot control program, which predicts the position and orientation of the robot at the next time based on the actual values of the position and orientation of the robot at a certain time and the action command that can be given to the robot. and an external force model for calculating a predicted value of an external force applied to the robot, the program for controlling the robot using a robot model, wherein the position and orientation results are obtained in each control cycle values and actual values of the external force applied to the robot, generate a plurality of candidates for the action command for each control cycle, and give the candidates to the robot model; The plurality of errors calculated by the external force model corresponding to the plurality of candidates for the action command and the plurality of errors between the plurality of predicted values of the position and orientation calculated by the transition model and the target values of the position and orientation to be reached. Based on the predicted value of the external force of, based on a plurality of rewards calculated corresponding to the plurality of candidates for the action command, determine an action command that maximizes the reward, and based on the determined action command A computer is caused to perform each process to control the robot by using a computer.

開示の技術によれば、ロボットモデルを機械学習により学習する際に、効率良く学習することができる。 According to the disclosed technology, it is possible to learn efficiently when learning a robot model by machine learning.

第１実施形態に係るロボットシステムの構成図である。1 is a configuration diagram of a robot system according to a first embodiment; FIG. （Ａ）はロボット１０の概略構成を示す図、（Ｂ）はロボットのアームの先端側を拡大した図である。(A) is a diagram showing a schematic configuration of the robot 10, and (B) is an enlarged diagram of the tip side of the arm of the robot. 修正外力及び敵対外力について説明するための図である。It is a figure for demonstrating corrected external force and hostile external force. ロボット制御装置のハードウェア構成を示すブロック図である。It is a block diagram which shows the hardware constitutions of a robot control apparatus. ロボットモデルの構成図である。1 is a configuration diagram of a robot model; FIG. 変形例に係るロボットモデルの構成図である。FIG. 11 is a configuration diagram of a robot model according to a modified example; 変形例に係るロボットモデルの構成図である。FIG. 11 is a configuration diagram of a robot model according to a modified example; 統合外力モデルの構成図である。It is a block diagram of an integrated external force model. 第１実施形態に係る学習処理のフローチャートである。4 is a flowchart of learning processing according to the first embodiment; 第２実施形態に係る学習処理のフローチャートである。9 is a flowchart of learning processing according to the second embodiment; 第３実施形態に係るロボットシステムの構成図である。FIG. 11 is a configuration diagram of a robot system according to a third embodiment; 第３実施形態に係る学習処理のフローチャートである。10 is a flowchart of learning processing according to the third embodiment; （Ａ）は位置誤差についての学習曲線を示すグラフ、（Ｂ）は外力についての学習曲線を示すグラフである。(A) is a graph showing a learning curve for position error, and (B) is a graph showing a learning curve for external force. （Ａ）は異なる摩擦におけるタスクの成功回数を表すグラフ、（Ｂ）は異なるペグの質量におけるタスクの成功回数を表すグラフである。(A) is a graph representing the number of task successes at different frictions, and (B) is a graph representing the number of task successes at different peg masses. 異なるペグの材質におけるタスクの成功回数を表すグラフである。FIG. 11 is a graph representing the number of task successes for different peg materials; FIG.

以下、開示の技術の実施形態の一例を、図面を参照しつつ説明する。なお、各図面において同一又は等価な構成要素及び部分には同一の参照符号を付与している。また、図面の寸法比率は、説明の都合上誇張されている場合があり、実際の比率とは異なる場合がある。 An example of embodiments of the technology disclosed herein will be described below with reference to the drawings. In each drawing, the same or equivalent components and portions are given the same reference numerals. Also, the dimensional ratios in the drawings may be exaggerated for convenience of explanation, and may differ from the actual ratios.

＜第１実施形態＞ <First Embodiment>

図１は、本実施形態に係るロボットシステム１の構成を示す。ロボットシステム１は、ロボット１０、状態観測センサ２０、触覚センサ３０Ａ、３０Ｂ、ロボット制御装置４０、表示装置５０、及び入力装置６０を有する。 FIG. 1 shows the configuration of a robot system 1 according to this embodiment. The robot system 1 has a robot 10 , a state observation sensor 20 , tactile sensors 30A and 30B, a robot control device 40 , a display device 50 and an input device 60 .

（ロボット） (robot)

図２（Ａ）、（Ｂ）は、ロボット１０の概略構成を示す図である。本実施形態におけるロボット１０は、６軸垂直多関節ロボットであり、アーム１１の先端１１ａに柔軟部１３を介してグリッパ（ハンド）１２が設けられる。ロボット１０は、グリッパ１２によって部品（例えばペグ）を把持して穴に嵌め込む嵌め込み作業を行う。なお、ロボット１０は、本実施形態では現実のロボットであるが、シミュレーションにおける仮想のロボットでもよい。 2A and 2B are diagrams showing a schematic configuration of the robot 10. FIG. The robot 10 in this embodiment is a 6-axis vertical articulated robot, and a gripper (hand) 12 is provided at the tip 11 a of an arm 11 via a flexible portion 13 . The robot 10 performs a fitting operation in which a part (for example, a peg) is gripped by the gripper 12 and fitted into a hole. Although the robot 10 is a real robot in this embodiment, it may be a virtual robot in a simulation.

図２（Ａ）に示すように、ロボット１０は、関節Ｊ１～Ｊ６を備えた６自由度のアーム１１を有する。各関節Ｊ１～Ｊ６は、図示しないモータによりリンク同士を矢印Ｃ１～Ｃ６の方向に回転可能に接続する。ここでは、垂直多関節ロボットを例に挙げたが、水平多関節ロボット（スカラーロボット）であってもよい。また、６軸ロボットを例に挙げたが、５軸や７軸などその他の自由度の多関節ロボットであってもよく、パラレルリンクロボットであってもよい。 As shown in FIG. 2A, the robot 10 has an arm 11 with six degrees of freedom and joints J1 to J6. The joints J1 to J6 connect the links so as to be rotatable in the directions of arrows C1 to C6 by motors (not shown). Here, a vertical articulated robot is taken as an example, but a horizontal articulated robot (scalar robot) may be used. Also, although a 6-axis robot has been exemplified, a multi-joint robot with other degrees of freedom such as a 5-axis or 7-axis robot, or a parallel link robot may be used.

グリッパ１２は、１組の挟持部１２ａを有し、挟持部１２ａを制御して部品を挟持する。グリッパ１２は、柔軟部１３を介してアーム１１の先端１１ａと接続され、アーム１１の移動に伴って移動する。本実施形態では、柔軟部１３は各バネの基部が正三角形の各頂点になる位置関係に配置された３つのバネ１３ａ～１３ｃにより構成されるが、バネの数はいくつであってもよい。また、柔軟部１３は、位置の変動に対して復元力を生じて、柔軟性が得られる機構であればその他の機構であってもよい。例えば、柔軟部１３は、バネやゴムのような弾性体、ダンパ、空気圧または液圧シリンダなどであってもよい。柔軟部１３は、受動要素によって構成されることが好ましい。柔軟部１３により、アーム１１の先端１１ａとグリッパ１２は、水平方向および垂直方向に、５ｍｍ以上、好ましくは１ｃｍ以上、更に好ましくは２ｃｍ以上、相対移動可能に構成される。 The gripper 12 has a pair of gripping portions 12a, and grips the component by controlling the gripping portions 12a. The gripper 12 is connected to the tip 11a of the arm 11 via the flexible portion 13 and moves as the arm 11 moves. In this embodiment, the flexible portion 13 is composed of three springs 13a to 13c arranged in a positional relationship such that the base of each spring is at each vertex of an equilateral triangle, but the number of springs may be any number. Also, the flexible portion 13 may be any other mechanism as long as it is a mechanism that generates a restoring force against positional fluctuations and obtains flexibility. For example, the flexible part 13 may be an elastic body such as a spring or rubber, a damper, a pneumatic or hydraulic cylinder, or the like. The flexible portion 13 is preferably constituted by passive elements. The flexible portion 13 allows the distal end 11a of the arm 11 and the gripper 12 to move relative to each other horizontally and vertically by 5 mm or more, preferably 1 cm or more, more preferably 2 cm or more.

グリッパ１２がアーム１１に対して柔軟な状態と固定された状態とを切り替えられるような機構を設けてもよい。 A mechanism may be provided to switch the gripper 12 between a flexible state and a fixed state with respect to the arm 11 .

また、ここではアーム１１の先端１１ａとグリッパ１２の間に柔軟部１３を設ける構成を例示したが、グリッパ１２の途中（例えば、指関節の場所または指の柱状部分の途中）、アームの途中（例えば、関節Ｊ１～Ｊ６のいずれかの場所またはアームの柱状部分の途中）に設けられてもよい。また、柔軟部１３は、これらのうちの複数の箇所に設けられてもよい。 In addition, although the configuration in which the flexible portion 13 is provided between the tip 11a of the arm 11 and the gripper 12 is illustrated here, the flexible portion 13 is provided in the middle of the gripper 12 (for example, in the middle of the finger joint or in the middle of the columnar portion of the finger), in the middle of the arm ( For example, it may be provided at any of the joints J1 to J6 or in the middle of the columnar portion of the arm). Also, the flexible portion 13 may be provided at a plurality of these locations.

ロボットシステム１は、上記のように柔軟部１３を備えるロボット１０の制御を行うためのロボットモデルを、機械学習（例えばモデルベース強化学習）を用いて獲得する。ロボット１０は柔軟部１３を有しているため、把持した部品を環境に接触させても安全であり、また、制御周期が遅くても嵌め込み作業などを実現可能である。一方、柔軟部１３によってグリッパ１２および部品の位置が不確定となるため、解析的なロボットモデルを得ることは困難である。そこで、本実施形態では機械学習を用いてロボットモデルを獲得する。 The robot system 1 uses machine learning (for example, model-based reinforcement learning) to acquire a robot model for controlling the robot 10 having the flexible part 13 as described above. Since the robot 10 has the flexible part 13, it is safe to bring the gripped part into contact with the environment, and even if the control cycle is slow, it is possible to perform fitting work. On the other hand, it is difficult to obtain an analytical robot model because the positions of the gripper 12 and parts are uncertain due to the flexible portion 13 . Therefore, in this embodiment, machine learning is used to acquire a robot model.

（状態観測センサ） (state observation sensor)

状態観測センサ２０は、ロボット１０の状態としてグリッパ１２の位置姿勢を観測し、観測した位置姿勢を実績値として出力する。状態観測センサ２０としては、例えば、ロボット１０の関節のエンコーダ、視覚センサ（カメラ）、モーションキャプチャ等が用いられる。モーションキャプチャ用のマーカーがグリッパ１２に取り付けられている場合には、グリッパ１２の位置姿勢が特定でき、グリッパ１２の位置姿勢から部品（作業対象物）の姿勢が推定できる。 The state observation sensor 20 observes the position and orientation of the gripper 12 as the state of the robot 10 and outputs the observed position and orientation as actual values. As the state observation sensor 20, for example, a joint encoder of the robot 10, a visual sensor (camera), a motion capture, or the like is used. When a motion capture marker is attached to the gripper 12 , the position and orientation of the gripper 12 can be identified, and the orientation of the component (workpiece) can be estimated from the position and orientation of the gripper 12 .

また、視覚センサによっても、グリッパ１２自体やグリッパ１２が把持している部品の位置姿勢をロボット１０の状態として検出できる。グリッパ１２とアーム１１との間が柔軟部である場合、アーム１１に対するグリッパ１２の変位を検出する変位センサによってもアーム１１に対するグリッパ１２の位置姿勢を特定することができる。 Also, the position and orientation of the gripper 12 itself and the part gripped by the gripper 12 can be detected as the state of the robot 10 by the visual sensor. If the flexible portion is between the gripper 12 and the arm 11 , the position and orientation of the gripper 12 with respect to the arm 11 can also be identified by a displacement sensor that detects the displacement of the gripper 12 with respect to the arm 11 .

（触覚センサ） (tactile sensor)

図２では図示は省略したが、図３に示すように、グリッパ１２のグリッパ本体１２ｂには、触覚センサ３０Ａ、３０Ｂが取り付けられている。 Although not shown in FIG. 2, tactile sensors 30A and 30B are attached to the gripper body 12b of the gripper 12 as shown in FIG.

触覚センサ３０Ａ、３０Ｂは、一例として１組の挟持部１２ａが対向する方向に沿った位置に設けられている。触覚センサ３０Ａ、３０Ｂは、一例として３軸又は６軸の力を検出するセンサであり、自身に加えられる外力の大きさと方向を検出することができる。ユーザーは、触覚センサ３０Ａ、３０Ｂの両方に手（指）が接触するようにグリッパ本体１２ｂを掴んでグリッパ１２を動かすことによりグリッパ１２に外力を加える。 As an example, the tactile sensors 30A and 30B are provided at positions along the direction in which the pair of holding portions 12a face each other. The tactile sensors 30A and 30B are, for example, sensors that detect three-axis or six-axis forces, and can detect the magnitude and direction of an external force applied to them. The user applies an external force to the gripper 12 by gripping the gripper main body 12b and moving the gripper 12 so that the hands (fingers) are in contact with both the tactile sensors 30A and 30B.

外力としては、ロボット１０が実行するタスク（作業）が成功するように修正（ａｄｖｉｓｏｒｙ）する修正外力と、タスクが失敗するように敵対（ａｄｖｅｒｓａｒｉａｌ）する敵対外力と、がある。修正外力とは、ロボットモデルが予測するロボット１０の位置姿勢の予測値とロボット１０が到達すべき位置姿勢の目標値との間の誤差の拡大を抑制する外力である。また、敵対外力とは、ロボットモデルが予測するロボット１０の位置姿勢の予測値とロボット１０が到達すべき位置姿勢の目標値との間の誤差の拡大を縮小する外力である。 The external force includes a corrective external force that is advisory so that the task (work) executed by the robot 10 will succeed, and an adversarial external force that is adversarial so that the task will fail. The corrected external force is an external force that suppresses the expansion of the error between the predicted value of the position and orientation of the robot 10 predicted by the robot model and the target value of the position and orientation that the robot 10 should reach. The hostile external force is an external force that reduces the expansion of the error between the predicted value of the position and orientation of the robot 10 predicted by the robot model and the target value of the position and orientation that the robot 10 should reach.

具体的には、ロボット１０が実行するタスクが、図３に示すようにペグ７０を台７２に設けられた穴７４に挿入するタスクである場合において、ペグ７０を穴７４に挿入するには、矢印Ａ方向にペグ７０を動かす必要がある。この場合、ペグ７０を穴７４に挿入するというタスクを成功させるための正しい方向である矢印Ａ方向に加える外力が修正外力である。一方、タスクを失敗させる方向であって矢印Ａ方向と反対方向である矢印Ｂ方向に加える外力が敵対外力である。 Specifically, when the task to be executed by the robot 10 is to insert the peg 70 into the hole 74 provided in the table 72 as shown in FIG. It is necessary to move peg 70 in the direction of arrow A. In this case, the force applied in the direction of arrow A, which is the correct direction for successfully completing the task of inserting peg 70 into hole 74, is the corrected force. On the other hand, the external force applied in the direction of arrow B, which is the direction that causes the task to fail and is opposite to the direction of arrow A, is the hostile external force.

図３の場合、ユーザーがグリッパ１２を掴んで矢印Ａ方向に修正外力を加えると、触覚センサ３０Ｂよりも触覚センサ３０Ａによって検出される力が大きくなり、修正外力が加わっていると判定できる。一方、矢印Ｂ方向に敵対外力を加えると、触覚センサ３０Ａよりも触覚センサ３０Ｂによって検出される力が大きくなり、敵対外力が加わっていると判定できる。 In the case of FIG. 3, when the user grips the gripper 12 and applies a corrective external force in the direction of arrow A, the force detected by the tactile sensor 30A becomes larger than that detected by the tactile sensor 30B, and it can be determined that the corrective external force is applied. On the other hand, when a hostile external force is applied in the direction of arrow B, the force detected by the tactile sensor 30B is greater than that detected by the tactile sensor 30A, and it can be determined that a hostile external force is being applied.

なお、本実施形態では、グリッパ本体１２ｂに２つの触覚センサ３０Ａ、３０Ｂが設けられた場合について説明するが、これに限られない。例えば３つ以上の触覚センサをグリッパ本体１２ｂの周囲に等間隔で設けても良い。触覚センサを３つ以上設けてそれらの検出結果を総合すれば少なくともグリッパ１２の軸に垂直な面内での外力の方向がわかる場合は、各触覚センサは外力の大きさだけを検出するものであってもよい。 In this embodiment, the case where the gripper main body 12b is provided with the two tactile sensors 30A and 30B will be described, but the present invention is not limited to this. For example, three or more tactile sensors may be provided at equal intervals around the gripper body 12b. If three or more tactile sensors are provided and their detection results are combined to at least determine the direction of the external force in the plane perpendicular to the axis of the gripper 12, each tactile sensor detects only the magnitude of the external force. There may be.

（ロボット制御装置） (robot controller)

ロボット制御装置４０は、機械学習によりロボットモデルを学習する学習装置として機能する。また、ロボット制御装置４０は、学習済みのロボットモデルを用いてロボット１０を制御する制御装置としても機能する。 The robot control device 40 functions as a learning device that learns a robot model by machine learning. The robot control device 40 also functions as a control device that controls the robot 10 using a learned robot model.

図４は、本実施形態に係るロボット制御装置のハードウェア構成を示すブロック図である。図４に示すように、ロボット制御装置４０は、一般的なコンピュータ（情報処理装置）と同様の構成であり、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）４０Ａ、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）４０Ｂ、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）４０Ｃ、ストレージ４０Ｄ、キーボード４０Ｅ、マウス４０Ｆ、モニタ４０Ｇ、及び通信インタフェース４０Ｈを有する。各構成は、バス４０Ｉを介して相互に通信可能に接続されている。 FIG. 4 is a block diagram showing the hardware configuration of the robot control device according to this embodiment. As shown in FIG. 4, the robot control device 40 has the same configuration as a general computer (information processing device), and includes a CPU (Central Processing Unit) 40A, a ROM (Read Only Memory) 40B, a RAM (Random Access Memory). ) 40C, storage 40D, keyboard 40E, mouse 40F, monitor 40G, and communication interface 40H. Each component is communicatively connected to each other via a bus 40I.

本実施形態では、ＲＯＭ４０Ｂ又はストレージ４０Ｄには、ロボットモデルを機械学習するためのプログラム及びロボット制御プログラムが格納されている。ＣＰＵ４０Ａは、中央演算処理ユニットであり、各種プログラムを実行したり、各構成を制御したりする。すなわち、ＣＰＵ４０Ａは、ＲＯＭ４０Ｂ又はストレージ４０Ｄからプログラムを読み出し、ＲＡＭ４０Ｃを作業領域としてプログラムを実行する。ＣＰＵ４０Ａは、ＲＯＭ４０Ｂ又はストレージ４０Ｄに記録されているプログラムに従って、上記各構成の制御及び各種の演算処理を行う。ＲＯＭ４２は、各種プログラム及び各種データを格納する。ＲＡＭ４０Ｃは、作業領域として一時的にプログラム又はデータを記憶する。ストレージ４０Ｄは、ＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）、ＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒｉｖｅ）、又はフラッシュメモリにより構成され、オペレーティングシステムを含む各種プログラム、及び各種データを格納する。キーボード４０Ｅ及びマウス４０Ｆは入力装置６０の一例であり、各種の入力を行うために使用される。モニタ４０Ｇは、例えば、液晶ディスプレイであり、表示装置５０の一例である。モニタ４０Ｇは、タッチパネル方式を採用して、入力装置６０として機能してもよい。通信インタフェース４０Ｈは、他の機器と通信するためのインタフェースであり、例えば、イーサネット（登録商標）、ＦＤＤＩ又はＷｉ－Ｆｉ（登録商標）等の規格が用いられる。 In this embodiment, the ROM 40B or the storage 40D stores a program for machine learning a robot model and a robot control program. The CPU 40A is a central processing unit that executes various programs and controls each configuration. That is, the CPU 40A reads a program from the ROM 40B or the storage 40D and executes the program using the RAM 40C as a work area. The CPU 40A performs control of the above components and various arithmetic processing according to programs recorded in the ROM 40B or the storage 40D. The ROM 42 stores various programs and various data. The RAM 40C temporarily stores programs or data as a work area. The storage 40D is configured by a HDD (Hard Disk Drive), SSD (Solid State Drive), or flash memory, and stores various programs including an operating system and various data. The keyboard 40E and mouse 40F are examples of the input device 60 and are used for various inputs. The monitor 40G is, for example, a liquid crystal display, and is an example of the display device 50. FIG. The monitor 40G may employ a touch panel system and function as the input device 60 . The communication interface 40H is an interface for communicating with other devices, and uses standards such as Ethernet (registered trademark), FDDI, or Wi-Fi (registered trademark), for example.

次に、ロボット制御装置４０の機能構成について説明する。 Next, the functional configuration of the robot control device 40 will be described.

図１に示すように、ロボット制御装置４０は、その機能構成として、取得部４１、モデル実行部４２、報酬算出部４３、行動決定部４４、外力モデル更新部４５、学習制御部４６、及びユーザーインターフェース（ＵＩ）制御部４７を有する。各機能構成は、ＣＰＵ４０ＡがＲＯＭ４０Ｂまたはストレージ４０Ｄに記憶された機械学習プログラムを読み出して、ＲＡＭ４０Ｃに展開して実行することにより実現される。なお、一部または全部の機能は専用のハードウェア装置によって実現されても構わない。 As shown in FIG. 1, the robot control device 40 includes an acquisition unit 41, a model execution unit 42, a reward calculation unit 43, an action determination unit 44, an external force model update unit 45, a learning control unit 46, and a user It has an interface (UI) control unit 47 . Each functional configuration is realized by CPU 40A reading a machine learning program stored in ROM 40B or storage 40D, developing it in RAM 40C, and executing it. Some or all of the functions may be realized by a dedicated hardware device.

取得部４１は、ロボット１０の位置姿勢の実績値及びロボット１０に加えられる外力の実績値を取得する。ロボット１０の位置姿勢とは、一例としてロボット１０のエンドエフェクタとしてのグリッパ１２の位置姿勢である。ロボット１０に加えられる外力は、一例としてロボット１０のエンドエフェクタとしてのグリッパ１２に加えられる外力である。外力の実績値は、触覚センサ３０Ａ、３０Ｂにより計測される。なお、ロボット１０が、どの部分がエンドエフェクタであるかを特定しにくいようなロボットの場合には、操作対象物に対する影響が生じるロボットの箇所という観点で適宜位置姿勢を計測する箇所や外力を加える箇所を特定すればよい。 The acquisition unit 41 acquires the actual values of the position and orientation of the robot 10 and the actual values of the external force applied to the robot 10 . The position and orientation of the robot 10 are, for example, the position and orientation of the gripper 12 as an end effector of the robot 10 . The external force applied to the robot 10 is, for example, an external force applied to the gripper 12 as an end effector of the robot 10 . Actual values of the external force are measured by the tactile sensors 30A and 30B. In the case where the robot 10 is such that it is difficult to identify which part is the end effector, the position and orientation of the robot may be measured and an external force may be applied to the part of the robot that affects the object to be manipulated. The location should be specified.

本実施形態では、アーム１１の先端１１ａに柔軟部１３を介してグリッパ１２が設けられた構成であるため、グリッパ１２に外力が加えられたときに物理的に柔軟に変位できるか、または外力に応じて制御により変位できる構成が好ましい。なお、柔軟性を有しない硬いロボットに手で外力を加えることによっても開示の技術は適用可能である。 In this embodiment, the gripper 12 is provided at the tip 11a of the arm 11 via the flexible portion 13. Therefore, when an external force is applied to the gripper 12, the gripper 12 can be physically flexibly displaced. A configuration that can be displaced by control in response is preferred. The technology disclosed herein can also be applied by manually applying an external force to a rigid robot that does not have flexibility.

ロボット１０の位置姿勢は、本実施形態では位置３自由度、姿勢３自由度の最大計６自由度の値で表されるが、ロボット１０の可動自由度に応じてより少ない自由度であってもよい。例えばエンドエフェクタの姿勢変化が生じないロボットの場合には、「位置姿勢」は位置３自由度のみでよい。 In the present embodiment, the position and orientation of the robot 10 are represented by values of a maximum of six degrees of freedom, ie, three degrees of freedom of position and three degrees of freedom of orientation. good too. For example, in the case of a robot in which the posture of the end effector does not change, the "position and posture" may be only three degrees of freedom.

モデル実行部４２は、ロボットモデルＬＭを実行する。 The model execution unit 42 executes the robot model LM.

ロボットモデルＬＭは、図５に示すように、ある時間における位置姿勢の実績値（計測値）及びロボット１０に与えることができる行動指令（候補値又は決定値）に基づき、その次の時間におけるロボット１０の位置姿勢の予測値を算出する状態遷移モデルＤＭ及びロボット１０に加えられる外力の予測値を算出する外力モデルＥＭを含む。 As shown in FIG. 5, the robot model LM is based on actual values (measured values) of the position and orientation at a certain time and action commands (candidate values or determined values) that can be given to the robot 10, and the robot model at the next time. A state transition model DM for calculating predicted values of the position and orientation of robot 10 and an external force model EM for calculating predicted values of external force applied to robot 10 are included.

なお、ロボットモデルＬＭが「に基づき」（＝入力する）、「算出する」（＝出力する）というのは、モデル実行部４２がモデルを実行する際に、入力データを用いてモデルを実行する、モデルを実行することにより出力データを算出（生成）することをいう。 The robot model LM is “based on” (=input) and “calculated” (=output) when the model execution unit 42 executes the model using input data. , to calculate (generate) output data by executing a model.

外力モデルＥＭは、修正外力の予測値を出力する修正外力モデルＥＭ１及び敵対外力の予測値を出力する敵対外力モデルＥＭ２を含む。 The external force model EM includes a modified external force model EM1 that outputs predicted values of corrected external forces and a hostile external force model EM2 that outputs predicted values of hostile external forces.

報酬算出部４３は、位置姿勢の予測値と到達すべき位置姿勢の目標値との間の誤差及び外力の予測値に基づいて報酬を算出する。到達すべき位置姿勢とは、タスク完了時に到達すべき位置姿勢でもよいし、タスク完了前の中間目標としての位置姿勢でもよい。 The reward calculation unit 43 calculates a reward based on the error between the position/orientation predicted value and the position/orientation target value to be reached and the external force predicted value. The position/orientation to be reached may be the position/orientation to be reached when the task is completed, or may be the position/orientation as an intermediate target before the task is completed.

報酬算出部４３は、外力が誤差の拡大を抑制する外力である修正外力である場合において、修正外力の予測値を報酬の減少要因とする計算により報酬を算出する。 When the external force is a corrected external force that suppresses the spread of the error, the remuneration calculation unit 43 calculates the remuneration by calculation using the predicted value of the corrected external force as a reduction factor of the remuneration.

誤差の拡大を抑制する外力である修正外力とは、位置姿勢の誤差が拡大していく場面で拡大の速さを鈍らせるような外力であり、誤差の拡大を縮小に転じさせる外力でなくてもよい。 A corrective external force, which is an external force that suppresses the expansion of an error, is an external force that slows down the speed of expansion when the position and orientation error expands, not an external force that turns the expansion of the error into a contraction. good too.

修正外力の予測値を報酬の減少要因とする計算とは、修正外力の予測値を０にして計算した場合の報酬にくらべて外力モデルが算出した修正外力の予測値を用いて計算した場合の報酬の方が小さいことを意味する。なお、減少は時間的な減少を意味するものではなく、修正外力モデルＥＭ１が算出した修正外力の予測値を計算に用いていても報酬が時間の経過に従い減少するとは限らない。 Calculation using the predicted value of the corrected external force as a factor for decreasing the reward means that the reward calculated using the predicted value of the corrected external force calculated by the external force model is higher than the reward when the calculated value is calculated with the predicted value of the corrected external force set to 0. means less reward. Decrease does not mean temporal decrease, and even if the predicted value of the corrected external force calculated by the corrected external force model EM1 is used for calculation, the reward does not necessarily decrease with the lapse of time.

位置姿勢の誤差が拡大していく場合でも、位置姿勢の誤差が大きいとき（例えば図３においてペグ７０が穴７４から大きく離間しているとき）に修正外力を加えるのが好ましく、位置姿勢の誤差が小さいとき（ペグ７０が穴７４付近にあるとき）は修正外力を加えなくてもよい。また、位置姿勢の誤差が大きいときに、位置姿勢の誤差が拡大していく速さが大きいほど大きな修正外力を加えることが好ましい。 Even if the position and orientation error increases, it is preferable to apply the correction external force when the position and orientation error is large (for example, when the peg 70 is far away from the hole 74 in FIG. 3). is small (when the peg 70 is near the hole 74), no correction external force need be applied. Further, when the position/orientation error is large, it is preferable to apply a larger correction external force as the speed at which the position/orientation error increases increases.

また、報酬算出部４３は、タスク実行中における修正外力の予測値に基づく報酬の減少量の変化の幅が誤差に基づく報酬の変化の幅よりも小さくなる計算により報酬を算出する。 In addition, the reward calculation unit 43 calculates the reward by performing a calculation such that the amount of change in the amount of decrease in the reward based on the predicted value of the corrected external force during task execution is smaller than the width of change in the reward based on the error.

また、報酬算出部４３は、外力が誤差の縮小を抑制する外力である敵対外力である場合において、敵対外力の予測値を報酬の増加要因とする計算により報酬を算出する。 Further, when the external force is a hostile external force that suppresses the reduction of the error, the remuneration calculation unit 43 calculates the remuneration by calculation using the predicted value of the hostile external force as a factor for increasing the remuneration.

誤差の縮小を抑制する外力である敵対外力とは、位置姿勢の誤差が縮小していく場面で縮小の速さを鈍らせるような外力であり、誤差の縮小を拡大に転じさせる外力でなくてもよい。 A hostile external force, which is an external force that suppresses the reduction of the error, is an external force that slows down the speed of reduction when the position and orientation error is reduced, not an external force that turns the reduction of the error into an increase. good too.

敵対外力の予測値を報酬の増加要因とする計算とは、敵対外力の予測値を０にして計算した場合の報酬にくらべて敵対外力モデルが算出した敵対外力の予測値を用いて計算した場合の報酬の方が大きいことを意味する。なお、増加は時間的な増加を意味するものではなく、敵対外力モデルＥＭ２が算出した敵対外力の予測値を計算に用いていても報酬が時間の経過に従い増加するとは限らない。 Calculations that use the predicted value of hostile external force as a factor to increase the reward are those calculated using the predicted value of hostile external force calculated by the hostile external force model compared to the reward when the predicted value of hostile external force is set to 0. This means that the reward for Note that the increase does not mean a temporal increase, and even if the predicted value of the hostile external force calculated by the hostile external force model EM2 is used for calculation, the reward does not necessarily increase over time.

位置姿勢の誤差が縮小していく場合でも、位置姿勢の誤差が小さいとき（例えば図３においてペグ７０が穴７４付近にあるとき）に敵対外力を加えるのが好ましく、位置姿勢の誤差が大きいときは敵対外力を加えなくてもよい。また、位置姿勢の誤差が小さいときに、位置姿勢の誤差が縮小していく速さが大きいほど大きな敵対外力を加えることが好ましい。 Even if the position and orientation error is reduced, it is preferable to apply the hostile external force when the position and orientation error is small (for example, when the peg 70 is near the hole 74 in FIG. 3), and when the position and orientation error is large. does not have to apply a hostile external force. Further, when the position and orientation error is small, it is preferable to apply a larger hostile external force as the speed of reduction of the position and orientation error increases.

また、報酬算出部４３は、タスク実行中における敵対外力の予測値に基づく報酬の増加量の変化の幅が誤差に基づく報酬の変化の幅よりも小さくなる計算により報酬を算出する。 Further, the reward calculation unit 43 calculates the reward by performing a calculation such that the amount of change in the amount of increase in the reward based on the predicted value of the hostile external force during execution of the task is smaller than the width of change in the reward based on the error.

行動決定部４４は、制御周期毎に、行動指令の複数の候補を生成してロボットモデルＬＭに与え、行動指令の複数の候補のそれぞれに対応して報酬算出部４３が算出する報酬に基づいて報酬を最大化する行動指令を決定する。 The action determining unit 44 generates a plurality of action command candidates for each control cycle and gives them to the robot model LM. Determine the behavioral directive that maximizes reward.

行動指令とは、本実施形態では速度指令であるが、位置指令、トルク指令、速度、位置、トルクの組み合わせ指令等でもよい。また、行動指令は、複数の時間にわたる行動指令の系列であってもよい。また、行動指令の複数の候補は、行動指令の複数の系列の候補であってもよい。 The action command is a speed command in this embodiment, but may be a position command, a torque command, a combination command of speed, position, torque, or the like. Also, the action command may be a sequence of action commands over multiple times. Also, the plurality of candidates for action commands may be candidates for a plurality of series of action commands.

報酬を最大化するとは、限られた時間内で探索した結果として最大化されていればよく、報酬がその状況における真の最大値になっている必要はない。 Maximizing the reward only needs to be maximized as a result of searching within a limited time, and the reward does not have to be the true maximum value in the situation.

外力モデル更新部４５は、決定された行動指令に基づいて外力モデルが算出した外力の予測値と、当該外力の予測値に対応する外力の実績値との間の差異が小さくなるように外力モデルを更新する。 The external force model updating unit 45 updates the external force model so that the difference between the predicted value of the external force calculated by the external force model based on the determined action command and the actual value of the external force corresponding to the predicted value of the external force becomes small. to update.

外力モデル更新部４５は、修正外力モデル更新部４５Ａ及び敵対外力モデル更新部４５Ｂを含む。 The external force model updater 45 includes a corrected external force model updater 45A and a hostile external force model updater 45B.

修正外力モデル更新部４５Ａは、行動決定部４４で決定された行動指令に基づいて修正外力モデルＥＭ１が算出した修正外力の予測値と修正外力の実績値との間の差異が小さくなるように修正外力モデルＥＭ１を更新する。 The corrected external force model updating unit 45A corrects so that the difference between the predicted value of the corrected external force calculated by the corrected external force model EM1 based on the action command determined by the action determination unit 44 and the actual value of the corrected external force becomes smaller. Update the external force model EM1.

敵対外力モデル更新部４５Ｂは、行動決定部４４で決定された行動指令に基づいて敵対外力モデルＥＭ２が算出した敵対外力の予測値と敵対外力の実績値との間の差異が小さくなるように敵対外力モデルＥＭ２を更新する。 The hostile external force model updating unit 45B adjusts the hostile force so that the difference between the predicted value of the hostile external force calculated by the hostile external force model EM2 based on the action command determined by the action determining unit 44 and the actual value of the hostile external force becomes small. Update the external force model EM2.

学習制御部４６は、位置姿勢の実績値及び外力の実績値に基づき外力が修正外力であるか敵対外力であるかを判別し、判別の結果が修正外力である場合は修正外力モデル更新部の動作を有効化し、判別の結果が敵対外力である場合は敵対外力モデル更新部４５Ｂの動作を有効化する。さらに、判別の結果が修正外力でない場合は修正外力モデル更新部４５Ａの動作を無効化し、判別の結果が敵対外力でない場合は敵対外力モデル更新部４５Ｂの動作を無効化する。 The learning control unit 46 determines whether the external force is the corrected external force or the hostile external force based on the actual values of the position and orientation and the actual values of the external force. The operation is validated, and if the determination result is hostile external force, the operation of the hostile external force model updating unit 45B is validated. Furthermore, if the determination result is not the corrected external force, the operation of the corrected external force model updating unit 45A is invalidated, and if the determination result is not the hostile external force, the operation of the hostile external force model updating unit 45B is invalidated.

なお、本実施形態では、学習制御部４６が、位置姿勢の実績値及び外力の実績値に基づき外力が修正外力であるか敵対外力であるかを自動で判別する場合について説明するが、学習制御部４６が、外力が修正外力であるか敵対外力であるかの指定を受け付ける受け付け部をさらに備えた構成としてもよい。 In this embodiment, the learning control unit 46 automatically determines whether the external force is a corrected external force or a hostile external force based on the actual values of the position and orientation and the actual values of the external force. The unit 46 may further include a receiving unit that receives designation as to whether the external force is the corrected external force or the hostile external force.

この場合、ユーザーは、入力装置６０を操作して、グリッパ１２に加える外力が修正外力であるか敵対外力であるかを指定してグリッパ１２に外力を加える。 In this case, the user operates the input device 60 to specify whether the external force applied to the gripper 12 is a corrected external force or a hostile external force, and applies the external force to the gripper 12 .

そして、学習制御部４６は、指定が修正外力である場合は修正外力モデル更新部４５Ａの動作を有効化し、指定が敵対外力である場合は敵対外力モデル更新部４５Ｂの動作を有効化する。さらに、指定が修正外力でない場合は修正外力モデル更新部４５Ａの動作を無効化し、指定が敵対外力でない場合は敵対外力モデル更新部４５Ｂの動作を無効化する。 Then, the learning control unit 46 validates the operation of the corrected external force model updating unit 45A when the designation is the corrected external force, and validates the operation of the hostile external force model updating unit 45B when the designation is the hostile external force. Furthermore, if the specified external force is not corrected, the operation of the corrected external force model updating unit 45A is invalidated, and if the specified external force is not the hostile external force, the operation of the hostile external force model updating unit 45B is invalidated.

なお、図５の例では、状態遷移モデルＤＭ、修正外力モデルＥＭ１、及び敵対外力モデルＥＭ２がそれぞれ独立したモデルであるが、ロボットモデルＬＭの構成はこれに限られない。例えば図６に示すロボットモデルＬＭ１のように、共用部ＣＭ、状態遷移モデル固有部ＤＭａ、修正外力モデル固有部ＥＭ１ａ、及び敵対外力モデル固有部ＥＭ２ａで構成されてもよい。この場合、共用部ＣＭは、状態遷移モデルＤＭ、修正外力モデルＥＭ１、及び敵対外力モデルＥＭ２に共通の処理を行う。修正外力モデル固有部ＥＭ１ａは、修正外力モデルＥＭ１に固有の処理を行う。敵対外力モデル固有部ＥＭ２ａは、敵対外力モデルＥＭ２に固有の処理を行う。 In the example of FIG. 5, the state transition model DM, the corrected external force model EM1, and the hostile external force model EM2 are independent models, but the configuration of the robot model LM is not limited to this. For example, like the robot model LM1 shown in FIG. 6, it may be composed of a common part CM, a state transition model specific part DMa, a corrected external force model specific part EM1a, and an enemy external force model specific part EM2a. In this case, the common part CM performs common processing for the state transition model DM, the corrected external force model EM1, and the hostile external force model EM2. The corrected external force model specific part EM1a performs processing specific to the corrected external force model EM1. The hostile external force model specific part EM2a performs processing specific to the hostile external force model EM2.

また、図７に示すロボットモデルＬＭ２のように、図５の修正外力モデルＥＭ１及び敵対外力モデルＥＭ２を備えた統合外力モデルＩＭを含む構成としてもよい。この場合、統合外力モデルＩＭは、外力の予測値を出力すると共に、外力が修正外力であるか敵対外力であるかを識別するための識別情報を出力する。 Also, like the robot model LM2 shown in FIG. 7, the configuration may include an integrated external force model IM including the corrected external force model EM1 and the hostile external force model EM2 shown in FIG. In this case, the integrated external force model IM outputs a predicted value of the external force and also outputs identification information for identifying whether the external force is a corrected external force or a hostile external force.

統合外力モデルＩＭは、プログレッシブニューラルネットワークの手法により修正外力モデルＥＭ１及び敵対外力モデルＥＭ２を統合したものであってもよい。この場合、修正外力モデルＥＭ１及び敵対外力モデルはニューラルネットワークで構成する。そして、敵対外力モデルＥＭ２の１又は複数の中間層及び出力層のうちの少なくとも１つの層は、修正外力モデルＥＭ１の対応する層の前段の層の出力をプログレッシブニューラルネットワーク（ＰＮＮ：ＰｒｏｇｒｅｓｓｉｖｅＮｅｕｒａｌＮｅｔｗｏｒｋ）の手法により統合する。 The integrated external force model IM may be obtained by integrating the corrected external force model EM1 and the hostile external force model EM2 by a progressive neural network technique. In this case, the modified external force model EM1 and the hostile external force model are configured by neural networks. Then, at least one of the one or more intermediate layers and the output layer of the hostile external force model EM2 converts the output of the layer preceding the corresponding layer of the modified external force model EM1 into a progressive neural network (PNN). integrated by the method of

図８の例では、敵対外力モデルＥＭ２の出力層ＯＵＴ２には、修正外力モデルＥＭ１の対応する出力層ＯＵＴ１の前段の層である中間層ＭＩＤ２Ａの出力が入力されている。また、敵対外力モデルＥＭ２の中間層ＭＩＤ２Ｂには、修正外力モデルＥＭ１の対応する中間層ＭＩＤ２Ａの前段の層である中間層ＭＩＤ１Ａの出力が入力されている。 In the example of FIG. 8, the output layer OUT2 of the hostile external force model EM2 is supplied with the output of the intermediate layer MID2A, which is the preceding layer of the corresponding output layer OUT1 of the corrected external force model EM1. Also, the intermediate layer MID2B of the hostile external force model EM2 receives the output of the intermediate layer MID1A, which is the layer preceding the corresponding intermediate layer MID2A of the modified external force model EM1.

このような統合外力モデルＩＭに対して、まず、修正外力モデルＥＭ１の機械学習を行い、次に、敵対外力モデルＥＭ２の機械学習を行う。修正外力を加えて修正外力モデルＥＭ１の学習を行っている間は、修正外力の実績値に対する修正外力の予測値の誤差が小さくなるように修正外力モデルＥＭ１の各層間の重みパラメータを更新し、敵対外力モデルＥＭ２は更新しない。修正外力モデルＥＭ１の１つの層（例えばＭＩＤ１Ａ）から次の層（ＭＩＤ２Ａ）に至る経路の重みパラメータと同じ層（ＭＩＤ１Ａ）から敵対外力モデルの層（ＭＩＤ２Ｂ）に至る経路の重みパラメータとは常に同じ値にする。敵対外力モデルＥＭ２の１つの層（例えばＭＩＤ２Ｂ）は、その層への修正外力モデルの層（例えばＭＩＤ１Ａ）からの重み付けられた入力と敵対外力モデルＥＭ２の前段の層（ＭＩＤ１Ｂ）からの重み付けられた入力との和を敵対外力モデルＥＭ２の後段の層（ＯＵＴ２）への出力とする。修正外力モデルＥＭ１の学習が終了した後、敵対外力を加えて敵対外力モデルＥＭ２の学習を行っている間は、敵対外力の実績値に対する敵対外力の予測値の誤差が小さくなるように敵対外力モデルＥＭ２の各層間の重みパラメータを更新し、修正外力モデルＥＭ１は更新しない。敵対外力モデルＥＭ２の学習が終了した後の運用フェーズにおいては、敵対外力モデルＥＭ２の出力を外力の予測値として使用し、修正外力モデルＥＭ１の出力は使用しない。このようにして外力の機械学習をすることにより、修正外力モデルＥＭ１と敵対外力モデルＥＭ２とを統合したモデルでありながら、先に行う修正外力についての学習結果を壊すことなく敵対外力についての学習を行うことができる。 Machine learning of the modified external force model EM1 is first performed on such integrated external force model IM, and then machine learning of the hostile external force model EM2 is performed. While learning the modified external force model EM1 by applying the modified external force, update the weight parameter between each layer of the modified external force model EM1 so that the error of the predicted value of the modified external force with respect to the actual value of the modified external force becomes small, The hostile external force model EM2 is not updated. Same as the weight parameter of the path from one layer (for example, MID1A) of the modified external force model EM1 to the next layer (MID2A) The weight parameter of the path from the layer (MID1A) to the layer (MID2B) of the hostile external force model is always the same value. One layer of the hostile external force model EM2 (e.g., MID2B) is composed of the weighted inputs from the modified external force model layer (e.g., MID1A) to that layer and the weighted inputs from the preceding layer (MID1B) of the hostile external force model EM2. The sum with the input is taken as the output to the subsequent layer (OUT2) of the hostile external force model EM2. After the learning of the modified external force model EM1 is completed, while the hostile external force is added and the hostile external force model EM2 is being learned, the hostile external force model is modified so that the error in the predicted value of the hostile external force with respect to the actual value of the hostile external force is reduced. The weight parameter between each layer of EM2 is updated, but the modified external force model EM1 is not updated. In the operation phase after the end of the learning of the hostile external force model EM2, the output of the hostile external force model EM2 is used as the external force prediction value, and the output of the corrected external force model EM1 is not used. By machine-learning the external force in this way, the model that integrates the corrected external force model EM1 and the hostile external force model EM2 can learn the hostile external force without destroying the learning result of the corrected external force performed previously. It can be carried out.

統合外力モデルＩＭは、図示しない識別部により外力の予測値が修正外力の予測値であるか敵対外力の予測値であるかを識別し、識別結果を識別情報として出力する。この場合、報酬算出部４３は、識別情報が、修正外力の予測値であることを示す場合には外力の予測値を報酬の減少要因とする計算により報酬を算出し、識別情報が、敵対外力の予測値であることを示す場合には外力の予測値を報酬の増加要因とする計算により報酬を算出する。 The integrated external force model IM identifies whether the predicted value of the external force is the predicted value of the corrected external force or the predicted value of the hostile external force by an identification unit (not shown), and outputs the identification result as identification information. In this case, if the identification information indicates that the predicted value of the modified external force , the reward is calculated using the predicted value of the external force as a factor for increasing the reward.

なお、プログレッシブニューラルネットワークの手法とは、例えば下記参考文献に記載された手法をいう。 It should be noted that the progressive neural network technique refers to, for example, the technique described in the following references.

（参考文献）Rusu et al., “Progressive neural networks,” arXiv preprint arXiv:1606.04671, 2016. (Reference) Rusu et al., “Progressive neural networks,” arXiv preprint arXiv:1606.04671, 2016.

また、プログレッシブニューラルネットワークに関しては、下記の参考記事がある。 In addition, there are the following reference articles regarding progressive neural networks.

（参考記事）複数のゲームにおけるcontinual learning
https://wba-initiative.org/wp-content/uploads/2015/05/20161008-hack2-noguchi.pdf (Reference article) Continuous learning in multiple games
https://wba-initiative.org/wp-content/uploads/2015/05/20161008-hack2-noguchi.pdf

（ロボットモデルの学習処理） (Learning process of robot model)

図９は、機械学習を用いてロボットモデルＬＭを学習する機械学習処理の流れを示すフローチャートである。図９の機械学習処理は、ＣＰＵ４０ＡがＲＯＭ４０Ｂまたはストレージ４０Ｄに記憶された機械学習プログラムを読み出して、ＲＡＭ４０Ｃに展開して実行される。 FIG. 9 is a flow chart showing the flow of machine learning processing for learning the robot model LM using machine learning. The machine learning process of FIG. 9 is executed by the CPU 40A reading a machine learning program stored in the ROM 40B or the storage 40D, developing it in the RAM 40C.

以下で説明するステップＳ１００～ステップＳ１０８の処理は、制御周期に従って一定の時間間隔で実行される。制御周期は、ステップＳ１００～ステップＳ１０８の処理を実行可能な時間に設定される。 The processes of steps S100 to S108, which will be described below, are executed at regular time intervals according to the control cycle. The control cycle is set to a time during which the processes of steps S100 to S108 can be executed.

ステップＳ１００では、ＣＰＵ４０Ａが、前回の制御周期を開始してから制御周期の長さに相当する所定時間が経過するまで待機する処理を行う。なお、ステップＳ１００の処理を省略し、前の制御周期の処理が完了したら直ぐに次の制御周期の処理が開始されるようにしてもよい。 In step S100, the CPU 40A waits until a predetermined time corresponding to the length of the control cycle has elapsed since the previous control cycle was started. It should be noted that the process of step S100 may be omitted, and the process of the next control period may be started immediately after the process of the previous control period is completed.

ステップＳ１０１では、ＣＰＵ４０Ａが、状態観測センサ２０からロボット１０の位置姿勢の実績値（計測値）を取得すると共に、触覚センサ３０Ａ、３０Ｂから外力の実績値（計測値）を取得する。 In step S101, the CPU 40A acquires actual values (measured values) of the position and orientation of the robot 10 from the state observation sensor 20, and acquires actual values (measured values) of the external force from the tactile sensors 30A and 30B.

ステップＳ１０２では、ＣＰＵ４０Ａが、取得部４１として、ステップＳ１０１で取得した位置姿勢の実績値が予め定めた終了条件を充足するか否かを判定する。ここで、終了条件を充足する場合とは、例えば位置姿勢の実績値と到達すべき位置姿勢の目標値との誤差が規定値以内の場合である。到達すべき位置姿勢とは、本実施形態の場合はロボット１０がペグ７０を穴７４に挿入できたときのロボット１０の位置姿勢である。 In step S102, the CPU 40A, as the acquiring unit 41, determines whether or not the actual values of the position and orientation acquired in step S101 satisfy a predetermined end condition. Here, the case where the end condition is satisfied is, for example, the case where the error between the actual value of the position and orientation and the target value of the position and orientation to be reached is within a specified value. The position and orientation to be reached is the position and orientation of the robot 10 when the robot 10 can insert the peg 70 into the hole 74 in this embodiment.

ステップＳ１０２の判定が肯定判定の場合は、本ルーチンを終了する。一方、ステップＳ１０２の判定が否定判定の場合は、ステップＳ１０３へ移行する。 If the determination in step S102 is affirmative, the routine ends. On the other hand, if the determination in step S102 is negative, the process proceeds to step S103.

ステップＳ１０３では、ＣＰＵ４０Ａが、外力モデル更新部４５として、外力モデルＥＭを更新する。具体的には、まずステップＳ１０１で取得した位置姿勢の実績値及び外力の実績値に基づき外力が修正外力であるか敵対外力であるかを判別する。例えば、位置姿勢の実績値と到達すべき位置姿勢の目標値との誤差が拡大しているときに誤差の拡大を抑制するような方向の力として検出された外力は修正外力と判別し、誤差が縮小しているときに誤差の縮小を抑制するような方向の力として検出された外力は敵対外力と判別することができるが、判別方法はこれに限られるものではない。 In step S103, the CPU 40A, as the external force model updating unit 45, updates the external force model EM. Specifically, first, it is determined whether the external force is a corrected external force or a hostile external force based on the actual values of the position and orientation and the actual values of the external force acquired in step S101. For example, when the error between the position/orientation actual value and the position/orientation target value to be reached is increasing, an external force detected as a force in a direction that suppresses the increase of the error is determined as a corrected external force, and the error An external force detected as a force in a direction that suppresses the reduction of the error when is shrinking can be discriminated as a hostile external force, but the discriminating method is not limited to this.

そして、判別された外力が修正外力である場合は、決定された行動指令に基づいて修正外力モデルＥＭ１が算出した修正外力の予測値と修正外力の実績値との間の差異が小さくなるように修正外力モデルＥＭ１の修正外力モデルパラメータを更新する。 When the determined external force is the corrected external force, the difference between the predicted value of the corrected external force calculated by the corrected external force model EM1 based on the determined action command and the actual value of the corrected external force is reduced. The corrected external force model parameters of the corrected external force model EM1 are updated.

一方、判別された外力が敵対外力である場合は、決定された行動指令に基づいて敵対外力モデルＥＭ２が算出した敵対外力の予測値と敵対外力の実績値との間の差異が小さくなるように敵対外力モデルＥＭ２の敵対外力モデルパラメータを更新する。 On the other hand, when the determined external force is a hostile external force, the difference between the predicted value of the hostile external force calculated by the hostile external force model EM2 based on the determined action command and the actual value of the hostile external force is reduced. Update the hostile external force model parameters of the hostile external force model EM2.

ステップＳ１０４では、ＣＰＵ４０Ａが、行動決定部４４として、ロボット１０に対する行動指令（又は行動指令系列）の複数の候補を生成する。本実施形態では、例えばｎ個（例えば３００個）の速度指令値候補をランダムに生成し、行動指令の候補値としてモデル実行部４２に出力する。 In step S104 , the CPU 40A, as the action determination unit 44 , generates a plurality of candidates for the action command (or action command series) for the robot 10 . In this embodiment, for example, n (eg, 300) speed command value candidates are randomly generated and output to the model execution unit 42 as action command candidate values.

ステップＳ１０５では、ＣＰＵ４０Ａが、モデル実行部４２として、ステップＳ１０４で生成した行動指令の複数の候補のそれぞれについて位置姿勢の予測値及び外力の予測値を算出する。具体的には、位置姿勢の実績値及びｎ個の行動指令の候補値をロボットモデルＬＭに入力し、それぞれ行動指令の候補値に対応する位置姿勢の予測値、及び修正外力の予測値又は敵対外力の予測値を算出する。 In step S105, the CPU 40A, as the model execution unit 42, calculates predicted values of the position and orientation and predicted values of the external force for each of the plurality of action command candidates generated in step S104. Specifically, actual position and orientation values and n candidate values for action commands are input to the robot model LM. Calculate the predicted value of the external force.

ステップＳ１０６では、ＣＰＵ４０Ａが、報酬算出部４３として、ｎ個の行動指令の候補値に対応する位置姿勢の予測値及び修正外力の組毎に、報酬を算出する。すなわち、ｎ個の報酬を算出する。 In step S106, the CPU 40A, as the reward calculation unit 43, calculates a reward for each set of the position/orientation predicted value and the corrected external force corresponding to the n candidate values of the action command. That is, n rewards are calculated.

外力が修正外力の場合の報酬ｒ１は、下記（１）式を用いて算出することができる。 The reward r1 when the external force is the corrected external force can be calculated using the following formula (1).

・・・（１）

... (1)

ここで、ｒ^Ｒは、位置姿勢の予測値と到達すべき位置姿勢の目標値との誤差である。ｓ１^Ｈは修正外力である。α１は重みであり、予め設定される。α１は、タスク実行中における修正外力の予測値に基づく報酬ｒ１の減少量の変化の幅が、位置姿勢の予測値と到達すべき位置姿勢の目標値との誤差に基づく報酬ｒ１の変化の幅よりも小さくなるように設定される。 Here, r ^R is the error between the position/orientation predicted value and the position/orientation target value to be reached. s1 ^H is the corrected external force. α1 is a weight and is set in advance. α1 is the width of change in the amount of decrease in reward r1 based on the predicted value of the corrected external force during task execution, and the width of change in reward r1 based on the error between the predicted value of the position and orientation and the target value of the position and orientation to be reached. is set to be less than

一方、外力が敵対外力の場合の報酬ｒ２は、下記（２）式を用いて算出することができる。 On the other hand, the reward r2 when the external force is a hostile external force can be calculated using the following equation (2).

・・・（２）

... (2)

ここで、ｓ２^Ｈは敵対外力である。α２は重みであり、予め設定される。α２は、タスク実行中における敵対外力の予測値に基づく報酬ｒ２の増加量の変化の幅が、位置姿勢の予測値と到達すべき位置姿勢の目標値との誤差に基づく報酬ｒ２の変化の幅よりも小さくなるように設定される。 where s2 ^H is the hostile external force. α2 is a weight and is set in advance. α2 is the width of change in the amount of increase in reward r2 based on the predicted value of hostile external force during task execution, and the width of change in reward r2 based on the error between the predicted value of position and orientation and the target value of position and orientation to be reached. is set to be less than

上記（１）、（２）式に示すように、外力が同じ場合において、位置姿勢の予測値と到達すべき位置姿勢の目標値との誤差が大きいほど報酬は小さくなる。また、上記（１）式に示すように、誤差が同じ場合において、修正外力が大きいほど報酬は小さくなる。また、上記（２）式に示すように、誤差が同じ場合において、敵対外力が大きいほど報酬は大きくなる。 As shown in the above formulas (1) and (2), when the external force is the same, the larger the error between the position/orientation predicted value and the target position/orientation value to be reached, the smaller the reward. Also, as shown in the above formula (1), when the error is the same, the larger the corrected external force, the smaller the reward. Also, as shown in the above formula (2), when the error is the same, the greater the hostile external force, the greater the reward.

ステップＳ１０７では、ＣＰＵ４０Ａが、行動決定部４４として、報酬を最大化する行動指令を決定し、ロボット１０に出力する。例えば、ｎ個の行動指令の候補値と報酬との対応関係を表す関係式を算出し、算出した関係式によって表される曲線上における最大の報酬に対応する行動指令の候補値を決定値とする。また、所謂クロスエントロピー法（ｃｒｏｓｓ－ｅｎｔｒｏｐｙｍｅｔｈｏｄ：ＣＥＭ）を用いて報酬を最大化できる行動指令を特定してもよい。これにより、報酬を最大化した行動指令が得られる。 In step S107 , the CPU 40A, acting as the action determination unit 44 , determines an action command for maximizing the reward and outputs it to the robot 10 . For example, a relational expression representing the correspondence between n candidate values of action commands and rewards is calculated, and the candidate value of the action command corresponding to the maximum reward on the curve represented by the calculated relational expression is taken as the determined value. do. A so-called cross-entropy method (CEM) may also be used to identify behavioral commands that can maximize the reward. This provides a behavioral command that maximizes the reward.

ステップＳ１０４からステップＳ１０６までは、所定回数繰り返して実行するようにしてもよい。その場合、ＣＰＵ４０Ａは、行動決定部４４として、１回目のステップＳ１０６を実行した後、ｎ個の行動指令の候補値と報酬との組から報酬が上位である行動指令の候補値ｍ個を抽出し、行動指令の候補値ｍ個の平均及び分散を求め、それに従う正規分布を生成する。２回目のステップＳ１０４では、ＣＰＵ４０Ａは、行動決定部４４として、ランダムにではなく、確率密度が求めた正規分布と一致するように新しいｎ個の速度指令の候補値を生成する。以下同様にして、ステップＳ１０４からステップＳ１０６までを所定回数実行する。このようにすると、報酬を最大化する精度を高めることができる。 Steps S104 to S106 may be repeated a predetermined number of times. In this case, the CPU 40A, as the action determination unit 44, after executing step S106 for the first time, extracts m candidate values of the action command with the highest reward from n pairs of the candidate values of the action command and the reward. Then, the average and variance of m candidate values for the action command are obtained, and a normal distribution is generated according to them. In the second step S104, the CPU 40A, as the action determining unit 44, generates n new speed command candidate values not randomly but so that the probability density matches the obtained normal distribution. Similarly, steps S104 to S106 are executed a predetermined number of times. In this way, it is possible to increase the accuracy of maximizing the reward.

ロボット１０は、行動指令の決定値に従って動作する。ユーザーは、ロボット１０の動作に応じて外力をロボット１０に加える。具体的には、外力をグリッパ１２に加える。ユーザーは、位置姿勢の予測値と到達すべき位置姿勢の目標値との誤差が拡大しつつある場合に、ロボット１０に対して修正外力を加え、誤差が縮小しつつある場合に、ロボット１０に対して敵対外力を加えることが好ましい。すなわち、ユーザーは、例えばロボット１０の動作によりペグ７０が穴７４から離れる方向に移動しつつある場合には、ペグ７０が穴７４に近づく方向にグリッパ１２に修正外力を加える。また、例えばロボット１０の動作によりペグ７０が穴７４から近づく方向に移動しつつある場合には、ペグ７０が穴７４から離れる方向にグリッパ１２に敵対外力を加える。 The robot 10 operates according to the determined value of the action command. A user applies an external force to the robot 10 according to the motion of the robot 10 . Specifically, an external force is applied to the gripper 12 . The user applies a correction external force to the robot 10 when the error between the predicted values of the position and orientation and the target values of the position and orientation to be reached is increasing, and when the error is decreasing, the robot 10 It is preferable to apply a hostile external force against it. That is, for example, when the peg 70 is moving away from the hole 74 due to the operation of the robot 10 , the user applies a correcting external force to the gripper 12 in the direction in which the peg 70 approaches the hole 74 . Also, for example, when the peg 70 is moving toward the hole 74 due to the motion of the robot 10 , the peg 70 applies a hostile external force to the gripper 12 in the direction away from the hole 74 .

なお、外力モデルを機械学習する過程において、最初に修正外力を加えることが好ましい。最初に敵対外力を加えると学習が遅くなる可能性があるためである。また、修正外力及び敵対外力を加える比率としては、１対１でもよいし、修正外力の比率を高くしてもよい。また、修正外力及び敵対外力を加える順序としては、修正外力を複数回加えてから敵対外力を複数回加えてもよいし、修正外力及び敵対外力を交互に加えてもよい。 In addition, in the process of machine-learning the external force model, it is preferable to apply the corrected external force first. This is because applying the hostile external force first may slow down learning. Also, the ratio of adding the corrected external force and the hostile external force may be 1:1, or the ratio of the corrected external force may be increased. As for the order in which the corrected external force and the hostile external force are applied, the corrected external force may be applied a plurality of times and then the hostile external force may be applied a plurality of times, or the corrected external force and the hostile external force may be applied alternately.

また、人間が修正外力又は敵対外力を加えるのではなく、外力を付与するロボット等によって自動で修正外力又は敵対外力を加えても良い。 Further, the corrected external force or the hostile external force may be automatically applied by a robot or the like that applies the external force instead of being applied by a human being.

ステップＳ１０８では、ＣＰＵ４０Ａが、モデル実行部４２として、ステップＳ１０７で決定した行動指令の決定値について外力の予測値を算出し、ステップＳ１００へ戻る。 In step S108, the CPU 40A, as the model execution unit 42, calculates the predicted value of the external force for the determined value of the action command determined in step S107, and returns to step S100.

このように、位置姿勢の実績値が終了条件を充足するまで、制御周期毎にステップＳ１００～Ｓ１０８の処理を繰り返す。 In this manner, the processing of steps S100 to S108 is repeated in each control cycle until the actual values of the position and orientation satisfy the termination condition.

これにより、ロボットモデルＬＭが学習される。このように、ロボットモデルＬＭは、修正外力モデルＥＭ１及び敵対外力モデルＥＭ２を含み、ユーザーが修正外力又は敵対外力をロボット１０に加えながらロボットモデルＬＭを学習するため、効率良く学習することができると共に、ロボット１０の操作対象である部品の形状及び材質が変化したり、ロボット１０の物理特性が経年変化したりする等の環境変化に対するロバスト性に優れたロボットモデルＬＭを得ることができる。 Thereby, the robot model LM is learned. In this way, the robot model LM includes the corrected external force model EM1 and the hostile external force model EM2, and the user learns the robot model LM while applying the corrected external force or the hostile external force to the robot 10. Therefore, the robot model LM can be learned efficiently. It is possible to obtain a robot model LM excellent in robustness against environmental changes such as changes in the shape and material of parts to be operated by the robot 10 and changes in the physical characteristics of the robot 10 over time.

なお、運用フェーズにおいては、モデル実行部４２は、図９の学習処理により学習済みのロボットモデルＬＭを実行する。運用フェーズにおけるロボット制御装置４０の機能構成は、図１の機能構成から外力モデル更新部４５及び学習制御部４６を省いた構成である。運用フェーズにおけるロボット制御処理は、図９の学習処理から、ステップＳ１０１の中の「外力の実績値を取得」の処理、及びステップＳ１０３の外力モデル更新の更新処理を除いた処理であり、この処理を実行するプログラムがロボット制御処理プログラムとなる。 Note that in the operation phase, the model execution unit 42 executes the robot model LM that has been learned through the learning process of FIG. The functional configuration of the robot control device 40 in the operation phase is a configuration in which the external force model updating unit 45 and the learning control unit 46 are omitted from the functional configuration of FIG. The robot control processing in the operation phase is the learning processing of FIG. is the robot control processing program.

なお、学習フェーズにおけるロボットモデルの学習処理を実行する装置と運用フェーズにおけるロボット制御処理を実行する装置とは、別々の装置でもよいし、同じでもよい。例えば、学習に用いた学習装置をそのままロボット制御装置４０として使用し、学習済みのロボットモデルＬＭを用いた制御を行ってもよい。また、ロボット制御装置４０は、学習を継続しながら制御を行ってもよい。 The device that executes the robot model learning process in the learning phase and the device that executes the robot control process in the operation phase may be separate devices or may be the same device. For example, the learning device used for learning may be used as it is as the robot control device 40 to perform control using the learned robot model LM. Further, the robot control device 40 may perform control while continuing learning.

＜第１実施形態の変形例＞
第１実施形態では、状態遷移モデルＤＭは、位置姿勢の実績値及び行動指令を入力するが、外力の実績値は入力しない構成であった。これに代えて、状態遷移モデルＤＭは、外力の実績値も入力する構成にしてもよい。その場合、状態遷移モデルＤＭは、位置姿勢の実績値、行動指令、及び外力の実績値に基づいて、位置姿勢の予測値を算出する。もっとも、触覚センサ３０Ａ、３０Ｂに修正外力又は敵対外力が加えられるのは外力モデルＥＭ１、ＥＭ２、ＥＭ１ａ、ＥＭ２ａ、ＩＭの機械学習をしている期間に限られる。運用フェーズにおいては、状態遷移モデルＤＭは、外力の実績値の入力が実質的にゼロである状態が継続したままで位置姿勢の予測値を算出する。一方、外力モデルが外力の実績値の入力なしに位置姿勢の実績値及び行動指令から外力の予測値を算出することは、この変形例においても同様である。外力の予測値は、報酬計算に使われることを通して行動決定に影響を与える。この変形例と同様の変形は、以降の実施形態においても行うことができる。 <Modified Example of First Embodiment>
In the first embodiment, the state transition model DM is configured such that the actual values of the position and orientation and the action command are input, but the actual values of the external force are not input. Instead of this, the state transition model DM may be configured to input the actual value of the external force. In this case, the state transition model DM calculates predicted values of the position and orientation based on the actual values of the position and orientation, the action command, and the actual values of the external force. However, the correction external force or the hostile external force is applied to the tactile sensors 30A and 30B only during the machine learning of the external force models EM1, EM2, EM1a, EM2a and IM. In the operation phase, the state transition model DM calculates the predicted values of the position and orientation while the state in which the input actual value of the external force is substantially zero continues. On the other hand, the fact that the external force model calculates the predicted value of the external force from the actual value of the position and orientation and the action command without inputting the actual value of the external force is the same in this modified example. Predicted values of external forces influence behavioral decisions through being used in reward calculations. Modifications similar to this modification can also be made in subsequent embodiments.

＜第２実施形態＞ <Second embodiment>

次に、開示の技術の第２実施形態について説明する。なお、第１実施形態と同一部分には同一符号を付し、詳細な説明は省略する。 Next, a second embodiment of the disclosed technology will be described. In addition, the same code|symbol is attached|subjected to the same part as 1st Embodiment, and detailed description is abbreviate|omitted.

第２実施形態にかかるロボットシステム１は第１実施形態と同一であるので説明は省略する。 Since the robot system 1 according to the second embodiment is the same as that of the first embodiment, description thereof is omitted.

（ロボットモデルの学習処理） (Learning process of robot model)

図１０は、第２実施形態に係る機械学習処理の流れを示すフローチャートである。図１０の機械学習処理は、ＣＰＵ４０ＡがＲＯＭ４０Ｂまたはストレージ４０Ｄに記憶された機械学習プログラムを読み出して、ＲＡＭ４０Ｃに展開して実行される。 FIG. 10 is a flowchart showing the flow of machine learning processing according to the second embodiment. The machine learning process of FIG. 10 is executed by the CPU 40A reading a machine learning program stored in the ROM 40B or the storage 40D, developing it in the RAM 40C.

ステップＳ１００～Ｓ１０３、Ｓ１０８の処理は、図９の処理と同一であるため説明は省略する。 Since the processing of steps S100 to S103 and S108 is the same as the processing of FIG. 9, the description thereof is omitted.

ステップＳ１０４Ａでは、ＣＰＵ４０Ａが、行動決定部４４として、ロボット１０に対する行動指令（又は行動指令系列）の一の候補を生成する。 In step S104A, the CPU 40A, as the action determination unit 44 , generates one candidate for the action command (or action command series) for the robot 10 .

ステップＳ１０６Ａでは、ＣＰＵ４０Ａが、モデル実行部４２として、ステップＳ１０４Ａで生成した行動指令の一の候補について位置姿勢の予測値及び外力の予測値を算出する。具体的には、位置姿勢の実績値及び行動指令の候補値をロボットモデルＬＭに入力し、行動指令の候補値に対応する位置姿勢の予測値、及び修正外力の予測値又は敵対外力の予測値を算出する。 In step S106A, the CPU 40A, as the model execution unit 42, calculates the predicted value of the position and orientation and the predicted value of the external force for one candidate of the action command generated in step S104A. Specifically, the actual values of the position and orientation and the candidate values of the action command are input to the robot model LM, and the predicted values of the position and orientation corresponding to the candidate values of the action command and the predicted values of the modified external force or the predicted values of the hostile external force are input. Calculate

ステップＳ１０６Ａでは、ＣＰＵ４０Ａが、報酬算出部４３として、行動指令の候補値に対応する位置姿勢の予測値及び外力の予測値の組に基づいて報酬を算出する。すなわち、外力が修正外力の場合は上記（１）式により報酬ｒ１を算出し、外力が敵対外力の場合は上記（２）式により報酬ｒ２を算出する。 In step S106A, the CPU 40A, as the reward calculation unit 43, calculates a reward based on the set of the position/orientation predicted value and the external force predicted value corresponding to the candidate value of the action command. That is, when the external force is the corrected external force, the reward r1 is calculated by the above formula (1), and when the external force is the hostile external force, the reward r2 is calculated by the above formula (2).

ステップＳ１０６Ｂでは、ＣＰＵ４０Ａが、ステップＳ１０６Ａで算出した報酬が規定条件を充足するか否かを判定する。ここで、規定条件を充足する場合とは、例えば報酬が規定値を超えた場合、または、ステップＳ１０４Ａ～Ｓ１０６Ｂの処理のループを規定回数実行した場合等である。規定回数は、例えば１０回、１００回、１０００回等に設定される。 In step S106B, CPU 40A determines whether or not the remuneration calculated in step S106A satisfies prescribed conditions. Here, the case where the prescribed condition is satisfied is, for example, the case where the remuneration exceeds the prescribed value, or the case where the processing loop of steps S104A to S106B is executed a prescribed number of times. The specified number of times is set to, for example, 10 times, 100 times, 1000 times, or the like.

ステップＳ１０７Ａでは、ＣＰＵ４０Ａが、行動決定部４４として、報酬を最大化する行動指令を決定し、ロボット１０に出力する。例えば、報酬が規定条件を充足したときの行動指令そのものでもよいし、行動指令の変化に対応する報酬の変化の履歴から予測される、更に報酬を最大化できる行動指令としてもよい。 In step S107A, the CPU 40A, acting as the action determination unit 44 , determines an action command for maximizing the reward and outputs it to the robot 10 . For example, the action command itself may be used when the reward satisfies a prescribed condition, or it may be an action command predicted from the history of changes in reward corresponding to changes in the action command and capable of maximizing the reward.

＜第３実施形態＞ <Third Embodiment>

次に、開示の技術の第３実施形態について説明する。なお、第１実施形態と同一部分には同一符号を付し、詳細な説明は省略する。 Next, a third embodiment of the disclosed technology will be described. In addition, the same code|symbol is attached|subjected to the same part as 1st Embodiment, and detailed description is abbreviate|omitted.

（ロボット制御装置） (robot controller)

図１１は、第３実施形態に係るロボット制御装置４０Ｘの機能構成である。ロボット制御装置４０Ｘが図１に示すロボット制御装置４０と異なるのは、記憶部４８及び状態遷移モデル更新部４９を備えている点である。その他の構成はロボット制御装置４０と同一であるので説明は省略する。 FIG. 11 shows the functional configuration of a robot control device 40X according to the third embodiment. The robot controller 40X differs from the robot controller 40 shown in FIG. 1 in that it includes a storage unit 48 and a state transition model update unit 49. Since other configurations are the same as those of the robot control device 40, description thereof will be omitted.

記憶部４８は、取得部４１が取得したロボット１０の位置姿勢の実績値を記憶する。 The storage unit 48 stores the actual values of the position and orientation of the robot 10 acquired by the acquisition unit 41 .

状態遷移モデル更新部４９は、行動決定部４４で決定された行動指令に基づいて状態遷移モデルＤＭが算出した位置姿勢の予測値と、当該位置姿勢の予測値に対応する位置姿勢の実績値との間の誤差が小さくなるように状態遷移モデルＤＭを更新する。 The state transition model update unit 49 updates the predicted values of the position and orientation calculated by the state transition model DM based on the action command determined by the action determination unit 44 and the actual values of the position and orientation corresponding to the predicted values of the position and orientation. The state transition model DM is updated so that the error between is reduced.

（ロボットモデルの学習処理） (Learning process of robot model)

図１２は、第３実施形態に係る機械学習処理の流れを示すフローチャートである。図１２の機械学習処理は、ＣＰＵ４０ＡがＲＯＭ４０Ｂまたはストレージ４０Ｄに記憶された機械学習プログラムを読み出して、ＲＡＭ４０Ｃに展開して実行される。 FIG. 12 is a flowchart showing the flow of machine learning processing according to the third embodiment. The machine learning process of FIG. 12 is executed by the CPU 40A reading a machine learning program stored in the ROM 40B or the storage 40D, developing it in the RAM 40C.

図１２の学習処理が図９の学習装置と異なるのは、ステップＳ１０１Ａ、Ｓ１０３Ａの処理が追加されている点である。その他のステップは図９の処理と同一であるため説明は省略する。 The learning process of FIG. 12 differs from the learning apparatus of FIG. 9 in that the processes of steps S101A and S103A are added. Since other steps are the same as the processing in FIG. 9, description thereof is omitted.

ステップＳ１０１Ａでは、ＣＰＵ４０Ａが、取得部４１として、ステップＳ１０１で取得したロボット１０の位置姿勢の実績値を記憶部４８に記憶させる。 In step S101A, the CPU 40A, as the acquisition unit 41, causes the storage unit 48 to store the actual values of the position and orientation of the robot 10 acquired in step S101.

ステップＳ１０３Ａでは、ＣＰＵ４０Ａが、状態遷移モデル更新部４９として、状態遷移モデルＤＭを更新する。具体的には、まず記憶部４８に記憶されている中からランダムに選んだ例えば１００個の時刻ｔについての位置姿勢の実績値ｘ_ｔ、行動指令としての速度指令値ｕ_ｔ、時刻ｔ＋１についての位置姿勢の実績値ｘ_ｔ＋１の組を取得する。次に、前回の状態遷移モデルパラメータを修正した新たな状態遷移モデルパラメータを決定する。状態遷移モデルパラメータの修正は、時刻ｔにおける位置姿勢の実績値から算出した時刻ｔ＋１における位置姿勢の予測値と、時刻ｔ＋１における位置姿勢の実績値との誤差が最小となることを目標として行う。 At step S103A, the CPU 40A, as the state transition model updating unit 49, updates the state transition model DM. Specifically, first, for example, 100 position and orientation actual values x _t at times t randomly selected from among those stored in the storage unit 48, speed command values u _t as action commands, and values for time t+1. Obtain a set of position and orientation actual values x _t+1 . Next, new state transition model parameters are determined by modifying the previous state transition model parameters. The state transition model parameters are corrected with the goal of minimizing the error between the predicted values of the position and orientation at time t+1 calculated from the actual values of the position and orientation at time t and the actual values of the position and orientation at time t+1.

そして、新たな状態遷移モデルパラメータを状態遷移モデルＤＭに設定する。新たな状態遷移モデルパラメータは、次の制御周期において「前回のモデルパラメータ」として使用するために状態遷移モデル更新部４９内に記憶する。 Then, new state transition model parameters are set in the state transition model DM. The new state transition model parameters are stored in the state transition model updating unit 49 for use as "previous model parameters" in the next control cycle.

このように、本実施形態では、修正外力モデルＥＭ１及び敵対外力モデルＥＭ２と共に、状態遷移モデルＤＭも学習することができる。 Thus, in this embodiment, the state transition model DM can be learned together with the modified external force model EM1 and the hostile external force model EM2.

＜実験例＞ <Experimental example>

次に、開示の技術の実験例について説明する。 Next, an experimental example of the disclosed technique will be described.

図１３は、シミュレーションによりロボットに修正外力及び敵対外力を加えながらペグを穴に挿入するタスクを行ってロボットモデルを学習した結果を示す。このシミュレーションでは、修正外力を７回加えた後、敵対外力を７回加えてロボットモデルを学習した。 FIG. 13 shows the result of learning a robot model by performing a task of inserting a peg into a hole while applying corrective external force and hostile external force to the robot through simulation. In this simulation, the robot model was learned by applying the corrected external force seven times and then applying the hostile external force seven times.

図１３（Ａ）、（Ｂ）の横軸は外力を加えた回数を表す。図１３（Ａ）の縦軸はペグの位置の誤差を表す。図１３（Ｂ）の縦軸は加えた外力の大きさを表す。 The horizontal axes in FIGS. 13A and 13B represent the number of times the external force is applied. The vertical axis of FIG. 13(A) represents the peg position error. The vertical axis of FIG. 13B represents the magnitude of the applied external force.

図１３（Ａ）には、従来手法により外力を加えずに状態遷移モデルのみ学習した結果(Ｂａｓｅｌｉｎｅ)と、提案手法により被験者１が加えた修正外力及び敵対外力により修正外力モデル及び敵対外力モデルを含むロボットモデルを学習した結果（Ｐｒｏｐｏｓｅｄ（ｐａｒｔｉｃｉｐａｎｔ１））と、提案手法により被験者１と異なる被験者２が加えた修正外力及び敵対外力により修正外力モデル及び敵対外力モデルを含むロボットモデルを学習した結果（Ｐｒｏｐｏｓｅｄ（ｐａｒｔｉｃｉｐａｎｔ２））と、を示した。図１３（Ａ）に示すように、学習終了時の位置の誤差を従来手法と提案手法（被験者１及び被験者２）とで比較すると、提案手法の方が従来手法よりも位置の誤差が小さいことが分かる。また、図１３（Ｂ）に示すように、外力の加え方は被験者１、２で異なるが、被験者によらず位置の誤差が小さくなることが分かった。 FIG. 13(A) shows the result of learning only the state transition model without applying external force by the conventional method (Baseline), and the corrected external force model and the hostile external force model by the corrected external force and hostile external force applied by subject 1 by the proposed method. The result of learning the robot model including the modified external force model and the hostile external force model by the modified external force and the hostile external force applied by the subject 2 different from the subject 1 by the proposed method (Proposed (participant 1)). (participant 2)). As shown in FIG. 13(A), when the positional error at the end of learning is compared between the conventional method and the proposed method (subject 1 and subject 2), the positional error is smaller in the proposed method than in the conventional method. I understand. Moreover, as shown in FIG. 13B, although the method of applying the external force differs between the subjects 1 and 2, it was found that the positional error was small regardless of the subject.

また、図１４（Ａ）には、穴が設けられた台の摩擦係数を変えてペグの挿入が成功した回数をシミュレーションした結果を示した。図１４（Ｂ）には、ペグの質量を変えてペグの挿入が成功した回数をシミュレーションした結果を示した。図１４（Ａ）、（Ｂ）に示すように、台の摩擦係数やペグの質量が異なる場合でも、ペグの挿入の成功回数が従来手法よりも提案手法の方が多くなることが分かった。 FIG. 14(A) shows the results of simulating the number of successful insertions of pegs by changing the coefficient of friction of the table provided with holes. FIG. 14(B) shows the results of simulating the number of successful peg insertions by changing the mass of the peg. As shown in FIGS. 14A and 14B, even when the coefficient of friction of the table and the mass of the peg are different, the number of successful peg insertions is greater in the proposed method than in the conventional method.

また、図１５には、シミュレーションと同様のペグの挿入タスクを、異なる材質のペグを用いて実機で行った結果を示した。ペグの材質は、金属（Ｄｅｆａｕｌｔ）、プラスティック、及びスポンジの３種類である。図１５に示すように、何れの材質のペグにおいても、従来手法と比べて提案手法の方がペグの挿入の成功回数が多くなることが分かった。 Further, FIG. 15 shows the result of performing a peg insertion task similar to the simulation on an actual machine using pegs made of different materials. There are three types of peg materials: metal (default), plastic, and sponge. As shown in FIG. 15, it was found that the number of successful insertions of the pegs by the proposed method is greater than that of the conventional method, regardless of the material of the pegs.

上記実施形態の構成及び動作、並びに上記実験例からわかるとおり、修正外力を加えて機械学習することによりロボットモデルの機械学習の効率を高めることができる。また、敵対外力を加えて機械学習することにより、把持対象物における摩擦力や質量の変化に対するロバスト性を高めることができる。また、敵対外力を加えて機械学習することにも学習効率を高める効果がある。 As can be seen from the configuration and operation of the above-described embodiment and the above-described experimental example, the efficiency of the machine learning of the robot model can be improved by performing the machine learning by applying the correction external force. In addition, by performing machine learning by applying a hostile external force, it is possible to improve robustness against changes in the frictional force and mass of the object to be grasped. In addition, machine learning by adding hostile external force also has the effect of increasing learning efficiency.

なお、上記実施形態は、本発明の構成例を例示的に説明するものに過ぎない。開示の技術は上記の具体的な形態には限定されることはなく、その技術的思想の範囲内で種々の変形が可能である。 It should be noted that the above-described embodiment merely exemplifies a configuration example of the present invention. The disclosed technique is not limited to the above-described specific forms, and various modifications are possible within the scope of the technical idea.

例えば上記実施形態では、ペグの嵌め込み作業を例に説明したが、学習および制御対象の作業は任意の作業であってよい。 For example, in the above embodiment, the peg fitting work was described as an example, but the work to be learned and controlled may be any work.

また、上各実施形態でＣＰＵがソフトウェア（プログラム）を読み込んで実行したロボットモデルの学習処理及びロボット制御処理を、ＣＰＵ以外の各種のプロセッサが実行してもよい。この場合のプロセッサとしては、ＦＰＧＡ（Ｆｉｅｌｄ－ＰｒｏｇｒａｍｍａｂｌｅＧａｔｅＡｒｒａｙ）等の製造後に回路構成を変更可能なＰＬＤ（ＰｒｏｇｒａｍｍａｂｌｅＬｏｇｉｃＤｅｖｉｃｅ）、及びＡＳＩＣ（ＡｐｐｌｉｃａｔｉｏｎＳｐｅｃｉｆｉｃＩｎｔｅｇｒａｔｅｄＣｉｒｃｕｉｔ）等の特定の処理を実行させるために専用に設計された回路構成を有するプロセッサである専用電気回路等が例示される。また、学習処理及び制御処理を、これらの各種のプロセッサのうちの１つで実行してもよいし、同種又は異種の２つ以上のプロセッサの組み合わせ（例えば、複数のＦＰＧＡ、及びＣＰＵとＦＰＧＡとの組み合わせ等）で実行してもよい。また、これらの各種のプロセッサのハードウェア的な構造は、より具体的には、半導体素子等の回路素子を組み合わせた電気回路である。 Further, various processors other than the CPU may execute the robot model learning process and the robot control process executed by the CPU by reading the software (program) in each of the above embodiments. The processor in this case is a PLD (Programmable Logic Device) whose circuit configuration can be changed after manufacturing such as an FPGA (Field-Programmable Gate Array), and an ASIC (Application Specific Integrated Circuit) for executing specific processing. A dedicated electric circuit or the like, which is a processor having a specially designed circuit configuration, is exemplified. Also, the learning process and the control process may be performed by one of these various processors, or by a combination of two or more processors of the same or different type (e.g., multiple FPGAs, and a CPU and an FPGA). , etc.). Further, the hardware structure of these various processors is, more specifically, an electric circuit in which circuit elements such as semiconductor elements are combined.

また、上記各実施形態では、ロボットモデルの学習プログラム及びロボット制御プログラムがストレージ４０Ｄ又はＲＯＭ４０Ｂに予め記憶（インストール）されている態様を説明したが、これに限定されない。プログラムは、ＣＤ－ＲＯＭ（ＣｏｍｐａｃｔＤｉｓｋＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、ＤＶＤ－ＲＯＭ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｋＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、及びＵＳＢ（ＵｎｉｖｅｒｓａｌＳｅｒｉａｌＢｕｓ）メモリ等の記録媒体に記録された形態で提供されてもよい。また、プログラムは、ネットワークを介して外部装置からダウンロードされる形態としてもよい。
以上の実施形態に関し、更に以下の付記を開示する。
（付記１）
ロボットの位置姿勢の実績値及び前記ロボットに加えられる外力の実績値を取得する取得部（４１）と、
ある時間における前記位置姿勢の実績値及び前記ロボットに与えることができる行動指令に基づき、その次の時間における前記ロボットの位置姿勢の予測値を算出する状態遷移モデル（ＤＭ）及び前記ロボットに加えられる外力の予測値を算出する外力モデル（ＥＭ）を含むロボットモデル（ＬＭ）と、
前記ロボットモデルを実行するモデル実行部（４２）と、
前記位置姿勢の予測値と到達すべき位置姿勢の目標値との間の誤差及び前記外力の予測値に基づいて報酬を算出する報酬算出部（４３）と、
制御周期毎に、前記行動指令の複数の候補を生成して前記ロボットモデルに与え、前記行動指令の複数の候補のそれぞれに対応して前記報酬算出部が算出する報酬に基づいて報酬を最大化する行動指令を決定する行動決定部（４４）と、
決定された前記行動指令に基づいて前記外力モデルが算出した前記外力の予測値と、当該外力の予測値に対応する前記外力の実績値との間の差異が小さくなるように前記外力モデルを更新する外力モデル更新部（４５）と、
を備えたロボットモデルの学習装置。
（付記２）
決定された前記行動指令に基づいて前記状態遷移モデルが算出した前記位置姿勢の予測値と、当該位置姿勢の予測値に対応する前記位置姿勢の実績値との間の誤差が小さくなるように前記状態遷移モデルを更新する状態遷移モデル更新部（４９）と、
を備えた付記１記載のロボットモデルの学習装置。
（付記３）
前記報酬算出部は、前記外力が前記誤差の拡大を抑制する外力である修正外力である場合において、前記修正外力の予測値を前記報酬の減少要因とする計算により前記報酬を算出する
付記１又は付記２記載のロボットモデルの学習装置。
（付記４）
前記報酬算出部は、タスク実行中における前記修正外力の予測値に基づく前記報酬の減少量の変化の幅が前記誤差に基づく前記報酬の減少量の変化の幅よりも小さくなる計算により前記報酬を算出する
付記３記載のロボットモデルの学習装置。
（付記５）
前記外力モデルは、前記外力が前記修正外力である場合において、前記修正外力の予測値を出力する修正外力モデル（ＥＭ１）を含み、
前記外力モデル更新部は、前記外力が前記修正外力である場合において、前記決定された前記行動指令に基づいて前記修正外力モデルが算出した前記修正外力の予測値と前記修正外力の実績値との間の差異が小さくなるように前記修正外力モデルを更新する修正外力モデル更新部（４５Ａ）を含む
付記３又は付記４記載のロボットモデルの学習装置。
（付記６）
前記報酬算出部は、前記外力が前記誤差の縮小を抑制する外力である敵対外力である場合において、前記敵対外力の予測値を前記報酬の増加要因とする計算により前記報酬を算出する
付記１又は付記２記載のロボットモデルの学習装置。
（付記７）
前記報酬算出部は、タスク実行中における前記敵対外力の予測値に基づく前記報酬の増加量の変化の幅が前記誤差に基づく前記報酬の増加量の変化の幅よりも小さくなる計算により前記報酬を算出する
付記６記載のロボットモデルの学習装置。
（付記８）
前記外力モデルは、前記外力が前記敵対外力である場合において、前記敵対外力の予測値を出力する敵対外力モデル（ＥＭ２）を含み、
前記外力モデル更新部は、前記外力が前記敵対外力である場合において、前記決定された前記行動指令に基づいて前記敵対外力モデルが算出した前記敵対外力の予測値と前記敵対外力の実績値との間の差異が小さくなるように前記敵対外力モデルを更新する敵対外力モデル更新部（４５Ｂ）を含む
付記６又は付記７記載のロボットモデルの学習装置。
（付記９）
前記報酬算出部は、前記外力が前記誤差の拡大を抑制する修正外力である場合において、前記修正外力の予測値を前記報酬の減少要因とする計算により前記報酬を算出し、前記外力が前記誤差の縮小を抑制する外力である敵対外力である場合において、前記敵対外力の予測値を前記報酬の増加要因とする計算により前記報酬を算出する
付記１又は付記２記載のロボットモデルの学習装置。
（付記１０）
前記報酬算出部は、タスク実行中における前記修正外力の予測値に基づく前記報酬の減少量の変化の幅が前記誤差に基づく前記報酬の変化の幅よりも小さくなり、タスク実行中における前記敵対外力の予測値に基づく前記報酬の増加量の変化の幅が前記誤差に基づく前記報酬の変化の幅よりも小さくなる計算により前記報酬を算出する
付記９記載のロボットモデルの学習装置。
（付記１１）
前記外力モデルは、前記外力が前記修正外力である場合において、前記修正外力の予測値を出力する修正外力モデル（ＥＭ１）と、前記外力が前記敵対外力である場合において、前記敵対外力の予測値を出力する敵対外力モデル（ＥＭ２）とを含み、
前記外力モデル更新部は、前記外力が前記修正外力である場合において、前記決定された行動指令に基づいて前記修正外力モデルが算出した前記修正外力の予測値と前記外力の実績値との差異が小さくなるように前記修正外力モデルを更新する修正外力モデル更新部と、前記外力が前記敵対外力である場合において、前記決定された行動指令に基づいて前記敵対外力モデルが算出した前記敵対外力の予測値と前記外力の実績値との差異が小さくなるように前記敵対外力モデルを更新する敵対外力モデル更新部（４５Ｂ）とを含む
付記９又は付記１０記載のロボットモデルの学習装置。
（付記１２）
前記ロボットモデルは、前記修正外力モデル及び前記敵対外力モデルを備えた統合外力モデル（ＩＭ）を含み、
前記修正外力モデル及び前記敵対外力モデルはニューラルネットワークであり、
前記敵対外力モデルの１又は複数の中間層及び出力層のうちの少なくとも１つの層は、前記修正外力モデルの対応する層の前段の層の出力をプログレッシブニューラルネットワークの手法により統合し、
前記統合外力モデルは、前記敵対外力モデルの出力を外力の予測値として出力し、
前記統合外力モデルは、出力する前記外力の予測値が修正外力の予測値であるか敵対外力の予測値であるかの識別情報を出力し、
前記報酬算出部は、前記識別情報が、修正外力の予測値であることを示す場合には前記外力の予測値を前記報酬の減少要因とする計算により前記報酬を算出し、前記識別情報が、敵対外力の予測値であることを示す場合には前記外力の予測値を前記報酬の増加要因とする計算により前記報酬を算出する
付記１１記載のロボットモデルの学習装置。
（付記１３）
前記外力が前記修正外力であるか前記敵対外力であるかの指定を受け付ける受け付け部をさらに備え、
前記指定が前記修正外力である場合は前記修正外力モデル更新部の動作を有効化し、前記指定が前記敵対外力である場合は前記敵対外力モデル更新部の動作を有効化する学習制御部をさらに備えた
付記１１又は付記１２記載のロボットモデルの学習装置。
（付記１４）
前記位置姿勢の実績値及び前記外力の実績値に基づき前記外力が前記修正外力であるか前記敵対外力であるかを判別し、前記判別の結果が前記修正外力である場合は前記修正外力モデル更新部の動作を有効化し、前記判別の結果が前記敵対外力である場合は前記敵対外力モデル更新部の動作を有効化する学習制御部（４６）をさらに備える
付記１１又は付記１２記載のロボットモデルの学習装置。
（付記１５）
ある時間におけるロボットの位置姿勢の実績値及び前記ロボットに与えることができる行動指令に基づき、その次の時間における前記ロボットの位置姿勢の予測値を算出する状態遷移モデル（ＤＭ）及び前記ロボットに加えられる外力の予測値を算出する外力モデル（ＥＭ）を含むロボットモデル（ＬＭ）を用意し、
制御周期毎に、前記位置姿勢の実績値及び前記ロボットに加えられる外力の実績値を取得し、
制御周期毎に、前記行動指令の複数の候補を生成して前記ロボットモデルに与え、前記行動指令の複数の候補に対応して前記状態遷移モデルにより算出される複数の前記位置姿勢の予測値と到達すべき位置姿勢の目標値との間の複数の誤差及び前記行動指令の複数の候補に対応して前記外力モデルにより算出される複数の前記外力の予測値に基づいて、前記行動指令の複数の候補に対応して算出される複数の報酬に基づいて、報酬を最大化する行動指令を決定し、
決定された前記行動指令に基づいて前記外力モデルが算出した前記外力の予測値と、当該外力の予測値に対応する前記外力の実績値との間の差異が小さくなるように前記外力モデルを更新する、
ロボットモデルの機械学習方法。
（付記１６）
さらに、決定された前記行動指令に基づいて前記状態遷移モデルが算出した前記位置姿勢の予測値と、当該位置姿勢の予測値に対応する前記位置姿勢の実績値との間の誤差が小さくなるように前記状態遷移モデルを更新する
付記１５記載のロボットモデルの機械学習方法。
（付記１７）
前記外力が前記誤差の拡大を抑制する外力である修正外力である場合において、前記修正外力の予測値を前記報酬の減少要因とする計算により前記報酬を算出する
付記１５又は付記１６記載のロボットモデルの機械学習方法。
（付記１８）
タスク実行中における前記修正外力の予測値に基づく前記報酬の減少量の変化の幅が前記誤差に基づく前記報酬の変化の幅よりも小さくなる計算により前記報酬を算出する
付記１７記載のロボットモデルの機械学習方法。
（付記１９）
前記外力モデルは、前記外力が前記修正外力である場合において、前記修正外力の予測値を出力する修正外力モデル（ＥＭ１）を含み、
前記外力が前記修正外力である場合において、前記決定された前記行動指令に基づいて前記修正外力モデルが算出した前記修正外力の予測値と前記外力の実績値との間の差異が小さくなるように前記修正外力モデルを更新する
付記１７又は付記１８記載のロボットモデルの機械学習方法。
（付記２０）
前記誤差が拡大しつつある場合に、前記ロボットに対して前記修正外力を加える
付記１９記載のロボットモデルの機械学習方法。
（付記２１）
前記外力が前記誤差の縮小を抑制する外力である敵対外力である場合において、前記敵対外力の予測値を前記報酬の増加要因とする計算により前記報酬を算出する
付記１５又は付記１６記載のロボットモデルの機械学習方法。
（付記２２）
タスク実行中における前記敵対外力の予測値に基づく前記報酬の増加量の変化の幅が前記誤差に基づく前記報酬の変化の幅よりも小さくなる計算により前記報酬を算出する
付記２１記載のロボットモデルの機械学習方法。
（付記２３）
前記外力モデルは、前記外力が前記敵対外力である場合において、前記敵対外力の予測値を出力する敵対外力モデル（ＥＭ２）を含み、
前記外力が前記敵対外力である場合において、前記決定された前記行動指令に基づいて前記敵対外力モデルが算出した前記敵対外力の予測値と前記外力の実績値との間の差異が小さくなるように前記敵対外力モデルを更新する
付記２１又は付記２２記載のロボットモデルの機械学習方法。
（付記２４）
前記誤差が縮小しつつある場合に、前記ロボットに対して前記敵対外力を加える
付記２３記載のロボットモデルの機械学習方法。
（付記２５）
前記外力が前記誤差の拡大を抑制する修正外力である場合において、前記修正外力の予測値を前記報酬の減少要因とする計算により前記報酬を算出し、前記外力が前記誤差の縮小を抑制する外力である敵対外力である場合において、前記敵対外力の予測値を前記報酬の増加要因とする計算により前記報酬を算出する
付記１５又は付記１６記載のロボットモデルの機械学習方法。
（付記２６）
タスク実行中における前記修正外力の予測値に基づく前記報酬の減少量の変化の幅が前記誤差に基づく前記報酬の変化の幅よりも小さくなり、タスク実行中における前記敵対外力の予測値に基づく前記報酬の増加量の変化の幅が前記誤差に基づく前記報酬の変化の幅よりも小さくなる計算により前記報酬を算出する
付記２５記載のロボットモデルの機械学習方法。
（付記２７）
前記外力モデルは、前記外力が前記修正外力である場合において、前記修正外力の予測値を出力する修正外力モデル（ＥＭ１）と、前記外力が前記敵対外力である場合において、前記敵対外力の予測値を出力する敵対外力モデル（ＥＭ２）とを含み、
前記外力が前記修正外力である場合において、前記決定された行動指令に基づいて前記修正外力モデルが算出した前記修正外力の予測値と前記外力の実績値との差異が小さくなるように前記修正外力モデルを更新し、前記外力が前記敵対外力である場合において、前記決定された行動指令に基づいて前記敵対外力モデルが算出した前記敵対外力の予測値と前記外力の実績値との差異が小さくなるように前記敵対外力モデルを更新する
付記２５又は付記２６記載のロボットモデルの機械学習方法。
（付記２８）
前記誤差が拡大しつつある場合に、前記ロボットに対して前記修正外力を加え、前記誤差が縮小しつつある場合に、前記ロボットに対して前記敵対外力を加える
付記２７記載のロボットモデルの機械学習方法。
（付記２９）
ある時間におけるロボットの位置姿勢の実績値及び前記ロボットに与えることができる行動指令に基づき、その次の時間における前記ロボットの位置姿勢の予測値を算出する状態遷移モデル（ＤＭ）及び前記ロボットに加えられる外力の予測値を算出する外力モデル（ＥＭ）を含むロボットモデル（ＬＭ）を機械学習するための機械学習プログラムであって、
制御周期毎に、前記位置姿勢の実績値及び前記ロボットに加えられる外力の実績値を取得し、
制御周期毎に、前記行動指令の複数の候補を生成して前記ロボットモデルに与え、前記行動指令の複数の候補に対応して前記状態遷移モデルにより算出される複数の前記位置姿勢の予測値と到達すべき位置姿勢の目標値との間の複数の誤差及び前記行動指令の複数の候補に対応して前記外力モデルにより算出される複数の前記外力の予測値に基づいて、前記行動指令の複数の候補に対応して算出される複数の報酬に基づいて、報酬を最大化する行動指令を決定し、
決定された前記行動指令に基づいて前記外力モデルが算出した前記外力の予測値と、当該外力の予測値に対応する前記外力の実績値との間の差異が小さくなるように前記外力モデルを更新する、
各処理をコンピュータに行わせるロボットモデルの機械学習プログラム。
（付記３０）
ある時間におけるロボットの位置姿勢の実績値及び前記ロボットに与えることができる行動指令に基づき、その次の時間における前記ロボットの位置姿勢の予測値を算出する状態遷移モデル（ＤＭ）及び前記ロボットに加えられる外力の予測値を算出する外力モデル（ＥＭ）を含むロボットモデル（ＬＭ）を実行するモデル実行部（４２）と、
前記ロボットの位置姿勢の実績値及び前記ロボットに加えられる外力の実績値を取得する取得部（４１）と、
前記ロボットモデルにより算出された位置姿勢の予測値と到達すべき位置姿勢の目標値との間の誤差及び前記ロボットモデルにより算出された外力の予測値に基づいて報酬を算出する報酬算出部（４３）と、
制御周期毎に、前記行動指令の複数の候補を生成して前記ロボットモデルに与え、前記行動指令の複数の候補のそれぞれに対応して前記報酬算出部が算出する報酬に基づいて報酬を最大化する行動指令を決定する行動決定部（４４）と、
を備えたロボット制御装置。
（付記３１）
ある時間におけるロボットの位置姿勢の実績値及び前記ロボットに与えることができる行動指令に基づき、その次の時間における前記ロボットの位置姿勢の予測値を算出する状態遷移モデル（ＤＭ）及び前記ロボットに加えられる外力の予測値を算出する外力モデル（ＥＭ）を含むロボットモデル（ＬＭ）を用意し、
制御周期毎に、前記位置姿勢の実績値及び前記ロボットに加えられる外力の実績値を取得し、
制御周期毎に、前記行動指令の複数の候補を生成して前記ロボットモデルに与え、前記行動指令の複数の候補に対応して前記状態遷移モデルにより算出される複数の前記位置姿勢の予測値と到達すべき位置姿勢の目標値との間の複数の誤差及び前記行動指令の複数の候補に対応して前記外力モデルにより算出される複数の前記外力の予測値に基づいて、前記行動指令の複数の候補に対応して算出される複数の報酬に基づいて、報酬を最大化する行動指令を決定し、
決定された前記行動指令に基づいて前記ロボットを制御する、
ロボット制御方法。
（付記３２）
ある時間におけるロボットの位置姿勢の実績値及び前記ロボットに与えることができる行動指令に基づき、その次の時間における前記ロボットの位置姿勢の予測値を算出する状態遷移モデル（ＤＭ）及び前記ロボットに加えられる外力の予測値を算出する外力モデル（ＥＭ）を含むロボットモデル（ＬＭ）を用いて前記ロボットを制御するためのプログラムであって、
制御周期毎に、前記位置姿勢の実績値及び前記ロボットに加えられる外力の実績値を取得し、
制御周期毎に、前記行動指令の複数の候補を生成して前記ロボットモデルに与え、前記行動指令の複数の候補に対応して前記状態遷移モデルにより算出される複数の前記位置姿勢の予測値と到達すべき位置姿勢の目標値との間の複数の誤差及び前記行動指令の複数の候補に対応して前記外力モデルにより算出される複数の前記外力の予測値に基づいて、前記行動指令の複数の候補に対応して算出される複数の報酬に基づいて、報酬を最大化する行動指令を決定し、
決定された前記行動指令に基づいて前記ロボットを制御する、
各処理をコンピュータに行わせるロボット制御プログラム。 In each of the above-described embodiments, the robot model learning program and the robot control program have been pre-stored (installed) in the storage 40D or ROM 40B, but the present invention is not limited to this. The program may be provided in a form recorded on a recording medium such as a CD-ROM (Compact Disk Read Only Memory), a DVD-ROM (Digital Versatile Disk Read Only Memory), and a USB (Universal Serial Bus) memory. Also, the program may be downloaded from an external device via a network.
The following additional remarks are disclosed regarding the above embodiments.
(Appendix 1)
an acquisition unit (41) for acquiring actual values of the position and orientation of the robot and actual values of the external force applied to the robot;
A state transition model (DM) for calculating predicted values of the position and orientation of the robot at the next time based on the actual values of the position and orientation at a certain time and an action command that can be given to the robot, and applied to the robot. a robot model (LM) including an external force model (EM) for calculating predicted values of external forces;
a model execution unit (42) for executing the robot model;
a reward calculation unit (43) for calculating a reward based on the error between the predicted value of the position and orientation and the target value of the position and orientation to be reached and the predicted value of the external force;
For each control cycle, a plurality of candidates for the action command are generated and given to the robot model, and a reward is maximized based on the reward calculated by the reward calculation unit corresponding to each of the plurality of candidates for the action command. an action determination unit (44) that determines an action command to
updating the external force model so that the difference between the predicted value of the external force calculated by the external force model based on the determined action command and the actual value of the external force corresponding to the predicted value of the external force is reduced; an external force model updating unit (45) for
A robot model learning device with
(Appendix 2)
The error between the predicted position/orientation calculated by the state transition model based on the determined action command and the actual position/orientation corresponding to the predicted position/orientation is reduced. a state transition model updating unit (49) for updating the state transition model;
The robot model learning device according to Supplementary Note 1, comprising:
(Appendix 3)
When the external force is a corrected external force that suppresses the expansion of the error, the remuneration calculation unit calculates the remuneration by calculation using a predicted value of the corrected external force as a factor for reducing the remuneration. The robot model learning device according to appendix 2.
(Appendix 4)
The reward calculation unit calculates the reward by calculating a range of change in the decrease amount of the reward based on the predicted value of the corrected external force during execution of the task so as to be smaller than a range of change in the decrease amount of the reward based on the error. The robot model learning device according to appendix 3.
(Appendix 5)
The external force model includes a modified external force model (EM1) that outputs a predicted value of the modified external force when the external force is the modified external force,
When the external force is the corrected external force, the external force model updating unit updates the predicted value of the corrected external force calculated by the corrected external force model based on the determined action command and the actual value of the corrected external force. The robot model learning device according to appendix 3 or appendix 4, further comprising a corrected external force model updating unit (45A) that updates the corrected external force model so that the difference between the models becomes smaller.
(Appendix 6)
When the external force is a hostile external force that suppresses reduction of the error, the remuneration calculation unit calculates the remuneration by calculation using the predicted value of the hostile external force as a factor for increasing the remuneration. The robot model learning device according to appendix 2.
(Appendix 7)
The reward calculation unit calculates the reward by calculating a range of change in the amount of increase in the reward based on the predicted value of the hostile external force during execution of the task so that the range of change in the amount of increase in the reward is smaller than a range of change in the amount of increase in the reward based on the error. The robot model learning device according to appendix 6.
(Appendix 8)
The external force model includes a hostile external force model (EM2) that outputs a predicted value of the hostile external force when the external force is the hostile external force,
When the external force is the hostile external force, the external force model update unit updates the predicted value of the hostile external force calculated by the hostile external force model based on the determined action command and the actual value of the hostile external force. The robot model learning device according to appendix 6 or appendix 7, further comprising a hostile external force model updating unit (45B) that updates the hostile external force model so that the difference between the models becomes smaller.
(Appendix 9)
When the external force is a corrected external force that suppresses the expansion of the error, the reward calculation unit calculates the reward by calculation using a predicted value of the corrected external force as a factor for reducing the reward, and the external force is the error The robot model learning device according to appendix 1 or appendix 2, wherein the reward is calculated using a predicted value of the hostile external force as a factor for increasing the reward when the reward is a hostile external force that suppresses a reduction in the remuneration.
(Appendix 10)
The reward calculation unit determines that the range of change in the decrease amount of the reward based on the predicted value of the corrected external force during task execution becomes smaller than the range of change in the reward based on the error, and the hostile external force during task execution. The robot model learning device according to Supplementary Note 9, wherein the reward is calculated by a calculation in which the width of change in the amount of increase in the reward based on the predicted value of is smaller than the width of change in the reward based on the error.
(Appendix 11)
The external force model includes a modified external force model (EM1) that outputs a predicted value of the modified external force when the external force is the modified external force, and a predicted value of the hostile external force when the external force is the hostile external force. and an adversarial external force model (EM2) that outputs
When the external force is the corrected external force, the external force model update unit determines whether the difference between the predicted value of the corrected external force calculated by the corrected external force model based on the determined action command and the actual value of the external force is a corrected external force model updating unit that updates the corrected external force model so that it becomes smaller; and a prediction of the hostile external force calculated by the hostile external force model based on the determined action command when the external force is the hostile external force. 11. The robot model learning device according to appendix 9 or 10, further comprising a hostile external force model updating unit (45B) that updates the hostile external force model so that the difference between the value and the actual value of the external force is reduced.
(Appendix 12)
the robot model includes an integrated external force model (IM) comprising the modified external force model and the hostile external force model;
The modified external force model and the hostile external force model are neural networks,
At least one layer of the one or more intermediate layers and the output layer of the hostile external force model integrates the output of the previous layer of the corresponding layer of the modified external force model by a progressive neural network technique,
The integrated external force model outputs the output of the hostile external force model as a predicted value of the external force,
The integrated external force model outputs identification information as to whether the predicted value of the external force to be output is the predicted value of the corrected external force or the predicted value of the hostile external force,
The remuneration calculation unit calculates the remuneration by calculation using the predicted value of the external force as a reduction factor of the remuneration when the identification information indicates that it is a predicted value of the corrected external force, and the identification information is: 12. The robot model learning device according to Supplementary Note 11, wherein the reward is calculated by calculation using the predicted value of the external force as a factor for increasing the reward when indicating that it is a predicted value of the hostile external force.
(Appendix 13)
further comprising a receiving unit that receives designation of whether the external force is the corrected external force or the hostile external force;
a learning control unit that validates the operation of the corrected external force model update unit when the designation is the corrected external force, and validates the operation of the hostile external force model update unit when the designation is the hostile external force; The robot model learning device according to Supplementary Note 11 or Supplementary Note 12.
(Appendix 14)
determining whether the external force is the modified external force or the hostile external force based on the actual values of the position and orientation and the actual values of the external force; and updating the modified external force model when the result of the determination is the modified external force. further comprising a learning control unit (46) for activating the operation of the robot model unit, and activating the operation of the hostile external force model update unit when the result of the determination is the hostile external force. learning device.
(Appendix 15)
In addition to a state transition model (DM) that calculates predicted values of the robot's position and orientation at the next time based on actual values of the robot's position and orientation at a certain time and action commands that can be given to the robot, and the robot. Prepare a robot model (LM) including an external force model (EM) that calculates the predicted value of the external force applied,
Acquiring the actual values of the position and orientation and the actual values of the external force applied to the robot for each control cycle,
A plurality of candidates for the action command are generated for each control cycle and given to the robot model, and a plurality of predicted values of the position and orientation calculated by the state transition model corresponding to the plurality of candidates for the action command, and Based on a plurality of predicted values of the external force calculated by the external force model corresponding to a plurality of errors between target values of the position and orientation to be reached and a plurality of candidates of the action command, a plurality of the action commands Based on the multiple rewards calculated for the candidates, determine the action command that maximizes the reward,
updating the external force model so that the difference between the predicted value of the external force calculated by the external force model based on the determined action command and the actual value of the external force corresponding to the predicted value of the external force is reduced; do,
Machine learning methods for robot models.
(Appendix 16)
Further, an error between the predicted value of the position and orientation calculated by the state transition model based on the determined action command and the actual value of the position and orientation corresponding to the predicted value of the position and orientation is reduced. 16. The robot model machine learning method according to appendix 15.
(Appendix 17)
The robot model according to appendix 15 or 16, wherein when the external force is a corrected external force that is an external force that suppresses the expansion of the error, the reward is calculated using a predicted value of the corrected external force as a factor for decreasing the reward. machine learning methods.
(Appendix 18)
The robot model according to Supplementary Note 17, wherein the reward is calculated by a calculation such that the width of change in the decrease amount of the reward based on the predicted value of the corrected external force during task execution is smaller than the width of change in the reward based on the error. machine learning method.
(Appendix 19)
The external force model includes a modified external force model (EM1) that outputs a predicted value of the modified external force when the external force is the modified external force,
When the external force is the corrected external force, the difference between the predicted value of the corrected external force calculated by the corrected external force model based on the determined action command and the actual value of the external force is reduced. 19. The robot model machine learning method according to appendix 17 or appendix 18, wherein the modified external force model is updated.
(Appendix 20)
20. The robot model machine learning method according to appendix 19, wherein the correction external force is applied to the robot when the error is increasing.
(Appendix 21)
When the external force is a hostile external force that suppresses reduction of the error, the reward is calculated by calculation using a predicted value of the hostile external force as a factor for increasing the reward. machine learning methods.
(Appendix 22)
The robot model according to Supplementary Note 21, wherein the amount of change in the amount of increase in the reward based on the predicted value of the hostile external force during task execution is smaller than the range of change in the amount of increase in the reward based on the error. machine learning method.
(Appendix 23)
The external force model includes a hostile external force model (EM2) that outputs a predicted value of the hostile external force when the external force is the hostile external force,
When the external force is the hostile external force, the difference between the predicted value of the hostile external force calculated by the hostile external force model based on the determined action command and the actual value of the external force is reduced. 23. The robot model machine learning method according to appendix 21 or appendix 22, wherein the hostile external force model is updated.
(Appendix 24)
24. The robot model machine learning method according to appendix 23, wherein the hostile external force is applied to the robot when the error is shrinking.
(Appendix 25)
When the external force is a corrected external force that suppresses the expansion of the error, the reward is calculated by calculation using the predicted value of the corrected external force as a factor for reducing the reward, and the external force suppresses the reduction of the error. 17. The robot model machine learning method according to appendix 15 or 16, wherein the reward is calculated using a predicted value of the hostile external force as an increasing factor of the reward.
(Appendix 26)
The range of change in the amount of decrease in the reward based on the predicted value of the corrected external force during task execution is smaller than the range of change in the reward based on the error, and the range of change in the reward based on the predicted value of the hostile external force during task execution. 26. The robot model machine learning method according to appendix 25, wherein the reward is calculated by a calculation in which the width of change in the amount of increase in reward is smaller than the width of change in the reward based on the error.
(Appendix 27)
The external force model includes a modified external force model (EM1) that outputs a predicted value of the modified external force when the external force is the modified external force, and a predicted value of the hostile external force when the external force is the hostile external force. and an adversarial external force model (EM2) that outputs
When the external force is the corrected external force, the corrected external force is adjusted so that the difference between the predicted value of the corrected external force calculated by the corrected external force model based on the determined action command and the actual value of the external force becomes small. The model is updated so that, when the external force is the hostile external force, the difference between the predicted value of the hostile external force calculated by the hostile external force model based on the determined action command and the actual value of the external force becomes smaller. 27. The robot model machine learning method according to appendix 25 or appendix 26.
(Appendix 28)
Machine learning of the robot model according to Appendix 27, applying the corrective external force to the robot when the error is increasing, and applying the hostile external force to the robot when the error is decreasing. Method.
(Appendix 29)
In addition to a state transition model (DM) that calculates predicted values of the robot's position and orientation at the next time based on actual values of the robot's position and orientation at a certain time and action commands that can be given to the robot, and the robot. A machine learning program for machine learning a robot model (LM) including an external force model (EM) that calculates a predicted value of an external force applied,
Acquiring the actual values of the position and orientation and the actual values of the external force applied to the robot for each control cycle,
A plurality of candidates for the action command are generated for each control cycle and given to the robot model, and a plurality of predicted values of the position and orientation calculated by the state transition model corresponding to the plurality of candidates for the action command, and Based on a plurality of predicted values of the external force calculated by the external force model corresponding to a plurality of errors between target values of the position and orientation to be reached and a plurality of candidates of the action command, a plurality of the action commands Based on the multiple rewards calculated for the candidates, determine the action command that maximizes the reward,
updating the external force model so that the difference between the predicted value of the external force calculated by the external force model based on the determined action command and the actual value of the external force corresponding to the predicted value of the external force is reduced; do,
A robot model machine learning program that causes a computer to perform each process.
(Appendix 30)
In addition to a state transition model (DM) that calculates predicted values of the robot's position and orientation at the next time based on actual values of the robot's position and orientation at a certain time and action commands that can be given to the robot, and the robot. a model execution unit (42) that executes a robot model (LM) including an external force model (EM) that calculates a predicted value of the external force applied;
an acquisition unit (41) for acquiring actual values of the position and orientation of the robot and actual values of the external force applied to the robot;
A reward calculation unit (43) for calculating a reward based on the error between the predicted value of the position and orientation calculated by the robot model and the target value of the position and orientation to be reached, and the predicted value of the external force calculated by the robot model. )When,
For each control cycle, a plurality of candidates for the action command are generated and given to the robot model, and a reward is maximized based on the reward calculated by the reward calculation unit corresponding to each of the plurality of candidates for the action command. an action determination unit (44) that determines an action command to
A robot controller with
(Appendix 31)
In addition to a state transition model (DM) that calculates predicted values of the robot's position and orientation at the next time based on actual values of the robot's position and orientation at a certain time and action commands that can be given to the robot, and the robot. Prepare a robot model (LM) including an external force model (EM) that calculates the predicted value of the external force applied,
Acquiring the actual values of the position and orientation and the actual values of the external force applied to the robot for each control cycle,
A plurality of candidates for the action command are generated for each control cycle and given to the robot model, and a plurality of predicted values of the position and orientation calculated by the state transition model corresponding to the plurality of candidates for the action command, and Based on a plurality of predicted values of the external force calculated by the external force model corresponding to a plurality of errors between target values of the position and orientation to be reached and a plurality of candidates of the action command, a plurality of the action commands Based on the multiple rewards calculated for the candidates, determine the action command that maximizes the reward,
controlling the robot based on the determined action command;
Robot control method.
(Appendix 32)
In addition to a state transition model (DM) that calculates predicted values of the robot's position and orientation at the next time based on actual values of the robot's position and orientation at a certain time and action commands that can be given to the robot, and the robot. A program for controlling the robot using a robot model (LM) including an external force model (EM) that calculates a predicted value of the external force applied,
Acquiring the actual values of the position and orientation and the actual values of the external force applied to the robot for each control cycle,
A plurality of candidates for the action command are generated for each control cycle and given to the robot model, and a plurality of predicted values of the position and orientation calculated by the state transition model corresponding to the plurality of candidates for the action command, and Based on a plurality of predicted values of the external force calculated by the external force model corresponding to a plurality of errors between target values of the position and orientation to be reached and a plurality of candidates of the action command, a plurality of the action commands Based on the multiple rewards calculated for the candidates, determine the action command that maximizes the reward,
controlling the robot based on the determined action command;
A robot control program that causes a computer to perform each process.

１ロボットシステム
１０ロボット
２０状態観測センサ
３０Ａ、３０Ｂ触覚センサ
４０ロボット制御装置
４１取得部
４２モデル実行部
４３報酬算出部
４４行動決定部
４５外力モデル更新部
４５Ａ敵対外力モデル更新部
４５Ｂ修正外力モデル更新部
４９状態遷移モデル更新部
７０ペグ
ＬＭロボットモデル 1 robot system 10 robot 20 state observation sensors 30A, 30B tactile sensor 40 robot control device 41 acquisition unit 42 model execution unit 43 reward calculation unit 44 action determination unit 45 external force model update unit 45A hostile external force model update unit 45B corrected external force model update unit 49 State transition model update unit 70 Peg LM Robot model

Claims

an acquisition unit that acquires the actual values of the position and orientation of the robot and the actual values of the external force applied to the robot;
A state transition model that calculates a predicted value of the position and orientation of the robot at the next time based on the actual values of the position and orientation at a certain time and an action command that can be given to the robot, and a prediction of the external force applied to the robot. a robot model including an external force model for calculating a value;
a model execution unit that executes the robot model;
a reward calculation unit that calculates a reward based on the error between the predicted value of the position and orientation and the target value of the position and orientation to be reached and the predicted value of the external force;
For each control cycle, a plurality of candidates for the action command are generated and given to the robot model, and a reward is maximized based on the reward calculated by the reward calculation unit corresponding to each of the plurality of candidates for the action command. an action determination unit that determines an action command to
updating the external force model so that the difference between the predicted value of the external force calculated by the external force model based on the determined action command and the actual value of the external force corresponding to the predicted value of the external force is reduced; an external force model updating unit that
A robot model learning device with

The error between the predicted position/orientation calculated by the state transition model based on the determined action command and the actual position/orientation corresponding to the predicted position/orientation is reduced. a state transition model updating unit that updates the state transition model;
The robot model learning device according to claim 1, comprising:

2. When the external force is a corrected external force that suppresses the expansion of the error, the remuneration calculation unit calculates the remuneration by calculation using a predicted value of the corrected external force as a reduction factor of the remuneration. 3. The robot model learning device according to claim 2.

2. When the external force is a hostile external force that suppresses reduction of the error, the remuneration calculation unit calculates the remuneration by calculation using a predicted value of the hostile external force as an increase factor of the remuneration. 3. The robot model learning device according to claim 2.

When the external force is a corrected external force that suppresses the expansion of the error, the reward calculation unit calculates the reward by calculation using a predicted value of the corrected external force as a factor for reducing the reward, and the external force is the error 3. The robot model learning device according to claim 1, wherein the reward is calculated using a predicted value of the hostile external force as an increase factor of the reward when the external force is a hostile external force that suppresses the reduction of the .

The reward calculation unit determines that the range of change in the decrease amount of the reward based on the predicted value of the corrected external force during task execution becomes smaller than the range of change in the reward based on the error, and the hostile external force during task execution. 6. The robot model learning device according to claim 5, wherein the reward is calculated by a calculation in which the width of change in the amount of increase in the reward based on the predicted value of is smaller than the width of change in the reward based on the error.

The external force model includes a corrected external force model that outputs a predicted value of the corrected external force when the external force is the corrected external force, and a predicted value of the hostile external force when the external force is the hostile external force. including a hostile force model and
When the external force is the corrected external force, the external force model update unit determines whether the difference between the predicted value of the corrected external force calculated by the corrected external force model based on the determined action command and the actual value of the external force is a corrected external force model updating unit that updates the corrected external force model so that it becomes smaller; and a prediction of the hostile external force calculated by the hostile external force model based on the determined action command when the external force is the hostile external force. 7. The robot model learning device according to claim 5, further comprising a hostile external force model updating unit that updates the hostile external force model so that a difference between the value and the actual value of the external force becomes smaller.

The robot model includes an integrated external force model comprising the modified external force model and the hostile external force model;
The modified external force model and the hostile external force model are neural networks,
At least one layer of the one or more intermediate layers and the output layer of the hostile external force model integrates the output of the previous layer of the corresponding layer of the modified external force model by a progressive neural network technique,
The integrated external force model outputs the output of the hostile external force model as a predicted value of the external force,
The integrated external force model outputs identification information as to whether the predicted value of the external force to be output is the predicted value of the corrected external force or the predicted value of the hostile external force,
The remuneration calculation unit calculates the remuneration by calculation using the predicted value of the external force as a reduction factor of the remuneration when the identification information indicates that it is a predicted value of the corrected external force, and the identification information is: 8. The robot model learning device according to claim 7, wherein when indicating that it is a predicted value of a hostile external force, the reward is calculated by calculation using the predicted value of the external force as a factor for increasing the reward.

further comprising a receiving unit that receives designation of whether the external force is the corrected external force or the hostile external force;
a learning control unit that validates the operation of the corrected external force model update unit when the designation is the corrected external force, and validates the operation of the hostile external force model update unit when the designation is the hostile external force; The robot model learning device according to claim 7 or 8.

determining whether the external force is the modified external force or the hostile external force based on the actual values of the position and orientation and the actual values of the external force; and updating the modified external force model when the result of the determination is the modified external force. 9. The robot model learning according to claim 7 or 8, further comprising a learning control unit that validates the operation of the unit and validates the operation of the hostile external force model update unit when the determination result is the hostile external force. Device.

A state transition model that calculates a predicted value of the robot's position and orientation at the next time based on the actual values of the robot's position and orientation at a certain time and an action command that can be given to the robot, and an external force applied to the robot. Prepare a robot model that includes an external force model that calculates the predicted value,
Acquiring the actual values of the position and orientation and the actual values of the external force applied to the robot for each control cycle,
A plurality of candidates for the action command are generated for each control cycle and given to the robot model, and a plurality of predicted values of the position and orientation calculated by the state transition model corresponding to the plurality of candidates for the action command, and Based on a plurality of predicted values of the external force calculated by the external force model corresponding to a plurality of errors between target values of the position and orientation to be reached and a plurality of candidates of the action command, a plurality of the action commands Based on the multiple rewards calculated for the candidates, determine the action command that maximizes the reward,
updating the external force model so that the difference between the predicted value of the external force calculated by the external force model based on the determined action command and the actual value of the external force corresponding to the predicted value of the external force is reduced; do,
Machine learning methods for robot models.

Further, an error between the predicted value of the position and orientation calculated by the state transition model based on the determined action command and the actual value of the position and orientation corresponding to the predicted value of the position and orientation is reduced. 12. The robot model machine learning method according to claim 11, wherein the state transition model is updated to .

13. The reward according to claim 11 or 12, wherein when the external force is a corrected external force that suppresses the expansion of the error, the reward is calculated using a predicted value of the corrected external force as a reduction factor of the reward. Machine learning methods for robot models.

13. The reward according to claim 11 or 12, wherein when the external force is a hostile external force that is an external force that suppresses reduction of the error, the reward is calculated using a predicted value of the hostile external force as a factor for increasing the reward. Machine learning methods for robot models.

When the external force is a corrected external force that suppresses the expansion of the error, the reward is calculated by calculation using the predicted value of the corrected external force as a factor for reducing the reward, and the external force suppresses the reduction of the error. 13. The machine learning method for a robot model according to claim 11 or 12, wherein when the hostile external force is , the reward is calculated by calculation using the predicted value of the hostile external force as a factor for increasing the reward.

The external force model includes a corrected external force model that outputs a predicted value of the corrected external force when the external force is the corrected external force, and a predicted value of the hostile external force when the external force is the hostile external force. including a hostile force model and
When the external force is the corrected external force, the corrected external force is adjusted so that the difference between the predicted value of the corrected external force calculated by the corrected external force model based on the determined action command and the actual value of the external force becomes small. The model is updated so that, when the external force is the hostile external force, the difference between the predicted value of the hostile external force calculated by the hostile external force model based on the determined action command and the actual value of the external force becomes smaller. 16. The machine learning method for a robot model according to claim 15, wherein said hostile force model is updated such that:

17. The robot model machine according to claim 16, wherein the correcting external force is applied to the robot when the error is increasing, and the hostile external force is applied to the robot when the error is decreasing. learning method.

A state transition model that calculates a predicted value of the robot's position and orientation at the next time based on the actual values of the robot's position and orientation at a certain time and an action command that can be given to the robot, and an external force applied to the robot. A machine learning program for machine learning a robot model including an external force model for calculating a predicted value,
Acquiring the actual values of the position and orientation and the actual values of the external force applied to the robot for each control cycle,
A plurality of candidates for the action command are generated for each control cycle and given to the robot model, and a plurality of predicted values of the position and orientation calculated by the state transition model corresponding to the plurality of candidates for the action command, and Based on a plurality of predicted values of the external force calculated by the external force model corresponding to a plurality of errors between target values of the position and orientation to be reached and a plurality of candidates of the action command, a plurality of the action commands Based on the multiple rewards calculated for the candidates, determine the action command that maximizes the reward,
updating the external force model so that the difference between the predicted value of the external force calculated by the external force model based on the determined action command and the actual value of the external force corresponding to the predicted value of the external force is reduced; do,
A robot model machine learning program that causes a computer to perform each process.

A state transition model that calculates a predicted value of the robot's position and orientation at the next time based on the actual values of the robot's position and orientation at a certain time and an action command that can be given to the robot, and an external force applied to the robot. a model execution unit that executes a robot model including an external force model that calculates a predicted value;
an acquisition unit that acquires the actual values of the position and orientation of the robot and the actual values of the external force applied to the robot;
a reward calculation unit that calculates a reward based on an error between the predicted value of the position and orientation calculated by the robot model and the target value of the position and orientation to be reached, and the predicted value of the external force calculated by the robot model;
For each control cycle, a plurality of candidates for the action command are generated and given to the robot model, and a reward is maximized based on the reward calculated by the reward calculation unit corresponding to each of the plurality of candidates for the action command. an action determination unit that determines an action command to
A robot controller with

A state transition model that calculates a predicted value of the robot's position and orientation at the next time based on the actual values of the robot's position and orientation at a certain time and an action command that can be given to the robot, and an external force applied to the robot. Prepare a robot model that includes an external force model that calculates the predicted value,
Acquiring the actual values of the position and orientation and the actual values of the external force applied to the robot for each control cycle,
A plurality of candidates for the action command are generated for each control cycle and given to the robot model, and a plurality of predicted values of the position and orientation calculated by the state transition model corresponding to the plurality of candidates for the action command, and Based on a plurality of predicted values of the external force calculated by the external force model corresponding to a plurality of errors between target values of the position and orientation to be reached and a plurality of candidates of the action command, a plurality of the action commands Based on the multiple rewards calculated for the candidates, determine the action command that maximizes the reward,
controlling the robot based on the determined action command;
Robot control method.

A state transition model that calculates a predicted value of the robot's position and orientation at the next time based on the actual values of the robot's position and orientation at a certain time and an action command that can be given to the robot, and an external force applied to the robot. A program for controlling the robot using a robot model including an external force model for calculating a predicted value,
Acquiring the actual values of the position and orientation and the actual values of the external force applied to the robot for each control cycle,
A plurality of candidates for the action command are generated for each control cycle and given to the robot model, and a plurality of predicted values of the position and orientation calculated by the state transition model corresponding to the plurality of candidates for the action command, and Based on a plurality of predicted values of the external force calculated by the external force model corresponding to a plurality of errors between target values of the position and orientation to be reached and a plurality of candidates of the action command, a plurality of the action commands Based on the multiple rewards calculated for the candidates, determine the action command that maximizes the reward,
controlling the robot based on the determined action command;
A robot control program that causes a computer to perform each process.